Inhaltsverzeichnis

## Introduction

In the previous articles, we introduced the basics of reinforcement learning (RL) and explored its various applications in business. To delve deeper into the inner workings of reinforcement learning algorithms, it is important to understand the critical concept of the Deadly Triad (DT) in reinforcement learning. Understanding the intricacies of the Deadly Triad is critical for anyone who wants to master reinforcement learning algorithms and develop robust and reliable AI systems. In this article, we will learn what the Deadly Triad is, what impact it has on RL systems and how to overcome it.

This article is divided into three sections. The first section gives a brief overview of necessary reinforcement learning concepts (deep reinforcement learning and overestimation of Q-values) that contribute to a better understanding of the Deadly Triad concept. In the second section, the basic concept of Deadly Triad is introduced and how it affects the training of reinforcement learning algorithms is explained. And finally, the third section discusses how the Deadly Triad problem can be addressed in the development of robust RL-based AI systems.

## Advanced Reinforcement Learning: Overview

### Introduction to Deep Reinforcement Learning

Our Reinforcement Learning Terminology blog article explains reinforcement learning using a very simple example where the state and action spaces are small enough to create a Q-table for the reinforcement learning agent. For complex use cases in companies, such as those in this blog article , there will be a very large number of states and actions. So, creating a huge table to store the Q-values is computationally inefficient and requires large memory storage. Therefore, we use function approximators like neural networks to approximate the Q-values and that’s why we refer to these neural networks as Deep Q-Networks. There are multiple benefits of using neural networks in RL:

- They can be updated much more efficiently than a Q-table.
- They can generalize better to new states and actions that the agent has not seen before.
- They can be used to solve problems with continuous state and action spaces.

For training of Deep Q-Networks, we collect hundreds of transitions (state, action, reward, next-state, termination) and then select a small batch from it to train the neural network after every few iterations. When we update the neural network, we also update the policy (state-action mapping) used by the agent.

For a compact introduction to the definition and terminology behind reinforcement learning, read our basic article on the methodology:

**Overestimation of Q-values**

When the agent starts learning in Temporal Difference (TD) learning, the accuracy of the Q-values depends on which actions it has tried and which neighbouring states it has explored. Moreover, at the beginning of training, the agent does not have enough information about the best action in a given state. Therefore, there is no guarantee at the beginning that the best action for the transition to the next state is the action with the highest Q-value! So an action with a maximum Q-value (which is noisy) may be suboptimal. If the agent has not explored the environment sufficiently, the Q-values of the suboptimal actions can be higher than the Q-values of the optimal actions. This is what is meant by overestimating the Q-values.It can lead to the agent making poor decisions and receiving fewer cumulative rewards.

Now let’s understand what the Deadly Triad is, how it affects the learning process of RL agents, and how to mitigate its negative effects.

For an in-depth technical introduction to reinforcement learning that gives you a basic understanding of reinforcement learning (RL) using a practical example, see our blog post:

## What is the Deadly Triad?

In their book Reinforcement Learning: An Introduction, Sutton and Barto coined the term Deadly Triad to describe three properties of reinforcement learning that can pose significant hurdles to the stable and efficient learning of optimal strategies. These properties are bootstrapping, off-policy learning and function approximation. Together, these three properties shape the landscape in which reinforcement learning algorithms operate. Understanding the interplay between these properties is critical for developing robust and reliable RL systems, especially in scenarios where complex real-world challenges require sophisticated decision strategies.

Let's understand each of these properties and their effects:

**1. Bootstrapping** is a method for using value estimates of one state to update the value estimates of other states. This approach is often used in reinforcement learning algorithms to disseminate knowledge and improve the accuracy of value functions or policy estimates. Bootstrapping plays an important role in the learning process as it allows an agent to use its existing knowledge to refine its understanding of the environment.

In the simplest form of TD learning i.e. TD(0) learning, the immediate reward is added to the discounted value of the subsequent state (Bellman Equation). This is then used as a target value to update the current state's value.

a) Q(s_{t} ,a_{t}) = R(s_{t}, a_{t}) + γ * max(a_{t+1})[Q(s_{t+1}, a_{t+1})], where

- Q is the q-value
- R is the reward
- s
_{t,}and a_{t}are the state and the action at time t - γ is the discount factor Gamma

Although this method can speed up the learning process, it can introduce biases as well which can lead to overestimation or underestimation of the true value of an action as explained in the previous section. And these biases can then be propagated to other state-action pairs and thus affects the entire learning process. So, it is important to prevent the overestimation of Q-values and bias propagation.

**2. Function approximators**: In complex reinforcement learning systems, neural networks are mostly used as function approximators, as they allow dealing with larger state spaces. In some use cases, such as autonomous driving, they also play a key role in processing input images to produce the correct state representations. The use of neural networks offers numerous advantages, as mentioned in the previous section, but at the same time leads to non-linearity and approximation errors. This can affect the stability and convergence of the learning process. Therefore, it is important to control how the neural network updates itself and what impact this has on the estimates of the values.

3. **Off-Policy**-learning involves learning from data generated by a different policy than the current one. One such technique is using experience replay. Experience replay is an RL term used to refer to a small subset of transitions that are used to train a Q-network (as mentioned in the previous section). When we sample transitions, not all of them are generated by the same version of the neural network (or the policy). So, the neural network is updated based on various policies. This is a very powerful technique, as it improves the agent’s generalization ability and learns from suboptimal policies to generate an optimal policy. Although these transitions are very helpful for learning, they are biased toward the policy used to generate those experiences. Sometimes the older transitions can be contradictory to the agent’s current policy. This can impact the convergence and stability of the learning process.

Let's see what happens when these three are combined. When we use function approximation, we are basically estimating the state-action values. When combining bootstrapping with neural networks, we are using the value estimate of one state to update the value estimate of another state and thereby also propagating the approximation errors. Since we are using a neural network, we update the parameters of the entire neural network, so we have inadvertently also affected the value estimates for all the other states as well. Now when we combine these two with off-policy learning, i.e., we use transitions from other older policies, we may also introduce a big difference between the current policy and the policy used to generate transitions. So now we have included approximation errors from older policies for bootstrapping as well. Together they amplify the negative effects of each other leading to instability, overestimation of value functions, and ultimately divergence of the learning curve of RL Agents. Now let’s see how we can mitigate these effects.

In our Deep Dive, we highlight the interactions between business methods, neuroscience and reinforcement learning in artificial and biological intelligence.

## How do you deal with the Deadly Triad?

Overcoming the challenges posed by the Deadly Triad in reinforcement learning (RL) requires a combination of careful algorithm design, regularisation techniques and strategies to mitigate the negative interactions between function approximation, bootstrapping and off-policy learning. RL researchers have investigated several algorithmic components that contribute to the divergence of the learning process. Here are some of the main approaches to address the Deadly Triad:

**Regularisation techniques**: Regularization methods can help control the complexity of the learned models and mitigate the impact of function approximation errors. Techniques like weight decay, dropout, and batch normalization can stabilize the training process of neural networks and reduce overfitting, which can contribute to inaccurate value estimates.**Capacity and size:**If all the values are stored independently of each other then there will be no divergence. Similarly, if a function approximator (a neural network) is large enough (wider and deeper neural networks) then it might behave similarly to a tabular case. Experimentation showed that the best-performing experiments use the bigger network architectures.**Target networks**: This hypothesis suggests that there is less divergence when bootstrapping on separate networks i.e., using another network (target network) to estimate the value of TD(0) target. This decoupling of the target and updating networks can alleviate the issues of error propagation.**Overestimation**: Double Deep Q-learning is used to decouple the action selection and the action evaluation which reduces the overestimation. This, when coupled with the previous hypothesis, will reduce divergence even further.**Prioritisation:**Prioritization assigns a priority value to each experience in the replay buffer, indicating its relative importance. During the sampling process, experiences with higher priority are more likely to be selected for training the RL agent. To balance the bias introduced by prioritized sampling (as high-priority experiences are sampled more frequently), importance sampling weights are used during the training process. These weights help correct the imbalance and ensure that the learning process remains stable.**Multi-Step**: When bootstrapping immediately after a single step, the contraction in the learning update is proportional to gamma, γ. When bootstrapping after two steps, the expected contraction is γ^2. Thus, divergence may reduce when using multi-step updates even when using neural networks. Experimentations showed a reduction in instability with an increase in the number of steps.**Exploration strategies**: Proper exploration strategies, such as epsilon-greedy or UCB exploration, can help the agent collect diverse experiences. This is especially important when using off-policy learning, as the agent needs to explore various situations to ensure that its data is representative. These strategies ensure that the agent slowly tries to prioritize valuable and highly rewarding experiences over time.

Read about the use of reinforcement learning in industry and other relevant sectors in our technical article:

## Conclusion

In reinforcement learning, the concept of the Deadly Triad - the convergence of function approximation, bootstrapping and off-policy learning - sheds light on a complex node in the process of optimal decision making. The interplay of these three factors can amplify the challenges for reinforcement learning algorithms, leading to instability, overestimation and suboptimal learning outcomes. By understanding the dynamics, carefully designing algorithms and deeply understanding these interactions, we can develop stable and powerful RL systems for complex real-world situations.

## 0 Kommentare