In our previous articles, we have introduced the basics of RL and explored its different business applications. To delve deeper into the inner workings of the RL algorithms, it is of paramount importance to understand a critical concept of the “Deadly Triad” in RL. Understanding the intricacies of the Deadly Triad is crucial for anyone seeking to master RL algorithms and build robust and reliable AI systems. In this article, we will understand what the Deadly Triad is, what are its implications on the RL systems, and how to overcome them.
This article is divided into three sections. The first section provides a brief overview of necessary concepts of RL (Deep-RL and Overestimation of Q-values) that will help to better understand the concept of the Deadly Triad. The second section introduces the basic concept of the Deadly Triad and how it impacts the training of RL algorithms. And finally, the third section discusses how to tackle the problem of Deadly Triad for developing robust RL-based AI systems.
The blog article on RL terminology, explains RL using a very simple example where the state and the action spaces are small enough to create a Q-table for the RL agent. For complex business use cases as explained in this blog article, there will be a very large number of states and actions. So, creating a huge table to store the Q-values is computationally inefficient and requires large memory storage. Therefore, we use function approximators like neural networks to approximate the Q-values and that’s why we refer to these neural networks as Deep Q-Networks. There are multiple benefits of using neural networks in RL:
For training of Deep Q-Networks, we collect hundreds of transitions (state, action, reward, next-state, termination) and then select a small batch from it to train the neural network after every few iterations. When we update the neural network, we also update the policy (state-action mapping) used by the agent.
In TD-learning, when the agent starts learning, the accuracy of Q-values depends on what action it has tried and what neighboring states it has explored. Moreover, at the start of the training, the agent doesn't have enough information about the best action to take in a given state. Therefore, initially, there is no guarantee that the best action to go to the next state is the action with the highest Q-value! So, taking an action with a maximum Q-value (which is noisy) may be suboptimal. Consequently, if the agent has not explored the environment enough, then the Q-values of suboptimal actions might be higher than the Q-values of optimal actions. This is what is meant by the overestimation of Q-values. It can cause the agent to make poor decisions as well as collect less cumulative rewards.
Now let’s understand what the Deadly Triad is, how it affects the learning process of RL agents, and how to mitigate its negative effects.
In their book “Reinforcement Learning: An Introduction”, Sutton and Barto coined the term Deadly Triad to encapsulate three properties of RL that can pose significant hurdles to stable and efficient learning of optimal policies. These properties are bootstrapping, Off-Policy Learning and Function Approximation. These three properties collectively shape the landscape in which reinforcement learning algorithms operate. Understanding the interplay between these properties is crucial for building robust and reliable RL systems, particularly in scenarios where intricate real-world challenges demand sophisticated decision-making strategies.
Let's understand each of these properties and their effects:
In the simplest form of TD learning, i.e. TD(0) learning, the immediate reward is added to the discounted value of the subsequent state (Bellman Equation). This is then used as a target value to update the current state's value.
Although this method can speed up the learning process, it can introduce biases as well which can lead to overestimation or underestimation of the true value of an action, as explained in the previous section. And these biases can then be propagated to other state-action pairs and thus affects the entire learning process. So, it is important to prevent the overestimation of Q-values and bias propagation.
Let's see what happens when these three are combined. When we use function approximation, we are basically estimating the state-action values. When combining bootstrapping with neural networks, we are using the value estimate of one state to update the value estimate of another state and thereby also propagating the approximation errors. Since we are using a neural network, we update the parameters of the entire neural network, so we have inadvertently also affected the value estimates for all the other states as well. Now when we combine these two with off-policy learning, i.e., we use transitions from other older policies, we may also introduce a big difference between the current policy and the policy used to generate transitions. So now we have included approximation errors from older policies for bootstrapping as well. Together they amplify the negative effects of each other, leading to instability, overestimation of value functions, and ultimately divergence of the learning curve of RL Agents. Now let’s see how we can mitigate these effects.
Tackling the challenges posed by the Deadly Triad in reinforcement learning (RL) requires a combination of careful algorithm design, regularization techniques, and strategies to mitigate the negative interactions between Function Approximation, Bootstrapping, and Off-Policy Learning. RL researchers have investigated different algorithmic components that contribute to the divergence of the learning process. Here are some of the most important approaches to address the Deadly Triad:
In reinforcement learning, the concept of the Deadly Triad – the convergence of Function Approximation, Bootstrapping, and Off-Policy Learning – casts light on a complex juncture in the process of optimal decision-making. The interaction of these three factors can amplify the challenges faced by RL algorithms, leading to instability, overestimation, and suboptimal learning outcomes. By understanding its dynamics, meticulous algorithm design, and a profound grasp of these interactions, we can develop stable and powerful RL-systems for real-world complexities.
Reference: https://arxiv.org/pdf/1812.02648.pdf
Share this post: