Deadly Triad in Reinforcement Learning

Published: 09.08.2023
Author: Brijesh Modasara
Category: Deep Dive

In our previous articles, we have introduced the basics of RL and explored its different business applications. To delve deeper into the inner workings of the RL algorithms, it is of paramount importance to understand a critical concept of the “Deadly Triad” in RL. Understanding the intricacies of the Deadly Triad is crucial for anyone seeking to master RL algorithms and build robust and reliable AI systems. In this article, we will understand what the Deadly Triad is, what are its implications on the RL systems, and how to overcome them. 

This article is divided into three sections. The first section provides a brief overview of necessary concepts of RL (Deep-RL and Overestimation of Q-values) that will help to better understand the concept of the Deadly Triad. The second section introduces the basic concept of the Deadly Triad and how it impacts the training of RL algorithms. And finally, the third section discusses how to tackle the problem of Deadly Triad for developing robust RL-based AI systems. 

Advance RL: Overview

Introduction to Deep RL 

The blog article on RL terminology, explains RL using a very simple example where the state and the action spaces are small enough to create a Q-table for the RL agent. For complex business use cases as explained in this blog article, there will be a very large number of states and actions. So, creating a huge table to store the Q-values is computationally inefficient and requires large memory storage. Therefore, we use function approximators like neural networks to approximate the Q-values and that’s why we refer to these neural networks as Deep Q-Networks. There are multiple benefits of using neural networks in RL: 

They can be updated much more efficiently than a Q-table.
They can generalize better to new states and actions that the agent has not seen before.
They can be used to solve problems with continuous state and action spaces.

For training of Deep Q-Networks, we collect hundreds of transitions (state, action, reward, next-state, termination) and then select a small batch from it to train the neural network after every few iterations. When we update the neural network, we also update the policy (state-action mapping) used by the agent. 

Overestimation of Q-values 

In TD-learning, when the agent starts learning, the accuracy of Q-values depends on what action it has tried and what neighboring states it has explored. Moreover, at the start of the training, the agent doesn't have enough information about the best action to take in a given state. Therefore, initially, there is no guarantee that the best action to go to the next state is the action with the highest Q-value! So, taking an action with a maximum Q-value (which is noisy) may be suboptimal. Consequently, if the agent has not explored the environment enough, then the Q-values of suboptimal actions might be higher than the Q-values of optimal actions. This is what is meant by the overestimation of Q-values. It can cause the agent to make poor decisions as well as collect less cumulative rewards. 

Now let’s understand what the Deadly Triad is, how it affects the learning process of RL agents, and how to mitigate its negative effects. 

What is the Deadly Triad? 

In their book “Reinforcement Learning: An Introduction”, Sutton and Barto coined the term Deadly Triad to encapsulate three properties of RL that can pose significant hurdles to stable and efficient learning of optimal policies. These properties are bootstrapping, Off-Policy Learning and Function Approximation. These three properties collectively shape the landscape in which reinforcement learning algorithms operate. Understanding the interplay between these properties is crucial for building robust and reliable RL systems, particularly in scenarios where intricate real-world challenges demand sophisticated decision-making strategies.

Let's understand each of these properties and their effects: 

Bootstrapping is a method of using value estimates of one state to update the value estimates of other states. This approach is commonly used in RL algorithms to propagate knowledge and improve the accuracy of value functions or policy estimates. Bootstrapping plays a significant role in the learning process by allowing an agent to leverage its existing knowledge to refine its understanding of the environment.

In the simplest form of TD learning, i.e. TD(0) learning, the immediate reward is added to the discounted value of the subsequent state (Bellman Equation). This is then used as a target value to update the current state's value. 

Q(s_t ,a_t) = R(s_t, a_t) +  γ * max(a_t+1)[Q(s_t+1, a_t+1)], where

Q is the q-value 
R is the reward 
s_t, a_t are the state and the action at time t 
γ is the discount factor gamma

Although this method can speed up the learning process, it can introduce biases as well which can lead to overestimation or underestimation of the true value of an action, as explained in the previous section. And these biases can then be propagated to other state-action pairs and thus affects the entire learning process. So, it is important to prevent the overestimation of Q-values and bias propagation. 

Function Approximators: For complex RL systems, most often neural networks are the go-to function approximators as they enable the handling of larger state spaces. In some use cases, like in autonomous driving, they also play a key role in processing input images to generate the correct state representations. There are multiple benefits of using neural networks as mentioned in the previous section, but at the same time, they also introduce nonlinearity and approximation errors. This can impact the stability and convergence of the learning process. So, it is important to control how the neural network updates itself and its effect on the value estimates. 
Off-Policy learning involves learning from data generated by a different policy than the current one. One such technique is using experience replay. Experience replay is an RL term used to refer to a small subset of transitions that are used to train a Q-network (as mentioned in the previous section). When we sample transitions, not all of them are generated by the same version of the neural network (or the policy). So, the neural network is updated based on various policies. This is a very powerful technique, as it improves the agent’s generalization ability and learns from suboptimal policies to generate an optimal policy. Although these transitions are very helpful for learning, they are biased toward the policy used to generate those experiences. Sometimes the older transitions can be contradictory to the agent’s current policy. This can impact the convergence and stability of the learning process.

Let's see what happens when these three are combined. When we use function approximation, we are basically estimating the state-action values. When combining bootstrapping with neural networks, we are using the value estimate of one state to update the value estimate of another state and thereby also propagating the approximation errors. Since we are using a neural network, we update the parameters of the entire neural network, so we have inadvertently also affected the value estimates for all the other states as well. Now when we combine these two with off-policy learning, i.e., we use transitions from other older policies, we may also introduce a big difference between the current policy and the policy used to generate transitions. So now we have included approximation errors from older policies for bootstrapping as well. Together they amplify the negative effects of each other, leading to instability, overestimation of value functions, and ultimately divergence of the learning curve of RL Agents. Now let’s see how we can mitigate these effects. 

How to tackle Deadly Triad?

Tackling the challenges posed by the Deadly Triad in reinforcement learning (RL) requires a combination of careful algorithm design, regularization techniques, and strategies to mitigate the negative interactions between Function Approximation, Bootstrapping, and Off-Policy Learning. RL researchers have investigated different algorithmic components that contribute to the divergence of the learning process. Here are some of the most important approaches to address the Deadly Triad:

Regularization Techniques: Regularization methods can help control the complexity of the learned models and mitigate the impact of function approximation errors. Techniques like weight decay, dropout, and batch normalization can stabilize the training process of neural networks and reduce overfitting, which can contribute to inaccurate value estimates.
Capacity & Size: If all the values are stored independently of each other then there will be no divergence. Similarly, if a function approximator (a neural network) is large enough (wider and deeper neural networks) then it might behave similarly to a tabular case. Experimentation showed that the best-performing experiments use the bigger network architectures.
Target Networks: This hypothesis suggests that there is less divergence when bootstrapping on separate networks, i.e., using another network (target network) to estimate the value of TD(0) target. This decoupling of the target and updating networks can alleviate the issues of error propagation.
Overestimation: Double Deep Q-learning is used to decouple the action selection and the action evaluation, which reduces the overestimation. This, when coupled with the previous hypothesis, will reduce divergence even further. 
Prioritization: Prioritization assigns a priority value to each experience in the replay buffer, indicating its relative importance. During the sampling process, experiences with higher priority are more likely to be selected for training the RL agent. To balance the bias introduced by prioritized sampling (as high-priority experiences are sampled more frequently), importance sampling weights are used during the training process. These weights help correct the imbalance and ensure that the learning process remains stable. 
Multi-step: When bootstrapping immediately after a single step, the contraction in the learning update is proportional to gamma, γ. When bootstrapping after two steps, the expected contraction is γ^2. Thus, divergence may reduce when using multistep updates even when using neural networks. Experimentations showed a reduction in instability with an increase in the number of steps.
Exploration Strategies: Proper exploration strategies, such as epsilon-greedy or UCB exploration, can help the agent collect diverse experiences. This is especially important when using off-policy learning, as the agent needs to explore various situations to ensure that its data is representative. These strategies ensure that the agent slowly tries to prioritize valuable and highly rewarding experiences over time.

In reinforcement learning, the concept of the Deadly Triad – the convergence of Function Approximation, Bootstrapping, and Off-Policy Learning – casts light on a complex juncture in the process of optimal decision-making. The interaction of these three factors can amplify the challenges faced by RL algorithms, leading to instability, overestimation, and suboptimal learning outcomes. By understanding its dynamics, meticulous algorithm design, and a profound grasp of these interactions, we can develop stable and powerful RL-systems for real-world complexities. 

Reference: https://arxiv.org/pdf/1812.02648.pdf

Share this post:

Author

Brijesh Modasara

Brijesh joined [at] in May 2022 as a Senior Data Scientist. His expertise lies in reinforcement learning and data mining. He enjoys engaging in interesting conversations about innovative applications of AI, particularly reinforcement learning. When he's not revolutionizing the tech world, you'll find him capturing breathtaking moments through his lens, combining his love of travel and photography.

Provider:	HubSpot European Headquarters 1 Sir John Rogerson's Quay Dublin 2, Ireland
Cookiename:	__hstc; hubspotutk; __hssc; __hssrc; __cf_bm; __cfruid
Runtime:	6 months; 6 months; 30 minutes; session end; 30 minutes; session end
Privacy source url:	https://legal.hubspot.com/privacy-policy
Host:	.hubspot.com

Provider:	InnoCraft Ltd., 150 Willis St, 6011 Wellington, New Zealand
Cookiename:	_pk_id..; _pk_ses..
Runtime:	13 months; 30 minutes
Privacy source url:	https://matomo.org/gdpr-analytics/
Host:	.matomo.cloud

Provider:	Google Ireland Limited, Gordon House, Barrow Street, Dublin 4, Ireland
Cookiename:	YSC; VISITOR_INFO1_LIVE; PREF
Runtime:	Session end; 6 months; 8 months
Privacy source url:	https://policies.google.com/privacy
Host:	.youtube.com

Provider:	Podigee GmbH, Revaler Straße 28, 10245 Berlin, Germany
Cookiename:	Not specified
Runtime:	Not specified
Privacy source url:	https://www.podigee.com/en/about-us/privacy/
Host:	.podigee.com

Provider:	Google Ireland Limited, Gordon House, Barrow Street, Dublin 4, Ireland
Cookiename:	SID; HSID; NID
Runtime:	2 years; 2 years; 6 months
Privacy source url:	https://policies.google.com/privacy
Host:	.google.com

Deadly Triad in Reinforcement Learning

Table of Contents

Advance RL: Overview

Introduction to Deep RL

Overestimation of Q-values

What is the Deadly Triad?

How to tackle Deadly Triad?

Author

Introduction to Deep RL 

Overestimation of Q-values 

What is the Deadly Triad?