In this article, we aim to bridge the gap between a basic understanding of reinforcement learning (RL) and solving a problem using RL methods. The article is divided into three sections. The first section is a brief introduction to RL. The second section explains the most important terms required to formulate an RL problem using an example. In the third and final section, we present a basic implementation for solving a problem with RL.
Reinforcement learning (RL) is a type of machine learning. It involves training agents to make decisions in an environment by learning from the consequences of their actions. The agent receives feedback in the form of “rewards” or “punishments,” which it uses to update its decision-making strategy and improve its performance over time. The agent learns through trial and error, i.e., it tries out different actions and receives feedback in the form of rewards or punishments. RL can be used for any process or system that involves a sequential decision-making process that could be optimized.
RL is used in areas such as autonomous driving, robotics, control systems, game AI, and economics and finance, e.g., in optimizing decision-making in portfolio optimization and resource allocation. It has also been used to improve the performance of dialogue systems (chatbots) such as ChatGPT, which have caused quite a stir in recent months.
This section provides an overview of the definitions and descriptions of the most important terms required to understand the dynamics of RL algorithms. It is always easier to understand complex terms using a simple visual reinforcement learning example, so we will explain the various terms below using a simple reinforcement learning problem.
The main goal of our robot QT is to win the game by reaching the target field (also known as the target state), which is marked by coins, by finding and following the optimal path. To move around the maze, QT can perform four actions: go up, go down, go left, and go right. In this environment, there are some obstacles such as the lightning field and the impassable block (gray box). These obstacles can either kill QT or prevent it from performing its intended action (movement). Since QT has no prior knowledge of the maze, the idea is that QT tries to explore the environment by navigating randomly and finding a possible path to the goal.
To better understand how QT can achieve this, it is important to understand some basic RL components and terms that are also used to formulate the problem mathematically.
The maze is referred to as the environment and the robot QT as the agent. In reinforcement learning, the environment is the system, simulation, or space in which the agent can act. The most important thing is that the agent cannot change the rules or dynamics of the environment in any way. The environment is the world in which the agent interacts by performing certain actions and receiving feedback. The agent interacts with the environment and makes decisions. This can be a robot, a software program, or another system that can perform actions and then receive a reward. In the example above, QT (the RL agent) interacts with the maze (environment) by moving in one of the directions and updating its position in the maze. Below are the various components and characteristics of agent-environment interaction:
During the exploration phase, the agent tries out new actions to learn more about the environment and improve its decision-making. This helps the agent discover high-reward states and perform actions to reach them. In the exploitation phase, it tries to reach the final states in the optimal way to maximize the cumulative reward. In the reinforcement learning example above, QT would first try out actions to discover different states and possible paths to the goal. Sometimes it will end up in the lightning field and sometimes in the coin field, but it will gather knowledge and learn which state-action combinations lead to these states. With increasing experience, it will gradually learn to find the optimal path (in our example, the shortest path).
The balance between exploration and exploitation is often referred to as the exploration-exploitation trade-off. In general, the more an agent explores, the more knowledge it gains about the environment, but it may take longer to obtain the highest reward and find the optimal strategy. On the other hand, the more an agent exploits, the faster it can achieve a higher reward, but it does not necessarily find the optimal strategy and misses out on discovering very rewarding states. The following describes some common strategies that help strike a balance between exploration and exploitation:
Now that we have learned the basic terminology related to the “maze problem,” let's formulate and understand this terminology from a mathematical perspective.
The loop above represents one step taken by the agent. The agent must take several such steps {S0, A0, R1, S1, T0} to explore its environment and identify the good and bad states along with the actions that lead to them. Finally, it must navigate the optimal path to the goal.
In order to formulate and solve a decision-making or optimization problem using reinforcement learning, some very important concepts and terms must be explained.
In our deep dive, we shed light on the interactions between business methods, neuroscience, and reinforcement learning in artificial and biological intelligence.
Reinforcement learning – algorithms in the brain
DISCLAIMER: Now begins a mathematical deep dive!
1. Markov Decision Process = MDP: A Markov decision process is a process in which future states depend only on the current state and are in no way connected to the past states that led to the current state. This mathematical framework is used to model decision problems in reinforcement learning. Given a sequence of states, actions, and rewards s0, a0, r0, s1, a1, r1, s2, a2, r2....st, at, rt (the so-called history) observed by an agent, the state signal is said to have the Markov property if it satisfies the following equation:
P (St+1 = s′, Rt+1 = r′| st, at) = P (St+1 = s′, Rt+1 = r′| st, at, rt, … s0, a0, r0), where:
2. Cumulative Return and Discount Factor
a) RL agents learn by selecting an action that maximizes the cumulative future reward. The cumulative future reward is called the return and is often denoted by R. The return at time t is denoted as follows:
b) This equation only makes sense if the learning task ends after a small number of steps. If the learning task requires many sequential decisions (e.g., balancing a pole), the number of steps is high (or even infinite), so the value of the return can be unlimited. Therefore, it is more common to use the future cumulative discounted reward G (cumulative discounted reward G), which is expressed as follows:
Where γ is referred to as the discount factor and varies in the range [0, 1]. Gamma (γ) controls the importance of future rewards relative to immediate rewards. The lower the discount factor (closer to 0), the less important future rewards are, and the agent will tend to focus only on actions that bring maximum immediate rewards. If the value of γ is higher (closer to 1), this means that every future reward is equally important. If the value of γ is set to 1, then the expression for the cumulative discount reward G is the same as that for the return R.
3. Policy: A policy (denoted as π(s)) is a set of rules or an algorithm that the agent uses to determine its actions. This policy determines how the agent behaves in each state and which action it should choose in a given state. For an agent, a policy is a mapping of states to their respective optimal actions. At the beginning, when the agent starts learning, a policy consists of a mapping of states to random actions. However, during training, the agent receives various rewards from the environment that help it optimize the policy. The optimal policy is the one that allows the agent to maximize the reward in each state. The optimal policy is denoted by π* and is a mapping of states to the respective optimal actions in those states.
4. Action Value and State Value Functions: To optimize the policy, the agent must figure out two things. First, which states are good and which are bad, and second, which actions are good and which are bad for each state. Two different value functions can be used for this purpose, namely the state value function V(s) and the action value function Q(s,a).
a) State value function – The state value function defines the “quality” of the agent in a given state. Mathematically speaking, the state value function estimates the expected future utility of a particular state. These values indicate the expected return if one starts from a state and follows its policy π.
It is denoted as V(s) or Vπ(s), where s is the current state. It is mathematically defined as:
b) Action value function The action value function evaluates “the quality of an action in a state,” i.e., how good a particular action is in a particular state. Mathematically speaking, it estimates the expected future benefit of a particular action in a particular state and the subsequent (repeated) adherence to the strategy. It is similar to the state value function, except that it is more specific because it evaluates the quality of each action in a state, rather than just the state itself.
It is denoted as Q(s,a) or Qπ(s,a), where s is the current state and a is the action taken:
5. Bellman Equation: The Bellman equation is fundamental to estimating the two value functions. It defines the relationship between the current state-action pair (s, a), the observed reward, and the possible successor state-action pairs. This relationship is used to find the optimal value function. The Bellman equation defines the value of a state-action pair Q(s,a) as the expected reward for the best action in that state s plus the discounted value of the best next state-action pair (according to the current policy).
The Bellman function can be defined for both the state value function V(s) and the action value function Q(s, a) as follows:
V(s) = max(a) [R(s,a) + γ * V(s‘)]
Q(s,a) = R(s,a) + γ * max(a‘) [Q(s‘,a‘)], where
The process of gradually estimating action values from the observed state-action-reward-state tuples is called Q-learning. The Q values are updated using the Bellman equation:
Share this post: