With this article we want to bridge the gap between the basic understanding of reinforcement learning (RL) and solving a problem with RL methods. The article is divided into three sections. The first section is a brief introduction to RL. The second section explains the main concepts needed to formulate an RL problem with an example. Finally, in the third and last section, we present a basic implementation for solving a problem with RL.
Inhaltsverzeichnis
What is Reinforcement Learning?
Reinforcement learning (RL) is a type of machine learning. It involves training agents to make decisions in an environment by learning from the consequences of their actions. The agent receives feedback in the form of "rewards" or "punishments" that it uses to update its decisionmaking strategy and improve its performance over time. The agent learns through trialanderror, i.e. it tries different actions and receives feedback in the form of rewards or punishments. RL can be used for any process or system that involves a sequential decisionmaking process that could be optimised.
RL has applications in areas such as autonomous driving, robotics, control systems, gaming AI, and business and finance, for example in optimising decisionmaking in portfolio optimisation and resource allocation. It has also been used to improve the performance of dialogue systems (chatbots) such as ChatGPT which have caused a lot of commotion in recent months.
Reinforcement Learning  Example & Definitions
This section provides an overview of the definitions and descriptions of the most important terms needed to understand the dynamics of RL algorithms. It is always easier to understand complex terms using a simple visual reinforcement learning example Therefore, the various terms will be explained in the following with the help of a simple problem of the Explain reinforcement learning.
The main goal of our robot QT is to win the game by reaching the target field (also known as the target state), which is marked by coins, by finding and following the optimal path. To move around the maze, QT can perform four actions: go up, go down, go left and go right. There are some obstacles in this environment such as the lightning field and the impassable block (grey box). These obstacles can either kill QT or prevent him from performing his intended action (movement). Since QT has no prior knowledge of the maze, the idea is that QT tries to explore the environment by randomly navigating and finding a possible path to the goal.
To better understand how QT can achieve this, it is important to understand some basic RL components and terms that are also used to formulate the problem mathematically.
The labyrinth is called Surroundings (Environment) and the robot QT as Agent is the term used. In reinforcement learning, the environment is the system, simulation or space in which the agent can act. Most importantly, the agent cannot change the rules or dynamics of the environment in any way. The environment is the agent's world in which it interacts by performing certain actions and receiving feedback. The agent interacts with the environment and makes decisions. This can be a robot, a software programme or another system that can perform actions and then receive a reward. In the example above, QT (the RL agent) interacts with the maze (environment) by moving in one of the directions and updating its position in the maze. Below are the different components and features of agentenvironment interaction:
For a compact introduction to the definition and terminology behind reinforcement learning, read our basic article on the methodology:
Reinforcement Learning Fundamental Definitions
 State (state): The parameters or observations that describe the agent's environment are collectively called the state of the environment (=state). It includes everything that the agent can observe about the environment, as well as all of the agent's internal variables that may be relevant to the agent's actions. In the reinforcement learning example above, the state of the robot is its position in the maze. Some environments, like the example above, have special states called terminal states that include the target state (coin field) and other terminal states (lightning field). A terminal state is the state of the environment that, when reached, the agent is unable to proceed. In this case, the agent ends its current learning cycle and a new learning cycle begins with the newly gained knowledge about the environment.
 Action: An action is a decision made by the agent and is the agent's means of interaction with the environment. It can be a physical movement, a change in the agent's internal state or any other type of decision that the agent can make. The consequences of the actions performed by the agent are the change in the state of the environment and the feedback it receives. In the reinforcement learning example above, our robot has four possible actions (up, down, left or right) that it can perform to change its state.
 Reward (reward): When an agent performs an action, the environment not only provides feedback in the form of the new state, but also a reward signal. A reward represents the benefit that the agent has when entering the new state based on its last performed action. The reward component is of utmost importance in RLas it guides the agent in finding the optimal path or process. The concepts in the next section explain that the agent is essentially trying to maximise the cumulative reward it receives from the environment to find the shortest path, as is the case for QT in the example above. In the example, the reward for the agent stepping on the target state (coins) is +10 points, the reward for the agent stepping on the lightning field is 10 points and for all other states it could be 0 or 1 point.
 Transition Probability (transition probability): It is one of the most important properties for some of the environments. It defines the probability of passing from one state to another by performing a certain action in the current state. This parameter defines whether the environment stochastic or deterministic is. Accordingly, the behaviour of the agent can be different in the same environment. An example makes it easier to understand this concept.
 Deterministic environment: For such environments, the transition probability is set to 1. This means that the transition of the agent from one state to another is determined only by the action it takes in the current state. In the following example, if QT decides to move to the right, it would end up in the lightning field and definitely die! If, on the other hand, it decides to move upwards, it would be one step closer to the target state. For the rest of the article, the environment will remain deterministic.

 Stochastic environment: In such environments, the transition of the agent from one state to another is determined by the action it performs and by the distribution of the transition probability, which determines which states it can transition to. The transition probabilities can be defined as follows (from the Frozen lake problem): The agent moves in its intended direction with a probability of 1/3 (33%) or in a perpendicular direction with the same probability of 1/3 (33%), as shown below. Assuming QT decides to move to the right, the probability of him landing in the lightning field is 33 % and landing in one of the two blocks above and below with 33 % each. This means that unlike the environment above, where QT must avoid moving to the right at all costs in order to survive, it can now take an action to move to the right as this offers a survival chance of 66 %.
 Exploration vs. exploitation (exploration vs. exploitation): In RL, exploration and exploitation are two concepts that refer to the balance between trying new actions to explore new states (exploration) and taking the best known action in the current state (according to QT's current assessment) to achieve the greatest possible reward (exploitation).
 During the exploration phase, the agent tries out new actions to learn more about the environment and improve his decisionmaking. This helps the agent discover states with high rewards and perform actions to reach them. In the exploitation phase, it tries to reach the final states in the optimal way to maximise the cumulative reward. In the reinforcement learning example above, QT would first try actions to discover different states and possible paths to the goal. Sometimes it will end up in the lightning field and sometimes in the coin field, but it will gather knowledge and learn which stateaction combinations lead to these states. With increasing experience, it will gradually learn to find the optimal path (in our example, the shortest path).
 The balance between exploration and exploitation is often referred to as the explorationexploitation tradeoff denotes. In general, the more an agent explores, the more knowledge he gains about the environment, but it may take him longer to get the highest reward and find the optimal strategy. On the other hand, the more an agent explores, the faster it can gain a higher reward, but it does not necessarily find the optimal strategy and misses out on discovering very rewarding states. Some common strategies that help to balance exploration and exploitation are described below:
 EpsilonGreedy algorithm (εgreedy) In this algorithm, the agent starts with a high exploration probability (ε) and a low exploitation probability (1ε). This allows the agent to explore new actions initially and gradually use the best known actions from experience. The rate of change of the value of ε is a hyperparameter that depends on the complexity of the RL problem. It can be linear, exponential or defined by a mathematical expression.
 Upper confidence limit (Upper Confidence Bound = UCB): In this algorithm, the agent chooses an action based on the balance between the expected reward and the uncertainty of the action. This allows the agent to explore actions with high uncertainty, but also to use actions with high expected rewards. In UCB, the agent keeps an estimate of the expected reward of each action and a measure of the uncertainty of the estimate. The agent chooses the action with the highest "Upper Confidence Bound" value), which is the sum of the expected reward and a term representing the uncertainty of the estimate. The uncertainty term is often chosen as the square root of the natural logarithm of the total number of actions chosen. The intuition behind this is that the uncertainty of the estimate should decrease as the number of chosen actions increases, as the agent has more data on which to base its estimate.
 EpsilonGreedy algorithm (εgreedy) In this algorithm, the agent starts with a high exploration probability (ε) and a low exploitation probability (1ε). This allows the agent to explore new actions initially and gradually use the best known actions from experience. The rate of change of the value of ε is a hyperparameter that depends on the complexity of the RL problem. It can be linear, exponential or defined by a mathematical expression.
 During the exploration phase, the agent tries out new actions to learn more about the environment and improve his decisionmaking. This helps the agent discover states with high rewards and perform actions to reach them. In the exploitation phase, it tries to reach the final states in the optimal way to maximise the cumulative reward. In the reinforcement learning example above, QT would first try actions to discover different states and possible paths to the goal. Sometimes it will end up in the lightning field and sometimes in the coin field, but it will gather knowledge and learn which stateaction combinations lead to these states. With increasing experience, it will gradually learn to find the optimal path (in our example, the shortest path).
Mathematical formulation of the terminologies based on the reinforcement learning example
Having learned the basic terminology associated with the "labyrinth problem", let us formulate and understand this terminology from a mathematical perspective.
 The robot QT (agent) receives the state S_{0} from the environment  the current block of the robot in the environment
 Based on this condition S_{0} QT (agent) performs an action A_{0} off  say QT moves upwards
 The environment changes to a new state S_{1}  new block
 The environment gives the agent a reward R_{1}  QT is not dead yet (positive reward, +1)
 The transition probability is expressed as P(S_{1}  S_{0}, A_{0}) or T  here it is 1, since the environment is inherently deterministic
The above loop represents one step of the agent. The agent must take several such steps {S_{0}, A_{0}, R_{1}, S_{1}, T_{0}} undertake to explore his environment and identify the good and bad conditions together with the respective actions that lead to them. And finally, it must navigate the optimal path to the goal. Now, in order to formulate and solve a decision or optimisation problem using reinforcement learning, some very important concepts and terms need to be explained.
In our Deep Dive, we highlight the interactions between business methods, neuroscience and reinforcement learning in artificial and biological intelligence.
DISCLAIMER: Now begins a mathematical deep dive!
Mathematical modelling of a decision problem
1. Markov Decision Process = MDP: A Markov decision process is one in which the future states depend only on the present state and are in no way connected to the past states that led to the present state. This mathematical framework is called for modelling decision problems in reinforcement learning is used. Given a sequence of states, actions and the rewards s0, a0, r0, s1, a1, r1, s2, a2, r2....st, at, rt (the socalled history) observed by an agent, the state signal is said to have the Markov property if it satisfies the following equation:
P (S_{t+1} = s′, R_{t+1} = r′ s_{t}, a_{t}) = P (S_{t+1} = s′, R_{t+1} = r′ s_{t}, a_{t, }r_{t}, ... s_{0}, a_{0, }r_{0}), where:
 S is a set of states, S = {s0, s1, ...}
 A is a set of actions, A = {a0, a1, ...}
 T is a set of transition probabilities, T = {P(st+1st, At), P(st+2st+1, At+1), P(st+3st+2, At+2), ...}
 R is a set of rewards, R = {r0, r1, ...}
2. Cumulative Return and Discount Factor (Cumulative return and discount factor)
a) RL agents learn by choosing an action that maximises the cumulative future reward. The cumulative future reward is called the return and is often denoted by R. The return at time t is denoted as follows:
b) This equation only makes sense if the learning task ends after a small number of steps. If the learning task requires many sequential decisions (e.g. balancing a bar), the number of steps is high (or even infinite), so the value of the return can be unlimited. Therefore, it is more common to use the future cumulative discounted reward G (cumulative discounted reward G) to be used, which is expressed as follows:
Where γ is called the discount factor and varies in the range of [0, 1]. Gamma (γ) controls the importance of future rewards relative to immediate rewards. The lower the discount factor (closer to 0), the less important the future rewards are, and the agent will tend to focus only on the actions that yield maximum immediate rewards. When the value of γ is higher (closer to 1), it means that each future reward is equally important. If the value of γ is set to 1, then the expression for the cumulative discount reward is G the same as that for the yield R.
3. PolicyA policy (denoted π(s)) is a set of rules or an algorithm that the agent uses to determine its actions. This policy determines how the agent behaves in each state and what action it should choose in a particular state. For an agent, a policy is a mapping from states to their respective optimal actions. Initially, when the agent starts learning, a policy consists of a mapping of states to random actions. However, during training, the agent receives various rewards from the environment that help it optimise the policy. The optimal policy is the one that allows the agent to maximise the reward in each state. The optimal policy is denoted by π* and is a mapping from states to the respective optimal actions in those states.
4. Action Value and State Value Functions (action value and state value functions): To optimise the policy, the agent has to find out two things. First, which are the good and bad states, and second, which are the actions for each state. Two different value functions can be used for this purpose, namely the state value function V(s) and the action value function Q(s,a).
a) State value function  The state value function defines the "goodness" of the agent that is in a state... Mathematically, the state value function estimates the expected future utility of a given state. These values indicate the expected return if one assumes that a condition and follows its policy π.
It is referred to as V(s) or Vπ(s), where s is the current state. It is mathematically defined as:
b) Action value function The action value function assesses "the quality of an action in a state", i.e. how good a particular action is in a particular state. Mathematically, it estimates the expected future utility of a given action in a given state and of subsequently (re)following the strategy. It is similar to the state value function, except that it is more specific because it evaluates the quality of each action in a state and not just the state itself.
It is referred to as Q(s,a) or Qπ(s,a), where s is the current state and a is the action performed:
5. Bellman Equation (Bellman Equation): The Bellman equation is fundamental to the estimation of the two value functions. It defines the relationship between the current stateaction pair (s, a), the observed reward and the possible successor stateaction pairs. This relationship is used to find the optimal value function. The Bellman equation defines the value of a stateaction pair Q(s,a) as the expected reward for the best action in that state s plus the discounted value of the best next stateaction pair (according to the current policy).
The Bellman function can be defined for both the statevalue function V(s) and the actionvalue function Q(s, a) as follows:
V(s) = max(a) [R(s,a) + γ * V(s')].
Q(s,a) = R(s,a) + γ * max(a') [Q(s',a')], where
 V(s) is the value of state s
 Q(s,a) is the action value function for performing action a in state s
 s' is the next state
 a' the next action is
 γ is the discount factor Gamma
 max(a) is the maximum value over all possible actions
Deepen your understanding of the concept of the "Deadly Triad" in reinforcement learning, its implications and approaches. This Deep Dive provides you with an overview of RL concepts, introduction of the "Deadly Triad" and its coping strategies.
The implementation of reinforcement learning
The procedure for stepwise estimation of action values from the observed stateactionrewardstate tuples is called Qlearning. The Qvalues are updated using the Bellman equation:
 Q(s, a) = Q(s, a) + α * (R(s,a) + γ * max(a') [Q(s', a')]  Q(s, a)), where
 Q(s,a) is the Qvalue for performing action a in state s
 R(s,a) is the immediate reward for the execution of action a in state s
 s' is the next state
 a' the next action is
 γ is the discount factor gamma
 α is the learning rate alpha
 QTablesQtables are a data structure (twodimensional table) in which all possible action values Q(s,a) are stored. Each entry in the table represents the Qvalue for a particular stateaction pair. The Qtable is initially filled with arbitrary values (or zeros) and is updated over time as the agent discovers new states and receives rewards as it transitions to those states. The policy is derived from the Qtable to determine the best action for each state. In the example above, there are 12 states (3 x 4 grids) and 4 actions (up, down, left, right). Therefore, the Qtable is a matrix with 12 rows (one for each state) and 4 columns (one for each action).
 Training the reinforcement learning agent (training means estimating Qvalues) requires an iterative interaction of the agent with the environment. This training is in episodes performed: An episode refers to a single run of the agentenvironment interaction that begins with an initial state (or start state) and ends at an end state or when a certain number of steps have been performed. An episode consists of all the decisions an agent makes during that single interaction. In the example above, the episode ends when QT reaches one of the end states  lightning or coin field. This is because in the lightning field QT is killed and training must be restarted. In the Coin Field, QT has successfully achieved his goal and training must be restarted to improve his performance. Once QT has been trained and the Qtable optimised, the optimal strategy (or "optimal policy") can be extracted by selecting the best action (highest value of expected future reward) for each state. The Qtable might look something like this:
Read about the use of reinforcement learning in industry and other relevant sectors in our technical article:
0 Kommentare