Reinforcement Learning - Algorithms in the Brain

from Dr Philipp Schwartenbeck, Dr Luca Bruder | 18 July 2023 | Tech Deep Dive

Inhaltsverzeichnis

Introduction

As a rule, our blog articles are published on Business Use Cases and the analysis of business data. But in this article we would like to take a different approach. We discuss how methods we use to solve business problems have provided us with insights into the field of neuroscience and the function of biological brains over the last century. The research fields of artificial and biological intelligence have developed in parallel and have strongly influenced each other. Here we will focus on reinforcement learning (RL) - a great example of how research into the functioning of biological systems on the one hand and research in statistics and computer science on the other can cross-fertilise to develop novel insights.

RL is originally a theory about learning that has been known in psychology for a long time. The underlying idea: actions that are reinforced by rewards are likely to be repeated, while actions that do not lead to rewards or even punishment become less frequent (Figure 1). Sutton and Barto formalised the mathematical basis of this "reinforcement learning" and thus founded a new branch of research in artificial intelligence (Sutton, R. S., & Barto, A. G. (1999).Reinforcement learning: An introduction. MIT press), which forms the basis for all the sophisticated algorithms that beat world chess and Go champions today. Neuroscientists, in turn, use these mathematical formulations to study different brain regions and identify those parts of our brain responsible for learning based on rewards. Decades of research suggest that RL, as formalised by Sutton, Barto and colleagues, may be one of the central mechanisms by which humans and other animals learn.

Figure 1 Early experiments by Edward Thorndike ("Law of Effect") and B.F. Skinner ("Operant Conditioning") showed that animals repeat rewarded actions and avoid unrewarded or punished actions - the key principle of reinforcement learning. Here in this "Skinner Box", a pigeon learns to press the right button in response to a certain stimulus (picture) in order to receive a reward (food).

For a compact introduction to the definition and terminology behind reinforcement learning, read our basic article on the methodology:

Reinforcement Learning: compactly explained

Reinforcement Learning in Algorithms and Biological Brains

The basic concept of reinforcement learning is to improve behaviour by learning from "trail-and-error", i.e. trial and error. Here, agents learn a so-called "value" V(t) of a state (or state-action pair) at a given time t, which tells the agent how much reward to expect for different options (e.g. how much would I like it if I ordered chocolate or vanilla ice cream). The key trick is to learn this value iteratively (I just ordered chocolate ice cream - how much did I like it?). To understand this, we can look at the mathematical formulation of one of the most central equations in RL, namely the iterative learning of the "Value function" V(t):

V(t+1) = V(t) + α ⋅ (Reward-V(t))

This equation assumes that the value of the next time step V(t+1), which formally reflects the expectation of future rewards (how much pleasure do I expect if I choose chocolate ice cream now?), is formed by updating the current value expectation V(t) by the difference between the reward received and the current expectation of what the reward should be (Reward-V(t)). This difference is often referred to as Reward Prediction Error (RPE) is referred to. The strength of the update based on the RPE is determined by a so-called Learning rate (𝛼, alpha) is determined. 𝛼 essentially determines how much the reward received changes the expectation of future rewards. You can find a more detailed description in our blog article on basic RL terminology.

To make the connection between RL and the brain, we need to take a short detour into anatomy. The human brain is divided into the neocortex and the subcortical areas (Figure 2). The neocortex is a large, folded area on the outside of the brain and is the structure that people often think of when they talk about the brain as a whole. It consists of several different areas named after their position in relation to the human skull. Of particular importance to us is the frontal lobe, which contains the prefrontal cortex which forms the front part of the brain. Among many other important functions, such as language and executive control, it is thought to play a crucial role in representing important RL variables, such as the value we place on objects and other rewards. Subcortical areas, on the other hand, are located deep in the brain. The most important area we would like to highlight here is a collection of neural structures called the Basal ganglia. These structures play an important role in movement control, decision making and reward learning. Another important concept is the notion of Neurotransmitter.

Neurotransmitters control the communication between different parts of the brain (more precisely: the communication between its subunits, the synapses between individual neurons). There are different types of neurotransmitters that are thought to control different neuronal processes. The most important neurotransmitter that controls the interaction of regions in the basal ganglia and prefrontal cortex is Dopaminewhich is also considered the most important neurotransmitter for RL in the brain.

**Figure 2** Important parts of the reward circuit are the nucleus accumbens in the basal ganglia (subcortex) and the prefrontal cortex in the frontal lobe. These regions are strongly influenced by the neurotransmitter dopamine, which originates in the ventral tegmental area behind the basal ganglia. Source: https://openbooks.lib.msu.edu/neuroscience/chapter/motivation-and-reward/ .

In the brain, dopamine has many roles, but here we want to focus on its role in reward learning (Figure 3A). How does it work? Contrary to public perception, dopamine release is not a pleasure or reward signal per sebut a way to encode the difference between expected and actual reward, i.e. our Reward Prediction Error defined above. In other words, the brain releases a large amount of dopamine in response to an unexpected/surprising reward, and the dopamine release is dampened when the expected reward does not occur. A fully expected reward, on the other hand, does not alter the dopaminergic firing pattern. The discovery of the close connection between dopaminergic firing patterns and RPEs is one of the most significant neuroscientific findings of the last decades (Schultz, Dayan, & Montague, Science 1997).

For an in-depth technical introduction to reinforcement learning that gives you a basic understanding of reinforcement learning (RL) using a practical example, see our blog post:

Reinforcement Learning - Framework and Application Example

Impact of RL algorithms in neuroscience and beyond

Since its discovery, much work has gone into developing experiments to uncover the dynamics of reinforcement learning in humans and other animals. RL experiments in neuroscience tend to be a little different from what data scientists do with RL models in business use cases do. In Data Science, we create a simulation ("digital twin") in which an agent learns how to optimise a specific task, such as controlling traffic lights or minimising energy costs. Agents learn to take optimal actions to maximise rewards. In this way, we hope to create an agent that can autonomously perform and solve the task by optimising its behaviour to obtain the best possible rewards. In contrast, researchers in neuroscience ask (biological) participants to solve specific learning tasks and analyse how different manipulations affect their (reinforcement) learning behaviour. To do this, reinforcement learning agents are trained to mimic the behaviour observed in the real participants of the experiment. Based on these trained agents, researchers can describe the behaviour in mathematical terms by comparing the actions of the participants with the actions of the trained RL agents.

A very simple example of such an experiment is the Bandit task (Figure 3B). In this task, participants have to optimise their actions by choosing the best possible predator. Each action has a certain probability of receiving a certain reward. The probabilities are not communicated to the participants, but have to be learned from experience by trying out the different options and observing the results. Often the probability of receiving a certain reward for each action (= choice of a certain option) changes in the course of the experiment. This leads to a dynamic environment where participants have to keep track of the current state of each action. By using the above RL formulation and optimising the parameter α for each participant, we can estimate how efficiently each participant updates their reward expectation and how this affects their behaviour.

**Figure 3** A Results showing that dopamine encodes a reward prediction error (RPE) (Schultz, Dayan & Montague, Science 1997). When an unexpected reward occurs, dopamine neurons fire when the reward arrives (left). However, if the reward is predicted by a stimulus, dopamine neurons fire when the reward is predicted, but not when the reward itself occurs (centre). If the reward is predicted but then fails to occur, the dopamine neurons reduce their firing (right).
**Figure 3** B Bandit tasks are a classic testing ground for reinforcement learning algorithms, where agents must learn to perform the best actions (i.e. choosing the best bandit) based on trial and error, similar to choosing the best slot machine (figures from https://en.wikipedia.org/wiki/Multi-armed_bandit and https://towardsdatascience.com/solving-the-multi-armed-bandit-problem-b72de40db97c)

Studies with this and similar designs have shown a number of clinically important results. These findings may help improve diagnosis and bridge to new treatments for mental disorders by identifying the mechanisms behind clinically abnormal behaviour. For example, it has been suggested that clinical disorders such as depression may be related to differences in learning from positive and negative feedback (Chong Chen, ... & Ichiro Kusumi, Neuroscience and Biobehavioural Reviews 2015; Reinen, ..., & Schneier, European Neuropsychopharmacology 2021). Imagine you have not only one learning rate (α in the example above) but two, one for learning from positive feedback and one for negative feedback. What happens if they are not equal, but your "positive" learning rate is lower than the "negative" one? This leads to you updating your knowledge of the world much more on the basis of negative feedback, resulting in a negatively biased representation of your environment. A second example relates to the representation of values themselves. In reinforcement learning, it is not only important to learn from experience, but also to accurately represent the values of the options themselves (V(t) in the example above).

This has been another important focus of research on individual differences and clinical symptoms. This individual sensitivity to value differences determines how much you actually care about whether one option is better than the other. Let's say you have learned that you like bananas very much and apples rather less. If these preference differences are very important to you, you will always choose bananas and ignore apples. If you are less sensitive to your preferences, you will choose both fruits more or less equally often. What sounds like an artificial example is a central aspect of the way artificial and biological agents find their way in the world: If they are not sensitive enough to what they think is good or bad in the world, they will behave too arbitrarily, but if they are too sensitive to their preferences, they will always stick to one option and not be able to explore other alternatives or detect changes in the world, such as a sudden improvement in the quality of apples (see the description of the changing bandits above).

This problem is at the core of the so-called exploration-exploitation trade-off, which is at the heart of reinforcement learning: How much should agents rely on their knowledge to choose the best option (exploit), and how much should they try new options to learn more about the changing world (explore). You will not be surprised to hear that both individual learning and resolving the trade-off between exploration and exploitation are associated with dopaminergic function (Chakroun, ..., & Peters, 2020 eLife; Cremer, ..., Schwabe, Neuropsychopharamcology 2022), which is also an important focus of clinical neuroscience (Foley, Psychiatry and Preclinical Psychiatric Studies 2019; Iglesias, ..., & Stephan, WIREs Cognitive Science 2016).

Learn how large language models such as ChatGPT are improved through the use of Reinforcement Learning from Human Feedback (RLHF).

Reinforcement Learning from Human Feedback in the Field of Large Language Models

Current connections between AI and neuroscience

The bidirectional flow of information between artificial intelligence and neuroscience research continues to be fruitful (Hassabis, ..., & Botvinick, Neuron 2017; Botvinick, ..., & Kurth-Nelson, Neuron 2020). For example, it has been shown that the idea of storing and "re-experiencing" experience in a repetition buffer to increase an agent's RL training data can significantly improve the efficiency of reinforcement learning algorithms in AI (Schaul, ..., & Silver, arXiv 2015; Sutton, Machine Learning Proceedings 1990). Recent work has also shown that these algorithms bear striking resemblance to the neuronal repetition found in the hippocampus, a central brain region for memory formation and navigation in the biological brain (Roscow, ..., Lepora, Trends in Neurosciences 2021; Ambrose, ... Foster, Neuron 2016).

Much research is also looking at the exact nature of the learning signal found in dopaminergic neurons. The key signature, a gradual shift of learning signals moving backwards in time, has been supported in recent work both Amo, ..., Uchida, Nature Neuroscience 2022) as well as questioned (Jefong, ..., Namboodiri, Science 2022; "A decades-old model of animal (and human) learning is under fire", Economist 2023). Such biological insights are crucial for the development of resource-efficient algorithms for reinforcement learning in artificial intelligence.

Another exciting piece of work suggests that dopaminergic learning signals may not reflect the updating of individual numbers, but rather the learning of the entire distribution of possible rewards and their respective probabilities (Bakermans, Muller, Behrens, Current Biology 2020, Dabney, Kurth-Nelson, ..., Botvinick, Nature 2020). This has both important biological and algorithmic implications. Algorithmically, this means that reinforcement learning approximates not only the expected value of a reward, but its entire distribution, which provides a much richer representation of the environment and thus greatly accelerates learning and action control. Biologically, an animal on the verge of starvation needs to know where to find enough food to survive, even if this option is less likely than a safer alternative that does not provide enough food. This and similar work provide important insights into the nature of resource-efficient yet powerful reinforcement learning algorithms, and thus into the nature of artificial and biological intelligence itself.

Reinforcement Learning - Use Cases for Companies, Dr Philipp Schwartenbeck, Alexander Thamm GmbH

Read about the use of reinforcement learning in industry and other relevant sectors in our technical article:

Reinforcement Learning Use Cases for Business Applications

Conclusion

In summary, the flow of knowledge between neuroscience and theoretical RL approaches over the last century has provided essential insights into the principles of both biological and artificial intelligence. Reinforcement learning is a key aspect of modern artificial intelligence applications, ranging from solving challenging control problems to the success story Large language models rich but strongly rooted in biological science. The development of increasingly powerful algorithms is leading to new insights into how our brains understand the world, and insights into biological intelligence are leading to more efficient and influential algorithms for reinforcement learning in a business context.

Author

Dr Philipp Schwartenbeck

Philipp is a Prinicipal Data Scientist and joined [at] in January 2023. Among other things, he works on reinforcement learning, which he became interested in during his previous work as a computational neuroscientist. When he is not analysing data or thinking about reinforcement learning algorithms, he is interested in various topics ranging from Bayesian inference to competing in Schafkopf tournaments.

Dr Luca Bruder

Dr Luca Bruder has been a Senior Data Scientist at Alexander Thamm GmbH since 2021. Luca completed his doctorate in the field of computational neuroscience and was able to gain experience in AI and data science consulting alongside his doctorate. He can draw on a wide range of experience in the fields of statistics, data analysis and artificial intelligence and leads a large project on the topic of Explainable AI and autonomous driving at Alexander Thamm GmbH. In addition, Luca is the author of several publications in the field of modelling and neuroscience.