Temporal Difference Learning

What is Temporal Difference Learning?

Temporal Difference Learning (also called TD Learning) describes a version of reinforcement learning.This is one of the three learning methods of machine learning, along with supervised learning and unsupervised learning.

As with other reinforcement learning methods, Temporal Difference Learning does not require the learning algorithm to have a starting point or a starting point. Training data necessary. The system, or a software agent, learns through a trial-and-error process in which it receives a reward for a sequence of decisions/actions and aligns and adjusts its future strategy accordingly. The model of the algorithm is based on the Markov decision problem, in which the benefit for a software agent results from a sequence of actions.

Unlike other learning methods, in TD learning the assessment function updates with the appropriate reward after each individual action, rather than after a sequence of actions has been completed. In this way, the strategy iteratively approaches the optimal function. This process is called bootstrapping or bragging and aims to reduce the variance in finding a solution.

What algorithms exist in TD learning?

Within Temporal Difference Learning, several algorithms exist to implement the method.

At Q-Learning the software agent evaluates the utility of an action to be performed instead of the utility level of a state and chooses the action with the greatest increase in utility based on the current evaluation function. In view of this, Q-learning is referred to as an "action-value function" instead of a "state-value function".

Also with SARSA (abbreviation for "state-action-reward-state-action") is an algorithm with an action-value function. In addition to this commonality with Q-learning, SARSA differs from Q-learning in that Q-learning is an off-policy algorithm, whereas SARSA is an on-policy algorithm. In the case of an off-policy, the next state is taken into account for action determination, whereas in the case of on-policy, the algorithm takes into account both the next state and its current action and the agent thus remains true to its strategy for calculating the subsequent action. The algorithms considered so far only take into account the immediate reward of the next action.

With so-called TD n-step methods on the other hand, the rewards of the n next steps are included.

At TD Lambda TD(λ) is an extension of the temporal difference learning algorithm. There is the possibility that not only a single state leads to the adjustment of the evaluation function, but within a sequence the values of several states can be adjusted. The decay rate λ regulates the extent of the possible change for each individual state, whereby this quantity moves away from the state under consideration with each iteration and decreases exponentially. TD-Lambda can also be applied to the methods of Q-learning and SARSA.

What are these algorithms used for in practice?

The areas of application of Temporal Difference Learning in the context of reinforcement learning methods are manifold. A striking example of its use is the game TD-Gammon, which is based on the game Backgammon and was developed using a TD-Lambda algorithm. The same applies to the game AlphaGowhich is based on the Japanese board game Go.

One application of Q-learning can be found in the framework of the autonomous driving in road traffic, as the system independently learns collision-free overtaking strategies and lane changes and then maintains a constant speed.

SARSA, on the other hand, can be used to detect credit card fraud, for example. The SARSA method calculates the algorithm for detecting fraud, while the Classification- and Regression method of a Random-Forest optimised the accuracy of credit card default prediction.

You may also be interested in

Data Navigator Newsletter