What is overfitting?

Overfitting is a particular situation in the use of artificial intelligence where a correction of a model on a certain given data set is referred to. Statistically, too many explanatory variables are used to specify a model. Thus, overfitting can be compared to a human hallucination, where things are seen that are not actually there.

In machine learning, overfitting is undesirable because the algorithm recognises patterns in the data set that do not exist and builds its learning file on them. Machine learning or also Deep Learning Algorithms are to derive rules that can be successfully applied to completely unknown inputs and are to provide an accurate prediction.

An over-adapted Algorithm can unfortunately also deliver incorrect results due to incorrect inferences. In the algorithm, the data is trained so often that this data is practically learned by heart. Unfortunately, however, no useful result can be delivered with a new input. Overfitting usually occurs when there are significant gaps between the training and testing errors. Overfitting is favoured by some factors. The number of observations and measurement points plays a major role in model building.

A selection of the data set decides on the possibility of making assumptions derived from these data for inferences about reality. If one should determine certain rules or trends from the available data, then the data set must also contain suitable data for this. Overfitting is also favoured by model misbehaviour, with a bias in some sample selection. This can also be due to a bias in data collection or evaluation. It is also possible that the training was too intensive, because an overtrained system can handle existing data absolutely well, but unfortunately not new and unknown data.

How to avoid over-adaptation?

There are some techniques used in predictive data mining to avoid overfitting (with neural networks, classification and regression trees). This can be used to control the model complexity (flexibility).

To avoid overfitting, one can plan a sufficiently large time window. Thus, one needs time for a truly unbiased and thus representative sampling. Factual preliminary considerations are important. It must be clarified which variables are relevant. The data set should be divided into test and training data sets.

Data Navigator Newsletter