What is bagging?

Bagging is an abbreviation of the term "bootstrap aggregating" and represents a procedure for variance reduction when using different classification and regression trees in the context of the machine learning dar.

Besides this increase in the accuracy of Classification- and Regression problemsbagging is also used to solve the well-known problem of the Overfitting to solve. The results of the algorithm are particularly good when the individual learners of the classification and regression trees are unstable and have a high variance.

According to the word components, this method involves bootstrap aggregating in two process steps. Bootstrapping basically describes a procedure in statistics in which random samples are repeatedly drawn from a defined data set in order to identify an unknown distribution function of the data set. Thus, this bootstrapping procedure can be classified as resampling, since sub-samples are repeatedly drawn on the basis of a sample (data set). These individual samples are then trained with the prediction model or weak classifiers and then aggregated to a predicted value.

This is where the name bootstrap aggregating comes from, as data is initially drawn through repeated sampling (using the bootstrapping procedure) and then the prediction models are unified (aggregated). Thus, it is possible that this methodology leads to an information fusion and thus increases the classification or regression performance.

How does the ensemble method work?

An ensemble method or ensemble learning is basically when several (weak) learners or classifiers are connected together and run through, thus creating a so-called ensemble. In this respect, ensemble methods are also referred to as a meta-approach to machine learning, since several models are combined to form a prediction value.

As described at the beginning When bagging (bootstrap aggregating), multiple samples of a data set are taken and the same algorithm is then trained and tested in parallel with the sample data.. This usually involves drawing random samples of the data set, but it would also be possible to distribute the entire data set and generate the distribution of the data from this. When the data is selected by random sampling, it corresponds to the "draw with reclamation" model. This means that certain data points can be included in the model several times (via multiple random selection), while others cannot be included at all.

After generating the sample, the learning algorithm is applied to each ensemble member. This is done in parallel with each other. Finally, the individual predictive models are aggregated, resulting in a final ensemble classifier. The individual models or algorithms can either flow into the classifier with equal weights or have different weights.

What is the difference between bagging and boosting?

In addition to bagging, the so-called Boosting an ensemble method in machine learning dar.

Thereby In contrast to bagging, the (weak) classifiers are not run through in parallel, but sequentially.. In both methods presented, a basic sample is drawn at the beginning. Due to the iterative and sequential approach of the ensemble method, it is possible that the findings from the previous steps are applied to subsequent steps. This is achieved by weighting incorrectly classified iterations differently from correctly classified iterations.

The aim of boosting is to create a strong classifier from a large number of weak classifiers. While weights can also be used in principle in bagging, they differ in boosting in that their size depends on the previous sequential progress, whereas the weights in bagging are already defined in advance, as the process runs in parallel.

Another difference between the two methods is the objective. The aim of bagging is to reduce the variance of the individual classifiers by combining them, while boosting aims to reduce the systematic error or bias of the distribution. In this sense, bagging can help solve the overfitting problem, whereas boosting does not.

Both methods can be combined with Python implement, whereby the scikit-learn library provides an implementation for ensemble methods and can thus be implemented relatively easily.