What is bootstrapping?

The bootstrapping process is about a method in statistics for estimating distribution functionsin which the so-called resampling procedure is used. Resampling describes a procedure of repeatedly drawing sub-samples from an initial sample in order to obtain conclusions about variables of the original distribution function such as mean or standard deviation.

Bootstrapping is used where the distribution function of a statistic is not known and must therefore be estimated. A prerequisite for the mode of operation is a certain size of the initial sample.

The name bootstrapping is derived from the English word for bootstrap and has its origin in the story of Baron von Münchhausen. Since, according to the story, Baron von Münchhausen pulled himself out of the swamp by his own boots, this metaphor is also taken up for the method, in that in the bootstrap method the sub-sample is drawn from one's own sample.

What are statistical methods in bootstrapping?

Since bootstrapping merely describes a general procedure in statistics, different concrete bootstrapping procedures also developed for different areas of application. At i.i.d Bootstrap (independent and identically distributed), for example, the procedure is to repeatedly draw a certain sub-sample size from the initial sample with reserve. After drawing the defined number of sample repetitions, the approximated distribution can be used to generate a confidence interval.

Since the i.i.d bootstrap cannot represent a temporal correlation of the data, the following is used for such cases Block bootstrap is applied. In this method, the data are divided into contiguous blocks in a preparatory step. By dividing the trend time series function into a trend and residual component, the basis for carrying out the method is created. The residual part corresponds to the difference between the respective measurement and estimation point. Using drag and drop, residual blocks are repeatedly drawn until the length of the original signal or the initial sample is reached and then added to the trend time series. By repeatedly applying this sample repetition, a temporal correlation function can finally be represented with this procedure by means of a bootstrap.

In contrast to the previously described methods of non-parametric bootstrapping, parametric bootstrapping assumes a certain distribution of the initial sample. Non-parametric bootstrapping takes advantage of the fact of non-parametric statistics that for its application no assumptions are made about the distribution, as this only emerges from the sample under consideration. At parametric bootstrapping one concentrates on estimating the parameters of the assumed distribution.

Where is the process used in machine learning?

In the area of the machine learning the procedure is carried out within the framework of the so-called Bagging (short for "bootstrap aggregating") is used. Bagging is used in particular for Regressions- and Classification trees Bootstrapping is used to reduce variance and thus improve prediction values. Bootstrapping is used in bagging to draw samples (bootstrap), which are trained with the prediction model and finally aggregated to a prediction value (aggregating). Also in the area of Temporal Difference Learning in the reinforcement learning environment, the procedure is used by iteratively optimising the objective function through variance reduction.

The Programming language R offers an implementation for non-parametric bootstrapping. By specifying the parameters, individual variables or vectors can be calculated. In a next step, the associated confidence intervals can also be determined.

Statistical software such as SPPS from IBM or Stata also offer largely automated bootstrapping application procedures. SPSS even provides its own module with many functionalities. After entering the individual parameters, the sampling distribution is estimated using the method described.