Random Forest

What is Random Forest?

Random Forest describes an algorithm in the field of machine learning or the artificial intelligence, which is for Classification or regression tasks can be applied. Classification or categorisation is about categorising or assigning a variable to a particular class. Regression, on the other hand, aims to estimate values of a variable based on its dependence on other variables.

The term Random Forest was introduced by statistician Leo Breiman and is based on the use of decision trees. By creating many random decision trees, a "random forest" of trees is created.

How does a Random Forest work?

To create a forest of trees, many individual decision trees must first be generated. These Creation is uncorrelated and randomised. Each tree consists of several branches/nodes, which finally result in an end point/leaf/class after several levels. A classifier assigns the data object to a class, which is then classified again in the next branch until the object reaches an end point.

To prevent decision trees from correlating with each other, the so-called principle of bagging (short for bootstrap aggregation) is applied. For this purpose, the decision trees are created using the Training data several times with different distributions. This variance of the respective decision nodes is to exclude a correlation of the decision trees to each other.

After creating the defined number of decision trees, the algorithm works based on the ensemble method by considering multiple decision trees for prediction. This method has the advantage over using a single decision tree that the decisions of a large number of predictors can counteract outliers and thus increase the reliability of the result. Thus, the prediction of a random forest regressor corresponds to the average of the predictions of the individual decision trees.

Random Forest basically belongs to the category of the so-called Supervised Learning (supervised learning). In this type of machine learning, the algorithm's training data is labelled, meaning that the input data is already mapped to the correct target data. Based on this, the system is supposed to learn to predict new data correctly.

In which software can a random forest be implemented?

Among other things, the method can be used in Scikit-learn, R programming languageH2O or Weka.

  • At Scikit-learn is a Python library that is mainly used for classification & regression algorithms as well as visualisations in the field of machine learning.
  • The Programming language R is classified as an interpreted language, was developed for static calculations and is very widely used for statistical calculations in both science and business. The name R can be traced back to the first letter of the first name of its founders Ross Ihaka and Robert Gentleman as well as to the simplicity of the programming language S, on which the syntax of R is strongly based.
  • H2O is an open-source software of the company H2O.ai and is mainly used for algorithms in the field of statistics and machine learning. The software can also be operated in Microsoft Excel via an API, for example. During the calculation of the algorithm, approximate results are displayed so that parameters can still be changed during the calculation process. The visualisation of the method is generally one of its advantages.
  • Weka (Waikato Environment for Knowledge Analysis) was developed by the University of Waikato in New Zealand and offers solutions for classifications and in the Cluster analysis also areas of application in neural networkswhich can be combined with the application of Random Forest.

Data Navigator Newsletter