What is a classification procedure?

Classification procedures are methods and also criteria that serve to divide (classify) objects and situations into classes. Many methods can simply be implemented as an algorithm and are referred to as machine or automatic classification. Classification procedures are always application-related and many different methods exist. Classification procedures play a role in pattern recognition, in the artificial intelligence, in documentation science and information retrieval.

What are types of classification procedures?

There are classification methods with different properties. There are automatic and manual methods, numerical and non-numerical methods, statistical and distribution-free methods, supervised and non-supervised methods, fixed-dimension and learning methods, and parametric and non-parametric methods.

At Data mining For the classification of objects, decision trees, neural networks, the Bayes classification and also the nearest neighbour method are used. Most of the time, the classification procedures have a two-stage structure. There is a learning phase with Training data and finally the classification phase.

Decision trees

In this procedure, data runs through a decision tree. The characteristic values of objects are checked at each individual node and it is determined which path in the tree should now be followed further. Finally, a leaf node is always reached and this is then the class of the object. The decision tree is basically created with the help of training objects. A recursive divide-and-conquer algorithm is used. The advantage is that all the rules determined can be interpreted quite easily. A Cluster analysis can be better understood with the classes identified by applying decision trees.

Neural networks

The Neural networks consist of different nodes (Neurons), which are connected to each other. Such a neural network consists of several layers. These nodes of all individual layers are connected to each other at the layer transitions. Each connection has its own edge weight. At the beginning of the training, such weights are determined randomly. The edge weight can decide to which node an object can go next to be finally assigned to an output node. Each output node in the output layer represents a class. Depending on the activation path from an object, a certain output node becomes active. Finally, learning takes place through verification by comparing actual and target results with the training data. Errors are easily fed back into the neural network and thus edge weights are successively adjusted. Outliers in the data are detected particularly well. The classification results, on the other hand, are determined in a hardly comprehensible way.

Bayes classification

In Bayesian classification, a class is assigned on the basis of the probabilities of all characteristics. Each object is assigned to its class by determining the probability of occurrence of the respective feature combination. Each occurrence is approximately estimated by the respective training data. The advantage is that a high accuracy of the classification is achieved when this method is applied to large amounts of data. The disadvantage, however, is that in the case of a wrongly assumed distribution or feature independence, the respective results become inaccurate and completely falsified.

Next-Neighbour Procedure

With this method, objects can be compared precisely with each other and finally assigned to a class. A comparison is made with similar training objects. The basis for comparison is the previously defined distance or similarity measure. Now, the most frequently occurring class in which the object comparisons occur is considered the result class. An advantage is the applicability to corresponding qualitative and quantitative characteristics of the objects. A disadvantage is the extremely time-consuming classification phase, because the entire training data must always be used for each comparison.

Examples from the field of data science

At Area of data mining are analyses of Big Data is carried out. In this way, large amounts of data are processed efficiently and reliable and easily interpretable results are to be achieved. A short processing time is the goal. It should be possible to process different types of data structures, such as text analyses, image processing, numbers, coordinates and the like.

Text mining is used to extract interesting and non-trivial knowledge from completely unstructured or weakly structured texts. Information retrieval and data mining play a role here, machine learning, statistics and computational linguistics. Text analyses such as cluster analyses, classification of texts and the construction of a corresponding question-answer system are used in text mining.

What is the difference between classification and regression?

Regression is the prediction of continuous values. Training is carried out with the help of Backpropagation. This is an optimisation procedure that uses a gradient method to directly calculate the error of a forward propagation and adjust the weightings against the error. By carrying out backpropagation, the "correct" weightings are obtained. In classification, on the other hand, group membership can be predicted.

Mathematically, regression and classification do not differ too much from each other. In fact, many classification methods can also be used for regression with only a few adjustments, and vice versa.

Artificial neural networks, nearest neighbour methods and decision trees are examples of these being used in practice for both classification and regression. What is different in any case, however, is the purpose of the application: With regression, one wants to predict continuous values (such as the temperature of a machine) and with classification, one wants to distinguish classes (such as "machine overheats" or "does not overheat").

The most common method in which classification problems can be tackled in supervised machine learning is logistic regression.