Data mining is one of the basic terms in the context of digitalisation and Data Science. It appears particularly in the context of Big Data projects and data analytics methods. The term generally refers to the systematic, mathematical-statistical handling of data. The goal is always to find patterns, relationships and correlations in large amounts of data. This article gives an overview of the underlying theory and illustrates the topic with 3 practical examples. However, data mining is not a universally applicable tool - rather, it is a group of Algorithmswhich promise very effective solutions in certain cases.
What is data mining?The term is used in the environment of Big Data data mining. Data mining subsumes the explorative methods in which - partly fully automated and partly only semi-automated - insights are gained from large amounts of data. The goal is to, Dependencies, Laws and Sample in otherwise disjointed or unstructured raw data. In accordance with the English term "mining", a metaphor from mining, the term "prospecting" is sometimes also used in this context. Data mining methods are statistical procedures that allow the Data to be analysed according to certain criteria. These can be roughly divided into four categories:
- Segmentation or clustering
Types of data miningData mining is the generic term for the systematic attempt to identify correlations, patterns and trends in data sets. Data mining uses a range of computer-assisted methods that work with statistical algorithms. Data mining is becoming increasingly important, especially due to the ever-growing amounts of data (big data).
SegmentationSegmentation or clustering is a method in which objects with similar common characteristics are grouped together. The objects within the resulting group are therefore homogeneous.
AssociationAssociation stands for the discovery of dependencies. Association includes association analysis and sequence analysis. Association analyses support users in deriving certain rules from data sets without having to specify a target variable. One area of application is shopping basket analysis. With the help of association, the purchase of an item B can be derived from the purchase of an item A. Sequence analyses extend association analyses by certain rules or statistics.
ClassificationIn classification, individual data objects are assigned to specific classes. The class must be defined in advance and objects are placed in this class based on characteristics that are also defined in advance. The basis is formed by data sets with various independent characteristics and a dependent target variable.
PredictionIn data mining, prediction is a forecast of previously unknown features based on previously gained knowledge. The basis is a training data set. This can be used to train models that make predictions about the development of certain dependent variables.
Data mining specialisationsThe majority of all data mining approaches can be applied universally to different types of data. In addition, there are specialisations in data mining that are used for specific data.
TextminingThe Textmining is a data mining method specifically applied to the indexing of text datasets. Text data pose a special challenge because they are not trivial. Due to their multi- to high-dimensional and unstructured character, text data first require special preparation for further processing. In this process, the text data must be reduced by some dimensional characteristics and structured. Complex statistical and data linguistic procedures can be used to extract information and patterns from text documents. Natural language sources are also the subject of text mining. A typical application is computer-assisted methods for detecting textual plagiarism.
WebminingWeb mining is used to tap into various internet data. The object of data analysis is not only the actual web pages but also the relations between the pages (for example in the form of hyperlinks). The data analysis of web mining identifies both clusters and outliers among the web data. Web data sets are in a constant state of dynamism, which poses a particular challenge in web mining.
Time series analysisTime series analysis is one of the data mining specialisations whose goal is a forecast. Future time series are to be determined in order to be able to derive predictions about future trends in this way, for example.
Typical tasksIn the course of these data evaluations, new business fields and models can be created or developed. In the automotive sector, for example, fleet analyses can be carried out that make it possible to offer customers a completely new service model (aftersales). If conspicuous patterns in the data indicate the possible defect of a component, it can be replaced even before it causes damage (Predictive Mainentance). Further characteristic tasks of the Data mining are:
- Outlier detection: Identification of unusual data sets: Outliers, errors, changes
- Cluster analysis: Grouping of objects based on similarities
- Classification: unassigned elements are assigned to existing classes
- Association analysis: Identification of correlations and dependencies in the data in the form of rules such as "From A and B usually follows C".
- Regression analysis: Identification of relationships between (several) dependent and independent variables
- SummaryReduction of the data set into a more compact description without significant loss of information.