What is Data Augmentation?

Data augmentation is a process in which the data is artificially generates new data on the basis of an existing data set in order to increase the totality of the data. In this respect, the technology is seen as a preparatory step in the field of machine learning applied. By means of prefabricated libraries in Python or PyTorch the functionality can be implemented.

Benefits and challenges

A Advantage of data augmentation through data augmentation results from the possibility of Reduction from Overfitting. This over-adaptation occurs, for example, when Training data cannot be sufficiently generalised, for example if the amount of training data is too small. The generation of augmented data can counteract the problem of overfitting, as it increases the amount of data.

A further benefit from artificial data generation arises in that Prevents potential data protection problems The data can only be generated through data augmentation. Furthermore, this technique can be used to collect and label data in a cost-effective way.

Challenges The augmented data, once it has been generated, can be subjected to a qualitative assessment through a rating system must be subjected to in order to capture the added value of the data expansion. Biases in original data cannot be eliminated by this method, but are carried over. To reduce this problem, an optimal extension strategy can be developed.

How it works

The procedure of data augmentation in the sense of the standard model works in such a way that the original data (e.g. an image) is loaded into the data augmentation pipeline. In this pipeline, so-called transformation functions are applied to the input data with a certain probability. These can be, for example, the Rotate (rotating) or Mirroring (flipping) of the image. After passing through the pipeline, the generated results are evaluated by a human expert. If the generated data has passed the inspection, it flows into the training data population as augmented data.

What are data augmentation techniques?

Within the framework of the Image classification and segmentation several techniques can be used to expand the training data. After loading the original image into the pipeline, the image can, for example, be extended by a frame, mirrored horizontally or vertically, rescaled, moved along the x- or y-axis, rotated, cropped or zoomed into. In addition to the possibilities mentioned for modifying an image, there are also those that concern colour or contrast. These concern colour modifications such as brightening or darkening the image, converting the image to greyscale, changing the contrast, adding noise or deleting parts of the image. Each of the activities included is applied to the original image with a certain probability, ultimately creating augmented data.

In addition to image classification and segmentation, the technique is also used in the field of Natural Language Processing (NLP) Application. Since NLP deals with the processing of natural language, meaningful data generation is more difficult. Applicable techniques are synonym substitution and the insertion, exchange or deletion of words, which can be summarised under the term Easy Data Augmentation (EDA). Another method is back-translation. Here, a text is back-translated from the target language into the original language and thus expands the data set of the training data. Augmented data can also be created by so-called contextualised embedding of words.

Where is Data Augmentation used?

The process is particularly well represented in the Medical imaging sectorsuch as in the segmentation of tumours or in the identification of diseases on X-ray images. Since only a limited data set is available for rare diseases, this can be expanded through data augmentation. Another use case can be found in the Area of the autonomous driving. Data augmentation is used to extend the simulation environment. Also in the Field of Natural Language Processing data augmentation is used. It is also used to augment the training data for NLP applications.