The Effect of Corrupt Labels on Computer Vision Performance

from | 6 October 2022 | Tech Deep Dive

An Empirical Study with High Relevance in Medical AI Applications

Find all data of this study here.


Whether it's X-rays of lungs or images of the eyeball - medical data sets are never perfect. A small number of misdiagnoses are often accompanied by a much bigger number of incorrect labels or annotations that can be traced back to incorrect documentation of the images. Using these erroneous data sets for training of convolutional neural networks affects the model's classification quality. To investigate and quantify this effect, we artificially created a dataset with 100 percent correct labels and injected various ratios of corrupted labels into the training set. Finally, we measured the model performance on image classification. Results show more complex models generally to perform better. However, the decrease of model performance with increasing corrupted labels in training data does not solely depend on model complexity. In several cases, model performance plateaus and sometimes even slightly increases at very low levels of corrupted labels ratios. A strong correlation of model performance and corrupted labels ratio can be used as a potential basis to assess the unknown corrupted labels ratio in existing data sets.


Computer vision already has a tangible impact in many industries. Especially in healthcare, the potential for the use of Artificial Intelligence (AI) is high. Algorithms and Convolutional Neural Networks (CNNs) have long been able to detect pneumonia (Patel et al. 2019; Rajpurkar et al. 2017; Stephen et al. 2019; Varshni et al. 2019), skin cancer (Esteva et al. 2017), malaria (Yang et al. 2017) and many other diseases with greater or at least the same accuracy as the best specialists in the respective field. Exemplary medical images used to train disease classification CNNs are shown in figure 1.

However, these models are subject to limitations because physicians often disagree markedly on the diagnosis of medical images. For example, in the evaluation of diabetic retinopathy. Doctors looked at images of the eyeball and classified the visual impairment on a scale of 1 to 5 as - in this order - full vision, slightly impaired, impaired, significantly impaired vision, and blind. The assessments of the medical experts often differ by several scales (Griffith et al. 1993; McKenna et al. 2018; Sussman et al. 1982). In addition, medical findings are occasionally documented incorrectly, or the labels (annotations interchangeably) are extracted from findings using NLP models. This adds further sources of error (Olatunji et al. 2019) in addition to potential incorrect diagnoses, for example in lung x-rays (Brady et al. 2012; Busby et al. 2018; Cohen et al. 2020; Oakden-Rayner 2019). Figure 2 shows example scans of tuberous sclerosis complex (TSC) patients with detected and missed annotations.

Figure 2: FLAIR images of TSC-subjects and the lesions that were detected (in blue) and missed (in red) by an experienced annotator in the first reading (Karimi et al. 2020)

Humans are said to learn from their mistakes, which is true only if the errors are recognised. This can only marginally be applied to Artificial Intelligence. These algorithms depend on input data (often images in the medical field) being correctly labelled, i.e., given the right diagnoses, to produce the best performance on unseen data. The exact effect of incorrect labels in an image data set used to train self-learning algorithms is difficult to assess, but its overall negative impact on model performance has been proven and documented in various settings (Moosavi-Dezfooli et al. 2017; Pengfei Chen et al. 2019; Quinlan 1986; Speth and Hand. Emily M. 2019; Yu et al. 2018; Wang et al. 2018; Zhu and Wu 2004). In settings such as healthcare, each performance point won is valuable and potentially live saving.

In this work, we focus on studying the impairment of image classification on model performance due to corrupted labels (i.e., wrong label attribution of an observation) in the training data set. We artificially generate "diseases" on images with the help of computer vision augmentation - and consequently label them 100% correctly without medical discrepancies. We then introduce and steadily increase the ratio of corrupted labels and measure the effect of the corrupted labels ratio (CLR) on model performance. Thereby, we hope to draw and generalize the conclusion of said effect for potential inference.


Noisy label training

Deep learning neural networks in general as well as CNNs in specific are typically trained on large data sets with annotated labels. This process is called supervised learning. The source for errors in such data sets, which are used by the algorithm to learn certain relationships and patterns within the data, can be manifold and difficult to circumvent in many business settings. Often, correctly labelled data is cost intensive or generally difficult to obtain (Guan et al. 2018; Pechenizkiy et al. 2006) or labelling, even by experts, can still result in noisy data (Smyth 1996).

Other deep learning approaches to overcome these problems have already been explored, such as learning with noisy labels (Joulin et al. 2016; Natarajan et al. 2013; Song et al. 2022; Veit et al. 2017), self-supervised learning (Pinto et al. 2016; Wang and Gupta 2015) or unsupervised learning (Krizhevsky 2009; Le 2013 - 2013). These approaches and their measured performances demonstrate that deep learning models can tolerate some small amount of noise in the training set.

Multiple existing studies have probed the impact of noisy data on deep learning methods. These studies can generally be categorised into two groups (Rolnick et al. 2017). First, approaches that focus on noise-robust models to learn using noisy annotations (Beigman and Klebanov 2009; Joulin et al. 2016; Krause et al. 2015; Manwani and Sastry 2011; Misra et al. 2015; Natarajan et al. 2013; Reed et al. 2014; Rolnick et al. 2017; Liu et al. 2020). Some of these explicitly focus on image classification and CNNs (Ali et al. 2017; Xiao et al. 2015). This first group is comparatively larger, as the noise-robust approach has more scaling potential as well as can optimally lead to a "train-and-forget" implementation of such models due to their robustness. Second, approaches that focus on identifying and removing or correcting corrupt data labels (Aha et al. 1991; Brodley and Friedl 1999; Skalak 1994). Karimi and colleagues provide an in-depth overview for various methods of both groups (Karimi et al. 2020).

Corrupted labels effect

Our study diverges from previous approaches, as the experiment is set up to have full control over the labelling process and thereby the data labels themselves. We then modify the CLR in the training data and consequently measure the changes of model performance. Furthermore, compared to other similar studies (Veit et al. 2016; Sukhbaatar et al. 2014), the model architecture used to train on the clean and then partially corrupted data is not changed. We likewise hope that focusing on the performance change, and not the performance level itself, will provide valuable insights.

The closest are two studies experimenting with incrementally changing the ratio of corrupted labels and its effect on model performance (van Horn et al. 2015; Zhang et al. 2016). The first of which finds that the increase of classification error due to label corruption in training data is surprisingly low, independent of the number of classes or computer vision algorithm. They conclude that for low CLRs (≤ 15%) the increase in classification error is smaller than the proportion of which the CLR is increased. When corruption is introduced not only to training data but also the test data set, a significant drop in model performance is found (van Horn et al. 2015). As model performance is measured based on CLR with large intervals (5%, 15%, 50%) we will extend on this study by using smaller intervals with a focus on the 0 % ≤ CLR ≤ 10% range. The second study independently corrupts the train labels based on a given probability with a stepwise increase of 10%running the experiment with two different CNN architectures on two different data sets (Zhang et al. 2016). They conclude that label noise slows down the fitting convergence time with an increasing level of label noises. Again, we choose a more granular change in CLR and evaluate the magnitude of model performance change while using the same model architecture on the same data set across the different ratios. Thereby, we expect this to be a meaningful extension of these two studies.


Data set augmentation and labelling

The base data set used is the public and freely available joined data sets PascalVoc from 2007 and 2012, where certain objects, such as people, bicycles, chairs, bottles, or sofas are originally labelled and annotated. Using these images as a basis, typical patterns in certain pathologies are artificially replicated onto the images. The focus is on ensuring that these patterns are created to be unambiguous in certain cases, and barely or not at all recognizable to the human eye in others. Some of the results are shown in Figures 3a through 3c.

Figure 3a: Easy (top) and difficult (bottom) to detect pixel changes of the class "Distortion". For each image from left to right: original image, original image with drawn-in region of the change, original image with drawn-in region including changed pixel values - and changed image with which the neural networks were trained.
Figure 3b: Easy (top) and difficult (bottom) to detect pixel changes of the class "Blur". For each image from left to right: see Figure 3a.
Figure 3c: Easy (top) and difficult (bottom) to detect pixel change of the class "Blob". For each image from left to right: see Figure 3a.

The respective image changes are based on two steps. First, a random image section is chosen, either as a rectangle or as a four-sided polygon. Then, within the selected image section, the pixel values of the image are randomly changed, and the images are labelled based on the type of pixel value change. The changes consist of four main classes and 14 subclasses as listed in table 1. For the corruption of labels, the annotation of an image is randomly changed to one of the not correct main or subclass respectively.

Main class (with description)

Pixel values in the region of interest are randomly changed within a specified interval.
- R: Red channel only
- G: Green channel only
- B: Blue channel only
- All: All channels
Pixel values in the region in question are blurred.
- No subclasses
Random number of dots of random size are added to the region in question.
- R: Red dots
- G: Green dots
- B: Blue dots
- All: Dots of random colour
In the region in question, the colour channels are randomly swapped.
- RBG: RGB (red-green-blue sequence) becomes RBG
- BGR: RGB becomes BGR
- GRB: RGB becomes GRB
BRG: RGB becomes BRG
- BBR: RGB becomes GBR
Table 1: Main classes including subclasses used for the project

Overall, 22,077 pictures are altered and labelled. The distribution of main and subclasses is shown in figure 4. The distribution of main classes is roughly equal, with around 5,500 images per class. For subclasses, an imbalance is given due to Blur-class having no subclasses, resulting in ~ 5,500 images. For all other subclasses, between 1,100 and 1,400 pictures are altered and labelled. As this study focuses on change in model performance and not optimization of model performance or prediction performance of a certain class, the impact of class imbalance on overall performance will not be further discussed. All images including the created polygons are reseized after alteration to a pixel width and height of 244 before being loaded into the neuronal networks, resulting in an input shape of (244, 244, 3) .

Figure 4: Main classes and subclasses distribution

Base and pretrained model architecture

Two models are used in the experiment setup. A self-developed, basic CNN (bCNN) with 7.2 million parameters as well as a pretrained ResNet50 (resnet) with 27.8 million parameters. The bCNN consists of nine convolutional layers, with each being encapsulated by batch normalization, pooling, and dropout layers (rate = 0.1) ). A quadratic kernel_size = 3 is selected and as activation function LeakyReLu with a α = 0.3 is implemented. After flattening, two hidden dense layers are added, again accompanied by batch normalization, dropout layers (rate = 0.1 ) and the LeakyReLu activation function (α = 0.3 ). For the output layer, softmax activation function is selected. The ResNet model (He et al. 2015) is extended with a single hidden dense layer (relu activation function) and an output layer using the softmax activation function. A detailed overview of the model architecture can be viewed on the provided GitHub.

Model training setup with increasing CLR

The neuronal networks are trained to classify the images across multiple experiment iterations. Within each iteration, the CLR is gradually increased within the training data and model performance is measured using an uncorrupted test set. The random split of training test data is 77.5% to 22.5%, resulting in 13,700 images for training and 4,952 for testing. During the model training phase, the training data is randomly split using 20% (3,425 images) of the training images for validation. The training of a single model consists of 20 epochs and batch size of 32. Each experiment iteration includes training both models on one of the classification tasks (either four main classes or 14 sub classes) for ten different CLRs with 0% ≤ CLR ≤ 10% .

Within each run, the higher CLR includes already corrupted labels from the previous ratio, i.e., 10% CLR includes all corrupt labels from the 7.5% CLR which includes all corrupt labels from the 5% CLR and so forth. 20 iterations are performed for each classification task, resulting in 800 trained models (see table 2) and >400 compute hours. Model classification performance was measured using accuracy, weighted average precision, weighted average recall, and weighted average f1-score.

Experiment parametersExperiment Parameter StatesNo. of States
ModelbCNN, ResNet502
Classification TaskMain class classification,
subclass classification
Corrupted Labels Ratio0.0%, 0.25%, 0.5%, 1.0%, 2.0%, 3.0%,
4.0%, 5.0%, 7.5%, 10.0%
Training Iterations2020
= 800 models
Table 2: Experiment setup overview


CLR based model performance decrease

Overall, model performance decreased continuously with increase of CLR. An average test accuracy of 0.842 (std = 0.016) for the bCNN with 14 classes and 0% CLR is achieved, whereas 0.878 average accuracy (std = 0.064) is achieved by the ResNet model for the same classification task and CLR setup. The average accuracy decreases to 0.68 (std = 0.036 - bCNN, 10% CLR, 14 class prediction) and 0.76 (std = 0.04 - ResNet, 10% CLR, 14 class prediction).

Figure 5a: Measured classification test metrics including outliers
Figure 5b*: Measured classification test metrics excluding outliers (z-score threshold of 3.0 for each metric)

* Number of removed model results using a z-score threshold of >3.0 calculated per metric - out off 800 model results: accuracy of 8 models removed, weighted avg. precision of 2 models removed, weighted avg. recall of 8 models removed, weighted avg. F1-score of 6 models removed

For four classes prediction, the bCNN average test accuracy with 0% CLR is 0.901 (std = 0.018). For the same setup (0% CLR, four class prediction), the ResNet achieves an average accuracy of 0.937 (std = 0.028 ). The average accuracy decreases to 0.789 (std = 0.072 - bCNN, 10% CLR, 4 class prediction) and 0.831 (std = 0.04  - ResNet, 10% CLR, 4 class prediction) respectively. Figure 5a gives an overview on the performance of the different model and classification task performances based on test data metrics.

Not only a steady decrease of more than 10 percentage points for each setup on the average accuracy can be observed, but also an increase of model performance variance as the CLR is increased. This behaviour remains true for weighted average precision, weighted average recall and weighted average F1-score across all setups. A detailed overview can be found in appendix A.

Model performance anomalies of low level CLR

For the ResNet neural network trained on 14 classes prediction task, the standard deviation of accuracy, weighted average recall and weighted average F1-score is surprisingly high for a CLR of 0.0%, 0.25%, 0.5% and 1.0% when compared to the other setups (figure 5a, for details see appendix A). At the same time, an unusual characteristic is recognizable in the average performance change of the CLR for the 14 class ResNet: a peculiar spike in the test performance metrics when only a small number of labels are deliberately corrupted, i.e., at CLR = 0.5%. Even when removing outliers (see figure 5b), this pattern persists.

Figure 6: Average test accuracy delta for CLR compared to its baseline (0% CLR)

Figure 6 shows the average accuracy delta of each training setup compared to its average baseline (model trained with 0% CLR for same classification task). For both models, bCNN and ResNet, a further visual plateauing can be observed for the same low level CLRs (0.0% ≤ CLR ≤ 1.0% ) mentioned above when trained to classify four labels as well as the ResNet trained on 14 classes.

Classification performance relationship on CLR

A simple linear regression was run to probe the prediction of CLR based on a given accuracy, using the classification results as input. Table 3 confirms a moderately strong to strong negative correlation between the classification test accuracy (regressor) and the CLR (regressand) for all experiment setups with -0.75 ≤ r ≤ -0.5, except for the bCNN model trained on 14 classes with a very strong negative relationship of r < -0.9 .

ModelClassification TaskPearson's rp-valueCoefficientRMSE trainRMSE test
bCNN4 classes-0.64758< 0.00001-2.159622.398082.77765
ResNet4 classes-0.72335< 0.00001-2.326662.162452.31352
bCNN14 classes-0.90859< 0.00001-2.869651.373721.17014
ResNet14 classes-0.54147< 0.00001-1.510712.926372.16982
Table 3: Regression results on classification test accuracy over CLR

The coefficients, which can be interpreted as higher values indicating more robust models and lower ones indicating models that are more susceptible to false labels, shows the biggest effect of CLR on accuracy for the bCNN with 14 classes. While models used to choose and classify fewer number of classes achieve a better overall total accuracy performance, this observation does not automatically translate to a higher robustness of the models with respect to corrupted labels. Their performance decreases significantly faster than the ResNet trained to recognise 14 classes.


With higher proportion of corrupted labels in the training data, model performance becomes less reliable. Performance declines faster than in previous similar studies (van Horn et al. 2015), which may be due to the overall comparatively low complexity of the model architecture used in the present study, as measured by the parameters that can be trained. Moreover, when fewer class types need to be classified by the model, performance is better across different CLR levels. This is likely because the differences in features, and thus underlying patterns used for class differentiation, are larger between main classes than between subclasses, especially between subclasses belonging to the same main class. It is plausible that as similarity increases (fewer differences in the underlying patterns), the predictive power of the model decreases regardless of the number of falsified labels.

Models with a higher number of trainable parameters perform better than those with fewer, across all CLR levels examined. The size of a network determines how much it can remember in terms of patterns observed during training. Here, a higher number of trainable parameters leads to improved recognition of patterns within images, as well as the relationship between these patterns and their respective classes. These results support previous research (Moosavi-Dezfooli et al. 2017; Pengfei Chen et al. 2019; Speth and Hand. Emily M. 2019; Wang et al. 2018; Yu et al. 2018; van Horn et al. 2015; Zhang et al. 2016). Since a model was implemented with pre-trained parameters, the potential for fine-tuning pattern recognition of already learned relationships was further increased. This is consistent with previous studies (Chandeep Sharma 2022; Hassan et al. 2021; Hussain et al. 2019; Wang et al. 2019).

An implication of the observed spike in performance at low CLR values for some of the experimental setups studied is that the model is better able to discriminate between frequent and less frequent patterns of correctly labelled images of the same class when the number of mislabelled images is extremely low. Thereby, robustness is increased for classification of unseen data by focusing on the most important class differences. As not all setups show this behaviour, this implication cannot be concluded for certain.

Like the anomaly described above, the plateau in classification performance at the same low level of CLR allows for some interpretation. An inclusion of false labels does not immediately decrease pattern recognition of the models but potentially supports the identification of relationships between image and respective classes through pattern stimulation and emphasis on class differences. Only the model with the least trainable parameters per classes does not show described behaviour and decreases steadily. Based on these observations, the conclusion might be drawn that injecting a very low number of corrupted labels into the training data might increase model performance. The results suggest this to be feasible only if the model is complex enough in terms of trainable parameter per class ratio.

Based on the regression results, an initial estimate can be made about the existing corrupted labels in a dataset, even if the CLR is unknown. For instance, models trained on a dataset with uncertain CLR can be re-trained by intentionally inserting different ratios of falsified labels. Inferences can then be made about the data and its potential corrupted labels ratio used in the original model. The strong correlation between accuracy and CLR could potentially serve as an indicator for predicting the number of falsified labels and thus represents an interesting result for further research.


This study focuses on the effect of corrupted labels in training data for image classification models. We find that more complex models generally perform better over various ratios of corrupted labels in the training data and across different classification task compared to lesser complex models. At the same time, results suggest that robustness does not automatically come with higher complexity of model architecture, as decrease of model performance does not seem to solely depend on model complexity. A surprising result is observed for one of the four training setups which opens further questions. Coming studies can draw on the present results and focus on performance plateauing and potential increase at very low levels of corrupted labels ratios, potentially as a source to overall improve model performance.

Data shows a case for using the strong relationship between model performance and corrupted labels ratio for inference of unknown CLR in existing data sets by deliberately inserting corrupted labels and measuring performance change. It is recommended to further validate this assumption by either increasing the number of model architectures measured or testing similar setups with different data sets.

Find all data of this study here.


Appendix A - Classification metrics

ModelClassesRatio (%)Accuracy
- mean
Weighted Avg. F1 Score
- mean
Weighted Avg. Precision
- mean
Weighted Avg. Recall
- mean
- std
Weighted Avg. f1 Score
- std
Weighted Avg. Precision
- std
Weighted Avg. Recall
- std


Daniel Czwalinna

Daniel has been in Consulting for over 5 years. He joined [at] in early 2020 and is currently working as a Senior Data Scientist. His focus lies on Computer Vision although he is interested everything related to ML engineering. In his free time, he is passionate about photography and bouldering.

0 Kommentare