1. Introduction
In recent years, the progress in artificial intelligence models has been remarkable, driven primarily by the increasing availability of large datasets and improvements in computing power [
1]. In particular, labeled datasets have been instrumental in enabling the training of supervised learning algorithms for tasks such as image classification, object detection, or image segmentation. Well-known datasets such as MNIST, CIFAR-10, and ImageNet are widely used in the deep learning community as benchmarks to evaluate and compare the performance of newly developed models [
2]. However, the utility of these datasets is closely related to the accuracy of their labels, as labeling errors can significantly impact the model performance.
Recent studies have shown that even low label error rates can significantly affect the model’s generalization and accuracy, further highlighting the importance of addressing label noise in datasets [
3,
4,
5,
6]. As a result, research on label noise detection and correction has become increasingly important to ensure the reliability and robustness of artificial intelligence systems. Labeling errors are a common challenge in various types of data, including text, audio, image, and video. These errors can distort the learning process, introduce noise, and reduce the generalization capabilities of machine learning models. Even widely used datasets such as MNIST, CIFAR-10, and ImageNet are not exempt from these problems, highlighting the need for effective methods to detect and correct mislabeled samples [
3,
7].
Detecting and correcting mislabeled samples are critical steps in data preprocessing, as these anomalies can introduce bias and noise into the learning process. Techniques such as supervised learning and anomaly detection algorithms have been developed to systematically identify and correct these errors [
3]. By ensuring data quality through accurate labeling and a deeper understanding of the class relationships, more robust and efficient convolutional neural networks (CNNs) can be developed. Addressing this issue not can only improve the model accuracy but also can reduce the computational complexity by reducing the need to retrain models with noisy data [
4,
5].
The impact of label noise on classification problems has been addressed in the literature for several areas of classification. One study recognized the two types of label noise in medical images, disagreement and single-target label noise, and related them to the model performance on dermatology, ophthalmology, and pathology datasets [
8]. It is also possible to find research that has addressed the label noise problem for segmentation applications in clinical settings [
9]. A recent study that provides an overview of label noise learning for medical image analysis can be found in [
10]. In remote sensing, problems with label noise may include studies based on single-label classification or multi-label scene classification for land cover mapping [
11,
12] or hyperspectral image classification applications [
13].
Several approaches have been proposed to addressing the data quality issues caused by labeling errors or biased labels. Methods based on a constrained optimization framework have been proposed that apply a training sample reweighting method to learn an equivalent unbiased labeling function [
14]. Alternative approaches have attempted to identify instances where a previously trained classifier (on the same data) assigns a very low probability to the original label, suggesting that the model considers these instances to be mislabeled and in need of revision [
5]. Confidence-based learning removes erroneous samples from the dataset by applying a probabilistic threshold to the confidence scores given by a trained model; with the cleaned dataset, the model is retrained to improve its robustness [
7].
A variety of approaches have also been proposed for learning with noisy labels. In DivideMix, semi-supervised learning is used to train two networks simultaneously by coding datasets, co-refinement, and co-division [
15]. In CORES (COnfidence REgularized Sample Sieve), corrupted examples are filtered out in order to treat clean and corrupted examples separately when training a deep learning model [
16]. On the other hand, methods such as PES (Progressive Early Stopping) [
17] or ELR+ (Early-Learning Regularization) [
18] propose techniques to prevent deep learning models from memorizing the labels for examples with noisy labels, either by controlling the early stopping point or by applying regularization techniques.
Researchers at MIT conducted a remarkable study, known as Cleanlab, on analyzing ten widely used datasets, including MNIST, CIFAR-10, and ImageNet. Using reinforcement learning, they proposed a method specifically designed to detect and correct mislabeled data, identifying numerous labeling inaccuracies. These results were later validated using Amazon’s Mechanical Turk platform to ensure the reliability of the corrections [
3,
7].
Many of these approaches rely on deep learning model training, which are effective but often require significant computational resources. That is, despite advances in labeling error detection and correction, many of the existing techniques require large computational resources, limiting their applicability in resource-constrained scenarios. This limitation underscores the importance of exploring alternative strategies that are efficient and scalable, allowing for the accurate detection of mislabeled or atypical samples without an excessive computational overhead.
However, characterization of the datasets prior to training machine learning models is critical to ensure the optimal performance. Several studies have focused on quantifying the intrinsic complexity of the data to identify the challenges that models may face during training [
19,
20,
21]. Analysis of the interclass similarity has proven to be essential, as it highlights potential overlaps and areas of confusion between classes. High interclass similarity can lead to increased misclassification rates, ultimately reducing the overall accuracy of CNNs [
3,
5,
19,
21]. By quantifying these similarities, researchers can design more robust learning strategies and refine datasets to improve the generalization of machine learning models.
Therefore, in this paper, we propose a new methodology for detecting errors in the labeling of image datasets based on an evaluation of the intrinsic complexity of the data using the Cumulative Spectral Gradient (CSG) metric. Unlike traditional approaches based on confident learning, the use of the CSG allows for the quantification of the probabilistic divergence between classes within a spectral clustering framework, which facilitates the identification of samples that do not conform to the expected distribution. By identifying and correcting these anomalies, we aim to significantly improve the generalization of deep learning models, especially in scenarios where the data quality is a limiting factor. The main contribution of this work is to identify samples with labeling errors without the need to train a machine learning model. This reduces the time associated with the data analysis phase.
The rest of this article is organized as follows:
Section 2 presents an explanation of the basic issues in this article related to the comparison metrics, deep learning architecture, and datasets.
Section 3 describes the proposed methodology in detail, including the selected datasets, the scoring system, the relabeling, and the fine-tuning of the DL model.
Section 4 presents the experimental results obtained and an analysis of them. Finally,
Section 5 presents the conclusions of this study.
3. Materials and Methods
The proposed method is based on the application of the CSG metric, which allows a confidence score to be obtained that facilitates relabeling a given number of samples according to their score. In this study, the CIFAR-10/100 and CIFAR-10n/100n datasets (using the worst case) are analyzed and compared class by class to identify mislabeled samples. This means that the complexity of each pair of classes is evaluated by applying the CSG metric (using a subset consisting of class i from CIFAR-10/100 and class i from CIFAR-10n/100n). When the CSG is applied to each pair of classes, each of the CIFAR-10n/100n samples contributes to a certain degree to the complexity of this subset. This degree of complexity is used as a confidence score to select N samples from each class for relabeling. This process is repeated for all classes, resulting in a new version of the dataset with N corrected samples per class.
Subsequently, a deep learning model is trained on the newly created dataset, and its performance can be compared to the performance of training the same model on the original CIFAR-10/100 and the CIFAR-10n/100n datasets. This comparison allows for an evaluation of the impact of label correction on the model performance, providing insights into the effectiveness of the proposed methodology in improving the dataset quality and consequently the accuracy and robustness of the deep learning model. The proposed methodology is illustrated in
Figure 3, with the details provided below.
3.1. Datasets
This research was conducted using the CIFAR-10, CIFAR-100, CIFAR-10n-Worst, and CIFAR-10n-noisy datasets. Since the goal is to identify mislabeled images in the CIFAR-10n-Worst and CIFAR-10n-noisy datasets (with about 40% noise in the labels), the most viable approach is to compare the performance of the models with respect to their performance on the original CIFAR-10/100 data.
It is important to note that the subset of training data in CIFAR-10/100 contains 5000 and 500 samples in each class, respectively, while in CIFAR-10n/100n, the number of training images per class is variable. For example, the car class in CIFAR-10n-worst has the largest number of samples (6053), and the deer class has the fewest (4040). In addition, each of the classes in CIFAR-10n/100n represents a mixture of images that include not only the respective class but also multiple images from other classes as a result of the labeling process explained above. In any case, the total number of images in the dataset is always the same (50,000 samples). Both CIFAR-10 and CIFAR-100 have an additional test subset of 10,000 images each, i.e., 1000 test images per class in CIFAR-10 and 100 test images per class in CIFAR-100.
Considering that the proposed methodology consisted of a data-centric approach, the training of the models was performed using the entire training set (50,000 images in both CIFAR-10 and CIFAR-100), i.e., no validation subset was used for hyperparameter fitting. The original test set (10,000 images in both CIFAR-10 and CIFAR-100) was used for the model evaluation.
3.2. The Confidence Scores
The CSG metric was applied to subsets consisting of one class from CIFAR-10 (train) and the corresponding class from CIFAR-10n. For example, the aircraft class from CIFAR-10 was compared to the aircraft class from CIFAR-10n-Worst, followed by comparing the automobile class from CIFAR-10 to the automobile class from CIFAR-10n-Worst, and so on for all remaining classes. This approach ensured a direct and meaningful comparison between the datasets and facilitated the identification of mislabeled samples in CIFAR-10n-Worst. A similar process was used for CIFAR-100 and CIFAR-100n but using their 100 classes.
In addition, the CSG was applied using the number of samples of class i in the noisy dataset and 500 as the number of neighbors in the K-nearest algorithm for CIFAR-10n and 50 for CIFAR-100n. By applying the CSG metric to the data subsets, a similarity array was obtained that allowed the most confused samples to be detected according to the probability of each sample. This probability was used as a confidence score, which made it possible to order the samples from the most confused to the least confused.
3.3. Relabel
After sorting the samples of each class according to the confidence score (from highest to lowest), the first N samples of each class are selected and relabeled according to the clean labels contained in the CIFAR-10n dataset. In this way, for each value of N, a new version of the CIFAR-10n-worst dataset is obtained, where N samples of each class have been relabeled.
This process was applied by varying N from 200 to 3200 samples with increments of 200 samples in CIFAR-10n and varying N from 40 to 640 samples with increments of 40 samples in CIFAR-100n. After relabeling 3200 samples in CIFAR-10n and 640 samples in CIFAR-100n, the dataset is balanced, i.e., the number of samples in each class is equal to the number of classes in the original CIFAR-10 dataset.
3.4. Fine-Tuning
During the fine-tuning process, the pre-trained DenseNet121 model is used, using the weights obtained from its training on the ImageNet dataset. The top layers of the model are excluded to adapt it to the 10 classes of the CIFAR-10 dataset. Consequently, in DenseNet, a global average pooling layer is added, followed by a fully connected layer of 256 units with a dropout of 0.5 and a final fully connected layer where the number of units is equal to the number of classes in the dataset. The first 141 layers of the model are frozen, as they contain the general features learned during pre-training that are very useful for a wide range of tasks. Layers from the 141st onward are set as trainable, allowing the model to adapt and fine-tune its parameters according to the specific datasets being evaluated. In Xception, the top layer is adjusted through a global average pooling layer and a fully connected layer with the same number of units as the number of classes in the dataset, and the first 117 layers are frozen.
3.5. Evaluation
After fitting and training the DenseNet121-based models, inference is performed on the CIFAR-10/100 test data. This ensures that the model is evaluated on an independent dataset, providing a reliable assessment of its performance. Each result is consolidated by a 10-class (100-class in CIFAR-100n) confusion matrix, allowing different metrics to be calculated. In particular, the accuracy (
7) and the F-score (
8) are obtained.
3.6. The Experimental Environment
Google Colab, a cloud platform that provides high-performance computing resources, was used for sample correction. The development environment was configured in Python 3.11, taking advantage of specialized libraries such as spectral-metric (CSG) [
28]. This environment can be run with basic resources such as a CPU (the CSG does not require a GPU) and the RAM available in the basic version (12.67 GB).
Google Colab Pro+ was used to train and evaluate the deep learning models. The development environment was configured in Python, taking advantage of specialized libraries such as Keras/TensorFlow. To speed up the training process, an A100 GPU was used, which significantly reduced the training time. In addition, 83.48 GB of RAM was available.
4. Results
The results after applying the proposed methodology are structured in terms of the balance of the number of samples per class, the overall performance metrics, the performance metrics per class, a multi-class scenario, and comparison with other methods.
4.1. Balance of the Number of Samples per Class
As mentioned above, due to the labeling process in CIFAR-10n-worst, each of the classes has a high level of noise, i.e., they contain a large number of images that do not correspond to their class, but also the number of images in each class varies with respect to the number of samples in the CIFAR-10 dataset.
By applying the proposed methodology, a certain number of erroneous samples is identified in each class and transferred to its corresponding class. Thus, each class not only sends images to other classes but also receives images from other classes. Since the number of images sent and received is not necessarily equal, the number of samples per class can both increase and decrease. In fact, the standard deviation in the number of images per class initially stays around 600 and begins to decrease from 1200 corrected samples per class until it reaches zero at 3200 corrected samples per class (i.e., the point at which the classes are balanced) (see
Figure 4).
The variation in the number of samples in each class is shown in
Figure 5. In this case, we have classes that gradually increase in the number of samples (deer, frog), classes that gradually decrease in the number of samples (car, dog), and classes that decrease and then increase in the number of samples. In each case, once the relabeling process stabilizes, the dataset reaches equilibrium, with an equal number of samples per class.
4.2. Overall Performance
In addition to the balance of the dataset, it is important to evaluate whether the versions of the dataset obtained after applying the proposed method have an advantage in terms of the deep learning model’s performance.
For this purpose, the models based on DenseNet121 and Xception were trained with each of the 17 available datasets (CIFAR-10n-worst: 200–3000 class-corrected images; CIFAR-10n-clean: 3200 class-corrected images). These results in terms of the accuracy and F-score are shown in
Table 1 and
Figure 6a–d.
According to
Figure 6a–d and
Table 1, there is a strong correlation between the metrics. This is to be expected since both metrics measure the overall performance of the classifier. In this case, the values are practically equal, indicating a good balance between precision and recall on average. It is also observed that the accuracy and the average F-score improve significantly when the labels are corrected. Starting from a value of 0.66 in “worst”, they reach a value of 0.92 with 3000 corrections. This is an improvement of 26 percentage points, which is significant.
Another aspect to consider has to do with the fact that most of the average improvement is concentrated in the early stages of correction, especially up to 2200 corrected samples. After this point, the improvement grows at a slower rate.
Finally, the performance on the clean dataset achieves an accuracy and an F-score of 0.94 with DenseNet and 0.89 with Xception. This gives us an idea of the standard performance achieved using DenseNet through transfer learning, which serves as a benchmark for evaluating the effectiveness of the correction after label identification and correction. It is observed that the model is quite close to this ideal performance with 3000 corrections.
To analyze the impact of the proposed method at the class level,
Table 2 and
Figure 7 are shown. As shown in
Table 2, the CSG-based label correction results in a significant improvement in the F-score for all classes. The absolute improvements vary between 0.19 and 0.31, which represents a significant relative increase between 25% and 51.79%. This confirms the effectiveness of the method in mitigating the effects of noise on the labels. In addition, considerable variability in the magnitude of the improvement between classes is observed.
In particular, the classes cat, deer, and dog show the largest relative improvements (about 50%), while frog shows the smallest (25%). This suggests that some classes have more initial noise than others and that the classifier has more difficulty learning the discriminative patterns in certain classes.
The column “Greatest Improvement (ΔF-score)” in
Table 2 indicates the greatest increase in the F-score between two consecutive correction points, complemented by the column “Range of Greatest Improvement”, which indicates the range of corrected samples in which this greatest improvement occurs. It is also observed that in most classes, the greatest improvement is centered between 600 and 2200 corrected samples. That is, most of the improvement is observed in the intermediate stages of correction (up to about 1600–2000 corrected samples). After this point, the improvement becomes more gradual, indicating a stabilization point. This suggests that correcting a large number of labels may not always be necessary and that a moderate number of strategic corrections (selected using the CSG) may be sufficient to achieve a significant improvement.
As can be seen in
Figure 7, some classes exhibit certain behavior in the early stages of pattern correction. This is the case for the classes airplane, automobile, and truck. Airplane shows an initial drop in the F-score between “worst” and 200 corrected samples, followed by rapid recovery and subsequent stabilization. This could indicate an initial correction of samples that, although mislabeled, contained useful information for the classifier. Finally, the frog and boat classes show the highest initial performance (≈0.75), suggesting that the noise level in these classes is lower compared to that in the other categories.
4.3. High-Dimensional Multi-Class Scenario
Taking into account that in practice there may be more complex problems with many more classes, the application of the proposed method was used in a scenario with 100 classes. In this case, the CIFAR-100-n dataset was used, whose samples were labeled with a noise level similar to that of CIFAR-10n—that is, ≈40% noise in the fine labels of the dataset (considering 100 classes).
The original version of the CIFAR-100n training subset has 500 images per class, while in CIFAR-100n, the number of images per class can vary from 199 to 865 samples in a class. Therefore, for CIFAR-100n, the number of corrected images per class was varied from 40 to 640 images, where all of the labels in the dataset were corrected.
Using this dataset and the corresponding sample-corrected versions of the data, the DenseNet121 model was fine-tuned, and the results were evaluated in a similar manner to the process for CIFAR-10n. These results are shown in
Figure 8.
According to the results shown in
Figure 8, the model performance improves as the number of corrected images per class increases. The greatest impact is seen when the correction is performed after relabeling the first samples identified by the proposed method (40–240 samples per class). On the contrary, increasing the number of relabeled samples per class has no significant impact for higher values (280–600).
Finally, two of the advantages of the CSG-based methodology are that this metric has been extensively validated on diverse datasets and that its performance is closely related to the generalization of CNNs. Its performance has been tested on low-resolution grayscale image datasets such as MNIST and notMNIST, low-resolution color datasets such as CIFAR-10, and medium-resolution image datasets such as STL-10 or CompCars. Similarly, the application results for this metric have been contrasted with the generalized performance of models such as ResNet-50 or Xception. In any case, it may be appropriate to extend the validation of the proposed methodology to datasets with higher-resolution images or with a larger number of classes [
19].
4.4. Comparison with the Other Methods
To put the performance of the models trained with CIFAR-10/100n into context, the results obtained are compared with the results presented in the original CIFAR-10/100n article. In this case, the three best results from this evaluation are included with respect to the best results of the proposed methodology. These values are presented in
Table 3.
It is important to note that these models were designed to train models in the presence of noisy labels, while the proposed methodology focuses on the identification of samples with the possible presence of noise in the dataset.