4.2. Experimental Results
In this study, the algorithm was evaluated using three distinct datasets: the NEU dataset [
16] from Northeastern University (steel surface defects), the DAGM2007 dataset [
17] (automatic guided-vehicle surface defects), and the GC10-DET dataset [
18] (ten common metal surface defects). The aim was to enhance the identification and classification capabilities for equipment faults by leveraging the unique characteristics and patterns of each dataset.
To thoroughly and comprehensively assess model performance, a variety of aggregation strategies were employed for systematic comparative analysis. These strategies included the classic FedAvg aggregation algorithm (basic averaging to construct a global model), the FedAvg-Data aggregation algorithm (which incorporates the size of each client’s dataset during aggregation and performs weighted aggregation based on dataset scale), and the FedAvg-ContribData aggregation algorithm proposed in this paper.
The evaluation process focused on several core performance indicators: accuracy, precision, recall, and F1 score. Accuracy was used to measure the overall predictive capability of the model. Precision assessed the accuracy level of the model in predicting positive classes. Recall highlighted the model’s ability to recognize positive class samples. The F1 score, calculated as the harmonic mean of precision and recall, proved particularly effective in addressing class imbalance issues and provided a comprehensive assessment of the model’s performance.
Table 2 presents the performance of the algorithm in this paper, FedAvg, and FedAvg-Data across multiple key evaluation metrics, including accuracy, precision, recall, and F1 score. Additionally, the table details the relative improvement rates of the proposed algorithm over the other two algorithms for each evaluation metric, providing comprehensive and intuitive data support for assessing and comparing the performance of each algorithm.
- (1)
Accuracy
From the table, it can be observed that the accuracies of FedAvg, FedAvg-Data, and FedAvg-ContribData are 49.7%, 81.5%, and 89.4%, respectively. The FedAvg-ContribData algorithm demonstrated markedly enhanced predictive proficiency over its two counterparts.
In the face of imbalanced data distributions, the averaging aggregation method adopted by FedAvg has limitations, and struggles to ensure accuracy; FedAvg-Data’s dataset weighting method overlooks the data quality and variety of each client, resulting in poorer results; and FedAvg-ContribData, merging dataset weights with a contribution–contribution analysis process, enhances performance. In contrast to FedAvg and FedAvg-Data, its respective rates of accuracy enhancement stand at 39.7% and 7.9%.
- (2)
Precision
Precision percentages for FedAvg, FedAvg-Data, and FedAvg-ContribData stand at 50.7%, 82.0%, and 91.1%, in that order. The proposed algorithm significantly improves accuracy in predicting positive class samples compared to the other two algorithms.
FedAvg fails to adequately consider the characteristics of data from different clients, leading to confusion between positive and negative class samples and resulting in lower precision; FedAvg-Data assigns weights based on dataset size, which somewhat accounts for data imbalance and improves the model’s accuracy in predicting positive classes, but this approach does not emphasize the importance of the datasets; FedAvg-ContribData combines multiple consideration mechanisms, capturing positive class features more precisely and distinguishing between positive and negative class samples with greater accuracy, achieving relative improvements of 40.4% and 9.1%, compared to FedAvg and FedAvg-Data, respectively.
- (3)
Recall
The recall rates of FedAvg, FedAvg-Data, and FedAvg-ContribData are 42.5%, 75.7%, and 83.5%, respectively. It is evident that our proposed algorithm significantly improves the ability to identify positive class samples, compared to the other two algorithms.
FedAvg’s average aggregation method misses a large amount of positive class sample information, resulting in poor recall of positive class samples; FedAvg-Data only assigns weights based on dataset size, without fully considering the importance of the datasets, leading to the neglect of some crucial data, which affects the overall performance of the model; FedAvg-ContribData dynamically adjusts the weights by combining dataset size with the value of client data, more comprehensively mining the features of positive class samples in the data. Compared to FedAvg and FedAvg-Data, it improves by 41.0% and 7.8%, respectively.
- (4)
F1 Score
The F1 scores for FedAvg, FedAvg-Data, and FedAvg-ContribData are 37.4%, 74.7%, and 84.0% respectively. It is evident that the algorithm proposed in this paper significantly enhances the balanced performance between precision and recall, compared to the other two algorithms.
FedAvg performs poorly in both precision and recall, resulting in a low F1 score and poor overall performance; FedAvg-Data assigns weights based on the size or importance of the dataset, which considers data imbalance, to some extent, but still fails to fully account for the complex characteristics of client data, leading to limited improvements in precision and recall and failing to optimize the overall model performance; FedAvg-ContribData combines multiple consideration mechanisms such as contribution degree and dataset weight, achieving a better balance between precision and recall, with improvements of 46.6% and 9.3%, compared to FedAvg and FedAvg-Data, respectively.
In summary, FedAvg-ContribData demonstrates superior performance in handling complex data-distribution scenarios, making it more suitable for application scenarios with high requirements for comprehensive model performance.
Figure 6 illustrates a comparison between the original image and feature maps generated by different algorithms. Specifically,
Figure 6A presents the original image, serving as a reference for subsequent algorithm feature-map analyses.
Figure 6B showcases the feature map produced by the FedAvg algorithm. It is evident that this feature map primarily reflects the basic shape of the image. Visually, its structure appears relatively simple, and it contains limited information, indicating a weaker capability to capture detailed image information.
Figure 6C displays the feature map generated by the FedAvg-Data algorithm. Compared to the feature maps from the FedAvg algorithm, those from the FedAvg-Data algorithm are more complex. These feature maps not only include edge information, but also present more detailed content, to some extent. This suggests that the FedAvg-Data algorithm can extract richer image information, particularly in terms of edge detection and detail capture.
Figure 6D shows the feature map generated by the algorithm proposed in this article. These feature maps exhibit even greater complexity, containing not only edge and texture information similar to those in the FedAvg-Data algorithm feature maps, but also depicting local shapes and patterns. The algorithm presented in this paper can capture the intrinsic structure and detailed features of images more comprehensively and deeply during feature extraction. Consequently, the feature maps generated by this algorithm more effectively represent image content, offering richer and more valuable clues for subsequent image classification or recognition tasks. This enhances the accuracy and reliability of these tasks.
Through the analysis of these four types of images (the original image and feature maps generated by three different algorithms), it is clear that there are differences in image feature extraction among the algorithms, with the one proposed in this paper having significant advantages in the richness and effectiveness of feature representation.
Figure 7 illustrates the performance of each algorithm across various evaluation metrics in the training, testing, and validation datasets. It is clearly observable that FedAvg’s aggregation method, being ill-suited to handle data imbalance issues, results in suboptimal performance across all datasets. Utilizing a dataset size-based weighting approach, FedAvg-Data demonstrates notable enhancements in comparison to its averaging method for all measured data. Yet, FedAvg-Data is limited to the volume of data, overlooking other essential attributes of the databases, leading to its inability to reach peak performance standards. Despite FedAvg-ContribData’s marginally reduced precision compared to FedAvg-Data in the test group, its effectiveness surpasses all other metrics for all datasets. This demonstration highlights its steadfast superiority in managing varied data, permitting a broader and more efficient use of data information, resulting in superb overall efficacy. This emphasizes the heightened flexibility and dependability of the algorithm in this paper in complex data settings.
FedAvg-ContribData demonstrates superior performance over all data sets, thereby affirming its effectiveness in handling diverse data sets.
Displayed in
Figure 8 are the confusion matrices for two specific algorithms: FedAvg-Data and FedAvg-ContribData. Issues of class imbalance in the dataset, causing underperformance in the FedAvg-Data algorithm, especially in underrepresented groups like classes 25 and 26, stem from their reduced sample sizes. The algorithm in this paper outperforms FedAvg-Data in all categories. As an example, within the 17th and 18th classes, the algorithm in this paper attains accuracies of 73.9% and 80.6%, correspondingly, in stark contrast to FedAvg-Data’s accuracy levels of 47.8% and 22.6%, respectively. FedAvg-Data exhibited significant classification bias when dealing with category 14, with 30.3% of the samples incorrectly judged as category 11. When processing category 18, its accuracy rate was low, at only 22.6%, and it was prone to misjudging samples of this category as either category 19 or 20. For category 23, the accuracy rate was 36.7%. Due to the similarity between the features of categories 23 and 24, the algorithm struggled to effectively distinguish between these two categories, resulting in limited classification accuracy. When the algorithm proposed in this paper processed category 14, only 12.1% of the samples were misjudged as category 11, significantly lower than FedAvg-Data. When handling category 18, the accuracy rate reached 80.6%, which is 58% higher than FedAvg-Data, clearly demonstrating the effectiveness of the algorithm in category 18 classification tasks. For category 23, the accuracy rate improved to 66.7%. Compared to the previous scenario, the number of samples misjudged as category 24 was significantly reduced, effectively enhancing the classification accuracy for category 23. The algorithm in this paper shows better performance when dealing with difficult categories.
The algorithm in this paper outperforms FedAvg-Data in precision and additional metrics, exhibiting a significant edge in managing diverse data. The algorithm in this paper, through improved adjustment for data irregularities in the design phase, sustains strong classification efficiency in the face of less commonly occurring classes.
Figure 9 shows the variation curves of accuracy, precision, recall, and F1 score. These graphs provide a detailed view of how the performance of each algorithm changes with iterations.
The FedAvg algorithm does not fully consider the difference between the data of each client, and its performance is poor when faced with the situation of large data dispersion. When the data distribution between clients shows significant differences, the FedAvg algorithm cannot effectively aggregate data, resulting in a decline in model performance.
The FedAvg-Data algorithm aggregates according to data weight, and can achieve better results in the case of heterogeneous data. By giving different weights to the data of different clients, the contribution of each client to the global model can be better balanced, so as to improve the performance of the model. However, determining the weight only through the size or importance of the data may ignore other important characteristics of the data, which leads to limited performance improvement of the model, and may not reach the ideal level in accuracy, recall and other indicators.
The algorithm in this paper combines the weight of the dataset and the contribution of the client to the overall model. Good results have been achieved in the training set and test set, and significant advantages have been shown in the verification set. Because of its careful consideration of dataset weight and client contribution, the model is more robust in the face of unknown data distribution. The algorithm in this paper not only has good fitting ability, but also shows stronger versatility and adaptability. The algorithm in this paper can transfer the knowledge learned from training data and apply it to new and independent data sets, so as to maintain high prediction performance on data that have not been seen before.
The FedAvg algorithm is relatively simple in terms of aggregation. It uses the average aggregation method, and does not pay attention to differences in data size. The FedAvg-ContribData algorithm deeply analyzes data through statistical analysis and calculation, adjusting the value weight of data, accordingly. Although the FedAvg-Data algorithm pays some attention to data size, it is not accurate or delicate in measuring value and cannot mine the value of data as deeply as FedAvg-ContribData. This results in the model being unable to fully leverage the advantages of high-contribution data during the learning process.
The FedAvg-Data algorithm aggregates data mainly based on the size of the data without in-depth mining of the internal structure of the data and the different contributions of different parts to model learning. For example, in an image classification task, when the number of images of different categories in a dataset is seriously unbalanced—for instance, some rare category images are extremely rare, but play a key role in the integrity of the overall classification system—FedAvg-Data may not reasonably evaluate the contribution of these categories of data. Consequently, it assigns inappropriate weights, which ultimately leads to low recognition accuracy for rare category images. The FedAvg-ContribData algorithm calculates the contribution degree and makes reasonable weight allocation accordingly. In cases of very high data heterogeneity, it can ensure that the model fully learns the characteristics of each category, especially those of key few categories. This effectively reduces the classification deviation caused by data imbalance, thus outperforming the FedAvg-Data algorithm in overall classification accuracy.
During training, the FedAvg algorithm cannot distinguish between the size and contribution of data, so it wastes a lot of computing resources when dealing with redundant data of low quality, full of noise, or with low relevance to the model’s learning objectives. With the help of a reasonable contribution evaluation and weight allocation mechanism, FedAvg-ContribData can quickly screen out data that significantly impact model improvement in each iteration process. This reduces the interference of invalid data, thereby promoting the model to converge to the global optimal solution more quickly. Although FedAvg-Data also filters data, its weight distribution is not accurate enough and may still be adversely affected by some low-contribution data during the optimization process, resulting in inefficient convergence. FedAvg-ContribData can quickly focus on high-value data and concentrate on learning in large-scale distributed data-processing scenarios. This greatly reduces training time and significantly improves the final performance of the model, enabling it to better adapt to rapidly changing practical application needs and complex data environments.
To sum up, the algorithm in this paper has obvious advantages in dealing with data heterogeneity; not only can it effectively use the data of each client in the training process, but it also can show good prediction ability on new data.