3.1. Experimental Procedure
PyTorch—a mainstream open-source framework for deep learning—was used in the experiment. The experimental environment was Windows 10 with an E5-2640 2.40-GHz CPU and an NVIDIA Quadro P2200 GPU. The model training steps were as follows:
- (1)
The 1141 dentate images in the dataset were divided into a training set, a validation set, and a test set at a ratio of approximately 7:2:1.
- (2)
A multiscale feature reconstruction model was developed, pretraining weights were applied, and the initial learning rate was set as 0.001. The learning rate was automatically adjusted by the step size. The ReLU activation function was used, and the number of epochs was 300.
- (3)
The gear image validation samples were input, and the detection results of the segmentation model were evaluated every seven epochs using the loss function evaluation metrics.
- (4)
The parameters were adjusted according to the evaluation results to obtain the optimal network model.
In step (2), the learning rate was updated automatically at 15 iterations of the network so that a higher learning rate could be used in the initial training phase to quickly find the direction of the gradient descent, after which the learning rate was automatically updated to obtain the optimal parameter value of the network. In step (3), the model detection results were evaluated every seven epochs to effectively monitor the network training and facilitate the model review and breakpoint retraining. To evaluate the performance of the three models U-Net, PSPNet (ResNet50), and DeepLabv3+, these models were trained sequentially according to steps (3) to (4), and their detection results were compared with those of the proposed multiscale feature reconstruction model.
3.2. Wind Turbine Gearbox Surface Defect Dataset Production
The study of wind turbine gearbox gear surface defect detection relies on a large number of images of gear surface defects, but there is no publicly available dataset containing images of similar gear surface defects; so, the most important aspect of this study was to build a high-quality gear surface image dataset. In this study, a GE industrial endoscope was used to obtain real defect images from a wind turbine gearbox, and the original images were extended by geometric and color space transformations in order to improve the accuracy of the model. The specific methods included horizontal flip, histogram equalization, noise addition, and random rotation. After image enhancement, there were 1350 images, which were classified into five types of defects: rust, deviation, bonding, spalling, and cracking. Finally, these images were filtered to obtain 1141 valid images. The particular defect types are shown in
Figure 6.
A total of 1144 valid images were labeled, each with a single defect according to the classification. The open-source tool LabelMe was used for labeling, and the files generated after labeling were converted into the standard format of the VOC dataset used by the network. The dataset was randomly divided into training, validation, and test sets at a ratio of approximately 7:2:1 before training; the numbers of images in these three sets were 800, 228, and 116, respectively. The specific annotations are presented in
Table 4.
3.3. Defect Detection Effect Evaluation Index
In image semantic segmentation research, evaluation metrics are commonly used to assess algorithms. The most important metrics are the mean Pixel Accuracy (mPA) and the mean Intersection over Union (mIoU). The mIoU represents the average sum of the pixel accuracies of all categories, and the mPA represents the average sum of the accuracies of each correctly classified category.
The Intersection over Union (IoU) represents the ratio of the intersection of the predicted results of the detection model and the true value for each category in the merged set. It is given by Equation (12), where
represents the number of correctly detected pixels. In this study, the correct pixels were the defective pixels, and
denotes the number of all pixels, which was calculated as follows:
The one-way setting is primarily used for the segmentation of few-sample gear defects, i.e., there is only one target defect in the foreground. In this case, the background occupied a large image area, which easily leads to the dominance of the background pixels, resulting in an inaccurate IoU evaluation metric that does not visually represent the performance of the model segmentation. For this reason, mIoU was proposed as an evaluation index; it is more reasonable as a gear defect evaluation index because its calculation does not include the background class but considers the average of all classes of IoU.
In this study, to describe the segmentation performance of the model intuitively, the mIoU was selected as the evaluation index, and the number of categories was six (five categories of defects + one category of background). The mIoU is as follows:
The pixel accuracy (PA), which represents the proportion of pixels that are correctly detected, is obtained using Equation (14), where
represents the number of correctly detected pixels, and
is the total number of pixels.
In gear defects, fewer defect pixels have more background pixels, which has less impact on the results; therefore, the PA is not applicable to the detection of gear defects. For this reason, the accuracy of the model detection can be effectively assessed by using the mPA to calculate the proportion of pixels of each defect that are correctly classified relative to all pixel points of that class, after which the average value is taken; thus, the accuracy of each defect is calculated. The mPA is given by Equation (15).
3.4. Comparison of the Results and Analysis
The Swin-T, Swin-S, and Swin-B were used in the algorithm of this study for the gear surface defect segmentation task, and the mIoU and mPA results for the above models are presented in
Table 5. As shown, with an increase in the model size, the mIoU and mAP improved; the mIoU and mAP with Swin-B as the backbone were improved by 0.9% and 2.24%, respectively, compared with those with Swin-T.
To investigate the extraction of image features with different magnitudes of Swin-T, Swin-S, and Swin-B, the detection performance of the model was visualized and analyzed using Gradient-weighted Class Activation Mapping (Grad-CAM) [
27], as shown in
Figure 7. In the segmentation of cracks (rows 1 and 2 both show crack detection), Swin-T showed the highest responsiveness to crack defects when there was a large gap between the crack and the background (row 1). However, when the difference between the crack and the background was not obvious (row 2), the three networks had low responsiveness; that is, the samples were not easily analyzed, and this class of samples had a negative impact on the network convergence. Among the three Swin Transformer models, Swin-B showed the most accurate distribution range of high-response areas and could effectively focus and cover most of the defect areas with the ability to learn the characteristics of the gear surface defects, whereas the high-response areas were not obvious for Swin-S. For Swin-T, the sub-response areas not only covered the defect areas but also had the problem of attentional bias. The above results showed that the detection performance of the Swin-T, Swin-S, and Swin-B models was enhanced sequentially, and that of the Swin-B model was higher than those of the other models in terms of recognition efficiency and generalization.
To investigate the effect of adding the FSM on the detection of each defect, the quantitative detection results were examined, as shown in
Figure 8. In the spalling defect detection, the accuracy and mIoU were almost unchanged after the addition of the FSM, indicating that the FSM provided no obvious advantage for this type of defect. In the detection of bonding and rust, which are more similar to the background and therefore cause difficulties in detection, the FSM led to an improvement of 3% and 2%, respectively; thus, the FSM had the best performance for these types of defects. In the crack defect detection, owing to the small sample size of the crack defects, the addition of the FSM improved the performance, although a few pixels were missed. In summary, the FSM could filter the image features better and showed a better detection performance for most gear surface defects, but when detecting gear surface defects with a very small number of images such as crack defects, the FSM efficacy was affected by the number of images, and the performance was reduced because there were fewer features available for the model to learn.
To evaluate the effect of each module on the network detection performance, ablation experiments were designed. The experimental results are presented in
Table 6, where”√” indicates that the module is used and “---” indicates that it is not used. The mIoU and mPA of the model were 73.77 and 80.31, respectively, for the baseline algorithm, i.e., without the FSM. For Model 2, with the addition of the FSM, the mIoU and mPA were improved by 0.12% and 3.34%, respectively. For Model 3, with the addition of Swin-B, the mIoU and mPA were improved by 1.00% and 3.70%, respectively. For Model 4, with the addition of both the FSM and Swin-B, the mIoU and mPA were improved by 1.21% and 3.88%, respectively. According to the experimental results, the detection network with a combination of Swin-B and the FSM could achieve a higher detection accuracy than when using Swin-B or the FSM alone.
For the established training and validation sets, both the proposed algorithm and the PSPNet were trained for 300 epochs with a batch size of 4. The Adam optimizer was used to optimize the network parameters. The MIoU and loss variation curves of the proposed algorithm and the PSPNet network for the training set with respect to the number of iterations are shown in
Figure 9.
As indicated by
Figure 9a, the proposed algorithm rapidly increased the model accuracy at the early stage of training; the convergence was faster than that of the PSPNet model, and the accuracy of the validation set segmentation exceeded 70%. At approximately 70 iterations, the mIoU of the proposed algorithm for the validation set fluctuated significantly owing to the influence of the computational resources. The migration learning strategy was used to freeze a large number of parameters of the network before 70 iterations to accelerate the convergence of the network; after 70 iterations, the network parameters changed, causing the mIoU curve of the validation set to oscillate, while the loss value also broke the bottleneck and decreased again. The change in loss in
Figure 9b indicates that after 270 iterations, the loss values of the two models tended to stabilize; the loss value of the proposed algorithm remained below 0.3, and that of PSPNet remained below 0.6. The statistical analysis of the loss comparison graph and the average cross-merge ratio showed that the detection performance of this model was better than that of the PSPNet model overall and is thus suitable for the gear surface defect detection task.
To evaluate the segmentation efficiencies of the different models, an image test was performed using the test set, and 50 images were selected for each test, including rust, deviation, bonding, cracks, and spalling images. The test results are presented in
Table 7. Compared with the original PSPNet network, the proposed model has the advantage of segmenting gear defects in complex backgrounds. It outperformed the PSPNet (ResNet50) by 1.21% and 3.88% for the mIoU and mAP, respectively. By comparing the number of parameters and the inference time on GPU for different models, it was found that the model used in this paper obtained a higher accuracy by sacrificing a smaller number of model parameters and inference time.
The comparison results for the proposed algorithms, U-Net, DeepLabv3+, and PSPNet, are shown in
Figure 10. The segmentation results obtained in this study were similar to those of a real label image, with a good segmentation effect. The edge of the segmentation result was fine, showing that the method is suitable for gear surface defect detection.