1. Introduction
As the demand for renewable energy sources grows, wind power has emerged as an efficient and safe solution. This sector has experienced significant expansion in recent years, driven by governmental incentives, lower production costs, and advancements in turbine technology [
1]. Within the United States, wind power accounted for 22% of new electricity capacity installed in 2022, reaching a total capacity of 90,000 turbines [
2]. Globally, the trend is equally robust. Europe increased wind generation capacity by 18.3 GW in 2023 as part of their continued commitment to green energy and 2030 energy targets [
3]. This rapid growth in wind energy production presents new challenges in maintenance and inspection, necessitating innovative solutions to sustain efficiency and reliability.
Inspection and early preventive maintenance of damages become vital to ensure the safe production of wind energy. Throughout the lifetime of the turbine, the stresses and harsh operating environment lead to damages that, if not addressed, can result in total loss. In addition, blade repairs range from 3000 to 16,000 USD, depending on the severity and size of the damage, while total blade replacement requires significantly more capital [
4]. Common damage to WTBs includes delamination, structural cracks, and erosion at the leading edge. These are commonly a result of external factors such as strong wind, rain, snow, salt, and temperature fluctuations [
5]. Delamination and surface cracks are the result of adhesive joint failures, core debonding, or excess loads, while leading edge erosion is often the result of cutting through particulate matter over the lifetime of the turbine.
Current WTB fault detection methods include sensor-based monitoring, Supervisory Control and Data Acquisition (SCADA) system analysis, and visual inspection [
6]. Visual-based inspection is used to identify and monitor blade defects. Methods of visual inspection have included RGB imaging, thermal, and a fusion of these modes to allow auxiliary information to be present during inference [
7]. In this area, deep learning has proved to be a pivotal technology in advancing this field [
6]. Additionally, the integration of deep learning and drones allows for an autonomous solution to the inspection problem, which is proving to be a promising direction as turbines continue to scale [
8]. Existing work in the detection of WTB faults typically addresses architectural changes to a single deep learning network to improve performance for a given task. While an important contribution, the comparison between distinct architectural designs, single and two-stage detectors, could be improved. This study investigates the performance of these architectural design patterns through YOLO and Mask R-CNN to determine which method achieves the best performance on turbine blade fault localization. The dataset used in this study was introduced in our previous classification research [
9]. Building upon this, the dataset is annotated for object detection architectures, allowing for the model comparison presented here. Additionally, methods such as the Adaptive Kriging Damage Assessment (AK-DA) improve inspection efficiency and reliability, further aiding in maintenance and extending equipment lifespans [
10].
The focus of this study is on the halted wind turbine case scenario, which allows for the inspection of the WTBs without rotation. The wind speed and direction for in-action wind turbines can cause drone destabilization, turbine yaw directional change, and spinning of wind turbine blades, as expected. This leads to blurry/unfocused aerial images, which makes it more challenging for the blade fault inspection problem. Although this is not the direction of our work here, there are recent works that explore CNN-BiLSTM networks for a predictive analysis of wind speeds [
11] that can help better stabilize the drone while collecting images and pre-processing techniques to de-blur the aerial images [
12]. These could be incorporated to allow inspection in the active turbine case scenario. The contributions of this work are as follows:
Examination of single-stage and two-stage detection algorithms, YOLO and Mask R-CNN, in WTB fault localization
Modifications to the backbone of Mask Region-based convolutional neural network (R-CNN) to reduce computational complexity and increase accuracy
In the following section, we investigate the current methods and architectures that exist for WTB fault localization and their performance. Then, an overview of the architectures investigated in our research is discussed along with the model precursors, simulations, and results. Finally, a discussion of the findings is made, and conclusions are drawn.
2. Literature Review
The growing availability of high-performance computational resources coupled with the surge in research interest surrounding deep learning has increased the application of these algorithms to the domain of wind turbine inspection [
13]. Of these inspection methods, object detection remains dominant in the literature, providing an accurate solution to the localization and classification of WTB anomalies. The algorithms deployed in this sector consist of either a single-stage model or a two-stage model. The inherent trade-offs in these different architecture design methods include inference computation time and accuracy, with two-stage models traditionally garnering higher accuracy overall. However, in resource-constrained or time-sensitive environments, single-stage architectures can provide close to real-time inference. Here, the literature introduces novel modifications to increase the accuracy, inference speed, or both. In this section, a breakdown of the recent methods and results is discussed for both single-stage detectors and two-stage detectors, followed by alternative methods.
To further reduce the computational complexity of the backbone of You Only Look Once (YOLO) version 4, Zhang, Yang, and Yang incorporated a MobileNetv1 architecture as the feature extraction [
14]. Furthermore, the model was further enhanced with attention modules, including Squeeze and Excitation Networks (SENet), Convolutional Block Attention Module (CBAM), and Efficient Channel Attention for Deep Convolutional Neural Networks (ECANet) [
15]. Through a sensitivity study, the backbone-only adjustment reduced overall model complexity but decreased the performance compared to base YOLOv4. With the addition of attention modules, specifically SENet and ECANet, the model precision overtook base YOLOv4 while maintaining a reduced model complexity due to backbone alteration [
14]. The proposed model, MobileNetv1-YOLOv4, achieved a mAP50 of 88.61 on their created dataset consisting of spalling, pitting, crack, and contamination faults on WTB [
14]. Similarly, Mohammadi and Sharifian investigated the performance of YOLOv4 by adding Spatial Attention Modules (SAMs) and a unique Mish activation function [
16]. With the added generalization of pre-trained COCO weights for distillation and data augmentation schemes to introduce training variance, the proposed model achieved 86.17% mAP50 [
16].
To increase the performance of YOLOv5 on minor target defects in WTBs, Ran et al. proposed novel architecture modifications, including an improved feature pyramid network, coordinate attention module Coordinate Attention (CA), and integration of an Efficient Intersection over Union (EIoU) loss function [
17]. The alteration to the pyramid network utilizes the Bi-directional Feature Pyramid Network (BiFPN) to enable the network to learn input features of varying importance through weighted feature fusion [
17]. Additionally, the CA module that was implemented allows the model to augment the representations of interest. Finally, with the EIou implemented, the model achieved a better localization effect of the WTB defects. To test their proposed model, a dataset of 599 images collected with Unmanned Aerial Vehicles (UAVs) was utilized after being synthetically augmented to increase the dataset to 2995. The following ablation experiment indicated the most model improvement with adding the CA module, increasing mAP50 from 80.5% in base YOLOv5s to 82.5%. Finally, with all modifications, the model achieved 83.7% [
17].
Echoing the performance abilities of YOLOv5, Liu, An, and Yang also investigated a BiFPN backbone to replace the base YOLOv5 Feature Pyramid Network (FPN) and Path Aggregation Network (PANet) to allow better feature fusion on WTBs [
18]. Additional modifications in the proposed architecture also included focal loss, making it harder to classify examples to better handle class imbalance. In addition, the inclusion of SENet attention modules was utilized to enhance feature extraction and reduce redundant feature information. Additionally, the C3 module was replaced with the C2F module, resulting in a more abundant gradient flow [
18]. With these architecture changes, the performance gained an increase in precision over the base YOLOv5s by 1.9% [
18]. Using the performance gains of BiFPN and CBAM, Yu et al. proposed an enhanced YOLOv8 architecture [
19]. Like [
17,
18], BiFPN replaced PANet, which prioritizes small target features and increases multi-scale fusion. Further improving small-target acquisition, the CBAM attention module was integrated into the backbone. Finally, EIoU was deployed to help with model training and convergence. For the validation of model performance, a dataset was created through UAV inspection of WTBs with the following classes: breakages, cracks, and scratches [
19]. The ablation experiment indicated the highest performance increase through BiFPN, which achieved an mAP50 of 83.3% compared to a base YOLOv8 of 81.9% [
19].
In an effort to mitigate the complexity of YOLOx and enhance multi-scale small-target WTB feature detection, Yao, Wang, and Fan introduced WT-YOLOx [
20]. This novel architecture incorporates a RepVGG backbone, which effectively diminishes the computational burden of YOLOx, thereby augmenting real-time performance while preserving extensive feature fusion through a cascade feature fusion module. To further optimize model performance and address class imbalance, focal loss was employed. For the purpose of evaluating the architecture, a total of 725 images were captured using UAVs at a wind farm in Mongolia. To augment the dataset’s variability, image augmentation techniques were applied, culminating in a comprehensive dataset of 5800 images encompassing the classes of pollution, fix, crack, and break [
20]. These architectural enhancements culminated in an impressive 94.29% mAP50, surpassing the performance of both YOLOx and SSD architectures [
20]. The modifications of the YOLO architectures for WTB fault diagnosis can be summarized as follows.
Backbone Alteration: By adjusting the backbone feature extraction architecture, higher performance can be gained in low-level feature propagation through the network. This can result in higher accuracy in hard-to-detect classes in fault diagnosis. Additionally, modifications can strive for a reduction in overall computational complexity resulting in better real-time and edge performance.
Loss Function: In changing the loss function, the model can be driven toward specific goals. Commonly, the focal loss is deployed which achieves better performance on hard-to-detect or partially obscured defects by weighting these over easy objects of interest.
Attention Modules: In the pursuit of enhancing the feature maps produced by the network, attention modules allow a unique weighting to prominent features to increase representation through the network. This has been shown to be a valuable addition in fault identification, specifically in the performance on smaller anomalies.
Mask R-CNN, a prominent network architecture in WTB visual analysis, offers instance segmentation alongside precise bounding box predictions. In this context, the term mask refers to the capability of the model to generate pixel-wise segmentation masks for each detected object instance. These masks outline the exact shape and extent of the object within the image, providing detailed information about its spatial distribution. Consequently, the masks generated by Mask R-CNN enable researchers to perform further analysis, such as assessing the size and severity of defects on WTBs. For instance, Zhang, Wen, and Liu proposed a pre-processing pipeline aimed at refining mask data obtained from UAV imagery to filter out background noise [
21]. This was accomplished through the proposed MinReact Network (MRnet) where the Graham scan algorithm extracts the convex hull points of a fault and allows the orientation of the image to minimize the area of the bounding box. This was followed by DenseNet-121 forming a new predictive head [
21]. This method achieved an mAP50 of 98.7% on the utilized dataset, showing promising accuracy in WTB fault detection. However, the authors indicated a reduction in inference speed due to redundant calculations [
21].
For identifying gaps in object detection performance metrics, Zhang, Cosma, and Watkins proposed three new evaluation measures for fault detection, namely Prediction Box Accuracy (PBA), Recognition Rate (RR), and False Label Rate (FLR) [
22]. Additionally, an image augmentation pipeline was introduced to further enhance the predictive capabilities of the Mask R-CNN architecture. The simulations used a dataset created through routine WTB maintenance and consisting of the following classes: cracks, erosion, voids, and others. With the proposed image enhancement and augmentation pipeline consisting of flipping, rotation, and histogram equalization, the performance resulted in the mAP50 of 84.21% [
22]. Seeking computational performance gains, researchers Diaz and Tittus utilized depthwise separable convolution in the ResNet50 backbone of the Cascade Mask R-CNN model [
23]. This reduction in backbone complexity resulted in a 13 million parameter decrease overall. Inference speed when compared to Mask R-CNN was reduced dramatically from 368 s to 136 s [
23]. Leveraging a dataset consisting of lightning damage, rust, erosion, crack, and broken parts along with data augmentation, the performance of the proposed model was compared. Here, the decoupled convolution layers in the proposed architecture provides comparable accuracy with a considerable inference speed increase at mAP 84.36% overall [
23].
After reviewing all these related works, it is clear that while significant advancements have been made in applying deep learning to WTB fault detection, there remain critical gaps that our study addresses. Current research predominantly focuses on individual architectural modifications without comprehensive comparative analyses between single-stage and two-stage detectors. Our study is crucial as it provides a detailed comparison of YOLO and Mask R-CNN architectures, highlighting their performance in terms of accuracy and computational efficiency under consistent conditions. Additionally, the creation of a new synthetic dataset simulating real-world conditions fills a vital resource gap, enabling more practical and scalable solutions. By addressing these gaps, our study enhances the robustness and efficiency of WTB fault detection systems, contributing to improved maintenance and reliability in wind turbine operations. The modifications of the Mask R-CNN architectures for WTB fault diagnosis can be summarized as follows.
Image Enhancement and Data Augmentation: With the inherent difficulty of image collection for WTB inspection, image enhancement techniques can provide de-blurring or feature enrichment. Additionally, the use of data augmentation has been heavily explored for further performance gains.
Backbone alteration: Caused by the two-stage architecture design of Mask R-CNN, the inference speed is considerably slower than that of YOLO. Novel backbone changes reducing complexity can bring this architecture closer to real-time detection with minimal accuracy impacts. Additionally, architecture improvements can further increase the accuracy of this model for defect identification.
5. Discussion
Further investigation of the model performance was conducted following the optimal performance and comparison of each architecture. The top model and size were selected with corresponding test set inference results for visualization, allowing trend and pattern analysis in model prediction.
Figure 10 shows a sample of the predictions from the YOLOv9C model.
Figure 10a illustrates the misclassifications of the model. Reflections from the shiny blade produced features that looked like cracks, leading to an increased number of false detections. Additionally, crack-like features outside the blade area also contributed to false detections due to the background and turbine chassis features.
Figure 10b illustrates the confident predictions from the YOLOv9C model. The model was capable of localizing each small hole feature and crack along the width of the turbine blade. Furthermore, the erosion damage, which is angled to offer more background information, attained significant confidence values. Analyzing the inference results for each YOLO model and size including v5, v8, and v9, shows a common trend in hole classification that explains the lower mAP values in
Table 5.
Figure 11 contains a sample of these inference results along with their corresponding ground truth label. Here, the model commonly classifies the mounting hardware as hole damage. While the mounting points contained distinctly hole-like features, they were not labeled as such during dataset creation. This led the models to generalize the hole damage features and still detect and classify in these cases, resulting in lower overall performance. Additionally, the crack classification results observed in YOLOv9 continued in each YOLO architecture, with turbine chassis detections and light reflections causing false detections.
As was discussed earlier,
Table 7 describes the performance of Mask R-CNN with three backbones of VGG19, ResnNet50, and ResNet18 on the CAI-SWTB dataset. According to the inference results of this table, Mask-RCNN with ResNet18 shows the most promising results with a higher combined mAP50 box score and a lower computational cost.
Figure 12 shows a sample of inference results from the Mask R-CNN model with the ResNet18-FPN backbone network. According to this Figure, it is evident here that the instance segmentation and localization abilities suffer when the turbine chassis is present within the data. The shadow created by the hub, chassis, and blade, shown in
Figure 12a, leads to crack and erosion false detections explaining the lower erosion class metrics for each of the modified backbone architectures. However, when the blade is the main component of the composition, the localization abilities are considerably high in confidence. This is further illustrated in
Figure 12b, where the confidence for erosion, crack, and hole detection achieves well above 90% and provides instance segmentation, allowing for pixel-level classification.
Figure 13a shows additional samples of these common misclassifications from Mask R-CNN with a ResNet18 backbone that were also observed using ResNet50 but with different confidence intervals. The corresponding ground truth labels of healthy blade images are illustrated in
Figure 13b.
This study investigated two popular architectures for the localization of faults in WTBs: YOLO and Mask R-CNN. While traditionally two-stage detectors, like Mask R-CNN, are considered more accurate, our study shows that the single-stage design of YOLO achieves a comparable performance. This raises the question of which architecture should be chosen for fault localization in WTBs. To answer this, multiple aspects must be considered. YOLO achieves promising precision while maintaining low computational costs and the ability to achieve close-to-real-time inference. However, Mask R-CNN provides comparable precision with the addition of instance segmentation. For the case of fault localization in WTBs, real-time performance is not essential, and thus, the greater level of detail could be considered a larger asset for maintenance engineers, allowing more precise size estimation. Alternatively, if computational performance is limited, a light YOLO model could be deployed to provide quick inspection results in a constrained environment.
6. Conclusions and Discussion on Limitations
With this rise of installed turbines comes a mounting need for improved maintenance and inspection techniques. In this research, two popular object detection architectures were deployed and compared for performance on the localization task of WTB faults. The YOLO family of architectures was compared with versions 5, 8, and 9 along with three varying parameter sizes to allow a comprehensive performance comparison. This study showed that the most recent version of YOLO, YOLOv9 with size c, provided the highest mAP50 and mAP50 with 0.849 and 0.539, respectively. In addition to the YOLO architectures, Mask R-CNN was investigated along with altered feature extraction layers for improved performance. Here, it was shown that the base ResNet50-FPN network provides adequate performance but can be improved through a reduction in layers and parameters with a ResNet18-FPN backbone. This alteration provided a higher mAP50 of 0.8415 and instance segmentation mAP50 of 0.7933 while contributing a computation decrease through a 32-layer backbone reduction. While the scores obtained by the modified mask R-CNN did not surpass the latest YOLOv9 architecture results, they did provide a greater level of detail with instance segmentation masks. The comparative study on WTB fault localization obtained promising results. This enhanced precision can be leveraged to improve inspection accuracy and throughput as increased demand drives turbine installation numbers.
While this study provides promising results on the identification and localization of wind turbine blade faults, there remain challenges to be addressed. The dataset utilized in this study was created in a controlled environment simulating the halted wind turbine case scenario, which is the current trend of inspection. However, the loss of power production is a constraint to this method. If automated methods can address the collection of images while the wind turbine is active, it can eliminate this loss of power. Here, the rotation of the blades, along with the yaw of the turbine due to the environment wind direction and speed, can cause blurry aerial images taken with the drone, resulting in poor image quality. The drone itself can experience some turbulence causing unfocused imaging. To tackle these issues, more wind-resistant drones with better camera payloads can be a solution at an additional cost. As an alternative, applying de-blurring methods such as Lucy–Richardson or Wiener filter methods can be used as a pre-processing stage to mitigate the blurs of the captured images before feeding the images to the deep learning stage for blade fault identification and localization [
12].
Another solution includes mounting an anemometer such as TriSonica Mini Ultrasonic Anemometer to obtain the real-time wind speed and direction data to better control the drone for further stabilization as it takes aerial images from the blades. This approach would allow the drone to collect more focused images of halted turbine blades in the harsh environments presented in onshore and offshore wind farms. As an alternative, CNN-based wind forecasting approaches can also be used for drone stabilization and path-planning [
11].
Finally, to allow for a more generic solution for inference on commercial WTBs, more publicly available datasets are required. While the presented dataset is limited in scope and size, the goal was to increase the availability of wind turbine fault datasets through open access. Furthermore, the dataset leveraged in this study is limited to three major fault types, i.e., erosion, holes, and cracks. While these are some of the most common faults, it is not a comprehensive list seen in operational wind turbines. Further improving the scope of faults will allow even greater levels of generalization for fault detection.