1. Introduction
As a leader of global wind energy development, China’s installed wind power capacity ranked first in the world by 2021. Additionally, in light of China’s double carbon reduction targets, it is essential to address the fault issue with the earlier wind turbine installations in addition to increasing the overall installed capacity of wind turbines. Therefore, the fault detection of wind turbines is particularly important in the background of China’s double carbon goals. Simultaneously, it also promotes the development of wind turbine fault detection technology [
1]. At present, in the field of wind turbine fault detection, the fault detection of wind turbine impeller blades has become a research focus. As the main component of the wind turbine [
2], the wind turbine impeller blades operate in harsh environments and are subjected to alternating loads for a long time, which leads to a number of failure issues. For example, when the impeller rotates, it is easy to cause the loosening fault of the blade bolt under the external alternating load, and can even lead to the fracture fault of the bolt in serious cases [
3]. The failure of blade bolts directly affects the aerodynamic performance of the impeller, which causes additional load and load imbalance of the wind turbine, and finally reduces the overall life of the wind turbine. Therefore, it is vitally necessary to study the fault detection method of wind turbine blade bolts. At the same time, it is of great significance in reducing the maintenance cost of the wind turbine and increasing the power generation of the wind turbine.
The fault detection methods of blade bolts mainly include vibration signal detection technology [
4], the finite element analysis method [
5], and the machine learning method [
6]. In recent years, machine learning has been widely used in the field of bolt failure detection. Reference [
7] used the damage index of multi-scale fuzzy entropy to extract the important features of bolts to construct a dataset, and then input it into the least squares support vector machine classifier to detect the fault of bolts. Reference [
8] performed visual processing and extracted relevant features on the collected bolts’ image information to realize the bolt-loosening detection, and the feasibility of the proposed method was verified in the simulation experiment. Reference [
9] combined deep learning and the mechanical-based manual-torque method for bolts’ fault detection, which effectively reduced the cost of artificial detection and improved the performance of fault detection models.
When the machine-learning method is used to detect the fault of wind turbine components, processing the data class imbalance problem has also become a hotspot in the research. In the field of machine learning, common methods for processing data class imbalance include oversampling, undersampling, cost-sensitive learning, and optimized classification algorithms [
10]. The main research method involved in this paper is the oversampling method. The representative methods of oversampling include random oversampling [
11], SMOTE (Synthetic Minority Oversampling technique) [
12], Borderline-SMOTE [
13], K-means SMOTE [
14], etc. Since the samples expanded by random oversampling originate from the raw samples, it is less productive for the training and learning of the model, so domestic and foreign researchers mostly use SMOTE to simply expand the fault samples.
Obtaining high-quality samples is susceptible to noise samples and distribution characteristics of raw data. The direct use of SMOTE to expand data samples can achieve a certain effect, but it is easy to cause data redundancy problems for datasets with extreme data class imbalance. Improved variants of SMOTE are generally used to obtain high-quality synthetic samples. For example, Li et al. [
15] proposed a hybrid cluster–boundary fuzzification sampling method to solve the problem of low accuracy of detection models caused by a data class imbalance, which not only balanced data class but also alleviated data redundancy issues. However, this method does not completely consider the original distribution characteristics of the fault data. Aiming at the intrusion detection data class imbalance problem, Zhang et al. [
16] combined traditional SMOTE and random undersampling based on GMM to achieve the purpose of balancing the dataset. Yi et al. [
17] proposed a SMOTE sampling method based on fault class clustering to solve the problem of data class imbalance in SCADA of wind turbines, which solved the problem that traditional SMOTE does not consider the distribution characteristics of the original fault samples when synthesizing samples. However, the above methods are susceptible to noise samples.
Cost-sensitive learning is also widely used to process the problem of data class imbalance. Aiming at the problem of class imbalance in the operational data of wind turbine gearboxes, Tang et al. [
18] introduced a cost-sensitive matrix into the LightGBM algorithm, which solved the problem of inferior performance of the detection model due to the data class imbalance. In order to make the cost-sensitive learning independent of the manually designed cost-sensitive matrix, Reference [
19] proposed an innovative method based on the genetic algorithm program to construct the cost-sensitive matrix independent of the manual design. Then, the improved cost-sensitive matrix was used to adjust the attention weight of the correlation classification algorithm to the fault class samples. The method of introducing cost-sensitive learning into the classification algorithm has been well verified when solving the problem of data class imbalance in various fields. However, when the class imbalance problem of the dataset is serious, only using cost-sensitive learning to adjust the learning weight of the classifier on the fault class samples, the effect is not so obvious.
LightGBM is widely used in the field of wind turbine fault detection due to its strong classification performance and fast training speed [
20]. For example, aiming at the problem of frequent faults of wind turbine gearboxes. Tang et al. [
21] proposed a fault diagnosis model for wind turbine gearboxes based on adaptive LightGBM, which was well verified in practical cases. Wang et al. [
22] constructed a cascaded SAE abnormal state detection and LightGBM abnormal state classification model detection framework, which solved the problem that the wind turbine alarm system is insensitive to early faults. However, when using the fault detection model based on LightGBM for wind turbine fault detection, due to the problem of data class imbalance, its fault detection performance will be difficult to guarantee.
Therefore, aimed at the above problems, this paper proposes a fault detection method for wind turbine blade bolts based on GSG combined with CS-LightGBM. This method combined the new sampling method proposed in this paper with cost-sensitive learning, and then solved the problem of class imbalance in the bolt operation data of wind turbine blades.
3. Wind Turbine Blade Bolt Fault Detection Model
The fault detection model of wind turbine blade bolts is shown in
Figure 7. It mainly consists of three main models: a data preprocessing model, a GSG oversampling model and a CS-LightGBM training and evaluation model.
3.1. Data Preprocessing
The data preprocessing model is mainly to perform relevant data processing operations on the original wind turbine blade bolts dataset, such as data cleaning and standardization, encoding the text labels of the dataset into numerical labels using one-hot encoder, and using XGboost to sort the importance of features for feature selection.
Data cleaning mainly processes the abnormal and missing values in the wind turbine blade bolts dataset. The dataset contains too many normal class samples; therefore, the method adopted in this paper is to directly remove the samples with abnormal or missing values in the normal class samples. The samples with abnormal or missing values in the fault class samples are filled with mean values, thereby reducing the loss of the fault class samples.
One-hot encoding mainly converts the text labels in the wind turbine blade bolts dataset into numerical labels for machine identification. Since its model is a binary classification problem, the normal labels and fault labels in the bolt operation data of wind turbine blades are defined as P = [0, 1].
When machine-learning algorithms learn data features, the best features are those that are sufficient and necessary. Fewer features can easily lead to the underfitting of the model, while too many features not only cause the overfitting of the model, but also take up too many computing resources. So, it is very important to select a certain number of features for the model. In this paper, the XGboost feature importance algorithm is used to rank the importance of the wind turbine blade bolts dataset features. The principle is to input each feature of wind turbine blade bolts data in a single decision tree. Then, calculate the degree value of the feature to the decision tree split point improvement performance measure. The degree value is weighted and counted by the leaf nodes responsible for weighting, and finally, according to the weighted value and count value to determine the importance of the feature. Its performance measurement is generally the Gini purity value of the selected split node. The dataset of wind turbine blade bolts in this study has a total of 48 features, and
Table 1 lists the importance values of some features, where the initial letters of the feature parameters represent different blades of the same wind turbine. According to the feature importance value and expert experience, 11 features are selected as the feature training set.
Finally, the dataset after performing feature selection is data-standardized according to Equation (10) and its dataset is converted to a dataset with mean 0 and variance 1. This improves the training speed and accuracy of the wind turbine blade bolt fault detection model.
where x′ is the standardized feature value, x is the original feature value, μ is the feature mean, and δ is the feature standard deviation.
3.2. GSG Oversampling
GSG is a new oversampling method proposed in this paper. GSG is mainly based on the fault class sample dataset in the wind turbine blade bolts training set, and expands the fault class samples to a certain number of scales when the distribution characteristics of the original fault class samples are kept unchanged as much as possible. Then, the expanded fault class sample dataset is concatenated into the normal class sample dataset to obtain a new training dataset. Thereby, the imbalance rate of normal class samples and fault class samples in the original training samples is alleviated.
3.3. CS-LightGBM Training and Evaluation Model
The CS-LightGBM training and evaluation model combines cost-sensitive learning with the LightGBM algorithm, which is used to train the new training set processed by GSG, and evaluate the classification performance of the model through the test set. Finally, through the Bayesian optimization algorithm, search the optimal hyperparameter combination, and obtain an optimal wind turbine blade bolt fault detection model.
3.3.1. CS-LightGBM
Cost-Sensitive (CS), a new method in machine learning to process data classes’ imbalanced problems, is mainly used in the field of relevant classification algorithms. When the machine-learning algorithm classifies the samples, the cost of misclassifying one class of samples into another class of samples is very serious. Aiming at this problem, cost-sensitive learning assigns different costs to different misclassifications, so as to minimize the number of high-cost errors and the sum of the misclassification costs.
LightGBM is optimized on the basis of XGBoost, so its basic principle is basically similar to XGBoost; both use decision trees based on learning algorithms. LightGBM is mainly optimized for the training speed of the model, so the efficiency of LightGBM is generally higher than that of the XGboost algorithm. XGboost performs an indiscriminate split on the nodes of each layer, which wastes a lot of computing resources and time resources, as the information gain brought by some nodes to the algorithm model is almost 0. Aiming at the problem, LightGBM only selects the node with the largest splitting gain for splitting, and runs recursively under the depth limitation of the tree. In addition, LightGBM also performs parallel processing of data and data features, which speeds up the training speed, so that the overall performance of the LightGBM algorithm model is better than the XGboost algorithm model. Therefore, LightGBM is used as the classification algorithm of the detection model proposed in this paper.
CS-LightGBM, a method proposed in recent years to optimize the LightGBM classifier, introduces the idea of cost-sensitive learning into the LightGBM classifier. Specifically, it introduces a cost-sensitive function to replace the information gain in the weight function of the traditional LightGBM. So that when LightGBM performs iterative updates, the fault class samples can get excessive attention weights to improve the classification effect of class imbalance samples.
3.3.2. Bayesian Optimization Algorithm
Bayesian Optimization (BO) is a global optimization algorithm based on probability distribution. When the hyperparameter search space of the algorithm model is given, BO constructs a probability model for the function to be optimized f: χ → Rd, and further uses the model to select the next evaluation point, and then iterates through the cycle to obtain the hyperparameter optimal solution:
where x* is the optimal hyperparameter combination, χ is the decision space, and f(x) is the objective function.
The combination of different hyperparameters can affect the classification performance of a detection model, so it is very important to optimize the hyperparameters of the model. The LightGBM model has many hyperparameters, but some hyperparameters have little effect on the classification performance of the model. This study mainly optimizes four hyperparameters in the LightGBM model, as shown in
Table 2.
3.3.3. Model Evaluation Index
The performance evaluation index in this experiment is not only used the false alarm rate (FAR) and missing alarm rate (MAR) indexes to evaluate the classification performance of the fault detection model, but also introduces the F1-score index to better describe the classification performance of LightGBM classifier for unbalanced data.
The expressions of MAR, FAR, and F1-score are as follows:
where P is the precision rate and R is the recall rate; and TP, TN, FP, and FN are the confusion matrices, as shown in
Table 3.
5. Conclusions
Wind turbine blade bolts operate under alternating loads for a long time, which often causes them to fail. When using machine-learning algorithms for fault detection, the fault detection performance is not only affected by external factors, but also affected by the data class imbalance. In order to solve this problem, this paper analyzed and compared the shortcomings of traditional sampling methods and algorithms, and proposed a fault detection method for wind turbine blade bolts based on GSG combined with CS-LightGBM. Its main contributions are as follows:
A new oversampling method GSG was proposed. GSG is based on the basic principle of GMM and SMOTE, which can retain the distribution characteristics of the original data to the greatest extent when the sample is expanded, avoiding the blindness of traditional SMOTE in performing oversampling. In addition, this method also effectively alleviated the influence of noise samples during oversampling, and effectively reduced the generation of overlapping samples.
We combined GSG and CS-LightGBM for the fault detection of wind turbine blade bolts. The model starts from expanding the fault class sample dataset and introducing cost-sensitive learning methods to solve the problem of data classes unbalanced in wind turbine blade bolts. Specifically, we used the GSG proposed in this paper to expand the fault class samples in the wind turbine blade bolts training dataset to obtain a new training dataset. Then, we inputted the new training dataset to the LightGBM classifier introducing a cost-sensitive function for training, and the operational status of the wind turbine blade bolt was used as the output value.
Both the proposed new sampling method and the fault diagnosis model were well experimentally verified. Analysis of the experimental results of GSG sampling on simulated datasets revealed that the sampling effect of GSG was better than that of traditional SMOTE, thus verifying the effectiveness of GSG. In addition, comparing the fault diagnosis model proposed in this paper with other models, the missing alarm rate and false alarm rate under the proposed model were lower, and the F1-score value was higher. Thus, the feasibility and superiority of the fault diagnosis model proposed in this paper were verified.
When the dataset has a serious class imbalance problem, the proposed fault detection model shows good detection performance. However, when there is only a slight class imbalance in the dataset, how to find an optimal sampling strategy for GSG needs further consideration, and will also be the focus of the next research.