1. Introduction
In recent years, with the increasing proportion of clean energy in total energy consumption, the wind power industry has been rapidly developing, and wind power has become one of the most widely used energy sources [
1,
2]. Wind turbines are typically installed in high-altitude outdoor environments, where they operate continuously under dynamic loads and face complex and varied operating conditions. Consequently, the gearbox in the entire power transmission chain has a relatively high probability of failure [
3,
4,
5,
6,
7]. The gearbox in wind turbines is relatively complex in structure. If a component within it fails and is not promptly detected and replaced, it can lead to downtime of the turbine, increasing operational costs [
8,
9]. Therefore, monitoring the operational status of the gearbox, detecting any anomalies in a timely manner, and diagnosing the type of failure have significant engineering value [
10].
In practical gearbox fault diagnosis in wind turbines, most studies have focused on analyzing vibration signals, which have proven to be effective. However, research on fault diagnosis based on sound signals is relatively limited, and most of it has been conducted in laboratory settings. For example, Lu Wenbo et al. developed a gearbox fault diagnosis scheme based on near-field acoustic holography and acoustic field spatial distribution characteristics, and achieved satisfactory diagnostic results [
11]. Chuan Li et al. utilized the fusion of acoustic and vibration signals using deep random forests to explore gearbox fault diagnosis under different operating conditions [
12]. Chen Peng et al. proposed a method for diagnosing roller faults using audio wavelet packet decomposition and convolutional neural networks, significantly improving the efficiency of diagnosing faults in sand-carrying roller drums [
13]. Liu Shaokang et al. proposed an improved method of local mean decomposition which separates the frequency-modulated and amplitude-modulated components in the sound signal, enabling composite fault diagnosis of gearboxes [
14]. Yang Mingjin proposed an online fault diagnosis system for belt conveyor rollers based on stacked sparse autoencoders, convolutional neural networks, and spectral clustering algorithms. This system collects audio data through sensors, extracts fault features through analysis, and then performs diagnosis [
15]. Jiachi Yao et al. utilized sound signals for fault diagnosis and employed Fourier decomposition for fault diagnosis under conditions of limited sample data, demonstrating superior diagnostic performance compared to vibration signals under experimental conditions [
16]. The aforementioned scholars analyzed the sound signals of industrial equipment, diagnosed mechanical faults, and achieved fruitful research results. However, due to technological limitations, the detection accuracy can be further improved. With the development of artificial intelligence, deep learning models have achieved higher accuracy in recognizing and classifying sounds and images. Considering the advantages of strong fault sensitivity, easy acquisition, and non-invasive measurement of sound compared to vibration signals, this study preprocesses wind turbine sound data using Mel spectrograms. Subsequently, a deep learning model is employed to classify the data and diagnose the types of faults present.
Using advanced signal processing techniques and deep learning classification models is of great significance for improving the accuracy and efficiency of gear fault diagnosis. Compared with the shallow fault features of traditional time-domain signals and frequency-domain signals, the Mel spectrogram generates a two-dimensional feature map that contains both time and frequency domains, simulating the human ear’s perception of sound. It better handles low-frequency sound signals, reduces the dimensionality of frequency-domain signals from the original frequency to logarithmic Mel frequency, decreases the redundant information of features, and improves the efficiency of signal processing. The deep learning model adopted is enhanced ResNeXt50 (Residual Neural Network), which is an improved version derived from the integration of ResNet and Inception Networks. It combines the repetition strategy with the split–transform–merge strategy, effectively addressing the issue of gradient vanishing or exploding during the training process, increasing the network’s width, and enhancing the model’s performance. By adding the CBAM (convolutional block attention module) to ResNeXt, we have increased the model’s focus on important features. This, combined with the ArcLoss function, enables the model to learn more discriminative features and enhance its generalization capability. Addressing the challenge of achieving a high recognition rate for gearbox faults under complex and variable loading conditions, this study proposes an acoustic intelligent diagnosis model incorporating the attention mechanism with ResNeXt and the ArcLoss function, known as CBAM-ResNeXt50-ArcLoss. Utilizing STFT (short-time Fourier transform) to extract the time–frequency matrix of acoustic signals, and after reducing its dimensionality through Mel filters, the resulting spectrogram is used as the input for the model. Using a deep learning model, classification and detection are performed to determine the fault type in wind turbine gearboxes.
This article consists of four parts: audio signal preprocessing, establishing a CBAM-Resnext50-ArcLoss deep learning classification model, experiment and training results, and conclusions.
2. Preprocessing
Mel spectrograms integrate the temporal, frequency, and energy information of sound signals, with time on the horizontal axis, frequency on the vertical axis, and the distribution of energy displayed through the intensity of color [
17,
18]. The processing procedure first involves preprocessing the audio signal, including pre-emphasis, framing, and windowing. Then, short-time Fourier transform (STFT) is applied to generate a spectrogram, mapping the individual frame signals and concatenating them along the time dimension. Due to various factors in recording equipment and transmission processes, high-frequency components tend to attenuate, leading to an imbalance in spectral characteristics. Pre-emphasis is a technique that applies a high-pass filter to the audio signal, emphasizing the high-frequency components in order to improve the balance of the signal. The process of pre-emphasis is as follows:
In Equation (1),
y(
n) represents the signal after pre-emphasis,
x(
n) is the input signal,
x(
n − 1) is the value of the input signal at the sampling point one time unit before the current time point (
n), λ is the pre-emphasis coefficient, and the fault frequency range of the wind turbine gearbox is high. The background wind noise is dominated by low frequencies. In order to emphasize the high-frequency fault frequency range [
19], λ = 0.97 is selected in this study.
Because audio signals are non-stationary, we cannot directly apply Fourier Transform to the entire audio signal. Instead, we need to use framing techniques to segment the audio into smaller frames and process them individually. Frame segmentation is achieved by applying a movable finite-length window for weighting. A certain window function is multiplied with the pre-emphasized signal, resulting in a windowed audio signal. This process allows for the audio to be segmented into smaller frames for further analysis and processing. Then, a short-time Fourier analysis is conducted, assuming that the audio signal remains stationary for a short duration, and a steady-state analysis method is applied for processing [
20]. In order to reduce signal distortion, the Hamming window is selected, and its formula is as follows:
In Equation (2), N is the length of the window, n is the sampling point index in the window, and w(n) is the value of the Hamming window at the nth sampling point.
The signal of the
-th frame after windowing is expressed as follows:
In Equation (3), xl(m) is represented as a vector of length M, where m is the index of the sample point within the frame, ranging from 0 to M−1. w(m) is the Hamming window, and x(nl + m) indicates the index of the starting sample point of the l-th frame as nl, which is expressed as the starting sample point index of the l-th frame plus the sample point index m within the frame, i.e., nl + m.
The short-time Fourier transform of it is as follows:
In Equation (4), Xl(k) represents the STFT result of the l-th frame, and k is the frequency index. xl(m) is the signal after windowing in the l-th frame.
To calculate the energy spectrum, after the Fourier transform is completed, the frequency-domain signal is obtained; the energy of each frequency band range is different, and the energy spectra of different factors are also different. The calculation formula is as follows:
In Equation (5), El(k) is the energy of the l-th frame at the frequency index k, and Xl(k) represents the result of the STFT of the l-th frame.
The Mel frequency is linear when the actual frequency is below 1000 Hz, and above 1000 Hz, it becomes logarithmic in growth. By setting the upper and lower limits of the frequency, unwanted or noisy frequencies can be filtered out and then converted to the Mel frequency [
21]. Then, a triangular filter bank with K channels is configured on the Mel frequency axis, and the frequency response of each filter is as follows:
In Equation (6),
Hm(
k) is the frequency response of the
m-th filter, representing the value of the frequency index
k, and
m represents the filter number.
f(
m) is the frequency value of the
m-th Mel frequency, and the frequency is usually converted to the Mel frequency using the Mel frequency scale.
In Equations (7) and (8), fh and fl represent the highest and lowest frequencies of the filter frequency, respectively; M is the number of Mel filters; fS is the sampling frequency of the wind turbine gearbox, where fS = 16 kHz; and N is the frame length for STFT.
By using Mel filters to reduce the dimensionality of the data, the size of the data is reduced, and subsequent model training and recognition are simplified. The process of generating the Mel spectrogram is as follows: first, the audio signal undergoes FFT (Fast Fourier transformation); second, the power spectrum is obtained through pre-emphasis; finally, the Mel spectrogram is generated using Mel filter banks.
4. Experiment
4.1. Wind Turbine Gearbox Audio Data
The structure of the gearbox is a one-stage planetary and two-stage parallel gear. The main shaft of the front-end wind turbine is connected to the low-speed planetary stage carrier of the wind power gearbox, and the high-speed shaft with a high-speed stage pinion at the rear end is connected to the generator shaft. Its structure is shown in
Figure 6a, where b is the sound pressure sensor installation position in the laboratory.
In
Figure 6a, PS is the low-speed planetary stage carrier, IS is the large gear of the intermediate stage, and HSS is the large gear of the high-speed stage. To prevent the amplitude of the collected sound signal from being weakened due to the excessive distance between the sensor and the gearbox, the sound pressure sensor is installed at the position S of
Figure 6a,b, which is below the first-stage parallel gear. The sound pressure sensor used is YSV5001, which is composed of an electret microphone and a dedicated preamplifier. It has the characteristics of high sensitivity, good linearity, and stable performance. Its frequency range is 10 HZ–20 kHz, and the measurement range is 20–136 db. During operation of the wind turbine, the load on the rotor driven by the blades will constantly change, so the collected data are all collected under variable load conditions. The length of each audio data point is 10 s, the sampling frequency is 16 kHz, the frequency resolution is 0.1 Hz, and the generator speed is about 1580 r/min.
4.2. Data Processing
The signal of the wind turbine is extracted through the Mel spectrogram, and finally a 256 × 256 feature map is generated.
Figure 7 shows the Mel spectrograms of different gear faults. Four fault features and healthy conditions each generate 2000 samples, and then the training set and test set are divided according to a ratio of 8:2. The specific fault sample distribution is shown in
Table 1.
A wind turbine (mainly including gearbox, blade, bearing, etc.) is usually installed in the high-altitude environment in the field. Long-term continuous operation bears dynamic heavy loads, and the operating conditions are complex and changeable. The gearbox has a high probability of failure in the entire power transmission chain, so it is the research object of this paper.
Under laboratory conditions, four types of faults (chipped tooth, missing tooth, root fault, source fault) are artificially implanted in the wind turbine gearbox, and then the audio signal is collected under different fault conditions.
A chipped tooth removes small pieces of material by milling the tooth surface to simulate a tooth surface defect caused by impact. A missing tooth is completely removed by mechanical processing to simulate severe fracture failure. A root fault simulates fatigue cracking at the fillet of the tooth root due to mechanical impact. A source fault creates local wear in the bearing raceway via chemical etching to simulate early surface degradation.
Spectrum distribution is uniform in a healthy state, intermittent impulse noise appears in the chipped tooth state, strong periodic peaks appear in the missing tooth state, continuous harmonic components appear in the root fault state, and early weak clutter appears in the source fault state.
There are 1600 samples for each state for training, and the manner of labeling is as follows: health 0001–1600, chipped 0001–1600, missing 0001–1600, root 0001–1600, and source 0001–1600.
4.3. Model Training
All experiments in this study are carried out by using the PyTorch deep learning framework and run on a Pusai deep learning server equipped with a 3080 graphics card. The Adam optimization algorithm and the LambdaLR custom learning rate adjustment strategy are used to adjust the parameters of the model. The size of the input Mel spectrogram, batch size, learning rate, the output of the fully connected layer, and the number of training iterations are set as shown in
Table 2.
4.4. Comparison of CBAM-ResNeXt50-ArcLoss with Classical Models
With the increase in the number of iterations, the accuracy of the training set and test set of the CNN reaches 86.8% and 86.5%, respectively, and remained stable. The loss value drops significantly at first and then tends to stabilize. Eventually, the loss value of the training set stabilizes around 0.388, and the loss value of the test set stabilizes around 0.368. The test results are shown in
Figure 8.
The training results of the ResNet50 model are shown in
Figure 9. As the number of iterations increases, the accuracy of the model gradually improves and then tends to stabilize. The accuracy of the training set and the test set finally stabilizes at 95.9% and 96.2%, respectively. The model loss value drops and then slowly tends to stabilize. Eventually, the loss values of the training set and the test set stabilize at 0.34 and 0.256, respectively. Compared with the CNN, the accuracy of the training set is increased by 9.1 percentage points, the accuracy of the test set is increased by 9.7 percentage points, and the loss values of the training set and the test set are reduced by 0.048 and 0.112, respectively.
The ResNeXt50 model is an improvement and enhancement of the ResNet50 model, which utilizes residual connections and group convolutions to maximize the feature extraction capability. Experiments show that the accuracy of the training set and test set of the ResNeXt50 model is 97.2% and 97.4%, respectively, which is an increase of 1.3 and 1.2 percentage points compared to the ResNet50 model. The loss values of the training set and test set are 0.159 and 0.146, respectively, which are reduced by 0.181 and 0.11, respectively. There is no overfitting or underfitting phenomenon. The accuracy and loss values of the ResNeXt50 model are shown in
Figure 10.
The results of the CBAM-ResNeXt50-ArcLoss model are shown in
Figure 11. As the number of iterations increases, the accuracy of the model on the training set and test set gradually increases and finally stabilizes at 99.6% and 99.8%, respectively. The loss value of the model gradually decreases and then tends to be stable, finally stabilizing at 0.082 and 0.054, respectively.
As shown in
Table 3, verifying the effectiveness of the algorithm, the test accuracy of the CBAM-ResNeXt50-ArcLoss model is improved by 13.3, 3.6, 2.4, and 1.3 percentage points compared with the classical algorithms CNN, ResNet50, ResNeXt50, CBAM-ResNeXt50, and CBAM-ResNeXt50-ArcLoss, respectively. Due to the introduction of the CBAM and the use of a more complex ArcLoss loss function, the calculation process of the model is increased. Therefore, the training time of the model is slightly increased compared with other models, but the robustness and generalization ability of the model are improved.
Using PyTorch 2.5.1, we obtained the confusion matrices obtained after testing CNN, ResNet50, ResNeXt50, and CBAM-ResNeXt50-ArcLoss after training, as shown in
Figure 12. The horizontal coordinate represents the predicted labels of different faults, and the vertical coordinate represents the true categories of different faults. The numbers on the main diagonal of the matrix represent the number of samples correctly classified for each type of fault. It can be seen that the diagnostic accuracy of the normal state of this research method has reached 99.8%, achieving a high accuracy rate.
4.5. The Influence of Different Loss Functions on Model Results
The loss function used in CBAM-ResNeXt50-ArcLoss is additive angular margin loss, which is an improvement over softmax loss. In order to verify its performance improvement on the model results, network models using different loss functions are compared and verified. The experiments proved that, as shown in
Table 4, using the ArcLoss loss function increased the accuracy by 1.3% and 0.7%, respectively, compared with softmax and Triplet Loss, and the loss value decreased by 0.011 and 0.059, respectively. However, ArcLoss involves complex trigonometric function operations and conditional judgments, so it is more time-consuming in calculation than other loss functions. Based on the comparison results, the ArcLoss loss function enhances the performance of the model and is suitable for the fault diagnosis of wind turbine gearboxes using sound signals. In summary, compared with other methods, the method proposed in this study has excellent fault identification capabilities.
5. Conclusions
We aimed to solve the problem of the gearbox of wind turbine generators bearing variable and heavy loads in working conditions and the dataset being small. It is difficult to obtain a high recognition rate for gear faults; so, we proposed a fault diagnosis method for wind turbine gearboxes based on Mel spectrograms and a CBAM-ResNeXt50-ArcLoss transfer learning model. The original sound signal is processed by Mel spectrograms to generate a two-dimensional feature map containing both time and frequency domains. The dimension of the frequency-domain signal is reduced from frequency to logarithmic Mel scale, which reduces the redundant information of the features, reduces the interference of noise, and improves the processing efficiency of the signal. In the sound signal classification task containing the operating state information of the gearbox, the generated Mel spectrogram is input to CBAM-ResNeXt50-ArcLoss for training and verification, and a high accuracy rate of fault classification is achieved. The CBAM enables the network to adaptively adjust the degree of attention to different feature channels and spatial locations, and ArcLoss reduces the intra-class differences, increases the inter-class differences, and enhances feature discrimination. The ResNeXt structure improves the accuracy without increasing the complexity of the parameters, and the transfer learning method using fine-tuning gives the model better generalization. Compared with the CNN, ResNet50, ResNeXt50, and CBAM-ResNet50 methods, the CBAM-ResNeXt50-ArcLoss model shows improvements of 13.3, 3.6, 2.4, and 1.3, respectively. This method provides a new idea for the fault diagnosis of wind turbine gearboxes and has potential engineering value for fault diagnosis, operation, and maintenance.
It should be noted that the proposed method was validated on a balanced dataset. In practical applications, the number of healthy-state samples far exceeds that of fault samples, and this class imbalance issue may reduce diagnostic sensitivity. Although the ArcLoss loss function and CBAM themselves can enhance the ability to distinguish features of minority classes, further improvement measures are still recommended for scenarios with severe imbalance. Future research will focus on verifying the robustness of the model under imbalanced data distributions in the real world.
In this study, controlled experimental data from a single wind turbine gearbox were utilized. Known faults were artificially implanted, and audio signals were collected under laboratory operating conditions. This ensures the reliability of fault labels and the effectiveness of the initial method. However, a limitation of this study is that the training and validation data are derived from a single gearbox, which may not fully cover the complexity of real-life scenarios (such as operating environments with extreme weather, differences in gearboxes under various scenarios, etc.). Future research will validate the abovementioned situations.