Fault Diagnosis of Wind Turbine Gearbox Based on Mel Spectrogram and Improved ResNeXt50 Model

Zhang, Xiaojuan; Jia, Feixiang; Chen, Yayu

doi:10.3390/app15158563

Open AccessArticle

Fault Diagnosis of Wind Turbine Gearbox Based on Mel Spectrogram and Improved ResNeXt50 Model

by

Xiaojuan Zhang

¹,

Feixiang Jia

^2,3 and

Yayu Chen

^2,3,*

¹

School of Information and Electrical Engineering, Hebei University of Engineering, Handan 056038, China

²

School of Mechanical and Equipment Engineering, Hebei University of Engineering, Handan 056038, China

³

Key Laboratory of Intelligent Industrial Equipment Technology of Hebei Province, Handan 056038, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(15), 8563; https://doi.org/10.3390/app15158563

Submission received: 2 July 2025 / Revised: 24 July 2025 / Accepted: 31 July 2025 / Published: 1 August 2025

Download

Browse Figures

Versions Notes

Abstract

In response to the problem of complex and variable loads on wind turbine gearbox bearing in working conditions, as well as the limited amount of sound data making fault identification difficult, this study focuses on sound signals and proposes an intelligent diagnostic method using deep learning. By adding the CBAM module in ResNeXt to enhance the model’s attention to important features and combining it with the Arcloss loss function to make the model learn more discriminative features, the generalization ability of the model is strengthened. We used a fine-tuning transfer learning strategy, transferring pre-trained model parameters to the CBAM-ResNeXt50-ArcLoss model and training with an extracted Mel spectrogram of sound signals to extract and classify audio features of the wind turbine gearbox. Experimental validation of the proposed method on collected sound signals showed its effectiveness and superiority. Compared to CNN, ResNet50, ResNeXt50, and CBAM-ResNet50 methods, the CBAM-ResNeXt50-ArcLoss model achieved improvements of 13.3, 3.6, 2.4, and 1.3, respectively. Through comparison with classical algorithms, we demonstrated that the research method proposed in this study exhibits better diagnostic capability in classifying wind turbine gearbox sound signals.

Keywords:

fault diagnosis; wind turbine gearbox; Mel spectrogram; CBAM-Resnext50-ArcLoss; transfer learning

1. Introduction

In recent years, with the increasing proportion of clean energy in total energy consumption, the wind power industry has been rapidly developing, and wind power has become one of the most widely used energy sources [1,2]. Wind turbines are typically installed in high-altitude outdoor environments, where they operate continuously under dynamic loads and face complex and varied operating conditions. Consequently, the gearbox in the entire power transmission chain has a relatively high probability of failure [3,4,5,6,7]. The gearbox in wind turbines is relatively complex in structure. If a component within it fails and is not promptly detected and replaced, it can lead to downtime of the turbine, increasing operational costs [8,9]. Therefore, monitoring the operational status of the gearbox, detecting any anomalies in a timely manner, and diagnosing the type of failure have significant engineering value [10].

In practical gearbox fault diagnosis in wind turbines, most studies have focused on analyzing vibration signals, which have proven to be effective. However, research on fault diagnosis based on sound signals is relatively limited, and most of it has been conducted in laboratory settings. For example, Lu Wenbo et al. developed a gearbox fault diagnosis scheme based on near-field acoustic holography and acoustic field spatial distribution characteristics, and achieved satisfactory diagnostic results [11]. Chuan Li et al. utilized the fusion of acoustic and vibration signals using deep random forests to explore gearbox fault diagnosis under different operating conditions [12]. Chen Peng et al. proposed a method for diagnosing roller faults using audio wavelet packet decomposition and convolutional neural networks, significantly improving the efficiency of diagnosing faults in sand-carrying roller drums [13]. Liu Shaokang et al. proposed an improved method of local mean decomposition which separates the frequency-modulated and amplitude-modulated components in the sound signal, enabling composite fault diagnosis of gearboxes [14]. Yang Mingjin proposed an online fault diagnosis system for belt conveyor rollers based on stacked sparse autoencoders, convolutional neural networks, and spectral clustering algorithms. This system collects audio data through sensors, extracts fault features through analysis, and then performs diagnosis [15]. Jiachi Yao et al. utilized sound signals for fault diagnosis and employed Fourier decomposition for fault diagnosis under conditions of limited sample data, demonstrating superior diagnostic performance compared to vibration signals under experimental conditions [16]. The aforementioned scholars analyzed the sound signals of industrial equipment, diagnosed mechanical faults, and achieved fruitful research results. However, due to technological limitations, the detection accuracy can be further improved. With the development of artificial intelligence, deep learning models have achieved higher accuracy in recognizing and classifying sounds and images. Considering the advantages of strong fault sensitivity, easy acquisition, and non-invasive measurement of sound compared to vibration signals, this study preprocesses wind turbine sound data using Mel spectrograms. Subsequently, a deep learning model is employed to classify the data and diagnose the types of faults present.

Using advanced signal processing techniques and deep learning classification models is of great significance for improving the accuracy and efficiency of gear fault diagnosis. Compared with the shallow fault features of traditional time-domain signals and frequency-domain signals, the Mel spectrogram generates a two-dimensional feature map that contains both time and frequency domains, simulating the human ear’s perception of sound. It better handles low-frequency sound signals, reduces the dimensionality of frequency-domain signals from the original frequency to logarithmic Mel frequency, decreases the redundant information of features, and improves the efficiency of signal processing. The deep learning model adopted is enhanced ResNeXt50 (Residual Neural Network), which is an improved version derived from the integration of ResNet and Inception Networks. It combines the repetition strategy with the split–transform–merge strategy, effectively addressing the issue of gradient vanishing or exploding during the training process, increasing the network’s width, and enhancing the model’s performance. By adding the CBAM (convolutional block attention module) to ResNeXt, we have increased the model’s focus on important features. This, combined with the ArcLoss function, enables the model to learn more discriminative features and enhance its generalization capability. Addressing the challenge of achieving a high recognition rate for gearbox faults under complex and variable loading conditions, this study proposes an acoustic intelligent diagnosis model incorporating the attention mechanism with ResNeXt and the ArcLoss function, known as CBAM-ResNeXt50-ArcLoss. Utilizing STFT (short-time Fourier transform) to extract the time–frequency matrix of acoustic signals, and after reducing its dimensionality through Mel filters, the resulting spectrogram is used as the input for the model. Using a deep learning model, classification and detection are performed to determine the fault type in wind turbine gearboxes.

This article consists of four parts: audio signal preprocessing, establishing a CBAM-Resnext50-ArcLoss deep learning classification model, experiment and training results, and conclusions.

2. Preprocessing

Mel spectrograms integrate the temporal, frequency, and energy information of sound signals, with time on the horizontal axis, frequency on the vertical axis, and the distribution of energy displayed through the intensity of color [17,18]. The processing procedure first involves preprocessing the audio signal, including pre-emphasis, framing, and windowing. Then, short-time Fourier transform (STFT) is applied to generate a spectrogram, mapping the individual frame signals and concatenating them along the time dimension. Due to various factors in recording equipment and transmission processes, high-frequency components tend to attenuate, leading to an imbalance in spectral characteristics. Pre-emphasis is a technique that applies a high-pass filter to the audio signal, emphasizing the high-frequency components in order to improve the balance of the signal. The process of pre-emphasis is as follows:

y (n) = x (n) - λ x (n - 1)

(1)

In Equation (1), y(n) represents the signal after pre-emphasis, x(n) is the input signal, x(n − 1) is the value of the input signal at the sampling point one time unit before the current time point (n), λ is the pre-emphasis coefficient, and the fault frequency range of the wind turbine gearbox is high. The background wind noise is dominated by low frequencies. In order to emphasize the high-frequency fault frequency range [19], λ = 0.97 is selected in this study.

Because audio signals are non-stationary, we cannot directly apply Fourier Transform to the entire audio signal. Instead, we need to use framing techniques to segment the audio into smaller frames and process them individually. Frame segmentation is achieved by applying a movable finite-length window for weighting. A certain window function is multiplied with the pre-emphasized signal, resulting in a windowed audio signal. This process allows for the audio to be segmented into smaller frames for further analysis and processing. Then, a short-time Fourier analysis is conducted, assuming that the audio signal remains stationary for a short duration, and a steady-state analysis method is applied for processing [20]. In order to reduce signal distortion, the Hamming window is selected, and its formula is as follows:

w (n) = \{\begin{cases} 0.54 - 0.46 \cos [2 π n / (N - 1)], 0 \leq n \leq N - 1 \\ 0, others \end{cases}

(2)

In Equation (2), N is the length of the window, n is the sampling point index in the window, and w(n) is the value of the Hamming window at the nth sampling point.

The signal of the

l

-th frame after windowing is expressed as follows:

x_{l} (m) = w (m) x (n_{l} + m) 0 \leq m \leq M - 1

(3)

In Equation (3), x_l(m) is represented as a vector of length M, where m is the index of the sample point within the frame, ranging from 0 to M−1. w(m) is the Hamming window, and x(n_l + m) indicates the index of the starting sample point of the l-th frame as n_l, which is expressed as the starting sample point index of the l-th frame plus the sample point index m within the frame, i.e., n_l + m.

The short-time Fourier transform of it is as follows:

X_{l} (k) = \sum_{m = 0}^{M - 1} x_{l} (m) \cdot e^{- j 2 π k m / M}

(4)

In Equation (4), X_l(k) represents the STFT result of the l-th frame, and k is the frequency index. x_l(m) is the signal after windowing in the l-th frame.

To calculate the energy spectrum, after the Fourier transform is completed, the frequency-domain signal is obtained; the energy of each frequency band range is different, and the energy spectra of different factors are also different. The calculation formula is as follows:

E_{l} (k) = \sum_{m = 0}^{M - 1} | X_{l} (k) |^{2}

(5)

In Equation (5), E_l(k) is the energy of the l-th frame at the frequency index k, and X_l(k) represents the result of the STFT of the l-th frame.

The Mel frequency is linear when the actual frequency is below 1000 Hz, and above 1000 Hz, it becomes logarithmic in growth. By setting the upper and lower limits of the frequency, unwanted or noisy frequencies can be filtered out and then converted to the Mel frequency [21]. Then, a triangular filter bank with K channels is configured on the Mel frequency axis, and the frequency response of each filter is as follows:

H_{m} (k) = \{\begin{cases} 0 k < f (m - 1) \\ \frac{k - f (m - 1)}{f (m) - f (m - 1)} f (m - 1) \leq k \leq f (m) \\ \frac{f (m + 1) - k}{f (m + 1) - f (m)} f (m) \leq k \leq f (m + 1) \\ 0 k > f (m + 1) \end{cases}

(6)

In Equation (6), H_m(k) is the frequency response of the m-th filter, representing the value of the frequency index k, and m represents the filter number. f(m) is the frequency value of the m-th Mel frequency, and the frequency is usually converted to the Mel frequency using the Mel frequency scale.

{F_{M e l}}^{- 1} (f) = 700 (e^{(f / 1125)} - 1)

(7)

f (m) = (\frac{N}{f_{s}}) F_{M e l}^{- 1} [F_{M e l} (f_{l}) + m \frac{F_{M e l} (f_{h}) - F_{M e l} (f_{l})}{M + 1}]

(8)

In Equations (7) and (8), f_h and f_l represent the highest and lowest frequencies of the filter frequency, respectively; M is the number of Mel filters; f_S is the sampling frequency of the wind turbine gearbox, where f_S = 16 kHz; and N is the frame length for STFT.

By using Mel filters to reduce the dimensionality of the data, the size of the data is reduced, and subsequent model training and recognition are simplified. The process of generating the Mel spectrogram is as follows: first, the audio signal undergoes FFT (Fast Fourier transformation); second, the power spectrum is obtained through pre-emphasis; finally, the Mel spectrogram is generated using Mel filter banks.

3. Model

3.1. ResNeXt50

ResNeXt is an improvement over ResNet and Inception Networks. By combining its repetition strategy with the split–transform–merge strategy, it solves the problem of gradient vanishing or explosion in the training process of the network, increases the width of the network, and improves the performance of the model without increasing the complexity of the parameters [22,23]. It transforms a single-path convolution into multiple convolutions in multiple branches, and also parallelly stacks residual blocks with the same structure [24]. By introducing cardinality to control the number of groups, the model is easier to expand. The aggregation transformation in ResNeXt can be expressed as follows:

F (x) = \sum_{i = 1}^{C} T_{i} (x)

(9)

In Equation (9), T_i is the same topological structure; C is the number of identical branches in a module, i.e., the cardinality.

Figure 1 shows the original module structure of ResNeXt. The input feature is passed through a convolution with a kernel size of 1 × 1 and is divided into 32 low-dimensional embeddings. Then, the 32 low-dimensional embeddings are transformed, and the transformed outputs are aggregated by addition. On the right side of the figure is the equivalent representation of the left side, where the branch structure is replaced by grouped convolution. In practice, a relatively simple graph structure is adopted, i.e., the basic module of ResNeXt is implemented in the form of grouped convolution.

3.2. Attention Mechanism

The attention mechanism selectively focuses on and processes information and is widely used in artificial intelligence algorithms. It can adaptively learn and calculate the influence weights of input data on output data, focus on important information, ignore irrelevant information, and adjust the weights of information in different situations to improve the feature extraction and information representation capabilities of neural networks, making them more robust and easier to expand [25]. The convolutional block attention module (CBAM) is adopted in this study [26]. The reason is that the mechanism has lightweight characteristics and can effectively enhance the resolution of features. It can better capture the key fault features in the spectrum diagram.

Given an intermediate feature map F ∈ R^C^×H×W as input, the CBAM sequentially derives a one-dimensional channel feature map M_C ∈ R^C^×1×1 and a two-dimensional spatial feature map M_S ∈ R^1×H×W. The entire attention process can be summarized as follows:

\begin{array}{l} F^{'} = M_{c} (F) \otimes F \\ F^{″} = M_{s} (F^{'}) \otimes F^{'} \end{array}

(10)

In Equation (10), ⨂ represents element-wise multiplication, and the attention values are propagated forward accordingly during the multiplication process. F″ is the final output after calculation.

3.2.1. Channel Attention Mechanism

Channel attention mechanisms evaluate the significance of individual feature channels, adaptively amplifying useful channels while suppressing less relevant ones. The process begins with an input feature map F of dimensions H × W × C. To condense spatial information, global max pooling and global average pooling are applied across the spatial dimensions (H × W), yielding two compact 1 × 1 × C feature descriptors. This spatial compression facilitates efficient subsequent learning of channel characteristics. Both pooled results are then fed into a shared MLP (Multi-Layer Perceptron). The MLP architecture employs a bottleneck design: its first layer contains C/r neurons with ReLU activation, followed by a second layer restoring the dimensionality to C neurons. The outputs from the MLP (corresponding to the two pooling paths) are element-wise summed. This combined signal is then passed through a sigmoid activation function, producing the final channel attention weight matrix Mc. Among them, M_C ∈ R^C^/r×1×1, M_C represents the channel attention weight matrix, and r is the dimensionality reduction coefficient used in the MLP.

\begin{array}{l} M_{c} (F) & = σ (M L P (A v g P o o l (F))) + M L P (M a x P o o l (F)) \\ = σ (W_{1} (W_{0} (F_{a v g}^{c}) + W_{1} (W_{0} (F_{m a x}^{c})) \end{array}

(11)

In Equation (11), σ represents the sigmoid activation function; F^C_avg and F^C_max represent the global average pooling and max pooling features, respectively; W₀ ∈ R^C^/r×C and W₁ ∈ R^C^×C/r.

3.2.2. Spatial Attention Mechanism

Spatial attention models are able to pinpoint key regions in the network that need processing and utilize the spatial relationships within feature maps to generate spatial attention feature maps. First, for an input feature map F with dimensions H × W × C, global max pooling and global average pooling are performed along the channel dimension, resulting in two H × W × 1 feature maps. These maps are then merged to form an effective feature representation. Convolution is applied to the merged feature map to generate the spatial attention feature map M_S(F) ∈ R^H^×W. Channel pooling helps reduce the channel size, facilitating subsequent learning of spatial features. Next, the outcomes of global max pooling and global average pooling are concatenated along the channel, yielding two-dimensional feature maps F^S_avg ∈ R^1×H×W and F^S_max ∈ R^1×H×W. After concatenation, the combined feature map is convolved through a standard convolutional layer to generate a two-dimensional spatial attention map. Specifically, a 7 × 7 convolution operation is applied to the concatenated result, producing a feature map of size H × W × 1. This feature map is then passed through the sigmoid activation function to obtain the spatial attention weight matrix M_S ∈ R^1×H×W. To sum up, the calculation formula of spatial attention is as follows:

\begin{array}{l} M_{s} (F) & = σ (f^{7 \times 7} ([A v g p o o l (F); M a x P o o l (F)])) \\ = σ (f^{7 \times 7} [F_{a v g}^{s}; F_{m a x}^{s}]) \end{array}

(12)

In Equation (12), f^7×7 represents a convolution operation with a 7 × 7 convolution kernel.

3.2.3. Mixed Attention CBAM

Channel attention identifies significant features (“what”), while spatial attention localizes their positions (“where”) within the image. Integrating both modules enables complementary attention modeling. These modules can be arranged sequentially or in parallel. Empirical results demonstrate that the channel-first sequential configuration (Channel → Spatial) yields marginally superior performance.

Consequently, the CBAM adopts a lightweight sequential architecture (Channel + Spatial Attention). Its compact design introduces minimal computational overhead or parameters, making it easy to insert into other networks. Figure 2 illustrates the CBAM-enhanced residual module structure.

3.3. Loss Function

In this study, the adopted loss function is Additive Angular Margin Loss (ArcLoss). As an improved version of softmax loss, it addresses the issue of softmax being sensitive to the modulus of feature vectors while being insensitive to angles [27]. By optimizing the angular connections between feature embeddings, more distinguishable features can be acquired. Moreover, adjusting the parameter values of the loss function enables feature quantities to achieve the effects of intra-class compactness and inter-class separation. The ArcLoss loss function is as follows:

L = - \frac{1}{N} \sum_{i = 1}^{N} \log \frac{e^{s (\cos (θ_{y_{i}} + m))}}{\sum_{j = 1, j \neq y_{i}}^{n} e^{s (\cos (θ_{y_{i}} + m))} + e^{s \cos θ_{y_{i}}}}

(13)

In Equation (13), N is the number of training samples, y_i is the true label of sample i, and θ_yi is the angle between the feature vector of sample i and the weight vector of the true category y_i. m is the added angular margin; s is a scaling factor used to scale the cosine value. θ_j is the angle between the feature vector of sample i and the weight vector of category j, where j ≠ y_i.

3.4. Transfer Learning

Transfer learning is an important strategy in deep learning; its core lies in transferring the knowledge learned from the source domain to the target domain, thereby improving the generalization performance of the model in the target domain [28]. Specifically, in this study, we first use the original network model pre-trained on the ImageNet dataset, initialize the network parameters with the initial weights of this pre-trained model, and then transfer this pre-trained model to the task of recognizing Mel spectrograms generated from the sound signals of wind turbines. By fine-tuning the weights and biases of the pre-trained model, the generalization ability and classification recognition of the model are effectively improved. The advantage of transfer learning is that it allows the model to avoid re-training from scratch, which not only reduces the training time and computing resources required for new tasks, but also further enhances the generalization ability of the model. Figure 3 is a schematic diagram of the fine-tuning transfer learning strategy.

3.5. The Overall Structure of the Model

The CBAM-ResNeXt50-ArcLoss network is a further improvement based on ResNeXt. The ResNeXt network adds the processing of contextual information in high-dimensional spaces, while the CBAM adds the processing of the relationships between the channels of features and the processing of the relationships between spatial features. The transformation of the CBAM is regarded as the non-identity branch of the residual module, and both the channel attention and the spatial attention act before being added to the identity branch. Finally, the ArcLoss loss function is added to the overall module to improve the classification accuracy of the model. Here, the asterisk * in the module structure represents convolution operation, which is used to implement the element-wise multiplication of attention weights and feature maps to apply attention mechanisms. Figure 4 shows the core CBAM-ResNeXt50 module of the network; it shows a network module structure that combines residual connection and a dual-channel attention mechanism, including three core parts: residual block, channel attention, and spatial attention.

The overall structure of the Mel spectrogram and CBAM-ResNeXt50-ArcLoss transfer learning model is shown in Figure 5. Different colors in the figure are used to distinguish the functional modules of the model. Among them, blue represents the backbone network, yellow represents the Pre-training weight, purple represents the Global average pooling, cyan represents the ArcLoss module, and red represents the Output module. Firstly, the collected noisy sound of the wind turbine gearbox is preprocessed to generate Mel spectrograms which are divided into a training set and test set, and the CBAM-ResNeXt50-ArcLoss model of transfer learning is introduced. The model can extract complex features of the image from low levels to high levels through multi-layer convolution operations. In the initial convolution stage, the model can obtain local and detailed information of the image. In the deep convolution stage, the model can obtain complex and abstract information of the image. After the convolution operation of all convolution layers, the feature matrix is obtained, and then the classification result is transformed into a probability distribution through global average pooling and the softmax function, thereby realizing the fault diagnosis of the wind turbine gearbox.

4. Experiment

4.1. Wind Turbine Gearbox Audio Data

The structure of the gearbox is a one-stage planetary and two-stage parallel gear. The main shaft of the front-end wind turbine is connected to the low-speed planetary stage carrier of the wind power gearbox, and the high-speed shaft with a high-speed stage pinion at the rear end is connected to the generator shaft. Its structure is shown in Figure 6a, where b is the sound pressure sensor installation position in the laboratory.

In Figure 6a, PS is the low-speed planetary stage carrier, IS is the large gear of the intermediate stage, and HSS is the large gear of the high-speed stage. To prevent the amplitude of the collected sound signal from being weakened due to the excessive distance between the sensor and the gearbox, the sound pressure sensor is installed at the position S of Figure 6a,b, which is below the first-stage parallel gear. The sound pressure sensor used is YSV5001, which is composed of an electret microphone and a dedicated preamplifier. It has the characteristics of high sensitivity, good linearity, and stable performance. Its frequency range is 10 HZ–20 kHz, and the measurement range is 20–136 db. During operation of the wind turbine, the load on the rotor driven by the blades will constantly change, so the collected data are all collected under variable load conditions. The length of each audio data point is 10 s, the sampling frequency is 16 kHz, the frequency resolution is 0.1 Hz, and the generator speed is about 1580 r/min.

4.2. Data Processing

The signal of the wind turbine is extracted through the Mel spectrogram, and finally a 256 × 256 feature map is generated. Figure 7 shows the Mel spectrograms of different gear faults. Four fault features and healthy conditions each generate 2000 samples, and then the training set and test set are divided according to a ratio of 8:2. The specific fault sample distribution is shown in Table 1.

A wind turbine (mainly including gearbox, blade, bearing, etc.) is usually installed in the high-altitude environment in the field. Long-term continuous operation bears dynamic heavy loads, and the operating conditions are complex and changeable. The gearbox has a high probability of failure in the entire power transmission chain, so it is the research object of this paper.

Under laboratory conditions, four types of faults (chipped tooth, missing tooth, root fault, source fault) are artificially implanted in the wind turbine gearbox, and then the audio signal is collected under different fault conditions.

A chipped tooth removes small pieces of material by milling the tooth surface to simulate a tooth surface defect caused by impact. A missing tooth is completely removed by mechanical processing to simulate severe fracture failure. A root fault simulates fatigue cracking at the fillet of the tooth root due to mechanical impact. A source fault creates local wear in the bearing raceway via chemical etching to simulate early surface degradation.

Spectrum distribution is uniform in a healthy state, intermittent impulse noise appears in the chipped tooth state, strong periodic peaks appear in the missing tooth state, continuous harmonic components appear in the root fault state, and early weak clutter appears in the source fault state.

There are 1600 samples for each state for training, and the manner of labeling is as follows: health 0001–1600, chipped 0001–1600, missing 0001–1600, root 0001–1600, and source 0001–1600.

4.3. Model Training

All experiments in this study are carried out by using the PyTorch deep learning framework and run on a Pusai deep learning server equipped with a 3080 graphics card. The Adam optimization algorithm and the LambdaLR custom learning rate adjustment strategy are used to adjust the parameters of the model. The size of the input Mel spectrogram, batch size, learning rate, the output of the fully connected layer, and the number of training iterations are set as shown in Table 2.

4.4. Comparison of CBAM-ResNeXt50-ArcLoss with Classical Models

With the increase in the number of iterations, the accuracy of the training set and test set of the CNN reaches 86.8% and 86.5%, respectively, and remained stable. The loss value drops significantly at first and then tends to stabilize. Eventually, the loss value of the training set stabilizes around 0.388, and the loss value of the test set stabilizes around 0.368. The test results are shown in Figure 8.

The training results of the ResNet50 model are shown in Figure 9. As the number of iterations increases, the accuracy of the model gradually improves and then tends to stabilize. The accuracy of the training set and the test set finally stabilizes at 95.9% and 96.2%, respectively. The model loss value drops and then slowly tends to stabilize. Eventually, the loss values of the training set and the test set stabilize at 0.34 and 0.256, respectively. Compared with the CNN, the accuracy of the training set is increased by 9.1 percentage points, the accuracy of the test set is increased by 9.7 percentage points, and the loss values of the training set and the test set are reduced by 0.048 and 0.112, respectively.

The ResNeXt50 model is an improvement and enhancement of the ResNet50 model, which utilizes residual connections and group convolutions to maximize the feature extraction capability. Experiments show that the accuracy of the training set and test set of the ResNeXt50 model is 97.2% and 97.4%, respectively, which is an increase of 1.3 and 1.2 percentage points compared to the ResNet50 model. The loss values of the training set and test set are 0.159 and 0.146, respectively, which are reduced by 0.181 and 0.11, respectively. There is no overfitting or underfitting phenomenon. The accuracy and loss values of the ResNeXt50 model are shown in Figure 10.

The results of the CBAM-ResNeXt50-ArcLoss model are shown in Figure 11. As the number of iterations increases, the accuracy of the model on the training set and test set gradually increases and finally stabilizes at 99.6% and 99.8%, respectively. The loss value of the model gradually decreases and then tends to be stable, finally stabilizing at 0.082 and 0.054, respectively.

As shown in Table 3, verifying the effectiveness of the algorithm, the test accuracy of the CBAM-ResNeXt50-ArcLoss model is improved by 13.3, 3.6, 2.4, and 1.3 percentage points compared with the classical algorithms CNN, ResNet50, ResNeXt50, CBAM-ResNeXt50, and CBAM-ResNeXt50-ArcLoss, respectively. Due to the introduction of the CBAM and the use of a more complex ArcLoss loss function, the calculation process of the model is increased. Therefore, the training time of the model is slightly increased compared with other models, but the robustness and generalization ability of the model are improved.

Using PyTorch 2.5.1, we obtained the confusion matrices obtained after testing CNN, ResNet50, ResNeXt50, and CBAM-ResNeXt50-ArcLoss after training, as shown in Figure 12. The horizontal coordinate represents the predicted labels of different faults, and the vertical coordinate represents the true categories of different faults. The numbers on the main diagonal of the matrix represent the number of samples correctly classified for each type of fault. It can be seen that the diagnostic accuracy of the normal state of this research method has reached 99.8%, achieving a high accuracy rate.

4.5. The Influence of Different Loss Functions on Model Results

The loss function used in CBAM-ResNeXt50-ArcLoss is additive angular margin loss, which is an improvement over softmax loss. In order to verify its performance improvement on the model results, network models using different loss functions are compared and verified. The experiments proved that, as shown in Table 4, using the ArcLoss loss function increased the accuracy by 1.3% and 0.7%, respectively, compared with softmax and Triplet Loss, and the loss value decreased by 0.011 and 0.059, respectively. However, ArcLoss involves complex trigonometric function operations and conditional judgments, so it is more time-consuming in calculation than other loss functions. Based on the comparison results, the ArcLoss loss function enhances the performance of the model and is suitable for the fault diagnosis of wind turbine gearboxes using sound signals. In summary, compared with other methods, the method proposed in this study has excellent fault identification capabilities.

5. Conclusions

We aimed to solve the problem of the gearbox of wind turbine generators bearing variable and heavy loads in working conditions and the dataset being small. It is difficult to obtain a high recognition rate for gear faults; so, we proposed a fault diagnosis method for wind turbine gearboxes based on Mel spectrograms and a CBAM-ResNeXt50-ArcLoss transfer learning model. The original sound signal is processed by Mel spectrograms to generate a two-dimensional feature map containing both time and frequency domains. The dimension of the frequency-domain signal is reduced from frequency to logarithmic Mel scale, which reduces the redundant information of the features, reduces the interference of noise, and improves the processing efficiency of the signal. In the sound signal classification task containing the operating state information of the gearbox, the generated Mel spectrogram is input to CBAM-ResNeXt50-ArcLoss for training and verification, and a high accuracy rate of fault classification is achieved. The CBAM enables the network to adaptively adjust the degree of attention to different feature channels and spatial locations, and ArcLoss reduces the intra-class differences, increases the inter-class differences, and enhances feature discrimination. The ResNeXt structure improves the accuracy without increasing the complexity of the parameters, and the transfer learning method using fine-tuning gives the model better generalization. Compared with the CNN, ResNet50, ResNeXt50, and CBAM-ResNet50 methods, the CBAM-ResNeXt50-ArcLoss model shows improvements of 13.3, 3.6, 2.4, and 1.3, respectively. This method provides a new idea for the fault diagnosis of wind turbine gearboxes and has potential engineering value for fault diagnosis, operation, and maintenance.

It should be noted that the proposed method was validated on a balanced dataset. In practical applications, the number of healthy-state samples far exceeds that of fault samples, and this class imbalance issue may reduce diagnostic sensitivity. Although the ArcLoss loss function and CBAM themselves can enhance the ability to distinguish features of minority classes, further improvement measures are still recommended for scenarios with severe imbalance. Future research will focus on verifying the robustness of the model under imbalanced data distributions in the real world.

In this study, controlled experimental data from a single wind turbine gearbox were utilized. Known faults were artificially implanted, and audio signals were collected under laboratory operating conditions. This ensures the reliability of fault labels and the effectiveness of the initial method. However, a limitation of this study is that the training and validation data are derived from a single gearbox, which may not fully cover the complexity of real-life scenarios (such as operating environments with extreme weather, differences in gearboxes under various scenarios, etc.). Future research will validate the abovementioned situations.

Author Contributions

X.Z.: Methodology, Formal analysis, Software, Validation, and Writing—original draft. F.J.: Software and Validation. Y.C.: Funding acquisition, Methodology, Project administration, and Writing review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was founded by Science Research Project of Hebei Education Department (CXY2024016).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

References

Lin, Z.; Chen, Z.; Wu, Q.; Yang, S.; Meng, H. Coordinated pitch & torque control of large-scale wind turbine based on Pareto efficiency analysis. Energy 2018, 147, 812–825. [Google Scholar] [CrossRef]
Qiao, W.; Lu, D. A Survey on Wind Turbine Condition Monitoring and Fault Diagnosis—Part II: Signals and Signal Processing Methods. IEEE Trans. Ind. Electron. 2015, 62, 6546–6557. [Google Scholar] [CrossRef]
Wang, A.; Pei, Y.; Qian, Z.; Zareipour, H.; Jing, B.; An, J. A two-stage anomaly decomposition scheme based on multi-variable correlation extraction for wind turbine fault detection and identification. Appl. Energy 2022, 321, 119373–119387. [Google Scholar] [CrossRef]
Teng, W.; Ding, X.; Cheng, H.; Han, C.; Liu, Y.; Mu, H. Compoundfaults diagnosis and analysis for a wind turbine gearbox via a novel vibration model and empirical wavelet transform. Renew. Energy 2019, 136, 393–402. [Google Scholar] [CrossRef]
Shanbr, S.; Elasha, F.; Elforjani, M.; Teixeira, J. Detection of natural crack in wind turbine gearbox. Renew. Energy. D 2018, 118, 172–179. [Google Scholar] [CrossRef]
Lu, S.; Jiang, Y.; Wu, C.; Sian, H. A high-accuracy and robust diagnostic tool for gearbox faults in wind turbines. J. Eng. 2024, 2024, 12411. [Google Scholar] [CrossRef]
Yang, S.; Xu, H.; Wang, Y.; Chen, J.; Li, C. Fault diagnosis of wind turbine with few-shot learning based on acoustic signal. Eng. Res. Express 2025, 7, 015516. [Google Scholar] [CrossRef]
Xu, Z.; Li, C.; Yang, Y. Fault diagnosis of rolling bearings using an improved multi-scale convolutional neural network with feature attention mechanism. Isa. Trans. 2021, 110, 379–393. [Google Scholar] [CrossRef]
Wang, T.; Han, Q.; Chu, F.; Feng, Z. Vibration based condition monitoring and fault diagnosis of wind turbine planetary gearbox: A review. Mech. Syst. Signal Process. 2019, 126, 662–685. [Google Scholar] [CrossRef]
Gou, B.; Xu, Y.; Xia, Y.; Wilson, G.; Liu, S. An Intelligent Time-Adaptive Data-Driven Method for Sensor Fault Diagnosis in Induction Motor Drive System. IEEE Trans. Ind. Electron. 2019, 66, 9817–9827. [Google Scholar] [CrossRef]
Lu, W.; Jiang, W.; Yuan, G.; Yan, L. A gearbox fault diagnosis scheme based on near-field acoustic holography and spatial distribution features of sound field. J. SOUND VIB. 2013, 332, 2593–2610. [Google Scholar] [CrossRef]
Li, C.; Sánchez, R.V.; Zurita, G.; Cerrada, M.; Cabrera, D.; Vásquez, R.E. Gearbox fault diagnosis based on deep random forest fusion of acoustic and vibratory signals. Mech. Syst. Signal Process. 2016, 76–77, 283–293. [Google Scholar] [CrossRef]
Peng, C.; Li, Z.; Yang, M.; Fei, M.; Wang, Y. An audio-based intelligent fault diagnosis method for belt conveyor rollers in sand carrier. Control. Eng. Pr. 2020, 105, 104650. [Google Scholar] [CrossRef]
Zhong, Q.; Liu, S.; Liu, C.; Liu, W.; Liu, S.; Zhao, Y.; Wu, Y. Fault Diagnosis of Wind Turbine Gearboxes Based on Multisource Signal Fusion. IEEE Trans. Instrum. Meas. 2025, 74, 3526413. [Google Scholar] [CrossRef]
Yang, M.; Zhou, W.; Song, T. Audio-based fault diagnosis for belt conveyor rollers. Neurocomputing 2020, 397, 447–456. [Google Scholar] [CrossRef]
Yao, J.; Liu, C.; Song, K.; Feng, C.; Jiang, D. Fault diagnosis of planetary gearbox based on acoustic signals. Appl. Acoust. 2021, 181, 108151. [Google Scholar] [CrossRef]
Jablonski, A.; Dziedziech, K. Intelligent spectrogram-A tool for analysis of complex non-stationary signals. Mech. Syst. Signal Process. 2022, 167, 108554–108568. [Google Scholar] [CrossRef]
Ji, H.; Lei, X.; Xu, Q.; Huang, C.; Ye, T.; Yuan, S. Research on characteristics of acoustic signal of typical partial discharge models. Glob. Energy Interconnect. 2022, 5, 118–130. [Google Scholar] [CrossRef]
Ban, Y.; Liu, C.; Yang, F.; Guo, N.; Ma, X.; Sui, X.; Huang, Y. Failure Identification Method of Sound Signal of Belt Conveyor Rollers under Strong Noise Environment. Electronics 2024, 13, 34. [Google Scholar] [CrossRef]
Zhang, X.-L.; Xie, L.; Fosler-Lussier, E.; Vincent, E. Guest editorial: Special issue on advances in deep learning based speech processing. Neural Netw. 2023, 158, 328–330. [Google Scholar] [CrossRef]
Lee, Y.-C.; Shariatfar, M.; Rashidi, A.; Lee, H.W. Evidence-driven sound detection for prenotification and identification of construction safety hazards and accidents. Autom. Constr. 2020, 113, 103127–103141. [Google Scholar] [CrossRef]
Grover, H.; Panwar, L.; Verma, A.; Panigrahi, B.; Bhatti, T. A multi-head Convolutional Neural Network based non-intrusive load monitoring algorithm under dynamic grid voltage conditions. Sustain. Energy Grids Netw. 2022, 32, 100938–100952. [Google Scholar] [CrossRef]
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
Sengupta, A.; Ye, Y.; Wang, R.; Liu, C.; Roy, K. Going Deeper in Spiking Neural Networks: VGG and Residual Architectures. Sec. Neuromorphic Eng. 2019, 13, 95. [Google Scholar] [CrossRef] [PubMed]
Hao, S.; Lee, D.-H.; Zhao, D. Sequence to sequence learning with attention mechanism for short-term passenger flow prediction in large-scale metro system. Transp. Res. Part C Emerg. Technol. 2019, 107, 287–300. [Google Scholar] [CrossRef]
Wang, S.; Huang, L.; Jiang, D.; Sun, Y.; Jiang, G.; Li, J.; Zou, C.; Fan, H.; Xie, Y.; Xiong, H.; et al. Improved Multi-Stream Convolutional Block Attention Module for sEMG-Based Gesture Recognition. Front. Bioeng. Biotech. 2022, 10, 909023–909037. [Google Scholar] [CrossRef] [PubMed]
Hassanin, M.; Moustafa, N.; Tahtali, M.; Choo, K.-K.R. Rethinking maximum-margin soft max for adversarial robustness. Comput. Secur. 2022, 116, 102640–102655. [Google Scholar] [CrossRef]
Li, W.; Shang, Z.; Gao, M.; Liu, F.; Liu, H. Intelligent fault diagnosis of partial deep transfer based on multi-representation structural intraclass compact and double-aligned domain adaptation. Mech. Syst. Signal Process. 2023, 197, 110412–110426. [Google Scholar] [CrossRef]

Figure 1. A structural diagram of 32-channel ResneXt.

Figure 2. ResBlock + CBAM attention mechanism module.

Figure 3. Transfer learning.

Figure 4. The structure of CBAM-ResNeXt50.

Figure 5. Overall structure of fault diagnosis based on transfer learning model.

Figure 6. Structure of gearbox and sensor position. (a) Structure of gearbox, (b) Sound pressure sensor installation position.

Figure 7. Mel spectrograms generated by different fault features.

Figure 8. Loss value and accuracy of CNN model.

Figure 9. Loss value and accuracy of ResNeXt50 model.

Figure 10. Loss value and accuracy of ResNeXt50model.

Figure 11. Loss value and accuracy of CBAM-ResNeXt50-ArcLoss model.

Figure 12. Confusion matrix plot.

Table 1. Division of fault sample datasets.

Fault Label	Training Set	Test Set
Healthy working state	1600	400
Chipped tooth	1600	400
Missing tooth	1600	400
Root fault	1600	400
Source fault	1600	400

Table 2. Parameters of model.

Parameter	Value
Size of Mel spectrogram	256 × 256
Learning rate	0.001
Batch size	32
Number of classification categories	5
Number of training iterations	100

Table 3. A comparison with the test results of the classical model.

Model	Accuracy/%	Loss Value
CNN	86.5%	0.368
ResNet50	96.2%	0.256
ResNeXt50	97.4%	0.146
CBAM-ResNeXt50	98.5%	0.113
CBAM-ResNeXt50-ArcLoss	99.8%	0.054

Table 4. The accuracy and loss value of the CBAM-ResNeXt50 model using different loss functions.

Loss Function	Accuracy	Loss Value
Softmax	98.5	0.113
Triplet Loss	99.1	0.102
ArcLoss	99.8	0.054

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, X.; Jia, F.; Chen, Y. Fault Diagnosis of Wind Turbine Gearbox Based on Mel Spectrogram and Improved ResNeXt50 Model. Appl. Sci. 2025, 15, 8563. https://doi.org/10.3390/app15158563

AMA Style

Zhang X, Jia F, Chen Y. Fault Diagnosis of Wind Turbine Gearbox Based on Mel Spectrogram and Improved ResNeXt50 Model. Applied Sciences. 2025; 15(15):8563. https://doi.org/10.3390/app15158563

Chicago/Turabian Style

Zhang, Xiaojuan, Feixiang Jia, and Yayu Chen. 2025. "Fault Diagnosis of Wind Turbine Gearbox Based on Mel Spectrogram and Improved ResNeXt50 Model" Applied Sciences 15, no. 15: 8563. https://doi.org/10.3390/app15158563

APA Style

Zhang, X., Jia, F., & Chen, Y. (2025). Fault Diagnosis of Wind Turbine Gearbox Based on Mel Spectrogram and Improved ResNeXt50 Model. Applied Sciences, 15(15), 8563. https://doi.org/10.3390/app15158563

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fault Diagnosis of Wind Turbine Gearbox Based on Mel Spectrogram and Improved ResNeXt50 Model

Abstract

1. Introduction

2. Preprocessing

3. Model

3.1. ResNeXt50

3.2. Attention Mechanism

3.2.1. Channel Attention Mechanism

3.2.2. Spatial Attention Mechanism

3.2.3. Mixed Attention CBAM

3.3. Loss Function

3.4. Transfer Learning

3.5. The Overall Structure of the Model

4. Experiment

4.1. Wind Turbine Gearbox Audio Data

4.2. Data Processing

4.3. Model Training

4.4. Comparison of CBAM-ResNeXt50-ArcLoss with Classical Models

4.5. The Influence of Different Loss Functions on Model Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI