1. Introduction
The increasing global demand for renewable energy, such as wind power, as a clean and efficient energy source, is playing an increasingly significant role in the global energy transition [
1,
2,
3]. As the core equipment responsible for converting wind energy into electrical energy, the stability and efficient operation of wind turbines directly affect the overall performance of wind power generation [
4,
5]. Over the long-term operation of wind turbines, the health of mechanical components, particularly the gearbox and generator, often determines the turbine’s service life and maintenance costs [
6,
7]. Therefore, efficient monitoring and prediction of the operational status of these critical components, especially temperature variations, have become essential research topics in modern wind power generation [
8]. Temperature fluctuations serve as a crucial indicator of wind turbine operational status, reflecting the condition of mechanical components, particularly under long-term operation. Abnormal temperature fluctuations or sudden changes often signal potential failures [
9]. For example, excessive gearbox temperatures can lead to the degradation of the lubricating oil, increasing wear on mechanical parts and potentially causing system failures [
10,
11]. Consequently, real-time monitoring of wind turbine temperature and accurate forecasting of future temperature trends are vital for ensuring smooth turbine operation and minimizing downtime caused by failures [
12].
In recent years, researchers have proposed various traditional mathematical modeling techniques for temperature prediction, which have yielded notable results under specific conditions and have laid a solid foundation for the application of temperature forecasting [
13]. These traditional methods primarily rely on models based on physical principles or statistical techniques, which can address temperature variation to some extent [
14]. For instance, Gu et al. [
15] proposed a method that combines the HTcT model with multi-output least squares support vector regression (MOLSSVR). By replacing traditional temperature factors with temperature field features extracted through clustering, they overcame the issue of multicollinearity and introduced a gray wolf optimization (GWO) algorithm to optimize hyperparameters, thus improving prediction efficiency and accuracy. Chen et al. [
16] introduced a novel hybrid model, S-GM-ARIMA, which integrates the GM model with the ARIMA model, optimizing it through a linear combination weight calculation method. Their research compared different weight calculation techniques and ultimately adopted the standard deviation method to compute the weights, enhancing prediction accuracy. Das et al. [
17] proposed a model-free temperature prediction approach. Other studies have focused on developing one-step prediction models and prediction intervals to improve temperature forecasting accuracy, particularly for long, local stationary time series data, outperforming the widely used RAMPFIT algorithm. While these methods have made significant progress in certain areas, traditional mathematical models generally rely on prior physical knowledge or assumptions. When confronted with complex and nonlinear temperature variations, these models struggle to fully and accurately capture the underlying patterns in the data [
18]. As a result, traditional methods still face substantial challenges in handling the complexities encountered in practical applications, limiting their broader adoption in industrial settings.
Compared to traditional analytical methods, machine learning techniques have shown considerable advantages in handling complex, nonlinear data, as they can automatically identify and extract key features [
19]. Among various machine learning models, convolutional neural networks (CNNs), recurrent neural networks (RNNs), and long short-term memory networks (LSTMs) have been widely applied across numerous tasks [
20]. In particular, LSTM is particularly well-suited for time series prediction tasks. By incorporating memory units, LSTM can effectively capture long-term dependencies, making it especially suitable for processing dynamic sequential data with such dependencies. This significantly enhances the stability and accuracy of the model in long-term forecasting [
21]. For example, Wang et al. [
22] proposed a short-term water quality prediction model that combines variational mode decomposition (VMD) with an improved grasshopper optimization algorithm (IGOA). This model optimized the LSTM network, significantly improving short-term prediction accuracy. Shahid et al. [
23] developed a short-term wind power prediction model based on LSTM, incorporating a wavelet kernel function to capture the dynamic characteristics of wind power data. By combining LSTM with wavelet transformers, the model successfully addressed the nonlinear mapping problem, thereby enhancing prediction accuracy. Huang et al. [
24] designed an oil and gas production prediction model based on LSTM, optimized through particle swarm optimization (PSO) to adjust LSTM configurations. This model effectively captured the dependencies in time series data related to oil and gas production while integrating production constraints. It demonstrated high prediction accuracy in multiple real-world applications, particularly in complex oil and gas production systems, outperforming traditional prediction methods and showcasing strong adaptability to various production environments.
Although the LSTM model excels at handling time series data, it can only capture unidirectional temporal dependencies, which may limit its ability to recognize certain complex sequential patterns. To overcome this limitation, the bidirectional long short-term memory network (BiLSTM) was introduced. BiLSTM learns both forward and backward information from a time series, enabling it to capture richer contextual information. This enhances its capacity to model complex temporal features, ultimately improving prediction accuracy. For instance, Cui et al. [
25] proposed a method that combines singular spectrum analysis (SSA) with BiLSTM for accurately predicting missing values in MODIS land surface temperature (LST) data. The validation results showed that this method maintained high prediction accuracy even under high missing rates. Zhang et al. [
26] introduced a monthly average temperature prediction model based on CEEMDAN-BO-BiLSTM, which was applied to temperature forecasting in Jinan city. The results demonstrated that this model significantly outperformed other models in terms of prediction accuracy and adaptability, offering an effective solution for temperature forecasting. Similarly, Jiang et al. [
27] proposed a method combining an elite-preserving genetic algorithm (EGA) with BiLSTM for temperature prediction in battery energy storage power stations (BESP). Validated with real-world data, this method effectively improved temperature prediction accuracy, providing reliable forecasts for the safe operation of BESP.
Although the BiLSTM model has shown success in various fields, its limitations become evident when dealing with complex temperature data. Temperature data often exhibit characteristics such as nonlinearity and periodicity, and a single model may not fully exploit the complementary features across different types of data. As a result, hybrid models have become increasingly popular, combining the strengths of multiple models to enhance prediction accuracy [
28]. For example, Tabrizchi et al. [
29] proposed an efficient temperature prediction model for data centers by combining CNNs with multi-layer BiLSTM, significantly improving prediction accuracy and reducing errors. Ji et al. [
30] introduced a novel hybrid prediction model that combines CNNs, BiLSTM, and squeeze-and-excitation (SE) networks, aiming to leverage the strengths of various deep learning models to enhance furnace temperature prediction accuracy. Similarly, Jiang et al. [
31] proposed a deep learning model combining LSTM, an encoder–decoder structure, and an attention mechanism for short-term indoor temperature forecasting. This model outperformed traditional LSTM and GRU models, demonstrating higher prediction accuracy and greater stability.
In addition to traditional CNNs, depthwise separable convolutional neural networks (DSCNNs) have gained increasing attention for their ability to efficiently extract spatial features while significantly reducing computational complexity. In recent years, DSCNN has also been explored for time series signal processing. For example, Yu et al. [
32] combined GRUDMU with DSCNNs and deployed the model on edge devices, enhancing real-time fault diagnosis performance in edge computing scenarios. Xie et al. [
33] combined the 1D-DSCNN with Global Max Pooling (GMP) to create the 1D-DSCNN-GMP model, which was optimized using TensorRT and deployed on edge devices, achieving improved fault diagnosis with reduced model size and faster inference time. Wang et al. [
34] combined Principal Component Analysis (PCA) with Gramian Angular Field (GAF) methods and DSCNN for operation state recognition of hydroelectric generating units, achieving high accuracy in fault diagnosis.
These studies highlight the growing potential of DSCNNs in diverse applications, demonstrating their effectiveness in improving both the efficiency and accuracy of fault diagnosis and state recognition tasks. Building on these advancements, this paper introduces an innovative hybrid model that combines DSCNNs with BiLSTM for accurate wind turbine gearbox temperature prediction. Like previous applications, the DSCNN in this model significantly reduces the parameter count by utilizing depthwise separable convolutions, improving computational efficiency while preserving the ability to extract robust spatial features. This synergy of DSCNNs and BiLSTM enables the model to effectively handle large-scale, high-dimensional temperature data. The BiLSTM component, in turn, captures bidirectional dependencies in time series data, further strengthening the model’s ability to capture periodic and long-term dependencies. By leveraging the strengths of both models, the proposed DSCNN-BiLSTM hybrid model integrates spatial and temporal learning, enabling precise wind turbine temperature predictions. To evaluate the performance of the proposed DSCNN-BiLSTM model, this study conducts temperature prediction experiments using two real-world datasets from a wind farm in Shaanxi. The experimental results show that the DSCNN-BiLSTM model significantly outperforms traditional methods in terms of prediction accuracy and generalization, underscoring its feasibility and effectiveness for real-world engineering applications. The main contributions of this paper are as follows: (1) A novel DSCNN-BiLSTM hybrid model is proposed, enhancing prediction accuracy by combining spatial feature extraction with time series modeling. (2) An efficient model architecture is designed to handle large-scale and complex temperature data, customized to the characteristics of wind turbine temperature data. (3) Experimental validation demonstrates the model’s effectiveness in predicting wind turbine motor and gearbox temperatures, highlighting its potential for practical applications.
The structure of this article is organized as follows:
Section 2 presents the basic theory of the proposed model;
Section 3 details the implementation of the proposed method;
Section 4 conducts several experiments to evaluate the performance of the proposed model; and, finally,
Section 5 provides the conclusion.
3. Methodology
3.1. The Proposed DSCNN-BiLSTM Model
In this paper, we propose a hybrid model that combines depthwise separable convolutional neural networks (DSCNNs) and bidirectional long short-term memory (BiLSTM) for temperature prediction. The model retains the advantages of both DSCNNs and BiLSTM, enabling efficient extraction of spatial and temporal features while reducing the number of untrained parameters, thus accelerating model training and enabling it to reach optimal performance more quickly and efficiently. In this hybrid model, the DSCNN is used for spatial feature extraction. As an enhanced version of the traditional convolutional neural network, DSCNNs effectively capture spatial features from the input data. Compared to conventional convolutional networks, DSCNNs significantly reduce the number of required convolutional kernels, thereby decreasing the number of parameters during training. This not only makes the training process more efficient but also improves the model’s performance when handling high-dimensional data. BiLSTM is employed to extract temporal features. By integrating both forward and backward network structures, BiLSTM allows the model to capture information from both past and future data. This capability enables BiLSTM to effectively model long-term dependencies and periodic patterns in time series data, thereby enhancing the accuracy of temperature predictions. The hybrid model combines the strengths of both DSCNNs and BiLSTM, significantly improving performance in temperature prediction tasks. The workflow of the model is shown in
Figure 4.
3.2. Data Preprocessing
In this study, the preprocessing of temperature data involves two key steps: data normalization and slicing. Data preprocessing is crucial to ensure the model can effectively learn from the data features, avoid potential issues during training, and improve the model’s overall performance. The following provides a detailed description of these two preprocessing techniques. Normalization of the temperature data is an essential step. Since output values from different sensors can vary significantly and the units of different features may not be consistent, unnormalized data could lead to certain features dominating the model training process, potentially affecting the model’s convergence and stability. To address this, the min-max normalization method was employed. Min-max normalization compresses the data into a specified range, typically [0, 1], ensuring that all features are scaled uniformly during training. This process is implemented using the following formula:
In this equation, represents the original data, and are the minimum and maximum values of the data, and is the normalized data. By employing this method, we can effectively mitigate issues arising from differences in data scales or extreme values, ensuring that each feature contributes equally during the model training process. Additionally, normalization enhances the efficiency of the gradient descent optimization algorithm, reduces the model’s convergence time, and prevents numerical instability during training. However, if the input data during inference falls outside the [0, 1] range, the model may encounter problems such as erroneous predictions or instability. To address this, we implement a clipping strategy, where values below 0 are set to 0, and values above 1 are set to 1. This ensures that the input data remains within the expected range, preserving the model’s stability and performance.
After data normalization, to effectively prepare the dataset for model training, this study further divides the data using a time-window-based sampling strategy. As illustrated in
Figure 5, we applied a sliding window approach to transform the original time series data into training samples and corresponding labels. This strategy splits the original time series into multiple fixed-length samples, with each sample corresponding to a prediction label at a specific time step. The steps are as follows:
Step 1: Initial Time Window Selection. The first n consecutive data points from the dataset are selected as the input sample, with the data point immediately following these points serving as the label. These data points represent the historical input of the model, while the label corresponds to the prediction target.
Step 2: Sliding Window Update. The first data point in the dataset is removed, and the window of the training sample slides forward. The new training sample consists of the next n consecutive data points, with the label being the data point immediately following these points. Each sliding window generates a new sample-label pair.
Step 3: Sample Generation Repetition. Steps 1 and 2 are repeated until there are insufficient remaining data points to form a new training sample. The sliding window continues to generate new sample-label pairs until all available data have been processed.
3.3. Model Configuration and Training Parameters
In this study, the proposed DSCNN-BiLSTM model leverages carefully chosen training parameters and strategies to ensure efficient temperature prediction. The overall network architecture and parameter configuration are summarized in
Table 1. To optimize performance, we explored various network configurations, including different depths for the DSCNN layers, the number of BiLSTM units, and the design of the fully connected layers. The final architecture was selected based on its validation performance and computational efficiency. Hyperparameter tuning was used to determine the optimal dropout rate, achieving a balance between generalization and training stability. No significant overfitting was observed during the validation phase, as further elaborated in
Section 4.
During the training process, the Adam optimizer was used in this study. Due to its adaptive learning rate feature, the Adam optimizer automatically adjusts the step size based on the different updating requirements of the parameters, thus accelerating the convergence process of the model. Specifically, the learning rate was set to 0.001, the batch size to 32, and the number of training epochs was set to 200 to ensure that the model could effectively learn the deeper features of the data through sufficient iterations. The DSCNN-BiLSTM model was implemented using Python 3.9 and PyTorch 1.10, which is a popular deep learning framework. All experiments were carried out on a workstation running Windows 11, equipped with an Intel i5-12400F CPU and a GTX 3060 Ti GPU.
In terms of the loss function, this study adopted mean squared error (MSE) as the loss function for the temperature prediction task. MSE is a widely used loss function for regression problems, as it quantifies the difference between the model’s predictions and the actual values, making it suitable for continuous numerical data. The loss function
, which represents the average squared difference between the predicted values and the actual values, is defined as follows:
Here, represents the predicted output of the i-th sample by the model, represents the true value of the i-th sample, and is the total number of samples. By minimizing the MSE, the model can better approximate the true temperature variations, thereby improving the prediction accuracy.