A Novel Data-Driven Approach with a Long Short-Term Memory Autoencoder Model with a Multihead Self-Attention Deep Learning Model for Wind Turbine Converter Fault Detection

Torres-Cabrera, Joel; Maldonado-Correa, Jorge; Valdiviezo-Condolo, Marcelo; Artigao, Estefanía; Martín-Martínez, Sergio; Gómez-Lázaro, Emilio

doi:10.3390/app14177458

Open AccessArticle

A Novel Data-Driven Approach with a Long Short-Term Memory Autoencoder Model with a Multihead Self-Attention Deep Learning Model for Wind Turbine Converter Fault Detection

by

Joel Torres-Cabrera

¹

,

Jorge Maldonado-Correa

^1,2,*

,

Marcelo Valdiviezo-Condolo

¹

,

Estefanía Artigao

²

,

Sergio Martín-Martínez

²

and

Emilio Gómez-Lázaro

²

¹

Technological and Energy Research Center (CITE), National University of Loja, Loja 110150, Ecuador

²

Renewable Energy Research Institute (IIER), University of Castilla-La Mancha, 02071 Albacete, Spain

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(17), 7458; https://doi.org/10.3390/app14177458 (registering DOI)

Submission received: 24 July 2024 / Revised: 14 August 2024 / Accepted: 18 August 2024 / Published: 23 August 2024

(This article belongs to the Special Issue Advanced Forecasting Techniques and Methods for Energy Systems)

Download

Browse Figures

Versions Notes

Abstract

:

The imminent depletion of oil resources and increasing environmental pollution have driven the use of clean energy, particularly wind energy. However, wind turbines (WTs) face significant challenges, such as critical component failures, which can cause unexpected shutdowns and affect energy production. To address this challenge, we analyzed the Supervisory Control and Data Acquisition (SCADA) data to identify significant differences between the relationship of variables based on data reconstruction errors between actual and predicted values. This study proposes a hybrid short- and long-term memory autoencoder model with multihead self-attention (LSTM-MA-AE) for WT converter fault detection. The proposed model identifies anomalies in the data by comparing the reconstruction errors of the variables involved. However, more is needed. To address this model limitation, we developed a fault prediction system that employs an adaptive threshold with an Exponentially Weighted Moving Average (EWMA) and a fixed threshold. This system analyzes the anomalies of several variables and generates fault warnings in advance time. Thus, we propose an outlier detection method through data preprocessing and unsupervised learning, using SCADA data collected from a wind farm located in complex terrain, including real faults in the converter. The LSTM-MA-AE is shown to be able to predict the converter failure 3.3 months in advance, and with an F1 greater than 90% in the tests performed. The results provide evidence of the potential of the proposed model to improve converter fault diagnosis with SCADA data in complex environments, highlighting its ability to increase the reliability and efficiency of WTs.

Keywords:

wind turbine; SCADA data; fault prediction; converter; LSTM; autoencoder; multi-head attention

1. Introduction

The world is currently facing the depletion of traditional energy resources such as oil and coal, and the challenges of climate change. At the same time, global energy demand continues to grow to meet economic and social needs. In response, clean energy is emerging as an alternative solution, releasing less carbon dioxide and offering unlimited energy sources to drive sustainable development [1]. The initiative to achieve carbon neutrality by 2050, proposed in the Paris Agreement, has increased interest and investment in the renewable energy sector.

Wind energy, in particular, represents a viable alternative to fossil fuel power generation, using natural sources to produce electricity. The cost of producing energy with this technology is decreasing yearly, with this trend being expected to continue [2]. Moreover, the continued development of wind turbines (WTs), driven by the integration of advanced technologies, is expected to significantly increase their efficiency and generating capacity.

However, wind power generation faces challenges regarding the availability, reliability, and lifetime of WTs. For this reason, manufacturers have focused their efforts on prolonging the lifetime of electrical and mechanical systems, which has resulted in reduced failures during operation and consistent production of high-quality power [3].

Unexpected failures in these systems can negatively affect availability and production rate. Components such as blades [4], generators [5], power converters [3], and gearboxes [6] are especially vulnerable to failure due to harsh environmental and operating conditions, leading to extended downtime for maintenance. If unaddressed, these challenges could significantly impact the renewable energy industry.

Fault detection is a critical problem with two main aspects: accuracy and cost [7]. The high vibration and noise levels of generators complicate the accurate measurement of faults. In addition, the diversity of faults depending on environmental conditions requires an accurate classification of each situation [8]. Implementing multiple sensors and complex algorithms is costly, especially for large-scale wind farms, and challenging to apply in smaller installations. As WTs increase in size and number, the need for cost reduction in monitoring and repair becomes more urgent [9,10].

One of the main WT components is the converter. The converters in WTs play a crucial role in facilitating the control and efficient use of energy, especially in complex terrain and with lower maintenance requirements [11]. The converter is a critical component in the operation of the WT, as it converts the variable frequency and voltage output from the generator into a stable, grid-compatible alternating current (AC) with a consistent frequency and voltage, ensuring that the electrical power generated is suitable for transmission and use [12]. According to [13,14], converters are one of the components that fail most in WTs, responsible for approximately 23% of the failures; the authors indicate that humidity, contamination, and other factors play an essential role in the incidence of these failures. Additionally, with the increasing development of direct-drive WTs, the converter is one of the most failure-prone components. This aspect becomes especially critical in complex terrains, where environmental and operating conditions are significantly more challenging than in less demanding terrains [15].

This study proposes an improved approach to fault detection using Supervisory Control and Data Acquisition (SCADA) system data. The goal is to detect faults in the converter of a WT located in complex terrain. This approach aims to optimize reliability and reduce the costs associated with operation and maintenance (O&M). The ultimate objective is to ensure the greater long-term viability and sustainability of wind energy.

Two common approaches to detecting WT faults are classification algorithms (supervised) and reconstruction models (unsupervised) [16]. Classification algorithms train machine learning (ML) models, such as decision trees and k-nearest neighbors, using labeled data and testing them on new data. However, labeling data can be a complex and expensive process. An example of recent work in this area is that by Khan in 2022 [17], who proposed a new classifier called AdaBoost, K-nearest neighbors, and logistic regression-based stacking ensemble (AKL-SE) to classify faults in WT monitoring systems, obtaining promising results.

The aim of reconstruction models, in contrast, is to understand the time series and reconstruct the variables to detect anomalies, using the reconstruction error as a measure to identify outliers. A notable example of this approach is the research by Xiang et al. [18], which proposes a method for WT fault detection. This method, which combines a convolutional neural network (CNN) with a long- and short-term memory (LSTM) network based on an attention mechanism (AM), is designed to alert on generator and gearbox faults, making it highly relevant to the field of WT systems.

In addition, deep learning (DL) networks have also been used in this area. Liu et al. [19] propose a new Deep Residual Network (DRN) for WT fault detection. In their method, raw data from the SCADA system are directly used as inputs to the DRN, which incorporates a Convolutional Residual Building Block (CRBB) with convolutional layers and compression and excitation units. This approach performs fault recognition and classification only when faults occur or are imminent, although, in practice, it is preferable to provide early warnings before a fault occurs.

A notable limitation is that most studies focus on detecting failures in mechanical or electrical components [20,21,22], while only some address failures in electronic components using SCADA data [23,24]. In this context, one of the studies exploring converters is that by Xiao et al. [25], who propose an improved structure called attention octave convolution (AOctConv), applied to the ResNet50 backbone (called AOC–ResNet50) for detecting faults in WT converters using SCADA data with 10 min intervals. In line with this research, we study this component for its recurrent failures, as demonstrated in Section 2.3.

Meanwhile, hybrid models are more robust than their base models, as evidenced in recent studies [26,27]. However, a limitation of these approaches is that they tend to detect alarms and do not attempt to predict failures.

An emerging approach in this context is the use of attention mechanisms, as presented in the study by Wang et al. [28]. The authors propose a novel method using an integrated AM with a multivariate query pattern for anomaly detection and underlying cause identification. The proposed anomaly detection model comprises multiple cascaded encoders and decoders with a multihead self-attention mechanism to extract correlations between variables. Inspired by this approach, we propose integrating a similar mechanism into our architecture.

1.1. Contribution of the Present Work

The aim of implementing an autoencoder (AE) model is to reconstruct the input data. However, the literature has shown that hybrid models yield superior performance. For this reason, this work implements LSTM and Multihead Attention (MA) layers in the AE architecture, specifically in the encoding and decoding layers, to capture more complex temporal relationships.

Specifically, we propose an architecture for fault prediction in WT converters using SCADA data, with the objective of improving reliability and configurability through fault prediction. This study implements an unsupervised learning approach using an LSTM-AE with multihead attention (LSTM-MA-AE), which incorporates temporal features from SCADA data.

This model is evaluated on a real dataset from a wind farm located in a complex terrain in southern Ecuador. We evaluate it with different architecture configurations to measure its performance in anomaly detection.

Therefore, the main contributions of this research are:

Development of a DL model that integrates autoencoders, LSTM networks, and MA for fault detection in WT converters.
Converter fault detection using SCADA data from a wind farm in complex terrain.
A system of prediction of converter failures in real WT operating conditions using unsupervised learning.

1.2. Background and Motivations

In [29], reports, several methods for fault detection in WTs, which can be classified into model-based, signal-processing, and data-based approaches. Furthermore, ref. [11] presents a review of the literature on converter faults; the authors include model-based, signal-based, and data-based methods and conclude that the data-based method has a high fault diagnosis capability.

Furthermore, in [25], a study is presented for detecting faults in WT converters from SCADA data using CNN; the effectiveness of the research was verified through a comparative study. In addition, ref. [30] is a study detailing the most common WT converter faults, and presents a Transfer Learning (TL)-based method; the results demonstrate the accuracy and efficiency of the TL method in diagnosing WT faults.

A strategy using wavelet transform, feature analysis, and a Back Propagation Neural Network (BPNN) to accurately identify open-circuit faults in WTs converters, is presented in [31]; the results show that the proposed strategy can successfully classify converter faults. In addition, ref. [32] describes a data-based approach to detect WT converter faults using an LSTM; the results show that the proposed method has powerful data processing capabilities and higher diagnostic accuracy. Similarly, early anomaly detection and root cause analysis in WTs using SCADA data are proposed in [33]. For this purpose, a hybrid model using a LSTM-based asymmetric variational autoencoding Gaussian mixture model (LSTM-AVAGMM) is employed. The robustness and competitiveness of the model are demonstrated in two case studies.

Moreover, a hybrid DL model combining recurrent neural networks (RNN) and LSTM for WT condition monitoring is presented in [34]. LSTM-AE is employed for data processing and feature extraction in the proposed model. SCADA data and simulated data are then used to provide a complete learning model of the WT behavior. The experimental results unequivocally demonstrate that the proposed model surpasses existing ML algorithms in terms of fault prediction accuracy.

Likewise, the approach proposed by [35] is based on a supervised implementation of the variational AE model, which allows for the projection of the WT system in a low-dimensional representation space for early fault prediction. Another similar work is presented in [36], where a deep EA (DAE), enhanced by fault cases, is developed for anomaly detection in WT. With the help of fault cases, the DAE can capture normal operation data patterns and acquire deep embedding features. Experimental results show that the method outperformed current AE-based methods in WT anomaly detection using multiple evaluation metrics.

In [37], a technique is presented to monitor the aging of insulated-gate bipolar transistor (IGBT) modules in offshore WT converters using SCADA data and a hybrid AE and attention-based LSTM (AT-LSTM) model. The AT-LSTM model is used to learn from SCADA data and establish a temperature prediction. AE is used to detect anomalies. Experimental results validate the effectiveness of the proposed model.

Meanwhile, in a previous study, we determined that the scientific literature on using SCADA data to predict failures in WT converters located in complex terrain is limited [24]. The Villonaco Wind Farm (VWF), located in Ecuador in a mountainous area at approximately 2700 masl, presents a challenging climate and irregular terrain, entailing unique characteristics that distinguish it from other wind farms [38]. These unique conditions highlight the need for advanced fault detection approaches and underlie the motivation for this study.

In modern WTs, the converter controls the speed and torque of the generator in addition to the power transfer to the grid [39]. Moreover, the study of the converter is a relevant research topic currently because a failure in this component can cause an unexpected WT shutdown, resulting in a decrease in energy production. This situation is even more critical in wind farms located in complex terrain, which can lead to long downtimes of the WT.

In addition, early fault detection is crucial for WT maintenance, as it saves significant time and costs. Implementing hybrid models and effectively using SCADA data can improve the reliability and efficiency of predictive maintenance, leading to a reduction in wind farm O&M costs.

Finally, Table 1 summarizes key studies on failure prediction, from which some conclusions can be drawn. Most studies do not indicate how far in advance the models predict the failure before it occurs. Furthermore, there is a small body of scientific research focused on predicting WT converter failures, while more studies are needed to address the prediction of failures in WTs located in complex terrains, such as the wind farm studied in this work. Therefore, our motivation is to help fill the knowledge gap mentioned above.

The remainder of the paper is structured as follows: Section 2 describes the methodological process, the model architecture, the failure prediction system, and the evaluation methods. Section 3 details the results of the evaluation of the fault prediction model. Finally, Section 4 offers conclusions and suggestions for future research.

2. Materials and Methods

Figure 1 shows the methodological process for failure prediction employed in this study. The figure describes the process from raw data acquisition to feature cleaning and selection, proposed model development, and fault prediction. The raw data undergoes a data cleaning and filtering process, with variables related to the target component then being selected. These variables are the input to the LSTM-MA-AE model, which reconstructs the output signal for each variable. Subsequently, the predicted signals enter a fault prediction system where they are compared with each actual signal, and the reconstruction error is used to calculate the abnormality score. This system generates an alarm when a significant discrepancy is detected, allowing it to predict and alert of the possibility of a failure.

2.1. Description of the Data Set and Study Area

This paper uses SCADA data from the VWF, located in Ecuador, between UTM coordinates 693,030 E 9,558,392 N and 693,526 E 9,556,476 N. The VWF is located in a mountainous area at approximately 2720 m.a.s.l., and includes 11 GOLDWIND GW70/1500 WTs of 1.5 MW nominal power with Direct Drive technology [10,15]. Based on the available information, there are few wind farms located at that altitude or higher.

The SCADA operational data and fault records correspond to WT2 of the VWF, recorded between 1 January 2014, and 31 October 2021, representing 386,288 records and 69 variables.

2.2. Data Processing

For data processing, raw files in .txt and .tmp format were initially taken from the SCADA system, processed, and concatenated into a unified file. The steps performed for data cleaning and filtering to ensure data integrity are described below.

Variables with null values exceeding 10,000 records were removed from the dataset, as these could potentially skew the analysis.
Variables that, based on domain knowledge, do not contribute significant features to the analysis were eliminated.
Null values per complete row were imputed, representing a maximum of 7 non-significant records in the WT2 variables.
Data were filtered to include only those with an active power greater than 0 and less than 1600, ensuring the WT is within the guaranteed operating range.
The operating mode of the WT must be 5, which indicates, in this case, that it is not in power limitation or maintenance.
The wind speed range was set between 3 m/s and 25 m/s, as these are the start and cut-off values of the GOLDWIND GW70/1500 WT, ensuring the data reflects typical operating conditions.
Temperature outliers were eliminated: specifically, IGBT temperatures higher than 120 °C, considering the average is 60 °C.
Temperatures should be greater than 0 °C, since no sub-zero temperatures have been recorded in the study area, which has an average historical temperature of approximately 15 °C.

This approach ensures that the data used for analysis and model training are consistent and high-quality, which is essential for accurate and reliable results.

After data processing, the final set was reduced to 317,323 records and 55 variables. That is, 17.85% of the data were eliminated.

2.3. Target Component

The converter failure analysis was performed following the recommendations of VWF operating technicians. A review of the maintenance records and SCADA alarms confirmed that the converter has the highest failure rate at 89.3% (see Figure 2a). This percentage differs from the 23% mentioned in [14] for converter failures, since the cited article includes failures of other components of the WT, such as the generator, gearbox, pitch system, among others. In our case, the WT is relatively new, so no failures in these other components have been recorded, resulting in a higher percentage of failures attributed to the converter.

Additionally, SCADA system alarms on the converter account for 86.4% of the total, significantly outnumbering those on the pitch system (see Figure 2b). In this study, only the converter and pitch components were analyzed because both required replacement as critical parts of the WT.

In addition, it is worth noting the conceptual difference between failures and alarms. A failure involves the shutdown of the WT, and generally requires the replacement of the affected component, while an alarm may cause a temporary shutdown of the wind turbine, but does not necessarily lead to the replacement of the component. Failures can stop production and cause significant losses. It is thus essential to predict when a failure will occur, allowing preventive measures to be taken to minimize the impact on production.

2.4. Feature Selection

As mentioned above, the SCADA dataset used in this study consists of 55 variables. The variables were reduced to prevent the model from becoming too complex and to thus improve the learning of the features related to the target component. For this purpose, the variables were selected based on Pearson’s correlation.

Figure 3 depicts a correlation matrix for several variables related to the operation of the wind turbines and their components, particularly the converter and the IGBT. This matrix shows the strength and direction of the relationships between pairs of variables, represented by colors, where red indicates a high positive correlation, and blue has a high negative correlation.

The variable igbt_temperature_max shows a strong positive correlation with several other variables, such as wind_speed_avg, grid_active_power_avg, generator_speed_avg, and ambient_temperature_avg. This suggests that the IGBT temperature is closely linked to the overall WT operation and environmental conditions.

Three converter faults were identified in the preliminary analysis, which prevented us from observing significant correlations between the IGBT failure variable and the other variables. To address this challenge, the 10,000 data prior to each failure were labeled under the assumption that the component degrades over time and indications of failure may exist in such data. This allowed the data set to be balanced and correlations to be observed. This procedure is similar to that employed in [24]. In addition, although the correlations observed are small, they provide insight into the variables that could be involved in the faults.

Despite the low direct correlation between IGBT failures and the other variables, including tags in the pre-failure data provides valuable insight into understanding conditions that could contribute to converter failures.

In summary, this study chose variables based on their correlation with the target variable, correlation with converter failures, correlation with converter alarms, and the authors’ domain knowledge. Thus, nine variables related to the target component were selected, and are shown in Table 2.

This selection ensures that the model effectively captures the operating conditions of the target component, anticipates possible failures, and minimizes the detection of failures in other components, improving the WTs reliability and efficiency.

2.5. Data Splitting

In this study, our dataset consists of 317,323 records and 9 variables after initial filtering and feature selection. During the time series analysis of the variables, we identified a problem with sensor reconfiguration in April 2017, which changed the measurement range in some cases. Therefore, we only considered data from after this date because including earlier data could have a negative impact on model training. After removing the affected data, we retained 176,707 records and 9 variables, which accounts for 55.69% of the original data.

For data splitting, a period of approximately one year of normal operation, between April 2017 and April 2018, was selected for training, which allows capturing seasonal patterns and short and long term variations. Subsequently, the testing phase was performed using the data labeled with failures.

It is important to note that no failure data were included in the training, as the model focuses on reconstructing the input depending on the relationships with other variables. The objective is that the model attempts to reconstruct the expected behavior, with a high discrepancy potentially indicating an anomaly.

2.6. Data Scaling

To improve the performance of a DL model, it is essential to standardize or normalize the data before feeding it into the model. These preprocessing techniques help the model learn more efficiently and effectively.

Standardization transforms each input variable

x

into a distribution with mean zero and standard deviation one, as shown in Equation (1). In this equation,

\tilde{x}

represents the mean of the variable

\tilde{x}

, and

σ

represents the standard deviation of

\tilde{x}

.

x = \frac{\tilde{x} - μ}{σ}

(1)

where

μ

is the mean of the original data.

This technique ensures that all features have a comparable scale, which can significantly improve the convergence and performance of the DL model. In this study, the StandardScaler function from the Scikit-learn library was used to perform data standardization.

After the model’s predictions were obtained, it was necessary to reverse the standardization process to interpret the results in their original scale. This step, known as back-standardization or rescaling, is crucial for understanding the model’s outputs in a practical context. This rescaling was implemented using the inverse transform function of the StandardScaler from the Scikit-learn library.

2.7. Long Short-Term Memory (LSTM) Network

LSTM networks can improve the fusion of temporal features from the state of different parts. This study combines our model with LSTM to extract temporal features from WT data. The LSTM contains an input gate, an output gate, and a forget gate. Its structure is shown in Figure 4.

The information to be forgotten is controlled by forget gate

f_{t}

, defined as:

f_{t} = σ (W_{f} x_{t} + W_{h} h_{t - 1} + b_{f})

(2)

where

σ

is the sigmoid function, and

W_{f}

and

b_{f}

represent the weight matrix and the bias of the forget gate, respectively. In the next step, input gate

i_{t}

receives

h_{t - 1}

and

x_{t}

to determine the new data that should be stored in the cell state. At the same time, a vector of candidate values

{\tilde{C}}_{t}

is created by a hyperbolic tangent layer (tanh), which returns a value between −1 and 1. The previous cell state

C_{t - 1}

is then updated to the new cell state

C_{t}

, as described in Equations (3)–(5) [40].

i_{t} = σ (W_{i} x_{t} + W_{h} h_{t - 1} + b_{i})

(3)

{\tilde{C}}_{t} = tanh (W_{c} x_{t} + W_{h} h_{t - 1} + b_{c})

(4)

C_{t} = f_{t} ⊙ C_{t - 1} + i_{t} ⊙ {\tilde{C}}_{t}

(5)

Finally, output gate

o_{t}

determines the hidden state

h_{t}

as follows:

o_{t} = σ (W_{o} x_{t} + W_{h} h_{t - 1} + b_{o})

(6)

h_{t} = o_{t} ⊙ tanh (C_{t})

(7)

where

W_{o}

and

b_{o}

are the weight matrix and the output gate bias, respectively. Output

h_{t}

is obtained according to the updated state of cell

C_{t}

.

2.8. Autoencoder (AE)

The AE is an unsupervised deep neural network that reconstructs the input data with minimal error based on the encoded data. The input data are encoded in this network by mapping it to a low-dimensional space [41]. It then attempts to minimize the loss between the input and decoded data. As illustrated in Figure 5, the AE generally consists of two parts: an encoder and a decoder.

The encoder transforms the input data

x

into the encoded representation

z

(i.e., code layer). According to the depicted architecture, the mathematical formulation for the code layer encoding process is given by Equation (8); thus,

z = ς (W_{2} h + b_{2})

(8)

where

W

and

b

are the weight matrix and the network bias vector, respectively. Next, the decoder attempts to reconstruct the input data from code layer

z

with the smallest discrepancy between the original input

x

and the reconstructed output

\hat{x}

, calculated based on the following equation:

x^{'} = ς (W_{4} h^{'} + b_{4}))

(9)

where

ς

is the activation function.

2.9. Multi-Head Attention (MA)

AM has become a vital component in neural networks for handling long sequential data. By computing attention weights, the network learns to focus on the most significant parts of the input. An important innovation was introduced with MA, proposed by Vaswani et al. [42]. This approach improves attention by using multiple parallel attention layers or “heads” to focus on different input segments. MA greatly improves the modeling of complex dependencies in the data and increases model performance.

Unlike conventional attention models, which compute attention scores between a query vector and key-value pairs from the input, self-attention generates the query, key, and value vectors through transformations of the input itself. This allows the model to effectively extract important features and relationships within the data through self-reference. Specifically, self-attention compares different items within a single input sequence against each other. It computes interaction scores between a query matrix Q, a key matrix K, and value matrices V from the input data using Equation (10).

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(10)

where

d_{k}

is the dimension of the key matrix. MA extends this by employing multiple independent attention heads, with each head learning distinct patterns. Multiple heads essentially allow for parallelization within the attention layer, providing a richer representation of the input.

As depicted in Figure 6, the input sequence is linearly projected into Q, K, and V by learned weight matrices

W_{i}^{Q}

,

W_{i}^{K}

, and

W_{i}^{V}

. Then, Q and K are multiplied and scaled to obtain the attention scores. The attention weights are multiplied by V to obtain the output values of each head

A t t_{h}

. These output values, which constitute the attention of each head

A t t_{h}

, are concatenated and linearly projected using a learned matrix

W^{O}

to obtain the final attention output of multiple heads

O u t p u t

. This process is described by Equations (11) and (12).

A t t_{h} = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})

(11)

O u t p u t (Q, K, V) = Concat (A t t_{1}, A t t_{2}, \dots, A t t_{n}) W^{O}

(12)

While masking is an optional component that can be applied to the attention scores before the softmax to prevent certain positions from attending to others, it was not used, as the full sequence context was preferred to capture all global dependencies and might exclude patterns helpful in detecting small changes in the entire sequence.

2.10. Positional Encoding (PE)

PE is an essential technique in Transformers that incorporates information about the position of input sequences. Unlike RNNs, which process sequence data in an ordered fashion, Transformers, with their AM, process sequences in a parallel fashion, which means that the position of elements within the sequence is not implicit in the model [43]. To address this limitation, PE is introduced, which adds positional information to the input vectors to the attention layer. This encoding is performed using trigonometric functions (sine and cosine) to generate a unique vector for each position, which makes it easier for the model to capture the positional relationship of the elements in the sequence.

Equations (13) and (14) are used in this process.

P E_{(p o s, 2 i)} = sin (\frac{p o s}{10000^{2 i / d_{m o d e l}}})

(13)

P E_{(p o s, 2 i + 1)} = cos (\frac{p o s}{10000^{2 i / d_{m o d e l}}})

(14)

where

p o s

is the position in the sequence, i is the dimension, and

d_{m o d e l}

is the size of the feature vector.

2.11. Hybrid Model LSTM-MA-AE

This model efficiently captures the spatial and temporal characteristics of SCADA data, improving fault prediction and diagnosis capabilities. The LSTM-MA-AE model developed in this study processes the input signals to reconstruct the time series and predict faults before they occur.

The LSTM-MA-AE model has two main parts: the encoder and the decoder. The structure of the model is presented in Figure 7. In the encoder, a combination of LSTM layers, PE, and MA processes the input sequences. The decoder uses a similar structure to the encoder to reconstruct the time series from the encoded representations. This means that two LSTM layers with 64 hidden units each are employed in the encoder, which are responsible for capturing complex short-term temporal dependencies in the data sequence, embedding patterns relevant to the reconstruction task. These layers are followed by normalization and a dropout of 0.1 to prevent overfitting. At the output of the LSTMs, a PE is integrated for the 8-head MA mechanism to understand the positions of the time series data. These attention features are added to a residual connection from the previous output so that no data information is lost. Each head of the MA allows the model to focus on different aspects of the input sequence simultaneously, enhancing the capability of long-term feature embedding and temporality.

Furthermore, in the decoder, we initiate the extraction of the encoder embedded features using an 8-head MA mechanism supplemented with two additional 64-layer hidden LSTM layers. Initially, zero tensors are used as the query in the MA mechanism, whereas the encoder embeddings are employed as a key and value. These are then passed to the LSTM layers, which process the historical values of the dataset along with the attention output, allowing the relevant sequential structure to be extracted and the first value to be predicted through a fully connected layer. This output is subsequently used as feedback in the query the MA mechanism for subsequent predictions. In this way, the predictions generated by the decoder are iteratively employed as a query, achieving accurate reconstruction throughout the entire sequence. Finally, we apply normalization and dropout to optimize training and prevent model overfitting.

The input to the model is a time series of length T data with d characteristics:

X = [x_{1}, x_{2}, \dots, x_{T}] \in R^{T \times d}

(15)

The temporal sequence is then encoded through a 2-layer LSTM with dropout in between and a normalization layer LayerNorm at the end of the second layer, to obtain the encoded representation, as represented in the following Equations (16) and (17):

H_{out} = {LSTM}_{enc} (X)

(16)

H_{enc} = LayerNorm ({LSTM}_{enc} (H_{out}))

(17)

Output series

H_{enc}

has PE added to it so that MA can understand the position of the time series, as shown in Equation (18).

H_{PE} = H_{enc} + PE

(18)

Once PE is added, the MA is calculated to capture complex relationships in the time sequence, as shown in Equations (19)–(22).

Q = H_{PE} W^{Q}

(19)

K = H_{PE} W^{K}

(20)

V = H_{PE} W^{V}

(21)

{att}_{enc} = Multi - HeadAttention (Q, K, V)

(22)

where

W^{Q}

,

W^{K}

,

W^{V}

are matrices of learnable weights of query Q, key K, and value V.

To avoid losing data information, a residual connection is made from the output of

H_{PE}

to the output of the MA, using Equation (23).

{out}_{enc} = H_{PE} + {att}_{enc}

(23)

The encoded representation

{out}_{enc}

is first decoded by means of an MA, which uses

{att}_{decod} = Multi - HeadAttention (H_{t - 1}, {out}_{enc}, {out}_{enc})

(24)

where

H_{t - 1}

is the previous hidden layer of the LSTM unit at the end of the decoder, which at the beginning is a tensor of zeros. Next, the MA output of the decoder is concatenated with the historical data

R_{t}

to be reconstructed:

{combined}_{att} = Concat ({att}_{decod}, R_{t})

(25)

A fully connected layer FL is then used to learn even more of the data representation, and the output of this is used as input to two LSTM layers, followed by LayerNorm:

H_{t}, {out}_{decod} = LayerNorm (LSTM (LSTM (FL ({combined}_{att}))))

(26)

where

H_{t}

feeds as a Q back to the MA to obtain the reconstructed predicted data. Finally, the decoder output is projected to the original dimension with a FL to reconstruct the

\hat{Z}

time series, using Equation (27).

\hat{Z} = FL ({out}_{decod})

(27)

The pseudocode for LSTM-MA-AE is presented in Algorithm 1.

Algorithm 1 Training procedure for LSTM-MA-AE with Early Stopping

1:: Input: SCADA data, where $X = [x_{1}, x_{2}, \dots, x_{T}] \in R^{T \times v}$ , T is the time series length and v is the number of variables.
2:: Initialize:
3:: Load configuration parameters: hidden layers $h l$ , batch size $b s$ , epochs e, patience p
4:: Initialize model, loss criterion $L$ , optimizer Adam, and learning rate $L r$
5:: Initialize variables: best_val_loss = ∞, patience_counter = 0, $L_{v a l} = 0$
6:: Load $X_{t r a i n i n g}$ , $X_{t e s t}$ , $X_{v a l} \in R^{(e, q, v)}$ , where e is the number of sequences and q is the sequence length.
7:: Create DataLoader iterators for training and testing data with $b s$
8:: for each epoch e do
9:: Training Phase:
10:: Set model to training mode
11:: Initialize training loss to zero
12:: for each batch in $X_{t r a i n i n g}$ do
13:: Zero the parameter gradients
14:: $o u t p u t$ = LSTM-MA-AE( $b a t c h$ )
15:: Compute loss $L = c r i t e r i o n (o u t p u t, b a t c h)$
16:: Concatenate $p a r a m s_w i t h o u t_b i a s = C o n c a t_p a r a m s (L S T M - M A - A E)$
17:: Compute regularization term $L_{r e g} = L + R e g \cdot L 1_{n o r m} (p a r a m s)$
18:: Perform backpropagation
19:: Update model parameters $L_{r e g} . b a c k w a r d ()$
20:: Accumulate training loss $L_{t r a i n} + = L_{r e g}$
21:: end for
22:: Compute average training loss $L_{t r a i n_a v g} = L_{t r a i n} / number of batches$
23:: Validation Phase:
24:: Set model to evaluation mode
25:: for each batch in $X_{v a l}$ do
26:: Disable gradient computation
27:: $o u t p u t_{v a l}$ = LSTM-MA-AE( $b a t c h$ )
28:: Compute loss $L_{v a l} = c r i t e r i o n (o u t p u t_{v a l}, b a t c h)$
29:: Accumulate validation loss $L_{v a l} + = L_{v a l}$
30:: end for
31:: Compute average validation loss $L_{v a l_a v g} = L_{v a l} / number of batches$
32:: Early Stopping:
33:: if $L_{v a l_a v g} <$ best_val_loss then
34:: Update best_val_loss $= L_{v a l_a v g}$
35:: Save model parameters and reset patience_counter $= 0$
36:: else
37:: Increment patience_counter $+ = 1$
38:: end if
39:: if patience_counter $> p$ then
40:: Break training loop early
41:: end if
42:: end for
43:: Test the model using $X_{t e s t}$
44:: Reconstruction data $o u t p u t_{t e s t}$ = LSTM-MA-AE( $X_{t e s t}$ )
45:: System of prediction faults
46:: Loss of each variable $L_{t e s t} = c r i t e r i o n (o u t p u t_{t e s t}, X_{t e s t})$
47:: Smoothing $L_{t e s t}$
48:: Apply threshold with EWMA and sliding window anomaly detection
49:: Fault warning;

2.12. Exponential Weighted Moving Average (EWMA)

Within the failure prediction mechanism, EWMA is used. This technique applies greater weight to more recent data with higher errors. This technique is defined by Equation (28) [44].

{EWMA}_{t} = α a_{t} + (1 - α) {EWMA}_{t - 1}

(28)

where

α

is the smoothing factor,

a_{t}

is the value of the reconstruction loss in time t and

{EWMA}_{t - 1}

is the value of EWMA in time

t - 1

.

The threshold value at the upper bound of EWMA is calculated as shown in Equation (29).

{threshold}_{E W M A} = EWMA + λ θ_{EWMA}

(29)

where

θ_{EWMA}

is the standard deviation of the EWMA values and

λ

is a constant that defines the position of the threshold. When training with normal data,

λ

can be adjusted to set an appropriate threshold.

2.13. Fault Prediction System

Initially, the model reconstructs all input variables to capture deviations in each one. As depicted in Figure 8, the fault prediction system calculates the reconstruction score for each data point for each variable. A smoothing vector is then used to distinguish between anomalies possibly caused by degradation or situations where the WT is operating at its power limits. Next, this new score is passed through a threshold with EWMA. If the score exceeds this threshold, a label vector is created, where 1 indicates that the threshold has been exceeded and 0 indicates that it has not.

This label vector is processed through a sliding window anomaly detector to improve anomaly labeling. This detector takes a binary vector as input and identifies if there are more than

N_{anomalies 1}

anomalies within a time window

N_{window 1}

, then it labels the entire window between the first and last anomaly within the window as anomalous (1). This method assumes that nearby anomalies within a maximum window size of

N_{window 1}

can represent the same fault.

Once each variable’s final label vector is obtained, all vectors are concatenated and added. This time, the number of variables with anomalies that would indicate a failure warning is determined using a fixed threshold.

For each variable

v_{i}

in the data set, we perform the following:

Calculation of the reconstruction score for each variable $v_{i}$ :

$L_{v} (t) = | y_{i} - {\hat{y}}_{i} |$

(30)

where $L_{v} (t)$ is the reconstruction score over time t, $y_{i}$ represents the actual values, and ${\hat{y}}_{i}$ the predicted values.
Calculation of the smoothed reconstruction vector $L_{s} (t)$ :
Consider $P (t)$ the nominal power at instant t and $P_{90}$ the 90th percentile value of the maximum power. We define the smoothing vector $V_{s} (t)$ as follows:

$V_{s} (t) = \{\begin{matrix} 0.85 & if P (t) > P_{90} \\ 1 & otherwise \end{matrix}$

(31)

$L_{s} (t) = V_{s} (t) \cdot L_{v} (t)$

(32)
Calculation of EWMA:

$S_{i} (t) = α \cdot L_{s} (t) + (1 - α) \cdot L_{s} (t - 1)$

(33)

where $S_{i} (t)$ is the EWMA-smoothed score.
Comparison with the threshold:

$R_{i} (t) = \{\begin{matrix} 1 & if L_{s} (t) > {threshold}_{E W M A} \\ 0 & otherwise \end{matrix}$

(34)

where $R_{i} (t)$ is the vector of labels indicating an anomaly.
Improvement of detection with the sliding window anomaly:
For a window size maximum of $N_{window 1}$ :

$R_{i} (t) = \{\begin{matrix} 1 & for t \in [t_{1}, t_{n}] if N_{anomalies 1} \leq \sum_{t_{1}}^{t_{n}} R_{i} (t) within N_{window 1} \\ 0 & otherwise \end{matrix}$

(35)

where $t_{1}$ and $t_{n}$ are the temporal positions of the start and end of the anomalies within the size window $N_{window 1}$ . If two anomalies are found within this window, all data points between $t_{1}$ and $t_{n}$ are labeled as anomalous.
Adding and fix the threshold:

$T_{add} (t) = \sum_{i = 1}^{v} R_{i} (t)$

(36)

where v is the number of variables.

$Fault Warning = \{\begin{matrix} W a r n i n g & if T_{add} (t) > {threshold}_{fault} \\ n o r m a l & otherwise \end{matrix}$

(37)

where ${threshold}_{f a u l t}$ represents the final fixed threshold of failure.

2.14. Model Hyperparameters

In this study, the early stopping technique is adopted during the training process to avoid overfitting the model. Overfitting can prevent the model from adequately understanding the temporal and spatial relationships between variables, as it fits too closely to the true values and loses generalization ability. Early stopping ensures that the model fits with small errors in normal cases. In contrast, in anomalous situations, the model cannot fit the input features, resulting in a significant discrepancy with the normal data. This improves the model’s ability to detect anomalies effectively.

The hyperparameters with which the model achieved the best performance are shown in Table 3.

2.15. Model Evaluation Metrics

To evaluate the proposed model, we used different error metrics, common in prediction analysis, such as the root mean square error (RMSE), the mean absolute error (MAE), and the coefficient of determination (

R^{2}

). Equations (38)–(40) define these metrics.

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}

(38)

MAE = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |

(39)

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}

(40)

where

y_{i}

represents the actual values,

{\hat{y}}_{i}

the predicted values, n is the number of samples, and

\bar{y}

is the mean value of

y_{i}

.

In addition, we use the Precision (

P r

), Recall (

R e

), and F1-Score (

F 1

) metrics, which are commonly employed in anomaly detection [45]. These metrics are defined by Equations (41)–(43).

P r = \frac{T P}{T P + F P}

(41)

R e = \frac{T P}{T P + F N}

(42)

F 1 = \frac{2 \cdot P r \cdot R e}{P r + R e}

(43)

where True Positive (

T P

) represents alarms that detected a fault correctly within a specific time window, False Positive (

F P

) corresponds to alarms that do not result in an actual fault, and False Negative (

F N

) indicates faults that were not correctly detected.

To conclude this section, the experiments were conducted using the Python programming language in the PyTorch library within a Google Colab environment. The environment had an Intel(R) Xeon(R) CPU @ 2.20 GHz, 51.00 GB of RAM, a Tesla K80 accelerator, and 12 GB of GDDR5 VRAM.

3. Results

This section presents the results in two subsections. Section 3.1 analyzes the model’s ability to generalize the temporal data for each feature. Section 3.2 describes the performance of the model in predicting faults and the advance time until the occurrence of faults.

3.1. Model Evaluation

The results shown in Table 4 were achieved using the hyperparameters outlined in Table 3. We evaluated three model configurations:

The LSTM-MA-AE approach.
The same approach without the AM (LSTM-AE).
A version without PE and residual connection.

This enables us to confirm that the improve implemented effectively enhance the performance of the proposed model.

The data presented in Table 4 indicates that incorporating MA into the LSTM-MA-AE approach is highly effective for capturing the complexities within temporal data. It shows that the AM enhances model performance. The MA assists the model in comprehending a wide range of sequences, thereby retaining crucial information necessary for effectively reconstructing the time series. In addition, the LSTM-MA-AE approach without positional coding shows inferior performance compared to the approach that includes PE. This is because the addition of PE makes it easier for the model to understand the positional characteristics of each data in the time series, improving its ability to interpret and predict accurately.

The

R^{2}

coefficient shows the difference between the series; a value of

R^{2}

close to 1 reflects a high similarity between the data sets. The MAE measures the loss between two data points, while the RMSE penalizes outliers. As can be seen, the best performance was obtained by the LSTM-MA-AE model.

Table 4. Comparison of variable reconstruction results.

Feature	LSTM-MA-AE			LSTM-AE			LSTM-MA-AE without PE
Feature	RMSE	MAE	$R^{2}$	RMSE	MAE	$R^{2}$	RMSE	MAE	$R^{2}$
3	0.44	0.33	0.91	1.50	1.22	−0.09	13.12	13.09	−80.65
2	0.52	0.37	0.88	1.71	1.35	−0.23	20.35	20.32	−143.71
6	0.80	0.67	0.84	2.64	2.24	−0.69	36.92	36.18	−19.17
8	1.55	1.17	0.82	3.34	2.68	0.18	58.62	56.84	−13.99
9	1.75	1.38	0.94	6.99	5.86	0.11	70.15	68.15	−15.24
7	3.63	2.60	0.93	16.28	13.15	−0.35	358.37	358.36	−9026.75
4	4.21	3.17	0.93	22.23	16.44	−0.88	1169.89	1052.11	−4.21
5	125.20	96.16	0.92	326.12	260.14	0.51	1094.50	986.58	−4.32
1	137.11	109.85	0.91	453.77	361.64	−0.12	1169.89	1052.11	−4.21

Note: In bold are the best results.

During training, we set the maximum number of epochs to 75. However, using early stopping, the model reached its best performance at epoch 25, when the validation loss stopped improving significantly. As shown in Figure 9a, the validation loss decreases rapidly during the first 25 epochs and then stabilizes. Early stopping acted from this point to avoid overfitting, which could have occurred if training had continued up to the initially configured epochs.

This analysis is crucial to determine the optimal point where the model generalizes best without overfitting, thus ensuring robust performance on unseen data. Figure 9a depicts the behavior of losses during the training epochs, highlighting the importance of early stopping in model optimization.

Furthermore, as seen in Figure 9a, a batch size of 256 was selected due to its comparable performance in terms of loss with smaller batch sizes, such as 128, but with a lower computational time, as shown in Figure 10. In Figure 9b, the sequence length is analyzed, and although sequence length 10 presents slightly better performance than sequence length 20, the latter was chosen, as it provides more history to the model, allowing more general learning of the time series. In Figure 9c, it is shown that regularization factors of 1

\times 10^{- 4}

and 1

\times 10^{- 6}

provide similar results, so a value of 1

\times 10^{- 4}

was chosen. Figure 9d shows that increasing the number of heads improves the capture of long-term patterns in the time series. Figure 9e shows that a single layer presents poor performance, while a larger number of layers, such as two or three, reduces the loss but increases the computational cost. Figure 9f shows that a higher dropout affects the model’s generalization capability, with a value of 0.1 being the most appropriate in this case. Figure 9g reveals that a hidden layer size of 128 improves the fit to the real values, although it increases the computational cost and time, as shown in Figure 10. Finally, Figure 9h shows that the initial value of the learning rate, together with the scheduler, significantly influences the model fit. We aim to generalize the time series with a good fit, without overfitting, to identify anomalous patterns over time. Although a higher learning rate would be adequate for forecasting tasks, in this case, we need one that allows us to generalize the time series and obtain adequate results, as presented in Table 4.

Table 5 indicates that adding an attention mechanism increases the time required for model training. This is due to the increased computations and the need to learn more complex features. The LSTM-MA-AE approach has a time per epoch of 13.28 s, while the LSTM-AE approach, without the AM, requires only 7.48 s per epoch. Furthermore, the use of PE in the LSTM-MA-AE approach slightly reduces the training time compared to the version without PE, with times of 13.28 s and 13.40 s per epoch, respectively. This suggests that PE and residual connection help make training slightly faster.

These results underscore the importance of balancing model accuracy with training time. Although the attention mechanism may increase training time, it can significantly improve model performance by capturing complex features and retaining crucial information in sequential data.

Additionally, the impact of the amount of training data on the overall model performance was evaluated, as shown in Table 6. The model for variable

a m b i e n t_t e m p e r a t u r e_a v g

accurately reconstructs the data even with a reduced dataset, yielding promising results.

Figure 11a displays the reconstruction of the IGBT maximum temperature variable. This figure shows an increase in temperature from early May to late June 2018. This trend coincides with a converter failure documented in the maintenance record. This fault may be due to an accelerated degradation of the converter since, during that period, according to the power output log, the WT operated at maximum capacity most of the time.

Figure 11 reveals that the true time series and the model reconstruction reflect this increase in temperature. However, when examining the reconstruction loss (errors) shown in Figure 11b, several anomalies exceeded the threshold established prior to the failure event. Although this reconstruction loss is not significant in absolute terms, it is noteworthy when compared to normal values.

3.2. Fault Prediction

Detecting faults in WT components may depend on anomalies in one or more variables related to the target component. In addition, the presence of indications of faults in SCADA variables may not be evident, especially in electronic components. Therefore, a fault prediction system was developed to more accurately detect and anticipate these anomalies.

After implementing the failure prediction system, we obtained Figure 12, demonstrating that the prediction system can provide alerts about potential failures up to two months in advance. Figure 12 also displays two alert events that preceded failures recorded in the maintenance file: one on 10 June, and another on 24 June 2018. These alerts occurred consecutively, indicating the system’s effectiveness in identifying impending failures.

Figure 13, which is an enlargement of the window shown in Figure 12, shows that the model with the fault prediction system with an additional anomaly sliding window at the end of the stage can alert of the presence of anomalies with an average advance of approximately four months for the failures that occurred on 10 and 24 June 2018, and 2 March 2021. However, it is important to note that the model also alerted of anomalies at the end of November 2018, although there was no failure according to the maintenance record.

The fixed threshold and an adaptive threshold with EWMA were used to reduce the noise generated by the fixed threshold. A limitation of using a fixed threshold is that it can only capture anomalies when there is a trend in the reconstruction loss. Therefore, in this study, since it is necessary to determine anomalies in the WT operation over its entire operating range, an adaptive threshold with EWMA is used.

Figure 14 shows the performance of the adaptive threshold with EWMA applied to each reconstruction error of each variable, which improved fault detection and reduced false positives.

Figure 15 shows the final result of the failure prediction system. The model correctly detected the converter failures approximately 2.5 and 4.2 months before they occurred. This demonstrates the proposed model’s capability for early fault detection.

In addition, as seen in Figure 15, an warning at the end of 2019 corresponds to an FP. According to the maintenance record, this event may be due to a failure in the yaw system, and the model could have captured these deviations.

The study was also conducted for WT3 of the VWF to determine the effectiveness of the proposed model. Figure 16 validates this failure prediction method because the model detected anomalies approximately 2.3, 3.7, and 4.3 months in advance of the failures, which occurred on 25 May 2019, 23 December 2019, and 31 March 2021, respectively.

The metrics described in Section 2.15 were used to evaluate the model’s performance. Table 7 shows that the model is able to anticipate failures an average of 3.39 months in advance for both WTs. The advance to fault (Adv.) is expressed in months.

Finally, we have developed a comparison with similar recent studies in WT converter failure prediction. Although direct comparisons are limited due to variability in the datasets used in each study, this comparison provides an overview of our model’s relative performance in relation to previous work. In Table 8, an impact matrix summarizing the contributions of each study is presented.

4. Conclusions

Since electronic converters are critical components, the ability to predict failures months in advance is crucial to reducing O&M costs. Therefore, this study addresses the prediction of converter failures in WTs located in complex terrain.

We developed a hybrid model using SCADA data that combines LSTM, Multi-head Attention, and Autoencoder, using their strengths to learn temporal and spatial characteristics between variables. The LSTM provides better generalization, while the Multi-head Attention mechanism captures complex patterns inherent in the data. The Autoencoder mechanism allows the model to reconstruct features with spatial and temporal information, facilitating the detection of malfunctions in variables related to the target component.

The LSTM-MA-AE model can predict failures an average of approximately 3.3 months in advance and with an average F1 of 90% in the evaluated WTs, showing a low false positive rate. These medium-term failures could be indicative of converter degradation.

This method can facilitate early fault detection and provides a robust mechanism for predictive maintenance in wind farms, ensuring greater reliability and generation efficiency. Moreover, we are confident that the methodological process we have developed could be successfully replicated in other wind farms. In fact, we believe it could even be applied to other components of the WT, not just the converter, further extending the reach of our research.

As a future work, it is proposed to use the parallelization capabilities of the Transformer model and the strengths of the self-attention mechanism to determine the variables that are intrinsically related to the component failure and to determine the root cause.

Author Contributions

J.T.-C. and J.M.-C. contributed to the conceptualization and methodological design of the proposal; were responsible for data preparation and analysis; J.T.-C. and J.M.-C. prepared the original draft; M.V.-C., E.A. and S.M.-M. contributed to the review and editing, and E.G.-L. was responsible for project supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by the Junta de Comunidades de Castilla-La Mancha and the E.U. FEDER (SBPLY/23/180225/000226) and by the Spanish Ministry of Economy and Competitiveness and the European Union (PID2021-126082OB-C21).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets presented in this article are not readily available because there is a confidentiality agreement between the owners and the authors. Requests for access to the datasets should be directed to the corresponding author.

Acknowledgments

The authors would like to recognize the assistance received by National University of Loja through the research project 20-DI-FEIRNNR-2023.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sun, T.; Yu, G.; Gao, M.; Zhao, L.; Bai, C.; Yang, W. Fault Diagnosis Methods Based on Machine Learning and its Applications for Wind Turbines: A Review. IEEE Access 2021, 9, 147481–147511. [Google Scholar] [CrossRef]
Bošnjaković, M.; Katinić, M.; Santa, R.; Marić, D. Wind Turbine Technology Trends. Appl. Sci. 2022, 12, 8653. [Google Scholar] [CrossRef]
Kouadri, A.; Hajji, M.; Harkat, M.F.; Abodayeh, K.; Mansouri, M.; Nounou, H.; Nounou, M. Hidden Markov model based principal component analysis for intelligent fault diagnosis of wind energy converter systems. Renew. Energy 2020, 150, 598–606. [Google Scholar] [CrossRef]
Mishnaevsky, L. Root Causes and Mechanisms of Failure of Wind Turbine Blades: Overview. Materials 2022, 15, 2959. [Google Scholar] [CrossRef]
Xu, Y.; Nascimento, N.M.M.; de Sousa, P.H.F.; Nogueira, F.G.; Torrico, B.C.; Han, T.; Jia, C.; Rebouças Filho, P.P. Multi-sensor edge computing architecture for identification of failures short-circuits in wind turbine generators. Appl. Soft Comput. 2021, 101, 107053. [Google Scholar] [CrossRef]
López-Uruñuela, F.J.; Fernández-Díaz, B.; Pagano, F.; López-Ortega, A.; Pinedo, B.; Bayón, R.; Aguirrebeitia, J. Broad review of “White Etching Crack” failure in wind turbine gearbox bearings: Main factors and experimental investigations. Int. J. Fatigue 2021, 145, 106091. [Google Scholar] [CrossRef]
Gao, Z.; Liu, X. An Overview on Fault Diagnosis, Prognosis and Resilient Control for Wind Turbine Systems. Processes 2021, 9, 300. [Google Scholar] [CrossRef]
Liu, Z.; Zhang, L. A review of failure modes, condition monitoring and fault diagnosis methods for large-scale wind turbine bearings. Measurement 2020, 149, 107002. [Google Scholar] [CrossRef]
Hossain, M.L.; Abu-Siada, A.; Muyeen, S.M. Methods for Advanced Wind Turbine Condition Monitoring and Early Diagnosis: A Literature Review. Energies 2018, 11, 1309. [Google Scholar] [CrossRef]
Maldonado-Correa, J.; Valdiviezo-Condolo, M.; Viñan-Ludeña, M.S.; Samaniego-Ojeda, C.; Rojas-Moncayo, M. Wind power forecasting for the Villonaco wind farm. Wind Eng. 2020, 45, 1145–1159. [Google Scholar] [CrossRef]
Liang, J.; Zhang, K.; Al-Durra, A.; Muyeen, S.; Zhou, D. A state-of-the-art review on wind power converter fault diagnosis. Energy Rep. 2022, 8, 5341–5369. [Google Scholar] [CrossRef]
Yang, Z.; Chai, Y. A survey of fault diagnosis for onshore grid-connected converter in wind energy conversion systems. Renew. Sustain. Energy Rev. 2016, 66, 345–359. [Google Scholar] [CrossRef]
Qiao, W.; Lu, D. A Survey on Wind Turbine Condition Monitoring and Fault Diagnosis—Part I: Components and Subsystems. IEEE Trans. Ind. Electron. 2015, 62, 6536–6545. [Google Scholar] [CrossRef]
Fischer, K.; Pelka, K.; Puls, S.; Poech, M.H.; Mertens, A.; Bartschat, A.; Tegtmeier, B.; Broer, C.; Wenske, J. Exploring the Causes of Power-Converter Failure in Wind Turbines based on Comprehensive Field-Data and Damage Analysis. Energies 2019, 12, 593. [Google Scholar] [CrossRef]
López, G.; Arboleya, P. Short-term wind speed forecasting over complex terrain using linear regression models and multivariable LSTM and NARX networks in the Andes Mountains, Ecuador. Renew. Energy 2022, 183, 351–368. [Google Scholar] [CrossRef]
Helbing, G.; Ritter, M. Deep Learning for fault detection in wind turbines. Renew. Sustain. Energy Rev. 2018, 98, 189–198. [Google Scholar] [CrossRef]
Waqas Khan, P.; Byun, Y.C. Multi-Fault Detection and Classification of Wind Turbines Using Stacking Classifier. Sensors 2022, 22, 6955. [Google Scholar] [CrossRef] [PubMed]
Xiang, L.; Wang, P.; Yang, X.; Hu, A.; Su, H. Fault detection of wind turbine based on SCADA data analysis using CNN and LSTM with attention mechanism. Measurement 2021, 175, 109094. [Google Scholar] [CrossRef]
Liu, J.; Wang, X.; Wu, S.; Wan, L.; Xie, F. Wind turbine fault detection based on deep residual networks. Expert Syst. Appl. 2023, 213, 119102. [Google Scholar] [CrossRef]
Zhang, K.; Tang, B.; Deng, L.; Yu, X. Fault Detection of Wind Turbines by Subspace Reconstruction-Based Robust Kernel Principal Component Analysis. IEEE Trans. Instrum. Meas. 2021, 70, 1–11. [Google Scholar] [CrossRef]
Santolamazza, A.; Dadi, D.; Introna, V. A Data-Mining Approach for Wind Turbine Fault Detection Based on SCADA Data Analysis Using Artificial Neural Networks. Energies 2021, 14, 1845. [Google Scholar] [CrossRef]
Shen, Y.; Tang, B.; Li, B.; Tan, Q.; Wu, Y. Remaining useful life prediction of rolling bearing based on multi-head attention embedded Bi-LSTM network. Measurement 2022, 202, 111803. [Google Scholar] [CrossRef]
Liu, Z.; Xiao, C.; Zhang, T.; Zhang, X. Research on Fault Detection for Three Types of Wind Turbine Subsystems Using Machine Learning. Energies 2020, 13, 460. [Google Scholar] [CrossRef]
Maldonado-Correa, J.; Torres-Cabrera, J.; Martín-Martínez, S.; Artigao, E.; Gómez-Lázaro, E. Wind turbine fault detection based on the transformer model using SCADA data. Eng. Fail. Anal. 2024, 162, 108354. [Google Scholar] [CrossRef]
Xiao, C.; Liu, Z.; Zhang, T.; Zhang, X. Deep Learning Method for Fault Detection of Wind Turbine Converter. Appl. Sci. 2021, 11, 1280. [Google Scholar] [CrossRef]
Ghazimoghadam, S.; Hosseinzadeh, S. A novel unsupervised deep learning approach for vibration-based damage diagnosis using a multi-head self-attention LSTM autoencoder. Measurement 2024, 229, 114410. [Google Scholar] [CrossRef]
Lee, Y.; Park, C.; Kim, N.; Ahn, J.; Jeong, J. LSTM-Autoencoder Based Anomaly Detection Using Vibration Data of Wind Turbines. Sensors 2024, 24, 2833. [Google Scholar] [CrossRef] [PubMed]
Wang, A.; Pei, Y.; Zhu, Y.; Qian, Z. Wind turbine fault detection and identification through self-attention-based mechanism embedded with a multivariable query pattern. Renew. Energy 2023, 211, 918–937. [Google Scholar] [CrossRef]
Aksan, F.; Janik, P.; Suresh, V.; Leonowicz, Z. Review of the application of deep learning for fault detection in wind turbine. In Proceedings of the 2022 IEEE International Conference on Environment and Electrical Engineering and 2022 IEEE Industrial and Commercial Power Systems Europe (EEEIC/I&CPS Europe), Prague, Czech Republic, 28 June–1 July 2022; pp. 1–6. [Google Scholar] [CrossRef]
Zhang, G.; Li, Y.; Jiang, W.; Shu, L. A fault diagnosis method for wind turbines with limited labeled data based on balanced joint adaptive network. Neurocomputing 2022, 481, 133–153. [Google Scholar] [CrossRef]
Zhang, J.; Sun, H.; Sun, Z.; Dong, W.; Dong, Y. Fault Diagnosis of Wind Turbine Power Converter Considering Wavelet Transform, Feature Analysis, Judgment and BP Neural Network. IEEE Access 2019, 7, 179799–179809. [Google Scholar] [CrossRef]
Xue, Z.Y.; Xiahou, K.S.; Li, M.S.; Ji, T.Y.; Wu, Q.H. Diagnosis of Multiple Open-Circuit Switch Faults Based on Long Short-Term Memory Network for DFIG-Based Wind Turbine Systems. IEEE J. Emerg. Sel. Top. Power Electron. 2020, 8, 2600–2610. [Google Scholar] [CrossRef]
Zhang, C.; Hu, D.; Yang, T. Research of artificial intelligence operations for wind turbines considering anomaly detection, root cause analysis, and incremental training. Reliab. Eng. Syst. Saf. 2024, 241, 109634. [Google Scholar] [CrossRef]
Rama, V.S.B.; Hur, S.H.; Yang, J.M. Short-Term Fault Prediction of Wind Turbines Based on Integrated RNN-LSTM. IEEE Access 2024, 12, 22465–22478. [Google Scholar] [CrossRef]
Oliveira-Filho, A.; Zemouri, R.; Cambron, P.; Tahan, A. Early Detection and Diagnosis of Wind Turbine Abnormal Conditions Using an Interpretable Supervised Variational Autoencoder Model. Energies 2023, 16, 4544. [Google Scholar] [CrossRef]
Liu, J.; Yang, G.; Li, X.; Wang, Q.; He, Y.; Yang, X. Wind turbine anomaly detection based on SCADA: A deep autoencoder enhanced by fault instances. ISA Trans. 2023, 139, 586–605. [Google Scholar] [CrossRef]
Zhong, Y.; Lakshminarayan, S.; Ran, L.; Mawby, P.; Jia, C.; Ng, C. Detecting Power Module Thermal Resistance Change in Wind Turbine Converters with an Attention-based LSTM-Autoencoder Architecture. In Proceedings of the 2023 IEEE Energy Conversion Congress and Exposition (ECCE), Nashville, TN, USA, 29 October–2 November 2023; pp. 314–320. [Google Scholar] [CrossRef]
Ayala, M.; Maldonado, J.; Paccha, E.; Riba, C. Wind Power Resource Assessment in Complex Terrain: Villonaco Case-study Using Computational Fluid Dynamics Analysis. Energy Procedia 2017, 107, 41–48. [Google Scholar] [CrossRef]
Zhang, J.; Sun, H.; Sun, Z.; Dong, W.; Dong, Y.; Gong, S. Reliability Assessment of Wind Power Converter Considering SCADA Multistate Parameters Prediction Using FP-Growth, WPT, K-Means and LSTM Network. IEEE Access 2020, 8, 84455–84466. [Google Scholar] [CrossRef]
Zhu, Y.; Zhu, C.; Tan, J.; Tan, Y.; Rao, L. Anomaly detection and condition monitoring of wind turbine gearbox based on LSTM-FS and transfer learning. Renew. Energy 2022, 189, 90–103. [Google Scholar] [CrossRef]
Wu, X.; Jiang, G.; Wang, X.; Xie, P.; Li, X. A Multi-Level-Denoising Autoencoder Approach for Wind Turbine Fault Detection. IEEE Access 2019, 7, 59376–59387. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar] [CrossRef]
Chen, P.C.; Tsai, H.; Bhojanapalli, S.; Chung, H.W.; Chang, Y.W.; Ferng, C.S. A Simple and Effective Positional Encoding for Transformers. arXiv 2021, arXiv:2104.08698. Available online: http://arxiv.org/abs/2104.08698 (accessed on 12 June 2024).
Abbas, N.; Riaz, M.; Does, R.J.M.M. Enhancing the performance of EWMA charts. Qual. Reliab. Eng. Int. 2011, 27, 821–833. [Google Scholar] [CrossRef]
Wu, Z.; Lin, W.; Ji, Y. An Integrated Ensemble Learning Model for Imbalanced Fault Diagnostics and Prognostics. IEEE Access 2018, 6, 8394–8402. [Google Scholar] [CrossRef]

Figure 1. General flow chart of the methodological process used in this study.

Figure 2. Percentage of failures and alarms of all wind turbines at VWF. (a) Representative faults from the maintenance record. (b) SCADA alarms related to the fault component.

Figure 3. Feature selection by correlation.

Figure 4. Schematic diagram of the LSTM architecture.

Figure 5. Architecture of a deep AE.

Figure 6. Schematic diagram of the MA mechanism.

Figure 7. Architecture of the LSTM-MA-AE model.

Figure 8. Flowchart of the failure prediction system. The symbol * represents the multiplication for the the threshold of the smothing vector. Dotted lines are used to highlight the sliding window anomaly process.

Figure 9. Loss comparison during training and validation with different model hyperparameters. (a) Loss comparison with different batch sizes. (b) Loss comparison with different sequence lengths. (c) Loss comparison with different regularization factors. (d) Loss comparison with different number of heads. (e) Loss comparison with different number of layers. (f) Loss comparison with different dropout values. (g) Loss comparison with different hidden layer sizes. (h) Loss comparison with different learning rate values.

Figure 10. Comparison of training time between different hyperparameter changes.

Figure 11. Reconstruction of the IGBT temperature variable. (a) True and reconstructed signal. (b) Reconstruction error.

Figure 12. Failure prediction with fixed threshold and limited data.

Figure 13. Results of the failure prediction system without EWMA for WT2. Parameters used:

{Threshold}_{f i x e d}

of each variable corresponds to the 98th percentile of the data;

N_{w i n d o w 1}

: 30 days;

N_{a n o m a l i e s 1}

: 30.

Figure 13. Results of the failure prediction system without EWMA for WT2. Parameters used:

{Threshold}_{f i x e d}

of each variable corresponds to the 98th percentile of the data;

N_{w i n d o w 1}

: 30 days;

N_{a n o m a l i e s 1}

: 30.

Figure 14. EWMA adaptive threshold.

Figure 15. Results of the failure prediction system for WT2. Parameters used:

α

: 0.3;

λ

: 1;

N_{w i n d o w 1}

: 15 days;

N_{a n o m a l i e s 1}

: 30. The June 24, 2018 failure (purple) could be a consequence of the failure preceding it, which not being due to degradation and the short time between the two failures (14 days), the model is not able to generate a warning.

Figure 15. Results of the failure prediction system for WT2. Parameters used:

α

: 0.3;

λ

: 1;

N_{w i n d o w 1}

: 15 days;

N_{a n o m a l i e s 1}

: 30. The June 24, 2018 failure (purple) could be a consequence of the failure preceding it, which not being due to degradation and the short time between the two failures (14 days), the model is not able to generate a warning.

Figure 16. Results of the failure prediction system for WT3. Parameters used:

α

: 0.32;

λ

: 1.2;

N_{w i n d o w 1}

: 30 days;

N_{a n o m a l i e s 1}

: 50.

Figure 16. Results of the failure prediction system for WT3. Parameters used:

α

: 0.32;

λ

: 1.2;

N_{w i n d o w 1}

: 30 days;

N_{a n o m a l i e s 1}

: 50.

Table 1. Summary of key studies on WT fault detection.

Ref. & Results	Learning Algorithms	Tools/Methods	Main Component	Results & Limitations
[31], 2019	Classification	BPNN	Converter	Accuracy = 99.98% in identifying converter faults. There is limited discussion on the scalability of the proposed strategy to handle a wider variety of fault types. Does not use SCADA data.
[39], 2020	Prediction	K-means, LSTM, FP-Growth, Wavelet packet transform (WPT)	Converter	The proposed strategy shows promise for assessing converter reliability and fits well with SCADA data. However, it does not show an early prediction of failure or a considerable time window before failure.
[25], 2021	Detection	AOC–ResNet50	Converter	With accuracy = 96.11%, the proposed model outperformed other models in detecting failures. However, it does not show the variables that have a high abnormality rate and does not test the model in large dataset periods.
[30], 2022	Transfer Learning	Balanced Joint Adaptive Network (BJAN), LSTM	WT	BJAN achieves accuracy = 99.3% in WT fault diagnosis. The study also addresses the detection of faults in the converter but does not present the time advance to failure and does not distinguish between faults and alarms.
[36], 2022	Detection	DAE	Blade	It demonstrates high performance with an F1 > 89.00%. The study focuses on a mechanical component of the WT and does not clearly describe the advance time to failure.
[35], 2023	Detection	Variational AE	WT	The paper focuses on detecting and diagnosing WT mechanical component failures. However, it does not present model evaluation metrics.
[37], 2023	Prediction	AE, AT-LSTM	IGBT	The results show that the model can dynamically reconstruct the variable related to the target component. However, it does not present a prediction system to determine when the failure will occur.
[24], 2024	Detection	Transformers	IGBT	The proposed models can predict failures in the target component. The study uses SCADA data from a high-altitude wind farm. Due to the model architecture, the fault detection system is complex.
[34], 2024	Prediction	RNN, LSTM	WT	Experimental results show that the model reconstructs and fits the normal data well. However, it does not detect faults in a specific target component.
[33], 2024	Detection	LSTM-AVAGMM	Gearbox	The model’s accuracy of 96.44% shows its efficiency in detecting failures in a WT mechanical component. It does not present a failure prediction system.

Table 2. Variables selected for converter failure analysis.

N°	Variables	Description
1	grid_active_power_avg	Indicates the performance of the WT and the operating conditions of the converter.
2	generator_speed_avg	This is closely linked to the performance of the WT and can influence the converter’s behavior.
3	ambient_temperature_avg	Ambient temperature affects several components of the WT, including the converter.
4	grid_U1_avg	The average voltage in the grid can influence converter failures.
5	grid_I1_avg	This a key indicator of the operational status of the WT, and can influence converter failures.
6	topbox_temperature_max	This may indicate thermal problems affecting the converter.
7	igbt_temperature_max	Directly related to the target component.
8	rectifier_temperature_max	Critical converter status indicator.
9	step_up_igbt_temperature_max	Monitors the IGBT temperature.

Table 3. Hyperparameters used in LSTM-MA-AE.

Parameters	Value
Batch size	256
Sequence length	20
Dropout	0.1
Hidden size LSTM	64
Learning rate	$1 \times 10^{- 6}$
Epochs	75
Regularization factor	$1 \times 10^{- 4}$
LSTM layers encoder	2
LSTM layers decoder	2
Heads encoder	8
Heads decoder	8

Table 5. Comparison of training speed by epoch.

Approach	Time by Epochs
LSTM-AE	7.48 s
LSTM-MA-AE without PE	13.40 s
LSTM-MA-AE	13.28 s

Table 6. Model performance for different dataset percentages.

Percent Data	Variable 3 (MAE)
90%	0.35
50%	1.45
25%	3.62

Table 7. Fault prediction system performance metrics.

WT	TP	FP	FN	Re	Pr	F1	Adv. 1	Adv. 2	Adv. 3	Adv. Avg.
WT2	2	1	0	1.00	0.67	0.80	2.50	-	4.20	3.35
WT3	3	0	0	1.00	1.00	1.00	2.30	3.70	4.30	3.43
									Avg.	3.39

Table 8. Impact matrix comparing this study with previous studies.

Criterion	[37]	[24]	Our Work
Fault prediction (F1)	-	1.00	0.90
Early fault detection	Medium-High	High	Medium-High
Computational efficiency	High	High	High
Use of Multi-Head Attention	No	Yes	Yes
Application in complex terrains	No	Yes	Yes
Advance	4.32 months	4 months	3.39 months
Limit to failure	3 months	3 months	2.30 months

Note: In bold are the best performance.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Torres-Cabrera, J.; Maldonado-Correa, J.; Valdiviezo-Condolo, M.; Artigao, E.; Martín-Martínez, S.; Gómez-Lázaro, E. A Novel Data-Driven Approach with a Long Short-Term Memory Autoencoder Model with a Multihead Self-Attention Deep Learning Model for Wind Turbine Converter Fault Detection. Appl. Sci. 2024, 14, 7458. https://doi.org/10.3390/app14177458

AMA Style

Torres-Cabrera J, Maldonado-Correa J, Valdiviezo-Condolo M, Artigao E, Martín-Martínez S, Gómez-Lázaro E. A Novel Data-Driven Approach with a Long Short-Term Memory Autoencoder Model with a Multihead Self-Attention Deep Learning Model for Wind Turbine Converter Fault Detection. Applied Sciences. 2024; 14(17):7458. https://doi.org/10.3390/app14177458

Chicago/Turabian Style

Torres-Cabrera, Joel, Jorge Maldonado-Correa, Marcelo Valdiviezo-Condolo, Estefanía Artigao, Sergio Martín-Martínez, and Emilio Gómez-Lázaro. 2024. "A Novel Data-Driven Approach with a Long Short-Term Memory Autoencoder Model with a Multihead Self-Attention Deep Learning Model for Wind Turbine Converter Fault Detection" Applied Sciences 14, no. 17: 7458. https://doi.org/10.3390/app14177458

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel Data-Driven Approach with a Long Short-Term Memory Autoencoder Model with a Multihead Self-Attention Deep Learning Model for Wind Turbine Converter Fault Detection

Abstract

1. Introduction

1.1. Contribution of the Present Work

1.2. Background and Motivations

2. Materials and Methods

2.1. Description of the Data Set and Study Area

2.2. Data Processing

2.3. Target Component

2.4. Feature Selection

2.5. Data Splitting

2.6. Data Scaling

2.7. Long Short-Term Memory (LSTM) Network

2.8. Autoencoder (AE)

2.9. Multi-Head Attention (MA)

2.10. Positional Encoding (PE)

2.11. Hybrid Model LSTM-MA-AE

2.12. Exponential Weighted Moving Average (EWMA)

2.13. Fault Prediction System

2.14. Model Hyperparameters

2.15. Model Evaluation Metrics

3. Results

3.1. Model Evaluation

3.2. Fault Prediction

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI