**1. Introduction**

With the improvement of people's environmental awareness, sustainable and carbonneutral renewable energy has gradually developed to replace oil, coal and other traditional fossil fuels [1]. According to a recent report about renewable capacity statistics [2], the world's wind energy capacity is 622,704 MW in 2019, accounting for 24.55% of the total renewable energy capacity, second only to the hydropower which is the oldest renewable energy source [3]. The annual growth rate of wind energy is 10.44% in 2019, second only to the rapidly developing solar energy. Improving the efficiency of wind turbines has always been a hot issue in terms of wind energy utilization. In addition to study the selection of wind turbine [4–6], it is useful to reasonably design the wind turbines' structure [7,8]. At the same time, wind turbines are usually exposed to dynamic and harsh weather conditions, experiencing variable and rough working environments, which makes them prone to failure than other ordinary machinery. If a component of the wind turbine is broken without awareness of workers, it may well cause damage to other components, and even lead to the shutdown of the wind turbine, resulting in huge economic losses [9]. Operating and maintenance costs account for more than 25% of total costs for onshore wind farms and these costs are even higher for offshore projects [10]. Therefore, it is of great significance to reduce maintenance costs and improve the efficiency of wind farms by detecting the fault of wind turbines in time.

Many studies have been carried out on fault diagnosis of wind turbines. Such as Liu et al. [11] introduced local mean decomposition (LMD) to analyze the wind turbine gearbox vibration signals for fault diagnosis. Feng et al. [12] proposed a frequency demodulation analysis method based on the ensemble empirical mode decomposition (EEMD)

**Citation:** Xiao, Y.; Xue, J.; Li, M.; Yang, W. Low-Pass Filtering Empirical Wavelet Transform Machine Learning Based Fault Diagnosis for Combined Fault of Wind Turbines. *Entropy* **2021**, *23*, 975. https://doi.org/10.3390/e23080975

Academic Editor: Sotiris Kotsiantis

Received: 24 June 2021 Accepted: 26 July 2021 Published: 29 July 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

and energy separation algorithm to detect and locate the fault of wind turbine planetary gearbox by analyzing vibration signals. Chen et al. [13] applied empirical wavelet transformation (EWT) to vibration signals to diagnose wind turbine generator bearings faults. Those methods depend on experienced people to analyze the signal and determine the fault of drivetrains of wind turbines, although the precision is guaranteed, it is lack of efficiency. In recent years, with the rise of machine learning (ML), some scholars have tried to use ML methods to diagnosis the drivetrain of wind turbines. For example, Liu et al. [14] extracted features from vibration signals by diagonal spectrum and employed clustering binary tree support vector machines to diagnosis the wind turbines gearbox. Tang et al. [15] proposed a fault diagnosis method for the drivetrain of wind turbines based on manifold learning and Shannon wavelet support vector machine. Gao et al. [16] decomposed vibration signals by integral extension local mean decomposition (IELMD) and calculated multiscale entropy values as features for least squares support vector machines to identify fault type of rolling bearing in wind turbine gearbox. Lei et al. [17] introduced long-short term memory (LSTM) networks in wind turbine fault diagnosis. Jiang et al. [18] proposed multiscale convolutional neural network (MSCNN) to diagnose wind turbine gearbox faults.

Almost two-thirds of ML-based wind turbine fault diagnosis methods use classification, whose procedures include preprocess data, equalize classes, feature extraction, feature selection, hyperparameter tuning, cross-validation and use the best model [19]. This intelligent way allows the diagnosis to be free from expert experience.

However, most of these ML-based wind turbine fault diagnosis methods only studied on single fault [15–19]. In reality, a wind turbine is a complex system, failures could happen one after another or simultaneously, therefore, a wind turbine may have more than one fault at the same time, i.e., combined fault occurs. For example, misalignment may lead to gear or bearing fails, then multiple faults coexist. Gear faults in different stages is also a common combined fault [20]. Combined fault (also called compound fault) is more difficult to diagnose than single fault because typical fault features will become difficult to be extracted. At present, combined fault diagnosis of wind turbines usually depends on manual analysis to calculate, extract and show the frequencies of different faults in spectrums [21–27]. Only a few scholars have studied combined fault diagnosis by ML. For example, Zhong et al. [28] decomposed the vibration signal into a series of intrinsic mode functions (IMFs) by Hilbert-Huang transform (HHT) with ensemble empirical mode decomposition (EEMD), then selected useful IMFs by correlation coefficients, and calculated the energy vector from the selected IMFs together with maximum amplitude and corresponding frequency and six time-domain statistical indices as features of pairwise-coupled sparse Bayes extreme learning machine to detect several common gearbox single-faults and simultaneous-faults.

This paper will focus on a ML-based fault diagnosis method for combined faults and single faults of wind turbines. In our method, a composite fault is considered as a fault equivalent to a single fault, which means the output of a combined fault is not multiple binary tags for each single fault (multilabel classification problem). The reminder of this paper is structured as follows: Section 2 introduces the proposed method and related theories. Section 3 presents the test rig, the experiments and the results. Finally, the conclusion in Section 4.

#### **2. Methods**

The fault diagnosis method for combined fault of wind turbines we proposed can be described as follows. First, extract features from vibration signals by low pass filtering empirical wavelet transform (LPFEWT). Then, build features datasets in different conditions (normal, single faults and combined fault). Last, train the support vector machine (SVM) for classification, using grey wolf optimizer (GWO) for hyperparameter tuning. After training, the obtained SVM model can identify faults of wind turbines by inputting features of vibration signals. The flow chart of the method is shown in Figure 1.

**Figure 1.** The flow chart of the proposed ML-based fault diagnosis method for combined fault of wind turbines.

#### *2.1. Low Pass Filtering Empirical Wavelet Transform (LPFEWT)*

Empirical Wavelet Transform (EWT) is a new adaptive signal processing approach proposed by Gilles in 2013 [29]. The main idea is to adaptively decompose the modes of a signal from its Fourier spectrum by an appropriately built wavelet filter bank. The steps of EWT are summarized as follows:

• Fast Fourier Transform (FFT);

Convert the signal *f* to the frequency domain by FFT to get its Fourier spectrum (frequency ω ∈ [0, π]).

• Fourier Spectrum Segmentation;

Divide the Fourier spectrum into *N* contiguous segments. Let *<sup>n</sup>* denote the limits between each segment. Each segment is denoted as Λ*<sup>n</sup>* = [*ωn*−1, *ωn*]. With each *<sup>n</sup>* as center, a transition phase of width 2*τ<sup>n</sup>* is defined.

• Mode Extraction;

Let <sup>ˆ</sup> *f* and ˇ *f* denote the Fourier transform and its inverse respectively. Choose *τ<sup>n</sup>* proportional to *n*: *τ<sup>n</sup>* = *γωn*, where 0 < *γ* < 1. Consequently, ∀*n* > 0, the empirical scaling function <sup>ˆ</sup> *<sup>φ</sup>n*(*ω*) and the empirical wavelets <sup>ˆ</sup> *ψn*(*ω*) are as follows:

$$\hat{\phi}\_n(\omega) = \begin{cases} 1, & |\omega| \le (1 - \gamma)\omega\_n \\ \cos\left[\frac{\pi}{2}\beta\left(\frac{1}{2\tau\_n}\left(|\omega| - (1 - \gamma)\omega\_n\right)\right)\right], & \\ (1 - \gamma)\omega\_n \le |\omega| \le (1 + \gamma)\omega\_n \\ 0, & \text{otherwise} \end{cases} \tag{1}$$

and

$$\hat{\boldsymbol{\psi}}\_{n}(\omega) = \begin{cases} 1, & \omega\_{\mathbb{T}} + \tau\_{\mathbb{N}} \le |\omega| \le (1 - \gamma)\omega\_{n+1} \\ \cos\left[\frac{\pi}{2}\beta \left(\frac{1}{2\tau\omega\_{n+1}} \left(|\omega| - (1 - \gamma)\omega\_{n+1}\right)\right)\right], & \\ (1 - \gamma)\omega\_{n+1} \le |\omega| \le (1 + \gamma)\omega\_{n+1} \\ \sin\left[\frac{\pi}{2}\beta \left(\frac{1}{2\gamma\omega\_{n}} \left(|\omega| - (1 - \gamma)\omega\_{n}\right)\right)\right], & \\ (1 - \gamma)\omega\_{n} \le |\omega| \le (1 + \gamma)\omega\_{n} \\ 0, & \text{otherwise} \end{cases} \tag{2}$$

To construct a tight frame set of empirical wavelets, choose

$$\gamma < \min\_{\nu} \left( \frac{\omega\_{n+1} - \omega\_n}{\omega\_{n+1} + \omega\_n} \right) \tag{3}$$

The detail coefficients <sup>W</sup>*<sup>ε</sup> <sup>f</sup>*(*n*, *t*) are given by the inner products with the empirical wavelets function <sup>ˆ</sup> *<sup>ψ</sup>n*(*ω*), and the approximation coefficients <sup>W</sup>*<sup>ε</sup> <sup>f</sup>*(0, *t*) are given by the inner product with the scaling function <sup>ˆ</sup> *φ*1(*ω*).

$$\begin{array}{rcl} \mathcal{W}\_f^{\pi}(\mathfrak{n}, t) &= \langle f, \psi\_{\mathfrak{n}} \rangle = \int f(\pi) \, \overline{\psi\_{\mathfrak{n}}(\mathfrak{r} - t)} d\tau \\ &= \left( \hat{f}(\omega) \overline{\psi\_{\mathfrak{n}}(\omega)} \right) \end{array} \tag{4}$$

$$\begin{aligned} \mathcal{W}\_f^{\varepsilon}(0, t) &= \langle f, \phi\_1 \rangle = \int f(\tau) \, \overline{\phi\_1(\tau - t)} d\tau \\ &= \left( \stackrel{\cdot}{f}(\omega) \overline{\phi\_1(\omega)} \right) \end{aligned} \tag{5}$$

The reconstruction is obtained by

$$\begin{aligned} f(t) &= \mathcal{W}\_f^\mathbf{r}(0, t) \star \phi\_1(t) + \sum\_{n=1}^N \mathcal{W}\_f^\mathbf{r}(n, t) \star \psi\_n(t) \\ &= \left( \mathcal{W}\_f^\mathbf{r}(0, \omega) \hat{\phi}\_1(\omega) + \sum\_{n=1}^N \mathcal{W}\_f^\mathbf{r}(n, t) \hat{\psi}\_n(t) \right) \end{aligned} \tag{6}$$

There are multiple algorithms to automatically segment the Fourier spectrum, such as local-maxima, local-maxima-minima and scale-space (including otsu, half-normal, empirical law, means and k-means) [29,30]. The scale-space algorithms are parameterless, but it takes long time for the computation when processing a long signal. And different signals are often decomposed into different amounts of modes, which is inconvenient for the comparison with each other. Considering these factors, we choose the simplest and fastest algorithm–local-maxima, which can set the max number of segments.

Based on EWT, LPFEWT is proposed to extract features. First, design a low pass FIR filter with an appropriate cut-off frequency for the signal. Next, employ EWT on the filtered signal to decompose the signal into several empirical modes. Then, exclude the empirical mode of the highest frequencies which is mostly affected by the filter. Last, calculate the indices of the left modes as features. According to this approach, the feature required for fault diagnosis can be obtained easily.

Compared to the tradition wavelet transform, LPFEWT is adaptive, which means it decomposes the signal based on the information contained in the signal itself so that there is no need to choose or design specific wavelet basis for the signal.

### *2.2. Support Vector Machine (SVM)*

SVM is a very powerful and versatile ML model and particularly well suited for classification of complex but small- or medium-sized datasets [31].

The simplest linear SVM for binary classification can be described as follows. For all samples to be classified *xi*(*i* = 1, 2, . . . , *m*), the output is

$$y\_i = \operatorname{sign}\left(\mathbf{w}^T \mathbf{x}\_i + b\right) \tag{7}$$

i.e., *yi* = −1 if *<sup>w</sup>Tx<sup>i</sup>* + *<sup>b</sup>* < 0, *yi* = +1 if *<sup>w</sup>Tx<sup>i</sup>* + *<sup>b</sup>* > 0. So the hyperplane *<sup>w</sup>T<sup>x</sup>* + *<sup>b</sup>* = <sup>0</sup> is decision boundary. To make the decision boundary best for separation, construct two hyperplanes *<sup>w</sup>T<sup>x</sup>* + *<sup>b</sup>* = −1 and *<sup>w</sup>T<sup>x</sup>* + *<sup>b</sup>* = 1 which are parallel and at equal distance to the decision boundary, i.e., *yi* = −1 if *<sup>w</sup>Tx<sup>i</sup>* + *<sup>b</sup>* ≤ −1, *yi* = +1 if *<sup>w</sup>Tx<sup>i</sup>* + *<sup>b</sup>* ≥ 1. Training SVM means finding the value of *w* and *b* that make the width of the margin 2/ *w* as large as possible. That is a constrained optimization problem

$$\max\_{\mathbf{w}, b \mid \|\mathbf{w}\|} \frac{2}{\|\mathbf{w}\|} \quad \text{s.t. } y\_i \left(\mathbf{w}^T \mathbf{x}\_i + b\right) \ge 1, \ i = 1, 2, \dots, n \tag{8}$$

which can be converted to an equivalent problem

$$\min\_{\mathbf{w}, \mathbf{b}} \frac{1}{2} \|\\_{\mathbf{w}}\|^2 \quad \text{s.t. } y\_i(\mathbf{w}^T \mathbf{x}\_i + b) \ge 1, \ i = 1, 2, \dots, n \tag{9}$$

This is a convex quadratic optimization problem with linear constraints, which is known as quadratic programming (QP) problems and can be solved by the method of Lagrange multipliers. Introduce Lagrange multipliers *λ* = (*λ*1, *λ*2, ··· , *λm*), the objective function of optimization can be expressed as

$$\begin{aligned} \mathcal{L}(\boldsymbol{w}, b, \lambda) &= \frac{1}{2} \|\,\boldsymbol{w}\|\,^2 + \sum\_{i=1}^n \lambda\_i \left[1 - y\_i(\boldsymbol{w}^T \boldsymbol{x}\_i + b)\right] \\ &\quad \lambda\_i \ge 0, i = 1, 2, \cdots, n \end{aligned} \tag{10}$$

The problem is to solve

$$\min\_{\mathbf{w},b} \max\_{\lambda} \mathcal{L}(\mathbf{w},b,\lambda) \tag{11}$$

The dual problem is

$$\max\_{\lambda} \min\_{w, b} \mathcal{L}(\mathfrak{w}, b, \lambda) \tag{12}$$

Calculate the gradients of both *w* and *b*, and set them equal to zero.

$$\nabla\_{\mathfrak{w}} \mathcal{L}(\mathfrak{w}, b, \lambda) = \mathfrak{w} - \sum\_{i=1}^{n} \lambda\_i \mathfrak{x}\_i y\_i = 0 \tag{13}$$

$$\frac{\partial}{\partial b}\mathcal{L}(w, b, \lambda) = -\sum\_{i=1}^{n} \lambda\_i y\_i = 0 \tag{14}$$

Substitute (13) and (14) into problem (12), obtain

$$\begin{aligned} \max\_{\lambda} \sum\_{i=1}^{n} \lambda\_i - \frac{1}{2} \sum\_{i=1}^{n} \sum\_{j=1}^{n} \lambda\_i \lambda\_j y\_i y\_j (\mathbf{x}\_i \cdot \mathbf{x}\_j) \\ \text{s.t. } \sum\_{i=1}^{n} \lambda\_i y\_i = 0, \lambda\_i \ge 0, i = 1, 2, \dots, m \end{aligned} \tag{15}$$

Consequently, the original minimization problem about *w* and *b* is converted to a QP problem about solving *λ*.

To make the model more flexible, soft margin classification is proposed which allows few instances between the margins or even on the wrong side. Soft margin SVM introduces slack variable *ξi*(*i* = 1, 2, ··· , *n*), so the problem becomes

$$\begin{array}{c} \min\_{\mathbf{w}, b, \mathbf{g}} \frac{1}{2} \|\|\mathbf{w}\|\|^2 + \mathbb{C} \sum\_{t=1}^{n} \mathbb{J}\_{i} \\ \text{s.t. } y\_i(\mathbf{w}^T \mathbf{x}\_i + b) \ge 1 - \mathbb{J}\_{i\prime} \mathbb{J}\_i > 0, \; i = 1, 2, \dots, m \end{array} \tag{16}$$

where *C* is penalty term. The bigger the *C*, the more penalty SVM gets when it makes misclassification, the less the tolerance, the smaller the margin.

The QP problem equivalent to soft margin SVM classification is

$$\begin{aligned} \max\_{\lambda} & \sum\_{i=1}^{n} \lambda\_i - \frac{1}{2} \sum\_{i=1}^{n} \sum\_{j=1}^{n} \lambda\_i \lambda\_j y\_i y\_j (\mathbf{x}\_i \cdot \mathbf{x}\_j) \\ \text{s.t. } & \sum\_{i=1}^{n} \lambda\_i y\_i = 0, 0 \le \lambda\_i \le \mathbf{C}, i = 1, 2, \dots, m \end{aligned} \tag{17}$$

For problems that are not linearly separable, transformation *φ* is introduced to map *x* from the original space to a higher dimensional space *φ*(*x*), which makes it easier to find a linear decision boundary in the new feature space. The kernel function *K xi*, *x<sup>j</sup>* = *φ*(*xi*) · *φ xj* is proposed to focus on the results without computing the coordinates of the data in the new space. The kernel trick makes the whole process much more computationally efficient. Problem (17) can be rewritten as

$$\begin{aligned} \max\_{\lambda} & \sum\_{i=1}^{n} \lambda\_i - \frac{1}{2} \sum\_{i=1}^{n} \sum\_{j=1}^{n} \lambda\_i \lambda\_j y\_i y\_j K(\mathbf{x}\_i, \mathbf{x}\_j) \\ \text{s.t. } & \sum\_{i=1}^{n} \lambda\_i y\_i = 0, 0 \le \lambda\_i \le \mathbf{C}, i = 1, 2, \dots, m \end{aligned} \tag{18}$$

In this paper, we use radial basis function (RBF) kernel as below

$$\mathcal{K}(\mathbf{x}\_{i}, \mathbf{x}\_{j}) = e^{-\gamma \|\mathbf{x}\_{i} - \mathbf{x}\_{j}\|^{2}}, \gamma > 0 \tag{19}$$

RBF kernel is one of the most used kernel functions, which can deal with both linear and nonlinear classification problems. The result of linear classification using RBF kernel is comparable to using linear kernel [32,33].

#### *2.3. Grey Wolf Optimizer*

Grey Wolf Optimizer (GWO) is a swarm intelligence (SI) algorithm proposed by Mirjalili et al. [34] in 2014 that imitates the leadership hierarchy and hunting mechanism of grey wolves in nature. In this paper, it is used to optimize the parameters in SVM. The social hierarchy of gray wolves is shown in Figure 2. Grey wolves are divided into four levels from α to ω. The upper level wolves dominate the lower level ones, and the lower level wolves follow the upper level ones.

**Figure 2.** The social hierarchy of grey wolves.

In the GWO algorithm, imitating the social hierarchy of grey wolves, the first best candidate solution is regarded as *α*, the second best candidate solution is regarded as *β*, the third best candidate solution is regarded as *δ*, the remaining candidate solutions are regarded as *ω*. The hunting (optimization) is guided by *α*, *β* and *δ*, while *ω* follow them. The encircling behavior is modeled as follows:

$$D = \left| \mathcal{C} \cdot \mathbf{X}\_p(t) - \mathbf{X}(t) \right| \tag{20}$$

$$X(t+1) = X\_p(t) - A \cdot D \tag{21}$$

where *t* represents the number of iterations, *A* and *C* are coefficient vectors, *X<sup>p</sup>* is the position vector of the prey (optimum), *X* is the position vector of a grey wolf, and *D* represents the distance between the grey wolf and the prey.

The vectors *A* and *C* are defined as follows:

$$A = 2\mathbf{a} \cdot \mathbf{r}\_1 - \mathbf{a} \tag{22}$$

$$\mathbf{C} = \mathbf{2} \cdot \mathbf{r}\_2 \tag{23}$$

where components of *a* are linearly dropped from 2 to 0 over the course of iterations, components of *r*<sup>1</sup> and *r*<sup>2</sup> are random numbers in [0, 1].

The random vectors *r*<sup>1</sup> and *r*<sup>2</sup> allow grey wolves to move any position within a certain range of the prey. With the vector *a* decreases, grey wolves encircle and pursue the prey. The location of the prey is replaced by the decisions of all three grey wolves *α*, *β* and *δ*. The following equations are used for updating the position of each grey wolf.

$$\begin{cases} \mathbf{D}\_{\mathcal{A}} = |\mathbf{C}\_{1} \cdot \mathbf{X}\_{\mathcal{A}} - \mathbf{X}(t)| \\ \mathbf{D}\_{\beta} = |\mathbf{C}\_{2} \cdot \mathbf{X}\_{\beta} - \mathbf{X}(t)| \\ \mathbf{D}\_{\delta} = |\mathbf{C}\_{3} \cdot \mathbf{X}\_{\delta} - \mathbf{X}(t)| \end{cases} \tag{24}$$

$$\begin{cases} \begin{aligned} \mathbf{X}\_1 &= \mathbf{X}\_{\ell} - A\_1 \cdot \mathbf{D}\_{\ell} \\ \mathbf{X}\_2 &= \mathbf{X}\_{\beta} - A\_2 \cdot \mathbf{D}\_{\beta} \\ \mathbf{X}\_3 &= \mathbf{X}\_{\delta} - A\_3 \cdot \mathbf{D}\_{\delta} \end{aligned} \end{cases} \tag{25}$$

$$X(t+1) = \frac{X\_1 + X\_2 + X\_3}{3} \tag{26}$$

Since *A* is a random vector in the interval [−*a*, *a*], the next position of wolves will approach the prey if |*A*| < 1, and move away from the prey if |*A*| > 1. This means that grey wolves not only pursue and attack current prey but also leave to search for other prey. In other words, the GWO algorithm has exploration feature to help avoid local optima. The random vector *C* simulates the obstacles to approaching prey in nature.

GWO can make the process of hyperparameter tuning of SVM more effective than normal way (grid search or randomized search). Also, GWO hyperparameter tuned has better classification accuracy than the typical one-versus-one multi-class SVM [35]. Compared with particle swarm optimization (PSO), GWO has fewer parameters to be determined, only the population and the max number of iterations, because it updates the positions of search agents by the positions of the three best wolves, while PSO updates the positions of search agents by the global best position and the personal best position, and each search agent has velocity besides position.
