**1. Introduction**

With the gradual depletion of non-renewable energy and the deteriorating human living environment, wind energy has developed rapidly as one renewable energy source [1]. However, wind turbines (WTs) are mostly installed in remote areas as the main equipment for wind power generation. The harsh operating environment causes frequent failures of key components such as gearboxes and bearings [2]. Therefore, in order to ensure the safe operation of WTs and reduce the operation and maintenance (O&M) costs, it is crucial to study effective fault diagnosis methods for gearboxes [3].

As the vibration and acoustic emission signals are sensitive to the faults of the machine, condition monitoring systems based on vibration [4,5] and acoustic emission [6–8] have been widely used in the field of condition monitoring and fault diagnosis. In order to monitor the health conditions of WTs, the wind energy industry is currently using condition monitoring systems to collect large amounts of real-time data for diagnosing gearbox faults. Since the amount of data collected from gearboxes is increasing, the traditional fault diagnosis method cannot effectively analyze massive data and

automatically give accurate diagnosis results [9]. Therefore, intelligent fault diagnosis methods based on artificial intelligence techniques are gaining more attention. Generally, there are two main steps for intelligent fault diagnosis methods: feature extraction and fault classification [10]. Traditional methods such as artificial neural networks (ANN) and support vector machine (SVM) are used to classify faults [11–13]. However, the problem of existing intelligent fault diagnosis methods is that the common machine learning methods rely on well-selected features and have limited ability to learn from complex time-series signals; meanwhile, with these methods it is more difficult to identify faults under variable working conditions, and they have a low classification accuracy. Therefore, a more effective fault identification method is needed. [14–17]. In recent years, deep learning has attracted great attention from various fields due to the powerful ability of feature learning and the superiority of processing massive data. Up to now, deep learning networks have been widely applied in fault diagnosis, such as deep belief networks (DBN) [18], convolutional neural networks (CNN) [19] and recurrent neural networks (RNN) [20]. However, the gearbox has strong time-dependence of faults due to its relatively long operating time [21]. Compared with other deep learning methods, the long short-term memory (LSTM) neural network has great advantages in learning long-term time-dependent characteristics of sequences [22,23].

For the fault diagnosis methods based on LSTM neural networks, the softmax cross entropy is usually used as the loss function of fault classification. However, recent studies found that the traditional softmax loss is insufficient to acquire the discriminating power for classification. To obtain better discriminating performance, Wang et al. [18] proposed a novel loss function called large margin cosine loss (LMCL) for learning the high-resolution depth features used in face recognition. The result shows that the loss function based on cosine distance has a good effect on classification. Therefore, this paper proposes an optimized fault diagnosis method using an LSTM network with cosine loss (Cos-LSTM) to improve the ability of classification. Meanwhile, the energy sequence features and the wavelet energy entropy of the fault vibration data collected on a gearbox fault diagnosis experimental platform are used to validate Cos-LSTM networks. The Cos-LSTM achieves higher accuracy of diagnosis, which is demonstrated through the gear transmission experiments and compared to other fault diagnosis methods.

The rest of the paper is organized as follows. In Section 2, the typical architecture of LSTM and the process of fault diagnosis are briefly introduced. Section 3 details the Cos-LSTM method and the process of gearbox fault diagnosis based on the Cos-LSTM method. The gearbox fault diagnosis experiment and the comparisons of our proposed method and other fault diagnosis methods are presented in Section 4. Finally, the conclusions are drawn in Section 5.

### **2. LSTM Neural Network for Fault Diagnosis**

As a special type of recurrent neural network (RNN), the LSTM neural network was proposed by Hochreiter and Schmifhuber [24] to solve the vanishing or exploding gradient problem of RNNs [25], while retaining the ability of RNNs to process sequential data. In this section, we describe LSTM in more detail.

### *2.1. Structure of LSTM*

The main component of an LSTM neural network is the LSTM cell, which can decide whether to update the state information of a memory cell. The structure of the LSTM cell is shown in Figure 1.

**Figure 1.** The Schematic diagram of an LSTM cell.

As shown in Figure 1, *h*(*t*) and *x*(*t*) are the output hidden states and inputs of the current time step, *h*(*t* − 1) represents the hidden state of the previous time step; sigm is the sigmoid function and *tanh* is the hyperbolic tangent function. *C*(*t*) is a memory cell which is used for the preservation of information, and the flow of information into or out of *C*(*t*) is regulated by three different gates:


The internal state node *s*(*t*) and input node *g(t*) are also integral parts of the LSTM cell. Here are the calculation procedures of the LSTM cell:

$$\mathbf{g}(t) = \Phi(\mathcal{W}\_{\mathbb{S}^\mathbf{x}}\mathbf{x}(t) + \mathcal{W}\_{\mathbb{S}^\mathbf{H}}\mathbf{h}(t-1) + b\_{\mathbb{S}}),\tag{1}$$

$$\dot{\mathbf{x}}(t) = \sigma(\mathcal{W}\_{\text{i}\text{x}}\mathbf{x}(t) + \mathcal{W}\_{\text{i}\text{h}}\mathbf{h}(t-1) + b\_{\text{i}}),\tag{2}$$

$$f(t) = \sigma(\mathcal{W}\_{fx}\mathbf{x}(t) + \mathcal{W}\_{fh}h(t-1) + b\_f),\tag{3}$$

$$\sigma(t) = \sigma(\mathcal{W}\_{\text{ox}}x(t) + \mathcal{W}\_{\text{ol}}h(t-1) + b\_0),\tag{4}$$

$$s(t) = g(t) \* i(t) + \mathbb{C}(t - 1) \* f(t),\tag{5}$$

$$h(t) = \Phi(s(t)) \* o(t). \tag{6}$$

In the above equations, *Wjx*, *Wjh* and *bj*, *j* = *g*, *j*, *f*, *o* denote the input weight matrixes, hidden weight matrixes and bias vectors separately; ∗*,* σ and Φ are element-wise multiplications of two vectors, the *sigmoid* function and *tanh* function, respectively.

The LSTM neural network can learn when to open or close the gate to control the flow of information in LSTM cells automatically, so it can choose useful information to train the model.

### *2.2. Architecture of LSTM for Fault Diagnosis*

The LSTM neural network is used for fault classification in fault diagnosis. The architecture for the LSTM network includes five layers: an input layer, an LSTM hidden layer, a fully connected layer, a softmax layer and a result output layer at the end. The architecture of the LSTM network is shown in Figure 2.

**Figure 2.** The architecture for LSTM network.

During the training process, the fault features are fed into the input layer first, then the data flow through LSTM cell and the result of LSTM cell is output to the LSTM hidden layer. The last output of the LSTM hidden layer is taken as the output of the LSTM network, and it is used to connect a fully connected layer to map outputs into the result space. The softmax layer follows the fully connected layer to calculate the probabilities for all the fault pattern. Finally, the fault diagnosis results are output to the classification output layer. After completing the training, the weights and bias will be adjusted to the optimal value, and then the test set is input into LSTM for fault diagnosis.

## **3. Cos-LSTM**

The softmax cross entropy is often used as the loss function of the LSTM neural network; however, the softmax loss is insufficient to enable classification [26,27]. To solve this problem, the cosine loss function is adopted to optimize the LSTM neural network. This section provides details about the Cos-LSTM.

### *3.1. Cosine Loss*

Based on the softmax loss, the cosine loss retains its advantage of enlarging the difference between classes [15], but reduces its sensitivity to different signal strengths and pays more attention to the difference of vectors in direction. The schematic of cosine loss is shown in Figure 3.

**Figure 3.** Schematic of Cosine Loss.

*Sensors* **2020**, *20*, 2339

Suppose there are two signals *q*<sup>1</sup> and *q*<sup>2</sup> with the same fault, and the corresponding fault label is *p*1. When softmax is taken as the loss function, the softmax loss can be formulated as follows,

$$Loss\_{\text{soft}} = \frac{1}{B} \sum\_{i=1}^{B} -\log(\frac{e^{||W\_i|| ||\mathbf{x}|| \cos \theta\_i}}{\sum\_{j=1}^{N} e^{||W\_j|| ||\mathbf{x}|| \cos \theta\_j}}\Big) \tag{7}$$

where *B* is the number of training samples and *N* is the number of classes, *x* and W represent the hidden layer output and the weight matrix respectively, and θ is the angle between W and *x.* Formula (2) suggests that softmax loss is related to signal strength, while cosine loss evaluates the size of the differences between classes according to cosine similarity between the two feature vectors. The cosine similarity is defined as follows:

$$\text{similarity}(A, B) = \frac{A \cdot B}{\|A \cdot \| \* \| \| B \cdot \|} = \frac{\sum\_{i=1}^{n} A\_i \* B\_i}{\sqrt{\sum\_{i=1}^{n} A\_i^2} \* \sqrt{\sum\_{i=1}^{n} B\_i^2}} \tag{8}$$

Taking 1—cosine similarity as the loss function, the cosine loss can be formulated as follows,

$$L\text{Loss}\_{\text{cos}} = \frac{1}{B} \sum\_{i=1}^{B} 1 - \frac{y\_i}{\sqrt{\sum\_{j=1}^{N} y\_j^2}} = \frac{1}{B} \sum\_{i=1}^{B} \sqrt{1 - \frac{\|\|\mathcal{W}\_i\|\|^2 \|\|\ge \|\|^2 \cos \theta\_i\|^2}{\sum \|\|\mathcal{W}\_j\|\|^2 \|\ge \|\|^2 \cos \theta\_j\|^2}}\tag{9}$$

By Formula (5), the *x* <sup>2</sup> in this formula can be eliminated, so the cosine loss is independent of the signal strength. Therefore, taking cosine loss function as the loss function in gearbox fault diagnosis, the loss can be converted from Euclid space to angular space, thus eliminating the effect of signal strength and reduce the burden of network fitting.

### *3.2. The Process of Cos-LSTM for Fault Diagnosis*

In this paper, there are two kinds of fault features extracted for evaluating the proposed method: the energy sequence feature and the wavelet energy entropy.

The energy sequence feature: The energy sequence features are extracted by wavelet packet decomposition (WPD). WPD is a signal decomposition tool that decomposes a signal to some nodes and every node represents a set of coefficients at a specified frequency band [28,29]. The wavelet packet is defined as follows:

$$\phi(t) = \sqrt{2} \sum\_{k} h(k) \phi(2t - 1) \tag{10}$$

$$\Psi(t) = \sqrt{2} \sum\_{k} g(k) \phi(2t - 1) \tag{11}$$

where *h*(*k*) and *g*(*k*) are a low-pass filter and a high-pass filter respectively. ∅(*t*) and Ψ(*t*) represent the scaling function and the wavelet function respectively. Additionally, *g*(*k*) can be expressed by *h*(*k*) using the formula *g*(*k*) = (−1) *k h*(1 − *k*).

The signal is decomposed by Equations (12) and (13)

$$d\_{j+1,2u}(t) = \sum\_{l \in \mathbb{Z}} h\_{l-2k} d\_{j,n}(t) \tag{12}$$

$$d\_{j+1,2n+1}(t) = \sum\_{l \in \mathbb{Z}} g\_{l-2k} d\_{j,n}(t) \tag{13}$$

where *j* denotes the decomposition layer, *n* ∈ \* 0, 1, 2, ... , 2*<sup>j</sup>* − 1 + is the number of nodes in layer *j*, *l* indicates the number of wavelet coefficients and *dj*,*<sup>n</sup>* represents the coefficient sequence at the *j*th layer, *n*th node.

Due to the large amount of data, we divided the vibration data into four segments and a three-layer WPD was performed on each segment of vibration data using Daubechies 3 (db 3) to obtain eight nodes [30–32]. The energy of each node *Ej*,*<sup>n</sup>* could then be calculated through Formula (14)

$$E\_{j,n} = \sum\_{k} \left| d\_{j,n}(k) \right|^2. \tag{14}$$

The total energy of the signal *E* is the sum of the energy of each node in layer three. It can be computed by (15):

$$E = \sum\_{n=0}^{2^3 - 1} E\_{j,n}.\tag{15}$$

and *Pj*,*<sup>n</sup>* is defined by (16):

$$P\_{j,n} = \frac{E\_{j,n}}{E}.\tag{16}$$

Each of the signals can be decomposed to get eight nodes, and the energy sequences feature can be expressed as Equation (17) according to Equations (14)–(16).

$$\mathbf{x}(i) = (P\_{2,i'}^{v1} P\_{2,i}^{v2}) \tag{17}$$

where *x*(*i*) is the energy sequences feature and *i* = 0, 1, ... , 7, *Pv*<sup>1</sup> 2,*<sup>i</sup>* and *Pv*<sup>2</sup> 2,*<sup>i</sup>* indicate the *P*2,*<sup>i</sup>* for *sv*1(*t*) and *sv*2(*t*), which denote the vibration signals of the gearbox in the horizontal and vertical directions respectively.

Wavelet energy entropy: The signal is reconstructed according to the eight node coefficients obtained from the three-layer WPD above, and the reconstructed signal is divided into *N* segments on the basis of the time characteristics of the signal. The energy of each segment is calculated by Formula (14). The calculated energy is normalized by Formulas (15) and (16) to obtain the wavelet energy entropy. The wavelet energy entropy of the *j*-th layer *n* node of the WPD is defined as *Hj*,*n*, and can be formulated as follows:

$$H\_{j,n} = -\sum\_{i=1}^{N} P\_{j,n}(i) \log P\_{j,n}(i) \tag{18}$$

where *Pj*,*n*(*i*) is the normalized value of the energy of each segment of the signal; *i* = 0, 1, ... , *N*. The value of *N* is 50 in this article.

According to the calculated wavelet energy entropy of each node, the wavelet energy entropy feature is formed by Equation (19):

$$T = \left[H\_{3,1}, H\_{3,2}, H\_{3,3}, H\_{3,4}, H\_{3,5}, H\_{3,6}, H\_{3,7}, H\_{3,8}\right] \tag{19}$$

The fault features obtained above are fed into the Cos-LSTM network to diagnose the gearbox fault. The flow chart of fault diagnosis based on the Cos-LSTM is shown in Figure 4.

**Figure 4.** The flow chart of the Cos-LSTM method for gearbox fault diagnosis.

We used one LSTM hidden layer with eight LSTM cells to extract deeper features. The fault features are first normalized and then fed into the input layer. In this paper, we used *N* samples (*N* = 2200 samples) to train the model. Therefore, the size of the input layer is *N* × 8 (time steps) × 2 (2-dimensional features), and the input size of each LSTM cell is *N* × 2. The last output *h* (7) of the LSTM hidden layer connects a fully connected layer with 11 neurons, using cosine loss to calculate the probabilities for the 11-fault pattern.

The parameters of the LSTM neural network are presented as follows: time steps for LSTM = 8; the LSTM hidden layer neurons = 4; the fully connected layer neurons = 11; learning rate = 0.01; number of iterations of training = 10,000. The workflow of the Cos-LSTM is shown in Figure 5.

**Figure 5.** The workflow of the Cos-LSTM.
