• Dempster's rule of combination:

Dempster's rule of combination is used to combine the BPA functions of multiple evidences. Although this rule is controversial at present, the authors of [31] have showed that it behaves perfectly when evidences do not conflict reciprocally. Only if we integrate conflicting evidences do we need to improve it. In this paper, there is no serious and complete conflict among the outputs from vibration signal, temperature signal, and stator current signal as evidences in this study. Therefore, Dempster's rule is still used here.

Suppose there are *n* independent evidences (sensors or expert opinions), *H*1, *H*2, ··· , *Hn* (are subsets of Ω), the BPA of them are *m*1, *m*2, ··· , *mn*. Then, Dempster's rule for the BPA functions on Ω is as follows:

$$m = m\_1 \oplus m\_2 \oplus \dots \oplus m\_n \tag{3}$$

Specifically, it can be expressed as follows:

$$m(H) = \frac{\sum\_{H\_1 \cap \cdots \cap H\_n} m\_1(H\_1) m\_2(H\_2) \cdots m\_n(H\_n)}{1 - K}, H \neq \infty \tag{4}$$

where the expression of *K* is as follows:

$$\mathcal{K} = \sum\_{H\_1 \cap \dots \cap H\_n = \mathcal{Q}} m\_1(H\_1) m\_2(H\_2) \cdots m\_n(H\_n) \tag{5}$$

where *K* is the degree of conflict between evidences. When *K* = 1, the evidences are completely conflicted and cannot be synthesized by this formula; when *K* tends to 1, the evidences are highly conflicted, and synthesizing by this formula may lead to results contrary to fact [32].

• Decision rules:

The decision rule is to draw a diagnosis based on the uncertain interval [*bel*(*H*), *pl*(*H*)] of the evidence. In the interval of [0, 1], the uncertainty of a proposition is shown in Figure 1.

**Figure 1.** Uncertainty representation of a proposition.

In Figure 1, [0, *bel*(*H*)] belongs to the support interval, [0, *pl*(*H*)] is the accept interval, [*pl*(*H*), 1] is the rejection interval, and [*bel*(*H*), *pl*(*H*)] is the uncertain interval.

When making a decision, choose a value in the uncertain interval as the final trustworthiness of the proposition. If this value has the highest trustworthiness among the possible hypothesis, this assumption is the final decision result.

#### *2.2. Posterior Probability Least Squares Support Vector Machine*

In the study, the fault samples collected are limited, while support vector machine (SVM) and LSSVM can obtain high diagnosis accuracy based on small sample data. Moreover, the speed of LSSVM is faster than that of SVM, so LSSVM is selected to be the initial classifier to judge the state. As the input parameters of the D–S evidence fusion are basic probability assignments in all classification spaces, the hard output (whether or not) of the traditional classifier has to be converted to a soft one (probability) [33]; that is, the output of the classifier must be changed to the posterior probability output. For the two-class problem, the posterior probability can be calculated using the sigmoid function to map the output *f*(*x*) (+1, −1) of the LSSVM to the [0, 1] interval. Assuming that the probability is consistent with the sigmoid distribution, the posterior probability can be calculated [34]:

$$p(y=1/f) = \frac{1}{1 + \exp(Af + B)}\tag{6}$$

$$p(y=-1/f) = 1 - p(y=1/f) \tag{7}$$

where *f* is the classification result of the standard LSSVM, *p*(*y* = 1/ *f*) is the probability when the classification is correct under the condition that the output value is *f* , *p*(*y* = −1/ *f*) is the probability when the classification is wrong under the condition that the output value is *f* , and *A* and *B* are parameters. So, the key to calculating the posterior probability is to obtain parameters *A* and *B*. The posterior probability least squares support vector machine model is usually established by first establishing a standard LSSVM model, and then obtaining *A* and *B* on the training set (*fi*, *ti*), where *ti* is the target probability output of the standard LSSVM:

$$t\_i = \begin{cases} \frac{N\_+ + 1}{N\_+ + 2} f\_i = +1\\ \frac{1}{N\_- + 2} f\_i = -1 \end{cases} \tag{8}$$

where *N*<sup>+</sup> is the number of positive samples; *N*<sup>−</sup> is the number of negative samples; and the problem of obtaining parameters *A* and *B* is to solve the minimum likelihood optimization problem of the following, i.e.,

$$\min \left\{ -\sum\_{i=1}^{n} [t\_i \log(p\_i) + (1 - t\_i) \log(1 - p\_i)] \right\} \tag{9}$$

where

$$p\_i = \frac{1}{1 + \exp(Af\_i + B)}\tag{10}$$

The Hessian matrix for solving (9) is as follows:

$$H(z) = \begin{bmatrix} \sum\_{i} f\_i^2 p\_i (1 - p\_i) & \sum\_{i} f\_i p\_i (1 - p\_i) \\ \sum\_{i} f\_i p\_i (1 - p\_i) & \sum\_{i} p\_i (1 - p\_i) \end{bmatrix} \tag{11}$$

In order to get the minimum value of (9), the Hessian matrix must be positively determined. So, *A* and *B* are finally obtained by solving all the eigenvalues of the matrix that are greater than zero. The posterior probability can be obtained.

It is proved that the posterior probability least squares support vector machine with sigmoid function works well in practical applications [35], but this method can only be used for the two-class problem. The main methods for extending LSSVM from two-class to multi-class are the "one-versus-one" and "one-versus-all" methods. The Platt algorithm calculates the probability formula for each classifier as follows, where *pm* is the probability that sample *x* belongs to the i-th class [35]:

$$p\_{\mathfrak{M}} = (y = m/\mathfrak{x}) = \frac{1}{1 + \exp(A\_{\mathfrak{m}} f(\mathfrak{x}) + B\_{\mathfrak{m}})} \tag{12}$$

#### *2.3. The Improved Artificial Bee Colony*

There are three kinds of kernel functions commonly used in LSSVM: linear kernel function, polynomial kernel function, and radial basis function (RBF) (*K*(*xi*, *xi*) = *exp* − *xi* − *xi* <sup>2</sup> /*σ*<sup>2</sup> *,* where *σ* is the kernel width). Many studies and experiments [36] show that, compared with other kernel functions, RBF can map the original space into an infinite dimensional space and find the hyperplane better. It is a better choice as the kernel function. Therefore, it is necessary to select the regularization parameter γ (necessary for LSSVM, determining the trade-off between the training error minimization and smoothness) and the kernel squared bandwidth *σ*2.

Choosing a better parameter value can greatly improve the performance of the LSSVM classifier and the accuracy of diagnosis. At present, the commonly used methods include trial and error, cross validation, grid search, and intelligent optimization algorithm [37]. Among them, the trial and error method not only consumes time and energy, but also the choice of parameters is greatly affected by subjective factors; the cross validation method divides the data set into training, validation, and testing, and different proportions will lead to different optimal models and optimal parameters; and the grid search method optimizes the model according to the set step size in the upper and lower limits of parameters, and then determines the optimal parameters, so the search speed is too slow and the precision is not high. Therefore, the advantages of the intelligent optimization algorithm are highlighted. It realizes the optimal distribution of food by simulating the behavior of animals in the population (interact information and cooperation among individuals). A swarm intelligence optimization algorithm is easy to implement and has high efficiency, so it is applied to the parameter optimization process of LSSVM.

Swarm intelligence optimization algorithms include genetic algorithm, particle swarm optimization, artificial fish swarm algorithm, artificial bee colony algorithm, and so on. Among them, artificial bee colony algorithm (ABC) is an optimization algorithm proposed in recent years, which not only has good optimization ability, but also controls less parameters in the process. Furthermore, it is simple, flexible, and easier to implement. The research [38] shows that the optimization performance of ABC is better than that of genetic algorithm and particle swarm algorithm, and the classification diagnosis accuracy of LSSVM optimized by ABC is higher than that of LSSVM optimized by genetic algorithm and particle swarm algorithm.

However, ABC has some shortcomings, such as slow convergence speed in the later stage of operation and the fact that it is easy to fall into local optimum. Therefore, in this paper, on the one hand, chaotic initialization is introduced in the artificial bee colony algorithm, which is used to initialize the population position to improve the diversity of the population and the ergodicity of the population search process. On the other hand, in the collecting bees stage of the artificial bee colony algorithm, the bees are divided into two parts: one part collects the optimal information of the region according to the original algorithm, and the other does Lévy flight around the global optimal solution to improve their global search capabilities. At the same time, in the observing bees stage, a search strategy based on the current local optimal solution (called pbest) is adopted to improve the local search ability of the algorithm.

(1) The logistic chaotic map is proposed to initialize the population. The equation for the logistic chaotic map is as follows:

$$y\_{t+1} = \mu y\_t (1 - y\_t) \quad t = 0, 1, \dots, l \tag{13}$$

In the formula, *yt* ∈ (0, 1), *t* is the number of iterations of the chaotic sequence, *μ* is the control parameter of the chaotic sequence, and the value range is [3.75, 4] [39].

(2) Lévy flight was introduced in the evolution strategy to improve the performance of the algorithm and achieve good results [39]. The calculation method is based on

$$L(\alpha) = \left[ \frac{\Gamma(1+\alpha)\sin\left(\frac{\pi\alpha}{2}\right)}{\Gamma\left(\frac{1+\alpha}{2}\right)\alpha 2^{\left(\frac{\alpha-1}{2}\right)}} \right]^{1/\alpha} \tag{14}$$

where *α* is the characteristic index, which usually satisfies 0 < *α* < 2. Γ(·) is the Gamma function defined as

$$
\Gamma(z) = \int\_0^\infty t^{z-1} e^{-t} dt.
$$

Its update equation is as follows:

$$w\_{ij} = x\_{ij} + \alpha \left(x\_{ij} - x\_{best}\right) L(\alpha) \tag{15}$$

where *α* is the step length, which usually meets the standard normal distribution, and *L*(·) is the random search path for Lévy flight.

(3) In the observing bees stage, for any current solution in each generation, the top *p*% solutions are randomly selected among all current solutions, and the best one (called pbest) can be used to balance global search capabilities and local development capabilities. The neighborhood search formula is as follows:

$$
\omega\_{i\dot{j}} = \mathbf{x}\_{i\dot{j}} + \varrho\_{i\dot{j}} \left(\mathbf{x}\_{i\dot{j}} - \mathbf{x}\_{k\dot{j}}\right) + \mathcal{Q}\_{i\dot{j}} (\mathbf{x}\_{randp,\dot{j}}^{best} - \mathbf{x}\_{i\dot{j}}) \tag{16}
$$

where *k* ∈ {1, 2, ··· , *SN*}, *SN* is the number of solutions for the bee colony, *j* ∈ {1, 2, ··· *D*}, *<sup>D</sup>* is the dimension of the optimization problem, *<sup>k</sup>* <sup>=</sup> *<sup>i</sup>*, *<sup>ϕ</sup>ij* <sup>∈</sup> [−1, 1], and <sup>∅</sup>*ij* <sup>∈</sup> [0, 1.5].

#### **3. Specific Steps for Misalignment Diagnosis**

D–S evidence theory is used to carry out the fault diagnosis of wind turbines. The specific steps are as follows.

(1) Identify the frame of discernment of the fault diagnosis system

The frame of discernment is the common faults of the wind turbines misalignment in the study. At the same time, the normal working state of the unit is added. So, the frame of discernment is expressed as follows: {normal, parallel misalignment, angular misalignment and integrated misalignment}.

(2) Determination of evidence

The posterior probability least squares support vector machines are trained by the vibration signal, the temperature signal, and the stator current signal feature vectors separately. The hard outputs of the traditional LSSVM are mapped to the [0, 1] interval using the sigmoid function. The soft outputs of the transformation are used as evidences for D–S evidence theory.

(3) Determination of basic probability assignment function, belief function, and plausibility function

The three least squares support vector machines give the probability vectors of all the classifications on the entire identification framework respectively, and the probability vectors to be directly used as the basic probability assignments, belief function, and plausibility function can be obtained by calculation.

(4) Evidence synthesis and diagnosis

According to Dempster's law, the probability vectors directly participate in the evidence fusion process. After the final probability vector is given, the final diagnosis result based on the probability vector after fusion can be obtained.

Figure 2 summarizes process of D–S evidence-based misalignment diagnosis.

**Figure 2.** Process of Dempster–Shafer (D–S) evidence fusion. LSSVM, least square support vector machine; BPA, basic probability assignment.

#### **4. The Simulation Case Studies of Misalignment Fault Diagnosis**

The simulation wind turbine system is established by ADAMS 2013, MATLAB R2014a, and Ansys 17.0. The three-dimensional (3D) model of the 1.5 MW wind turbine is established using SolidWorks, and then it is imported into ADAMS 2013, where the Marker point is moved according to the type and degree of misalignment; that is, parallel misalignment is simulated by making the center of mass deviate from the center of rotation for a certain distance; angle misalignment is simulated by rotating the marker a certain angle around the y-axis, and placing the rotation axis of the coupling relative to the ground on the z-axis of the Marker point; and integrated misalignment is simulated by adding the parallel misalignment and angle misalignment in the local coordinate system (maker) of the left half coupling at the same time. The correctness of the models has been verified in the literature [40]. The vibration signals were extracted under the input speeds of 81.3◦/s, using step function as the input of ADAMS, the simulation time is 1.5 s, and simulation steps are 6000 steps. The wind turbine models and its control system are established by SIMULINK/MATLAB, where the stator current was sampled at the same speed at which the vibration signal was sampled, and the sample frequency is 200 kHz. The correctness of the models has been verified in the literature [41]. After that, the high-speed gear shaft and the main shaft of the generator are introduced into HyperMesh to divide the grid. Then, the model is imported into Ansys Workbench to get the corresponding temperature signals (details in the literature [42]). In this paper, 100 samples are taken for each of the four types of diagnostic states (normal, parallel misalignment, angular misalignment, and integrated misalignment), of which 60 are for training and 40 are for testing. So, there are 240 (60 × 4) samples in the training set and there are 160 (40 × 4) samples in the testing set.

#### *4.1. Data Processing*

After the vibration signal, temperature signal, and stator current signal under four working conditions are collected, in order to make better use of them and get good diagnosis results, the feature indexes in the time, frequency, and time–frequency domain are extracted. Table 2 shows a 21-dimension mixed feature library of the vibration signal.

**Table 2.** Mixed feature library of vibration signals. IEMD, image extended empirical mode decomposition.


Suppose signal *x* (*x*0, *x*1, *x*2, ··· , *xN*−1) is a discrete time series with a finite length, the calculation formulas of time domain characteristic indexes are shown in Table 3, where *x* is the mean value of the signal, *x* is the average amplitude, and *xp* is the peak value of the signal.

**Table 3.** Time domain characteristic index.


In signal analysis, power spectrum analysis is usually used to extract the frequency domain index. Center of gravity frequency, mean square frequency, root mean square frequency, and frequency variance are commonly used. The sampling frequency is set as *fs*, and the calculation formula of each index is shown in Table 4, where *S*(*ω*) is the power spectrum of discrete time series, *<sup>S</sup>*(*ω*) <sup>=</sup> *<sup>X</sup>*(*ω*)·*X*(*ω*), *<sup>X</sup>*(*ω*) <sup>=</sup> *<sup>N</sup>*−<sup>1</sup> ∑ *i*=0 *x*(*i*)*e*−*jπω*, *ω* is the angular frequency.


**Table 4.** Frequency domain characteristic index.

Time–frequency analysis is a fault diagnosis method that combines the law and reason of frequency changing with time. In this paper, image extended empirical mode decomposition (IEMD) is used to process the vibration signal, and dual tree complex wavelet transform (DTCWT) is used to process the stator current signal (see the literature [43] for details).

The gearbox tooth temperature *T*<sup>1</sup> and the generator rotor shaft temperature *T*<sup>2</sup> are selected as the characteristic values of the temperature signal. Construct a two-dimensional vector of the temperature signal: *X* = [*T*1, *T*2].

Table 5 is a mixed feature library with a total of 29 dimensions in the time domain, frequency domain, and time–frequency domain of the stator current signal (see the literature [41] for details).


**Table 5.** Mixed feature library of stator current signals.

In order to eliminate the influence of different input dataset dimensions and large numerical differences, the original dataset is normalized, i.e.,

$$y = \frac{y\_{\max} - y\_{\min}}{x\_{\max} - x\_{\min}} \ast (\mathbf{x} - \mathbf{x}\_{\min}) + y\_{\min} \tag{17}$$

where *x* is the value to be normalized, *ymin* is the lower bound of the normalized interval, and *ymax* is the upper bound of the normalized interval. In this paper, *ymin* = 0, *ymax* = 1, and the vector is normalized by column.

Because of the high dimensionality of the constructed vectors of the vibration signal and the stator current signal, not only does the amount of calculation increase, but also some difficulties are brought to fault diagnosis [44]. In order to make better use of various information and obtain good diagnostic results, the feature vectors are subjected to dimensionality reduction using t-SNE.

t-SNE based on conditional probability retains the similarity between high-dimensional and low dimensional space data and adopts symmetric objective function, and t distribution in low-dimensional space replaces Gaussian distribution, which solves the problem of crowding and clear visualization in low-dimensional space [45]. Its implementation steps are as follows:


$$\mathcal{L} = \sum\_{i} \sum\_{j} p\_{ij} \log \frac{p\_{ij}}{q\_{ij}} \tag{18}$$

$$\text{perp}(p\_i) = 2^{H(p\_i)} \tag{19}$$

$$H(p\_i) = -\sum\_j p\_{j/i} \log\_2^{p\_{j/i}} \tag{20}$$

where *pi* is the conditional probability of data points (other than *xi*) with respect to *xi*, *pj*/*<sup>i</sup>* is the conditional probability of high-dimensional data, *pij* is the joint probability density in the high-dimensional space, and *qij* is the joint probability density in the low-dimensional mapping space.

(3) Define the optimization parameters: the number of iterations T, the learning rate η, and the momentum factor at the tth (t ≤ T) iteration α(t) (0 < α(t) < 1). The value equation c is learned by the gradient descent method, and the low-dimensional mapping of the high-dimensional data is finally obtained:

$$\frac{\delta \mathcal{E}}{\delta y\_i} = 4 \sum\_{j} (p\_{ij} - q\_{ij}) \left( y\_i - y\_j \right) \left( 1 + \left\lfloor \left| y\_i - y\_j \right| \right\rfloor^2 \right)^{-1} \tag{21}$$

where *yi* and *yj* are the mapping of the high-dimensional data *xi* and *xj* in the lowdimensional space.

In order to speed up the optimization process and prevent trapping into local minima, a relatively large momentum condition is imposed on the descent process. The current gradient value is summed to the previous gradient value for each iteration and then decays exponentially to determine the coordinates of the low-dimensional data. The momentum formula is as follows:

$$y^{(t)} = y^{(t-1)} + \eta \frac{\delta c}{\delta y} + a(t) \left( y^{(t-1)} - y^{(t-2)} \right) \tag{22}$$

where *y* is the data in the low-dimensional space.
