**3. Methodology**

In this section, combined with the characteristics of signals of the running gears, CCA-JITL FD model is introduced in details.

### *3.1. Canonical Correlation Analysis and Just-in-Time Learning Methods*

CCA transforms covariance matrices of input and output datasets into two subspaces with the greatest correlation by computing the linear combination of the latent dimensions. The algorithm is adopted by using singular value decomposition (SVD), and it can preserve the original trend of the data [4,32]. The algorithm maximizes Pearson coherence between *Xx* and *Yy*. Pearson correlation of data sets *Xx* and *Yy* can be expressed as [4]

$$R\left(X\_{x'}\,\mathcal{Y}\_{\mathcal{Y}}\right) = \max \frac{\boldsymbol{u}^T S\_{XY} \boldsymbol{v}}{\sqrt{\boldsymbol{u}^T S\_{XX} \boldsymbol{u}} \sqrt{\boldsymbol{v}^T S\_{YY} \boldsymbol{v}}} \tag{4}$$

where *SXY* = *XTx Yy*, *SXX* = *XTx Xx* and *SYY* = *YTy Yy*. According to the data set *Xx* and *Yy* given above, standardization is carried out, respectively. Calculation of matrix is [6]

$$\mathcal{W} = \mathcal{S}\_{XX}^{-\frac{1}{2}} \mathcal{S}\_{XY} \mathcal{S}\_{YY}^{-\frac{1}{2}} = \frac{\mathcal{X}\_x^T \mathcal{Y}\_y}{\sqrt{\mathcal{X}\_x^T \mathcal{X}\_x} \sqrt{\mathcal{Y}\_y^T \mathcal{Y}\_y}} \tag{5}$$

The matrix *W* is decomposed by singular value decomposition (SVD), and the *W* is decomposed as [4]

$$\mathcal{W} = \Gamma D V^T \tag{6}$$

where Γ = (*<sup>u</sup>*1, ··· , *ut*), *D* = - *Dh* 0 0 0 , *V* = (*<sup>v</sup>*1, ··· , *vn*), and where *h* represents the number of top-ranking singular values and ∑*h* = diag(*ρ*1, ··· , *ρh*). Through the formula *ω* = <sup>Γ</sup>(:, 1 : *h*), *ψ* = *<sup>V</sup>*(:, 1 : *h*), the related subspaces *Hx* and *Hy* are generated as [6]

$$\begin{aligned} H\_x &= S\_{XX}^{-\frac{1}{2}} \omega = \frac{1}{\sqrt{X\_x^T X\_x}} \omega \\ H\_y &= S\_{YY}^{-\frac{1}{2}} \psi = \frac{1}{\sqrt{Y\_y^T Y\_y}} \psi \end{aligned} \tag{7}$$

The latent space of *Xx* is divided into two subspaces, namely the related subspaces with *Yy* and the unrelated subspaces with *Yy*. Similarly, the latent space of *Yy* is divided into two parts, namely the related subspaces with *Xx* and the unrelated subspaces with *Xx*. According to the above parameters, the original data inputs are mapped to the related latent spaces, *Hx* and *Hy*, and obtain two associated matrices, *Px* and *Py*. The correlation coefficient is *ρ*1 between *Hx* and *Hy* if only considering the first canonical variate pair of CCA. The data matrix can be formulated as

$$\begin{aligned} P\_{\mathbf{X}} &= \boldsymbol{\mu} = \boldsymbol{H}\_{\mathbf{X}} \left( \mathbf{X}\_{\mathbf{X}} \left( \boldsymbol{H}\_{\mathbf{X}}^{T} \boldsymbol{H}\_{\mathbf{X}} \right)^{-1} \boldsymbol{H}\_{\mathbf{x}}^{T} \right)^{T} \\ P\_{\mathbf{Y}} &= \boldsymbol{v} = \boldsymbol{H}\_{\mathbf{y}} \left( \boldsymbol{Y}\_{\mathbf{y}} \left( \boldsymbol{H}\_{\mathbf{y}}^{T} \boldsymbol{H}\_{\mathbf{y}} \right)^{-1} \boldsymbol{H}\_{\mathbf{y}}^{T} \right)^{T} \end{aligned} \tag{8}$$

The following JITL algorithm is carried out on *Px* and *Py*, respectively, for data fitting. Different from the traditional global model approach, this JITL-based approach uses an online local model structure which could effectively track the current status of the algorithm.

JITL is to improve the prediction of the local FD model using similarity measures. After the most relevant normal data selected from the database, the distance measure, e.g., Euclidean distance *d*(*t s*, *tc*) = *t s*, *tc*2, is employed to evaluate the similarity between *t s* and *tc*. Here, *t s* is the data point of the training set, *tc* is the data point of the test set; that is, a smaller value of distance implies a greater similarity between these two vectors [26]. The inverse of Euclidean distance is used to find the correlation between two vectors.

$$S\_{i,k} = \frac{1}{\sqrt{e^{\left(\left\|\left|t'\_s, t\_{\varepsilon}\right\|\_2\right)^2\right.}}}, i = 1, 2, 3, \dots, N \tag{9}$$

where *Si*,*<sup>k</sup>* represents the magnitude of correlation.

**Remark 3.** *The JITL algorithm arranges Si*,*<sup>k</sup> values in descending order. The number of data to be selected is determined by calculating the accumulated contribution value of Si*,*<sup>k</sup> to the variance of the overall data, and the formula for the average value of Si*,*<sup>k</sup> is θi* = ∑ *N i*=1 *Si*,*<sup>k</sup>* /*N. The variance formula is G* = (*Si*,*<sup>k</sup>* − *θi*) 2*. The contribution parameter G is used to determine how many data points to be included in the testing sample data. For example, the algorithm picks 900 data points until the sum of G value reach* 90%*.*

**Remark 4.** *JITL selects testing data points which have lower correlations with training dataset. In the experiment, the system only takes the last 900 data points from the sorted testing dataset.*

### *3.2. Monitoring Statistics of FD Models*

This section describes the test statistics used for FD. This article uses *T*<sup>2</sup> and SPE for FD. Firstly, a data matrix to be detected is given. Let the input and output matrices obtained after the JITL processing be *μx* = [*<sup>α</sup>r*(1), ··· , *<sup>α</sup>r*(*N*)] ∈ *<sup>R</sup>l*×*N*, *μy* = [*<sup>α</sup>c*(1), ··· , *<sup>α</sup>c*(*N*)] ∈ *Rm*×*<sup>N</sup>* . According to the Formulas (5)–(7), the residual vector is obtained [6]

$$\mathbf{s} = H\_y^T \boldsymbol{\mu}\_{\mathcal{X}} - \boldsymbol{M}^T \boldsymbol{\mu}\_{\mathcal{Y}} \tag{10}$$

where *M<sup>T</sup>* = *DhHTx* .

In the FD algorithm, statistics and their corresponding thresholds define the boundaries of system prediction. *T*<sup>2</sup> and SPE are the two most commonly used statistics in FD [33–36]. Taking the two data matrices *Px* and *Py* perform separately FD. The detection of the subspaces is the same as the routine detection process. Then, judging whether the input signals normally or not requires the following methods [4]

$$\begin{aligned} &SPE = \mathbf{s}^{T}\mathbf{s} \\ &SPE\_{\mathbf{x}} \le J\_{\mathbf{x},th} \quad \text{and} \quad &SPE\_{\mathbf{y}} \le J\_{\mathbf{y},th} \Rightarrow \text{fault} - \text{free} \\ &SPE\_{\mathbf{x}} > J\_{\mathbf{x},th} \quad \text{or} \quad &SPE\_{\mathbf{y}} > J\_{\mathbf{y},th} \Rightarrow \text{fault} \end{aligned} \tag{11}$$

where *Jx*,*th* and *Jy*,*th* are the thresholds for *SPEx* and *SPEy*, respectively. Then, judging whether the input signals normally or not need methods as following [4]

$$\begin{aligned} T^2 &= \mathbf{s}^T \boldsymbol{\Lambda}^{-1} \mathbf{s} \\ T\_x^2 &\le T\_{\mathbf{x},th} \quad \text{and} \quad T\_y^2 \le T\_{\mathbf{y},th} \Rightarrow \text{fault} \quad - \text{free} \\ T\_x^2 &> T\_{\mathbf{x},th} \quad \text{or} \quad T\_y^2 > T\_{\mathbf{y},th} \Rightarrow \text{fault} \end{aligned} \tag{12}$$

where *s* is residual matrix, *Tx*,*th* and *Ty*,*th* are the thresholds for *T*2*x* and *T*2*y* , respectively.

### *3.3. Offline Training and Online Detection Algorithms*

The procedures in Algorithm 1 are used for offline training. The steps in Algorithm 2 are used for online detection.
