*2.1. Introduction of the TCN*

Rooted in the convolutional neural network (CNN) [30], the TCN has been proven equal to or even better than the recurrent neural network (RNN) [31] in dealing with temporal data [27]. The TCN is mainly composed of three parts: causal convolution, dilated convolution, and residual module, as depicted in Figure 1. A detailed introduction of each part will be given as follows.

Causal convolution is a one-way structure, i.e., the value at time *t* of the upper layer only depends on the value at time *t* and before time *t* of the next layer [27], as shown in Figure 1a. Causal convolution brings the time constraint structure. Dilated convolution is designed to solve the problem that the modeling length with temporal data is restrained by the size of the convolution kernel. In Figure 1b for example, the dilated factor *d* = 1 in the first layer means that every sample is calculated in convolution. If *d* = 4, all four samples are calculated together. As a result, the TCN can obtain a larger receptive field through fewer layers. Here, the function *F* of dilated convolution with the element *s* of the sequence *X* is shown as:

$$F(s) = (X \* df)(s) = \sum\_{i=0}^{k-1} f(i) \cdot X\_{s-d \cdot i} \tag{1}$$

where *f* is the convolution operation and *k* is the size of the convolutional kernel.

As shown in Figure 1c, the residual module contains two layers of dilated causal convolution and ReLU mapping. Besides, the TCN runs the dropout after each convolution layer to achieve regularization. The residual module can be expressed as:

$$\mathfrak{z}^{(i)} = \mathfrak{z}^{(i-1)} + F(\mathfrak{s}) \tag{2}$$

**Figure 1.** Structure of the time convolutional network (TCN) with (**a**) causal convolution, (**b**) dilated convolution, and (**c**) the residual module.

#### *2.2. Introduction of the DANN*

Proposed by Ganin et al. [28], the DANN has become an overwhelming domain adaptation model. The structure of the DANN is shown in Figure 2. In Figure 2, given source domain samples (*x<sup>s</sup> <sup>i</sup>* , *yi*) and domain samples (*xi*, *di*), the green part is a feature extractor *Gf*(·; *θ <sup>f</sup>*), the blue part is a source domain classifier *Gy*(·; *θy*), and the red part is a domain classifier *Gd*(·; *θd*). The adversarial training strategy means that *Gy*(·; *θy*) can recognize the data from the source domain using the features extracted by *Gf*(·; *θ <sup>f</sup>*) while ensuring that *Gd*(·; *θd*) cannot recognize from which domain the data come.

The training process of the DANN mainly concentrates on optimizing *Gy*(·; *θy*) and *Gd*(·; *θd*). Then, the loss function of *Gy*(·; *θy*) is:

$$L\_y^i(\theta\_f, \theta\_y) = L\_y(G\_y(G\_f(\mathbf{x}\_i^s; \theta\_f); \theta\_y), y\_i) \tag{3}$$

The loss function of *Gd*(·; *θd*) is:

$$L\_d^l(\theta\_f, \theta\_d) = L\_d(G\_d(G\_f(\mathbf{x}\_i; \theta\_f); \theta\_d), d\_i) \tag{4}$$

By adding a gradient reversal layer between *Gy*(·; *θy*) and *Gd*(·; *θd*), the total loss function of the DANN becomes:

$$\begin{split} E(\boldsymbol{\theta}\_{f},\boldsymbol{\theta}\_{y},\boldsymbol{\theta}\_{d}) &= \sum\_{i=1,2,\ldots,N} L\_{y}(\mathcal{G}\_{y}(\mathcal{G}\_{f}(\mathbf{x}\_{i};\boldsymbol{\theta}\_{f});\boldsymbol{\theta}\_{y}),y\_{i}) - \lambda \sum\_{i=1,2,\ldots,N} L\_{d}(\mathcal{G}\_{d}(\mathcal{G}\_{f}(\mathbf{x}\_{i};\boldsymbol{\theta}\_{f});\boldsymbol{\theta}\_{d}),d\_{i}) \\ &= \sum\_{i=1,2,\ldots,N} L\_{y}^{i}(\boldsymbol{\theta}\_{f},\boldsymbol{\theta}\_{y}) - \lambda \sum\_{i=1,2,\ldots,N} L\_{d}^{i}(\boldsymbol{\theta}\_{f},\boldsymbol{\theta}\_{d}) \end{split} \tag{5}$$

Using a back-propagation optimization algorithm like the stochastic gradient descent (SGD) algorithm, Equation (5) can be minimized to reach equilibrium. Then, a domain-

invariant feature representation can be determined. The test data can then be classified using *Gy*(*Gf*(*xi*; *θ <sup>f</sup>*); *θy*).

**Figure 2.** Schematic diagram of the domain adversarial neural network (DANN).

#### **3. Proposed Approach of Early Fault Online Detection**

It is worth noting that the proposed method is unsupervised, i.e., all training data are whole-life degradation sequences with no state labels. The training data include a sufficient amount of data from offline working conditions (for example, in a laboratory) and a small amount of data from online working conditions (for example, in a real application). The goal of the proposed method is to recognize the occurrence of an early fault in the online data of a target bearing.

To reach this goal, the proposed method contains an offline stage and an online stage. In the offline stage, it needs to first grab the label information of offline data via state assessment and then extract the domain-invariant feature representation between the offline and online working conditions. Following this idea, a new state assessment method and a novel deep domain adaptation model named DTDA are proposed. In the online stage, the sequentially collected data batch is simply fed into such a feature representation to get the discriminative features for the online tasks and directly get the detection results. The steps stated above are shown in Figure 3.

#### *3.1. State Assessment*

The offline data are a whole-life degradation sequence, so we need to determine the label information of the normal state and early fault before conducting domain adaptation. Therefore, an efficient state assessment method is first presented in this section. Given a raw vibration signal sequence of a rolling bearing, the steps of the state assessment are as follows:

	- pose the raw signal *x*(*t*) as: *x*(*t*) = *k* ∑ *i*=1 *ci*(*t*) + *rk*(*t*), where *ci*(*t*) is the *i*-th intrinsic mode function (IMF) component and *rk*(*t*) is the residual term. Second, run the Hilbert transform for each IMF component: *H*[*x*(*t*)] = <sup>1</sup> *π* +∞ −∞ *x*(*τ*) *<sup>t</sup>*−*<sup>τ</sup> <sup>d</sup>τ*, and get the corresponding analytic signal: *C<sup>A</sup> <sup>i</sup>* (*t*) = *ci*(*t*) + *jc<sup>H</sup> <sup>i</sup>* (*t*) = *ai*(*t*)*ejθi*(*t*), where *<sup>c</sup><sup>H</sup> <sup>i</sup>* (*t*) =

1 *π* +∞ −∞ *ci*(*s*) *<sup>t</sup>*−*<sup>s</sup> ds*, *ai*(*t*) = <sup>2</sup> *ci* <sup>2</sup> + (*c<sup>H</sup> i* ) 2 , *θi*(*t*) = *arctan*(*c<sup>H</sup> <sup>j</sup>* /*ci*), with the instantaneous fre-

quency *ω* = *<sup>d</sup>θ*(*t*) *dt* . Third, calculate the Hilbert spectrum: *<sup>H</sup>*(*ω*, *<sup>t</sup>*) = *<sup>n</sup>* ∑ *i*=1 *ai*(*t*)*ejθi*(*t*), and obtain the marginal spectrum: *H*(*ω*) = *H*(*ω*, *t*)*dt*.

Here, the HHT is regarded as a signal processing method, as the HHT has two merits for analyzing signals: (1) no need to preset the orthogonal basis and (2) the good capability of processing the non-stationary signal. Therefore, here, the HHT is chosen as the signal processing method.

(2) Select the initial 500 samples of the whole-life degradation sequence. Set the HHT marginal spectrum of these samples as the normal state data, and train an SVDD model. Specifically, the optimization target of the SVDD is:

$$\begin{array}{l}\min\_{a,R,\boldsymbol{\xi}}\mathcal{R}^2 + \mathcal{C} \sum\_{i=1}^n \boldsymbol{\xi}\_i \\ \text{s.t.} \left\|\boldsymbol{\phi}(\mathbf{x}\_i - a)\right\|^2 \le \mathcal{R}^2 + \boldsymbol{\xi}\_i, \boldsymbol{\xi}\_i \ge 0, \forall i = 1, 2, \dots, n\right\end{array} \tag{6}$$

where *ξ* is the slack variable, *R* and *a* are the radius and center of the hyper-sphere, and *C* is the regularization parameter.

(3) Select the sample sequentially from the beginning and feed the spectrum of the sample into the obtained SVDD. Calculate the distance between this sample and the hyper-sphere center of the SVDD:

$$d = \sqrt{K(\mathbf{x}\_{test}, \mathbf{x}\_{test}) - 2\sum\_{i=1}^{n} a\_i K(\mathbf{x}\_{test}, \mathbf{x}\_i) + \sum\_{i=1}^{n} \sum\_{j=1}^{n} a\_i a\_i K(\mathbf{x}\_i, \mathbf{x}\_j)} \tag{7}$$

where *K*(*xi*, *xj*) is the kernel function and *α<sup>i</sup>* is the Lagrange coefficient. If *d* ≤ *R*, the sample *xtest* is recognized as in the normal state, otherwise it is a fault sample. As a result, the boundary between the normal state and the fault state can be determined.

### *3.2. Proposed DTDA Model*

To realize an effective domain adaptation between offline and online working conditions, the DANN is chosen as the baseline algorithm. The DANN adopts the adversarial training strategy and can get a better domain-invariant feature representation even between quite different domains [28]. As training the DANN requires label the information of the source domain, the results of the state assessment presented in Section 3.1 can be used. Different from mature fault data, early fault data generally have a temporal characteristic that reflects the degradation process from the normal state to the fault state. More importantly, the degradation part of the early fault is similar between different bearing sequences [1]. Therefore, the effect of domain adaptation by the DANN can be further enhanced by extracting common temporal information. To extract temporal information well, the TCN is adopted as the feature extractor in the classical DANN, and we propose a new DTDA model.

Specifically, a strategy of dual adaptation is proposed in the DTDA model. This strategy comes from the following observations: (1) Different domains have different requirements on the amount of temporal information, e.g., degradation length. Then, the TCN may perform poorly due it not having a sufficiently large receptive field. (2) The adversarial training strategy used in the DANN may perform unstably when tackling the data with a large distribution difference. Following this analysis, an adaptation layer with the maximum mean discrepancy (MMD) [33] is first added after the TCN's residual blocks. This layer can shrink the distribution difference of temporal features between the source domain and the target domain to some extent. Then, the DANN is run based on such adapted TCN features and can improve the stability of the DANN training as well.

**Figure 3.** Flowchart of the proposed online detection method of early fault. SVDD, support vector data description; DTDA, dual temporal domain adaptation.

The above idea is shown in Figure 4. Specifically, the orange part represents the source data with labels that are obtained by the state assessment in Section 3.1. The purple part represents the available input data in the target domain. The green part is the feature extractor using the TCN, linked by an MMD adaptation layer. The blue part is the source domain label classifier that aims to recognize the normal state data from the fault data. The pink part is the domain classifier whose task is to discriminate the source domain data and the target domain data. The blue part and pink part are the same as the ones in Figure 2.

The training process of the DTDA model can be summarized as follows:

Step 1. Initialize randomly the weight *w* and bias *b*.

Step 2. Combine the source domain data and the target domain data as a whole, and feed them into the TCN to get the output:

$$H\_1 = G\_f(\sum\_{i=1}^{m+n} w\_i \mathbf{x}\_i - b) \tag{8}$$

where *m* and *n* are the sample number in the source domain and target domain, respectively, and *Gf*(·) is the feature extractor of the TCN.

Step 3. Denote by *X<sup>s</sup>* and *X<sup>t</sup>* the source domain feature and target domain feature in *H*1, respectively. Then, realize the domain adaptation for *X<sup>s</sup>* and *X<sup>t</sup>* by using an MMD layer. The definition of the MMD is as follows:

$$MMD(X^{s}, X^{t}) = \left\| \frac{1}{m} \sum\_{i=1}^{m} \phi(\mathbf{x}\_{i}^{s}) - \frac{1}{n} \sum\_{j=1}^{n} \phi(\mathbf{x}\_{j}^{t}) \right\|\_{\mathcal{H}} \tag{9}$$

where the function *φ*(·) indicates a nonlinear mapping to a reproducing kernel Hilbert space (RKHS) and the subscript H refers to this RKHS.

**Figure 4.** Structure diagram of the proposed DTDA model.

Step 4. Denote by *X<sup>s</sup> MMD* the source domain feature set adapted by the MMD layer. Feed *X<sup>s</sup> MMD* into the source domain label classifier *Gy*(·) in Figure 4, and get the output *Gy*(*X<sup>s</sup> MMD*; *wy*). The loss function of *Gy*(·) can be expressed as:

$$L\_{\mathcal{Y}}^{\dot{i}}(w\_{f}, w\_{\mathcal{Y}}) = L\_{\mathcal{Y}}(G\_{\mathcal{Y}}(X\_{\text{MMD}}^{\varepsilon}; w\_{\mathcal{Y}}), y\_{\mathcal{i}}) \tag{10}$$

where *wf* is the model parameter of the feature extractor composed of the TCN and MMD adaptation layer and *wy* is the model parameter of the source domain label classifier.

Step 5. Feed the adapted feature set *XMMD* of the source and target domains into the domain classifier *Gd*(·), then get the output: *Gd*(*XMMD*; *wd*). The loss function of *Gd*(·) can be expressed as:

$$L\_d^i(w\_f, w\_d) = L\_d(G\_d(X\_{MMD}; w\_d)\_\prime d\_i) \tag{11}$$

where *wd* is the model parameter of the domain classifier.

Step 6. After combining Equations (9)–(11), the optimization function of the DTDA model is:

$$E(w\_f, w\_y, w\_d) = L\_y^i(w\_f, w\_y) - \lambda L\_d^i(w\_f, w\_d) + \mu MMD(X^s, X^t) \tag{12}$$

where *λ*, *μ* > 0 are the regularization parameters, which are used to tune a trade-off between these three quantities during the learning process. Specifically, the larger the value of *μ* is, the higher the requirement for extracting common features is, and vice versa. Similarly, if the value of *λ* becomes smaller, the effect of the domain classifier is equivalent to being enhanced, and correspondingly, the samples are more difficult to recognize from the source domain or the target domain. In Section 4, a reverse crossvalidation approach, which was adopted in the original DANN model [28], is employed to update the regularization parameters *λ* and *μ*. It is worth noting that the minus sign in Equation (12) means gradient reversal for reducing the distribution difference between source domain features and target domain features.

Another thing that needs to be noticed is that the classifier parameters of *wy* and *wd* are optimized in order to minimize their error on the training set; the feature extractor parameter of *wf* is optimized in order to minimize the loss of the source domain label classifier and to maximize the loss of the domain classifier; and the SGD algorithm is employed to minimize Equation (12) to update all three of them.

Step 7. In the training process, if the iteration number reaches a pre-defined number *ρ* or the difference between two consecutive training errors is less than a pre-defined threshold, the training is terminated; otherwise, go to Step 6.

After dual adaptation, i.e., MMD adaptation and the DANN, the feature distribution of the source domain data and target domain data tends to be consistent. After reaching convergence, the DTDA model can extract the common temporal feature representation of different domain data. In the experiment of this paper, the offline working condition is set as the source domain, and the online working condition is set as the target domain, then such a common feature representation can provide a channel to transfer fault information from the offline data to the online task.
