*2.2. Convolutional Neural Network*

Convolutional neural network is a typical deep feed-forward artificial neural network that can be used to process time sequences and images by convolution operation. This operation can reduce the number of weights and biases to decrease the complexity of the model. A standard CNN consists of convolution layer, pooling layer, fully connected layer and classification layer. In a convolutional layer, multiple convolutional kernels are used to convolution the input, and the weights and bias are shared between hidden neurons. The process in the convolutional layer can be expressed as follows:

$$\mathbf{z}\_{n}^{l} = f^{l}(\sum\_{k} \mathbf{x}\_{k}^{l-1} \ast \mathbf{w}\_{n}^{l} + \mathbf{b}\_{n}^{l}) \tag{3}$$

where **x***l*−<sup>1</sup> *<sup>k</sup>* is the *k*-th input sample in the *l*-1 layer. \* represents convolution operation. **w***l <sup>n</sup>* and **<sup>b</sup>***<sup>l</sup> <sup>n</sup>* denote the weight and corresponding bias, respectively. Additionally, *f <sup>l</sup>* (·) represents the activation function.

A pooling layer usually follows the convolutional layer, and the subsampling operations is employed to reduce the spatial dimension for reducing overfitting risk. Mathematically, a maximum pooling operation is defined as follows:

$$pop\_j = \max\_{i \in m\_j} \{ \mathbf{c}\_j(i) \}\tag{4}$$

where **c***<sup>j</sup>* represent the *j*-th location, and *poj* is the output of the pooling. Moreover, average pooling and stochastic pooling are also usually used in pooling layer.

After several convolutional and pooling layers, the fully connected layer immediately converts the output matrix into a row or column. The last layer is usually served by a softmax output layer in which a softmax function is utilized to predict the probability of each target.

#### *2.3. Deep Adversarial Convolution Neural Network*

Generally, a deep adversarial convolution neural network usually consists of a feature extractor *Gf*, a domain discriminator *Gd* and a classifier *Gc* [13]. The feature extractor, which is a competitor in the DACNN, is typically served by several convolution blocks or fully connected layers. It can be expressed as *Gf* = *Gf*(*x*, *<sup>θ</sup> <sup>f</sup>*) : *<sup>x</sup>* → *<sup>R</sup><sup>D</sup>* with parameter *<sup>θ</sup> <sup>f</sup>* , which indicates that the input sample *x* is transformed into *D*-dimensional features. In addition, the domain discriminator (binary classifier) is treated as the opponent, which is expressed as *Gd* = *Gd*(*Gf*(*x*), *θd*) with parameters *θd*. Inputting the source and target samples into the feature extractor and the output is further distinguished by the domain discriminator *Gd*. The binary cross entropy (BCE) loss is taken as objective function, which can be described as follows:

$$L(\mathcal{G}\_d(\mathcal{G}\_f(\mathbf{x}\_i)), d\_i) = d\_i \log \frac{1}{\mathcal{G}\_d(\mathcal{G}\_f(\mathbf{x}\_i))} + (1 - d\_i) \times \log \frac{1}{1 - \mathcal{G}\_d(\mathcal{G}\_f(\mathbf{x}\_i))} \tag{5}$$

where *di* denotes the binary variable for *xi*. By conducting adversarial training between the two parts, feature extractor *Gf* tends to extract common features from two types of data and makes the domain discriminator difficult to distinguish in terms of zero or one. Thus, the model can perform well on both the source and target datasets. Assuming *n* samples in the source domain dataset and *N*-*n* samples in the target domain dataset, the objective function is expressed as follows:

$$E(\theta\_{f'}, \theta\_d) = -\left(\frac{1}{n} \sum\_{i=1}^{n} L\_d^i(\theta\_{f'}, \theta\_d) + \frac{1}{N - n} \sum\_{i=n+1}^{N} L\_d^i(\theta\_{f'}, \theta\_d)\right) \tag{6}$$

where *L<sup>i</sup> <sup>d</sup>*(*θ <sup>f</sup>* , *θd*) = *Ld*(*Gd*(*Gf*(*xi*, *θ <sup>f</sup>*), *θd*), *di*), and this equation includes a maximization problem with respect to *θ<sup>d</sup>* and a minimization problem with respect to *θ <sup>f</sup>* .

Additionally, all the labeled samples should be supervised and trained to ensure the accuracy of the diagnosis in the adversarial procedure. Therefore, a classifier is established, and it is expressed as *Gy* = *Gy*(*Gf*(*x*), *<sup>θ</sup>y*) : *<sup>R</sup><sup>D</sup>* → *<sup>R</sup><sup>L</sup>* with parameters *<sup>θ</sup>y*, in which *<sup>L</sup>* is the number of classes. Cross-entropy loss is applied in the Softmax function, and it can be described as follows.

$$L\_y(\mathcal{G}\_y(\mathcal{G}\_f(\mathbf{x}\_i))\_\prime \mathcal{Y}\_i) = \log \frac{1}{\mathcal{G}\_y(\mathcal{G}\_f(\mathbf{x}\_i))\_{\mathcal{Y}\_i}} \tag{7}$$

By adding the Equation (7) to objective function (6), the optimization objective can be expressed as follows:

$$E(\boldsymbol{\theta}\_{f}, \boldsymbol{\theta}\_{\mathcal{Y}}, \boldsymbol{\theta}\_{d}) = \frac{1}{n} \sum\_{i=1}^{n} L^{\stackrel{i}{i}}\_{\boldsymbol{y}}(\boldsymbol{\theta}\_{f}, \boldsymbol{\theta}\_{\mathcal{Y}}) \ -\lambda \left(\frac{1}{n} \sum\_{i=1}^{n} L^{\stackrel{i}{i}}\_{\boldsymbol{d}}(\boldsymbol{\theta}\_{f}, \boldsymbol{\theta}\_{d}) + \frac{1}{N - n} \sum\_{i=n+1}^{N} L^{\stackrel{i}{i}}\_{\boldsymbol{d}}(\boldsymbol{\theta}\_{f}, \boldsymbol{\theta}\_{d})\right) \tag{8}$$

where *L<sup>i</sup> <sup>y</sup>*(*θ <sup>f</sup>* , *θy*) = *Ly*(*Gy*(*Gf*(*xi*, *θ <sup>f</sup>*), *θy*), *yi*). The entire training process of DANN is to optimize the parameters *θ <sup>f</sup>* , *θ<sup>y</sup>* and *θd*, and it is expressed as follows.

$$(\hat{\theta}\_{f'} \,\, \hat{\theta}\_{\mathcal{Y}}) = \underset{\theta\_{f'} \,\, \theta\_{\mathcal{Y}}}{\text{argmax}} \, E(\theta\_{f'} \,\, \theta\_{\mathcal{Y}'} \,\, \hat{\theta}\_d) \tag{9}$$

$$\boldsymbol{\theta}\_d = \underset{\boldsymbol{\theta}\_d}{\text{argmax}} \boldsymbol{E}(\boldsymbol{\theta}\_{f'}, \boldsymbol{\theta}\_{y'}, \boldsymbol{\theta}\_d) \tag{10}$$

In the training stage, parameter updates are implemented in the opposite direction to the gradient in the adversarial process. The update in the domain discriminator *Gd* is to reduce the loss with the purpose of improving the discriminative ability. However, the update in the feature extractor *Gf* is to maximize the loss to fool the discriminator. In order to frame a flexible implement of the stochastic gradient descent (SGD) algorithms in the training of DACNN, we use a circuitous method by rewriting the loss as *Li <sup>d</sup>*(*θ <sup>f</sup>* , *θd*) = *Ld*(*Gd*(*Gf*(*xi*, *θ <sup>f</sup>*), *θd*), 1 − *di*) in updating the feature extractor parameters. Maximizing *L<sup>i</sup> <sup>d</sup>*(*<sup>θ</sup> <sup>f</sup>* , *<sup>θ</sup>d*) can be accomplished by minimizing *<sup>L</sup><sup>i</sup> <sup>d</sup>*(*θ <sup>f</sup>* , *θd*). During backpropagation, the features extractor takes the gradient of the recalculated *L<sup>i</sup> <sup>d</sup>*(*θ <sup>f</sup>* , *θd*) from the domain discriminator and updates its parameters with SGD. Overall, the update rales of parameters *θ <sup>f</sup>* , *θ<sup>y</sup>* and *θ<sup>d</sup>* can be formulated as follows:

$$\theta\_f \leftarrow \theta\_f - \mu \left( \frac{\partial L\_y^i}{\partial \theta\_f} + \lambda \frac{\partial L\_d^{\prime i}}{\partial \theta\_f} \right) \tag{11}$$

$$
\theta\_d \leftarrow \theta\_d - \mu \,\,\frac{\partial L\_d^i}{\partial \theta\_d} \tag{12}
$$

$$
\theta\_{\mathcal{Y}} \leftarrow \theta\_{\mathcal{Y}} - \mu \,\,\frac{\partial L\_{\mathcal{Y}}^{\hat{i}}}{\partial \theta\_{\mathcal{Y}}} \,\tag{13}
$$

where *μ* represents the learning rate.

By the above optimization process, DACNN tends to train a feature extractor *Gf* that can extract suitable representations from input samples (either source domain datasets or target domain datasets), which can be classified accurately by the classifier *Gy* but weakens the ability of the domain discriminator *Gd* to differentiate a sample from the source or target domain datasets. In the phases of testing, domain insensitive features will be extracted by feature descriptor *Gf* and fed into the classifier *Gy* to identify the states immediately.

#### **3. The Proposed Method**

#### *3.1. Data-Level Fusion*

Data-level fusion is a relatively direct fusion method that can retain the effective fault information hidden in the measured signals and reduce the complexity of the model. Therefore, a data-level fusion strategy is designed in this study to fuse one-dimensional vibration signals and two-dimensional infrared thermal images.

In general, the measured time-domain vibration signals are feeble and inadequate, particularly in the early stages of a failure. Although many deep learning methods only use raw time-domain signals for fault diagnosis, they rely heavily on deep learning structure. It is considered that frequency features play a significant role in rotating machinery failure diagnostics. The frequency domain signal and the squared envelope spectrum, in particular, are widely used in classical signal processing methods [31,32]. Therefore, measured raw signals are transformed to acquire frequency domain signals by fast Fourier transform (FFT) and CS2 by squared envelope spectrum in this study. Then, they are reshaped into two-dimensional matrix as part of the input of the convolution layer as shown in Figure 1.

After that, the RGB 3-channels of each infrared thermal image are combined with the 2-dimensional frequency domain signal and the squared envelope spectrum to synthesize a 5-channel data. The 5-channel fusion data will be used for subsequent domain adaptation CNN training.

#### *3.2. Fusion Domain-Adaptation CNN Construction*

Domain adaptation based on adversarial networks is an effective approach for crossdomain fault diagnosis. In this section, fusion 5-channel data are used to train a domainadaptation CNN for cross-domain fault diagnosis of gearboxes.

Firstly, a feature extractor is constructed, which contains multiple convolutional blocks and several fully connected layers. The feature extractor is used to extract domaininsensitive features from 5-channel fusion data. The extracted features are fed into the state classifier to recognize health states. Meanwhile, the extracted features are inputted into the domain discriminator to distinguish whether they are from the source or target domain.

In the training process, the feature extractor minimizes the state classification loss and maximizes the domain discrimination loss so that it can extract features that are not only not sensitive to the domain and make the state classifier easy to classify. As the opponent, the domain discriminator aims to minimize domain discrimination loss so that it can distinguish the feature from the source or target domain. Finally, through a large amount of adversarial training, the diagnosis model composed of feature extractor and state classifier can recognize fault states accurately with multi-source heterogeneous data under cross-working conditions.
