2.1.3. Distance Loss

Wasserstein distance, which is often used to measure the discrepancy between different distributions, can be understood as the minimum consumption under optimal path planning. The Wasserstein distance is used as distance loss to reflect accurately the distance between the two distributions with little to no overlap in the support set with the objective of measuring the overall distance between the feature center and the target domain feature [26]. The function formula is as follows:

$$L\_D(P\_1, P\_2) = \inf\_{\gamma \in \prod(p\_1, p\_2)} E\_{(x, y) \sim \gamma} \left[ \|x - y\| \right] \tag{3}$$

where ∏ (*p*1, *p*2) is the set of all possible joint distributions that combine the *P*<sup>1</sup> and *P*<sup>2</sup> distributions, and *γ* represents the joint distribution of each possible fault type. *x* represents the feature center sample feature and *y* represents the target sample feature.

#### *2.2. Training Process*

The goal of the model is to identify the fault type of the target domain sample and reduce the domain distance between two identical faults. At the same time, the fault types of unknown samples in the target domain are identified. The training process of the model is shown in the figure below:

Step 1: Figure 3 shows that the network learns the features of the fault types from the source domain samples to form the characteristic centers of multiple fault types. The classifier tries to pre-classify the target domain samples and attempts to shorten the distance discrepancy between domains. Therefore, the source domain classification loss is introduced into the model. The mathematical equation used is as follows:

$$L\_{\mathbb{C}} = -\frac{1}{m} \left[ \sum\_{i=1}^{m} \sum\_{j=1}^{k} \mathbf{I} \{ \mathbf{y}^{(i)} = j \} \log \frac{\exp(\theta\_{j}^{T} \cdot \mathbf{x}^{(i)})}{\sum\_{l=1}^{k} \exp(\theta\_{l}^{T} \cdot \mathbf{x}^{(i)})} \right] \tag{4}$$

where *m* represents the number of samples in the source domain and I[·] is an index function used to represent the value of the probability that the sample is true. *θ*1, *θ*2, ..., *θ<sup>k</sup> n*+1 are the parameters of the model and 1/ ∑*k <sup>j</sup>*=<sup>1</sup> *e θT <sup>j</sup> <sup>x</sup>*(*i*) normalizes the distribution such that it sums to 1.

**Figure 3.** Model training steps.

Step 2: Tiny noise is mixed into the target domain samples in the classification, and these tiny noises merge the features of the target domain sample extracted from the feature extractor G. The fault features of the target domain samples mixed with the noise will then undergo a slight change. Given that the mixed tiny noise is related to the target sample features themselves, the extracted fault sample feature will be closer to the feature center of its own fault type. The function formula is as follows:

$$X\_{to} = X\_t + \lambda \cdot \widehat{\mathbf{x}} \cdot \mathbf{t} \tag{5}$$

where *<sup>λ</sup>* represents the feature coefficient of the tiny noise. *x* is the feature coefficient of the extracted target domain, which is the sample feature extracted from the target domain sample. *ot* is Gaussian white noise used as the tiny noise for network training.

Step 3: The network recalculates the distance between the features of the target sample and the center of each fault type after the addition of noise, and the distance loss between the features of the target sample and the feature center is calculated to judge the fault. The specific distance loss function [25] is shown in the following formula:

$$L\_{dis} = \left| \frac{1}{m^s} \sum\_{i=1}^{m^s} T(\mathbf{x}\_{si}) - \frac{1}{m^t} \sum\_{i=1}^{m^t} T(\mathbf{x}\_{ti}) \right| \tag{6}$$

where *xsi* and *xti* are the *i-*th features extracted from the target domain *Xt* and source domain *Xs* through the fully connected layer.

The three steps of model training are looped continuously until the expected performance is achieved as shown in Figure 3. The network repeatedly adds tiny noise interference containing the characteristics of the target domain sample to the target domain samples and measures the feature distance to ensure the accurate diagnosis of the fault type of the target domain samples. The stable samples that have been classified accurately do not undergo classification changes after multiple small disturbances are added, whereas the active samples that have been classified incorrectly jump or leave the feature center.

#### **3. Experimental Verification**
