**1. Introduction**

As indispensable parts of rotating machinery, the fault identification and diagnosis of bearings and gears are crucial for the normal operation of the machinery. Since traditional fault diagnosis methods rely on manual processing of vibration signals, it is difficult to explore the depth of fault diagnosis knowledge. With the widely application in industry and academia of deep learning technology, it is possible to mine effective diagnosis knowledge from massive amounts of fault data [1–3]. Therefore, such methods have been extensively applied in fault diagnosis of rotating machinery [4–7].

Li et al. [8] proposed a fault diagnosis framework based on multi-scale permutation entropy (MPE) and multi-channel fusion convolutional neural networks (MCFCNN). Since it considers the structure and spatial information between different sensor measurement points, the fault diagnosis with high accuracy and speed is realized. Valtierra-Rodriguez et al. [9] proposed a methodology based on convolutional neural networks for automatic detection of broken rotor bars by considering different severity levels. This method applies a notch filter to remove the fundamental frequency component of the current signal, and the shorttime Fourier transform (STFT) is used to obtain time-frequency plane. Experimental results show that the methods is capable of identifying the healthy condition of the induction motor. However, the distributions of the collected datasets may different due to the change of the operating environments. The diagnostic knowledge in the original training data will no longer be fully applicable to the new testing data when the working condition changes [10–14]. In this case, the fault diagnosis methods under variable working conditions based on transfer learning come into being. Recently, some transfer learning-based methods have been developed to solve cross domain fault diagnosis problem. Mao et al. [15] proposed a deep dual temporal

domain adaptation (DTDA) model which could recognize whether an early fault occurs and achieve an earlier detection location and lower false alarm rate. An et al. [16] proposed to apply the maximum mean discrepancy (MMD) based on multiple kernels to intelligent fault diagnosis, and the features of different layers were involved in the domain adaptation process. Wang et al. [17] presented a deep adaptive adversarial network (DAAN) which could narrow the discrepancy to learn domain-invariant features. Chen et al. [18] proposed an unsupervised domain adaptation method which could maximize the mutual information between the target feature space and the entire feature space and minimize the featurelevel discrepancy between the two domains. Hasan et al. [19] proposed a multitask-aided transfer learning-based diagnostic framework. This method applies multitask learning-based convolutional network to identify working conditions, and then identifies health status of the rolling element bearings based on transfer learning. In a word, transfer learning techniques provide an efficient solution to cross domain fault diagnosis problems.

Although transfer learning-based methods have made great progress, partial transfer fault diagnosis problem has not been well solved. The partial transfer diagnosis means that the number of fault types in the test data is less than that in the training data. Since the machine is in a healthy working state most of the time, the test data may contain only a few types of fault data. That is, the distribution of two domains is different and the label space of target domain is a subset of that of the source domain [20–22]. As many different health types as possible can be involved by training data through a long period of data accumulation, while it is difficult to guarantee the symmetry of health types in testing data and training data. Therefore, this setting is closer to engineering practice compared with the scenario for which the standard domain adaptation is targeted. Since most of the transfer fault diagnosis methods use all source samples for domain adaptation, the unique types of source samples can enable the network to learn false classification knowledge during domain adaptation, which is the major challenge in partial transfer fault diagnosis. Actually, partial transfer problem has been studied in the field of target detection and computer vision. Cao et al. [23] proposed a selective adversarial network (SAN) to facilitate positive transfer by selecting the source samples highly correlated with the target samples. Chen et al. [24] proposed reinforced transfer network (RTNet) which could apply both high-level and pixel-level information to solve partial transfer problem. In addition, importance weighted adversarial nets [25] and example transfer network (ETN) [26] also obtained excellent performance in the image classification task. These works have laid a solid foundation for solving the problem of partial transfer in mechanical fault diagnosis.

Recently, the partial transfer problem has made initial progress in fault diagnosis. Jiao et al. [27] applied weighted cross entropy loss to give smaller weight to the unique source samples, and such weight is determined by the predicted outputs of two classifiers [28]. Li et al. [29] presented a weighted adversarial transfer network (WATN) which used adversarial training to reweight the source domain samples. Yang et al. [30] proposed a deep partial transfer learning network (DPTL-Net) which could learn domain-asymmetry factor to weight the source samples and finally block unnecessary knowledge. The previous partial domain adaptation methods mainly tried to get the weight of the source samples from a global perspective without considering the relationships between two subdomains [31] in source and target domains, which is not conducive to obtaining the fine-grained transferable information in each type of data. To solve the above problem, this paper proposed a weighted subdomain adaptation network (WSAN) to improve the efficiency of partial transfer diagnosis of machinery. All the samples are divided into classlevel subdomains, and the subdomain distributions of deep features in multiple layers are aligned. In order to block the samples of outlier source types, an auxiliary classifier is introduced to conduct adversarial training with the feature generator to obtain the classlevel weights. To achieve weighted subdomain adaptation, we propose a weighted local maximum mean discrepancy (WLMMD) to measure the Hilbert-Schmidt norm between kernel mean embedding of empirical distributions between relevant subdomains. The main innovations of this work are summarized as follows:


The remainder of this work begins with the background of theory in Section 2. In addition, Section 3 provides an introduction to the methodology presented, and Section 4 applies the proposed model to partial transfer fault diagnosis and verifies the advantages of the model by comparing other methods. Finally, some conclusions are drawn in Section 5.

#### **2. Theoretical Background**

#### *2.1. Partial Transfer Fault Diagnosis*

For standard domain adaptation-based frameworks, target domain *Dt* and source domain *Ds* are collected under different but related working conditions [26]. As shown in the upper part of Figure 1, the job of standard transfer fault diagnosis is to facilitate a knowledge transfer from the labeled source data {*Xs*, *Cs*} to the unlabeled target dataset *Xt*. However, different from the closed transfer fault diagnosis, the source label space *Cs* and target label space *Ct* are different in partial transfer diagnosis problem. In the bottom part of Figure 1, there are more source classes than target classes, i.e., *Ct* ⊆ *Cs*. In addition, it should be noted that the sample types in the target domain do not deviate from the scope of the source domain, which ensures the authority of the diagnostic knowledge in source domain. The purpose of partial transfer fault diagnosis is to find the categories associated with the source domain and classify them accurately.

**Figure 1.** Comparison of standard transfer fault diagnosis and partial transfer fault diagnosis.

#### *2.2. Subdomain Adaptation*

The source and target domains may consist of some subdomains that can be defined according to different criteria, such as class or category. For partial transfer fault diagnosis, the number of sample types in the source domain must be no less than that in the target domain, so is practicable to delimit the subdomains based on the number of types in the source domain, although this may not be appropriate for the target domain, but it ensures alignment of local data distribution discrepancy. As can be seen from Figure 2a,b, it is difficult to match two data distributions directly in the process of global or partial domain adaptation. In Figure 2c,d, subdomain adaptation is of superior feature representation ability because the fine-grained transferable information within the subdomains is uti-

lized [31]. However, the problem with this is that the data in the target domain is unlabeled, which prevents target domain from being partitioned. Fortunately, we take the prediction probability output of the model for the target samples as pseudo-labels to divide them into some subdomains. In this way, subdomain adaptation enables the model to focus more on local data distribution differences.

**Figure 2.** Comparison of standard domain adaptation (**a**,**b**) and subdomain adaptation (**c**,**d**). The box represents the data distribution range and the arrows represent the domain or subdomain adaptation process. The dotted circles represent the divided subdomains.

#### *2.3. Weighted Local Maximum Mean Discrepancy*

In the field of transfer learning, MMD [32] is a common nonparametric metric that measures the discrepancy between two distributions. It takes the mean embeddings of two distributions in a Reproducing Kernel Hilbert Space (RKHS) as a distance calculation to avoid the density estimation. MMD can be defined as:

$$d\_{\mathcal{H}}(\mathcal{D}\_s, \mathcal{D}\_t) \stackrel{\Delta}{=} \left\| \mathbf{E}\_p[\boldsymbol{\phi}(\mathbf{x}^s)] - \mathbf{E}\_q[\boldsymbol{\phi}(\mathbf{x}^t)] \right\|\_{\mathcal{H}'}^2 \tag{1}$$

where *φ*(·) is the feature mapping function that maps the original data to RKHS H. Therefore, an estimate of the MMD compares the square distance between the empirical kernel mean embeddings as:

$$\begin{array}{lcl}\hat{d}\_{\mathcal{H}}(\mathcal{D}\_{s\prime}\mathcal{D}\_{t}) &=& \left\| \left| \frac{1}{n\_{s}} \sum\_{\mathbf{x}\_{i}\in\mathcal{D}\_{s}}\phi(\mathbf{x}\_{i}) - \frac{1}{n\_{t}} \sum\_{\mathbf{x}\_{j}\in\mathcal{D}\_{t}}\phi(\mathbf{x}\_{j}) \right| \right\|\_{\mathcal{H}}^{2} \\ &=& \frac{1}{n\_{s}^{2}}\sum\_{i=1}^{n\_{s}}\sum\_{j=1}^{n\_{t}}k\left(\mathbf{x}\_{i}^{s},\mathbf{x}\_{j}^{s}\right) + \frac{1}{n\_{t}^{2}}\sum\_{i=1}^{n\_{t}}\sum\_{j=1}^{n\_{t}}k\left(\mathbf{x}\_{i}^{t},\mathbf{x}\_{j}^{t}\right) \\ & - \frac{2}{n\_{s}n\_{t}}\sum\_{i=1}^{n\_{t}}\sum\_{j=1}^{n\_{t}}k\left(\mathbf{x}\_{i}^{s},\mathbf{x}\_{j}^{t}\right) \end{array} \tag{2}$$

where ˆ*dH*(*p*, *q*) is an unbiased estimator of *dH*(*p*, *q*). *ns* and *nt* are the number of source samples and target samples, respectively.

Most previous domain adaptation methods apply MMD to narrow the distribution discrepancy without considering the internal distribution of the data. However, such methods may result in poor alignment because the relationship between related subdomains is ignored. Furthermore, these methods also fail to selectively involve source samples in the adaptation process due to the asymmetry of data types across the two domains. Considering the above problems, we propose the WLMMD to achieve weighted subdomain adaptation:

$$d\_{\mathcal{H}}(\mathcal{D}\_s, \mathcal{D}\_t) \stackrel{\Delta}{=} \mathbf{E}\_{\mathbf{c}} \parallel \mathbf{E}\_{p^{(\varepsilon)}}[\boldsymbol{\phi}(\mathbf{x}^s)] - \mathbf{E}\_{q^{(\varepsilon)}}[\boldsymbol{\phi}(\mathbf{x}^t)] \parallel \boldsymbol{\parallel}\_{\mathcal{H}}^2 \tag{3}$$

where *<sup>x</sup><sup>s</sup>* and *<sup>x</sup><sup>t</sup>* are the instances in *Ds* and *Dt*, and *<sup>p</sup>(c)*, and *<sup>q</sup>(c)* are the distributions of *<sup>D</sup>*(*c*) *s* and *D*(*c*) *<sup>t</sup>* , respectively. So we can calculate an unbiased estimator of WLMMD as:

$$\mathcal{A}\_{\mathcal{H}}(\mathcal{D}\_s, \mathcal{D}\_t) = \frac{1}{\mathbb{C}} \sum\_{k=1}^{\mathbb{C}} \parallel \sum\_{\mathbf{x}\_i^t \in \mathcal{D}\_s} w\_i^{sk} \phi(\mathbf{x}\_i^s) - \sum\_{\mathbf{x}\_j^t \in \mathcal{D}\_t} w\_j^{tk} \phi\left(\mathbf{x}\_j^t\right) \parallel\_{\mathcal{H}}^2 \tag{4}$$

where *wsk <sup>i</sup>* and *<sup>w</sup>tk <sup>j</sup>* denote the weights of *<sup>x</sup><sup>s</sup> <sup>i</sup>* and *<sup>x</sup><sup>t</sup> <sup>i</sup>* belonging to class *k*, respectively. Obviously, ∑*ns <sup>i</sup>*=<sup>1</sup> *<sup>w</sup>sk <sup>i</sup>* <sup>=</sup> <sup>∑</sup>*nt <sup>i</sup>*=<sup>1</sup> *<sup>w</sup>tk <sup>i</sup>* = 1, and *<sup>w</sup><sup>k</sup> <sup>i</sup>* for the sample *xi* can be computed as:

$$w\_i^k = \frac{\mathcal{Y}\_{ik}}{\sum\_{(\mathbf{x}\_j, \mathbf{y}\_j) \in \mathcal{D}} \mathcal{Y}\_{jk}} \, \, \tag{5}$$

where *yic* is the *k*-th entry of vector *yi*. Since the source samples are labeled with a one-hot vector, we can directly calculate the weight *wsk <sup>i</sup>* by the labels. Although the samples of the target domain are unlabeled, it is feasible to use pseudo labels to partition related subdomains. Note that the predicted output *y*ˆ*<sup>t</sup> <sup>i</sup>* given by the classifier can be used as pseudo target labels which measures the probability that the target sample belongs to the corresponding category. *y*ˆ*<sup>t</sup> <sup>i</sup>* can be regarded as the probability of assigning *<sup>x</sup><sup>t</sup> <sup>i</sup>* to each of the C classes, and the weight *wtk <sup>i</sup>* of target samples could be acquired. Thus, we can approximate Equation (5) as:

$$\begin{split}d\hat{d}\_{l}(\mathcal{D}\_{s},\mathcal{D}\_{t}) &= \frac{1}{\mathsf{C}}\sum\_{k=1}^{\mathsf{C}} \begin{bmatrix} \frac{n\_{s}}{\mathsf{s}}\sum\_{j=1}^{n\_{s}} w\_{j}^{sk} w\_{j}^{sk} k\left(\boldsymbol{\pi}\_{i}^{sl},\boldsymbol{\pi}\_{j}^{sl}\right) \\ \sum\_{i=1}^{n\_{t}}\sum\_{j=1}^{n\_{t}} w\_{i}^{tk} w\_{j}^{tk} k\left(\boldsymbol{\pi}\_{i}^{tl},\boldsymbol{\pi}\_{j}^{tl}\right) \\ -2\sum\_{i=1}^{n\_{s}}\sum\_{j=1}^{n\_{t}} w\_{i}^{sk} w\_{j}^{tk} k\left(\boldsymbol{\pi}\_{i}^{sl},\boldsymbol{\pi}\_{j}^{tl}\right) \end{bmatrix} \end{split} \tag{6}$$

where *z<sup>l</sup>* is the *l*th layer activation of *L* layers. By using Equation (6), the distribution discrepancy between the two subdomains at a particular activation layer can be calculated.

#### **3. Proposed Method**
