*Article* **Intelligent Diagnosis of Rolling Bearings Fault Based on Multisignal Fusion and MTF-ResNet**

**Kecheng He 1,2,\*, Yanwei Xu 1,2,\*, Yun Wang 1,2, Junhua Wang <sup>1</sup> and Tancheng Xie 1,2**


**Abstract:** Existing diagnosis methods for bearing faults often neglect the temporal correlation of signals, resulting in easy loss of crucial information. Moreover, these methods struggle to adapt to complex working conditions for bearing fault feature extraction. To address these issues, this paper proposes an intelligent diagnosis method for compound faults in metro traction motor bearings. This method combines multisignal fusion, Markov transition field (MTF), and an optimized deep residual network (ResNet) to enhance the accuracy and effectiveness of diagnosis in the presence of complex working conditions. At the outset, the acquired vibration and acoustic emission signals are encoded into two-dimensional color feature images with temporal relevance by Markov transition field. Subsequently, the image features are extracted and fused into a set of comprehensive feature images with the aid of the image fusion framework based on a convolutional neural network (IFCNN). Afterwards, samples representing different fault types are presented as inputs to the optimized ResNet model during the training phase. Through this process, the model's ability to achieve intelligent diagnosis of compound faults in variable working conditions is realized. The results of the experimental analysis verify that the proposed method can effectively extract comprehensive fault features while working in complex conditions, enhancing the efficiency of the detection process and achieving a high accuracy rate for the diagnosis of compound faults.

**Keywords:** metro traction motor bearings; multisignal fusion; Markov transition field; optimized deep residual network; diagnosis of compound faults

## **1. Introduction**

As the power source of metro trains, the quality of the traction motor bearings directly affects the normal operation of the motor. The frequent starting and stopping of the metro causes alternating changes in the speed of the traction motor bearings and the loads they are subjected to. With long-term harsh working conditions, the inner and outer rings of bearings and rolling elements will produce varying degrees of pitting, cracking and more complex forms of failure. The adverse vibrations generated by a faulty bearing, when input into the entire system over an extended period, not only damage the traction motor but also pose a risk to other structural components. This poses a serious threat to the safety and reliability of metro trains. The intelligent diagnosis of bearings fault in complex working conditions enables the timely identification of fault types, facilitating early maintenance intervention and providing significant engineering value for practical applications.

Conventional approaches for bearing fault diagnosis predominantly rely on signal processing techniques. To address the issue of noise interference during feature extraction, wavelet thresholding was employed to effectively eliminate significant noise components from the raw data [1,2]. In an effort to enhance the signal-to-noise ratio, ref. [3,4] adopted empirical mode decomposition (EMD) to decompose the signal into multiple intrinsic mode functions. Furthermore, ref. [5] introduced an optimized variational mode decomposition

**Citation:** He, K.; Xu, Y.; Wang, Y.; Wang, J.; Xie, T. Intelligent Diagnosis of Rolling Bearings Fault Based on Multisignal Fusion and MTF-ResNet. *Sensors* **2023**, *23*, 6281. https:// doi.org/10.3390/s23146281

Academic Editors: Yuxing Li and Luca Fredianelli

Received: 20 June 2023 Revised: 8 July 2023 Accepted: 9 July 2023 Published: 10 July 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

(VMD) method to facilitate the selection of intrinsic mode functions containing pertinent fault information. Despite the promising outcomes achieved by these traditional methods in bearing fault diagnosis, they are accompanied by inherent limitations. These drawbacks encompass restricted generalization capability, challenges in extracting deep fault features, and complexities associated with parameter optimization. Signal analysis technology, as a research hotspot, has been receiving attention from scholars. Subsequently, the introduction of new methods has successfully addressed many challenges [6,7].

With the development of artificial intelligence technology, machine learning and deep learning [8] have gained significant attention in various fields, and numerous researchers have started extracting deeper features and making notable contributions [9–11]. A convolutional neural network (CNN), as one of their important representatives, possesses a powerful adaptive feature extraction capability. Moreover, CNN has demonstrated remarkable performance in the field of image processing. As such, scholars have increasingly introduced CNN into the field of fault diagnosis and conducted a series of research studies in this area. Ref. [12] has recently proposed a CNN model that utilizes widened convolutional kernels to improve the feature extraction efficiency of the network. Ref. [13] has deployed a CNN to extract features from Mel spectrum generated from the voiceprint signals of motors. Ref. [14] has presented a multiscale CNN model that effectively extracts signal features at different frequencies. This advanced model is further combined with LSTM to identify fault types. In the field of medical imaging, ref. [15] proposed an improved CNN model architecture for the identification of a lung nodule and early-stage cancer diagnosis by comparing multiple photos. In big data environments, to reduce the costs associated with data collection and processing, some researchers have explored unsupervised learning techniques. To synchronously extract local and global structural information from the raw unlabeled industrial data, ref. [16] proposed a new multiple-order graphical deep extreme learning machine (MGDELM) algorithm. Ref. [17] proposed a novel selftraining semi-supervised deep learning (SSDL) approach to train a fault diagnosis model together with few labeled and abundant unlabeled samples. The previously discussed research studies have made notable advances in fault diagnosis. However, because of their reliance on single-sensor signals, there may be limitations in accurately characterizing fault information, which could ultimately reduce their overall reliability.

Multisignal fusion technology enables the simultaneous processing of time-series data obtained from multiple sensors, thereby capturing a broader range of system variability while offering heightened complementarity and fault tolerance. In one study, feature extraction was performed on original vibration and acoustic signals, which were subsequently fused using a 1DCNN-based network model [18]. Another approach proposed a frequency-domain multilinear principal component analysis to effectively identify faults by integrating diverse vibration and acoustic signals [19]. Similarly, a two-dimensional matrix was constructed from multi-axial vibration signals, and an enhanced 2DCNN model was employed for fault diagnosis [20]. These methods have demonstrated commendable enhancements in diagnostic accuracy. However, it is worth noting that a limitation common to these approaches is the omission of time correlation among signals, which may result in the loss of crucial fault-related information.

Upon a comprehensive analysis of existing literature, it has been observed that diagnostic approaches leveraging deep learning techniques frequently employ increasing network depths to enhance the model's learning capacity and improve diagnostic performance. Nevertheless, the utilization of progressively deeper networks may give rise to challenges such as the vanishing or exploding gradient problem. To address this issue, deep residual networks were introduced [21], effectively mitigating the aforementioned problem. Furthermore, an innovative activation function named STAC-tanh was proposed by [22], which enables adaptive feature extraction in the bearing system by employing the hyperbolic tangent function with slope and threshold adaptivity. Another compelling approach involved the fusion of Gramian angular field (GAF) with ResNet, leading to notable advancements in bearing fault diagnosis [23]. Additionally, ref. [24] combined

transfer learning with ResNet, utilizing a pretrained ResNet model on ImageNet as a fault feature extractor, which yielded remarkably accurate results. These aforementioned studies have demonstrated promising outcomes in the realm of bearing fault diagnosis. However, certain limitations persist, including the sole reliance on a single sensor signal and the absence of experimental verification through the use of a purpose-built platform.

In summary, most of the studies are based on open source datasets with simple working conditions and failure forms, but the actual working conditions of bearings are complex and can present different parts and degrees of failure. To address the challenges faced in compound bearing fault diagnosis under complex working conditions, such as the low reliability of single sensor signals, the tendency for traditional data processing methods to result in important information loss, the degradation of diagnostic models with increasing network depth, and the difficulty of feature extraction, this paper proposes an intelligent diagnosis method for compound bearing faults in metro traction motors by combining MTF-processed acoustic-vibration signals using IFCNN for feature fusion along with an optimized version of ResNet. The main contributions of the paper are expressed as follows:


The remaining sections of this paper are arranged as follows: In Section 2, the data processing method used in this study and the construction of the dataset are introduced. Section 3 focuses on the multisignal fusion technology used in this study. Section 4 provides a detailed description of the fault diagnosis model and the corresponding diagnostic process. Section 5 explains the specific experimental design, as well as the diagnostic scheme adopted in this study. Section 6 analyzes the experimental results and carries out a series of method comparisons to validate the effectiveness of the proposed approach. Section 7 summarizes the main content of the paper and draws conclusions.

## **2. Data Preprocessing**

In this study, a signal acquisition system will be built to obtain a large amount of raw data using acoustic emission sensors, vibration sensors and PCI acquisition cards. The research focuses on compound faults, with pitting as the main defect. The location of the defect is used as a classification criterion. A total of eight fault types including normal bearings are designed and labeled for subsequent study, using different fault locations as classification indicators. The fault types and labels are shown in Table 1.

**Table 1.** Label settings for different fault types.


## *2.1. Dataset Construction*

The vibration and acoustic emission signals were acquired using a PCI data acquisition card with a sampling frequency of 50 kS/s and a sampling time of 10 s, giving a total of <sup>5</sup> <sup>×</sup> <sup>10</sup><sup>5</sup> sampling points. In this experiment, the minimum speed of the bearing is determined to be 800 rpm. Based on this speed, the number of sampling points obtained from one cycle of bearing rotation can be calculated to be 3750. In order to ensure the completeness of the sampled fault information, it is recommended that the number of sampling points be at least twice that of the calculated value, resulting in a sampling length of 8192 (213). With a limited amount of data, the vibration and acoustic emission signals were data augmented using overlapping sampling so that each fault type under each working condition contained 1000 samples for a total of 8000 samples, which were randomly divided into a training set and a testing set at 9:1. Under fixed working conditions, the dataset is divided as shown in Table 2.


**Table 2.** Dataset partitioning under fixed working conditions.

#### *2.2. MTF Image Encoding*

In this paper, MTF is used to process vibration signals and acoustic emission signal data, converting the acquired data samples into image samples. MTF is an image encoding method that converts original vibration or acoustic emission signals into time series twodimensional images through Markov transition probabilities [25].

Suppose a discretized segment of time series data *X* = {*x*1, *x*2, · · · , *xn*} is partitioned into intervals of its value domain by quantile *Q*. Each *x<sup>t</sup>* in the sequence can be mapped to the corresponding interval *qn*(*n* ∈ [1, *Q*]). By calculating the state transfer probabilities through the Markov chain principle, a state transfer probability matrix *W* of size *Q* × *Q* can be obtained, with an expression, as shown in Equation (1), where *wij* denotes the probability that a sample point in interval *q<sup>j</sup>* at moment *t* is transferred to interval *q<sup>i</sup>* at moment *t* + 1 [26].

$$\mathcal{W} = \begin{bmatrix} w\_{11} | P(\mathbf{x}\_{t+1} \in q\_1 | \mathbf{x}\_t \in q\_1) & \cdots & w\_{1Q} | P(\mathbf{x}\_{t+1} \in q\_1 | \mathbf{x}\_t \in q\_Q) \\ w\_{21} | P(\mathbf{x}\_{t+1} \in q\_2 | \mathbf{x}\_t \in q\_1) & \cdots & w\_{2Q} | P(\mathbf{x}\_{t+1} \in q\_2 | \mathbf{x}\_t \in q\_Q) \\ \vdots & \ddots & \vdots \\ w\_{Q1} | P(\mathbf{x}\_{t+1} \in q\_Q | \mathbf{x}\_t \in q\_1) & \cdots & w\_{QQ} | P(\mathbf{x}\_{t+1} \in q\_Q | \mathbf{x}\_t \in q\_Q) \end{bmatrix} \tag{1}$$

By incorporating the temporal information into the state transfer probability matrix *W* and arranging each state transition probability *wij* in time sequence, a Markov transition field (MTF) matrix *M* of size *n* × *n* is obtained as expressed in shown Equation (2) where *mij* denotes the transition probability *wij* between the intervals (*q<sup>j</sup>* → *qi*) in which the sample points are located in time sequence.

$$M = \begin{bmatrix} m\_{11} & m\_{12} & \cdots & m\_{1n} \\ m\_{21} & m\_{22} & \cdots & m\_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ m\_{n1} & m\_{n2} & \cdots & m\_{nn} \end{bmatrix} = \begin{bmatrix} w\_{ij|x\_1 \in q\_i, x\_1 \in q\_j} & \cdots & w\_{ij|x\_1 \in q\_i, x\_n \in q\_j} \\ w\_{ij|x\_2 \in q\_i, x\_1 \in q\_j} & \cdots & w\_{ij|x\_2 \in q\_i, x\_n \in q\_j} \\ \vdots & \ddots & \vdots \\ w\_{ij|x\_n \in q\_i, x\_1 \in q\_j} & \cdots & w\_{ij|x\_n \in q\_i, x\_n \in q\_j} \end{bmatrix} \tag{2}$$

The elements *mij* in the MTF matrix are transformed as pixel points into a twodimensional feature image with temporal correlation. As the number of sample points selected directly affects the size of the generated coded image, it is clearly inappropriate for an image with too large a size to be used directly as input to the CNN. To improve computational efficiency, a fuzzy kernel <sup>n</sup> 1 *m*<sup>2</sup> o *m*×*m* is used to pixel average each region without

overlap. Figure 1 shows images of different fault types after encoding each sample, consisting of 8192 sampling points, using MTF image encoding and subsequently subjecting them to pixel averaging processing. Compared to traditional time domain analysis methods, MTF encoding images preserve time-related information and enable clearer differentiation of various fault types in rolling bearings. *m m*×*<sup>m</sup>* out overlap. Figure 1 shows images of different fault types after encoding each sample, consisting of 8192 sampling points, using MTF image encoding and subsequently subjecting them to pixel averaging processing. Compared to traditional time domain analysis methods, MTF encoding images preserve time-related information and enable clearer differentiation of various fault types in rolling bearings.

, , 11 12 1 21 22 2 , ,

 

1 2 , ,

 

The elements *mij* in the MTF matrix are transformed as pixel points into a two-dimensional feature image with temporal correlation. As the number of sample points selected directly affects the size of the generated coded image, it is clearly inappropriate for an image with too large a size to be used directly as input to the CNN. To improve com-

*mm m w w*

*mm m w w*

 

 2 1

*n n nn ij x q x q ij x q x q*

11 1

*ij x q x q ij x q x q <sup>n</sup> n ij x q x q ij x q x q*

*w w*

∈∈ ∈ ∈

*i j in j*

*i j in j*

(2)

∈∈ ∈ ∈

∈∈ ∈ ∈

*n i j n in j*

is used to pixel average each region with-

21 2

1

*Sensors* **2023**, *23*, x FOR PEER REVIEW 5 of 21

= =

*M*

putational efficiency, a fuzzy kernel

*mm m*

**Figure 1.** MTF-encoded images of 8 types of fault: (**a**) normal; (**b**) inner ring; (**c**) outer ring; (**d**) rolling element; (**e**) inner ring + outer ring; (**f**) inner ring + rolling element; (**g**) outer ring + rolling element; (**h**) inner ring + outer ring + rolling element. **Figure 1.** MTF-encoded images of 8 types of fault: (**a**) normal; (**b**) inner ring; (**c**) outer ring; (**d**) rolling element; (**e**) inner ring + outer ring; (**f**) inner ring + rolling element; (**g**) outer ring + rolling element; (**h**) inner ring + outer ring + rolling element.

#### **3. Multisignal Fusion 3. Multisignal Fusion**

To enhance system stability and increase diagnostic reliability, this article collected vibration signals and acoustic emission signals and fused them for processing. This fusion processing can establish correlations between multiple signal sources. Usually, information fusion can be divided into three levels: data-level fusion, feature-level fusion, and decision-level fusion. Considering that the sample data in this study consist of MTF encoded images of different fault types, it is advantageous to employ CNN for image processing. Therefore, this paper adopted the IFCNN for feature-level fusion of the data. To enhance system stability and increase diagnostic reliability, this article collected vibration signals and acoustic emission signals and fused them for processing. This fusion processing can establish correlations between multiple signal sources. Usually, information fusion can be divided into three levels: data-level fusion, feature-level fusion, and decisionlevel fusion. Considering that the sample data in this study consist of MTF encoded images of different fault types, it is advantageous to employ CNN for image processing. Therefore, this paper adopted the IFCNN for feature-level fusion of the data.

IFCNN consists of three modules, namely, the feature extraction module, the feature fusion module and the feature reconstruction module [27], and the structure of this framework is shown in Figure 2. IFCNN consists of three modules, namely, the feature extraction module, the feature fusion module and the feature reconstruction module [27], and the structure of this framework is shown in Figure 2. *Sensors* **2023**, *23*, x FOR PEER REVIEW 6 of 21

The feature extraction module consists of two convolutional layers. The first layer uses the first convolutional layer of the ResNet101 network model, pretrained on the ImageNet dataset. This layer includes 64 convolutional kernels with a size of 7 × 7 and

This framework uses the mean squared error (MSE) as the basic loss function and adds a perceptual loss to optimize the model. The expression for the perceptual loss ( *Ploss* )

<sup>1</sup> = − , , *i i*

<sup>1</sup> , , <sup>3</sup> = − *i i*

, ,

, ,

fused image, respectively. The expression for the total loss (*Tloss* ) is as follows:

where *w*1 and *w*2 are the weighting coefficients. For the fusion of MTF-encoded im-

*loss p g g g ixy*

where *<sup>p</sup> f* and *gf* are the feature maps of the predicted fused image and the true fused image, respectively; *i* is the feature map channel index; *Cf* , *H <sup>f</sup>* and *Wf* are the number of channels, height and width of the feature map, respectively. The expression for

*loss p g f ff ixy*

() () <sup>2</sup>

*P f xy f xy CHW* (3)

() () <sup>2</sup>

*<sup>g</sup>* and *Wg* are the height and width of the true

*T wB wP loss loss loss* = + 1 2 (5)

*<sup>B</sup> I xy I xy H W* (4)

*I* are the predicted fused image and the true fused image, respectively;

convolutional layer includes 64 convolutional kernels with a size of 3 × 3, which are used to adjust the features extracted by the first layer in order to adapt to feature fusion. For this study, the feature fusion module adopts an element-wise maximum fusion strategy. The final module is the image reconstruction module, in which the third convolutional layer includes 64 convolutional kernels with a size of 3 × 3. This layer adjusts the fused convolutional features and plays an important role in reconstructing the image. The fourth convolutional layer reconstructs the feature map with three-channel output, and it in-

**Figure 2.** The structure of IFCNN. **Figure 2.** The structure of IFCNN.

is as follows:

where *<sup>p</sup>*

cludes 3 convolutional kernels with a size of 1 × 1.

the basic loss ( *Bloss* ) is as follows:

*i* is the RGB image channel index; *H*

ages in this study, the sums are both set to 1.

*I* and *<sup>g</sup>*

The feature extraction module consists of two convolutional layers. The first layer uses the first convolutional layer of the ResNet101 network model, pretrained on the ImageNet dataset. This layer includes 64 convolutional kernels with a size of 7 × 7 and retains the training parameters, enabling effective extraction of image features. The second convolutional layer includes 64 convolutional kernels with a size of 3 × 3, which are used to adjust the features extracted by the first layer in order to adapt to feature fusion. For this study, the feature fusion module adopts an element-wise maximum fusion strategy. The final module is the image reconstruction module, in which the third convolutional layer includes 64 convolutional kernels with a size of 3 × 3. This layer adjusts the fused convolutional features and plays an important role in reconstructing the image. The fourth convolutional layer reconstructs the feature map with three-channel output, and it includes 3 convolutional kernels with a size of 1 × 1.

This framework uses the mean squared error (MSE) as the basic loss function and adds a perceptual loss to optimize the model. The expression for the perceptual loss (*Ploss*) is as follows:

$$P\_{\rm loss} = \frac{1}{\mathcal{C}\_f H\_f \mathcal{W}\_f} \sum\_{i, \mathbf{x}, \mathbf{y}} \left[ f\_p^i(\mathbf{x}, \mathbf{y}) - f\_\mathcal{S}^i(\mathbf{x}, \mathbf{y}) \right]^2 \tag{3}$$

where *f<sup>p</sup>* and *f<sup>g</sup>* are the feature maps of the predicted fused image and the true fused image, respectively; *i* is the feature map channel index; *C<sup>f</sup>* , *H<sup>f</sup>* and *W<sup>f</sup>* are the number of channels, height and width of the feature map, respectively. The expression for the basic loss (*Bloss*) is as follows:

$$B\_{\rm loss} = \frac{1}{\mathfrak{G}H\_{\mathfrak{g}}\mathcal{W}\_{\mathfrak{g}}} \sum\_{i,\mathbf{x},\mathbf{y}} \left[ I\_p^i(\mathbf{x},\mathbf{y}) - I\_{\mathfrak{g}}^i(\mathbf{x},\mathbf{y}) \right]^2 \tag{4}$$

where *I<sup>p</sup>* and *I<sup>g</sup>* are the predicted fused image and the true fused image, respectively; *i* is the RGB image channel index; *H<sup>g</sup>* and *W<sup>g</sup>* are the height and width of the true fused image, respectively. The expression for the total loss (*Tloss*) is as follows:

$$T\_{\text{lops}} = w\_1 B\_{\text{lops}} + w\_2 P\_{\text{lops}} \tag{5}$$

where *w*<sup>1</sup> and *w*<sup>2</sup> are the weighting coefficients. For the fusion of MTF-encoded images in this study, the sums are both set to 1.

#### **4. Fault Diagnosis Method 4. Fault Diagnosis Method**

#### *4.1. Optimized Deep Residual Network 4.1. Optimized Deep Residual Network*

ResNet is built on the basis of CNN and solves the gradient vanishing problem by adding skip connections between the input and output of each convolutional layer. The classic residual module structure is shown in Figure 3. ResNet is built on the basis of CNN and solves the gradient vanishing problem by adding skip connections between the input and output of each convolutional layer. The classic residual module structure is shown in Figure 3.

The structure contains two mappings, the part of the main path is called the residual

The structure of the residual network model constructed in this study is shown in Table 3. It includes an input layer, a maximum pooling layer, convolutional layers, an average pooling layer, a fully connected layer and a softmax classifier. Conv2, Conv3, Conv4

Layer Name Kernel Size Channel Stride Padding Output Input Image - - - - 3 × 224 × 224 Conv1 7 × 7 64 2 3 64 × 112 × 112 Maxpool 3 × 3 64 2 1 64 × 56 × 56

Conv2\_1 3 × 3 64 1 1 64 × 56 × 56 Conv2\_2 3 × 3 64 1 1 64 × 56 × 56 Conv2\_3 3 × 3 64 1 1 64 × 56 × 56 Conv2\_4 3 × 3 64 1 1 64 × 56 × 56

Conv3\_1 3 × 3 128 2 1 128 × 28 × 28 Conv3\_2 3 × 3 128 1 1 128 × 28 × 28 Conv3\_3 3 × 3 128 1 1 128 × 28 × 28 Conv3\_4 3 × 3 128 1 1 128 × 28 × 28

Conv4\_1 3 × 3 256 2 1 256 × 14 × 14 Conv4\_2 3 × 3 256 1 1 256 × 14 × 14 Conv4\_3 3 × 3 256 1 1 256 × 14 × 14

*Hx Fx x* () () = + (6)

**Figure 3.** Residual module structure. **Figure 3.** Residual module structure.

and Conv5 are residual modules.

**Table 3.** ResNet model structure.

the two mappings:

Conv2

Conv3

Conv4

The structure contains two mappings, the part of the main path is called the residual mapping and the part of the bypass connection is called the constant mapping. The final output of the residual block is therefore the superposition of the outputs obtained from the two mappings:

$$H(\mathbf{x}) = F(\mathbf{x}) + \mathbf{x} \tag{6}$$

The structure of the residual network model constructed in this study is shown in Table 3. It includes an input layer, a maximum pooling layer, convolutional layers, an average pooling layer, a fully connected layer and a softmax classifier. Conv2, Conv3, Conv4 and Conv5 are residual modules.


**Table 3.** ResNet model structure.

Convolutional layers are the core of CNNs, responsible for extracting features from large amounts of input data. Typically, convolutional layers can be described by the following expression:

$$\mathbf{x}\_{j}^{l} = \sigma(\sum\_{i \in M\_{j}} \mathbf{x}\_{j}^{l-1} \* k\_{ij}^{l} + b\_{j}^{l}) \tag{7}$$

where *x l*−1 *j* is the input of the (*l* − 1)-th layer of the network; *x l j* is the output of the *l*-th layer of the network; *k l ij* is the weight matrix of the convolutional kernel; *b l j* is the bias term; *M<sup>j</sup>* is the set of input feature maps; *σ* is the nonlinear activation function; and ∗ represents the convolution operation.

Pooling aims to reduce the size of feature maps while retaining the most important feature information. It can effectively reduce computational complexity and improve the model's robustness and generalization capabilities. The pooling process involves four steps: input feature map, sliding window coverage, feature aggregation, and output feature map. The pooling process can be described by the following expression:

$$\mathbf{x}\_{j}^{l} = \sigma(\beta\_{j}^{l}down(\mathbf{x}\_{j}^{l-1}) + b\_{j}^{l}) \tag{8}$$

where *x l*−1 *j* is the input of the (*l* − 1)-th layer of the network; *x l j* is the output of the *l*-th layer of the network; *b l j* is the bias term; *σ* is the nonlinear activation function; *down*(·) is the down-sampling function; and *β l j* is the weight.

To improve the efficiency of fault diagnosis, a convolutional block attention module (CBAM) is introduced to optimize the model by focusing it more on important features [28]. CBAM consists of channel attention module, which captures the connections between channels of the feature map, and spatial attention module, which captures the connections between spatial regions of the feature map.

The channel attention module feeds the features *F c avg* and *F c* max obtained after using average pooling and max pooling in the channel dimension into the convolutional network, respectively, and sums the results and outputs them. The process is described as:

$$M\_{\mathbb{C}}(\mathcal{F}) = \sigma(\mathcal{W}\_1(\mathcal{W}\_0(F\_{\text{avg}}^{\mathbb{C}})) + \mathcal{W}\_1(\mathcal{W}\_0(F\_{\text{max}}^{\mathbb{C}}))) \tag{9}$$

where *σ* is a sigmoid function; *W*<sup>0</sup> and *W*<sup>1</sup> are convolution operations with a convolution kernel size of 1 × 1.

The spatial attention module performs a convolution operation on the features *F s avg* and *F s* max obtained after stitching using average pooling and max pooling in the channel dimension. The process is described as:

$$M\_{\rm s}(F) = \sigma(f^{7 \times 7}([F\_{avg}^s; F\_{\rm max}^s]))\tag{10}$$

where *σ* is a sigmoid function; *f* 7×7 is convolution operation with a convolution kernel size of 7 × 7.

This study introduced CBAM into ResNet without changing the overall structure of the network. The input data are MTF feature images of size 224 × 224. After passing through the first convolutional layer with a kernel size of 7 × 7 and a stride of 2, the image size is reduced to 112 × 112. This is followed by a max pooling layer with a stride of 2, which further reduces the data dimensionality and the image size to 56 × 56. The channel attention and spatial attention modules are added sequentially after the batch normalization (BN) layer at the end of the residual modules Conv2, Conv3, Conv4 and Conv5, respectively. After passing through the Conv2, which has 64 channels and convolutional kernels of size 3 × 3 with a stride of 1, deeper features are extracted while maintaining the same image size as the previous layer. The channels in Conv3, Conv4, and Conv5 are doubled successively to 128, 256 and 512. At the same time, down-sampling is implemented in the first convolutional layer with a stride of 2 in each residual module. This results in output image sizes that progressively decrease to 28 × 28, 14 × 14 and 7 × 7, respectively. Afterwards, the network passes through an average pooling layer to reduce the number of parameters and mitigate the occurrence of overfitting. Then, a fully connected layer is used for nonlinear combination of the extracted features, followed by a softmax classifier to produce the final output.

The proposed model uses a cross-entropy loss function to evaluate the error between the predicted and true values, avoiding gradient dispersion, which is defined in the context of a multiclassification problem as:

$$L = \frac{1}{N} \sum\_{i} L\_i = -\frac{1}{N} \sum\_{i} \sum\_{c=1}^{M} y\_{ic} \log(p\_{ic}) \tag{11}$$

where *M* is the number of categories; *yic* is the sign function, taking 1 if the true value of sample *i* is equal to *c* and 0 otherwise; and *pic* is the predicted probability that sample *i* belongs to category *c*.

An initial test was carried out with a constant speed of 1600 rpm and a load of 7 kN, the number of epochs was set to 50 and the loss and accuracy (Acc) in training are shown in Figure 4.

where

σ

nel size of 7 × 7.

softmax classifier to produce the final output.

text of a multiclassification problem as:

sample *i* belongs to category *c* .

in Figure 4.

**Figure 4.** Loss and accuracy in training with 50 epochs: (**a**) loss; (**b**) acc. **Figure 4.** Loss and accuracy in training with 50 epochs: (**a**) loss; (**b**) acc.

Overall, from the graph, it can be seen that when the epoch reaches 40, the loss and accuracy have basically converged, and the accuracy has reached nearly 100%. This indicates that the model performs well on the training set and has good generalization ability, Overall, from the graph, it can be seen that when the epoch reaches 40, the loss and accuracy have basically converged, and the accuracy has reached nearly 100%. This indicates that the model performs well on the training set and has good generalization ability, which also verifies that the model structure and parameters chosen in this paper are correct. Setting the number of epochs too large can significantly prolong the training time and even cause overfitting, while setting it too small may not find the global optimal solution. After multiple tests, this paper set the learning rate to 0.001 and the number of epochs to 40, which is a good choice. To intuitively demonstrate the advantages of the proposed method in extracting fault features, this paper utilized the uniform manifold approximation and projection (UMAP) algorithm to perform dimensionality reduction on the data and visualize the results. Taking the steady state condition with a speed of 1600 rpm and a load of 7 kN as an example, this paper conducted a layer-by-layer analysis of ResNet models with and without CBAM and extracted the output features of the intermediate layers for calculation. Then, UMAP is utilized to reduce the dimensionality of the extracted features to two dimensions. This paper extracted the fault features from the avgpool layer and visualized the results using a scatter plot where different fault types are marked with different colors. The visualization is shown in Figure 5. *Sensors* **2023**, *23*, x FOR PEER REVIEW 10 of 21 which also verifies that the model structure and parameters chosen in this paper are correct. Setting the number of epochs too large can significantly prolong the training time and even cause overfitting, while setting it too small may not find the global optimal solution. After multiple tests, this paper set the learning rate to 0.001 and the number of epochs to 40, which is a good choice. To intuitively demonstrate the advantages of the proposed method in extracting fault features, this paper utilized the uniform manifold approximation and projection (UMAP) algorithm to perform dimensionality reduction on the data and visualize the results. Taking the steady state condition with a speed of 1600 rpm and a load of 7 kN as an example, this paper conducted a layer-by-layer analysis of ResNet models with and without CBAM and extracted the output features of the intermediate layers for calculation. Then, UMAP is utilized to reduce the dimensionality of the extracted features to two dimensions. This paper extracted the fault features from the avgpool layer and visualized the results using a scatter plot where different fault types are marked with different colors. The visualization is shown in Figure 5.

is a sigmoid function; <sup>7</sup>×<sup>7</sup> *f* is convolution operation with a convolution ker-

This study introduced CBAM into ResNet without changing the overall structure of the network. The input data are MTF feature images of size 224 × 224. After passing through the first convolutional layer with a kernel size of 7 × 7 and a stride of 2, the image size is reduced to 112 × 112. This is followed by a max pooling layer with a stride of 2, which further reduces the data dimensionality and the image size to 56 × 56. The channel attention and spatial attention modules are added sequentially after the batch normalization (BN) layer at the end of the residual modules Conv2, Conv3, Conv4 and Conv5, respectively. After passing through the Conv2, which has 64 channels and convolutional kernels of size 3 × 3 with a stride of 1, deeper features are extracted while maintaining the same image size as the previous layer. The channels in Conv3, Conv4, and Conv5 are doubled successively to 128, 256 and 512. At the same time, down-sampling is implemented in the first convolutional layer with a stride of 2 in each residual module. This results in output image sizes that progressively decrease to 28 × 28, 14 × 14 and 7 × 7, respectively. Afterwards, the network passes through an average pooling layer to reduce the number of parameters and mitigate the occurrence of overfitting. Then, a fully connected layer is used for nonlinear combination of the extracted features, followed by a

The proposed model uses a cross-entropy loss function to evaluate the error between the predicted and true values, avoiding gradient dispersion, which is defined in the con-

1 1 log( )

= =−

*i ic*

where *M* is the number of categories; *ic y* is the sign function, taking 1 if the true value of sample *i* is equal to *c* and 0 otherwise; and *pic* is the predicted probability that

An initial test was carried out with a constant speed of 1600 rpm and a load of 7 kN, the number of epochs was set to 50 and the loss and accuracy (Acc) in training are shown

1

*L L yp N N* (11)

=

*M i ic ic*

**Figure 5.** Visualization of fault features extracted from the avgpool layer: (**a**) before the introduction of CBAM; (**b**) after the introduction of CBAM. **Figure 5.** Visualization of fault features extracted from the avgpool layer: (**a**) before the introduction of CBAM; (**b**) after the introduction of CBAM.

As can be seen from the figure above, there is a significant difference in the clustering degree of data samples between the two models, and introducing CBAM to ResNet can yield more obvious clustering effect in the avgpool layer. Therefore, it can be concluded

This paper proposes a compound fault diagnosis method of rolling bearings based on multisignal fusion and MTF-ResNet. The fused MTF-encoded images are input into the ResNet model for training, and the fault is intelligently diagnosed under different working conditions. The basic process is shown in Figure 6, and the main steps are as follows: (1) acquire vibration and acoustic emission signals; (2) generate feature images of size 224 × 224 by MTF encoding of the original data to build a training set and a test set; (3) fuse the MTF encoded images of the two signals using IFCNN; (4) input the training set into the optimized ResNet model built for training, and save the optimal parameters; and (5) test the test samples and output the results to complete the intelligent fault diag-

der complex working conditions.

*4.2. Fault Diagnosis Process* 

nosis.

As can be seen from the figure above, there is a significant difference in the clustering degree of data samples between the two models, and introducing CBAM to ResNet can yield more obvious clustering effect in the avgpool layer. Therefore, it can be concluded that the proposed optimized ResNet has excellent abilities in extracting fault features under complex working conditions.

#### *4.2. Fault Diagnosis Process*

This paper proposes a compound fault diagnosis method of rolling bearings based on multisignal fusion and MTF-ResNet. The fused MTF-encoded images are input into the ResNet model for training, and the fault is intelligently diagnosed under different working conditions. The basic process is shown in Figure 6, and the main steps are as follows: (1) acquire vibration and acoustic emission signals; (2) generate feature images of size 224 × 224 by MTF encoding of the original data to build a training set and a test set; (3) fuse the MTF encoded images of the two signals using IFCNN; (4) input the training set into the optimized ResNet model built for training, and save the optimal parameters; and (5) test the test samples and output the results to complete the intelligent fault diagnosis. *Sensors* **2023**, *23*, x FOR PEER REVIEW 11 of 21

**Figure 6.** Fault diagnosis process of rolling bearings based on multisignal fusion and MTF-ResNet. **Figure 6.** Fault diagnosis process of rolling bearings based on multisignal fusion and MTF-ResNet.

were artificially introduced to the inner and outer rings, as well as the rolling elements using a YLP-MDF-152 laser marking machine from Han's Laser. Taking into account the failure mechanism of bearings in actual working environments, alternating loads can cause cracks to form at a certain depth below the surface, which may then propagate to the surface and cause spalling. Fatigue spalling increases vibration and noise during rotation and is usually the main form of rolling bearing failure. Therefore, pitting was produced on the surface of the bearing at different locations to simulate early defects. The pitting diameter was set to 40 µm and the depth was set to 30% of the laser energy. Eight types of faults, as described in Section 2, were designed using different fault positions as

In order to simulate the working conditions of metro traction motors, three additional

speeds and three additional loads were included in the experimental design. In consideration of both actual working conditions and minimizing the impact of bearing degradation on the experiment, gradient speeds of 800 rpm (low), 1600 rpm (medium) and 2400 rpm (high) were chosen, along with gradient equivalent dynamic loads of 5 kN (light), 7 kN (medium) and 9 kN (heavy) as the radial loads. There are a total of 72 (8 × 3 × 3) subex-

periments. The experimental arrangement is shown in Table 4.

**5. Fault Diagnosis Experiment** 

*5.1. Experimental Design* 

classification criterion.

#### **5. Fault Diagnosis Experiment**

#### *5.1. Experimental Design*

The experimental bearing was selected as NU216 cylindrical roller bearing. Defects were artificially introduced to the inner and outer rings, as well as the rolling elements using a YLP-MDF-152 laser marking machine from Han's Laser. Taking into account the failure mechanism of bearings in actual working environments, alternating loads can cause cracks to form at a certain depth below the surface, which may then propagate to the surface and cause spalling. Fatigue spalling increases vibration and noise during rotation and is usually the main form of rolling bearing failure. Therefore, pitting was produced on the surface of the bearing at different locations to simulate early defects. The pitting diameter was set to 40 µm and the depth was set to 30% of the laser energy. Eight types of faults, as described in Section 2, were designed using different fault positions as classification criterion.

In order to simulate the working conditions of metro traction motors, three additional speeds and three additional loads were included in the experimental design. In consideration of both actual working conditions and minimizing the impact of bearing degradation on the experiment, gradient speeds of 800 rpm (low), 1600 rpm (medium) and 2400 rpm (high) were chosen, along with gradient equivalent dynamic loads of 5 kN (light), 7 kN (medium) and 9 kN (heavy) as the radial loads. There are a total of 72 (8 × 3 × 3) subexperiments. The experimental arrangement is shown in Table 4.

**Table 4.** Experimental arrangement.


#### *5.2. Construction of the Signal Acquisition System*

This study utilized the intelligent testing platform for comprehensive bearing performance, jointly developed by Henan University of Science and Technology, Luoyang Bearing Research Institute, and Intelligent Numerical Control Equipment Henan Provincial Engineering Laboratory, as the signal acquisition system. The testing machine allows for a maximum inner diameter of 120 mm, a maximum speed of 5000 r/min, a maximum radial load of 300 kN, and a maximum axial load of 200 kN for the bearing. The platform is equipped with a PCI-8 acoustic emission transmitter, two R50S-TC acoustic emission sensors, two LC0151T acceleration sensors, two LC0201-5 signal conditioners, and a PCI8510 data acquisition card.

During the experiment, a healthy bearing and a faulty bearing were installed at both ends of the testing machine's spindle, and vibration and acoustic emission signals were collected from both bearings simultaneously. The loading system applies radial loads to the spindle via a pair of NU2218 cylindrical roller bearings, which in turn are transferred to the test bearings at both ends of the spindle. The sensor signals are amplified and conditioned by signal amplifiers, signal conditioners, and input to the computer through a PCI acquisition card. The principle of the signal acquisition system is shown in Figure 7. The physical set-up of the system is shown in Figure 8.

Speed/rpm

Normal Inner

Ring

The physical set-up of the system is shown in Figure 8.

**Figure 7.** Schematic diagram of the signal acquisition system. **Figure 7.** Schematic diagram of the signal acquisition system.

**Table 4.** Experimental arrangement.

Rolling Element

Outer Ring

data acquisition card.

Radial Loads for Different Fault Types/kN

Inner Ring + Rolling Element

This study utilized the intelligent testing platform for comprehensive bearing performance, jointly developed by Henan University of Science and Technology, Luoyang Bearing Research Institute, and Intelligent Numerical Control Equipment Henan Provincial Engineering Laboratory, as the signal acquisition system. The testing machine allows for a maximum inner diameter of 120 mm, a maximum speed of 5000 r/min, a maximum radial load of 300 kN, and a maximum axial load of 200 kN for the bearing. The platform is equipped with a PCI-8 acoustic emission transmitter, two R50S-TC acoustic emission sensors, two LC0151T acceleration sensors, two LC0201-5 signal conditioners, and a PCI8510

During the experiment, a healthy bearing and a faulty bearing were installed at both ends of the testing machine's spindle, and vibration and acoustic emission signals were collected from both bearings simultaneously. The loading system applies radial loads to the spindle via a pair of NU2218 cylindrical roller bearings, which in turn are transferred to the test bearings at both ends of the spindle. The sensor signals are amplified and conditioned by signal amplifiers, signal conditioners, and input to the computer through a PCI acquisition card. The principle of the signal acquisition system is shown in Figure 7.

Outer Ring + Rolling Element Inner Ring + Outer Ring + Rolling Element

Inner Ring + Outer Ring

800 5 5 5 5 5 5 5 5 800 7 7 7 7 7 7 7 7 800 9 9 9 9 9 9 9 9 1600 5 5 5 5 5 5 5 5 1600 7 7 7 7 7 7 7 7 1600 9 9 9 9 9 9 9 9 2400 5 5 5 5 5 5 5 5 2400 7 7 7 7 7 7 7 7 2400 9 9 9 9 9 9 9 9

*5.2. Construction of the Signal Acquisition System* 

**Figure 8.** Photograph of the built signal acquisition system. **Figure 8.** Photograph of the built signal acquisition system.

#### *5.3. Diagnostic Scheme Design 5.3. Diagnostic Scheme Design*

To further validate the effectiveness of the proposed method, three types of diagnostic schemes were designed for single working condition changes, compound working condition changes, and generic working conditions, considering two different factors (speed To further validate the effectiveness of the proposed method, three types of diagnostic schemes were designed for single working condition changes, compound working condition changes, and generic working conditions, considering two different factors (speed and load) that affect the test results.

and load) that affect the test results. When studying single working condition changes, first control the speed to be constant, put data of two different loads in the training set, and put data of another load in the test set to verify the robustness of the model. When controlling the load to be constant, When studying single working condition changes, first control the speed to be constant, put data of two different loads in the training set, and put data of another load in the test set to verify the robustness of the model. When controlling the load to be constant, the method is similar to the above. The specific diagnostic program is shown in Table 5.

the method is similar to the above. The specific diagnostic program is shown in Table 5.

When studying the change of compound working condition, it is required that the training set contains data with different speeds and loads at the same time. For generic working conditions, it is required that all fault types data under all conditions exist in both

During the operational process of a metro system, variations in bearing speed and load are inevitable. While previous steady-state tests have certain limitations, it becomes crucial to analyze the results of variable working condition tests to validate the effectiveness of the proposed method. To further explore the changes in compound working conditions, an additional analysis comparing the fusion of acoustic emission and vibration signals with a single signal was incorporated to emphasize the advantages of the proposed method. In the generic working condition tests, the feature extraction capabilities of four

5, 7 9 5, 9 7 7, 9 5

800, 1600 2400 800, 2400 1600 1600, 2400 800

Speed/rpm Load/kN

Load/kN Speed/rpm

**6. Experimental Results and Comparison of Methods** 

the training and testing sets.

**Table 5.** Diagnostic scheme for single working condition change.


**Table 5.** Diagnostic scheme for single working condition change.

When studying the change of compound working condition, it is required that the training set contains data with different speeds and loads at the same time. For generic working conditions, it is required that all fault types data under all conditions exist in both the training and testing sets.

#### **6. Experimental Results and Comparison of Methods**

During the operational process of a metro system, variations in bearing speed and load are inevitable. While previous steady-state tests have certain limitations, it becomes crucial to analyze the results of variable working condition tests to validate the effectiveness of the proposed method. To further explore the changes in compound working conditions, an additional analysis comparing the fusion of acoustic emission and vibration signals with a single signal was incorporated to emphasize the advantages of the proposed method. In the generic working condition tests, the feature extraction capabilities of four models, namely the proposed model, RepVGG, CBAM-CNN and ResNet, were compared to evaluate their performance.

#### *6.1. Single Working Condition Changes*

Based on the fault diagnosis method proposed in Section 5.3, with the control of constant speed and load, the training set was input into the model constructed in this paper, and fault diagnosis was performed on the test set. The diagnostic results are shown in Table 6.


**Table 6.** The diagnostic results for single working condition changes.

Based on a comprehensive examination of the aforementioned table, it is observed that when maintaining a constant speed while altering the load, the fault diagnosis accuracy reaches nearly 100%. Conversely, in cases where the load remains constant but the speed varies, a decrease in fault diagnosis accuracy is observed, indicating a substantial influence of rotational speed on diagnostic outcomes. Subsequent analysis reveals that the accuracy of items numbered 12, 15 and 18 is significantly low, whereas items numbered 3, 6 and 9 demonstrate accuracy close to 100%, albeit slightly lower than other items within the initial nine numbers. This discrepancy can be attributed to the fact that fault characteristics extracted under medium- to high-speed and medium to heavy load conditions are more discernible compared to those under low-speed and light load conditions.

#### *6.2. Compound Working Condition Changes*

Mixed data with different speeds and loads were included in the training set and used to train the model proposed for fault diagnosis on the testing set. Subsequently, a comparison was made between the fusion of acoustic emission and vibration signals and using a single signal. The diagnostic results are shown in Table 7.


**Table 7.** The diagnostic results for compound working condition changes.

The table clearly indicates that the diagnostic results of items numbered 4 to 6 surpass those of items numbered 1 to 3. Notably, the training and testing sets for items numbered 1 to 3 encompass varying rotation speeds, whereas items numbered 4 to 6 involve different loads. It is observed that the diagnostic accuracy of items numbered 4 to 6 remains relatively stable, whereas item numbered 3 exhibits significantly lower accuracy compared to items numbered 1 and 2. The underlying reason behind this phenomenon aligns with the findings presented in Section 6.1 of this paper.

From the standpoint of signal acquisition, the fusion of acoustic emission and vibration signals yields higher diagnostic accuracy in fault diagnosis compared to utilizing a single signal. This finding provides further substantiation that the application of multisignal fusion technology can effectively enhance system stability and diagnostic accuracy. Furthermore, it is evident that employing a single vibration signal for diagnostics yields superior results in comparison to employing a single acoustic emission signal. This can be attributed to the fact that the acoustic emission acquisition system exhibits heightened sensitivity to environmental noise, primarily stemming from the operational testing equipment, which poses challenges in noise elimination.

#### *6.3. Generic Working Conditions*

To evaluate the performance of the proposed fault diagnosis model, all fault samples involving three different speeds and three different loads were included in both the training and testing sets. The sample ratio between the two sets was set to 9:1 to ensure the training set was large enough to enable the model to effectively learn the fault data while still reserving an adequate number of samples for testing. Subsequently, the model was applied to diagnose faults on the testing set. To visualize the diagnostic results, a confusion matrix was employed, providing an intuitive and reliable representation of classifications made by the model. The confusion matrix is presented in Figure 9.

**Figure 9.** The diagnostic results for generic working condition: (**a**) based on vibration signal (with an accuracy rate of 97%); (**b**) based on acoustic emission signal (with an accuracy rate of 94.88%); (**c**) based on the fusion of acoustic emission and vibration signals (with an accuracy rate of 99.25%). **Figure 9.** The diagnostic results for generic working condition: (**a**) based on vibration signal (with an accuracy rate of 97%); (**b**) based on acoustic emission signal (with an accuracy rate of 94.88%); (**c**) based on the fusion of acoustic emission and vibration signals (with an accuracy rate of 99.25%).

The confusion matrix provides a clear and intuitive visualization of the model's misclassifications and the types of errors. It can be seen that the overall diagnostic performance is good, and the accuracy rate for the fusion of acoustic emission and vibration signals is almost 100%. However, the diagnosis accuracy rate for label 6, which corresponds to the "outer Ring + rolling element pitting" fault type, is relatively low. The model misclassified three test samples as "rolling element pitting". Further analysis revealed that the two types of faults have similar features, making it difficult to extract differences between them. By comparing (a–c) in Figure 9, the results further confirm that multisignal fusion technology has higher reliability and accuracy compared to a single signal, especially under changing working conditions. To compare the feature extraction capabilities of different models, the training and The confusion matrix provides a clear and intuitive visualization of the model's misclassifications and the types of errors. It can be seen that the overall diagnostic performance is good, and the accuracy rate for the fusion of acoustic emission and vibration signals is almost 100%. However, the diagnosis accuracy rate for label 6, which corresponds to the "outer Ring + rolling element pitting" fault type, is relatively low. The model misclassified three test samples as "rolling element pitting". Further analysis revealed that the two types of faults have similar features, making it difficult to extract differences between them. By comparing (a–c) in Figure 9, the results further confirm that multisignal fusion technology has higher reliability and accuracy compared to a single signal, especially under changing working conditions.

testing sets samples of above-mentioned generic working conditions were respectively input into RepVGG, CBAM-CNN and ResNet models for diagnosis. Two types of faults were selected as examples: label 1 (corresponding to "inner ring pitting") with better diagnostic results and label 6 (corresponding to "outer ring + rolling element pitting") with To compare the feature extraction capabilities of different models, the training and testing sets samples of above-mentioned generic working conditions were respectively input into RepVGG, CBAM-CNN and ResNet models for diagnosis. Two types of faults were selected as examples: label 1 (corresponding to "inner ring pitting") with better diagnostic results and label 6 (corresponding to "outer ring + rolling element pitting") with poorer results. The precision–recall (PR) curves and receiver operating characteristic (ROC) curves were generated for the optimized ResNet, RepVGG, CBAM-CNN and ResNet models and evaluation indicators, such as average precision (AP) and area under the curve (AUC) were introduced.

The precision–recall (PR) curve is a graphical representation of the performance of a binary classification model, with recall on the x-axis and precision on the y-axis. It illustrates the trade-off between precision and recall at various classification thresholds. The relevant theoretical formulas for the PR curve are as follows:

$$Precision = \frac{TP}{TP + FP} \tag{12}$$

$$Recall = \frac{TP}{TP + FN} \tag{13}$$

where *TP* represents the number of true positive instances; *FP* represents the number of false positive instances; and *FN* represents the number of false negative instances.

The principle of average precision (AP) is to summarize the Precision-Recall (PR) curve by calculating the average precision value. It can be obtained by computing the area under the PR curve. It provides a comprehensive assessment of how well the model balances precision and recall across different recall levels.

The receiver operating characteristic (ROC) curve is a tool used to evaluate the performance of binary classification models. It plots the false positive rate (*FPR*) on the x-axis and the true positive rate (*TPR*) on the y-axis. The principle of the ROC curve can be described using the following formulas:

$$TPR = \frac{TP}{TP + FN} \tag{14}$$

$$FPR = \frac{FP}{FP + TN} \tag{15}$$

where *FP* represents the number of negative instances incorrectly classified as positive; *TN* represents the number of negative instances correctly classified as negative; *TP* represents the number of positive instances correctly classified as positive; and *FN* represents the number of positive instances incorrectly classified as negative.

Area under the curve (AUC) is obtained by calculating the area under the ROC curve. The resulting AUC value ranges from 0 to 1, where 0.5 represents a random classifier and 1 represents a perfect classifier. A higher AUC value indicates better classifier performance.

The diagnostic results are presented in the form of PR and ROC curves in Figures 10 and 11. The overall accuracy rate, AP and AUC for all fault types were calculated for the four models, and the weighted average values were recorded in Table 8. *Sensors* **2023**, *23*, x FOR PEER REVIEW 18 of 21

**Figure 10.** PR curves of four models with two fault types. **Figure 10.** PR curves of four models with two fault types.

**Figure 11.** ROC curves of four models with two fault types.

**Table 8.** The accuracy evaluation indicators of the four models.

**Model Evaluation Indicator** 

Proposed 99.25 0.989 1.000 RepVGG 96.72 0.967 0.996 CBAM-CNN 94.16 0.953 0.993 ResNet 88.35 0.935 0.988

Generally, the closer the PR curve in Figure 10 is to the upper right corner, the larger the AP value, and the better the model performance. The closer the ROC curve in Figure 11 is to the upper left corner, the larger the AUC value, and the better the model performance. Observing the figure above, it can be seen that for the two selected fault types with different diagnostic effects, the PR and ROC curves of proposed model are both closer to

**Accuracy/% AP AUC** 

**Figure 11.** ROC curves of four models with two fault types. **Figure 11.** ROC curves of four models with two fault types.

**Figure 10.** PR curves of four models with two fault types.



Generally, the closer the PR curve in Figure 10 is to the upper right corner, the larger the AP value, and the better the model performance. The closer the ROC curve in Figure 11 is to the upper left corner, the larger the AUC value, and the better the model performance. Observing the figure above, it can be seen that for the two selected fault types with different diagnostic effects, the PR and ROC curves of proposed model are both closer to Generally, the closer the PR curve in Figure 10 is to the upper right corner, the larger the AP value, and the better the model performance. The closer the ROC curve in Figure 11 is to the upper left corner, the larger the AUC value, and the better the model performance. Observing the figure above, it can be seen that for the two selected fault types with different diagnostic effects, the PR and ROC curves of proposed model are both closer to the rightangle edge than those of RepVGG, CBAM-CNN and ResNet, indicating better performance. Combined with the data in Table 8, the three accuracy evaluation indicators of the proposed model are higher than those of the compared models, validating the good feature extraction ability of the proposed model.

#### **7. Conclusions**

This paper focused on the study of the feature extraction ability of the model for complex working conditions, using the metro traction motor bearings as the research object. On the basis of ResNet, CBAM was introduced to optimize the ResNet model. Nine different working conditions and eight compound fault types were designed for experimentation. In addition, a dataset was constructed using MTF image encoding and IFCNN image fusion technology. During the model training process, UMAP was used for visualization to intuitively demonstrate the feature extraction effect of the proposed model. After the experiment, three evaluation indicators were used for objective evaluation of the feature extraction ability of the optimized ResNet, RepVGG, CBAM-CNN and ResNet models.

The results of the experiment show that the MTF-ResNet model with multisignal fusion performs well under complex working conditions, with a diagnostic accuracy rate of up to 99.25%. Based on the results, some important conclusions can be drawn. Specifically, in terms of sensors, using only vibration signals produces better diagnostic results than using only acoustic emission signals. In addition, compared with a single signal, using acoustic emission and vibration signal fusion can provide more comprehensive and integrated information, while reducing misclassifications caused by the limitations of a single signal, thereby improving fault diagnosis accuracy and making the diagnosis result more reliable. In terms of data processing, MTF image encoding technology is a simple data processing method that retains the time correlation of the data, making it easier for the model to extract more comprehensive fault features. For feature extraction models, introducing CBAM after the batch normalization layers of the ResNet model can make the model more focused on capturing important features, quickly distinguishing different types of fault features, and improving diagnostic efficiency. Furthermore, the ResNet structure can effectively alleviate the gradient disappearance phenomenon that occurs as the network deepens, thereby preventing model degradation.

Undoubtedly, this study presents several avenues for future research in the proposed methodologies. Firstly, the inclusion of additional sensors or exploration of different sensor types holds promise. For instance, incorporating multidirectional vibration sensors or temperature sensors could offer a more comprehensive spectrum of fault information, thereby enhancing diagnostic fault tolerance. Secondly, exploring more advanced data processing techniques warrants investigation to enhance the quality of input signals. The acoustic emission signals acquired in this study exhibited significant levels of environmental noise that proved challenging to eliminate. Therefore, employing sophisticated techniques may substantially improve the value derived from these acoustic emission signals. Moreover, conducting model testing on larger datasets utilizing more complex compound faults can effectively confirm the feature extraction capabilities and generalization of the model. This approach will serve as a more robust means of validation. Furthermore, future research focusing on feature extraction models should prioritize the development of lightweight and efficient models to facilitate practical implementation.

Despite the inherent limitations of the methods proposed in this paper, they exhibit commendable feature extraction capabilities within intricate operational scenarios. Consequently, these methods hold potential for application in fault diagnosis tasks related to metro traction motor bearings, thereby possessing appreciable value in engineering applications.

**Author Contributions:** Conceptualization, K.H. and Y.X.; methodology, K.H. and Y.X.; validation, K.H., Y.X. and Y.W.; formal analysis, K.H.; investigation, K.H.; data curation, K.H. and Y.W.; writing—original draft preparation, K.H.; writing—review and editing, K.H., Y.X. and J.W.; visualization, K.H.; supervision, T.X.; funding acquisition, Y.X. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the National Natural Science Foundation of China (grant number: 51805151) and the Key Scientific Research Project of the University of Henan Province of China (grant number: 21B460004).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data presented in this study are available on request from the author.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
