**a. Outer race fault**

When the outer race exists as the local fault, such as a spall, the switch function *βj* is expressed as

$$\beta\_{\dot{j}} = \begin{cases} 1 & \text{if } \phi\_d < \phi\_{\dot{j}} < \phi\_d + \Delta\phi\_d \\ 0 & \text{otherwise} \end{cases} \tag{8}$$

In this case, the spall is fixed in the outer race located from the defined angular position *φd* to *φj* + Δ*φ*. Here, *phid* is a constant value, and Δ*φ* is related to the fault length.

### **b. Inner race fault**

In the case of the inner race fault, the local fault rotates with the inner race and the shaft. The switch function *βj* is given as

$$\beta\_{\dot{j}} = \begin{cases} 1 & \text{if } \omega\_s t + \phi\_{d0} < \phi\_{\dot{j}} < \omega\_s t + \phi\_{d0} + \Delta\phi\_d \\ 0 & \text{otherwise} \end{cases} \tag{9}$$

In this case, the angular position of the fault *φd* will change with the speed of the shaft. Here, *φd* = *<sup>ω</sup>st* + *φd*0, where *ωs* denotes the angular velocity of the shaft and *φd*0 is the initial angular position of the fault.

### **c. Fault in rolling elements**

It is more complicated when a local fault occurs in a rolling element. The fault will rotate with the rolling element. The angular position of the fault is described as

$$\begin{aligned} \phi\_s &= \omega\_r t + \phi\_{d0} \\ \omega\_l &= \frac{\omega\_s}{2} \frac{D\_p}{D\_b} (1 - \left(\frac{D\_p}{D\_b} \cos \alpha\right)^2) \end{aligned} \tag{10}$$

 where *ωr* is the angular velocity of the rolling element, and *α* is the contact angle.

When there exists a fault in the rolling element *k*, the fault will make contact with both the inner and outer races. The switch values and the fault periods will differ for both races due to the difference in the raceway curvature between the inner and outer races. Therefore, the switch function *βj* is defined as

$$\beta\_{\vec{\beta}} = \begin{cases} 0, \dot{j} \neq k \\ 1, \text{ if } 0 < \phi\_s < \Delta\phi\_{do}, j = k \\ \frac{c\_{dr} + c\_{di}}{c\_{dr} - c\_{ds}}, \text{if } \pi < \phi\_s < \pi + \Delta\phi\_{di\prime}, j = k \\ 0, \text{ otherwise}, j = k \end{cases} \tag{11}$$

with

 $\mathbf{c}\_{dr} = \frac{D\_b}{2} - \sqrt{\frac{D\_b^2}{2} - \mathbf{x}^2}$ , $\mathbf{c}\_{di} = r\_i - \sqrt{r\_i^2 - \mathbf{x}^2}$ , $\mathbf{c}\_{do} = r\_o - \sqrt{r\_o^2 - \mathbf{x}^2}$ ,  $r\_i = \frac{D\_r + D\_b}{2}$ 
 $\Delta\phi\_{do} = \frac{2\chi}{r\_o}, \Delta\phi\_{di} = \frac{2\chi}{r\_i}$ 

where *x* is the half of the spall width. For more details, please refer to [35].

### *2.2. Deep Residual Network for Fault Detection*

Deep residual network (Resnet) is a deep learning method with extremely deep architecture, which shows outstanding performance on accuracy and convergence. It introduces the shortcut connection module into the framework to learn the residual, which avoids the degradation problem of deep networks. The high-level representative features can be better extracted by propagating the data information directly throughout the network [36,37].

A residual learning unit is shown in Figure 4, which can be expressed as:

$$\begin{array}{l} \mathbf{y}\_{l} = h(\mathbf{x}\_{l}) + \mathbf{F}(\mathbf{x}\_{l}, \mathbf{W}\_{l})\\ \mathbf{x}\_{l+1} = \mathbf{f}(\mathbf{y}\_{l}) \end{array} \tag{12}$$

where **X***l* and **X***l*+1 denote the input and output vectors of the *l*th residual unit, which generally includes multi-layers. F is the residual function, which represents the learned residual, while *h*(**X***l*) = **X***l* denotes the identity mapping, and *f*(**y***l*) is the activation function. Based on Equation (12), the learning features we obtained from the shallow layer *l* to the deep layer *L* are described as

$$\mathbf{x}\_{L} = \mathbf{x}\_{l} + \sum\_{i=1}^{L-1} \mathbf{F}(\mathbf{x}\_{i\prime}, \mathbf{W}\_{i}) \tag{13}$$

**Figure 4.** A residual learning unit.

With regard to backpropagation, assuming the loss function is E, the gradient of the reverse process can be obtained according to the chain rule of backpropagation.

$$\frac{\partial \mathbf{E}}{\partial \mathbf{x}\_{l}} = \frac{\partial \mathbf{E}}{\partial \mathbf{x}\_{L}} \cdot \frac{\partial \mathbf{x}\_{L}}{\partial \mathbf{x}\_{l}} = \frac{\partial \mathbf{E}}{\partial \mathbf{x}\_{L}} \cdot \left(1 + \frac{\partial}{\partial \mathbf{x}\_{l}} \sum\_{i=1}^{L-1} \mathbf{F}(\mathbf{x}\_{i\prime}, \mathbf{W}\_{i})\right) \tag{14}$$

where *∂*E *∂***<sup>x</sup>***L* denotes the gradient of the loss function to *L*, the 1 in parentheses indicates that the shortcut mechanism can propagate the gradient lossless, and the other residual gradient needs to pass through the layer with weight; the gradient is not passed directly. The residual gradient is not all −1 coincidentally, and even if it is small, the presence of 1 will not result in the gradient disappearing. The advantage of the Resnet neural network is that it can be used to train complex networks and ensure high classification accuracy.
