*2.1. Convolutional Neural Networks*

Convolutional neural networks are an important branch of neural networks [20]. However, unlike back propagation (BP) neural networks, convolutional neural networks have a strong feature extraction capability. After the convolutional operation, the network can perform feature extraction on the signal fed into the network.

A CNN has a convolutional layer, a pooling layer, and a full connection layer. Since the proposal of the convolutional neural network, a rich variety of CNN structures have been developed over the decades, including LeNET, AlexNET, VGG, GoogleLeNET, ResNET, DenseNET, etc. A typical convolutional neural network structure is shown in Figure 1.

**Figure 1.** Typical convolutional neural network structure.

The forward computation of a CNN can be expressed as follows.

$$\mathcal{G}(X) = \mathcal{g}^{(L)}(\dots \mathcal{g}^{(2)}(\mathcal{g}^{(1)}(X, \theta^{(1)}), \theta^{(2)}) \dots, \theta^{(L)}).\tag{1}$$

G is the mapping equation of the network. *g* is the nonlinear function of each layer inside the network. *θ* is the connection parameters of each layer and *L* is the number of layers of the neural network. *X* = " *x*1, *x*2 ... *xp* ... *xQ* # is the input to the network, which can be one- or two-dimensional. *Q* is the number of data in the dataset.

The convolutional layer of a CNN consists of a convolution core and a bias. After the input of the network has been convolved, the bias of the layer is added, and, finally, the output of the network is obtained by passing through the nonlinear layer. The equation for the convolution layer is shown as follows [20].

$$O\_i^{(l)} = \lg^{(l-1)} \left( \sum\_{j=1}^{n^{(l-1)}} w\_{ij}^{(l)} \ast X\_j^{(l-1)} + b\_i^{(l)} \right). \tag{2}$$

In the formula, *O*(*l*) is the output of the *l*th layer. *i* = 1, 2 ... *<sup>n</sup>*(*l*), *n*(*l*) is the output size of the *l*th layer. *j* = 1, 2 ... *<sup>n</sup>*(*l*−<sup>1</sup>), *n*(*l*−<sup>1</sup>) is the output size of the (*l*−1)th layer. *w* is the value of the convolution core. *b* is the value of bias. The pooling layer is used to further extract the information from the convolutional output. The pooling operation can be max pooling, down pooling or average pooling. After the pooling operation, the representative features in the local area are further extracted. Taking down pooling as an example, it takes the smallest value inside the pooling size range and generates a new output. As shown in Equation (3), let the pooling size be *p* × *p*.

$$\mathbf{x}\_{ijl} = \min(o\_{i'j'l} : i \le i' < i + p\_{i'}j \le j' < j + p). \tag{3}$$

The input information to the network is passed through a number of convolution, pooling, and nonlinear computations. Then, the output value of the last pooling layer is reshaped, and this value is fed into the full connecting layer. Finally, the diagnosis result is provided after a softmax layer. We assume that the network output has *K* classes. *Y* = " *y*1, *y*2 ... *yq* ... *yQ* # is the output set of the dataset. *Q* is the number of data in the dataset. If *yq* ∈ {1, 2 . . . *k* ... *<sup>K</sup>*}, the predicted output of the network is shown as Equation (4):

$$\mathfrak{z}(\mathfrak{z}\_{\mathfrak{q}} = k \Big| \mathfrak{o}^{(L-1)}; w^{(L)}) = \frac{\mathfrak{e}^{w\_k^{(L)} \mathfrak{o}^{(L-1)}}}{\sum\_{i=1}^{K} \mathfrak{e}^{w\_i^{(L)} \mathfrak{o}^{(L-1)}}}.\tag{4}$$

where *w*(*L*) is the network parameter for the softmax layer. *Y*ˆ = " *y*ˆ1, *y*ˆ2 ... *y*ˆ*q* ... *y*<sup>ˆ</sup>*Q* # .

The network parameters are updated using the minibatch stochastic gradient descent method after the network has completed forward propagation. *J* is the loss function of the network, which can be mean square error (MSE) or cross-entropy, etc. After each

forward calculation is completed, the output of the network is updated by iterating backward derivation. The network parameters are updated by Equations (5) and (6). *γ* is the learning rate.

$$\mathcal{W} = \mathcal{W} - \gamma \frac{\partial f(\mathcal{W}, B; X, \hat{Y})}{\partial \mathcal{W}}.\tag{5}$$

$$B\_- = B - \gamma \frac{\partial J(\mathcal{W}, B; X, \hat{Y})}{\partial B}.\tag{6}$$
