*2.7. Pooling Layers*

Even though parameter sharing reduces the large number of parameters for CNN, for a large-scale picture, it is still necessary to find a reasonable way to perform subsampling. Pooling layers can be used to finish this job.

For a continuous signal (like an image), it is intuitive to perform downsampling by grouping a fixed number of adjacent values and then picking up an output value from each group. The pickup method could be based on the average, maximum, or minimum. Among these methods, maximum pooling has shown the best result and is commonly used now.

However, care must be taken, as not all feature maps can take pooling as the down sampling method. According to the previous description, each value in the same group needs to be adjacent, which means these values actually have some spatial relationships, and each group also needs to have the same spatial meaning. Therefore, pooling layers might not be suitable to in some cases using CNN, such as in game maps [26].

#### *2.8. Fully Connected Layers*

Fully connected (FC) layers are similar to the typical MLP. The processing feature maps are flattened before entering this layer and transform from several dimensions to a single vector. Most of the parameters in a CNN are set in FC layers, and the size of the FC layer determines the capacity of the network.

#### *2.9. Loss Function*

A neural network can be used for classification and regression, each of which needs a di fferent loss function, and these functions all need to be derivate:

$$Loss\_{mse}(\underline{\underline{y}}, \underline{\underline{t}}) = \frac{1}{2} (\underline{\underline{y}} - \underline{\underline{t}})^2. \tag{7}$$

Equation (7) is the mean square loss (MSE) function, which is often used in regression tasks. It directly shows the di fference between the output value and the target value. Another loss function often used in classification is cross entropy, which usually works with the softmax logistic function. In Equation (7), *y* = *y*1 ··· *yJ T* is the output vector coming from the FC layer and the *J* is the final class number. Softmax tends to find the probability distribution of the classification result. After passing through the function, the sum of output vector *S* becomes 1, and *Sj* represents the probability of the input being classified as class *j*:

$$S\_{\widehat{\boldsymbol{\beta}}}(\underline{\boldsymbol{y}}) = \frac{\boldsymbol{c}^{\boldsymbol{y}\_{\boldsymbol{\beta}}}}{\sum\_{k=1}^{l} \boldsymbol{c}^{\boldsymbol{y}\_{k}}} \tag{8}$$

$$\text{Loss}\_{\text{cross\\_entropy}}(\underline{y}, \underline{t}) = -\sum\_{j=1}^{I} t\_j \log S\_j(\underline{y}) \tag{9}$$

$$\text{Loss}\_{\text{cross\\_entropy}} = -\log \mathbb{S}\_j(\underline{y}).\tag{10}$$

The purpose of cross entropy is to estimate the di fference between two vectors by calculating the log likely-hood function. The result is the same as that shown by Equation (9). In most classification cases, the final result will be a one-hot vector, in which target j has a value of one and the other element is zero, that is, only *Sj* has the value. Therefore, the loss function then be simplified to (10).

#### *2.10. Model Initialization*

In a network, there are numerous hyper parameters that need to be decided, it is normally to consider a way to do the initialize. An ideal properly-initialized network could have the following property: if we take a series of random inputs into the network, the output should be fairly distributed in each of the classes, and there should not be any particular trends at the beginning. Obviously, randomly initializing the parameters will not have this e ffect. Glorot and Bengio proposed normalized initialization [27] to keep the various from the layer input to output.

$$\mathbf{W} \sim \mathbf{U} \left[ -\frac{\sqrt{6}}{\sqrt{n\_{\circ} + n\_{\circ} + 1}}, \frac{\sqrt{6}}{\sqrt{n\_{\circ} + n\_{\circ} + 1}} \right] \tag{11}$$

*nj* in Equation (11) means the number of inputs in layer *j*. Equation (11) performs well for linear layers, but for nonlinear layers like ReLU, the equation needs to be adjusted.

He et al. proposed another method [28] to fix the formula, in which *nj*+<sup>1</sup> in (11) can be simply ignored. Our experiment used He's method to initialize the network.

#### *2.11. Batch Normalization*

In the previous section, it was mentioned that ReLU needs a method to support it in arranging the output value. The distribution of the output value also needs to be controlled. Sergey et al. proposed a method called *batch normalization* [29]. The main concept of this method is to force the addition of a linear transform before the nonlinear layer to make the variance and mean of the nonlinear layer input *X*, *X* R*<sup>i</sup>* ×*j* ×*k*, *i* + *j* + *k* = *m* be in a certain range:

$$
\mu\_{\beta} \leftarrow \frac{1}{m} \sum\_{i=1}^{m} x\_i \tag{12}
$$

$$
\sigma\_{\beta}^{2} \leftarrow \frac{1}{m} \sum\_{i=1}^{m} \left( \mathbf{x}\_{i} - \mu\_{\beta} \right)^{2} \tag{13}
$$

$$\mathcal{R}\_i \leftarrow \frac{\varkappa\_i - \mu\_\beta}{\sqrt{\sigma\_\beta^2 + \varepsilon}} \tag{14}$$

In Equations (12) and (13), the value of m is the total number of elements in the mini-batch and channels. After Equation (14), the purpose is to find the current mean value μβ and current variance σβ, and then adjust them to become 0 and 1:

$$y\_i \leftarrow \gamma \pounds\_i + \beta \equiv \text{BN}\_{\gamma \pounds}(\mathbf{x}\_i). \tag{15}$$

Other learnable transform parameters can be added into the formula, and the final result will be similar to Equation (15). These two variables help the input values to do a little bit adjustment, which helps to solve the dead ReLU problem. It is essential to take batch normalization in a deep network.
