2.1.4. Batch Normalization

Batch standardization was proposed in [36] to accelerate the training speed of deep neural networks by reducing the transfer of internal covariates. The batch normalization layer is usually added after the convolutional layer or the fully connected layer, and before the activation layer. Given p-dimensional data into the BN layer *X* = (*x*(1), ... , *x*(*p*)) the operation of the BN layer can be expressed as the following expression (4):

$$\begin{aligned} \hat{\mathbf{x}}^{(i)} &= \frac{\mathbf{x}^{(i)} - E(\mathbf{x}^{(i)})}{\sqrt{Var[\mathbf{x}^{(i)}]}} \\ y^{(i)} &= \gamma^{(i)}\hat{\mathbf{x}}^{(i)} + \boldsymbol{\beta}^{(i)}, \end{aligned} \tag{4}$$

where *y*(*i*) represents the p-dimensional output of the BN layer, and *γ*(*i*) and *β*(*i*) are the scaling and bias that the BN layer needs to learn, which need to be learned in the neural network training.
