Many studies have shown that the deep neural networks are facing severe over-parameterization, that is, there is a huge redundancy in the internal parameters of the network model, and deep neural networks may only need to train 5% of the parameters and use them to predict the rest of the network parameters, which can be comparable to the original ones [
28]. Therefore, with the gradual transformation of deep neural networks from academia to industry in recent years, network compression and lightweight methods have gradually become a research hotspot. Among them, low-rank decomposition [
29,
30], sparse training [
31], structure pruning [
32,
33,
34,
35,
36], weight quantization [
37,
38], knowledge distillation [
39,
40], compact convolution kernel design, etc., have all proved to be very effective network compression methods. This paper proposes a hybrid compression optimization method that integrates sparse training, structural pruning, and knowledge distillation. The specific implementation process is shown in
Figure 2, and it will be explained and described in detail below.
3.2.1. Sparse Training and Structural Pruning
As the feature information is extracted by the deep neural network through the convolutional layer, its distribution will shift or change with the increase of network depth or training, generally towards the upper and lower limits of the value interval of the non-linear function. Taking the
sigmoid and
tanh activation functions as an example, as shown in
Figure 3, the gradient of the neural network in the highlighted interval is very small, which is not conducive to the reverse neural network propagation, or will lead to slower convergence of the network, or will also lead to overfitting of the network.
Therefore, in the deep neural network, the combination of convolution layers, batch normalization (BN) layers, and activation function is generally used: the feature information extracted by convolution layers is preferentially adjusted by BN layers for data distribution, and then the nonlinear information will be introduced by the activation function, thereby improving the expressive ability of the neural network to the model; the specific structure can be found in the CBL module shown in
Figure 1. In detail, the use of BN layers is to re-adjust the data distribution of the feature information of each channel extracted by the convolution layers to make it meet a relatively standard normal distribution, which ensures the result of each layer of the convolution calculation can be transmitted within an effective range, thus avoiding the disappearance of the gradient and speeding up the convergence speed. The specific implementation method is as follows:
Assuming that the input information of BN layers is a mini-batch consisting of m samples, , then the output result of batch normalization can be obtained by the following steps:
(1) Firstly, find the mean
and variance
of the mini-batch
:
(2) Secondly, standardize
based on the above mean and variance:
(3) Thirdly, to ensure that the denominator in Equation (3) is greater than 0, a tiny constant
greater than 0 is introduced, then Equation (3) can be corrected as Equation (4):
(4) Further, since the standardized
will basically be limited to a normal distribution, which will reduce the expressive ability of the network, two learnable parameters could be introduced for further adjustment of the data distribution, namely the scaling parameter
and the bias parameter
, both of which will be automatically updated when the neural network backpropagates for gradient descent. At this point, Equation (4) can be further corrected as follows:
Through the above process, the BN layer could complete the batch normalization of the feature information of each channel, and then, through the activation function, more nonlinear information could be further introduced to the features. It can be seen from Equation (5) that when the scaling parameter is close to 0, no matter what the feature information of the channel is, the result of batch normalization will only be determined by the bias parameter , and it has nothing to do with the channel feature information. Therefore, we can judge the importance of the network channel according to : in the training process, when approaches 0, the feature information of the corresponding channel will become useless information, and the network will become more sparse, which is the essence of network sparse training.
Since the scaling parameter
and the bias parameter
are involved in the training process of the network, a penalty term can be added to the loss function to constrain it so that more of the
converges to 0, thus achieving a greater degree of sparsity. In this paper,
L1 regularization is used as a penalty term added to the original loss function to drive the scaling parameter
to approach 0 during training, resulting in a sparse network model. The new loss function after the penalty term added is shown in Equation (6).
where
denotes the loss function with the penalty term added;
denotes the original loss function;
denotes the data and label provided by the dataset;
is the weights trained by the network; and
is the regularization coefficient.
As shown in Equation (6), the loss function after adding the penalty term differs from the original loss function by taking into account. In detail, since the overall loss function is finally descended towards the direction of minimization, the value of is not allowed to increase; as the training progresses, the value of will gradually approach 0, which will lead to the failure or uselessness of the feature information of the corresponding channel and achieve the purpose of network sparseness. The value of the regularization coefficient determines the speed and constraint of network sparseness; the larger the value, the faster the network becomes sparse and the stronger the constraint.
Based on the sparse training of the network, the value of
can be used as a criterion to measure the importance of the network channels. By pruning the channel where
is closest to 0, a more lightweight network can be obtained. Further, since the calculation result of each channel in the output feature layer is also closely related to the filter, the judgment of the filter weight is also one of the factors to measure the importance of the channel. Therefore, we propose a channel importance scoring standard that comprehensively considers the BN layer scaling parameters and filter weights, as shown in Equations (7) and (8).
where
denotes the sum of the absolute values of the weights of filter
x;
denotes the number of convolution kernels in the filter;
denotes the
j-th convolution kernel in filter
x;
indicates the
L1 norm of the convolution kernel
;
denotes the scaling parameter for the
i-th channel; and
is the importance score of the
i-th filter.
The evaluation of the importance of each channel can be completed by Equation (8), and the comprehensive score set
can be obtained. Finally, the pruning threshold of the filter in each convolutional layer can be obtained by Equation (9) and the preset pruning rate.
where
is the pruning threshold; and
denotes the ascending sort function, which will output the value at position
pr.
Then, from the network structure of Yolov5-s shown in
Figure 1, it can be found that the BN layers in the network are distributed in three positions, namely the CBL (Conv_BN_LeakyRelu) module used for channel number adjustment in the backbone, the CSP 1_X module used for network expansion in the backbone, and the CSP 2 module used for feature enhancement in the neck. This paper proposes three pruning strategies based on the distribution characteristics of the BN layer in the network, with the difference between them lying in the different processing ways of the BN layer in the CSP module.
- (1).
Pruning strategy 1:
In the structure of the CSP1_X module shown in
Figure 1, there is a residual-like unit called Resunit, whose structure is also clearly given. From that, it can be clearly found that there is a residual edge (shortcut) and convolution layer splicing together through an add function for subsequent feature extraction. Due to the use of the add function, the two feature layers need to be consistent in dimension. Therefore, in pruning strategy 1, we do not perform pruning on the two convolution layers that are directly connected at the beginning and end of the shortcut to avoid dimensional processing.
- (2).
Pruning strategy 2:
Since pruning strategy 1 cannot sufficiently compress the network, in pruning strategy 2 we also prune the two convolutional layers that are directly connected to the shortcut. During the dimension processing, the number of remaining channels of the convolutional layer connected to the front end of the shortcut is taken as a reference to prune the channels of the convolutional layer inside the CSP module.
- (3).
Pruning strategy 3:
In pruning strategy 3, the pruning threshold will no longer refer to the importance score of the number of channels in each convolutional layer. Instead, after sparse training, a globally-based comprehensive score set is obtained based on Equation (8), and the pruning threshold is determined according to Equation (9) and the artificially set pruning rate. However, global pruning threshold cannot guarantee the integrity of some special network structures, such as residual blocks, so we also introduce a local security threshold to ensure the integrity of channel connections.