2.3.2. Network Pruning

The key of on-board ship detection is to find a lightweight on-board SAR ship detector that balances detection accuracy and model complexity under the constraints of satellites with limited memory and computation resources. Thus, it is of grea<sup>t</sup> significance to find a model compression method without much accuracy sacrifice. Network pruning is a mainstream model compression method. By pruning unimportant neurons, filters, or channels, it can effectively compress the parameters and computation of the model. Therefore, to further obtain a more compact detector, we follow the scheme illustrated in Figure 5 to prune the Conv and BN layers of the network.

**Figure 5.** Flow-chart of iterative network pruning procedure.

From Figure 5, firstly, through sparse regularization training of the model, some parameters of the initial network tend to zero or equal to zero during training, and the neural network model with sparse weights is obtained. Then, the model is pruned to remove sparse channels. Next, the model is fine-tuned to restore the accuracy of the model. Finally, by iterating the above network pruning procedures, we can obtain the ultimate compact network.

The scaling factors, the sparsity training, the channel pruning, and fine-tuning will be illustrated below.

(1) Scale factors in BN layers: The Conv layer and BN layer are widely used in CNNs. In the Conv layer, reducing the number of filters can effectively reduce the amount of network parameters and computation, and accelerate the reasoning speed of the model. In the BN layer, the characteristic scaling coefficient of each BN layer corresponds to each channel, representing the activation degree of its corresponding channel. The operation formula of BN layer is formulated by

$$Z\_{\rm out} = \alpha \frac{Z\_{\rm in} - \mu\_{\varepsilon}}{\sqrt{\sigma\_{\varepsilon} + \varepsilon}} + \beta \tag{4}$$

where *Zin* denotes input, *Zout* denotes output, *μ*c and *σ*c denote the mean and variance of input activation values, respectively, and *α* and *β* denote the scaling factor and offset factor corresponding to the activation channel, respectively.

In practice, the BN layers in our network are all followed after the Conv layers. Therefore, to prune a BN layer, it is necessary to prune the convolution kernel corresponding to the upper layer and subtract the channels of the convolution kernel corresponding to the next layer.

(2) Training with the L1 sparsity constraint: The sparse effect of L1 regularization on CNNs has been proved and widely used [38,39]. In this paper, a penalty factor is added to the loss function to constrain the weight of the Conv layer and the scaling coefficient of the BN layer, and the model will be sparse. The larger the regularization coefficient *λ* is, the greater the constraint is. Specifically, the sparsity constraint loss function is defined by

$$L = L\_{raw} + \lambda \sum\_{\mathcal{Y} \in \Gamma} \mathcal{g}(\mathcal{Y}) \tag{5}$$

where *Lossraw* denotes the loss function of the raw detector, *g*(γ)=| γ| denotes L1 regularization, and *λ* denotes the regularization coefficient, adjusted according to the dataset.

During the sparsity training procedure of network pruning, we visualize the scale factors of BN layers. In general, the smaller the scale factor is, the sparser the channel parameters of the network are. We visualize the scale factors with three typical regularization factors (i.e., λ = 0, λ = <sup>10</sup>−4, λ = <sup>10</sup>−3), respectively, following [39]. Obviously, λ = 0 means there is no sparsity training (i.e., normal training). From Figure 6, when λ = 0, the scale factors distribution of the BN layer obeys an approximately normal distribution. When λ = <sup>10</sup>−4, some scale factors are pressed towards 0, but not enough to guarantee the sparsity of scale factors. When λ = <sup>10</sup>−3, most of the scale factors are pressed towards 0, which is enough to guarantee channel-wise pruning is being followed. Therefore, in our implementation, Lite-YOLOv5 is sparsity trained with λ = 10−<sup>3</sup> to guarantee the channel-wise sparsity.

(3) Channel pruning and fine-tuning: After sparse regularization training, the model with more sparse weights is obtained and many weights are close to zero. Then, we prune the channels of the near-zero scaling factor by deleting all incoming and outcoming connections and corresponding weights. We prune the channels across all BN layers according to the prune ratio *Pr*, which is defined as a certain percentile of all the scaling factor values. This ratio *Pr* will be determined experimentally in Section 5.2.2. Then, we adopt fine-tuning for the pruned model to restore the accuracy. Note that fine-tuning is a simple training the same as normal training. However, in this way, we can obtain a lightweight detector without sacrificing too much accuracy. In addition, an iterative network pruning procedure can lead to a more compact network [28] as can be seen in Figure 5.
