Optimization Based Layer-Wise Pruning Threshold Method for Accelerating Convolutional Neural Networks

Ding, Yunlong; Chen, Di-Rong

doi:10.3390/math11153311

Open AccessArticle

Optimization Based Layer-Wise Pruning Threshold Method for Accelerating Convolutional Neural Networks

by

Yunlong Ding

^*

and

Di-Rong Chen

School of Mathematical Science, Beihang University, Beijing 100191, China

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(15), 3311; https://doi.org/10.3390/math11153311

Submission received: 16 June 2023 / Revised: 24 July 2023 / Accepted: 25 July 2023 / Published: 27 July 2023

(This article belongs to the Special Issue Artificial Intelligence Applications in Complex Networks)

Download

Browse Figures

Versions Notes

Abstract

:

Among various network compression methods, network pruning has developed rapidly due to its superior compression performance. However, the trivial pruning threshold limits the compression performance of pruning. Most conventional pruning threshold methods are based on well-known hard or soft techniques that rely on time-consuming handcrafted tests or domain experience. To mitigate these issues, we propose a simple yet effective general pruning threshold method from an optimization point of view. Specifically, the pruning threshold problem is formulated as a constrained optimization program that minimizes the size of each layer. More importantly, our pruning threshold method together with conventional pruning works achieves a better performance across various pruning scenarios on many advanced benchmarks. Notably, for the

L_{1}

-norm pruning algorithm with VGG-16, our method achieves higher FLOPs reductions without utilizing time-consuming sensibility analysis. The compression ratio boosts from 34% to 53%, which is a huge improvement. Similar experiments with ResNet-56 reveal that, even for compact networks, our method achieves competitive compression performance even without skipping any sensitive layers.

Keywords:

model compression; neural network; pruning; pruning metric; pruning threshold

MSC:

68T07

1. Introduction

Deep neural networks have achieved great success in various applications [1,2] ranging from image classification [3,4] to image segmentation [5,6] and object detection [7,8], but their limitations are gradually emerging. Extreme model complexity hinders large-scale deployment on constrained edge devices. For instance, to classify an image with

224 \times 224

pixels, the VGG-16 model consumes 550 M of storage space and 15.5 billion floating point operations (FLOPs). For embedded applications, these resource demands are prohibitive. Thus, how to deploy deep neural networks on embedded systems has become an urgent problem for both academia and industry.

To reduce the storage and energy requirements to run inference of large networks on mobile devices, many different compression and acceleration methods [9,10] have been proposed. Among them, pruning is a classic technique in machine learning. Pruning decision trees [11] can reduce model complexity and improve model generalization. To achieve high performance, network pruning [12,13] reduces storage and computational costs by removing unimportant connections and achieves excellent compression performance. Nowadays, network pruning can be roughly grouped into three categories, namely, metric-based pruning, training-based pruning, and reconstruction-based pruning. Metric-based pruning methods [14,15,16] typically propose a metric to measure whether a filter/channel is critical or not then retrain the pruned network to recover accuracy drops. There are three commonly used metrics in metric-based pruning methods, i.e., the weight, activation and gradient. Weight pruning methods evaluate the importance of connections using the weight magnitude. In 2015, unstructured weight pruning [17] proposed a simple yet effective pruning strategy based on weight, where weight absolute values below a given threshold are removed.

L_{1}

-norm pruning [18] extends this idea to structured pruning by removing a certain percentage of filters with smaller

L_{1}

-norms. Activation pruning methods [19,20] focus on the neuron importance in terms of activation values. Network trimming [20] finds that most neurons have zero output regardless of what the input data are. These neurons with zero output only increase the risk of overfitting. Therefore, the Average Percentage of Zeros (APoZ) criterion is proposed to evaluate the importance of different neurons. Fundamentally, the two previous pruning methods are heuristics and do not guarantee the effectiveness of pruning. Gradient pruning [21,22] starts directly from the loss function and prunes neurons that have the smallest impact on the loss function. The most representative approach is Taylor expansion pruning [22], which adopts the gradient score to determine which filters are pruned.

There have been other attempts to prune networks. Training-based pruning methods [23,24] add regularization terms to different network parameters during training. Different from metric-based pruning methods, training-based pruning methods focus more on sparse terms and then employ relatively simple pruning criteria. Reconstruction-based pruning methods [25,26,27] mainly evaluate the parameter importance by minimizing the feature map reconstruction error of the next layer. Given the pruning threshold, the reconstruction method selects a subset of channels as the pruned network.

Apart from pruning algorithms, the pruning performance of these methods is limited due to a common challenge. That is, they always need to determine the pruning threshold either by hand or using handcrafted heuristic rules. However, determining the layer-wise threshold is a difficult task. If the pruning threshold is too small, the network compression is insufficient, but if the pruning threshold is too large, it is difficult to recover the accuracy loss by fine-tuning. With a well-chosen pruning threshold, conventional pruning algorithms can achieve significant reductions in the number of parameters and FLOPs. A question arises: how to determine the pruning threshold for each layer? Currently, the commonly used pruning threshold methods are divided into the following three types: sensitivity analysis [18], the automatic method [28] and the soft threshold [29]. We summarize these methods in Figure 1. Sensitivity analysis methods mainly select the one with the smallest accuracy drop as the final pruning threshold for a specific layer. To obtain dynamic layer-wise pruning thresholds, the automatic method determines the pruning degree of each layer based on a single global pruning threshold, such as network slimming [24], which trains the scale factor of batch normalization (BN) layer by adding an

L_{1}

regular constraint, which automatically determines the pruning degree of each layer.

Nevertheless, these pruning threshold methods still suffer from several problems and limitations. The sensitivity analysis requires a large number of search experiments to obtain the optimal pruning threshold setting, and the time cost grows exponentially with an increasing layer depth. Assuming a network with L layers and each layer with M threshold candidate testing, the sensitivity analysis requires

M^{L}

experiments, which cause the prohibitively expensive time cost. Another method, the automatic method, has difficulty in assessing the local parameters’ importance from different layers since the contributions of each layer to the model are significantly different. This allows some layers with few parameters but significant FLOPs to be removed, such as an initial convolutional layer. Hence, while such models may significantly reduce the number of non-zeros, their FLOPs may still be large, making the pruning works based on a single global threshold insufficient and non-robust. To mitigate these shortcomings, the various soft threshold methods [29,30,31,32] are proposed to automatically learn the pruning thresholds for each layer using a soft thresholding operator or a close variant of it. However, most of the discussed pruning threshold techniques conventionally require handcrafted heuristics or domain experts to explore a large design space. More seriously, the aforementioned pruning threshold methods lack a corresponding recoverability guarantee. Thus, how to obtain pruning thresholds offline and give reasonable interpretability is a critical issue that needs to be addressed. It is an appealing idea to design the pruning threshold of each layer from an optimization point of view.

Unlike the traditional pruning threshold algorithms, we propose a general layer-wise pruning threshold method for metric-based pruning algorithms from an optimization point of view. Concretely, we transform the pruning threshold problem into a constrained optimization program, whose solution reveals that how many connections are pruned depends on the pruning metric of the specific layer and the statistical information computed from the next layer. Based on the time cost savings, the proposed pruning threshold method enhances the interpretability of the pruned network. Moreover, the proposed method can update the performance of most metric-based pruning algorithms. Extensive experiments on a variety of datasets demonstrate that our pruning threshold method together with the conventional metric-based pruning works can achieve higher compression performance. Obtaining pruning thresholds via optimization problems and, in turn, improving the compression performance of conventional pruning algorithms is the main contribution of this paper.

2. Related Work

Different methods have been proposed to compress neural network models, which can be roughly categorized as follows: (i) network quantization; (ii) knowledge distillation; (iii) network pruning. Currently, network pruning methods have emerged as an efficient strategy for compressing models with limited accuracy degradation, which can be broadly categorized into unstructured and structured pruning. Early pruning works focused on unstructured pruning, which removes individual weights from the neural network. While unstructured pruning methods achieve much higher weight sparsity, unstructured pruning is considered less hardware friendly, as irregular sparsity often makes it difficult to leverage the speedup provided by commodity hardware during training and inference. Structured pruning takes structure into account, making the model scalable on commodity hardware with standard computation techniques/architectures.

Although fruitful pruning algorithms have been proposed, most unstructured and structured pruning methods generally require hand-tuned thresholds to achieve an excellent performance. These pruning methods iteratively prune weights smaller than a certain threshold and retrain the network to regain the performance lost during pruning. However, when the wrong pruning threshold is set, the accuracy of the pruned model degrades considerably. In general, a small pruning threshold is needed for certain “sensitive” layers of the network and a big pruning threshold for “non-sensitive” layers. Thus, the key challenge in pruning is to find an optimal setting for these pruning thresholds. For simplicity, some methods set the same threshold for all layers, which may not be appropriate since the distribution and range of weights in each layer can be very different. Also, different layers may have varying sensitivities to pruning, depending on their location in the network or their type. The optimal setting of the threshold should take these layer-wise features into account. However, most pruning threshold methods are based on the well-known hard [17,18,26] and soft threshold [29,30,31,32] techniques, which rely on handcrafted features or require domain experts to explore the large design space trade-off between model size, speed, and accuracy, which is often suboptimal and time-consuming. More seriously, existing pruning threshold methods lack theoretical guarantees on compression performance. Unlike the aforementioned methods, our approach is a general pruning threshold method that achieves a better performance across various pruning scenarios.

3. Methodology

The goal of pruning is to compress the network as much as possible with little accuracy drop. To achieve higher compression, it is necessary to design appropriate pruning thresholds for each layer. In this section, we will give a comprehensive introduction to our pruning threshold method, which prunes redundant filters/channels using a three-step strategy. First, we define our optimization problem for obtaining a more reasonable pruning threshold for each layer. Next, we relax the optimization condition for solving easily. Finally, we obtain a suboptimal solution and discuss the impact of the error bound.

3.1. Optimization Problem

Suppose the l-th layer output is defined as follows:

F^{(l)} (x) = R e L U (w^{(l)} x + b^{(l)})

(1)

where x denotes the input data of the ReLU activation function,

F^{(l)}

is the output tensor, and the parameters

w^{(l)}

and

b^{(l)}

represent the weight and bias, respectively.

Our goal is to maximize the l-th layer pruning threshold for metric-based pruning methods, which means that we should keep the number of channels as small as possible. So, our final objective can be set to minimize the number of surviving channels. Given the error bound before and after pruning, we formulate our optimization problem as follows:

\underset{τ}{arg min} | | s_{l} {| |}_{1}

s . t . \sum_{m = 1}^{M} | | F^{(l)} (x_{l}^{(m)}) - F^{(l)} (s_{l} * x_{l}^{(m)}) {| |}_{1} \leq ϵ

where

| | \cdot {| |}_{1}

denotes the

L_{1}

-norm, M denotes the number of test samples,

x_{l}^{(m)}

is the input tensor of the l-th layer,

ϵ

is the tolerable error bound before and after pruning and

s_{l}

is a binary vector, which denotes the channel prune indicator of the l-th layer. Suppose

N_{l}

channels are to be kept in the l-th layer after pruning:

s_{l, i} = 1

if and only if

s_{l, i}

is among the top

N_{l}

values in the l-th layer.

3.2. Relaxing Optimization Condition

We have formulated the pruning threshold problem as an optimization program, but, due to the ReLU function, it is difficult to obtain an analytical solution through direct optimization. In order to obtain a suboptimal solution, we derive an upper bound on the original optimization condition

| R e L U (x) - R e L U (y) | \leq | x - y |

(2)

where

| \cdot |

denotes the element-wise absolute value. Then, we can obtain

| F^{(l)} (x) - F^{(l)} (y) | \leq | w^{(l)} x - w^{(l)} y | \leq | w^{(l)} | \cdot | x - y |

(3)

By substituting

x = x_{l}^{(m)}, y = s_{l} * x_{l}^{(m)}

into (3),

| | F^{(l)} (x_{l}^{(m)}) - F^{(l)} (s_{l} * x_{l}^{(m)}) {| |}_{1} \leq \sum_{i} \sum_{j} | w_{i, j}^{(l)} | \cdot (1 - s_{l, j}) \cdot | x_{l, j}^{(m)} |

(4)

Introducing all M test samples,

\sum_{m = 1}^{M} | | F^{(l)} (x_{l}^{(m)}) - F^{(l)} (s_{l} * x_{l}^{(m)}) {| |}_{1} \leq \sum_{m = 1}^{M} \sum_{i} \sum_{j} | w_{i, j}^{(l)} | \cdot (1 - s_{l, j}) \cdot | x_{l, j}^{(m)} |

(5)

Since the input data

| x_{l, j}^{(m)} |

are bounded, thus there exist a constant C such that

\sum_{m = 1}^{M} | x_{l, j}^{(m)} | \leq C

. The optimization condition is written as

\sum_{m = 1}^{M} | | F^{(l)} (x_{l}^{(m)}) - F^{(l)} (s_{l} * x_{l}^{(m)}) {| |}_{1} \leq C \cdot \sum_{i} \sum_{j} | w_{i, j}^{(l)} | \cdot (1 - s_{l, j}) \leq ϵ

(6)

Defining that

δ

is a new error bound as follows, we can finally state the optimization condition as follows:

\sum_{i} \sum_{j} | w_{i, j}^{(l)} | \cdot s_{l, j} \geq \sum_{i} \sum_{j} | w_{i, j}^{(l)} | - ϵ / C = δ

(7)

where

s_{l, j} = 0

denotes the j-th feature map be pruned; otherwise, it should be preserved. Equation (7) shows that, when given a suitable error bound, the pruning threshold is entirely up to the pruning metric of the specific layer and the weight information computed from the next layer. The detailed steps will be shown in the following sections.

3.3. Solution

Now, if we further define

k_{j} = \sum_{i} | w_{i, j}^{(l)} |

, the optimization condition (7) can be simplified as follows:

\sum_{j} k_{j} \cdot s_{l, j} \geq δ

(8)

Metric-based pruning methods typically propose a metric to evaluate whether a channel is critical or not. Then, the unimportant channels are pruned. To minimize the objective function, i.e.,

| s_{l} |_{1}

, we rank channels’ importance via the proposed metric and select the corresponding next layer

k_{j}

such that the optimization condition (8) is satisfied. Among the remaining importance factors, the maximum or minimum value is the pruning threshold of the l-th layer. The detailed pruning procedure of the l-th layer is summarized below:

For each feature map $x_{l, j}$ from the l-th layer, calculate the importance as $a_{j}$ using the proposed metric.
For each filter $w_{n, j}$ from the $(l + 1)$ -th layer, calculate $k_{j} = \sum_{n} | w_{n, j} |$ .
According to the distribution of $k_{j}$ , determine the error bound $δ$ .
Rank the $a_{j}$ by importance.
Select the first m filters by $a_{j}$ , then choose a corresponding $k_{j}$ such that the optimization condition (8) is satisfied. Among the remaining importance score $a_{j}$ , the maximum or minimum value is the pruning threshold of the l-th layer.

Generally, other metrics include but are limited to the

L_{1}

-norm of filters, which also leads to the removal of the corresponding feature maps. Thus, any pruning works that result in the removal of the feature maps can use this method, such as [24,33,34], whose pruning performance can be updated using our proposed pruning threshold method.

3.4. Analysis of Choice of Error Bound

In this subsection, we investigate the effect of error bound on the pruning performance. The error bound

δ

determines the degree of pruning. Thus, we need to choose the error bound carefully. In effect, it is hard to set the error bound

δ

because of the uniqueness of each layer. In 2019, a study on filter pruning via geometric median (FPGM) [34] pointed out the problem of the criterion “smaller-norm-less-important” adopted by previous works. They believe that the large norm deviation and small minimum norm can ensure the effectiveness of pruning. Inspired by FPGM, we find that the error bound is related to the statistics information computed from the next layer. Therefore, we set the error bound as

r \cdot \sum_{i} \sum_{j} | w_{i, j}^{(l)} |

, where r is related to the Std/Min from the distribution of

k_{j}

. (Std and Min denote the standard derivation and minimum of the

k_{j}

, respectively).

4. Experiments

In this section, we evaluate our method for the popular VGG Net [35], ResNet [36], on ImageNet and CIFAR-10 datasets. There are two reasons why we use the VGG model and CIFAR-10 dataset. The first reason is that the VGG model and CIFAR-10 dataset are very redundant, so almost all pruning algorithms show VGG and CIFAR-10 results to verify the effectiveness of the algorithm. The second reason is that all five different metric-based pruning works use the VGG model and CIFAR-10 dataset, so we also compare them in the experimental section. We organize the experiments according to different pruning metrics, including the

L_{1}

norm and

L_{2}

-norm of weights, activations, BN factors, and geometric median, which come from five different works on metric-based pruning. We first introduce these different pruning works and use their default settings. We then update their pruning performance with the proposed pruning threshold method. If we do not report the accuracy of the experimental table, it indicates that the accuracy of the pruned model decreased by less than 0.1 compared to the unpruned model.

4.1. $L_{1}$ -Norm-Based Filter Pruning [18]

L_{1}

-norm-based filter pruning [18] is one of the earliest works on structured pruning, which selects the

L_{1}

-norm as the pruning metric and prunes redundant filters using the three-step method in Figure 2. First, the unpruned network is trained to learn which filters are important. Next, the filters with small

L_{1}

-norms are removed. Finally, the pruned network is retrained to fine-tune the remaining filters. We update this pruning work with the proposed pruning threshold method. The model parameters and FLOPs before and after pruning are shown in Table 1. The detailed procedure of pruning from the l-th layer is as follows:

For each filter $w_{j, m}$ from the l-th layer, calculate the $L_{1}$ -norm $a_{j} = \sum_{m} | w_{j, m} |$ .
For each filter $w_{n, j}$ from the $(l + 1)$ -th layer, calculate $k_{j} = \sum_{n} | w_{n, j} |$ .
According to the distribution of $k_{j}$ , determine the error bound $δ$ .
Rank the $a_{j}$ in a descending order.
Select the first m filters by $a_{j}$ , then choose corresponding $k_{j}$ such that the optimization condition (8) is satisfied. Among the remaining importance score $a_{j}$ , the maximum value is the pruning threshold of the l-th layer.

Detailed experiment results are shown as follows:

VGG-16 on CIFAR-10: According to the value of the Std/Min, we can distinguish different layers’ pruning sensibilities. Table 2 shows our results. We observe that the values of Std/Min from Conv-1 and Conv-8 to 13 are significantly higher than Conv-2 to 7, so we only prune Conv-1 and Conv-8 to 13. The find is consistent with $L_{1}$ -norm-based filter pruning [18], but we do not use the sensibility analysis, which is time-consuming. For simplicity, we set pruning factor r as 0.5 for these pruned layers. Exactly how many filters are pruned depends on the $L_{1}$ -norm and corresponding $k_{j}$ . As shown in Table 3 and Table 4, we achieved higher FLOPs reductions on the basis of a lower time cost. Especially in Table 3, the compression ratio boosts from 34% to 53%, which is a huge improvement. To achieve a higher compression performance, we can also carefully set different pruning factor r values for different layers based on the value of the Std/Min. Maybe the sigmoid function is a good choice, which is the direction of our future work.
ResNet-56 on CIFAR-10: Similar to VGG-16 in the CIFAR-10 experiment, we can distinguish different layers’ pruning sensibilities according to the Std/Min. Table 5 shows our results. Unlike the VGG-16 experiment, the value of the Std/Min from ResNet-56 is very small, so we consider the Std/Min as the pruning factor r of each layer. As the stage grows, the average value of the Std/Min decreases, so the pruning threshold should also decrease; this finding is consistent with the pruning rate setting in the original paper (0.6, 0.3, 0.1). However, we do not skip any sensitive layers in the pruning process, which avoids the time-consuming sensibility analysis. As shown in Table 1, ResNet-56 also achieves a competitive compression performance.
ResNet-34 on ImageNet: With promising results on VGG-16 and ResNet-56, we also test our method using a more complicated network, ResNet-34, on the ImageNet dataset. Different from ResNet-56, ResNet-34 uses the projection shortcut when the output feature maps are downsampled, which makes the pruning operation more difficult than that using ResNet-56. Following the pruning strategy from [18], we only prune the first layer of each residual block. Results are shown in Table 1. Although the FLOPs are slightly higher than the pruned result from original paper, our method does not use time-consuming sensibility analysis and does not skip any sensitive layers. Moreover, our pruning threshold method achieves more parameter savings than the $L_{1}$ -norm-based filter pruning [18].

To understand the pruning sensitivity of each layer, the

L_{1}

-norm-based filter pruning [18] prunes each layer independently and evaluates the pruned network accuracy on the validation set, which needs hundreds of steps of testing and retraining progress. These time demands are prohibitive for network pruning. As shown in Table 1, we observe that in each row our method achieves a better compression performance with little extra experiments, which greatly saves time costs.

4.2. Network Trimming [20]

Network trimming [20] selects the activation output as the pruning metric and prunes redundant neurons based on analysis of network outputs on a large dataset. It defines the Average Percentage of Zeros (APoZ) to measure the percentage of zero activations of a neuron after the ReLU mapping. In pruning experiments, a neuron with mostly zeros in its output will have an extremely small contribution. Thus, a certain percentage of neurons with a large APoZ will be pruned. To avoid pruning too many neurons in one step, an iterative scheme is chosen to prune the network. Instead of an iterative scheme, we use a one-shot way for fast comparison. To decide how many neurons to prune, Network Trimming [20] prunes neurons whose APoZ is larger than one standard derivation from the mean APoZ of the target layer. We compare this pruning scheme with our proposed pruning threshold scheme. Detailed experiment results are shown in Table 6, which shows that our proposed pruning threshold method can achieve a better compression performance than the original pruning scheme.

4.3. Network Slimming [24]

Network slimming [24] imposes

L_{1}

-sparsity on scaling factors from batch normalization layers during training and selects scaling factors as the pruning metric. Since the channel scaling factors are compared across layers, this approach automatically produces the target architecture. However, Network Slimming does not account for layer differences. Evaluating the local parameter importance of each layer with a global threshold is difficult because the parameters and contributions of each layer in the network are significantly different. This approach can lead to some layers with few parameters, but significant FLOPs are removed. In fact, a

70 %

pruning rate for VGG16 is a limit. If setting the pruning rate as

80 %

, some layers’ remaining channel is zero, which leads to deleting the whole layer. In addition, these pruned filters mainly operate on

2 \times 2

feature maps, which leads to fewer FLOPs in such small dimensions. To avoid these problems and further compress the network, we update this pruning approach with our proposed pruning threshold method. As shown in Table 6, Network Slimming achieves fewer FLOPs by setting the pruning threshold reasonably.

4.4. Soft Filter Pruning [33]

For previous hard filter pruning works, the pruned filters were deleted permanently and were no longer updated during training. Therefore, the model capacity was reduced due to the small model size. To tackle the problem of model capacity reduction, a Soft Filter Pruning (SFP) scheme [33] is proposed to select the

L_{2}

-norm as the pruning metric. Specifically, the pruned filters are updated again using SFP while training the pruned model, and its effectiveness has been demonstrated for various advanced network architectures. For simplicity, SFP prunes the same percentage of parameters at each layer, which limits the pruning performance of SFP. To achieve a better performance of the pruned model, we update the pruning performance of SFP with our proposed pruning threshold method. Please note that SFP prunes filters with small

L_{2}

-norms. Detailed results are shown in Table 7. For pruning the ResNet-20, our method achieves higher compression with a 0.25% accuracy increase. For pruning the pre-trained ResNet-56, our method also achieves a better pruning performance than SFP.

4.5. Filter Pruning via Geometric Median [34]

Filter pruning via geometric median (FPGM) points out the problem of the criterion “smaller-norm-less-important” utilized by the previous works, whose effectiveness depends on the minimum norm and the large norm deviation. However, the statistical information collected from the pre-trained model suggests that the two proposed requirements may not always hold. To solve this problem, FPGM rejects norm importance and introduces the geometric median as the pruning metric. Similar to SFP [33], the same pruning threshold for all layers limits the pruning performance of FPGM. We further improve the pruning performance using our proposed pruning threshold method. As shown in Table 8, for ResNet, our method achieves a higher compression performance with a competitive and even higher accuracy. For example, for ResNet-20, our method without utilizing the pre-trained model achieves a 0.9 accuracy improvement but fewer FLOPs. Comparing to FPGM, when pruning ResNet-56, our method without utilizing the pre-trained model even achieves the same competitive accuracy as using the pre-trained model. These results demonstrate that our pruning threshold method can achieve a state-of-the-art pruning performance.

In the above subsection, we introduced five different works on metric-based pruning and updated their pruning performance with the proposed pruning threshold method. The experiment results show that our proposed pruning threshold method can achieve higher FLOPs reductions while saving time costs. Thus, it is an appealing idea to design the pruning threshold of each layer from an optimization point of view. For more modern networks like efficientnet and mobilenet, we can also use our method to combine with them in the future to achieve higher compression, because the different compression directions can be combined with each other.

5. Conclusions

Traditional pruning threshold methods require extensive search experiments or handcrafted heuristics and domain experts to obtain optimal pruning threshold settings. Unlike the traditional pruning threshold algorithms, which only aim at a single pruning metric, we propose a general pruning threshold method from an optimization point of view. By formalizing the pruning threshold as an optimization problem, we provide a reasonable pruning threshold setting for all metric-based pruning algorithms in terms of time cost savings. Numerical experiments on a range of popular models and datasets demonstrate the effectiveness of our pruning threshold method. Notably, the compression ratio boosts from 34% to 53% on the

L_{1}

-norm pruning algorithm with VGG-16, which is a huge improvement. In the future, we can set the more refined prune threshold for different layers carefully, which can further improve the compression performance. Perhaps the sigmoid function is a better choice.

Author Contributions

Y.D.: conceptualization, methodology, data curation, software and writing—original draft. D.-R.C.: supervision and writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Beijing Natural Science Foundation (L222018) and the National Natural Science Foundation of China (11971048).

Data Availability Statement

Data sharing is not applicable to this article as no new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflict of interest.

References

Lilhore, U.K.; Imoize, A.L.; Lee, C.C.; Simaiya, S.; Pani, S.K.; Goyal, N.; Kumar, A.; Li, C.T. Enhanced convolutional neural network model for cassava leaf disease identification and classification. Mathematics 2022, 10, 580. [Google Scholar] [CrossRef]
Mahajan, A.; Sharma, N.; Aparicio-Obregon, S.; Alyami, H.; Alharbi, A.; Anand, D.; Sharma, M.; Goyal, N. A novel stacking-based deterministic ensemble model for infectious disease prediction. Mathematics 2022, 10, 1714. [Google Scholar] [CrossRef]
Ma, J.; Wang, L.; Zhang, L.; Zhang, Q. Restoration and enhancement on low exposure raw images by joint demosaicing and denoising. Neural Netw. 2023, 162, 557–570. [Google Scholar] [CrossRef] [PubMed]
Batchuluun, G.; Nam, S.H.; Park, K.R. Deep learning-based plant-image classification using a small training dataset. Mathematics 2022, 10, 3091. [Google Scholar] [CrossRef]
Liu, F.; Kong, Y.; Zhang, L.; Feng, G.; Yin, B. Local-global coordination with transformers for referring image segmentation. Neurocomputing 2023, 522, 39–52. [Google Scholar] [CrossRef]
Yan, B.; Zhang, S.; Yang, Z.; Su, H.; Zheng, H. Tongue segmentation and color classification using deep convolutional neural networks. Mathematics 2022, 10, 4286. [Google Scholar] [CrossRef]
Sun, J.; Yao, W.; Jiang, T.; Wang, D.; Chen, X. Differential evolution based dual adversarial camouflage: Fooling human eyes and object detectors. Neural Netw. 2023, 163, 256–271. [Google Scholar] [CrossRef] [PubMed]
Zhu, L.; Chen, J.; Hu, X.; Fu, C.W.; Xu, X.; Qin, J.; Heng, P.A. Aggregating attentional dilated features for salient object detection. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 3358–3371. [Google Scholar] [CrossRef]
Nekooei, A.; Safari, S. Compression of deep neural networks based on quantized tensor decomposition to implement on reconfigurable hardware platforms. Neural Netw. 2022, 150, 350–363. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.A.; Shen, B.; Zou, L. Recursive fault estimation with energy harvesting sensors and uniform quantization effects. IEEE-CAA J. Autom. Sin. 2022, 9, 926–929. [Google Scholar] [CrossRef]
Lazebnik, T.; Bunimovich-Mendrazitsky, S. Decision tree post-pruning without loss of accuracy using the SAT-PP algorithm with an empirical evaluation on clinical data. Data Knowl. Eng. 2023, 145, 102173. [Google Scholar] [CrossRef]
Oliveira, D.V.R.; Cavalcanti, G.D.C.; Sabourin, R. Online pruning of base classifiers for dynamic ensemble selection. Pattern Recognit. 2017, 72, 44–58. [Google Scholar] [CrossRef]
Tan, J.H.; Chan, C.S.; Chuah, J.H. End-to-End supermask pruning: Learning to prune image captioning models. Pattern Recognit. 2022, 122, 108366. [Google Scholar] [CrossRef]
Yao, K.; Cao, F.; Leung, Y.; Liang, J. Deep neural network compression through interpretability-based filter pruning. Pattern Recognit. 2021, 119, 108056. [Google Scholar] [CrossRef]
Ziv, Y.; Goldberger, J.; Raviv, T.R. Stochastic weight pruning and the role of regularization in shaping network structure. Neurocomputing 2021, 462, 555–567. [Google Scholar] [CrossRef]
Han, S.; Mao, H.; Dally, W.J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In Proceedings of the 4th International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Han, S.; Pool, J.; Dally, W.J. Learning both weights and connections for efficient neural network. In Proceedings of the 29th Annual Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 1135–1143. [Google Scholar]
Li, H.; Kadav, A.; Durdanovic, L.; Samet, H.; Hans, H.P. Pruning filters for efficient convnets. In Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Guo, Q.; Wu, X.J.; Kittler, J.; Feng, Z. Weak sub-network pruning for strong and efficient neural networks. Neural Netw. 2021, 144, 614–626. [Google Scholar] [CrossRef] [PubMed]
Hu, H.; Peng, R.; Tai, Y.W.; Tang, C.K. Network trimming: A data-driven neuron pruning approach towards efficient deep architectures. arXiv 2016, arXiv:1607.03250. [Google Scholar]
Maatta, J.; Bazaliy, V.; Kimari, J.; Djurabekova, F.; Nordlund, K.; Roos, T. Gradient-based training and pruning of radial basis function networks with an application in materials physics. Neural Netw. 2021, 133, 123–131. [Google Scholar] [CrossRef] [PubMed]
Molchanov, P.; Tyree, S.; Karras, T.; Aila, T.; Kautz, J. Pruning convolutional neural networks for resource efficient inference. In Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Lebedev, V.; Lempitsky, V. Fast convnets using group-wise brain damage. In Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2554–2564. [Google Scholar]
Liu, Z.; Li, J.; Shen, Z.; Huang, G.; Yan, S.; Zhang, C. Learning efficient convolutional networks through network slimming. In Proceedings of the 16th IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2736–2744. [Google Scholar]
He, Y.; Zhang, X.; Sun, J. Channel pruning for accelerating very deep neural networks. In Proceedings of the 16th IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1389–1397. [Google Scholar]
Luo, J.H.; Wu, J.; Lin, W. Thinet: A filter level pruning method for deep neural network compression. In Proceedings of the 16th IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5058–5066. [Google Scholar]
Yu, R.; Li, A.; Chen, C.F.; Lai, J.H.; Morariu, V.I.; Han, X.; Gao, M.; Lin, Y.; Davis, L.S. Nisp: Pruning networks using neuron importance score propagation. In Proceedings of the 31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 9194–9203. [Google Scholar]
Liu, Z.; Sun, M.; Zhou, T.; Huang, G.; Darrell, T. Rethinking the value of network pruning. In Proceedings of the 7th International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Kusupati, A.; Ramanujan, V.; Somani, R.; Wortsman, M.; Jain, P.; Kakade, S.; Farhadi, A. Soft threshold weight reparameterization for learnable sparsity. In Proceedings of the 37th International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 5544–5555. [Google Scholar]
Manessi, F.; Rozza, A.; Bianco, S.; Napoletano, P.; Schettini, R. Automated pruning for deep neural network compression. In Proceedings of the 24th International Conference on Pattern Recognition, Beijing, China, 20–24 August 2018; pp. 657–664. [Google Scholar]
Zheng, Z.; Ghodrati, S.; Yazdanbakhsh, A.; Esmaeilzadeh, H.; Kang, M. Accelerating attention through gradient-based learned runtime pruning. In Proceedings of the 49th IEEE/ACM International Symposium on Computer Architecture, New York, NY, USA, 18–22 June 2022; pp. 902–915. [Google Scholar]
Xu, Z.; Sun, J.; Liu, Y.; Sun, G. An efficient channel-level pruning for CNNs without fine-tuning. In Proceedings of the 2021 International Joint Conference on Neural Networks, Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar]
He, Y.; Kang, G.; Dong, X.; Fu, Y.; Yang, Y. Soft filter pruning for accelerating deep convolutional neural networks. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; pp. 2234–2240. [Google Scholar]
He, Y.; Liu, P.; Wang, Z.; Hu, Z.; Yang, Y. Filter pruning via geometric median for deep convolutional neural networks acceleration. In Proceedings of the 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 4335–4344. [Google Scholar]
Karen, S.; Andrew, Z. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3th International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]

Figure 1. Three typical pruning threshold methods. However, these methods still suffer from different problems and limitations.

Figure 2. A typical three-step pruning pipeline. The

L_{1}

-norm-based filter pruning also uses this pruning way.

Figure 2. A typical three-step pruning pipeline. The

L_{1}

-norm-based filter pruning also uses this pruning way.

Table 1. Overall results. The accuracy, parameters and FLOPs of different models are reported. We achieved higher FLOPs reductions on VGG-16 and ResNet-56. Although the FLOPs are slightly higher on ResNet-34, our method does not use time-consuming sensibility analysis and does not skip any sensitive layers.

Dataset	Model	Method	Accuracy	Params	FLOPs
CIFAR-10	VGG-16	Unpruned	93.63	1.50 $\times 10^{7}$	3.14 $\times 10^{8}$
		Haoli	93.41	5.40 $\times 10^{6}$	2.07 $\times 10^{8}$
		Ours	93.54	3.80 $\times 10^{6}$	1.83 $\times 10^{8}$
	ResNet-34	Unpruned	93.14	8.50 $\times 10^{5}$	1.27 $\times 10^{8}$
		Haoli	92.67	7.40 $\times 10^{5}$	9.22 $\times 10^{7}$
		Ours	92.97	6.20 $\times 10^{5}$	8.29 $\times 10^{7}$
ImageNet	ResNet-34	Unpruned	73.31	2.18 $\times 10^{7}$	3.67 $\times 10^{9}$
		Haoli	72.29	1.95 $\times 10^{7}$	2.79 $\times 10^{9}$
		Ours	72.57	1.80 $\times 10^{7}$	2.87 $\times 10^{9}$

Table 2. Pruning factor and surviving maps of VGG-16 on CIFAR-10. The values of Std/Min from Conv-1 and Conv-8 to 13 are significantly higher than Conv-2 to 7, so we only prune Conv-1 and Conv-8 to 13.

Layer Type	Used Layer	Std/Min	r	Maps
Conv-1	Conv-2	485.87	0.5	15
Conv-2	Conv-3	0.209	0	64
Conv-3	Conv-4	0.247	0	128
Conv-4	Conv-5	0.102	0	128
Conv-5	Conv-6	0.112	0	256
Conv-6	Conv-7	0.129	0	256
Conv-7	Conv-8	0.179	0	256
Conv-8	Conv-9	0.581	0.5	214
Conv-9	Conv-10	1.039	0.5	199
Conv-10	Conv-11	1.597	0.5	175
Conv-11	Conv-12	1.667	0.5	178
Conv-12	Conv-13	1.609	0.5	181
Conv-13	Linear	1.329	0.5	175

Table 3. Other pruning factor and FLOPs of VGG-16 on CIFAR-10. We achieved higher FLOPs reduction.

Prune Layer	r	Maps	r	Maps
Conv-1	0.5	15	0.6	11
Conv-2	0.1	56	0	64
Conv-3	0.1	112	0	128
Conv-4	0.1	114	0	128
Conv-5	0.1	227	0	256
Conv-6	0.1	227	0	256
Conv-7	0.1	226	0	256
Conv-8	0.5	214	0.6	167
Conv-9	0.5	199	0.6	150
Conv-10	0.5	175	0.6	132
Conv-11	0.5	178	0.6	133
Conv-12	0.5	181	0.6	136
Conv-13	0.5	175	0.6	129
FLOPs		0.148G		0.173G

Table 4. Pruning results of VGG-16 on CIFAR-10. We achieved higher FLOPs reduction than the original method.

Layer Type	Maps			FLOPs
Layer Type	VGG-16	Haoli	Ours	VGG-16	Haoli	Ours
Conv-1	64	32	15	1.77 $\times 10^{6}$	8.85 $\times 10^{5}$	4.15 $\times 10^{5}$
Conv-2	64	64	64	3.77 $\times 10^{7}$	1.89 $\times 10^{7}$	8.85 $\times 10^{6}$
Conv-3	128	128	128	1.89 $\times 10^{7}$	1.89 $\times 10^{7}$	1.89 $\times 10^{7}$
Conv-4	128	128	128	3.77 $\times 10^{7}$	3.77 $\times 10^{7}$	3.77 $\times 10^{7}$
Conv-5	256	256	256	1.89 $\times 10^{7}$	1.89 $\times 10^{7}$	1.89 $\times 10^{7}$
Conv-6	256	256	256	3.77 $\times 10^{7}$	3.77 $\times 10^{7}$	3.77 $\times 10^{7}$
Conv-7	256	256	256	3.77 $\times 10^{7}$	3.77 $\times 10^{7}$	3.77 $\times 10^{7}$
Conv-8	512	256	214	1.89 $\times 10^{7}$	9.44 $\times 10^{6}$	7.89 $\times 10^{6}$
Conv-9	512	256	199	3.77 $\times 10^{7}$	9.44 $\times 10^{6}$	6.13 $\times 10^{6}$
Conv-10	512	256	175	3.77 $\times 10^{7}$	9.44 $\times 10^{6}$	5.01 $\times 10^{6}$
Conv-11	512	256	178	9.44 $\times 10^{6}$	2.36 $\times 10^{6}$	1.12 $\times 10^{6}$
Conv-12	512	256	181	9.44 $\times 10^{6}$	2.36 $\times 10^{6}$	1.16 $\times 10^{6}$
Conv-13	512	256	175	9.44 $\times 10^{6}$	2.36 $\times 10^{6}$	1.14 $\times 10^{6}$
Total				3.14 $\times 10^{8}$	2.07 $\times 10^{8}$	1.83 $\times 10^{8}$

Table 5. The Std/Min of ResNet-56 on CIFAR-10. We find that the value of Std/Min from ResNet-56 is very small, so we consider Std/Min as the pruning factor r of each layer.

Layer	Used Layer	Std/Min	Layer	Used Layer	Std/Min
Conv-2	Conv-3	0.22	Conv-30	Conv-31	0.41
Conv-4	Conv-5	0.47	Conv-32	Conv-33	0.38
Conv-6	Conv-7	0.58	Conv-34	Conv-35	0.41
Conv-8	Conv-9	0.47	Conv-36	Conv-37	0.47
Conv-10	Conv-11	0.24	Conv-38	Conv-39	0.13
Conv-12	Conv-13	0.39	Conv-40	Conv-41	0.15
Conv-14	Conv-15	0.36	Conv-42	Conv-43	0.11
Conv-16	Conv-17	0.32	Conv-44	Conv-45	0.12
Conv-18	Conv-19	0.27	Conv-46	Conv-47	0.13
Conv-20	Conv-21	0.15	Conv-48	Conv-49	0.11
Conv-22	Conv-23	0.12	Conv-50	Conv-51	0.13
Conv-24	Conv-25	0.32	Conv-52	Conv-53	0.10
Conv-26	Conv-27	0.27	Conv-54	Conv-55	0.12
Conv-28	Conv-29	0.39

Table 6. Results of network trimming [20] and network slimming [24] on CIFAR-10, which shows that our proposed pruning threshold method can achieve better compression performance than the original pruning scheme.

Dataset	Model	Method	Pruning Threshold	Acc	FLOPs
CIFAR-10	VGG-16	1Network Trimming [20]	Unpruned	93.63	3.14 $\times 10^{8}$
			Pruned	93.41	2.34 $\times 10^{8}$
			Ours	93.54	1.84 $\times 10^{8}$
	VGG-19	Network Slimming [24]	Unpruned	93.53	3.99 $\times 10^{8}$
			Pruned	93.60	1.95 $\times 10^{8}$
			Ours	93.56	1.74 $\times 10^{8}$

Table 7. Comparison of pruned ResNet on CIFAR-10. For pruning the ResNet-20, our method achieves higher compression with a 0.25% accuracy increase. For pruning the pre-trained ResNet-56, our method also achieves better pruning performance than SFP.

Depth	Method	Pre-Trained	Accuracy	FLOPs
20	SFP(10%)	N	92.24	3.54 $\times 10^{7}$
	SFP(20%)	N	91.20	2.93 $\times 10^{7}$
	SFP(30%)	N	90.83	2.47 $\times 10^{7}$
	Ours	N	91.45	2.57 $\times 10^{7}$
56	SFP(20%)	N	93.47	9.31 $\times 10^{7}$
	SFP(30%)	N	93.10	7.87 $\times 10^{7}$
	SFP(40%)	N	92.26	6.31 $\times 10^{7}$
	SFP(40%)	Y	93.35	6.31 $\times 10^{7}$
	Ours	N	92.73	5.74 $\times 10^{7}$
	Ours	Y	93.41	5.74 $\times 10^{7}$

Table 8. Comparison of pruned ResNet on CIFAR-10. For ResNet, our method achieves higher compression performance with competitive and even higher accuracy.

Depth	Method	Pre-Trained	Accuracy	FLOPs
20	FPGM(30%)	N	91.09	2.47 $\times 10^{7}$
	FPGM(40%)	N	90.44	1.95 $\times 10^{7}$
	Ours	N	91.99	2.32 $\times 10^{7}$
56	FPGM(40%)	N	92.93	6.31 $\times 10^{7}$
	FPGM(40%)	Y	93.49	6.31 $\times 10^{7}$
	Ours	N	93.33	5.54 $\times 10^{7}$
	Ours	Y	93.45	5.54 $\times 10^{7}$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ding, Y.; Chen, D.-R. Optimization Based Layer-Wise Pruning Threshold Method for Accelerating Convolutional Neural Networks. Mathematics 2023, 11, 3311. https://doi.org/10.3390/math11153311

AMA Style

Ding Y, Chen D-R. Optimization Based Layer-Wise Pruning Threshold Method for Accelerating Convolutional Neural Networks. Mathematics. 2023; 11(15):3311. https://doi.org/10.3390/math11153311

Chicago/Turabian Style

Ding, Yunlong, and Di-Rong Chen. 2023. "Optimization Based Layer-Wise Pruning Threshold Method for Accelerating Convolutional Neural Networks" Mathematics 11, no. 15: 3311. https://doi.org/10.3390/math11153311

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Optimization Based Layer-Wise Pruning Threshold Method for Accelerating Convolutional Neural Networks

Abstract

1. Introduction

2. Related Work