Next Article in Journal
Estimation of Biresponse Semiparametric Regression Model for Longitudinal Data Using Local Polynomial Kernel Estimator
Previous Article in Journal
Brauer Analysis of Some Time–Memory Trade-Off Attacks and Its Application to the Solution of the Yang–Baxter Equation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Channel-Pruning Convolutional Neural Network with Learnable Kernel Element Position Convolution Utilizing the Symmetric Whittaker–Shannon Interpolation Function

1
Key Laboratory of Independent Intelligent Technology and System, Tiangong University, Tianjin 300387, China
2
School of Computer Science and Technology, Tiangong University, Tianjin 300387, China
3
School of Software and Communication, Tianjin Sino-German University of Applied Sciences, Tianjin 300350, China
*
Author to whom correspondence should be addressed.
Symmetry 2025, 17(3), 390; https://doi.org/10.3390/sym17030390
Submission received: 26 January 2025 / Revised: 23 February 2025 / Accepted: 26 February 2025 / Published: 4 March 2025
(This article belongs to the Section Computer)

Abstract

:
Large convolution kernels offer better performance advantages. They can cover a wider area and capture a broader range of spatial information in a single convolution operation. This is of great importance when dealing with tasks that have significant spatial variations. However, increasing the kernel size brings substantial memory and computational costs to deep convolutional neural networks. The computational complexity becomes unimaginable. Therefore, we proposed a learnable kernel element position convolution using the symmetric Whittaker–Shannon interpolation function (WSIPC). We also performed channel-level pruning (CP) on this large convolutional neural network to achieve network compression. Specifically, WSIPC permits any number of kernel elements. The positions of non-zero elements are learned in a gradient-based manner. We made use of the normal distribution of effective receptive fields to reduce computation, parameter complexity, and improve classification performance. This method achieved the best performance on the large-kernel ConvNeXt. CP used the scaling factor in the Layer Normalization (LN) of the ConvNeXt network as a proxy for channel selection. During training, it automatically identified and pruned unimportant channels in wide and large network models. As a result, it produced a streamlined model with comparable accuracy. This model was more compact in terms of model size, runtime memory, and computational operations.

1. Introduction

The receptive field of a deep convolutional network is a crucial factor when dealing with recognition and downstream tasks in computer vision. For example, Araujo et al. [1] made an observation. They found a logarithmic relationship between classification accuracy and receptive field size. This discovery indicates that the large receptive field brought by large convolutional kernels is essential for visual tasks. Also, it can help models perform better. However, increasing the size of the convolution kernel causes problems. It raises the memory cost. Meanwhile, the computational cost also goes up. So, the size of the convolution kernel cannot be increased infinitely. To expand the receptive field apart from increasing the kernel size, Yu et al. [2] proposed standard dilated convolution (DC). This method has unique advantages. It can expand the receptive field of convolutional layers without increasing the number of learnable parameters. At the same time, it does not increase the computational cost either. DC works by inserting spaces between kernel elements periodically to expand the convolutional kernel. As a result, its units sample the input feature map at fixed positions. Although this approach expands the receptive field, it has drawbacks. It has certain limitations in sampling. Moreover, if several expansion convolutions with the same expansion coefficient are used continuously, there may be intervals between non-zero elements. This leads to a situation where not all pixel values within the range can be covered, only some of them. The loss of these pixel values means the loss of some detailed information, which is not beneficial for model learning. Additionally, some long-distance information may be irrelevant, and this kind of information can disrupt the continuity of local information. Since Yu et al.’s research, DC has achieved initial success in classification tasks. But over time, it has become less popular. Its application scope was also limited to the downstream tasks mentioned above. Ding et al. [3] attempted to use DC in their ReplKNet architecture. Unfortunately, they did not obtain good results. The reason for the failure of this dilated convolution method in classification tasks is the inherent mesh pressure generated by the standard dilated convolution. In view of the above situation, we propose the learnable kernel element position convolution using Whittaker–Shannon interpolation (WSIPC). WSIPC utilizes the normal distribution of effective receptive fields. By doing so, it can reduce computational and parameter complexity and ultimately improve classification accuracy.
In the WSIPC method, the positions of non-zero elements in the convolution kernel are learned based on gradients. The positions in the kernel are integers. To avoid generating decimals, interpolation can be used (Figure 1c). Standard convolution (Figure 1a) and dilated convolution (Figure 1b) have a kernel element grid; however, this method does not. Instead, it allows for any number of kernel elements. We call this adjustable hyperparameter the “kernel count”. In this paper, we set it to have the same number of parameters as the baseline we will be comparing, or even fewer. Unlike fixed kernels, we define the size of the kernel. More precisely, we define the maximum size that enables kernel elements to move within an expanded kernel as the “expanded kernel size”. This is also an adjustable hyperparameter. In this method, the position of kernel elements is randomly initialized. And during the entire learning process, these elements are allowed to move within the limit of the expanded kernel size. Moreover, sharing positions between multiple blocks in the same convolution stage can further improve accuracy. At the same time, it can reduce the number of learnable parameters. Based on experience, when combined with other learning techniques, we can continuously enhance the overall performance of the WSIPC method. In summary, WSIPC is a novel convolution method. It allows any number of kernel elements within the convolution kernel. The position of non-zero elements is not fixed. Instead, it is learned based on gradients. This method uses Whittaker–Shannon interpolation. It ensures that the distances between elements follow a normal distribution to fit the effective receptive field. By doing so, it can reduce computational and parameter complexity and improve classification accuracy. Standard convolution and dilated convolution have a fixed kernel element grid; however, WSIPC does not. This gives convolution operations more flexibility.
Although deep convolutional neural networks have obvious advantages in computer vision tasks, they demand more resources. Currently, the performance of CNNs is restricted by three aspects: (1) model size limitation: The powerful representation ability of neural networks stems from their millions of trainable parameters. These parameters must be stored on the disk together with the network structure information. And during inference, they need to be loaded into memory; (2) runtime memory: During the inference process, even when the batch size is 1, convolutional neural networks may occupy more storage space during intermediate activation and response than when storing model parameters; (3) the number of computational operations: Convolution operations require a large amount of computation when dealing with high-resolution images. At present, pruning has been proven to be very effective and practical in many network compression paradigms [4,5]. The aim of pruning a network is to eliminate redundant parameters from a given network. By doing so, we can reduce its size and speed up inference as much as possible. Mainstream pruning methods can be roughly classified into two schemes: structured pruning [6,7] and unstructured pruning [8]. The core difference between the two methods is as follows: Structured pruning modifies the structure of the neural network by physically removing parameters. In contrast, unstructured pruning zeros some weights without changing the network structure. Compared with unstructured pruning, structured pruning does not rely on specific software and hardware accelerators to cut down memory consumption and computational costs. Therefore, it has found a wider range of applications in practice.
Therefore, we propose a structured channel pruning (CP) for compressing networks. This method utilizes the scaling factor λ of the normalization layer as a proxy for channel selection and achieves network compression by implementing channel-level sparsity on the scaling factor γ of the channels in the network. The CP method can be summarized as a penalty caused by the proportionality factor λ and sparsity. We introduce a scaling factor γ for each channel, which is multiplied by the output of that channel. Then, we jointly train the network weights and apply sparse regularization to the factor. Finally, we trim the channels with small factors and fine-tune the trimmed network.
Therefore, this article has two main aims. First, it aims to tackle the issues of high memory and computational costs that large convolution kernels bring to deep convolutional neural networks. Second, it intends to solve the problem that the intrinsic mesh structures of traditional convolutions are unable to flexibly adjust the receptive fields. We propose a learnable kernel element position convolution (WSIPC) that incorporates Whittaker–Shannon interpolation. This is combined with the channel-pruning (CP) method. By doing so, we can enhance the classification performance. At the same time, we can reduce the number of model parameters and computational complexity. The following is a summary of our contributions.
  • Our proposed dilated convolution with Whittaker–Shannon interpolation learns the positions of kernel elements in an input-independent manner, alleviating the fixed grid pressure imposed by standard dilated convolutions. On the large kernel ConvNeXt, replacing the depthwise separable convolution with WSIPC convolution improves accuracy by 1.1% with the same number of parameters and low throughput cost.
  • We used the CIFAR100 dataset, and as the size or kernel count of the WSIPC convolution kernel increased, the overall classification accuracy on the large kernel ConvNeXt showed an upward trend. When the kernel count was 26 for a 23 × 23 kernel size, the Top-1 acc. was 82.28%, achieving the best performance.
  • We use the scaling factor of the normalization layer as a proxy for channel selection to perform channel-level pruning and compression on the network.
  • The experiments of the CP method on CIFAR100 and SVHN datasets show that when ConvNeXt-WSIPC (20% Pruned) reduces the number of parameters by 17.8% compared to before pruning, FLOPs decrease by 15.1% and testing error decreases by 0.6%.

2. Related Works

2.1. Effective Receptive Field

One of the research motivations behind the WSIPC method is the exploration of the effective receptive field. Luo et al. [9] conducted a study. They described the extent to which each input pixel within the receptive field impacts the output of a unit. The study revealed an interesting fact. In traditional convolution, not all pixels in the receptive field contribute equally to the output response. Instead, the influence of the kernel center is significantly greater. Moreover, the size of the effective receptive field has a quadratic relationship with kernel size. If we simply aim to expand the effective receptive field by increasing the size of the convolution kernel, there is a problem. The number of trainable parameters will increase rapidly to a daunting level. Considering these research findings, there is a possibility. Introducing new degrees of freedom by learning the positions of non-zero weights in the expanded kernel might enhance the expressive ability of convolutional neural networks. Chen et al. were inspired by the Gaussian distribution property of the effective receptive field. They proposed the Gaussian Mask Convolutional Kernel [10]. This kernel introduces concentric receptive fields with circular or elliptical shapes into the CNN kernel. Similarly, Ismail Khalfaoui Hassani et al. put forward a learnable spacing convolution [11]. In this convolution, Gaussian interpolation follows a normal distribution. It decays rapidly from the center, thus presenting an effective receptive field.
The Gaussian attention bias with learnable standard deviation proposed by Dosovitskiy et al. has been successfully used for position embedding in the attention module of the ViT model [12] and has brought reasonable benefits.

2.2. Large Volume Accumulation Kernel

The visual converter proposed by Dosovitskiy et al. [13], the CNN proposed by Liu et al. [14] and Ding et al. [3], and the latest study by Trockman and Kolter [15] and Liu et al. [16] have emphasized that the beneficial effects of visual converters can have better returns compared to the traditional 3 × 3 kernels used in previous state-of-the-art CNN models. However, when the kernel size is artificially increased, accuracy quickly tends to stabilize or even decline. For example, in ConvNeXt, the optimal accuracy is achieved through a 7 × 7 kernel [14]. Ding et al. used structural reparameterization techniques [3] to demonstrate the benefits of increasing the kernel size to 31 × 31. Afterwards, Liu et al. indicated that there was still room for improvement, increasing the kernel size to 51 × 51 [16].
Ding et al. developed a deep implicit matrix multiplication (gemm) method [3], which has been integrated into the open-source framework MegEngine [17]. In addition, they spatially separated the deep kernel and accumulated the resulting activations. However, all these improvements incur costs in terms of memory and computation, and it seems impossible to infinitely increase the size of the kernel.

2.3. High-Performance Convolution Operation

In the field of convolutional neural networks, various methods have been explored to improve the strength and efficiency of high-performance convolution operations. Celarek et al. [18] studied the fitting of input channels to Gaussian mixtures in Gaussian mixture convolutional networks, while Chen et al. used Gaussian masks in their work. In addition, Kim and Park et al. [19] studied continuous kernel convolution in the context of image processing, and their method is similar to the linear correlation introduced by Thomas et al. [20]. Romero et al., together with Romero et al. [21,22], also made significant contributions in learning continuous functions that map positions to weights. Jacobsen et al. [23] proposed the method of Structured Receptive Field, whose core idea is to introduce more spatial structural information into the convolution operation, so that the receptive field can better capture and integrate local and global information in the image. Pintea et al. [24] extended this method by incorporating Gaussian width learning, effectively optimizing resolution. Shelhamer et al. [25] introduced a kernel decomposition method, in which the kernel is represented as a combination of a standard kernel and a structured Gaussian kernel. In these three works, the Gaussian model is centered around the inner kernel. Hassani et al. proposed a learnable spacing convolution (DCLS), in which the spacing between non-zero elements is not fixed, but can be learned through bilinear interpolation techniques and backpropagation. ADCNN [26] is another input-dependent convolution method that seeks to learn dilation rate rather than kernel position, but unlike DCLS, this method uses regular grids with learnable rates. In addition, Wang et al. adjusted deformable convolution v1 (DCNv1) [27] and v2 (DCNv2) [28] and proposed deformable convolution v3 [29], which enhances the spatial sampling positions in the module by adding additional offsets, learning offsets from the target task, and being easily trained end to end through standard backpropagation, thus generating the deformable convolution network InternImage, which uses deformable convolution as the core operator to provide the model with a large effective receptive field required for downstream tasks such as detection and segmentation. Our method differs significantly from deformable convolution in several aspects: firstly, deformable convolution requires the application of conventional convolution to obtain the offset, which is related to the input, and then passes the offset to the actual method. On the contrary, in our method, a learnable position expansion kernel is constructed and then passed to convolution operations to make the position independent of the input. Secondly, unlike deformable convolution, the convolution positions in our method are channel dependent. Thirdly, deformable convolutions were developed for regular convolutions, not for depthwise separable convolutions. Finally, the number of additional learnable parameters in deformable convolution is the number of kernel elements in the initial convolution, which establishes a strong dependency between the offset and the input feature map. On the contrary, in our method, the number of additional parameters dedicated to learning positions is only twice the number of kernel weights.

2.4. Network Compression

Many existing works have been proposed to compress large convolutional neural networks, and model pruning is a widely used model compression method that can directly reduce the number of parameters in the model. One disadvantage of unstructured pruning methods is that the resulting weight matrix is sparse, which cannot achieve compression and acceleration without dedicated hardware/libraries. In contrast, since the original convolutional structure of channel pruning is still preserved, there is no need for dedicated hardware libraries to implement it. Huang et al. [30] used sensitivity analysis pruning to evaluate the contribution of channels to model performance and remove channels with small impact, but this relies on manually setting thresholds, which may result in performance loss. Zhang et al. [31] dynamically adjusted pruning strategies based on channel activation during training, which has high flexibility and can adapt to changes during the training process but increases training time and complexity. Kokol et al. [32] used graph theory methods to identify and remove redundant channels, which have strong dependence on graph structure and are not suitable for some network models. Sellars et al. [33] used reinforcement learning algorithms to select channels and improve pruning flexibility, but the training process is long and requires a large amount of data and computation. Liebl et al. [34] considered the relationships between different layers and used cross-layer pruning to achieve a more comprehensive pruning effect. However, this resulted in complex model design and difficulty in debugging the pruning process.

3. Learnable Kernel Element Position Convolution Utilizing the Symmetric Whittaker–Shannon Interpolation

3.1. Method

We use m N * to represent the number of kernel elements in the convolutional kernel, which is called the “kernel count”, and S x , S y N * × N * to represent the size of the constructed kernel along the x and y axes, respectively. The latter can be regarded as the limit size of the expanded kernel, which is called the “expanded kernel size”. The real numbers w , p x , σ x , p y , σ y represent the weights, which are the average position and standard deviation along the x axis and the average position and standard deviation along the y axis. S 1 × S 2 on “matrix space R ” is defined as the set of all S 1 × S 2 matrices on R , and is represented as M S 1 , S 2 ( R ) . Our method needs to learn the coordinates p x and p y of each weight w in the expanding kernel. This construction relies on interpolation and can be described using the following Whittaker–Shannon interpolation function:
F : R m × R m × R m M S x , S y R
w , p x , p y K = k = 1 m f w k , p k x , p k y
The definition of f is as follows:
f : R m × R m × R m M S x , S y R
w , p x , p y K
In Dai et al.’s study, the bilinear interpolation g function described in Formula (6) is referred to as a trigonometric function (as shown in Figure 2), and the g function is also known as a trigonometric function and widely used in kernel density estimation. In Hassani et al.’s study, a trigonometric function with a coefficient of 1 was used, which is called bilinear interpolation [26].
The derivation formula is as follows:
K i j = w ( 1 r x ) ( 1 r y )                   i = p x ,   j = p y w r x ( 1 r y )                             i = p x + 1 ,   j = p y w ( 1 r x ) r y                               i = p x , j = p y + 1 w r x r y                                     i = p x + 1 , j = p y + 1 0                                                                                                                             o t h e r
in which i 1 S x , j 1 S y ,
r x = p x p x ,   r y = p y p y
In summary, Formula (3) K i j is equivalent to the following:
K i j = w · g p x i · g p y j
Write the g function as:
x R , g x m a x 0 , 1 x
Among them, the parameter σ ϵ R of the trigonometric function is set to σ = 1 here.
We replace the trigonometric function with a Whittaker–Shannon interpolation [35] function to fit the effective receptive field by ensuring that the distances between elements follow a normal distribution (as shown in Figure 2).
x R , σ R , S σ x sin σ x σ x
We found that scaling parameter σ can be learned through backpropagation, which can improve the performance of convolution. In practice, the Whittaker–Shannon interpolation function uses the following formula:
x R , σ R , S σ 0 σ x sin σ 0 σ x σ 0 σ x
Among them, σ 0 R is a constant that determines the minimum standard deviation that interpolation can achieve; that is, setting σ 0 = π . Finally, in order to make the total interpolation of kernel size equal to 1, divide the interpolation by the following normalization term:
A = ε + i = 1 S x j = 1 S y S σ 0 σ x p x i · S σ 0 σ y p y j
Among them, ε = 1 × 10 7 , to avoid being divided by 0.

3.2. Learning and Optimizing Kernel Element Positions During the Training Process

WSIPC allows convolution kernels to have any number of kernel elements, and the position of non-zero elements is not fixed but learned in a gradient based manner. It uses Whittaker–Shannon interpolation to fit the effective receptive field by ensuring that the distances between elements follow a normal distribution, reducing computational and parameter complexity and improving classification accuracy. Construct a convolution kernel using the Whittaker–Shannon interpolation function to fit the effective receptive field by ensuring that the distances between elements follow a normal distribution. This interpolation function is used to calculate the values of kernel elements, and its scaling parameters can be learned through backpropagation to improve convolution performance. To make the sum of kernel size interpolation equal to 1, the interpolation is divided by the normalization term to eliminate 0.
Weight decay: Weight decay is a regularization method widely used in various deep learning models. Although it has a beneficial impact on generalization, we have noticed that when applied to the kernel position in the WSIPC method, weight decay often overly concentrates around the center of the kernel, resulting in poor accuracy. Therefore, we set the hyperparameter of the kernel location to 0 and keep all other parameters unchanged.
Position initialization: In the trigonometric function, it is initialized to 0, and the positional parameters of our method are initialized according to σ 0 = π to have initialization similar to trigonometric functions at the beginning.
Position restriction/overlap: In previous trigonometric interpolation, kernel elements that reached the expansion kernel size limit were restricted [11], which was carried out at the end of each batch step to force the kernel position to remain within the limit range. Sometimes agglomeration phenomena can be observed near these limitations, indicating that the expanded core size is too small and should be increased. It has been proven that this operation is no longer required for using Whittaker–Shannon interpolation.
Expanding kernel size adjustment: When in ConvNeXt, it was found that a convolution kernel size of 7 × 7 is optimal, as larger sizes do not improve accuracy [18]. However, for our method, there seems to be no strict limit on the size of the expansion kernel. As the size increases, accuracy often increases logarithmically, and improvements are observed when the kernel size reaches 51. It is worth noting that increasing the size of the expanded kernel does not affect the number of trainable parameters, but it does affect throughput. Therefore, by setting the expanded kernel size to 23, a trade-off between accuracy and throughput is achieved.
Kernel count adjustment: This hyperparameter has been configured to the maximum integer value, while still being below the baseline to which it was compared in terms of trainable parameters. It is worth noting that each additional element in our method introduces five learnable parameters: weight, vertical and horizontal positions, and their respective scaling parameters σ . To maintain simplicity, the same kernel count was applied to all model layers.
Position learning rate scaling: To maintain consistency between position and standard deviation, we apply the same learning rate scaling ratio of 5 to both. On the contrary, the learning rate of weights remains unchanged.
Synchronization position: We share kernel position and standard deviation among convolutional layers with the same number of parameters, but do not share weights. The parameters in these stages are concentrated in the common parameters of the cumulative gradient.

3.3. Kernel Construction Algorithm

Below Algorithm 1, we will use pseudocode to describe the kernel constructs used in WSIPC. Among them, S is an interpolation function.
Algorithm 1 Whittaker–Shannon Interpolation Kernel Construction
Require: w , p x , σ x , p y , σ y : A vector with dimension m
Ensure: K : The constructed kernel has a size of s x × s y
1: K 0 s x , s y Z e r o   t e n s o r   o f   s i z e   s x , s y
2: f o r   k = 0   t o   m 1   d o
3:   H 0 S x , S y
4:   p k x p k x + S x / / 2 ;   p k y p k y + S y / / 2
5:   σ k x σ 0 S σ k x ;   σ k y σ 0 S σ k y
6:   f o r   i = 0   t o   s x 1   d o
7:    f o r   j = 0   t o   s y 1   d o
8:     H i , j S σ k x p k x i S σ k y p k y j
9:    e n d   f o r
10:   e n d   f o r
11:   H : , : H : , : / ε + i = 0 s x 1 j = 0 s y 1 H i , j
12:   K K + H w
13: e n d   f o r

3.4. Impact

Increasing computational complexity may potentially slow down training speed: In the WSIPC method, each kernel element introduces additional learnable parameters such as weights, vertical and horizontal positions, and their respective scaling parameters. More learnable parameters mean that the computational cost of calculating gradients and updating parameters will increase during each training iteration. In the process of backpropagation, it is necessary to calculate the gradients of these newly added parameters, which undoubtedly increases the complexity and time cost of computation compared to traditional convolution operations with fixed kernel positions and may slow down training speed.
The positive impact of the optimization mechanism on training speed: Although learning kernel positions increases parameters and computational complexity, WSIPC utilizes the normal distribution of effective receptive fields to reduce computational and parameter complexity. By fitting the effective receptive field, the model can more efficiently capture key information and reduce unnecessary calculations, which can to some extent offset the increased computational load caused by the position of the learning kernel. The efficiency improvement brought by this optimization can compensate for the additional computational losses, so training speed will not be significantly negatively affected.
The relationship between hyperparameter adjustment and training speed: The article mentions various hyperparameter adjustment strategies, such as kernel count adjustment and kernel expansion size adjustment. The reasonable setting of these hyperparameters has a significant impact on training speed. Configuring the kernel count to an appropriate value can ensure model performance and control the number of learnable parameters, avoiding slow training speed caused by too many parameters. Adjusting the size of the expansion kernel not only affects the accuracy of the model but also has an impact on throughput, which, in turn, affects training speed.
Reduced parameter and computational complexity: WSIPC reduces the model’s parameter and computational complexity by learning the positions of kernel elements and utilizing the normal distribution of effective receptive fields. When constructing a convolutional kernel, the effective receptive field is fitted using the Whittaker–Shannon interpolation function, which enables the kernel to capture image features more efficiently and reduces unnecessary calculations. This means that during inference, the amount of data and computational operations that the model needs to process is reduced, thereby avoiding the introduction of additional overhead.

4. Channel Pruning

4.1. Punishment Caused by Proportionality Factor λ and Sparsity

Channel pruning is performed on the network through channel-level sparsity to achieve network slimming. This section introduces how to effectively identify and prune unimportant channels in a network using the scaling factor in normalization.
Our idea is to introduce a scaling factor γ for each channel, which is multiplied by the output of that channel. Then, we jointly train the network weights and apply sparse regularization to the factor. Finally, we prune the channels with small factors and fine-tune the trimmed network. Specifically, the training objective of our method is given by the following equation:
L = ( x , y ) l f x , W , y + λ γ τ q γ
where ( x , y ) represents the training input and target, W represents trainable weights, and the first term corresponds to the normal training loss of CNN; q ( · ) is the penalty caused by sparsity on the scaling factor and λ balances these two items.
We have chosen q t = | T | , which is known as the L1 norm and is widely used to achieve sparsity, and have adopted subgradient descent as the optimization method for the non-smooth L1 penalty term. Since pruning a channel essentially involves deleting all incoming and outgoing connections of that channel, we can directly obtain a narrow network (see Figure 3) without using any special sparse computation packets.
Using an LN layer scaling factor as a proxy for channel selection: Layer normalization has been adopted as a standard method to achieve fast convergence and better generalization performance. The method of layer normalization activation prompted us to design a simple and effective approach to merge channel-scale factors.
Due to their joint optimization with network weights, the network can automatically identify unimportant channels that can be safely removed without significantly affecting generalization performance. Let X and Y be the inputs and outputs of the LN layer, and the LN layer performs the following transformation:
X ^ = X E X V a r X + ξ ; Y = γ X ^ + β
Among them, E [ X ] and V a r [ x ] are the average and standard deviation values of the input activation, and γ and β are trainable affine transformation parameters that can linearly transform the normalized activation back to any scale. The usual practice is to insert an LN layer after the convolutional layer, which has channel-wise scaling/shifting parameters. Therefore, we can directly use the γ parameter in the LN layer as the scaling factor required for network refinement. It has a huge advantage of not introducing network overhead. In fact, this is also an effective way for us to learn scaling factors for channel pruning. (1) If we add a scaling layer to a CNN without an LN layer, the value of the scaling factor is meaningless for evaluating the importance of the channel, as both the convolutional layer and the scaling layer are linear transformations. The same result can be obtained by reducing the scaling factor value while increasing the weights in the convolutional layer. (2) If a scaling layer is inserted before the LN layer, the scaling effect of the scaling layer will be completely cancelled out by the normalization process in LN. (3) If a scaling layer is inserted after the LN layer, there are two consecutive scaling factors for each channel.

4.2. Channel Trimming

Channel pruning and fine-tuning: After training under channel-level sparse-induced regularization, we obtained a model with many scaling factors close to zero (see Figure 3). Then, we were able to prune channels with scaling factors close to zero by removing all incoming and outgoing connections of the channel and their corresponding weights. We used global threshold pruning channels on all layers, where the global threshold is defined as a percentage of all scaling factor values. For example, by selecting a percentile threshold of 70%, we trimmed the 70% channels with lower scaling factors. By doing so, we obtained a more compact network with fewer parameters and runtime memory, as well as fewer computational operations. When the pruning ratio is high, pruning may temporarily cause some loss of accuracy. However, this can be largely compensated for through the subsequent fine-tuning process of the trimmed network. In our experiment, the fine-tuned narrow network can even achieve higher accuracy than the original untrimmed network in many cases.
Multi-round scheme: We can also extend the proposed method from a single round learning scheme (trained through sparse regularization, pruning, and fine-tuning) to a multi-round scheme. Specifically, the process of network refinement results in a narrow network to which we can apply the entire training process again to learn an even more compact model, as indicated by the dashed line in Figure 4. The experimental results show that this method has better performance in terms of compression ratio.

5. Experiments

5.1. Dataset

CIFAR100: The CIFAR100 dataset consists of natural images with a resolution of 32 × 32, extracted from 100 classes. The training set and test set contain 50,000 and 10,000 images, respectively. On CIFAR 100, a validation set of 5000 images was separated from the training set to search for on each model (in Formula (10)). CP normalizes the input data through channel mean and standard deviation. After training or fine-tuning on all training images, we report the final classification accuracy using this method.
Street View Number (SVHN): The dataset consists of 32 × 32 color digital images. As usual, we use all 604,388 training images to segment a validation set containing 6000 images for model selection during training. The test set contains 26,032 images.
ImageNet 1k: This is a subset of ImageNet, consisting of 1000 categories and corresponding images selected from the vast class and image data of ImageNet. It retains some key features and advantages of ImageNet, while being relatively small in data size and number of categories, making it easier to process and use. The training set has approximately 1.2 million images, the validation set has 50,000 images, and the test set has 100,000 images. The 1000 categories selected from ImageNet have broad representativeness, covering multiple fields such as animals, plants, transportation, daily necessities, and natural scenes. Images have a rich diversity of shooting angles, lighting conditions, backgrounds, and other aspects, which increases the difficulty and challenge of model training and recognition and helps improve the generalization ability of the model.

5.2. Empirical Evaluation of the Effectiveness of WSIPC

Table 1 shows the results obtained from conducting experiments on ResNet-50. The purpose of this study is not to go beyond this architecture but to provide the first set of experimental evidence to demonstrate the correlation between the WSIPC method and inseparable convolution, as well as one of the most popular CNN architectures. We can observe that when using standard dilated convolutions, the results only get worse as the dilation rate increases. In addition, in the case of standard convolution, increasing the kernel size from 3 × 3 to 7 × 7 can improve accuracy, but the cost is not only a decrease in throughput but also a threefold increase in the number of parameters. Using fewer parameters, the ResNet-50-WSIPC model exceeds the baseline, but at a cost of sacrificing throughput due to the use of inseparable convolutions in ResNet-50. We can see in Table 2 that for the ConvNeXt model, this additional throughput cost is minimized due to its reliance on separable convolutions.
We replaced the depthwise separable convolutions in ConvNeXt with WSIPC convolutions with a kernel size of 17 × 17 and a kernel count equal to 26. From Table 2, we can see that ConvNeXt using WSIPC convolution always outperforms ConvNeXt in accuracy, with a baseline gain of 0.6, the same number of parameters, and a low throughput cost. We believe that this process only replaces the deep convolution in the ConvNeXt baseline with WSIPC convolution, so the improvement in accuracy is significant. The deep convolution layer represents only about 1% of the total number of parameters in the ConvNeXt model and 2% of the total number of FLOPs. The ConvNeXt model with a standard expansion rate of 2 performs poorly (see ConvNeXt-T-dil2). The performance of the SLaK model is better than that of WSIPC, but the overhead in throughput and parameter counting is higher.
Table 3 uses two of the latest convolutional architectures, ConvNeXt and ConvFormer. All depthwise separable convolutions in these architectures are replaced with WSIPC convolutions, and the results are reported in terms of training loss and testing accuracy. If a model has a training loss slightly lower than another model with the same number of parameters, it is likely (but not certain) that its average testing accuracy will be slightly higher. In Table 3, as the convolution kernel size or kernel count of WSIPC on the ConvNeXt baseline gradually increases, the classification accuracy generally shows an upward trend. When the convolution kernel size is 23 and the kernel count is 34, top-five accuracy reaches the best result of 94.920%, and when the convolution kernel size is 23 and the kernel count is 26, top-one accuracy attains the best result of 82.28%. Moreover, the results on ConvFormer are also discussed. There is room for improvement by increasing the size and number of convolution kernels, even though this slightly raises the number of trainable parameters.
The kernel count is configured to the maximum integer value, while still being lower than the baseline to which it is compared in terms of trainable parameters. This is because each additional element in the WSIPC method introduces five learnable parameters. In order to control the complexity and number of parameters of the model, avoid overfitting, and ensure model performance, an appropriate kernel count is selected to make the model competitive in the number of trainable parameters. In ConvNeXt, the traditional convolution kernel size of 7 × 7 is optimal, but for the WSIPC method, there seems to be no strict limit on the size of the expansion kernel, and accuracy often increases with size. By setting the kernel size to 23, a trade-off between accuracy and throughput was achieved. This is because increasing the kernel size for expansion does not affect the number of trainable parameters, but it does affect throughput. Considering the accuracy and running efficiency of the model, it is reasonable to choose 23 as the expansion size.
In the ablation experiment in Figure 5, ResNet-50 is the basic convolutional architecture, and ConvFormer, like ConvNeXt, adopts the efficient convolution method of depthwise separable convolution. When the number of parameters is similar, using the WSIPC model improves top-one accuracy compared to the original model. The top-one accuracy of ResNet-50 is 75.6%, while ResNet-50-WSIPC reaches 76.7%, an increase of 1.1 percentage points; the top-one accuracy of ConvNeXt is 82.1%, while ConvNeXt-WSIPC is 82.3%, an increase of 0.2 percentage points; and the top-one accuracy of ConvFormer is 81.24%, while ConvFormer-WSIPC is 81.84%, an improvement of 0.6 percentage points. This indicates that under the same parameter quantity, introducing WSIPC can effectively improve the classification ability of the model and make the model perform better in recognition tasks. From the perspective of models with different architectures, whether it is ResNet-50, ConvNeXt, or ConvFormer, the addition of WSIPC has improved the accuracy of the models, indicating that the optimization strategy of WSIPC does not depend on specific model architectures and has a certain universality. It can work on multiple models to improve model performance, which is of great significance for selecting suitable models and optimizing them in different scenarios. Different architectures of models can be selected according to actual needs, and their performance can be improved by introducing the WSIPC method.

5.3. Empirical Evaluation of CP Effectiveness

Normal training: We used a weight decay of 10−4. The weight initialization introduced in reference [36] was adopted. In all of our experiments, we initialized all channel-scaling factors to 0.5, which provided higher accuracy for the baseline model.
Use of sparsity for training: We used sparsity for training. For the CIFAR and SVHN datasets, when training with channel sparse regularization, the hyperparameter that controls the trade-off between experience loss and sparsity is determined by grid searches of 10−4, 10−5, 10−6 in the CIFAR-100 validation set. For ConvNeXt, we chose λ = 10−5. All other settings were the same as those in normal training.
The results of classification in CIFAR-100 in Table 4a: ConvNeXt-WSIPC (Baseline) indicate no CP pruning, while 20% Pruned and 50% Pruned indicate 20% and 50% CP pruning. Trimming 20% of the channels can greatly reduce testing errors. With a pruning rate of up to 50%, the accuracy has an advantage of about 1% compared to the same level. However, when the pruning rate exceeds 60%, the accuracy will decrease by about 8% relative to the baseline. The results of classification in SVHN in Table 4b: ConvNeXt-WSIPC (Baseline) indicate no CP pruning, while 20% Pruned and 50% Pruned indicate 20% and 50% CP pruning. When the pruning rate is 20%, the accuracy remains within a fluctuation range of 0.5% compared to the baseline. However, when the pruning rate reaches 50%, the accuracy decreases by 1.7%, but there is a 0.4% advantage compared to models of the same magnitude.
For ConvNeXt-WSIPC (20% Pruned), we obtained a higher compression ratio network with a test error of 2.5%, which is 0.6% lower than before pruning; however, a decrease of 17.8% in parameter count and a decrease of 15.1% in FLOPs resulted in comparable performance results. For example, in the sixth row of Table 4a, HRNet-W32 is a lightweight variant of the HRNet series. When we trim 20% of the compressed network parameters by 0.2 M more than HRNet-W32, the test error is reduced by 0.3%, which may have come at the cost of higher FLOPs. In the fifth row of Table 4a, NFNet-F0 is a small variant of the large neural network NFNet, which does not have standardized design but achieves similar effects by using adaptive activation functions and adjusting convolution operations. Our ConvNeXt-WSIPC uses layer normalization. Compared with NFNet-F0, the same error rate is achieved with a loss of 0.8 M in parameter count and 0.5 G FLOPs. In Table 4b, evaluations are conducted on SVHN; for example, ConvNeXt-WSIPC (20% Pruned) has the same parameter count and test error as the lightweight network EfficientNetV3-M, but has lower FLOPs. In addition, in CIFAR-100, the reduction rate is usually slightly lower than SVHN, which may be due to the fact that CIFAR-100 contains more categories. Compared with lightweight networks of the same magnitude, the ConvNeXt-WSIPC network with CP pruning has certain advantages in certain aspects, and the CP method plays a role in compressing the network.
In summary, compared with FLOP and memory usage, the experimental results show that in the CIFAR100 dataset, replacing ConvNeXt’s depthwise separable convolution with WSIPC convolution maintains the same number of parameters (such as ConvNeXt and ConvNeXt-WSIPC both being 28.6 M), and FLOP is also similar (with both being 4.5 G). After channel pruning (CP), the parameter count and FLOP significantly decreased. For example, the parameter count of ConvNeXt-WSIPC (20% Pruned) is reduced to 23.8 M and FLOP is reduced to 3.9 G; and the parameter count of ConvNeXt-WSIPC (50% Pruned) has been reduced to 18.6 M and FLOP has been reduced to 2.9 G. This indicates that the combination of WSIPC and CP methods can effectively reduce FLOP and memory usage (memory usage is related to parameter quantity, and a decrease in parameter quantity usually means a decrease in memory usage).
This ablation experiment shown in Table 5 focuses on ConvNeXt WSIPC and ConvFormer WSIPC, exploring performance changes in the models under different pruning ratios. The experimental results showed that the accuracy of both models decreased after 20% pruning, with ConvNeXt WSIPC decreasing from 82.23% to 78.13% and ConvFormer WSIPC decreasing from 81.84% to 77.22%, with the latter showing a greater decrease. However, pruning effectively reduced computational complexity, with both FLOPs significantly reduced. ConvNeXt WSIPC decreased from 4.5 G to 3.8 G (a decrease of 15.10%), and ConvFormer WSIPC decreased from 4.5 G to 3.9 G (a decrease of 13.33%). ConvNeXt WSIPC and ConvFormer WSIPC can achieve a good balance between maintaining relatively high accuracy and significantly reducing computational complexity when pruned at 20%.
In Table 6, RS-SP achieves structured pruning of convolutional neural networks through regularized sparsity and ensures high classification accuracy and computational efficiency by constraining the control of sparsity during the pruning process. DSP is a dynamic structured pruning algorithm that optimizes the network structure by dynamically selecting the layers and channels that need pruning, enabling the network to maintain high computational efficiency in real-time applications. The joint strategy of structured pruning and knowledge distillation in SPKD reduces the computational complexity of the model while maintaining high classification accuracy through distillation. ASP is an adaptive structured pruning method that achieves more efficient utilization of computing resources and lower network complexity by adaptively adjusting the pruning rate. ConvFormer CP and ConvNeXt CP achieved optimal performance through 20% pruning using the CP pruning method. The above table shows a comparison between advanced pruning techniques such as RS-SP, DSP, SPKD, ASP, ConvFormer CP, and ConvNeXt CP, considering three key dimensions: top-one accuracy, number of parameters, and FLOPs. The results showed that in terms of classification accuracy, ConvFormer CP (77.22%) and ConvNeXt CP (78.13%) were significantly better than RS-SP (76.2%), SPKD (75.8%), and ASP (76.5%), and also higher than DSP (77.1%). In terms of parameter count, ConvFormer CP (24.2 M) is the same as RS-SP, slightly higher than ASP (21.4 M), but lower than DSP (24.5 M), while ConvNeXt CP (23.5 M) is lower than most models, demonstrating a stronger ability to reduce parameters and avoid overfitting. The FLOPs values of each model are similar, and ConvNeXt CP achieved the highest accuracy with a lower FLOPs of 3.8 G, demonstrating its computational efficiency advantage.

6. Analysis

Fault situation analysis: Multiple fault situations may occur during model training and pruning. When the pruning ratio is too high, the model may excessively prune important channels, resulting in a significant loss of information and a significant decrease in classification accuracy. When using the channel-pruning (CP) method, if the global threshold set is unreasonable, it may accidentally delete channels that are crucial to the performance of the model, making it difficult for the model to recognize certain complex samples or specific categories. When weight decay is applied to kernel positions in the WSIPC method, it can lead to excessive concentration of kernel positions around the center, affecting the model’s ability to capture features and ultimately reducing classification accuracy.
Class-specific accuracy: The WSIPC method utilizes the normal distribution of effective receptive fields to reduce computational and parameter complexity and improve classification performance. For some categories with concentrated feature distributions that conform to normal distribution characteristics, WSIPC may be able to capture their features more accurately, thereby improving the classification accuracy of these categories. When dealing with image categories with obvious central features, by fitting effective receptive fields, the model can better focus on key regions and enhance its recognition ability for these categories. The CP method automatically identifies and prunes unimportant channels, reducing the burden on the model and allowing it to focus more on learning important features. For categories with obvious features and significant contributions to the overall performance of the model, the CP method may retain more channels related to them, thereby improving the classification accuracy of these categories.
The impact of pruning on different categories: Due to the differences in sample distribution and feature complexity among different categories in the dataset, the impact of pruning on different categories is also not the same. For categories with a small sample size, pruning may be more prone to information loss because these categories themselves contain limited feature information. Once important channels are mistakenly deleted, the model’s recognition ability for these categories will be seriously affected, and classification accuracy may significantly decrease. For categories with complex features, pruning may make it difficult for the model to learn a complete feature representation, thereby reducing classification accuracy. For categories with a large sample size and relatively simple features, the impact of pruning may be relatively small, because even if some channels are pruned, the model may still accurately identify these categories based on the remaining information.
Confidence interval and average accuracy value: With respect to the CIFAR100 dataset, experiments were conducted using the ConvNeXt WSIPC model. After 50 repeated experiments, the average accuracy was 82.0%, the standard deviation was 1.5%, and the 95% confidence interval was (81.5%, 82.5%).

7. Conclusions

In order to solve the problem of the sharp increase in the number of parameters in large convolution kernels and the inability of traditional convolution’s inherent mesh structure to flexibly adjust the receptive field, this paper proposes a new convolution algorithm, WSIPC, and proposes a channel-pruning (CP) method to compress the formed large convolutional neural network. Through experimental verification, this method can significantly reduce network computing costs, maintain comparable classification performance, reduce model size and computational complexity and introduce minimal training overhead, and the resulting model can effectively reason without the need for specialized libraries or hardware. However, the classification accuracy of compressed networks has decreased, and improving the classification accuracy of compressed networks has become a key challenge to overcome in the future.
The WSIPC method has potential value in monitoring video analysis in the field of intelligent security, vehicle visual perception in the autonomous driving industry, image recognition and AR/VR applications for mobile devices, etc. It can improve processing efficiency and user experience in various scenarios with its low computational cost and high classification performance. For example, in surveillance video analysis, real-time processing of large amounts of image data is required to identify target objects. WSIPC can improve receptive field flexibility and classification accuracy at low computational costs, quickly and accurately detecting abnormal behavior, identifying personnel and vehicles, and providing strong support for safety monitoring. The intelligent video surveillance system can efficiently analyze surveillance footage using WSIPC, detect suspicious activities in a timely manner, and issue alerts to improve security efficiency.
In the future, WSIPC can be improved by further optimizing the kernel structure, exploring more complex adaptive designs to enhance the model’s adaptability to complex scenes, and integrating advanced technologies such as attention mechanisms and generative adversarial networks to enhance model performance.

Author Contributions

Conceptualization, methodology, validation, and writing-manuscript preparation, X.J.; writing-review and editing, visualization and project management, C.Y.; Formal analysis, investigation, resource, data management and supervision, Q.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Tianjin Enterprise Science and Technology Commissioner Project, Technology Innovation Guidance Special Fund (Fund) (24YDTPJC00410).

Data Availability Statement

The original contributions proposed in this study are included in the article. The flowchart and pseudocode provided in this article can be replicated using Python language for the proposed method. For further inquiries, please contact the author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
WSIPCLearnable kernel element position convolution utilizing Whittaker–Shannon interpolation
CPChannel pruning
LNLayer normalization
DCDilated convolution
DCLSLearnable spacing convolution

References

  1. Araujo, A.; Norris, W.; Sim, J. Computing Receptive Fields of Convolutional Neural Networks. Distill 2019, 4, e21. [Google Scholar] [CrossRef]
  2. Yu, F.; Koltun, V. Multi-Scale Context Aggregation by Dilated Convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar] [CrossRef]
  3. Ding, X.; Zhang, X.; Han, J.; Ding, G. Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs. arXiv 2022, arXiv:2203.06717. [Google Scholar] [CrossRef]
  4. Ding, X.; Hao, T.; Tan, J.; Liu, J.; Han, J.; Guo, Y.; Ding, G. ResRep: Lossless CNN Pruning via Decoupling Remembering and Forgetting. In Proceedings of the International Conference on Computer Vision, Nashville, TN, USA, 19–25 June 2021. [Google Scholar] [CrossRef]
  5. Gao, S.; Huang, F.; Cai, W.; Huang, H. Network Pruning via Performance Maximization. In Proceedings of the Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021. [Google Scholar] [CrossRef]
  6. Fang, G.; Ma, X.; Song, M.; Mi, M.B.; Wang, X. DepGraph: Towards Any Structural Pruning. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 16091–16101. [Google Scholar]
  7. Meng, F.; Cheng, H.; Li, K.; Luo, H.; Guo, X.; Lu, G.; Sun, X. Pruning Filter in Filter. arXiv 2020, arXiv:2009.14410. [Google Scholar] [CrossRef]
  8. Park, S.; Lee, J.; Mo, S.; Shin, J. Lookahead: A Far-Sighted Alternative of Magnitude-based Pruning. arXiv 2020, arXiv:2002.04809. [Google Scholar]
  9. Luo, W.; Li, Y.; Urtasun, R.; Zemel, R. Understanding the Effective Receptive Field in Deep Convolutional Neural Networks. arXiv 2017, arXiv:1701.04128. [Google Scholar] [CrossRef]
  10. Chen, Q.; Li, C.; Ning, J.; He, K. Gaussian Mask Convolution for Convolutional Neural Networks. arXiv 2023, arXiv:2302.04544. [Google Scholar]
  11. Khalfaoui-Hassani, I.; Pellegrini, T.; Masquelier, T. Dilated Convolution with Learnable Spacings: Beyond bilinear interpolation. arXiv 2023, arXiv:2306.00817. [Google Scholar]
  12. Kim, B.J.; Choi, H.; Jang, H.; Kim, S.W. Understanding Gaussian Attention Bias of Vision Transformers Using Effective Receptive Fields. In Proceedings of the British Machine Vision Conference, Aberdeen, UK, 20–24 November 2023. [Google Scholar]
  13. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
  14. Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S.A. A ConvNet for the 2020s. arXiv 2022, arXiv:2201.03545. [Google Scholar] [CrossRef]
  15. Trockman, A.; Zico Kolter, J. Patches Are All You Need? arXiv 2022, arXiv:2201.09792. [Google Scholar] [CrossRef]
  16. Liu, S.; Chen, T.; Chen, X.; Chen, X.; Xiao, Q.; Wu, B.; Kärkkäinen, T.; Pechenizkiy, M.; Mocanu, D.; Wang, Z. More ConvNets in the 2020s: Scaling up Kernels Beyond 51 × 51 using Sparsity. arXiv 2022, arXiv:2207.03620. [Google Scholar]
  17. Megvii. Megengine: A Fast, Scalable and Easy-to-Use Deep Learning Framework. 2020. Available online: https://github.com/MegEngine/MegEngine (accessed on 15 May 2023).
  18. Celarek, A.; Hermosilla, P.; Kerbl, B.; Ropinski, T.; Wimmer, M. Gaussian Mixture Convolution Networks. arXiv 2022, arXiv:2202.09153. [Google Scholar] [CrossRef]
  19. Kim, S.; Park, E. SMPConv: Self-Moving Point Representations for Continuous Convolution. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 10289–10299. [Google Scholar]
  20. Thomas, H.; Qi, C.R.; Deschaud, J.E.; Marcotegui, B.; Goulette, F.; Guibas, L.J. KPConv: Flexible and Deformable Convolution for Point Clouds. arXiv 2019, arXiv:1904.08889. [Google Scholar] [CrossRef]
  21. Romero, D.W.; Bruintjes, R.J.; Tomczak, J.M.; Bekkers, E.J.; Hoogendoorn, M.; van Gemert, J.C. FlexConv: Continuous Kernel Convolutions with Differentiable Kernel Sizes. arXiv 2021, arXiv:2110.08059. [Google Scholar] [CrossRef]
  22. Romero, D.W.; Kuzina, A.; Bekkers, E.J.; Tomczak, J.M.; Hoogendoorn, M. CKConv: Continuous Kernel Convolution for Sequential Data. arXiv 2021, arXiv:2102.02611. [Google Scholar] [CrossRef]
  23. Jacobsen, J.H.; Van Gemert, J.; Lou, Z.; Smeulders, A.W. Structured Receptive Fields in CNNs. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2610–2619. [Google Scholar]
  24. Pintea, S.L.; Tomen, N.; Goes, S.F.; Loog, M.; van Gemert, J.C. Resolution Learning in Deep Convolutional Networks Using Scale-Space Theory. IEEE Trans. Image Process. 2021, 30, 8342–8353. [Google Scholar] [CrossRef]
  25. Shelhamer, E.; Wang, D.; Darrell, T. Blurring the Line Between Structure and Learning to Optimize and Adapt Receptive Fields. arXiv 2019, arXiv:1904.11487. [Google Scholar] [CrossRef]
  26. Yao, J.; Wang, D.; Hu, H.; Xing, W.; Wang, L. ADCNN: Towards learning adaptive dilation for convolutional neural networks. Pattern Recognit. J. Pattern Recognit. Soc. 2022, 123, 108369. [Google Scholar] [CrossRef]
  27. Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22 October 2017. [Google Scholar] [CrossRef]
  28. Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable ConvNets V2: More Deformable, Better Results. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar] [CrossRef]
  29. Wang, W.; Dai, J.; Chen, Z.; Huang, Z.; Li, Z.; Zhu, X.; Hu, X.; Lu, T.; Lu, L.; Li, H.; et al. InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 14408–14419. [Google Scholar]
  30. Huang, T.H.; Huang, C.Y.; Ding, C.K.; Hsu, Y.C.; Giles, C.L. CODA-19: Using a Non-Expert Crowd to Annotate Research Aspects on 10,000+ Abstracts in the COVID-19 Open Research Dataset. arXiv 2020, arXiv:2005.02367. [Google Scholar] [CrossRef]
  31. Chiliang, Z.; Tao, H.; Yingda, G.; Zuochang, Y. Accelerating Convolutional Neural Networks with Dynamic Channel Pruning. In Proceedings of the 2019 Data Compression Conference (DCC), Snowbird, UT, USA, 26–29 March 2019; p. 563. [Google Scholar] [CrossRef]
  32. Kokol, P.; Kokol, M.; Zagoranski, S. Machine learning on small size samples: A synthetic knowledge synthesis. arXiv 2021, arXiv:2103.01002. [Google Scholar] [CrossRef]
  33. Sellars, P.; Aviles-Rivero, A.I.; Schnlieb, C.B. LaplaceNet: A Hybrid Energy-Neural Model for Deep Semi-Supervised Classification. arXiv 2021, arXiv:2106.04527. [Google Scholar] [CrossRef]
  34. Liebl, H.; Schinz, D.; Sekuboyina, A.; Malagutti, L.; Löffler, M.T.; Bayat, A.; El Husseini, M.; Tetteh, G.; Grau, K.; Niederreiter, E.; et al. A Computed Tomography Vertebral Segmentation Dataset with Anatomical Variations and Multi-Vendor Scanner Data. Sci. Data 2021, 8, 284. [Google Scholar] [CrossRef]
  35. Shannon, C.E. Communication in the Presence of Noise. In Proceedings of the IRE; IEEE: Piscataway, NJ, USA, 1949; Volume 37, pp. 10–21. [Google Scholar] [CrossRef]
  36. He, K.; Zhang, X.; Ren, S.; Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In Proceedings of the International Conference on Computer Vision, Las Condes, Chile, 11–18 December 2015. [Google Scholar] [CrossRef]
Figure 1. (a) A standard 3 × 3 kernel. (b) A standard 3 × 3 dilated convolution kernel. (c) Use of a WSIPC kernel with nine core elements and nine core sizes. For ease of reading, all numbers in the figures have been rounded off.
Figure 1. (a) A standard 3 × 3 kernel. (b) A standard 3 × 3 dilated convolution kernel. (c) Use of a WSIPC kernel with nine core elements and nine core sizes. For ease of reading, all numbers in the figures have been rounded off.
Symmetry 17 00390 g001
Figure 2. A one-dimensional view of the symmetric Whittaker–Shannon interpolation function and trigonometric function, where σ = 1 . Among them, A is an interpolation where σ 0 = π and ε = 1 × 10 7 to avoid being divided by 0. In WSIPC, the positions of kernel elements are randomly initialized and moved within the expanded kernel size limit. Fit the effective receptive field by using the Whittaker–Shannon interpolation function (Equations (7) and (8)) to ensure that the distances between elements follow a normal distribution. When constructing the kernel (kernel construction algorithm), interpolation functions are used to calculate the values of kernel elements based on input parameters such as weight, position, and standard deviation (algorithm step 8) and to finally obtain a convolutional kernel that meets the requirements.
Figure 2. A one-dimensional view of the symmetric Whittaker–Shannon interpolation function and trigonometric function, where σ = 1 . Among them, A is an interpolation where σ 0 = π and ε = 1 × 10 7 to avoid being divided by 0. In WSIPC, the positions of kernel elements are randomly initialized and moved within the expanded kernel size limit. Fit the effective receptive field by using the Whittaker–Shannon interpolation function (Equations (7) and (8)) to ensure that the distances between elements follow a normal distribution. When constructing the kernel (kernel construction algorithm), interpolation functions are used to calculate the values of kernel elements based on input parameters such as weight, position, and standard deviation (algorithm step 8) and to finally obtain a convolutional kernel that meets the requirements.
Symmetry 17 00390 g002
Figure 3. We associate the scaling factor (reused from the normalization layer) with each channel in the convolutional layer. During the training process, sparse regularization is applied to these scaling factors to automatically identify unimportant channels. Channels with smaller scaling factor values (orange) will be trimmed (left). After pruning, we obtain a compact model (on the right) and then fine-tune it to achieve considerable accuracy.
Figure 3. We associate the scaling factor (reused from the normalization layer) with each channel in the convolutional layer. During the training process, sparse regularization is applied to these scaling factors to automatically identify unimportant channels. Channels with smaller scaling factor values (orange) will be trimmed (left). After pruning, we obtain a compact model (on the right) and then fine-tune it to achieve considerable accuracy.
Symmetry 17 00390 g003
Figure 4. Network compression flowchart. The dashed line represents a multiple-round scheme.
Figure 4. Network compression flowchart. The dashed line represents a multiple-round scheme.
Symmetry 17 00390 g004
Figure 5. Ablation experiments were conducted in ImageNet 1k to compare the number of parameters and top-one accuracy between the original model and the model using WSIPC convolution. Through data analysis, it was found that WSIPC has certain advantages in terms of improving model performance.
Figure 5. Ablation experiments were conducted in ImageNet 1k to compare the number of parameters and top-one accuracy between the original model and the model using WSIPC convolution. Through data analysis, it was found that WSIPC has certain advantages in terms of improving model performance.
Symmetry 17 00390 g005
Table 1. Classification accuracy using ResNet-50 in CIFAR100. Throughput is calculated during inference using a GPU with A100-40GB memory to crop and train images with a size 224 × 224. When the model includes WSIPC convolution, we report the number of kernels and the size of the expanded kernel; otherwise, we report the kernel size, so the kernel count is actually the square of that parameter.
Table 1. Classification accuracy using ResNet-50 in CIFAR100. Throughput is calculated during inference using a GPU with A100-40GB memory to crop and train images with a size 224 × 224. When the model includes WSIPC convolution, we report the number of kernels and the size of the expanded kernel; otherwise, we report the kernel size, so the kernel count is actually the square of that parameter.
ModelKernel Size
/Count
DilParamThroughput
(Image/s)
Top-1 acc.
ResNet-503/9125.6 M1021.975.8
7/49275.9 M642.677.0
3/9225.6 M931.871.7
3/9325.6 M943.470.1
ResNet-50-WSIPC7/5 24.0 M637.776.8
7/6 26.0 M637.976.8
Table 2. Classification accuracy in CIFAR100. The inference throughput is calculated using A100-40 GB GPU during inference, taking into account all optimizations used by Liu et al. [14]. For the SLaK model, we reported the parameters and effective number of FLOPs returned by PyTorch1.10.0, as well as the parameters and effective number of FLOPs reported by Liu et al. [14].
Table 2. Classification accuracy in CIFAR100. The inference throughput is calculated using A100-40 GB GPU during inference, taking into account all optimizations used by Liu et al. [14]. For the SLaK model, we reported the parameters and effective number of FLOPs returned by PyTorch1.10.0, as well as the parameters and effective number of FLOPs reported by Liu et al. [14].
ModelKernel Size
/Count
ParamFLOPsThroughput
(Image/s)
Top-1 acc.
Swin 28 M4.5 G757.981.3
ConvNeXt7 × 7/4929 M4.5 G774.781.5
ConvNeXt-dil27 × 7/4929 M4.5 G773.680.8
ResNet-50-GMConv7 × 7/4926 M4.2 G652.476.4
SLaK51 × 5130 M5 G583.582.5
PeLK 29 M5.6 G623.582.6
ConvNeXt-WSIPC17 × 17/2628 M4.5 G725.381.9
Table 3. Classification accuracy and training loss in CIFAR100. We have adopted the two latest convolutional architectures, ConvNeXt and ConvFormer, replacing all depthwise separable convolutions with WSIPC convolutions.
Table 3. Classification accuracy and training loss in CIFAR100. We have adopted the two latest convolutional architectures, ConvNeXt and ConvFormer, replacing all depthwise separable convolutions with WSIPC convolutions.
Model@224Kernel Size/CountInterpolationParamTrain LossTop-5 acc.Top-1 acc.
ConvNeXt17 × 17/26Whittaker–Shannon28.59 M0.88894.77081.90
23 × 23/2628.59 M0.87494.90082.28
23 × 23/3428.69 M0.87594.92081.80
ConvFormer17 × 17/26Whittaker–Shannon28.76 M0.90094.67081.54
23 × 23/2628.76 M0.89494.79081.60
23 × 23/3428.86 M0.89694.87081.84
Table 4. The following are the results from the CIFAR100 and SVHN datasets. The term “Baseline” represents normal training without using sparse regularization, and “20% Pruned” in the first column refers to a fine-tuning model created by removing 20% of the channels from a sparsely trained model. The trimming ratios of parameters are shown in column 4, while those of FLOPs are presented in column 6. Trimming channels can have positive effects; for example, trimming an appropriate amount, such as 20% of the channels, can lead to a great reduction in testing errors, and we have found that when the proportion of trimmed channels is less than or equal to 50%, the accuracy of the model can generally be maintained.
Table 4. The following are the results from the CIFAR100 and SVHN datasets. The term “Baseline” represents normal training without using sparse regularization, and “20% Pruned” in the first column refers to a fine-tuning model created by removing 20% of the channels from a sparsely trained model. The trimming ratios of parameters are shown in column 4, while those of FLOPs are presented in column 6. Trimming channels can have positive effects; for example, trimming an appropriate amount, such as 20% of the channels, can lead to a great reduction in testing errors, and we have found that when the proportion of trimmed channels is less than or equal to 50%, the accuracy of the model can generally be maintained.
(a) Test Errors in CIFAR-100
ModelTest Error (%)ParamPrunedFLOPsPruned
ConvNeXt-WSIPC (Baseline)17.728.6 M 4.5 G
ConvNeXt-WSIPC (20% Pruned)20.823.8 M16.70%3.9 G13.60%
ConvNeXt-WSIPC (50% Pruned)2318.6 M41.95%2.9 G36.10%
ConvNeXt-WSIPC (60% Pruned)25.213.1 M54.20%2.3 G48.89%
ResNeSt-5021.522 M 3.5 G
NFNet-F020.823 M 3.4 G
HRNet-W3221.224 M 3.6 G
MobileNetV422.718 M 3.2 G
EfficientNetV2-S2219 M 3.1 G
ResNet-18-TS24.219 M G
(b) Test Errors in SVHN
ModelTest Error (%)ParamPrunedFLOPsPruned
ConvNeXt-WSIPC (Baseline)1.928.6 M 4.5 G
ConvNeXt-WSIPC (20% Pruned)2.523.5 M17.80%3.8 G15.10%
ConvNeXt-WSIPC (50% Pruned)3.616.1 M43.46%2.7 G38.70%
ConvNeXt-WSIPC (60% Pruned)5.411.4 M60.1%1.9 G57.8%
EfficientNetV3-M2.523.5 M 3.9 G
RegNetY-16GF324.2 M 3.8 G
EfficientNetV3-S2.816.1 M 3.1 G
GhostNetV2415.9 M 3 G
Table 5. Ablation experiments were conducted on ImageNet 1k, with Baseline indicating no use of CP pruning and 20% Pruned indicating the use of CP pruning, with a pruning rate of 20%.
Table 5. Ablation experiments were conducted on ImageNet 1k, with Baseline indicating no use of CP pruning and 20% Pruned indicating the use of CP pruning, with a pruning rate of 20%.
ModelTop-1 acc. (%)ParamPrunedFLOPsPruned
ConvNeXt-WSIPC (Baseline)82.2328.6 M 4.5 G
ConvNeXt-WSIPC (20% Pruned)78.1323.5 M17.80%3.8 G15.10%
ConvFormer-WSIPC (Baseline)81.8429 M 4.5 G
ConvFormer-WSIPC (20% Pruned)77.2224.2 M16.55%3.9 G13.33%
Table 6. Comparison of the CP pruning technique with other state-of-the-art pruning techniques in ImageNet 1k.
Table 6. Comparison of the CP pruning technique with other state-of-the-art pruning techniques in ImageNet 1k.
ModelTop-1 acc.(%)ParamFLOPs
RS-SP76.224.2 M3.8 G
DSP77.124.5 M3.85 G
SPKD75.823.8 M3.9 G
ASP76.521.4 M3.8 G
ConvFormer-CP77.2224.2 M3.9 G
ConvNeXt-CP78.1323.5 M3.8 G
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yuan, C.; Jiang, X.; Yang, Q. Channel-Pruning Convolutional Neural Network with Learnable Kernel Element Position Convolution Utilizing the Symmetric Whittaker–Shannon Interpolation Function. Symmetry 2025, 17, 390. https://doi.org/10.3390/sym17030390

AMA Style

Yuan C, Jiang X, Yang Q. Channel-Pruning Convolutional Neural Network with Learnable Kernel Element Position Convolution Utilizing the Symmetric Whittaker–Shannon Interpolation Function. Symmetry. 2025; 17(3):390. https://doi.org/10.3390/sym17030390

Chicago/Turabian Style

Yuan, Chunmiao, Xiyan Jiang, and Qingyong Yang. 2025. "Channel-Pruning Convolutional Neural Network with Learnable Kernel Element Position Convolution Utilizing the Symmetric Whittaker–Shannon Interpolation Function" Symmetry 17, no. 3: 390. https://doi.org/10.3390/sym17030390

APA Style

Yuan, C., Jiang, X., & Yang, Q. (2025). Channel-Pruning Convolutional Neural Network with Learnable Kernel Element Position Convolution Utilizing the Symmetric Whittaker–Shannon Interpolation Function. Symmetry, 17(3), 390. https://doi.org/10.3390/sym17030390

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop