Towards Super Compressed Neural Networks for Object Identification: Quantized Low-Rank Tensor Decomposition with Self-Attention

Liu, Baichen; Wang, Dongwei; Lv, Qi; Han, Zhi; Tang, Yandong

doi:10.3390/electronics13071330

Open AccessArticle

Towards Super Compressed Neural Networks for Object Identification: Quantized Low-Rank Tensor Decomposition with Self-Attention

¹

State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang 110016, China

²

Institutes for Robotics and Intelligent Manufacturing, Chinese Academy of Sciences, Shenyang 110169, China

³

University of Chinese Academy of Sciences, Beijing 100049, China

⁴

School of Mechanical Engineering, University of Science and Technology Beijing, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(7), 1330; https://doi.org/10.3390/electronics13071330

Submission received: 29 February 2024 / Revised: 26 March 2024 / Accepted: 29 March 2024 / Published: 2 April 2024

(This article belongs to the Special Issue Deep Learning-Based Image Restoration and Object Identification)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Deep convolutional neural networks have a large number of parameters and require a significant number of floating-point operations during computation, which limits their deployment in situations where the storage space is limited and computational resources are insufficient, such as in mobile phones and small robots. Many network compression methods have been proposed to address the aforementioned issues, including pruning, low-rank decomposition, quantization, etc. However, these methods typically fail to achieve a significant compression ratio in terms of the parameter count. Even when high compression rates are achieved, the network’s performance is often significantly deteriorated, making it difficult to perform tasks effectively. In this study, we propose a more compact representation for neural networks, named Quantized Low-Rank Tensor Decomposition (QLTD), to super compress deep convolutional neural networks. Firstly, we employed low-rank Tucker decomposition to compress the pre-trained weights. Subsequently, to further exploit redundancies within the core tensor and factor matrices obtained through Tucker decomposition, we employed vector quantization to partition and cluster the weights. Simultaneously, we introduced a self-attention module for each core tensor and factor matrix to enhance the training responsiveness in critical regions. The object identification results in the CIFAR10 experiment showed that QLTD achieved a compression ratio of 35.43×, with less than 1% loss in accuracy and a compression ratio of 90.61×, with less than a 2% loss in accuracy. QLTD was able to achieve a significant compression ratio in terms of the parameter count and realize a good balance between compressing parameters and maintaining identification accuracy.

Keywords:

deep neural network compression; low-rank decomposition; quantization; self-attention; object identification

1. Introduction

Deep convolutional neural networks (DCNNs) have realized great success in many computer vision tasks, such as object identification [1,2,3], object detection [4,5,6] and image segmentation [7,8,9]. The design of DCNNs is becoming increasingly intricate, which is accompanied by a simultaneous enhancement in their performance. However, this progress comes at the cost of escalating the model complexity, leading to heightened storage space requirements. This trend indicates that while we strive for a more powerful performance, we also encounter challenges associated with managing larger models, notably, the increased demand for storage resources. Hence, many methods for compressing DCNNs have been proposed to address this issue and obtain light-weight networks with significantly fewer parameters, including pruning methods, low-rank tensor or matrix decomposition methods, and quantization methods.

Network pruning methods involve trimming away less important weight connections within a model, reducing the network’s complexity, and enhancing the model’s inference speed. Pruning at the layer, feature map, and kernel levels is called structured pruning [10,11], while weight pruning within the kernel itself is unstructured [12,13,14]. The primary advantage of weight pruning lies in the fact that weights are the smallest, most fundamental elements in a network. Therefore, a substantial pruning of these connections can be performed without significantly impacting the network performance. Additionally, widely-used deep learning frameworks such as PyTorch allow easy access to all parameters of a network, making the implementation of this method straightforward.

Pre-trained models often exhibit low-rank characteristics in their convolutional kernels, making it possible to leverage low-rank decomposition methods for model compression. In low-rank tensor or matrix decomposition methods, the weights of convolutional or fully connected layers are compressed via a low-rank tensor or matrix decomposition. By employing such approaches, the parameter count of DCNNs is substantially reduced, leading to a significant improvement in their inference speed.

Network quantization is a method of saving the network storage space by reducing the number of bits required to store weights. Quantization methods are typically categorized into low-bit representations and weight-sharing methods. Weight-sharing quantization methods, which primarily employ clustering techniques, cluster different weights into several clusters representing all weights within a cluster with the centroid of that cluster. During an inference calculation, it is only necessary to look up the index of the centroid to obtain the corresponding weight value.

Even though these DCNN compression methods reduce the parameter count and accelerate the inference speed of networks, they face the challenge of a significant decline in accuracy when employing extremely large compression ratios of over 50× the parameter count. In this study, we propose a Quantized Low-Rank Tensor Decomposition (QLTD) method with self-attention to achieve an extremely large compression rate with only a minimal loss in accuracy. The main scheme of our QLTD method is shown in Figure 1. As pre-trained weights in DCNNs often contain a significant amount of redundant information [15], firstly, we conducted a Tucker-2 decomposition on the pre-trained weights of the convolutional layers. Meanwhile, we introduced a self-attention module on each core tensor and factor matrix through the Tucker-2 decomposition to obtain and focus on the key positions, which mitigated performance losses arising from the permutation and quantization. The self-attention module was a trainable and sparse convolutional layer. Then, to further mine the redundancy in the core tensor and factor matrices, we applied a permutation and quantization approach [16] using concepts from the rate-distortion theory. Finally, the compressed network was fine-tuned for several epochs to recover the accuracy. The identification results in the CIFAR-10 experiment showed that our approach achieved a compression ratio of 35.43× with less than 1% loss in accuracy and a compression ratio of 90.61× with less than 2% loss in accuracy. Additionally, depending on the requirements, our method can achieve a parameter compression ratio of up to 200 times.

Our contributions can be summarized as follows:

(i) We propose a QLTD method with self-attention that is easy to integrate into DCNNs to extremely compress them. Our method incurs minimal losses in the identification accuracy at a parameter compression ratio below 100 times and can achieve a parameter compression ratio of up to 200 times if needed.

(ii) Our QLTD framework unifies network pruning, low-rank decomposition and quantization compression methods to fully leverage their advantages. The framework compresses DCNNs by sequentially performing Tucker decomposition, permutation, and quantization. Meanwhile, a sparse self-attention sparse layer is designed to obtain and focus on the key positions of each core tensor and factor matrix through Tucker decomposition.

2. Related Work

Here, we review the related work on DCNN compression methods, including network pruning methods, low-rank decomposition methods and network quantization methods.

Network pruning. Structured pruning techniques focus on zeroing out organized groups of convolutional kernels at the layer, feature map, and kernel levels. Fang et al. [17] proposed a universal automated method named DepGraph for the structural pruning of neural network architectures. This method explicitly modeled two types of dependencies (inter-layer and intra-layer), constructed a dependency graph, and introduced the general structural pruning through sparse training that constrained the parameter grouping. This approach facilitated the universal structural pruning. Despite compressing the parameter count and accelerating the inference speed, these approaches led to a decline in the identification accuracy.

Unstructured pruning methods involve deactivating connections associated with small weights or applying sparsity regularization to the weights. DEEP-R [18] takes a Bayesian perspective and performs sampling for pruning and regrowth decisions. Sparse evolutionary training (SET) [19] simplifies pruning-regrowth cycles by pruning the smallest and most negative weights and growing new weights in random locations. Dynamic Sparse Reparameterization (DSR) [20] utilizes a pruning–redistribution–regrowth cycle and addresses the limitations of previous techniques, such as their high computation costs and the use of manual configuration for the number of free parameters allocated to each layer. Sparse networks from scratch [21] are more general and use a fully sparse setting. Soft threshold weight reparameterization (STR) [22] smoothly induces the sparsity while learning pruning thresholds. The authors of [23] proposed a novel adversarial training method called inverse weight inheritance, which imposed sparse weight distribution on a large network by inheriting weights from a small network, thereby improving the robustness of the large network. However, these methods are not able to realize an extremely large parameter compression ratio.

Low-rank decomposition. Low-rank matrix and tensor decomposition methods can be directly applied to compress DCNNs. Canonical polyadic (CP) decomposition [24] was proposed to compress the number of parameters and speed up networks. Tucker decomposition [25] was proposed to compress convolutional layers for fast and low-power mobile applications. Tensor-train (TT) decomposition [26] was proposed and found to be effective in solving dense connection problems to avoid the curse of dimensionality. Tensor-ring decomposition [27] can be viewed as a linear combination of TT decomposition, in which the cyclic permutation invariance of the latent cores is maintained through the use of tracing operations and the equitable treatment of potential cores. These methods directly apply a low-rank decomposition to the weights of DCNNs. CP and Tucker decomposition cannot achieve large compression ratios. While the TT and TR decomposition achieve large compression ratios over 10×, they result in a significant decrease in accuracy.

Some other approaches have made improvements to low-rank decomposition methods. Jaderberg et al. [28] proposed a linear combination of a smaller basis set of 2D separable filters to approximate a 2D filter set to speed up the evaluation of convolutional neural networks. The authors of [29] proposed the automatic selection of ranks in a recent study of tensor-ring decomposition in each convolutional layer, which was inspired by reinforcement learning. The authors of [30] derived filter pruning and low-rank decomposition by simply changing the way in which the sparsity regularization was enforced. The authors of [31] proposed a systematic framework for the tensor–decomposition-based model compression using the Alternating Direction Method of Multipliers (ADMM). The Trained Rank Pruning (TRP) [32] alternated between the low-rank approximation and training, and the low-rank approximation and regularization were integrated into the training process. The authors of [33] showed that, with a suitable formulation, determining the optimal rank of each layer was amenable to a mixed discrete continuous optimization jointly over the ranks and matrix elements. The authors of [34] proposed a novel compact design of convolutional layers with a spatial transformation towards a lower-rank representation, and they applied trainable spatial transformations to low-rank convolutional kernels in a predefined Tucker product form to enhance the versatility of convolutional kernels. These approaches can improve the identification accuracy of a compressed network to some extent. However, they cannot achieve very large compression ratios of over 30×.

The original intention of low-rank decomposition methods was to accelerate the network inference and reduce the network storage. The common issue with low-rank decomposition is the difficulty in achieving high compression ratios of over 30×. Our method is based on the Tucker decomposition. Apart from the fact that the Tucker decomposition method cannot achieve very high compression ratios, it is more important to find redundant information in the core tensor and factor matrices obtained through the Tucker decomposition, as these can be further exploited through permutation and quantization methods. Through our QLTD method, network weights achieve a more fundamentally lightweight representation and a high parameter compression ratio of over 30×. The experiment in Section 4.4 “Ablation Study” validates that our QLTD method realizes a better balance between the parameter compression ratio and identification accuracy.

Network quantization. Quantization methods based on weight sharing primarily aggregate different weights into several clusters through techniques such as clustering, and the centroid of each cluster is used to represent the values of all weights within that cluster. During the inference calculation, one only needs to look up the index of the centroid to obtain the corresponding weight value. Stock et al. [35] proposed a vector quantization method based on product quantization (PQ), focusing on the importance of activation values rather than weights. They combined this with knowledge distillation and minimized the reconstruction error between teacher and student networks to achieve compression. The “Permute, Quantize, and Fine-Tune” method [16] establishes a connection with a rate-distortion theory to search for permutations that make a network more amenable to compression. Through a quantization algorithm that was subjected to annealing, this method achieved a high network compression ratio and identification accuracy. The authors of [36] introduced an adaptive method for a data-free quantization (AdaDFQ), and they treated the training process of a generative network in generator-based data-free quantization (GDFQ) as a zero-sum game. This method optimized the adaptability of the generative network by constructing boundaries between inconsistent and consistent samples. The authors of [37] addressed the problem of the module recovery loss oscillation in post-training quantization (PTQ) methods. They proposed a solution by introducing the concept of module topological homogeneity to optimize modules with significantly different capacities, thereby reducing the recovery loss of quantized networks. The authors of [38] presented a zero-shot quantization (ZSQ) method called HAST. It enhanced the difficulty of matching synthetic samples for a quantization model by elevating the difficulty of generating pseudo-samples. This approach mitigated the issue of synthetic samples being easy to fit and ensured a similarity between the quantization model and the full-precision model through feature alignment. While these methods achieve large compression ratios, they often come with a substantial decrease in accuracy.

3. Quantized Low-Rank Tensor Decomposition with Self-Attention

3.1. Preliminary Notations for Low-Rank Tensor Decomposition

A tensor is a multi-dimensional array. A d-order tensor represents a d dimensional multi-way array. Scalars, vectors and matrices are zero-order, one-order and two-order tensors, respectively. In this study, we denote scalars, vectors and matrices with lowercase letters

(x, y, z, \dots)

, bold lowercase letters

(x, y, z, \dots)

and uppercase letters

(X, Y, Z, \dots)

, respectively. An N-order tensor (

N \geq 3

) is denoted as

X \in R^{I_{1} \times I_{2} \times \dots \times I_{N}}

, and each element is denoted as

x_{i_{1}, i_{2}, \dots, i_{N}}

.

The mode-n matrix form of a tensor

X \in R^{I_{1} \times I_{2} \times \dots \times I_{N}}

is the operation of reshaping a tensor into a matrix

X_{(n)} \in R^{I_{n} \times (I_{1} \dots I_{n - 1} I_{n + 1} \dots I_{N})}

. For a tensor

X \in R^{I_{1} \times I_{2} \times \dots \times I_{N}}

, its Tucker decomposition is defined as:

\begin{matrix} X & = \sum_{s_{1} = 1}^{S_{1}} \times \dots \times \sum_{s_{N}}^{S_{N}} g_{s_{1} s_{2} \dots s_{N}} (u_{s_{1}}^{(1)} \circ u_{s_{2}}^{(2)} \circ \dots \circ u_{s_{N}}^{(N)}) \\ = G \times_{1} U^{(1)} \times_{2} U^{(2)} \dots \times_{N} U^{(N)}, \end{matrix}

(1)

where

G \in R^{S_{1} \times S_{2} \times \dots \times S_{N}}

denotes the core tensor and

U^{(n)} = [u_{1}^{(n)}, u_{2}^{(n)}, \dots, u_{N}^{(n)}] \in R^{I_{n} \times S_{n}}

denotes a factor matrix.

3.2. Low-Rank Decomposition on Pre-Trained Weights

A convolution operation maps an input tensor

X \in R^{H \times W \times C_{i n}}

to an output tensor

Y \in R^{H^{'} \times W^{'} \times C_{o u t}}

with the following equation:

\begin{matrix} Y_{h^{'}, w^{'}, o} = \sum_{h = 1}^{H} \sum_{w = 1}^{W} \sum_{i = 1}^{S} W_{h - h^{'} + 1, w - w^{'} + 1, i, o} X_{h, w, i}, \end{matrix}

(2)

where

H \times W

are the height and width of the input feature map,

C_{i n}

is the number of channels of the input feature map,

H^{'} \times W^{'}

are the height and width of the output feature map,

C_{o u t}

is the number of channels of the output feature map, and

Y_{h^{'}, w^{'}, o}

denotes a voxel of

Y

.

W \in R^{d \times d \times C_{i n} \times C_{o u t}}

is a four-order tensor, where

d \times d

represent the filter width and height,

C_{i n}

is the number of input channels, and

C_{o u t}

is the number of output channels. It can be decomposed via theTucker decomposition [39] in Equation (1), as follows:

\begin{matrix} W_{i, j, s, t} = \sum_{r_{1} = 1}^{R_{1}} \sum_{r_{2} = 1}^{R_{2}} \sum_{r_{3} = 1}^{R_{3}} \sum_{r_{4} = 1}^{R_{4}} G_{r_{1}, r_{2}, r_{3}, r_{4}}^{'} U_{i, r_{1}}^{(1)} U_{j, r_{2}}^{(2)} U_{s, r_{3}}^{(3)} U_{t, r_{4}}^{(4)}, \end{matrix}

(3)

where

W_{i, j, s, t}

denotes a voxel of

W

,

G

is the core tensor of size

(R_{1} \times R_{2} \times R_{3} \times R_{4})

, and

U^{(1)}

,

U^{(2)}

,

U^{(3)}

and

U^{(4)}

are factor matrices of sizes

(d \times R_{1})

,

(d \times R_{2})

,

(C_{i n} \times R_{3})

, and

(C_{o u t} \times R_{4})

, respectively.

As the kernel size

(d \times d)

of a DCNN is always

(3 \times 3)

or

(1 \times 1)

, which is too small to decompose, we only decompose the input and output channels. Then, Equation (3) is reformulated into a variant form called the Tucker-2 decomposition, as follows:

\begin{matrix} W_{i, j, s, t} = \sum_{r_{3} = 1}^{R_{3}} \sum_{r_{4} = 1}^{R_{4}} G_{i, j, r_{3}, r_{4}} U_{s, r_{3}}^{(3)} U_{t, r_{4}}^{(4)} . \end{matrix}

(4)

After the Tucker-2 decomposition, the ordinary convolution operation in Equation (2) can be replaced with three sequential convolution operations [25] that use the core tensor and factor matrices as convolution kernels. The three convolution operations are formulated, as follows:

\begin{matrix} Z_{h, w, r_{3}} & = & \sum_{s = 1}^{C_{i n}} U_{s, r_{3}}^{(3)} X_{h, w, s}, \\ Z_{h^{'}, w^{'}, r_{4}}^{'} & = & \sum_{i = 1}^{d} \sum_{j = 1}^{d} \sum_{r_{3} = 1}^{R_{3}} G_{i, j, r_{3}, r_{4}} Z_{h_{i}, w_{j}, r_{3}}, \\ Y_{h^{'}, w^{'}, t} & = & \sum_{r_{4} = 1}^{R_{4}} U_{t, r_{4}}^{(4)} Z_{h^{'}, w^{'}, r_{4}}^{'}, \end{matrix}

(5)

where

Z_{h, w, r_{3}}

and

Z_{h^{'}, w^{'}, r_{4}}^{'}

are two intermediate feature maps. Equations (2) and (5) are equivalent. The advantages are that the total parameter count of the three sequential convolution operations is much smaller than that of the original convolution operation, and the inference speed is much faster.

3.3. Permuting the Kernels

Vector quantization inevitably comes with a large reconstruction error, so the accuracy of identification drops significantly after vector quantization. The reason for that is that the sub-vectors have a rich diversity, and they are not able to be well represented by the centroid.

To address this problem, we search for permutations applied to the low-rank convolutional kernels of each layer, which results in sub-vectors that are easier to quantize, by minimizing the determinant of the covariance of the resulting sub-vectors [16]. It is worth mentioning that the network is invariant under the permutation of its weights, as long as the same permutation is applied to the output dimension for parent layers and the input dimension for children layers. We refer the reader to Equations (4)–(16) of paper [16] for more details about how to find the permutation matrix for each layer.

3.4. Vector Quantization

In this section, we introduce how to apply the vector quantization to the low-rank convolution kernels

U^{(3)}, G

, and

U^{(4)}

.

We learn an encoding

β

that takes considerably less memory to store the low-rank convolution kernels, and it consists of a codebook

C_{b}

and a set of codes

B

. Taking

U^{(3)} \in R^{C_{i n} \times R_{3}}

as an example, we need to split the matrix into column sub-vectors

u_{i, j} \in R^{d \times 1}

. Equation (6) shows a divided factor matrix.

\begin{matrix} U^{(3)} = [\begin{matrix} u_{11} & \dots & \dots & u_{1 R_{3}} \\ u_{21} & \dots & \dots & u_{2 R_{3}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ u_{\hat{C_{i n}} 1} & \dots & \dots & u_{\hat{C_{i n}} R_{3}} \end{matrix}], \end{matrix}

(6)

where

\hat{C_{i n}} = C_{i n} / d

. Notably,

G

needs to be reshaped into a 2D matrix before being divided. Instead of storing all of the sub-vectors in Equation (6), we apply K-means to the sub-vectors to obtain k centroids and approximate them with a smaller set, which we call the codebook

C_{b}

for the layer, and all centroids are stored, as follows:

\begin{matrix} C_{b} & = [c (1), \dots, c (k)] \in R^{d \times 1} . \end{matrix}

(7)

Each sub-vector

u_{i, j}

is assigned to a centroid by measuring the minimum Euclidean distance as

b_{i, j} \in (1, \dots, k) = \underset{t}{arg min} {∥ u_{i, j} - c (t) ∥}_{2}^{2}

, which is the index of the element in

C_{b}

that is closest to

u_{i, j}

in the Euclidean space.

Codes

B

store the index of each sub-vector as

\begin{matrix} B & = [\begin{matrix} b_{11} & \dots & \dots & b_{1 R_{3}} \\ b_{21} & \dots & \dots & b_{2 R_{3}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ b_{\hat{C_{i n}} 1} & \dots & \dots & b_{\hat{C_{i n}} R_{3}} \end{matrix}] \in R^{\hat{C_{i n}} \times R_{3}} . \end{matrix}

(8)

Then, the low-rank convolutional kernel

U^{(3)}

can be reconstructed by decoding

C_{b}

and

B

for the forward propagation, as follows:

\begin{matrix} U^{(3)} = [\begin{matrix} c (b_{11}) & \dots & \dots & c (b_{1 R_{3}}) \\ c (b_{21}) & \dots & \dots & c (b_{2 R_{3}}) \\ ⋮ & ⋮ & ⋱ & ⋮ \\ c (b_{\hat{C_{i n}} 1}) & \dots & \dots & c (b_{\hat{C_{i n}} R_{3}}) \end{matrix}] . \end{matrix}

(9)

The schematic diagrams for permutation and vector quantization are shown in Figure 2.

3.5. Self-Attention Module

To further recover from the decrease in accuracy caused by the quantization and permutation of low-rank convolutional kernels, we introduce a self-attention module to focus on the key positions of the low-rank convolutional kernels. Taking

U^{(3)} \in R^{C_{i n} \times R_{3}}

as an example, to retain the important information of

U^{(3)}

, we define a self-attention module

S^{U^{(3)}}

and preset the self-attention percentage s of three levels 0.1%, 0.5%, 1% to focus on the key positions of

U^{(3)}

that have large absolute values. The self-attention module is a sparse layer whose weight is a trainable sparse tensor that has the same dimensions as the weight of

U^{(3)}

. The sparse weight of

S^{U^{(3)}}

is initialized, as follows.

We first sort the elements in

U^{(3)}

by their absolute values. Let

t = ⌈ C_{i n} \times R_{3} \times s ⌉

, and the truncation value

λ

is the absolute value of the element with the

t^{t h}

largest absolute value in

U^{(3)}

. For any position

S_{i j}^{U^{(3)}}

of

S^{U^{(3)}}

, it is initialized as:

\begin{matrix} S_{i j}^{U^{(3)}} = sgn (U_{i j}^{(3)}) max {| U_{i j}^{(3)} | - λ, 0} . \end{matrix}

(10)

In addition, we design a (0,1) mask that has the same dimension as that of the sparse weight. For the positions of the sparse weights where the values are 0, the values of the corresponding positions on the mask are also 0. The values of the other positions on the mask are set to 1. The mask is multiplied to the sparse weights to block the gradient back-propagation and training of the positions with a value of 0.

3.6. The Overall Framework of QLTD

We summarize the overall framework of our QLTD method in Algorithm 1.

Algorithm 1 The overall framework of QLTD

Input:: weights $W \in R^{d \times d \times C_{i n} \times C_{o u t}}$ , the channel compression ratio of the Tucker decomposition r, the length of subvectors in vector quantization clustering d, the number of cluster centroids k, and the self-attention percentage s.
Output:: Codebook $C_{b}^{U^{(3)}}$ , $C_{b}^{U^{(4)}}$ and $C_{b}^{G}$ , code $B^{U^{(3)}}$ , $B^{U^{(4)}}$ and $B^{G}$ , the initial weight of the self-attention module $S^{U^{(3)}}$ , $S^{U^{(4)}}$ and $S^{G}$ .
1:: Obtain $U^{(3)}, G$ and $U^{(4)}$ by applying Tucker-2 decomposition to $W \in R^{d \times d \times C_{i n} \times C_{o u t}}$ as Equation (4).
2:: Split $U^{(3)}$ to sub-vectors by $\hat{C_{i n}} = C_{i n} / d$ as Equation (6). Cluster the sub-vectors to k centroids by K-means and obtain the codebook $C_{b}^{U^{(3)}}$ and code $B^{U^{(3)}}$ as Equations (7) and (8).
3:: Similarly, obtain the codebook $C_{b}^{U^{(4)}}$ and code $B^{U^{(4)}}$ of $U^{(4)}$ .
4:: Similarly, obtain the codebook $C_{b}^{G}$ and code $B^{G}$ of $G$ .
5:: Apply the same permutation to the output dimension for parent layers and the input dimension for children layers.
6:: Obtain the initial weight of the self-attention module $S^{U^{(3)}}$ from $U^{(3)}$ and the self-attention percentage s as Equation (10).
7:: Similarly, obtain the initial weight of the self-attention module $S^{U^{(4)}}$ .
8:: Similarly, obtain the initial weight of the self-attention module $S^{G}$ .

4. Experiment

In this section, we validate our proposed approach on three datasets, including CIFAR10, CIFAR100 [40], and ImageNet [41]. To validate the effectiveness of our approach, we compare it with several recent low-rank tensor decomposition approaches, vector quantization approaches, scalar quantization approaches and network pruning methods. The low-rank tensor decomposition approaches include SVD [42], LCT [34], TDNR [43], HALOC [44], Maestro [45] and ELRT [46]. The vector quantization approaches include PQF [16] and BGD [47]. The scalar quantization approaches include ABC-Net [48], DC [12], LR-Net [49], HAQ [50] and BWN [51]. The network pruning methods include CLIP-Q [52], TRP [32] and SSS [53].

4.1. Training Settings

Compressing each layer independently causes errors in the activations to accumulate, resulting in a degradation of performance. To address this problem, the compressed model is fine-tuned to recover its performance. During fine-tuning, we fix the codes and permutations and only fine-tune the centroids. Notably, each centroid in the compressed network is differentiable. Therefore, the centroids will be updated, as follows:

\begin{matrix} c (i + 1) = c (i) - η \frac{\partial L}{\partial c (i)}, \end{matrix}

(11)

where i is the step of back propagation, L is the original loss function and

η

is the learning rate.

All of the compressed models were fine-tuned with an Adam optimizer. For the CIFAR-10 and CIFAR-100 experiments, we set the initial learning rate to

10^{- 3}

and decreased it by a factor of 10 every 20 epochs (50 epoch in total). For the ImageNet experiments, the initial learning rate was

10^{- 5}

and decreased by a factor of 10 every 5 epochs (15 epochs in total). The batch size was 256. We implemented our approach with Pytorch, and all of the following experiments were run on a single Nvidia GeForce RTX 3090 GPU.

4.2. CIFAR-10 and CIFAR-100 Experiment

The CIFAR-10 and CIFAR-100 datasets consisted of colored natural images with 32 × 32 pixels in 10 and 100 classes, respectively. Each dataset contained 50k training images and 10k testing images. All of the CIFAR-10 and CIFAR-100 experiments that followed used the data augmentation method provided in [54] for training: 4 pixels with a value of 0 were padded on each side for a 32 × 32 image, and a 32 × 32 crop was randomly sampled from the padded image or its horizontal flip. We only evaluated the original 32 × 32 images for testing.

We chose a popular DCNN ResNet-18 [55] as the baseline network. The uncompressed baseline ResNet-18 model had 11.2 M parameters and reached 95.09% Top-1 accuracy on the CIFAR-10 dataset and 75.58% Top-1 accuracy on the CIFAR-100 dataset. We applied our compression scheme to all the convolutional layers of ResNet-18 except for the first convolutional layer, as it only had three input channels, making it too small to compress. In order to achieve similar compression ratios with the different network compression approaches used for comparison, we compressed the model under different compression regimes. The compression ratio of the network could be controlled by adjusting the following parameters: (1) the channel compression rate of Tucker decomposition, (2) the length of subvectors in vector quantization clustering, (3) the number of cluster centers in vector quantization clustering and (4) the sparsity of the self-attention module. It is worth mentioning that our approach was able to realize an extremely high compression ratio of over 90×.

Table 1 shows the experimental results of ResNet-18 on the CIFAR10 dataset for our approach and the approaches used for comparison. The uncompressed baseline ResNet-18 model reached 95.09% Top-1 accuracy. Our approach was able to achieve an accuracy drop of less than 1% while achieving a compression ratio of 35.43×. As the PQF [16] used quantization and permutation to compress the network, we conducted a more detailed compared experiment with it. The results showed that our approach achieved a better trade-off between the compression ratio and Top-1 accuracy than PQF did. The results of the last three rows from our approach showed that it achieved a compression ratio that was 10× higher and a higher Top-1 accuracy than those of the corresponding results in the three rows from the PQF approach [16]. Conventional compression approaches such as Maestro [45] and TDNR [43] were only able to achieve a compression ratio of about 10×, and their accuracy was not as high as that of our QLTD approach.

Table 2 shows the experimental results of ResNet-18 on the CIFAR100 dataset with our approach and the PQF approach. The uncompressed baseline ResNet-18 model reached a Top-1 accuracy of 75.58%. We conducted a more detailed experiment with the PQF approach [16]. The results showed that our approach achieved a better balance between the compression ratio and Top-1 accuracy than PQF did. Our approach was able to reach an accuracy of 74.07% while compressing the parameters by 36.27×. Even when our approach achieved a compression ratio as high as 88.51×, it still maintained an identification accuracy of over 70%.

4.3. ImageNet Experiment

We extensively validate our approach on the most popular benchmark ImageNet. ImageNet contains 1.28 million training images and 50 thousand validation images of 1000 different classes. These images cover various categories found in everyday life. The data augmentation for training included a random resized cropping and a random horizontal flip. The image was center-cropped to match the input size for validation.

We chose the pre-trained ResNet-18 and ResNet-50 from the Pytorch model zoo as baseline networks. The baseline ResNet-18 and ResNet-50 achieved a 69.1% and 76.15% identification accuracy on the ImageNet dataset, respectively. We compressed the baseline ResNet-18 and ResNet-50 and conducted a series of experiments. Table 3 shows the experimental results of ResNet18 on the ImageNet dataset with our approach and the approaches used for comparison. The HALOC [44], LCT [34], and ELRT [46] approached the aim to compress the parameter count while maintaining identification accuracy and, thus, only achieved a compression ratio of less than 3×. Direct low-rank decomposition and pruning approaches such as TRP [32] and SVD [42] were not able to achieve high compression ratios, and their accuracy noticeably decreased. ABC-Net [48] and LR-Net [49] were able to achieve high compression ratios, but the identification accuracy also noticeably decreased. BGD [47] and PQF [16] achieved a good identification accuracy while achieving relatively high compression ratios for the parameter count. Compared with PQF [16], our approach achieved a higher identification accuracy with a similar compression ratio for the parameter count. To sum up, our approach achieved the best balance between compressing parameters and maintaining identification accuracy among all of the approaches shown in Table 3.

Table 4 shows the experimental results of ResNet-50 on the ImageNet dataset with our approach and the approaches used for comparison. ResNet-50 used a bottleneck block that included two

1 \times 1

convolutional layers and one

3 \times 3

convolutional layer. The

1 \times 1

convolutional layer was not very suitable for compression with approaches based on low-ranking decomposition, pruning, and scalar quantization. Therefore, SSS [53], TRP [32], SVD [42], CLIP-Q [52], HAQ [50], HAQ [50], and DC [12] did not achieve very high compression ratios, and the accuracy significantly decreased compared with that of the baseline ResNet-50. The vector quantization approaches, BGD [47] and PQF [16], achieved very high accuracies of 74.81% and 75.42%, respectively, with a compression ratio of around 15× in terms of the parameter count. Compared with BGD [47] and PQF [16], our approach achieved a higher identification accuracy and compression ratio at the same time. Notice that our approach achieved a compression ratio of 18.53× with only an accuracy drop of 0.38% compared with the baseline ResNet-50.

4.4. Ablation Study

We studied the impact on our QLTD approach with/without the (1) Tucker decomposition, (2) quantization, (3) permutation, and (4) self-attention module. For the sake of discussion, we kept the following: (1) the channel compression rate of the Tucker decomposition at 2, (2) the length of sub-vectors in the vector quantization clustering at 2, (3) the number of cluster centers in the vector quantization clustering at 256, and (4) the sparsity of the self-attention module at 99.5%, if the corresponding scheme was used.

We chose the CIFAR-10 dataset and ResNet-18 as the baseline network. Table 5 shows the results of our QLTD approach with different schemes. In the table, we can observe the following conclusions: (1) Significant parameter compression relies primarily on quantization. (2) Permutation does not introduce parameters and can even enhance the identification accuracy. (3) The self-attention module introduces a small number of parameters, but it contributes significantly to improving the accuracy. (4) When not aiming for an extremely high compression ratio, using only the Tucker decomposition can reduce the loss caused by compression. To sum up, the proposed QLTD approach achieves the best trade-off between the parameter compression ratio and identification accuracy when utilizing all these modules.

5. Discussion

In this section, we discuss the compression potential and limitations of our QLTD method. When using the Tucker decomposition to compress tensors, it is common practice not to compress dimensions that are too small, such as the

3 \times 3

dimension of convolutional kernels, and also not to compress the three input channels when the input is an RGB image. As the low-rankness of small dimensions is poor, using low-ranking decomposition methods for compression will lead to a significant performance decrease, which is not conducive to achieving a better balance between the compression ratio and accuracy. Additionally, the degree of redundancy in network parameters is correlated with the complexity of the dataset and the difficulty of identification tasks. As shown in Table 3, our QLTD method achieved a parameter compression ratio of 54.35× and an accuracy of 61.57%, representing a 7.53% decrease in accuracy compared with that of the baseline ResNet-18 on the ImageNet dataset. As shown in Table 1, on the CIFAR-10 dataset that is relatively simple, our QLTD method was able to achieve a parameter compression ratio of 90.61× and an accuracy of 93.24%, representing a 1.85% decrease in accuracy compared with that of the baseline ResNet-18, which have a higher parameter compression ratio and smaller accuracy decrease than those of the ImageNet dataset.

It is also worth mentioning that early convolutional neural networks use relatively large receptive fields with dimensions of, for example,

11 \times 11

and

7 \times 7

. VGGNet [56] achieved the same receptive field as that attained with larger convolutional kernels by stacking multiple layers of

3 \times 3

kernels, thereby reducing the number of network parameters. However, adding more layers to a suitable deep model increases the training error and reduces the accuracy [55,57,58]. An effective strategy for achieving both a large receptive field and parameter efficiency without adding more layers involves employing larger convolutional kernels, such as

11 \times 11

kernels, and compressing them to

3 \times 3

using our QLTD method during the Tucker decomposition. This approach ensures the preservation of a substantial receptive field while mitigating parameter redundancy. We will explore this strategy in future work.

6. Conclusions

In this study, we proposed a compact representation for neural networks named Quantized Low-Rank Tensor Decomposition (QLTD) to super compress deep convolutional neural networks. We found that the parameter redundancy in the low-rank space can be alleviated by performing a vector quantization, which enables the network to realize an ultra-light-weight structure. Furthermore, the self-attention module which is designed as a trainable sparse convolutional layer can contribute significantly to improving accuracy. Extensive experiments on object identification showed that our approach can achieve state-of-the-art results in the super-compression ratio domain and can realize the best balance between compressing parameters and maintaining identification accuracy.

Author Contributions

B.L., Conceptualization, formal analysis, investigation, writing—original draft preparation and validation; D.W., Methodology, software, investigation and data curation; Q.L., Software, investigation, validation and visualization; Z.H., Methodology, writing—review and editing and funding acquisition; Y.T., Supervision, resources and writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China (Grant Number: U23A20343, 11971374, 61821005 and 61903358), CAS Project for Young Scientists in Basic Research (Grant Number: YSBR-041), Youth Innovation Promotion Association of the Chinese Academy of Sciences (Grant Number: Y202051 and 2022196).

Institutional Review Board Statement

Not applicable. This work only involves testing identification tasks on publicly available datasets and does not involve any ethical review content.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset CIFAR-10 and CIFAR-100 are open datasets and can be downloaded at https://www.cs.toronto.edu/~kriz/cifar.html, accessed on 1 January 2024. The dataset ImageNet is also an open dataset and can be downloaded at https://www.image-net.org/download.php, accessed on 1 January 2024. Our code to produce the experiment results in the manuscript will be shared at https://github.com/liubc17/QLTD, accessed on 25 March 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, J.; Chen, G.; Jin, M.; Mao, W.; Lu, H. AE-Qdrop: Towards Accurate and Efficient Low-Bit Post-Training Quantization for a Convolutional Neural Network. Electronics 2024, 13, 644. [Google Scholar] [CrossRef]
Smagulova, K.; Bacha, L.; Fouda, M.E.; Kanj, R.; Eltawil, A. Robustness and Transferability of Adversarial Attacks on Different Image Classification Neural Networks. Electronics 2024, 13, 592. [Google Scholar] [CrossRef]
Yu, C.C.; Chen, T.Y.; Hsu, C.W.; Cheng, H.Y. Incremental Scene Classification Using Dual Knowledge Distillation and Classifier Discrepancy on Natural and Remote Sensing Images. Electronics 2024, 13, 583. [Google Scholar] [CrossRef]
Yang, W.; Wang, X.; Luo, X.; Xie, S.; Chen, J. S2S-Sim: A Benchmark Dataset for Ship Cooperative 3D Object Detection. Electronics 2024, 13, 885. [Google Scholar] [CrossRef]
Jia, L.; Tian, X.; Hu, Y.; Jing, M.; Zuo, L.; Li, W. Style-Guided Adversarial Teacher for Cross-Domain Object Detection. Electronics 2024, 13, 862. [Google Scholar] [CrossRef]
Chen, R.; Lv, D.; Dai, L.; Jin, L.; Xiang, Z. AdvMix: Adversarial Mixing Strategy for Unsupervised Domain Adaptive Object Detection. Electronics 2024, 13, 685. [Google Scholar] [CrossRef]
Wang, C.; Li, Y.; Wei, G.; Hou, X.; Sun, X. Robust Localization-Guided Dual-Branch Network for Camouflaged Object Segmentation. Electronics 2024, 13, 821. [Google Scholar] [CrossRef]
Rudnicka, Z.; Szczepanski, J.; Pregowska, A. Artificial Intelligence-Based Algorithms in Medical Image Scan Segmentation and Intelligent Visual Content Generation—A Concise Overview. Electronics 2024, 13, 746. [Google Scholar] [CrossRef]
Li, H.; Li, L.; Zhao, L.; Liu, F. ResU-Former: Advancing Remote Sensing Image Segmentation with Swin Residual Transformer for Precise Global–Local Feature Recognition and Visual–Semantic Space Learning. Electronics 2024, 13, 436. [Google Scholar] [CrossRef]
He, Y.; Zhang, X.; Sun, J. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1389–1397. [Google Scholar]
He, Y.; Liu, P.; Wang, Z.; Hu, Z.; Yang, Y. Filter pruning via geometric median for deep convolutional neural networks acceleration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4340–4349. [Google Scholar]
Han, S.; Mao, H.; Dally, W.J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv 2015, arXiv:1510.00149. [Google Scholar]
Han, S.; Pool, J.; Tran, J.; Dally, W. Learning both weights and connections for efficient neural network. Adv. Neural Inf. Process. Syst. 2015, 28, 1–9. [Google Scholar]
Liu, B.; Wang, M.; Foroosh, H.; Tappen, M.; Pensky, M. Sparse convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 806–814. [Google Scholar]
Denil, M.; Shakibi, B.; Dinh, L.; Ranzato, M.; De Freitas, N. Predicting parameters in deep learning. Adv. Neural Inf. Process. Syst. 2013, 26, 1–9. [Google Scholar]
Martinez, J.; Shewakramani, J.; Liu, T.W.; Bârsan, I.A.; Zeng, W.; Urtasun, R. Permute, quantize, and fine-tune: Efficient compression of neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 15699–15708. [Google Scholar]
Fang, G.; Ma, X.; Song, M.; Mi, M.B.; Wang, X. Depgraph: Towards any structural pruning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 16091–16101. [Google Scholar]
Bellec, G.; Kappel, D.; Maass, W.; Legenstein, R. Deep Rewiring: Training very sparse deep networks. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Mocanu, D.C.; Mocanu, E.; Stone, P.; Nguyen, P.H.; Gibescu, M.; Liotta, A. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nat. Commun. 2018, 9, 2383. [Google Scholar] [CrossRef] [PubMed]
Mostafa, H.; Wang, X. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 4646–4655. [Google Scholar]
Dettmers, T.; Zettlemoyer, L. Sparse Networks from Scratch: Faster Training without Losing Performance. arXiv 2019, arXiv:1907.04840. [Google Scholar]
Kusupati, A.; Ramanujan, V.; Somani, R.; Wortsman, M.; Jain, P.; Kakade, S.; Farhadi, A. Soft Threshold Weight Reparameterization for Learnable Sparsity. In Proceedings of the ICML 2020: 37th International Conference on Machine Learning, Vienna, Austria, 12–18 July 2020; Volume 1, pp. 5544–5555. [Google Scholar]
Liao, N.; Wang, S.; Xiang, L.; Ye, N.; Shao, S.; Chu, P. Achieving adversarial robustness via sparsity. Mach. Learn. 2021, 111, 685–711. [Google Scholar] [CrossRef]
Lebedev, V.; Ganin, Y.; Rakhuba, M.; Oseledets, I.; Lempitsky, V. Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition. In Proceedings of the ICLR 2015: International Conference on Learning Representations 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Kim, Y.D.; Park, E.; Yoo, S.; Choi, T.; Yang, L.; Shin, D. Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications. In Proceedings of the 4th International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Oseledets, I.V. Tensor-train Decomposition. SIAM J. Sci. Comput. 2011, 33, 2295–2317. [Google Scholar] [CrossRef]
Zhao, Q.; Zhou, G.; Xie, S.; Zhang, L.; Cichocki, A. Tensor Ring Decomposition. arXiv 2016, arXiv:1606.05535. [Google Scholar]
Jaderberg, M.; Vedaldi, A.; Zisserman, A. Speeding up Convolutional Neural Networks with Low Rank Expansions. In Proceedings of the British Machine Vision Conference 2014, Nottingham, UK, 1–5 September 2014. [Google Scholar]
Cheng, Z.; Li, B.; Fan, Y.; Bao, Y. A novel rank selection scheme in tensor ring decomposition based on reinforcement learning for deep neural networks. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 3292–3296. [Google Scholar]
Li, Y.; Gu, S.; Mayer, C.; Gool, L.V.; Timofte, R. Group Sparsity: The Hinge between Filter Pruning and Decomposition for Network Compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8018–8027. [Google Scholar]
Yin, M.; Sui, Y.; Liao, S.; Yuan, B. Towards Efficient Tensor Decomposition-Based DNN Model Compression With Optimization Framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 10674–10683. [Google Scholar]
Xu, Y.; Li, Y.; Zhang, S.; Wen, W.; Wang, B.; Qi, Y.; Chen, Y.; Lin, W.; Xiong, H. Trp: Trained rank pruning for efficient deep neural networks. arXiv 2020, arXiv:2004.14566. [Google Scholar]
Idelbayev, Y.; Carreira-Perpinán, M.A. Low-rank Compression of Neural Nets: Learning the Rank of Each Layer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8049–8059. [Google Scholar]
Liu, B.; Han, Z.; Shao, W.; Jia, H.; Wang, Y.; Tang, Y. A novel compact design of convolutional layers with spatial transformation towards lower-rank representation for image classification. Knowl.-Based Syst. 2022, 255, 109723. [Google Scholar] [CrossRef]
Merolla, P.; Appuswamy, R.; Arthur, J.; Esser, S.K.; Modha, D. Deep neural networks are robust to weight binarization and other non-linear distortions. arXiv 2016, arXiv:1606.01981. [Google Scholar]
Qian, B.; Wang, Y.; Hong, R.; Wang, M. Adaptive Data-Free Quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7960–7968. [Google Scholar]
Ma, Y.; Li, H.; Zheng, X.; Xiao, X.; Wang, R.; Wen, S.; Pan, X.; Chao, F.; Ji, R. Solving Oscillation Problem in Post-Training Quantization through a Theoretical Perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7950–7959. [Google Scholar]
Li, H.; Wu, X.; Lv, F.; Liao, D.; Li, T.H.; Zhang, Y.; Han, B.; Tan, M. Hard Sample Matters a Lot in Zero-Shot Quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 24417–24426. [Google Scholar]
Tucker, L.R. Some mathematical notes on three-mode factor analysis. Psychometrika 1966, 31, 279–311. [Google Scholar] [CrossRef] [PubMed]
Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Zhang, X.; Zou, J.; He, K.; Sun, J. Accelerating very deep convolutional networks for classification and detection. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 1943–1955. [Google Scholar] [CrossRef] [PubMed]
Liu, Y.; Ng, M.K. Deep neural network compression by Tucker decomposition with nonlinear response. Knowl.-Based Syst. 2022, 241, 108171. [Google Scholar] [CrossRef]
Xiao, J.; Zhang, C.; Gong, Y.; Yin, M.; Sui, Y.; Xiang, L.; Tao, D.; Yuan, B. HALOC: Hardware-Aware Automatic Low-Rank Compression for Compact Neural Networks. arXiv 2023, arXiv:2301.09422. [Google Scholar] [CrossRef]
Horvath, S.; Laskaridis, S.; Rajput, S.; Wang, H. Maestro: Uncovering Low-Rank Structures via Trainable Decomposition. arXiv 2023, arXiv:2308.14929. [Google Scholar]
Sui, Y.; Yin, M.; Gong, Y.; Xiao, J.; Phan, H.; Yuan, B. ELRT: Efficient Low-Rank Training for Compact Convolutional Neural Networks. arXiv 2024, arXiv:2401.10341. [Google Scholar]
Stock, P.; Joulin, A.; Gribonval, R.; Graham, B.; Jégou, H. And the bit goes down: Revisiting the quantization of neural networks. arXiv 2019, arXiv:1907.05686. [Google Scholar]
Lin, X.; Zhao, C.; Pan, W. Towards accurate binary convolutional neural network. Adv. Neural Inf. Process. Syst. 2017, 30, 1–9. [Google Scholar]
Shayer, O.; Levi, D.; Fetaya, E. Learning discrete weights using the local reparameterization trick. arXiv 2017, arXiv:1710.07739. [Google Scholar]
Wang, K.; Liu, Z.; Lin, Y.; Lin, J.; Han, S. Haq: Hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–19 June 2019; pp. 8612–8620. [Google Scholar]
Rastegari, M.; Ordonez, V.; Redmon, J.; Farhadi, A. Xnor-net: Imagenet classification using binary convolutional neural networks. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 525–542. [Google Scholar]
Tung, F.; Mori, G. Deep neural network compression by in-parallel pruning-quantization. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 42, 568–579. [Google Scholar] [CrossRef]
Huang, Z.; Wang, N. Data-driven sparse structure selection for deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 304–320. [Google Scholar]
Lee, C.Y.; Xie, S.; Gallagher, P.; Zhang, Z.; Tu, Z. Deeply-supervised nets. In Proceedings of the Artificial Intelligence and Statistics, PMLR, San Diego, CA, USA, 9–12 May 2015; pp. 562–570. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Sun, J. Convolutional neural networks at constrained time cost. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5353–5360. [Google Scholar]
Srivastava, R.K.; Greff, K.; Schmidhuber, J. Highway networks. arXiv 2015, arXiv:1505.00387. [Google Scholar]

Figure 1. The main framework of QLTD. Firstly, the original convolutional kernel is decomposed to

U^{(3)}, G

and

U^{(4)}

by Tucker-2 decomposition. Meanwhile, self-attention modules

S^{U^{(3)}}

,

S^{G}

and

S^{U^{(4)}}

are introduced to focus on the key positions of

U^{(3)}, G

and

U^{(4)}

, respectively. The self-attention modules are sparse and trainable. Additionally, a permutation and vector quantization approach is applied to

U^{(3)}, G

and

U^{(4)}

to further reduce parameter storage, which is explained in Section 3.3 and Section 3.4 in detail.

Figure 1. The main framework of QLTD. Firstly, the original convolutional kernel is decomposed to

U^{(3)}, G

and

U^{(4)}

by Tucker-2 decomposition. Meanwhile, self-attention modules

S^{U^{(3)}}

,

S^{G}

and

S^{U^{(4)}}

are introduced to focus on the key positions of

U^{(3)}, G

and

U^{(4)}

, respectively. The self-attention modules are sparse and trainable. Additionally, a permutation and vector quantization approach is applied to

U^{(3)}, G

and

U^{(4)}

to further reduce parameter storage, which is explained in Section 3.3 and Section 3.4 in detail.

Figure 2. Permutation and vector quantization of QLTD.

Table 1. Experimental results of ResNet-18 on the CIFAR10 dataset. We report the Top-1 accuracy and compression ratios of our approach and the approaches used for comparison. For our approach, we provide the compression scheme. For example, “2-4-256-0.5” denotes that: (1) the channel compression rate of the Tucker decomposition is 2, (2) the length of subvectors in the vector quantization clustering is 4, (3) the number of cluster centers in the vector quantization clustering is 256 and (4) the sparsity of the self-attention module is 99.5%. The following tables follow the same notation.

Method	Compression Scheme	Top-1 Accuracy (%)	Com. Ratio (×)
Maestro [45]	-	93.86	9.1
TDNR [43]	-	92.82	11.82
PQF [16]	-	93.77	40.97
		93.21	60.23
		93.15	79.34
QLTD (ours)	2-2-256-0.5	94.15	35.43
	2-4-256-0.5	93.87	48.13
	4-2-256-0.1	93.51	72.07
	4-4-256-1	93.24	90.61

Table 2. Experimental results of ResNet-18 on the CIFAR100 dataset. We report the Top-1 accuracy and compression ratios of our approach and the PQF approach [16]. For our approach, we provide the compression scheme.

Method	Compression Scheme	Top-1 Accuracy (%)	Com. Ratio (×)
PQF [16]	-	72.68	22.73
		70.85	45.57
		69.97	75.12
QLTD (ours)	2-2-256-0.1	74.07	36.27
	4-2-256-0.5	71.59	65.86
	4-4-256-0.5	70.35	88.51

Table 3. Experimental results of ResNet-18 on the ImageNet dataset. We report the Top-1 accuracy and compression ratios of our approach and the approaches used for comparison. For our approach, we provide the compression scheme.

Method	Compression Scheme	Top-1 Accuracy (%)	Com. Ratio (×)
HALOC [44]	-	70.65	2.75
LCT [34]	-	67.87	2.57
ELRT [46]	-	68.65	2.17
TRP [32]	-	65.51	2.59
SVD [42]	-	63.10	1.41
ABC-Net [48]	-	62.8	32.05
LR-Net [49]	-	59.9	31.89
BGD [47]	-	64.12	35.21
BGD [47]	-	61.17	43.23
PQF [16]	-	65.23	35.14
		59.87	56.74
		58.92	59.53
QLTD (ours)	2-2-256-0.1	65.32	36.72
	4-2-512-0.5	61.57	54.35
	4-2-256-1	59.85	63.41

Table 4. Experimental results of ResNet-50 on the ImageNet dataset. We report the Top-1 accuracy and compression ratio of our approach and the approaches used for comparison. For our approach, we provide the compression scheme.

Method	Compression Scheme	Top-1 Accuracy (%)	Com. Ratio (×)
SSS [53]	-	72.98	1.20
TRP [32]	-	72.69	2.30
SVD [42]	-	71.80	1.50
CLIP-Q [52]	-	73.77	14.9
HAQ [50]	-	70.63	15.2
DC [12]	-	68.9	15.18
BGD [47]	-	74.81	15.2
BGD [47]	-	71.53	25.91
PQF [16]	-	75.42	16.82
		70.22	29.37
		69.13	32.16
QLTD (ours)	2-4-256-0.5	75.77	18.53
	4-2-256-1	72.25	34.24
	4-2-256-0.1	71.46	35.72

Table 5. Experimental results of the ablation study of our proposed method on the CIFAR-10 dataset. A “✓” notation denotes that the corresponding module is utilized.

Tucker	Quantization	Permutation	Self-Attention	Com. Ratio	Accuracy
✓				3.75	94.47
✓	✓			37.33	92.57
✓	✓	✓		37.33	93.26
✓			✓	3.66	94.81
✓	✓		✓	35.43	93.38
✓	✓	✓	✓	35.43	94.15

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, B.; Wang, D.; Lv, Q.; Han, Z.; Tang, Y. Towards Super Compressed Neural Networks for Object Identification: Quantized Low-Rank Tensor Decomposition with Self-Attention. Electronics 2024, 13, 1330. https://doi.org/10.3390/electronics13071330

AMA Style

Liu B, Wang D, Lv Q, Han Z, Tang Y. Towards Super Compressed Neural Networks for Object Identification: Quantized Low-Rank Tensor Decomposition with Self-Attention. Electronics. 2024; 13(7):1330. https://doi.org/10.3390/electronics13071330

Chicago/Turabian Style

Liu, Baichen, Dongwei Wang, Qi Lv, Zhi Han, and Yandong Tang. 2024. "Towards Super Compressed Neural Networks for Object Identification: Quantized Low-Rank Tensor Decomposition with Self-Attention" Electronics 13, no. 7: 1330. https://doi.org/10.3390/electronics13071330

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Towards Super Compressed Neural Networks for Object Identification: Quantized Low-Rank Tensor Decomposition with Self-Attention

Abstract

1. Introduction

2. Related Work

3. Quantized Low-Rank Tensor Decomposition with Self-Attention

3.1. Preliminary Notations for Low-Rank Tensor Decomposition

3.2. Low-Rank Decomposition on Pre-Trained Weights

3.3. Permuting the Kernels

3.4. Vector Quantization

3.5. Self-Attention Module

3.6. The Overall Framework of QLTD

4. Experiment

4.1. Training Settings

4.2. CIFAR-10 and CIFAR-100 Experiment

4.3. ImageNet Experiment

4.4. Ablation Study

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI