Advances in the Neural Network Quantization: A Comprehensive Review

Wei, Lu; Ma, Zhong; Yang, Chaojie; Yao, Qin

doi:10.3390/app14177445

Open AccessReview

Advances in the Neural Network Quantization: A Comprehensive Review

¹

School of Software, Northwestern Polytechnical University, Xi’an 710072, China

²

Xi’an Microelectronics Technology Institute, Xi’an 710065, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(17), 7445; https://doi.org/10.3390/app14177445 (registering DOI)

Submission received: 20 June 2024 / Revised: 3 August 2024 / Accepted: 20 August 2024 / Published: 23 August 2024

(This article belongs to the Special Issue Advances and Applications of Numerical Analysis and Intelligent Computing)

Download

Browse Figures

Versions Notes

Abstract

:

Artificial intelligence technologies based on deep convolutional neural networks and large language models have made significant breakthroughs in many tasks, such as image recognition, target detection, semantic segmentation, and natural language processing, but also face a conflict between the high computational capacity of the algorithms and limited deployment resources. Quantization, which converts floating-point neural networks into low-bit-width integer networks, is an important and essential technique for efficient deployment and cost reduction in edge computing. This paper analyzes various existing quantization methods, showcases the deployment accuracy of advanced techniques, and discusses the future challenges and trends in this domain.

Keywords:

quantization; neural network; large language model; deployment; accuracy

1. Introduction

With the rapid development of artificial intelligence and machine learning technologies, deep learning models have achieved excellent results in many fields, such as target detection [1,2,3,4], intelligent obstacle avoidance [5], semantic segmentation [6,7,8,9], and situational awareness [10]. However, the high performance of these models often relies on a large number of parameters and complex computational processes, leading to significant challenges in computational resource and energy consumption in their practical applications. Existing deep learning models still face the shortcomings of requiring high storage, memory bandwidth, and computational power [11]. These shortcomings limit their use in resource-limited devices in many practical applications. Therefore, how to reduce model size, improve model running speed, and enable existing deep models to be applied to embedded platforms are key to solving the current challenges of deep learning applications. To address these challenges, neural network model quantization techniques have been extensively investigated in recent years to compress neural network models and accelerate their inference process by converting high-precision data to low-precision data [12,13]. For example, quantizing 32-bit multiplication operations to 8-bit can reduce power consumption by about 94.59% (8-bit multiplication operations consume 0.2 pj, while 32-bit multiplication operations consume 3.7 pj), and the data transfer speed of the quantized model can be increased to 4 times that of the original model [14,15]. Neural network model quantization is an important technology that needs to be solved in the field of artificial intelligence, and has urgent application needs and broad application prospects in the fields of embedded high-speed inference and large model compression.

(1) There is a need for efficient computation of intelligent algorithms on the embedded devices. With the continuous progress of artificial intelligence technology, the application field of neural networks is also expanding. Due to the high computational complexity of intelligent algorithm models represented by neural networks, the contradiction between limited computational resources and the high computational demand of intelligent models seriously restricts the deployment of high-performance neural network models on end-side computing devices, which has become a bottleneck restricting the promotion and application of artificial intelligence technology [16,17,18]. To address this problem, there is an urgent need to carry out research on quantization techniques for neural network models to reduce data storage, data transmission, and computational power. Modern neural networks, while highly effective, often require substantial computational resources and memory, which can be prohibitive for deployment on edge devices with limited hardware capabilities. By reducing the precision of the model’s weights and activations, quantization techniques aim to shrink the model size and enhance efficiency without significantly sacrificing performance.

(2) Deployment Requirements for Large Language Models. In recent years, big models based on ultra-large-scale text data, represented by BERT [19] and GPT [20,21], have flourished and made substantial progress in tasks such as natural language reasoning and human-computer dialogue, reaching or even surpassing the human level on some datasets. Among them, the most famous is the GPT series of models introduced by OpenAI, whose latest version has reached the scale of 100 billion parameters and is regarded as the most advanced language model. With the rapid increase in the scale of large models, the demand for GPU memory and arithmetic power is getting bigger and bigger, resulting in bottlenecks involving storage, access bandwidth, and computation speed; it is getting more and more difficult for high-performance servers to meet the demand for hardware resources of large models. Deloitte Touche Tohmatsu, one of the world’s Big Four accounting firms, predicts that the market for dedicated chips optimized for generative AI will exceed $50 billion in 2024. The quantization of large models allows companies that rely on them to significantly reduce their need for HPC hardware, which in turn significantly reduces chip procurement and operational costs. Quantizing floating-point large models to 8-bit models for inference computation is also a mainstream practice in the industry today [22,23].

Therefore, the research and development of efficient and low-loss neural network model quantization methods are not only of great significance for promoting the wide application of deep learning techniques, but also has a long-term impact on advancing the development of computer science and artificial intelligence technologies. Neural network quantization can reduce the complexity and size of the model, accelerate model inference speed, and reduce resource consumption, which has significant research value in promoting the development of artificial intelligence technology. The main challenge of quantization is how to minimize the loss of computational accuracy caused by the reduction of numerical bit width. A large amount of research has been conducted to try to find quantization strategies suitable for different network architectures and application scenarios, as well as the optimal quantization parameters under different quantization strategies. However, there are still some technical difficulties in this research. The main challenge of quantization at present is how to minimize the loss of accuracy due to the reduction of numerical bit-width. The quantization methods face challenges regarding how to allocate the quantization bit-width, how to adaptively optimize the quantization parameters according to the differences in the model and the data distributions, as well as how to balance the speed of computation with the computational accuracy. Post-training quantization is simpler and faster to implement since it quantizes a model after training without requiring retraining, making it ideal for rapid deployment. However, it can lead to notable accuracy loss, especially in models sensitive to precision. In contrast, quantization-aware training integrates quantization into the training phase, leading to better-optimized models with minimal accuracy degradation, but it requires more computational resources and time due to the need for retraining.

About the neural network model quantization accuracy loss assessment technique, the traditional neural network model quantization accuracy loss assessment criteria tend to use a single metric to measure the neural network model quantization accuracy loss, and this accuracy loss is the cumulative result on a small number of test sets, which often fails to comprehensively reflect the loss of accuracy after neural network model quantization. Research on how to quickly and accurately assess the loss of accuracy after model quantization is crucial to the field of model compression and model quantization.

For neural network model bit-width allocation techniques, existing techniques have made great progress compared to the traditional uniform accuracy quantization methods, automating the bit-width search for different layer weights and activations. However, these methods still have room for improvement. Most of the existing methods do not take into account the loss of accuracy and efficiency of the embedded device in the optimization objective function of the quantization bit width selection, and since the effect of different hardware devices on the operation of the quantization model varies, the establishment of a complete set of co-optimization methods is of great importance. It is necessary to study the feedback of the deployment model on the hardware side of the accuracy as well as the efficiency loss through both the neural model accuracy loss assessment and speed assessment method, optimizing the bit width selection process through the accuracy and efficiency loss of the deployment model. Most of the existing methods tend to be computationally demanding, and some rely on manual configuration to optimize the initial session, posing challenges for different model and device setups, and there is a need to build a fast and initialization-free bit-width search technique to address these issues.

For neural network model quantization parameter generation technology, the current quantization methods are usually only for single application scenarios. Most of them adopt the unified symmetric uniform quantization mapping strategy, but their robustness is poor, and they cannot adapt to different neural network model structures, hardware resource limitations, or computational task requirements, especially for computational tasks with high accuracy requirements or difficult computational tasks, such as small target detection and identification tasks. Existing quantization methods have difficulty meeting accuracy requirements, resulting in a huge loss of performance.

Current comprehensive research on quantization methods includes how to improve the accuracy of the quantized neural network model. This is a key problem to be solved, and it is necessary to study how to automatically select the appropriate quantization threshold according to the distribution of the data, how to combine the quantized features with the quantization error to be considered in the training, how to solve the problem of mismatch between quantized forward propagation and back propagation, and how to make the neural network model more friendly to quantization.

In this paper, we compare and analyze existing neural network model quantization methods, give the deployment accuracy results of the existing advanced methods in the field, and summarize the technical difficulties and development trends of future neural network quantization technology, providing practical guidelines. The rest of the paper is organized as follows: Section 2 describes the quantization fundamentals. Section 3 describes several key techniques of quantification. Section 4 provides details of the experimental results and analysis. Section 5 concludes our work and discusses the future challenges and trends.

2. Quantization Fundamentals

Neural network models use floating-point data types. To achieve high-speed computations, it is necessary to quantize the floating-point neural network models into integer neural networks. As shown in Figure 1, quantization reduces computational cost by decreasing the precision of the data type, which minimizes the bit-width of data storage and the amount of data passing through the deep neural network. Computing and storing data at lower bit-widths enable fast inference and reduces energy consumption. The quantization process of a neural network model involves choosing a quantization mapping strategy, deciding whether to train the model, computing the quantization parameters, generating the inverse quantization parameters, and deploying the inference on resource-limited devices.

When performing the quantization process on a neural network model, the activation and weight values of the model are restricted to a discrete set of numbers that can have different distributions: uniform or non-uniform. The non-uniform quantization method is a logarithmic distribution [24]. Inference of neural network models in hardware platforms requires processing of quantized computational results through certain inverse quantization computational steps to obtain a true floating-point type output. Non-uniform quantization methods have complex inverse quantization computation processes and are not friendly to hardware platforms. Most current hardware supports only uniform quantization methods. Therefore, the most widely used quantization scheme is the uniform distribution [25] method with uniform step size.

The key issue of quantization is to design a proper quantization mapping function and a proper method to calculate quantization parameters. For uniform quantization, most existing approaches use either asymmetric or symmetric quantization mapping functions [26]. The asymmetric quantization mapping function is as follows:

r = f (Q) = s \cdot Q + D

(1)

Q = f^{- 1} (r) = r o u n d (\frac{r - D}{s})

(2)

where f and

f^{- 1}

are the quantization mapping function,

f^{- 1}

is the inverse function of f, round is the rounding operation, r is the floating-point real value, Q is the integer value after quantization, s and D are quantization parameters, s is the scaling factor, and D is the zero point, chosen such that the 0 value would exactly map to quantized values. Symmetric quantization is a simplified version of the general asymmetric case [27]. The symmetric quantizer restricts the quantization parameter D to 0 [28].

In Figure 2, we take symmetric quantization to 8-bit as an example. We can see that quantization converts continuous floating-point data into discrete integers, which brings accuracy loss. The quantization parameters are very important for both asymmetric and symmetric quantization, and affect the performance of the quantized neural network. The quantization parameters depend on the clipping range, and the scaling factors divide the clipping range into a number of partitions.

The optimal clipping range for the input is [

m i n_{i n}, m a x_{i n}

], the optimal clipping range for the output is [

m i n_{o u t}, m a x_{o u t}

], and the threshold of the weights is

t h_{w}

. The method to compute the quantization parameters according to the clipping thresholds is as follows: The quantization parameters

s_{i n}

,

D_{i n}, s_{o u t}

,

D_{o u t}

,

s_{w}

of a layer are computed according to Equations (3)~(7).

s_{i n} = \frac{m a x_{i n} - m i n_{i n}}{2^{b w_{i n}} - 1}

(3)

D_{i n} = \frac{m i n_{i n} - m a x_{i n}}{2^{b w_{i n}} - 1} \cdot round (\frac{(2^{b w_{i n} - 1} - 1) \cdot m i n_{i n} + 2^{b w_{i n} - 1} \cdot m a x_{i n}}{m i n_{i n} - m a x_{i n}})

(4)

s_{o u t} = \frac{m a x_{o u t} - m i n_{o u t}}{2^{b w_{o u t}} - 1}

(5)

D_{o u t} = \frac{m i n_{o u t} - m a x_{o u t}}{2^{b w_{o u t}} - 1} \cdot round (\frac{(2^{b w_{o u t} - 1} - 1) \cdot m i n_{o u t} + 2^{b w_{o u t} - 1} \cdot m a x_{o u t}}{m i n_{o u t} - m a x_{o u t}})

(6)

s_{w} = \frac{t h_{w}}{2^{b w_{w} - 1} - 1}

(7)

where

b w_{i n}

,

b w_{o u t}

, and

b w_{w}

are the bit width of input, output, and weight, of which 8 is commonly used.

To effectively quantize the convolutional layer, it is essential to understand its underlying computational principles. The fundamental operation of the convolutional layer is:

\sum^{} ({input}_{i, j, k}^{f} \cdot w_{k, n}^{f}) + {bias}_{k}^{f} = {output}_{l, m, n}^{f}

(8)

where

{bias}_{k}^{f}

represents the k-th bias of the convolutional layer, and

{output}_{m, n, l}^{f}

denotes the output of the convolutional layer. All these data types are in floating-point format.

Based on the computational principles of the convolutional layer and the proposed hybrid asymmetric quantization strategy, the method for quantizing the convolutional layer can be deduced. The activations of the convolutional layer (including both the input and output) use asymmetric quantization mapping, while the weights of the convolutional layer use symmetric quantization mapping. The computation principle for quantizing the convolutional layer is:

{(\frac{s_{i n} \cdot s_{w}}{s_{o u t}} \cdot 2^{S}) (\sum^{} (\frac{{input}_{i, j, k}^{f} - D_{i n}}{s_{i n}} \cdot \frac{w_{k, n}^{f}}{s_{w}})) + \frac{{bias}_{k}^{f} + D_{i n} \cdot \sum^{} w_{k, n}^{f} - D_{o u t}}{s_{o u t}} \cdot 2^{S}} \cdot 2^{- S} = \frac{{output}_{l, m, n}^{f} - D_{o u t}}{s_{o u t}}

(9)

where

s_{o u t}

and

D_{o u t}

are the quantization parameters for the convolutional layer’s output, and S represents the shift parameter used during the inference process of the convolutional layer.

As shown in Figure 3, the data distribution of neural network model activations and weights is often asymmetric, which poses a significant challenge for selecting clipping range and quantization parameters. A good quantization method should resolve the two following questions to improve the deployment performance. The first question is the trade-off between the accuracy and the difficulty of deployment. The second question is the trade-off between clipping range and quantization resolution, which significantly influences quantization parameters’ computation. There are two main forms of quantization methods: Post-training quantization [14,29] and quantization-aware training [30,31]. Post-training quantization (PTQ) is the quick quantization of floating-point weights and activations after the model has been trained. Quantization-aware training (QAT) is the process of updating the weights of the model by considering the quantization process during the training process. PTQ is time-saving and convenient, while QAT can obtain a higher accuracy. In addition to PTQ and QAT, the selection of quantization bit-width and the evaluation of the accuracy loss of deep learning models after quantization are also very important techniques related to quantization. In this paper, we will analyze the development status and challenges of the above related techniques, respectively.

3. Quantization Techniques

3.1. PTQ

Post-Training Quantization (PTQ) is an optimization technique applied to neural network models after they have been trained. It is designed to reduce the model’s memory footprint and accelerate inference speed while attempting to maintain the model’s accuracy. The steps of PTQ are as shown in Figure 4. When the distribution is not gaussian-like, it is difficult for the quantized model to meet the accuracy requirements, especially for complicated tasks with higher accuracy requirements.

The key point for PTQ is how to calculate the quantization parameters, which depends on the clipping range. Usually, a series of calibrations are used as the input of a neural network to compute the typical range of activations [30,32]. A straightforward choice is to use the min/max of the data for the clipping range [30], which may unnecessarily increase the range and reduce the quantization resolution. One approach is to use the i-th largest/smallest value instead of the min/max value as the clipping range [33]. Another approach is to select the clipping range by some kind of information loss between the original real values and the quantized values [34,35], including KL divergence [36,37], Mean Squared Error (MSE) [38,39,40,41], or entropy [42]. Wei [43] proposes an activation redistribution-based hybrid asymmetric quantization method for neural networks, which takes data distribution into consideration and can resolve the contradiction between the quantization accuracy and the ease of implementation. Choosing the optimal quantization parameters and how to reduce the loss of accuracy of the neural network model after quantization are the key problems to be solved.

3.2. QAT

Quantization-aware training (QAT) is a technique in deep learning that integrates the quantization process into the training phase of a neural network model. Unlike post-training quantization (PTQ), which applies quantization after the model has been trained, QAT allows the model to learn and adapt to the quantization effects during training. The steps of QAT are shown in Figure 5.

QAT is a powerful technique for preparing neural network models for deployment on quantization-sensitive hardware. By training the model to be aware of quantization, it can better adapt and maintain accuracy even with reduced precision. However, most QAT methods use the straight-through gradient estimation (STE) technique, which causes a significant gradient error. To solve this problem, researchers have proposed alternative approaches. PACT [44] explores the impact of activation value trimming on quantization performance. Gong [45] adopts a differentiable tanh function to gradually approximate the quantization function. DoReFa [25] proposes tailoring the weight range prior to quantization. SAT [46] uses the gradient updating process and weight scale adjustment during training to improve quantization performance. Sharpness-aware quantization (SAQ) [47] provides a unified view of both quantization and sharpness-aware minimization (SAM) by treating them as introducing quantization noises and adversarial perturbations to the model weights. Zhuang [48] proposes a solution by training the low-precision network with a full-precision auxiliary module and constructs a mix-precision network by augmenting the original low-precision network with the full precision auxiliary module. It is worth noting that although QAT methods can deliver better quantitative performance, they often require large training datasets and significant computational resources, with training times exceeding 100 GPU hours [49].

3.3. The Selection of Quantization Bitwidth

Traditional model quantization methods involve quantizing the weight parameters and activation values of the whole model to a fixed bit-width. However, while high bit-width quantization ensures high accuracy, it also causes larger memory footprint and computation. Low bit-width quantization has lower accuracy but a smaller memory footprint and computation. Because different layers have different redundancies and computational effort, simply assigning the same bit-width does not guarantee optimal network performance. Therefore, mixed precision quantization (MPQ) is needed to achieve further efficient compression of the model. Wu [50] proposes a search method using a differentiable neural network structure search for the bit-width of each layer with no consideration about the inference delay on hardware. Wang [51] proposes the HAQ algorithm, which combines the hardware latency and energy consumption fed back from the hardware simulator to constrain the search for bit-widths at each layer, resulting in a hardware-aware hybrid accuracy policy. In order to quickly allocate the quantization bit-width of each layer, Dong [52] proposes HAWQ, which uses the second-order information of the model parameters to assess the sensitivity of each layer of the model to quantization, and then allocates the bit-width of each layer based on this sensitivity to improve the search efficiency. Dong [53] proposes the EMQ method, which automates the search for mixed-precision configurations with the help of evolutionary algorithms. Tang [54] proposes the SEAM method using small agent datasets to perform a fast search in order to uncover effective MPQ strategies applicable to large-scale training datasets, thus improving the search efficiency and practicality. HAWQv2 [55] enhances model quantization by using second-order information (the Hessian) to determine optimal quantization levels, thereby preserving accuracy while significantly reducing computational and memory costs. Tang [56] proposes the LIMPQ method that can speed up the indicator training process by parallelizing the original sequential training processes. With these learned importance indicators, the MPQ search problem is treated as a one-time integer linear programming (ILP) problem.

Existing hybrid bit-width allocation methods suffer from the following challenges: Most of the methods do not consider the adaptability to hardware, and since the efficiency and accuracy of the quantized neural network model may vary significantly on different intelligent processors, it is necessary to incorporate hardware feedback on accuracy and speed into the optimization function of the bit-width search. Meanwhile, these methods either still require a lot of computational resources or are very sensitive to hyperparameters or even initialization, so it is important to propose a parsimonious and fast search technique.

3.4. The Accuracy Loss Evaluation of the Quantized Models

The quantized model needs to be tested to ensure that the loss of accuracy is within an acceptable range. Some researchers have designed several quantization accuracy loss assessment criteria from the perspective of statistical data analysis to improve the performance of quantization models. Qualcomm [57] designed the Signal to Quantization Noise Ratio (SQNR) to measure the quantization accuracy of different quantization bit-widths. Pengcheng [58] proposed a method to determine the quantization coefficients using the quantization mean square error as a metric and proposed a method to update the statistical parameters for small networks with serious performance loss. However, the above accuracy loss assessment criteria evaluate the loss from the perspective of data statistics, resulting in a deviation from the actual application data and a lack of practicability in obtaining the accuracy loss of the task. Wang [59] constructs a quantization accuracy predictor based on the highly flexible once for all network [60], which encodes the model structure and quantization strategy to directly predict the accuracy of the quantization model. However, in this method, collecting the quantization dataset takes 16,000 GPU hours, which is quite expensive and time-consuming, and the quantization predictor can only predict neural network models of a predetermined structure.

Existing studies on the quantization of neural network models have less systematically analyzed the mechanism of computational accuracy loss generated by the quantization process, making it difficult to effectively measure the accuracy loss of different neural network models and different computational tasks after quantization. Therefore, how to accurately measure the accuracy loss of quantized models and how to provide a judgement basis for the quantization method of neural network models are challenges that still need to be solved.

3.5. Quantization of LLMs

The LLM (Large Language Model) is a state-of-the-art deep learning model designed for natural language processing tasks [61]. By integrating the latest quantization techniques, these models achieve significant reductions in computational and memory costs while maintaining high accuracy. However, existing solutions may still face challenges in preserving performance for highly complex tasks and ensuring generalizability across diverse datasets.

Existing quantization methods for large language models (LLMs) can be mainly divided into two categories: weight-only quantization and joint weight and activation quantization. The former compresses a large number of weights into lower bit-widths [62], effectively reducing the memory footprint of the models. The latter quantizes both weights and activations into mixed bit-widths [63], which accelerates matrix multiplications and significantly enhances computational speed.

However, when dealing with significant activation outliers, existing methods often exhibit limited improvements or result in unstable gradients. This challenge highlights the need for more robust quantization techniques capable of effectively managing outliers without compromising stability. Future advancements in this field may focus on adaptive quantization strategies that dynamically adjust bit-widths based on activation distributions, ensuring both efficiency and accuracy. Additionally, exploring novel quantization-aware training methods could further enhance the robustness and generalizability of quantized LLMs, paving the way for their broader application in resource-constrained environments.

4. Experimental Results

4.1. Experimental Setting

We perform two sets of experiments, one comparing PTQ methods and the other comparing QAT methods with mixed bit-widths. In order to obtain the experimental results conveniently, software based on fake quantization v1.0 [30] modules is used to simulate the accuracy on the neural network accelerator. Fake-quantization models quantization errors in the forward passes. The biggest reason to apply fake quantization is to quickly simulate the effects of quantization using simulated quantization operations. The purpose of the experiments is to compare the state-of-the-art quantization methods. The CPU is an Intel(R) Core(TM) i7-8700K, 3.70 GHz, and the GPU is a NVIDIA GeForce GTX1070.

In the PTQ experiment, we utilize classic image classification models GoogleNet and VGG16, the YOLOv1 model from the YOLO series for object detection applications, and the classic Unet model for image segmentation applications. GoogleNet and VGG16 are tested using the publicly authoritative ImageNet dataset to highlight the scientific rigor and fairness of our comparative method. The YOLOv1 model is applied to a custom ship detection dataset, while the Unet model is used with a custom remote sensing image segmentation dataset. The aim is to assess the precision loss of quantized models across targets of various sizes.

In the QAT experiment, we utilize the classic image classification ResNet50, which is evaluated through comparative experiments using the publicly authoritative ImageNet dataset.

The accuracy is verified using software. The evaluation metrics are the accuracy metrics of the model. For image classification applications, we use Top-1 Accuracy (the one with the highest probability must be exactly the expected answer). For small target detection applications, we use mAP (Mean Average Precision). The calculation of mAP is the same as in the internationally renowned target detection competition PASCAL VOC Challenge. For image segmentation applications, we use the mean intersection over union (mIoU) metric to evaluate accuracy.

4.2. Results

First, we compare the PTQ methods [41,42,43] on image classification models, the small target detection model, and the segment model. The results of the PTQ methods are shown in Table 1. Higher accuracy can be achieved due to the use of hybrid asymmetric quantization [43]. Upon implementing the hybrid asymmetric quantization approach [43] within the GoogleNet classification model, the resultant accuracy loss is minimal, amounting to a mere 0.39% when compared to the traditional symmetric quantization method [41,42]. Similarly, when applied to the VGG16 classification model, this method achieves an accuracy loss of only 0.52%. In the context of object detection, the application of the hybrid asymmetric quantization method [43] to the YOLO v1 model yields an accuracy loss of just 0.72%. Furthermore, when utilized in the U-net segmentation model, the method results in an accuracy loss of a mere 0.64%. The hybrid asymmetric quantization technique effectively reconciles the trade-off between the precision of quantized neural networks and the simplicity of their implementation, striking a balance between clipping range and quantization precision. Therefore, for the PTQ quantization method, using hybrid quantization with symmetric weights and asymmetric activation can improve the accuracy of the quantized model as much as possible while ensuring that it is hardware-friendly.

Then, we compare the QAT methods with the mixed bit-widths on the image classification model. The results of the QAT methods are shown in Table 2. We compare three methods for mixed-precision allocation and QAT: PACT [44], HAWQv2 [55], and LIMPQ [56].

It is evident that LIMPQ, due to its ability to assign different precision levels to different layers based on their sensitivity, achieves the best results with minimal loss. The LIMPQ technique intelligently discerns the significance of each neural network layer and dynamically assigns appropriate levels of precision. This sophisticated, adaptive strategy ensures that pivotal layers receive the requisite high precision to preserve functionality, while allowing less crucial layers to operate with reduced precision. This nuanced approach not only maintains the integrity of the model’s performance but also enhances overall efficiency. By precisely gauging and applying the necessary level of precision to each layer, LIMPQ lays the groundwork for the development of more potent and streamlined neural network architectures. The method’s scalability is evident in its applicability across a spectrum of neural network configurations, including but not limited to convolutional neural networks (CNNs), recurrent neural networks (RNNs), and transformer models. This versatility positions LIMPQ as a multifaceted asset within the realm of deep learning, capable of addressing diverse computational needs and optimizing performance across a wide range of applications. This advantage allows LIMPQ to create a more balanced trade-off between model size reduction and accuracy retention, leading to it being the most effective model among the three methods.

In conclusion, when quantizing neural network models, it can be divided into two technical routes according to different application requirements. When a large amount of training data is not available, neural network models can be quickly quantized using the PTQ method, as in [43]. When a large amount of training data can be obtained, the weights and quantization parameters of the model can be co-optimized using the hybrid bit-width training framework [56].

5. Future Challenges and Trends

Quantization is an essential technique for model deployment and practical application, helping to balance model performance and efficiency. The main challenge of quantization is how to minimize the accuracy loss caused by the reduction of numerical bit-width. From the application perspective, the requirement for the quantization of neural network models is that the bit-width should be set as low as possible to reduce the computational energy consumption and improve the inference speed, while the accuracy loss due to quantization should be as small as possible. To solve this problem, the weights of the model and the quantization parameters need to be co-optimized. Co-optimization leads to problems regarding how to build a quantization operator that can be derived everywhere, how to evaluate the loss of quantization accuracy, and how to improve the robustness of the model after quantization. To solve the problem that the back propagation process of quantization is not derivable during the training process of existing neural network models, it is necessary to model the full-domain differentiable quantization operator. To solve the problem that the existing quantization accuracy evaluation methods adopt the real measurement method, which requires manual participation and is too time-consuming to be integrated into the collaborative optimization algorithm, it is necessary to study the quantization accuracy loss modelling method. To solve the problem of poor performance of existing quantization methods in real intelligent application scenarios, methods to improve the robustness of neural network models after quantization need to be investigated. Therefore, the future development trends of quantization technology are to construct fast adaptive updating of quantization parameters, explore the mechanism of precision loss caused by quantization, and propose automatic compensation and robustness enhancement methods to quantize the neural network models.

Author Contributions

Conceptualization, L.W. and Z.M.; methodology, L.W.; software, C.Y.; validation, C.Y. and Q.Y.; formal analysis, L.W. and Z.M.; investigation, L.W. and C.Y.; resources, L.W. and C.Y.; data curation, C.Y.; writing—original draft preparation, L.W.; writing—review and editing, Z.M.; project administration, Z.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.-Y. DINO: DETR with improved denoising anchor boxes for end-to-end object detection. In International Conference on Learning Representations. arXiv 2023, arXiv:2203.03605. [Google Scholar]
Zong, Z.; Song, G.; Liu, Y. Detrs with collaborative hybrid assignments training. In Proceedings of the International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 6725–6735. [Google Scholar]
Zhou, X.S.; Wu, W.L. Unmanned system swarm intelligence and its research progresses. Microelectron. Comput. 2021, 38, 1–7. [Google Scholar] [CrossRef]
Chen, Z.; Duan, Y.; Wang, W.; He, J.; Lu, T.; Dai, J.; Qiao, Y. Vision transformer adapter for dense predictions. In International Conference on Learning Representations. arXiv 2023, arXiv:2205.08534. [Google Scholar]
Fang, Y.; Wang, W.; Xie, B.; Sun, Q.; Wu, L.; Wang, X.; Huang, T.; Wang, X.; Cao, Y. EVA: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 19358–19369. [Google Scholar]
Su, W.; Zhu, X.; Tao, C.; Lu, L.; Li, B.; Huang, G.; Qiao, Y.; Wang, X.; Zhou, J.; Dai, J. Towards all-in-one pre-training via maximizing multi-modal mutual information. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 15888–15899. [Google Scholar]
Wang, W.; Dai, J.; Chen, Z.; Huang, Z.; Li, Z.; Zhu, X.; Hu, X.; Lu, T.; Lu, L.; Li, H.; et al. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14408–14419. [Google Scholar]
Tang, L.; Ma, Z.; Li, S.; Wang, Z.X. The present situation and developing trends of space-based intelligent computing technology. Microelectron. Comput. 2022, 39, 1–8. [Google Scholar] [CrossRef]
Bianco, S.; Cadene, R.; Celona, L.; Napoletano, P. Benchmark analysis of representative deep neural network architectures. IEEE Access 2018, 6, 64270–64277. [Google Scholar] [CrossRef]
Hong, J.; Duan, J.; Zhang, C.; Li, Z.; Xie, C.; Lieberman, K.; Diffenderfer, J.; Bartoldson, B.; Jaiswal, A.; Xu, K.; et al. Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs under Compression. Computing Research Repository. arXiv 2024, arXiv:2403.15447. [Google Scholar]
Agustsson, E.; Theis, L. Universally quantized neural compression. Adv. Neural Inf. Process. Syst. 2020, 33, 12367–12376. [Google Scholar]
Banner, R.; Nahshan, Y.; Soudry, D. Post-training 4-bit quantization of convolution networks for rapid-deployment. arXiv 2018, arXiv:1810.05723. [Google Scholar]
Bulat, A.; Martinez, B.; Tzimiropoulos, G. High-capacity expert binary networks. International Conference on Learning Representations. arXiv 2021, arXiv:2010.03558. [Google Scholar]
Liang, T.; Glossner, J.; Wang, L.; Shi, S.; Zhang, X. Pruning and quantization for deep neural network acceleration: A survey. arXiv 2021, arXiv:2101.09671. [Google Scholar] [CrossRef]
Garg, S.; Jain, A.; Lou, J.; Nahmias, M. Confounding tradeoffs for neural network quantization. arXiv 2021, arXiv:2102.06366. [Google Scholar]
Garg, S.; Lou, J.; Jain, A.; Guo, Z.; Shastri, B.J.; Nahmias, M. Dynamic precision analog computing for neural networks. arXiv 2021, arXiv:2102.06365. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. North American Chapter of the Association for Computational Linguistics. arXiv 2018, arXiv:1810.04805, 4171–4186. [Google Scholar]
Tom, B.B.; Benjamin, M.; Nick, R.; Melanie, S.; Jared, K. Language Models are Few-Shot Learners. Conf. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Zhu, X.; Li, J.; Liu, Y.; Ma, C.; Wang, W. A Survey on Model Compression for Large Language Models. CoRR arXiv 2023, arXiv:2308.07633. [Google Scholar]
Zhang, Y.; Huang, D.; Liu, B.; Tang, S.; Lu, Y.; Chen, L.; Bai, L.; Chu, Q.; Yu, N.; Ouyang, W. MotionGPT: Finetuned LLMs Are General-Purpose Motion Generators. Proc. AAAI Conf. Artif. Intell. 2024, 38, 7368–7376. [Google Scholar] [CrossRef]
Xu, Z.; Cristianini, N. QBERT: Generalist Model for Processing Questions. In Advances in Intelligent Data Analysis XXI; Springer: Berlin/Heidelberg, Germany, 2022; pp. 472–483. [Google Scholar]
Frantar, E.; Ashkboos, S.; Hoefler, T.; Alistarh, D. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv 2023, arXiv:2210.17323. [Google Scholar]
Zhou, S.; Wu, Y.; Ni, Z.; Zhou, X.; Wen, H.; Zou, Y. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv 2016, arXiv:1606.06160. [Google Scholar]
Gholami, A.; Kim, S.; Dong, Z.; Yao, Z.; Mahoney, M.W.; Keutzer, K. A Survey of Quantization Methods for Efficient Neural Network Inference. arXiv 2021, arXiv:2103.13630. [Google Scholar]
Nagel, M.; Fournarakis, M.; Amjad, R.A.; Bondarenko, Y.; Van Baalen, M.; Blankevoort, T. A White Paper on Neural Network Quantization. arXiv 2021, arXiv:2106.08295. [Google Scholar]
Li, Y.; Dong, X.; Wang, W. Additive Powers-of-Two Quantization: An Efficient Non-uniform Discretization for Neural Networks. arXiv 2020, arXiv:1909.13144. [Google Scholar]
Liu, Z.; Wang, Y.; Han, K.; Zhang, W.; Ma, S.; Gao, W. Post-training quantization for vision transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 28092–28103. [Google Scholar]
Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Adam, H.; Kalenichenko, D. Quantization and training of neural networks for efficient integerarithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2704–2713. [Google Scholar]
Li, Y.; Xu, S.; Zhang, B.; Cao, X.; Gao, P.; Guo, G. Q-vit: Accurate and fully quantized low-bit vision transformer. Adv. Neural Inf. Process. Syst. 2022, 35, 34451–34463. [Google Scholar]
Yao, Z.; Dong, Z.; Zheng, Z.; Gholami, A.; Yu, J.; Tan, E.; Wang, L.; Huang, Q.; Wang, Y.; Mahoney, M. Hawqv3: Dyadic neural network quantization. arXiv 2020, arXiv:2011.10680. [Google Scholar]
McKinstry, J.L.; Esser, S.K.; Appuswamy, R.; Bablani, D.; Arthur, J.V.; Yildiz, I.B.; Modha, D.S. Discovering low-precision networks close to full-precision networks for efficient embedded inference. arXiv 2018, arXiv:1809.04191. [Google Scholar]
Krishnamoorthi, R. Quantizing deep convolutional net-works for efficient inference: A whitepaper. arXiv 2018, 8, 667–668. [Google Scholar]
Wu, H.; Judd, P.; Zhang, X.; Isaev, M.; Micikevicius, P. Integer quantization for deep learning inference: Principles and empirical evaluation. arXiv 2020, arXiv:2004.09602. [Google Scholar]
Migacz, S. 8-Bit Inference with TensorRT. GPU Technology Conference 2, 7. 2017. Available online: https://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf (accessed on 8 May 2017).
Chen, T.; Moreau, T.; Jiang, Z.; Zheng, L.; Yan, E. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th fUSENIXg Symposium on Operating Systems Design and Implementation (fOSDIg 18), Carlsbad, CA, USA, 8–10 October 2018; pp. 578–594. [Google Scholar]
Choukroun, Y.; Kravchik, E.; Yang, F.; Kisilev, P. Low-bit quantization of neural networks for efficient inference. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 3009–3018. [Google Scholar]
Shin, S.; Hwang, K.; Sung, W. Fixed-point performance analysis of recurrent neural networks. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 976–980. [Google Scholar]
Sung, W.; Shin, S.; Hwang, K. Resiliency of deep neural networks under quantization. arXiv 2015, arXiv:1511.06488. [Google Scholar]
Zhao, R.; Hu, Y.W.; Dotzel, J. Improving neural network quantization without retraining using outlier channel splitting. arXiv 2019, arXiv:1901.09504. [Google Scholar]
Park, E.; Ahn, J.; Yoo, S. Weighted-entropy-based quantization for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 26 July 2017; pp. 5456–5464. [Google Scholar]
Wei, L.; Ma, Z.; Yang, C. Activation Redistribution Based Hybrid Asymmetric Quantization Method of Neural Networks. CMES 2024, 138, 981–1000. [Google Scholar] [CrossRef]
Choi, J.; Wang, Z.; Venkataramani, S.; Chuang, P.I.-J.; Srinivasan, V.; Gopalakrishnan, K. Pact: Parameterized clipping activation for quantized neural networks. arXiv 2018, arXiv:1805.06085. [Google Scholar]
Gong, R.; Liu, X.; Jiang, S.; Li, T.; Hu, P.; Lin, J.; Yu, F.; Yan, J. Differentiable soft quantization: Bridging full-precision and low-bit neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4852–4861. [Google Scholar]
Jin, Q.; Yang, L.; Liao, Z. Towards efficient training for neural network quantization. arXiv 2019, arXiv:1912.10207. [Google Scholar]
Liu, J.; Cai, J.; Zhuang, B. Sharpness-aware Quantization for Deep Neural Networks. arXiv 2021, arXiv:2111.12273. [Google Scholar] [CrossRef]
Zhuang, B.; Liu, L.; Tan, M.; Shen, C.; Reid, I. Training Quantized Neural Networks With a Full-Precision Auxiliary Module. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1488–1497. [Google Scholar] [CrossRef]
Diao, H.; Li, G.; Xu, S.; Kong, C.; Wang, W. Attention Round for post-training quantization. Neurocomputing 2024, 565, 127012. [Google Scholar] [CrossRef]
Wu, B.; Wang, Y.; Zhang, P.; Tian, Y.; Vajda, P.; Keutzer, K. Mixed precision quantization of convnets via differentiable neural architecture search. arXiv 2018, arXiv:1812.00090. [Google Scholar]
Wang, K.; Liu, Z.; Lin, Y.; Lin, J.; Han, S. Haq: Hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8612–8620. [Google Scholar]
Dong, Z.; Yao, Z.; Gholami, A.; Mahoney, M.; Keutzer, K. Hawq: Hessian aware quantization of neural networks with mixed-precision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 293–302. [Google Scholar]
Dong, P.; Li, L.; Wei, Z.; Niu, X.; Tian, Z.; Pan, H. Emq: Evolving training-free proxies for automated mixed precision quantization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 17076–17086. [Google Scholar]
Tang, C.; Ouyang, K.; Chai, Z.; Bai, Y.; Meng, Y.; Wang, Z.; Zhu, W. SEAM: Searching Transferable Mixed-Precision Quantization Policy through Large Margin Regularization. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 7971–7980. [Google Scholar]
Dong, Z.; Yao, Z.; Cai, Y.; Arfeen, D.; Gholami, A.; Mahoney, M.W.; Keutzer, K. Hawq-v2: Hessian aware trace-weighted quantization of neural networks. In Advances in Neural Information Processing Systems; JMLR: New York, NY, USA, 2020; pp. 18518–18529. [Google Scholar]
Tang, C.; Ouyang, K.; Wang, Z.; Zhu, Y.; Ji, W.; Wang, Y.; Zhu, W. Mixed-Precision Neural Network Quantization via Learned Layer-wise Importance. In Proceedings of the Computer Vision—ECCV 2022, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar] [CrossRef]
Sheng, T.; Feng, C.; Zhuo, S.; Zhang, X.; Shen, L.; Aleksic, M. A quantization-friendly separable convolution for mobilenets. In Proceedings of the 2018 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2), Williamsburg, VA, USA, 25 March 2018. [Google Scholar]
Feng, P.; Yu, L.; Tian, S.W.; Geng, J.; Gong, G.L. Quantization of 8-bit deep neural networks based on mean square error. Comput. Eng. Des. 2022, 43, 1258–1264. [Google Scholar]
Wang, T.; Wang, K.; Cai, H.; Lin, J.; Liu, Z.; Wang, H.; Lin, Y.; Han, S. APQ: Joint Search for Network Architecture, Pruning and Quantization Policy. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; Volume 2006.08509, pp. 2075–2084. [Google Scholar]
Wang, P.; Yang, A.; Men, R.; Lin, J.; Bai, S.; Li, Z. Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 23318–23340. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 30; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S.V.N., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2017; pp. 5998–6008. [Google Scholar]
Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. arXiv 2023, arXiv:2305.14314. [Google Scholar]
Wei, X.; Zhang, Y.; Zhang, X.; Gong, R.; Zhang, S.; Zhang, Q.; Yu, F.; Liu, X. Outlier suppression: Pushing the limit of low-bit transformer language models. NeurIPS 2022, 35, 17402–17414. [Google Scholar]

Figure 1. Quantization: Turning the 32-bit floating-point neural network to an 8-bit or even lower bit integer network.

Figure 2. An example of quantization.

Figure 3. The activation distributions of three representative convolutional layers of the YOLOv3-tiny model. These distributions are asymmetric. The horizontal axis represents the activation value, and the vertical axis represents the activation density. (a) The activation distribution of layer 1. (b) The activation distribution of layer 23. (c) The activation distribution of layer 3.

Figure 4. The steps of PTQ.

Figure 5. The steps of QAT.

Table 1. Results of the PTQ methods.

Task	Model	PC Accuracy (FP32)	Method [42]	Method [41]	Method [43]
Image Classification	GoogleNet	67.04	65.22	65.91	66.65
Image Classification	VGG16	66.13	64.72	62.53	65.61
Object Detection	YOLOv1	61.99	59.41	60.94	61.27
Image Segmentation	U-net	82.78	82.13	81.71	82.14

Table 2. Results of the mixed-precision allocation and QAT methods for ResNet50 on the ImageNet dataset. “W-C” stands for weight compression rate.

Method	W-Bits	B-Bits	Top-1	W-C
PACT [44]	3	3	67.57	10.67x
HAWQv2 [55]	3-Mixed-Precision	3-Mixed-Precision	68.62	12.2x
LIMPQ [56]	3-Mixed-Precision	4-Mixed-Precision	70.15	12.3x

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wei, L.; Ma, Z.; Yang, C.; Yao, Q. Advances in the Neural Network Quantization: A Comprehensive Review. Appl. Sci. 2024, 14, 7445. https://doi.org/10.3390/app14177445

AMA Style

Wei L, Ma Z, Yang C, Yao Q. Advances in the Neural Network Quantization: A Comprehensive Review. Applied Sciences. 2024; 14(17):7445. https://doi.org/10.3390/app14177445

Chicago/Turabian Style

Wei, Lu, Zhong Ma, Chaojie Yang, and Qin Yao. 2024. "Advances in the Neural Network Quantization: A Comprehensive Review" Applied Sciences 14, no. 17: 7445. https://doi.org/10.3390/app14177445

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Advances in the Neural Network Quantization: A Comprehensive Review

Abstract

1. Introduction

2. Quantization Fundamentals

3. Quantization Techniques

3.1. PTQ

3.2. QAT

3.3. The Selection of Quantization Bitwidth

3.4. The Accuracy Loss Evaluation of the Quantized Models

3.5. Quantization of LLMs

4. Experimental Results

4.1. Experimental Setting

4.2. Results

5. Future Challenges and Trends

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI