Learning Bilateral Clipping Parametric Activation for Low-Bit Neural Networks

Ding, Yunlong; Chen, Di-Rong

doi:10.3390/math11092001

Open AccessArticle

Learning Bilateral Clipping Parametric Activation for Low-Bit Neural Networks

by

Yunlong Ding

^*

and

Di-Rong Chen

School of Mathematical Science, Beihang University, Beijing 100191, China

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(9), 2001; https://doi.org/10.3390/math11092001

Submission received: 27 March 2023 / Revised: 14 April 2023 / Accepted: 22 April 2023 / Published: 23 April 2023

(This article belongs to the Special Issue Artificial Intelligence Applications in Complex Networks)

Download

Browse Figures

Versions Notes

Abstract

:

Among various network compression methods, network quantization has developed rapidly due to its superior compression performance. However, trivial activation quantization schemes limit the compression performance of network quantization. Most conventional activation quantization methods directly utilize the rectified activation functions to quantize models, yet their unbounded outputs generally yield drastic accuracy degradation. To tackle this problem, we propose a comprehensive activation quantization technique namely Bilateral Clipping Parametric Rectified Linear Unit (BCPReLU) as a generalized version of all rectified activation functions, which limits the quantization range more flexibly during training. Specifically, trainable slopes and thresholds are introduced for both positive and negative inputs to find more flexible quantization scales. We theoretically demonstrate that BCPReLU has approximately the same expressive power as the corresponding unbounded version and establish its convergence in low-bit quantization networks. Extensive experiments on a variety of datasets and network architectures demonstrate the effectiveness of our trainable clipping activation function.

Keywords:

quantization; neural network; activation function; full-precision model

MSC:

68T07

1. Introduction

Deep neural networks (DNNs) have achieved great success in various computer vision tasks, such as image classification [1,2], object detection [3,4], and image segmentation [5,6], etc. However, the high model complexity hinders large-scale deployment on resource-constrained devices. For instance, to classify a

224 \times 224

image, the VGG-16 model consumes 550 M storage space and 15.5 billion floating point operations (FLOPs). For embedded applications, these resource demands are prohibitive. Determining how to deploy deep neural networks on embedded systems has therefore become an urgent problem for both academia and industry.

Among a wide range of neural network compression and acceleration methods [7,8], network quantizations [7,9] reduce the storage and computational cost by converting 32-bit floating points to low-precision representation and achieve excellent compression performance. Nowadays, network quantization can be roughly grouped into two categories, namely weight quantization and activation quantization. A plethora of weight quantization methods have been extensively studied and proposed. Han et al. [10] quantized the model jointly with pruning, which reduced the storage required by AlexNet from 240 MB to 6.9 MB. Liu et al. [11] proposed a one-shot mixed-precision quantization method to adaptively adjust bit-width precision.

In addition to weight quantization methods, rectified activation functions also play an important role in both full-precision and quantized networks, whose non-saturated feature alleviates the problem of exploding/vanishing gradient, thereby speeding up the model convergence. Of all these activation functions, the Rectified Linear Unit (ReLU) is one of several keys to the recent success of deep networks. To alleviate the problem of zero gradients, leaky ReLU proposes a fixed slope to negative inputs. PReLU [12] introduces a trainable slope for negative inputs rather than a fixed one. This trainable slope is a key factor in surpassing human-level classification performance on the ImageNet 2012 dataset. Furthermore, RReLU [13], SE [14], and DY-ReLU [15] have also been proposed to better improve classification performance. To better understand the differences between the above-mentioned activation functions, we illustrate them in Figure 1.

While rectified activation functions have achieved great success for floating point networks, the unbounded outputs make activation quantization difficult. The quantization ranges determine the trade-off between model accuracy and quantization error. Specifically, models with a smaller quantization range can mitigate the quantization error but restrict their numerical range for training. On the other hand, models with a larger quantization range have a more flexible quantization range for achieving higher accuracy. Thus, a question arises: how does one obtain a reasonable quantization range? The traditional activation quantization methods [16,17,18,19] directly transform activation into [−1, 1] or [0, 1] to limit the high output range. This trivial clipping strategy limits the performance of activation quantization. So far, several clipping activation functions have been proposed to address the unboundedness problem. The quantization scheme [20] with ReLU6 achieves less accuracy degradation by placing a fixed upper bound on the activation values, which shows that a fixed range is easier to quantize. Half-Wave-Gaussian-Quantization (HWGQ) [21] utilizes the statistics of network activations to propose a variant of ReLU that constrains the unbounded value. Nevertheless, the fixed upper bound during training yields a drastic accuracy drop. In order to improve quantization performance more flexibly, Parameterized Clipping activation (PACT) [22] places a trainable upper bound limit on the output of standard ReLU, which achieves full-precision accuracy with 4-bit precision. However, these conventional clipping functions mainly focus on standard ReLU, which lacks an adaptive consideration of all rectified activation functions.

In this paper, we propose a Bilateral Clipping Parametric Rectified Linear Unit (BCPReLU) as a generalization of ReLU and its variants. The learnable two-sided thresholds and non-zero slopes for activation input are introduced during training. We prove that the expressive ability of BCPReLU is almost the same as the corresponding unbounded one, and we establish the convergence property of the quantization error. Extensive experiments on a variety of popular models and datasets demonstrate that our proposed clipping activation function can achieve better compression performance in both full-precision and quantized networks. Our contributions are summarized below:

We propose a novel clipping activation function as a generalized version of rectified activation functions. The trainable slopes and thresholds for both positive and negative input are introduced in BCPReLU.
We theoretically prove that BCPReLU has almost the same expressive ability as the corresponding unbounded function in a full-precision network, and we establish the convergence of BCPReLU in the quantized network.
Extensive experiments on CIFAR-10 and ImageNet datasets demonstrate the effectiveness of BCPReLU.

The rest of the paper is organized as follows: Section 2 provides a summary of prior work on quantization. Section 3 proposes a novel clipping activation function, in which the expressive ability in the full-precision network is analyzed, and the convergence in the quantized network is established. In Section 4, we demonstrate the effectiveness of our quantization scheme.

2. Related Work

Currently, network quantizations [23,24,25] have become an efficient strategy to compress models with limited accuracy degradation. These quantization approaches can be roughly divided into weight quantization and activation quantization. Early quantization works [26,27,28,29] are mainly concerned with weight quantization, which quantizes weight into one bit (binary) or two bits (ternary). Recently, other weight quantization schemes [30,31,32] based on optimization or approximation have been proposed. For example, Liu et al. [30] formulated weight quantization as a differentiable nonlinear mapping function. Cai et al. [21] used piece-wise backward approximators to overcome the gradient mismatch problem. In addition to weight quantization methods, activation quantization is another essential factor that affects final quantization performance. In order to maximally utilize bit-wise operations, activation quantization methods [16,33] have gained considerable attention. In general, conventional activation quantization methods generally directly utilize the activation distribution or the range of activation output to quantize models, in which the quantization range is essential to quantization performance. The quantization range must be large enough to reduce the clipping error, and at the same time, it should be small enough to ensure the rounding error does not become too large. The quantization scheme of [20] is limited to the ReLU function and chooses a quantization range of

[0, c]

. Outlier Channel Splitting (OCS) [34] exploits channel splitting to avoid containing outliers, which duplicates channels containing outliers, then halves the input activations. Furthermore, ACIQ [35] also suggests limiting the range of activation values, which approximates the optimal clipping value analytically from the distribution of the tensor by minimizing the mean-square-error measure. These works reveal that a fixed clipping range with the right quantization range can achieve state-of-the-art accuracy. However, a fixed quantization range on the output is suboptimal due to the difference in layers/channels. Instead of fixing the quantization range for quantized deep learning models, novel quantization methods [22,36] propose a learnable parameter soft clipping quantization, which automatically learns suitable quantization ranges for each layer and then linearly quantizes the values in the optimized range to M bits.

3. BCPReLU and Quantization Analysis

In this section, we first propose a novel clipping activation function and then prove that the proposed activation function has almost the same representation ability as the corresponding unbounded one in the full-precision network. According to the reasonable assumption of activation distribution, we finally establish the convergence of the quantization error.

3.1. Novel Activation Function

As stated in the previous section, the traditional clipping activation quantization mainly focuses on standard ReLU, which lacks an adaptive consideration of all rectified activation functions. To overcome this limitation, we propose a more comprehensive clipping trainable activation function (BCPReLU) as follows:

y = f (x; μ, k_{1}, α, k_{2}) = \{\begin{matrix} - k_{1} μ & x \in (- \infty, - μ) \\ k_{1} x & x \in [- μ, 0) \\ k_{2} x & x \in [0, α) \\ k_{2} α & x \in [α, + \infty) \end{matrix}

(1)

The parameter x is the input of the nonlinear activation y, and the parameters

k_{1}

and

k_{2}

are trainable and control the slopes of the negative input and the positive input, respectively. The parameters

- μ

and

α

are the clipping values of the negative input and the positive input, respectively. In fact, the proposed activation function includes a large number of traditional activations. When

k_{1} = 0

and

k_{2} = 1

, the clipping activation function becomes PACT (the trainable clipping ReLU); when

k_{1}

is a small and fixed value and

k_{2} = 1

, it becomes a clipping form of LReLU; and when

k_{1}

is a trainable variable and

k_{2} = 1

, it becomes a clipping form of PReLU.

As shown in Figure 2, when

- μ \to - \infty

and

α \to + \infty

, the output of BCPReLU is unbounded, which incurs large quantization error. To limit the range of output, it is necessary to place reasonable clipping values. The clipping values

- k_{1} μ

and

k_{2} α

in BCPReLU limit the range of output, and

k_{1}

and

μ

also enhance the expressive power in negative input to avoid dead neurons.

3.2. Expressive Ability

In this subsection, we consider the effect of clipping values and compare BCPReLU with the corresponding unbounded one in a full-precision network. Following PACT [22], we summarize the theorem as follows:

Theorem 1.

Assume that x is an activation input, y represents the corresponding output. The network with BCPReLU can be trained to find the same output as the corresponding unbounded function and converge faster than the corresponding unbounded function.

Proof of Theorem 1.

Assume that

y^{*}

is the corresponding label of x; the cost function is defined as following the mean-square-error (MSE):

L (y) = 0.5 \cdot {(y - y^{*})}^{2}

We define the unbounded function corresponding to BCPReLU as g:

g = g (x; k_{1}, k_{2}) = \{\begin{matrix} k_{1} x & x \in (- \infty, 0) \\ k_{2} x & x \in [0, + \infty) \end{matrix}

(2)

If

x \in [- μ, α]

, the network with BCPReLU (Equation (1)) behaves the same as the network with the unbounded function g (Equation (2)).

If

x < - μ

, then

y = - k_{1} μ

,

g = k_{1} x

.

k_{2}

and

α

are not updated because

\frac{\partial y}{\partial k_{2}} = \frac{\partial y}{\partial α} = 0 .

Updating

k_{1}

and

μ

:

k_{1}^{n e w} = k_{1} - η \frac{\partial L}{\partial k_{1}} = k_{1} - η \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial k_{1}} = k_{1} + η μ \frac{\partial L}{\partial y},

μ^{n e w} = μ - η \frac{\partial L}{\partial μ} = μ - η \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial μ} = μ + η k_{1} \frac{\partial L}{\partial y} .

The update of

k_{1}

and u depend on

\frac{\partial L}{\partial y}

, so we need to consider the different cases of

\frac{\partial L}{\partial y} = y - y^{*} :

Case 1: If

y^{*} < k_{1} x

, then

y^{*} < y

,

\frac{\partial L}{\partial y} > 0

,

k_{1}

increases,

μ

increases, and thus

y = - k_{1} μ

decreases until

- μ < x

, i.e.,

y = g

. The network with BCPReLU (Equation (1)) behaves the same as the network with the unbounded function g (Equation (2)) in this case.

Case 2: If

k_{1} x < y^{*} < y

,

\frac{\partial L}{\partial y} > 0

,

k_{1}

increases,

μ

increases, and thus

y = - k_{1} μ

decreases and converges to

y^{*}

.

Case 3: If

y < y^{*}

,

\frac{\partial L}{\partial y} < 0

,

k_{1}

decreases,

μ

decreases, and thus

y = - k_{1} μ

increases and converges to

y^{*}

. Note that in case 2 and case 3, the output of BCPReLU converges to the same target

y^{*}

faster than the corresponding unbounded function g.

If

x > α

, then

y = k_{2} α

,

k_{1}

and

μ

are not updated. The analytical method is similar to that when

x < - μ

. □

From Theorem 1, we know that BCPReLU has almost the same expressive power as the corresponding unbounded function.

3.3. Convergence Analysis

The proposed activation function limits the range of activation to

[- k_{1} μ, k_{2} α]

, and then the bounded range of output is linearly quantized to M bits as follows

y_{q} = r o u n d (y \cdot \frac{2^{M} - 1}{k_{1} μ + k_{2} α}) \cdot \frac{k_{1} μ + k_{2} α}{2^{M} - 1}

During the training stage,

k_{1}

,

μ

,

k_{2}

, and

α

, are trainable variables. The gradient with respect to the involved parameters can be computed using the Straight-Through Estimator (STE) [37]. Thus,

\frac{\partial y_{q}}{\partial μ} = \frac{\partial y_{q}}{\partial y} \cdot \frac{\partial y}{\partial μ} = \{\begin{matrix} - k_{1} & x \in (- \infty, - μ) \\ 0 & x \in [- μ, + \infty) \end{matrix}

\frac{\partial y_{q}}{\partial k_{1}} = \frac{\partial y_{q}}{\partial y} \cdot \frac{\partial y}{\partial k_{1}} = \{\begin{matrix} - μ & x \in (- \infty, - μ) \\ x & x \in [- μ, 0) \\ 0 & x \in [0, + \infty) \end{matrix}

\frac{\partial y_{q}}{\partial k_{2}} = \frac{\partial y_{q}}{\partial y} \cdot \frac{\partial y}{\partial k_{2}} = \{\begin{matrix} 0 & x \in (- \infty, 0) \\ x & x \in [0, α) \\ α & x \in [α, + \infty) \end{matrix}

\frac{\partial y_{q}}{\partial α} = \frac{\partial y_{q}}{\partial y} \cdot \frac{\partial y}{\partial α} = \{\begin{matrix} 0 & x \in (- \infty, α) \\ k_{2} & x \in [α, + \infty) \end{matrix}

Next, we consider the quantization error of BCPRLU. Under a reasonable bell-curved distribution assumption [35], the convergence of quantization error is established as follows:

Theorem 2.

Assume that activation input x is a random variable, with a probability density function

f (x) =

Laplace

(0, b)

, and the quantization error of BCPReLU converges to zero when bit-width

M \to \infty .

A proof of Theorem 2 will be given in Appendix A. This theorem shows that our proposed activation function using the common quantization method satisfies the property of quantization error; i.e., the quantization error converges to zero when the bit-width M approaches infinity (

\to \infty

).

4. Experiments

To demonstrate the effectiveness of BCPReLU, we evaluate it on several well-known models: ResNet20/32 [1] on the CIFAR10 dataset and ResNet18 [1] on the ImageNet dataset. The CIFAR-10 dataset contains 10 different class images, and each image is an RGB image

32 \times 32

in size. There are 50,000 training images and 10,000 testing images. The ImageNet dataset consists of 1000 classes with 1.28 million training images and 50 k validation images. The image from the ImageNet dataset is an RGB image

256 \times 256

in size. All experiments are based on the Pytorch framework. We used an NVIDIA Tesla K20c GPU for training and testing.

4.1. Equivalent Form of BCPReLU

In general, the more trainable parameters are the easier it is for overfitting to occur. To mitigate this problem, with the loss of generality, we consider the case where the convolutions and BCPReLU are fused to derive an equivalent form of BCPReLU. Considering a single-neuron network with BCPReLU, where

(b, y^{*})

is a sample of training data and w denotes weight, then the activation input is denoted as

x = w b

, and the activation output is denoted as:

y = \{\begin{matrix} - k_{1} μ & w b \in (- \infty, - μ) \\ k_{1} w b & w b \in [- μ, 0) \\ k_{2} w b & w b \in [0, + α) \\ k_{2} α & w b \in [α, + \infty) \end{matrix}

k_{2} w

,

k_{2} μ

, and

k_{2} α

are denoted by

w^{*}

,

μ^{*}

, and

α^{*}

, respectively.

y = \{\begin{matrix} - \frac{k_{1}}{k 2} μ^{*} & w^{*} b \in (- \infty, - μ^{*}) \\ \frac{k_{1}}{k 2} w^{*} b & w^{*} b \in [- μ^{*}, 0) \\ w^{*} b & w^{*} b \in [0, + α^{*}) \\ α^{*} & w^{*} b \in [α^{*}, + \infty) \end{matrix}

The slope

k_{2}

of the positive input can be integrated into the training of

k_{1}

,

μ

,

α

, and weight. The proposed clipping activation function is equivalent to the bounded form of PReLU [12]:

y = f (x; μ, k, α) = \{\begin{matrix} - k μ & x \in (- \infty, - μ) \\ k x & x \in [- μ, 0) \\ x & x \in [0, α) \\ α & x \in [α, + \infty) \end{matrix}

(3)

The parameter k is a coefficient controlling the slope of the negative input, and

- k μ

and

α

control the lower bound and upper bound of y, respectively. Note that we use the equivalent function from Equation (3) in the following experiments.

4.2. Trainable Parameters

We have transformed the original function in a simpler way. We use

α = 10, k = 0.25

, and

μ = 5

as the initialization with regularization. To better understand the more flexible quantization scale, the learned parameters of BCPReLU are shown in Table 1. It can be observed that the trainable parameter

α

converges to a much smaller value after epochs of training, and other learned parameters k and

μ

rarely have a magnitude larger than 1 in our experiment. Thus, our learned activation function limits the dynamic range of activations and reduces the quantization error, leading to better performance in terms of accuracy.

4.3. Comparison of PReLU and BCPReLU

For BCPReLU experiments, we only replace PReLU with BCPReLU but keep the other hyper-parameters the same in the full-precision network. Figure 3 shows the validation error of PReLU and BCPReLU on CIFAR10-ResNet20. The curves show that the BCPReLU has almost the same expressive ability as PReLU in the full-precision network. The performance of BCPReLU is consistent with the theoretical analysis from Theorem 1.

4.4. Different Bit-Width Quantization Performance

In this subsection, we compare the full-precision network and the low-bit (4,8-bit) network. The curves of validation error with different bit-width are shown in Figure 4. We observe that the validation error is gradually reduced as the bit width increases. The result is in line with the theoretical analysis in Theorem 2; i.e., when the bit-width M increases, the quantization error of BCPReLU decreases.

4.5. Accuracy Performance Comparison

To evaluate our method, we compare it with PACT [22] on several well-known CNNs: ResNet [1] for the CIFAR10 and ImageNet datasets. ResNet20/32 on the CIFAR10 dataset uses stochastic gradient descent (SGD) with a momentum of 0.9. The learning rate starts from 0.1 and is scaled by 0.1 at epochs 60 and 120. The mini-batch size of 128 is used, and the maximum number of epochs is 200. ResNet18 on the ImageNet dataset uses the same training parameters as PACT. During the training stage, the initial values of k,

μ

, and

α

are set to 0.25, 5, and 10, respectively. Table 2 summarizes quantization accuracy on CIFAR-10 and ImageNet datasets. As shown in Table 2, BCPReLU achieves competitive accuracy consistently across the networks. With 2-bit quantization, BCPReLU achieves higher accuracy. This is reasonable because the parameters of BCPReLU are adapt more flexibly to training than PACT.

5. Conclusions

In this paper, we have proposed a learnable clipping activation function to improve low-bit quantization performance, which dynamically determines the clipping range utilizing adaptive learning capability. By analyzing the representation power on the full-precision network and the convergence of the low-bit network, the quantized model accuracy can be improved while preserving the full-precision model’s accuracy. Extensive experiments on various classification datasets further demonstrate that our proposed activation function can achieve higher quantization accuracy in low-bit networks. In the future, we can also propose corresponding clipping methods for other types of activation functions, such as the sigmoid function. In addition, we plan to work on how to combine our quantization method with other acceleration algorithms to achieve higher compression performance.

Author Contributions

Y.D.: Conceptualization, Methodology, Data curation, Software, Writing—original draft. D.-R.C.: Supervision, Writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Beijing Natural Science Foundation (L222018) and the National Natural Science Foundation of China (11971048).

Data Availability Statement

Data sharing is not applicable to this article as no new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

In the quantized network, we need the following notations:

η_{1} = r o u n d (- k_{1} μ \frac{2^{M} - 1}{k_{1} μ + k_{2} α}), β = \frac{k_{1} μ + k_{2} α}{2^{M} - 1}

η_{2} = r o u n d (k_{2} α \frac{2^{M} - 1}{k_{1} μ + k_{2} α}), y_{q} = r o u n d (y \cdot \frac{1}{β}) β

Assume that the activation input x is a random variable, with a probability density function

f (x) =

Laplace

(0, b)

, then

k x

has a density function

h (x) =

Laplace

(0, k b)

.

Thus, the activation output y has the probability density function

h (y) = \{\begin{matrix} \frac{1}{2 k_{1} b} e^{\frac{y}{k_{1} b}} & y \in (- \infty, 0) \\ \frac{1}{2 k_{2} b} e^{\frac{- y}{k_{2} b}} & y \in [0, \infty) \end{matrix}

The quantization error between the activation output y and its quantized version

y_{q}

is denoted by

e r r o r (y)

, which can be written as follows:

\begin{matrix} e r r o r (y) = & \int_{- k_{1} μ}^{k_{2} α} h (y) {(y - y_{q})}^{2} d y \\ = \int_{- k_{1} μ}^{(η_{1} + 0.5) β} h (y) {(y - η_{1} β)}^{2} d y \\ + \int_{(η_{2} - 0.5) β}^{k_{2} α} h (y) {(y - η_{2} β)}^{2} d y \\ + \sum_{i = 1}^{2^{M} - 2} \int_{(η_{1} + i - 0.5) β}^{(η_{1} + i + 0.5) β} h (y) {[y - (η_{1} + i) β]}^{2} d y \end{matrix}

Because the probability density of the positive input is different from that of the negative input, we need to consider the integral of different intervals when calculating the quantization error. Obviously,

y_{q} = 0

when

y = 0 .

Assume that

η_{1} + i^{*} = 0

; then,

0 \in [(η_{1} + i^{*} - 0.5) β, (η_{1} + i^{*} + 0.5) β]

. Thus,

\begin{matrix} e r r o r (y) = & \int_{- k_{1} μ}^{(η_{1} + 0.5) β} h (y) {(y - η_{1} β)}^{2} d y \\ + \sum_{i = 1}^{i^{*} - 1} \int_{(η_{1} + i - 0.5) β}^{(η_{1} + i + 0.5) β} h (y) {[y - (η_{1} + i) β]}^{2} d y \\ + \int_{- 0.5 β}^{0} h (y) y^{2} d y + \int_{0}^{0.5 β} h (y) y^{2} d y \\ + \sum_{i = i^{*} + 1}^{2^{M} - 2} \int_{(η_{1} + i - 0.5) β}^{(η_{1} + i + 0.5) β} h (y) {[y - (η_{1} + i) β]}^{2} d y \\ + \int_{(η_{2} - 0.5) β}^{α} h (y) {(y - η_{2} β)}^{2} d y \end{matrix}

Defining

h_{1} (y) = \frac{1}{2} e^{\frac{y}{k_{1} b}} {(y - η_{1} β)}^{2} - (k_{1} b) e^{\frac{y}{k_{1} b}} (y - η_{1} β) + {(k_{1} b)}^{2} e^{\frac{y}{k_{1} b}}

h_{2} (y) = - \frac{1}{2} e^{\frac{- y}{k_{2} b}} {(y - η_{2} β)}^{2} - (k_{2} b) e^{\frac{- y}{k_{2} b}} (y - η_{2} β) - {(k_{2} b)}^{2} e^{\frac{- y}{k_{2} b}}

The quantization error is calculated using the integration by parts as follows:

\begin{matrix} e r r o r (y) = & h_{1} (- k_{1} μ) + {(k_{1} b)}^{2} + {(k_{2} b)}^{2} + h_{2} (k_{2} α) \\ - \sum_{i = 1}^{i^{*} - 1} (k_{1} b β) e^{\frac{(η_{1} + i - 0.5) β}{k_{1} b}} - \sum_{i = i^{*} + 1}^{2^{M} - 2} (k_{2} b β) e^{\frac{- (η_{1} + i - 0.5) β}{k_{2} b}} \end{matrix}

If

M \to \infty,

then

\begin{matrix} h_{1} (- k_{1} μ) \to - {(k_{1} b)}^{2} e^{\frac{- k_{1} μ}{k_{1} b}} \\ h_{2} (k_{2} α) \to - {(k_{2} b)}^{2} e^{- \frac{k_{2} α}{k_{2} b}} \\ \sum_{i = 1}^{i^{*} - 1} (k_{1} b β) e^{\frac{(η_{1} + i - 0.5) β}{k_{1} b}} \to - {(k_{1} b)}^{2} (e^{\frac{- k_{1} μ}{k_{1} b}} - 1) \\ \sum_{i = i^{*} + 1}^{2^{M} - 2} (k_{2} b β) e^{\frac{- (η_{1} + i - 0.5) β}{k_{2} b}} \to {(k_{2} b)}^{2} e^{- \frac{k_{2} α}{k_{2} b}} \end{matrix}

According to the above, the quantization error

e r r o r (y)

converges to zero. The proof is completed.

References

He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the 27th IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2016; pp. 580–587. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the 29th Annual Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 91–99. [Google Scholar]
Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 325–341. [Google Scholar]
Takikawa, T.; Acuna, D.; Jampani, V.; Fidler, S. Gated-scnn: Gated shape cnns for semantic segmentation. In Proceedings of the 17th IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5228–5237. [Google Scholar]
Lee, D.; Kim, C.; Kim, S.; Cho, M.; Han, W.S. Autoregressive Image Generation using Residual Quantization. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 11523–11532. [Google Scholar]
Li, Y.; Adamczewski, K.; Li, W.; Gu, S.; Timofte, R.; Van Gool, L. Revisiting Random Channel Pruning for Neural Network Compression. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 191–201. [Google Scholar]
Liu, S.; Li, B.; Zhao, B.; Huang, L.; Li, Q.; Huang, M.; Wu, Y.; Bao, W. Two-Bit Quantization for Harmonic Suppression. In Proceedings of the 5th International Conference on Information Communication and Signal Processing, Shenzhen, China, 26–28 November 2022; pp. 696–699. [Google Scholar]
Han, S.; Mao, H.; Dally, W.J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In Proceedings of the 4th International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Liu, H.; Elkerdawy, S.; Ray, N.; Elhoushi, M. Layer importance estimation with imprinting for neural network quantization. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Nashville, TN, USA, 19–25 June 2021; pp. 2408–2417. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the 15th IEEE International Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 1026–1034. [Google Scholar]
Xu, B.; Wang, N.; Chen, T.; Li, M. Empirical evaluation of rectified activations in convolutional network. arXiv 2015, arXiv:1505.00853. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Chen, Y.; Dai, X.; Liu, M.; Chen, D.; Yuan, L.; Liu, Z. Dynamic ReLU. In Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 351–367. [Google Scholar]
Zhou, S.; Wu, Y.; Ni, Z.; Zhou, X.; Wen, H.; Zou, Y. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv 2016, arXiv:1606.06160. [Google Scholar]
Yamamoto, K. Learnable companding quantization for accurate low-bit neural networks. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, Online, USA, 19–25 June 2021; pp. 5027–5036. [Google Scholar]
Jung, S.; Son, C.; Lee, S.; Son, J.; Han, J.J.; Kwak, Y.; Hwang, S.J.; Choi, C. Learning to quantize deep networks by optimizing quantization intervals with task loss. In Proceedings of the 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 4345–4354. [Google Scholar]
Faraone, J.; Fraser, N.; Blott, M.; Leong, P.H. SYQ: Learning symmetric quantization for efficient deep neural networks. In Proceedings of the 31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4300–4309. [Google Scholar]
Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Adam, H.; Kalenichenko, D. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the 31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2704–2713. [Google Scholar]
Cai, Z.; He, X.; Sun, J.; Vasconcelos, N. Deep learning with low precision by half-wave gaussian quantization. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5406–5414. [Google Scholar]
Choi, J.; Wang, Z.; Venkataramani, S.; Chuang, P.J.; Srinivasan, V.; Gopalakrishnan, K. Pact: Parameterized clipping activation for quantized neural networks. arXiv 2018, arXiv:1805.06085. [Google Scholar]
Chen, W.; Wang, P.; Cheng, J. Towards mixed-precision quantization of neural networks via constrained optimization. In Proceedings of the 18th IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 5330–5339. [Google Scholar]
Hubara, I.; Nahshan, Y.; Hanani, Y.; Banner, R.; Soudry, D. Accurate post training quantization with small calibration sets. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 4466–4475. [Google Scholar]
Wang, L.; Dong, X.; Wang, Y.; Liu, L.; An, W.; Guo, Y. Learnable Lookup Table for Neural Network Quantization. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 12413–12423. [Google Scholar]
Courbariaux, M.; Bengio, Y.; David, J.P. Binaryconnect: Training deep neural networks with binary weights during propagations. In Proceedings of the 29th Annual Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 3123–3131. [Google Scholar]
Li, F.; Zhang, B.; Liu, B. Ternary weight networks. arXiv 2016, arXiv:1605.04711. [Google Scholar]
Rastegari, M.; Ordonez, V.; Redmon, J.; Farhadi, A. XNOR-net: Imagenet classification using binary convolutional neural networks. In Proceedings of the 21st ACM Conference on Computer and Communications Security, Scottsdale, AZ, USA, 3–7 November 2016; pp. 525–542. [Google Scholar]
Zhu, C.; Han, S.; Mao, H.; Dally, W.J. Trained ternary quantization. arXiv 2016, arXiv:1612.01064. [Google Scholar]
Yang, J.; Shen, X.; Xing, J.; Tian, X.; Li, H.; Deng, B.; Huang, J.; Hua, X. Quantization networks. In Proceedings of the 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 7300–7308. [Google Scholar]
Wang, P.; Hu, Q.; Zhang, Y.; Zhang, C.; Liu, Y.; Cheng, J. Two-step quantization for low-bit neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4376–4384. [Google Scholar]
Wu, J.; Leng, C.; Wang, Y.; Hu, Q.; Cheng, J. Quantized convolutional neural networks for mobile devices. In Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 4820–4828. [Google Scholar]
Kryzhanovskiy, V.; Balitskiy, G.; Kozyrskiy, N.; Zuruev, A. QPP: Real-Time Quantization Parameter Prediction for Deep Neural Networks. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, Online, USA, 19–25 June 2021; pp. 10679–10687. [Google Scholar]
Zhao, R.; Hu, Y.; Dotzel, J.; De Sa, C.; Zhang, Z. Improving neural network quantization without retraining using outlier channel splitting. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 13012–13021. [Google Scholar]
Banner, R.; Nahshan, Y.; Soudry, D. Post training 4-bit quantization of convolutional networks for rapid-deployment. In Proceedings of the 33rd Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Huang, C.T.; Chuang, Y.C.; Lin, M.G.; Wu, A.A. Automated Quantization Range Mapping for DAC/ADC Non-linearity in Computing-In-Memory. In Proceedings of the 2022 IEEE International Symposium on Circuits and Systems, Austin, TX, USA, 27 May–1 June 2022; pp. 2998–3002. [Google Scholar]
Bengio, Y.; Leonard, N.; Courville, A. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv 2013, arXiv:1308.3432. [Google Scholar]

Figure 1. ReLU, LReLU/PReLU, RReLU, and SE. For PReLU, k is learned, and for LReLU, k is fixed. For RReLU, k is randomized during training in a given range. For SE, the slope is decided by a trainable dynamic function.

Figure 2. BCPReLU: the

k_{1}

and

k_{2}

control the slopes of the negative input and the positive input, respectively. The

- k_{1} μ

and

k_{2} α

control the lower bound and upper bound of output, respectively. The

k_{1}, k_{2}, μ, α

are adaptively learned in training and then fixed in the testing.

Figure 2. BCPReLU: the

k_{1}

and

k_{2}

control the slopes of the negative input and the positive input, respectively. The

- k_{1} μ

and

k_{2} α

control the lower bound and upper bound of output, respectively. The

k_{1}, k_{2}, μ, α

are adaptively learned in training and then fixed in the testing.

Figure 3. The validation error of PReLU and BCPReLU on CIFAR10-ResNet20.

Figure 4. The validation error of different bit-width BCPReLU on CIFAR10-ResNet20.

Table 1. The learned parameters of BCPReLU using a ResNet20 model on the CIFAR10 dataset.

Layer	Learned Parameters
Layer	$α$	$k$	$μ$
Conv-1	1.5829	0.2151	0.5267
Conv-2	1.3588	0.2425	0.4372
Conv-3	1.1250	0.2336	0.3098
Conv-4	1.0106	0.1555	0.2061
Conv-5	1.5837	0.2528	0.4292
Conv-6	1.1262	0.3445	0.4317
Conv-7	2.0118	0.3648	0.3836
Conv-8	1.6620	0.2455	0.7877
Conv-9	1.3133	0.6584	2.4953
Conv-10	1.7046	0.2471	0.5996
Conv-11	2.9583	0.1456	0.9196
Conv-12	1.7303	0.3160	0.5303
Conv-13	4.3142	0.4195	1.3533
Conv-14	2.0902	0.1666	0.6147
Conv-15	3.1975	0.2384	0.9554
Conv-16	2.5048	0.3540	0.8222
Conv-17	5.2674	0.3605	1.9852
Conv-18	1.9860	0.3511	0.8580
Conv-19	2.2096	0.3535	0.9461

Table 2. The Comparison of top-1 accuracy between PACT and BCPReLU on CIFAR-10 and ImageNet datasets.

Dataset	Model	FullPrec	PACT			BCPReLU
Dataset	Model	FullPrec	2b	4b	5b	2b	4b	5b
CIFAR-10	ResNet20	0.916	0.897	0.913	0.917	0.903	0.914	0.915
CIFAR-10	ResNet32	0.923	0.907	0.921	0.921	0.912	0.921	0.922
ImageNet	ResNet18	0.702	0.644	0.692	0.698	0.652	0.695	0.698

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ding, Y.; Chen, D.-R. Learning Bilateral Clipping Parametric Activation for Low-Bit Neural Networks. Mathematics 2023, 11, 2001. https://doi.org/10.3390/math11092001

AMA Style

Ding Y, Chen D-R. Learning Bilateral Clipping Parametric Activation for Low-Bit Neural Networks. Mathematics. 2023; 11(9):2001. https://doi.org/10.3390/math11092001

Chicago/Turabian Style

Ding, Yunlong, and Di-Rong Chen. 2023. "Learning Bilateral Clipping Parametric Activation for Low-Bit Neural Networks" Mathematics 11, no. 9: 2001. https://doi.org/10.3390/math11092001

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Learning Bilateral Clipping Parametric Activation for Low-Bit Neural Networks

Abstract

1. Introduction

2. Related Work

3. BCPReLU and Quantization Analysis

3.1. Novel Activation Function

3.2. Expressive Ability

3.3. Convergence Analysis

4. Experiments

4.1. Equivalent Form of BCPReLU

4.2. Trainable Parameters

4.3. Comparison of PReLU and BCPReLU

4.4. Different Bit-Width Quantization Performance

4.5. Accuracy Performance Comparison

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI