1. Introduction
Deep neural networks (DNNs), of strong fitting capabilities, have extended the range of application of artificial intelligence in recent years. In the face of increasingly complicated learning tasks, researchers have deepened and widened neural networks to achieve better representation capabilities. The price of this increased capability is a large consumption of computing resources, which poses a hefty challenge to hardware design.
As neural networks increase in size, the training of the neural network becomes more dependent on specialized hardware consisting of thousands of computing units. General matrix multiply (GEMM) is at the core of deep learning because a tremendous amount of matrix multiplication operations are required for neural network training. To accelerate the training and inference of deep neural networks, hardware vendors are constantly increasing the chip area and the number of processing units. Nevertheless, accelerating matrix multiplication operations remains cost-ineffective, in contrast to accelerating simple operations, such as addition, subtraction, and a bit shift.
In terms of hardware implementation, multipliers are bulky and power-intensive compared to other logic resources. Consequently, this prohibits deployment to scenarios when the multipliers are insufficient. For instance, digital signal processing (DSP) used for floating-point multiplication is often found to be a scarce resource when attempting to implement neural networks on field-programmable gate arrays (FPGAs). To reduce the usage of hardware multipliers, many researchers use a coordinate rotation digital computer (CORDIC) [
1]) module to compute layer activations rather than the direct usage of DSPs on FPGAs when hyperbolic activation functions are applied. This method involves only additions, subtractions, a bit shift, and look-up tables, and has been confirmed to be more hardware efficient [
2,
3,
4]. Although widely used in microcontrollers and FPGAs, CORDIC can only partially eliminate dependence on multipliers, and a large number of multipliers are still required for inference and error backpropagation, which hinders on-FPGA neural network training.
Efforts had been put into hardware optimizations. Many optimization possibilities are motivated by the various inherent characteristics of the CNN models. For instance, CNN’s spatially correlated characteristics, i.e., adjacent output feature map activations will share close values per feature map, motivated Shomron et al. to propose a value prediction based method that reduces MAC operations [
5] in DNNs. In [
6], a lightweight CNN is implemented to predict zero-valued activations. The prediction step is calculated prior to the convolution step, thus the convolution operations can be largely saved. In addition, the sparsity of DNN values causes the underutilization of the underlying hardware. For example, DNN tensors usually follow a bell-shaped distribution that is concentrated on zero, therefore a high percentage of values will only be represented by a portion of least-significant bits (LSBs). These zero-valued bits may cause inefficiencies when executed on hardware. To address it, a non-blocking simultaneous multithreading (NB-SMT) method [
7] was designed to better utilize the execution resources. Many quantization algorithms have also been proposed to reduce the usage of multipliers in DNNs. For example, to quantize all network weights into powers of 2, one can avoid multiplications [
8]. However, the calculation of weight parameters directly involved only accounts for a part of the neural network training. If we further quantize the layer inputs, we can reduce the use of multipliers to a greater extent, which will cause a huge accuracy loss and is only suitable for simple tasks [
9]. For circumstances in which hardware multipliers are insufficient, we propose a trigonometric approach to eliminate the dependence on multipliers.
Consider a simple case in which only fully connected layers are used, and the non-linearity is omitted. As shown in
Figure 1, to back propagate error signals, each layer conducts multiplications between the weight matrix and error matrix from the last layer (see subfigure (b)). In addition, to update the weights, the dot product of the activation and error matrix is computed (see subfigure (c)). It is clear that multiplications between errors, activations, and weights dominate the computation cost. However, in modern neural network models, both the weight parameters and error signals are highly concentrated around zero with an extremely small variance. This characteristic enables us to approximate the original error and weight using its sine value. Then, using the product to sum formula, we can convert the multiplications to easier operations. Here, we briefly introduce the ideas behind this study.
From Equation (
1), we know that when
x is infinitesimal, we regard
and
x as equivalent. Moreover, we have
when
x is a small value, and, thus, we can approximate
x as
. For example, the error gap is within
when
is smaller than 0.01. The left subfigure of
Figure 2 shows a typical distribution of weight parameters extracted from a random layer of a 28-layer WideResNet trained on the CIFAR-100 dataset. The right subfigure of
Figure 2 shows the error curve when we replace
x with
. Apparently, using the sine value as an approximation does not add much noise to the network when the parameters are highly concentrated at near 0.
By applying Equation (
2), i.e., the product to sum formula, we can convert multiplications of sine values into simpler addition, subtraction, and bit shift operations, which are far more economical than classical multiplication operations. Note that the cosine value can be computed by the aforementioned CORDIC engine, which is also hardware friendly and only requires simple operations. We also introduce a sine-based activation function. Rather than a mere sine activation, we adopt a rectified variant that combines the ReLU [
10] and sine activation. In this way, we realize the sine value replacement of activation, error, and weight, between which the multiplications are removable in inference and training. We call this method trigonometric inference.
Trigonometric inference offers an alternative training method when the hardware multipliers are insufficient. In addition, the method is superior when a hyperbolic function, such as , is adopted as the activation function. The approach is evaluated on image classification tasks, and the experimental results confirm a performance comparable to that of the classical training method.
The remainder of this paper is organized as follows. We introduce related research in
Section 2. A detailed explanation of the methodology is provided in
Section 3, and
Section 4 summarizes the experimental results. We discuss future work in
Section 5, and some concluding remarks are provided in
Section 6. Our contributions are listed as follows:
We propose a novel training and inference method that utilizes trigonometric approximations. To the best of our knowledge, this is the first work that shows trigonometric inference can provide learning in deep neural networks;
By replacing the model parameters and activations with their sine value, we analyze that multiplications could be transferred to shift-and-add operations in training and inference. To achieve this, a rectified sine activation function is proposed;
We evaluate trigonometric inference on several models and shows that it can achieve performance that is close to conventional CNNs on MNIST, Cifar10, and Cifar100 datasets.
5. Future Research and Challenges
To the best of our knowledge, this is the first study to utilize a trigonometric approximation of all parameters in the training of deep neural networks. However, on modern GPUs and CPUs, we were unable to show the efficiency of our training methods because they are specifically optimized for multiplication. We set as our future study the design of a hardware computation engine for trigonometric inference. Here, we briefly discuss the challenges. Although a simple serial CORDIC module requires significantly fewer logic resources than a multiplier, it has certain flaws in terms of speed. To accelerate the CORDIC module for neural network inference, the following two schemes are worth applying:
(1) CORDIC optimization for small values:
Although randomly distributed, layer inputs, weights, and back-propagated errors tend to have small variances in large deep neural networks. Starting the rotation from the mean value can reduce the number of iterations for the CORDIC modules because the mean value can be selected as a pre-computed angle.
(2) Neural network training with lower-precision CORDIC:
As mentioned in
Section 2, training with a lower precision can provide a comparable performance under many scenarios. Reducing the model precision to 16-bits, 8-bits, or even fewer bits will significantly decrease the CORDIC iteration times. Therefore, the training efficiency is higher when a lower precision is required.