1. Introduction
With the rapid development of artificial intelligence (AI) technology, more and more AI applications are beginning to be developed and deployed on internet of things (IoT) node devices. The intelligent internet of things (AI + IoT, AIoT), which integrates the advantages of AI and IoT technology, has gradually become a research hotspot in IoT-related fields [
1]. Traditional cloud computing is not suitable for AIoT, because of its high delay and poor mobility [
2]. For this reason, a new computing paradigm, edge computing, is proposed. In order to reduce the computing delay and network congestion, the computing is migrated from the cloud server to the device [
2]. Edge computing brings innovation to the IoT system, but also challenges the AI computing performance of the IoT node processor. It is necessary to improve its AI computing performance under the condition of meeting the power consumption and area limitation of IoT node devices [
3]. In order to improve the AI computing power of IoT node processors, some IoT chip manufacturers provide some artificial intelligence acceleration libraries for their IoT node processors, but only from the software level optimization and tailoring algorithm, just a stopgap. It is necessary to design the AI algorithm calculator suitable for the IoT node processor from the hardware level.
Among all kinds of AI algorithms, the convolutional neural network (CNN) algorithm is widely used in various IoT systems with image scenes due to its excellent performance in image recognition. Compared to traditional signal processing algorithms, it has a higher recognition accuracy, can avoid complex and tedious data feature extraction, and has stronger adaptability to different image scenes [
4]. The common hardware of CNN acceleration is a GPU or TPU, but this kind of hardware accelerator is mainly used for high-performance servers, which are not suitable for the use of IoT node devices. In Reference [
5,
6], most of the constituent units in the CNN network are implemented, which need to consume more hardware resources while obtaining higher computing acceleration performance, and cannot meet the resource-limited needs of nodes in the IoT systems. In Reference [
7], the matrix multiplication in the CNN algorithm is reduced by the FFT, but the FFT itself consumes more hardware resources. The sparse CNN (SCNN) accelerator proposed in Reference [
8] uses the data sparsity of neural network pruning operation to design a special MAC computing unit, but it has a highly customized structure and a single application scenario. In Reference [
9], an acceleration structure based on in-memory computing is proposed. By shortening the distance between computing and storage, the access to memory is reduced and the computing efficiency is improved. However, the cost of the chip based on in-memory computing is high, so it is not suitable for large-scale application in the IoT systems. The work in Reference [
10] proposes several alternative reconfiguration schemes that significantly reduce the complexity of sum-of-products operations but do not explicitly propose a CNN architecture. In Reference [
11], an FPGA implementation of a CNN for addressing portability and power efficiency is designed, but the proposed architecture is not reconfigurable. In Reference [
12], a CNN accelerator on the Xilinx ZYNQ 7100 (Xilinx, San Jose, CA, USA) hardware platform that accelerates both standard convolution and depthwise separable convolution is implemented, but the designed accelerator cannot be configured for other algorithms in the IoT system. Several research works [
13,
14,
15] use the RISC-V ecosystem to design an accelerator-centric SoC or multicore processor, but do not configure the design accelerator in the form of a coprocessor and design the corresponding custom coprocessor instructions to speed up the algorithm processing speed.
Based on Reference [
16], this paper further optimizes the structure of the CNN accelerator. The basic operation modules in the acceleration chain designed by Reference [
16] are interconnected through a crossbar, which makes the flow direction of input data more diversified, thus enriching the function of the acceleration unit and forming a reconfigurable CNN accelerator. Based on the expansibility of the RISC-V instruction set architecture, the reconfigurable CNN accelerator is connected to the E203 core (Nucleisys, Beijing, China) in the form of a coprocessor, and the corresponding custom coprocessor instructions are designed. The compiler environment and library functions are established for the designed instructions, the hardware and software design of the coprocessor is completed, and the implementation process of common algorithms in the IoT system on the coprocessor is described. Finally, resource evaluation of the coprocessor is completed based on the Xilinx FPGA. The evaluation results show that the coprocessor only consumes 8534 LUTS, accounting for 47.6% of the total SoC system. The RISC-V standard instruction set and custom coprocessor instruction set are used to implement the four basic algorithms of convolution, pooling, ReLU, and matrix addition. The acceleration performance of the coprocessor is evaluated by comparing the operation cycles of each algorithm in these two ways. The results show that the implementation of the coprocessor instruction set has a significant acceleration effect on these four algorithms, and the acceleration of the convolution has reached 6.27 times that achieved by the standard instruction set.
The rest of this paper is organized as follows:
Section 2 introduces some background knowledge, including the RISC-V instruction set and E203 CPU. In
Section 3, the hardware design of the reconfigurable CNN-accelerated coprocessor is introduced.
Section 4 introduces the software design of the reconfigurable CNN-accelerated coprocessor.
Section 5 introduces the implementation of commonly used IoT algorithms on the coprocessor.
Section 6 is the experiment and resource analysis.
Section 7 summarizes the conclusions of this work.
3. Hardware Design of CNN-Accelerated Coprocessor
The reconfigurable CNN acceleration coprocessor designed in this paper is a further extension of the compact CNN accelerator designed in Reference [
16]. It mainly optimizes the acceleration chain structure and adds the coprocessor-related modules. Other basic components are consistent with those in Reference [
16].
3.1. Accelerator Structure Optimization
In the compact CNN accelerator designed in Reference [
16], each operation unit in the acceleration chain is connected by a serial structure, and its data flow direction is single. When some algorithms are implemented, the data need to be moved in the memory and accelerator many times, which will reduce the calculation efficiency and increase the power consumption of the processor. In this paper, four basic operation modules, convolution, pooling, ReLU, and matrix plus, are interconnected by a crossbar, and a reconfigurable CNN accelerator is designed. The accelerator mainly comprises four reconfigurable computing acceleration processing elements (PEs). By configuring PE units with different parameters, the hardware acceleration of various algorithms can be realized. The structure is shown in
Figure 4.
The accelerator includes a source convolution kernel cache module (COE RAM), a reconfigurable circuit controller (Reconfigure Controller), two ping-pong buffer blocks (BUF RAM BANK), and four PE units. Each PE unit contains four basic computing components and a configurable crossbar (Crossbar). The configuration of the crossbar allows data to flow to different computing components, which can speed up different algorithms. The entire accelerator can be parameterized and data accessed through an external bus.
The key to the acceleration of various algorithms by the four PE units is the design of the crossbar circuit. The configuration of the crossbar can make the input data flow pass through any one or more calculation modules. Its structure is shown in
Figure 5.
The crossbar mainly consists of an input buffer (FIFO), a configuration register group (Cfg Regs), and five multiplexers (MUX). The five MUX select the data path to be opened according to the configuration information of the Cfg Regs. Thus, the data stream passes through different calculation modules in different orders. For example, the edge detection operation in image processing usually extracts the input image and then convolves it with the edge detection operator. It is necessary to configure the MUX as the path in the red part of
Figure 5. In the convolutional neural network algorithm, it is usually necessary to perform convolution, ReLU, and pooling operations on the source matrix. You need to configure the MUX as the path in the blue part of
Figure 5. The crossbar circuit and each calculation module use the ICB bus for data transmission. The ICB bus is a bus protocol customized by the open source CPU E203. It combines the advantages of the AXI bus and the AHB bus. For more information about the ICB bus, please refer to [
19].
3.2. Coprocessor Design
In addition to optimizing the accelerator chain module designed in Reference [
16], this paper also adds the EAI controller, decoder, and data fetcher, and completes the hardware design of the reconfigurable CNN-accelerated coprocessor, whose structure is shown in
Figure 6.
The EAI controller is used to process the time sequence related to the EAI interface, and hand over the instruction information and source operands obtained from the EAI request channel to the decoder for decoding. The memory data obtained from the EAI memory response channel are handed over to the data fetcher for allocation to the corresponding cache. The decoder is used to decode custom extended instructions. For configuration instructions, the instructions are handed over to the reconfiguration controller to realize the configuration of each functional module. For the memory access instruction, the instruction is handed over to the data fetcher for processing, and the memory access is realized. The data fetcher implements the processing of the memory access instruction. It reads data from external memory to the corresponding cache through the memory response channel of the EAI interface, the convolution kernel coefficient is loaded to the COE CACHE unit, and the source matrix is loaded to CACHE BANK1 or CACHE BANK2. The calculation results are written back to the external memory through the memory request channel of EAI interface.
When the main processor executes an instruction, the decoding unit of the main processor first determines whether the instruction belongs to the custom instruction group according to the opcode of the instruction. For the instructions belonging to the custom instruction group, determine whether to read the source operand according to the xs1 and xs2 bits in the instruction. If to read, read the source operand from the register group according to rs1 and rs2. The main processor also needs to maintain the data correlation of instructions. If the instructions have data correlation, the processor will pause the pipeline until the correlation is removed. At the same time, if the instruction needs to be written back to the target register, the target register rd will also be used as one of the bases for the subsequent instruction data correlation judgment. Then, the instructions are sent to the coprocessor through the EAI request channel for processing. After the coprocessor receives the instructions, the instructions are further decoded and allocated to different units for execution according to the types of instructions. The coprocessor will execute instructions in the form of blocking. Only when one instruction is executed can the main processor dispatch the next instruction. Finally, after the instruction is executed, the result of instruction execution is returned to the main processor through the response channel. For the instruction to be written back, the result of instruction execution needs to be written back to the target register.
The designed reconfigurable CNN acceleration coprocessor and the compact CNN accelerator designed by Reference [
16] are used to process multimedia data such as voice and image. The data flow is shown in
Figure 7.
Figure 7a shows the data flow direction of the compact CNN accelerator designed in Reference [
16]. It is an external device mounted on the SoC bus, which requires the CPU core to handle the data transfer. First, it needs to move the multimedia data from the external interface to the data memory through the CPU core, and then to the accelerator for processing. After the processing, it needs the CPU core to move the calculation results from the accelerator to the data memory and then to the external interface to send them to the network interface. It needs to move data between the accelerator and the data memory twice.
Figure 7b shows the data flow direction of the reconfigurable CNN-accelerated coprocessor designed in this paper. Compared to the accelerator designed in Reference [
16], the coprocessor can read multimedia data directly from the external interface, and after processing, it can be directly sent to the network interface by the external interface, which does not need the data moving between the coprocessor and the data memory.
The coprocessor method reduces the data moving, further speeds up the algorithm processing speed, and saves power consumption. In addition, the coprocessor can provide coprocessing instruction support, with higher code density and simple programming.
5. Implementation of Common Algorithms on Coprocessor
The reconfigurable CNN-accelerated coprocessor designed in this paper can accelerate not only the CNN algorithm but also some other commonly used algorithms in the IoT system. This article describes the implementation of the LeNet-5 network, Sobel edge detection, and FIR filtering algorithms on the coprocessor.
5.1. LeNet-5 Network Implementation
In order to verify the acceleration of the coprocessor to the CNN algorithm, the classical LeNet-5 network is used in this paper. The structure of the LeNet-5 network is shown in
Figure 12.
The LeNet-5 network mainly comprises six hidden layers [
20]:
- (1)
Convolution layer C1. Convolutions are performed using six 5 × 5 size convolution kernels and 32 × 32 original images to generate six 28 × 28 feature maps, and the feature map is activated using the ReLU function.
- (2)
Pooling layer S2. The S2 layer uses a 2 × 2 pooling filter to perform maximum pooling on the output of C1 to obtain six 14 × 14 feature maps.
- (3)
Partially connected layer C
3. The C
3 layer uses 16 5 × 5 convolution kernels to partially connect with the six feature maps output by S
2 and calculates 16 10 × 10 feature maps. The partial connection relationship and calculation process are shown in
Figure 13. Take the calculation process of the output 0th feature map as an example: First, use the 0th convolution kernel to convolve with the 0, 1, 2 feature maps output by the S
2 layer, and then add the results of the three convolutions, Plus a bias, and finally activate it to obtain the 0th feature map of the C
3 layer.
- (4)
Pooling layer S4. This layer uses a 2 × 2 pooling filter to pool the output of C3 into 16 5 × 5 feature maps.
- (5)
Expand layer S5. This layer combines the 16 5 × 5 feature maps output by S4 into a one-dimensional matrix of size 400.
- (6)
Fully connected layer S6. The S6 layer fully connects the one-dimensional matrix output from the S5 layer with 10 convolution operators, and obtains 10 classification results as the recognition results of the input image.
From comprehensive analysis of the operating characteristics of each layer of the network, the calculation of each layer of LeNet-5 is summarized in
Table 4.
As can be seen from the above table, the implementation of the LeNet-5 network on the coprocessor can be performed in four steps:
- (1)
First, map the C
1 layer and the S
2 layer. Configure the coprocessor to use the convolution, ReLU, and pooling modules of the PE unit, and configure the parameters of these three modules. Configure the crossbar to enable the data flow in the sequence shown in
Figure 14a, start the coprocessor, and calculate the C
1 and S
2 layers.
- (2)
Map C
3 layer and S
4 layer. Configure the coprocessor to use the convolution, matrix addition, ReLU, and pooling modules of the PE unit, and configure the parameters of these four modules. Configure the crossbar to enable the data flow in the order shown in
Figure 14b, start the coprocessor, calculate the C
3 layer and S
4 layer, and cache the calculation result in BUF RAM BANK1.
- (3)
The S5 layer uses CPU calculations. Use a software program to expand the calculation result in (2) into a 1 × 400 one-dimensional matrix, which is buffered in BUF RAM BANK2.
- (4)
Map the S
6 layer. Configure the coprocessor to use the convolution module of the PE unit so that the convolution module supports one-dimensional convolution. Configure the crossbar so that the data flow follows the sequence in
Figure 14c and uses only the convolution module. Configure the data source as BUF RAM BANK2, start the coprocessor, and calculate the S
6 layer.
5.2. Sobel Edge Detection and FIR Algorithm Implementation
In image processing algorithms, edge detection of images is often required. Edge detection based on Sobel operators is a commonly used method. Sobel operators are first-order gradient algorithms that can effectively filter out noise interference and extract accurate edge information [
21]. Sobel edge detection uses two 3 × 3 matrix operators and the input image to convolve to obtain the gray values of the horizontal and vertical edges, respectively. If
A is the original image and
Gx and
Gy are the horizontal and vertical edge images, respectively, the calculation formulas of
Gx and
Gy are shown in
Figure 15.
The complete edge gray value of the image can be approximated by Equation (1).
Before image edge detection, downsampling is usually performed to reduce the image size and the amount of data for operations.
In summary, the implementation of Sobel edge detection on the accelerator can be performed in three steps:
- (1)
Calculate
Gx and
Gy. Configure the coprocessor to use two PE units. Each PE unit uses pooling and convolution modules, and the convolution kernels used by the convolution modules in the two PEs are the two convolution kernels of the Sobel operator. Configure the crossbar so that the data flow follows the sequence in
Figure 16a. Start the coprocessor, calculate
Gx and
Gy, and cache the calculation result in BUF RAM BANK1.
- (2)
Use the CPU to calculate |Gx| and |Gy|. Use a software program to take the absolute value of each element in the cached result of (1) to obtain |Gx| and |Gy|, and cache the calculation result in BUF RAM BANK2.
- (3)
Calculate |
G|. Configure the coprocessor to use the matrix addition module of the PE unit and configure the crossbar so that the data flow follows the order in
Figure 16b and only the matrix addition module is used. Configure the data source as BUF RAM BANK2, start the coprocessor, and calculate |
G|.
In speech signal processing, FIR filters are often used for denoising. FIR filtering is a one-dimensional convolution operation. You only need to configure the PE to use a convolution module. Configure the crossbar so that the data flow follows the sequence shown in
Figure 17.
The convolution module is configured as a one-dimensional convolution, loads the convolution kernel into COE RAM, starts the coprocessor, and calculates the FIR filtering result.
6. Experiment and Resource Analysis
This section completes resource analysis and performance analysis of the designed reconfigurable CNN acceleration coprocessor based on Xilinx FPGA. The FPGA model is Xilinx xc7a100tftg256-1 (Xilinx, San Jose, CA, USA), and the synthesis tool is Vivado 18.1 (Xilinx, San Jose, CA, USA). FPGA circuit synthesis is performed on the E203 SoC connected to the coprocessor.
Table 5 shows the resource consumption of each main functional unit.
As shown in
Table 5, the E203 kernel and coprocessor account for most of the resource consumption in the E203 SoC. The E203 core accounts for 24.2% of the total LUT consumption, while the designed coprocessor accounts for 47.6% of the LUT resource consumption in the SoC.
To evaluate the performance of the coprocessor, the four basic arithmetic algorithms of convolution, pooling, ReLU, and matrix addition are implemented in two ways. One is implemented by the I and M instruction sets of the standard instruction set of RISC-V. The other is implemented using the coprocessor instructions designed in this paper, comparing the number of cycles of algorithm execution in these two implementations. By writing a testbench file to load the compiled binary file as an input stimulus for the entire E203 SoC, it is simulated by modelsim simulation software to count the execution cycles of each algorithm in two implementations. The experimental results are shown in
Table 6.
As can be seen from
Table 6, using the coprocessor can accelerate all four algorithms, and the acceleration ratio of each algorithm is shown in
Figure 18.
It can be seen from
Figure 18 that the coprocessor has the most obvious acceleration effect on the convolution algorithm, which is 6.27 times faster than the standard instruction set. This is because, on the one hand, the coprocessor realizes convolution calculation by the special hardware unit, while the RISC-V-based main processor realizes convolution calculation by software; on the other hand, the coprocessor architecture reduces the data moving, further speeding up the algorithm processing speed.