FPGA Implementation of Pillar-Based Object Classification for Autonomous Mobile Robot

Park, Chaewoon; Lee, Seongjoo; Jung, Yunho

doi:10.3390/electronics13153035

Open AccessFeature PaperArticle

FPGA Implementation of Pillar-Based Object Classification for Autonomous Mobile Robot

by

Chaewoon Park

¹

,

Seongjoo Lee

^2,3

and

Yunho Jung

^1,4,*

¹

School of Electronics and Information Engineering, Korea Aerospace University, Goyang-si 10540, Republic of Korea

²

Department of Electrical Engineering, Sejong University, Seoul 05006, Republic of Korea

³

Department of Convergence Engineering of Intelligent Drone, Sejong University, Seoul 05006, Republic of Korea

⁴

Department of Smart Air Mobility, Korea Aerospace University, Goyang-si 10540, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(15), 3035; https://doi.org/10.3390/electronics13153035

Submission received: 14 June 2024 / Revised: 29 July 2024 / Accepted: 30 July 2024 / Published: 1 August 2024

(This article belongs to the Special Issue System-on-Chip (SoC) and Field-Programmable Gate Array (FPGA) Design)

Download

Browse Figures

Versions Notes

Abstract

:

With the advancement in artificial intelligence technology, autonomous mobile robots have been utilized in various applications. In autonomous driving scenarios, object classification is essential for robot navigation. To perform this task, light detection and ranging (LiDAR) sensors, which can obtain depth and height information and have higher resolution than radio detection and ranging (radar) sensors, are preferred over camera sensors. The pillar-based method employs a pillar feature encoder (PFE) to encode 3D LiDAR point clouds into 2D images, enabling high-speed inference using 2D convolutional neural networks. Although the pillar-based method is employed to ensure real-time responsiveness of autonomous driving systems, research on accelerating the PFE is not actively being conducted, although the PFE consumes a significant amount of computation time within the system. Therefore, this paper proposes a PFE hardware accelerator and pillar-based object classification model for autonomous mobile robots. The proposed object classification model was trained and tested using 2971 datasets comprising eight classes, achieving a classification accuracy of 94.3%. The PFE hardware accelerator was implemented in a field-programmable gate array (FPGA) through a register-transfer level design, which achieved a 40 times speedup compared with the firmware for the ARM Cortex-A53 microprocessor unit; the object classification network was implemented in the FPGA using the FINN framework. By integrating the PFE and object classification network, we implemented a real-time pillar-based object classification acceleration system on an FPGA with a latency of 6.41 ms.

Keywords:

field-programmable gate array; autonomous mobile robot; object classification; pillar feature encoder; FINN

1. Introduction

Autonomous mobile robots move by recognizing their surroundings and performing physical tasks performed by people in the past. Advances in computer vision, robot control, and artificial intelligence (AI) have enhanced the ability of autonomous mobile robots to perceive and respond to their environment. Consequently, these robots are utilized in several areas of human life, including manufacturing, transportation, healthcare, and daily living [1,2]. Object classification is essential for autonomous mobile robots for identifying objects and planning travel paths [3].

Various types of sensors such as cameras and radio detection and ranging (radar) sensors have been used to perceive the surrounding environment in autonomous driving situations. Camera sensors cannot obtain depth information, making it impossible to accurately determine the distance to an object, which can lead to serious issues when the two objects overlap [4]. In addition, camera sensors are sensitive to brightness and have a limited field of view, which results in operational constraints [3]. Radar sensors cannot determine the height information of objects and have very low resolution for angles, preventing the precise observation of objects [5]. In contrast, point clouds acquired by light detection and ranging (LiDAR) sensors contain 3D physical coordinate information that enables the acquisition of depth and height information regarding objects. Furthermore, there are fewer operational constraints, and the high resolution allows for precise observation of objects [3]. Therefore, the use of point clouds acquired by LiDAR sensors is preferred for autonomous driving.

Various deep learning algorithms are available for processing point clouds. MVCNN [6] projects a point cloud in multiple directions and classifies objects using a convolutional neural network (CNN) based on the projected images. Although it uses a simple method to convert 3D data into 2D images and utilize a CNN, information loss occurs during the projection process [7]. PointNet [8] is a permutation-invariant network designed using a symmetry function to handle unordered point sets. VoxelNet [9] encodes features on a voxel-by-voxel basis using PointNet and conducts inferences using a region proposal network. Because the inference is based on a 3D CNN, a long inference time of 225 ms is required [10]. SECOND [11] achieved an improved inference speed of 50 ms by applying a sparse CNN to VoxelNet, but the complexity of the 3D CNN computation remained high. PointPillars [12] proposed a pillar-based method for encoding 3D point clouds into 2D images. Unlike MVCNN, which only uses information from the highest element, the pillar-based method uses information from all points within the pillar. Conducting 2D CNN-based inference on an encoded image enables pillar-based methods to achieve high-speed inference and excellent performance [13,14,15,16].

Autonomous mobile robots must have real-time response time to prevent physical accidents. Under these constraints, running microprocessor unit (MPU)-based pillar-based methods are extremely time-consuming [17]. Graphic processing units (GPUs) can accelerate inference based on 2D CNNs; however, using GPUs on edge devices is inefficient in terms of power consumption [18]. Therefore, refs. [19,20,21] implemented a hardware accelerator for pillar-based methods using a field-programmable gate array (FPGA). However, a pillar feature encoder (PFE) was computed on an MPU in [19], requiring 71.2 ms and accounting for 19% of the total response time. Similarly, a PFE was computed on an MPU in 9.51 ms in [20], which was 22% of the total response time. Additionally, a portion of the PFE was accelerated in [21], requiring 37.7 ms for computation, which accounted for 59% of the total response time. Because the PFE occupies a significant portion of the overall response time of the pillar-based method, a hardware accelerator for PFE is useful.

The pillar-based method can perform inference on an image encoded by the PFE with a network based on a depth-wise separable CNN (DSCNN). The DSCNN used in [22,23,24] has been widely adopted in recent CNN models owing to its low computation cost and number of parameters. Because of these advantages, DSCNNs are often used in resource-constrained or latency-sensitive applications [25]. Additionally, despite their low complexity, they have shown successful performance in image classification models by enhancing the representational efficiency [26].

In this study, we propose an efficient pillar-based object classification model for autonomous mobile robots using a PFE and DSCNN, achieving a classification accuracy of 94.3% through training and validation with datasets acquired using Ouster OS1-32 [27]. In addition, we propose an FPGA-based acceleration system for the real-time operation of the proposed object classification model. The proposed acceleration system comprises a PFE accelerator and classification network accelerator. We designed the PFE accelerator to achieve a computation time of 0.23 ms and confirmed a 40 times acceleration performance compared with the MPU. A classification network accelerator was implemented on an FPGA using FINN. For operation in edge environments, we implemented the proposed acceleration system on an FPGA and confirmed a real-time response time of 6.41 ms.

The remainder of this paper is structured as follows. Section 2 provides an overview of the proposed acceleration system, including the PFE and DSCNN-based classification networks, dataset used, model configuration, and evaluation results. The design of the proposed acceleration system is described in Section 3. Section 4 presents the implementation results and acceleration effects. Finally, Section 5 concludes the study.

2. Proposed System

Figure 1 shows an overview of the proposed acceleration system. The point cloud input to the system is encoded into an image by the PFE through three steps: (a) converting the point cloud into a set of pillar units, (b) adding hand-crafted features (HCFs), and (c) expanding the features through point-wise (PW) convolution, compressing them through a max-pool operation, and scattering them back to their original position on the x–y plane to generate an image. The generated image is then fed into a classification network to classify the objects. The classification network comprises a DSCNN-based network.

2.1. PFE

The PFE, which encodes 3D point clouds into 2D images to enable 2D CNN-based inference, was proposed in [12]. First, a grid is generated by discretizing the x–y plane, and then the point cloud is converted into a set of pillar units. The point cloud set

g

is defined by Equation (1).

g = [p_{1}, p_{2}, \dots p_{i}, \dots p_{P}]

(1)

We set

g

as a set of

p_{i}

, where

P

is the maximum number of pillars that contains points. We set

p_{i}

as defined by Equation (2).

p_{i} = [n_{1}, n_{2}, \dots n_{i}, \dots n_{N}]

(2)

We set

p_{i}

as a set of

n_{j}

, where

N

is the maximum number of points that a pillar can contain. The set

n_{j}

is defined as in Equation (3).

n_{i} = [x, y, z, r]

(3)

The set

n_{j}

comprises the

x

,

y

,

z

coordinates and reflectivity values of the points. Finally, the point cloud is represented by a tensor of size (

P

,

N

,

4

), where

P

and

N

are hyperparameters. If the number of points in a pillar is greater than or equal to

N

, random sampling is used; if it is less than

N

, zero padding is used.

After the point cloud is converted into a set of pillar units, an HCF is added to set

n_{j}

. The HCF comprises the differences

x_{m}

,

y_{m}

,

z_{m}

between the arithmetic mean coordinates of the points inside the pillar and

x

,

y

,

z

coordinates of each point, and the differences

x_{c}

,

y_{c}

between the center coordinates of the pillar and the

x

,

y

coordinates of each point. Setting

n_{j}

with the appended HCF is given by Equation (4).

n_{i} = [x, y, z, r, x_{m}, y_{m}, z_{m}, x_{c}, y_{c}]

(4)

A tensor of size (

P

,

N

, 9) with an added HCF undergoes PW convolution, batch normalization, and a ReLU layer to transform it into a tensor of size (

P

,

N

, 64). Subsequently, maximum pooling is applied along the second dimension to yield a tensor of size (

P

, 64). Finally, each pillar is scattered back to the original x–y plane, completing the 2D image data with 64 channels.

2.2. Classification Network

The proposed classification network is based on the DSCNN. Figure 2 shows how the DSCNN works. The DSCNN comprises the depth-wise (DW) and PW methods. In DW, the number of channels in the input image, filters, and channels in the output image are the same. Each filter has only one channel, and performs a convolution operation on only one channel in the input image to create a one channel output image. For example, in the DW convolution shown in Figure 2, one channel of the input image, separated by a dashed line, is performed a convolution operation with one filter to produce one channel of the output image, also separated by a dashed line. Let

iC

be the number of channels in the input image and

oC

be the number of channels in the output image; PW uses

oC

filters of size 1 × 1 with

iC

channels. Each filter performs a convolution operation on all the channels in the input image to create one channel of the output image. For example, in the PW convolution shown in Figure 2, one pixel of the input image, separated by a dashed line, completes one pixel of the output image, separated by a dashed line, by applying all filters in the convolution operation.

2.3. Dataset

To train the proposed object classification model, we obtained point clouds using an Ouster OS1-32 sensor. Figure 3 illustrates examples from a dataset comprising objects typically encountered by autonomous mobile robots in their driving environments. To enhance the training quality, we augmented the dataset using samples from the nuScenes [28] dataset. The dataset includes classes such as building, tree, vehicle, bicycle, obstacle, greenery, person, and urban fixture (street light, bicycle rack, traffic light, etc.), as depicted in Figure 4, which shows the configuration of the dataset. Out of a total of 2971 data points, 2617 were used for training and 354 were used for validation.

2.4. Performance Evaluation

In the PFE, the image size, number of pillars, and number of points per pillar are hyperparameters that significantly influence both the performance and complexity of the model. The image size determines the resolution and dimensions of the encoded image. A larger image size results in higher-resolution images; however, it also increases the computational demand of the CNN. Similarly, increasing the number of pillars and points per pillar allows more detailed information to be captured from the point cloud. However, this also increases the computational load of the PFE owing to the increased number of computations required to process a larger number of data. Thus, finding a balanced combination of these hyperparameters is crucial for achieving both high performance and efficient computation in pillar-based object classification models.

Figure 5 illustrates the architectures of the networks utilized for performance evaluation, all of which were based on the DSCNN. Table 1 presents the accuracies corresponding to various hyperparameters and network structures, with the overall accuracy serving as an evaluation metric for object classification. We trained 300 epochs with a batch size of 16, and the results of the best performing epochs are presented in Table 1. The training was performed with a learning rate of 0.001 and a weight decay rate of 0.0001. First, it was observed that Networks 1 and 5 exhibited lower performance than Networks 2, 3, and 4. Among Networks 2, 3, and 4, Network 4 stands out for its balanced performance, demonstrating comparable accuracy to Networks 2 and 3, while boasting lower computational complexity and fewer parameters. Additionally, the image size of 128 × 128 exhibited outstanding performance, whereas the performance was comparable when the number of pillars was 1024 and 512. The best performance was achieved when 16 points were inside each pillar. Consequently, the proposed object classification model adopted an image size of 128 × 128, 512 pillars, and 16 points within each pillar, with the classification network structured according to Network 4.

Table 2 presents a performance comparison between the proposed classification network and the other classification networks in this study. The proposed classification network achieved the highest classification accuracy of 94.6% based on the DSCNN, with a small number of multiply-accumulate (MAC) operations of 37 M and a small number of parameters of 50.8 K.

3. Implementation of Acceleration System

The proposed acceleration system was implemented on an FPGA by integrating the PFE hardware accelerator intellectual property (IP) and classification network hardware accelerator IP with an MPU. Section 3.1 presents the bit width of each IP obtained through quantization experiments. Section 3.2 describes the structure and operations of the proposed acceleration system. Section 3.3 describes the structure and behavior of the PFE hardware accelerator, and Section 3.4 describes the FINN used to implement the object classification network hardware accelerator.

3.1. Quantization

In this study, we quantized the proposed object classification model to implement the proposed acceleration system in the hardware. To minimize the performance reduction owing to errors caused by quantization, we performed quantization-aware training (QAT). The results of the experiments using different bit formats for the PFE and classification networks are listed in Table 3. The PFE is more accurate with 16 and 32 bits than with 8 bits, but there is no difference in accuracy between 16 and 32 bits. Therefore, we decided to use a 16 bit format for the PFE. The object classification network showed higher accuracy with 4 and 8 bits, but there was no difference in performance; therefore, we used 4 bits.

3.2. System Architecture

Figure 6 shows the structure of the proposed acceleration system. To validate the system in an FPGA environment, the accelerators were integrated with an MPU via an advanced extensible interface (AXI) interconnect and implemented in an FPGA. Each accelerator IP is a slave to the MPU, and the MPU sends the start and end DRAM addresses that the IP will access, the number of words to be read from DRAM, AXI protocol parameters, and start and stop signals to the slave registers of the IP. The IP acts as a master to DRAM, generating read and write addresses through its master interface to fetch the required data from DRAM during computation and write back the results to DRAM.

The MPU first sets the slave register of the PFE accelerator IP, which starts by reading the point cloud from DRAM through the master interface to organize the point cloud into pillars. The PFE accelerator checks the coordinates of the point cloud and stores the point cloud in FIFO1 one after the other. To store the point cloud in DRAM pillar-by-pillar, it generates the DRAM addresses and stores them in FIFO2. When all the DRAM addresses for the point cloud are generated and stored in FIFO2, the master interface references FIFO1 and FIFO2 to store the point cloud at an appropriate location in DRAM.

The PFE accelerator organizes the entire point cloud into pillars by appropriately storing it in DRAM, and then reads the point cloud back and stores it in FIFO1 and FIFO2. It also reads the weights and biases for performing PW convolution from DRAM and stores them in the weight memory (WM) and bias memory (BM). After adding the HCF to the point cloud, PW convolution is performed to expand the features, and a max-pool operation is performed. Thus, one pillar is encoded with 64 features, which is one pixel with 64 channels in a 2D image. The encoded pixel is stored in DRAM by considering its position.

After the PFE accelerator encodes all point clouds into 2D images, the MPU sets the slave register of the classification network accelerator IP and starts the accelerator to read images from DRAM through the master interface. After the classification network accelerator has read all images and has performed all computations, it writes the classification results to the slave register. The MPU completes the system operation by accessing the slave register of the classification network accelerator IP and reading the classification results.

3.3. PFE Hardware Architecture

Figure 7 shows a block diagram of the PFE accelerator. The proposed PFE accelerator comprises a grouping unit (GU), an HCF unit (HU), a PFN unit (PU), and eight memories. The memories are FIFO1, FIFO2, number of points per pillar (NPPP) buffer (NB), pillar index buffer (PB), HCF buffer1 (HB1), HB2, WM, and BM. HB1 and HB2 are utilized in a ping-pong scheme by the HU and PU within a pipelined structure. As the HU operates, it writes results to HB1, while the PU concurrently accesses data from HB2. Subsequently, the HU switches to writing computation results to HB2, while the PU utilizes the data stored in HB1 during this phase.

The point clouds are stored in FIFO1 in order, and the grid locator in the GU checks the coordinates of the points to obtain the coordinates of the pillars containing the points. For example, if the width of a grid cell is an integer value of

2^{5}

, the coordinates of the pillar containing a point can be determined by dividing the point’s coordinates by

2^{5}

. This division operation is replaced by a 5 bit right shift operation. The coordinates of the pillars are used as the read address of the NB to determine the NPPP. If the NPPP is 0, a new pillar is created; thus, the pillar counter is incremented by one, and the value is written to the PB to be used as the index of the pillar. If the NPPP is less than 16, it is incremented by 1 and written back to the NB; otherwise, it remains unchanged. The address is generated by using the pillar index multiplied by 16 (the maximum NPPP) as an offset, and then adding the NPPP to this offset. The generated address is stored in FIFO2. With the point cloud in FIFO1 and the DRAM address in FIFO2, the point cloud is stored in the correct location in DRAM.

After all the point clouds are stored in DRAM in the pillars, they are read back into FIFO1 and FIFO2. In FIFO2, the DRAM write address is stored during the grouping operation, while the DRAM read data are stored at other times. The PFE accelerator then performs computation on all point clouds in a pillar-wise manner. The averaging unit (AU) of the HU adds the

x

,

y

, and

z

coordinates of all points in the pillar and is divided by the NPPP to obtain the arithmetic mean coordinates of the points. The offset unit in the HU computes

x_{m}

,

y_{m}

,

z_{m}

by subtracting the center coordinates of the points obtained from the AU from the

x

,

y

,

z

coordinates of each point, and

x_{c}

,

y_{c}

by subtracting the center coordinates of the pillar from the

x

,

y

coordinates of each point. The

x

,

y

,

z

,

r

,

x_{m}

,

y_{m}

,

z_{m}

,

x_{c}

,

y_{c}

of the points with the HCF added are then stored in the HB.

Figure 8a shows the tensor-level operation performed by the PU. The PU executes PW convolution, batch normalization, max-pool, and ReLU operations on the HCF. The weights and biases of the batch normalization can be combined with those of the PW convolution layer. In the PW convolution, only the weights are used, and the biases are added after a biased ReLU operation, which thresholds values at -bias instead of 0.

Figure 8b represents the operation in Figure 8a at the matrix level. The PU performs the multiplication of the HCF matrix of size (16, 9) and the weight matrix of size (9, 64) using an 8 × 32 MAC array. To do this, the HCF matrix is divided into two matrices of size (8, 9), and the weight matrix is divided into two matrices of size (9, 64), performing the entire matrix multiplication in four parts. For instance, the matrix multiplication of HCF tile 2 and weight tile 1 produces result tile 2, and these tiles are max-pooled to create max-pooled tile 1.

Figure 8c illustrates the hardware-level operation of the PU in chronological order. The PU reads HCF tile 1 and weight tile 1 from the HB and WM, respectively. The PU reads HCF tile 1 column by column and weight tile 1 row by row over 9 cycles, feeding them into the MAC array. The MAC array processes the HCF and weight inputs over 9 cycles and outputs result tile 1 of size (8, 32). Result tile 1 is then processed by the parallel max-pool unit to perform max-pool operations, producing intermediate tile 1 of size (1, 32). Because the parallel max-pool unit immediately performs max-pool operations on the MAC array output, it saves cycles and buffer memory space by not storing and reloading result tile 1. Intermediate tile 1 is compared to the -bias value in the serial max-pool unit, and the larger value is stored in the register. By performing biased ReLU operations in the serial max-pool unit, the ReLU operation could be integrated into the max-pool operation. After creating intermediate tile 1, the PU reads HCF tile 2 and weight tile 1 to produce intermediate tile 2. Intermediate tile 2 undergoes max-pool operations with intermediate tile 1 stored in the serial max-pool unit’s register, resulting in max-pooled tile 1. Max-pooled tile 1 is half a pixel when the bias is added and rounded and clipped. This half pixel is stored in the correct location in DRAM based on its coordinates in the image. To create the DRAM write address, the x coordinate of the pixel multiplied by the image’s width is used as an offset, and the y coordinate is added to this offset. When the PU has created both max-pooled tiles 1 and 2, there is one complete pixel in DRAM.

3.4. Classification Network Accelerator Using FINN

Xilinx offers various tools for implementing deep learning networks. Vitis AI is a platform used to implement and optimize artificial intelligence and machine learning applications on FPGAs and adaptive system-on-chip platforms. Vitis AI supports various artificial intelligence and machine learning frameworks and libraries, and generates a deep learning processing unit (DPU) IP. FINN is an open-source framework for FPGA-based quantized deep learning network inference. FINN uses an open neural network exchange model as the input, optimizes the network, and generates a hardware accelerator.

In this study, we used FINN to implement a hardware version of an object classification network. Whereas the DPU serves as a general-purpose accelerator for deep learning operations, FINN optimizes the hardware architecture and data flow specifically for the deep learning network. Consequently, FINN offers lower latency, enables real-time response speeds, and can achieve high throughput by optimizing the pipeline [29]. Moreover, networks with low complexity, such as the proposed classification network, can significantly reduce hardware resource usage. In addition, FINN supports 4 bit integers, providing an advantage over DPUs. Leveraging the benefits of FINN, ref. [19] implemented a backbone layer of PointPillars using FINN.

4. Implementation Results

The proposed acceleration system was implemented on a Xilinx Zynq Ultrascale+ ZCU104 [30]. Table 4 summarizes the hardware resource usage and operating frequency of the proposed acceleration system. The proposed acceleration system utilized 75,548 configurable logic block (CLB) look-up tables (LUTs), 56,931 CLB registers, 374 digital signal processors (DSPs), and 65 block RAMs. The operating frequency was 187.5 MHz, which enables high-speed operation, and the response time of the proposed object classification system was 6.41 ms.

Table 5 compares the response times of the firmware-based PFE implementation on an ARM Cortex A53 (baseline) operating at 1.2 GHz and the proposed PFE hardware accelerator (proposed) in terms of the execution time for each operation. The grouping operation, responsible for converting a point cloud into a set of pillar units, experienced a remarkable acceleration of 23.67 times, reducing from 0.71 ms to 0.03 ms. Notably, the HCF and PFN operations achieved a significant acceleration of approximately 42.6 times, decreasing from 8.52 ms to 0.20 ms. This notable acceleration was enabled by leveraging the parallel computation capability of the MAC array in the PFE accelerator and the pipelined structure of the HCF and PFN operations. The overall execution time for PFE operations was reduced from 9.23 ms to 0.23 ms, demonstrating a 40 times acceleration effect compared with the firmware-based implementation.

Table 6 presents a comparison between the proposed object classification system and pillar-based methods implemented in previous studies. In [19,20,21], a heterogeneous system between the MPU and FPGA was constructed, whereas our system represents an end-to-end hardware implementation of all operations on an FPGA. In [19], PFE computation was performed on an MPU, and deep learning network computation was executed using an accelerator implemented through FINN. Ref. [20] computed the PFE on an MPU and executed a deep learning network on an accelerator implemented using a register-transfer level (RTL) design. Ref. [21] utilized both an MPU and DPU for PFE computation, and a DPU for deep learning network computation. We computed the PFE with an accelerator implemented through an RTL design and deep learning network with an accelerator implemented through FINN.

Our method achieved the fastest response time of 6.41 ms, because it was the only one that accelerated the PFE, and the proposed object classification model was designed to have low computational complexity through various experiments. Moreover, these experiments resulted in an object classification model with fewer parameters. Our method used more DSPs than that in [19], but significantly fewer LUTs, registers, and memory. Ref. [19] used the same FPGA platform targeting edge environments as ours, the ZCU104, achieving low power consumption. However, it still consumed more power than ours, which implemented an efficient deep learning network. Compared with the method in [20], our method used fewer resources across the board, with particularly notable differences in DSPs and memory usage. Ref. [20] also consumed low power by implementing an efficient deep learning network. Additionally, since ref. [20] designed the deep learning network accelerator at the RTL level, it consumed even less power than ours, which was designed based on HLS. In addition, our method used more LUTs but fewer registers and DSPs, and substantially fewer BRAM and URAM than that in [21]. Furthermore, our method consumed significantly less power than that in [21]. Ref. [21] used the U280 FPGA platform targeting server environments, which had much higher power consumption and is not suitable for edge environments.

5. Conclusions

In this study, we propose a deep learning object classification model based on a pillar-based method for real-time responses of autonomous mobile robots. The object classification network of the deep learning model is based on the DSCNN, which has a low complexity and few parameters, making it suitable for edge device environments. The object classification model was trained using a dataset that considered the operating environment of autonomous mobile robots. To minimize the performance reduction due to quantization, we quantized and tested the model using QAT, achieving a classification accuracy of 94.3%. Because the computation of the PFE in pillar-based object classification models was time-consuming, we proposed a hardware accelerator for the PFE. The PFE accelerator was implemented through RTL design and achieved a response time of 0.23 ms, which was 40 times faster than the firmware program. The accelerator for the object classification network of the proposed deep learning model was implemented through the FINN framework and achieved an execution time of 6.18 ms. We also proposed an object classification system that integrates the proposed PFE accelerator and object classification network accelerator through an AXI interconnect. The proposed object classification system was implemented and verified on an FPGA and achieved a real-time response time of 6.41 ms.

The proposed PFE accelerator can be combined with other deep learning network accelerators to accelerate more diverse deep learning models. However, in this study, we only performed acceleration on the object classification models. In future work, we plan to implement a system that supports acceleration for more diverse and challenging deep learning models.

Author Contributions

C.P. designed and implemented the proposed acceleration system, performed the experiment and evaluation, and wrote the paper. S.L. evaluated the proposed acceleration system and revised this manuscript. Y.J. conceived of and led the research, analyzed the experimental results, and wrote the paper. All authors read and agreed to the published version of the manuscript.

Funding

This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by Korean government (MSIT) (No. 2022-0-00960), and the CAD tools were supported by IDEC.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Dataset available on request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Varlamov, O. “Brains” for Robots: Application of the Mivar Expert Systems for Implementation of Autonomous Intelligent Robots. Big Data Res. 2021, 25, 100241. [Google Scholar] [CrossRef]
Liu, Y.; Li, Z.; Liu, H.; Kan, Z. Skill Transfer Learning for Autonomous Robots and Human–robot Cooperation: A Survey. Robot. Auton. Syst. 2020, 128, 103515. [Google Scholar] [CrossRef]
Yoshioka, M.; Suganuma, N.; Yoneda, K.; Aldibaja, M. Real-time Object Classification for Autonomous Vehicle using LIDAR. In Proceedings of the 2017 International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS), Okinawa, Japan, 24–26 November 2017; pp. 210–211. [Google Scholar]
Gao, H.; Cheng, B.; Wang, J.; Li, K.; Zhao, J.; Li, D. Object Classification using CNN-based Fusion of Vision and LIDAR in Autonomous Vehicle Environment. IEEE Trans. Ind. Inform. 2018, 14, 4224–4231. [Google Scholar] [CrossRef]
Zhou, Y.; Liu, L.; Zhao, H.; López-Benítez, M.; Yu, L.; Yue, Y. Towards Deep Radar Perception for Autonomous Driving: Datasets, Methods, and Challenges. Sensors 2022, 22, 4208. [Google Scholar] [CrossRef] [PubMed]
Su, H.; Maji, S.; Kalogerakis, E.; Learned-Miller, E. Multi-view Convolutional Neural Networks for 3D Shape Recognition. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 945–953. [Google Scholar]
Hoang, L.; Lee, S.H.; Lee, E.J.; Kwon, K.R. GSV-NET: A Multi-modal Deep Learning Network for 3D Point Cloud Classification. Appl. Sci. 2022, 12, 483. [Google Scholar] [CrossRef]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Zhou, Y.; Tuzel, O. Voxelnet: End-to-end Learning for Point Cloud based 3D Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4490–4499. [Google Scholar]
Bhanushali, D.; Relyea, R.; Manghi, K.; Vashist, A.; Hochgraf, C.; Ganguly, A.; Kwasinski, A.; Kuhl, M.E.; Ptucha, R. LiDAR-camera Fusion for 3D Object Detection. Electron. Imaging 2020, 32, 1–9. [Google Scholar] [CrossRef]
Yan, Y.; Mao, Y.; Li, B. Second: Sparsely Embedded Convolutional Detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef] [PubMed]
Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. Pointpillars: Fast encoders for Object Detection from Point Clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12697–12705. [Google Scholar]
Lis, K.; Kryjak, T. PointPillars Backbone Type Selection for Fast and Accurate LiDAR Object Detection. In Proceedings of the International Conference on Computer Vision and Graphics, Warsaw, Poland, 19–21 September 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 99–119. [Google Scholar]
Shu, X.; Zhang, L. Research on PointPillars Algorithm based on Feature-Enhanced Backbone Network. Electronics 2024, 13, 1233. [Google Scholar] [CrossRef]
Wang, Y.; Han, X.; Wei, X.; Luo, J. Instance Segmentation Frustum–PointPillars: A Lightweight Fusion Algorithm for Camera–LiDAR Perception in Autonomous Driving. Mathematics 2024, 12, 153. [Google Scholar] [CrossRef]
Agashe, P.; Lavanya, R. Object Detection using PointPillars with Modified DarkNet53 as Backbone. In Proceedings of the 2023 IEEE 20th India Council International Conference (INDICON), Hyderabad, India, 14–17 December 2023; pp. 114–119. [Google Scholar]
Choi, Y.; Kim, B.; Kim, S.W. Performance Analysis of PointPillars on CPU and GPU Platforms. In Proceedings of the 2021 36th International Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC), Jeju, Republic of Korea, 27–30 June 2021; pp. 1–4. [Google Scholar]
Silva, A.; Fernandes, D.; Névoa, R.; Monteiro, J.; Novais, P.; Girão, P.; Afonso, T.; Melo-Pinto, P. Resource-constrained onboard Inference of 3D Object Detection and Localisation in Point Clouds Targeting Self-driving Applications. Sensors 2021, 21, 7933. [Google Scholar] [CrossRef] [PubMed]
Stanisz, J.; Lis, K.; Gorgon, M. Implementation of the Pointpillars Network for 3D Object Detection in Reprogrammable Heterogeneous Devices using FINN. J. Signal Process. Syst. 2022, 94, 659–674. [Google Scholar] [CrossRef]
Li, Y.; Zhang, Y.; Lai, R. TinyPillarNet: Tiny Pillar-based Network for 3D Point Cloud Object Detection at Edge. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 1772–1785. [Google Scholar] [CrossRef]
Latotzke, C.; Kloeker, A.; Schoening, S.; Kemper, F.; Slimi, M.; Eckstein, L.; Gemmeke, T. FPGA-based Acceleration of Lidar Point Cloud Processing and Detection on the Edge. In Proceedings of the 2023 IEEE Intelligent Vehicles Symposium (IV), Anchorage, AK, USA, 4–7 June 2023; pp. 1–8. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the International Conference on Machine Learning. PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Lu, G.; Zhang, W.; Wang, Z. Optimizing Depthwise Separable Convolution Operations on GPUs. IEEE Trans. Parallel Distrib. Syst. 2021, 33, 70–87. [Google Scholar] [CrossRef]
Kaiser, L.; Gomez, A.N.; Chollet, F. Depthwise Separable Convolutions for Neural Machine Translation. arXiv 2017, arXiv:1706.03059. [Google Scholar]
Ouster. Ouster OS1 Lidar Sensor. Available online: https://ouster.com/products/hardware/os1-lidar-sensor (accessed on 22 May 2024).
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuscenes: A multimodal Dataset for Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11621–11631. [Google Scholar]
Xilinx Inc. Vitis™ AI Documentation Frequently Asked Questions. Available online: https://xilinx.github.io/Vitis-AI/3.0/html/docs/reference/faq.html#what-is-the-difference-between-the-vitis-ai-integrated-development-environment-and-the-finn-workflow (accessed on 22 May 2024).
AMD. UltraSclae+ ZCU104. Available online: https://www.xilinx.com/products/boards-and-kits/zcu104.html. (accessed on 22 May 2024).

Figure 1. Overview of the proposed acceleration system.

Figure 2. Operation mechanism of DSCNN.

Figure 3. Examples of dataset.

Figure 4. Configuration of dataset classes: (a) building; (b) tree; (c) vehicle; (d) bicycle; (e) obstacle; (f) greenery; (g) person; (h) urban fixture.

Figure 5. Architecture of DSCNN-based object classification network: (a) Network 1; (b) Network 2; (c) Network 3; (d) Network 4; (e) Network 5.

Figure 6. Architecture of proposed acceleration system on FPGA.

Figure 7. Block diagram of PFE accelerator.

Figure 8. Operation of the PU: (a) tensor-level operation; (b) matrix-level operation; (c) hardware-level operation.

Table 1. Accuracy of model by various hyperparameters and networks.

Hyperparameter			Network
Image Size	Number of Pillars	Number of Points per Pillar	1	2	3	4	5
256 × 256	2048	32	91.8%	93.5%	93.5%	92.9%	89.7%
		16	92.9%	92.9%	92.4%	92.4%	88.0%
		8	92.9%	92.9%	92.9%	92.4%	89.7%
	1024	32	92.4%	92.4%	91.8%	92.9%	89.1%
		16	92.4%	92.9%	92.4%	92.9%	90.2%
		8	93.5%	93.5%	92.4%	92.4%	89.7%
	512	32	92.4%	92.9%	93.5%	93.5%	89.7%
		16	92.9%	94.0%	92.4%	93.5%	89.7%
		8	94.6%	92.9%	92.4%	91.8%	90.8%
128 × 128	1024	32	90.8%	91.8%	92.4%	94.6%	90.8%
		16	90.8%	94.6%	92.9%	92.4%	91.3%
		8	90.8%	91.8%	91.8%	94.0%	91.8%
	512	32	90.8%	94.0%	91.8%	93.5%	92.4%
		16	91.3%	92.9%	93.5%	94.6%	92.9%
		8	90.8%	93.5%	91.8%	92.9%	92.4%
	256	32	91.8%	93.5%	92.9%	91.8%	90.8%
		16	90.8%	92.9%	92.4%	91.3%	90.8%
		8	90.8%	94.0%	94.0%	92.4%	91.3%
64 × 64	512	32	89.7%	89.7%	92.9%	91.8%	91.3%
		16	88.6%	90.2%	93.5%	91.8%	91.8%
		8	88.6%	90.8%	92.9%	92.4%	92.4%
	256	32	88.6%	90.2%	93.5%	90.8%	90.8%
		16	87.5%	89.7%	93.5%	91.8%	91.8%
		8	87.5%	90.8%	93.5%	91.8%	90.8%
	128	32	88.6%	89.1%	89.7%	90.8%	89.7%
		16	89.1%	90.2%	91.3%	90.2%	90.8%
		8	86.4%	92.4%	91.3%	89.1%	91.8%

Table 2. Performance comparison with other networks.

Network	Accuracy	Number of MACs	Number of Parameters
LeNet	92.2%	787.6 M	2.0 M
ResNet18	83.8%	1.4 G	11.4 M
ResNet34	82.3%	2.0 G	21.5 M
MobileNet	85.7%	128.3 M	3.2 M
Ours	94.6%	37.0 M	50.8 K

Table 3. Accuracy according to bit format of PFE and classification networks.

PFE	Classification Network
PFE	2	4	8
8	90.3%	91.7%	92.3%
16	89.7%	94.3%	94.3%
32	88.7%	92.7%	94.3%

Table 4. Hardware resources of proposed object classification system.

Unit	CLB LUTs	CLB Registers	DSPs	Block RAM	Frequency (MHz)
PFE	35,567	28,513	302	10.5
Classification Network	34,788	21,625	72	54.5
AXI Interconnect	5193	6793	0	0
Total	75,548	56,931	374	65	187.5

Table 5. PFE response time comparison with baseline.

Operation	Baseline (Firmware)	Proposed (Hardware)
Grouping	0.71 ms	0.03 ms
HCF	0.20 ms
PFN	8.32 ms	0.20 ms
Total	9.23 ms	0.23 ms

Table 6. Comparison with other pillar-based implementations.

	[19]	[20]	[21]	Ours
Platform	ZCU104	ZC706	U280	ZCU104
Computation Device	MPU & FPGA (FINN)	MPU & FPGA (RTL)	MPU & FPGA (DPU)	FPGA (RTL & FINN)
Execution Time (ms)	377.1	43.26	64.1	6.41
CLB LUTs	189,074	128,721	48,006	75,548
CLB Registers	150,187	111,118	85,801	56,931
DSPs	88	883	525	374
Block RAM	159	382.5	131.5	65
Ultra RAM	-	-	64	1
Frequency (MHz)	150	150	300	187.5
Power (W)	6.5	3.6	73.8	4.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Park, C.; Lee, S.; Jung, Y. FPGA Implementation of Pillar-Based Object Classification for Autonomous Mobile Robot. Electronics 2024, 13, 3035. https://doi.org/10.3390/electronics13153035

AMA Style

Park C, Lee S, Jung Y. FPGA Implementation of Pillar-Based Object Classification for Autonomous Mobile Robot. Electronics. 2024; 13(15):3035. https://doi.org/10.3390/electronics13153035

Chicago/Turabian Style

Park, Chaewoon, Seongjoo Lee, and Yunho Jung. 2024. "FPGA Implementation of Pillar-Based Object Classification for Autonomous Mobile Robot" Electronics 13, no. 15: 3035. https://doi.org/10.3390/electronics13153035

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FPGA Implementation of Pillar-Based Object Classification for Autonomous Mobile Robot

Abstract

1. Introduction

2. Proposed System

2.1. PFE

2.2. Classification Network

2.3. Dataset

2.4. Performance Evaluation

3. Implementation of Acceleration System

3.1. Quantization

3.2. System Architecture

3.3. PFE Hardware Architecture

3.4. Classification Network Accelerator Using FINN

4. Implementation Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI