Improved 3D Object Detection Based on PointPillars

Kong, Weiwei; Du, Yusheng; He, Leilei; Li, Zejiang

doi:10.3390/electronics13152915

Open AccessArticle

Improved 3D Object Detection Based on PointPillars

¹

School of Computer Science and Technology, Xi’an University of Posts and Telecommunications, Xi’an 710121, China

²

Shaanxi Key Laboratory of Network Data Analysis and Intelligent Processing, Xi’an 710121, China

³

Xi’an Key Laboratory of Big Data and Intelligent Computing, Xi’an 710121, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(15), 2915; https://doi.org/10.3390/electronics13152915

Submission received: 27 May 2024 / Revised: 16 July 2024 / Accepted: 22 July 2024 / Published: 24 July 2024

(This article belongs to the Special Issue Advances in Computer Vision and Deep Learning and Its Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Despite the recent advancements in 3D object detection, the conventional 3D point cloud object detection algorithms have been found to exhibit limited accuracy for the detection of small objects. To address the challenge of poor detection of small-scale objects, this paper adopts the PointPillars algorithm as the baseline model and proposes a two-stage 3D target detection approach. As a cutting-edge solution, point cloud processing is performed using Transformer models. Additionally, a redefined attention mechanism is introduced to further enhance the detection capabilities of the algorithm. In the first stage, the algorithm uses PointPillars as the baseline model. The central concept of this algorithm is to transform the point cloud space into equal-sized columns. During the feature extraction stage, when the features from all cylinders are transformed into pseudo-images, the proposed algorithm incorporates attention mechanisms adapted from the Squeeze-and-Excitation (SE) method to emphasize and suppress feature information. Furthermore, the 2D convolution of the traditional backbone network is replaced by dynamic convolution. Concurrently, the addition of the attention mechanism further improves the feature representation ability of the network. In the second phase, the candidate frames generated in the first phase are refined using a Transformer-based approach. The proposed algorithm applies channel weighting in the decoder to enhance channel information, leading to improved detection accuracy and reduced false detections. The encoder constructs the initial point features from the candidate frames for encoding. Meanwhile, the decoder applies channel weighting to enhance the channel information, thereby improving the detection accuracy and reducing false detections. In the KITTI dataset, the experimental results verify the effectiveness of this method in small objects detection. Experimental results show that the proposed method significantly improves the detection capability of small objects compared with the baseline PointPillars. In concrete terms, in the moderate difficulty detection category, cars, pedestrians, and cyclists average precision (AP) values increased by 5.30%, 8.1%, and 10.6%, respectively. Moreover, the proposed method surpasses existing mainstream approaches in the cyclist category.

Keywords:

3D object detection; attention mechanism; transformer

1. Introduction

LiDAR technology plays a pivotal role in the development of self-driving vehicles, providing a reliable and robust means for environmental sensing and decision-making, thus solidifying its status as a foundational component in this rapidly advancing field [1]. Unlike conventional image data, which only provide two-dimensional information, 3D point cloud data offer richer spatial details, providing a comprehensive understanding of the surrounding environment. This direct access to three-dimensional scene information enables a more accurate and realistic perception of the world, critical for self-driving applications. Point cloud data are primarily acquired by scanning LiDAR sensors, which are characterized by disorder, unstructuredness, inconsistent density, and incomplete information. Therefore, networks well studied in 2D object detection cannot be directly used for processing point cloud data [2]. The research in the field of 3D object detection has primarily been categorized into three main branches, according to the utilized data sources: LiDAR-based 3D object detection, camera-based 3D object detection [3], and multi-modal 3D object detection, which integrates both LiDAR and camera data [4]. This paper focuses on LiDAR-based 3D object detection. The LiDAR-based 3D object detection models use different forms of data processing and are broadly categorized into four types: point-based, grid-based, point-voxel-based, and range-based methods [5].

The effectiveness of the 3D object detection method based on points is limited to a large extent by the limitations of the sampling strategy. The higher the number of context points, the stronger it is, but it can also lead to excessive memory requirements. Moreover, the uneven distribution of points in point cloud data may lead to the oversampling of dense areas and the undersampling of sparse areas, thus reducing the detection accuracy. A significant advantage of direct 3D data processing is the availability of rich spatial information, which enables the extraction of more effective target features. Moreover, the point cloud representation is suitable for complex environments, as it allows for the more comprehensive capture of environmental information, leading to improved performance in various challenging scenarios. Relevant research methods include PointNet++ [6], Pointformer [7], Point-GNN [8], and 3DSSD [9]. Grid-based 3D object detection methods first transform point cloud data into a discrete grid representation, a process known as “voxelization”. It is necessary to convert the point cloud data into a pseudo-image and then input it into a traditional 2D convolutional neural network for feature extraction so that mature 2D image processing techniques can be applied to 3D data. The advantage of this algorithm lies in its utilization of a discrete grid-based representation, which simplifies point cloud data and enhances the algorithm efficiency. Typical algorithms include Pointpillars [10], CenterPoint [11], VoTr [12], and Part-A2 [13]. The 3D object detection method based on point voxel usually uses the integrated feature information of both point and voxel for 3D object detection. The methods take advantage of both point cloud and voxel representations, with point clouds capturing detailed geometric information. The point cloud captures detailed geometric information, while voxels provide a structured and mesh-like representation for efficient computation. Compared to voxel-based detection methods, the point-voxel combination approach offers improved detection accuracy, albeit with the trade-off of a longer inference time. Representative algorithms contain SASSD [14], PVGNet [15], and CT3D [16]. Range-based 3D object detection methods, such as RangeDet [17], to the point [18], and Rsn [19], process point cloud data by generating distance images based on the distance information between points, rather than working directly with the original 3D spatial coordinates. This approach has proven effective in capturing local spatial information while avoiding the challenges associated with traditional point-based and voxel-based approaches.

In this article, PointPillars, the most typical 3D target detection method based on grid, is selected as the baseline model. Since it is processed directly on the native point cloud data and the features are extracted using a two-dimensional convolutional neural network, all these features significantly improve the inference speed. PointPillars divides the native point cloud data into a sequence of vertically aligned pillars and then performs feature extraction on the pillar representation by 2D convolution. The advantage of this is that large-scale point cloud data can be processed more quickly, and important information in the original state space can be preserved. It strikes a perfect balance between speed and accuracy but suffers from the poor detection of small-sized objects. To address this, we introduce an adapted attention mechanism in the feature encoding stage, optimize the 2D convolutional neural network, and finally perform candidate boxes refinement with the help of Transformer [20]. A multitude of comparison experiments and rigorous ablation studies have been meticulously performed on the KITTI dataset [21], demonstrating significant advancements in small target detection facilitated by the proposed method.

2. Related Work

This study focuses on 3D object detection using a grid-based approach. The fundamental idea of this approach is to first divide the cluttered point cloud into cubes of the same size and then utilize three-dimensional or two-dimensional convolution for feature extraction. The VoxelNet [22] algorithm proposed by Yin Zhou and Oncel Tuzel separates the native point cloud data into equal-sized three dimensional voxels and uses voxel feature encoding (VFE) to convert all the points within the voxels into uniform feature vectors. It then utilizes the extracted features for object detection and semantic segmentation. This network architecture is directly applicable to 3D point cloud data without the need for manual feature engineering (e.g., BEV). Lang, A. H. et al. proposed the PointPillars [10] algorithm. The algorithm achieves object detection by dividing the native point cloud data into equal-sized cubes, next using a feature encoder network to transform the input point cloud into a sparse pseudo image, obtaining high-level features through 2D convolution. Compared with the traditional point-based and voxel-based methods, the distance-based method improves the processing efficiency and reduces the computational complexity. However, the process of voxelization, when used in the generation of distance images, has the potential to result in the loss of fine-grained details within the point cloud data. This, in turn, can have a detrimental effect on the accurate detection of small targets, as the essential subtle features that are crucial for precise identification and localization might be compromised. Consequently, the overall effectiveness and reliability of the detection algorithm may be adversely impacted. Yin, T. et al. proposed the two-stage CenterPoint algorithm [11], which is an algorithm based on a key point detector for object detection, representation, and tracking. The multilevel pipeline flow of the object detection network consists of two stages. In the first stage, it extracts the BEV features of LiDAR point clouds using a voxel or column representation. Two-dimensional CNN detection heads are used to determine the target centers, and these center features are used to return to the full 3D bounding box. During the second phase, the detection frame generated from the preliminary stage is leveraged to extract point features at the center of the frame, which are then used for regressing the detection frame score and subsequent refinement, resulting in a robust and reliable multi-stage approach for 3D target detection. In the VoTr model [12] developed by Mao, J. et al., the Transformer serves as the 3D backbone, replacing the first-stage 3D sparse convolution model VOTR-SSD in SECOND, and the second-stage 3D sparse convolution model VOTR-TSD in PV-RCNN. Additionally, it solves the challenge that it is difficult for sparse, not-empty voxels to directly use Transformer. Through the fast voxel query and attention mechanism proposed by the author, attention operations can be effectively performed on sparse, not-empty voxels. This effectively utilizes the power of Transformer to complete voxel-based tasks.

In recent years, numerous advancements have been made in pillar-oriented 3D target detection. As a notable example, Anshul Paigwar et al. proposed Frustum-pointpillars [23], the proposed method incorporates both point cloud features and RGB images to significantly enhance LiDAR-based 3D object detection. Additionally, a novel approach to applying Gaussian-based masking to 3D points is introduced, which effectively distinguishes the foreground objects from background clutter. This leads to the more accurate localization of objects in three-dimensional space. Zhang, Lin, et al. developed the TGPP [24], the algorithm segments the original point cloud into multiple pillars for subsequent processing, and utilizes a multi-head attention mechanism to extract both global context features and local structure features. The effectiveness of Transformer models in learning context-aware representations has made them a promising direction for computer vision research. Building on this potential, Hualian Sheng et al. developed the CT3D [16] method. This method utilizes SECOND [25] to generate candidate frames, which are then refined by leveraging the strengths of Transformer models to learn and represent the intricate contextual information of 3D point cloud data. This innovative approach has shown promising results in improving the accuracy of 3D object detection. Beyond the application of Transformer models for frame refinement, the CT3D approach also captures global context information among points by implementing a multi-tier self-attention mechanism. This mechanism refines the points through a channel attention decoding module, which initially performs repeated query and key matrix multiplication, followed by the dot–product reweighting of the key to generate decoding weights. This approach preserves both global information and emphasizes local information at the channel level. The Voxel Transformer for 3D Object Detection [12] was introduced by Jiageng Mao and colleagues. The authors proposed a generalized Transformer-based 3D backbone, which primarily comprises a sequence of sparse and semi-fluid voxel modules. Through a special attention mechanism and fast voxel querying, the sparse voxels can effectively perform self-attention, and capture a large range of information.

In general, various 3D object detection methods possess distinct advantages. In this work, we proposed a new attention method to enhance feature expressiveness based on the SE [26] attention mechanism. Additionally, we introduce Transformer [20] to refine candidate frames and modify the backbone network using dynamic convolution. Further details are described below.

3. Network Architecture Overview

The algorithmic network framework in this paper adopts the PointPillars network due to its very fast runtime efficiency, exceeding the radar scanning frequency. The algorithm takes a point cloud as its input and is able to detect road vehicles, pedestrians, and cyclists, utilizing 3D bounding frames to enclose the predicted objects. The algorithm is mainly divided into three parts, and the network structure is shown in Figure 1.

Pillar feature attention net (PFANet).
Backbone (2D CNN).
Redefined detection head based on Transformer (RDHT).

In this section, we decouple them and briefly review each section.

3.1. Point Feature Attention Net

This part of the PFANet network architecture is used to convert the original point cloud into a pseudo-image. The algorithm structure is shown in Figure 2 below.

First, each pillar is defined as a small three-dimensional cell, obtained by dividing the point cloud in the x–y plane (Cartesian coordinate system) at specific intervals. Then, the points in each pillar are encoded into a nine-dimensional vector D. In light of the computational complexity associated with 3D point cloud processing, it is necessary to place some limitations on the number of pillars and feature vectors used in the algorithm. Specifically, the number of non-empty pillars will be restricted to a maximum of P, while each pillar will contain no more than N feature vectors. By following the above method, a point cloud data frame is encoded as a tensor with dimensions

(D, P, N)

.

Secondly, we propose a point feature attention net (PFANet). In this work, we have modified the squeeze-and-excitation network (SENet) and will introduce the Redefined-SENet (R-SENet) in detail. By introducing R-SENet, this module can better utilize global information.

3.1.1. SENet

The SENet includes both squeeze operation and an excitation operation to capture inter-channel relationships [26]. In the squeeze phase, the module compresses the convolutional layer’s output feature map into a feature vector using global average pooling. Subsequently, in the squeeze phase, the module uses global averaging pooling to compress a dimension map of the output feature map of the convolution layer into feature vectors, specifically converting the input of

H \times W \times C

into the output of

1 \times 1 \times C

. Subsequently, in the excitation phase, a fully connected layer with a nonlinear activation function generates a channel weighting vector. This vector is then utilized to each channel of the original feature map, learning to adjust the importance of features across the channels. The specific structure diagram is shown in Figure 3.

3.1.2. Squeeze

The

H \times W \times C

feature map containing global information is directly squeezed into a

1 \times 1 \times C

feature vector Z. The channel features of each of the C feature maps are compressed into a single value, which makes the generated channel-level statistics Z contain contextual information, alleviating the problem of channel dependency.

z_{c} = F_{s q} (u_{c}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} u_{c} (i, j)

(1)

3.1.3. Excitation

To capitalize on the information aggregated during the compression operation, the model fully captures the channel dependencies through the excitation operation. This function must satisfy two essential criteria: First of all, it must be flexible enough to learn nonlinear interactions between channels. Secondly, it must be capable of learning non-mutually exclusive relationships, allowing the model to emphasize multiple channels simultaneously, unlike solo thermal activation which limits this capability.

s = F_{e x} (z, W) = σ (g (z, W)) = σ (W_{2} δ (W_{1} z))

(2)

By the formula above, we can see that the SE module employs a gating mechanism composed of two fully connected layers. The first fully connected layer reduces the computational load by compressing the C channels down to

C / r

channels. This is followed by a ReLU nonlinear activation layer. The second fully connected layer then expands the dimensionality of the channels back to C. In the end, a Sigmoid activation function is utilized to obtain the weights s, where the dimension of s is

1 \times 1 \times C

. These weights are used to adjust the C feature maps in the feature map U. Here, r represents the compression ratio.

3.1.4. Redefined-SENet

This paper presents a novel improvement to the existing squeeze–excitation network (SE Net) attention module. This improved module can be interpreted as a mechanism that automatically learns the relative weights and importance of features, thereby enhancing the representation of 3D point cloud data in the algorithm. In the squeezing stage, the input features are compressed by the global average pooling operation to obtain the global feature statistics. In the excitation stage, the compressed features are nonlinearly mapped, which is suitable for learning the weight relationship between the features through a pair of fully connected layers and activation functions. Since the traditional SE Net only carries out feature compression along the spatial dimension, it turns each two-dimensional feature channel into a real number. However, the pseudo-images generated in the point cloud cannot be compressed to a certain dimension like the traditional 2D images. To solve this problem, we squeeze and excite from the second and third dimensions of the vector

(D, P, N)

, respectively, i.e., P and N. The effective feature map has a heavy weight, and the invalid or ineffective feature map has a small weight, so that the training model can achieve better results. As illustrated in Figure 4, our proposed point feature attention net (PFANet) incorporates a novel redefined-SENet (R-SENet) module, which outperforms the traditional SE Net in terms of feature recognition capabilities. This improvement enables the module to adaptively select and emphasize critical features, resulting in more accurate object representation in 3D point clouds. Empirical evidence, as presented in our experimentation, demonstrates that the proposed approach of extruding and stimulating P and N separately significantly outperforms either extruding or stimulating P or N in isolation.

3.2. Backbone (2D CNN)

In the original PointPillars backbone network, the architecture is divided into two main stages: downsampling and upsampling. During the upsampling stage, a traditional CNN network is employed for feature extraction. However, this type of network exhibits significant limitations when it comes to the detection of small objects. To tackle this problem, we introduce SA-Net [27] into the downsampling phase of the backbone network, aiming to enhance the detection capabilities for small targets. Additionally, dynamic convolution [28] is incorporated to augment the expressive power of the shallow network. Below is the illustration of the backbone structure, as depicted in Figure 5. The detailed architecture of the 2D backbone network is described in the following sections.

Spatial attention network (SA-Net) is a deep learning network designed to enhance feature extraction capabilities, particularly for the detection of small targets. SA-Net introduces a spatial attention mechanism to emphasize key regions within an image, enabling the network to more effectively identify and locate small targets. During the downsampling phase, SA-Net utilizes global contextual information to adjust feature maps, thereby improving the accuracy and sensitivity of feature representation. This method enables SA-Net to significantly improve the detection capability of the model when detecting small targets.

Optimized dynamic convolution [28] (ODConv) is a technique aimed at enhancing the expressive power of convolutional neural networks. Traditional convolution operations use fixed convolution kernels, whereas ODConv dynamically adjusts the parameters of the convolution kernels based on different input data, adapting to various features and contextual information. This dynamic adjustment mechanism enables the network to more flexibly capture diverse features, particularly in shallow layers, significantly enriching and refining feature representation. ODConv optimizes the convolution operation by introducing learnable weights and activation functions; thus, the overall performance of the model is improved.

In our research, we proposed the use of dynamic convolution to replace traditional convolution in the algorithm, enabling the application of diverse convolution kernels for different inputs. This approach allows for the use of attention-based weighting to increase the average predicted number of objects, while reducing computational effort compared to traditional convolution. Unlike traditional convolution, which requires sequential computations, dynamic convolution performs these operations in parallel, leading to a more efficient and effective object detection process. Our hypothesis that the introduction of ODConv could result in a reduction in recognition accuracy was validated through ablation experiments. The attention mechanism is currently divided into spatial attention and channel attention; SA-Net effectively combines these two attention mechanisms, although the introduction of ODConv can be compensated by the attention mechanism to bring about a decline in the accuracy of the defects. However, this change will still result in an increase in the number of parameters, posing limitations.

Although dynamic convolution and attention mechanisms share the common goal of enhancing the performance of neural networks, they operate in fundamentally different ways. While dynamic convolution enables the algorithm to adaptively select convolution kernels for different inputs, the attention mechanism works to focus the network on critical information. This synergistic combination of both approaches enhances the network’s adaptability to small targets and complex scenes while maintaining computational efficiency, a significant improvement over traditional convolution-based methods.

3.3. Redefined Detection Head Based on Transformer

Currently, candidate box refinement methods mainly rely on manual design, which cannot fully capture the rich contextual information between points. In contrast, it is widely used in natural language processing and computer vision in the field of Transformers can effectively solve the restriction. Transformer possesses sequence invariance, enabling them to avoid defining the order of point clouds, and can perform feature learning through self-attention mechanisms. This enhances the feature extraction of native point cloud data, making it more comprehensive and accurate. There are already numerous algorithms for 3D object detection tasks applying Transformer, such as PCT [29], Point Transformer [30], SOE-Net [31], VoxSeT [32], FlatFormer [33], and so on, all of which have achieved notable results. The structure of Transformer is shown in Figure 6.

In this paper, proposal information is initially encoded into each original point through an effective proposal-to-point embedding approach. Subsequently, remote interactions between points are captured using self-attention mechanisms. After feature encoding, the point features undergo conversion into a global proposal-aware representation via an extended channel re-weighting scheme, ensuring valid decoding weights for all points. This procedure enables the network to effectively leverage global information and refine its predictions, leading to more accurate object detection. The details of this process will be elaborated on in the next subsection of the paper.

3.3.1. Embedding

This embedding step maps the proposal to the original point cloud space, which refers to the proposal-to-point embedding approach, resulting in a better representation of the object and improved feature extraction. This is achieved as follows. The generated 3D bounding box is transformed into a cylinder with no restriction on the height, and the formula regarding r is as follows:

r = α \sqrt{{(\frac{l}{2})}^{2} + {(\frac{w}{2})}^{2}}

(3)

α

is the hyperparameter. w and l denote the dimensions of the region’s width and length, respectively. A total of 256 points are randomly sampled from it for subsequent processing.

Calculate the relative coordinates of each point with respect to the center point and the eight corner points of the corresponding candidate box, thereby constructing point features. Point features can be represented as:

f i = A ([Δ {p_{i}}^{c}, Δ {p_{i}}^{1}, \dots, Δ {p_{i}}^{8}, {f_{i}}^{r}]) \in R^{D}

(4)

A is a linear layer that maps the features of the points to a higher dimensional space,

{f_{i}}^{r}

is the reflected intensity of the point cloud.

Δ {p_{i}}^{j}

is the computed relative coordinates

p^{i} - p^{j}, j = 1, \dots, 8

, where

p^{j}

is the coordinate of the j-th corner point. The point feature is first mapped to a high-dimensional space via a linear layer, which is then input into a multi-head attention layer. The feedforward network with residual structure encodes the intricate contextual relations and relationships between points, effectively enriching and refining the original point features. This process contributes to the overall robustness of the network, enabling more accurate object detection and localization.

3.3.2. Encoder

The encoder layer comprises three components: Add & Norm layers, a multi-head attention mechanism, and a feed-forward neural network. The self-attention encoding mechanism further refines point features by modeling the relative relationships between the points within the proposal. This self-attention process aggregates global context information and dependencies, resulting in more comprehensive feature representations. By leveraging self-attention, the mechanism can capture long-range interactions among points, enabling the better feature extraction and representation of complex spatial relationships and dependencies. The process used to perform feature extraction on the input is shown in Figure 7. It is used for the process of feature extraction from the input, where the multi-head attention layer is mainly used for the computation of attention, the matrix operation of

Q, K, V

. The Add and Norm layers: “Add” stands for residual connection, which helps prevent network degradation, while “Norm” stands for layer normalization, which is used to normalize the activation values of each layer. And then the feed-forward layer is for the extraction of features for forward propagation. The self-attention mechanism is computed using the matrices Q (query), K (key), and V (value). These matrices are obtained by linearly transforming the inputs using the learned linear transformation matrices

W^{Q}

,

W^{K}

, and

W^{V}

during the model’s training process. The output of self-attention is then calculated as follows:

Z = A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(5)

The dimension of the Q and K matrices is

d_{k}

, and dividing by the square root of

d_{k}

prevents the inner product from becoming too large. After applying the softmax function, the result is multiplied by the V matrix to obtain the output. Multi-head attention consists of multiple self-attention layers. First, the input X is passed through h different self-attention layers, resulting in h output matrices Z. These matrices are then concatenated and passed through a linear layer to produce the final output.

\{\begin{matrix} M u l t i H e a d (Q, K, V) = C o n c a t (h e a d_{1}, \dots, h e a d_{h}) W_{0} \\ h e a d = A t t e n t i o n (Q {W_{i}}^{Q}, K {W_{i}}^{K}, V {W_{i}}^{V}) \end{matrix}

(6)

The Add and Norm layer is composed of two distinct parts, Add and Norm, and their calculations are performed in the following manner:

\{\begin{matrix} L a y e r N o r m (X + M u l t i H e a d A t t e n t i o n (X)) \\ L a y e r N o r m (X + F e e d F o r w a r d (X)) \end{matrix}

(7)

In this context, X denotes the input to either the multi-head attention or the feed-forward layer. The “Add” operation refers to the residual connection

X + M u l t i H e a d A t t e n t i o n (X)

, which is typically used to address the issue of training deep networks. This residual connection enables the network to concentrate on the current differences, a technique often utilized in ResNet. The structure diagram of the Residual Network is shown in Figure 8.

“Norm” refers to Layer Normalization, which is typically used in RNN structures. This process transforms the inputs of each layer of neurons to have the same mean and variance, thereby accelerating convergence.

The feed-forward layer is straightforward, consisting of two fully connected layers: the first layer applies the ReLU activation function, while the second layer does not apply any activation function. This structure corresponds to the following equation.

m a x (0, X W_{1} + b_{1}) W_{2} + b_{2}

(8)

The input to the feed-forward layer is the output of multi-head attention after residual connection and normalization. The feed-forward layer then conducts two linear transformations to delve deeper into the feature space. Its main purpose is to transform data from a high-dimensional space to a lower-dimensional space, facilitating the extraction of more complex features.

With multi-head attention, feed forward, add and norm described above, an encoder block can be constructed, which receives an input matrix and outputs a matrix. In addition, encoder can be formed by stacking multiple encoder blocks.

3.3.3. Decoder

In this module, all the point feature codes output by the encoder are decoded, the structure diagram is shown in Figure 9. Unlike the conventional Transformer decoder that processes multiple query embeddings using a self-encoder–decoder mechanism, the decoder in this paper disregards multiple queries because the proposed model requires only a single prediction. The decoder module opts for a single prediction instead of utilizing M query embeddings due to two primary reasons. Firstly, employing M query embeddings can result in high memory latency, particularly when handling numerous proposals. Secondly, While typically each of the M query embeddings independently transforms into M words or objects, the proposal refinement model requires only a single prediction for streamlined processing. The fundamental departure from the standard Transformer decoder lies in the decoding approach. In the conventional Transformer Decoder, M multiple query embeddings undergo transformation via self- and encoder–decoder attention mechanisms. In contrast, the decoder operates on a single query embedding to aggregate point features across all channels and generate a single prediction. This streamlining of the decoding process enables a more efficient and effective approach to refining point features for 3D object detection tasks. The standard Transformer decoder aggregates global point features using learnable vectors, and the final decoded weight vector for all point features per attention head is:

w_{h}^{(S)} = σ (\frac{{\hat{q}}_{h} {\hat{K}}_{h}^{T}}{\sqrt{D^{'}}}), h = 1, \dots, H,

(9)

{\hat{K}}_{h}^{T}

is the key embedding of the h-th attention header and

{\hat{q}}_{h}

is the corresponding query embedding. In order to emphasize the channel information for

{\hat{K}}_{h}^{T}

and

{\hat{q}}_{h}

, a decoding weight vector is introduced for all channels.

w_{h}^{(E C)} = s \cdot \hat{σ} (\frac{ρ ({\hat{q}}_{h} {\hat{K}}_{h}^{T}) ⊙ {\hat{K}}_{h}^{T}}{\sqrt{D^{'}}}), h = 1, \dots, H

(10)

s is a linear projection of the decoded values compressed into a reweighted scalar, where

ρ (\cdot)

is the repetition operator such that

R^{1 \times N} \to R^{D^{'} \times N}

. Such an approach enriches localized and detailed channel interactions compared to conventional decoding methods. Finally, the decoding proposal can be represented as:

y = [{w_{1}}^{(E C)} \cdot {\hat{V}}_{1}, \dots, {w_{H}}^{(E C)} \cdot {\hat{V}}_{H}]

(11)

Here, the value embedding

\hat{V}

is derived as the linear projection of

\hat{X}

.

3.4. Detection Head and Loss

The outputs of the encoding–decoding module are fed into two feed-forward neural networks (FFNs) to obtain confidence scores and box error values relative to the input proposals. The positional error between the true box and the predicted box is calculated as follows:

\{\begin{matrix} Δ x = \frac{x_{g} - x}{d}, Δ y = \frac{y_{g} - y}{d}, Δ z = \frac{z_{g} - z}{h}, \\ Δ w = log \frac{w_{g}}{w}, Δ l = log \frac{l_{g}}{l}, Δ h = log \frac{h_{g}}{h}, \\ Δ θ = θ_{g} - θ, d = \sqrt{l^{2} + w^{2}} \end{matrix}\}

(12)

In the formula,

x, y

, and z are the center points of the frame,

w, l

, and h are the width, length, and height of the frame,

θ

denotes the facing angle of the prediction frame, and g is the parameter of the real frame.

c^{t} = m i n (1, m a x (0, \frac{I o U - a_{B}}{a_{F} - a_{B}}))

(13)

where the superscript t is the regression target, encoded by the proposal, and subscript g,

a_{F}

and

a_{B}

are the IOU thresholds for the foreground and background, respectively. The loss of the network consists of the RPN loss

L_{rpn}

, the bounding box confidence loss

L_{reg}

, and the confidence prediction loss

L_{c o n f}

.

L = L_{reg} + L_{c o n f} + L_{r p n}

(14)

The confidence prediction loss uses binary cross-entropy loss:

L_{c o n f} = - c^{t} log (c) - (1 - c^{t}) log (1 - c)

(15)

Moreover, the box regression loss adopts:

L_{reg} = (IoU \geq α_{R}) \sum_{μ \in x, y, z, l, w, h, θ} L_{smooth - L 1} (μ, μ^{t})

(16)

L_{rpn}

loss consists of the focal-loss classification branch and the smooth-L1-loss based regression.

4. Experimental Setup and Evaluation Indicators

In this chapter, the algorithm is evaluated on the public dataset KITTI. Further elaboration will be provided on the training process specifics and evaluation criteria. In addition, a well-rounded ablation study is conducted to verify the usefulness of each module in the algorithm.

4.1. Experimental Data

The models were trained and tested using the KITTI 3D object detection benchmark (Geiger, Lenz, and Urtasun 2012) [21], which contained 7481 training LiDAR samples and 7518 testing LiDAR samples. All experiments utilized the identical dataset partitioning means as the PointPillars, with the official training dataset divided into 3712 training samples and 3769 validation samples. We tested the accuracy and performance of our model by training it on the available training dataset, followed by a comprehensive comparison to the results achieved by state-of-the-art methods on both the validation and test datasets. To ensure a fair and objective evaluation, we not only utilized the 3769 validation samples but also tested the model on 7518 test samples, uploading the generated labels from this process to the official KITTI website for independent assessment. This rigorous testing process allowed us to obtain unbiased results, reflecting the true performance of our model.

4.2. Model Training

This experiment utilizes the OpenPCDet 3D object detection framework. The CPU used is an AMD EPYC 9754, the GPU is an NVIDIA GeForce GTX 4090D, with 24 GB of memory. The algorithm model is trained on the Ubuntu 20.04 platform. For training, the Adam_onecycle optimizer minimizes the loss function with a maximum of 160 iterations, a batch size of 5, an initial learning rate of 0.001, a momentum optimization coefficient of 0.8, and a weight decay rate of 0.01.

4.3. Testing Results

The algorithm developed by this work is evaluated on the KITTI 3D target detection benchmark, which contains three targets: car, pedestrian, and cyclist. Test scenarios for each category are segmented into easy, moderate, and hard levels. For evaluation purposes, this paper utilizes the average precision (AP) metric to compare different methods, with 3D IoU thresholds set at 0.7 for cars and 0.5 for cyclists and pedestrians.

4.3.1. Compare with PointPillars on Validation Samples

For fair comparison, the algorithm used in this paper is trained on a desktop workstation with the same loss function and the same hyperparameters. Table 1 and Table 2 present the comparison results between our algorithm and the PointPillars algorithm.

The reported results are based on the average accuracy across 40 recall positions. Table 1 reveals that our proposed method has a significant improvement compared to the baseline model, PointPillars. Car detection AP increased by 4.73%, 4.47%, and 5.18% across the three difficulty scenarios, respectively. The pedestrian detection AP increased by 10%, 7.45%, and 7.51%, respectively; and the detection of AP by cyclists increased by 6.85%, 7.22%, and 6.26%, respectively. Meanwhile, as shown in Table 2, our method also greatly improved on BEV detection accuracy. In the three difficulty scenarios, the car detection AP saw increases of 1.51%, 1%, and 2.17% across the three scenarios, respectively. Similarly, the pedestrian detection AP increased by 5.8%, 4.23%, and 4.32%, while the cyclist detection AP showed increases of 18.22%, 5.49%, and 4.25%, respectively.

To provide a thorough analysis of the detection performance across all categories, the mean average precision (mAP) value was computed by aggregating the AP values under moderate difficulty for each category. Compared with the existing PointPillars algorithm, the method achieved a significant improvement in the 3D target detection accuracy of 6.26%. Furthermore, in the BEV detection accuracy, our model registered a 3.74% increase. Moreover, our model achieved an impressive inference speed, which was only 20 ms slower than the industry-standard limit model, despite running on identical hardware and software configurations.

4.3.2. Compare with Others Methods

To validate the algorithm’s universality, it was tested and compared on both the validation and test sets in this article. The experimental results are summarized in Table 3 and Table 4, with bold formatting highlighting the highest-performing detection outcomes.

To evaluate the improved algorithm on the KITTI dataset, it was compared with several typical algorithms. VoxelNet [22], SECOND [25], PointPillars [10], Pointformer [7], SVGA-Net [34], TGPP [24], and PSA-Det3D [35] algorithms were selected for comparison. At the three difficulty levels, our proposed method achieves results comparable to or better than state-of-the-art methods, confirming the effectiveness of our approach, as shown in Table 3.

Furthermore, to facilitate a more equitable comparison with other state-of-the-art methods, the method proposed in this paper was evaluated using the 3D detection benchmark on the KITTI test server. First of all, according to the KITII website and the suggestions made by Mapillary team in their paper [36], we used 40 recall positions instead of 11 recall positions. Second, the algorithm of this paper is tested on a test set, and the data for comparison in Table 4 below are from the KITTI website. We have selected typical methods in recent years, such as PointPillars [10], PointRCNN [37], SA-SSD [14], Pointformer [7], RangeDet [17], SVGA Net [34], IA-SSD [38], and EOTL [39]. Our method shows better progress in the bicycle category, performs satisfactorily in the automobile category, and exhibits some effectiveness for small object detection. The experiments confirm the effectiveness of the proposed algorithmic enhancements.

4.3.3. Visual Comparison Analysis

This paper exclusively utilizes the point cloud dataset for training, with visualization employed to facilitate a more intuitive comparison. To further analyze and visually demonstrate the usefulness of our proposed method compared to the baseline, this subsection presents the bounding box prediction results of both approaches in two different scenarios. Each scenario is presented separately for detailed analysis, observation, and illustration. Each scenario includes an RGB image and two point cloud images with the detected boxes, which are visually compared for clarity. The color coding for each category is as follows: cars are depicted with green bounding boxes, pedestrians with blue bounding boxes, and cyclists with yellow bounding boxes. We randomly selected two scenes for testing. Scene 1, depicted in Figure 10, shows the original image, while Figure 11 displays the detection results from both algorithms applied to this scene. Similarly, in Scene 2, as described in Figure 12, visual comparisons between the two algorithms are shown in Figure 13, respectively.

Figure 11a test results show some common detection failures. In contrast, Figure 11b has tight and oriented 3D bounding boxes. The prediction results for cars are more accurate, with no misclassifications for pedestrians and cyclists. The same results can be seen in Figure 13a,b. Detecting pedestrians and cyclists proves to be more challenging; pedestrians and cyclists are often misclassified, and environmental noise can be easily misinterpreted as cyclists and pedestrians. The visual analysis further demonstrates that the proposed algorithm significantly improves upon the baseline model.

5. Ablation Studies

To affirm the effectiveness of our proposed method, we conducted a series of ablation experiments that objectively evaluated the performance of each algorithm presented in this paper. In this subsection, we present the results of the ablation experiments in detail, which were designed to assess the efficacy of each module. Following standard 3D object detection practices, we split the KITTI dataset into “train” and “val” subsets, with 3712 and 3769 samples, respectively.

Referring to the data shown in Table 5, the APs for each category are averaged across the three levels of difficulty. Evidently, the method with the addition of the Transformer module has a significant improvement of 3.96% compared to the baseline for the model. The method, with the addition of ODConv as well as the Transformer module, has an improvement of 4.88% compared to the baseline for the model. Methods that add the ODConv, Transformer, and Attention modules have a 7.38% improvement over the baseline for the model. Building upon the analysis provided earlier, firstly, it can be seen that the introduction of Transformer for candidate box refinement in the baseline for the model has a significant effect and captures the rich contextual information between points better than the baseline for the model. Second, based on the introduction of Transformer, the traditional convolution in some of the convolutional layers is replaced with dynamic convolution in the backbone network. By dynamically adjusting the shape and size of the convolutional kernel, the performance of the convolutional neural network is thus improved, and it is improved by 0.92% from the above. Our analysis revealed that the introduction of adapted SE attention in the feature encoding network and SA attention in the backbone network plays a critical role in bolstering the representation capability of the features. This results in a significant enhancement in feature representation compared to simply adding the Transformer module, with an improvement of 3.42% in detection accuracy.

According to Table 6, the application of the R-SENet model demonstrates superior detection capability compared to the traditional SENet model. For the three categories under moderate difficulty, the improvement is 0.11%, 2.27%, and 1.19%, respectively. The realization proves that squeezing and motivation for P and N are much better than squeezing and motivation for P or N alone.

6. Conclusions

This paper introduces an enhanced Transformer-based PointPillars feature encoding network aimed at enhancing small object detection. Among them, the introduced candidate box refinement module is the core part of this algorithm, which significantly enhances the detection capabilities for small targets. Experimental results in the KITTI 3D detection benchmark show that the algorithm outperforms PointPillars in target detection performance, achieving a 6.26% average accuracy improvement under comparable conditions. This places it competitively among recent advancements in the field. The results of the ablation experiments confirmed that our developed improvements to the PointPillars algorithm achieved better performance than the baseline model. Additionally, the comparison between our proposed R-SENet and the traditional SENet revealed that R-SENet demonstrated various levels of improvement in all three categories, suggesting that the Redefined-SENet module was a valuable addition to our model. While the incorporation of the Transformer module has undoubtedly enhanced the performance of our proposed model, it has also introduced some trade-offs in terms of computational efficiency. In particular, incorporating this module has led to slower computation and an expansion in the parameter count. Moving forward, our goal is to enhance our approach by exploring new architectural designs that streamline computational efficiency and reduce parameter overhead, all while preserving the high accuracy of our model.

Author Contributions

Conceptualization, W.K. and Y.D.; methodology, Y.D.; software, Y.D.; validation, W.K., Y.D., L.H. and Z.L.; formal analysis, W.K. and Y.D.; resources, L.H. and Z.L.; data curation, Y.D., L.H. and Z.L.; writing—original draft preparation, W.K. and Y.D.; writing—review and editing, Y.D., L.H. and Z.L.; visualization, Y.D. and L.H.; project administration, W.K. and Y.D.; funding acquisition, W.K. and Y.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by the National Natural Science Foundations of China, grant number 61902296, and the Natural Science Foundation of Shannxi Province of China, grant number 2022JM-369.

Data Availability Statement

The data presented in this research are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cui, Y.; Chen, R.; Chu, W.; Chen, L.; Tian, D.; Li, Y.; Cao, D. Deep Learning for Image and Point Cloud Fusion in Autonomous Driving: A Review. IEEE Trans. Intell. Transp. Syst. 2021, 23, 722–739. [Google Scholar] [CrossRef]
Li, Y.; Ma, L.; Zhong, Z.; Liu, F.; Chapman, M.A.; Cao, D.; Li, J. Deep Learning for LiDAR Point Clouds in Autonomous Driving: A Review. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 3412–3432. [Google Scholar] [CrossRef]
Ma, X.; Ouyang, W.; Simonelli, A.; Ricci, E. 3d object detection from images for autonomous driving: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 3537–3556. [Google Scholar] [CrossRef]
Singh, A. Transformer-Based Sensor Fusion for Autonomous Driving: A Survey. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Paris, France, 2–6 October 2023; pp. 3312–3317. [Google Scholar]
Mao, J.; Shi, S.; Wang, X.; Li, H. 3D Object Detection for Autonomous Driving: A Comprehensive Survey. Int. J. Comput. Vis. 2023, 131, 1909–1963. [Google Scholar] [CrossRef]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inf. Process. Syst. 2017, 30, 5105–5114. [Google Scholar]
Pan, X.; Xia, Z.; Song, S.; Li, L.E.; Huang, G. 3D Object Detection with Pointformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7463–7472. [Google Scholar]
Shi, W.; Rajkumar, R. Point-gnn: Graph neural network for 3d object detection in a point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1711–1719. [Google Scholar]
Yang, Z.; Sun, Y.; Liu, S.; Jia, J. 3dssd: Point-based 3d single stage object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11040–11048. [Google Scholar]
Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. Pointpillars: Fast Encoders for Object Detection From Point Clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12697–12705. [Google Scholar]
Yin, T.; Zhou, X.; Krahenbuhl, P. Center-based 3D Object Detection and Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11784–11793. [Google Scholar]
Mao, J.; Xue, Y.; Niu, M.; Bai, H.; Feng, J.; Liang, X.; Xu, H.; Xu, C. Voxel transformer for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 3164–3173. [Google Scholar]
Shi, S.; Wang, Z.; Shi, J.; Wang, X.; Li, H. From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 2647–2664. [Google Scholar] [CrossRef]
He, C.; Zeng, H.; Huang, J.; Hua, X.S.; Zhang, L. Structure Aware Single-Stage 3D Object Detection From Point Cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11873–11882. [Google Scholar]
Miao, Z.; Chen, J.; Pan, H.; Zhang, R.; Liu, K.; Hao, P.; Zhu, J.; Wang, Y.; Zhan, X. PVGNet: A Bottom-Up One-Stage 3D Object Detector with Integrated Multi-Level Features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3279–3288. [Google Scholar]
Sheng, H.; Cai, S.; Liu, Y.; Deng, B.; Huang, J.; Hua, X.S.; Zhao, M.J. Improving 3d object detection with channel-wise transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 2743–2752. [Google Scholar]
Fan, L.; Xiong, X.; Wang, F.; Wang, N.; Zhang, Z. RangeDet: In Defense of Range View for LiDAR-based 3D Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 2918–2927. [Google Scholar]
Chai, Y.; Sun, P.; Ngiam, J.; Wang, W.; Caine, B.; Vasudevan, V.; Zhang, X.; Anguelov, D. To the Point: Efficient 3D Object Detection in the Range Image with Graph Convolution Kernels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16000–16009. [Google Scholar]
Sun, P.; Wang, W.; Chai, Y.; Elsayed, G.; Bewley, A.; Zhang, X.; Sminchisescu, C.; Anguelov, D. RSN: Range Sparse Net for Efficient, Accurate LiDAR 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5725–5734. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
Zhou, Y.; Tuzel, O. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4490–4499. [Google Scholar]
Paigwar, A.; Sierra-Gonzalez, D.; Erkent, Ö.; Laugier, C. Frustum-pointpillars: A multi-stage approach for 3d object detection using rgb camera and lidar. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2926–2933. [Google Scholar]
Zhang, L.; Meng, H.; Yan, Y.; Xu, X. Transformer-based global PointPillars 3D object detection method. Electronics 2023, 12, 3092. [Google Scholar] [CrossRef]
Yan, Y.; Mao, Y.; Li, B. Second: Sparsely embedded convolutional detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef] [PubMed]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Zhang, Q.L.; Yang, Y.B. SA-Net: Shuffle Attention for Deep Convolutional Neural Networks. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 2235–2239. [Google Scholar]
Li, C.; Zhou, A.; Yao, A. Omni-dimensional dynamic convolution. arXiv 2022, arXiv:2209.07947. [Google Scholar]
Guo, M.H.; Cai, J.X.; Liu, Z.N.; Mu, T.J.; Martin, R.R.; Hu, S.M. Pct: Point cloud transformer. Comput. Vis. Media 2021, 7, 187–199. [Google Scholar] [CrossRef]
Engel, N.; Belagiannis, V.; Dietmayer, K. Point transformer. IEEE Access 2021, 9, 134826–134840. [Google Scholar] [CrossRef]
Xia, Y.; Xu, Y.; Li, S.; Wang, R.; Du, J.; Cremers, D.; Stilla, U. SOE-Net: A self-attention and orientation encoding network for point cloud based place recognition. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11348–11357. [Google Scholar]
He, C.; Li, R.; Li, S.; Zhang, L. Voxel set transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8417–8427. [Google Scholar]
Liu, Z.; Yang, X.; Tang, H.; Yang, S.; Han, S. Flatformer: Flattened Window Attention for Efficient Point Cloud Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 1200–1211. [Google Scholar]
He, Q.; Wang, Z.; Zeng, H.; Zeng, Y.; Liu, Y. Svga-net: Sparse voxel-graph attention network for 3d object detection from point clouds. In Proceedings of the AAAI Conference on Artificial Intelligence, Palo Alto, CA, USA, 22 February–1 March 2022; Volume 36, pp. 870–878. [Google Scholar]
Huang, Z.; Zheng, Z.; Zhao, J.; Hu, H.; Wang, Z.; Chen, D. PSA-Det3D: Pillar set abstraction for 3D object detection. Pattern Recognit. Lett. 2023, 168, 138–145. [Google Scholar] [CrossRef]
Simonelli, A.; Bulo, S.R.; Porzi, L.; Antequera, M.L.; Kontschieder, P. Disentangling monocular 3d object detection: From single to multi-class recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1219–1231. [Google Scholar] [CrossRef] [PubMed]
Shi, S.; Wang, X.; Li, H. PointRCNN: 3D Object Proposal Generation and Detection From Point Cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 770–779. [Google Scholar]
Zhang, Y.; Hu, Q.; Xu, G.; Ma, Y.; Wan, J.; Guo, Y. Not All Points are Equal: Learning Highly Efficient Point-based Detectors for 3D LiDAR Point Clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18953–18962. [Google Scholar]
Yang, R.; Yan, Z.; Yang, T.; Wang, Y.; Ruichek, Y. Efficient online transfer learning for road participants detection in autonomous driving. IEEE Sens. J. 2023, 23, 23522–23535. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of network structure.

Figure 2. Schematic diagram of the PFANet algorithm structure.

Figure 3. Schematic diagram of squeeze-and-excitation network.

Figure 4. Schematic diagram of redefined-SENet (R-SENet).

Figure 5. Schematic diagram of a 2D Backbone structure.

Figure 6. Schematic diagram of Transformer structure.

Figure 7. Schematic diagram of embedding-and-encoder structure.

Figure 8. Schematic diagram of the residual network.

Figure 9. Schematic diagram of decoder structure.

Figure 10. Original photo of Scene 1.

Figure 11. Detection results of two algorithms in Scenario 1.

Figure 12. Original photo of Scene 2.