AEPF: Attention-Enabled Point Fusion for 3D Object Detection

Sharma, Sachin; Meyer, Richard T.; Asher, Zachary D.

doi:10.3390/s24175841

Open AccessArticle

AEPF: Attention-Enabled Point Fusion for 3D Object Detection

by

Sachin Sharma

,

Richard T. Meyer

^*

and

Zachary D. Asher

Department of Mechanical and Aerospace Engineering, Western Michigan University, 1903 West Michigan Ave, Kalamazoo, MI 49008, USA

^*

Author to whom correspondence should be addressed.

Sensors 2024, 24(17), 5841; https://doi.org/10.3390/s24175841

Submission received: 6 August 2024 / Revised: 2 September 2024 / Accepted: 7 September 2024 / Published: 9 September 2024

(This article belongs to the Section Sensors and Robotics)

Download

Browse Figures

Versions Notes

Abstract

:

Current state-of-the-art (SOTA) LiDAR-only detectors perform well for 3D object detection tasks, but point cloud data are typically sparse and lacks semantic information. Detailed semantic information obtained from camera images can be added with existing LiDAR-based detectors to create a robust 3D detection pipeline. With two different data types, a major challenge in developing multi-modal sensor fusion networks is to achieve effective data fusion while managing computational resources. With separate 2D and 3D feature extraction backbones, feature fusion can become more challenging as these modes generate different gradients, leading to gradient conflicts and suboptimal convergence during network optimization. To this end, we propose a 3D object detection method, Attention-Enabled Point Fusion (AEPF). AEPF uses images and voxelized point cloud data as inputs and estimates the 3D bounding boxes of object locations as outputs. An attention mechanism is introduced to an existing feature fusion strategy to improve 3D detection accuracy and two variants are proposed. These two variants, AEPF-Small and AEPF-Large, address different needs. AEPF-Small, with a lightweight attention module and fewer parameters, offers fast inference. AEPF-Large, with a more complex attention module and increased parameters, provides higher accuracy than baseline models. Experimental results on the KITTI validation set show that AEPF-Small maintains SOTA 3D detection accuracy while inferencing at higher speeds. AEPF-Large achieves mean average precision scores of 91.13, 79.06, and 76.15 for the car class’s easy, medium, and hard targets, respectively, in the KITTI validation set. Results from ablation experiments are also presented to support the choice of model architecture.

Keywords:

3D object detection; sensor fusion; autonomous vehicles; LiDAR; camera

1. Introduction

Three-dimensional object detection remains one of the most critical tasks within the perception subsystem for various applications such as autonomous driving, robotics, drone navigation, and augmented reality [1]. The goal of 3D object detection is to predict the location and classes of the objects in the scene and localize them with respect to some known reference. Safety-critical robotic systems require highly accurate information about an object’s depth, position, and volume in a scene for accurate perception. Advancements in computer vision technology have resulted in highly effective 2D object detectors that deliver excellent results using only image data [2,3], but since camera data are 2D by nature and stereo cameras have limited depth detection range, these detectors are unable to provide accurate depth and spatial positioning of objects. Therefore, data from sensors like LiDAR and RADAR is often fused with camera images to provide the high accuracy needed for 3D object detection.

Cameras provide images as an array of pixels where each pixel has three color channels. Although cameras offer rich semantic information about a scene, they inherently lack the ability to directly capture its 3D structural data, and depth information estimated from images typically contains significant errors [1]. Existing literature on camera-based 3D object detectors [4,5,6] demonstrates lower performance primarily due to imprecise depth estimation. On the other hand, LiDARs can provide accurate depth and geometric information via point clouds but are usually sparse due to factors such as small object sizes, long distances between objects, or occlusion situations. Despite achieving competitive performance on 3D detection benchmarks, LiDAR-based 3D object detectors [7,8,9,10,11] struggle under such inclement conditions because of insufficient context to distinguish sparse distant regions.

Several multi-modal sensor fusion methods [12,13,14,15,16,17,18] have been proposed in the literature to improve 3D object detection by utilizing geometric and semantic information from images and point clouds. Three distinct groups of deep-learning-based multi-modal sensor fusion exist: early (data-level), middle (feature-level), and late fusion (decision-level). In the early fusion methods [17,19], raw sensory inputs are fused to help the network learn a joint representation. For example, the authors in [17] performed xdata-level fusion by complementing additional semantic information to the LiDAR-only detection pipeline. At the decision-making stage, late-fusion-based methods [14] process sensory information from different sensor modalities separately and fuse the output at the decision level. With middle-fusion-based methods [12,13,16,18,20], individual features are extracted from multi-modal inputs, and then an intermediate stage is used to learn joint representations. Although feature level fusion methods [12,16,21,22] have shown remarkable success in 3D object detection benchmarks [23,24], the extensive research focus is still on determining at what stage do the features need to be fused. Methods like [12,13,16] combine semantic and geometric features towards the end of both modalities. Coarse-grained features from individual modalities are fused to regress 3D bounding boxes. Extracting coarse-grained features from both modalities requires a higher training and inference cost. These networks also fail to learn shared features between modalities early. Feature-fusion methods such as ref. [18,25] combine features at an earlier stage. The authors in [18] conducted voxel-level fusion by projecting non-empty voxels to the image, allowing them to extract image features for every voxel.

Feature fusion at early stages between different modalities has the most significant opportunity for cross-modal interaction. Given a camera image and a corresponding LiDAR point cloud of a scene, can an early feature-fusion-based method be created that uses the most prominent features from individual modalities and outputs 3D bounding boxes with improved accuracy? The overall hypothesis is that image and point cloud features can be fused while selecting the most prominent features from individual modalities pre-fusion with attention mechanisms to enhance 3D detection results. To this end, a novel multi-modal and multi-class 3D object detector named Attention-Enabled Point Fusion (AEPF) for 3D object detection is proposed as shown in Figure 1, which takes in images and point cloud data as inputs and outputs 3D bounding boxes after passing through the attention-enabled sensor fusion layers. Two AEPF model variants are proposed: AEPF-Small (AEPF-S) and AEPF-Large (AEPF-L). AEPF-S employs attention mechanisms for both image and point cloud features before fusion, while AEPF-L utilizes multi-head self-attention and uses image features to highlight important point cloud features before fusion. The contributions of this work are as follows:

A novel feature fusion methodology—AEPF is proposed for 3D object detection. The proposed feature fusion methodology utilizes an attention mechanism to highlight important features within individual sensor modalities.
Two object detection variants based on AEPF architecture are presented and validated. AEPF-S maintains the accuracy of state-of-the-art (SOTA) algorithms while inferencing at higher speeds. AEPF-L obtains competitive results in the overall 3D mean average precision (mAP) category on the KITTI validation set and is intended for scenarios prioritizing higher accuracy with sufficient computational resources available.
The proposed 3D object detection method is validated after extensive experiments in the KITTI dataset [23]. The effectiveness of key network design components is verified by performing ablation studies.

An overview of this paper is as follows. Section 2 presents a literature review on current SOTA 3D object detection methods across different sensor modalities, the architecture for the proposed 3D object detection method is described in Section 3, experimental results on KITTI data and ablation studies are presented in Section 4, and Section 5 concludes the research contribution and suggests future research directions. Although our use of the KITTI dataset primarily establishes the use of AEPF for automated driving applications, we contend that it can be used for other applications, such as robotics, drone navigation, and augmented reality.

2. Related Work

Three-dimensional object detection methods can be classified into three types: camera (or stereo)-based, LiDAR-based, and multi-modal fusion-based. As camera (or stereo) and LiDAR are the most common sensor setups for 3D object perception, the focus will be on methods involving these two technologies.

2.1. 3D Object Detection Using Images

Given the success of 2D detection methods in regressing 2D boxes in images, a straightforward approach to extend this paradigm to 3D detection is to just directly regress 3D localization parameters using a convolutional neural network (CNN). The shift from 2D to 3D detection involves utilizing the feature extraction capabilities of CNNs and extending them to accommodate the additional (albeit missing from individual camera images) spatial dimension present in 3D data. For instance, approaches such as that in [5,26,27] predict 3D bounding boxes using images as the sole input. These methods usually involve creating specialized loss functions to guide the learning of 3D parameters effectively and designing architectures that can capture essential depth cues and contextual information. Stereo-based methods [28,29] detect 3D objects from pairs of images, leveraging the additional geometric information from stereo images to infer depth using a disparity map. Since RGB images lack inherent depth information, methods like those in [30,31] perform depth estimation and generate pseudo-LiDAR representations for 3D object detection. With recent advancements in transformer-based architectures [32], researchers [33,34] have utilized 3D object queries and 3D–2D correspondence for 3D object detection. Furthermore, techniques like incremental structure-from-motion [35] and machine-learning-based image translation methods [36] have been developed to improve 3D point cloud reconstruction from RGB data providing spatial information that images alone cannot offer. Although these advancements in 3D structure recovery from images are crucial for 3D object detection and localization, LiDAR sensor data typically provides more accurate 3D information as a point cloud without requiring additional processing. Given the challenges of accurate depth estimation from images, LiDAR-based 3D object detectors tend to outperform all image-only methods for 3D object detection.

2.2. 3D Object Detection Using Point Clouds

Thanks to the direct depth information provided by LiDAR, point-cloud-based 3D object detectors have been the primary focus in recent years. A challenge with LiDAR data is that in its original representation, the point cloud contains sparse unordered points, which means that it cannot serve as an input to convolutional layers. However, a key advantage of LiDAR-based 3D object detection over multi-modal fusion-based methods is that these models do not require multi-sensor calibration and alignment. LiDAR-based 3D object detectors do not perform well for longer-distance objects. These methods can be distinguished by how they encode raw LiDAR data to extract features from the point cloud and can be categorized into point-based and grid-based methods.

In the point-based category, PointRCNN [37] utilizes the original point cloud data and employs PointNet++ [38] to learn per-point features to generate 3D proposals and segmentation masks. 3DSSD [39] increases network inference speed by replacing feature propagation and refinement modules with fusion sampling and candidate generation layers.

VoxelNet [8], a grid-based approach, utilizes voxelization to encode the raw point cloud data into fixed-size voxels to employ 3D CNNs to learn voxel features for classification and bounding box regression. SECOND [9] upgrades the original VoxelNet [8] approach by introducing sparse 3D CNNs to accommodate the sparse structure of point cloud data while significantly improving inference time. Pointpillars [7] adopts PointNets [40] as an encoder and organizes point clouds in vertical columns (pillars), which is processed by a 2D CNN detection head to perform 3D object detection, enabling even slower inference time than [9]. Overall, we observe a tradeoff between accuracy and runtime that influences the choice of method. These findings have prompted researchers to investigate alternative multi-modal fusion methods.

2.3. 3D Object Detection Using Multi-Modal Fusion Methods

While examining several 3D object detectors in popular detection benchmarks [23,24], most LiDAR-based methods surpass fusion-based methods because a significant amount of the objects measured are cars, whose sizes are often larger than cyclists and pedestrians. Several comparisons on smaller objects show that fusion-based methods do not perform worse than LiDAR-based methods [14]. Combining two modalities comes with an additional computational load of processing additional sensor information, which limits fusion-based methods to limit the number of convolutional operations. One of the earliest fusion methods, MV3D [12], transforms the point cloud into a BEV representation and a front view representation, then fuses this data with RBG image information. It begins by generating 3D proposals on the BEV feature map, projects these proposals onto the other two feature maps, and ultimately fuses region-based features to make the final prediction. AVOD [13] extracts features from RGB images and BEV before maps to fuse them for 3D object proposal generation. Conversion of point cloud data into 3D representation as BEV and front-view representation loses spatial information in a point cloud.

Cascaded-fusion methods [41,42] narrow regions for 3D data processing within a point cloud using information from 2D detectors. For example, Frustum Pointnet [43] uses a frustum-based methodology, where 2D proposals are lifted into 3D spaces using a frustum. The major drawback, however, is that these methods rely heavily on the 2D object proposal generation stage and would perform poorly in cases where 2D object proposal generation fails.

Late-fusion method, CLOCs [14] utilizes object candidates fusion to fuse 2D and 3D object detection candidates to exploit geometric and semantic consistency between 2D and 3D detections. Fast-CLOCs [15] uses a 3D detector-cued 2D image detector to reduce memory and computational load from the original CLOCs [14] implementation. Although these late-fusion methods perform well in benchmark detection tasks [23,24], the intermediate features and representations from images and point clouds are not correlated, which leads to loss of valuable contextual information captured by one sensor, which may not be effectively complemented by another sensor.

MVX-Net [18] proposes two fusion strategies—point-level and voxel-level—to fuse image and voxel features early. PointFusion [42] correlates LiDAR points with image features by projecting each point onto the image using the calibration matrix to obtain point-wise image features. Inspired by recent success in attention-based mechanisms [32] in focusing essential features, we extend the PointFusion approach by incorporating an attention mechanism to highlight important point-wise image features and voxel features before fusion to obtain 3D bounding boxes.

2.4. Research Gaps

Although methods like MVXNet [18] and PointFusion [42] fuse image and point cloud features early, these methods fail to highlight features from individual modalities. A recent work, AVFP-MVX [44], uses an attention mechanism within the 2D feature extraction module and processes the fused representation with Voxel-FPN. While this method uses attention to highlight essential features from images, it does not account for the most prominent point cloud features pre-fusion, as the attention mechanism is absent in the point feature extraction module. As existing voxel-based 3D object detectors that solely use LiDAR data [9,11,45] demonstrate strong performance, the current literature lacks methods that focus on the most prominent voxel features before fusion with image features such that the fused structure can be processed with any voxel-based 3D backbone. In summary, existing 3D object detection methods fail to emphasize prominent complementary features from image and voxel data before feature fusion. To the best of our knowledge, AEPF is the first attempt to emphasize voxel and image features separately before their fusion with two different techniques, ensuring that complementary features are effectively highlighted.

3. Proposed Fusion Methodology

Herein, two AEPF variants are presented to fuse images and point cloud information for 3D object detection, as shown in Figure 1. AEPF-S employs attention mechanisms for both image and point cloud features before fusion, while AEPF-L utilizes multi-head self-attention and uses image features to highlight important point cloud features before fusion. For both networks, the first stage involves feature extraction from 2D images. Post-feature extraction, following [18,42], points from the LiDAR point cloud are projected onto the camera image to obtain point-wise image features. Point-wise image and voxel features are highlighted with attention mechanisms for each before the feature fusion step. After feature fusion, a 3D backbone is used to regress bounding boxes from the multi-modal features. Each major step of AEPF is described in the following sections.

3.1. 2D Image Feature Extraction

The first step for AEPF involves feature extraction from 2D RGB images as shown in Figure 1. CNNs have been proven effective at extracting semantic information from images [3]. ResNet [46] (Residual Network) is a pioneering work in 2D computer vision that uses residual blocks with skip connections for improved feature extraction. Residual learning, introduced in [46], addresses information losses and gradient explosion issues in traditional CNNs. ResNet architectures are designed to balance network depth and computational efficiency. While ResNet-18 and Resnet-34 perform with limited computational resources, they are relatively shallow for complex feature extraction. With much greater depth layers on ResNet-101 and ResNet-152 variants, feature extraction with deeper layers adds extra computational cost, thereby making them unsuitable for real-time inference given current typical computational resources. Desiring a balance between depth and computational efficiency, ResNet-50 was used for feature extraction for both network variants proposed.

The 50-layered ResNet is categorized into four stages, each containing several residual blocks, which generate feature maps with channel sizes of

[256, 512, 1024, 2048]

, respectively. Since the first stage captures basic features like edges and textures, the batch normalization layers and parameters for this stage were frozen to help stabilize the training process. For the first network variant, AEPF-Small (AEPF-S), hierarchical feature maps from the second and third stages from backbone ResNet-50 were used for point cloud projection to obtain point-wise image features. For the second network variant, AEPF-Large (AEPF-L), we add a feature pyramid network (FPN) that takes feature maps from outputs of all ResNet stages to construct a pyramid of feature maps, allowing the capture of multi-scale information for object detection.

Post-feature extraction for both network variants, each 3D point from the LiDAR point cloud is projected onto the image using a known calibration matrix. We use a similar approach as [18,42] to attach the corresponding image feature to each 3D point. As feature connection happens early, both network variants can learn and summarize joint multi-modal representation for accurately regressing 3D bounding boxes.

3.2. Point Cloud Voxelization

After obtaining point-wise image features, we extract point-wise voxel features to enable fusion in the subsequent step. To manage the sparse and unstructured nature of the point cloud, the point cloud can be divided into equally spaced voxels to allow grid-based convolutional operations, as shown in Figure 1. VoxelNet [8] introduced a voxel feature encoding (VFE) layer to encode raw point clouds at individual voxel levels. A raw point cloud can be transformed into a 3D space divided into equally spaced voxels. A point in voxel is represented as:

P_{i} = {x_{i}, y_{i}, z_{i}, I_{i}, c_{x}, c_{y}, c_{z}}

, where

{x_{i}, y_{i}, z_{i}}

are the XYZ coordinates of the point with intensity

I_{i}

and

{c_{x}, c_{y}, c_{z}}

represents the centroid of the voxel at which

P_{i}

is located. For both network variants, we use dynamic voxelization [47], which establishes a bi-directional relationship between points and voxels laying the foundation for cross-view feature fusion.

The stacks of VFE layers containing fully connected networks (FCNs) transform the original point cloud into high-dimensional voxel features. Both network variants use the same voxel size, but the feature encoder for AEPF-S has 2 layers with 32 channels each and outputs point cloud features with 32 channels before fusion. The feature encoder for AEPF-L has two layers with 32 channels each and outputs point cloud features with 64 channels before fusion.

3.3. Attention Mechanisms and Point Fusion

After extracting point-wise image and voxel features, it is necessary to highlight the most important features from both image and voxel data before fusion. Attention-based mechanisms [32] are introduced to enhance fusion between the voxel and point-wise image features, as shown in Figure 1. Using an attention-based mechanism, different weights are assigned to different features, ensuring that the most essential features for fusion have a more significant impact on the final output. The selective weighting mechanism forces the model to focus on the most important features to improve object detection performance and plays a crucial role in addressing common issues such as false positives and false negatives. By assigning higher attention weights to prominent object features, the model can detect objects that might otherwise be missed, thus reducing false negatives. Conversely, the attention mechanism diminishes the impact of less relevant features, which helps minimize false positives and prevent incorrect classifications. In cluttered or complex environments, this focused approach ensures that only the most significant features are considered for detection, thereby improving the accuracy and reliability of the detection model.

For AEPF-S, a linear layer transforms the input point-wise image features followed by a rectified linear unit (ReLU) activation as shown in Figure 1. An additional linear layer produces a single scale value as the attention weight for each point-wise image feature. The same transformation process is applied to the voxel features to produce attention weights.

The attention scores calculated are used to weigh the original features as:

F_{img}^{att} = F_{img} ⊙ ReLU (A_{img})

F_{vxl}^{att} = F_{vxl} ⊙ ReLU (A_{vxl})

where

F_{img} \in R^{N \times D_{img}}

denotes point-wise image features,

F_{vxl} \in R^{N \times D_{vxl}}

denotes voxel features,

F_{img}^{att}

and

F_{vxl}^{att}

denote attended image and voxel features, respectively,

N, D_{i m g}

, and

D_{v x l}

represent the number of features, the dimension of input image features, and the dimension of input voxel features, respectively, and

A_{img}

and

A_{vxl}

contains attention scores for image features and voxel features representing their importance. Weighted features after attention are concatenated for fused feature representation for AEPF-S,

F_{fused}^{S}

:

F_{fused}^{S} = F_{img}^{att} + F_{vxl}^{att}

AEPF-L employs a multi-headed self-attention (MHSA) [32], as shown in Figure 1, to enhance the most significant voxel features using information from the image features before fusion. MHSA extends the self-attention process across multiple parallel heads, each with learnable parameters to produce query, key, and value vectors. Queries represent the features being focused on, keys represent the features against which the queries are compared, and values contain the information that is aggregated based on the attention weights. For AEPF-L, point-wise image features are used as queries as they capture rich semantic information which helps in identifying relevant point-wise voxel features. Point-wise voxel features are used as keys and help in determining the relevance of each voxel feature in relation to the query. Values are also derived from voxel features and are used to update the voxel features based on the weights determined by the attention mechanism. Comparing queries with keys, the MHSA computes attention weights indicating the significance of each voxel feature and applies attention to the values (point-wise voxel features). Point-wise image and voxel features are linearly transformed into queries (

Q

), keys (

K

), and values (

V

) as:

Q = {Linear}_{Q} (F_{img}) reshaped to {(B, H, - 1)}^{T}

K = {Linear}_{K} (F_{vxl}) reshaped to {(B, H, - 1)}^{T}

V = {Linear}_{V} (F_{vxl}) reshaped to {(B, H, - 1)}^{T}

where B and H denote the batch size and number of attention heads, respectively. The attention weights (

A

) are computed using the scaled dot product attention mechanism [32] as:

A = softmax (\frac{Q \cdot K^{T}}{\sqrt{D_{k}}})

where

D_{k}

is the dimension of the keys. The attention weights are applied to the values to obtain attended values,

Z

:

Z = A \cdot V

The combined attention heads are transformed to produce the final attended voxel features:

F_{vxl}^{att} = {Linear}_{O} (Z)

Final attended voxel features are concatenated with point-wise image features for fused feature representation for AEPF-L,

F_{fused}^{L}

:

F_{fused}^{L} = F_{img} + F_{vxl}^{att}

3.4. 3D Backbone Network

Post-point-wise image and voxel feature fusion, fused feature representation must be passed through a 3D backbone network for 3D object detection as shown in Figure 1. Fused feature representation for both network variants can be passed through any voxel-based 3D backbone networks [7,9,37,45]. SECOND [9] introduced 3D sparse convolution to handle sparse LiDAR point clouds. We use the single-stage SECOND [9] backbone for both of our network variants because of its computational efficiency and effectiveness in handling sparse data over other counterparts. The voxel structure to be fed into the SECOND-based 3D backbone network consists of

1600 \times 1408 \times 40

voxel grids with each voxel of size

0.05 \times 0.05 \times 0.1

m for both network variants.

AEPF-S processes 128 input channels in the SECOND backbone with two stages, each consisting of five layers. AEPF-L processes 258 input channels from the SECOND backbone to handle richer feature representations from the fused representation. Both network variants also output higher dimensional feature maps (128 and 256 for AEPF-S and AEPF-L, respectively) using the FPN from SECOND to enhance multi-scale representation.

Since we use the SECOND backbone, we follow the same multitask loss function used in [9] which is a combination of classification loss (

L_{c l s}

), regression loss (

L_{r e g}

), and direction classification loss (

L_{d i r}

):

L_{total} = β_{1} L_{c l s} + β_{2} L_{r e g} + β_{3} L_{d i r}

where

β_{1}

,

β_{2}

, and

β_{3}

are the weights for classification loss, regression loss, and direction classification loss, respectively. We set

β_{1} = 2.5

,

β_{2} = 1

, and

β_{3} = 0.2

for training both AEPF variants. The choice of these weights was determined through extensive experimentation by systematically varying the weights for each component to assess their impact on model performance. We also follow [9] in parametrizing 3D ground truth boxes and 3D anchors.

4. Experimental Validation

4.1. Dataset

The proposed AEPF method is evaluated on KITTI Vision Benchmark [23], which provides 7481 training samples and 7518 testing samples for the 3D and birds-eye view (BEV) object detection tasks. The difference between 3D and BEV object detection tasks is that BEV does not consider the object’s height. Ground truth labels are provided for the 7481 training samples, and testing samples are evaluated by submitting results to the online KITTI server [23]. Each sample contains a LiDAR point cloud, corresponding RGB image, and their calibration parameters. The dataset categorizes object detection tasks into three difficulty levels—“easy”, “moderate”, and “hard”—based on fully visible and slight truncation, partly occluded and moderate truncation, and challenging to see and severe truncation, respectively.

4.2. Training Configuration

The range of point cloud data was limited to

[0, 70.4] \times [- 40, 40] \times [- 3, 1]

meters in

(x, y, z)

axes to remove points outside of detection range. Following [48], the training data was divided into train and validation splits containing 3712 and 3769 frames, respectively. The three classes aimed for detection are cars, pedestrians, and cyclists. We use the same data augmentation techniques described in [49].

4.3. Training Settings

Both networks were trained on a single NVIDIA A6000 GPU with the ADAM optimizer. The total batch size was set to 6, and the cosine annealing strategy was used to adjust the learning rates dynamically. This scheduler decreases the learning rate following a cosine curve, starting at 0.0003 and reducing it to a minimum of 0.0001.

4.4. Evaluation Metrics

KITTI uses average precision (AP) for 3D object detection and BEV detection to evaluate each category within each difficulty level, calculated with 40 recall positions. For multi-class evaluation across multiple difficulty levels, we use mAP as the evaluation metric, the mean AP of all categories across all difficulty levels. The IoU thresholds for this metric for cars, pedestrians, and cyclists are 0.7, 0.5, and 0.5, respectively, as suggested in the KITTI [23] server. Predictions are considered correct when the IoU of the predicted bounding box and ground-truth box exceeds those thresholds.

4.5. Results

Table 1 shows the validation results for our methods. It compares them against LiDAR-based detection methods as well as LiDAR and image-based 3D detection methods. We did not include image-only methods in our comparison because LiDAR and fusion-based methods consistently outperform image-only methods in 3D object detection tasks. For LiDAR-based methods, we specifically chose voxel-based methods, since the fused representation for both AEPF variants can be processed with any voxel-based 3D backbone. The proposed fusion techniques achieved improved performance compared to the original SECOND [9] method with improved AP scores across all categories ranging from +0.06 to +7.05. AEPF-L outperforms the MVXNet [18] method, which also employs the PointFusion [42] strategy, achieving improved AP scores across all categories, with increases ranging from +3.49 to +8.75. The BEV mAP score for car detection in the easy category was the highest among other camera and LiDAR fusion-based methods, with AEPF-L scoring 95.27 and AEPF-S coming in second at 94.40. When comparing fusion-based methods, AEPF-L demonstrated the second-highest AP score for 3D car detection in all categories, just below CLOCs [14], which uses a late-fusion strategy that combines detection candidates from PV-RCNN (LiDAR) and Cascade R-CNN [50] (image), making it more computationally expensive than AEPF-L. Although the score for the easy category for 3D car detection was close to CLOCs (−1.65), AP scores for the moderate and hard categories for both detection variants were significantly lower than CLOCs, with differences ranging from −6.88 to −9.80. This could be addressed in future work by thoroughly exploring advanced attention-enabled fusion strategies to improve performance across all detection categories. Fused point-wise image features and voxel features for AEPF can be processed with any voxel-based 3D backbone which allows AEPF-based networks to swap the existing single-stage SECOND-based 3D backbone to other multi-stage 3D backbones like Part-A2 [51], Voxel-RCNN [45], and PointRCNN [37] for tasks that require greater accuracy in the expense of computational resources. Given the strong evidence of accuracy improvements over the baseline SECOND when using a SECOND-based 3D backbone, we argue that employing a double-stage 3D backbone network, similar to the LiDAR-only methods in Table 1 [11,45,52], would result in better accuracy for AEPF-based methods compared to those methods.

Qualitative results for both detection variants are displayed in Figure 2. The 3D object detection outcomes, based on image and point cloud data, are projected onto the image for visualization. AEPF-L successfully addressed the false positives and missed detections in Figure 2A,B from AEPF-S shown in Figure 2C,D, reinforcing the rationale behind proposing two variants: one optimized for inference speed and the other for improved accuracy.

Additionally, AEPF-S, AEPF-L, and an early feature fusion method (MVXNet [18]) with similar backbone configuration were run on the same machine for a fair comparison, and the results are shown in Table 2. We chose to compare our approach with MVXNet [18], as it was readily available for implementation [53] and shares architectural similarities with our method. Notably, AEPF-S demonstrated enhanced inference times compared to the baseline MVXNet and AEPF-L. AEPF-S achieved a significant improvement in inference speed, exceeding MVXNet by +4.8 fps. This suggests that the attention mechanism used for AEFP-S enabled us to use a more straightforward configuration while maintaining the accuracy of other fusion-based methods. Additionally, AEPF-L outperformed MVXNet in terms of mAP scores, outperforming it by +1.5 in 3D detection and +0.32 in BEV detection despite the slightly slower inference speed (−1.2 fps), all while keeping the same 2D and 3D backbone configurations. This suggests that AEPF-L’s attention mechanisms significantly improve detection performance at the expense of only a minor increase in inference time. This becomes particularly evident in scenes with numerous pedestrians, cyclists, and cars. Figure 3 illustrates detection results in a scene containing multiple pedestrians, cyclists, and cars. Due to the limited training data for pedestrians and cyclists, both MVXNet and AEPF-S fail to detect a cyclist, as shown in Figure 3A,B. In contrast, AEPF-L, with its attention mechanism successfully detects the cyclist as shown in Figure 3C. AEPF-based detection frameworks also work well in cluttered environments; for instance, where MVXNet fails to detect a car amidst object clutter, the AEPF-based methods accurately identify it, as shown in Figure 3. These results further demonstrate the effectiveness of attention mechanisms within AEPF-based networks for accurate object detection.

4.6. Ablation Studies

To evaluate the contribution of specific components in the proposed detection pipeline, we conducted ablation experiments for AEPF-S and AEPF-L. Given the need for AEPF-S to infer at faster speeds, the computationally expensive part lies in the image feature extraction process, specifically from ResNet-50. To determine which stages of features are most critical, we compared three AEPF-S variants, each extracting features from different ResNet stages, as shown in Table 3. For baseline comparison, we also included results from a fusion procedure that uses features from all stages without applying any attention mechanism. The results showed that using features from stages 2 and 3 with the AEPF-S attention mechanism showed the best performance, with an improvement of +3.81 in Car 3D mAP and +1.96 in Car BEV mAP compared to the baseline.

We also performed ablation experiments to evaluate the impact of the number of attention heads in the attention mechanism used for AEPF-L. We tested three different settings with the number of attention heads set to 4, 8, and 12. As shown in Table 4, AEPF-L achieved the best results when the number of attention heads was set to 4. The best performance of AEPF-L with four attention heads can be attributed to a balance between model complexity and capacity, allowing it to capture essential features without overfitting. Moreover, using fewer attention heads improves computational resource utilization, reducing redundant feature extraction and highlighting important point cloud features for more focused learning.

5. Conclusions

This paper introduced a novel multi-modal and multi-class 3D object detection framework named Attention-Enabled Point Fusion (AEPF), which leverages an attention mechanism to fuse features from images and point clouds, thereby enhancing the accuracy of 3D object detection compared to traditional methods. Our results highlight the potential of early feature fusion and attention mechanisms in enhancing 3D object detection. Through extensive experiments on the KITTI dataset, the effectiveness of our method was validated, showcasing competitive results in both 3D and BEV object detection tasks across different difficulty levels.

Two model variants, AEPF-S and AEPF-L, are proposed, each tailored to different speed and accuracy trade-offs, providing flexibility for various application needs. AEPF-S is designed for scenarios that demand faster inference speeds. It is ideal for immediate real-time applications with other functional oversight (e.g., a human driver in advanced driving assistance systems) and is hardware-limited. Conversely, AEPF-L prioritizes higher accuracy, making it well-suited for limited oversight critical tasks such as autonomous driving, where safety is paramount. The complexity of the AEPF framework is effectively managed by introducing two variants tailored to different computational needs, thereby ensuring efficient resource utilization while delivering SOTA performance. The trade-offs between inference speed and detection accuracy were thoroughly analyzed. AEPF-S achieved significantly higher inference speeds, making it particularly appealing for resource-constrained environments while maintaining SOTA accuracy. AEPF-L, although inferencing 4.6% slower than the compared baseline, provided substantial improvements in detection performance (+1.63 mAP in car 3D detection and +0.49 mAP in car BEV detection), making it ideal for applications where accuracy is critical, even at the expense of increased computational demands.

Future work will be aimed at investigating the scalability and adaptability of these models, including a more exhaustive analysis across varying lighting conditions, noise levels, and other challenging scenarios to further evaluate and enhance the robustness of the AEPF variants. This will involve further refinement of attention mechanisms to enhance detection accuracy while satisfying low computational demands, and exploring hybrid strategies that dynamically switch between AEPF-S and AEPF-L based on real-time assessments of the environment and available compute resources. Furthermore, the modularity of the AEPF framework also allows for the integration of advanced multi-stage networks as the fused point-wise image and voxel features can be processed with any voxel-based 3D backbone networks based on task severity with minimal customization. This preliminary work into AEPF is promising, and further exploration could make the model scalable to larger datasets and adaptable for real-time applications in robotics, navigation, and autonomous driving.

Author Contributions

Conceptualization, S.S., R.T.M., and Z.D.A.; methodology, S.S.; software, S.S.; formal analysis, S.S. and R.T.M.; investigation, S.S., R.T.M., and Z.D.A.; resources, R.T.M. and Z.D.A.; writing—original draft preparation, S.S.; writing—review and editing, S.S., R.T.M., and Z.D.A.; project administration, R.T.M. and Z.D.A.; funding acquisition, R.T.M. and Z.D.A. All authors have read and agreed to the published version of the manuscript.

Funding

This material is based upon work supported by the US DOE’s Office of Energy Efficiency and Renewable Energy (EERE) under the Energy Efficient Mobility Systems program under DE–EE–0009657.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data used for this research are publicly available at https://www.cvlibs.net/datasets/kitti/index.php (accessed on 15 April 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Mao, J.; Shi, S.; Wang, X.; Li, H. 3D object detection for autonomous driving: A comprehensive survey. Int. J. Comput. Vis. 2023, 131, 1909–1963. [Google Scholar] [CrossRef]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 36, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Mousavian, A.; Anguelov, D.; Flynn, J.; Košecká, J. 3D Bounding Box Estimation Using Deep Learning and Geometry. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5632–5640. [Google Scholar] [CrossRef]
Zhang, Y.; Lu, J.; Zhou, J. Objects are Different: Flexible Monocular 3D Object Detection. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 3288–3297. [Google Scholar] [CrossRef]
Wang, T.; Zhu, X.; Pang, J.; Lin, D. FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, QC, Canada, 11–17 October 2021; pp. 913–922. [Google Scholar] [CrossRef]
Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. PointPillars: Fast Encoders for Object Detection from Point Clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 12697–12705. [Google Scholar]
Zhou, Y.; Tuzel, O. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4490–4499. [Google Scholar]
Yan, Y.; Mao, Y.; Li, B. SECOND: Sparsely Embedded Convolutional Detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 222–237. [Google Scholar]
Li, X.; Zhang, W.; Liu, X.; Zhou, B.; Xie, P.; Yuille, A.L. SESSD: Self-Supervised 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 7527–7536. [Google Scholar]
Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Guo, Z.; Xiang, K. PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10529–10538. [Google Scholar]
Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3D Object Detection Network for Autonomous Driving. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6526–6534. [Google Scholar] [CrossRef]
Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3D Proposal Generation and Object Detection from View Aggregation. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1–8. [Google Scholar] [CrossRef]
Pang, S.; Morris, D.; Radha, H. CLOCs: Camera-LiDAR Object Candidates Fusion for 3D Object Detection. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24–27 October 2020; pp. 10386–10393. [Google Scholar] [CrossRef]
Pang, S.; Morris, D.; Radha, H. Fast-CLOCs: Fast Camera-LiDAR Object Candidates Fusion for 3D Object Detection. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 4–8 January 2022; pp. 3747–3756. [Google Scholar] [CrossRef]
Liu, Z.; Tang, H.; Amini, A.; Yang, X.; Mao, H.; Rus, D.L.; Han, S. BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 2774–2781. [Google Scholar] [CrossRef]
Vora, S.; Lang, A.H.; Helou, B.; Beijbom, O. PointPainting: Sequential Fusion for 3D Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 4603–4611. [Google Scholar] [CrossRef]
Sindagi, V.A.; Zhou, Y.; Tuzel, O. MVX-Net: Multimodal VoxelNet for 3D Object Detection. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 7276–7282. [Google Scholar] [CrossRef]
Xie, L.; Xiang, C.; Yu, Z.; Xu, G.; Yang, Z.; Cai, D.; He, X. PI-RCNN: An Efficient Multi-Sensor 3D Object Detector with Point-Based Attentive Cont-Conv Fusion Module. Proc. AAAI Conf. Artif. Intell. 2020, 34, 12460–12467. [Google Scholar] [CrossRef]
Li, Y.; Yu, A.W.; Meng, T.; Caine, B.; Ngiam, J.; Peng, D.; Shen, J.; Lu, Y.; Zhou, D.; Le, Q.V.; et al. DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 17161–17170. [Google Scholar] [CrossRef]
Wang, C.H.; Chen, H.W.; Chen, Y.; Hsiao, P.Y.; Fu, L.C. VoPiFNet: Voxel-Pixel Fusion Network for Multi-Class 3D Object Detection. IEEE Trans. Intell. Transp. Syst. 2024, 25, 8527–8537. [Google Scholar] [CrossRef]
Li, X.; Ma, T.; Hou, Y.; Shi, B.; Yang, Y.; Liu, Y.; Wu, X.; Chen, Q.; Li, Y.; Qiao, Y.; et al. LoGoNet: Towards Accurate 3D Object Detection With Local-to-Global Cross-Modal Fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 17524–17534. [Google Scholar]
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The KITTI dataset. Int. J. Robot. Res. IJRR 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuScenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Zhu, H.; Deng, J.; Zhang, Y.; Ji, J.; Mao, Q.; Li, H.; Zhang, Y. VPFNet: Improving 3D Object Detection With Virtual Point Based LiDAR and Stereo Data Fusion. IEEE Trans. Multimed. 2023, 25, 5291–5304. [Google Scholar] [CrossRef]
He, T.; Soatto, S. Mono3D++: Monocular 3D vehicle detection with two-scale 3D hypotheses and task priors. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; AAAI Press: Menlo Park, CA, USA, 2019. [Google Scholar] [CrossRef]
Liu, Z.; Wu, Z.; Tóth, R. SMOKE: Single-Stage Monocular 3D Object Detection via Keypoint Estimation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 4289–4298. [Google Scholar] [CrossRef]
Wang, Y.; Yang, B.; Hu, R.; Liang, M.; Urtasun, R. PLUMENet: Efficient 3D Object Detection from Stereo Images. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 3383–3390. [Google Scholar] [CrossRef]
Chen, Y.; Liu, S.; Shen, X.; Jia, J. DSGN: Deep Stereo Geometry Network for 3D Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 12533–12542. [Google Scholar] [CrossRef]
Wang, Y.; Chao, W.L.; Garg, D.; Hariharan, B.; Campbell, M.; Weinberger, K.Q. Pseudo-LiDAR From Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 8437–8445. [Google Scholar] [CrossRef]
You, Y.; Wang, Y.; Chao, W.L.; Garg, D.; Pleiss, G.; Hariharan, B.; Campbell, M.; Weinberger, K.Q. Pseudo-LiDAR++: Accurate Depth for 3D Object Detection in Autonomous Driving. In Proceedings of the International Conference on Learning Representations, Virtual Event, 26–30 April 2020. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.0376230. [Google Scholar]
Zhang, R.; Qiu, H.; Wang, T.; Xu, X.; Guo, Z.; Qiao, Y.; Gao, P.; Li, H. MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision ICCV 2023, Paris, France, 1–6 October 2023. [Google Scholar]
Li, Z.; Wang, W.; Li, H.; Xie, E.; Sima, C.; Lu, T.; Qiao, Y.; Dai, J. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 1–18. [Google Scholar]
Hazzat, S.; Merras, M.; El Akkad, N.; Saaidi, A.; Satori, K. 3D reconstruction system based on incremental structure from motion using a camera with varying parameters. Vis. Comput. 2018, 34, 1443–1460. [Google Scholar] [CrossRef]
Dippold, E.J.; Tsai, F. Enhancing Building Point Cloud Reconstruction from RGB UAV Data with Machine-Learning-Based Image Translation. Sensors 2024, 24, 2358. [Google Scholar] [CrossRef] [PubMed]
Shi, S.; Wang, X.; Li, H. PointRCNN: 3D Object Proposal Generation and Detection from Point Cloud. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In Proceedings of the 31st Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Yang, B.; Wang, J.; Clark, R.; Hu, H.; Wang, S.; Markham, A. 3DSSD: Point-based 3D Single Stage Object Detector. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Wang, Z.; Jia, K. Frustum ConvNet: Sliding Frustums to Aggregate Local Point-Wise Features for Amodal 3D Object Detection. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 1742–1749. [Google Scholar] [CrossRef]
Xu, D.; Anguelov, D.; Jain, A. PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 244–253. [Google Scholar] [CrossRef]
Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum PointNets for 3D Object Detection from RGB-D Data. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 918–927. [Google Scholar] [CrossRef]
Wang, X.; Lan, J.; Wang, B.; Chen, C.; Chen, S. AVFP-MVX: Multimodal VoxelNet With Attention Mechanism and Voxel Feature Pyramid. IEEE Sens. J. 2023, 23, 6139–6149. [Google Scholar] [CrossRef]
Deng, J.; Shi, S.; Li, P.; Zhou, W.; Zhang, Y.; Li, H. Voxel r-cnn: Towards high performance voxel-based 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Event, 2–9 February 2021; Volume 35, pp. 1201–1209. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Zhou, Y.; Sun, P.; Zhang, Y.; Anguelov, D.; Gao, J.; Ouyang, T.; Guo, J.; Ngiam, J.; Vasudevan, V. End-to-end multi-view fusion for 3d object detection in lidar point clouds. In Proceedings of the Conference on Robot Learning, PMLR Virtual Event, 16–18 November 2020; pp. 923–932. [Google Scholar]
Chen, X.; Kundu, K.; Zhu, Y.; Berneshawi, A.G.; Ma, H.; Fidler, S.; Urtasun, R. 3D object proposals for accurate object class detection. In Proceedings of the Annual Conference on Neural Information Processing Systems 2015, Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
Yoo, J.H.; Kim, Y.; Kim, J.; Choi, J.W. 3D-CVF: Generating Joint Camera and LiDAR Features Using Cross-view Spatial Feature Fusion for 3D Object Detection. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 720–736. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: High Quality Object Detection and Instance Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 1483–1498. [Google Scholar] [CrossRef] [PubMed]
Shi, S.; Wang, Z.; Shi, J.; Wang, X.; Li, H. From Points to Parts: 3D Object Detection From Point Cloud With Part-Aware and Part-Aggregation Network. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 2647–2664. [Google Scholar] [CrossRef] [PubMed]
Shenga, H.; Cai, S.; Liu, Y.; Deng, B.; Huang, J.; Hua, X.S.; Zhao, M.J. Improving 3D Object Detection with Channel-wise Transformer. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 2723–2732. [Google Scholar] [CrossRef]
Contributors, M. MMDetection3D: OpenMMLab Next-Generation Platform for General 3D Object Detection. 2020. Available online: https://github.com/open-mmlab/mmdetection3d (accessed on 17 April 2024).

Figure 1. Architecture for AEPF: Attention-Enabled Point Fusion for 3D object detection. Blocks illustrate processes from Section 3.1, Section 3.2, Section 3.3 and Section 3.4. Attention mechanisms for AEPF-Small and AEPF-Large are also shown.

Figure 2. Visualization of detection results for two AEPF variants. Panels (A,B) display results for AEPF-S, while panels (C,D) show results for AEPF-L. False positives and missed detections from AEPF-S, highlighted by dotted yellow lines, are effectively addressed by AEPF-L. Red bounding boxes indicate cars and purple bounding boxes indicate pedestrians.

Figure 3. Visualization of detection results for (A) MVXNet (obtained from [53]), (B) AEPF-S, and (C) AEPF-L. Dotted green lines indicate false negatives, while dotted yellow lines indicate false positives. AEPF-L effectively resolves false negatives identified by MVXNet and AEPF-S. Purple bounding boxes indicate pedestrians and red bounding boxes indicate cars.

Table 1. Car 3D detection results on the KITTI validation set. We use [9] for baseline comparison. Cells are left blank for methods that did not report their validation statistics in their paper. The best and second best performance among fusion-based methods only for every category is highlighted in black and blue, respectively.

Method	Modality	Car 3D AP (R40)			Car BEV AP (R40)
Method	Modality	Easy	Mod.	Hard	Easy	Mod.	Hard
VoxelNet [8]	LiDAR	81.97	65.46	62.85	89.60	84.81	78.57
SECOND [9] (Baseline)	LiDAR	87.43	76.48	69.10	89.96	87.07	79.66
PointPillars [7]	LiDAR	86.46	77.28	74.65	-	-	-
PointRCNN [37]	LiDAR	88.72	78.61	77.82	-	-	-
PV-RCNN [11]	LiDAR	92.57	84.83	82.69	95.76	91.11	88.93
Voxel-RCNN [45]	LiDAR	92.38	85.29	82.86	95.52	91.25	88.99
CT3D [52]	LiDAR	92.85	85.82	83.46	96.14	91.88	89.63
MV3D [12]	LiDAR+RGB	71.29	62.68	56.56	86.55	78.10	76.67
F-PointNet [43]	LiDAR+RGB	83.76	70.92	63.65	88.16	84.02	76.44
CLOCs [14]	LiDAR+RGB	92.78	85.94	83.25	93.48	91.98	89.48
MVXNet [18]	LiDAR+RGB	85.50	73.30	67.40	89.50	84.9	79.00
AEPF-S (Ours)	LiDAR+RGB	89.87	77.83	73.45	94.40	87.13	84.74
AEPF-L (Ours)	LiDAR+RGB	91.13	79.06	76.15	95.27	88.39	85.91

Table 2. Comparison of three different methods for 3D object detection. We used open-sourced implementation in MMDetection3D [53] for ResNet-50 and SECOND-FPN-configured MVXNet [18]. The metrics for the top-performing model in each category are highlighted in bold.

Method	Modality	Backbone Configuration		Speed (fps)	Car 3D mAP	Car BEV mAP
Method	Modality	2D	3D	Speed (fps)	Car 3D mAP	Car BEV mAP
MVXNet [18] (Baseline)	LiDAR+RGB	ResNet-50 + FPN	SECOND + FPN	26.2	80.48	89.37
AEPF-S (Ours)	LiDAR+RGB	ResNet-50	SECOND + FPN	31.0	80.38	88.75
AEPF-L (Ours)	LiDAR+RGB	ResNet-50 + FPN	SECOND + FPN	25.0	82.11	89.86

Table 3. Ablation experiments to choose feature extraction pipeline for AEPF-S before fusion. Features from stages 2 and 3 were used without an FPN for the final AEPF-S architecture. The metrics for the top-performing configuration in each category are highlighted in bold.

ResNet-50 Stages	Attention	Car 3D mAP	Car BEV mAP
All	-	76.57	86.79
1 and 2	AEPF-S	79.37	87.31
2 and 3	AEPF-S	80.38	88.75
3 and 4	AEPF-S	79.58	87.54

Table 4. Ablation experiments to choose the number of attention heads for MHSA in AEPF-L. For the final model, the number of heads was set to 4. The metrics for the top-performing configuration in each category are highlighted in bold.

Num. of Heads	Attention	Car 3D mAP	Car BEV mAP
4	AEPF-L	82.11	89.86
8	AEPF-L	81.54	89.47
12	AEPF-L	80.06	88.82

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sharma, S.; Meyer, R.T.; Asher, Z.D. AEPF: Attention-Enabled Point Fusion for 3D Object Detection. Sensors 2024, 24, 5841. https://doi.org/10.3390/s24175841

AMA Style

Sharma S, Meyer RT, Asher ZD. AEPF: Attention-Enabled Point Fusion for 3D Object Detection. Sensors. 2024; 24(17):5841. https://doi.org/10.3390/s24175841

Chicago/Turabian Style

Sharma, Sachin, Richard T. Meyer, and Zachary D. Asher. 2024. "AEPF: Attention-Enabled Point Fusion for 3D Object Detection" Sensors 24, no. 17: 5841. https://doi.org/10.3390/s24175841

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AEPF: Attention-Enabled Point Fusion for 3D Object Detection

Abstract

1. Introduction

2. Related Work

2.1. 3D Object Detection Using Images

2.2. 3D Object Detection Using Point Clouds

2.3. 3D Object Detection Using Multi-Modal Fusion Methods

2.4. Research Gaps

3. Proposed Fusion Methodology

3.1. 2D Image Feature Extraction

3.2. Point Cloud Voxelization

3.3. Attention Mechanisms and Point Fusion

3.4. 3D Backbone Network

4. Experimental Validation

4.1. Dataset

4.2. Training Configuration

4.3. Training Settings

4.4. Evaluation Metrics

4.5. Results

4.6. Ablation Studies

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI