MDFusion: Multi-Dimension Semantic–Spatial Feature Fusion for LiDAR–Camera 3D Object Detection

Qiao, Renzhong; Yuan, Hao; Guan, Zhenbo; Zhang, Wenbo

doi:10.3390/rs17071240

Open AccessArticle

MDFusion: Multi-Dimension Semantic–Spatial Feature Fusion for LiDAR–Camera 3D Object Detection

¹

The School of Electronic Engineering, Xidian University, Xi’an 710071, China

²

The 54th Research Institute of CETC, Shijiazhuang 050081, China

³

Hebei Key Laboratory of Intelligent Information Perception and Processing, Shijiazhuang 050081, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2025, 17(7), 1240; https://doi.org/10.3390/rs17071240

Submission received: 28 January 2025 / Revised: 14 March 2025 / Accepted: 17 March 2025 / Published: 31 March 2025

Download

Browse Figures

Versions Notes

Abstract

:

Accurate 3D object detection is becoming increasingly vital for the development of robust perception systems, particularly in applications such as autonomous driving vehicles and robotic systems. Many existing approaches rely on bird’s eye view (BEV) feature maps to facilitate multi-modal interaction, as BEV representations enable efficient operations. However, the inherent sparsity of LiDAR BEV features often leads to misalignment with the dense semantic information in camera images, resulting in suboptimal fusion quality and degraded detection performance, especially in complex and dynamic environments. To mitigate these issues, this paper proposes a novel multi-dimension semantic–spatial feature fusion (MDFusion) method that combines LiDAR and image features in 2D and 3D spaces. Specifically, image semantic features are extracted using the DeepLabV3 segmentation network, which captures rich contextual information and is aligned with LiDAR point cloud voxel features through a summation operation to achieve precise semantic fusion. Additionally, LiDAR BEV features are fused with downsampled image features in 2D space via concatenation and spatially adaptive dilated convolution. The mechanism dynamically adjusts to the spatial characteristics of the data, ensuring robust feature integration. Extensive experiments on the KITTI and ONCE datasets demonstrate that our method achieves competitive performance in complex scenes, significantly improving the multi-modal fusion quality and detection accuracy while maintaining computational efficiency.

Keywords:

3D object detection; multi-dimension feature fusion; multi-modal

1. Introduction

In environmental perception systems, the most commonly used sensors include cameras and LiDAR. Monocular cameras provide texture and fine-grained data for target objects, while stereo cameras can additionally capture depth information. Nonetheless, both types of sensors remain sensitive to adverse lighting conditions and object occlusion. In contrast, LiDAR is widely employed in sensing applications as it can provide a high resolution for angle, distance, and speed measurements. It remains uninfluenced by the lighting conditions and provides detailed 3D information. Table 1 displays the attributes of these sensors.

Recent advancements in LiDAR-based perception systems have yielded sophisticated architectures such as DVST [1], which employs deformable attention mechanisms and an offset generation module to prioritize foreground features dynamically. The Local-to-Global Semantic Learning Network (LGSLNet) [2] exploits complementary multi-view LiDAR representations to improve the detection robustness for distant small objects. Nevertheless, the inherent sparsity of LiDAR point clouds fundamentally limits the detection capabilities of LiDAR-based systems under adverse conditions. In Figure 1a, the marked area within the LiDAR image displays a point cloud of distant cars. The image corresponding to the point cloud is shown in Figure 1b, which contains rich semantic information about the object. In Figure 1c, the direct projection of sparse LiDAR points onto the image creates feature misalignment, highlighting the challenges in cross-modal fusion. Therefore, the synergistic integration of camera–LiDAR data through learned feature alignment strategies is critical to enhance the robustness and detection outcomes.

LiDAR–camera fusion has emerged as a critical paradigm for 3D object detection, with existing methods broadly categorized into point-level, region-level, and BEV-based fusion strategies. While point-level approaches [3,4,5,6] augment LiDAR points with pixel semantics through calibration matrices, their effectiveness depends on precise sensor alignment and they struggle to resolve misalignment between modalities. Region-level methods [7,8] mitigate this by fusing instance-level RoI features but introduce background noise from proposals. BEV fusion [9,10] addresses these issues by unifying multi-modal features in a shared BEV space, yet the current implementations prioritize modality-specific processing over joint feature learning, leading to incomplete cross-modal interaction and suboptimal fusion.

Moreover, fundamental shortcomings persist in the current fusion frameworks. First, modality-specific biases arise from naive fusion operations such as concatenation, which inadequately reconcile LiDAR’s sparse geometric representations with cameras’ dense semantic features. This mismatch amplifies spatial misalignment, especially under sensor degradation or calibration errors. Second, oversimplified fusion frameworks homogenize 2D and 3D features as interchangeable inputs, thereby neglecting their intrinsic independence. For instance, high-resolution image textures and depth-aware LiDAR structures are treated as equivalent representations, disregarding their complementary roles in resolving occlusion or low-light scenarios. Such limitations critically undermine cross-modal synergy, where one modality should compensate for the weaknesses of the other under challenging conditions.

To address these gaps, we propose MDFusion, a multi-dimensional fusion network that tackles modality-specific biases and spatial misalignment through 3D semantic enhancement and 2D geometric-aware alignment. At the 3D semantic level, MDFusion integrates semantic-aware image segmentation priors with LiDAR voxel features via a sparse convolutional network, enhancing object boundary representation while suppressing sparse point cloud noise, directly resolving the severe semantic dilution problem in region-level fusion. At the 2D spatial level, learnable dilated convolution dynamically aligns BEV features with image features using spatial offsets, adapting to calibration errors and viewpoint variations that inherently degrade the fusion. Subsequently, we leverage a spatially adaptive alignment (SAA) module to retain the spatial awareness and adjust the feature aggregation positions. In summary, our contributions include the following aspects.

We propose a 3D semantic-level feature fusion method to address the semantic sparsity of object point clouds. By employing an inverse operation, we precisely align the 3D voxel features with the semantic features of images projected into 3D space. This module efficiently achieves dense semantic feature fusion through summation.
We present a 2D spatial-level feature fusion module to tackle the spatial sparsity of point clouds. RGB image features and LiDAR BEV features are aligned using learnable spatial position offsets and fused with dilated convolutions.
Our proposed MDFusion demonstrates competitive results on the KITTI and ONCE datasets, particularly excelling in the cases of complex-shaped cyclists and long-range 50-Inf, outperforming several advanced methods. Extensive experiments demonstrate that our novel fusion method effectively integrates multi-modal features and overcomes the sparsity of the point cloud.

The remaining parts of this article are composed as follows. Section 2 introduces related work. Section 3 describes the proposed MDFusion. Section 4 tests our proposed method with extensive experiments. Section 5 concludes our work.

2. Related Work

2.1. LiDAR-Only 3D Detection

Seeking to tackle the issues posed by point clouds’ unstructured and irregular nature, the current 3D object detection methods are primarily divided into three categories: point-based, voxel-based, and point–voxel methods.

Point-based approaches directly generate regression boxes from points. PointRCNN [11] employs PointNet++ [12] to generate a global feature that represents the geometric structure of the whole point set. VoteNet [13] utilizes PointNet++ for feature extraction and applies surface point voting to predict bounding boxes. CenterPoint [14] adapts the principles of CenterNet [15] into a three-dimensional framework, enabling the anchor-free detection of 3D objects through the prediction of key points. While precise 3D positioning information is preserved, these algorithms often experience elevated computational demands.

Voxel-based methods address the irregularity of raw point clouds by quantizing them into structured voxel grids, enabling efficient feature processing. VoxelNet [16] is a popular framework that introduces a voxel feature encoder (VFE) to obtain richer feature representations. SECOND [17] proposes a sparse convolution technique to improve 3D convolution. PointPillars [18] partitions the 3D space into pillars. It encodes the features of individual points into features of non-empty pillars. After this, it combines these to form pseudo-images. Finally, it makes use of 2D convolutional networks for rapid detection. Voxel-RCNN [19] incorporates voxel pooling to combine one-stage features and improve the precision of RPN-predicted boxes, thereby boosting the detection performance. VoTr [20] employs sparse transformers in place of the traditional backbone to build its feature extraction networks.

Point–voxel detectors combine the merits of point-based and voxel-based methods to boost 3D detection. The STD [21] method utilizes point networks in its initial stage, converting features into voxels in the following stage to refine the predictions. Similarly, PV-RCNN [22] and PVRCNN++ [23] start with voxel networks, integrating point features during the later stage for enhanced prediction accuracy.

2.2. Camera-Only Detection

Camera-based methods typically obtain depth values through direct or indirect depth estimation and then perform 3D object detection. SMOKE [24] refines the keypoint concept, arguing that the 2D detection module adds unnecessary noise. It retains only the 3D detection module, eliminating the reliance on 2D proposals for 3D estimation. Instead, it combines single keypoint estimation with 3D variable regression to predict 3D bounding boxes. Keypoint3D [25] proposes an anchor-free 3D detection framework that projects a 3D object’s geometric centroid onto a 2D plane as a key point. It uses an adaptive elliptical Gaussian filter on the heatmap for precise keypoint localization. Lift-splat-shoot (LSS) [26] and BEVDet [27] offer two distinct approaches for BEV feature generation based on multi-view camera images, offering a more accurate and efficient detection pipeline for camera-only methods. However, a constraint is the absence of precise depth details in camera images, necessitating depth reconstruction from 2D data. Additionally, the passive imaging modality of cameras constrains the efficacy of image-based methods, rendering them susceptible to variations in lighting and weather conditions.

2.3. LiDAR–Camera 3D Detection

Diverse LiDAR–camera fusion techniques have been explored to improve the representation and enhance the performance, leveraging the complementary strengths of point clouds and images. AVOD [28] and MV3D [8] were among the first proposals in multi-modal fusion, where 2D and 3D regions of interest are combined directly prior to box prediction. Meanwhile, 3D-CVF [29] integrates multi-view image features. CLOCs [30] enhances the 3D candidate confidence using 2D candidates through a learnable approach. Certain studies have achieved detailed fusion by creating correlations between images and point clouds, subsequently using point clouds to reference image features. Nonetheless, the image data that they reference are constrained by the limited correspondence between images and point clouds. SFD [31] and TED [32] utilize depth completion to resolve the limited alignment between images and point clouds. However, the deep completion network restricts the advancement of this fusion approach. SupFusion [33] improves the model robustness under adverse conditions by leveraging multi-task learning frameworks, although its effectiveness relies on intricate joint loss optimization. FusionRCNN [34] achieves enhanced detection precision through a two-stage region proposal refinement mechanism. However, this architecture demonstrates significant sensitivity to sensor calibration inaccuracies. FusionViT [35] employs transformer-based architectures for global cross-modal feature integration but faces challenges with computational efficiency during large-scale deployment. MSMDFusion [36] facilitates multi-scale feature fusion through depth-aware convolutional modules but exhibits a substantial dependence on pre-trained depth estimation networks for optimal performance. These methods commonly exhibit sensitivity to sensor calibration errors, inefficiency in aligning sparse LiDAR points with dense image pixels, and a strong reliance on auxiliary networks for depth estimation or feature completion, limiting their robustness and scalability in practical scenarios.

2.4. Deformable Convolution

In deep learning, the CNN is pivotal in the field of visual computing tasks but struggles with geometric transformations due to fixed receptive fields. Deformable convolution was presented by Dai et al. in their work. Deformable convolutional networks [37] address the above issue by incorporating learnable offsets to adjust the sampling positions dynamically, making CNNs more adaptable to object deformations. Subsequent work, such as Deformable ConvNets v2 proposed by Zhu et al. [38], introduced modulation mechanisms to further enhance the flexibility and representational power. Deformable convolution has since been widely adopted in various computer vision applications, demonstrating its importance in handling complex visual tasks with greater accuracy and robustness. Its ability to adjust the receptive fields dynamically marks a significant advancement over traditional convolutional methods, making it a critical development in the field.

3. Method

In this section, we provide an overview of the key components of our proposed MDFusion. Section 3.1 introduces the network’s overall architecture. Section 3.2 delves into the implementation specifics of the 3D semantic-level fusion component. Finally, Section 3.3 elucidates the distinct design principles of our 2D spatial-level fusion methodology.

3.1. Overall Architecture

Our proposed MDFusion 3D object detection network, illustrated in Figure 2, primarily consists of the 3D semantic-level fusion module (3D-SLF) and the 2D spatial-level fusion module (2D-SLF). The initial point cloud is voxelized using a voxel encoder and downsampled through the first layer of sparse convolution (C1). We utilize only the voxel features from the C1 layer to eliminate some redundant information and ensure that the geometric shapes of features can be distinguished. Semantic features are extracted from the RGB images using a semantic segmentation network and integrated into the 3D-SLF module to enhance the sparse point cloud features. The voxel features obtained from the first fusion stage are then passed to the second layer of sparse convolution (C2), followed by height compression to generate BEV features. In the 2D-SLF module, image features extracted via a convolutional neural network are fused with the BEV features. To effectively align their spatial characteristics, we implement learnable spatial offsets to adjust the fusion positions adaptively, resulting in high-quality fused features. This method enables more accurate and robust detection, particularly for complex-shaped and small objects.

3.2. Three-Dimensional Semantic-Level Fusion

In this study, we posit that the sparsity of LiDAR point clouds is not only reflected in the sparsity of geometric space but also in the sparsity of semantics. It poses difficulties for object detection, especially for occluded and small objects. However, traditional 3D sparse convolution cannot enhance the representation ability of sparse point cloud features. It also renders the distinctions between object and background features more blurred. To mitigate this, we utilize image segmentation to extract semantic features and align them with voxel features. By integrating the two modalities at the 3D semantic level, we enhance the semantic separability of different objects, leveraging the fact that RGB image features generally contain abundant appearance details and extensive receptive fields.

For the semantic segmentation network, we employ the Semdeeplabv3 [39] network to extract semantics from the RGB image. As shown in Figure 3, it contains a ResNet50 backbone and a segmentation head, with the primary component being an atrous spatial pyramid pooling (ASPP) module.

The process of semantic feature extraction can be represented as

F_{I}^{S e} = B - I n t e r [S e g H e a d (R e s (F_{I}))]

(1)

S e g H e a d (F) = C o n v^{3} [C o n v^{1} [C o n c a t (A S P P_{(1, 6, 12, 8)} (F))]]

(2)

F_{I}^{S e}

is the extracted semantic feature,

F_{I}

is the RGB image,

B - I n t e r

represents bilinear interpolation,

S e g H e a d

represents the segmentation head,

R e s

represents the ResNet50 backbone network,

C o n c a t

is the concatenation operation, and

{Conv}^{k}

(k = 1, 3) denotes a convolution process using a kernel dimension of k.

In the 3D-SLF module, as shown in Figure 4, there are two specific processing steps: feature alignment and feature fusion. For feature alignment, a challenge is 2D-to-3D projection misalignment. The reason for this is that point clouds are generally transformed and enhanced using various methods, such as flipping, rescaling, rotating, and translation, before being input into the model. Additionally, augmentation techniques often include sampling ground-truth data and copying objects from different scenes. To manage reversible transformations, we reverse the coordinates of the sparse features by applying transformation parameters. In the process of sampling the ground truth, we suitably superimpose the relevant 2D objects onto the images. The specific steps involved in feature alignment are outlined as follows:

F_{L}^{C 1} = I n v e r s e_{R o t} (I n v e r s e_{F l i p} (F_{V o x}^{C 1}))

(3)

F_{I^{'}}^{S e} = P r o j e c t_{2 D \to 3 D} (F_{I}^{S e})

(4)

where

F_{L}^{C 1}

is the LiDAR feature obtained through the inverse operation,

F_{V o x}^{C 1}

is the voxel feature from the C1 layer inputted into the fusion module, and

F_{I^{'}}^{S e}

is the image feature projected into the 3D space.

To achieve feature fusion, the aligned RGB semantic features are seamlessly combined with the sparse features using a straightforward summation approach. This method is employed to maintain efficiency, as both types of features share the same channel dimensions:

F_{F}^{C 1} = A d d (F_{L}^{C 1} + F_{I^{'}}^{S e})

(5)

where

A d d

represents the summation method and

F_{F}^{C 1}

represents the fused features output from the C1 layer. The fused multi-modal feature serves as the basis for predicting different objects’ importance, which can be jointly trained.

To ascertain the optimal stage for fusion, we carried out a comprehensive set of experiments, the findings of which are detailed in Section 4.3.

3.3. Two-Dimensional Spatial-Level Fusion

In the pipeline of the voxel-based detector, voxel features are transformed into a BEV through height compression, enabling the detector to adapt to complex scenes. Although BEV simplifies the analysis of spatial relationships, its sparser representation may lead to the loss of the spatial shapes of objects, resulting in false positives and missed detections. RGB images not only contain the distinguishable contours and texture features of objects but also include the spatial relationships between them, providing reliable complementary features for BEV representations.

Based on the above analysis, we propose a 2D spatial-level fusion (2D-SLF) method, as shown in Figure 5, to integrate both spatial features. Inspired by deformable convolution, we implement a learnable offset in the spatially adaptive alignment (SAA) module to align the BEV features with the image features and employ dilated convolution to enhance the receptive field of the features.

Before the SAA module, we perform channel-level integration via concatenation between the image features and BEV features to achieve the coarse fusion of the features. In the SAA module, we use convolution kernels of size K and channel number N to adaptively learn the position offsets of each sampling point. Among them, N is twice the square of the convolution kernel size, ensuring that the offset of the x and y positions of each sampling point can be learned. To simplify the learning parameters, we input each coordinate vector into a feedforward neural network (FNN) to obtain unique positional offsets

Δ p

for each sampling point. The specific alignment process can be expressed as follows:

F_{c} = C o n c a t (F_{B E V}, F_{I}^{'})

(6)

(Δ p_{0}, \dots, Δ p_{(N / 2) - 1}) = F N N (R e s h a p e (C o n v^{k} (F_{c})))

(7)

where

F_{c}

is the concatenated features,

F_{B E V}

represents the input BEV features, and

F_{I}^{'}

denotes the input image features.

To integrate BEV and image features while expanding the receptive field of the fused features, we introduce dilated convolution to achieve the fusion of these two spatial features. Notably, learnable position offsets guide the dilated convolution to adapt to the shapes of objects in complex environments through element-wise summation. The dense fusion features will be input into the RPN network to complete the first stage of prediction. In a convolutional kernel with predefined sampling points,

w_{k}

and

p_{k}

correspond to the weight and the predetermined offset for the k-th sampling location. The dense fusion features at a given location p, derived from the input feature maps, are represented by

F_{F}^{'} (p)

. The fusion mechanism can be mathematically described as follows:

F_{F}^{'} (p) = \sum_{k = 0}^{N - 1} w_{k} \cdot F_{c} (p_{0} + p_{k} + Δ p_{k})

(8)

To demonstrate the enhanced performance of our proposed SAA module, we carried out an ablation study, detailed in Section 4.3. This experiment was designed to analyze and evaluate the impact of the proposed alignment on the overall effectiveness of the fusion process.

3.4. Loss Function

We adopt the same loss function utilized by Voxel-RCNN [19]. It additionally comprises two distinct loss functions: the classification loss and box regression loss. The regression loss predicts 3D box positions, while the classification loss identifies object categories. The RPN loss is as follows:

L_{R P N} = \frac{1}{N_{f g}} [\sum_{i} L_{c l s} (p_{i}^{a}, c_{i}^{*}) + l (c_{i}^{*} \geq 1) \sum_{i} L_{r e g} (δ_{i}^{a}, t_{i}^{*})]

(9)

N_{f g}

denotes the total count of foreground anchors, and

p_{i}^{a}

and

δ_{i}^{a}

correspond to the outputs generated by the classification and bounding box regression branches, respectively.

c_{i}^{*}

and

t_{i}^{*}

represent the ground truth labels for the classification and regression targets. Notably, the regression loss,

l (c_{i}^{*} \geq 1)

, is computed exclusively for foreground anchors. To optimize the model, the focal loss is employed for classification tasks, and the Huber loss is utilized for bounding box regression.

The confidence branch is designed to predict a value associated with the IoU, expressed as follows:

l_{i}^{*} (I o U_{i}) = \{\begin{matrix} 0 & I o U_{i} < θ_{L} \\ \frac{I o U_{i} - θ_{L}}{θ_{H} - θ_{L}} & θ_{L} \leq I o U_{i} < θ_{H} \\ 1 & I o U_{i} > θ_{H} \end{matrix}

(10)

I o U_{i}

represents the

I o U

between the

i - t h

proposal and its associated ground truth bounding box, while

θ_{H}

and

θ_{L}

denote the

I o U

thresholds for the foreground and background, respectively. For the prediction of confidence, the binary cross-entropy loss is employed. Additionally, the box regression branch employs the Huber loss, consistent with the approach used in the region proposal network (RPN). The overall losses for our detection head are calculated as follows:

L_{h e a d} = \frac{1}{N_{s}} [\sum_{i} L_{c l s} (p_{i}, l_{i}^{*} (I o U_{I})) + l (I o U_{i} \geq θ_{r e g}) \sum_{i} L_{r e g} (δ_{i}, t_{i}^{*})]

(11)

N_{s}

represents the total number of region proposals sampled during the training phase.

l (I o U_{i} \geq θ_{r e g})

ensures that only those region proposals meeting the condition

I o U_{i} \geq θ_{r e g}

are factored into the regression loss calculation.

4. Experiments

4.1. Implementation Details

4.1.1. Network Settings

Our network implementation utilizes OpenPCDetv0.5.0 [40], an open-source PyTorch-based codebase v.2.5.1. In the 3D-SLF process, the Semdeeplabv3 network is employed to extract semantic information from images. To align the dimensions of the extracted semantic features with those of the point cloud features, the number ofoutput channels of the Semdeeplabv3 [39] semantic feature extraction is configured as 16. In the context of 2D-SLF, the Yolox [41] network serves as the image feature extraction mechanism. For the execution of two-dimensional spatial fusion, the extracted image features are aligned with the BEV features using bilinear interpolation, adjusting the features to a length of 200 and a width of 176, ensuring dimensional congruence and facilitating effective fusion. For the KITTI dataset, the input point cloud is constrained within specific boundaries: the Z-axis spans from −3 to 1 m, the X-axis ranges from 0 to 70.4 m, and the Y-axis extends from −40 to 40 m. The voxel dimensions are set at [0.05, 0.05, 0.1] meters, with each voxel capable of holding a maximum of 5 points. For the ONCE dataset, the point cloud is confined to a Z-axis range of −5 to 3 m, an X-axis range of −75.2 to 75.2 m, and a Y-axis range of −75.2 to 75.2 m. The voxel size here is configured to [0.1, 0.1, 0.2] meters, with each voxel also accommodating up to 5 points.

4.1.2. Training Configuration

All experiments are conducted using an RTX 4090 GPU. The network undergoes training for 60 epochs with a batch size of 2 for the KITTI [42] dataset, while, for the ONCE [43] dataset, it undergoes 80 epochs of training with the same batch size. We implement a cosine annealing strategy to decay the learning rate, capping the maximum learning rate at 0.005. Optimization is carried out using the ADAM optimizer.

4.1.3. Dataset

The KITTI dataset stands as a cornerstone benchmark for 3D object detection in the realm of autonomous driving. It integrates a variety of sensor data, including stereo cameras and a 64-beam Velodyne LiDAR, encompassing 7481 training samples and 7518 test samples. Following standard practice, we split the training set into two subsets: a training split with 3712 examples and a validation split with 3769 examples. We evaluate the 3D detection performance of MDFusion alongside cutting-edge 3D detectors across three levels of difficulty on the validation set to assess the model effectiveness. The ONCE dataset is a comprehensive, up-to-date resource with one million LiDAR scenes, tailored to 3D object detection in self-driving systems. In addition, the ONCE dataset has 16,000 annotated point cloud scenes, each providing a full 360-degree field of view. To train and validate our supervised model, we use 4961 training samples and 3321 validation samples, which contain three classes: vehicles, cyclists, and pedestrians. It is worth noting that “vehicles” is a superclass that includes cars, buses, and trucks. We carried out tests on typical vehicle and pedestrian types, including cars, buses, trucks, pedestrians, and cyclists. We present the 3D detection results of MDFusion and leading 3D detectors on the validation set for model assessment.

4.1.4. Evaluation Metrics

Adhering to the KITTI dataset protocol, we assess the performance of 3D and BEV detection using the average precision (AP) computed at 40 recall positions (R40) and 11 recall positions (R11). To compare the overall performance of different models in multi-class object detection, we also use the mAP of the difficulty levels as another metric. For a more reliable evaluation of different ranges of object detection, the orientation is factored into the AP calculation in the ONCE dataset. We utilize the orientation-aware AP to evaluate the performance of MDFusion with separate ranges, “0–30 m”, “30–50 m”, and “50 m-inf”, while the mAP is calculated by averaging the overall AP of all classes (Figure 6).

4.2. Experimental Results

4.2.1. Main Results on KITTI

We assessed MDFusion on KITTI’s validation and test sets, comparing the outcomes with those of leading methods like PV-RCNN [22], PointPillars [18], the multi-modal fusion method EPNet [5], and AVOD-FPN [28]. The performance metrics that they used were derived entirely from existing research articles, as detailed in Table 2. The KITTI 3D detection benchmark, which is widely recognized, relies on the AP metric. This metric is calculated using an IoU threshold of 0.7 and incorporates 40 recall points for evaluation.

Our method outperforms other approaches on the validation set. Compared to existing methods in multi-class 3D object detection tasks, the proposed MDFusion demonstrates significant advantages. As illustrated in Table 2, when compared to the top-performing multi-modal method EPNet, the proposed MDFusion achieves superior performance across all three object categories. Under the 40 recall points metric, it improves by 4.1, 7.38, and 5.4 for the car category at easy, moderate, and hard difficulty levels, respectively. For the cyclist category, the improvements are 7.73, 8.7, and 7.03 across the three difficulty levels, while, for the pedestrian category, the improvements are 1.79 and 1.6 at the easy and moderate difficulty levels, respectively. Since pedestrians and cyclists often lack complete point cloud information in LiDAR data, these results highlight the effectiveness of our 3D-SLF approach.

Our method performs well on the cyclist data. On the test set, as illustrated in Table 3, it outperforms nearly all approaches in the cyclist category. Compared with the method STD [21], the proposed MDFusion improves by 5.56 for medium difficulty and 5.51 for hard difficulty in the cyclist class. Since cyclists usually have irregular structures, this result proves the effectiveness of 2D-SLF for non-rigid object detection.

4.2.2. Visualizations

In the visual representation depicted in Figure 7a, one can discern the distant instances of pedestrians and vehicles. Conversely, within the actual bounding box, as illustrated in Figure 7b, the distant objects remain unlabeled. The scenario in Figure 7c demonstrates that the predictions generated by focal convolution lead to instances of missed detection and false positives within the bounding box. In contrast, Figure 7d showcases the efficacy of our proposed method, which adeptly identifies the objects with precision.

Figure 8a presents the detection results of our research method, and Figure 8b shows the detection results of the FocalConv method. In Figure 8b, the area marked by the red circle is the misdetection part of the FocalConv method. This method wrongly identifies a pedestrian sitting on a chair as a cyclist. Through in-depth analysis, it is found that this is because, at the point cloud feature level, the features in such a situation are similar to those of cyclists, resulting in misjudgment during the detection process. In contrast, the proposed 2D-SLF module can effectively avoid such situations. The 2D-SLF module supplements image features for bird’s eye view (BEV) features, enabling the BEV features to accurately identify that the object is not a cyclist when generating 3D detection proposals through the region proposal network (RPN).

Upon further observation of Figure 9a, due to the similarity between the geometric contours of flower beds and cars, the FocalConv method fails to detect cars. In terms of long-distance object detection, compared with the FocalConv method in Figure 9b, our method can successfully detect cars. However, for cars at ultra-long distances, our method still has certain limitations. This is mainly because, in our method, image semantic features only serve as an auxiliary supplement to point cloud features, and 3D object detection mainly relies on point cloud features. In future research work, we plan to deeply explore how to render point cloud features and image features independent and efficient, so as to avoid the recurrence of the above-mentioned issues (Figure 10).

4.2.3. Main Results on ONCE

We evaluated the MDFusion model using the ONCE validation set and contrasted the results with those of other approaches, including PointContrast [52], PointPillars [18], DepthContrast [53], SwAV [54], and SECOND [17]. All performance values were obtained from the benchmark, as indicated in Table 4.

MDFusion outperforms other methods in the “overall” metric for the pedestrian and cyclist classes, as well as in the “mAP” metric across all classes. Although it is suboptimal in the vehicle class, our model achieves the best performance in long-distance detection and small-object detection. In the pedestrian class, its AP value increased by 5.53 compared to SECOND, which performed best at 30–50 m. In the cyclist class, its AP value increased by 0.98 compared to PV-RCNN, which performed best at 50 m-inf. The promising long-distance object detection results demonstrate the efficacy of the 3D-SLF module. The improvement in the metrics is more pronounced in the pedestrian class. This is because pedestrians are non-rigid objects, and the spatially adaptive alignment adopted in the proposed 2D spatial fusion module can adaptively perceive the shapes of such objects (Figure 11).

From Figure 12, it can be intuitively observed that our proposed model achieves the highest detection accuracy compared to the average values across different distances. Additionally, the deviations from the standard deviation lines show minimal fluctuations across varying distances, indicating that MDFusion demonstrates superior detection stability compared to other models.

4.3. Ablation Study

To assess the efficacy of the modules introduced by MDFusion, we split the KITTI training dataset into 3712 samples for training and 3769 for validation. Using the baseline TED-S model, we conducted ablation studies focusing on the car, pedestrian, and cyclist categories. Performance was evaluated using the average precision metrics for bird’s eye view (AP-BEV) and 3D detection at the moderate difficulty level (3D AP-R40), demonstrating the impact of each proposed component.

Given that 3D-SLF represents a lightweight module and 2D-SLF embodies a more refined fusion approach, we conducted a series of runtime tests for each module and evaluated their AP-BEV performance at the moderate level to confirm their effectiveness. We carried out separate ablation experiments for the two proposed modules on the KITTI validation set, with the results presented in Table 5. The 3D-SLF module significantly reduced the model’s runtime, while 2D-SLF demonstrated notable enhancements in BEV detection. Overall, the network experienced an improvement of 1.83, 2.22, and 4.08 at the easy, moderate, and hard levels and a runtime reduction of 14 ms when compared to the baseline.

The results in Table 5 prove that the 3D-SLF module improves the detection effect for hard-type objects, and the 2D-SLF module improves the detection speed more significantly. The two fusion modules have competitive performance. The reason is that the addition of dense semantic features of the image enhances the sparse features and the fine granularity of the point cloud, which can lead to better detection performance for occluded and small objects. The spatially adaptive module can dynamically adjust the sampling position of the convolution kernel by introducing an offset, so as to capture features more accurately in a specific area, which can reduce the number of redundant calculations and improve the reasoning efficiency.

To ascertain the optimal stage for the integration of 3D-SLF, we performed an additional ablation study. We selected various stages to merge image information with convolution operations and assessed the average precision (AP40) to gauge the impact. The findings are presented in Table 6. Compared with fusion in other stages, the effect of fusion in the first stage is the best at all levels of difficulty in all categories. This is because some redundant information is removed after the first sparse convolution downsampling step, and the sparse point cloud features of the object are not confused. In the fourth stage of fusion, the voxel features cannot be used in the three-dimensional semantic feature fusion module. This is because the voxel features after multi-layer sparse convolution downsampling do not have separable object features and cannot be effectively matched with the image semantic features.

To validate the effectiveness of our proposed spatial fusion method, we carried out an ablation comparison experiment, and the results are presented in Table 7. The results in the table show that the simple element-level addition and convolution fusion method cannot obtain high-quality fusion features, and the element-level multiplication performs better on the easy type of cyclists. The fusion method of dilated convolution combined with SAA (2D-SLF module) can effectively adapt to the deformation of complex objects, and the effect on small objects is more obvious. The reason for this is that the addition operation interaction features are more singular, and the learning of important features is insufficient; ordinary convolution cannot adapt to the deformation of the object space. The 2D-SLF module is more comprehensive and flexible, and it is conducive to the model learning more important features, especially in the case of rich information.

Figure 13a,b provide visualizations of the intensity of the sparse features. Figure 13c,d provide the results of mapping the sparse features to colors, which were processed by PCA dimensionality reduction. In the area circled in red, it can be seen that the intensity of the semantic features after passing through the 3D-SLF fusion module is denser. After the PCA dimensionality reduction process, the color changes from green to blue. The above results further confirm the effectiveness of the proposed 3D-SLF module in semantic supplementation.

Figure 14 presents the feature heatmaps before and after the 2D-SLF module. Figure 14a shows the heatmap before fusion, while Figure 14b displays the heatmap after fusion. By comparison, it is evident that the feature heatmap after the 2D spatial fusion module becomes significantly denser. This finding highlights the efficacy of our 2D spatial fusion module, utilizing dense image features to mitigate 3D point cloud sparsity.

In order to verify the effectiveness of adopting the Deeplabv3 network, we conducted ablation experiments using different segmentation networks, and the results are shown in Table 8. As shown in Table 8, experiments on the KITTI validation set demonstrate that our method achieves superior performance across the car, cyclist, and pedestrian categories. While pedestrian detection under hard-level conditions remains suboptimal, the overall performance significantly surpasses PSPNet’s contributions. This strongly validates the suitability of DeepLabV3 for our proposed framework.

We also conducted an ablation experiment on the number of model parameters and runtime and performed a comparison in terms of the number of parameters and runtime between our model and the current lightweight model FocalConv-F. At the same time, the number of parameters and runtime of the LiDAR-based model PV-RCNN were also included. The specific results are shown in Table 9. It can be clearly seen from the data in the table that, compared with FocalConv-F, our model has achieved significant improvements in both the number of parameters and runtime. When further compared with the LiDAR-based model PV-RCNN, the number of parameters of our model has been reduced by 1.95 M, and the runtime has been shortened by 5 ms. All of the above results strongly confirm that our proposed fusion method has outstanding lightweight characteristics.

5. Limitations and Discussion

While MDFusion achieves robust performance in multi-modal 3D detection, several challenges persist. The framework shows limited gains for small or heavily occluded objects, such as pedestrians in sparse point clouds, where fine-grained texture integration remains imperfect despite semantic-aware fusion. Additionally, while the geometric alignment module accommodates minor calibration shifts, its robustness diminishes under gradual calibration drift, a practical constraint in real-world sensor systems. The multi-dimensional architecture, though effective in fusing complementary modalities, introduces a computational overhead that may challenge real-time deployment. Furthermore, while the model excels on trained categories, its generalization to unseen object types or scenarios with concurrent sensor degradation remains underexplored, particularly in complex environments with multi-modal noise.

To advance multi-modal 3D detection, future work will focus on multi-scale adaptive mechanisms to enhance feature aggregation for small and occluded objects in sparse point clouds and pursue architectural optimizations to reduce the computational complexity while maintaining fusion accuracy. For calibration robustness, online self-correction strategies will be integrated into the alignment module to dynamically compensate for parameter drift under varying conditions, complemented by cross-modal contrastive learning to enforce geometric consistency between LiDAR and camera features during fusion. Domain-adaptive fusion frameworks will be developed using self-supervised learning techniques to distinguish relevant features from multi-modal noise in challenging environments. To address generalization limitations, open-set object detection principles will be integrated through uncertainty estimation mechanisms, such as defining confidence thresholds to categorize low-confidence detections as unknown objects, pseudo-unknown sample generation during training, and specialized loss functions that suppress the confidence scores for unseen object types. These enhancements aim to strengthen the real-world adaptability and generalization capabilities while preserving the framework’s core strengths in cross-modal synergy.

6. Conclusions

The proposed MDFusion presents a robust LiDAR–camera fusion framework for 3D object detection that addresses the geometric structure degradation caused by sparse point cloud features through multi-dimension fusion. Key contributions include (1) a 3D semantic-level fusion module that generates object-specific semantic fusion features through cross-modal attention, significantly improving the classification accuracy on KITTI hard cases, and (2) a learnable 2D spatial-level fusion module with dynamic position offsets that achieves cross-modal feature alignment, reducing false positives through geometric consistency. Extensive experiments on the KITTI and ONCE datasets demonstrate state-of-the-art performance, particularly for complex-shaped and small objects. In the future, we will explore adaptive fusion mechanisms for adverse weather conditions and real-time deployment optimizations to address the remaining challenges in sensor calibration sensitivity and computational complexity scaling.

Author Contributions

Conceptualization, H.Y. and R.Q.; methodology, H.Y. and R.Q.; software, H.Y.; validation, H.Y. and Z.G.; formal analysis, H.Y. and R.Q.; investigation, Z.G.; resources, W.Z.; data curation, H.Y. and R.Q.; writing—original draft preparation, H.Y. and R.Q.; writing—review and editing, H.Y., R.Q. and W.Z.; visualization, H.Y. and Z.G.; supervision, W.Z.; project administration, W.Z.; funding acquisition, W.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 62276204.

Data Availability Statement

Two publicly available datasets were analyzed in this paper. The ONCE dataset can be found here: https://once-for-auto-driving.github.io/index.html (accessed on 25 January 2025). The KITTI dataset can be found at https://www.cvlibs.net/datasets/kitti/ (accessed on 25 January 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

3D	Three-dimensional
LiDAR	Light detection and ranging
BEV	Bird’s eye view
2D	Two-dimensional
CNN	Convolutional neural network
IoU	Intersection over union
VFE	Voxel feature encoding
RPN	Region proposal network
RoI	Regions of interest
3D-SLF	3D semantic-level fusion
2D-SLF	2D spatial-level fusion
ASPP	Atrous spatial pyramid pooling
FNN	Feedforward neural network
mAP	Mean average precision
AP	Average precision

References

Ning, Y.; Cao, J.; Bao, C.; Hao, Q. DVST: Deformable Voxel Set Transformer for 3D Object Detection from Point Clouds. Remote Sens. 2023, 15, 5612. [Google Scholar] [CrossRef]
Qiao, R.; Ji, H.; Zhu, Z.; Zhang, W. Local-to-Global Semantic Learning for Multi-View 3D Object Detection from Point Cloud. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 9371–9385. [Google Scholar] [CrossRef]
Vora, S.; Lang, A.H.; Helou, B.; Beijbom, O. PointPainting: Sequential fusion for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4604–4612. [Google Scholar]
Zhang, Y.; Huang, D.; Wang, Y. PointAugmenting: Cross-modal augmentation for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11794–11803. [Google Scholar]
Huang, T.; Liu, Z.; Chen, X.; Bai, X. EPNet: Enhancing point features with image semantics for 3D object detection. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 35–52. [Google Scholar]
Xie, S.; Yang, D.; Jiang, K.; Zhong, Y. Pixels and 3-d points alignment method for the fusion of camera and lidar data. IEEE Trans. Instrum. Meas. 2018, 68, 3661–3676. [Google Scholar]
Wu, Q.; Li, X.; Wang, K.; Bilal, H. Regional feature fusion for on-road detection of objects using camera and 3d-lidar in high-speed autonomous vehicles. Soft Comput. 2023, 27, 18195–18213. [Google Scholar]
Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1907–1915. [Google Scholar]
Liu, Z.; Tang, H.; Amini, A.; Yang, X.; Mao, H.; Rus, D.L.; Han, S. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 2774–2781. [Google Scholar]
Hu, C.; Zheng, H.; Li, K.; Xu, J.; Mao, W.; Luo, M.; Wang, L.; Chen, M.; Peng, Q.; Liu, K.; et al. FusionFormer: A Multi-sensory Fusion in Bird’s-Eye-View and Temporal Consistent Transformer for 3D Object Detection. arXiv 2023, arXiv:2309.05257. [Google Scholar]
Shi, S.; Wang, X.; Li, H. Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–19 June 2019; pp. 770–779. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems; NIPS: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Ding, Z.; Han, X.; Niethammer, M. Votenet: A deep learning label fusion method for multi-atlas segmentation. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2019: 22nd International Conference, Shenzhen, China,13–17 October 2019; Proceedings, Part III 22; Springer: Cham, Switzerland, 2019; pp. 202–210. [Google Scholar]
Yin, T.; Zhou, X.; Krahenbuhl, P. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11784–11793. [Google Scholar]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578. [Google Scholar]
Zhou, Y.; Tuzel, O. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4490–4499. [Google Scholar]
Yan, Y.; Mao, Y.; Li, B. Second: Sparsely embedded convolutional detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef] [PubMed]
Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Y.ang, J.; Beijbom, O. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12697–12705. [Google Scholar]
Deng, J.; Shi, S.; Li, P.; Zhou, W.; Zhang, Y.; Li, H. Voxel r-cnn: Towards high performance voxel-based 3d object detection. Proc. AAAI Conf. Artif. Intell. 2021, 35, 1201–1209. [Google Scholar] [CrossRef]
Mao, J.; Xue, Y.; Niu, M.; Bai, H.; Feng, J.; Liang, X.; Xu, H.; Xu, C. Voxel transformer for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 3164–3173. [Google Scholar]
Yang, Z.; Sun, Y.; Liu, S.; Shen, X.; Jia, J. Std: Sparse-to-dense 3d object detector for point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–19 June 2019; pp. 1951–1960. [Google Scholar]
Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10529–10538. [Google Scholar]
Wu, P.; Gu, L.; Yan, X.; Xie, H.; Wang, F.L.; Cheng, G.; Wei, M. Pv-rcnn++: Semantical point-voxel feature interaction for 3d object detection. Vis. Comput. 2022, 39, 2425–2440. [Google Scholar]
Liu, Z.; Wu, Z.; Tóth, R. Smoke: Single-stage monocular 3d object detection via keypoint estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 996–997. [Google Scholar]
Li, Z.; Gao, Y.; Hong, Q.; Du, Y.; Serikawa, S.; Zhang, L. Keypoint3d: Keypoint-based and anchor-free 3d object detection for autonomous driving with monocular vision. Remote Sens. 2023, 15, 1210. [Google Scholar] [CrossRef]
Philion, J.; Fidler, S. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Huang, J.; Huang, G.; Zhu, Z.; Ye, Y.; Du, D. BEVDet: High-performance multi-camera 3D object detection in bird-eye-view. arXiv 2021, arXiv:2112.11790. [Google Scholar]
Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3d proposal generation and object detection from view aggregation. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1–8. [Google Scholar]
Yoo, J.H.; Kim, Y.; Kim, J.; Choi, J.W. 3d-cvf: Generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection. In Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXVII 16; Springer: Cham, Switzerland, 2020; pp. 720–736. [Google Scholar]
Pang, S.; Morris, D.; Radha, H. Clocs: Camera-lidar object candidates fusion for 3d object detection. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 10386–10393. [Google Scholar]
Wu, X.; Peng, L.; Yang, H.; Xie, L.; Huang, C.; Deng, C.; Liu, H.; Cai, D. Sparse fuse dense: Towards high quality 3d detection with depth completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5418–5427. [Google Scholar]
Wu, H.; Wen, C.; Li, W.; Li, X.; Yang, R.; Wang, C. Transformation-equivariant 3d object detection for autonomous driving. Proc. AAAI Conf. Artif. Intell. 2023, 37, 2795–2802. [Google Scholar] [CrossRef]
Qin, Y.; Wang, C.; Kang, Z.; Ma, N.; Li, Z.; Zhang, R. SupFusion: Supervised LiDAR-camera fusion for 3D object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 22014–22024. [Google Scholar]
Xu, X.; Dong, S.; Xu, T.; Ding, L.; Wang, J.; Jiang, P.; Song, L.; Li, J. Fusionrcnn: Lidar-camera fusion for two-stage 3d object detection. Remote Sens. 2023, 15, 1839. [Google Scholar] [CrossRef]
Xiang, X.; Zhang, J. Fusionvit: Hierarchical 3d object detection via lidar—Camera vision transformer fusion. arXiv 2023, arXiv:2311.03620. [Google Scholar]
Jiao, Y.; Jie, Z.; Chen, S.; Chen, J.; Ma, L.; Jiang, Y.-G. Msmdfusion: Fusing lidar and camera at multiple scales with multi-depth seeds for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21643–21652. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–19 June 2019; pp. 9308–9316. [Google Scholar]
Chen, L.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. SemDeepLabV3: Semantic segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [PubMed]
OpenPCDet Development Team. OpenPCDet: An Open-Source Toolbox for 3D Object Detection from Point Clouds. 2020. Available online: https://github.com/open-mmlab/OpenPCDet (accessed on 20 November 2024).
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar]
Mao, J.; Niu, M.; Jiang, C.; Liang, H.; Chen, J.; Liang, X.; Li, Y.; Ye, C.; Zhang, W.; Li, Z.; et al. One million scenes for autonomous driving: Once dataset. arXiv 2021, arXiv:2106.11037. [Google Scholar]
Shi, S.; Wang, Z.; Shi, J.; Wang, X.; Li, H. Part-A^{^}2: Part-aware and aggregation neural network for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10529–10538. [Google Scholar]
Wang, Y.; Chen, X.; Ma, Y. BtcDet: A bidirectional temporal context network for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum PointNets for 3D object detection from RGB-D data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 918–927. [Google Scholar]
Li, Y.; Bu, R.; Sun, M.; Wu, W.; Di, X.; Chen, B. F-ConvNet: Feature fusion convolutional network for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–19 June 2019; pp. 641–650. [Google Scholar]
Li, Y.; Chen, X.; Ma, Y. CasA: A context-aware sparse attention network for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Wang, Y.; Chen, X.; Ma, Y. M3DETR: Multi-modal 3D Detection Transformer for Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Chen, Y.; Zhang, X.; Sun, J. PI-RCNN: An Efficient Multi-sensor Fusion Framework for 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Chen, Y.; Li, Y.; Zhang, X.; Sun, J.; Jia, J. Focal sparse convolutional networks for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5428–5437. [Google Scholar]
Xie, S.; Gu, J.; Guo, D.; Qi, C.R.; Guibas, L.; Litany, O. Pointcontrast: Unsupervised pre-training for 3d point cloud understanding. In Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part III 16; Springer: Cham, Switzerland, 2020; pp. 574–591. [Google Scholar]
Zhang, Z.; Girdhar, R.; Joulin, A.; Misra, I. Self-supervised pretraining of 3d features on any point-cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 10252–10263. [Google Scholar]
Caron, M.; Misra, I.; Mairal, J.; Goyal, P.; Bojanowski, P.; Joulin, A. Unsupervised learning of visual features by contrasting cluster assignments. Adv. Neural Inf. Process. Syst. 2020, 33, 9912–9924. [Google Scholar]

Figure 1. Visualization of LiDAR and image feature disparities. (a) LiDAR point cloud: While the LiDAR data enable the precise spatial localization of objects, their sparse nature results in incomplete feature representation, particularly for distant objects. The distant object within the red circle lacks a complete structure due to the sparsity of the point cloud. (b) Image representation: The corresponding RGB image captures dense semantic details (e.g., texture and shape boundaries), enabling comprehensive object characterization. The object within the blue circle lacks complete semantics in the image due to occlusion. (c) Projection analysis: Directly aligning LiDAR points with the image reveals a critical misalignment. The proximal regions show a moderate point density, whereas distant areas exhibit severe sparsity.

Figure 2. The overall framework of our proposed MDFusion. A voxel encoder converts the point cloud into voxels, followed by the extraction of voxel features using a 3D backbone network. The C1 voxel features are integrated with semantic information from images in the 3D-SLF module. In the 2D-SLF module, image features are fused with LiDAR BEV features directly. The detection results are obtained by the detection head through a region proposal network (RPN).

Figure 3. Semantic segmentation network. k: convolutional kernel size, d: convolutional kernel dilation rate.

Figure 4. The fusion process of 3D semantic-level fusion. Initially, the voxel features (C1) are subjected to feature alignment, after which an inverse operation is applied to the processed voxel features. Subsequently, the aligned voxel features are integrated with the image features through a summation-based fusion approach.

Figure 5. The figure delineates the fusion process of 2D spatial-level fusion. Specifically, the image features and BEV features are concatenated, followed by the application of spatially adaptive alignment to integrate these features, ultimately generating the spatial fused feature.

Figure 6. (a–c) The precision–recall (PR) curves for the 3D and BEV metrics of our proposed method, evaluated on the KITTI test dataset, are presented for the three object categories. The PR curves of different colors represent different detection difficulties. Specifically, the purple one represents the easy-level, the green one represents the moderate-level, and the blue one represents the hard-level.

Figure 7. Qualitative comparison between FocalConv [51] and our MDFusion on the KITTI dataset. Sub-image (a) depicts the original image data, sub-image (b) illustrates the ground truth, sub-image (c) presents the FocalConv predictions, and sub-image (d) presents the MDFusion predictions. In this figure, red denotes bicycles, green signifies cars, and blue indicates pedestrians.

Figure 8. Qualitative comparison between FocalConv and our MDFusion on the KITTI dataset. Sub-image (a) presents the MDFusion predictions and sub-image (b) presents the FocalConv predictions. In this figure, red denotes bicycles, green signifies cars, and blue indicates pedestrians.

Figure 9. Qualitative comparison between FocalConv and our MDFusion on the KITTI dataset. Sub-image (a) presents the MDFusion predictions and in the red circle, a obstructed car can not be detected by MDFusion. Sub-image (b) presents the FocalConv predictions and in the red circle, a distant car and severe obstructed car wasn’t detected by FocalConv. In this figure, green signifies cars.

Figure 10. Visualization of detection in LiDAR. The green boxes represent the real boxes of the cars, the yellow boxes represent cyclists, and the red boxes represent the predicted boxes of the cars, pedestrians, and cyclists.

Figure 11. (a–c) Visualization results of dense vehicle object detection across three scenarios. Green boxes indicate ground truth, while blue boxes denote predicted values.

Figure 12. Comparison of the 3D detection performance on the “vehicle” category at different distances in the ONCE dataset.

Figure 13. Feature intensity maps before and after the 3D semantic-level fusion module. (a) shows the feature intensity map before fusion. (b) presents the feature intensity map after fusion. (c) displays the feature intensity map with PCA dimensionality reduction before fusion, and (d) shows the feature intensity map with PCA dimensionality reduction after fusion. The area within the red circle represents the region where the feature intensity changes remarkably before and after the fusion.

Figure 14. Feature heatmaps before and after the 2D spatial fusion module. (a) represents the heatmap before fusion, and (b) represents the heatmap after fusion.

Table 1. Comparative analysis of the merits and limitations of common sensors: LiDAR, monocular camera, and stereo camera.

Sensor Type	Merits	Limitations
LiDAR	High resolution, light-independent, rich 3D information	Poor performance in nighttime, rainy, and foggy scenes
Monocular Camera	Fine-grained object details	Susceptible to lighting variations and occlusion, lacks direct depth perception
Stereo Camera	Rich fine-grained object details with depth acquisition	Vulnerable to changes in lighting and occlusion

Table 2. Performance evaluation of 3D detection models on the KITTI validation set, with results presented as AP at a 0.7 IoU threshold, evaluated across 40 recall positions. Methods are categorized into two groups.: ‘L’ for approaches using LiDAR data only and ‘L+C’ for techniques combining LiDAR point clouds with camera imagery to enhance the detection accuracy.

Method	Modality	mAP	Car-R40			Cyclist-R40			Pedestrian-R40
Method	Modality	mAP	Easy	Mod.	Hard	Easy	Mod.	Hard	Easy	Mod.	Hard
PointRCNN [11]	L	69.96	86.99	76.09	73.12	85.32	70.98	66.59	65.03	56.75	48.78
Part-A2 [44]	L	63.99	87.81	78.49	73.51	79.17	63.52	56.93	53.10	43.35	40.06
STD [21]	L	63.60	87.95	79.71	75.09	78.69	61.59	55.30	53.29	42.47	38.35
BtcDet [45]	L	65.96	90.64	82.86	78.09	82.81	68.68	61.81	47.80	41.63	39.30
PV-RCNN [22]	L	70.67	91.73	82.55	80.06	88.71	71.27	66.44	58.50	50.72	46.06
F-PointNet [46]	L+C	57.85	82.19	69.79	60.59	72.27	56.12	49.01	50.53	42.15	38.08
AVOD-FPN [28]	L+C	56.84	83.07	71.76	65.73	50.46	42.27	39.04	63.76	50.55	44.93
F-ConvNet [47]	L+C	63.15	87.36	76.39	66.69	81.98	65.07	56.54	52.16	43.38	38.80
EPNet [5]	L+C	70.96	88.76	78.65	78.32	83.88	65.50	62.70	66.74	59.29	54.82
CasA [48]	L+C	69.77	91.58	83.06	80.08	87.91	73.47	66.17	54.04	47.09	44.56
MDFusion (Ours)	L+C	75.76	92.96	86.03	83.72	91.61	74.20	69.73	68.53	60.89	54.24

Table 3. Evaluation of model performance on the KITTI benchmark for the car and cyclist categories, using AP metrics across 11 recall points at an IoU threshold of 0.7. Models labeled as ‘L’ use only LiDAR data, while those marked as ‘L+C’ combine both LiDAR point clouds and camera imagery for improved detection.

Method	Modality	Car-R11			Cyclist-R11
Method	Modality	>Easy	Mod.	Hard	Easy	Mod.	Hard
PointRCNN [11]	L	86.96	75.64	70.70	74.96	58.82	52.53
PV-RCNN [22]	L	90.25	81.43	76.82	78.60	63.71	57.65
Part-A2 [44]	L	87.81	78.49	73.51	79.17	63.52	56.93
M3DETR [49]	L	90.28	81.73	76.96	83.83	66.74	59.03
STD [21]	L	87.95	79.71	75.09	78.69	61.59	55.30
F-PointNet [46]	L+C	82.19	69.79	60.59	72.27	56.12	49.01
AVOD-FPN [28]	L+C	83.07	71.76	65.73	63.76	50.55	44.93
PI-RCNN [50]	L+C	84.37	74.82	70.03	-	-	-
EPNet [5]	L+C	89.81	79.28	74.59	-	-	-
3D-CVF [29]	L+C	89.20	80.05	73.11	-	-	-
MDFusion (Ours)	L+C	88.35	82.24	77.43	83.37	67.15	60.81

Table 4. ONCE validation set performance evaluation.The italicized word represents the extent to which our proposed model has improved compared to the baseline.

Method	Vehicle				Pedestrian				Cyclist				mAP
Method	Overall	0–30 m	30–50 m	50 m-inf	Overall	0–30 m	30–50 m	50 m-inf	Overall	0–30 m	30–50 m	50 m-inf	mAP
Baseline (SECOND) [17]	71.19	84.04	63.02	47.25	26.44	29.33	24.05	18.05	58.04	69.96	52.43	34.61	51.89
Point Contrast [52]	71.07	83.31	64.90	49.34	22.52	23.73	21.81	16.06	56.36	68.11	50.35	34.06	49.98
Depth Contrast [53]	71.88	84.26	65.58	49.97	23.57	26.36	21.15	14.39	56.63	68.26	50.82	34.67	50.69
PV-RCNN [22]	77.77	89.39	72.55	58.64	23.50	25.61	22.84	17.27	59.37	71.66	52.58	36.17	53.55
SwAV [54]	72.71	83.68	65.91	50.10	25.13	27.77	22.77	16.36	58.05	69.99	52.23	34.86	51.96
PointRCNN [11]	52.09	74.45	40.89	16.81	4.28	6.17	2.40	0.91	29.84	46.03	20.94	5.46	28.74
PointPillars [18]	68.57	80.86	62.07	47.04	17.63	19.74	15.15	10.23	46.81	58.33	40.32	25.86	44.34
MDFusion (Ours)	73.78	84.59	68.05	51.16	31.50	34.69	29.58	18.02	59.23	71.49	52.19	37.15	54.83
Improvement	+2.59	+0.55	+5.03	+3.91	+5.06	+5.36	+5.53	−0.03	+1.19	+1.53	−0.24	+2.54	+2.94

Table 5. BEV detection of vehicles at easy, moderate, and hard difficulty levels on the KITTI validation set, along with inference times of respective methods. The italicized word represents the extent to which our proposed model has improved compared to the baseline. The bolded numbers are the values of the metric with the best performance. √: use the selected module.

Method	3D-Level Fusion	2D-Level Fusion	BEV Detection			Runtime
Method	3D-Level Fusion	2D-Level Fusion	Easy	Mod.	Hard	Runtime
Baseline			89.19	87.44	85.16	112 ms
3D-SLF	√		90.58	88.56	88.63	122 ms
2D-SLF		√	90.25	88.14	87.97	86 ms
Multi-Dimension Fusion	√	√	91.02	89.66	89.24	98 ms
Improvement	-	-	+1.83	+2.22	+4.08	−14 ms

Table 6. Performance in terms of of AP-R40 metric for different fusion stages on the KITTI validation set.

Fusion Stage	Car-R40			Cyclist-R40			Pedestrian-R40			mAP
Fusion Stage	Easy	Mod.	Hard	Easy	Mod.	Hard	Easy	Mod.	Hard	mAP
C1	92.93	86.09	83.74	92.63	76.24	71.62	71.54	62.47	55.59	76.98
C2	92.85	85.72	83.45	90.06	73.47	68.96	64.74	56.99	50.18	74.05
C3	92.47	83.25	80.63	83.35	66.96	63.46	53.08	43.79	37.29	67.14
C4	-	-	-	-	-	-	-	-	-	-

Table 7. Comparative performance in terms of AP-R40 metric for cyclist and pedestrian categories on the KITTI validation set, using various spatial fusion approaches. The bolded numbers are the values of the metric with the best performance.

Fusion Approach	Cyclist-R40			Pedestrian-R40			mAP
Fusion Approach	Easy	Mod.	Hard	Easy	Mod.	Hard	mAP
Add	85.8	72.72	69.44	68.81	60.88	55.29	68.82
Multiple	86.78	73.35	70.06	69.15	61.23	57.29	69.54
Conv	86.53	73.04	71.58	68.64	60.52	57.14	69.58
SAA+Dilated-Conv	86.48	74.16	72.07	69.5	61.51	58.15	70.31

Table 8. Comparison of AP3D-R40 values of the model with different segmentation networks. The bolded numbers are the values of the metric with the best performance.

Method	Car-R40			Cyclist-R40			Pedestrian-R40			mAP
Method	Easy	Mod.	Hard	Easy	Mod.	Hard	Easy	Mod.	Hard	mAP
HRNet	92.58	85.25	82.87	90.40	73.76	69.35	66.08	59.71	53.82	74.87
PSPNet	92.55	84.77	82.48	88.63	71.45	67.20	66.29	59.26	54.81	74.16
Ours	92.96	86.03	83.72	91.61	74.20	69.73	68.53	60.89	54.24	75.77

Table 9. Comparison of model parameters and runtime with different methods. The bolded numbers are the values of the metric with the best performance.

Method	Parameters	Runtime
FocalConv-F	13.70 M	125 ms
PV-RCNN	13.16 M	103 ms
Ours	11.21 M	98 ms

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qiao, R.; Yuan, H.; Guan, Z.; Zhang, W. MDFusion: Multi-Dimension Semantic–Spatial Feature Fusion for LiDAR–Camera 3D Object Detection. Remote Sens. 2025, 17, 1240. https://doi.org/10.3390/rs17071240

AMA Style

Qiao R, Yuan H, Guan Z, Zhang W. MDFusion: Multi-Dimension Semantic–Spatial Feature Fusion for LiDAR–Camera 3D Object Detection. Remote Sensing. 2025; 17(7):1240. https://doi.org/10.3390/rs17071240

Chicago/Turabian Style

Qiao, Renzhong, Hao Yuan, Zhenbo Guan, and Wenbo Zhang. 2025. "MDFusion: Multi-Dimension Semantic–Spatial Feature Fusion for LiDAR–Camera 3D Object Detection" Remote Sensing 17, no. 7: 1240. https://doi.org/10.3390/rs17071240

APA Style

Qiao, R., Yuan, H., Guan, Z., & Zhang, W. (2025). MDFusion: Multi-Dimension Semantic–Spatial Feature Fusion for LiDAR–Camera 3D Object Detection. Remote Sensing, 17(7), 1240. https://doi.org/10.3390/rs17071240

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MDFusion: Multi-Dimension Semantic–Spatial Feature Fusion for LiDAR–Camera 3D Object Detection

Abstract

1. Introduction

2. Related Work

2.1. LiDAR-Only 3D Detection

2.2. Camera-Only Detection

2.3. LiDAR–Camera 3D Detection

2.4. Deformable Convolution

3. Method

3.1. Overall Architecture

3.2. Three-Dimensional Semantic-Level Fusion

3.3. Two-Dimensional Spatial-Level Fusion

3.4. Loss Function

4. Experiments

4.1. Implementation Details

4.1.1. Network Settings

4.1.2. Training Configuration

4.1.3. Dataset

4.1.4. Evaluation Metrics

4.2. Experimental Results

4.2.1. Main Results on KITTI

4.2.2. Visualizations

4.2.3. Main Results on ONCE

4.3. Ablation Study

5. Limitations and Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI