In this section, we introduce Fast Transfusion, a novel approach tailored for multimodal 3D object detection. Addressing the high computational demands of the original Transfusion model, we propose three key technological advancements: QConv (Quick convolution), EH Decoder (Efficient and Hybrid Decoder), and semi-dynamic query selection.
As illustrated in
Figure 2, multimodal data first undergo feature extraction through the QConv Network, serving as the backbone. Subsequently, image features are processed via semi-dynamic query selection to complete Query Initialization. Following this, object queries and image features are inputted into the EH Decoder, and through a Feed-Forward Network (FFN), the final prediction output is obtained. By integrating these innovations into our model, Fast Transfusion achieves an optimal balance, significantly enhancing both inference speed and detection accuracy in multimodal 3D object detection tasks. At the same time, we can scale the backbone and decoder of Fast Transfusion using a depth multiplier and a width multiplier. Below, we will first introduce the method of fusing LiDAR and camera data.
3.2. QConv Network
This paper employs the feature vectors extracted by the QConv (Quick convolution) network as inputs for the fusion process. The initial step upon receiving LiDAR and camera data is to extract the pertinent features. Processing these extracted features as opposed to the raw data is more efficient and yields superior results. Our proposed QConv network, serving as the backbone of Fast Transfusion, excels at extracting features from both LiDAR and camera data, surpassing the performance of traditional convolutional neural networks.
For image data as given in
Figure 3a, we consider their three dimensions during feature extraction: the horizontal and vertical coordinates on their plane, and their depth, i.e., the number of channels. This paper optimizes the extraction of features from both the plane and depth of the data separately. The overview of the QConv structure is shown in
Figure 3b. Deformable convolution is used in one channel and pointwise convolution in the remaining channels. The deformable convolutional operation optimizes the extraction of planar features as
Figure 3c. Convolution operations are no longer limited to regular squares, and each position can be offset, while the partial convolutional operation is designed to optimize the extraction of depth features as shown in
Figure 3d. The dark parts are convoluted, while the light parts remain the same. In this way, only a portion of the data is convoluted, which saves a lot of computing resources. Given that the plane and depth are orthogonal in geometric space, the operations performed on them do not interfere with each other. This allows us to fuse the Quick convolutional operations in parallel, achieving comprehensive optimization of feature extraction. The specific steps are as Algorithm 1.
For the extraction of depth features, this paper uses the partial convolutional operation to speed up the process by leveraging the feature maps’ redundancy. The feature maps exhibit considerable redundancy across different channels, a phenomenon widely acknowledged but underutilized in existing literature. To address this efficiently, we introduce a streamlined partial convolution approach designed to simultaneously reduce computational redundancy and minimize memory access. As
Figure 3 shows, our method employs a standard convolutional operation on a subset of the input channels to extract spatial features, while leaving the rest untouched. For efficient memory access, either the first or the last consecutive
channels are utilized as representatives for the entire feature map computation.
Algorithm 1 Quick convolution (QConv). |
- Input:
Image data - Output:
Feature map - 1:
Select the first channel of input data as - 2:
Apply convolution kernels to it - 3:
for each unit f on F do - 4:
for convolution operations - 5:
if no offset at this position f then - 6:
The convolution operation is performed as usual - 7:
else - 8:
Use a regular grid R over the input - 9:
Add offset to f - 10:
Calculate the back propagation of gradients via the bilinear operations - 11:
end if - 12:
end for - 13:
Leave the rest of Input as - 14:
Concentrate f and - 15:
return
|
Without loss of generality, we maintain consistency by ensuring that the input and output feature maps contain an equal number of channels
d and the plane size of the input and output is the same. For an input
, our QConv applies d filters
to compute the output
. When we calculate only a portion of the channels, that is,
, the FLOPs are only
Therefore, with a typical partial ratio
, the FLOPs of a partial convolution are only
. Note that the method keeps the remaining channels untouched instead of removing them from the feature maps. It is because they are useful for a subsequent pointwise convolution layer, which allows the feature information to flow through all channels. Our architecture integrates a partial convolution layer with a pointwise convolution layer, synergistically enhancing feature extraction efficiency. Their effective receptive field together on the input feature maps looks like a T-shaped Conv, which focuses more on the center position compared to a regular Conv uniformly processing a patch. Despite the reduction in computations, its performance is nearly equivalent to that of a regular convolution because the center position turns out to be the salient position most frequently among the filters. In other words, the center position weighs more than its surrounding neighbors.
Finally, we will delineate how QConv accelerates inference speed during the extraction of depth features. The FLOPs of a QConv are only while the FLOPs of a regular Conv are . Under normal circumstances, is greater than . For example, when and , we take as 1 as usual. Theoretically, the FLOPs of QConv is of the FLOPs of the regular Conv. It means that our QConv is capable of achieving a performance speed that is multiple times faster than that of regular convolution.
For the extraction of planar features, the method employs the deformable convolutional operation to optimize our convolution and pooling processes. The prior implementation of the partial convolutional operation ensures our inference speed, while this adjustment guarantees the precision of our recognition and detection tasks. Given the variance in object scales or deformations at different locations, an adaptive approach to determining scales or receptive field sizes is essential for achieving accurate visual recognition coupled with precise localization. What is more, both modules exhibit lightweight characteristics, introducing a minimal number of parameters and computational overhead for offset learning. They can seamlessly substitute their conventional counterparts in deep CNNs and are amenable to end-to-end training using standard back propagation techniques.
Subsequently, we will explicate the mechanism by which QConv facilitates the free-form deformation of the sampling grid during the extraction of planar features. The regular grid
R is augmented with offsets
for QConv, while the regular Conv sample using a regular grid
R over the input feature map
I. As depicted in
Figure 3, offsets are derived through the application of a convolutional layer on the identical input feature map. The convolution kernel maintains parity in spatial resolution and dilation with the origin convolutional layer. The resultant offset fields match the spatial resolution of the input feature map. Throughout the training phase, the convolutional kernels tasked with output feature generation and the offsets are learned concurrently. The learning process for the offsets involves the back propagation of gradients via the bilinear operations as follows:
where
k is the convolution kernel size and
f is top-left corner pixel.The principle of deformable pooling parallels that of convolution, entailing a positional displacement, and thus, is not reiterated further for brevity.
Next, we will integrate these two components into a unified framework. As the partial operation and the deformable operation occupy orthogonal geometric spaces without interference, their integration into QConv merely requires the application of a simple superposition technique. Specifically, in contrast to a regular convolution, our QConv engages only the partial channels of the feature map for deformable convolution. One might question whether this amalgamation of techniques ensures that the computational cost of QConv remains below that of regular convolution. The answer is affirmative. Although the deformable operation slightly increases the computational demand, it remains lightweight, involving only additional additive operations. For instance, with a convolution kernel size of
, besides the original 9 multiplications and 9 additions, there are only 9 extra additions. Notably, the computational demand for multiplications significantly surpasses that of additions. Thus, the theoretical FLOPS of QConv are indicated by
b, a value close to 1, representing the manageable computational burden introduced by the deformable operation. In total, the FLOPs of QConv are as follows:
Furthermore, not only theoretically but also through experimental validation presented later, the QConv network demonstrates superior performance compared to other models’ backbones.
3.3. EH Decoder
Although the Transformer decoder has demonstrated commendable performance in the domain of object detection, the substantial computational demands of the decoder curtail their practical deployment and hinder the full realization of their advantages, such as the elimination of post-processing steps like non-maximum suppression. Specifically, the incorporation of multiscale features, while beneficial for hastening training convergence and enhancing performance, concurrently augments the sequence length fed into the decoder. Because the increase in accuracy is huge, we cannot just sacrifice it for the sake of inference speed. In the following experiment sections, we also verify the necessity of this mechanism. Consequently, the multiscale feature fusion of the transformer decoder emerges as a computational bottleneck within the model due to its significant computational requirements.
In response, this paper introduce the Enhanced Hybrid (EH) decoder as a substitute for the traditional transformer decoder. This novel approach segregates the interactions within the same scale from the fusion across different scales of multiscale features, thereby enabling the efficient processing of diverse scale features. Upon analyzing the computational redundancies inherent in the multiscale transformer decoder, it becomes evident that the concurrent handling of intra-scale and cross-scale features is computationally onerous. Given that high-level features, rich in semantic content, are derived from lower-level features, engaging in feature interaction across concatenated multiscale features proves to be redundant. Thus, by disentangling the multiscale feature interaction into distinct phases of intra-scale interaction and cross-scale fusion, we significantly diminish the computational overhead while enhancing the decoder’s efficacy.
This paper propose a reevaluation of the decoder’s architecture. As depicted in
Figure 4, the redesigned decoder is composed of two key modules: the Attention-based Intra-scale feature Interaction Module (AIM) and the CNN-based Cross-scale Feature Fusion Module (CCM). The AIM, refining upon the previous decoder’s approach, solely facilitates intra-scale interaction within last three layers of QConv. This method posits that applying self-attention to high-level features, imbued with dense semantic information, enables the capture of relationships among conceptual entities within images, thereby aiding subsequent modules in object detection and recognition. Conversely, intra-scale interactions among lower-level features are deemed unnecessary due to their semantic paucity and the potential for overlap and confusion with high-level feature interactions. The CCM, also an advancement of the previous decoder, incorporates multiple fusion blocks, consisting of convolutional layers, into the fusion pathway. The specific fusion steps are as shown in Algorithm 2. The primary function of these fusion blocks is to amalgamate adjacent features into a cohesive new feature, thereby streamlining the feature processing landscape.
Algorithm 2 Fusion algorithm. |
- Input:
Feature vectors A, B - Output:
Fusion vector F - 1:
Concentrate A and B to C - 2:
Use 1 × 1 Conv to get - 3:
for each do - 4:
Branch 1: Calculate by 1 × 1 Conv and BN - 5:
Branch 2: Calculate by 3 × 3 Conv and BN - 6:
Branch 3: Calculate by BN - 7:
Sum the three of them - 8:
Perform ReLU calculations to get - 9:
end for - 10:
Concentrate and - 11:
return Fusion vector F
|
3.4. Semi-Dynamic Query Selection
The concept of Object Query is a pivotal element within the Transformer framework for object detection, representing a vector that denotes the predicted bounding boxes for detected targets. These vectors encapsulate a fusion of content and positional information, where the content information serves to distinguish between different targets, and the positional information describes the location of the targets within the image. The design of Object Query stands as a significant contribution of the Transformer, addressing the issue present in traditional object detection methods that necessitate the pre-definition of anchor boxes. In conventional object detection approaches, the size and position of anchor boxes are pre-specified, potentially resulting in the inadequate coverage of certain targets. By employing Object Query vectors in lieu of anchor boxes, the Transformer facilitates object detection without relying on pre-defined anchor boxes, thereby better accommodating targets of varying sizes and shapes.
Given that Object Query comprises both content and positional information for detected targets, its initialization, or query selection, holds paramount importance. Currently, there are two primary methods for query selection: static query selection, learned from the dataset and independent of input images, remains fixed during inference, thus qualifying as static. However, as static methods do not fully leverage the information from input images, dynamic query selection has been proposed. This approach utilizes dense features extracted from input images to predict categories and bounding boxes, thereby initializing the content and position within the Object Query. Nevertheless, dynamic query selection still possesses shortcomings: the dynamically obtained content query may not be optimal for detection, as these features are unrefined and may harbor ambiguities. For instance, in the case of the “person” category, the selected feature may only encompass a portion of the person or objects surrounding the person, rendering it less precise compared to the static method. Conversely, the position query tends to be more accurate. In light of these considerations, this paper advocates for semi-dynamic query selection, wherein the static part is employed for content query selection, while the dynamic part is utilized for position query selection. In this way, we can make full use of the advantages of both, yielding improved detection performance.
3.5. Scaled Fast Transfusion
To provide a scalable version of Fast Transfusion, this paper simultaneously scales the QConv and the Enhanced Hybrid (EH) decoder using a depth multiplier and a width multiplier. This adjustment results in two variants of Fast Transfusion, differentiated by their parameter counts and frames per second. For QConv, the design of the dynamic network structures enables the adaptation of the network’s depth during both training and inference processes, according to the characteristics of the input data or the requirements of the task at hand. This approach typically employs conditional logic or learning strategies to dynamically determine the engagement of each layer or module. Additionally, network pruning techniques can be leveraged to reduce the complexity and computational demands of the model while maintaining or enhancing its performance.
For the EH decoder, we modulate the depth and width multipliers by altering the number of RepBlocks in the Cross-scale Feature Fusion Module (CCM) and the embedding dimension of the decoder, respectively. Variable-structure decoders allow for the dynamic modification of the decoder’s architecture, enabling the adjustment of the number of parameters or modules in response to the needs of diverse tasks or data attributes. This methodology involves the dynamic addition or removal of neurons, layers, or modules, which is achieved by constraining the number of parameters or the dimension of the encoded space. Such constraints enable the decoder to adaptively adjust within different data distributions or feature spaces. By introducing regularization or sparsity constraints, the decoder is incentivized to exhibit robust performance across various representations of input data, while maintaining model simplicity and generalization capabilities. It is noteworthy that our scaled versions of Fast Transfusion preserve a uniform decoder architecture, which aids in the knowledge distillation from high-precision, large-scale DETR models to lighter detectors. This aspect presents a promising avenue for future exploration.