To validate the accuracy of our proposed model, a series of experiments were conducted. We outline the experimental setup, methods, and evaluation metrics used to assess the performance of our proposed object detection model. The datasets, implementation details, and ablation studies are described in detail to ensure clarity and replicability of results.
3.1. Datasets
This paper selects the PASCAL VOC 2007 and MS COCO 2017 datasets for model training and validation. These datasets are chosen for their comprehensive annotation details, including object bounding boxes and class labels, which facilitate precise object detection studies. The MS COCO 2017 dataset, in particular, contains a large number of small objects and occlusions, making it suitable for validating model performance in scenarios involving small objects and occlusions.
To evaluate the performance of our proposed object detection algorithm, this study selected the PASCAL VOC 2007 dataset for model training and testing. This dataset comprises 16,551 images spanning 20 categories, including airplanes, bicycles, birds, boats, bottles, etc. During the testing phase, we utilized the PASCAL VOC 2007 test set, which contains 4952 images, for performance assessment. Additionally, we used the MS COCO 2017 dataset to validate our method further. During development, the 2017 training set was used to train the algorithm, and the 2017 validation set was utilized for hyperparameter selection and validation. Ultimately, our algorithm was compared with state-of-the-art methods on the MS COCO 2017 dataset.
In terms of evaluation metrics, we primarily used the mean average precision (mAP) and frames per second (FPS) to measure the performance of the object detection task. These metrics not only reflect the accuracy of the algorithm but also demonstrate its processing speed in practical applications, thereby providing a comprehensive assessment of the algorithm’s usability and efficiency. Through these rigorous evaluations, we demonstrated the effectiveness and superiority of the proposed algorithm.
3.2. Implementation Details
This work is implemented in PyTorch. All experiments were conducted on a PC equipped with an Intel Core i7-12700K CPU and an Nvidia RTX 3090 GPU.
In this study, we selected the lightweight MobileNetV3 as the backbone network for feature extraction, which was pre-trained and fine-tuned on the dataset, with detailed parameter configurations presented in
Table 1. During the experimental process, the input image resolution was set to 512 × 512, and the batch size was set to 16. Initially, we used the Adam optimizer with an initial learning rate of 0.004 for parameter optimization; subsequently, as training progressed, we switched to the SGD optimizer to dynamically adjust the learning rate for more refined model tuning. The purpose of this strategy was to balance training efficiency and model performance.
The primary reason for choosing MobileNetV3 as the backbone network for feature extraction was its outstanding performance and efficient computational characteristics. MobileNetV3 combines lightweight depthwise separable convolutions and the SE channel attention mechanism. This combination significantly reduces the model’s parameter count and computational complexity. At the same time, it enhances the expressiveness of features and the model’s sensitivity to small-sized targets. Furthermore, the structural optimizations of MobileNetV3 make it particularly suitable for mobile devices and edge computing scenarios.
During the training process, we initially chose to use the Adam optimizer because it can achieve convergence faster than traditional SGD optimizers, especially when dealing with complex non-convex optimization issues. The Adam optimizer calculates first-order moment estimates and second-order moment estimates of gradients. Such calculations allow for the adaptive adjustment of learning rates for different parameters, making the training process more stable and efficient. As training progressed, to refine the network’s ability to fit the data, we switched to using the SGD optimizer. SGD makes more precise learning rate adjustments in the later stages of training, which helps the model achieve a better local optimum and reduces the risk of overfitting. Consequently, the model achieves better performance on the validation and test sets.
Furthermore, fixing the input image resolution at 512 × 512 ensures that the details of the images are fully utilized during feature extraction. Setting the batch size to 16 aims to maximize the use of GPU resources while ensuring computational efficiency. This setup helps the model better capture details and features when processing high-resolution images, especially in images where the targets are small or the scenes are complex.
The specific parameter settings of MobileNetV3 are shown in
Table 1.
3.3. Ablation Experiments
To clarify the contributions of various components of the proposed model towards detecting small objects and occlusions, extensive ablation studies were performed.
3.3.1. Effectiveness of Bottleneck with Separable Convolution Skip Connection
To validate the effectiveness of the separable CenterNet detection network algorithm based on MobileNetV3 proposed in this study, we conducted detailed ablation experiments on the PASCAL VOC dataset. We set the baseline model as the traditional CenterNet network and divided the ablation experiments into six groups. The specific configurations for each group were as follows: Group 1 used DLA-34 as the backbone of the traditional CenterNet; group 2 used Hourglass as the backbone of the traditional CenterNet; group 3 employed ResNet-18 with deconvolution as the backbone; group 4 introduced MobileNetV3 as the backbone; group 5 further added a DBi-FPN to MobileNetV3; the final group integrated the smooth
loss function. In the experimental labels, B, v3, FPN*, and
represent the original CenterNet, the introduced MobileNetV3, DBi-FPN, and the
loss function, respectively. The results of the ablation experiments are shown in
Table 2.
From the results in
Table 2, we observe variations in the basic model performance based on different backbone networks. In models 1, 2, and 3, which utilize B-DLA34, B-Hourglass, and B-ResNet-18 as backbones, the
[email protected] scores are 80.7%, 81.5%, and 79.5%, respectively, with frame rates (FPS) of 33.0, 32.0, and 31.0. It is evident that B-Hourglass has a slight advantage in accuracy, though its FPS is marginally lower than B-DLA34 and B-ResNet-18. Comparatively, B-DLA34 offers a better balance between accuracy and speed. With the introduction of MobileNetV3 as the backbone (model 4), the model’s
[email protected] significantly increases to 83.2%, with an FPS increase to 52.0. This indicates that MobileNetV3’s lightweight design and efficient feature extraction significantly enhance detection accuracy and speed.
Building on this, the introduction of the FPN structure (model 5) further raises the
[email protected] to 84.9%, with FPS also increasing to 56.0. The FPN structure, through multi-scale feature fusion, effectively improves the model’s detection capabilities across different object sizes, significantly boosting overall performance.
Finally, when the
loss function is integrated into the model (model 6), the
[email protected] reaches 85.6%. Although the FPS slightly decreases to 55.0, the overall performance enhancement remains significant. The
loss function, by more precisely regressing target box positions, further increases detection accuracy.
These experimental results demonstrate that the introduction of MobileNetV3 as the backbone significantly enhances the model’s computational efficiency and detection accuracy. The multi-scale feature fusion capability of the FPN structure enhances the model’s adaptability to targets of varying sizes. The use of the loss function further refines target box regression, enhancing detection accuracy.
The motivation for this research stems from the demand in the object detection field for efficient and high-accuracy detection models. Particularly in real-time application scenarios such as autonomous driving and video surveillance, it is crucial for models to maintain high accuracy while also possessing rapid processing capabilities. By incorporating the lightweight MobileNetV3 and DBi-FPN, this study not only optimizes feature extraction efficiency but also enhances recognition capabilities for small targets and complex scenes through structural improvements. Additionally, the application of the smooth loss function further enhances the stability of model training and the accuracy of prediction results.
Through comparative analysis, this research demonstrates the superiority of the proposed method over traditional approaches, including improved processing speed and accuracy, as well as stronger adaptability in complex environments.
3.3.2. Effectiveness of the Dual-Path Bi-FPN
To validate the effectiveness of the DBi-FPN proposed in this paper, we implemented improvements on the original CenterNet model and the separable CenterNet detection network based on MobileNetV3 proposed in this study, and conducted experiments on the PASCAL VOC dataset. The experimental results, as shown in
Table 3, indicate that models 1, 2, and 3 use B-DLA34, B-Hourglass, and B-ResNet-18 as backbone networks, respectively, achieving
[email protected] of 80.7%, 81.5%, and 79.5%, with FPS of 33.0, 32.0, and 31.0, respectively. This suggests that B-Hourglass has a slight advantage in accuracy, though its FPS is slightly lower than the other two models. In contrast, B-DLA34 offers a better balance between accuracy and speed. When we introduced MobileNetV3 as the backbone network in the base model (model 4), the model’s
[email protected] significantly increased to 83.2%, and the FPS increased to 52.0. This demonstrates the distinct advantages of MobileNetV3’s lightweight design and efficient feature extraction capabilities in enhancing detection accuracy and speed. Compared to using other backbone networks, MobileNetV3 significantly increases inference speed while maintaining high accuracy. Building on this, we further introduced an FPN structure (model 5), resulting in an
[email protected] increase to 84.9% and an FPS increase to 56.0. The FPN, through its multi-scale feature fusion, effectively enhances the model’s detection capability across different object sizes. The strategy of multi-scale feature fusion allows the model to better capture details of targets of varying sizes, thereby significantly boosting detection accuracy. When we also introduced an
loss function in the model (model 6), the
[email protected] reached 85.6%, although the FPS slightly decreased to 55.0, the overall performance enhancement was still significant. The
loss function, by more accurately regressing the positions of target boxes, further improves detection accuracy. This indicates that with a finely designed loss function, we can achieve higher detection accuracy while maintaining a high speed.
DBi-FPN adopted in our study combines top-down and bottom-up feature fusion mechanisms. This structure not only resolves the issue of insufficient feature utilization due to the unidirectional flow of information in traditional FPNs but also greatly enriches the network’s learning capability and adaptability by enhancing the interaction between features at different levels. Additionally, we chose the lightweight MobileNetV3 as the backbone network, further reducing the model’s parameter count and computational complexity. Thus, the model is not only highly accurate but also more suitable for deployment on resource-constrained devices.
This paper, by introducing DBi-FPN, significantly enhances the performance of the object detection network, particularly in handling multi-scale feature fusion and improving operational efficiency.
3.3.3. Configuration of Training Parameters
To comprehensively evaluate the separable CenterNet detection network algorithm based on MobileNetV3 proposed in this study,
Table 4 details the comparative data between this model and the original CenterNet model in terms of backbone network parameters and the overall model size. The comparison allows us to observe the significant impact of different FPN configurations on the performance within the base CenterNet model. Without FPN, the base model achieves an mAP of 80.7% and an FPS of 33.0. When incorporating a standard FPN, the mAP increases to 82.1%, but the FPS slightly decreases to 32.0. However, with the introduction of an improved FPN (FPN*), the mAP further rises to 83.3%, and the FPS increases to 35.0. This demonstrates that FPN effectively enhances model detection accuracy through multi-scale feature fusion. Particularly, the improved FPN* not only maintains or enhances detection accuracy but also increases inference speed.
In our model, different FPN configurations also show significant performance variations. The base model without FPN achieves an mAP of 83.3% and an FPS of 52.0. Compared to the base CenterNet model, our model already exhibits higher detection accuracy and faster inference speed under the same conditions. When a standard FPN is added, the mAP is raised to 84.2%, with the FPS remaining at 52.0. Introducing the improved FPN* further increases the mAP to 85.6%, with the FPS rising to 55.0. These results indicate that our model not only surpasses the traditional CenterNet model in accuracy but also offers significant advantages in speed.
A comparative analysis confirms that the foundational setup of our model capitalizes on MobileNetV3’s streamlined architecture and superior feature extraction capabilities. The multi-scale feature fusion capability of FPN significantly improves detection accuracy. Especially, the improved FPN* shows exceptional results in our model, enhancing both accuracy and inference speed. Across all configurations, our model outperforms the CenterNet model in both mAP and FPS on the VOC dataset, demonstrating that our approach achieves faster inference speeds while maintaining high detection accuracy.
3.4. Comparison with State-of-the-Art Methods
To comprehensively validate the performance of the separable CenterNet detection network algorithm based on MobileNetV3 proposed in this study, we trained and tested this algorithm against other mainstream detection algorithms on the MS COCO 2017 dataset. Considering that two-stage object detection algorithms, although highly accurate, are cumbersome and impractical for real-world applications, they were not included in this comparison. The selected comparative algorithms include one-stage object detection algorithms based on anchors, such as the YOLO series and EfficientDet series, as well as anchor-free methods like CornerNet, CenterNet, and DERT based on self-attention mechanisms. The performance comparison results of all the algorithms are shown in
Table 5.
Our comparison focuses on the models’ performance in practical applications, including key metrics such as detection accuracy, speed, and model size, to demonstrate the advantages of the model proposed in this article in modern object detection tasks. Compared to traditional anchor-based methods like YOLO and EfficientDet, which have achieved significant success in commercial applications, these algorithms rely on dense prior box predictions. This not only increases the computational burden, but also often leads to a higher false detection rate. In contrast, the anchor-free strategy, such as the separable CenterNet based on MobileNetV3 proposed in this paper, reduces model complexity while maintaining or enhancing detection accuracy.
In terms of overall detection precision (AP), our method based on MobileNetV3 achieved an AP of 54.8%, significantly better than traditional YOLOv4 (43.5%) and EfficientDet-D5 (51.5%). This improvement is mainly attributed to the lightweight design and efficient feature extraction capabilities of MobileNetV3. Although YOLOv4 and EfficientDet-D5 perform well in some respects, they still fall short of our model in terms of overall performance.
Analyzing detection accuracy at different IoU thresholds (AP50 and AP75), it is evident that our method reached 72.9% on AP50, while AP75 was also achieved at 59.8%. This indicates that our model maintains high detection accuracy even at higher IoU thresholds. In comparison, YOLOv4 achieved 65.7% on AP50 and 47.3% on AP75, both lower than our model. This reflects the advantages of MobileNetV3 in handling high-precision detection tasks.
Regarding detection performance for targets of different sizes, our method achieved 38.5% on small targets (), and performance on medium-sized () and large-sized targets () was 59.8% and 68.9%, respectively. These results show that our model performs well on targets of various sizes, especially in small-target detection, where it has a distinct advantage over EfficientDet-D5 (33.9%) and YOLOv4 (26.7%). This advantage likely stems from the Feature Pyramid Network (FPN) used in our model, which, through multi-scale feature fusion, enhances the model’s ability to detect small-sized targets.
Models like Deformable DETR that incorporate self-attention mechanisms perform excellently in complex scenes but require substantial computational resources, limiting their application in resource-constrained environments. In contrast, our model leverages the lightweight nature of MobileNetV3 and efficient feature fusion, not only enhancing execution speed but also reducing the overall size of the model, making it more suitable for mobile device applications.
The experimental results show that although our model requires far fewer parameters and computational resources than traditional models, its performance on the MS COCO 2017 dataset is comparable to advanced models like YOLOv5 and EfficientDet. Even more, our model performs better in handling small targets and partially occluded targets. This achievement validates the effectiveness of the DBi-FPN in enhancing detection performance, particularly in optimizing the balance between detection accuracy and operational speed.
Compared to traditional two-stage detection methods, the algorithm proposed in this study has achieved significant improvements in feature extraction and detection speed. Additionally, while substantially reducing the model’s parameters and computational load, this algorithm continues to enhance detection accuracy, demonstrating its efficiency and practicality in modern object detection tasks. To more comprehensively showcase the superiority of this algorithm, we conducted a comparative analysis with current mainstream anchor-based one-stage algorithms, selecting RetinaNet [
24], YOLOv3 [
21], YOLOv4 [
22], YOLOv5 [
23], and their variants for performance evaluation.
Detailed experimental comparisons indicate that our algorithm excels across multiple important performance metrics. Particularly, compared to YOLOv3, our model’s AP is higher by 21.8%, a significant enhancement that demonstrates the model’s powerful capability in handling complex visual scenes. Compared to YOLOv4 and YOLOv5, the AP is 11.3% and 10.3% higher, respectively, further validating our model’s sustained advantage in accuracy. Although our model’s AP is 0.7% lower than the latest YOLOv4-P7 version—primarily due to differences in the backbone network—the slight difference still highlights the efficiency and competitiveness of our algorithm, given its significant advantages in parameter count and computational efficiency.
Moreover, compared to existing algorithms that employ an anchor-free method, the algorithm presented in this study not only exhibits a clear advantage in detection accuracy but also excels in detection speed. This is attributed to the use of separable convolution technology within the algorithm, which significantly reduces the model’s parameter count, thereby enhancing operational speed.
The separable CenterNet network framework based on MobileNetV3, as depicted in
Figure 2, visually contrasts the performance of this study’s model with several mainstream models in real-world applications. As shown in
Figure 2a,c,d, the traditional CenterNet model primarily relies on the target’s center point for prediction. Although this method performs well in open scenes, it struggles with heavily occluded targets such as the small sheep depicted, showcasing its limitations in handling occlusion issues. This is because CenterNet depends on precise feature point localization, and occlusion affects the visibility and accuracy of these feature points, making it difficult for the model to accurately locate occluded targets. Additionally, CenterNet’s bounding box regression depends on the overall features of a target, and when a target is partially occluded, the model receives incomplete feature information, leading to inaccurate regression results. Occlusion also interferes with the model’s feature extraction process, especially in complex scenes. Occluding objects may share similar features with the target, making it difficult for the model to distinguish between target and non-target areas, thereby affecting detection performance.
Meanwhile, although the SSD model is widely used in various detection tasks, it exhibits significant detection omissions in scenes with multiple targets and occlusions. This is because the SSD model relies on preset anchor boxes to detect targets, and when there are many dense targets in a scene the preset anchor boxes may not cover all targets, causing some to be undetected. In situations with occlusions, parts of a target’s features are obscured by occluders, making it difficult for the model to extract complete feature information, directly impacting the model’s detection capabilities. Additionally, when performing multi-scale feature extraction, the SSD model may not adequately capture the detailed features of occluded targets, especially on lower-resolution feature maps. This leads to a more pronounced loss of detail, further exacerbating the issue of missed detections. Moreover, when there are numerous targets in the scene, the non-maximum suppression (NMS) process of the SSD model may incorrectly suppress some targets, especially those adjacent to occluders or other targets.
As for YOLOv5, known for its efficient detection rate and excellent performance with large targets, it still shows some shortcomings in the accuracy of small-target detection. Because the anchor box design and resolution configuration of YOLOv5 are more suited to detecting larger targets, there is insufficient focus on small targets, affecting their detection accuracy. Additionally, YOLOv5’s feature extraction network captures rich semantic information. However, when dealing with small targets, detail information can be diluted or lost during the multi-level feature fusion process, making the features of small targets less distinct. In the regression and classification processes of YOLOv5, larger targets occupy a greater proportion of the feature map, thus obtaining higher confidence scores. Conversely, small targets have a lower proportion on the feature map, and hence, lower confidence, making them more likely to be suppressed during the NMS process. During the training of YOLOv5, the loss function is more sensitive to errors in large targets. This causes the system to converge more quickly to the detection of large targets during the optimization process, while neglecting the accuracy of detecting small targets. Compared to the aforementioned models, our proposed separable CenterNet detection network based on MobileNetV3 demonstrates significant advantages in handling small targets and occlusion issues. Particularly through the introduction of DBi-FPN, this model achieves more effective feature fusion and bidirectional information flow, greatly enhancing its ability to recognize small and partially occluded targets in complex scenes. This DBi-FPN not only improves the hierarchy and richness of features but also expands the model’s perceptual range, thereby significantly enhancing the detection accuracy of small targets and the overall robustness of the model.