Separable CenterNet Detection Network Based on MobileNetV3—An Optimization Approach for Small-Object and Occlusion Issues

Jiao, Zhengkuo; Dong, Heng; Diao, Naizhe

doi:10.3390/math12162524

Open AccessArticle

Separable CenterNet Detection Network Based on MobileNetV3—An Optimization Approach for Small-Object and Occlusion Issues

by

Zhengkuo Jiao

¹

,

Heng Dong

¹ and

Naizhe Diao

^2,*

¹

College of Information Science and Engineering, Northeastern University, Shenyang 110819, China

²

The School of Electrical Engineering, Yanshan University, Qinhuangdao 066004, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(16), 2524; https://doi.org/10.3390/math12162524

Submission received: 25 July 2024 / Revised: 12 August 2024 / Accepted: 12 August 2024 / Published: 15 August 2024

Download

Browse Figures

Versions Notes

Abstract

:

This paper proposes a novel object detection method to address the challenges posed by small objects and occlusion in object detection. This work is performed within the CenterNet framework, leveraging the MobileNetV3 backbone to model the input image’s abstract representation in a lightweight manner. A sparse convolutional skip connection is introduced in the bottleneck of MobileNetV3, specifically designed to adaptively suppress redundant and interfering information, thus enhancing feature extraction capabilities. A Dual-Path Bidirectional Feature Pyramid Network (DBi-FPN) is incorporated, allowing for high-level feature fusion through bidirectional flow and significantly improving the detection capabilities for small objects and occlusions. Task heads are applied within the feature space of multi-scale information merged by DBi-FPN, facilitating comprehensive consideration of multi-level representations. A bounding box-area loss function is also introduced, aimed at enhancing the model’s adaptability to object morphologies and geometric distortions. Extensive experiments on the PASCAL VOC 2007 and MS COCO 2017 datasets validate the competitiveness of our proposed method, particularly in real-time applications on resource-constrained devices. Our contributions offer promising avenues for enhancing the accuracy and robustness of object detection systems in complex scenarios.

Keywords:

object detection; CenterNet; MobileNetV3; Dual-Path Bidirectional Feature Pyramid Network; loss function

MSC:

68T45

1. Introduction

Object detection is a critical task in computer vision and has been extensively applied across various fields. However, traditional object detection models such as Single-Shot MultiBox Detector (SSD) and You Only Look Once (YOLO), despite achieving a balance between speed and accuracy, still face significant shortcomings in handling small objects and occlusions [1,2]. The SSD model detects across different scales of feature maps to address variations in object sizes, but it tends to lose detailed information in small-object detection, thus reducing accuracy [3]. The YOLO model speeds up detection through whole-image processing, but its anchor box design and feature extraction mechanisms perform poorly with small objects and occlusions, being especially prone to missed and false detections in dense object scenes [4].

CenterNet adopts an anchor-free design, which significantly simplifies the model structure and avoids the need for complex parameter settings typical of traditional anchor-based methods. This design not only boosts training efficiency but also enhances detection capabilities for small objects and in occlusion scenarios. As a lightweight object detection framework, CenterNet achieves object localization through keypoint detection, substantially improving detection efficiency [5]. However, CenterNet tends to miss detections of small objects or in cases of multiple overlapping targets and experiences performance declines when dealing with occluded objects. These issues primarily arise from limitations in its feature extraction network and inadequate feature fusion capabilities. Firstly, the precision of feature point extraction decreases significantly when targets are small or overlap, leading to missed detections [6,7]. Secondly, CenterNet’s feature fusion process inadequately leverages multi-scale features, particularly losing detail on low-resolution feature maps, which impacts the detection of small targets [8,9]. Additionally, the performance drop in occluded object scenarios is due to the feature extraction network’s inability to effectively separate features between targets and occluders, resulting in occlusion of target features and affecting detection accuracy [10,11]. To address these challenges, researchers have proposed improvements such as deformable convolution, introduced by Sun et al., which enhances model robustness in handling small objects and occlusions by incorporating deformable convolutions and multi-scale feature enhancement [12]. The application of multi-task learning in object detection is gaining attention, improving model robustness and detection accuracy by integrating object and keypoint detection tasks [13,14]. This approach benefits from shared features, enhancing performance in complex scenarios. In recent years, various multi-task learning frameworks have been proposed and validated on multiple benchmark datasets, like Mask R-CNN, which not only detects objects but also performs instance segmentation, improving both detection accuracy and segmentation results by adding a parallel segmentation branch to share convolutional feature maps [15]. The sizable backbone model of CenterNet poses deployment challenges on hardware devices due to high resource consumption, which could lead to efficiency and speed issues when operating on resource-limited hardware.

Originally lacking a feature pyramid network (FPN), CenterNet was limited in handling multi-scale object detection, especially in detecting small objects. By integrating an FPN, CenterNet effectively enhances its ability to detect targets of varying sizes, particularly improving recognition accuracy for small objects.

An FPN significantly improves the performance of object detection models by constructing multi-scale feature maps, especially for objects of different sizes. This structure enables the model to capture subtle features at different levels. The FPN, as proposed by Lin et al., significantly enhances small-object detection by fusing features across different scale feature maps [16]. Tan et al.’s EfficientDet combines an efficient backbone network to further boost model performance through multi-scale feature fusion and optimized pyramid design [17]. While these advancements alleviated issues with small objects and occlusions to some extent, they did not fully resolve these challenges. FPN is another common structure that improves performance through multi-scale feature fusion [18]. However, traditional FPNs have limitations in feature flow and deep aggregation. Ghiasi et al.’s NAS-FPN, through neural architecture search (NAS), automatically designs optimal feature pyramid structures for efficient multi-scale feature fusion, significantly enhancing detection effectiveness for small objects [19]. Moreover, [20] proposed a bidirectional feature pyramid network (Bi-FPN), which further refines the feature fusion mechanism. Bi-FPN employs weighted feature fusion, allowing bidirectional feature flow between different levels and learning weights of various feature maps for more flexible and efficient feature fusion. Despite these improvements in multi-scale feature fusion, challenges persist in handling complex scenes, especially in scenarios with overlapping targets and small-object detection.

To address these issues, we propose a novel objective detection method in the CenterNet paradigm. At the backbone of CenterNet, we utilize MobileNetV3, which lightly models the abstract representation of the input image. Particularly, we introduce a spare convolutional skip connection in the bottleneck of CenterNet. To enhance the model’s accuracy in detecting small objects and to mitigate the impact of occlusions, we introduce a feature pyramid to fuse multi-scale features on CenterNet. Specifically, we propose Dual-Path Bidirectional Feature Pyramid Network (DBi-FPN) to deepen the interaction of multi-level features to adapt to complex scenarios with occlusions and small objects. Furthermore, we apply task heads to fused features extracted by DBi-FPN to detect objects considering multi-scale features. Additionally, we introduce a bounding box-area cost function to enhance the model’s ability to adapt to the geometric distortions of the objects. In summation, the main contributions of this work are as follows:

(1): We utilize MobileNetV3 as the backbone of CenterNet to detect targets in a lightweight manner. This enhances the potential for model deployment to the edge.
(2): We have designed a sparse convolution-based skip connection in the bottleneck of MobileNetV3 to adaptively suppress the accumulation of redundant and interfering information.
(3): We propose DBi-FPN to enhance the model’s performance in detecting complex occluded and small objects.
(4): We apply task heads within the feature space of multi-scale information merged by DBi-FPN to consider multi-level representations comprehensively. In particular, we introduce a bounding box-area loss function to enhance the model’s adaptability to various morphologies.
(5): Our method is competitive on the PASCAL VOC 2007 and MS COCO 2017 datasets.

The remainder of this article is structured as follows: Section 2 details our proposed CenterNet detection network based on MobileNetV3. Section 3 describes the datasets used, implementation details, and ablation experiments, along with a comparative analysis of experimental results on these benchmark datasets. Finally, Section 4 summarizes and discusses potential future directions for this research.

2. Methods

The separable CenterNet network framework based on MobileNetV3 proposed in this paper is illustrated in Figure 1.

2.1. Overview

Considering the issues with existing CenterNet frameworks, such as missed detections of small or overlapping objects and decreased performance when dealing with occluded objects, this paper proposes a new detection network. Our detection network utilizes an anchor-free, lightweight MobileNetV3 backbone to replace the conventional core network. This approach not only enhances efficiency but also maintains accuracy. The integration of the SE channel attention mechanism within MobileNetV3 facilitates superior feature extraction, which is crucial for improving the robustness of object detection. Additionally, we introduce the Dual-Path Bidirectional Feature Pyramid Network (DBi-FPN) to address the detection of small objects. DBi-FPN enhances the efficiency of feature fusion. In terms of feature flow and deep aggregation, DBi-FPN enables the network to perform more effectively when dealing with small objects and partial occlusions. Finally, we incorporate a multi-scale merged multi-task loss function to constrain the model. This loss function effectively improves detection efficiency for small objects and occlusions.

2.2. Bottleneck with Separable Convolution Skip Connection

This paper utilizes MobileNetV3 as the backbone network to address the resource and speed limitations encountered by traditional heavy networks when applied to mobile or edge devices. Specifically, we have designed a jump connection based on sparse convolution, which can optimize the model’s information flow and enhance the network’s nonlinear expression capabilities.

The streamlined and efficient design of MobileNetV3 is the primary reason for its selection as the replacement for the traditional CenterNet backbone network. As a leading representative of lightweight neural networks, MobileNetV3 significantly enhances the design concepts of its predecessors, MobileNetV1 and V2, especially in reducing model parameters and accelerating inference speed. One of the innovations of MobileNetV3 is the introduction of the SE channel attention mechanism. This mechanism, by recalibrating its importance, significantly improves the network’s ability to capture key information, thereby optimizing overall performance. This attention mechanism explicitly models the dependencies between channels, enhancing the model’s adaptability to features across different channels. Consequently, the use of MobileNetV3 substantially enhances performance in object detection tasks, especially in resource-constrained environments. Due to its very low computational demands and compact model size, the MobileNetV3-Small version is chosen as the feature extraction tool for this study. Not only does it maintain efficient feature extraction capabilities, but its light weight significantly reduces operational costs, making it more suitable for deployment in real-time applications and mobile devices.

To further enhance network performance,

1 \times 1

convolutions were introduced in the bottleneck residual connections to strengthen important information and suppress redundant information. In the MobileNetV3 architecture,

1 \times 1

convolutions in the bottleneck’s jump connections play a crucial role. Initially, by incorporating

1 \times 1

convolutions, we achieve compression and expansion of feature channels. This helps to highlight important information and suppress redundant information. By minimizing the transmission of irrelevant features, this technique ensures that the network can focus more on crucial information during feature transformation, thereby enhancing computational efficiency and reducing the number of parameters. These

1 \times 1

convolutions serve as an effective tool for integrating information, enhancing interactions between different features without significantly increasing computational burden. This convolution operation acts as a linear transformation, rearranging and reinforcing input features, thereby increasing the model’s capability to capture complex patterns in the data. Lastly, the application of

1 \times 1

convolutions in jump connections increases the nonlinearity of the network structure. Each time a jump connection occurs,

1 \times 1

convolutions are used in conjunction with the ReLU activation function. This combination not only retains the advantages of deep networks but also enhances feature expressiveness through nonlinear transformations.

The introduction of a jump connection based on sparse convolution not only optimizes information flow but also enhances the network’s nonlinear expression capabilities. This improvement increases the depth of the network and expands its receptive field. The modification enables the network to capture a broader range of contextual information, thus enhancing the model’s ability to recognize diverse targets in complex scenes. In this way, without significantly increasing the number of parameters, the model’s ability to process fine-grained features is significantly improved and the overall robustness and adaptability of the model are enhanced.

By adopting the lightweight MobileNetV3 integrated with the SE channel attention mechanism as the backbone network and designing a jump connection based on sparse convolution, the MobileNetV3-based CenterNet proposed in this paper not only meets the high standards of industry and research in terms of accuracy but also demonstrates superior performance in terms of speed and resource consumption.

2.3. Dual-Path Bi-FPN

This paper proposes the Dual-Path Bidirectional Feature Pyramid Network (DBi-FPN) that combines top-down and bottom-up information flow mechanisms to improve the aggregation efficiency of multi-scale features and dynamic feature flow. DBi-FPN optimizes overall detection performance. By adjusting the channel count of each layer’s output, we ensure the quality and efficiency of feature fusion and enhance the accuracy of small-object detection and the adaptability to complex backgrounds.

DBi-FPN is an extension of the traditional FPN. Traditional FPN adjusts feature sizes across different layers using

1 \times 1

convolutions and enhances the model’s semantic expression through horizontal feature fusion. While traditional FPN has shown significant performance improvements in various computer vision tasks, it still exhibits limitations in complex scenarios, especially with overlapping multiple objects and when detecting small objects. To address these limitations, we propose DBi-FPN, that introduces a more flexible bidirectional information flow mechanism, greatly enhancing the aggregation efficiency and dynamic flow of multi-scale features, thereby significantly optimizing overall detection performance.

We employ transposed convolution for upsampling to increase the dimensions of feature maps, enhancing the resolution of outputs through a specialized convolution process. This technique involves padding the input with zeros and applying a transposed convolution layer with specific stride and kernel size settings, which dictates the expansion of the feature maps. Such a configuration allows for the learnable optimization of kernel parameters, which is crucial for accurately reconstructing small-sized targets and improving the detection of occluded or partially visible objects. This approach not only refines feature details but also significantly enhances the model’s performance in handling complex visual scenarios, with clearer target delineation and feature representation.

Specifically, in our work, the backbone of CenterNet (MobileNetV3) encompasses features at four different resolutions. The features at each resolution are denoted as

F_{H^{'} \times W^{'}}

,

F_{\frac{H^{'}}{2} \times \frac{W^{'}}{2}}

,

F_{\frac{H^{'}}{4} \times \frac{W^{'}}{4}}

, and

F_{\frac{H^{'}}{8} \times \frac{W^{'}}{8}}

, respectively. In DBi-FPN, the upsampler and downsampler on the dual paths are denoted as

δ_{u}

and

δ_{d}

, respectively. We denote the features on the two paths as A and B, respectively.

The first path is

\{\begin{matrix} A_{\frac{1}{8}} = F_{\frac{H^{'}}{8} \times \frac{W^{'}}{8}} \\ A_{\frac{1}{4}} = {δ_{u}}^{\frac{1}{8} \to \frac{1}{4}} (A_{\frac{1}{8}}) + F_{\frac{H^{'}}{4} \times \frac{W^{'}}{4}} \\ A_{\frac{1}{2}} = {δ_{u}}^{\frac{1}{4} \to \frac{1}{2}} (A_{\frac{1}{4}}) + F_{\frac{H^{'}}{2} \times \frac{W^{'}}{2}} \\ A_{\frac{1}{1}} = {δ_{u}}^{\frac{1}{2} \to 1} (A_{\frac{1}{2}}) + F_{H^{'} \times W^{'}} \end{matrix}

(1)

The second path is

\{\begin{matrix} B_{\frac{1}{1}} = F_{H^{'} \times W^{'}} \\ B_{\frac{1}{2}} = {δ_{d}}^{1 \to \frac{1}{2}} (B_{\frac{1}{1}}) + F_{\frac{H^{'}}{2} \times \frac{W^{'}}{2}} \\ B_{\frac{1}{4}} = {δ_{d}}^{\frac{1}{2} \to \frac{1}{4}} (B_{\frac{1}{2}}) + F_{\frac{H^{'}}{4} \times \frac{W^{'}}{4}} \\ B_{\frac{1}{8}} = {δ_{d}}^{\frac{1}{4} \to \frac{1}{8}} (B_{\frac{1}{4}}) + F_{\frac{H^{'}}{8} \times \frac{W^{'}}{8}} \end{matrix}

(2)

Then, the multi-scale features on the two paths are fused. We denote the upsampling operator in the fusion phase by

Δ_{u}

. The merged feature Z is

Z = [A_{1} + B_{1} ∥ Δ_{u}^{\frac{1}{2} \to 1} (A_{\frac{1}{2}} + B_{\frac{1}{2}}) ∥ Δ_{u}^{\frac{1}{4} \to 1} (A_{\frac{1}{4}} + B_{\frac{1}{4}}) ∥ Δ_{u}^{\frac{1}{8} \to 1} (A_{\frac{1}{8}} + B_{\frac{1}{8}})]

(3)

where ‖ means concat. Z contains multi-level multi-scale information of the target objects.

The introduction of DBi-FPN in our CenterNet model significantly enhances feature fusion across multiple scales by leveraging the distinct properties of the upsampling and downsampling paths. Specifically, the dual-path design consists of two separate streams, path 1 and path 2, designated as A and B, respectively. Path 1 focuses on upsampling, gradually increasing the resolution of feature maps to capture fine details, while path 2 concentrates on downsampling, preserving and emphasizing higher-level semantic information at coarser scales.

The architecture’s bidirectionality allows for comprehensive integration of features: upsampling operations (

δ_{u}

) in path1 progressively refine lower-resolution features by merging them with higher-resolution counterparts from the original input scales. Conversely, the downsampling operations (

δ_{d}

) in path 2 enrich the high-resolution features with contextual information from larger spatial extents. This dual approach ensures that both local details and global contextual information are effectively captured and utilized.

Moreover, the final feature fusion process, as denoted by (3), leverages concatenated outputs from both paths at corresponding scales. This concatenation of upsampled and downsampled features from both paths at every scale allows the model to access a rich set of features, which is crucial for detecting objects of various sizes and under different occlusion conditions. The design of merging multi-scale features using upsampling operations (

Δ_{u}

) further emphasizes the integration of detailed textural information with broad contextual insights across the entire network, optimizing the model’s ability to detect and classify objects accurately in complex scenes.

In practical implementation, by adjusting the number of output channels at each layer we can more precisely control the details of information flow, thus ensuring the quality and efficiency of feature fusion. For instance, lower-resolution features are upsampled through the bottom-up process and combined with adjacent higher-resolution feature layers. This method not only preserves the richness of high-level semantic information but also retains the detail information of lower-level features, making the fused features more comprehensive and adaptable to complex detection tasks.

Additionally, a second FPN branch, based on the bottom-up aggregation approach, was added to the foundation of FPN to enable bidirectional information flow. This bidirectional structure not only improves the accuracy of detecting small objects but also enhances the model’s adaptability to occlusions and complex backgrounds. This novel FPN architecture not only provides a new method for multi-scale feature fusion in theory but also demonstrates excellent performance in practical applications.

The use of DBi-FPN thus offers significant advantages in terms of enhanced feature representation and improved detection performance, particularly in challenging visual environments where both detail and context play critical roles. This innovative feature fusion mechanism not only boosts the model’s robustness and accuracy but also provides a versatile framework for adapting to a wide range of practical applications in computer vision.

2.4. Multi-Scale Merged Multi-Task Loss

In addition to the existing keypoint map, object size, and local offset features in CenterNet, this study introduces the smooth

L_{1}

loss function for the prediction of bounding box areas, comparing predicted areas with the true bounding box areas.

Optimizing the discrepancy between predicted and actual bounding box areas enhances model accuracy across objects of varying sizes. This is particularly beneficial for small objects, as the inclusion of area loss allows the model to more effectively learn and recover the true dimensions of small targets, thereby improving performance in detecting small objects. Additionally, area loss contributes to the model’s stability when dealing with changes in scale, ensuring consistent performance across object detections of various sizes. This loss function increases the model’s robustness to morphological distortions of targets. Moreover, compared to the traditional

L_{1}

loss, the smooth

L_{1}

loss function prevents gradient explosions when dealing with large errors and provides a more stable gradient than the

L_{2}

loss in scenarios with smaller errors, making the overall training process more robust, especially suitable for fine-tuning in keypoint prediction. Specifically,

L_{s m o o t h} = \{\begin{matrix} 0.5 \times {(\hat{S} - S)}^{2}, i f | \hat{S} - S | < 1 \\ | \hat{S} - S | - 0.5 . o t h e r w i s e \end{matrix}

(4)

where S and

\hat{S}

are the predicted bounding box area and true bounding box area, respectively.

In practical implementation, the keypoint map prediction branch uses the smooth

L_{1}

loss function to accurately calculate the differences between the predicted results and the true annotations, facilitating supervised learning. This method improves the accuracy of keypoint localization and optimizes the overall model’s detection capabilities for small and occluded objects. Additionally, this improved method of calculating the loss function helps reduce overfitting during training, thereby enhancing the model’s generalization capability on unseen data. The smooth

L_{1}

loss function is shown in (5) and (6).

L_{s m o o t h} = \frac{1}{N} \cdot Σ L_{i} .

(5)

L_{i} = \{\begin{matrix} 0.5 \times {(\hat{Y} - Y)}^{2}, if | \hat{Y} - Y | < 1 \\ | \hat{Y} - Y | - 0.5, otherwise \end{matrix}

(6)

where

\hat{Y}

represents the predicted values, Y represents the true values, and N represents the number of hotspots. This paper introduces the

L_{s m o o t h}

loss function as a constraint, forming the final loss function for the dual-path bidirectional FPN-CenterNet based on MobileNetV3, as shown in (7).

L_{t o t a l} = L_{c e n t e r} + 0.1 \cdot L_{s i z e} + 0.1 \cdot L_{o f f s e t} + 0.1 \cdot L_{s m o o t h} .

(7)

For

L_{c e n t e r}

,

L_{s i z e}

, and

L_{o f f s e t}

, we retain the original CenterNet configurations. For

L_{s m o o t h}

, we aim to minimize the gap between the predicted Gaussian heatmap and the true Gaussian heatmap from the keypoint map predictions. Moreover, when there is a large discrepancy between the predicted values and the true values, the gradient calculated by

L_{s m o o t h}

is not excessively high. When the discrepancy is small, the gradient remains low, avoiding adverse impacts on the overall loss function.

3. Experiments

To validate the accuracy of our proposed model, a series of experiments were conducted. We outline the experimental setup, methods, and evaluation metrics used to assess the performance of our proposed object detection model. The datasets, implementation details, and ablation studies are described in detail to ensure clarity and replicability of results.

3.1. Datasets

This paper selects the PASCAL VOC 2007 and MS COCO 2017 datasets for model training and validation. These datasets are chosen for their comprehensive annotation details, including object bounding boxes and class labels, which facilitate precise object detection studies. The MS COCO 2017 dataset, in particular, contains a large number of small objects and occlusions, making it suitable for validating model performance in scenarios involving small objects and occlusions.

To evaluate the performance of our proposed object detection algorithm, this study selected the PASCAL VOC 2007 dataset for model training and testing. This dataset comprises 16,551 images spanning 20 categories, including airplanes, bicycles, birds, boats, bottles, etc. During the testing phase, we utilized the PASCAL VOC 2007 test set, which contains 4952 images, for performance assessment. Additionally, we used the MS COCO 2017 dataset to validate our method further. During development, the 2017 training set was used to train the algorithm, and the 2017 validation set was utilized for hyperparameter selection and validation. Ultimately, our algorithm was compared with state-of-the-art methods on the MS COCO 2017 dataset.

In terms of evaluation metrics, we primarily used the mean average precision (mAP) and frames per second (FPS) to measure the performance of the object detection task. These metrics not only reflect the accuracy of the algorithm but also demonstrate its processing speed in practical applications, thereby providing a comprehensive assessment of the algorithm’s usability and efficiency. Through these rigorous evaluations, we demonstrated the effectiveness and superiority of the proposed algorithm.

3.2. Implementation Details

This work is implemented in PyTorch. All experiments were conducted on a PC equipped with an Intel Core i7-12700K CPU and an Nvidia RTX 3090 GPU.

In this study, we selected the lightweight MobileNetV3 as the backbone network for feature extraction, which was pre-trained and fine-tuned on the dataset, with detailed parameter configurations presented in Table 1. During the experimental process, the input image resolution was set to 512 × 512, and the batch size was set to 16. Initially, we used the Adam optimizer with an initial learning rate of 0.004 for parameter optimization; subsequently, as training progressed, we switched to the SGD optimizer to dynamically adjust the learning rate for more refined model tuning. The purpose of this strategy was to balance training efficiency and model performance.

The primary reason for choosing MobileNetV3 as the backbone network for feature extraction was its outstanding performance and efficient computational characteristics. MobileNetV3 combines lightweight depthwise separable convolutions and the SE channel attention mechanism. This combination significantly reduces the model’s parameter count and computational complexity. At the same time, it enhances the expressiveness of features and the model’s sensitivity to small-sized targets. Furthermore, the structural optimizations of MobileNetV3 make it particularly suitable for mobile devices and edge computing scenarios.

During the training process, we initially chose to use the Adam optimizer because it can achieve convergence faster than traditional SGD optimizers, especially when dealing with complex non-convex optimization issues. The Adam optimizer calculates first-order moment estimates and second-order moment estimates of gradients. Such calculations allow for the adaptive adjustment of learning rates for different parameters, making the training process more stable and efficient. As training progressed, to refine the network’s ability to fit the data, we switched to using the SGD optimizer. SGD makes more precise learning rate adjustments in the later stages of training, which helps the model achieve a better local optimum and reduces the risk of overfitting. Consequently, the model achieves better performance on the validation and test sets.

Furthermore, fixing the input image resolution at 512 × 512 ensures that the details of the images are fully utilized during feature extraction. Setting the batch size to 16 aims to maximize the use of GPU resources while ensuring computational efficiency. This setup helps the model better capture details and features when processing high-resolution images, especially in images where the targets are small or the scenes are complex.

The specific parameter settings of MobileNetV3 are shown in Table 1.

3.3. Ablation Experiments

To clarify the contributions of various components of the proposed model towards detecting small objects and occlusions, extensive ablation studies were performed.

3.3.1. Effectiveness of Bottleneck with Separable Convolution Skip Connection

To validate the effectiveness of the separable CenterNet detection network algorithm based on MobileNetV3 proposed in this study, we conducted detailed ablation experiments on the PASCAL VOC dataset. We set the baseline model as the traditional CenterNet network and divided the ablation experiments into six groups. The specific configurations for each group were as follows: Group 1 used DLA-34 as the backbone of the traditional CenterNet; group 2 used Hourglass as the backbone of the traditional CenterNet; group 3 employed ResNet-18 with deconvolution as the backbone; group 4 introduced MobileNetV3 as the backbone; group 5 further added a DBi-FPN to MobileNetV3; the final group integrated the smooth

L_{1}

loss function. In the experimental labels, B, v3, FPN*, and

L_{1}

represent the original CenterNet, the introduced MobileNetV3, DBi-FPN, and the

L_{s m o o t h}

loss function, respectively. The results of the ablation experiments are shown in Table 2.

From the results in Table 2, we observe variations in the basic model performance based on different backbone networks. In models 1, 2, and 3, which utilize B-DLA34, B-Hourglass, and B-ResNet-18 as backbones, the mAP@0.5 scores are 80.7%, 81.5%, and 79.5%, respectively, with frame rates (FPS) of 33.0, 32.0, and 31.0. It is evident that B-Hourglass has a slight advantage in accuracy, though its FPS is marginally lower than B-DLA34 and B-ResNet-18. Comparatively, B-DLA34 offers a better balance between accuracy and speed. With the introduction of MobileNetV3 as the backbone (model 4), the model’s mAP@0.5 significantly increases to 83.2%, with an FPS increase to 52.0. This indicates that MobileNetV3’s lightweight design and efficient feature extraction significantly enhance detection accuracy and speed.

Building on this, the introduction of the FPN structure (model 5) further raises the mAP@0.5 to 84.9%, with FPS also increasing to 56.0. The FPN structure, through multi-scale feature fusion, effectively improves the model’s detection capabilities across different object sizes, significantly boosting overall performance.

Finally, when the

L_{1}

loss function is integrated into the model (model 6), the mAP@0.5 reaches 85.6%. Although the FPS slightly decreases to 55.0, the overall performance enhancement remains significant. The

L_{1}

loss function, by more precisely regressing target box positions, further increases detection accuracy.

These experimental results demonstrate that the introduction of MobileNetV3 as the backbone significantly enhances the model’s computational efficiency and detection accuracy. The multi-scale feature fusion capability of the FPN structure enhances the model’s adaptability to targets of varying sizes. The use of the

L_{1}

loss function further refines target box regression, enhancing detection accuracy.

The motivation for this research stems from the demand in the object detection field for efficient and high-accuracy detection models. Particularly in real-time application scenarios such as autonomous driving and video surveillance, it is crucial for models to maintain high accuracy while also possessing rapid processing capabilities. By incorporating the lightweight MobileNetV3 and DBi-FPN, this study not only optimizes feature extraction efficiency but also enhances recognition capabilities for small targets and complex scenes through structural improvements. Additionally, the application of the smooth

L_{1}

loss function further enhances the stability of model training and the accuracy of prediction results.

Through comparative analysis, this research demonstrates the superiority of the proposed method over traditional approaches, including improved processing speed and accuracy, as well as stronger adaptability in complex environments.

3.3.2. Effectiveness of the Dual-Path Bi-FPN

To validate the effectiveness of the DBi-FPN proposed in this paper, we implemented improvements on the original CenterNet model and the separable CenterNet detection network based on MobileNetV3 proposed in this study, and conducted experiments on the PASCAL VOC dataset. The experimental results, as shown in Table 3, indicate that models 1, 2, and 3 use B-DLA34, B-Hourglass, and B-ResNet-18 as backbone networks, respectively, achieving mAP@0.5 of 80.7%, 81.5%, and 79.5%, with FPS of 33.0, 32.0, and 31.0, respectively. This suggests that B-Hourglass has a slight advantage in accuracy, though its FPS is slightly lower than the other two models. In contrast, B-DLA34 offers a better balance between accuracy and speed. When we introduced MobileNetV3 as the backbone network in the base model (model 4), the model’s mAP@0.5 significantly increased to 83.2%, and the FPS increased to 52.0. This demonstrates the distinct advantages of MobileNetV3’s lightweight design and efficient feature extraction capabilities in enhancing detection accuracy and speed. Compared to using other backbone networks, MobileNetV3 significantly increases inference speed while maintaining high accuracy. Building on this, we further introduced an FPN structure (model 5), resulting in an mAP@0.5 increase to 84.9% and an FPS increase to 56.0. The FPN, through its multi-scale feature fusion, effectively enhances the model’s detection capability across different object sizes. The strategy of multi-scale feature fusion allows the model to better capture details of targets of varying sizes, thereby significantly boosting detection accuracy. When we also introduced an

L_{1}

loss function in the model (model 6), the mAP@0.5 reached 85.6%, although the FPS slightly decreased to 55.0, the overall performance enhancement was still significant. The

L_{1}

loss function, by more accurately regressing the positions of target boxes, further improves detection accuracy. This indicates that with a finely designed loss function, we can achieve higher detection accuracy while maintaining a high speed.

DBi-FPN adopted in our study combines top-down and bottom-up feature fusion mechanisms. This structure not only resolves the issue of insufficient feature utilization due to the unidirectional flow of information in traditional FPNs but also greatly enriches the network’s learning capability and adaptability by enhancing the interaction between features at different levels. Additionally, we chose the lightweight MobileNetV3 as the backbone network, further reducing the model’s parameter count and computational complexity. Thus, the model is not only highly accurate but also more suitable for deployment on resource-constrained devices.

This paper, by introducing DBi-FPN, significantly enhances the performance of the object detection network, particularly in handling multi-scale feature fusion and improving operational efficiency.

3.3.3. Configuration of Training Parameters

To comprehensively evaluate the separable CenterNet detection network algorithm based on MobileNetV3 proposed in this study, Table 4 details the comparative data between this model and the original CenterNet model in terms of backbone network parameters and the overall model size. The comparison allows us to observe the significant impact of different FPN configurations on the performance within the base CenterNet model. Without FPN, the base model achieves an mAP of 80.7% and an FPS of 33.0. When incorporating a standard FPN, the mAP increases to 82.1%, but the FPS slightly decreases to 32.0. However, with the introduction of an improved FPN (FPN*), the mAP further rises to 83.3%, and the FPS increases to 35.0. This demonstrates that FPN effectively enhances model detection accuracy through multi-scale feature fusion. Particularly, the improved FPN* not only maintains or enhances detection accuracy but also increases inference speed.

In our model, different FPN configurations also show significant performance variations. The base model without FPN achieves an mAP of 83.3% and an FPS of 52.0. Compared to the base CenterNet model, our model already exhibits higher detection accuracy and faster inference speed under the same conditions. When a standard FPN is added, the mAP is raised to 84.2%, with the FPS remaining at 52.0. Introducing the improved FPN* further increases the mAP to 85.6%, with the FPS rising to 55.0. These results indicate that our model not only surpasses the traditional CenterNet model in accuracy but also offers significant advantages in speed.

A comparative analysis confirms that the foundational setup of our model capitalizes on MobileNetV3’s streamlined architecture and superior feature extraction capabilities. The multi-scale feature fusion capability of FPN significantly improves detection accuracy. Especially, the improved FPN* shows exceptional results in our model, enhancing both accuracy and inference speed. Across all configurations, our model outperforms the CenterNet model in both mAP and FPS on the VOC dataset, demonstrating that our approach achieves faster inference speeds while maintaining high detection accuracy.

3.4. Comparison with State-of-the-Art Methods

To comprehensively validate the performance of the separable CenterNet detection network algorithm based on MobileNetV3 proposed in this study, we trained and tested this algorithm against other mainstream detection algorithms on the MS COCO 2017 dataset. Considering that two-stage object detection algorithms, although highly accurate, are cumbersome and impractical for real-world applications, they were not included in this comparison. The selected comparative algorithms include one-stage object detection algorithms based on anchors, such as the YOLO series and EfficientDet series, as well as anchor-free methods like CornerNet, CenterNet, and DERT based on self-attention mechanisms. The performance comparison results of all the algorithms are shown in Table 5.

Our comparison focuses on the models’ performance in practical applications, including key metrics such as detection accuracy, speed, and model size, to demonstrate the advantages of the model proposed in this article in modern object detection tasks. Compared to traditional anchor-based methods like YOLO and EfficientDet, which have achieved significant success in commercial applications, these algorithms rely on dense prior box predictions. This not only increases the computational burden, but also often leads to a higher false detection rate. In contrast, the anchor-free strategy, such as the separable CenterNet based on MobileNetV3 proposed in this paper, reduces model complexity while maintaining or enhancing detection accuracy.

In terms of overall detection precision (AP), our method based on MobileNetV3 achieved an AP of 54.8%, significantly better than traditional YOLOv4 (43.5%) and EfficientDet-D5 (51.5%). This improvement is mainly attributed to the lightweight design and efficient feature extraction capabilities of MobileNetV3. Although YOLOv4 and EfficientDet-D5 perform well in some respects, they still fall short of our model in terms of overall performance.

Analyzing detection accuracy at different IoU thresholds (AP50 and AP75), it is evident that our method reached 72.9% on AP50, while AP75 was also achieved at 59.8%. This indicates that our model maintains high detection accuracy even at higher IoU thresholds. In comparison, YOLOv4 achieved 65.7% on AP50 and 47.3% on AP75, both lower than our model. This reflects the advantages of MobileNetV3 in handling high-precision detection tasks.

Regarding detection performance for targets of different sizes, our method achieved 38.5% on small targets (

A P_{S}

), and performance on medium-sized (

A P_{M}

) and large-sized targets (

A P_{L}

) was 59.8% and 68.9%, respectively. These results show that our model performs well on targets of various sizes, especially in small-target detection, where it has a distinct advantage over EfficientDet-D5 (33.9%) and YOLOv4 (26.7%). This advantage likely stems from the Feature Pyramid Network (FPN) used in our model, which, through multi-scale feature fusion, enhances the model’s ability to detect small-sized targets.

Models like Deformable DETR that incorporate self-attention mechanisms perform excellently in complex scenes but require substantial computational resources, limiting their application in resource-constrained environments. In contrast, our model leverages the lightweight nature of MobileNetV3 and efficient feature fusion, not only enhancing execution speed but also reducing the overall size of the model, making it more suitable for mobile device applications.

The experimental results show that although our model requires far fewer parameters and computational resources than traditional models, its performance on the MS COCO 2017 dataset is comparable to advanced models like YOLOv5 and EfficientDet. Even more, our model performs better in handling small targets and partially occluded targets. This achievement validates the effectiveness of the DBi-FPN in enhancing detection performance, particularly in optimizing the balance between detection accuracy and operational speed.

Compared to traditional two-stage detection methods, the algorithm proposed in this study has achieved significant improvements in feature extraction and detection speed. Additionally, while substantially reducing the model’s parameters and computational load, this algorithm continues to enhance detection accuracy, demonstrating its efficiency and practicality in modern object detection tasks. To more comprehensively showcase the superiority of this algorithm, we conducted a comparative analysis with current mainstream anchor-based one-stage algorithms, selecting RetinaNet [24], YOLOv3 [21], YOLOv4 [22], YOLOv5 [23], and their variants for performance evaluation.

Detailed experimental comparisons indicate that our algorithm excels across multiple important performance metrics. Particularly, compared to YOLOv3, our model’s AP is higher by 21.8%, a significant enhancement that demonstrates the model’s powerful capability in handling complex visual scenes. Compared to YOLOv4 and YOLOv5, the AP is 11.3% and 10.3% higher, respectively, further validating our model’s sustained advantage in accuracy. Although our model’s AP is 0.7% lower than the latest YOLOv4-P7 version—primarily due to differences in the backbone network—the slight difference still highlights the efficiency and competitiveness of our algorithm, given its significant advantages in parameter count and computational efficiency.

Moreover, compared to existing algorithms that employ an anchor-free method, the algorithm presented in this study not only exhibits a clear advantage in detection accuracy but also excels in detection speed. This is attributed to the use of separable convolution technology within the algorithm, which significantly reduces the model’s parameter count, thereby enhancing operational speed.

The separable CenterNet network framework based on MobileNetV3, as depicted in Figure 2, visually contrasts the performance of this study’s model with several mainstream models in real-world applications. As shown in Figure 2a,c,d, the traditional CenterNet model primarily relies on the target’s center point for prediction. Although this method performs well in open scenes, it struggles with heavily occluded targets such as the small sheep depicted, showcasing its limitations in handling occlusion issues. This is because CenterNet depends on precise feature point localization, and occlusion affects the visibility and accuracy of these feature points, making it difficult for the model to accurately locate occluded targets. Additionally, CenterNet’s bounding box regression depends on the overall features of a target, and when a target is partially occluded, the model receives incomplete feature information, leading to inaccurate regression results. Occlusion also interferes with the model’s feature extraction process, especially in complex scenes. Occluding objects may share similar features with the target, making it difficult for the model to distinguish between target and non-target areas, thereby affecting detection performance.

Meanwhile, although the SSD model is widely used in various detection tasks, it exhibits significant detection omissions in scenes with multiple targets and occlusions. This is because the SSD model relies on preset anchor boxes to detect targets, and when there are many dense targets in a scene the preset anchor boxes may not cover all targets, causing some to be undetected. In situations with occlusions, parts of a target’s features are obscured by occluders, making it difficult for the model to extract complete feature information, directly impacting the model’s detection capabilities. Additionally, when performing multi-scale feature extraction, the SSD model may not adequately capture the detailed features of occluded targets, especially on lower-resolution feature maps. This leads to a more pronounced loss of detail, further exacerbating the issue of missed detections. Moreover, when there are numerous targets in the scene, the non-maximum suppression (NMS) process of the SSD model may incorrectly suppress some targets, especially those adjacent to occluders or other targets.

As for YOLOv5, known for its efficient detection rate and excellent performance with large targets, it still shows some shortcomings in the accuracy of small-target detection. Because the anchor box design and resolution configuration of YOLOv5 are more suited to detecting larger targets, there is insufficient focus on small targets, affecting their detection accuracy. Additionally, YOLOv5’s feature extraction network captures rich semantic information. However, when dealing with small targets, detail information can be diluted or lost during the multi-level feature fusion process, making the features of small targets less distinct. In the regression and classification processes of YOLOv5, larger targets occupy a greater proportion of the feature map, thus obtaining higher confidence scores. Conversely, small targets have a lower proportion on the feature map, and hence, lower confidence, making them more likely to be suppressed during the NMS process. During the training of YOLOv5, the loss function is more sensitive to errors in large targets. This causes the system to converge more quickly to the detection of large targets during the optimization process, while neglecting the accuracy of detecting small targets. Compared to the aforementioned models, our proposed separable CenterNet detection network based on MobileNetV3 demonstrates significant advantages in handling small targets and occlusion issues. Particularly through the introduction of DBi-FPN, this model achieves more effective feature fusion and bidirectional information flow, greatly enhancing its ability to recognize small and partially occluded targets in complex scenes. This DBi-FPN not only improves the hierarchy and richness of features but also expands the model’s perceptual range, thereby significantly enhancing the detection accuracy of small targets and the overall robustness of the model.

4. Conclusions

In this study, we addressed the challenges posed by missed detections in traditional CenterNet networks, particularly with small or overlapping targets, and the significant performance degradation experienced when handling occluded objects. We proposed a novel object detection method within the CenterNet paradigm, utilizing MobileNetV3 as the backbone. This approach not only maintains high detection accuracy but also substantially reduces the model’s parameter count and enhances computational efficiency. A sparse convolutional skip connection in the model’s bottleneck was introduced to adaptively suppress redundant and interfering information, contributing to more precise and efficient feature processing.

Furthermore, DBi-FPN, a novel feature fusion method, was introduced, deepening the interaction among different levels of features and facilitating the effective aggregation of features across multiple layers. This design significantly enhances detection capabilities for small objects and optimizes recognition performance for occluded objects.

The loss function was optimized by incorporating a bounding box-area cost function, which strengthens the fitting of predicted heatmap peaks to true values, thus improving the model’s efficiency in complex environments. By applying task heads to the multi-scale feature space merged by DBi-FPN, the model comprehensively considers multi-level representations, further enhancing accuracy and robustness.

Rigorous testing on the PASCAL VOC 2007 and MS COCO 2017 datasets demonstrated that our novel separable CenterNet detection network, based on MobileNetV3, excels in improving detection precision and processing speed. The experimental results not only confirm the superior performance of this method in handling small and occluded objects but also highlight its significant advantages in reducing computational resource consumption and enhancing processing speed.

DBi-FPN introduced in this study has demonstrated substantial improvements in handling small objects and occlusions, underscoring the critical role of advanced feature fusion techniques in boosting model performance. Future research could explore deeper bidirectional or multi-directional feature interactions and employ adaptive learning mechanisms to refine feature fusion across different levels, effectively tackling diverse detection challenges.

Author Contributions

Conceptualization, Z.J. and N.D.; methodology, Z.J.; software, H.D.; validation, Z.J.; formal analysis, Z.J.; investigation, Z.J.; resources, N.D.; data curation, N.D.; writing—original draft preparation, Z.J.; writing—review and editing, N.D.; visualization, Z.J.; supervision, Z.J.; project administration, H.D.; funding acquisition, Z.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Basic Research Project of Shijiazhuang-based Universities in Hebei under Grant 241790947A.

Data Availability Statement

Data are contained within the article.

Acknowledgments

Thank you for the constructive comments provided by reviewers.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hu, M.; Li, Z.; Yu, J.; Wan, X.; Tan, H.; Lin, Z. Efficient-Lightweight YOLO: Improving Small Object Detection in YOLO for Aerial Images. Sensors 2023, 23, 6423. [Google Scholar] [CrossRef] [PubMed]
Kang, S.-H.; Park, J.-S. Aligned Matching: Improving Small Object Detection in SSD. Sensors 2023, 23, 2589. [Google Scholar] [CrossRef]
Wang, Y.; Liu, X.; Guo, X. An object detection algorithm based on the feature pyramid network and single shot multibox detector. Cluster Comput. 2022, 25, 3313–3324. [Google Scholar] [CrossRef]
Yan, B.; Li, J.; Yang, Z.; Zhang, X.; Hao, X. AIE-YOLO: Auxiliary Information Enhanced YOLO for Small Object Detection. Sensors 2022, 22, 8221. [Google Scholar] [CrossRef]
Zhou, X.; Wang, D.; Krahrnbuhl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
Wu, P.; Wang, C.; Wang, J.; Wang, Y. Anchor-free object detection network based on non-local operation. Multimed. Tools Appl. 2024, 83, 56249–56259. [Google Scholar] [CrossRef]
Farooq, J.; Muaz, M.; Jadoon, K.K.; Aafaq, N.; Khan, M.K.A. An improved YOLOv8 for foreign object debris detection with optimized architecture for small objects. Multimed. Tools Appl. 2024, 83, 60921–60947. [Google Scholar] [CrossRef]
Shi, P.; Jiang, T.; Yang, A.; Liu, Z. CenRadfusion: Fusing image center detection and millimeter wave radar for 3D object detection. Signal Image Video Process. 2024, 18, 1–11. [Google Scholar] [CrossRef]
Leng, J.; Liu, Y.; Gao, X.; Wang, Z. Crnet: Context-guided reasoning network for detecting hard objects. IEEE Trans. Multimed. 2024, 26, 3765–3777. [Google Scholar] [CrossRef]
Zhang, M.; Hu, H.; Li, Z.; Chen, J. Occlusion-robust workflow recognition with context-aware compositional convnet. Soft Comput. 2024, 28, 5125–5135. [Google Scholar] [CrossRef]
Li, J.; Piao, Y. Multi-object tracking based on re-identification enhancement and associated correction. Appl. Sci. 2023, 13, 9528. [Google Scholar] [CrossRef]
Sun, K.; Zhen, Y.; Zhang, B.; Song, Z. An improved anchor-free object detection method applied in complex scenes based on sda-dla34. Multimed. Tools Appl. 2024, 83, 59227–59252. [Google Scholar] [CrossRef]
Liu, F.; Zhang, Z.; Li, S. Dtse-spacenet: Deformable-transformer-based single-stage end-to-end network for 6-d pose estimation in space. IEEE Trans. Aerosp. Electron. Syst. 2024, 60, 2555–2571. [Google Scholar] [CrossRef]
Zhang, C.; Qi, H.; Wang, S.; Li, Y.; Lyu, S. Comics: End-to-end bi-grained contrastive learning for multi-face forgery detection. IEEE Trans. Circuits Syst. Video Technol. 2023. [Google Scholar] [CrossRef]
Kabir, H.; Lee, H.-S. Mask r-cnn-based stone detection and segmentation for underground pipeline exploration robots. Appl. Sci. 2024, 14, 3752. [Google Scholar] [CrossRef]
Lin, T.-Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. arXiv 2016, arXiv:1612.03144. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and efficient object detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 10778–10787. [Google Scholar]
Gao, C.; Zhang, Q.; Tan, Z.; Zhao, G.; Gao, S.; Kim, E.; Shen, T. Applying optimized yolov8 for heritage conservation: Enhanced object detection in jiangnan traditional private gardens. Herit. Sci. 2024, 12, 31. [Google Scholar] [CrossRef]
Ghiasi, G.; Lin, T.-Y.; Le, Q.V. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7029–7038. [Google Scholar]
Yang, T.; Li, P.; Liu, P. Efficient automatic detection of uterine fibroids based on the scalable efficientdet. In Proceedings of the 2022 IEEE 16th International Conference on Anti-Counterfeiting, Security, and Identification (ASID), Xiamen, China, 2–4 December 2022; pp. 157–160. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Kvietkauskas, T.; Pavlov, E.; Stefanovič, P.; Pliuskuvienė, B. The Efficiency of YOLOv5 Models in the Detection of Similar Construction Details. Appl. Sci. 2024, 14, 3946. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef] [PubMed]
Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. Int. J. Comput. Vis. 2020, 128, 642–656. [Google Scholar] [CrossRef]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]

Figure 1. The pipeline of the proposed method.

Figure 2. Visualization comparison with competition methods on VOC 2007 dataset: First column: original image; second column: CenterNet; third column: SSD; fourth column: YOLOv5; fifth column: our method. (a) Comparison of sheep detection results; (b) comparison of horse detection results; (c) comparison of bird detection results; (d) comparison of sheep detection results.

Table 1. MobileNet-Small lightweight network parameter settings.

Input	Operator	Exp. Size	#Out	SE	NL	s
224² × 3	conv2d, 3 × 3	-	16	-	HS	2
112² × 16	bneck, 3 × 3	16	16	✓	RE	2
56² × 16	bneck, 3 × 3	72	24	-	RE	2
28² × 24	bneck, 3 × 3	88	24	-	RE	1
28² × 24	bneck, 5 × 5	96	40	✓	HS	2
14² × 40	bneck, 5 × 5	240	40	✓	HS	1
14² × 40	bneck, 5 × 5	240	40	✓	HS	1
14² × 40	bneck, 5 × 5	120	48	✓	HS	1
14² × 40	bneck, 5 × 5	144	48	✓	HS	1
14² × 40	bneck, 5 × 5	288	96	✓	HS	2
7² × 96	bneck, 5 × 5	576	96	✓	HS	1
7² × 96	bneck, 5 × 5	576	96	✓	HS	1
7² × 96	conv2d, 1 × 1	-	576	✓	HS	1
7² × 576	pool, 7 × 7	-	-	-	-	1
1² × 576	conv2d, 1 × 1, NBN	-	1280	-	HS	1
1² × 1280	conv2d, 1 × 1, NBN	-	k	-	-	1

Table 2. PASCAL VOC ablation experiment result comparison.

Model	Model	v3	FPN*	L1	mAP@0.5/%	FPS
1	B-DLA34	×	×	×	80.7	33.0
2	B-Hourglass	×	×	×	81.5	32.0
3	B-ResNet-18	×	×	×	79.5	31.0
4	B	✓	×	×	83.2	52.0
5	B	✓	✓	×	84.9	56.0
6	B	✓	✓	✓	85.6	55.0

Table 3. Bifurcated detachable FPN network ablation test results.

Model	Method	VOC
Model	Method	mAP	FPS
CenterNet	without FPN	80.7	33.0
	FPN	82.1	32.0
	FPN*	83.3	35.0
Ours	without FPN	83.3	52.0
	FPN	84.2	52.0
	FPN*	85.6	55.0

Table 4. Comparison of parameter quantity and model volume of backbone networks used for feature extraction.

Method	Backbone Network Parameters (M)	Model Volume (MB)
CenterNet-ResNet-18	11.2	89.0
CenterNet-Hourglass-104	189.59	731.0
MobileNetV3-CenterNet	2.84	16.9

Table 5. Comparison of algorithm results.

Method	Backbone	AP	${AP}_{50}$	${AP}_{75}$	${AP}_{s}$	${AP}_{M}$	${AP}_{L}$
YOLOv3 [21]	DarkNet-53	33.0	57.9	34.4	18.3	35.4	41.9
YOLOv3-SPP [21]	DarkNet-53	36.2	60.6	38.2	20.6	37.4	46.1
YOLOv4 [22]	CSP-DarkNet-53	43.5	65.7	47.3	26.7	46.7	53.3
YOLOv4-CSP	CSP-DarkNet-53S	46.2	64.8	50.2	24.6	50.4	61.9
YOLOv4-P7	CSP-P7	$55.5$	$73.4$	$60.8$	38.4	59.4	67.7
EfficientDet-D3 [17]	EfficientNet-B3	47.5	66.2	51.5	27.9	51.4	62.0
EfficientDet-D5 [17]	EfficientNet-B5	51.5	70.5	56.7	33.9	54.7	64.1
YOLOv5-M [23]	Modified CSP v5	44.5	63.1	-	-	-	-
YOLOv5-L [23]	Modified CSP v5	48.2	66.9	-	-	-	-
YOLOv5-X [23]	Modified CSP v5	50.4	68.8	-	-	-	-
RetinaNet [24]	S143	50.7	70.4	54.9	33.6	53.9	62.1
CornerNet [25]	Hourglass-104	40.6	56.4	43.2	19.1	42.8	54.3
CenterNet [25]	Hourglass-104	42.1	61.1	45.9	24.1	45.5	52.8
YOLOX-DarkNet53	DarkNet-53	47.4	67.3	52.1	27.5	51.5	60.9
YOLOX-M [26]	Modified CSP v5	46.4	65.4	50.6	26.3	51.0	59.9
YOLOX-L [26]	Modified CSP v5	50.0	68.5	54.5	29.8	54.5	64.4
YOLOX-X [26]	Modified CSP v5	51.2	69.6	55.7	31.2	56.1	66.1
Deformable DETR	RN-101-DCN	50.1	69.7	54.6	30.6	52.8	65.6
Ours	MobileNetV3	54.8	72.9	59.8	$38.5$	$59.8$	$68.9$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jiao, Z.; Dong, H.; Diao, N. Separable CenterNet Detection Network Based on MobileNetV3—An Optimization Approach for Small-Object and Occlusion Issues. Mathematics 2024, 12, 2524. https://doi.org/10.3390/math12162524

AMA Style

Jiao Z, Dong H, Diao N. Separable CenterNet Detection Network Based on MobileNetV3—An Optimization Approach for Small-Object and Occlusion Issues. Mathematics. 2024; 12(16):2524. https://doi.org/10.3390/math12162524

Chicago/Turabian Style

Jiao, Zhengkuo, Heng Dong, and Naizhe Diao. 2024. "Separable CenterNet Detection Network Based on MobileNetV3—An Optimization Approach for Small-Object and Occlusion Issues" Mathematics 12, no. 16: 2524. https://doi.org/10.3390/math12162524

APA Style

Jiao, Z., Dong, H., & Diao, N. (2024). Separable CenterNet Detection Network Based on MobileNetV3—An Optimization Approach for Small-Object and Occlusion Issues. Mathematics, 12(16), 2524. https://doi.org/10.3390/math12162524

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Separable CenterNet Detection Network Based on MobileNetV3—An Optimization Approach for Small-Object and Occlusion Issues

Abstract

1. Introduction

2. Methods

2.1. Overview

2.2. Bottleneck with Separable Convolution Skip Connection

2.3. Dual-Path Bi-FPN

2.4. Multi-Scale Merged Multi-Task Loss

3. Experiments

3.1. Datasets

3.2. Implementation Details

3.3. Ablation Experiments

3.3.1. Effectiveness of Bottleneck with Separable Convolution Skip Connection

3.3.2. Effectiveness of the Dual-Path Bi-FPN

3.3.3. Configuration of Training Parameters

3.4. Comparison with State-of-the-Art Methods

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI