DB-YOLOv5: A UAV Object Detection Model Based on Dual Backbone Network for Security Surveillance

Liu, Yuzhao; Li, Wan; Tan, Li; Huang, Xiaokai; Zhang, Hongtao; Jiang, Xujie

doi:10.3390/electronics12153296

Open AccessArticle

DB-YOLOv5: A UAV Object Detection Model Based on Dual Backbone Network for Security Surveillance

by

Yuzhao Liu

¹,

Wan Li

^1,*,

Li Tan

^1,2

,

Xiaokai Huang

¹

,

Hongtao Zhang

¹ and

Xujie Jiang

¹

School of Computer Science and Engineering, Beijing Technology and Business University, Beijing 100048, China

²

Chongqing Institute of Microelectronics Industry Technology, University of Electronic Science and Technology of China, Chongqing 400031, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(15), 3296; https://doi.org/10.3390/electronics12153296

Submission received: 28 June 2023 / Revised: 28 July 2023 / Accepted: 29 July 2023 / Published: 31 July 2023

(This article belongs to the Special Issue AI-Driven Network Security and Privacy)

Download

Browse Figures

Versions Notes

Abstract

:

Unmanned aerial vehicle (UAV) object detection technology is widely used in security surveillance applications, allowing for real-time collection and analysis of image data from camera equipment carried by a UAV to determine the category and location of all targets in the collected images. However, small-scale targets can be difficult to detect and can compromise the effectiveness of security surveillance. In this work, we propose a novel dual-backbone network detection method (DB-YOLOv5) that uses multiple composite backbone networks to enhance the extraction capability of small-scale targets’ features and improve the accuracy of the object detection model. We introduce a bi-directional feature pyramid network for multi-scale feature learning and a spatial pyramidal attention mechanism to enhance the network’s ability to detect small-scale targets during the object detection process. Experimental results on the challenging UAV aerial photography dataset VisDrone-DET demonstrate the effectiveness of our proposed method, with a 3% improvement over the benchmark model. Our approach can enhance security surveillance in UAV object detection, providing a valuable tool for monitoring and protecting critical infrastructure.

Keywords:

object detection; UAV; security surveillance; feature pyramid network; attention mechanism

1. Introduction

The use of drones for remotely detecting and tracking persons or vehicles has become increasingly prevalent in the field of security surveillance, particularly in urban areas. However, the widespread use of drones also raises concerns about network security and privacy. Due to their small size and limited power supply performance, UAVs have low computing power, which poses significant challenges for accurate object detection tasks [1]. In 2012, the introduction of the deep convolutional neural network (CNN) by Krizhevsky et al. [2] revolutionized the field of computer vision, leading to the development of more efficient and accurate object detection models, such as the RCNN model proposed by Girshick et al. [3] in 2015. Since then, deep learning-based object detection technology has undergone rapid development, providing significant potential for enhancing security surveillance capabilities while also requiring careful consideration of network security and privacy concerns. Recent studies such as [4] have focused on developing AI-driven solutions to address network security and privacy challenges in the context of UAV object detection.

Currently, object detection methods based on deep learning can be mainly divided into two-stage and one-stage categories. Among them, the two-stage method is based on target candidate regions for detection. This method extracts candidate regions and performs deep learning on the corresponding regions, which has high classification accuracy of detection results. This type of algorithm mainly includes a series of methods such as RCNN [3], Fast R-CNN [5], Faster R-CNN [6], R-FCN [7], CoupleNet [8], Mask RCNN [9], Cascade RCNN [10], Libra R-CNN [11], Reasoning-RCNN [12], EfficientDet [13], D2Det [14], DeFeat [15], MViTv2 [16], and AdaMixer [17]. Otherwise, the one-stage method can directly calculate the category probability and position coordinate value of the object through regression, including SSD [18], DSSD [19], FSSD [20], ScratchDet [21], ExtremeNet [22], MimicDet [23], I3Net [24], SaFT [25], CapDet [26], YOLO [27], YOLOv2 [28], YOLOv3 [29], YOLOv4 [30], YOLOv5, YOLOx [31], YOLOF [32], YOLOv6 [33], YOLOv7 [34], YOLOv8, and other methods. The detection result is directly obtained after single detection, which greatly improves the speed of the model. For example, the YOLOv5 model improves the accuracy and speed of YOLOv4. The training phase of the model includes improvements such as mosaic data enhancement, adaptive image scaling, and adaptive anchor box calculation.

With the successive birth of various object detection models with superior performance and improved experimental results on the MS COCO dataset, object detection for UAV has also attracted increasing attention as an important computer vision task. UAV object detection plays an important role in understanding long-range images. Unlike general object detection, the aerial photography angles of UAVs all look down, which leads to a small scale and many targets in the image. Therefore, object detection models that are suitable for general data sets cannot achieve high robustness in UAV aerial photography. As shown in Figure 1, when the YOLOv5 model for object detection is employed in the VisDrone UAV dataset, pedestrian targets with small scales are more likely to be missed compared to vehicle targets with larger scales, which leads to weak generalization ability of the UAV object detection model.

In general, small-scale targets can be defined as the size range of objects in an image that are smaller than the minimum detectable size threshold of the object detection model. One way to define this threshold is to use the concept of object coverage ratio, which refers to the proportion of the image area covered by an object. For example, a commonly used threshold for object detection is 0.005, which means that an object must occupy at least 0.5% of the image area to be detected by the model. In this case, too-small scales would refer to objects that are smaller than the minimum size required to achieve an object coverage ratio of 0.005.

Based on the above discussion, our paper proposes a DB-YOLOv5 object detection model that is suitable for UAV, aiming at the problem of too-small targets in UAV. The model adopts a composite backbone network, which connects multiple identical backbone networks in a composite manner and fuses high-level and low-level features of multiple backbone, which expands the receptive field of the network. The bi-directional feature pyramid network structure is also introduced in the feature extraction stage, which can fuse multi-scale features conveniently, quickly, and effectively to improve the detection accuracy of small-scale targets. The spatial pyramid attention mechanism is used in the output stage, which can maintain the feature representation and spatial location information of the target and further strengthen the ability to identify and locate small targets. Finally, EIoU_loss is used to further optimize the bounding box of the small-scale target to improve the bounding box problem in small target detection.

The main contributions of this paper are as follows:

The dual-backbone network model DB-YOLOv5 is proposed. Aiming at the problem of missed and false detection of small targets in UAV images, the model integrates the high-level and low-level features of multiple identical backbone networks to expand the receptive field of the network for small target features.
The model uses a bi-directional feature pyramid network, which fuses multi-scale features through the bottom-up and top-down approaches to strengthen the network’s multi-scale feature fusion for small targets.
The spatial pyramid attention mechanism enables the model to maintain both the feature information and the location information of small targets, which strengthens their identification and positioning.

The remainder of this paper is organized as follows. Section 2 presents the related work. Section 3 details the proposed method. The experiments and results analyses are provided in Section 4 while introducing the selected dataset and evaluation indicators. Finally, conclusions are drawn in Section 5.

2. Related Work

With the increasing demand for security surveillance, object detection models are being used more frequently in aerial remote sensing to detect potential security threats. However, one of the major challenges in this field is detecting small targets in aerial images. To address this problem, researchers have proposed various multi-angle improvements to different general object detection models with the aim of enhancing the robustness and generalization ability of the detection models for aerial photography datasets in the context of network security and privacy. For example, Cheng et al. [35] proposed a concept of coarse-grained density maps for the problem of dense small objects and uneven distribution in aerial remote sensing images and designed a density map-based clustering region generation algorithm. They improved the Mosaic data augmentation method to divide the image into multiple sub-regions so that dense small objects could be adjusted to a reasonable scale. This method improved the detection performance of rare objects and difficult samples and alleviated the foreground–background and class imbalances. Huang et al. [36] designed a unified foreground assembly and multi-agent detection network for the problem of dense small targets and target shape similarity in aerial images. The method combines the subregions provided by the coarse detector to suppress the background by clustering, and this method then assembles the resulting subregions into mosaics for a single inference. The method models the object distribution in a fine-grained manner by using multi-agent learning, thereby significantly reducing the overall time cost and improving the efficiency and accuracy of detection. Wang et al. [37] proposed a model based on multiple center points to solve the problem of small target detection in images. The method first located multiple center points and then estimated the offset and scale of multiple corresponding targets, which can improve the detection performance of small targets. Xu et al. [38] aimed at the detection of small targets in aerial images and believed that IoU, as the most commonly used indicator in object detection tasks, was not suitable for small targets. They proposed a simple and effective dot distance method, which was defined as the normalized Euclidean distance between the center points of the two bounding boxes. This method was suitable for small target detection and achieved better detection performance. Tan et al. [39] proposed the YOLOv4_Drone method based on YOLOv4, aiming at the problems of small targets, complex background, and mutual occlusions of targets in UAV images. This method employed the concept of hole convolution. It introduced an ultra-lightweight subspace attention mechanism and soft-nms to resample the same feature map, which implemented multi-scale feature representation, to solve the problem of missed detection caused by adjacent or even occluded targets captured by drones. In order to improve the precision of UAV object detection while satisfying the lightweight feature, Yang et al. [40] modified the YOLOv5s model. To address the small object detection problem, a prediction head is added to better retain small object feature information. The CBAM attention module is also integrated to better find attention regions in dense scenes. The original IOU-NMS is replaced by NWD-NMS in post-processing to alleviate the sensitivity of IOU to small objects.

Although the above algorithms have carried out various studies and explorations on the problem of small targets in UAV images, the proposed models still cannot obtain real-time and efficient detection results of small targets in practical UAV applications. Therefore, based on the YOLOv5 algorithm, this paper introduces a composite backbone network, a bidirectional feature pyramid network structure, and a spatial pyramid attention mechanism and proposes the UAV image object detection model DB-YOLOv5.

The YOLOv5 model has excellent speed and accuracy in the detection algorithm due to using the CSPDarknet53 backbone network to extract features. The CSPDarknet53 is based on the Darknet53 in YOLOv3, which combines the CSPNet [41] to develop the backbone structure. This network contains five CSP modules, which are composed of convolution kernels with a size of 3 × 3 and a stride of 2, so it can play a role in downsampling. Thus, the model adopts the CSPDarknet53 backbone network, which can enhance the feature learning ability of network, maintain the model accuracy while remaining lightweight, and reduce the computational cost of the model. In addition, the input of the model adopts the mosaic method for data enhancement, which splices four pictures through random scaling, cropping, and arrangement. This method is highly effective for small target detection. In the Neck stage of YOLOv5, the PANet [42] structure of FPN + PAN [43] is adopted, with which the model can strengthen the multi-scale feature fusion ability, accurately save the location information of small targets, and contribute to locating the target correctly. Although the DB-YOLOv5 model focuses on improving object detection in low-altitude aerial images, there are also several studies addressing other challenges in video surveillance. Sun et al. [44] proposed a dynamic partial-parallel data layout (DPPDL) for green video surveillance storage, which aims to reduce energy consumption and improve storage efficiency. Similarly, Yu et al. [45] introduced an extra-parity energy-saving data layout for video surveillance, which reduces energy consumption and optimizes storage utilization. These studies demonstrate the importance of developing efficient and sustainable solutions for video surveillance, which can have significant implications for various applications, such as security and public safety. Zhang et al. [46] conducted research on backdoor attacks on deep neural network models used in image classification. They analyzed the impact of these attacks on classification accuracy and proposed a defense mechanism to mitigate them. This study emphasizes the importance of developing secure and robust deep neural network models for reliable image classification.

In low-altitude aerial images, the visual information contained in tiny targets is limited by the condition of looking down, which results in significant difficulties in aerial target detection. Therefore, improving the detection performance of small and ambiguous targets and reducing the occurrence of missed and false detections is an urgent problem for UAV image object detection. This paper is dedicated to providing an effective solution to this purpose, namely, the UAV image object detection model DB-YOLOv5.

3. Methods

3.1. Overall Structure

The DB-YOLOv5, which the UAV object detection model proposed, improves the capabilities of feature extraction and fusion by introducing a composite backbone network, a bidirectional feature pyramid network, and a spatial pyramid attention mechanism. The problem of false and missed detection because the scale of targets in the UAV environment is too small can be solved by this model. Thus, the model can improve the accuracy of small targets. The structure of DB-YOLOv5 is shown in Figure 2.

The model performs data enhancement using the Mosaic operation in the image preprocessing stage, scales the input UAV image to the prescribed input size of 640 × 640 for this network model, and performs data preprocessing operations such as normalization. After that, Focus slicing and convolution are performed on the input image to obtain 320 × 320 × 32 feature maps. In the N-layer assisting backbone network (yellow part in Figure 1), the H × W × C feature map of layer N − 1 can be obtained after 3 × 3 convolution and normalization operations to obtain the H/2 × W/2 × 2C N-layer feature map. To clarify, in the N-layer main backbone network (green part in Figure 1), the process to obtain the Nth-layer H × W × C feature map involves superimposing and fusing the 2H × 2W × C/2 feature maps of the N − 1st-layer and the Nth-layer feature maps in the assisting backbone network. This process is shown in Equation (1), where

x_{m a i n}^{N}

and

x_{m a i n}^{N - 1}

denote the feature maps of the main backbone network at layer N and layer N − 1, respectively;

x_{a s s i s t}^{N}

denotes the feature map of the assisting backbone network at layer N; and UP(∙) denotes the UpSampling operation. The value of N used in this paper is 5.

x_{m a i n}^{N} {= x}_{m a i n}^{N - 1} \oplus U P (x_{a s s i s t}^{N})

(1)

where n is the feature fusion stage with N′ output scales, and the dimension H × W × C of the N′th output feature map is obtained by fusing the N′ − 1st output feature map with a scale of 2H × 2W × C/2 and the feature maps of the same H × W × C dimensions in the shallow network processed by the bidirectional feature pyramid network, as shown in Equation (2), where

x_{f p n}^{N'}

and

x_{f p n}^{N' - 1}

denote the N′th and N′ − 1st feature fusion output, respectively;

x_{b a c k w a r d}^{N'}

denotes the feature map of the same size as the N’th output in the shallow network; and Bi(∙) denotes the bidirectional feature pyramid network. In this paper, the value of N′ is taken as 3.

x_{f p n}^{N'} {= x}_{f p n}^{N' - 1} \oplus B i (x_{b a c k w a r d}^{N'})

(2)

The feature maps after feature extraction and multiscale fusion are adaptively averaged pooled at three scales of 80 × 80, 40 × 40, and 20 × 20 through a spatial pyramid structure to generate an attention map. The generated attention maps are weighted by a combination of a fully connected layer and a sigmoid activation layer to generate attention weights in the corresponding feature maps, which label the small targets in the original images more accurately.

3.2. Composite Backbone Network Based on CSPDarknet53

The backbone of the YOLOv5 model adopts the CSPDarknet53 network combined with the CSP structure. However, the feature extraction ability of this network for small-scale targets cannot achieve satisfactory results. Most of the current research on the backbone network focuses on deepening or widening the backbone. To deepen the network of the model without introducing additional pre-training overhead, we introduce the structure of the composite backbone network (CBNet) in the backbone stage of DB-YOLOv5. The structure aims to superimpose multiple layers of the same type of backbone to expand the feature receptive field of the network, thereby enhancing capability of the backbone’s feature extraction for the small target in UAV.

CBNet is divided into two types: the main backbone and the assisting backbone. The purpose of using the assisting backbone is to complement the features extracted by the main backbone. Each backbone has L stages, which contain a series of layers of convolution and have the same size of feature maps. The nonlinear transformation implemented in the lth stage is defined as

F^{l}

. The output of the lth stage of the assisting backbone (denoted as

x_{a s s i s t}^{l}

) is fused with the output of the l − 1st stage of the main backbone (

x_{m a i n}^{l - 1}

), which is the input in the parallel stage (l) of the main backbone, as shown in Equation (3):

x_{m a i n}^{l} = F_{m a i n}^{l} (x_{m a i n}^{l - 1} + g (x_{a s s i s t}^{l})), l \geq 2

(3)

where g(∙) represents a layer of 1 × 1 convolution and a layer of batch normalization, whose purpose is to concatenate the features of the main and assisting backbones.

As shown in Figure 3, after the 80 × 80 feature image obtained by convolution in the assisting CSPDarknet53 backbone is input to the main backbone, it is superimposed and fused with the feature image obtained after the 640 × 640 original image is processed by Focus slicing, and the obtained result is input as the input content to the starting position of the main backbone. The 40 × 40 feature image obtained by convolution processing in the assisting backbone model and the 80 × 80 feature image in the main backbone model are then superimposed and fused, and the result is used as the input to continue the next convolution process. Finally, the 20 × 20 feature image output by the assisting backbone is superimposed and fused with the 40 × 40 feature image in the main backbone model, and the result is input to the main backbone model to continue the convolution operation. Thus, the obtained feature image is passed to the next module. After that, the feature information extracted from the backbone network is used as the input in Section 3.3 to perform multi-scale stacking and processing of features through the bidirectional feature pyramid network. Using the features learned by the network, combined with the spatial pyramid attention mechanism in Section 3.4, the objects in the input images of the model are classified and localized.

Meanwhile, to further enhance the operating efficiency and cut down on time cost, we no longer connect the low-level features of the first two layers in the composite backbone network module and only connect and stack the features of the last two layers of backbone networks. The high-level semantic feature information is further saved and learned while retaining lower-level location information, thereby easing the contradiction between time and accuracy to a certain extent. We named this module CBNet-tiny, as shown in Figure 4.

3.3. Bidirectional Feature Pyramid Network

The feature information extracted by the composite backbone network in Section 3.2 is also deviated in sensitivity to small targets according to the different feature extraction scales. In order to coordinate the feature information extracted from different scales, the model needs to combine the extracted features for multi-scale fusion and learning—that is, to take methods to further represent and process multi-scale features effectively, which is also one of the difficulties in target detection. Early detection models usually make predictions through a pyramid structure directly, which is based on features extracted from the backbone. In this process, the feature pyramid network plays an important role, proposing the idea of combining multi-scale features in a top-down manner. Inspired by this idea, PANet, based on FPN, adds a bottom-to-top path to further aggregate the feature information. However, it also consumes a lot of time, especially in the training phase, but achieves good performance. Since the contribution of nodes—only one input edge—to the fusion feature network is small, the Bi-FPN removes the intermediate nodes of P3 and P7 in PANet to form a simplified bidirectional network to reduce model overhead. Additionally, this module adds a skip connection between input nodes to output nodes at the same scale, incorporating more features without increasing excessive computational overhead. At the same time, the model regards the Bi-FPN module, which achieves feature fusion through a bidirectional path, as a network layer and reuses it many times to achieve better feature fusion, as shown in Figure 5.

In the bidirectional feature pyramid network, the output results of each node are shown in Equations (4)–(6):

x_{f p n}^{l} = F_{c o n v} (x^{l} + F_{c o n v} (x^{l + 1})) .

(4)

x_{f p n}^{l} = F_{c o n v} (x^{l} + F_{c o n v} (x^{l} + F_{c o n v} (x^{l + 1})) + x_{f p n}^{l - 1}) .

(5)

x_{f p n}^{l} = F_{c o n v} (x^{l} + x_{f p n}^{l - 1}) .

(6)

where

F_{c o n v}

(∙) represents the convolutional layer,

x^{l}

represents the input features of the bidirectional pyramid network, and

x_{f p n}^{l}

represents the output features. Equation (4) represents the output result of node P3 in Figure 5, Equation (5) represents the output result of node P7 in Figure 5, and Equation (6) represents the output result of the intermediate node in Figure 5. Although the PANet structure of YOLOv5 can fuse multi-scale target feature information, it can easily cause missing features in the process of feature fusion between small and large targets, thus affecting the ability of the model to detect small-scale targets. To address this, we use the Bi-FPN module four times in DB-YOLOv5 to bi-directionally fuse multi-scale feature information multiple times on the three output branches to further strengthen the model’s feature fusion and extraction capabilities for small targets and ensure the accuracy of positioning and classification of small targets in the model.

3.4. Spatial Pyramid Attention Mechanism

Although the model can strengthen the feature fusion and extraction capabilities of small objects through the bidirectional feature pyramid network described in Section 3.3, it may still be challenging to fully grasp the mechanism involved. However, in practical applications, the model also needs to be able to ignore the complex background and other interfering information features in the image and identify the desired feature information of the small target. To achieve accurate classification and positioning to avoid the problem of missed detection and false detection in the detection of small targets in the drone environment, we use the spatial pyramid attention network (SPANet) in DB-YOLOv5. The introduction of an attention mechanism makes our model focus on the small target part in the image and selectively extract key information from images while ignoring the interference of irrelevant information, such as the background. This improves the localization and classification performance of the entire model for small-scale targets and the accuracy of the detection model.

The feature processing process of the spatial pyramid attention mechanism is shown in Equation (7):

x_{w e i g h t} = s i g m o i d (F_{f c} (P_{1 \times 1} (x) + P_{2 \times 2} (x) + P_{4 \times 4} (x)))

(7)

Among them,

F_{f c} (\cdot)

represents the fully connected layer; sigmoid(∙) represents the activation function layer;

P_{1 \times 1} (\cdot)

,

P_{2 \times 2} (\cdot),

and

P_{4 \times 4} (\cdot)

represent the 1 × 1, 2 × 2, and 4 × 4 adaptive average pooling layers, respectively; and

x_{w e i g h t}

represents the output weight. The spatial pyramid attention mechanism locates the information of interest by using the structure of the spatial pyramid instead of global average pooling, which consists of two parts. As shown in Figure 6, the input feature maps are passed through a spatial pyramid structure, which is adaptively average pooled at three scales, to generate attention maps. Among them, the purpose of the 1 × 1 adaptive average pooling layer is to obtain the key information of the category in the feature map, the 2 × 2 pooling layer is used to save the less important key feature information in the image, and the 4 × 4 average pooling can effectively obtain the key position information in the feature map. Afterwards, the generated attention map is passed through a weight module, which is composed of a fully connected layer and a sigmoid activation layer, to generate the attention weights in the corresponding feature map. Thus, through the attention weight output by the attention module, the small objects in the original image are more accurately marked.

4. Experiments

4.1. Datasets

The data selected for the experiments were from the VisDrone-DET dataset. This dataset was collected by the AISKYEYE team at the Machine Learning and Data Mining Laboratory of Tianjin University, China. The dataset consisted of 10,209 still images captured by various drone-mounted cameras, including different locations (taken from 14 different cities that were thousands of kilometers apart in China), different environments (urban and rural), different objects (pedestrians, vehicles, bicycles, etc.), and different densities (sparse and crowded scenes). There were mainly 10 categories of objects. The number of samples for each category is shown in Figure 7.

4.2. Experimental Details

We conducted our experiment using 6471 images from VisDrone2019-DET-train as the training set, 548 images from VisDrone2019-DET-val as the validation set, and 1610 images from VisDrone2019-DET-test-dev as the test set.

For DB-YOLOv5, we set the model input image size to 640 × 640, the batch size was 16, the confidence threshold was 0.25, and the Intersection over Union threshold was 0.45. The learning rate was initialized to 10–4 and halved after every 50% training batch. We implemented our model on the Torch 1.8.0 platform and conducted 300 epochs of training experiments on the training and validation sets on a single NVIDIA GeForce RTX 3070.

4.3. Quantitative Experiments

To verify the detection effect of the model proposed in the paper, we compared our model with other models in the field of object detection. The detection results are shown in Table 1.

It can be seen from the mean average precision (mAP) results in Table 1 that, compared with the anchor-based method adopted by DB-YOLOv5, the anchor-free detection model CornerNet [47] was not suitable for UAV detection. At the same time, the comparison of our model with FPN [42] also verifies that the BiFPN structure of the bidirectional path combined with the composite backbone network method could achieve more excellent results than basic FPN in UAV object detection. The comparison of Cascade RCNN [10] and Sparse R-CNN [48] shows that the one-stage method outperformed the two-stage method for the detection of small targets of UAVs. Obviously, compared with YOLO series models such as YOLOv4 [30] and YOLOv4_Drone [39], our model improved the overall detection performance and the performance of various categories. For example, in the three categories of “pedestrian”, “people”, and “motor”, our model obtained an improvement of 346% and 292%, 244% and 216%, and 213% and 180%, respectively, which verifies the high accuracy of our model for the detection of small objects like pedestrians and motorcycles. However, it can also be observed that our model had lower accuracy on the three target categories of “trunk”, “tricycle”, and “awning-tricycle”. After analysis, it is believed that the detection ability of small targets with similar semantics will be enhanced after the detection ability of our model for small targets is further improved. Due to the semantic similarity between tricycles and bicycles, trunks and vans, and awning-tricycles and cars, it is a significant challenge to distinguish them through the model during the learning process. Through the above experiments, the effectiveness of the improvement idea of our proposed target detection method applicable to small targets of UAVs can be seen, so we also hope that the improvement method proposed in this thesis can be applied to the same kind of YOLO algorithm and make performance breakthroughs on these versions as well, which is the work we will continue to study in depth in the future.

4.4. Qualitative Experiments

The detection results of the DB-YOLOv5 model are visualized in Figure 8. The figure includes the detection results under various conditions of insufficient light, sufficient light, dark, blurred image, and top-down angle. We can see that our method could better detect small and dense objects, especially in the central region. The target in the image is marked by a bounding box, whose color was randomly generated, and the same category in an image is marked with the same color, whereas different categories are marked with different colors.

4.5. Ablation Experiment

4.5.1. Influence of the Number of Backbone Networks

In the composite backbone network module of DB-YOLOv5, the visualization of the experimental results shows that the effect of using two backbone networks was better than three backbone networks in this model, as shown in Figure 9. Therefore, in this paper, we connected the two CSPDarknet53 backbone networks through the connection module to strengthen the main backbone with the assisting backbone, thereby improving the capability of feature extraction in the backbone.

4.5.2. Influence of Composite Backbone Network on Model Parameters

In the module of CBNet, we improved the ability of the backbone network to extract image features by adding two identical CSPDarknet53 backbone networks. However, the introduction of two backbone networks into the model caused an exponential increase in the parameters of the entire network model. Our experiment demonstrated that our proposed improvements can meet the requirements of high precision and short time consumption for UAV object detection. The parameter comparison after the improved model is shown in Table 2. Based on the above results, although the introduction of a composite backbone network structure in the model led to a significant improvement in parameters, it did not increase by multiples. At the same time, we can see that the floating point operations (FLOPs) computing power of the model was doubled. Therefore, after experimental verification, it can be concluded that the method of adding a composite backbone network structure in the model has feasibility and practical application prospects.

4.5.3. The Influence of Each Module on the Model

To verify the performance of the improved detection model, we compared the proposed model with the original YOLOv5 model and calculated the mean average precision (mAP) index for evaluation. As discussed in Section 3, we refer to the model with the CBNet module as YOLOv5_cb, the model with the CBNet-tiny module is YOLOv5_cbty, the YOLOv5_cb model with the BiFPN module is YOLOv5_bi, and the model proposed in this paper is DB-YOLOv5. The experimental results are shown in Table 3.

It can be seen from the results in Table 3 that the three improved methods proposed in this paper significantly increased the detection accuracy of the categories in the UAV dataset VisDrone-DET. Compared with the baseline model, after adding the faster CBNet-tiny module, the YOLOv5_cbty model achieved an 0.82% improvement on the mAP indicator. At the same time, the mAP index of the model with a complete CBNet module had a 1.21% improvement compared to the benchmark model and a 0.4% performance improvement compared to the YOLOv5_cbty model. On this basis, the model obtained by adding the BiFPN module also had a 2.3% improvement in performance indicators compared to the benchmark model and a 1.1% performance improvement based on the YOLOv5_cb model. Moreover, our proposed model DB-YOLOv5 had a performance improvement of nearly 3.1% compared to the benchmark model and a 0.8% improvement compared to YOLOv5_bi. The contribution of the three modules to our model from high to low were CBNet, BiFPN, and SPA.

5. Conclusions

In this paper, we proposed a DB-YOLOv5 UAV object detection algorithm to address the issue of detecting small targets in UAV images. We built the model on top of YOLOv5 by incorporating a composite backbone network, bidirectional feature pyramid, and pyramid attention mechanism, which improved the network’s capability for multi-scale feature fusion and small target detection. Our experiments on the VisDrone-DET dataset demonstrated that the proposed model achieved better performance in terms of objective detection metrics, making it suitable for small target detection tasks in UAV images. The proposed method has significant implications for security surveillance, particularly in the field of network security and privacy. By identifying small targets in UAV images, our approach can aid in detecting potential security threats, such as identifying security vulnerabilities in critical infrastructure and monitoring public events for potential security risks. Overall, this research provides a valuable contribution to the field of security surveillance by enhancing the capabilities of object detection algorithms for small target detection in UAV images, ultimately improving network security and privacy. Our next step is to develop a practical platform based on the simulation experiments. However, since we need to collect image data before learning, this process may be relatively slow. Our ultimate goal is to create a real-time platform and conduct experiments in a real environment. These are the directions for our future work.

Author Contributions

Conceptualization: L.T. and Y.L.; methodology: Y.L. and X.H.; formal analysis and investigation: Y.L.; writing—original draft preparation: Y.L.; writing—review and editing: Y.L., W.L., L.T., H.Z., X.H. and X.J.; supervision: L.T. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Chongqing (CSTB2022NSCQ-MSX1415).

Data Availability Statement

The dataset used during the current study is available at: http://aiskyeye.com/download/object-detection-2/ (accessed on 27 June 2023).

Conflicts of Interest

All authors have read the final manuscript, have approved the submission to the journal, and have accepted full responsibilities pertaining to the manuscript’s delivery and contents. The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Erdelj, M.; Natalizio, E.; Chowdhury, K.R. Help from the sky: Leveraging UAVs for disaster management. IEEE Pervasive Comput. 2017, 16, 24–32. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef] [Green Version]
Girshick, R.; Donahue, J.; Darrell, T. Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 142–158. [Google Scholar] [CrossRef]
Zou, Z.; Shi, Z.; Guo, Y. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; IEEE Press: Piscataway, NJ, USA, 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Dai, J.; Li, Y.; He, K. R-fcn: Object detection via region-based fully convolutional networks. Adv. Neural Inf. Process. Syst. (NIPS) 2016, 29, 379–387. [Google Scholar]
Zhu, Y.; Zhao, C.; Wang, J. Couplenet: Coupling global structure with local parts for object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 4126–4134. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P. Mask r-cnn. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 42, 386–397. [Google Scholar] [CrossRef] [PubMed]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
Pang, J.; Chen, K.; Shi, J. Libra r-cnn: Towards balanced learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 821–830. [Google Scholar]
Xu, H.; Jiang, C.; Liang, X. Reasoning-rcnn: Unifying adaptive global reasoning into large-scale object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 6412–6421. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787. [Google Scholar]
Cao, J.; Cholakkal, H.; Rao, M.A. D2Det: Towards high quality object detection and instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11482–11491. [Google Scholar]
Guo, J.; Han, K.; Wang, Y. Distilling object detectors via decoupled features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 2154–2164. [Google Scholar]
Li, Y.; Wu, C.Y.; Fan, H. Improved multiscale vision transformers for classification and detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 4794–4804. [Google Scholar]
Gao, Z.; Wang, L.; Han, B. AdaMixer: A fast-converging query-based object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 5354–5363. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D. Ssd: Single Shot Multibox Detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; Springer International Publishing: Amsterdam, The Netherlands, 2016; pp. 21–37. [Google Scholar]
Fu, C.Y.; Liu, W.; Ranga, A. Dssd: Deconvolutional single shot detector. arXiv 2017, arXiv:1701.06659. [Google Scholar]
Li, Z.; Zhou, F. FSSD: Feature fusion single shot multibox detector. arXiv 2017, arXiv:1712.00960. [Google Scholar]
Zhu, R.; Zhang, S.; Wang, X. ScratchDet: Training single-shot object detectors from scratch. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 2263–2272. [Google Scholar]
Zhou, X.; Zhuo, J.; Krahenbuhl, P. Bottom-up object detection by grouping extreme and center points. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 850–859. [Google Scholar]
Lu, X.; Li, Q.; Li, B. MimicDet: Bridging the gap between one-stage and two-stage object detection. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference (ECCV), Glasgow, UK, 23–28 August 2020; Springer International Publishing: Glasgow, UK, 2020; pp. 541–557. [Google Scholar]
Chen, C.; Zheng, Z.; Huang, Y. I3Net: Implicit instance-invariant network for adapting one-stage object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 12571–12580. [Google Scholar]
Zhao, Y.; Guo, X.; Lu, Y. Semantic-aligned fusion transformer for one-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 7591–7601. [Google Scholar]
Long, Y.; Wen, Y.; Han, J. CapDet: Unifying dense captioning and open-world detection pretraining. arXiv 2023, arXiv:2303.02489. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Chen, Q.; Wang, Y.; Yang, T. You only look one-level feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 13034–13043. [Google Scholar]
Li, C.; Li, L.; Jiang, H. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2023, arXiv:2207.02696. [Google Scholar]
Cheng, D.; Zhi, W.; Chi, Z. CMDNet: Coarse-grained density map guided object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 2789–2798. [Google Scholar]
Huang, Y.; Chen, J.; Huang, D. UFPMP-Det: Toward accurate and efficient object detection on drone imagery. Proc. AAAI Conf. Artif. Intell. 2022, 36, 1026–1033. [Google Scholar]
Wang, J.; Yang, W.; Guo, H. Tiny object detection in aerial images. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 3791–3798. [Google Scholar]
Xu, C.; Wang, J.; Yang, W. Dot distance for tiny object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 1192–1201. [Google Scholar]
Tan, L.; Lv, X.; Lian, X. YOLOv4_Drone: Uav image target detection based on an improved yolov4 algorithm. Comput. Electr. Eng. 2021, 93, 107261. [Google Scholar] [CrossRef]
Yang, J.; Yang, H.; Wang, F. A modified yolov5 for object detection in uav-captured scenarios. In Proceedings of the 2022 IEEE International Conference on Networking, Sensing and Control (ICNSC), Shanghai, China, 15–18 December 2022; pp. 1–6. [Google Scholar]
Wang, C.; Mark, L.; Wu, Y. CSPNet: A new backbone that can enhance learning capability of cnn. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1571–1580. [Google Scholar]
Liu, S.; Qi, L.; Qin, H. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
Lin, T.; Dollár, P.; Girshick, R. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–21257. [Google Scholar]
Sun, Z.; Zhang, Q.; Li, Y.; Tan, Y.A. DPPDL: A Dynamic Partial-Parallel Data Layout for Green Video Surveillance Storage. IEEE Trans. Circuits Syst. Video Technol. 2018, 28, 193–205. [Google Scholar] [CrossRef]
Yu, X.; Zhang, C.; Xue, Y.; Zhu, H.; Li, Y.; Tan, Y.A. An Extra-Parity Energy Saving Data Layout for Video Surveillance. Multimed. Tools Appl. 2018, 77, 4563–4583. [Google Scholar]
Zhang, Q.; Ma, W.; Wang, Y. Backdoor Attacks on Image Classification Models in Deep Neural Networks. Chin. J. Electron. 2022, 31, 199–212. [Google Scholar] [CrossRef]
Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
Sun, P.; Zhang, R.; Jiang, Y. Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 14454–14463. [Google Scholar]

Figure 1. After the YOLOv5 model performs object detection on the VisDrone dataset, there is a phenomenon of both missed detection and false positives. The figure shows the detected targets marked with positioning boxes of different colors, but many vehicles and pedestrians are still not detected or are incorrectly identified.

Figure 2. The overall architecture of the DB-YOLOv5 object detection model. The green module represents the image feature obtained through the model network; the number above the module represents the channel number of the feature.

Figure 3. Diagram of the CBNet module’s structure.

Figure 4. Diagram of the CBNet-tiny module’s structure.

Figure 5. Comparison among different feature extraction modules. (a) Top-down FPN structure; (b) top-down and bottom-up PANET structure; (c) BiFPN structure with bidirectional path.

Figure 6. The structure of the spatial pyramid attention mechanism.

Figure 7. The number of samples in the categories in the VisDrone-DET dataset.

Figure 8. The detection effect of the DB-YOLOv5 model in different scenarios. The images on the left are the original images in the dataset, and the images on the right are the output images of the model. (a) The detection effect under insufficient light; (b,c) the detection results when it is dark; (d,g) the detection results when the light is sufficient; (e) the detection effect when the light is sufficient but the image is blurred; (f,h) the detection effect diagram from the top-down angle when the light is sufficient; (i,j) the detection results from the top-down angle when it is dark.

Figure 9. Schematic diagram of the detection results of two backbone networks (a) and three backbone networks (b) in the composite backbone network. From the comparison of the two figures, the number of targets detected in (b) is significantly less than in (a), which proves that the effect of the two backbone networks in the composite backbone network is better than that of the three backbone networks.

Table 1. Experimental results of different object detection models on the VisDrone-DET dataset.

Method	mAP	Ped	People	Bicycle	Car	Van	Trunk	Tricycle	Awn. *	Bus	Motor
Corner Net [47]	17.41	20.43	6.55	4.56	40.94	20.23	20.54	14.03	9.25	24.39	12.1
Light-R CNN [48]	16.53	17.02	4.83	5.73	32.39	22.12	18.39	16.63	11.91	29.02	11.93
FPN [42]	16.51	15.69	5.02	4.93	38.47	20.82	18.82	15.03	10.84	26.72	12.83
Cascade RCNN [10]	16.09	16.28	6.16	14.85	4.18	37.29	17.11	14.48	12.37	20.38	24.31
Sparse R-CNN [48]	36.70	26.50	18.40	11.80	56.00	35.80	25.40	19.50	12.20	43.30	26.10
YOLOv4 [30]	40.99	15.20	11.50	22.40	65.40	60.70	59.40	33.60	52.60	71.40	22.10
YOLOv4_Drone [39]	45.67	18.00	13.00	23.00	69.00	62.00	68.00	42.00	60.00	76.00	26.00
YOLOv5	48.44	49.20	24.80	25.90	74.30	62.20	69.60	31.40	30.50	73.40	43.10
DB-YOLOv5	51.53	52.70	28.10	28.40	77.10	72.30	65.40	33.80	33.90	76.60	47.00

* The meaning of awn. is awning-tricycle.

Table 2. Comparison of parameters of different detection models.

Model	Size (Pixels)	Params (M)	FLOPs@640 (B)
YOLOv5s	640	7.2	12.6
DB-YOLOv5s	640	12.9	41.6
YOLOv5m	640	21.2	49.0
DB-YOLOv5m	640	32.8	102.6
YOLOv5l	640	46.5	109.1
DB-YOLOv5l	640	88.4	220.3
YOLOv5x	640	86.7	205.7
DB-YOLOv5x	640	176.5	440.6

Table 3. Ablation experiment results of the model on the VisDrone-DET dataset.

Model	CBNet-tiny	CBNet	Bi-FPN	SPANet	mAP
YOLOv5					48.44%
YOLOv5_cbty	✓				49.26%
YOLOv5_cb		✓			49.65%
YOLOv5_bi		✓	✓		50.74%
DB-YOLOv5		✓	✓	✓	51.53%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, Y.; Li, W.; Tan, L.; Huang, X.; Zhang, H.; Jiang, X. DB-YOLOv5: A UAV Object Detection Model Based on Dual Backbone Network for Security Surveillance. Electronics 2023, 12, 3296. https://doi.org/10.3390/electronics12153296

AMA Style

Liu Y, Li W, Tan L, Huang X, Zhang H, Jiang X. DB-YOLOv5: A UAV Object Detection Model Based on Dual Backbone Network for Security Surveillance. Electronics. 2023; 12(15):3296. https://doi.org/10.3390/electronics12153296

Chicago/Turabian Style

Liu, Yuzhao, Wan Li, Li Tan, Xiaokai Huang, Hongtao Zhang, and Xujie Jiang. 2023. "DB-YOLOv5: A UAV Object Detection Model Based on Dual Backbone Network for Security Surveillance" Electronics 12, no. 15: 3296. https://doi.org/10.3390/electronics12153296

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DB-YOLOv5: A UAV Object Detection Model Based on Dual Backbone Network for Security Surveillance

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Overall Structure

3.2. Composite Backbone Network Based on CSPDarknet53

3.3. Bidirectional Feature Pyramid Network

3.4. Spatial Pyramid Attention Mechanism

4. Experiments

4.1. Datasets

4.2. Experimental Details

4.3. Quantitative Experiments

4.4. Qualitative Experiments

4.5. Ablation Experiment

4.5.1. Influence of the Number of Backbone Networks

4.5.2. Influence of Composite Backbone Network on Model Parameters

4.5.3. The Influence of Each Module on the Model

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI