An Efficient UAV Image Object Detection Algorithm Based on Global Attention and Multi-Scale Feature Fusion

Qian, Rui; Ding, Yong

doi:10.3390/electronics13203989

Open AccessArticle

An Efficient UAV Image Object Detection Algorithm Based on Global Attention and Multi-Scale Feature Fusion

by

Rui Qian

^1,*

and

Yong Ding

²

¹

School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798, Singapore

²

College of Automation Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(20), 3989; https://doi.org/10.3390/electronics13203989

Submission received: 20 August 2024 / Revised: 20 September 2024 / Accepted: 10 October 2024 / Published: 10 October 2024

Download

Browse Figures

Versions Notes

Abstract

:

Object detection technology holds significant promise in unmanned aerial vehicle (UAV) applications. However, traditional methods face challenges in detecting denser, smaller, and more complex targets within UAV aerial images. To address issues such as target occlusion and dense small objects, this paper proposes a multi-scale object detection algorithm based on YOLOv5s. A novel feature extraction module, DCNCSPELAN4, which combines CSPNet and ELAN, is introduced to enhance the receptive field of feature extraction while maintaining network efficiency. Additionally, a lightweight Vision Transformer module, the CloFormer Block, is integrated to provide the network with a global receptive field. Moreover, the algorithm incorporates a three-scale feature fusion (TFE) module and a scale sequence feature fusion (SSFF) module in the neck network to effectively leverage multi-scale spatial information across different feature maps. To address dense small objects, an additional small object detection head was added to the detection layer. The original large object detection head was removed to reduce computational load. The proposed algorithm has been evaluated through ablation experiments and compared with other state-of-the-art methods on the VisDrone2019 and AU-AIR datasets. The results demonstrate that our algorithm outperforms other baseline methods in terms of both accuracy and speed. Compared to the YOLOv5s baseline model, the enhanced algorithm achieves improvements of 12.4% and 8.4% in AP₅₀ and AP metrics, respectively, with only a marginal parameter increase of 0.3 M. These experiments validate the effectiveness of our algorithm for object detection in drone imagery.

Keywords:

UAV; object detection; global attention; feature fusion

1. Introduction

In recent years, the decreasing manufacturing cost of unmanned aerial vehicles (UAVs) has led to the gradual replacement of traditional aerial photography with UAV-based technology, transitioning from military to civilian applications [1]. The high maneuverability of UAVs makes aerial photography more flexible and adaptable, resulting in video data with richer information in both time and content. Consequently, UAV object detection technology has found widespread applications in intelligent monitoring, traffic control, flight path planning, map drawing, and other scenarios, offering vast potential for further use [2].

Object detection technology essentially extends image classification tasks, aiming to distinguish moving targets from backgrounds of varying complexity to accomplish tasks like image segmentation [3], scene understanding [4], and target tracking [5]. Recent advancements in object detection technology are closely tied to the development of deep convolutional neural networks and the promotion of benchmark datasets such as MS COCO [6], PASCAL VOC [7], and ImageNet [8]. However, these commonly used datasets, derived from conventional viewpoints, cannot be directly applied to UAV aerial footage.

UAV-mounted cameras primarily capture images from a top-down perspective, presenting unique challenges for object detection tasks [9], as illustrated in Figure 1. Compared to datasets from conventional viewpoints (often side views), UAV-perspective datasets exhibit several distinct characteristics: (1) Aerial photography challenges: during aerial photography, issues such as poor drone stability, limited camera resolution, and environmental changes can introduce view jitter, lighting variations, reduced resolution, and camera distortion in video frames. (2) Object density and scale: Objects in aerial footage are typically unevenly distributed and small in scale, often occupying only a few pixels. This presents challenges such as non-maximum suppression (NMS), affecting object boundaries and causing target distortion. (3) Occlusions: In conventional viewpoints, occlusions primarily occur between objects. From a UAV perspective, occlusions are predominantly caused by environmental elements like trees and buildings, which obstruct object features and complicate feature extraction. These characteristics highlight the need for specialized object detection algorithms tailored specifically for UAV aerial imagery.

Additionally, for small drones, the ideal application scenario involves synchronizing the processing of aerial image data on embedded systems to enable automatic obstacle avoidance and mission planning. To meet the operational requirements of embedded hardware, it is crucial to control the model parameters within a certain range and ensure that the model’s processing speed meets real-time demands.

This paper presents a series of enhancements to the YOLOv5s neural network model, effectively improving detection accuracy while preserving the lightweight characteristics of YOLOv5s. Deploying the enhanced algorithm on embedded systems in small drones can enhance UAV competitiveness in object detection and related tasks. The key contributions of this paper are as follows:

DCNCSPELAN4 module: A new lightweight feature extraction module, DCNCSPELAN4, is designed, which enhances the ELAN [10] structure based on CSPNet [11] and incorporates deformable convolution (DCN) [12]. Compared to the C3 structure in the YOLOv5 backbone network, this module provides improved feature extraction results and lightweight advantages.
CloFormer Block: This paper introduces the CloFormer Block [13], a lightweight Vision Transformer module, into the backbone network. By efficiently integrating low-frequency global information with high-frequency local information, this module provides the neural network with a comprehensive global receptive field.

Three-scale feature fusion modules: Two three-scale feature fusion modules are incorporated into the neck network. These modules adeptly handle semantic information and spatial features from both deep and shallow layers, thus improving detection accuracy for multi-scale targets, particularly small ones.

Detection Head Improvement: The detection scales of YOLOv5 are enhanced by adding an additional small detection head and removing the original large detection head. This adjustment balances accuracy with model size, enhancing overall performance.

The improved algorithm in this paper performs excellently on the VisDrone-DET2019 dataset and AU-AIR dataset. The ablation experiments demonstrate the effectiveness of the proposed modules. The experimental results show that the improved YOLOv5 model achieves significant performance improvement with minimal growth in network parameters and computational complexity. Additionally, comparative experiments were designed to prove the efficiency of the proposed network model among mainstream object detection algorithms.

The organization of this paper is as follows: Section 2 reviews related work, Section 3 details the proposed method, Section 4 presents the experimental results, and Section 5 concludes the paper.

2. Related Work

2.1. Object Detection

The neural network models used for UAV object detection can be broadly categorized into single-stage and two-stage networks. Two-stage network models originated from design concepts such as sliding windows, image pyramids, and bounding box regression. Classic two-stage networks include R-CNN [14], SPP-NET [15], Fast-RCNN [16], Mask-RCNN, Cascade RCNN [17], and Faster-RCNN [16]. However, due to their complexity and large number of parameters, two-stage networks are not suitable for deployment on mobile devices. With limited computational resources, these networks exhibit slow computation speeds, which may not meet the real-time requirements for object detection from a UAV perspective.

Single-stage algorithms, on the other hand, bypass the need to select target regions, directly classifying and regressing the candidate frames. Typical networks include the YOLO series [18], RefineNet [19], and SSD [20]. The YOLO algorithm is currently one of the most widely used algorithms in UAV object detection, offering notable advantages in computational speed. YOLOv4 [21] introduced CSPDarkNet as the backbone network, enhancing the learning capability of the convolutional neural network while enabling the network to maintain accuracy in feature map extraction, all while being lightweight. YOLOv5 [22] introduced the Focus module in the initial structure of the backbone, improving the processing speed of input images, and adopting the PAFPN module for the neck. YOLOv7 [23] introduced the E-ELAN structure, which allows for multi-scale features training without increasing the shortest gradient path length. YOLOv8 introduced the C2f module as part of the backbone phase module. In our work, improvements are made based on YOLOv5, YOLOv7, and YOLOv8. The ablation experiments verified that the improved YOLOv5s algorithm achieved the best balance between accuracy and speed.

2.2. YOLO-Based UAV Object Detection Algorithm

Since the number and positions of YOLO’s candidate boxes are fixed, its performance in target box regression is relatively weaker. As a result, researchers have proposed various modifications to the YOLO algorithm’s neural network structure to enhance object detection performance from a UAV perspective.

Skakun et al. [24] improved the YOLOv4 network by proposing the YOLOv4eff network. This version utilizes four sets of Cross-Stage-Partial (CSP) [11] structures to link the backbone network with the neck network, changes the activation function to Swish, and sets Letterbox to 1 for efficiency. However, this model’s complex structure and high computational resource requirements make it unsuitable for resource-constrained devices. Liu et al. [25] introduced the Multi-branch Parallel Feature Pyramid Network (MPFPN) to extract richer small target features and incorporated a Supervised Spatial Attention Module (SSAM) to reduce background noise impact through an attention mechanism. Despite these enhancements, this network struggles with imbalanced training samples, which can lead to detection errors. Liu et al. [26] proposed the TridentFPN backbone structure along with a new attention mechanism to improve multi-scale object prediction in UAV images. However, this model suffers from slower detection speeds, making it less suitable for real-time detection. Drone-YOLO [27], an enhancement of YOLOv8, integrated feature fusion modules and designs detection models of different sizes tailored for various application scenarios. Nonetheless, the processing speed of each model in this algorithm is suboptimal, failing to meet the real-time demands of UAV object detection tasks. To improve UAV object detection performance, Wang et al. introduced an efficient end-to-end detector named SPB-YOLO [28]. This detector features a new Strip Bottleneck (SPB) module and employs an upsampling strategy based on the Path Aggregation Network (PANet). However, this network is prone to missed detections and false alarms in complex backgrounds during the detection process. Enhancing the feature extraction module of the YOLO network, integrating multi-scale feature fusion techniques, and expanding the network’s receptive field have proven effective in improving UAV object detection performance. However, existing methods often struggle to balance detection accuracy, model size, and computational speed. To address these challenges, we propose lightweight feature extraction modules and multi-scale fusion modules. These modules optimize the feature extraction process across multi-scale feature maps and improve the performance of detection heads.

2.3. Vision Transformer

In recent years, Vision Transformers (ViTs) equipped with global self-attention mechanisms have shown exceptional performance in various visual tasks, including object detection, instance segmentation, and image classification. Prominent ViT models such as Swin Transformer [29] and Mobile ViT [30] transform images into sequence data and feed them into Transformer models to capture the relationships between different positions within the images, enabling the global modeling of image features. Introducing the Multi-Head Self-Attention (MSHA) mechanism [31] of Transformers into YOLO networks provides the network with a global receptive field, allowing it to extract higher-level feature representations and capture the critical features of objects in images. For instance, the TPH-YOLOv5 [32] model enhanced YOLOv5 by incorporating a Transformer prediction head and an attention model (CBAM), which effectively detects dense small objects from a UAV perspective. Feng et al. [33] designed a fusion of MSHA and CSP Bottleneck, embedding it into the connecting layers of the YOLOv5 backbone and neck networks, yielding positive results in UAV vehicle recognition tasks. To improve the model’s focus on significant object information, Wang et al. [34] employed a dynamic sparse attention mechanism called Biformer to enhance the YOLOv8 backbone network. This enhancement boosts feature extraction with limited computational resources while accurately detecting occluded targets. Additionally, PVswin-YOLOv8 [35] integrated a Swin Transformer Block into the lower layers of the backbone network, enhancing the detection of small objects through global feature extraction capabilities. In our work, a CNN-Transformer hybrid module, CloFormer Block, is introduced to the backbone network, enabling the model to acquire global self-attention.

By leveraging limited computational resources, these improvements bolster the network’s capability to extract features from small and occluded targets. Consequently, the enhanced YOLOv5 algorithm achieves high speed, efficiency, and ease of deployment in UAV object detection tasks.

3. Methodologies

YOLOv5, as one of the most mature versions in the YOLO series, uses CSPDarknet53 [36] and the Path Aggregation Network (PANet) [37] as the backbone and neck networks, achieving a good balance between detection accuracy and speed. Additionally, YOLOv5 provides flexible and user-friendly training and deployment interfaces, making it easier for users to apply this algorithm in various practical scenarios. The YOLOv5 algorithm has a very adaptable network structure, divided into five versions based on neural network depth and the number of channels: YOLOv5s [22], YOLOv5m, YOLOv5n [22], YOLOv5l, and YOLOv5x. In this paper, we select YOLOv5s, with its relatively simple structure, as the baseline model.

The modified YOLOv5 neural network structure is depicted in Figure 2. First, a lightweight feature extraction module, DCNCSPELAN4, replaces the original C3 module in the backbone network, enhancing feature extraction efficiency. Following this, a lightweight Transformer module, the CloFormer Block, is integrated at the base of the backbone network to effectively manage both global and local information. The neck network then incorporates a three-scale feature fusion operation (SSFF) and a three-scale feature map channel concatenation operation (TFE), allowing for better utilization of spatial and channel information across different scale feature maps. Finally, an additional small-object detection head is introduced in the output layer, replacing the original large-object detection head, thereby improving the algorithm’s ability to detect dense small objects effectively.

3.1. DCNCSPELAN4 Module

In the YOLO series, a common method to enhance the feature extraction capability of the backbone network is by deepening and widening the network model. However, these operations increase computational complexity, making deployment on mobile devices difficult and failing to meet real-time requirements. To address this issue, we designed a new feature extraction module, DCNCSPELAN4, to replace the C3 module in the original YOLOv5 backbone network.

The DCNCSPELAN4 module mimics the design of CSPNet [11], as shown in Figure 3a, allowing the convolution operation in the ELAN [10] structure to be replaced with any computable module. The ELAN structure, depicted in Figure 3b, is a classic module in deep learning primarily used for label propagation tasks. The structure of the CSPELAN module [38], shown in Figure 3c, introduces cross-stage partial connections based on ELAN, enabling the network to better utilize feature information from previous stages, thereby enhancing the network’s efficiency and performance.

When the underlying feature map of a certain stage passes through the DCNCSPELAN4 module, the input’s channels are divided into two parts

x_{0} = {x_{0'}, x_{0''}}

. One part is directly connected to the end of the stage, while the other part goes through several feature extraction modules. This ensures that neither side contains redundant gradient information belonging to the other. Overall, the DCNCSPELAN4 module retains the advantages of the ELAN module while preventing excessive redundant gradient information by truncating the gradient flow. Introducing the DCNCSPELAN4 module effectively improves the performance of deep neural networks, reducing computational and parameter overhead.

The overall structure of our DCNCSPELAN4 module is illustrated in Figure 4. Central to achieving high performance in this module is the use of deformable convolution. Deformable convolution (DCN) [12] represents a specialized convolution operation that introduces spatial variability, thereby enhancing adaptability to spatial changes. In Figure 4, the distribution of targets under aerial photography is depicted with gray shading. Traditional convolution operations are static, where the shape and position of the convolution kernel remain fixed and cannot adjust to the spatial variations in targets. In contrast, deformable convolution dynamically modifies the shape and position of the convolution kernel by incorporating learnable offset values.

The advantage of DCNCSPELAN4 is that it introduces a series of offset predictions. These predictions are used to adjust the sampling position of the convolution kernel, enabling spatial dynamic adjustments based on the content of the input feature map. On a 2D plane, there is a standard

3 \times 3

convolutional kernel. For each position

P_{0}

on the output feature map

y

, the convolution operation can be expressed as follows:

y (P_{0}) = \sum_{P_{n} \in R} w (P_{n}) \cdot x (P_{0} + P_{n})

(1)

where

w

is the convolution kernel weight,

x

is the input feature graph, and

P_{n}

represents any position in the convolution kernel. In the DCNCSPELAN4 module, a learnable offset scalar

Δ P_{n}

and modulation scalar

Δ m_{n}

are introduced on a regular grid:

y (P_{0}) = \sum_{P_{n} \in R} w (P_{n}) \cdot x (P_{0} + P_{n} + Δ P_{n}) \cdot Δ m_{n}

(2)

The modulation scalar

Δ m_{n}

lies within the range of

[0, 1]

, and

Δ P_{n}

is a real number without constraints. For adaptive spatial aggregation, both the sampling offsets and modulation scalars are learnable and adjusted by the input.

With this design, DCNCSPELAN4 can effectively adapt to the spatial changes in targets, expand the receptive field of the convolution kernel, and enhance the model’s capacity to handle target deformations. Overall, this module improves feature extraction capabilities while achieving lightweight efficiency.

3.2. CloFormer Block

In recent years, the Vision Transformer (ViT) architecture has demonstrated excellent performance in object detection. However, early ViTs were limited by their large models and computational requirements, making deployment on mobile devices challenging. To address this, current research focuses on integrating Transformer modules with global self-attention mechanisms and traditional CNN modules to create more lightweight models. This paper introduces the CloFormer Block from the lightweight ViT CloFormer [13] into the backbone network to effectively integrate local and global information.

As shown in Figure 5, the CloFormer Block consists of a local branch and a global branch. In the global branch, depicted on the left side of the figure, the module first downsamples K and V, and then applies standard attention operations on Q, K, and V to extract low-frequency global information. The output of the global branch

X_{g l o b a l}

can be expressed as follows:

X_{g l o b a l} = Attention (Q_{g}, Pool (K_{g}), Pool (V_{g}))

(3)

where Attention represents the self-attention mechanism, and Pool represents the pooling operation.

The global branch of the Transformer utilizes the common global self-attention mechanism, effectively reducing the computational cost of repetitive attention segments while providing the CloFormer Block with a global receptive field. Although the global branch can efficiently handle low-frequency global information, its ability to process high-frequency local information is limited. Therefore, the convolution operation AttnConv [13] from CNN architecture is used to extract local information. The right part of Figure 5 illustrates the AttnConv structure.

In AttnConv, a linear transformation is first applied to obtain Q, K, and V. Then, a local feature aggregation process with shared weights is applied to V:

V_{s} = DWconv (V)

(4)

where DWconv [39] represents depth-wise convolution. This operation applies a simple depth-wise convolution to numeric vectors to aggregate global information, with weights shared globally. Following this, a context-aware local enhancement is performed on the processed Q, K, and V.

In the local enhancement operation, the first step involves combining Q and K to generate context-aware weights. This process requires the use of two DWconvs to, respectively, aggregate local information from Q and K. Next, the Hadamard product of Q and K is computed, and the result undergoes a series of transformations to obtain context-aware weights between −1 and 1. The output of the local branch

X_{l o c a l}

can be expressed as follows:

\begin{array}{l} Q_{l} = DWconv (Q) \\ K_{l} = DWconv (K) \\ A t t n_{t} = FC (Swish (F C (Q_{l} ⊙ K_{l}))) \\ A t t n = Tanh (\frac{A t t n_{t}}{\sqrt{d}}) \\ X_{l o c a l} = A t t n ⊙ V_{s} \end{array}

(5)

where d is the number of channels of the feature graph, and

⊙

represents the Hadamard product. Meanwhile, Swish and Tanh are two activation functions.

Compared to ordinary attention mechanisms, AttnConv introduces stronger nonlinearity. While the ordinary attention mechanism employs a single nonlinear operator, Softmax, AttnConv uses Tanh and Swish as activation functions. This increased nonlinearity enables more efficient generation of context-aware weights.

The CloFormer Block’s dual-branch structure effectively integrates high-frequency local information with low-frequency global information, enhancing the feature extraction capability of the backbone network. The nonlinearity introduced by this module enables the YOLOv5 network to capture complex relationships within the data more effectively, thereby improving the model’s generalization ability. Although CloFormer, as a lightweight Transformer, has relatively few parameters and computational demands, it still imposes a certain computational load on the CNN model. Therefore, this module is applied only at the lower-resolution stages of the backbone network, allowing for superior feature extraction while preserving the model’s lightweight characteristics.

3.3. Three-Scale Feature Fusion Modules

In the neck network, this paper introduces a three-scale feature fusion framework to replace the original network’s channel splicing operation of two-scale feature maps. This framework consists of two feature fusion modules:

(1): Scale sequence feature fusion (SSFF) module: this module combines global- or high-level semantic information from multiple scale images, enhancing the network’s ability to utilize diverse feature representations effectively.
(2): Triple Feature Encoder (TFE) module: this module captures the local details of small targets, significantly improving the detection accuracy for small objects by encoding features from three different scales [40].

To address the challenge of multi-scale feature extraction, YOLOv5 employs a feature pyramid structure to merge pyramid features using addition or concatenation methods. However, this approach may not fully exploit the relationships between feature maps of different scales. As depicted in Figure 6a, the scale sequence feature fusion (SSFF) method enhances this capability by effectively integrating the high-dimensional information from deep feature maps with the fine-grained details from shallow feature maps. While a blurry image may lose some details, its structural features remain preserved [41]. Thus, the scaled images used as input for SSFF can be derived through the following operation:

\begin{array}{l} F_{σ} (w, h) = G_{σ} (w, h) \times f (w, h) \\ G_{σ} (w, h) = \frac{1}{2 π σ^{2}} e^{- (w^{2} + h^{2}) / 2 σ^{2}} \end{array}

(6)

where

f (w, h)

is the original 2D input image, and the width and height of the image are

w

and

h

, respectively. The input image, processed through a two-dimensional Gaussian filter

G_{σ} (w, h)

produces the output

F_{σ} (w, h)

. Here,

σ

is the standard deviation scale parameter of the Gaussian filter used for convolution.

In this paper, the output of the SSFF module is concatenated with the top output of the YOLOv5 feature pyramid using a common channel and spatial attention mechanism called CPAM [40]. This combined output is then directed to the newly added ultra-small object detection head, enhancing the detection performance for dense small objects.

Figure 6b illustrates the internal structure of the TFE three-scale feature map channel concatenation module. The original YOLOv5 neck network uses an FPN structure to propagate feature information. However, this mechanism only upsamples small feature maps and adds them to the previous layer, overlooking the rich detailed information present in large-scale feature layers. The TFE module addresses this limitation by combining large, medium, and small feature maps, amplifying the features of large-scale maps and enhancing detailed feature information.

At the input stage, the number of channels in the large-scale feature map is adjusted to 1C, followed by a mixed structure of max pooling and average pooling for downsampling. This approach preserves the effectiveness and diversity of high-resolution feature maps while reducing the channel count and size of large-scale feature maps, thereby lowering computational complexity and memory usage. Similarly, the channel count of the small-scale feature map is adjusted using the same convolution operation, followed by upsampling with the nearest neighbor interpolation method to prevent information loss in small-scale feature maps. Finally, the feature maps from the three scales are convolved separately and concatenated along the channel dimension.

Through the SSFF and TFE operations, YOLOv5 effectively combines spatial and multi-scale features, extracting information more comprehensively than the original feature pyramid. This enhances the detection capabilities for multi-scale objects, especially small objects, without leading to a substantial increase in parameter count or computational complexity. Compared to the original YOLOv5 network’s method of upsampling and channel concatenation, this approach better preserves the effectiveness and diversity of high-resolution feature maps.

3.4. Detection Head Improvement

The original YOLOv5 model is designed for three-scale detection. When a 640 × 640 pixel image is processed, it passes through the backbone and undergoes two upsampling stages in the neck network before being downsampled and output to the detection layer, resulting in output images of sizes 80 × 80, 40 × 40, and 20 × 20 pixels. While downsampling with a stride of 2 in the backbone captures more semantic information, it also causes a loss of detailed feature information, particularly affecting the semantic features of tiny objects. To address this, we introduce a shallow detection layer focused on smaller objects and add a related feature fusion layer to the neck network. This additional detection scale enhances the detection capabilities for tiny objects by providing richer positional information, making it more suitable for aerial scenes with a wide range of object sizes.

The newly added feature pyramid layer includes the TFE feature fusion operation. To integrate the three-scale feature maps onto the 160 × 160 pixel detection layer, the feature maps downscaled for the first and second time in the backbone network are concatenated with the previous layer of the feature pyramid. The output of the small object detection layer, after the fusion of three-scale features, shows superior performance in detecting small objects. However, this additional detection head increases network parameters and computational complexity. Given that large objects are rare in aerial shots, the original large object detection head and related structures in the neck network are removed to improve computational speed.

4. Experiments and Results

4.1. Dataset and Experiments Environment

All experiments in this study were conducted using the VisDrone2019 dataset [42] and the AU-AIR dataset [43] for training and testing. The VisDrone2019 dataset, compiled by the AISKYEYE team at Tianjin University’s Machine Learning and Data Mining Laboratory, comprises 288 video clips, 261,908 frames, and 10,209 static images. Of these static images, 6471 are used for training, 548 for validation, and 3190 for testing. The images were captured by various drones in diverse scenes, including urban and rural environments, and feature a wide range of objects (vehicles, pedestrians, etc.), lighting conditions (daytime and nighttime), and densities (sparse, dense, and occluded). Over 2.6 million objects have been manually annotated into 10 different categories, addressing all challenges in drone-based object detection tasks. The AU-AIR dataset is the first multimodal dataset specifically designed for detecting small-scale objects in drone imagery. It contains two hours of raw video captured from an aerial device, featuring 32,823 annotated frames and 8 object categories for traffic monitoring purposes. The object instances include cars, vans, trucks, humans, trailers, bicycles, buses, and motorbikes.

The experimental environment is PyTorch 1.9.0 and CUDA 11.1, running on an NVIDIA GeForce RTX 3080 GPU. During training, input images were resized to 640 × 640 pixels, and the SGD optimizer was used with an initial learning rate of 0.01 and a weight decay coefficient of 0.005.

Due to the extensive modifications made to the YOLO model in this study, only a few modules could be initialized with pre-trained weights during the ablation experiments. To better validate the model’s performance and ensure experimental completeness, no YOLO pre-trained weights were loaded during training. This approach may result in slower model convergence. Therefore, the model was trained on the training set for 200 epochs, with the first 3 epochs used for warm-up to ensure full model convergence.

4.2. Experiment Metrics

The key metrics involved in object detection tasks are Precision, Recall, Average Precision (AP), and mean Average Precision (mAP). Precision and Recall can be defined as follows:

\begin{array}{l} P r e c i s i o n = \frac{T P}{T P + F P} \\ R e c a l l = \frac{T P}{T P + F N} \end{array}

(7)

where TP represents true positive detections, FP represents false positive detections, and FN represents false negative detections.

AP can be obtained by plotting Recall on the x-axis and Precision on the y-axis, and then integrating the area under the resulting curve to produce the Precision–Recall (P-R) curve for a single category.

The mean Average Precision (mAP) comprehensively evaluates the model’s performance under various overlapping demands. It is calculated by averaging the AP values across multiple classes as follows:

m A P = \frac{\sum_{i = 1}^{N} A P_{i}}{N}

(8)

where AP adheres to the following:

A P = \int_{0}^{1} P d R

(9)

The two most used metrics for average precision verification are AP₅₀ (IoU = 0.5) and AP (IoU

\in

[0.5:0.95]).

Additionally, the metrics for evaluating object detection performance include the model size (number of parameters), computational complexity (measured in FLOPs), and operational speed (measured in Frames Per Second, FPS).

4.3. Comparison with Different Detection Algorithms

To validate the efficiency of the proposed algorithm, we conducted comparative experiments on the Visdrone2019-val set and AU-AIR dataset using several representative YOLO-based object detection algorithms, including YOLOv4 [21], YOLOv5 [22], TPH-YOLOv5 [32], YOLOv7 [23], and YOLOv8. We assessed these algorithms based on detection accuracy, model parameters, and detection speed. The experimental results are presented in Table 1.

As shown in Table 1, the baseline model YOLOv5s, without loading pre-trained weights, demonstrates a relatively balanced performance in detection accuracy and speed. On Visdrone2019-val, the improved YOLOv5s algorithm proposed in this paper achieves an accuracy of 43.7% on AP₅₀, which is 12.4% higher than the baseline model. Additionally, our model’s AP₅₀ is 11.2% and 9.1% higher than models of similar size, YOLOv7-tiny and YOLOX-S, respectively. With 80% fewer parameters than TPH-YOLOv5 and YOLOv8l, the AP₅₀ of our model is 0.9% and 0.5% higher than TPH-YOLOv5 and YOLOv8l, respectively, with a detection speed of 11.3 FPS and 12.9 FPS faster. On the AU-AIR dataset, our algorithm performs best on AP, achieving 45.3%, while YOLOv8l performs best on AP₅₀.

Next, we compared our model with some classic object detection algorithms, including SSD512 [20], RetinaNet [45], Faster-RCNN [16], Light-RCNN [46], CenterNet [47], CornerNet [48], and Grid-RCNN [49], on the Visdrone2019-val set. The experimental results, shown in Table 2, indicate that the proposed algorithm outperforms these alternatives in both accuracy and speed.

Classic two-stage algorithms struggle with datasets containing complex backgrounds and exhibit low detection efficiency due to their intricate structures. For instance, while Grid-RCNN achieves high detection accuracy, its detection speed is only 10.4 FPS, making it difficult to meet real-time requirements. In contrast, the improved YOLOv5s model delivers superior detection results in real-time conditions, maintaining high accuracy and speed.

Finally, the algorithm proposed in this article was validated on the Visdrone2019-test. The experiment selected some representative single-stage algorithms and two-stage algorithms for comparative experiments, including Cascade-RCNN [17], mSODANet [50], UCGNet [51], YOLOv3_ReSAM [26], YOLOv8s, PVswin-YOLOv8s [35], and TPH-YOLOv5 [32], including some SOTA models. The experiment analyzed the 10 object categories in the Visdrone dataset separately, with the evaluation metric being AP₅₀. The experimental results are shown in Figure 7. Thanks to the global receptive field and multi-scale feature fusion, the improved YOLOv5s algorithm excels in detecting small target categories such as pedestrians and motorcycles, achieving AP₅₀ scores of 40.6% and 42.6%, respectively—the highest among all the algorithms presented. Additionally, for the categories prone to misdetection, such as vans and trucks, the improved YOLOv5s algorithm achieved AP₅₀ scores of 44.1% and 37.6%, outperforming other object detection algorithms. Overall, our proposed algorithm achieved the highest AP₅₀ on the test set, reaching 36.5%. In comparison to the SOTA TPH-YOLOv5, which has an AP₅₀ of 36.1%, our model is much lighter and more accurate.

From the above three comparative experiments, it is evident that the enhanced YOLOv5s algorithm leverages optimized feature extraction, feature fusion, and lightweight modifications, resulting in a satisfying performance in detection accuracy and model size on UAV images.

4.4. Ablation Experiment

To assess the impact of the proposed enhancements on the model’s detection performance, ablation experiments were conducted using the original YOLOv5s as the baseline. The experimental results are presented in Table 3.

The ablation experiments fully demonstrate the superiority of the proposed object detection algorithm. As depicted in Table 3, the DCNCSPELAN4 module effectively enhances efficiency with notable reductions in parameters and computational costs, while also slightly increasing mAP. The CloFormer Block module effectively improves feature extraction accuracy. Following optimizations to the backbone network, the new model reduced parameters and FLOPs by 0.5 M and 2.2 G, respectively, on the VisDrone2019-val dataset, while increasing AP₅₀ and AP by 1.3% and 0.9%, respectively, and FPS decreased by 18.7 FPS. The addition of a small object detection head in the output layer significantly enhanced accuracy, reflecting increases of 9.4% and 6.3% in AP₅₀ and AP, respectively. Integrating the small object detection layer into the improved backbone network resulted in AP₅₀ and AP improvements of 9.0% and 6.2%, respectively. The SSFF and TFE feature fusion operations, despite adding few parameters, notably boosted object detection accuracy by 2.1% and 1.3% in AP₅₀ and AP, respectively. Moreover, optimizations in handling high-resolution feature maps preserved the detection speed so it was nearly unchanged, primarily by reducing the non-maximum suppression (NMS) processing time.

Overall, compared to the original YOLOv5s algorithm, the improved YOLOv5s algorithm achieved increases of 12.4% and 8.4% on AP50 and AP, respectively. Although the enhanced model increased parameters by 0.3M and FLOPs by 12.9 GFLOPs, it retains a lightweight advantage in neural network object detection models. Regarding detection speed, the final algorithm achieves 46.8 FPS, fully meeting real-time requirements.

To further verify the effectiveness of each module, we conducted ablation experiments based on the representative YOLO models, YOLOv8s and YOLOv7-tiny, as baseline models. The experimental results are shown in Table 4. After introducing DCNCSPELAN4 and the CloFormer Block into the backbone network, the AP₅₀ of the YOLOv8s and YOLOv7-tiny models on the VisDrone2019-val set increased by 1.8% and 0.9%, respectively, while AP increased by 0.7% and 0.6%. After adding the feature fusion modules SSFF and TFE, as well as an additional small object detection head, the final YOLOv8s and YOLOv7-tiny models achieved improvements of 7.9% and 6.4% in AP₅₀ over the baseline models, reaching 46.3% and 38.9%, respectively. However, the model sizes increased by 1.2 M and 0.4 M, and the speeds decreased by 36.7 FPS and 51.1 FPS, respectively. Compared to the improved YOLOv5s algorithm, although the YOLOv8-based model achieved higher accuracy, its detection speed was only 25.7 FPS, failing to meet the minimum real-time requirement of 30 FPS [52], rendering it unsuitable for deployment on airborne platforms for target detection tasks.

Comparative experiments on the detection accuracy of 10 object classes on VisDrone2019-val were conducted for each improved module, as shown in Table 5. From the table, it is evident that the original YOLOv5 network achieved only 10.3% accuracy for bicycles and canopy tricycles. The improved network in this study increased AP₅₀ for these classes by 9.0% and 4.1%, respectively, underscoring its effectiveness in detecting dense small objects. Across all object classes, the proposed modules consistently improved mAP compared to the original YOLOv5, demonstrating robust multi-scale detection and generalization capabilities suitable for UAV image object detection.

4.5. Visualization

The confusion matrix is a visual representation that illustrates the algorithm’s performance, where the abscissa represents True and the ordinate represents Predicted. Figure 8a–c depict the confusion matrices of the baseline YOLOv5s model, TPH-YOLOv5, and our improved model on the Visdrone2019-test, respectively. In Figure 8a, it is evident that bicycles, tricycles, and people were predominantly detected as background by the YOLOv5s algorithm, indicating a high rate of missed detections. Additionally, the YOLOv5 algorithm shows significant misclassifications among similar targets; for instance, 44% of trucks and 19% of trucks were incorrectly identified as cars. In Figure 8b, the addition of multiple attention mechanisms and the Transformer detection head in TPH-YOLOv5 [32] effectively mitigated the false detection issues of the original model. For instance, the probability of misclassifying vans and trucks as cars has been reduced to 36% and 15%, respectively. While these modules also reduced the overall false detection rate, it remains relatively high. Comparing the final model (Figure 8c) with the original model reveals higher values along the diagonal and lower values in the upper-right and lower-left triangles. This shift indicates that the improved model achieves higher probabilities of correct predictions, leading to notable reductions in both false detections and missed detections. Overall, these enhancements contribute to outstanding object detection performance.

The application of our improved algorithm on the AU-AIR dataset and Visdrone2019-test vividly demonstrates its advantages. Figure 9 compares the performance of the original YOLOv5 algorithm (Figure 9a, on the left) and TPH-YOLOv5 (Figure 9b, in the middle) with our improved algorithm (Figure 9c, on the right). As shown in the first row of Figure 8, in the AU-AIR dataset, fast-moving cars face issues such as blurriness, poor lighting, and obstruction by trees, making accurate detection difficult. Original YOLOv5s fail to recognize it, but our proposed algorithm successfully detects the car with higher accuracy than TPH-YOLOv5. In the complex backgrounds typical of the Visdrone2019 dataset, the original YOLOv5 algorithm often exhibits misdetections and omissions, particularly struggling with objects obstructed by structures, blurred objects, and objects in dimly lit scenes. Following optimizations to the YOLOv5 algorithm, the improved model successfully detects previously undetected objects such as cars obscured by bridges, small pedestrians on playgrounds, and vehicles in low-light conditions. Misdetection rates have also been reduced. As shown in Figure 9b,c, compared to TPH-YOLOv5, the algorithm proposed in this paper performs better in detecting dense small objects, such as pedestrians on bridges and roads. These experiments underscore that our proposed algorithm is more robust and adept at handling object detection tasks in challenging, real-world scenarios.

The application of our improved YOLOv5s algorithm in urban supervision scenarios is illustrated in Figure 10 and Figure 11. These figures depict complex urban traffic and street scenes during both day and night. In these environments, various objects of different scales, including pedestrians, motorcycles, cars, trucks, and crowds, are densely distributed. Additionally, objects often face occlusions from surrounding elements like trees and buildings, representing typical challenges in real-world scenarios.

Figure 10 highlights the algorithm’s effectiveness in detecting large-scale cars and trucks, as well as small-scale motorcycles and pedestrians amidst the complexity of urban traffic. This capability holds true both in well-lit daytime or dimly-lit nighttime conditions. The algorithm exhibits consistent performance regardless of the distribution density, lighting changes, or environmental occlusions, underscoring its high accuracy and robustness.

Figure 11 showcases the detection performance of our improved YOLOv5s algorithm in urban street scenes during both day and night. From the UAV perspective, these scenes feature densely packed pedestrians, motorcycles, and other small objects with significant overlap. Our algorithm maintains high detection accuracy under these challenging conditions, with minimal missed detections and virtually no false positives. These results demonstrate the algorithm’s suitability for complex urban object detection tasks.

5. Conclusions

This paper introduces an efficient object detection algorithm based on YOLOv5s, specifically tailored for UAV aerial photography. The algorithm aims to enhance detection accuracy while preserving the lightweight and fast response characteristics of the YOLOv5s baseline model. Key innovations include the lightweight feature extraction module DCNCSPELAN4 and the CloFormer Block, a Vision Transformer module that expands the network’s receptive field to improve feature extraction for small and occluded targets. Additionally, two feature fusion modules, SSFF and TFE, efficiently integrate channel information from different scale feature maps in the neck network, enhancing the detection of multi-scale targets.

The experimental results demonstrate that the improved YOLOv5s algorithm surpasses other YOLO variants in detection accuracy on the AU-AIR dataset and Visdrone2019-val. It also outperforms classical object detection algorithms in both accuracy and speed metrics. On the Visdrone2019-test, the algorithm outperforms some state-of-the-art models with a 0.4% increase in AP₅₀ while maintaining fewer parameters compared to TPH-YOLOv5. The ablation experiments, conducted without pre-trained weights, show notable improvements with increases of 12.4% and 8.4% in AP₅₀ and AP, respectively, over the YOLOv5 baseline model. Visual analysis of the experimental outcomes reveals superior performance in complex urban scenarios compared to the original YOLOv5 algorithm, particularly in UAV object detection tasks from an aerial viewpoint.

The proposed algorithm is primarily intended for high-performance embedded platforms integrated with UAVs. However, the ablation experiments indicate that the improved YOLOv8s algorithm offers higher accuracy compared to the YOLOv5s-based algorithm, while the improved YOLOv7-tiny algorithm achieves faster detection speeds. This suggests that optimizing these algorithms (e.g., through pruning and distillation) could yield even better detection performance. Additionally, deploying different algorithms of varying scales on corresponding drone platforms for specific scenarios can better exploit their potential. Therefore, future work will focus on extending the proposed algorithm to a wider range of applications in drone-based object detection.

Author Contributions

R.Q.: Conceptualization, data curation, methodology, investigation, writing—original draft preparation. Y.D.: resources, supervision, writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article, and further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, X.; Han, Y.; Tang, L.; Deng, C. Multi Target Detection and Tracking Algorithm for UAV Platform Based on Deep Learning. J. Signal Process. 2022, 38, 157–163. [Google Scholar] [CrossRef]
Ravindran, R.; Santora, M.J.; Jamali, M.M. Multi-object detection and tracking, based on DNN, for autonomous vehicles: A review. IEEE Sens. J. 2020, 21, 5668–5677. [Google Scholar] [CrossRef]
Minaee, S.; Boykov, Y.; Porikli, F.; Plaza, A.; Kehtarnavaz, N.; Terzopoulos, D. Image Segmentation Using Deep Learning: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 3523–3542. [Google Scholar] [CrossRef] [PubMed]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
Du, Y.; Zhao, Z.; Song, Y.; Zhao, Y.; Su, F.; Gong, T.; Meng, H. Strongsort: Make deepsort great again. IEEE Trans. Multimed. 2023, 25, 8725–8737. [Google Scholar] [CrossRef]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. pp. 740–755. [Google Scholar]
Hoiem, D.; Divvala, S.K.; Hays, J.H. Pascal VOC 2008 challenge. World Lit. Today 2009, 24, 1–4. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.; Li, K.; Li, F. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Wu, X.; Li, W.; Hong, D.; Tao, R.; Du, Q. Deep learning for unmanned aerial vehicle-based object detection and tracking: A survey. IEEE Geosci. Remote Sens. Mag. 2021, 10, 91–124. [Google Scholar] [CrossRef]
Zhang, X.; Zeng, H.; Guo, S.; Zhang, L. Efficient long-range attention network for image super-resolution. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 649–667. [Google Scholar]
Wang, C.; Liao, H.; Wu, Y.; Chen, P.; Hsieh, J.; Yeh, I. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 390–391. [Google Scholar]
Wang, R.; Shivanna, R.; Cheng, D.; Jain, S.; Lin, D.; Hong, L.; Chi, E. Dcn v2: Improved deep & cross network and practical lessons for web-scale learning to rank systems. In Proceedings of the Web Conference, Ljubljana, Slovenia, 19–23 April 2021; pp. 1785–1797. [Google Scholar]
Fan, Q.; Huang, H.; Guan, J.; He, R. Rethinking local perception in lightweight vision transformer. arXiv 2023, arXiv:2303.17803. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Purkait, P.; Zhao, C.; Zach, C. SPP-Net: Deep absolute pose regression with synthetic views. arXiv 2017, arXiv:1712.03452. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on COMPUTER Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Lin, G.; Milan, A.; Shen, C.; Reid, I. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1925–1934. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. pp. 21–37. [Google Scholar]
Bochkovskiy, A.; Wang, C.; Liao, H. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Jocher, G.; Stoken, A.; Chaurasia, A.; Borovec, J.; Kwon, Y.; Michael, K.; Changyu, L.; Fang, J.; Skalski, P.; Hogan, A. ultralytics/yolov5: v6. 0-YOLOv5n’Nano’models, Roboflow integration, TensorFlow export, OpenCV DNN support. Zenodo 2021. Available online: https://zenodo.org/records/5563715 (accessed on 10 August 2024).
Wang, C.-Y.; Bochkovskiy, A.; Liao, H. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Saetchnikov, I.; Skakun, V.; Tcherniavskaia, E. Efficient objects tracking from an unmanned aerial vehicle. In Proceedings of the 2021 IEEE 8th International Workshop on Metrology for AeroSpace (MetroAeroSpace), Virtual, 23–25 June 2021; pp. 221–225. [Google Scholar]
Liu, Y.; Yang, F.; Hu, P. Small-object detection in UAV-captured images via multi-branch parallel feature pyramid networks. IEEE Access 2020, 8, 145740–145750. [Google Scholar] [CrossRef]
Liu, B.; Luo, H.; Wang, H.; Wang, S. YOLOv3_ReSAM: A small-target detection method. Electronics 2022, 11, 1635. [Google Scholar] [CrossRef]
Zhang, Z. Drone-YOLO: An efficient neural network method for target detection in drone images. Drones 2023, 7, 526. [Google Scholar] [CrossRef]
Wang, X.; Li, W.; Guo, W.; Cao, K. SPB-YOLO: An efficient real-time detector for unmanned aerial vehicle images. In Proceedings of the 2021 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Jeju, Republic of Korea, 13–16 April 2021; pp. 099–104. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Mehta, S.; Rastegari, M. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]
Voita, E.; Talbot, D.; Moiseev, F.; Sennrich, R.; Titov, I. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv 2019, arXiv:1905.09418. [Google Scholar] [CrossRef]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2778–2788. [Google Scholar]
Feng, J.; Yi, C. Lightweight detection network for arbitrary-oriented vehicles in UAV imagery via global attentive relation and multi-path fusion. Drones 2022, 6, 108. [Google Scholar] [CrossRef]
Wang, G.; Chen, Y.; An, P.; Hong, H.; Hu, J.; Huang, T. UAV-YOLOv8: A small-object-detection model based on improved YOLOv8 for UAV aerial photography scenarios. Sensors 2023, 23, 7190. [Google Scholar] [CrossRef]
Tahir, N.U.A.; Long, Z.; Zhang, Z.; Asim, M.; ELAffendi, M. PVswin-YOLOv8s: UAV-based pedestrian and vehicle detection for traffic management in smart cities using improved YOLOv8. Drones 2024, 8, 84. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Wang, C.; Yeh, I.; Liao, H. Yolov9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Kang, M.; Ting, C.M.; Ting, F.F.; Phan, R.C. ASF-YOLO: A novel YOLO model with attentional scale sequence fusion for cell instance segmentation. Image Vis. Comput. 2024, 147, 105057. [Google Scholar] [CrossRef]
Lindeberg, T. Scale-Space Theory in Computer Vision; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013; Volume 256. [Google Scholar]
Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y. VisDrone-DET2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Bozcan, I.; Kayacan, E. Au-air: A multi-modal unmanned aerial vehicle dataset for low altitude traffic surveillance. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 8504–8510. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Lin, T.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Li, Z.; Peng, C.; Yu, G.; Zhang, X.; Deng, Y.; Sun, J. Light-head r-cnn: In defense of two-stage object detector. arXiv 2017, arXiv:1711.07264. [Google Scholar]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578. [Google Scholar]
Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
Lu, X.; Li, B.; Yue, Y.; Li, Q.; Yan, J. Grid r-cnn. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7363–7372. [Google Scholar]
Chalavadi, V.; Jeripothula, P.; Datla, R.; Ch, S.B. mSODANet: A network for multi-scale object detection in aerial images using hierarchical dilated convolutions. Pattern Recognit. 2022, 126, 108548. [Google Scholar] [CrossRef]
Barnwal, R.P.; Bharti, S.; Misra, S.; Obaidat, M.S. UCGNet: Wireless sensor network-based active aquifer contamination monitoring and control system for underground coal gasification. Int. J. Commun. Syst. 2017, 30, e2852. [Google Scholar] [CrossRef]
Singh, B.; Li, H.; Sharma, A.; Davis, L.S. R-fcn-3000 at 30fps: Decoupling detection and classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1081–1090. [Google Scholar]

Figure 1. Sample images taken from UAVs.

Figure 2. Improved YOLOv5 network structure.

Figure 3. The structure of (a) CSPNet, (b) ELAN, and (c) CSPELAN. CSPELAN extends the convolution module in ELAN to arbitrary computable modules modeled after CSPNet.

Figure 4. The overall structure of DCNCSPELAN4. For the DCN structure. The gray grid simulates the distribution of targets in aerial photography, while the solid circle and hollow circle represent the receptive fields of DCN and regular convolutions in UAV images, respectively.

Figure 5. The structure of CloFormer Block, consisting of a global branch and a local branch.

Figure 6. Structure of the three-scale feature fusion operation (a) SSFF and (b) THE.

Figure 7. Experimental results for all categories of Visdrone2019-test.

Figure 8. Confusion matrix of (a) the original YOLOv5s model, (b) TPH-YOLOv5, and (c) the improved YOLOv5s model.

Figure 9. Performance of the original YOLOv5, TPH-YOLOv5, and the proposed algorithm on the AU-AIR and Visdrone2019-test. (a) Original YOLOv5; (b) TPH-YOLOv5; (c) the proposed algorithm.

Figure 10. The detection effect of our improved YOLOv5s in the urban traffic supervision scenario. (a) Daytime urban traffic scenario; (b) nighttime urban traffic scenario.

Figure 11. The detection effect of our improved YOLOv5s in the urban streets supervision scenario. (a) Daytime urban streets scenario; (b) nighttime urban streets scenario.

Table 1. Comparison with different YOLO algorithms on Visdrone2019-val and AU-AIR.

Methods	Param. (M)	FLOPs (G)	Visdrone		AU-AIR		FPS (f·s⁻¹)
Methods	Param. (M)	FLOPs (G)	AP₅₀ (%)	AP (%)	AP₅₀ (%)	AP (%)	FPS (f·s⁻¹)
YOLOX-S [44]	9.0	26.8	34.6	19.9	47.2	41.3	53.1
YOLOv4	9.12	25.3	35.7	18.5	46.6	40.7	57.6
YOLOv5s	7.2	16.5	31.3	16.6	46.4	40.8	84.0
YOLOv5l	46.5	109.1	40.3	23.4	50.9	43.2	38.4
TPH-YOLOv5 [32]	47.9	157.2	42.8	24.6	50.4	42.9	35.5
YOLOv7-tiny	6.2	13.1	32.5	16.5	45.9	39.6	110.1
YOLOv8s	11.2	28.6	38.4	23.1	49.5	44.0	62.4
YOLOv8l	43.7	165.2	43.2	25.4	52.8	45.1	32.9
Ours	7.5	29.4	43.7	25.0	52.6	45.3	46.8

Table 2. Comparison with other classic object detection algorithms on Visdrone2019-val.

Methods	Backbone	AP₅₀ (%)	AP (%)	FPS (f·s⁻¹)
RetinaNet [45]	ResNet-50	22.5	11.9	33.4
SSD512 [20]	VGG16	26.8	12.2	37.2
Faster-RCNN [16]	ResNet-50	29.6	14.1	36.3
Light-RCNN [46]	ResNet-50	32.8	16.5	33.2
CenterNet [47]	ResNet-101	33.6	18.5	29.8
CornerNet [48]	ResNet-101	34.2	17.9	41.6
Grid-RCNN [49]	ResNet-50	39.3	25.1	10.4
Ours	Darknet-53	43.7	25.0	46.8

Table 3. Ablation experimental results based on the YOLOv5s baseline model on Visdrone2019-val.

DCNCSPELAN4	CloFormer Block	Head	SSFF TFE	AP₅₀ (%)	AP (%)	Param. (M)	FLOPs (G)	FPS (f·s⁻¹)
				31.3	16.6	7.2	16.5	84.0
√				31.7	17.1	6.4	13.2	78.5
√	√			32.6	17.5	6.7	14.3	65.3
		√		40.7	22.9	7.8	28.9	66.2
√	√	√		41.6	23.7	7.3	26.7	47.1
√	√	√	√	43.7	25.0	7.5	29.4	46.8

Note: √ represents the module is added.

Table 4. Ablation experiments’ results based on the YOLOv8s and YOLOv7-tiny baseline models on Visdrone2019-val.

Methods	Improved Backbone	SSFF TFE +Head	AP₅₀ (%)	AP (%)	Param. (M)	FLOPs (G)	FPS (f·s⁻¹)
YOLOv8s			38.4	23.1	11.2	28.6	62.4
	√		40.2	23.8	10.1	23.9	46.9
	√	√	46.3	26.4	12.4	34.6	25.7
YOLOv7-tiny			32.5	16.5	6.2	13.1	110.1
	√		33.4	17.1	6.1	12.7	86.5
	√	√	38.9	22.8	6.6	20.3	59.0

Note: √ represents the module is added.

Table 5. AP₅₀ for all categories on the Visdrone2019-val.

Methods	All	Pedestrian	People	Bicycle	Car	Van	Truck	Tricycle	Awning -Tricycle	Bus	Motor
Base	31.3	38.6	32.0	10.3	71.6	31.0	25.5	16.7	10.3	39.9	36.9
Improved Backbone	32.6	40.4	32.5	10.9	73.1	36.4	25.7	17.4	9.2	41.4	38.9
Improved Backbone + Head	41.6	51.4	39.8	18.5	82.5	46.1	34.1	27.7	15.4	52.3	48.2
Ours	43.7	50.8	40.0	19.3	83.1	47.2	39.2	31.6	14.4	60.3	50.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qian, R.; Ding, Y. An Efficient UAV Image Object Detection Algorithm Based on Global Attention and Multi-Scale Feature Fusion. Electronics 2024, 13, 3989. https://doi.org/10.3390/electronics13203989

AMA Style

Qian R, Ding Y. An Efficient UAV Image Object Detection Algorithm Based on Global Attention and Multi-Scale Feature Fusion. Electronics. 2024; 13(20):3989. https://doi.org/10.3390/electronics13203989

Chicago/Turabian Style

Qian, Rui, and Yong Ding. 2024. "An Efficient UAV Image Object Detection Algorithm Based on Global Attention and Multi-Scale Feature Fusion" Electronics 13, no. 20: 3989. https://doi.org/10.3390/electronics13203989

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

An Efficient UAV Image Object Detection Algorithm Based on Global Attention and Multi-Scale Feature Fusion

Abstract

1. Introduction

2. Related Work

2.1. Object Detection

2.2. YOLO-Based UAV Object Detection Algorithm

2.3. Vision Transformer

3. Methodologies

3.1. DCNCSPELAN4 Module

3.2. CloFormer Block

3.3. Three-Scale Feature Fusion Modules

3.4. Detection Head Improvement

4. Experiments and Results

4.1. Dataset and Experiments Environment

4.2. Experiment Metrics

4.3. Comparison with Different Detection Algorithms

4.4. Ablation Experiment

4.5. Visualization

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI