A UAV Aerial Image Target Detection Algorithm Based on YOLOv7 Improved Model

Qin, Jie; Yu, Weihua; Feng, Xiaoxi; Meng, Zuqiang; Tan, Chaohong

doi:10.3390/electronics13163277

Open AccessArticle

A UAV Aerial Image Target Detection Algorithm Based on YOLOv7 Improved Model

by

Jie Qin

¹,

Weihua Yu

¹,

Xiaoxi Feng

²,

Zuqiang Meng

^1,3,* and

Chaohong Tan

³

¹

School of Computer Science and Electronic Engineering, Guangxi University, Nanning 530004, China

²

School of Electrical Engineering, Guangxi University, Nanning 530004, China

³

Guangxi Key Laboratory of Digital Infrastructure, Guangxi Zhuang Autonomous Region Information Center, Nanning 530201, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(16), 3277; https://doi.org/10.3390/electronics13163277

Submission received: 24 July 2024 / Revised: 11 August 2024 / Accepted: 15 August 2024 / Published: 19 August 2024

(This article belongs to the Special Issue Applications of Computer Vision, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

To address the challenges of multi-scale objects, dense distributions, occlusions, and numerous small targets in UAV image detection, we present CMS-YOLOv7, a real-time target detection method based on an enhanced YOLOv7 model. Firstly, the detection layer P2 for small targets was added to YOLOv7 to enhance the detection ability of small and medium-sized targets, and the deep detection head P5 was taken out to mitigate the influence of excessive downsampling on small target images. The anchor frame was calculated by the K-means++ method. Using the concept of Inner-IoU, the Inner-MPDIoU loss function was constructed to control the range of the auxiliary border and improve detection performance. Furthermore, the CARAFE module was introduced to replace traditional upsampling methods, offering improved integration of semantic information during the image upsampling process and enhancing feature mapping accuracy. Simultaneously, during the feature extraction stage, a non-strided convolutional SPD-Conv module was constructed using space-to-depth techniques. This module replaced certain convolutional operations to minimize the loss of fine-grained information and improve the model’s ability to extract features from small targets. Experiments on the UAV aerial photo dataset VisDrone2019 demonstrated that compared with the baseline YOLOv7 object detection algorithm, CMS-YOLOv7 achieved an improvement of 3.5% [email protected], 3.0% [email protected]:0.95, and the number of parameters decreased by 18.54 M. The ability of small target detection was significantly enhanced.

Keywords:

UAV; small target detection; YOLOv7

1. Introduction

In today’s rapidly advancing technological landscape, UAV technology has also progressed swiftly. Unmanned aerial vehicles (UAVs) are now widely utilized across various fields, including military, agriculture, emergency rescue, and geological surveys, owing to their unique advantages. This has made UAV technology a prominent focus in modern science and technology [1,2,3]. Target detection, an important research area in computer vision [4], includes identifying and locating objects within images or videos. Traditional target detection algorithms often fall short in effectively extracting features for accurate detection. However, with advancements in deep learning technology [5], the performance of target detection algorithms has significantly improved, enabling UAVs to perform target detection in complex scenes [6].

There are four factors that cause various problems in UAV applications when performing target detection. First, UAVs typically shoot from an overhead perspective, resulting in a too-singular angle for feature extraction, which may affect the detection algorithm’s ability to identify and locate targets [7,8]. Second, UAVs may encounter complex background environments during detection, including season, weather, target occlusion, lighting, and similarly shaped objects, increasing the complexity of small target detection [9]. Third, during inspection or surveillance, targets often exhibit characteristics of small targets, with fewer pixels and relatively simple features. This can lead to targets being easily overlooked or misdetected during detection. Fourth, UAV target detection often needs to be performed in real-time with high accuracy requirements.

At present, UAV target detection methods are usually used in specific scenarios, typically suitable for scenes where the background and target image sizes are similar and uniform, such as maritime targets and agriculture. Therefore, in multi-class targets with complex backgrounds, detection performance is reduced, and feature loss occurs at different scales, increasing the risk of missed detections. UAVs frequently detect numerous small objects with limited detail, which can make them difficult to distinguish from the background or from other similar objects. This challenge arises because the limited resolution and fine detail in the images can cause these small objects to blend in with their surroundings or be easily misidentified.

To reply to these challenges, this paper introduces an enhanced model, CMS-YOLOv7, which builds upon the YOLOv7 framework. The proposed algorithm was tested using the VisDrone2019 dataset [10] and aims to advance UAV applications in small target detection by improving both precision and efficiency. This study accomplished the following:

(1): The small target detection layer (P2) was added, and the deep detection layer (P5) was eliminated. At the same time, the K-means++ method [11] was used to compute the anchor frame after operation. This enhanced the performance of the model in detecting small targets and reduced the possibility of false detection or missed detection.
(2): The Inner-MPDIoU loss function was constructed using the idea of Inner-IoU [12], replacing the CIoU [13], to improve the detection capability of small target objects in complicated surroundings.
(3): Introducing the CARAFE module [14] as a replacement for traditional upsampling methods allowed for a larger field of view, effectively aggregating contextual information to enhance the acquisition of target feature information.
(4): A new convolution module, SPD-Conv [15], was introduced to improve computational efficiency, enhance the model’s performance and generalization ability, reduce information loss, and strengthen the feature extraction capability for small target objects.

2. Related Work

2.1. Common Target Detection Algorithms

Traditional target detection techniques are based on manually designed feature extractors [16,17]. While these methods offer the advantage of requiring fewer algorithm parameters for specific detection tasks, facilitating their integration into various small platforms, they also have notable drawbacks. Specifically, designing different feature extractors for diverse target detection needs limits the algorithm’s adaptability. Despite their accuracy, these traditional methods often suffer from the problems of too-complex computation, a slow speed of processing tasks, insufficient adaptability, and reduced robustness.

Deep learning-based target detection frameworks are primarily classified in two ways—anchor-based and anchor-free methods—depending on their approach to defining ground truth. Within the anchor-based category, there are two primary types: multi-stage and single-stage methods. Multi-stage approaches, including R-CNN [18], Fast R-CNN [19], and Faster R-CNN [20], begin by generating region proposals and then apply convolutional neural networks to categorize objects and perform boundary box regression. These techniques are designed to improve detection accuracy and speed while addressing issues such as class imbalance. Conversely, single-stage methods, for example, the YOLO series [21,22,23,24,25,26,27,28], SSD [29], and RetinaNet [30], integrate feature extraction, classification, and localization into a single process, which leads to faster detection times. Although these single-stage detectors might have slightly lower accuracy compared with multi-stage methods, they significantly reduce computational requirements and are thus more suitable for real-time applications. However, despite the advancements offered by these deep learning techniques, their direct application to UAV aerial image detection may still present challenges.

2.2. YOLO Architectures Suitable for Aerial Imagery

YOLO series algorithms have gained widespread recognition in academia and industry for their excellent detection efficiency and accuracy in target detection. Among them, YOLOv5 [25] and YOLOv7 [27] are currently two of the most widely adopted models. YOLOv5 enhances real-time target detection tasks through advanced deep learning techniques, boasting improvements over its predecessor, YOLOv4 [24], in model architecture, training strategies, and overall performance. YOLOv5 integrates the CSP (Cross Stage Partial) network architecture, which significantly mitigates redundant computations and improves overall computational efficiency. Despite its advancements, YOLOv5 still faces difficulties in detecting small and densely packed objects and grappling with occlusions and pose variations. To overcome these challenges and boost the performance of real-time object detectors, YOLOv7 introduces a novel training scheme known as the Trainable Bag of Freebies (TBoF). This innovative approach has markedly enhanced the accuracy and generalization capabilities of various object detection models.

Extensive research has made numerous advancements to improve YOLO models, specifically for UAV target detection. Researchers have introduced multiple modifications and improvements to optimize these models for the unique challenges associated with aerial surveillance and detection tasks. Zhu et al. [31] proposed TPH-YOLOv5, a modification of YOLOv5 that substitutes the conventional prediction heads with Transformer Prediction Heads. This adaptation improves the model’s capacity to handle complex scene variations. Qin et al. [32] developed the MCA-YOLOv7 algorithm, an improvement on the YOLOv7 model, by optimizing the Feature Pyramid Network (FPN) structure, incorporating attention mechanisms, and enhancing context aggregation blocks to better detect small targets. Wu et al. [33] amended the spatial pyramid pooling framework by integrating and cascading manifold pooling layers, which enhances the network’s capacity for feature learning. Liu et al. [34] proposed EdgeYOLO, which features a lightweight decoupled head for target detection, achieving faster inference speeds and higher accuracy. Similarly, Zhao et al. [35] introduced the MS-YOLOv7 model, which leverages the Swin Transformer and attention mechanisms to improve the feature extraction capabilities of the network’s neck, strengthening detection precision.

These studies have successfully enhanced UAV target detection performance to a certain extent. However, in practical UAV applications, target objects still face issues, such as complex and variable backgrounds, small sizes, and mutual occlusion, leading to missed and false detections. Constructing a more efficient target detection algorithm remains a significant challenge.

2.3. YOLOv7 Network Structure

YOLOv7 [27] represents a sophisticated single-stage target detection model known for its exceptional balance between speed and precision. Its advanced architecture ensures versatility and effectiveness across various application contexts. YOLOv7 offers different network configurations by adjusting width and depth parameters to address varying complexity and performance needs. Each configuration provides a tailored solution for diverse target detection tasks. The architecture of YOLOv7 is organized into three fundamental components: Backbone, Neck, and Head.

The Backbone network performs feature extraction; it consists of two main components: ELAN and MP. The ELAN module captures more contextual information by expanding the convolutional layer’s receptive field, enhancing the network’s learning ability. The MP module enhances feature extraction capability through multi-path convolution and receptive field expansion. The use of these two modules allows the backbone network to efficiently extract and represent image features while maintaining high computational efficiency.

To optimally merge feature information across multiple scales, the Neck network employs both the Feature Pyramid Network (FPN) architecture and the SPP-PANet design. The SPPCSPC structure, which combines Spatial Pyramid Pooling (SPP) with Cross Stage Partial Connections (CSPC), effectively enhances the network to extract and use the feature information.

In the Head part, the Rep structure is used to flexibly regulate the number of image channels in the output features. Then, through 1 × 1 convolution operations, the network can accurately predict object confidence, class, and anchor box position.

3. Method

3.1. CMS-YOLOv7

In aerial small target detection, while YOLOv7 demonstrates impressive performance due to its advanced network architecture, it faces challenges in extracting fine features from small targets due to the limited pixel information available. Additionally, the restricted receptive field of the model may impede its ability to capture comprehensive contextual information, complicating detection in complex environments. To solve these problems, we propose an enhanced model based on the YOLOv7 architecture named CMS-YOLOv7, specifically designed for UAV aerial image target detection tasks. The network architecture is depicted in Figure 1.

In this model, a small target detection layer is introduced while the deep detection layer is removed, allowing for better extraction of target pixel information and improved accuracy in detecting small targets. To enhance both regression and classification performance, we propose the Inner-MPDIoU loss function and incorporate an auxiliary bounding box for faster and more effective regression results. To overcome the limitations of traditional upsampling methods and improve target feature retrieval, we replace them with the CARAFE module. Furthermore, the integration of the SPD-Conv module enhances the network’s ability to obtain image features and minimize information loss, leading to improved detection of small objects in images.

3.2. Small Target Detection Layer

In UAV images characterized by complex backgrounds, detecting ground objects can be particularly challenging due to factors such as their small size, dense environments, and occlusion. To address these issues, we integrated an additional detection head into the baseline YOLOv7 framework, thereby enhancing its capability to more effectively identify small objects within UAV imagery.

The benchmark YOLOv7 network effectively detects objects of different scales from large to small using three different scale feature maps (80 × 80 × 255, 40 × 40 × 255, and 20 × 20 × 255). By adding a small target detection layer P2 (160 × 160 × 255), more small object feature information can be obtained from the shallow feature map, significantly enhancing the network’s ability to capture medium and small object features. By directly feeding the feature map obtained from this layer into the prediction module, the accuracy of medium and small target detection is effectively improved, and the possibility of detecting errors and detecting omissions is significantly reduced, enhancing the network’s adaptability to target scales and detection robustness.

Downsampling in deep feature maps often leads to a significant loss of information for small objects, making it difficult to capture their features effectively in the deep detection layer and potentially impacting final predictions during feature fusion. To address this issue, we introduce the P2 small target detection layer and remove the P5 deep detection layer. While adding the P2 detection head increases the network’s parameter count, removing the P5 head reduces the large number of parameters, achieving a balanced adjustment in the overall parameter count.

However, eliminating the P5 detection head may also result in a partial loss of semantic information. Therefore, we further optimized the connectivity channels in the neck network to preserve more semantic information and strengthen the fusion of features.

Since YOLOv7 is an anchor-based target detection algorithm, its performance is sensitive to the sizes of anchor boxes. To optimize the size of anchor boxes, we used the K-means++ method [11]. Table 1 displays the optimized anchor box dimensions tailored for the VisDrone2019 dataset, configured for an image resolution of 640 × 640 pixels.

3.3. Inner-MPDIoU

In YOLOv7, the CIoU loss function [13] is employed to enhance the accuracy of bounding box alignment evaluation. Unlike the traditional IoU loss, CIoU provides a more nuanced assessment by incorporating the overlap area, the distance between box centers, and the difference in aspect ratios. This detailed evaluation improves the precision of the predicted bounding boxes. The calculation formula of CIoU is provided in Equation (1).

L_{C I o U} = 1 - IoU + \frac{ρ (b, b^{g t})}{c^{2}} + α v

(1)

where

IoU = \frac{| B \cap B^{g t} |}{| B \cup B^{g t} |}

indicates the extent of overlap between the predicted box and the truth box,

v = \frac{4}{π^{2}} {(\arcsin \frac{ω^{g t}}{h^{g t}} - \arcsin \frac{ω^{g t}}{h})}^{2}

is used to represent the aspect ratio, and

α = \frac{v}{(1 - IoU) + v}

represents the balancing parameter.

b^{g t}

and

b

represent the calculation results of the truth box and the predicted box, respectively. The Euclidean distance between the center points of the prediction box and the truth box is denoted by

ρ (b, b^{g t})

.

L_{I o U} = 1 - IoU

is defined as the loss corresponding to IoU.

Many IoU-based Bounding Box Regression (BBR) loss functions aim to enhance convergence speed by incorporating additional loss terms, yet they frequently overlook the intrinsic limitations of IoU loss. Inner-IoU [12] addresses these limitations by integrating an auxiliary bounding box loss and applying a scaling factor to adjust the auxiliary bounding box size. This approach refines the bounding box regression process and improves detection accuracy. The detailed formulas are provided in Equations (2)–(6):

b_{l} = x_{c} - \frac{ω \times r a t i o}{2}, b_{r} = x_{c} + \frac{ω \times r a t i o}{2}

(2)

b_{t} = y_{c} - \frac{h \times r a t i o}{2}, b_{b} = y_{c} + \frac{h \times r a t i o}{2}

(3)

i n t e r = (\min (b_{r}^{g t}, b_{r}) - \max (b_{l}^{g t}, b_{l})) \times (\min (b_{b}^{g t}, b_{b}) - \max (b_{t}^{g t}, b_{t}))

(4)

u n i o n = (ω^{g t} \times h^{g t}) \times {(r a t i o)}^{2} + (ω \times h) \times {(r a t i o)}^{2} - i n t e r

(5)

{IoU}^{i n n e r} = \frac{i n t e r}{u n i o n}

(6)

where

r a t i o \in [0.5, 1.5]

, and when

r a t i o = 1

, Inner-IoU can be considered identical to ordinary IoU. When

r a t i o > 1

, the auxiliary bounding box is larger than the actual bounding box, promoting the regression of low IoU, which is beneficial for detecting small objects in the image. When

r a t i o < 1

, the auxiliary bounding box is smaller than the actual bounding box, accelerating the convergence of high IoU samples, which is beneficial for detecting large objects in the image.

MPDIoU [36] is an advanced evaluation criterion designed to improve the precision of object detection tasks. Unlike traditional IoU, which measures the overlap between predicted and truth bounding boxes, MPDIoU integrates probabilistic information to provide a more nuanced assessment of localization accuracy. This metric helps address challenges related to precise object boundary delineation and enhances the evaluation of detection models, particularly in complex scenarios where traditional IoU may fall short.

d_{1}^{2} = {(x_{1}^{p r d} - x_{1}^{g t})}^{2} + {(y_{1}^{p r d} - y_{1}^{g t})}^{2}

(7)

d_{2}^{2} = {(x_{2}^{p r d} - x_{2}^{g t})}^{2} + {(y_{2}^{p r d} - y_{2}^{g t})}^{2}

(8)

MPDIoU = IoU - \frac{d_{1}^{2}}{w^{2} + h^{2}} - \frac{d_{2}^{2}}{w^{2} + h^{2}}

(9)

L_{M P D I o U} = 1 - MPDIoU

(10)

The distance between the upper left and lower right corners of the predicted and truth boxes is symbolized by

d_{1}^{2}

and

d_{2}^{2}

. Building on the concept of Inner-IoU, MPDIoU is enhanced by employing a scale factor to generate auxiliary boxes of varying scales for loss calculation. This adjustment results in faster and more effective regression outcomes. The Inner-MPDIoU calculation formula is shown in Equation (11).

L_{I n n e r - M P D I o U} = L_{M P D I o U} + IoU - {IoU}^{i n n e r}

(11)

3.4. CARAFE

Upsampling is a crucial operation in convolutional neural networks (CNNs) primarily used in the feature fusion stage to improve feature resolution. This process increases the size of feature maps, allowing for finer detail representation in images. In deep learning and computer vision, various methods for upsampling exist, including interpolation techniques like nearest-neighbor, bilinear, and trilinear interpolation. YOLOv7 uses nearest-neighbor interpolation for upsampling, which relies only on adjacent pixels and does not fully utilize the semantic information within the feature map.

CARAFE, introduced by Wang et al. [14], represents a lightweight and resource-efficient image upsampling technique designed to overcome the limitations of traditional nearest-neighbor interpolation methods, particularly in handling small targets where feature degradation often occurs. CARAFE enhances feature reconstruction by leveraging larger receptive fields to incorporate a richer context of feature information, thus improving the accuracy of upsampled images.

In this study, we integrated the CARAFE module into the neck region of our network, replacing the previous upsampling component. The CARAFE module is composed of two main parts: the upsampling prediction module and the feature reassembly module, as depicted in Figure 2.

In the upsampling prediction module, the process begins with input images of size

H \times W \times C

. With an upsampling factor of

σ

, a 1 × 1 convolutional layer first reduces the number of image channels. Next, a convolution kernel of size

k_{u p} \times k_{u p}

performs convolution operations, expanding the channel count to channels

σ^{2} \times {k_{u p}}^{2}

for content encoding. To finalize the process, output normalization is applied to optimize the parameter count, ensuring efficient model performance. Next, the feature reassembly module takes these features after processing and integrates them through an element-wise multiplication operation. This is performed between the prediction reassembly kernel and the corresponding areas of the original feature map. This operation effectively reconstructs the upsampling output, leveraging the preserved information from the original features. By carefully reassembling and enhancing these features, the module achieves a high-quality upsampling result that maintains fidelity and detail from the initial input.

3.5. SPD-Conv

In the domains of target detection and image classification, convolutional neural networks (CNNs) have demonstrated exceptional performance and set new benchmarks. However, for small target detection tasks, especially when these small objects overlap, occlude, or are very small and blurry in the image, traditional CNN architectures often face challenges. These scenarios require the network to capture and retain fine-grained image information, but current designs often struggle to handle these details, leading to a decline in feature learning ability and significantly affecting model performance. To tackle this challenge, the paper presents a novel convolutional neural network architecture, SPD-Conv [15], integrated into the backbone network. The SPD-Conv structure is depicted in Figure 3.

The SPD-Conv architecture introduces an innovative approach to convolutional neural networks by incorporating two key components: the SPD (space-to-depth) layer and a non-strided convolutional layer. The SPD module ensures that all information within the channel dimension is fully preserved during the downsampling process, avoiding any information loss. A non-strided convolutional layer is added after each SPD module. This convolutional layer reduces the number of channels by learning parameters, minimizing non-discriminative information loss. This design is particularly advantageous for low-resolution and small target detection tasks. It works by segmenting the original feature map into multiple sub-feature maps, each downsampled at different scales. Then, the sub-feature maps are connected by the channel dimension, and a feature map with reduced scale but enhanced information richness. When the method is applied to the backbone, more feature information in the image is preserved through the feature connection of multiple sub-feature graphs. In the same downsampling process, the computational efficiency of the model is improved, the model performance and generalization are significantly enhanced, and the information loss is minimized.

4. Experiments

4.1. Dataset

To assess the ability of CMS-YOLOv7, we used the VisDrone2019 dataset [10], which was compiled by the AISKYEYE team at Tianjin University’s Machine Learning and Data Mining Laboratory. This dataset comprises 6471 training images, 548 validation images, and 1610 testing images, all captured from a drone’s perspective. It includes ten kinds of objects: pedestrians, people, bicycles, cars, vans, trucks, tricycles, awning-tricycles, buses, and motors. The images feature numerous small to medium-sized objects, some of which overlap, and cover diverse scenes such as highways and intersections under various weather conditions, like sunny, rainy, cloudy, and nighttime conditions. These characteristics pose significant challenges for target detection. The VisDrone2019 dataset is shown in Figure 4.

4.2. Parameter Settings

The experiments were performed on a Windows 11 system featuring an Intel^® Core™ i7-13700F CPU, an RTX 4090 24 GB GPU, CUDA 11.1, and Python 3.8, with PyTorch 1.13.0 as the deep learning framework. The training configuration included a batch size of 8 and 300 epochs and input image dimensions of 640 × 640 pixels. All other parameters were configured to the default settings of YOLOv7.

4.3. Evaluation Metrics

Four metrics are used in this paper to evaluate the performance of the target detection algorithm: [email protected], [email protected]:0.95, Parameters, GFLOPs, and FPS.

In the context of object detection evaluations, [email protected] refers to the mean average precision (mAP) computed using a fixed IoU threshold of 0.5. This metric assesses the accuracy of the model’s predictions by calculating how well the predicted bounding boxes overlap with the truth boxes when this overlap is at least 50%.

By contrast, [email protected]:0.95 offers a more detailed performance assessment by averaging mAP values over a range of IoU thresholds from 0.5 to 0.95, with 0.05 increments. This broader evaluation captures the model’s performance across various levels of localization accuracy, providing a more comprehensive picture of its robustness and effectiveness in different detection scenarios. By incorporating multiple IoU thresholds, [email protected]:0.95 reflects how well the model performs under varying degrees of overlap, thus offering a more nuanced understanding of its overall detection capabilities.

Precision (P) is used to measure the percentage of the predicted positive samples that actually turn out to be positive. Essentially, it represents the number of samples that were predicted to be positive that were actually positive. The formula for calculating Precision is detailed in Equation (12).

Precision = \frac{T P}{T P + F P}

(12)

Recall (R) is used to indicate the percentage of actual positive samples that are correctly predicted to be positive examples. In other words, it indicates how many of all the true positive examples are correctly predicted as positive. The detailed formula for calculating Recall is provided in Equation (13).

Recall = \frac{T P}{T P + F N}

(13)

mAP is an integral metric for assessing the performance of object detection models across a range of classes. It is calculated as the weighted average of the average precision (AP) scores for each individual class within the dataset. The specific formula of mAP is detailed in Equation (14).

mAP = \frac{\sum_{n = 1}^{\infty} \int_{0}^{1} P (R) d R}{N}

(14)

Additionally, GFLOPs serve as a metric for evaluating the computational complexity of a model, reflecting the volume of arithmetic operations required during inference. FPS is employed to express the inference speed of the model when it actually performs the detection task. The parameters can measure the size of the model.

4.4. Ablation Experiments

In order to further evaluate the effectiveness of the CMS-YOLOv7 model, ablation studies were performed using the VisDrone2019 dataset.

4.4.1. Comparison with Baseline Model

The validation set of the VisDrone2019 dataset was used to assess the performance of each component introduced. To effectively demonstrate changes in algorithm performance, we measured metrics such as [email protected], [email protected]:0.95, number of parameters, GFLOPs, and FPS. All experiments were conducted under uniform conditions with consistent parameters to ensure accuracy.

Table 2 outlines the progressive introduction of several key improvement strategies relative to the baseline YOLOv7. This detailed the development process of the CMS-YOLOv7 on the VisDrone2019 dataset and analyzed the resulting performance changes. The introduction of these strategies led to a significant enhancement in the model’s detection performance.

Initially, the addition of the P2 detection head resulted in notable improvements in performance metrics, with [email protected] and [email protected]:0.95 increasing by 1.2% and 1.8%, respectively. Subsequently, removing the P5 detection head and optimizing the connectivity channels in the neck network led to further gains, with [email protected] and [email protected]:0.95 increasing by an additional 0.3% and 0.2%. These modifications significantly enhanced the model’s capability to detect small objects, while resulting in an 18.82 M drop in parameters and a small increase in GFLOPs compared with baseline YOLOv7.

Next, the Inner-MPDIoU loss function was adopted in place of the original CIoU loss function. This improvement strategy resulted in a successful increase of 0.4% in [email protected] and 0.2% in [email protected]:0.95, without requiring any additional parameters or GFLOPs.

Following this, the use of the CARAFE module instead of traditional upsampling methods slightly increased both parameter count and GFLOPs. This adjustment led to enhancements of 0.4% in [email protected] and 0.3% in [email protected]:0.95.

Finally, optimizing the network’s backbone with SPD-Conv was introduced to enhance the detection of small objects. This addition led to a notable improvement in performance, with [email protected] and [email protected]:0.95 increasing by 1.2% and 0.5%, respectively. Although this improvement made the parameters increase slightly and generated some additional computational overhead, the impact on the detection performance and real-time ability of the model was still small.

Overall, these four improvements resulted in a 3.5% increase in [email protected] and a 3.0% increase in [email protected]:0.95 for the model while reducing the parameter count by 18.54 M, achieving a degree of lightweight model optimization.

4.4.2. Determining Parameters in Inner-MPDIoU

In Inner-IoU, the scaling factor

r a t i o

is used to adjust the size of the auxiliary bounding box during loss calculation. Increasing

r a t i o

enlarges the auxiliary bounding box, which helps the model better capture feature information for small and medium-sized targets, thereby enhancing detection accuracy. To assess the effectiveness of this approach, an empirical study was conducted to examine how varying

r a t i o

affects the performance of the target detection algorithm.

Firstly, the P2 detection head was introduced into the YOLOv7 network, the P5 detection head was removed, and the neck connection channels were adjusted. Then, on the basis of this, Inner-MPDIoU was introduced, and the network’s

r a t i o

value was adjusted to adapt to the changes in feature map size brought by the previous improvements. This adjustment ensured that the model could fully utilize the feature information of small and medium-sized targets extracted by the previous improvements, further enhancing the accuracy of target detection.

Table 3 indicates variations when adjusting within the range of

r a t i o \in [1.30, 1.45]

, affecting both [email protected] and [email protected]:0.95. The optimal performance improvement was observed at

r a t i o = 1.33

. Therefore, for subsequent experiments, Inner-MPDIoU was set to 1.33.

4.5. Detection Results Visualization

From the detailed analysis of the experimental data, it is evident that the CMS-YOLOv7 model significantly outperforms the baseline YOLOv7 on the VisDrone2019 dataset. To illustrate the performance differences between the two models in target detection, we present predictions on three representative images. These examples offer an intuitive comparison of their relative effectiveness.

Figure 5 presents the prediction results categorized into three groups—(a), (b), and (c)—for comparative analysis. In each group, the left image shows the detection outcomes from the baseline YOLOv7 model, while the right image depicts the results from the CMS-YOLOv7 model.

In group (a) images, the YOLOv7 baseline model detected 72 pedestrians, 10 individuals, 1 bicycle, 17 cars, 4 vans, 2 trucks, 1 tricycle, 1 tricycle with awning, and 5 motorcycles. By contrast, the CMS-YOLOv7 model detected 85 pedestrians, 11 individuals, 2 bicycles, 19 cars, 4 vans, 2 trucks, 1 tricycle, 1 tricycle with awning, and 6 motorcycles.

In group (b) images, the baseline YOLOv7 model detected 1 pedestrian, 2 individuals, 39 cars, 7 trucks, 2 buses, 2 motorcycles, and 3 other vehicles. The CMS-YOLOv7 model detected 1 pedestrian, 2 individuals, 43 cars, 9 trucks, 3 buses, 2 motorcycles, and 3 other vehicles.

In group (c) images, the baseline YOLOv7 model detected 61 pedestrians, 10 individuals, 1 bicycle, 1 car, and 2 motorcycles. The CMS-YOLOv7 model detected 73 pedestrians, 10 individuals, 1 bicycle, 1 car, and 2 motorcycles.

The position marked in the red box in the figure shows more intuitively that CMS-YOLOv7 can detect more small targets.

In summary, the CMS-YOLOv7 model exhibited markedly superior detection performance compared with the baseline YOLOv7 on the VisDrone2019 dataset. The CMS-YOLOv7 model not only detected a greater number of targets but also demonstrated enhanced accuracy and improved capability in detecting small objects. These experimental results strongly support the application of the CMS-YOLOv7 model for tasks such as drone target detection.

4.6. Comparison with Other Algorithms

To thoroughly assess the performance of the CMS-YOLOv7 model for target detection, a comparative analysis was performed against several established models, including YOLOv4, YOLOv5l, TPH-YOLOv5 [31], YOLOv6m, YOLOv7, and YOLOv8. Comprehensive experiments were conducted on the VisDrone2019 dataset, focusing on key evaluation metrics such as [email protected] and [email protected]:0.95 to gauge detection accuracy across various IoU thresholds. This comparative evaluation facilitates a nuanced understanding of the CMS-YOLOv7 model’s effectiveness in target detection, with the detailed results summarized in Table 4.

The experimental findings reveal that the CMS-YOLOv7 model exhibits outstanding performance on the VisDrone2019 dataset. It shows notable improvements in both [email protected] and [email protected]:0.95, highlighting its effectiveness and superiority in target detection tasks. Given that the VisDrone2019 dataset closely mirrors real-world scenarios in drone target detection, these results underscore the practicality and effectiveness of CMS-YOLOv7 for real-world applications.

5. Conclusions

Addressing prevalent challenges in UAV image object detection such as multi-scale objects, dense distributions, occlusions, and the high prevalence of small targets, this paper introduces the CMS-YOLOv7 algorithm. This novel approach enhances detection performance by incorporating a specialized small target detection layer (P2), eliminating the deep detection layer (P5), and adjusting the neck connection channels, thereby significantly improving the detection of small objects in aerial imagery. Introducing Inner-MPDIoU to replace CIoU accelerated the bounding box regression process by incorporating auxiliary bounding boxes for loss calculation, thereby enhancing learning capabilities for small target samples in complex backgrounds. Substituting the CARAFE module for traditional upsampling modules effectively aggregates contextual information, improving feature acquisition capabilities. Finally, integrating SPD-Conv into the backbone architecture mitigated information loss in images and bolsters the model’s capacity to extract features from small targets, thereby enhancing overall detection performance. The experimental results demonstrate that CMS-YOLOv7 achieves significantly higher accuracy in object detection compared with other advanced algorithm models, particularly excelling in detecting small targets.

In addition, CMS-YOLOv7 shows excellent detection performance, significantly reduces the model parameters, and realizes the light weight of the model. At the same time, GFLOPs is added to enhance the computational power of the model and improve the detection performance. In a word, the model meets the requirements of UAV image detection accuracy and real-time detection.

Author Contributions

Conceptualization, J.Q.; methodology, J.Q.; software, J.Q. and W.Y.; validation, J.Q. and W.Y.; formal analysis, J.Q. and X.F.; investigation, J.Q. and X.F.; resources, J.Q.; data curation, J.Q.; writing—original draft preparation, J.Q., W.Y. and X.F.; writing—review and editing, J.Q., W.Y. and X.F.; visualization, J.Q.; supervision, Z.M. and C.T.; project administration, Z.M. and C.T.; funding acquisition, Z.M. and C.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (under Grant No. 62266004), supported by the Open Project Program of Guangxi Key Laboratory of Digital Infrastructure (under Grant No. GXDINBC202401) and the National Training Program of Innovation and Entrepreneurship for Undergraduates (under Grant No. 202310593066).

Data Availability Statement

Data set: https://github.com/VisDrone (accessed on 23 July 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Fan, B.; Li, Y.; Zhang, R.; Fu, Q. Review on the technological development and application of UAV systems. Chin. J. Electron. 2020, 29, 199–207. [Google Scholar] [CrossRef]
Do-Duy, T.; Nguyen, L.D.; Duong, T.Q.; Khosravirad, S.R.; Claussen, H. Joint optimisation of real-time deployment and resource allocation for UAV-aided disaster emergency communications. IEEE J. Sel. Areas Commun. 2021, 39, 3411–3424. [Google Scholar] [CrossRef]
Villarreal, C.A.; Garzón, C.G.; Mora, J.P.; Rojas, J.D.; Ríos, C.A. Workflow for capturing information and characterizing difficult-to-access geological outcrops using unmanned aerial vehicle-based digital photogrammetric data. J. Ind. Inf. Integr. 2022, 26, 100292. [Google Scholar] [CrossRef]
Zhao, Z.-Q.; Zheng, P.; Xu, S.-T.; Wu, X. Object detection with deep learning: A review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef] [PubMed]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Wu, X.; Li, W.; Hong, D.; Tao, R.; Du, Q. Deep learning for unmanned aerial vehicle-based object detection and tracking: A survey. IEEE Geosci. Remote Sens. Mag. 2021, 10, 91–124. [Google Scholar] [CrossRef]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and tracking meet drones challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7380–7399. [Google Scholar] [CrossRef] [PubMed]
Oksuz, K.; Cam, B.C.; Kalkan, S.; Akbas, E. Imbalance problems in object detection: A review. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3388–3415. [Google Scholar] [CrossRef] [PubMed]
Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y. VisDrone-DET2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Arthur, D.; Vassilvitskii, S. k-Means++: The Advantages of Careful Seeding; Stanford: Stanford, CA, USA, 2006. [Google Scholar]
Zhang, H.; Xu, C.; Zhang, S. Inner-IoU: More effective intersection over union loss with auxiliary bounding box. arXiv 2023, arXiv:2311.02877. [Google Scholar]
Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Trans. Cybern. 2021, 52, 8574–8586. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. Carafe: Content-aware reassembly of features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 3007–3016. [Google Scholar]
Sunkara, R.; Luo, T. No more strided convolutions or pooling: A new CNN building block for low-resolution images and small objects. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Grenoble, France, 19–23 September 2022; pp. 443–459. [Google Scholar]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; pp. 886–893. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Las Condes, Chile, 11–18 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Ultralytics: Yolov5. [EB/OL]. Available online: https://github.com/ultralytics/yolov5 (accessed on 23 July 2024).
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Wang, C.; Yeh, I.; Liao, H. YOLOv9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37, Proceedings, Part I 14. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2778–2788. [Google Scholar]
Qin, Z.; Chen, D.; Wang, H. MCA-YOLOv7: An Improved UAV Target Detection Algorithm Based on YOLOv7. IEEE Access 2024, 12, 42642–42650. [Google Scholar] [CrossRef]
Wu, H.; Hua, Y.; Zou, H.; Ke, G. A lightweight network for vehicle detection based on embedded system. J. Supercomput. 2022, 78, 18209–18224. [Google Scholar] [CrossRef]
Liu, S.; Zha, J.; Sun, J.; Li, Z.; Wang, G. EdgeYOLO: An edge-real-time object detector. In Proceedings of the 2023 42nd Chinese Control Conference (CCC), Tianjin, China, 24–26 July 2023; pp. 7507–7512. [Google Scholar]
Zhao, L.; Zhu, M. MS-YOLOv7: YOLOv7 based on multi-scale for object detection on UAV aerial photography. Drones 2023, 7, 188. [Google Scholar] [CrossRef]
Siliang, M.; Yong, X. MPDIoU: A loss for efficient and accurate bounding box regression. arXiv 2023, arXiv:2307.07662. [Google Scholar]

Figure 1. Network architecture of CMS-YOLOv7.

Figure 2. Structure of CARAFE.

Figure 3. Structure of SPD-Conv.

Figure 4. Partial image of the VisDrone2019 dataset.

Figure 5. Comparative analysis of baseline YOLOv7 and CMS-YOLOv7 on VisDrone2019 dataset.

Table 1. Anchor box size setting.

Detection Layer	Feature Map Size	Anchor Frame Setting
P2	160 × 160	[3, 4, 4, 8, 7, 6]
P3	80 × 80	[7, 12, 14, 8, 11, 17]
P4	40 × 40	[27, 15, 21, 28, 48, 38]

Table 2. Experimental results on the VisDrone2019 dataset.

Add P2	Remove P5 with Optimized Neck	Inner-MPDIoU	CARAFE	SPD-Conv	[email protected]	[email protected]:0.95	Params (M)	GFLOPs	FPS
					48.8%	27.7%	36.53	103.3	119
√					50.0%	29.5%	37.08	117.1	96
√	√				50.3%	29.7%	17.71	116.9	102
√	√	√			50.7%	29.9%	17.71	116.9	104
√	√	√	√		51.1%	30.2%	17.84	117.9	88
√	√	√	√	√	52.3%	30.7%	17.99	166.0	73

Table 3. Effect of ratio values on experimental results.

Ratio	[email protected]	[email protected]:0.95
1.30	50.6%	29.8%
1.33	50.7%	29.9%
1.35	50.7%	29.8%
1.37	50.6%	29.7%
1.40	50.5%	29.7%
1.41	50.5%	29.6%
1.45	50.3%	29.5%

Table 4. Comparison between different models.

Model	[email protected]	[email protected]:0.95
YOLOv4	47.5%	26.1%
YOLOv5l	39.8%	22.9%
TPH-YOLOv5	46.4%	27.6%
YOLOv6m	31.9%	21.8%
YOLOv7	48.8%	27.7%
YOLOv8	45.5%	27.8%
CMS-YOLOv7	52.3%	30.7%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qin, J.; Yu, W.; Feng, X.; Meng, Z.; Tan, C. A UAV Aerial Image Target Detection Algorithm Based on YOLOv7 Improved Model. Electronics 2024, 13, 3277. https://doi.org/10.3390/electronics13163277

AMA Style

Qin J, Yu W, Feng X, Meng Z, Tan C. A UAV Aerial Image Target Detection Algorithm Based on YOLOv7 Improved Model. Electronics. 2024; 13(16):3277. https://doi.org/10.3390/electronics13163277

Chicago/Turabian Style

Qin, Jie, Weihua Yu, Xiaoxi Feng, Zuqiang Meng, and Chaohong Tan. 2024. "A UAV Aerial Image Target Detection Algorithm Based on YOLOv7 Improved Model" Electronics 13, no. 16: 3277. https://doi.org/10.3390/electronics13163277

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A UAV Aerial Image Target Detection Algorithm Based on YOLOv7 Improved Model

Abstract

1. Introduction

2. Related Work

2.1. Common Target Detection Algorithms

2.2. YOLO Architectures Suitable for Aerial Imagery

2.3. YOLOv7 Network Structure

3. Method

3.1. CMS-YOLOv7

3.2. Small Target Detection Layer

3.3. Inner-MPDIoU

3.4. CARAFE

3.5. SPD-Conv

4. Experiments

4.1. Dataset

4.2. Parameter Settings

4.3. Evaluation Metrics

4.4. Ablation Experiments

4.4.1. Comparison with Baseline Model

4.4.2. Determining Parameters in Inner-MPDIoU

4.5. Detection Results Visualization

4.6. Comparison with Other Algorithms

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI