A Small Object Detection Method for Drone-Captured Images Based on Improved YOLOv7

Zhao, Dewei; Shao, Faming; Liu, Qiang; Yang, Li; Zhang, Heng; Zhang, Zihan

doi:10.3390/rs16061002

Open AccessArticle

A Small Object Detection Method for Drone-Captured Images Based on Improved YOLOv7

by

Dewei Zhao

,

Faming Shao

,

Qiang Liu

^*,

Li Yang

,

Heng Zhang

and

Zihan Zhang

College of Field Engineering, Army Engineering University of PLA, Nanjing 210007, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(6), 1002; https://doi.org/10.3390/rs16061002

Submission received: 28 January 2024 / Revised: 8 March 2024 / Accepted: 8 March 2024 / Published: 12 March 2024

Download

Browse Figures

Versions Notes

Abstract

Due to the broad usage and widespread popularity of drones, the demand for a more accurate object detection algorithm for images captured by drone platforms has become increasingly urgent. This article addresses this issue by first analyzing the unique characteristics of datasets related to drones. We then select the widely used YOLOv7 algorithm as the foundation and conduct a comprehensive analysis of its limitations, proposing a targeted solution. In order to enhance the network’s ability to extract features from small objects, we introduce non-strided convolution modules and integrate modules that utilize attention mechanism principles into the baseline network. Additionally, we improve the semantic information expression for small targets by optimizing the feature fusion process in the network. During training, we adopt the latest Lion optimizer and MPDIoU loss to further boost the overall performance of the network. The improved network achieves impressive results, with mAP₅₀ scores of 56.8% and 94.6% on the VisDrone2019 and NWPU VHR-10 datasets, respectively, particularly in detecting small objects.

Keywords:

object detection; drone; improved YOLOv7

1. Introduction

With the continuous innovation and development of human beings, the manufacturing and control technology of drones has gradually matured. Lightweight, compact, and inexpensive drones are widely used and applied in many fields such as agricultural production, disaster relief, safety prevention and control, and express delivery. They have brought tremendous impacts to people’s production and lives [1]. With the continuous expansion of drone applications, the demand for drone object detection is also increasing. Due to the fact that images obtained by drones are usually at higher positions and objects in the images are often small in size and captured from a limited range of angles, object detection for drones is different and more difficult compared to ordinary object detection. Exploring a high-precision object detection algorithm suitable for drone platforms is a popular research direction in the domain of object detection today [2,3,4].

Currently, object detection is primarily classified into two domains: traditional object detection techniques and neural network-based object detection approaches. The traditional object detection algorithm first extracts features from the target through manually designed feature extractors and then performs object detection. This type of algorithm is not widely used due to the poor generalization of manually designed feature extractors. Neural network-based object detection not only avoids the heavy workload of manual design but also has strong generalization performance for different objects [2]. Drone object detection needs to cater to the demands for a compact model size and rapid processing speed, and for this reason, we have chosen the YOLOv7 [5] algorithm with real-time inference speed as the research foundation in neural network-based object detection algorithms.

We first analyzed the object sizes in the VisDrone2019 [6] and NWPU VHR-10 [7] datasets, as shown in Figure 1. Pixel sizes below 32 × 32 are referred to as small objects, according to the relevant literature [6]. From Figure 1, it can be seen that in the images obtained by the drone, the proportion of small objects reached 88.12% and 47.02%, respectively, and the pixels of these small objects below 64 × 64 reached 92.7% and 83.06%, respectively.

Secondly, we extensively investigated the YOLOv7 network through thorough research. Although the network has good detection accuracy and fast detection speed, we found that when applied to drone object detection, however, the internal network structure of YOLOv7 still exhibits several deficiencies, leading to numerous shortcomings: Firstly, in the backbone of the network, pooling and downsampling will erase or forget small targets with already low information content; secondly, the feature layer of 160 × 160 in the network backbone is not fully utilized, which is the layer with the richest feature information containing small targets with pixel sizes below 32 × 32; finally, the feature fusion of the network is not sufficient. Inspired by some recent advancements in deep learning research, we have decided to improve the YOLOv7 network structure and training.

In this article, our contributions mainly include the following aspects:

By analyzing relevant datasets, we have identified the shortcomings of the original network. Improvements have been made to the backbone of the network, reducing the omission of small object information caused by operations such as pooling and stride convolution, and enhancing the network’s capability to capture intricate features of small objects.
Include an attention mechanism module as a component in the baseline, integrating the relevant advantages of attention mechanisms into the backbone of the object detection network and enhancing the network’s generalization performance.
Continuing the design concept of the original network, in the neck part of the network and in the feature fusion part, we connect more feature layers through cross connections, and included the 160 × 160 feature layer in the backbone in the feature fusion, enriching the network’s output features with enhanced semantic information pertaining to small objects.

This article is structured into seven distinct parts. In Section 2, we introduce the relevant work related to our research; in Section 3, we specifically introduce the improvement and implementation details. Moving to Section 4, we provide an overview of the model training process, including the configuration, parameters, and training methods. Moving to Section 5, we perform a comparative analysis of the training outcomes and present visualizations of selected results. In Section 6, we examine various phenomena and engage in further discussions. Finally, in Section 7, we summarize the current work and propose prospects for the next steps.

2. Related Work

In this chapter, we begin by offering an overview of the historical progression of object detection and its prominent directions. Secondly, we introduce the relevant research that inspired us to improve networks. Finally, we compare our research with similar works.

The research goal of object detection is to enable machines to recognize the category and position of objects in images correctly and quickly. The current expression for identifying the position of the target is to mark the object by outlining it with a rectangular box. Traditional object detection is based on manually designed feature extractors [8,9,10,11]. The advantage of these object detection methods is that it has a small number of algorithm parameters for specific detection targets and can be easily embedded into various small platforms. However, its disadvantage is that distinct feature extractors must be devised to cater to the detection requirements of various targets, and the algorithm’s adaptation range is limited. The deep learning-based algorithm for object detection obtains the relevant features of the target to be detected through learning, and the algorithm has strong generalization. It progressively supplants traditional algorithms, emerging as the prevailing research direction.

Based on network types, object detection algorithms using deep learning can be categorized into two groups: Object detection algorithms based on deep learning can be classified into two categories: those built on a convolutional neural network (CNN) architecture [12] and those built on a transformer network architecture [13]. Within the CNN-based algorithms, there are one-stage and two-stage networks. The two-stage networks include R-CNN [14], fast R-CNN [15], SSD [16], mask R-CNN [17], faster R-CNN [18], cascade R-CNN [19], and more. However, a drawback of this network type is that it generates numerous candidate boundaries during inference, leading to resource-intensive computations and slower performance. One-stage detection algorithms include YOLOv1~YOLOv5 [20,21,22,23,24], YOLOX [25], and YOLOv6 ~ YOLOv7 [5,26]. The advantages of this type of network are fast inference speed, the ability to achieve real-time inference, short model training time, and its detection accuracy constantly improving with version updates. Algorithms based on a transformer network architecture, such as DETR [27], deformable DETR [28], Swin Transformer [29], and Vision Transformer [30], have the disadvantages of long network training time, large model parameters, and mediocre detection accuracy. Considering the speed and computational resource requirements of drones for object detection, we have chosen the YOLOv7 network as the basis for our research.

We analyzed the image data obtained by drones and found that the main characteristic of drone object detection is the significant quantity of small objects present. In response to this characteristic, we have improved the YOLOv7 network by combining the latest achievements in neural network-related research. Firstly, the non-strided convolution (SPD conv) [31] module avoids information loss due to step-size convolution and pooling operations. The deformable attention (DAT) [32] module determines the range of attention through learning, improving the performance of network feature extraction. We use the efficient multi-scale attention (EMA) [33] module to reduce the model’s excessive attention to individual features and improve its generalization ability by smoothing attention weights. The above modules are mainly used for improving the backbone network. Secondly, the multi-branch connection method is widely used in the ELEAN module of the YOLOv7 network [5], which can extract features from multiple paths and enrich the semantic expression ability of the network. This idea is also included in the feature fusion network of the network. In addition, we also use Mosaic image data augmentation [34], MPDIoU [35] loss, and a high-performance Lion optimizer [36] to train our network.

There are many improved networks based on the YOLO network, and some are improved by using a single module [37,38,39], some are embedded in the network based on the large model [40,41,42], and some add new structures to the original network [43]. At the same time, there are also many improvements in small object detection based on the YOLO series in different application scenarios [44,45,46,47]. Our network is mainly aimed at the application of image acquisition for drones.

We differ from it in the following aspects: firstly, the starting point for network improvement. Considering the dataset’s characteristics obtained from the drone platform, we propose improvements to the network, and our proposed solution is more targeted. Secondly, in network improvement, we combine various advantages to enhance the performance of the network from different perspectives. Thirdly, our work aims to inherit the YOLO network framework and explore an object detection algorithm suitable for drones without increasing network complexity and computational resource consumption during model inference.

3. Our Method

In this chapter, firstly, we provide a detailed comparison and explanation of the network before and after improvement. Secondly, we introduce the modules used in the improvement. Lastly, we present the training loss function and the metrics associated with it.

3.1. Improvement of Network Structure

We combined the analysis of the VisDrone2019 and NWPU VHR-10 datasets to summarize the characteristic that the objects are generally small in the images obtained by drones. In order to enhance the precision of object detection, we strive to improve the widely used YOLOv7 algorithm, which is depicted in Figure 2 as its network structure.

In Figure 2, the rectangular cube represents the feature map of an input or output network node, the line with an arrow represents the direction of the data flow, and the letter next to the line represents the abbreviation of the relevant network module. The feature map in the figure is annotated as H × W × C, where H represents the height, W represents the width, and C represents the number of channels. This annotation is provided in the bottom left corner of the diagram, and we have illustrated the combination modules represented by letters in the network diagram. In the formula, “=“ represents the module on the left, composed of the network on the right. In Figure 3, the same expression method as in Figure 2 is also used.

The YOLOv7 network is a higher upgraded version and its advantages are small model parameters, easy deployment to small terminals, fast detection speed, and real-time inference. These advantages meet the application requirements of drone target detection. However, through in-depth research and exploration of the YOLOv7 network, we have identified its shortcomings and made improvements to it. The specific details of the improvements are shown in Figure 3.

As shown in Figure 3, the areas where the colors in Figure 3 are different from those in Figure 2 are our improvements to the model, while the areas where the colors remain unchanged remain consistent with the original network structure. To highlight the differences, we also use dashed boxes to highlight the modifications. We have made three improvements to the original YOLOv7 framework. Firstly, in the backbone section, we replaced the original CBS module with the SPD-Conv module. Secondly, in some areas, we replaced the CBS module with DAT and EMA modules to introduce attention mechanisms, which further enhance its ability to extract target features. Thirdly, in the feature fusion section (also known as the neck section), we proposed a new cross-linking method, as shown by the orange connection line in the figure. We improved the part of the bottom layer that outputs the detection feature map for small objects, using all three information output layers at the bottom as the feature fusion part. Specifically, we merged the 160 × 60 feature map into the detection end through cross-linking, further enhancing the feature extraction ability for small objects.

3.2. The Modules Used in the Improvement

3.2.1. Non-Strided Convolutional (SPD-Conv) Module

The mathematical expression of Figure 4 is Equation (1):

Y_{0,0} = X [0 : S - 1 : 2,0 : S - 1 : 2], Y_{0,1} = X [1 : S : 2,0 : S - 1 : 2], Y_{1,0} = X [0 : S - 1 : 2,1 : S : 2], Y_{1,1} = X [1 : S : 2,1 : S : 2], X' = [Y_{0,0}; Y_{0,1}; Y_{1,0}; Y_{1,1}],

(1)

In Equation (1),

X

represents the input feature map,

Y_{0,0}, Y_{0,1}, Y_{1,0} {, Y}_{1,1}

respectively represent the feature maps after the separation operation, and

X'

prime represents the output feature maps.

0 : S - 1 : 2

represents taking one element at every interval, where

S

represents the width or height of the feature map for each channel.

Combining Figure 4 with Equation (1), we can see that the SPD-Conv module achieves the same downsampling functionality as ordinary 3 × 3 convolutions, Maxpooling, and Avgpooling through simple separation and combination operations. Moreover, compared to these operations, such operations placed at the bottom of the entire network preserve more original information about the features used for detecting small objects.

3.2.2. Multi Branch Convolution (MConv) Module

The mathematical expression of the network shown in Figure 5 is Equation (2):

{Y = f}^{1 \times 1} {([f}^{3 \times 3} (X), {D e p f}^{3 \times 3} (X)]),

(2)

In Equation (2), the output feature map is denoted as

Y

,

{D e p f}^{3 \times 3}

,

f^{3 \times 3}

,

f^{1 \times 1}

represent depthwise convolution, 3 × 3 convolutions, and 1 × 1 convolution, respectively, and the input feature map is denoted as

X

.

3.2.3. Deformable Attention (DAT) Module

In Figure 6, the offset is a set of bias parameters that are key points for calculating the values. Equation (3) represents the mathematical formula for deformable attention in Figure 6. In this equation, the attention weight offset and values are obtained through a linear, which transforms Query features. Value is then derived from the offset.

D A T (z_{q}, x) = \sum_{m = 1}^{M} W_{m} [\sum_{k = Ω_{k}} A_{m q k} \cdot W_{m}^{'} x (p_{q} + Δ p_{m q k})],

(3)

In Equation (3), the variables are as follows: the feature vector for Query is denoted as

z_{q}

,

M

denotes the total attention heads, the Input feature map is denoted as

x

, weight for the k-th Key element is denoted as

A_{m q k}

, weights for linear transformation are denoted as

W_{m}

, the attention head index is denoted as

m

, the set of possible values for k (normalized weight vector) is denoted as

Ω_{k}

, the position without bias change is denoted as

p_{q}

, encoding of the Key element is denoted as

W_{m}^{'}

, and the specific bias magnitude is denoted as

Δ p_{m q k}

.

In the network, a deformable attention module was introduced, expanding the receptive field range of each feature point and enhancing the ability of the network to extract features effectively.

3.2.4. Efficient Multi-Scale Attention (EMA) Module

In Figure 7, the input feature map is denoted as

X

; it is separated into several

X^{'}

through grouping operations, and each

X^{'}

feature map obtains the output

X^{″}

through the operation process shown in the figure. For Figure 7, its mathematical formula is Equation (4):

M_{1} (X') = A v g P o o l (f^{3 \times 3} (X^{'})), M_{2} (X') = S o f t m a x (f^{3 \times 3} (X^{'})), M_{3} (X') = f^{1 \times 1} ([X A v g P o o l (X^{'}), X A v g P o o l (X^{'})]) ⨀ X^{'}, X ″ = σ (A v g P o o l (M_{3}) ⨂ M_{2} + S o f t m a x (M_{3}) ⨂ M_{1}) ⨀ X^{'},

(4)

In Equation (4), the input feature map is denoted as

X'

.

M_{1} (X'), M_{2} (X'), a n d M_{3} (X')

represent intermediate computation results.

A v g P o o l

represents average pooling.

f^{3 \times 3}

and

f^{1 \times 1}

represent 3 × 3 convolution and 1 × 1 convolution, respectively.

σ

represents the sigmoid operation.

⨀

represents the channel-wise convolution operation.

⨂

represents the element-wise multiplication operation.

Combining Equation (4) with Figure 7, we can see that the efficient multi-scale attention integrates information more fully from the original feature map, enriching the network’s ability to understand complex relationships within the data.

3.3. Loss Function and Model Evaluation-Related Metrics

3.3.1. Concepts Related to Losses

In Figure 8, for a certain detection target, the red border represents the predicted border of the network model. The green border represents the manually pre-labeled true border, also known as the ground truth. In the illustration, towards the right-hand side is the mathematical expression and graphical representation of IoU. Based on the representation of IoU, Figure 9 and Equation (5) demonstrate the fundamental concept underlying MPDIoU.

Figure 9 shows a visual representation of the meaning of MPDIoU-related symbols, with the red border representing the predicted border of the network model and the green border representing the true border. The mathematical expression of MPDIoU is Equation (5), and the bounding box loss function obtained from it is expressed as Equation (6):

d_{1}^{2} = {(x_{1}^{p r} - x_{1}^{g t})}^{2} + {(y_{1}^{p r} - y_{1}^{g t})}^{2}, d_{2}^{2} = {(x_{2}^{p r} - x_{2}^{g t})}^{2} + {(y_{2}^{p r} - y_{2}^{g t})}^{2}, M P D I o U = I o U - \frac{d_{1}^{2}}{w^{2} + h^{2}} - \frac{d_{2}^{2}}{w^{2} + h^{2}},

(5)

L_{M P D I o U} = 1 - M P D I o U,

(6)

In Equation (5), (

x_{1}^{p r}, y_{1}^{p r})

represent the corners located at the top-left of the predicted bounding box and (

x_{2}^{p r}, y_{2}^{p r})

represent the corners located at the bottom right. Similarly, (

x_{1}^{g t}, y_{1}^{g t})

represent the corners located at the top-left of the ground truth bounding box and (

x_{2}^{g t}, y_{2}^{g t})

represent the bottom-right corners.

d_{1}

and

d_{2}

represent the distances between the predicted and the top-left and bottom-right corners of the ground truth.

w

is the width and

h

is the height of the image.

L_{M P D I o U}

represents the bounding box loss function.

3.3.2. Loss Function

In Figure 10, the values at each position in the detection results represent specific meanings. The numbers “10, 10” located at the top of the illustration indicate that the output values correspond to a 10 × 10 grid on the image being detected. Additionally, 3*N = 3*(5 + (N − 1)), where 3 represents the three different shapes of predicted bounding boxes, as depicted by the blue, yellow, and green boxes in the top left of the figure; 5 represents the information about the predicted bounding box’s position, size, and class confidence; and N − 1 represents the probabilities of the object inside the predicted bounding box being classified into each class. The specific meanings of these values are explained in the text in the lower left of the figure.

L_{n e t} = λ_{1} \sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} Ι_{i j}^{o b j} L_{M P D I o U}^{i j} + λ_{2} \sum_{i = 0}^{S^{2}} Ι_{i j}^{o b j} L (p_{i j}, {\hat{p}}_{i}) + λ_{3} (\sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} Ι_{i j}^{o b j} {(c_{i} - {\hat{c}}_{i})}^{2} + λ_{g} \sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} Ι_{i j}^{n o o b j} {(c_{i} - {\hat{c}}_{i})}^{2}),

(7)

In Equation (7), the overall loss of the network is denoted as

L_{n e t}

.

Ι_{i j}^{o b j}

signifies if there is a genuine object in the

j

-th bounding box that corresponds to the

i

-th grid. If there is a real object, it is denoted as 1; otherwise, it is 0. Here,

j

= 1, 2, 3 represents the three predicted bounding boxes for small objects in the network output. The coefficients

λ_{g}

,

λ_{1}

,

λ_{2}

, and

λ_{3}

correspond to the weightings applied to the negative sample confidence loss, bounding box loss, class loss, and confidence loss, respectively. In training, these coefficients are set to 0.2, 0.3, 0.15, and 0.05, respectively.

Ι_{i j}^{n o o b j}

corresponds to negative samples, where no object is present.

c_{i}

and

{\hat{c}}_{i}

represent the true confidence value and predicted confidence value, respectively, for the

i

-th grid.

3.3.3. Training Metrics

The evaluation metrics for measuring model training results include AP and mAP [49], and their specific meanings and mathematical expressions are shown in Figure 11 and Equations (8) and (9).

In Figure 11, TY represents that the object is truly present and successfully detected. TN indicates that the object is truly present but not detected. FY represents a false positive, where the object is not present but mistakenly detected. FN denotes a false negative, where the object is truly absent and not detected.

A P = \int_{0}^{1} P (R) d (R),

(8)

m A P = \frac{1}{c l a s s e s} \int_{0}^{1} P (R) d (R),

(9)

In Formulas (8) and (9), R represents recall and P represents precision. Their calculation methods are shown in Figure 11. AP stands for Average Precision and it corresponds to the integral of the precision-recall curve. mAP indicates the average of the average precisions (APs) for multiple categories and it is the abbreviation for Mean Average Precision.

In the gradient update of loss, we used a more advanced Lion optimizer for network training. The update process of the Lion optimizer is shown in Equation (10):

u_{t} = s i g n (β_{1} m_{t - 1} + (1 - β_{1}) g_{t}) + λ_{t} θ_{t - 1}, θ_{t} {= θ}_{t - 1} - η_{t} u_{t}, {m_{t} = β_{2} m}_{t - 1} + (1 - β_{2}) g_{t},

(10)

In Equation (10),

θ_{t}

and

m_{t}

represent the intermediate variable in the gradient update process,

β_{1}

and

β_{2}

are two hyperparameters with values of 0.9 and 0.99, respectively, and

η_{t}

represents the learning rate, with a value of 0.0003.

λ_{t}

represents the weight decay rate, with a value of 0.01.

g_{t} = \nabla_{θ} L (θ_{t - 1})

represents the gradient of the loss and

s i g n

corresponds to the sign function, that is, positive numbers become 1 and negative numbers become −1.

4. Experiment

This section primarily covers the target dataset employed, the equipment and settings utilized in our experiment, as well as an overview of the training procedure.

4.1. Dataset

In order to fulfill the demands of drone target detection tasks, the VisDrone2019 [4] and NWPU VHR-10 [5] datasets were selected as our primary databases for experimentation and verification. The VisDrone2019 dataset is shown in Figure 12 and it comprises a total of 10,209 static images, which are split into three parts, with 6471 for model training, 3190 for testing, and 548 for validation. It consists of 10 predefined object categories, namely, tricycle, van, bus, awning-tricycle, truck, car, person, bicycle, pedestrian, and motor. Similarly, the NWPU VHR-10 dataset includes 800 static images, with 520 designated for model training, 150 for testing, and 130 for validation. This dataset also encompasses 10 object categories and they are storage tank, airplane, basketball court, baseball diamond, ship, vehicles, bridge, harbor, ground track field, and tennis court.

From Figure 12a,d, we can see that most of the detected targets in UAV-captured scenes are relatively small, and there are also cases where small objects are densely packed. Figure 12b,f show that in drone-captured scenes, there are cases where the object and background are mixed together, making it difficult to distinguish, and there are also cases of occlusion. By observing the dataset images, we have concluded that there are two difficulties in drone object detection: one is that small objects appear frequently in some scenes, and the other is that some objects are difficult to recognize due to complex backgrounds. Improving the ability to accurately identify objects in various complex environments is the goal we seek to achieve through network improvement.

4.2. Image Data Augmentation

The training effectiveness of a network depends not only on its internal structure but also on the quantity of data used. With the help of relevant data augmentation techniques, especially mosaic image data augmentation, we enhanced the data before training, as shown in Figure 13.

In Figure 13, we illustrate the methods we employ for data augmentation. The main process consists of several steps. Firstly, we enlarge the dataset size by applying random flipping, cropping, scaling, adjusting image brightness, and occluding generated images. Then, we shuffle the dataset and utilize the mosaic operation to further expand the dataset size. Finally, we shuffle the dataset once again and select a batch of data for the model.

The right side of the image presents a schematic diagram of the mosaic operation. This operation randomly resizes, crops, and stitches images together, thereby increasing the data volume and enriching the proportion of small objects. This approach enhances the detection performance on small targets.

4.3. Experimental Settings

We present the equipment and related parameter settings used for model training in the form of tables in Table 1 and Table 2.

Our experimental setup utilized the devices listed in Table 1. We conducted our experiments on a Linux operating system using the Pytorch framework, with Python 3.8.16 as the programming language. The computer system was equipped with an RTX 6000 Ada GPU and a Xeon(R) w9-3495X CPU, with a total system memory of 64 GB of RAM.

Table 2 displays the parameter configurations used in the experiments. Based on the performance of our devices, the Batch Size is 16 and the Epoch is 200. For the remaining parameters, we refer to the settings in YOLOv7: Image Size is set to 640 × 640, Momentum [50] is set to 0.937, Optimizer is set to Lion, Weight Decay [51] is set to 0.0005, and Learning Rate is set to 0.01.

4.4. Training Process

In our model training, in addition to using data augmentation techniques, we utilized the K-means clustering technique [52] to establish a suitable range for initializing the bounding box values, enabling faster convergence during training. The specific training process is shown in Figure 14.

5. Result and Analyses

Our main focus is on analyzing the data produced during the training of the modified model within this section. It consists of six aspects: firstly, a statistical analysis of the target dataset; secondly, a comparative analysis of the effects before and after model improvement; thirdly, conducting ablation experiments on the various modules used in the article; fourthly, comparing the model results with other models; fifthly, visualizing and analyzing some experimental results; sixthly, visualizing the network’s effectiveness using GradCAM [53] heatmaps.

5.1. Statistics of Relevant Data in the Dataset

To gain a deeper insight into the network model’s performance, we conducted a statistical analysis on the size distribution of bounding boxes and the number distribution of different types of targets in both datasets prior to conducting the training process. The results are presented in Figure 15 and Figure 16.

The left side of Figure 15 is a box diagram in which the blue color represents the statistical data of the object’s bounding box width distribution while the red color represents the statistical data of the object’s bounding box height distribution. On the right is a heatmap of frequency statistics for different sizes of bounding boxes, with colors ranging from blue to red indicating a decrease in frequency. From the blue and red box plots in Figure 15a,c, from the statistical results, it is evident that the object sizes in our utilized dataset are relatively centralized, with the height of the objects being generally greater than their width. From Figure 15b,d, we can see that the object sizes in the dataset are all within 100 × 100, and the VisDrone2019 dataset represented by (b) is smaller and concentrated within 50 × 50. Furthermore, considering the specific frequency, the objects in the VisDrone2019 dataset are smaller and more numerous than those in the NWPU VHR-10 dataset.

In Table 3, different sets of anchor box sizes are obtained based on different datasets. The first, second, and third sets of borders correspond to the small, medium, and large object detection heads of the network, respectively. There are three pairs of anchor boxes in each group. In [34], 3 represents the width of the anchor box as 3 pixels, and 4 represents the height of the anchor box as 4 pixels.

From Figure 16, we can see that there are differences for each category, with (a) representing the most significant difference in objects in the VisDrone2019 dataset. Comparing Figure 16a and Figure 16b, it is found that the number of each category in VisDrone2019 is much higher than that in NWPU VHR-10.

5.2. The Effect of Model Improvement

For each type of object on the validation data of two datasets, we plot the detection outcomes of the enhanced network as accuracy–recall curves, presented in Figure 17a and Figure 17b, respectively. In Figure 17, the average object detection accuracy for each category is presented, with the IoU threshold set at 0.5. Different colors are used to represent different categories, and the average accuracy recall curve for all categories is also plotted.

Firstly, observe the two subgraphs in Figure 17. The AP of the car category is the highest, and the awning-tricycle category has the lowest AP, as illustrated in Figure 16a. In Figure 16a, it is shown that there are the most cars and the least awning-tricycles in the training dataset. Similarly, in Figure 17b, the airplane has the highest AP and the bridge has the lowest. In Figure 16b, it is shown that there are the highest number of airplanes and the lowest number of bridges in the training dataset. Similar phenomena include pedestrians to people, ships to storage tanks, etc., all of which have small numbers in the training dataset and small values in the validation results.

From Figure 17 alone, we have not yet seen the advantages of our improved model. Therefore, we performed a comparative analysis of the object detection outcomes obtained before and after enhancing the model. This evaluation resulted in the generation of the mAP curve presented in Figure 18, the radar graph shown in Figure 19, and the confusion matrix graph illustrated in Figure 20. We demonstrated the advantages of our improved model from different angles.

Figure 18 showcases the comparison between the training results before and after model improvement. The horizontal axis represents the number of training iterations, while the vertical axis represents the mAP₅₀ value. The blue curve represents the training outcomes prior to model enhancement, while the orange curve represents the results achieved following the model improvement. By examining both subgraphs in Figure 18, we can clearly observe the advantages of model enhancement. Beyond the 150th epoch, the orange curve consistently surpasses the blue curve. As depicted in (a), the improved model demonstrates a remarkable 17.1% enhancement in mAP₅₀ for the VisDrone2019 dataset. Similarly, as demonstrated in (b), the improved model showcases a significant 7.1% improvement in mAP₅₀ for the NWPU VHR-10 dataset. To further validate the effectiveness of the model improvement, Table 4 and Table 5 provide AP₅₀ statistics for each category within the two datasets.

Table 4 and Table 5 respectively show the AP₅₀ of each category target in the two datasets before and after model improvement. Among them, Original represents the training results before model improvement and Improved represents the training results after model improvement. Table 4 corresponds to each category in the VisDrone2019 dataset, and Table 5 corresponds to each category in the NWPU VHR-10 dataset. In order to offer a more user-friendly interpretation of the information contained in Table 4 and Table 5, we have created a radar chart, visually depicted in Figure 19. This chart serves to present the data in a more accessible and easily understandable format, facilitating straightforward analysis and comparison.

The dark green polygons in Figure 19 represent the training results before the model improvement, while the light green polygons represent the training results after the model improvement, where each vertex of the polygon corresponds to a category. The multiple blue deformations in the figure represent contour lines, with values gradually increasing from the inside out. Observing the two subgraphs in Figure 19, it can be seen that the AP₅₀ values of different categories of targets have improved to varying degrees after model improvement. In subgraph (a), the numerical growth of pedestrian, motor, and people categories is more significant, with an increase of 23.6%, 23.9%, and 26.1%, respectively, compared to before the model improvement. In subgraph (b), the growth of the ship and storage tank categories is also more significant, with an increase of 11.9% and 11.8%, respectively, compared to before model improvement. Moreover, these categories of targets have smaller sizes in the image compared to other targets, indicating that the improved model improves the effectiveness of small object detection.

When comparing the subgraphs on the left and right in Figure 20, it can be seen that the overall right subgraph became sparser, reflecting the improved accuracy of the model in detecting objects of various categories. By comparing the upper and lower graphs in Figure 20, we found that the sparsity of the lower subgraph is better than that of the upper subgraph. Analyzing the reason suggests that the VisDrone2019 dataset is located above, which has a larger number of objects and relatively smaller sizes, leading to disparities in the detected outcomes when compared to the NWPU VHR-10 dataset. Therefore, the confusion matrix corresponding to the VisDrone2019 dataset is more complex. By combining the two subgraphs in Figure 19, we can further conclude that the smaller the object size in the dataset, the poorer the detection results.

5.3. Ablation Experiment

To further explore the practical effects of the newly added network modules in our improvement, ablation experiments were conducted on five parts: the non-strided convolution (SPD-Conv) module, deformable attention (DAT) module, efficient multi-scale attention (EMA) module, multi-branch convolution (MConv) module, and MPDIoU loss. The experimental outcomes are presented in Table 6.

Table 6 shows the impact of using different modules to improve the original network structure on training results for different datasets.

Among them, when comparing each row of the table with the corresponding first row of the VisDrone2019 dataset, it can be observed that SPD-Conv exhibits a noteworthy improvement across various model indicators. The accuracy increased by 7.5, mAP₅₀ increased by 5.1, and mAP_50:95 increased by 3.3. Analyzing the reasons behind this improvement, we primarily utilized the SPD-Conv module in the network’s backbone section. The key feature of the SPD-Conv module is enhancing the detection capability of small object targets. Since the VisDrone2019 dataset contains a substantial number of small objects, the utilization of the SPD-Conv module resulted in the most significant performance improvement for the entire model.

For the NWPU VHR-10 dataset, comparing the data in each row of the table with the corresponding first row of the dataset, we can find that DAT has a significant improvement on various indicators of the model. The accuracy increased by 1.8, mAP₅₀ increased by 2.5, and mAP_50:95 increased by 1.1. Analyzing the reasons, we also used the DAT module in the backbone part of the network, and the key attribute of the DAT module lies in its ability to augment the receptive field and enhance the feature extraction capability. At the same time, combined with Figure 19b, we found that the bridge category significantly improved after model improvement. Therefore, we can suggest that the DAT module has a better improvement in the detection performance of medium-sized objects in the model.

In addition, for each dataset, there is an increase in the data of each row in the table relative to the data of the previous row corresponding to that row, and this implies that every module has played a contributing role in the overall improvement of the model’s performance.

In addition to conducting ablation experiments on the modules in the overall network, we also explored the performance of the MConv module. The relevant results of the experiments are shown in Table 7. Depthwise Conv, Common Conv, and Pointwise Conv are abbreviated as DD, CC, and PC, respectively.

As a multi-branch structure module, MConv integrates ordinary convolution and depthwise separable convolution. Comparing the data in Table 7, it is found that the combination of Depth Conv, Common Conv, and Pointwise Conv has the best performance in both datasets. The combination of Common Conv and Pointwise Conv does not significantly improve the network performance. These data indicate that the MConv module as a whole achieves better network performance.

5.4. Comparison Experiment

To further illustrate the efficacy of our model enhancements, we conducted a comparative analysis with other existing models. The corresponding data can be found in Table 8, Table 9, Table 10 and Table 11.

Table 8 shows the comparison of experimental results between our model and five other superior models on the VisDrone2019 validation dataset. From Table 8, we can see that for mAP₅₀, our model has the highest 56.8%, while TPH-YOLOv5 is the second highest 48.9%. Our model has improved by 5.9% compared to TPH-YOLOv5. For mAP_50:95, our model has a maximum of 34.9%, followed by the original model, which is 28.1% higher. The improved model has increased by 6.5% compared to the previous one. Compared to the newer network YOLOv8, our improved performance for the VisDrone2019 validation dataset still has advantages, and our improvements for specific application scenarios are effective. In terms of inference speed, our network maintains a small difference from the original network and still maintains a high efficiency.

Table 9 shows the comparison of the experimental results of our model with the other five models on the VisDrone2019 validation dataset for each category of object AP₅₀. There are a total of 11 rows of data listed in the table, with the first row representing the average AP₅₀ value of all categories and the remaining 10 rows representing the AP₅₀ value of each category. Based on the data in Table 9, we created Figure 21 to visualize the comparison results between models.

Based on the data in each row of Figure 21 and Table 9, we can see that our model has significantly improved experimental results for targets such as Motor, People, and Pedestrian, with values of 68.9%, 57.9%, and 69.6%, respectively. Compared to the second highest values of 60.2%, 49.8%, and 60.1% in the row, they are 8.7%, 8.1%, and 9.5% higher, respectively. These types of targets are all objects with relatively small shapes, so they once again verify that our model has an advantage in detecting small object targets.

Table 10 shows the comparison of experimental results between our model and five other models on the NWPU VHR-10 validation dataset. From Table 10, we can see that for mAP₅₀, our model has the highest score of 94.6%, while TPH-YOLOv5 has the second highest score of 92.3%, which is an improvement of 2.3% compared to others; For mAP_50:95, our model has a maximum of 58.3%, and TPH-YOLOv5 is the second highest at 56.8%. This demonstrates a 1.5% advancement in comparison to other models. Compared to the newer network YOLOv8, our improved performance for the NWPU VHR-10 validation dataset still has advantages, especially for images that contain a large number of small objects. It can also be seen that our network still maintains high efficiency in terms of inference speed.

Table 11 shows the comparison of our model with five other models on the NWPU VHR-10 validation dataset for the experimental results of AP₅₀ for each category of targets. There are a total of 11 rows of data listed in the table, with the last row representing the average AP₅₀ value of all categories, and the remaining 10 rows representing the AP₅₀ value of each category. Based on the data in Table 11, we also created Figure 22 to visualize the comparison between models.

Based on the data in each row of Figure 22 and Table 11, we can see that our model has significantly improved experimental results for targets such as Storage tank and Baseball diamond, with values of 93.5% and 99.4%, respectively. Compared to the second-highest values of 88.3% and 97.4% in the row, they are 5.2% and 2.0% higher, respectively. These data further validate the superiority of our model in object detection performance.

5.5. Visualization

In order to further validate the effectiveness of our model enhancements, we conducted a comparative analysis between the improved model and YOLOv3-SPP as well as TPH-YOLOv5. We extracted partial data from the VisDrone2019 and NWPU VHR-10 validation datasets and compared the prediction results. The specific content is shown in Figure 23 and Figure 24.

In Figure 23 and Figure 24, different-colored rectangular boxes are used to represent the type of object being detected. In addition, in order to observe the differences between the models before and after improvement more clearly, the experimental results of the same experimental image in different models are arranged horizontally.

In Figure 23, each column of images represents the detection results of a model. Figure 23 illustrates the detection outcomes of three models on the VisDrone2019 validation dataset. By comparing the three images in the first row, our model detects more pedestrians marked with blue rectangular boxes and cars marked with blue-green rectangular boxes in the center of the image. Compared to the three images in the second row, our model detects more pedestrians marked by blue rectangular boxes in the upper left part of the image. Compared to the three images in the third row, our model detected more cars marked with blue-green rectangular boxes on the highway in the upper and middle parts of the image. We can see that our model detects more small objects than the other two models.

In Figure 24, each column of images represents the detection results of a model. Figure 24 shows the detection results of three models on the NWPU VHR-10 validation dataset. By comparing the three images in the first row, our model identifies a greater number of airplanes, which are indicated by red rectangular bounding boxes positioned at the center of the images. Compared to the three images in the second row, our model also detects more storage tanks marked by blue-green rectangular boxes at the center of the image. Compared to the three images in the third row, our model detects more ships marked by blue rectangular boxes at the right-hand side of the image. We can see that our model detects more small objects than the other two models.

In summary, by comparing each row of images, we can find that the three models have differences in detecting small-sized targets. Our proposed model has better detection performance and can detect more small targets under the same conditions.

5.6. Heatmap Analysis

We combined the GradCAM heatmap visualization analysis model and selected the TPH-YOLOv5 model with better detection performance for comparative display. The results are shown in Figure 25 and Figure 26.

Figure 25 shows some images in the NWPU VHR-10 validation dataset, generated using GradCAM on the TPH-YOLOv5 model and the proposed model, respectively. The top row displays the original input image for the model. The heat map highlights the object of interest in red, with darker shades of red indicating a higher level of network attention in that specific area.

In Figure 25, there is only one type of target in each column of the image, where the target in column (a) is an airplane, the target in column (b) is a storage tank, and the target in column (c) is a ship. Upon comparing the image in the third row with the image in the second row, we can observe an increased presence of brighter red areas in the former. This indicates that our proposed model places a greater emphasis on detecting small-sized targets.

Figure 26 shows some images in the VisDrone2019 validation dataset, generated using GradCAM on the TPH-YOLOv5 model and the proposed model, respectively. The first row in subgraphs (a) and (b) is the original image of the input model.

In Figure 26a,b, each column of the heatmap corresponds to a category of object, where the first column corresponds to the car category, the second column corresponds to the motor category, and the third column corresponds to the people category. Similarly, in subgraphs (a) or (b), a comparison between the second- and third-row images reveals that the latter exhibits a higher number of red-bright areas. This indicates that our proposed model has an increased focus on identifying small-sized objects.

6. Discussion

After conducting a comparative analysis between different models, we have found that our proposed model has advantages in detecting small targets. However, there are still some aspects that seem counterintuitive from certain perspectives, and we would like to provide our own insight into this matter.

Firstly, in Figure 17a, there is a notable disparity in the Average Precision (AP) values across various target types. Even after model improvements, this difference still exists, as shown in Figure 19a. Upon analyzing this phenomenon with the bar chart in Figure 16, we can see that the cause for the difference in AP values between different object categories is due to imbalanced data. Generally speaking, models trained on categories with more samples will perform better than those with fewer samples.

Secondly, comparing the “bus” and “pedestrian” categories in Figure 17a, we can see that the average precision for “bus” is higher than that for “pedestrian”. However, in Figure 16a, there are far more pedestrians than buses, which seems to contradict our previous conclusion. The reason for this is that the shape of the buses in the dataset is much larger than that of the pedestrians. Since learning to recognize large targets is easier for the network, training results for the “bus” category may be better, despite having fewer samples.

Thirdly, combining the bar chart in Figure 16 and the radar chart in Figure 19, we observe that the VisDrone2019 dataset, which has significantly more data, performs worse than the NWPU VHR-10 dataset, even though both datasets contain the same 10 target categories. This contradicts the notion that more target categories lead to better model performance. Upon closer analysis, we find that the NWPU VHR-10 dataset typically only has 1–2 types of targets per image, with a relatively simple background, whereas the VisDrone2019 dataset is much more complex, containing multiple types of targets with small sizes in each image, and a varied and intricate background. This makes it more difficult to train a model with the same level of performance as the NWPU VHR-10 dataset. The confusion matrices in Figure 20a–d illustrate that the confusion matrix for the NWPU VHR-10 dataset is sparser, while the confusion matrix for the VisDrone2019 dataset is more complex, indicating that the small target size and complex background in the VisDrone2019 dataset make it more prone to misclassification.

Finally, comparing the results in Table 8 and Table 10, it can be seen that although YOLOv8 is a relatively new network, our method is superior to YOLOv8. This can be explained from two aspects: firstly, our method has specificity for small objects. The images captured by drones contain lots of small objects, and our method does have advantages for data that contain a large number of small objects. Secondly, the DAT, SPDConv, EMA, MConv, and MPDIoU modules we use enhance the extraction of small target features, which improves the detection accuracy of small objects compared to the YOLOv7 network. The network structure of YOLOv8 is significantly different from our improved network, resulting in better data results than YOLOv8. We also see that in terms of inference speed, our network is slightly slower than YOLOv8, which will be the next direction for improving our model.

Overall, we believe that these insights can help provide a better understanding of the factors that affect model performance in object detection tasks.

7. Conclusions

This article introduces a network model enhancement approach to improve the detection of small objects in images captured by drones. The approach considers the specific characteristics of the images and integrates deep insights into the YOLOv7 object detection network. In the proposed scheme, non-strided convolution is used to minimize the information loss pertaining to small-sized objects during the feature extraction process. Efficient multi-scale attention and deformable attention modules are introduced to improve the network’s feature extraction and semantic information expression abilities. Furthermore, the low-level feature maps are connected to the feature fusion section of the network, and the MPDIoU is incorporated into the loss calculation to enhance the performance of detecting small objects even further. The improved model outperforms other models on the NWPU VHR-10 and VisDrone2019 datasets. This enhancement scheme plays a significant role in advancing the application of object detection in drone scenarios.

Through the analysis of selected detection results, there is still room for improvement compared to the desired outcomes. The ability to detect and recognize objects needs further enhancement, particularly when dealing with imbalanced data containing various object categories. Additionally, the model’s performance in unfamiliar scenes not present in the datasets requires further verification and targeted optimization. These limitations will be the focus of our future exploration and development.

Author Contributions

Methodology, F.S. and Q.L.; project administration, Q.L.; funding acquisition, F.S. and H.Z.; software, F.S.; data curation, D.Z. and L.Y.; validation, Q.L. and D.Z.; conceptualization, D.Z. and F.S.; writing—original draft preparation, D.Z.; resources, F.S.; investigation, Z.Z.; writing—review and editing, D.Z. and F.S.; formal analysis, F.S. and L.Y.; All authors have read and agreed to the published version of the manuscript.

Funding

This research received funding from the National Natural Science Foundation of China under grant number 61671470.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Rolly, R.M.; Malarvezhi, P.; Lagkas, T.D. Unmanned aerial vehicles: Applications, techniques, and challenges as aerial base stations. Int. J. Distrib. Sens. Netw. 2022, 18, 15501329221123933. [Google Scholar] [CrossRef]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object Detection in 20 Years: A Survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Wu, X.; Li, W.; Hong, D.; Tao, R.; Du, Q. Deep Learning for UAV-based Object Detection and Tracking: A Survey. arXiv 2021, arXiv:2110.12638. [Google Scholar]
Kang, J.; Tariq, S.; Oh, H.; Woo, S.S. A Survey of Deep Learning-based Object Detection Methods and Datasets for Overhead Imagery. IEEE Access 2022, 10, 20118–20134. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2022; pp. 7464–7475. [Google Scholar]
Zhu, P.F.; Wen, L.; Du, D.; Bian, X.; Ling, H.; Hu, Q.; Nie, Q.; Cheng, H.; Liu, C.; Liu, X.; et al. VisDrone-DET2019: The Vision Meets Drone Object Detection in Image Challenge Results. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2018; pp. 213–226. [Google Scholar]
Cheng, G.; Han, J.; Zhou, P.; Guo, L. Multi-class geospatial object detection and geographic image classification based on collection of part detectors. ISPRS J. Photogramm. Remote Sens. 2014, 98, 119–132. [Google Scholar] [CrossRef]
Lienhart, R.; Maydt, J. An extended set of Haar-like features for rapid object detection. In Proceedings of the International Conference on Image Processing, Rochester, NY, USA, 22–25 September 2002; IEEE: Piscataway, NJ, USA, 2002; Volume 1, pp. I–I. [Google Scholar]
Viola, P.A.; Jones, M.J. Rapid Object Detection using a Boosted Cascade of Simple Features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Kauai, HI, USA, 8–14 December 2001. [Google Scholar]
Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision & Pattern Recognition, San Diego, CA, USA, 20–25 June 2005. [Google Scholar]
Forsyth, D. Object Detection with Discriminatively Trained Part-Based Models; IEEE: Piscataway, NJ, USA, 2014. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2012, 60, 84–90. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017. [Google Scholar]
Girshick, R.B.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R.B. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
Berg, A.C.; Fu, C.Y.; Szegedy, C.; Anguelov, D.; Erhan, D.; Reed, S.; Liu, W. SSD: Single Shot MultiBox Detector. In Computer Vision–ECCV 2016, Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Part I 14; Springer International Publishing: New York, NY, USA, 2016; pp. 21–37. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R.B. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: High Quality Object Detection and Instance Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 1483–1498. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Jocher, G.; Stoken, A.; Borovec, J.; Chaurasia, A.; Changyu, L.; Hogan, A.; Hajek, J.; Diaconu, L.; Kwon, Y.; Defretin, Y.; et al. ultralytics/yolov5: v5.0—YOLOv5-P6 1280 models, AWS, Supervise.ly and YouTube integrations. 2021. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. arXiv 2020, arXiv:2005.12872. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Sunkara, R.; Luo, T. No More Strided Convolutions or Pooling: A New CNN Building Block for Low-Resolution Images and Small Objects. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Grenoble, France, 19–23 September 2022. [Google Scholar]
Xia, Z.; Pan, X.; Song, S.; Li, L.E.; Huang, G. Vision Transformer with Deformable Attention. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 4784–4793. [Google Scholar]
Ouyang, D.; He, S.; Zhan, J.; Guo, H.; Huang, Z.; Luo, M.; Zhang, G. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023. [Google Scholar]
Dadboud, F.; Patel, V.; Mehta, V.; Bolic, M.; Mantegh, I. Single-Stage UAV Detection and Classification with YOLOV5: Mosaic Data Augmentation and PANet. In Proceedings of the 2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Washington, DC, USA, 16–19 November 2021; pp. 1–8. [Google Scholar]
Ma, S.; Xu, Y. MPDIoU: A Loss for Efficient and Accurate Bounding Box Regression. arXiv 2023, arXiv:2307.07662. [Google Scholar]
Chen, L.; Liu, B.; Liang, K.; Liu, Q. Lion Secretly Solves Constrained Optimization: As Lyapunov Predicts. arXiv 2023, arXiv:2310.05898. [Google Scholar]
Tang, S.; Fang, Y.; Zhang, S. HIC-YOLOv5: Improved YOLOv5 For Small Object Detection. arXiv 2023, arXiv:2309.16393. [Google Scholar]
Gong, H.; Mu, T.; Li, Q.; Dai, H.; Li, C.; He, Z.; Wang, W.; Han, F.; Tuniyazi, A.; Li, H.; et al. Swin-Transformer-Enabled YOLOv5 with Attention Mechanism for Small Object Detection on Satellite Images. Remote Sens. 2022, 14, 2861. [Google Scholar] [CrossRef]
Hu, J.; Zhi, X.; Shi, T.; Zhang, W.; Zhao, S. PAG-YOLO: A Portable Attention-Guided YOLO Network for Small Ship Detection. Remote Sens. 2021, 13, 3059. [Google Scholar] [CrossRef]
Chengji, X.U.; Xiaofeng, W.; Yadong, Y. Attention-YOLO:YOLO Detection Algorithm That Introduces Attention Mechanism. Comput. Eng. Appl. 2019, 55, 13–23. [Google Scholar]
Zhou, Z.; Yu, X.; Chen, X. Object Detection in Drone Video with Temporal Attention Gated Recurrent Unit Based on Transformer. Drones 2023, 7, 466. [Google Scholar] [CrossRef]
Ma, J.; Wang, X.; Xu, C.; Ling, J. SF-YOLOv5: Improved YOLOv5 with swin transformer and fusion-concat method for multi-UAV detection. Meas. Control 2023, 56, 1436–1445. [Google Scholar] [CrossRef]
Feng, Q.; Shao, Z.; Wang, Z. Boundary-aware small object detection with attention and interaction. Vis. Comput. 2023, 11, 1–14. [Google Scholar] [CrossRef]
Ji, S.J.; Ling, Q.H.; Han, F. An improved algorithm for small object detection based on YOLO v4 and multi-scale contextual information. Comput. Electr. Eng. 2023, 105, 108490. [Google Scholar] [CrossRef]
Mahaur, B.; Mishra, K.K. Small-Object Detection based on YOLOv5 in Autonomous Driving Systems. Pattern Recognit. Lett. 2023, 168, 115–122. [Google Scholar] [CrossRef]
Tian, Z.; Huang, J.; Yang, Y.; Nie, W. KCFS-YOLOv5: A High-Precision Detection Method for Object Detection in Aerial Remote Sensing Images. Appl. Sci. 2023, 13, 649. [Google Scholar] [CrossRef]
Qi, G.; Zhang, Y.; Wang, K.; Mazur, N.; Liu, Y.; Malaviya, D. Small Object Detection Method Based on Adaptive Spatial Parallel Convolution and Fast Multi-Scale Fusion. Remote Sens. 2022, 14, 420. [Google Scholar] [CrossRef]
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2016; pp. 1800–1807. [Google Scholar]
Henderson, P.; Ferrari, V. End-to-End Training of Object Class Detectors for Mean Average Precision. arXiv 2016, arXiv:1607.03476. [Google Scholar]
Fu, J.; Wang, B.; Zhang, H.; Zhang, Z.; Chen, W.; Zheng, N. When and Why Momentum Accelerates SGD: An Empirical Study. arXiv 2023, arXiv:2306.09000. [Google Scholar]
Andriushchenko, M.; D’Angelo, F.; Varre, A.; Flammarion, N. Why Do We Need Weight Decay in Modern Deep Learning? arXiv 2023, arXiv:2310.04415. [Google Scholar]
MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 21 June–18 July 1965; University of California Press: Berkeley, CA, USA, 1967. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
Xu, S.; Wang, X.; Lv, W.; Chang, Q.; Cui, C.; Deng, K.; Wang, G.; Dang, Q.; Wei, S.; Du, Y. PP-YOLOE: An evolved version of YOLO. arXiv 2022, arXiv:2203.16250. [Google Scholar]
Jocher, G. Ultralytics YOLOv8: v6. Available online: https://github.com/ultralytics/ultralytics (accessed on 23 October 2023).

Figure 1. The target size distribution in the dataset: (a) displays the size histogram of objects in the VisDrone2019 dataset; (b) represents the object size distribution within the NWPU VHR-10 dataset.

Figure 2. Network before improvement.

Figure 3. Improved network structure. In this figure, SPD Conv represents non-strided convolution, EMA represents efficient multi-scale attention, DAT represents deformable attention, and MConv represents multi-branch convolution.

Figure 4. Schematic diagram of non-strided convolution module. Split is the pixel separation operation, and concatenate is the combination operation in the channel direction.

Figure 5. Schematic diagram of MConv module. Among them, Depthwise Conv represents deep convolution [48], Common Conv represents ordinary 3 × 3 convolutions, and Pointwise Conv represents 1 × 1 Convolution.

Figure 6. Schematic diagram of deformable attention module.

Figure 7. Schematic diagram of EMA.

Figure 8. A graphical representation of the Intersection over Union (IoU) metric.

Figure 9. Schematic diagram of MPDIoU.

Figure 10. A visual representation illustrating the detection results generated by the network.

Figure 11. Schematic diagram of evaluation metrics for the model.

Figure 12. Sample images from the dataset. (a–c) show images from the VisDrone2019; images shown in (d–f) are from the NWPU VHR-10 dataset.

Figure 13. Schematic diagram of image data augmentation.

Figure 14. Training flowchart.

Figure 15. Statistics on the distribution of bounding box size of the object to be detected. (a,b) depict the statistical findings of the width and height distribution of objects in the Vis-Drone2019 dataset. Similarly, (c,d) showcase the statistical results of the width and height distribution of objects in the NWPU VHR-10 dataset.

Figure 16. Statistics of every category’s number in every database. (a) displays the statistical outcomes of different object categories in the VisDrone2019 dataset, while (b) showcases the statistical results of each object category in the NWPU VHR-10 dataset.

Figure 17. The P-R curve of every category object at an IoU threshold of 0.5. (a) P-R curves on the VisDrone2019 validation dataset; (b) P-R curves on the NWPU VHR-10 validation dataset.

Figure 18. Comparison of mAP₅₀ curves before and after model improvement. (a) Comparison chart of curves for the VisDrone2019 dataset mAP₅₀; (b) comparison chart of the curves of mAP₅₀ in the NWPU VHR-10 dataset.

Figure 19. Radar chart for each category of AP₅₀, (a) Comparison chart of AP₅₀ for each category in the VisDrone2019 dataset before and after model improvement; (b) comparison chart of AP₅₀ for each category in the NWPU VHR-10 dataset before and after model improvement.

Figure 20. Comparison of confusion matrices. (a) Confusion matrix diagram for the VisDrone2019 validation set before model improvement; (b) confusion matrix diagram for the VisDrone2019 validation set after model improvement; (c) confusion matrix diagram for NWPU VHR-10 validation set before model improvement; (d) confusion matrix diagram for the NWPU VHR-10 validation set after model improvement.

Figure 21. Comparison experiments with the modules on VisDrone2019 for each category of AP₅₀.

Figure 22. Comparison experiments with the modules on NWPU VHR-10 for each category of AP₅₀.

Figure 23. The performance of different models in detecting objects was evaluated on the VisDrone2019 validation dataset.

Figure 24. Detection results of different models on the NWPU VHR-10 validation dataset.

Figure 25. Comparison of thermal maps generated using GradCAM for different model detection results.

Figure 26. Comparison of heat maps generated using GradCAM for different model detection results. (a,b) are the results generated from two different images, respectively.

Table 1. Experimental configuration.

Parameters	Value
Programming language	Python (3.8.16)
GPU	NVIDIA GeForce RTX 6000 Ada
CPU	Intel(R) Xeon(R) w9-3495X
RAM	64 G
system	Ubuntu 20.04
Neural network framework	Pytorch (torch 1.13.1 + cu116)

Table 2. Experimental setting.

Hyperparameters	Value
Image	640 × 640
Batch	16
Epoch	200
Weight Decay	0.0005
Momentum	0.937
Learning Rate	0.01
Optimizer	Lion

Table 3. Anchor box sizes for different datasets.

Dataset	Value
Dataset	Group One			Group Two			Group Three
VisDrone2019	[3 4]	[4 9]	[8 7]	[8 14]	[15 9]	[14 21]	[28 15]	[31 34]	[62 46]
NWPU VHR-10	[6 7]	[10 15]	[21 12]	[19 28]	[41 23]	[34 49]	[73 41]	[59 86]	[128 82]

Table 4. Value of AP₅₀ for each category in the VisDrone2019 dataset before and after model improvement.

Models	All	Bicycle	Pedestrian	Truck	People	Motor	Car	Bus	Van	Tricycle	Awning-Tricycle
Original	48.5	23.8	56.3	48.2	45.9	56.3	85.1	64.9	48.9	37.1	18.6
Improved	56.8	38.3	69.6	49.7	57.9	69.8	89.7	65.1	57.5	43.1	26.2

Table 5. The AP₅₀ values of each category in the NWPU VHR-10 dataset before and after model improvement.

Models	All	Tennis Court	Airplane	Storage Tank	Basketball Court	Ship	Ground Track Field	Harbor	Baseball Diamond	Bridge	Vehicles
Original	88.3	95.9	96.8	83.5	85.6	83.6	96.1	95.5	97.4	62.1	86.8
Improved	94.6	97.3	98.5	93.5	93.6	93.6	98.2	95.5	99.4	83.5	92.8

Table 6. Ablation experiments with the modules on VisDrone2019 and NWPU VHR-10 datasets.

Dataset	Models	P (%)	R (%)	mAP₅₀ (%)	mAP_50:95 (%)
VisDrone2019	Baseline	53.1	52.3	48.5	28.1
	+SPD − Conv	60.6	53.2	53.6	31.4
	+DAT + SPD − Conv	61.1	56.7	54.8	32.6
	+DAT + SPD – Conv + EMA	62.3	57.6	55.3	33.1
	+DAT + SPD – Conv + EMA + MConv	63.3	58.5	56.1	34.1
	+DAT + SPD – Conv + EMA + MConv + MPDIoU	63.9	59.2	56.8	34.9
NWPU VHR-10	Baseline	90.8	82.6	88.3	55.8
	+SPD − Conv	91.3	83.4	89.0	56.1
	+DAT + SPD − Conv	93.1	86.5	91.5	57.2
	+DAT + SPD – Conv + EMA	94.6	89.1	93.6	57.9
	+DAT + SPD – Conv + EMA + MConv	94.9	89.5	94.1	58.1
	+DAT + SPD – Conv + EMA + MConv + MPDIoU	95.3	90.2	94.6	58.3

Table 7. Ablation experiments on VisDrone2019 and NWPU VHR-10.

Dataset	Models	P (%)	R (%)	mAP₅₀ (%)	mAP_50:95 (%)
VisDrone2019	Baseline	53.1	52.3	48.5	28.1
	+DD + PC	53.8	53.1	49.1	28.9
	+CC + PC	53.3	52.3	48.6	28.1
	+ DD + CC + PC	53.9	53.2	49.2	29.1
NWPU VHR-10	Baseline	90.8	82.6	88.3	55.8
	+DD + PC	91.0	82.9	88.6	56.0
	+ CC + PC	90.9	82.6	88.4	55.8
	+ DD + CC + PC	91.1	82.9	88.7	56.1

Table 8. Comparison experiments on VisDrone2019.

Algorithm	FPS	mAP₅₀(%)	mAP_50:95(%)
YOLOv3-SPP [22]	32	47.7	28.6
YOLOv4 [23]	35	47.3	26.4
YOLOv5l [24]	28	48.1	25.3
YOLOv7 [5]	53	48.5	28.1
TPH-YOLOv5 [54]	32	48.9	26.1
PP-YOLOE [55]	36	39.6	24.6
YOLOv8 [56]	59	46.3	27.8
Ours	51	56.8	34.9

Table 9. Comparison experiments with the modules on VisDrone2019 for each category of AP₅₀.

Category	YOLOv3-SPP [22]	YOLOv4 [23]	YOLOv5l [24]	TPH-YOLOv5 [5]	YOLOv7 [54]	Ours
All	47.7	47.3	48.1	48.9	48.5	56.8
Motor	55.4	58.2	55.8	60.2	56.3	69.8
Bus	63.8	54.3	64.3	56.1	64.9	65.1
Awning-tricycle	18.3	21.8	18.4	22.6	18.6	26.2
Tricycle	36.5	35.9	36.7	37.2	37.1	43.1
Truck	47.4	41.4	47.8	42.9	48.2	49.7
Van	48.1	47.9	48.5	49.5	48.9	57.5
Car	83.6	74.8	84.3	77.3	85.1	89.7
Bicycle	23.4	31.9	23.6	33.2	23.8	38.3
People	45.1	48.3	45.5	49.8	45.9	57.9
Pedestrian	55.3	58.1	55.8	60.1	56.3	69.6

Table 10. Comparison experiments on NWPU VHR-10.

Algorithm	FPS	mAP₅₀(%)	mAP_50:95(%)
YOLOv3-SPP [22]	39	89.1	55.2
YOLOv4 [23]	42	86.9	54.3
YOLOv5l [24]	36	87.6	54.9
YOLOv7 [5]	61	88.3	55.8
TPH-YOLOv5 [54]	41	92.3	56.8
PP-YOLOE [55]	45	81.8	54.2
YOLOv8 [56]	67	89.9	56.3
Ours	58	94.6	58.3

Table 11. Comparison experiments with the modules on NWPU VHR-10 for each category of AP₅₀.

Category	YOLOv3-SPP [22]	YOLOv4 [23]	YOLOv5l [24]	TPH-YOLOv5 [54]	YOLOv7 [5]	Ours
Tennis court	89.1	89.6	87.8	92.1	95.9	97.3
Baseball diamond	94.4	95.4	92.4	97.0	97.4	99.4
Storage tank	65.3	67.1	85.5	88.3	83.5	93.5
Airplane	94.6	96.1	92.5	96.1	96.8	98.5
Ship	90.8	92.0	84.9	89.2	83.6	93.6
Basketball court	87.5	73.0	85.7	90.8	85.6	93.6
Ground track field	92.5	94.1	89.8	94.9	96.3	98.2
Harbor	93.9	80.9	83.5	89.1	95.2	95.6
Bridge	90.4	86.7	82.6	89.4	62.1	83.5
Vehicles	92.3	93.9	91.2	95.7	86.8	92.8
All	89.1	86.9	87.6	92.3	88.3	94.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, D.; Shao, F.; Liu, Q.; Yang, L.; Zhang, H.; Zhang, Z. A Small Object Detection Method for Drone-Captured Images Based on Improved YOLOv7. Remote Sens. 2024, 16, 1002. https://doi.org/10.3390/rs16061002

AMA Style

Zhao D, Shao F, Liu Q, Yang L, Zhang H, Zhang Z. A Small Object Detection Method for Drone-Captured Images Based on Improved YOLOv7. Remote Sensing. 2024; 16(6):1002. https://doi.org/10.3390/rs16061002

Chicago/Turabian Style

Zhao, Dewei, Faming Shao, Qiang Liu, Li Yang, Heng Zhang, and Zihan Zhang. 2024. "A Small Object Detection Method for Drone-Captured Images Based on Improved YOLOv7" Remote Sensing 16, no. 6: 1002. https://doi.org/10.3390/rs16061002

APA Style

Zhao, D., Shao, F., Liu, Q., Yang, L., Zhang, H., & Zhang, Z. (2024). A Small Object Detection Method for Drone-Captured Images Based on Improved YOLOv7. Remote Sensing, 16(6), 1002. https://doi.org/10.3390/rs16061002

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Small Object Detection Method for Drone-Captured Images Based on Improved YOLOv7

Abstract

1. Introduction

2. Related Work

3. Our Method

3.1. Improvement of Network Structure

3.2. The Modules Used in the Improvement

3.2.1. Non-Strided Convolutional (SPD-Conv) Module

3.2.2. Multi Branch Convolution (MConv) Module

3.2.3. Deformable Attention (DAT) Module

3.2.4. Efficient Multi-Scale Attention (EMA) Module

3.3. Loss Function and Model Evaluation-Related Metrics

3.3.1. Concepts Related to Losses

3.3.2. Loss Function

3.3.3. Training Metrics

4. Experiment

4.1. Dataset

4.2. Image Data Augmentation

4.3. Experimental Settings

4.4. Training Process

5. Result and Analyses

5.1. Statistics of Relevant Data in the Dataset

5.2. The Effect of Model Improvement

5.3. Ablation Experiment

5.4. Comparison Experiment

5.5. Visualization

5.6. Heatmap Analysis

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI