1. Introduction
With the advancement of UAV technology, UAVs have become an ideal platform for multi-target detection due to their superior payload capacity, ease of operation, and flexible mobility. UAVs play an important role in industrial risk monitoring [
1], terrain exploration [
2], crowd [
3] and vehicle safety monitoring [
4], and air and water quality monitoring [
5], especially in applications where pollution sources are tracked. In these applications, the UAV’s target acquisition and tracking technology is particularly important. However, due to the high altitude of remote sensing platforms such as drones, images often contain a large number of small objects [
6], as shown in
Figure 1. The detection of these small targets faces several challenges: low resolution, limited feature information, and susceptibility to clutter and occlusion in complex environments. These factors make small target detection a significant challenge.
With the rapid advancement of deep learning technology, the capabilities of object-detection models using this technology have improved significantly. Currently, mainstream object-detection algorithms can be categorized into three types: two-stage algorithms, one-stage algorithms, and vision transformers. The two-stage object-detection approach first generates candidate regions, followed by classification and regression analysis on these regions. Typical methods include the R-CNN series, such as Faster R-CNN [
7], Mask R-CNN [
8], Cascade R-CNN [
9], R-FCN [
10], etc. One-step object-detection methods consider object detection as a whole regression problem, and directly output object category and location information. The commonly used methods are some anchor-based detection methods. These include the You Only Look Once (YOLO) series [
11,
12,
13,
14], Single-Shot MultiBox Detector (SSD) [
15], RetinaNet [
16], etc. These algorithms determine the presence, location, and size of the target object by dividing the image into anchors and performing classification and regression processing on each anchor. Vision transformers divide the image into sequential blocks and process these sequences using a self-attention mechanism for object detection. Algorithms in this class include Swin transformer [
17], detection transformer (DETR) [
18], MViT [
19], and its successors.
These algorithms have obvious differences in their design principles and architectures, and each has different advantages and disadvantages in terms of speed, accuracy, efficiency, and model size. For example, the two-stage algorithm has high accuracy, especially excellent performance in detecting small objects, and its detection accuracy can be further improved by methods such as multi-scale fusion and model distillation. However, the main drawback of this algorithm is the slow processing speed, and it needs to generate a large number of candidate regions, so this increases the amount of computation and time complexity. In contrast, vision transformers provide an end-to-end object-detection scheme that simplifies the overall architecture, but due to their high computational requirements, such algorithms are not suitable for use on embedded devices with limited resources. On the other hand, the one-step algorithm does not need to generate candidate regions, thus simplifying the detection process, and is very effective in scenes that require immediate responses due to its fast and real-time performance, although it may be slightly insufficient in the accuracy of detecting small objects.
Balancing speed, accuracy, and model size is critical for UAV missions, so single-stage object-detection algorithms such as YOLO are preferred for their excellent real-time performance and high accuracy. Although the YOLO algorithm is mainly used to detect full-size objects, its performance can be limited when dealing with scenes with special dimensions. Various strategies have been developed by researchers to address the challenge of small object detection, with successful applications of YOLO series models. For example, Ding et al. [
20] extended the YOLOv4 model by integrating the coordinate attention mechanism (CA) [
21] into MobileNetV2 [
22], replacing the original backbone, and using depth-separable convolution and the squeeze-and-excitation (SE) module [
23]. These modifications improved the learning ability of the convolutional neural network, maintaining the accuracy of feature extraction while reducing complexity, resulting in better overall performance. However, YOLOv4 still contains some redundant structures. Similarly, Wu et al. [
24] introduced a region proposal network and improved YOLOv5’s small object detection capabilities by incorporating the multi-scale anchor mechanism from Faster R-CNN. This adaptation enabled YOLOv5 to exhibit high adaptability to images of different sizes. Zhu et al. [
25] integrated transformer prediction heads (TPHs) into the YOLOv5 structure and proposed the TPH-YOLOv5 model. This model introduced an additional prediction head for detecting objects of different scales and used the self-attention mechanism of TPHs. It also incorporated the Convolutional Block Attention Module (CBAM) [
26] into YOLOv5. These additions optimized the model’s ability to discriminate cross-scale objects and effectively identify key regions in dense object scenes, although they introduced additional computational and parameter complexity, which affected detection efficiency. MS-YOLOv7, proposed by Zhao et al. [
27], built on YOLOv7 by increasing the number of detection heads from three to four to better extract features at different scales. They also incorporated the Swin transformer, Window Multi-Head Self-Attention (W-MSA), Shifted Window Multi-Head Self-Attention (SW-MSA) [
28], and CBAM attention mechanisms to improve the neck features of the network. In addition, the application of the soft NMS method improved the performance of NMS in densely distributed object detection. Lin et al. [
29] proposed YOLOv8n-SLIM-CA, which uses mosaic data augmentation to generate many small target instances, thereby increasing the overall robustness of the network. The backbone network was augmented with the coordinate attention mechanism to improve focus on key regions in complex backgrounds and suppress interference from irrelevant features. In the neck network, they adopted a slim neck structure and added a small object detection layer to improve the model’s ability to detect objects in complex backgrounds and small targets. Zhang et al.’s Drone-YOLO [
30] used a three-layer path aggregation feature pyramid network (PAFPN) structure with a special small-object detection head. The sandwich fusion module was used to optimize semantic features. The detection head obtained feature vectors with high spatial resolution and accurate semantic information, improving the overall detection performance. The RepVGG reparameterized convolution module improved the model’s ability to understand features of different scales, thereby improving the detection accuracy for small objects. Finally, Wang et al. [
31] developed UAV-YOLOv8, which introduced a BiFormer attention mechanism to optimize the backbone network and improve the model’s focus on critical information. They also introduced the Focal FasterNet Block (FFNB) for feature processing. This model added two new detection scales that effectively fuse shallow and deep feature information, significantly improving detection performance and reducing the miss rate of small objects.
These contributions collectively demonstrate the continuous progress in adapting YOLO models for improving small object detection. Due to the complexity of backgrounds, significant differences in spatial resolution, and the ubiquitous presence of irregularly arranged small objects, small objects need to be detected more effectively. To address these challenges and optimize the detection accuracy for small objects, this paper proposes an improved neural network model based on the YOLOv8n architecture. Our approach enhances the C2f module, integrates advanced attention mechanisms, introduces a novel feature pyramid network, and strengthens the detection head to improve the feature representation of small object detection, thereby improving the overall detection performance. The proposed model structure and corresponding experimental results demonstrate the effectiveness of the model in detecting complex small objects. Comparative analysis with existing models indicates that this model exhibits superior performance in drone target detection.
The key contributions of this work encompass the following:
Introducing the DualC2f module, which integrates 1 × 1 and 3 × 3 dual-convolution kernels to simultaneously process the identical input feature map channels. The use of group convolution techniques efficiently arranges the convolution filters, solving the problems of inter-channel communication and information preservation within the original input feature map, which significantly improves the accuracy of the model in identifying small targets.
The proposed DCNv3LKA attention mechanism combines large convolution kernels and deformable convolutions to simulate receptive fields similar to those of self-attention. This approach adapts to the wide variations in targets while avoiding the high computational costs associated with traditional self-attention mechanisms.
To mitigate the common problem of the misidentification and omission of small targets in aerial imagery, we developed the SDI-FPN feature pyramid network. This network architecture integrates both semantic and detailed information to improve target detection accuracy. Inspired by BiFPN, this approach achieves more efficient feature fusion by fully integrating shallow and deep features. By inserting four CBS blocks with a step of 1 and a kernel size of 1 between the spine and neck components, the storage and transmission of image feature information is improved. Furthermore, we introduce a novel detection scale optimized for the detection of small objects, which effectively mitigates the loss of contextual information within the model.
We propose a spatial feature fusion method with coordinate adaptivity for optimizing the original detection head. This method first integrates features from different levels and adjusts them to the same resolution using weights, effectively filtering out spatial information conflicts. Then, it enhances the spatial expression of features by incorporating the coordinate attention (CA) mechanism. This mechanism assists the model in learning regions in the feature map that are relevant to the target position, thereby improving the model’s localization accuracy. This optimization strategy enhances the model’s ability to represent features of small-sized targets, especially in complex backgrounds, thereby increasing detection accuracy.
This paper is divided into five sections. In
Section 2, we review related work relevant to our research.
Section 3 provides a detailed explanation of the improvements and implementation details.
Section 4 provides an overview of the experimental process, including the configurations and specific experimental methods, followed by a comparative analysis of the experimental results and a visualization of the results. Finally,
Section 5 summarizes the current work and suggests future directions.
3. Methods
The YOLOv8 detector consists of three primary components, the backbone, the neck, and the detector head, which are available in five variants, YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x, each varying in channel width, depth, and the maximum number of channels. All input images are resized to a uniform size of
. The backbone network extracts feature maps from the input images through repeated convolutions, creating three layers of feature maps (
,
, and
). The backbone network uses Cross-Stage Partial Darknet (CSPdarknet) [
12] for feature extraction, enriching gradient information by replacing the original Cross-Stage Partial (CSP) module with the C2f module. In addition, a Spatial Pyramid Pooling Fast (SPPF) module is used at the end of the backbone to pool the input feature maps to a fixed size to accommodate outputs of different sizes. The neck uses a PAN-FPN structure, forming a top-down and bottom-up network architecture. The detection head adopts a decoupled head structure, utilizing two independent branches for object classification and bounding box regression prediction, with distinct loss functions for each task: Binary Cross-Entropy (BCE) loss for class loss and Distribution Focal Loss (DFL Loss) [
39] and Complete IoU Loss (CIOU Loss) [
40] for regression loss. To address the detection of small and multi-scale objects within the YOLOv8 network, we propose the DDSC-YOLO network model, which is based on the YOLOv8n baseline combined with a multi-scale attention module, specifically for the detection of small targets in drone aerial images. The focus is on improving the spatial-feature-extraction capability for target detection and building an efficient multiscale feature fusion network to overcome the challenges posed by background clutter and large target variations in UAV aerial imagery. The detailed network structure is shown in
Figure 2.
3.1. DualC2f
In YOLOv8, the C2f module is employed to integrate low-level and high-level feature maps, facilitating the capture of rich gradient information flow. However, with the proliferation of layers in convolutional neural networks, the semantic information within feature maps tends to be progressively extracted and aggregated, resulting in deep feature maps containing redundant information. Additionally, due to the weight-sharing mechanism of convolutional layers, convolution kernel parameters are shared across different positions of deep feature maps, further exacerbating redundancy. The bottleneck module, which consists of numerous complex convolutions, greatly increases the parameter size and computational complexity. To solve this problem, the DualC2f module is proposed to replace the bottleneck module with the newly designed DualBottleneck module. This structural design not only maintains the network depth and representational capability, but also reduces the redundancy of feature map information, as shown in
Figure 3.
The DualConv [
41] module optimizes the convolutional operations by employing group convolution and heterogeneous convolution techniques to process input feature map channels efficiently. Within this module, some kernels perform both
and
convolution operations, while others solely execute
convolution. The structure is shown in
Figure 4, where
M represents the number of input channels (i.e., the depth of the input feature map),
N denotes the number of convolution filters and output channels (i.e., the depth of the output feature map), and
G signifies the number of groups in group convolution and dual-convolution. The
N convolution filters are partitioned into
G groups, with each group processing the entire input feature map. Within these groups,
input feature map channels undergo simultaneous
and
convolution operations, while the remaining
input channels are processed separately using
convolution kernels.
The design of DualConv maintains the original information of the input feature map by employing the group convolution strategy. This approach fosters improved information sharing between convolutional layers and facilitates maximum cross-channel communication with M convolutions, thereby achieving efficient information flow and integration across different feature map channels. Therefore, replacing the bottleneck structure in C2f with the above DualBottleneck enriches the gradient flow representation, improves the feature extraction ability, reduces the diversity of false detections and missed detections between network learning, and is more suitable for various small target scenarios.
3.2. DCNv3LKA Attention Mechanism
The target size in UAV aerial photography is small and highly variable, creating challenges for traditional convolution methods to extract features efficiently. As network complexity increases during training, critical inter-layer information is often lost, significantly affecting object detection accuracy. To tackle this issue, we integrated a DCNv3LKA attention mechanism, based on DCNv3 [
42] and Deformable Large Kernel Attention, into the backbone network. This mechanism enhances the focus on crucial small-target information within the input features, suppresses non-essential details, and reduces the impact of background noise. The attention mechanism, which combines a large convolution kernel with deformable convolution, is illustrated in
Figure 5. The DCNv3LKA module can be expressed as follows:
where the input feature is denoted by
and
. The attention component, represented as an attention map in
, assigns a relative importance to each corresponding feature. The operator ⊗ signifies matrix multiplication, while the ⊕ operator denotes elementwise summation.
Deformable convolution v3 enables the sampling grid to deform adaptively using offsets learned from the features themselves, resulting in a flexible convolution kernel. Initially, the input feature maps are divided into
g groups, with convolution performed on each group. This multi-group mechanism enhances the expressiveness of deformable convolution and achieves shared convolution weights
. Multiple sets of convolution kernel offsets and modulation factors are computed, with softmax normalization applied to the modulation scalars to stabilize the network training process. The output feature map is then calculated based on these sampling points and offsets. The formulation for deformable convolution v3 is as follows:
where
G is the total number of aggregation groups and
K is the total number of sampling points. For the
g-th group,
represents the location-irrelevant projection weights of the group, where
represents the group dimension.
denotes the modulation scalar of the
k-th sampling point in the
g-th group, normalized by the softmax function along the
k dimension.
represents the sliced input feature map.
is the offset corresponding to the
k-th grid sampling location
in the
g-th group.
DCNv3LKA overcomes the limitations of traditional convolution concerning long-range dependencies and adaptive spatial aggregation, improving the network’s ability to adapt to object deformation. This makes it more suitable for detecting small targets in complex natural backgrounds and enhances the model’s accuracy in target detection.
3.3. Semantics and Detail Infusion Feature Pyramid Network
The original YOLOv8 network utilizes a path aggregation feature pyramid network (PAN-FPN) as the neck network. By bi-directionally aggregating features from both bottom-up and top-down pathways, it promotes the fusion of detailed and abstract information, effectively shortening the information path between them. During the upsampling and downsampling processes, features of the same dimension are stacked together to preserve information about small objects. However, for small targets with pixel-level resolution, due to insufficient attention to large-scale feature mappings, the demands for positional information are not fully met, reducing the quality of detection. Moreover, the reuse rate of features is low, and some information is lost after the long paths of upsampling and downsampling. Therefore, to enhance the performance in detecting low-pixel small targets and multidimensional feature fusion, we propose the SDI-FPN feature pyramid network.
First, we introduce four CBS modules between the backbone and the neck, with a step size and convolution kernel size of 1. These modules serve as containers for storing feature information from the backbone, aggregating both shallow feature maps (which have low resolution, but weak semantic information) and deep feature maps (which have high resolution, but rich semantic information). Feature information is transmitted along a designated path, enabling strong positional features from lower layers to be propagated upwards. These operations further enhance the expression capability of multi-scale features. Based on this, using these four containers as new inputs for backbone feature information in subsequent networks, we propose a new detection scale for small objects. This effectively addresses the problem of the loss of context information in the model, and thus improves the performance of small target detection with a low pixel size.
Furthermore, in the feature fusion process, due to different resolutions, rescaling is required, so the SDI mechanism is introduced to improve multidimensional feature fusion. The structure is shown in
Figure 6.
For feature maps of various sizes, we begin by applying spatial and channel attention mechanisms to each level
i feature
, where
, and we use CBAM to implement the spatial and temporal attention. This process allows the features to incorporate both local spatial information and global channel information, expressed as:
where
represents the processed feature map at the
i-th level, and the calculation of the spatial and channel attentions at level
i is denoted by
. Subsequently, we apply a
convolution to adjust the channels of
to
c, resulting in the feature map
, where
,
, and
c are the height, width, and channels of
, respectively. We then resize the feature maps at each level
j to match the resolution of
, as formulated below:
where (D), (I), and (U) stand for adaptive average pooling, identity mapping, and bilinear interpolation of
to the resolution
, respectively. In Equation (
6),
represents the parameters of the smoothing operation, and
is the
j-th smoothed feature map at the
i-th level.
Next, we enhance the
i-th level features by applying the elementwise Hadamard product to all the resized feature maps, integrating semantic information and finer details as follows:
where
denotes the Hadamard product.
The SDI-FPN pyramid network further improves the quality of feature extraction after feature fusion, obtains richer gradient flow information, effectively compensates for the loss of context information in the model, and improves the accuracy of small target detection.
3.4. Coordinate Adaptive Spatial Feature Fusion Mechanism Detection Head
To further enhance the accuracy of small target detection and address the impact of conflicting information among targets during cross-scale fusion, we have improved the detection head of YOLOv8. This study introduces a novel coordinate adaptive spatial feature fusion mechanism (CASFF), as shown in
Figure 7. Initially, inspired by the adaptive spatial feature fusion (ASFF) strategy, features of adjacent scales are resized to the same dimensions, and spatial importance weights representing four different layer feature mappings are calculated, effectively filtering out conflicting information and enhancing scale invariance. By using these learned weights for weighted fusion, we obtain a new multi-scale feature map tensor, calculated as:
where
denote different feature layers,
represents the feature vectors resized across layers, and
are the learned weights. These weights are defined by using the softmax function with control parameters
,
,
,
, respectively, and
is defined as follows:
ensuring that
.
To prevent the loss of spatial information during feature fusion, a CA mechanism is introduced to improve the network’s focus on essential information, reducing false positives and missed detections. This mechanism involves two one-dimensional global pooling operations that aggregate the input features in vertical and horizontal directions, creating two independent direction-aware feature maps. These maps are then concatenated and processed through a convolutional layer, followed by a batch normalization (BN) layer and a non-linear activation layer. Subsequently, the feature maps are separated and processed through another convolutional layer, and the coordinate features are normalized using the Sigmoid function to obtain coordinate attention weights. These two direction-embedded feature maps are encoded into attention maps, each capturing long-range dependencies along one spatial direction of the input feature map. Finally, these attention maps are multiplied with the input feature map to enhance its representational power.
By integrating the adaptive spatial feature fusion (ASFF) strategy and the coordinate attention (CA) module into the detection head, the model not only leverages the multi-scale feature fusion advantages of the ASFF module, but also incorporates coordinate information through the CA attention mechanism. This combination significantly enhances the spatial expressiveness and precision of the final feature map. Consequently, the model’s ability to perceive the positional information of small targets and long-range dependencies is greatly improved. This enables more accurate localization of small targets within detected images, thereby significantly enhancing the detection accuracy for small targets.
4. Experiments
Our experimental environment used Ubuntu 20.04.5 LTS as the operating system, with processing performed by an Intel(R) Xeon(R) CPU E5-2680 v4 and an NVIDIA GeForce RTX 3090 GPU (24GB). The software framework consisted of Python 3.8 and torch 2.0.0+cu117. We adopted the YOLOv8n model as our experimental baseline, employing a batch size of 8 for 200 training epochs. The optimizer chosen was Stochastic Gradient Descent (SGD), initialized with a learning rate of 0.01. The input images were standardized to a resolution of pixels. Throughout the training procedure, mosaic data augmentation was applied to fortify the model’s robustness.
4.1. Dataset and Evaluation Criteria
We assessed the efficacy of our model using the VisDrone2019 dataset [
43], a UAV-based dataset tailored for object detection and tracking tasks, facilitating comprehensive evaluation and research of visual analysis algorithms for UAV platforms. This dataset offers a rich and diverse collection of images captured from various perspectives and tasks utilizing drones. Noteworthy features encompass a wide array of detected objects, ranging from highly diverse to monotone, varying numbers and distributions of detected objects, and observations captured under both day and night lighting conditions. Comprising 10 categories, the training set consisted of 6471 images, while the validation set comprised 548 images. Additionally, the test set contained 1610 images, with the weight of each class in the training set being proportional to the number of labels. Notably, the dataset exhibited class imbalance and included small objects, as depicted in
Figure 8. According to the COCO standard [
44], we categorized the objects into three sizes: small, medium, and large. Specifically, objects with bounding box areas smaller than 32 × 32 pixels were classified as small objects, those with areas larger than 32 × 32 pixels, but smaller than 96 × 96 pixels were classified as medium objects, and those with areas larger than 96 × 96 pixels were classified as large objects.
Table 1 shows the distribution of large, medium, and small objects for each category. As can be seen, small objects account for 60.49% of the targets in the VisDrone2019 dataset. This high proportion of small targets makes it an ideal dataset for evaluating the performance of small object detection.
To assess the effectiveness of our proposed enhanced detection model, we employed precision (P), recall (R), average precision (AP), mean average precision at IoU threshold 0.5 (
[email protected]), and mean average precision from IoU 0.5 to 0.95 (
[email protected]:0.95) as the evaluation metrics. Precision denotes the ratio of correctly detected targets to the total number of predicted targets, while recall indicates the ratio of correctly detected targets to the total number of actual targets. The formulas for these metrics are as follows:
where
(True Positive) represents the count of instances where the model correctly identifies positive-class samples as positive,
(False Positive) signifies the count of instances where the model incorrectly identifies negative-class samples as positive, and
(False Negative) denotes the count of instances where the model incorrectly classifies positive-class samples as negative.
The F1 score, ranging from 0 to 1, represents the harmonic mean of precision and recall. It takes into account the significance of both precision and recall, effectively assessing algorithm performance. The formula is as follows:
The average precision (AP) evaluates the performance of object detection models for a particular class. It is computed as the area under the precision–recall (P–R) curve, with recall plotted on the x-axis and precision on the y-axis:
The mean average precision (mAP) signifies the average precision across all categories, serving as a comprehensive performance metric for multi-class detection tasks:
Specifically,
[email protected] denotes the mean average precision at an IoU threshold of 0.5, while
[email protected]:0.95 represents the mean average precision for IoU thresholds ranging from 0.5 to 0.95, in increments of 0.05.
4.2. Experimental Results
Remotely sensed images, such as those from UAVs, have significantly different characteristics from images captured by ground personnel. As a result, unmanned aerial vehicle image recognition is more challenging than traditional image recognition. To confirm the superiority of the proposed DDSC-YOLO model over the YOLOv8n model in remote sensing detection tasks, comparative experiments were conducted on the VisDrone2019 validation set to assess the performance of both models. To better evaluate the model’s object-detection capabilities on a small scale, the
indicator was added. The results, presented in
Table 2, indicate that, compared to the baseline model, our strategy achieved significant improvements of 9.2% in P, 7.4% in R, 8.3% in F1, 9.3% in mAP0.5, and 6.4% in mAP0.5:0.95. Additionally,
shows significant growth compared to YOLOv8n, YOLOv8s, and YOLOv8m. These significant improvements demonstrate an overall enhancement in detection accuracy, indicating that the enhanced model significantly boosts the detection accuracy for small targets.
The precision–recall (P–R) curve illustrates the trade-off between precision and recall across various thresholds. A larger area under the P–R curve signifies better overall performance, reflecting higher average precision (AP). This means the model achieves greater accuracy and recall in classification tasks. As depicted in
Figure 9, the area under the P–R curve for the DDSC-YOLO model (0.422) surpasses that of YOLOv8n (0.329), confirming the superior detection capability of our enhanced model over the original YOLOv8n.
To visually compare the models before and after optimization, we used Gradient-weighted Class Activation Mapping (GradCAM) to visualize the output layers for small objects from both YOLOv8n and DDSC-YOLO. As shown in
Figure 10, without improvements to YOLOv8n, the model is affected by densely packed objects and complex backgrounds, resulting in less focus on small and distant targets.
After applying the above optimization steps to YOLOv8n, the model shows an improved focus on small target regions. For example, compared to the baseline, the improved model better highlights distant cars in the background of the square in the first image, people and vehicles under night lighting conditions in the second image, and distant cars in the background of the elevated bridge in the third image. The baseline method shows a weak classification ability when dealing with edge, clustered, or blurred objects, often misclassifying background as foreground targets. In contrast, our proposed approach shows promising performance, indicating that the improved model effectively prevents information loss when downsampling small targets. Our method has distinct advantages in handling small targets and overall image understanding, effectively suppressing irrelevant background interference. As a result, we have improved the detection accuracy of small targets.
4.3. Ablation Experiment
To evaluate the impact of the newly added and modified modules on the performance of the baseline model, a series of controlled experiments was conducted for evaluation and comparison. Using YOLOv8n as the baseline, we performed training and testing under consistent experimental conditions—employing the VisDrone2019 dataset and identical training parameters. The experimental outcomes are summarized in
Table 3.
As shown in
Table 3, each enhancement strategy applied to the baseline model resulted in varying degrees of improved detection performance. In particular, replacing the original path aggregation feature pyramid network with the SDI-FPN structure and adding a detection head specifically for small targets resulted in significant improvements in all metrics: the precision (P), recall (R), mAP0.5, mAP0.5:0.95, and
increased by 5.2%, 5.1%, 5.8%, 4.1%, and 4.9%, respectively. These improvements are attributed to effectively preventing the loss of contextual information during the multi-level fusion process, facilitating the integration of low-level and high-level information, and significantly enhancing the overall performance metrics of the model. To adapt to the wide variation in target sizes, more emphasis was placed on critical information in small targets within the input features, while suppressing non-critical feature information. The inclusion of the DCNv3LKA module in the feature-extraction process resulted in a 0.8% increase in the mAP, and the
improved by 0.6%. Using the DualC2f module instead of the C2f module improved the mAP0.5 by 0.6% by reducing the redundancy of feature map information and promoting information sharing between convolutional layers, thus enriching the gradient flow. In addition, the introduction of the coordinate adaptive spatial feature fusion mechanism in the detection head effectively mitigated conflicts between different feature levels, filtered out conflicting information, and improved the model’s ability to recognize images at different scales, with increases of 3.8%, 1.3%, 2.1%, 1.1%, and 0.9% in the P, R, mAP0.5, mAP0.5:0.95, and
, respectively. These experimental results demonstrate that detailed optimizations at different stages of the algorithm can significantly improve the model’s learning efficiency and confirm the effectiveness of each optimization measure. The optimized network structure effectively addresses the challenges of detecting small-sized targets.
In the detection head, the addition of the CA attention mechanism was compared with commonly used attention mechanisms.
Table 4 shows the performance of the model after incorporating different attention mechanisms. Compared to SGE [
45], SimAM [
46], and the Convolutional Block Attention Module (CBAM), the CA attention mechanism demonstrated the best detection performance. The inclusion of the CA allows the model to extract features more effectively, particularly for small and dense objects that often blend into blurred edges and backgrounds in the dataset. This significantly reduces the influence of irrelevant information on detection results, thereby enhancing the model’s ability to detect small targets.
4.4. Comparative Experiment
To assess the enhancements made to the YOLOv8 network model, we compared the performance of our improved model with that of current mainstream object-detection algorithms. The experimental results from this quantitative analysis and comparison are presented in
Table 5.
On the mAP50 metric, our proposed DDSC-YOLO model achieved significant performance improvements: it outperformed Faster R-CNN by 20.5%, RetinaNet by 28.3%, Cascade-RCNN by 19%, and CornerNet by 24.8%. Faster R-CNN had low resolution feature maps extracted by their backbone networks, resulting in relatively poor detection accuracy for small objects. RetinaNet and CornerNet can suffer when dealing with dense objects and tend to ignore the overlap of extremely small objects, reducing detection accuracy. Cascade R-CNN’s multi-level detection enhances overall performance, but adds complexity and training difficulty, necessitating further optimization. In contrast, our proposed model delivers superior results, avoiding the limitations of retinal networks.
Compared to YOLOv3-tiny, YOLOv4, YOLOv5s, YOLOv5m, CDNet, YOLOv6, YOLOv7-tiny, YOLOv8n, and YOLOv8s, the improvements were 18.6%, 1.8%, 9.2%, 4.9%, 8%, 13%, 6.7%, 9.3%, and 2.9%, respectively. YOLOv3-Tiny and YOLOv7-Tiny achieved lightweight models, but lost much of their detection accuracy. YOLOv4 had good detection accuracy, but performed poorly in terms of model size. CDNet showed high detection accuracy in two categories, tricycle awning and tricycle, but showed severe misses and false detections in other categories. YOLOv6 had poor detection performance and complexity. YOLOv5s and YOLOv8n had a small model size. These two models cannot meet the detection requirements of a high proportion of small objects. Notably, the DDSC-YOLO model has only 4.99 million model parameters, representing an 80.7% reduction compared to the 25.9 million parameters of YOLOv8m. Despite the significantly smaller parameter count, the detection accuracy remained on par with that of YOLOv8m, demonstrating its efficiency. Compared to improved algorithms such as DBAI-Det, YOLOv5s-pp, LW-YOLO v8, DC-YOLOv8, and Drone-YOLO (nano), which are designed to detect small targets in UAV and other remote sensing images, our method showed significant improvements. Specifically, our approach achieved increases of 14.2%, 0.5%, 5.9%, 0.7%, and 4.1%, respectively.
Specifically, in four categories—people, car, van, and motor—the DDSC-YOLO model demonstrated exceptional detection performance. This indicates that the model significantly outperformed other classical algorithms in small-target-detection tasks. This achievement is attributed to the proposed feature pyramid network, which enhances the fusion of semantic and detailed information, improves the overall feature representation and network performance, and implements four-scale detection and enhanced feature extraction capabilities. Moreover, the integration of a coordinate adaptive spatial feature fusion mechanism in the detection head optimizes the model’s recognition capabilities across different image scales. Comprehensive experimental results showed that the DDSC-YOLO model performed exceptionally well in detecting targets in aerial images captured by drones. The model effectively addressed the challenges posed by complex backgrounds in drone detection and considered the issue of class imbalance in the dataset, demonstrating robustness.
4.5. Generalization Test
To further evaluate the effectiveness of our method, we conducted comparative analyses with the YOLO series object-detection algorithms on the SSDD [
53] and RSOD datasets [
54]. For a fair comparison, we used the same hyperparameters and training policies as those employed for the VisDrone dataset. The SSDD dataset, designed for ship detection in satellite imagery, includes 1160 high-resolution remotely sensed images with a single category—ships—comprising a total of 2456 instances. This dataset is divided into training, validation, and testing sets with an 8:1:1 ratio, resulting in 928 images for training, 116 for validation, and 116 for testing. The RSOD dataset is tailored to object detection in remote sensing images and features four target categories: airplanes, oil tanks, playgrounds, and overpasses. It contains 4993 aircraft instances, 1586 oil tank instances, 191 playground instances, and 180 overpass instances. This dataset was also split into the same 8:1:1 ratio. Both datasets contain small and very small objects, making detection difficult and prone to false positives and false negatives.
According to
Table 6, the Drone-YOLO algorithm proposed in this paper outperformed Faster R-CNN, YOLOv3-tiny, YOLOv5s, YOLOv6, YOLOv7-tiny, and YOLOv8n in terms of the precision, recall, mAP50, and mAP0.5:0.95 on both the SSDD and RSOD datasets. Specifically, compared to YOLOv8n, the mAP0.5 metric increased by 1.3% and 3.2% on the SSDD and RSOD datasets, respectively. This demonstrates that the DDSC-YOLO model introduced in our research is versatile and exhibits superior detection capabilities across various remote sensing scenarios, not being limited to specific datasets.
4.6. Visualization
Figure 11 shows the visualization results of the DDSC-YOLO model in some complex scenes and dense object areas in the VisDrone2019 dataset, with bounding box labels and confidence scores displayed. The results indicate that the DDSC-YOLO model can accurately detect small targets even in these challenging scenarios, maintaining high confidence scores and demonstrating its good performance in handling complex natural backgrounds and dense object detection.
To clearly demonstrate the detection performance of our method, we compared the detection results of DDSC-YOLO and the baseline method using visualization techniques. We selected some representative images from the VisDrone, SSDD, and RSOD datasets as experimental data. These images include small objects under different lighting conditions, backgrounds, and density levels, making them suitable for comparative analysis experiments.
Figure 12 presents these comparative results. As objects in drone and remote sensing images are typically small in size, we enlarged some samples to ensure the clear display of their differences. The results clearly show that our method had higher effectiveness and robustness in various complex situations compared to the YOLOv8n method. For instance, when comparing group a, the baseline method was significantly impacted by occlusion from pedestrians and vehicles, whereas our method demonstrated notable superiority under low-light conditions. In groups b and e, where numerous small and obstructive objects are present, the baseline method experienced a significant decrease in detection accuracy due to the influence of dense objects, while our model showed advanced detection capabilities in complex scenes. In groups c and d, our model showed a lower false negative rate compared to YOLOv8n due to the small size of the boats and the blurred remote sensing images. In addition, in group f, YOLOv8 showed significant false positives, identifying a single aircraft as two detection boxes, while our method accurately identified the object. Through these visualizations, DDSC-YOLO can effectively guide the detector to focus on challenging regions, effectively suppress non-critical information, and enhance the focus on critical information. It demonstrates good detection performance in practical scenarios with varying lighting conditions, backgrounds, and object scales.
5. Conclusions
This study focuses on the problems of the difficult detection of small targets in UAV aerial scenes, the high missed detection rate, and the frequent false detection rate, based on the YOLOv8n model. We propose a small-target-detection model specifically designed for UAV aerial scenes, called DDSC-YOLO. First, DualC2f addresses cross-channel communication issues in feature maps while preserving critical information. In addition, the DCNv3LKA attention mechanism improves interlayer information retention and reduces non-critical details, which is critical for detecting small targets against complex backgrounds. The SDI-FPN structure refines multi-level feature fusion, preventing the loss of contextual information. This integration enables seamless blending of low-level and high-level information, which is critical for accurate detection of small targets. In addition, a novel coordinate-adaptive spatial feature fusion mechanism in the sensor head improves feature representation across different target scales.
Experimental validation on the VisDrone2019 dataset demonstrates the superior performance of our approach. Compared to YOLOv8n, our DDSC-YOLO increases the mAP0.5 by 9.3% and by 6.6%, effectively improving small target detection accuracy and outperforming other comparison models on the mAP0.5 and mAP0.5-0.95 metrics. Our model also demonstrates robust generalization across the SSDD and RSOD datasets. Going forward, our research will focus on further improving the model’s lightness, detection accuracy, and processing speed to make it more effective and improve small target detection performance. We plan to explore efficient network optimization algorithms such as quantization compression and sparse training to adapt to the characteristics of high-resolution images, thereby accelerating network training speed and improving validation accuracy. In addition, we will consider applying model pruning and knowledge distillation techniques to eliminate redundant and unnecessary network connections. This will further improve the efficiency and speed of lightweight models, opening up new possibilities for object-detection technology in remote sensing image scenes.