1. Introduction
Unmanned aerial vehicles (UAVs) have experienced widespread adoption in recent years, driven by their cost-effectiveness and adaptability. Outfitted with cameras and various sensors, UAVs find application in diverse scenarios such as surveillance, rescue operations, agriculture, express delivery, and more. The effectiveness of these applications relies heavily on the seamless and robust recognition of objects within the UAV’s field of view, as noted in recent comprehensive reviews [
1,
2,
3].
Solutions including R-CNN series [
4,
5], YOLO series [
6,
7] and, most recently, a query-based series [
8,
9,
10] have shown leading performance on public datasets such as MS COCO [
11]. Despite these accomplishments, when applied to UAV images, particularly for datasets like VisDrone [
12], generic detectors’ performance falls short of satisfactory in terms of accuracy and efficiency. Challenges arise due to hardware limitations, complexities in the imaging environment, and flight trajectory intricacies, leading to the following issues.
Scale variations: Owing to variations in flight altitudes and complex object distribution in UAV images, object sizes differ by distance from camera exposure, exhibiting a broad range of scales, with a predominance of small-scale objects. As illustrated in
Figure 1a, within the VisDrone dataset, the proportion of small objects (object size
pixels) reaches
, which is
higher than that in the COCO dataset. On the other hand, the percentage of large objects (object size
pixels) accounts for only
. The limited feature representation of small objects poses a considerable challenge, which leads to decreased accuracy and reliability in detecting smaller objects, thus significantly affecting the overall detection performance.
Class imbalance: UAV images collected in urban environments exhibit a long-tailed class imbalance characteristic; a small subset of head classes (e.g., car and people) dominates the dataset, collectively representing over 70% of object instances, while numerous tail classes (e.g., bicycle and van) are severely underrepresented, constituting only 1% to 7% of instances (
Figure 1b). This skewed distribution creates a model bias towards head classes during training—detectors prioritize optimizing features for high-frequency categories at the expense of tail classes. Consequently, tail classes suffer from poor localization accuracy and high false-negative rates due to insufficient discriminative feature learning.
To address these challenges, researchers have made many attempts. For scale variations, ref. [
13,
14,
15,
16] focuses on multi-scale feature fusion, and ref. [
17,
18,
19] emphasizes hard sample mining. Nevertheless, due to the input scale limitation, there is still room for improvement in accuracy. Another direct method involves cropping the image into subregions before applying detection, such as in uniform cropping. However, these cropping strategies cannot adaptively adjust based on the semantic information in UAV images, potentially including extensive background areas. Li et al. [
20] utilize a density map to guide the cropping of images, enhancing the semantics of the resulting subregions. However, this approach involves a density map generation network, which increases model complexity and requires ground truth density map generation. Compared to the scale variations, the challenge of class imbalance is often neglected in UAV images. In recent years, research in the field of long-tailed detection has primarily been categorized into three types: (1) resampling and data augmentation techniques [
21]; (2) loss function reweighting methods [
17,
22]; and (3) decoupled optimization strategies [
23]. However, existing approaches still face challenges in dynamically adapting to changes in the distribution of tail classes. Zhang et al. [
24] employ a multi-model fusion (MMF) strategy to tackle the head and tail classes distinctly, effectively enhancing the detection performance of tail classes. However, this approach results in the discarding of a substantial amount of valuable data during the training of each model, which could potentially compromise the model’s representational capacity.
In this paper, for object detection in UAV images, we propose a novel framework called AD-Det, which adopts a coarse-to-fine strategy and mainly consists of two key components: adaptive small object enhancement (ASOE) and dynamic class-balanced copy–paste (DCC). Specifically, ASOE employs a high-resolution feature map from the classification head to pinpoint small objects. As shown in
Figure 1c, most small objects in the image show significant activation in the high-resolution feature map. These positions of interest are clustered into
subregions, which are then enlarged and processed by a fine-grained detector, thereby enhancing the detection performance of small objects. DCC, on the other hand, performs object-level resampling through a dynamic copy-and-paste strategy specifically tailored for tail-class objects. It leverages the clustering cues from ASOE to dynamically search for suitable pasting positions around the cluster center and maintain a dynamic memory bank for each tail class. Additionally, data augmentation is conducted to avoid overfitting. In this way, the proposed AD-Det can extract regions with small objects for fine-grained detection and dynamically perform reasonable resampling for tail-class objects, thereby improving overall detection performance.
Figure 1.
(
a) Comparison of scale distribution between VisDrone and COCO. (
b) Class distribution of VisDrone and UAVDT. (
c) Visualization of high-resolution feature map (
) in GFL [
25] for VisDrone. It can be seen that the high-resolution feature map mainly focuses on small objects. The masked areas on the left indicate regions that contain large objects, which are easier to handle and can be ignored in the fine-grained detection stage.
Figure 1.
(
a) Comparison of scale distribution between VisDrone and COCO. (
b) Class distribution of VisDrone and UAVDT. (
c) Visualization of high-resolution feature map (
) in GFL [
25] for VisDrone. It can be seen that the high-resolution feature map mainly focuses on small objects. The masked areas on the left indicate regions that contain large objects, which are easier to handle and can be ignored in the fine-grained detection stage.
Compared to existing coarse-to-fine solutions, our approach is a plug-and-play, unsupervised scheme. The advantages lie in leveraging the inherent properties of the detector to focus on small object regions that truly contribute to accuracy gains, thereby enhancing the precision of small object detection without extra learnable parameters. Additionally, we propose a training-only image augmentation technique for tail classes, which helps to alleviate class imbalance issues.
We summarize four main contributions of this work as follows.
We propose AD-Det, a novel object detection framework that addresses two key challenges in UAV image object detection—complex scale variations and class imbalance—ultimately enhancing detection performance.
To enhance small object detection, we propose ASOE, which uses a high-resolution feature map to cluster regions containing small objects, followed by processing them for fine-grained detection.
To enhance tail-class object detection, we propose DCC, which performs reasonable object-level resampling by dynamically pasting tail classes around clustering centers obtained by ASOE.
We conduct extensive experiments on VisDrone and UAVDT, which demonstrate that our approach significantly outperforms existing competitive alternatives.
2. Related Works
In object detection tasks, unlike traditional solutions that rely on handcrafted features and classifiers, DL-based solutions can automatically learn discriminative features and achieve promising results. Here, we briefly review object detection solutions in the deep learning era, including (1) generic object detection (in
Section 2.1) that focuses on natural images such as MS COCO [
11]; (2) object detection in aerial images (in
Section 2.2) that focuses on aerial images such as VisDrone [
12] and DOTA [
26]; and (3) long-tailed object detection (in
Section 2.3) that focuses on long-tail scenarios such as LVIS [
27].
2.1. Generic Object Detection
Sparked by impressive success in image classification tasks [
28], DL-based methods have dominated generic object detection. The existing DL-based detection solutions, when analyzed according to their pipeline, can be roughly grouped into two-stage and one-stage methods. The two-stage detectors are represented by R-CNN [
4], Fast R-CNN [
29], Faster R-CNN [
5], and Mask R-CNN [
30]. These detectors first reduce the search space significantly by extracting candidate region proposals. The extracted proposals are then classified into specific categories and refined to proper locations. Such a pipeline leads to relatively higher accuracy but lower efficiency. Different from the above methods, one-stage methods, on the other hand, directly predict objects’ categories and locations without region proposals. The representative approaches include the YOLO series [
6,
7,
31], SSD [
32], and RetinaNet [
17]. Such a design leads to relatively higher efficiency but at the cost of accuracy. Recently, studies into CornerNet [
33] and FCOS [
34] bypassed the anchor boxes mechanism and corresponding hyperparameter setting and presented promising alternatives for one-stage methods. DETR [
35] pioneered a fully end-to-end object detector that employed a transformer-based architecture, eliminating reliance on anchor generation and non-maximum suppression (NMS). Although these detectors achieved impressive progress in generic object detection, their performance on UAV images is far from satisfying due to problems of scale variations and class imbalance.
2.2. Object Detection in Aerial Images
Compared with generic objects that are mostly captured in ground view, object detection in UAV images presents heightened challenges due to object scale and object/category distribution problems.
Many studies focus on
multi-scale/tiny-scale object problems in aerial images. Most of the early research focuses on the migration of classical generic object detection algorithms. Yang et al. [
36] proposed a novel query mechanism, which effectively and efficiently incorporates an additional high-resolution layer to promote small object detection. In the realm of knowledge distillation, Zhu et al. [
37] costlessly enhanced the performance of lightweight models by incorporating scale-aware knowledge from more complex ones. To detect small weak objects in UAV images, Han et al. [
38] presented a context–scale-aware detector, combining the strengths of context-aware learning and multi-scale feature extraction. Cao et al. [
39] introduced a self-reconstructed difference map approach to enhance feature visibility for challenging tiny object detection tasks. DQ-DETR [
40] specialized in tiny object detection by employing dynamic query selection and counting-guided feature enhancement, boosting detection performance. SDPDet [
41] employed scale-separated dynamic proposals and activation pyramids to enhance the efficiency and accuracy of object detection in UAV views. Zhang et al. [
42] enabled detectors to focus on discriminative features while reducing false-positives in cluttered backgrounds. Zhang et al. [
43] proposed IMPR-Det, which can integrally mix multi-scale pyramid representations for different components of an instance.
Besides object scale problems, many studies focus on
object/category distribution problems. Inspired by research into crowd counting, Li et al. [
20] injected estimation of density maps into a conventional object detection framework, which was utilized to predict object distribution, reduce the influence of background, and generate more balanced candidate proposals. Duan et al. [
44] further adopted a coarse-grained density map to identify subregions more accurately. Similar observations regarding the object distribution issues led [
45] to improved RPN in order to cluster region proposals. GLSAN [
46] utilized a scale-aware algorithm to fuse global and local detection results. CZDet [
47] enhanced cascade detection by innovatively incorporating high-density subregion labels into the repurposed detector. YOLC [
48] employed a local scale module and deformable convolutions to enhance accuracy in aerial small object detection. Zhang [
49] proposed structured adversarial self-supervised pretraining to strengthen both clean accuracy and adversarial robustness. On the other hand, Yu et al. [
50] investigated long-tail category distribution issues, where they designed dedicated samplers and detection heads to address the distinct characteristics of tail and head classes.
Such excellent works have indeed facilitated aerial image object detection. Nevertheless, when addressing scale variations and class imbalance challenges, the majority of preceding studies treat these essential issues rigidly and separately, not only ignoring the complexity of UAV images but also neglecting their potential synergy.
2.3. Long-Tail Object Detection
Much like the paradigm of long-tail classification, research in the domain of long-tail object detection predominantly adheres to two distinctive methodologies: resampling and reweighting. Commonly employed techniques in resampling involve either oversampling the minority classes or undersampling the majority classes during the training process, aimed at mitigating the issue of imbalanced class distribution. Repeat factor sampling (RFS) [
27] involves an image-level sampling approach following the resampling paradigm. SimCal [
51] proposed a bi-level class balanced sampling approach to alleviate classification head bias. In the context of reweighting, the fundamental concept lies in the assignment of diverse weights to training samples, with a focus on enhancing the training of tail samples. Cui et al. [
52] introduced a novel loss function that significantly improves class imbalance handling by dynamically adjusting the weights of training samples based on their effective numbers.
Besides the aforementioned strategies, many researchers have endeavored to handle the long-tail problem from other perspectives. Kang et al. [
23] advocated for a decoupled training paradigm that disentangles the learning process into distinct phases: representation learning and classifier training. BAGS [
53] grouped classes based on their instance frequency and subsequently applied softmax loss among each group. ROG [
54] proposed a multi-task learning approach that concurrently optimizes both object-level classification and global-level score ranking. Hyun et al. [
55] introduced effective class-margin loss (ECM) as a novel surrogate objective through which to optimize margin-based binary classification error on the imbalanced training set.
Many earlier coarse-to-fine counterparts pinpoint regions of interest by relying on extra learnable modules or by post-processing sparse boxes. Our method, however, differs by incorporating dense indicators from a high-resolution feature map. This map is adept at preserving extensive information about small objects, facilitating adaptive region generation. Additionally, we enhance this framework by addressing class imbalance issues through the integration of object-level copy–paste techniques. This novel combination effectively tackles two distinct challenges, which are overlooked in existing solutions.
4. Experiments
4.1. Experimental Setup
Datasets. We evaluate our AD-Det on two publicly available datasets for UAV image object detection: VisDrone [
12] and UAVDT [
58]. We summarize the details on these two datasets in
Table 1.
VisDrone is a challenging large-scale dataset captured by various camera devices using multiple UAVs. This dataset contains data collected from several Chinese cities, covering different weather conditions and scenarios. Ten predefined categories, including pedestrian, car, van, etc., are manually annotated with bounding boxes. There are 6471 images in the VisDrone for training, complemented by 548 images in the validation set, and a total of 3190 images for testing. The maximal resolution for each image is
pixels. Given that the testing set is not publicly available, we follow ClusDet [
45] and DMNet [
20] to train the model on the training set while evaluating it on the validation set.
UAVDT is a large-scale vehicle detection and tracking dataset for UAV scenarios, which is released by UCAS. This dataset contains data collected from complex environments with different weather conditions, occlusion, and flying heights. Bounding boxes have been manually annotated for three predefined categories, including car, truck, and bus. It contains 23,258 and 15,069 images for training and testing, respectively, and the resolution is pixels for each image.
Evaluation metrics. Following the evaluation protocols outlined in MS COCO [
11], we employ average precision (AP) as our primary metric, spanning various categories and IoU thresholds. The metrics are succinctly outlined as follows:
AP: average precision calculated across all categories, considering IoU values within the range of [0.5, 0.95] at intervals of 0.05.
AP_50, AP_75: average precision computed across all categories using individual IoU thresholds of 0.5 and 0.75, respectively.
AP_S, AP_M, AP_L: average precision calculated on object sizes small (less than ), medium (from to ), and large (greater than ).
Implementation details. We implement AD-Det using the MMDetection toolbox [
59] with PyTorch [
60] as the basic framework. The model is trained on a system equipped with 2*Intel Silver 4210R CPU and 2*NVIDIA A4000 GPU. The baseline detection network employed is GFL [
25] with four uniform cropping parts.
Training phase: For the training procedure on VisDrone and UAVDT, as [
61] does, the input image size is configured as
pixels for VisDrone and
pixels for UAVDT, with both coarse and fine-grained detectors. Concerning hyperparameters in the ASOE, the maximum subregion number (
) is set to 4 and 3 on VisDrone and UAVDT, respectively (as discussed in
Section 4.4). In the DCC module, all categories except
pedestrian,
people, and
car are considered tail classes on VisDrone, while
truck and
bus are considered tail classes on UAVDT. All models undergo 12 epochs of training using an SGD optimizer, with
,
, and
. AD-Det is trained with a linear warm-up strategy and undergoes decay by a factor of 10 at epochs 8 and 11.
In the testing phase, the input image size and hyperparameters in the ASOE remain consistent with the training phase unless otherwise specified. In the process of fusing detection, the non-max suppression (NMS) threshold and max detection number are set to 0.5 and 500, respectively, across all datasets.
4.2. Comparison with Representative Solutions
Results on VisDrone. We compare AD-Det with other SOTA solutions on VisDrone, and the results on AP, AP_50, and AP_75 are listed in
Table 2. Our baseline method employs GFL with ResNet-50 / 101, and ResNeXt-101 as backbones. From the table, we have the following observations.
(1) Our proposed method achieves consistent improvement upon all the compared solutions. Our model improves the detection performance to 37.5%, 60.9%, and 39.2% in terms of AP, AP_50, and AP_75, surpassing all compared methods. Among the compared solutions, ClusDet, DMNet, CDMNet, GLSAN, CZDet, AMRNet, and YOLC employ a coarse-to-fine strategy, which is similar to our approach. Even compared to CZDet, the best solution within them, our model also increases performance by 3.1%, 1.2%, and 4.6% in AP, AP_50, and AP_75, respectively.
(2) Despite using weaker backbones, our method surpasses or rivals the performance of all compared methods. As illustrated at the bottom of
Table 2, although we employed a relatively weak and lightweight backbone ResNet-50, our model continues to be highly competitive. Our method exhibits substantial improvements in AP metrics compared to other methods, even with stronger backbones. For example, compared to YOLC, which used ResNeXt-101 as the backbone, our method, with ResNet-50 as the backbone, nonetheless managed to increase performance by 2.2%, 1.4%, and 3.4% in AP, AP_50, and AP_75, respectively.
To further evaluate the accuracy and computational complexity of AD-Det, we present a comparison with existing solutions based on AP, parameter counts (Params), floating-point operations (FLOPs), and inference speed (s/img), as shown in
Table 3. Under the same backbone architecture of ResNeXt-101, AD-Det demonstrates superior improvements in AP, Params, FLOPs, and inference speed compared to ClusDet, DMNet, and GLSAN. Although our method exhibits marginally higher FLOPs and slightly slower inference speed than CZDet and YOLC, it achieves higher detection accuracy and a significant reduction in Params. These results highlight that our approach delivers competitive performance with a better accuracy–speed trade-off, fulfilling practical engineering requirements for real-world deployment.
Results on UAVDT. Similar situations occur when our method is evaluated on UAVDT, and the detection accuracy compared with SOTA methods on the testing set of UAVDT is shown in
Table 4. To facilitate comparison, we utilize GFL with ResNet-50 backbone as our baseline. Compared with other SOTA methods, our method attains a new SOTA performance even with relatively weaker ResNet-50 as the backbone, achieving 20.1% in AP, 34.2% in AP_50, and 21.9% in AP_75. In comparison with the GLSAN, our model demonstrates increases of 3.1%, 6.1%, and 3.1% in AP, AP_50, and AP_75, respectively.
4.3. Qualitative Analysis
To qualitatively contrast the performance,
Figure 5 depicts the detection results for both the baseline detector and our AD-Det. It is evident that our method surpasses the baseline detector and reaches a satisfactory detection accuracy. Specifically, as depicted in the dashed area of
Figure 5 and the corresponding zoom-in views, our approach excels in detecting small objects and tail categories. This superiority can be attributed to the contributions of ASOE and DCC, which collectively enhance UAV image object detection from two distinct perspectives.
4.4. Ablation Study
We conduct extensive studies on VisDrone to validate the effectiveness of our key design, including (1) the effectiveness of ASOE and DCC modules, (2) the choice of key hyperparameters, (3) the selection of the base detector, (4) the design of the ASOE module, and (5) the design of the DCC module. For all ablation studies, we employ ResNet-50 as our backbone.
The effectiveness of ASOE and DCC modules: As the key components of our method, ASOE and DCC are systematically incorporated into the model to evaluate their respective efficacies. Simultaneously, we demonstrate the improvements resulting from the integration of ASOE and DCC, emphasizing their mutual complementarity. A comprehensive presentation of the ablation study is provided in
Table 5 and
Figure 6, which illustrate the accuracy improvements across different metrics after integrating our key components.
As illustrated in
Table 5, the incorporation of ASOE results in a notable improvement of 2.2% in AP and 3.0% in AP_S over the baseline method. Similarly, DCC contributes to an enhancement in detection performance, elevating it from 35.3% to 35.9% in AP. Compared to baseline with identical parameter settings, our method achieves a notable increase in accuracy, but only increases time consumption slightly.
As illustrated in
Figure 6, a noteworthy observation is the consistent enhancement in results for objects in the tail categories, especially for
bicycles, tricycles, awn., and
buses, which improved by 0.9%∼1.9%, underscoring the ability of DCC to augment detection performance reliably. Significantly, the addition of complementary samples through DCC does not adversely affect the performance of other categories.
The choice of key hyperparameters: We analyze the impact of two key hyperparameters herein—the number of subregions
in
Figure 7a and the confidence threshold
in
Figure 7b.
The number of subregions
is a pivotal parameter that significantly influences the detection speed and accuracy. The effects of different values of
on detection time and accuracy are summarized in
Figure 7a. This subfigure indicates that as the number of subregions increases, the network’s detection accuracy improves, but with an associated increase in detection time. Considering the trade-off between accuracy and efficiency, this study selects
= 4 as the optimal number of subregions. We conduct a similar experiment on UAVDT and find that
= 3 is the optimal choice for UAVDT.
The confidence threshold
determines the quality of the subregions (Equation (
2)). We experiment with different
values in {0.1, 0.3, 0.5, 0.7}. In particular, we quantify sample points and their corresponding detection accuracy. As depicted in
Figure 7b, with an increase in
, the number of sample points decreases, facilitating a proportional acceleration in clustering speed. However, the accuracy benefits are not linear with higher
values. When
= 0.7, an excessive number of points are disregarded, leading to a significant decline in detection accuracy. Therefore, we select
= 0.5 as the threshold parameter.
The selection of a base detector: In
Table 6, we present a comparison of GFL with several leading base detectors, where GFL achieves superior performance. As indicated by AP_S, both GFL and CenterNet are effective in small object detection, but GFL exhibits more balanced accuracy across various object scales. In contrast to FCOS and YOLOv8, GFL outperforms both in AP and AP_S. Therefore, we choose GFL as the base detector of AD-Det.
The design of the ASOE module: In our ASOE module, we empirically select the lowest layer of the feature maps, namely
, as our primary focus [
56]. In
Figure 8, we visualize the regions of interest selected by ASOE using different feature maps. The visualization shows that the
layer focuses more on small object regions, while higher layers such as
and
gradually focus on larger objects. To substantiate its efficacy, we conduct a straightforward comparison with the higher adjacent layer
as an alternative focus. We summarize the results in
Table 7. Experimental results indicate that the
layer yields the highest performance, while clustering from higher-level layers such as
would lead to performance degradation. Note that AP and AP_S undergo a sharp decrease from
to
, highlighting our approach’s effectiveness in detecting small objects within higher-resolution layers.
The design of the DCC module: DCC comprises two major components—diversity augmentation (DA) and dynamic search (DS). Based on ASOE, we verify these two components step by step. As shown in
Table 8, when DA is applied to tail-class instances, the AP_S and AP_M are increased by 0.4% and 0.7%, respectively. When further combined with DS, the AP is increased to 35.9%, which proves its effectiveness.
5. Conclusions
In this paper, we propose AD-Det, a novel object detection framework for UAV images, which addresses the challenges of scale variations and class imbalance. AD-Det consists of two key components: adaptive small object enhancement (ASOE) and dynamic class-balanced copy–paste (DCC). ASOE employs a high-resolution feature map to cluster regions containing small objects, followed by processing them for fine-grained detection. DCC performs reasonable object resampling by dynamically pasting tail classes around clustering centers obtained by ASOE. We conduct extensive experiments on VisDrone and UAVDT and demonstrate that our approach significantly outperforms existing competitive alternatives.
Although our proposed approach achieves satisfactory results in UAV image object detection, several limitations remain to be addressed in our future endeavors. (1) ASOE achieves precise detection of small objects by leveraging global location cues, but the usage of global features is still limited. ASOE ignores the feature interaction between the global image and the local regions, which may help to better understand the scene’s semantic knowledge. (2) DCC conducts uniform sampling copy–paste, but there is room for improvement by considering instance-specific metrics and the complex relationships between classes for more effective hard example mining. In the future, we plan to explore more effective ways of handling UAV images under complex scenarios, including occlusion and background clutter issues.