4.4.1. Ablation Experiments
In this section, we mainly conduct ablation experiments to verify the rationality and effectiveness of the proposed module in combination with the baseline ResNet-101. The DOTA-V2.0 dataset provides the data foundation for this experiment. To ensure the scientific nature of the experimental process, we proceed from two perspectives:
(1) Quantitative Analysis. We use , , and to represent the average detection results of small, medium, and large objects in the dataset, respectively. Additionally, and are used to represent the average value of all object categories when the threshold is set to 0.5 and the average value of all object categories with a step size of 0.05 when the threshold ranges from 0.5 to 0.95. Among these, is the most effective metric in assessing the combination of the proposed module with the baseline framework.
Table 4 demonstrates the detection results for the integration of BMSFPN, FPM, ARAM, and DAOM with the baseline framework, with the optimal results highlighted in bold. Specifically, the integration of the FPM, ARAM, and DAOM results in significant improvements in
and
, outperforming the worst results by 9.18% and 8.54%, respectively. Similarly,
and
increase by 12.88% and 9.89%, respectively. These positive outcomes are attributed to the decoupling and refinement of classification and regression features achieved by these three modules, simplifying the model’s handling of remotely sensed objects with critical features and variable orientations. The combination of BMSFPN, FPM, ARAM, and DAOM yields the best overall performance, with
,
, and
reaching 65.29%, 85.30%, and 88.17%, respectively, and
and
achieving 75.03% and 66.93%, respectively. This demonstrates that the four modules designed in this paper can collaborate effectively with the baseline framework to comprehensively address issues such as noise and the rotational distribution in drone RS image object detection.
To further validate the efficiency and superiority of the combinations of each module with the baseline framework in detecting objects of various scales, we analyzed the precision and recall rates of the various combinations presented in
Table 4 and utilized P-R curves for comparison and verification. The combinations in
Table 4 are named in sequential order from I to VI, such as I (Baseline + BMSFPN), II (Baseline + FPM), and so on.
The P-R curves are presented in
Figure 11. As can be seen in the figure, VI (the proposed URSNet in this paper) occupies the optimal position in the detection of various object sizes. For large- and medium-sized objects, the precision of VI is slightly higher than that of the second-ranked Ⅴ, but both significantly outperform the third-ranked Ⅳ. This demonstrates that after image filtering, decoupling of key features, and anchor box refinement, URSNet can generally ensure efficient precision. For small objects, VI exhibits the most significant advantage, indicating that the enhancement of classification and localization capabilities for small objects in URSNet is strengthened by highlighting object detail textures through BMSFPN, as well as the rotation and optimization of anchor boxes through the FPM. Furthermore, the distribution of the P-R curves for all object sizes is consistent with the data presented in
Table 4, comprehensively reflecting the unique contributions of each module and their indispensability, as well as validating the superiority of the proposed URSNet in this paper.
Based on the above analysis, we will now conduct experimental validation of the rationality and scientific basis for selecting the ResNet-101 [
41] backbone network using the DOTA-V2.0 dataset. According to the models utilized by most researchers [
53,
54,
55,
56,
57,
58], we have chosen several advanced neural network frameworks for discussion, namely ResNet-50 [
59], VGG-16 [
60], LSKNet [
61], Swin-Trans [
44], and DLA-34 [
62].
Figure 12 illustrates the P-R curves for these six types of backbone networks. As can be seen in the figure, ResNet-101 exhibits superior precision and recall performance compared to the other networks. In the low recall region, the precision of ResNet-101 is significantly higher than the other networks, indicating its excellent performance in handling complex backgrounds and distinguishing similar targets. In the high recall region, ResNet-101 still maintains high precision, demonstrating its strong generalization ability and robustness to noise and interference. Therefore, through an experimental performance analysis, it can be verified that using ResNet-101 as the backbone network is feasible.
Additionally,
Table 5 presents the evaluation results for various baseline frameworks. It can be observed that the ResNet-101 baseline selected in this paper generally exhibits the best performance. It achieves an F1-Score of 85.72 and a top-1 accuracy of 82.95%, ranking first with an 8.85% advantage over the top-1 accuracy score of DLA-34. Therefore, it can be validated that the selection of ResNet-101 as the baseline framework for the model in this paper is reasonable.
(2) Qualitative Analysis. To observe the ablation results more intuitively, we present them in a visual form in
Figure 13. Specifically, the three top-ranked methods (IV, V, and VI) from
Table 4 are selected for detection in five challenging scenarios of drone RS images.
Observing the output results, the heatmap of our method VI covers the largest number of object regions, maintaining relatively accurate capture capabilities even in complex noisy images. Additionally, for difficult objects with large scale differences, dense distributions, and rotational characteristics, VI utilizes the FPM, ARAM, and DAOM to enhance the expression of object boundary features and anchoring regression capabilities, extracting more feature information compared to V and IV and resulting in more accurate detections. Therefore, these results demonstrate that the strategy of combining the four modules proposed in this paper with ResNet-101 is reasonable and efficient.
4.4.2. Comparison Experiments with SOTA Methods
To fully demonstrate the unique advantages of the proposed URSNet in the task of object detection in drone RS images, it is necessary to conduct a comprehensive performance comparison experiment with similar SOTA algorithms. Therefore, based on the stage division of deep learning detection algorithms, we selected over twenty of the most advanced and classic object detection models for drone RS images from three categories: anchor-free, single-stage, and two-stage algorithms. These include DRN [
64], O
2-DNet [
65], AOPG [
66], CenterMap [
67], S
2ANet [
68], and AO
2-DETR [
69], among others.
This experiment was conducted on four RS datasets: DOTA-V2.0, RSOD, DIOR, and UCAS-AOD. The following is a specific explanation:
(1) The detection results for the DOTA-V2.0 dataset are presented in
Table 6. The data in the table are calculated and evaluated strictly according to the
and
standards in MS COCO [
70]. For ease of expression, we have simplified the names of various objects in the dataset, such as swimming pool (SWP), helicopter (HC), bridge (BE), large vehicle (LVE), ship (SP), plane (PE), soccer ball field (SBF), basketball court (BC), airport (AT), container crane (CCE), ground track field (GTF), small vehicle (SV), harbor (HB), baseball diamond (BDD), tennis court (TCT), roundabout (RT), storage tank (ST), and helipad (HD). Observing the data in the table, we can find that our model URSNet has the highest
score (84.03%), which is 2.75 percentage points higher than the second-ranked LSKNet-S. This indicates that URSNet has the optimal overall detection performance for various objects in DOTA-V2.0. Additionally, in terms of
performance, SGR-Net, which has an advanced architecture of Swin-Trans, achieves the highest detection result for the small object HB (72.04%), but falls behind URSNet in detecting SP, ST, and HC. This suggests that URSNet still has relatively good performance in small object detection. For elongated objects such as SWP and BE, URSNet has a unique advantage due to the carefully designed spatial attention convolution kernel in the FPM. For medium and large objects with significant scale and edge features, such as BC, TCT, and GTF, both URSNet and AO
2-DETR achieve over 85%, with URSNet outperforming AO
2-DETR by 0.71%, 2.16%, and 0.40%, respectively.
In addition, based on the data in
Table 6, we will further analyze the target detection performance of different SOTA models in challenging scenarios in the subsequent results visualization section.
(2) The detection results for the RSOD dataset are presented in
Table 7. It can be observed that the proposed URSNet method achieves the best
and
scores for the four types of objects in the RSOD dataset, surpassing the powerful YOLOv7 [
43] and Vision-Trans [
82]. Specifically, the Anchor Rotation Alignment Module (ARAM) and Feature Polarization Module (FPM) in URSNet are highly effective in handling aircraft with a multi-directional distribution and overpasses with significant differences in length and width.
(3) The evaluation results for the DIOR dataset are presented in
Table 8. The results demonstrate that our method, URSNet, achieves an
score of 85.13% and a processing speed of 108.20 FPS, indicating its superior performance in both detection accuracy and image processing speed. This is attributed to the advanced modular architecture design of URSNet. However, it is worth noting that URSNet does not achieve the highest level in terms of model parameters (Params) and computational complexity (FLOPs), which suggests that further improvements are needed in model lightweighting and hardware resource allocation to achieve more efficient performance in future work.
(4) The performance evaluation results for the UCAS-AOD dataset are presented in
Table 9. The UCAS-AOD dataset contains only two types of objects: cars and planes. However, these two types of objects exhibit characteristics such as an arbitrary orientation, small scale, and dense distribution, making them complex targets that can be used to further validate the effectiveness of the proposed method in this paper. As can be seen in the table, most SOTA models have lower detection accuracy for planes than for cars. Additionally, URSNet outperforms the second-ranked YOLOv7 by 4.05%, 0.15%, and 1.77% in the
and
metrics for both targets. This suggests that for cars with regular edges, most models can effectively extract key feature information for classification and localization. However, when faced with planes with complex boundary information, most models lack efficient key information extraction and localization refinement capabilities, resulting in lower detection accuracy. In contrast, URSNet maintains high accuracy due to the design of BMSFPN and DAOM.
(5) Visualization of experimental results.
Figure 14 illustrates the visualized detection results from our proposed method URSNet for the large-scale RS dataset DOTA-V2.0. As can be seen in the figure, targets with multi-directional variations such as planes, ships, large vehicles, and small vehicles all exhibit high detection results. This demonstrates that our designed ARAM and DAOM can fully utilize the key features of such targets for anchor rotation refinement, and further improve the classification and localization performance of URSNet for these targets through label assignment.
Furthermore, for medium and large-scale objects such as a baseball diamond, tennis court, and ground track field, URSNet maintains a superior detection level, achieving accuracy rates above 70%. For densely distributed small objects like storage tanks and harbors, URSNet ensures accuracy rates of over 60%. This is attributed to the powerful feature extraction and detail representation capabilities of the designed BMSFPN and FPM.
Figure 15 demonstrates the visualization results from URSNet for the DIOR dataset. Since DIOR contains 20 different categories of objects, the generalization and robustness of URSNet have been thoroughly tested. For elongated objects such as bridges, airports, and swimming pools, their aspect ratios vary significantly, making it difficult for conventional models to extract effective feature information. However, the unique design of the spatial attention convolution kernel in the FPM proposed in this paper effectively overcomes this issue. As can be seen in the figure, URSNet exhibits excellent performance for such objects.
Furthermore, for targets such as golf courses with blurred backgrounds, small-scale chimneys, multi-scale ships, and planes, URSNet effectively reduces the misalignment between predicted and actual bounding boxes by smoothing out redundant background details and dynamically optimizing the target anchor boxes. This enhancement in both classification and localization capabilities results in generally impressive detection performance.
Figure 16 demonstrates the detection results from several advanced SOTA models for the RSOD dataset. Based on
Table 7, we visualize the detection results from our proposed URSNet, along with Vision-Trans, YOLOv7, and RoI-Trans, for four types of objects: aircraft, oil tank, overpass, and playground. It can be observed that URSNet detects the oil tank, which has a large size but a small scale, and the overpass, which is located in a variable background, more accurately. At the same time, aircraft and playground are also well detected. The efficient performance of URSNet further validates its advantages and reliability.
Figure 17 demonstrates the detection results from URSNet, YOLOv7, R
3Det-DCL, and Oriented Rep. for the UCAS-AOD dataset. As can be seen, for densely arranged cars and planes on similar backgrounds, our proposed method detects all targets with relatively high accuracy, while the other three models exhibit varying degrees of missed detections and poorer accuracy. This illustrates that the advanced architecture of the FPM designed for URSNet effectively highlights the key features of the targets, which are then precisely captured and optimized by the ARAM and DAOM. The efficient performance of URSNet further validates its reliability and applicability in handling small targets in complex scenarios.