AG-Yolo: Attention-Guided Yolo for Efficient Remote Sensing Oriented Object Detection

Wang, Xiaofeng; Han, Chengshan; Huang, Liang; Nie, Ting; Liu, Xin; Liu, Hao; Li, Mingxuan

doi:10.3390/rs17061027

Open AccessArticle

AG-Yolo: Attention-Guided Yolo for Efficient Remote Sensing Oriented Object Detection

by

Xiaofeng Wang

^1,2,

Chengshan Han

¹,

Liang Huang

¹,

Ting Nie

¹,

Xin Liu

¹,

Hao Liu

^1,2 and

Mingxuan Li

^1,*

¹

Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China

²

University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(6), 1027; https://doi.org/10.3390/rs17061027

Submission received: 17 February 2025 / Revised: 12 March 2025 / Accepted: 13 March 2025 / Published: 15 March 2025

Download

Browse Figures

Versions Notes

Abstract

:

Remote sensing can efficiently acquire information and is widely used in many areas. Object detection is a key component in most applications. But complex backgrounds in remote sensing images severely degrade detection performance. Current methods fail to effectively suppress background interference while maintaining fast detection speeds. This paper proposes Attention-Guided Yolo (AG-Yolo), an efficient oriented object detection (OOD) method tailored for remote sensing. AG-Yolo incorporates an additional rotation parameter into the head of Yolo-v10 and extends its dual label assignment strategy to maintain high efficiency in OOD. An attention branch is further introduced to generate attention maps from shallow input features, guiding feature aggregation to focus on foreground objects and suppress complex background interference. Additionally, derived from the background complexity, a three-stage curriculum learning strategy is designed to train the model from some much easier samples generated from the labeled data. This approach can give the model a better starting point, improving its ability to handle complicated datasets and increasing detection precision. On the DOTA-v1.0 and DOTA-v1.5 datasets, compared with other advanced methods, our algorithm reduces the processing latency from 33.8 ms to 19.7 ms (a roughly 40% decrease) and produces a certain degree of improvement in the mAP metric.

Keywords:

remote sensing; oriented object detection; attention; curriculum learning

Graphical Abstract

1. Introduction

Remote sensing is a technology that can obtain a large amount of information from a long distance without direct contact. The ability of efficient information acquisition makes it widely used in many areas such as smart agriculture [1], transportation planning [2], intelligence reconnaissance [3], disaster prevention [4], and so on. Remote sensing technology is highly valued for its ability to provide extensive, high-resolution information on the Earth’s surface. Yet, the effectiveness of this information heavily relies on our ability to extract useful knowledge from it. This requires object detection (OD), which involves automatically identifying and extracting targets of interest from remote sensing images. After obtaining a lot of pictures related to an application, the specific objects’ locations usually need to be located first before further logical processing.

Traditional OD methods always use some windows to scan the whole picture to evaluate its relationship with some predefined features. These features are always hard to design and the procedure is slow. In recent years, deep learning methods have achieved excellent performance. They can be roughly divided into two categories. Two-stage methods detect objects in a two-stage manner. First, some region proposals are generated and then the target is classified and the bounding boxes are refined. Representative methods include R-CNN [5], Fast R-CNN [6], Faster R-CNN [7], Mask R-CNN [8], and Cascade R-CNN [9]. The proposed region can eliminate some background disturbance, which is beneficial for improving detection precision. However, it brings additional computations and slows down the speed. Single-stage methods omit the region proposal step and use a single united network to realize feature extraction, target classification, and bounding-box regression together, which achieves an end-to-end learning process. The most famous single-stage method is the Yolo family. For Yolo-v1 [10], many of the advances in the field of deep learning have been incorporated into its network. Along with some theoretical breakthroughs of its own, although its initial design is not conducive to precision, the Yolo family has achieved the most comprehensive performance and has become the most popular object detection method. The recently proposed Yolo-v10 [11] further advanced the performance-efficiency limit of YOLOs from both post-processing and model architecture. The consistent dual assignments bring competitive performance and low inference latency simultaneously. The holistic efficiency–accuracy-driven model design strategy introduces the lightweight classification head, the spatial–channel decoupled down-sampling, the large-kernel convolution, and the partial self-attention to the model structure, which leads to comprehensive performance improvement.

However, these methods cannot easily achieve good performance in remote sensing situations. Because of the long distance between the sensor and the objects, pictures captured by a remote sensing camera are usually very large. And the targets appear with large differences in size and much more complex background information [12]. To increase detection accuracy, many works have tried to consider direction information in remote sensing object detection. These strategies can be broadly divided into two categories: oriented bounding-box (OBB) regression and segmentation methods. Segmentation is the most precise way to depict an arbitrary target. But it needs many more annotations and the model should be more complex, which will severely hinder the detection speed. Regression OBBs are a relatively effective approach. Until now, there have mainly been two ways to express an OBB, viz. a five-dimensional vector or an eight-dimensional vector. Although the eight-dimensional vector can describe more types of OBBs, it has more parameters to predict and is more susceptible to the ambiguity problem [13]. Yolo-v8 [14] incorporates an additional rotation angle to realize OBB detection and adopts an anchor-free split head, which contributes to better accuracy and a more efficient detection process compared to anchor-based approaches.

Despite the advantages of each method, certain issues remain. The structure of Yolo-v8 is a little redundant and less effective. And its complex post-processing also hinders the speed. Yolo-v10 [11] is an efficient model, but it is designed for general detection and cannot output OBBs directly. Extending the head of Yolo-v10 can create a fast OOD algorithm. But only outputting an OBB is not enough. Backgrounds can still hinder the detection severely.

To tackle the issues mentioned above, this paper proposes an Attention-Guided Yolo (AG-Yolo). Inspired by the attention map-assisted inference block introduced in [15], an attention branch is incorporated with Yolo-v10. Its head is also refined to adapt to OBB detection. In addition, a curriculum learning strategy is established to simulate the human learning process to promote accuracy by generating some much easier samples. Extensive experiments on DOTA-v1.0 and DOTA-v1.5 are implemented to verify our advantages in addressing complex background challenges in remote sensing images both subjectively and objectively.

The main contributions of this research can be summarized as follows:

An oriented object detection (OOD) method is proposed, namely AG-Yolo. An additional rotation parameter is supplemented to the head of Yolo-v10 to achieve efficient OOD. The dual label assignment strategy is also extended for OOD, which can bring both strong supervision during training and high efficiency during inference. This improves the precision–latency balance.
In order to deal with the complex background interference in remote sensing images, an attention branch is constructed and paralleled with the backbone. It generates attention maps from the shallow features of inputs, mimicking the human recognition process, to guide the feature aggregation of the neck to focus on foregrounds. The attention branch enables the model to concentrate on the objects themselves, thereby facilitating precision.
When targeting complex background issues, a three-stage curriculum learning strategy is designed to further improve detection performance. Derived from the background complexity, some much easier samples are generated from the labeled dataset. Training from these samples step by step provides the model with a better starting point to handle complicated information, which is conducive to precision.

The remainder of this paper is organized as follows: The related work on remote sensing object detection is reviewed in Section 2. In Section 3, we introduce the details of our method. Numerous experiments on standard datasets were implemented to evaluate the enhancement performance in Section 4. Finally, the discussion and conclusions are given in the last two sections.

2. Related Work

2.1. Oriented Bounding-Box Detection

Similar to HBB detection, most OBB detection methods also contain four steps. First, features from the input picture are extracted with a backbone. Then, these features are aggregated by a neck. Third, some heads are used to perform the classification and the bounding-box regression. Finally, post-processing is used, which is usually a non-maximum-suppression operation to filter out the bounding boxes with high overlapping rates. The key difference between an HBB detection and an OBB detection is the expression of the bounding box. Two ways are mainly used to express an OBB, viz. a five-dimensional vector or an eight-dimensional vector. The former contains four parameters of a regular horizontal bounding box (HBB) and an additional argument, which is usually the rotation angle. R3Det [16] designed a feature refinement module and an approximate SkewIoU loss to construct a refined single-stage detector. A sampling fusion network is devised by combining a multilayer feature with effective anchor sampling in SCRDet [17]. And an IoU constant factor is added to the smooth L1 loss to address the boundary problem for the rotating bounding box. S2ANet [18] proposed a network consisting of a feature alignment module and an oriented detection module to alleviate the inconsistency between the classification score and the localization accuracy. To deal with the boundary discontinuity problem, a phase-shifting angle coder is proposed in [19]. The eight-dimensional vector includes four coordinates of the OBB’s four vertices. Gliding Vertex [20] regresses to four length ratios that characterize the relative offset of the glide on each corresponding side. GGHL [21] proposed an OBB representation component to directly represent the OBB using the horizontal and vertical components of the distances from each Gaussian candidate position to the four vertices of the OBB in an anchor-free manner.

To identify an OBB, the easiest and most efficient way is to add an additional parameter to the four-parameter HBB. Mostly, the additional parameter is a rotation. And the whole annotation of a bounding box is

(x, y, w, h, θ)

. Many specific definitions for this five-dimensional vector have been proposed here [22]. Keeping the format consistent during processing is important. The

θ

we used is shown in Figure 1. It is the positive angle (clockwise rotation) between the width of the rectangle and the positive semi-axis of x. Its value is limited to

(0, π / 2]

.

2.2. Attention Mechanism

The attention mechanism acts like a resource scheduler within human vision. By allocating resources to prioritize the handling of critical targets, it can improve the efficiency of information acquisition. Inspired by neurons of the primate visual system, the attention mechanism was applied to computer vision by Laurent et al. [23] as early as 1998. They merged multi-scale feature maps and selected focal regions in order of saliency to realize rapid scene analysis. Concise theories and excellent results are enticing more and more people to delve into the study of attention mechanisms.

In a groundbreaking study of integrating attention mechanisms in neural networks, RAM [24] introduces dynamic positioning and information selection methods in 2014. STN [25] further builds a spatial transformer module to actively spatially transform feature maps. In addition, SENet [26] concentrates on the channel relationship and proposes a squeeze-and-excitation block. This integrates attention mechanisms into the image processing pipeline in a different way. After that, CBAM [27] constructs a module to contain both the spatial and channel attention mechanisms. But this leads to a significant computational overhead for complex models. In order to balance speed and precision, ECA [28] proposes a local cross-channel interaction strategy without dimensionality reduction. It is more efficient than SENet. CA [29] embeds positional information in channel attention to capture long-range dependencies without introducing excessive computational complexity. These methods are mostly plug-and-play modules. Although they can improve performance, their attention weights are generated only by adjacent features and without strong supervision, which may limit the performance.

Zhang et al. [15] propose the generation of a mask to mark background and foreground information in a segmentation branch. With the mask, the disturbance in complex environments can be filtered out and the detector can focus on the features of the targets. They use a dedicated mask to implement the spatial attention mechanism. And a specialized reference is generated from the annotation to constrain the training of the segmentation branch. These enable the detection model to distinguish between foreground and background information more accurately. Consequently, localization and classification become easier. But the mask only takes effect at the final stage. An earlier application may produce a better result.

2.3. Curriculum Learning

Curriculum learning is a model training strategy that trains machine learning models in a meaningful order, mainly from easy samples to hard ones [30]. It can encourage performance improvements without any additional computational costs during inference. Curriculum learning has been successfully deployed in many areas of machine learning, such as natural language processing [31,32], image classification [33,34], and image generation [35,36]. In the field of object detection, most related works apply curriculum learning in weakly or semi-supervised tasks. Their training pipelines are usually established by building a new difficulty measurer or a new training scheduler. Wang et al. [37] trained an SVM on a subset of fully annotated data and measured the mean average precision per image to determine the easiness of a sample. The difficulty measurer used by Feng et al. [38] is the summation of location deviation, scale variation, and pixel inconsistency. Zhang et al. [39] built an IoU–confidence–standard-deviation difficulty measurer and a batch-package training scheduler to improve detection performance. Although each design has its own advantages, their final training data are still the original dataset, with just the order of each sample changing. What if the simplest sample in the dataset is still difficult for a model to learn? On the basis of this consideration, some much easier samples are generated in our work.

3. Methods

The framework of Attention-Guided Yolo (AG-Yolo) is shown in Figure 2. Based on the structure of Yolo-v10m, an additional attention branch is added to generate an attention map for each input. Under the guidance of the attention map, important features will be emphasized during the aggregation stage. To realize OBB regression, an angle head is supplemented. And the dual label assignment design is retained by analyzing the HBB, the angle, and the classification as a whole. After learning in a predefined curriculum pipeline, AG-Yolo can achieve better precision. Each key component, including the curriculum training strategy, is detailed below.

3.1. Attention Branch

The attention mechanism is derived from human vision. It can improve the efficiency of information processing by allocating more resources to key targets. Many works [26,27,28,29] have been proposed to realize the same effect in computer vision. Most of them are plug-and-play modules. Although they are effective and easy to use, their attention weights are mostly generated from some adjacent features and without strong supervision, which may limit performance. Inspired by the foreground mask implemented in [15], a specialized branch is built to generate an attention map in a manner closer to the human recognition process.

When a person identifies a pet in a store, they can tell where the animal is and where the cage is just by a glance. Then, they may pay more attention to the animal and find some key features to determine if it is a cat or a dog. It only takes a glance to identify the region of interest (ROI), which means that the attention map should be generated from some shallow features. And, mainly, the deep features in the ROI are important to obtain the final result. Following this consideration, an attention branch is constructed to extract ROIs. A total of three maps are used as the input. The original input image and the first two feature maps of the backbone are included. They are scaled to the same size as the original input image and then concatenated and convolved by a conv block to 32 channels. The conv block is constructed with a

3 \times 3

convolution, a batch normalization, and a SiLU activation. After another

3 \times 3

channel-preserving conv block, it is finally convolved and sigmoid-activated to a one-channel attention map. The whole structure of the attention branch is shown in Figure 3.

Each output of the attention branch has a reference during training. The reference is generated by setting all values of the pixels inside the region of the OBB label to 1 and the others to 0. A corresponding loss function (

l_{a t t}

), as shown below, based on the binary cross-entropy loss, is added to the original loss to constrain the generation of the attention map.

l_{a t t} = - \frac{1}{H \times W} \sum_{i = 1}^{H \times W} (y_{i} log (p_{i}) + (1 - y_{i}) log (1 - p_{i})) .

(1)

Here,

y_{i}

is the value of the i-th sample in the reference,

p_{i}

is the predicted value; and H and W are the height and width of the predicted attention map.

The attention branch not only generates a full-sized attention map with the same size as the original input image but also down-samples it to three smaller sizes to control the feature aggregation. The down-sample operation is implemented as a max pooling. Each input of the feature aggregation is weighted by the corresponding attention map. The neck is forced to pay more attention to the foreground and suppress the background interference.

3.2. Oriented Bounding-Box Regression Head

Because of the restrictions in imaging positions, remote sensing objects often appear in the form of arbitrary rotation angles. This makes the regular bounding box in object detection, the HBB, a bad choice to depict the location. An HBB can contain much more background than the object itself, especially when the target has a large aspect ratio and is rotated near 45 degrees as shown in Figure 4a. In addition, if two rotated objects are densely distributed in a small area, such as the parking lot shown in Figure 4b, the HBBs of two adjacent objects will overlap too much, which can easily cause one box to be filtered out by the non-maximum-suppression post-processing procedure, resulting in a missed detection. To alleviate this problem, the OBB is a good choice. It can suppress the background interference and reduce the overlap rate of two adjacent targets significantly.

The simplest but still very effective way to acquire an OBB is to add an additional rotation angle parameter. Then, a five-dimensional vector can be used to define an OBB. Considering its similarity to the classification task, the angle regression task is assigned to a head consisting of two

3 \times 3

depth-wise separable convolutions [40,41] and a

1 \times 1

convolution, which is the same as the classification head. This head can predict an angle for each inputted feature with balanced performance.

Furthermore, the dual label assignment [11] strategy is retained and extended for OBB regression as shown in Figure 2. A one-to-one head is established. It shares an identical structure and the same optimization objectives as the original one-to-many branch. The one-to-many head only takes effect during training. It can provide plentiful supervision signals to facilitate the optimization. The one-to-one head leverages the one-to-one matching to obtain label assignments. Both heads can be optimized consistently and harmoniously, which means that the best positive sample for the one-to-many head tends to also be the best for the one-to-one head [11]. During inference, the one-to-one head assigns only one prediction to each ground truth, which can avoid the time-consuming NMS post-processing.

3.3. Curriculum Learning Strategy

Humans usually learn new things in the form of a curriculum, starting from a simple instance, establishing some basic concepts, and then extending to more complicated scenes. Many works [37,38,39] have tried to train an object detection model in a similar process, mainly from easy samples to hard ones. They use different methods to sort the samples or organize the batches, but the whole training set still includes the original data. Each object appears in many different scenarios. Even the simplest one may still be difficult for a model to learn. Training a model from some much easier samples may improve the final precision.

The background is an important component of a picture and can significantly affect detection performance. A commonly observed phenomenon is that the simpler the background, the more precise the recognition. So, an object appearing on a white flat background is the easiest situation to be identified. Then, considering that one picture often contains many objects in remote sensing, many objects appearing simultaneously on a white background is a harder situation. And the original training data with many objects and complicated backgrounds are in the hardest set. Based on this idea, a three-stage curriculum learning strategy is proposed. The training process is shown in Figure 5. Before training, the easiest and the relatively harder samples from the labeled dataset are generated. Then, the model is trained from easy to hard. The values of all pixels outside the bounding boxes are set to 0 to generate easy samples. And the values of all pixels inside the bounding boxes are set to 1 to generate their references for the training of the attention head. The curriculum scheduler is designed like a warming-up strategy with constant epochs for each stage. Considering its simplicity, only one epoch is trained in the first two stages. More epochs are trained in the last stage to acquire the final model. The checkpoint is saved at the end of the previous stage and loaded in the next stage. With the help of these easier samples, the model can achieve a better starting point to learn complex information and finally achieve a superior detection performance.

4. Experiments and Results

4.1. Benchmark Evaluations

4.1.1. Experimental Configuration

Experiments were conducted on two datasets (DOTA-v1.0 [42] and DOTA-v1.5 [43]) to compare the performance of AG-Yolo with multiple other oriented target detection SOTA methods including Yolov8m-obb [14], S2ANet [18], Rotated_FCOS [44], Oriented_RCNN [45], and Oriented_Reppoints [46].

YOLOv8 is built upon the advancements of previous YOLO versions with new features and optimizations, including advanced backbone, neck architectures, and an anchor-free split head, to maintain an optimal balance between accuracy and speed. S2ANet [18] is constructed by a feature alignment module and an oriented detection module to deal with the misalignment between anchor boxes and axis-aligned convolutional features, which also provides a better trade-off between speed and accuracy for the OD in large-size images. Rotated_FCOS [44] proposed an anchor box-free and proposal-free network to realize OOD, which demonstrates a much simpler and flexible detection framework, achieving improved detection accuracy. Oriented_RCNN [45] is a general two-stage oriented detector with promising accuracy and efficiency, whose oriented Region Proposal Network can directly generate high-quality oriented proposals in a nearly cost-free manner. Oriented_Reppoints [46] proposes an effective adaptive-point-learning approach to OOD by taking advantage of the adaptive point representation, which is able to capture the geometric information of the arbitrarily oriented instances.

DOTA is a large-scale dataset for object detection in aerial images. The images are collected from different sensors and platforms including Google Earth, GF-2, the JL-1 satellite provided by the China Centre for Resources Satellite Data and Application, and aerial images provided by CycloMedia B.V. Each image’s size ranges from 800 × 800 to 20,000 × 20,000 pixels and contains objects exhibiting a wide variety of scales, orientations, and shapes. The instances in DOTA images are annotated by experts in aerial image interpretation by arbitrary (8 d.o.f.) quadrilaterals. It is a continuously updated dataset. The first version is DOTA-v1.0, which was proposed in 2018 with 15 common categories, 2806 images, and 188,282 instances. Then, DOTA-v1.5 was released in 2019 with the same images as DOTA-v1.0 but with one new category and more comprehensive annotations. The extremely small instances (less than 10 pixels) are further annotated. It contains 403,318 instances in total.

AG-Yolo was implemented through ultralytics [47] and trained on a single GPU (NVIDIA RTX A6000 manufactured by LEADTEK in Taiwan, China) on an Ubuntu22.04 system with the batch size set to 4. The CPU is Intel (R) Core (TM) i7-10870H manufactured by Intel Corporation in Santa Clara, CA, United States. Table 1 summarizes the environment setting. The ADAM optimizer [48] was used to update the parameters for 50 epochs in the last stage. The learning rate was initially configured to

1 \times 10^{- 4}

during the first two stages. Then, it was multiplied by ten for the third stage and reduced by a factor of 10 at epochs 36 and 48. Other parameter values were left as their defaults. The compared methods were implemented through mmrotate [22] or ultralytics [47]. They were also trained on the same GPU with the same learning rate setting to the last stage of AG-Yolo. The default settings of the other parameters are retained. Table 2 summarizes all the main parameter settings for each method. Each picture in the datasets was split into patches with the size of

1024 \times 1024

following the common pipeline. Specifically, the gap was set at 500 for the multi-scale training set and 200 for the validating set. For better analysis, the validation part, which has its own labels publicly available, of each dataset was examined. The distribution of all annotations in each dataset is shown in Figure 6. Each validating set has a similar structure to its corresponding training set. The experiment results are listed below.

4.1.2. Qualitative Comparison

The qualitative comparison of each method on each dataset is shown in Figure 7 and Figure 8. The confidence threshold is set at 0.1 for each method. The input and its ground truth are also presented. The objects belonging to different classes are annotated in different colors.

In Figure 7, the vehicle on the left of the input is a hard target. Only ours and Oriented_Reppoints successfully identify it. In Figure 8, many vehicles appear under different backgrounds including lawn (top left), tarmac (top right), and concrete (bottom left). Each part is detected by AG-Yolo. At the top right of the input especially, only our method recognizes three targets.

4.1.3. Quantitative Comparison

The performance of each method was evaluated on two datasets separately. The average precision under an IoU threshold of 50% (

A P_{50}

) of each category, the mean average precision under the same threshold (

m A P_{50}

), the prediction time for one validation picture (latency), and the count of model parameters (#Params) were used as metrics. Table 3 shows the comparison on DOTA-v1.0. Although the S2ANet, Rotated_FCOS, and Oriented_Reppoints have proposed some innovative detection patterns, they fail to achieve sufficient detection accuracy. The Oriented_RCNN is relatively slow. With many advanced modules integrated, Yolo-v8-obb achieves a good balance between speed and accuracy. Compared to other methods, our AG-Yolo achieved better results with

m A P_{50}

reaching 0.796 and latency only being 19.70 ms. The detection was sped up with a small improvement in

m A P_{50}

. The

A P_{50}

for most classes was also improved. As for the comparison on DOTA-v1.5 shown in Table 4, the AG-Yolo improved the

m A P_{50}

by 0.005. The latency and #Params are almost unchanged because of the same image size of these two datasets. AG-Yolo is also the most lightweight model with only 16.56M parameters. Our algorithm demonstrates superior performance in remote sensing OOD.

4.2. Model Analyses

4.2.1. Ablation Study

To verify the role of each component of AG-Yolo, an ablation experiment is designed, and the results are shown in Table 5. The test set used for the quantitative comparison is the validation part of DOTA-v1.0. The latency and

m A P_{50}

are selected for performance analysis. The baseline is Yolov10m-obb, which is built by replacing the head of Yolov10m with the head of Yolov8m-obb. Despite some additional calculations, the attention branch can suppress the interference of complex backgrounds and improve the

m A P_{50}

by 0.021. The NMS-free head effectively speeds up the detection during inference. The three-stage curriculum learning strategy can help the model find a better initial point to improve precision without any extra calculation burden for the final application. With all three components, our AG-Yolo achieves the best balance.

4.2.2. Analyses for the Attention Branch

The attention branch explicitly generates an attention map to guide the feature aggregation. Figure 9 shows the map of a test sample. The brighter the pixel, the closer the value is to 1, and the better the corresponding feature is preserved. In this sample, the positions of many planes and vehicles are highlighted. With the shallow features used as inputs, these attention maps are more object-like instead of OBB-like. This is more conducive for the model to focus on foregrounds and suppress the interference of backgrounds.

4.2.3. Analyses for the NMS-Free OBB Head

The NMS-free OBB head can provide both strong supervision during training and high efficiency during inference under the dual label assignment strategy. The comparison of training with the only one-to-one (o2o) branch, the only one-to-many (o2m) branch, and both branches is shown in Table 6. The first head is precise. The second head is fast. With both branches, the model can achieve a better precision–latency trade-off.

4.2.4. Analyses for the Three-Stage Curriculum Learning Strategy

The three-stage curriculum learning (TSCL) strategy generates some much easier samples from the labeled dataset to facilitate the training of detection models. Training in this easy-to-hard process can provide the model with a better initial point, which subsequently leads to better performance. The validation

m A P_{50}

–epoch curve of an AG-Yolo with/without TSCL during training is shown in Figure 10. Both models are trained from the same initial point. Because of the different generated datasets on which the first two stages of TSCL are trained, only the last stage is plotted. With TSCL, although the

m A P_{50}

is lower at the beginning of the third stage, it increases more rapidly and ultimately reaches a higher value. The lower start might stem from the discrepancy in the background interference during bounding-box regression. However, the parameter adjustments during the first two training stages are more beneficial to the model’s final performance. The model with the TSCL strategy is better equipped to handle the complex dataset.

To further verify its effectiveness, the TSCL strategy is combined with Yolov8m-obb. As shown in Table 7, after training under the same initial point and same hyperparameters, the curriculum learning strategy achieves an

m A P_{50}

improvement of 0.017 without any speed degradation.

5. Discussion

The AG-Yolo proposed in this paper establishes an attention-guided oriented object detection model based on Yolo-v10 to deal with the complex backgrounds appearing in remote sensing images. A three-stage curriculum learning strategy is designed to train the model from some much easier samples, which can provide the model with a better initial point to learn on a complicated dataset. This is conducive to performance improvement even with another model. The NMS-free OBB head with dual label assignments can provide the model with OBB detection capabilities and achieve a better mAP–latency balance. The attention branch further suppresses the interference of backgrounds to facilitate precision. However, our attention idea is more of a sensation and lacks sufficient theoretical support so far. We have now explored the development of current attention mechanism technologies and validated the effectiveness of our attention branch. Further work will be carried out to investigate its underlying cause. Furthermore, when targets are too small, our method cannot improve precision effectively. The improvement on DOTA-v1.5 is not as big as on DOTA-v1.0, especially for small vehicles. Their small features are not very appropriate for identification. Some appropriate context information can potentially help.

6. Conclusions

In this paper, we propose a remote sensing object detection method, namely AG-Yolo. AG-Yolo is established based on Yolo-v10. An additional rotation parameter is added to the head and the dual label assignment strategy is expanded upon to achieve efficient OBB detection, which is more suitable for remote sensing applications. An attention branch is built to guide the feature aggregation of the neck to concentrate on foregrounds by generating attention maps from the shallow features of inputs, which can further suppress the interference of complicated backgrounds. In addition, a three-stage curriculum learning strategy is designed to train the model from some much easier samples, which can provide the model with a better initial point to learn on a complicated dataset and which is conducive to precision. Compared with other advanced methods on the DOTAv1.0 and DOTAv1.5 datasets, the processing latency of our algorithm is reduced from 33.8 ms to 19.7 ms, a decrease of about 40%, and the mAP also achieves a certain improvement. These experimental results have revealed the advantages of our method.

Author Contributions

Conceptualization, X.W.; data curation, X.W.; formal analysis, X.W.; funding acquisition, T.N.; investigation, X.W. and H.L.; methodology, X.W.; project administration, L.H. and M.L.; resources, X.W.; software, X.W.; supervision, C.H., L.H. and T.N.; validation, T.N. and M.L.; visualization, X.W.; writing—original draft, X.W.; writing—review and editing, X.W., C.H., L.H., T.N., X.L., H.L. and M.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (No. 62105328).

Data Availability Statement

The DOTA dataset used in the research is available on the website at the link https://captain-whu.github.io/DOTA/index.html (accessed on 5 January 2025).

Acknowledgments

The authors would like to thank D.J., X.N., X.G., B.X., Y.W., Y.M., B.S., L.J., D.M., P.M. and Z.L. for providing the data.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zheng, Z.; Wu, M.; Chen, L.; Wang, C.; Xiong, J.; Wei, L.; Huang, X.; Wang, S.; Huang, W.; Du, D. A robust and efficient citrus counting approach for large-scale unstructured orchards. Agric. Syst. 2024, 215, 103867. [Google Scholar] [CrossRef]
Sun, W.; Dai, L.; Zhang, X.; Chang, P.; He, X. RSOD: Real-time small object detection algorithm in UAV-based traffic monitoring. Appl. Intell. 2022, 52, 8448–8463. [Google Scholar] [CrossRef]
Liu, H.; Yu, Y.; Liu, S.; Wang, W. A Military Object Detection Model of UAV Reconnaissance Image and Feature Visualization. Appl. Sci. 2022, 12, 12236. [Google Scholar] [CrossRef]
Ju, Y.; Xu, Q.; Jin, S.; Li, W.; Su, Y.; Dong, X.; Guo, Q. Loess Landslide Detection Using Object Detection Algorithms in Northwest China. Remote Sens. 2022, 14, 1182. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. arXiv 2015, arXiv:1504.08083. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
Redmon, J. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Yu, J.; Sun, L.; Song, S.; Guo, G.; Chen, K. BAIDet: Remote sensing image object detector based on background and angle information. Signal Image Video Process. 2024, 18, 9295–9304. [Google Scholar] [CrossRef]
Wang, J.; Yang, W.; Li, H.C.; Zhang, H.; Xia, G.S. Learning center probability map for detecting objects in aerial images. IEEE Trans. Geosci. Remote Sens. 2020, 59, 4307–4323. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 31 May 2024).
Zhang, H.; Yang, K.F.; Li, Y.J.; Chan, L.L.H. Night-Time Vehicle Detection Based on Hierarchical Contextual Information. IEEE Trans. Intell. Transp. Syst. 2024, 25, 14628–14641. [Google Scholar] [CrossRef]
Yang, X.; Yan, J.; Feng, Z.; He, T. R3det: Refined single-stage detector with feature refinement for rotating object. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 3163–3171. [Google Scholar]
Yang, X.; Yang, J.; Yan, J.; Zhang, Y.; Zhang, T.; Guo, Z.; Sun, X.; Fu, K. Scrdet: Towards more robust detection for small, cluttered and rotated objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8232–8241. [Google Scholar]
Han, J.; Ding, J.; Li, J.; Xia, G.S. Align deep features for oriented object detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5602511. [Google Scholar] [CrossRef]
Yu, Y.; Da, F. On Boundary Discontinuity in Angle Regression Based Arbitrary Oriented Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 6494–6508. [Google Scholar] [CrossRef] [PubMed]
Xu, Y.; Fu, M.; Wang, Q.; Wang, Y.; Chen, K.; Xia, G.S.; Bai, X. Gliding vertex on the horizontal bounding box for multi-oriented object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 1452–1459. [Google Scholar] [CrossRef]
Huang, Z.; Li, W.; Xia, X.G.; Tao, R. A general Gaussian heatmap label assignment for arbitrary-oriented object detection. IEEE Trans. Image Process. 2022, 31, 1895–1910. [Google Scholar] [CrossRef] [PubMed]
Zhou, Y.; Yang, X.; Zhang, G.; Wang, J.; Liu, Y.; Hou, L.; Jiang, X.; Liu, X.; Yan, J.; Lyu, C.; et al. MMRotate: A Rotated Object Detection Benchmark using PyTorch. In Proceedings of the 30th ACM International Conference on Multimedia, New York, NY, USA, 10–14 October 2022; Available online: https://github.com/open-mmlab/mmrotate (accessed on 1 December 2024).
Itti, L.; Koch, C.; Niebur, E. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 1254–1259. [Google Scholar] [CrossRef]
Mnih, V.; Heess, N.; Graves, A.; Kavukcuoglu, K. Recurrent models of visual attention. Adv. Neural Inf. Process. Syst. 2014, 27, 1–9. [Google Scholar]
Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial transformer networks. Adv. Neural Inf. Process. Syst. 2015, 28, 1–9. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Soviany, P.; Ionescu, R.T.; Rota, P.; Sebe, N. Curriculum learning: A survey. Int. J. Comput. Vis. 2022, 130, 1526–1565. [Google Scholar] [CrossRef]
Li, B.; Liu, T.; Wang, B.; Wang, L. Label noise robust curriculum for deep paraphrase identification. In Proceedings of the 2020 International Joint Conference On Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–8. [Google Scholar]
Jafarpour, B.; Sepehr, D.; Pogrebnyakov, N. Active curriculum learning. In Proceedings of the First Workshop on Interactive Learning for Natural Language Processing, Virtual, 7 November 2021; pp. 40–45. [Google Scholar]
Dogan, Ü.; Deshmukh, A.A.; Machura, M.B.; Igel, C. Label-similarity curriculum learning. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXIX 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 174–190. [Google Scholar]
Zhang, B.; Wang, Y.; Hou, W.; Wu, H.; Wang, J.; Okumura, M.; Shinozaki, T. Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. Adv. Neural Inf. Process. Syst. 2021, 34, 18408–18419. [Google Scholar]
Soviany, P.; Ardei, C.; Ionescu, R.T.; Leordeanu, M. Image difficulty curriculum for generative adversarial networks (CuGAN). In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass, CO, USA, 1–5 March 2020; pp. 3463–3472. [Google Scholar]
Doan, T.; Monteiro, J.; Albuquerque, I.; Mazoure, B.; Durand, A.; Pineau, J.; Hjelm, R.D. On-line adaptative curriculum learning for gans. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 3470–3477. [Google Scholar]
Wang, J.; Wang, X.; Liu, W. Weakly-and semi-supervised faster r-cnn with curriculum learning. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 2416–2421. [Google Scholar]
Feng, J.; Jiang, Q.; Zhang, J.; Liang, Y.; Shang, R.; Jiao, L. CFDRM: Coarse-to-Fine Dynamic Refinement Model for Weakly Supervised Moving Vehicle Detection in Satellite Videos. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5626413. [Google Scholar] [CrossRef]
Zhang, Y.J.; Li, N.N.; Xie, B.H.; Zhang, R.; Lu, W.D. Semi-supervised object detection framework guided by curriculum learning. J. Comput. Appl. 2023, 43, 1234–1245. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Howard, A.G. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Jian, D.; Nan, X.; Yang, L.; Xia, G.-S.; Lu, Q. Learning RoI Transformer for Detecting Oriented Objects in Aerial Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. arXiv 2019, arXiv:1904.01355. [Google Scholar]
Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3520–3529. [Google Scholar]
Wentong, L.; Yijie, C.; Kaixuan, H.; Jianke, Z. Oriented RepPoints for Aerial Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLO. 2024. Available online: https://github.com/ultralytics/ultralytics (accessed on 1 December 2024).
Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]

Figure 1. The definition of the

θ

in OBBs. It is the positive angle (clockwise rotation) between the width of the rectangle and the positive semi-axis of x.

Figure 1. The definition of the

θ

in OBBs. It is the positive angle (clockwise rotation) between the width of the rectangle and the positive semi-axis of x.

Figure 2. The framework of the AG-Yolo. This model is built based on the structure of yolo-v10. The main alterations are highlighted in red boxes. PAN is the path aggregation network. Scale means resizing. The inputs and outputs of the attention head are scaled to fit the processing requirements of the subsequent components. The three-stage training data are gradually fed into the model for optimization.

Figure 3. The whole structure of the attention branch.

Figure 4. A comparison of HBBs (red boxes) and OBBs (yellow boxes). (a) An HBB can contain a single object and much more background. (b) Two adjacent HBBs may overlap too much. (c) The OBB of a single object. (d) The OBBs of two adjacent objects.

Figure 5. The training process of the curriculum learning strategy. The easiest and the relatively harder samples are generated from the labeled dataset, then inputted to the model stage by stage. The original labeled dataset, which is also the hardest dataset, is used to train the model in the third stage.

Figure 6. The distribution of all annotations in each dataset.

Figure 7. Qualitative comparison of six methods on DOTA-v1.0 datasets with the input (IN) and the ground truth (GT).

Figure 8. Qualitative comparison of six methods on DOTA-v1.5 datasets with the input (IN) and the ground truth (GT).

Figure 9. The attention map example of a test sample. The brighter the pixel, the closer the value is to 1.

Figure 10. The validation

m A P_{50}

of an AG-Yolo w/o TSCL at the end of each epoch during training. TSCL means three-stage curriculum learning.

Figure 10. The validation

m A P_{50}

of an AG-Yolo w/o TSCL at the end of each epoch during training. TSCL means three-stage curriculum learning.

Table 1. Experimental environment.

Item	Value
CPU	Intel (R) Core (TM) i7-10870H
GPU	NVIDIA RTX A6000
OS	Ubuntu22.04

Table 2. Main hyperparameter settings.

		Batch Size	Epochs	Learning Rate
AG-Yolo	Stage 1	4	1	$1 \times 10^{- 4}$
	Stage 2	4	1	$1 \times 10^{- 4}$
	Stage 3	4	36	$1 \times 10^{- 3}$
		4	12	$1 \times 10^{- 4}$
		4	2	$1 \times 10^{- 5}$
Other compared methods		4	36	$1 \times 10^{- 3}$
		4	12	$1 \times 10^{- 4}$
		4	2	$1 \times 10^{- 5}$

Table 3. Quantitative comparison of six methods on DOTA-v1.0. The best two results of each row are shown in bold and in italics separately.

	S2ANet	Rotated_FCOS	Oriented_RCNN	Oriented_Reppoints	Yolov8-obb	AG-Yolo
plane	0.777	0.794	0.793	0.780	0.915	0.913
ship	0.792	0.869	0.875	0.858	0.905	0.897
storage tank	0.704	0.687	0.693	0.750	0.849	0.852
baseball diamond	0.732	0.718	0.752	0.742	0.831	0.842
tennis court	0.891	0.895	0.896	0.892	0.940	0.931
basketball court	0.766	0.753	0.810	0.717	0.723	0.741
ground track field	0.524	0.413	0.627	0.505	0.629	0.659
harbor	0.680	0.657	0.759	0.671	0.858	0.853
bridge	0.517	0.485	0.575	0.504	0.649	0.655
large vehicle	0.773	0.818	0.842	0.791	0.855	0.862
small vehicle	0.586	0.612	0.593	0.657	0.631	0.643
helicopter	0.626	0.649	0.733	0.665	0.823	0.840
roundabout	0.694	0.666	0.662	0.653	0.704	0.726
soccer ball field	0.513	0.631	0.684	0.410	0.659	0.702
swimming pool	0.661	0.701	0.705	0.663	0.812	0.827
$m A P_{50}$	0.682	0.690	0.733	0.684	0.786	0.796
latency/ms	38.89	47.06	61.29	78.57	33.80	19.70
#Params/M	35.58	31.92	41.14	35.43	26.41	16.56

Table 4. Quantitative comparison of six methods on DOTA-v1.5. The best two results of each row are shown in bold and in italics separately.

	S2ANet	Rotated_FCOS	Oriented_RCNN	Oriented_Reppoints	Yolov8-obb	AG-Yolo
plane	0.705	0.794	0.774	0.776	0.862	0.840
ship	0.762	0.773	0.821	0.790	0.839	0.847
storage tank	0.605	0.611	0.614	0.697	0.796	0.793
baseball diamond	0.718	0.726	0.718	0.705	0.800	0.761
tennis court	0.868	0.812	0.885	0.883	0.857	0.855
basketball court	0.776	0.709	0.719	0.671	0.640	0.744
ground track field	0.463	0.327	0.543	0.463	0.556	0.560
harbor	0.604	0.687	0.673	0.682	0.767	0.773
bridge	0.464	0.484	0.479	0.485	0.636	0.577
large vehicle	0.693	0.771	0.780	0.710	0.791	0.804
small vehicle	0.591	0.639	0.628	0.661	0.659	0.663
helicopter	0.534	0.654	0.642	0.595	0.736	0.739
roundabout	0.620	0.658	0.651	0.654	0.666	0.689
soccer ball field	0.475	0.550	0.657	0.360	0.601	0.609
swimming pool	0.653	0.611	0.657	0.609	0.795	0.811
container crane	0.471	0.492	0.527	0.486	0.566	0.579
$m A P_{50}$	0.625	0.644	0.673	0.639	0.723	0.728
latency/ms	38.89	47.06	61.29	78.57	33.80	19.70
#Params/M	35.58	31.92	41.14	35.43	26.41	16.56

Table 5. Comparison for the ablation study. The best result of each column is shown in bold.

	Att. Bran.	NMS-Free.	Three-Stage C.L.	${mAP}_{50}$	Latency/ms
Yolov10m-obb				0.782	23.86
	✓			0.803	24.61
	✓	✓		0.778	19.70
	✓	✓	✓	0.796	19.70

Table 6. Comparison for different OBB heads. The best result of each column is shown in bold.

o2m	o2o	${mAP}_{50}$	Latency/ms
✓		0.817	24.61
		0.781	19.70
✓	✓	0.796	19.70

Table 7. Comparison for Yolov8 w/o three-stage curriculum learning strategy.

CL	${mAP}_{50}$	Latency/ms
✓	0.803	33.80
	0.786	33.80

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, X.; Han, C.; Huang, L.; Nie, T.; Liu, X.; Liu, H.; Li, M. AG-Yolo: Attention-Guided Yolo for Efficient Remote Sensing Oriented Object Detection. Remote Sens. 2025, 17, 1027. https://doi.org/10.3390/rs17061027

AMA Style

Wang X, Han C, Huang L, Nie T, Liu X, Liu H, Li M. AG-Yolo: Attention-Guided Yolo for Efficient Remote Sensing Oriented Object Detection. Remote Sensing. 2025; 17(6):1027. https://doi.org/10.3390/rs17061027

Chicago/Turabian Style

Wang, Xiaofeng, Chengshan Han, Liang Huang, Ting Nie, Xin Liu, Hao Liu, and Mingxuan Li. 2025. "AG-Yolo: Attention-Guided Yolo for Efficient Remote Sensing Oriented Object Detection" Remote Sensing 17, no. 6: 1027. https://doi.org/10.3390/rs17061027

APA Style

Wang, X., Han, C., Huang, L., Nie, T., Liu, X., Liu, H., & Li, M. (2025). AG-Yolo: Attention-Guided Yolo for Efficient Remote Sensing Oriented Object Detection. Remote Sensing, 17(6), 1027. https://doi.org/10.3390/rs17061027

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AG-Yolo: Attention-Guided Yolo for Efficient Remote Sensing Oriented Object Detection

Abstract

1. Introduction

2. Related Work

2.1. Oriented Bounding-Box Detection

2.2. Attention Mechanism

2.3. Curriculum Learning

3. Methods

3.1. Attention Branch

3.2. Oriented Bounding-Box Regression Head

3.3. Curriculum Learning Strategy

4. Experiments and Results

4.1. Benchmark Evaluations

4.1.1. Experimental Configuration

4.1.2. Qualitative Comparison

4.1.3. Quantitative Comparison

4.2. Model Analyses

4.2.1. Ablation Study

4.2.2. Analyses for the Attention Branch

4.2.3. Analyses for the NMS-Free OBB Head

4.2.4. Analyses for the Three-Stage Curriculum Learning Strategy

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI