GCA2Net: Global-Consolidation and Angle-Adaptive Network for Oriented Object Detection in Aerial Imagery

Zhou, Shenbo; Liu, Zhenfei; Luo, Hui; Qi, Guanglin; Liu, Yunfeng; Zuo, Haorui; Zhang, Jianlin; Wei, Yuxing

doi:10.3390/rs17061077

Open AccessArticle

GCA2Net: Global-Consolidation and Angle-Adaptive Network for Oriented Object Detection in Aerial Imagery

by

Shenbo Zhou

^1,2,3

,

Zhenfei Liu

^1,2,3

,

Hui Luo

^1,2

,

Guanglin Qi

^1,2

,

Yunfeng Liu

^1,2,

Haorui Zuo

^1,2,

Jianlin Zhang

^1,2

and

Yuxing Wei

^1,2,3,*

¹

National Key Laboratory of Optical Field Manipulation Science and Technology, Chinese Academy of Sciences, Chengdu 610209, China

²

Institute of Optics and Electronics, Chinese Academy of Sciences, Chengdu 610209, China

³

University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(6), 1077; https://doi.org/10.3390/rs17061077

Submission received: 11 January 2025 / Revised: 26 February 2025 / Accepted: 14 March 2025 / Published: 19 March 2025

Download

Browse Figures

Versions Notes

Abstract

:

Enhancing the detection capabilities of rotated objects in aerial imagery is a vital aspect of the burgeoning field of remote sensing technology. The objective is to identify and localize objects oriented in arbitrary directions within the image. In recent years, the capacity for rotated object detection has seen continuous improvement. However, existing methods largely employ traditional backbone networks, where static convolutions excel at extracting features from objects oriented at a specific angle. In contrast, most objects in aerial imagery are oriented in various directions. This poses a challenge for backbone networks to extract high-quality features from objects of different orientations. In response to the challenge above, we propose the Dynamic Rotational Convolution (DRC) module. By integrating it into the ResNet backbone network, we form the backbone network presented in this paper, DRC-ResNet. Within the proposed DRC module, rotation parameters are predicted by the Adaptive Routing Unit (ARU), employing a data-driven approach to adaptively rotate convolutional kernels to extract features from objects oriented in various directions within different images. Building upon this foundation, we introduce a conditional computation mechanism that enables convolutional kernels to more flexibly and efficiently adapt to the dramatic angular changes of objects within images. To better integrate key information within images after obtaining features rich in angular details, we propose the Multi-Order Spatial-Channel Aggregation Block (MOSCAB) module, which is aimed at enhancing the integration capacity of key information in images through selective focusing and global information aggregation. Meanwhile, considering the significant semantic gap between features at different levels during the feature pyramid fusion process, we propose a new multi-scale fusion network named AugFPN+. This network reduces the semantic gap between different levels before feature fusion, achieves more effective feature integration, and minimizes the spatial information loss of small objects to the greatest extent possible. Experiments conducted on popular benchmark datasets DOTA-V1.0 and HRSC2016 demonstrate that our proposed model has achieved mAP scores of

77.56 %

and

90.4 %

, respectively, significantly outperforming current rotated detection models.

Keywords:

oriented object detection; feature extraction; feature enhancement; convolutional neural network; remote sensing image

1. Introduction

In recent years, advances in remote sensing technology have enabled us to obtain more high-quality image datasets. By using remote sensing image object detection technology, specific objects can be effectively identified from these remote sensing images, which can be widely applied in the fields of smart agriculture, intelligent transportation, and disaster relief. However, because remote sensing images are obtained from high altitude and typically contain a large number of objects smaller than 32 × 32 pixels [1], as well as the complexity of the background, remote sensing image object detection presents significant challenges compared to natural image object detection tasks. Therefore, exploring more accurate object detection algorithms is a hot research frontier.

Many researchers have proposed numerous remote sensing object detection algorithms, most of which are based on the improvement of natural image object detection methods and can be mainly divided into single-stage and two-stage method improvements. For example, single-stage methods include improvements based on YOLO series networks for small object feature representation in remote sensing images [2,3,4,5], and two-stage methods include improvements based on R-CNN series methods [6,7,8,9], which have achieved performance improvements for remote sensing objects. However, as shown in Figure 1, the comparison between (a) and (b) reveals that the direction of the target itself in the remote sensing image is arbitrary, and the horizontal box contains more background information, making it difficult to accurately locate the target in any scene of the remote sensing image. Therefore, we need to focus on object detection with oriented bounding boxes to reduce the difficulty of target localization.

Researchers have conducted extensive research on the object detection problem of oriented bounding boxes and have made good progress. For example, some researchers have proposed methods to address rotational variance [10,11,12] or adopted new directional box coding methods [13]. In addition, designing appropriate loss functions [14,15,16,17,18] for rotated objects, representation methods for rotated objects [13,19,20,21,22], and label assignment strategies [23,24] have also been widely mentioned. In addition, with the rapid advancement of large model technologies such as LLaVA [25] and SAM [26], fine tuning for downstream tasks [27] has also achieved significant performance improvements. However, on the one hand, the static convolution in the standard backbone network used by these methods often only has good feature extraction ability for targets oriented at a fixed angle, making it difficult to extract good features in scenes with numerous rotating targets in remote sensing images. On the other hand, the lack of small target information and complex backgrounds in remote sensing images pose significant challenges to the task of object detection. As the depth of the feature extraction backbone network increases, operations such as downsampling and pooling may lead to the loss of spatial location information of small targets [28]. The above two issues must be alleviated through enhanced feature extraction capabilities and more effective feature fusion. Therefore, we aim to propose a feature extraction backbone network with adaptive target direction capability and a feature fusion network that is more conducive to small target detection.

In this paper, the goal is to solve two challenges. One is the difficulty of extracting rotating target features using standard convolution. The other is the loss of spatial position information of small targets. This loss is caused by increased network depth. To address these challenges, a feature extraction backbone network with an adaptive target direction is used. Additionally, a feature fusion network is employed. This network is more conducive to small target detection. Firstly, in response to the challenges faced in feature extraction, considering that in remote sensing images with arbitrary target orientations, standard backbone networks can only extract features of a fixed orientation target well. In order to enhance the feature extraction ability for targets with different orientations, we designed a feature extraction backbone network with adaptive target orientations, which was called DRC-ResNet. In the backbone network, we propose Dynamic Rotational Convolution (DRC) to adaptively extract features of targets with different orientations. At the same time, in order to more effectively distinguish the relationship between complex background and foreground based on the angle-rich features captured by DRC, we propose the Multi-Order Spatial-Channel Aggregation Block (MOSCAB), which enhances the perception ability of key foreground information in images through selective focusing and global information aggregation. Secondly, we propose a feature fusion network called AugFPN+ based on AugFPN [29] for feature fusion. This network can reduce the information loss caused by the reduction in the number of channels for high-level features in the feature pyramid and, at the same time, reduce the aliasing effect caused by bilinear interpolation by adaptively combining contextual features to enhance the ability of feature fusion. Our network (abbreviated as GCA2Net) has undergone extensive experiments and has significant advantages compared to other mainstream algorithms. The main contributions of this article are as follows:

To solve the problem of arbitrary orientation of remote sensing image objects, we propose the DRC module, which adaptively extracts object features with different orientations through a data-driven method.
Regarding the complex background of remote sensing images, DRC is essentially a convolutional kernel that extracts features with locality, making it difficult to obtain global information and focus on the object area. We propose the MOSCAB module, which achieves selective feature focusing and global information aggregation through the attention mechanism and gating mechanism.
Given the problem that the features at the highest level of the feature pyramid will lose information due to the reduction of the number of channels in feature fusion, this paper improves the feature pyramid and proposes AugFPN+, which realizes the effective fusion of features rich in low-order texture information and features rich in high-order semantic information.
Our network (GCA2Net) performance has been verified on a wide range of data sets. Compared with the mainstream methods, our model achieves superior results in detection accuracy and GFlops (see Figure 2).

2. Related Work

2.1. Object Detection in Aerial Images

In recent years, with the continuous improvement of the quality of aerial and remote sensing images, and the remarkable success of object detection methods based on deep learning in the field of computer vision, more and more researchers have shown interest in the field of aerial image object detection. As we all know, object detection methods based on deep learning are mainly divided into single-stage methods and two-stage methods. Therefore, this section reviews the types of aerial image object detection methods based on the above distinction. Most of the single-stage or two-stage remote sensing object detection methods are improved from the two-stage method based on natural images. Because remote sensing images are taken at high altitudes and far from the ground, most objects are very small. Let us review some classic methods in two stages. Many researchers have made improvements to R-CNN [30], Faster-RCNN [31] and other networks. For example, Ren et al. [6] improved the RPN network and feature extraction network to propose more anchors and obtain high-resolution feature maps. Bai et al. [9] improved the trunk using DRNet based on the Faster-RCNN network and achieved good results in building the detection of remote sensing images. Zheng et al. [7] proposed a boundary aware network (BANet) for salient object detection (SOD) in remote sensing images, which can predict salient objects with clear edges and complete structures while reducing model parameters.

In order to detect faster and meet the real-time requirements, the single-stage detection method has great advantages. As we all know, the YOLO [32] series has achieved great success and has been widely used because of its balance between speed and accuracy. When the YOLO method is applied to remote sensing images, a large number of researchers have made targeted improvements on small objects. For example, when detecting small ships in SAR images, SSS-YOLO [33] removed the low-resolution level and designed the path argument fusion network (PAFN) to capture more local textures and patterns through shallow features. In addition, Zhang [3] and others proposed FFCA-YOLO, which effectively enhanced the feature representation of small objects and suppressed the background interference through three lightweight modules (function enhancement module (FEM), function fusion module (FFM), and spatial context awareness module (SCAM)). Xie et al. [2] proposed CSPPartial-YOLO to enhance the detection ability of small objects with complex distribution by introducing the partial hybrid divided revolution (PHDC) block and coordinated attention module.

While most existing deep learning-based methods have achieved significant success, there are notable differences between aerial imagery and natural scene imagery. Aerial images are characterized by large-scale variations, arbitrary object orientations, and cluttered backgrounds. These features impose certain limitations on the feature extraction of objects when using object detection algorithms designed for natural scenes. Therefore, based on the characteristics of aerial imagery, we have designed a rotation box object detection model that can more accurately capture features from objects oriented in different directions.

2.2. Rotated Object Detection

In recent years, with the continuous improvement of the quality of aerial and remote sensing images, an increasing number of researchers have shown interest in the field of aerial image object detection. Unlike traditional horizontal detection boxes, rotated object detection employs oriented bounding boxes (OBBs), as illustrated in Figure 1. Compared to horizontal bounding boxes, OBBs provide more accurate object localization while including less background information. Along the path of rotated object detection, researchers have primarily focused on two areas of work.

The first aspect involves the design of feature fusion methods in the detector’s neck [34,35,36], oriented region proposal networks [10,37], mechanisms for extracting regions of interest with rotation [10,11], the design of the detector’s head [12,38], and the design of label assignment strategies [23,24].

The second aspect focuses on designing more flexible object representation methods. For instance, Oriented RepPoints [21] represent objects as a set of sample points. CFA [19] models the layout and shape of irregular objects using convex hull modeling. G-Rep [20] proposes a unified Gaussian representation method for constructing Gaussian distributions for OBBs, Quadrilateral Bounding Boxes, and Point Sets. Concurrently, loss functions for various rotated object representations have been extensively studied. For example, KFIoU [18] introduces an approximate loss based on Kalman filtering and Gaussian modeling, which is called SkeIoU. Additionally, GWD [16] and KLD [17] transform oriented bounding boxes into two-dimensional Gaussian distributions for loss calculation, namely Gaussian Earth-Mover’s distance and Kullback–Leibler divergence.

It can be observed that the aforementioned efforts have typically focused on the design of object detectors, achieving commendable performance. However, there has been less exploration in the design of feature extraction backbone networks. Most backbone networks for rotated object detection are derived from traditional object detection backbones, leading to a challenge: traditional static convolutions often perform well in extracting features of objects at specific angles, but the orientation of objects in aerial imagery is arbitrary, and traditional static convolutions cannot effectively extract features of objects at various angles. Therefore, regarding the characteristic of arbitrary object orientation in aerial imagery, this paper proposes the Dynamic Rotational Convolution (DRC) and has achieved promising results.

2.3. Feature Fusion Method

Previous studies have indicated that the accurate localization of small objects requires sufficiently rich spatial information. However, as the depth of feature extraction networks increases, the spatial information of small objects can be compromised. Consequently, to mitigate the loss of spatial information for small objects with the increasing depth of networks, researchers have employed multi-scale feature fusion to enhance the integrity of spatial information and minimize the loss of spatial information for small objects. Representative works in this area include: Lin [39] first proposed the Feature Pyramid Network (FPN), which enhances the performance of object detection by upsampling deep-layer feature maps and fusing them with higher-level feature maps, thus combining deep semantic information with low-level structural information. Following this, researchers conducted more in-depth studies on FPN, leading to the proposal of numerous improved FPN variants. While FPN addressed the scarcity of semantic information in shallow layers, it did not resolve the issue of inaccurate localization information in deep feature maps. Subsequently, PANet [40] added a bottom–up pathway to FPN to fuse low-level features with high-level features, thereby enhancing the localization information of high-level features. Furthermore, to better balance speed and accuracy, NAS-FPN [41] utilized Neural Architecture Search (NAS) to automatically optimize FPN, reducing network complexity and improving computational efficiency while maintaining performance. Although PANet and NAS-FPN achieved the fusion of low-level and high-level features, the contribution of input features of different resolutions to the fused output features is uneven. Therefore, BiFPN [42] introduced more lateral connections and nodes, employing learnable weights to determine the importance of different input features.

Although significant progress has been made in the methods of feature fusion, there are still some issues that need to be addressed. Before feature fusion, there is a certain gap in the semantic information between different feature layers. Directly fusing these features can reduce the expressive power of multi-scale feature fusion. During the feature fusion process, as features propagate from top to bottom, higher-level features in the feature pyramid may lose information due to the reduction in channels. After feature fusion, the features for each RoI are selected based on the scale of the proposal, but the ignored feature layers also contain rich information, which can affect the final classification and regression. To address these issues, we have improved upon the AugFPN by introducing Residual Feature Augmentation+, Adaptive Spatial Fusion+, and soft RoI Selection. These enhancements alleviate the limitations of previous methods before, during, and after feature fusion, significantly improving the accuracy of object detection.

3. Methodology

This section offers an in-depth analysis of the methods we employed. The overall pipeline of the proposed detector is outlined in Section 3.1. Next, Section 3.2 and Section 3.3, respectively, provide the rotation mechanism of the convolution kernels and Dynamic Rotational Convolution (DRC). Subsequently, a precise and thorough description was provided of the constituent submodules of DRC, namely CGB (Convolution Generation Block) and ARU (Adaptive Routing Unit), which are components of the DRC (Dynamic Rotational Convolution) module. Finally, Section 3.4 introduces the MOSCAB module proposed in this paper.

3.1. Overall Framework

When aerial imagery is input into the detector’s pipeline, GCA2Net aims to accurately localize the position and category of objects within the input images. The overall structure of GCA2Net is depicted in Figure 3. Specifically, the images are first fed into the DRC-ResNet backbone network, which completes the feature extraction for rotated objects and generates feature maps. Subsequently, feature fusion across different stages is required, where feature maps from various stages of the backbone network are input into the AugFPN+ feature pyramid network. Then, high-level features within the pyramid are enhanced through the Residual Feature Augmentation+ submodule, employing residual connections to extract scale-invariant contextual information, thereby reducing information loss in the highest-level pyramid feature maps. The feature maps obtained at different levels are adaptively combined through the Adaptive Spatial Fusion+ submodule. Following this, the fused feature maps obtained from AugFPN+ are input into the Oriented RPN to generate oriented candidate bounding boxes. Finally, these oriented candidate boxes undergo Rotated RoIAlign and are fed into the regression and classification heads for the prediction of location and category.

In this paper, the improved DRC-ResNet backbone network based on ResNet, taking ResNet50 as an example, proposes that the Dynamic Rotational Convolution module (DRC) can be integrated as a plug-and-play module into the ResNet backbone network. The specific approach is as follows: Considering that 1 × 1 convolutions are rotation-invariant, the DRC replaces the 3 × 3 convolutions in the last three stages of the ResNet, namely Stage 2, Stage 3, and Stage 4. The detailed architecture of the DRC-ResNet backbone network is depicted in Figure 4. For feature extraction, compared with previous convolution-based remote sensing object detection algorithms such as OrientedRCNN, our proposed DRC can adaptively extract features of targets with different orientations through data-driven methods, as static convolution often only performs better feature extraction for targets in a certain direction. Therefore, it has a stronger feature extraction ability in scenes with numerous rotating targets in remote sensing images. Compared with previous attention-based methods such as CBAM, our proposed MOSCAB can enhance the ability to perceive key foreground information in images through selective focusing and global information aggregation to distinguish between the background and foreground.

3.2. Rotation Mechanism of the Convolution Kernels

Standard convolution uses fixed parameters to extract features uniformly across all images, which works well when the objects have consistent orientations. However, in aerial images, the orientation of objects can vary significantly, making it difficult for standard convolution to adapt to these changes in object orientation. To address this issue, the convolution kernel needs to be rotated, enabling a better representation of features. In this section, we briefly explain the rotation mechanism of the convolution kernel [43].

Initially, we obtain the original convolution kernel, as shown in Figure 5a. Here, we consider the parameters of the convolution kernel as a sampling from the kernel space. Thus, the original convolution kernel undergoes bilinear interpolation to span a kernel space, as depicted in Figure 5b. Subsequently, if we need to rotate the convolution kernel counterclockwise by an angle

θ

, we can obtain the rotated coordinate system, as illustrated in Figure 5c. Finally, sampling from the rotated coordinate system yields the rotated convolution kernel, as shown in Figure 5d.

To illustrate the process of the rotation mechanism more clearly, we provided an example as shown in Figure 6. Firstly, we initialize a 2 × 2 convolution kernel and then perform bilinear interpolation to obtain a 4 × 4 convolution kernel, which is similar to the bilinear interpolation operation of images. Rotate the original

2 \times 2

convolution kernels by 45 degrees and recalculate the weights of each convolution kernel. For each grid occupied by a weight, taking the average of the weights in the grid yields the final

2 \times 2

convolution kernel after rotation.

3.3. Dynamic Rotational Convolution (DRC)

Based on the convolutional kernel rotation mechanism described in Section 3.2, the DRC (Dynamic Rotational Convolution) module has been designed. Conventional convolutional kernels are static and often excel at capturing features with a fixed orientation. In a typical convolutional layer, the same kernel is applied to all input images. The DRC module employs a data-driven approach. It uses this approach to predict the rotation angles and the combination weights. The combination weights are of expert knowledge. The expert knowledge is for multiple orientations. The predictions are based on the input feature maps. This process generates Dynamic Rotational Convolution. This enables adaptive feature capture for the input feature maps.

Specifically, the DRC module is composed of the Convolution Generation Block

(C G B)

, Rotated Convolution, and MOSCAB. The Adaptive Routing Unit

(A R U)

conducts preliminary feature extraction on the input feature maps, predicting the rotation angles

θ

required for the convolutional kernels and the combination weights

λ

for the convolutional kernel. Subsequently, based on these angles and combination weights, the convolutional kernels are rotated and combined, which is the operation performed by the CGB. Upon completion of this step, the desired Rotated Convolution is obtained. Further, the Rotated Convolution is applied to the input feature maps through convolutional operations. Finally, the feature maps resulting from the Rotated Convolution are input into the MOSCAB, enabling efficient capture of contextual representations based on the obtained rotational features. MOSCAB will be introduced in detail in Section 3.4.

Convolution Generation Block (CGB):

Due to the arbitrary orientation of objects in aerial imagery, it is necessary to rotate multiple convolutional kernels. It is also required to perform a weighted combination of them. The goal is to obtain features enriched with angular details. This process leads to better feature extraction capabilities. It improves capabilities for any rotation direction. Consequently, the generation of Dynamic Rotational Convolution is required. In light of this, this paper proposes the Convolution Generation Block (CGB).

The specific structure of the Convolution Generation Block (CGB) is shown in Figure 7b. CGB takes as input a set of n convolutional kernels

W = [W_{1}, \dots, W_{n}]

, each with a shape of

[C_{o u t}, C_{i n}, k, k]

. It first rotates the convolutional kernels based on the angles

θ = [θ_{1}, \dots, θ_{n}]

obtained from the Adaptive Routing Unit (ARU), resulting in the rotated kernels

W_{1}^{'}, \dots, W_{n}^{'}

. Finally, a combination operation is performed on these rotated kernels to yield the final convolutional kernel

\tilde{W}

.

The specific implementation method is as follows. Firstly, we input the origin convolution kernels W and rotate them; then, we can obtain rotated convolution kernels

W^{'}

:

W_{i}^{'} = R o t a t e (W_{i}; θ_{i}), i = 1, 2, \dots, n

(1)

where

W^{'} = [W_{1}^{'}, . . ., W_{n}^{'}]

are the rotated convolution kernels we can obtain. Meanwhile,

R o t a t e (\cdot)

is implemented by a rotation matrix. The simple application of these rotated convolution kernels is to perform a weighted combination of them, as illustrated in Equation (2).

When we obtain the rotated convolution kernels

W^{'}

, and then implement the combination operation, we can obtain the final convolution kernels

\tilde{W}

.

\tilde{W} = λ_{1} \times W_{1}^{'} + \dots + λ_{n} \times W_{n}^{'}

(2)

where

\tilde{W}

is the combination of the rotated convolution kernels we can obtain. A straightforward application of the convolutional kernel

\tilde{W}

is to employ it as a convolutional filter to perform convolution operations on feature maps.

Adaptive Routing Unit (ARU): Due to the arbitrary orientation of objects in aerial imagery, it is necessary to rotate the convolutional kernels. Additionally, a linear combination of multiple rotated kernels must be employed. This process aims to obtain features enriched with angular details. Consequently, the rotation angles for each kernel and the combination weights need to be predicted. Therefore, this paper introduces the Adaptive Routing Unit (ARU). Through a data-driven approach, the ARU completes the prediction of the rotation angles and combination weights of the kernels, which is a key component of the DRC.

The specific structure of the Adaptive Routing Unit (ARU) is shown in Figure 7c. Due to the locality of convolution, it limits the receptive field and cannot efficiently capture long-distance spatial dependencies. Meanwhile, convolution shares parameters across all spatial positions, which weakens its ability to adapt to different visual modes at different spatial locations. These issues will limit the initial spatial encoding capability in remote sensing images with complex backgrounds and numerous small targets. Therefore, we use the Involution operator [44] to spatially encode the input of ARU. This encoding method has two benefits: firstly, it can obtain a wider range of spatial contextual information; secondly, it can adaptively allocate different weights to different spatial positions, which is conducive to finding visual elements with higher spatial contribution to the foreground. In the Adaptive Routing Unit (ARU), the input feature map, with a size of

[C_{i n}, H, W]

, first undergoes feature extraction through Involution [44]. The resulting feature maps are then subjected to normalization, activation, and pooling operations before being fed into two separate branches: the angle

θ

prediction branch and the combination weight

λ

prediction branch. Ultimately, this process yields the predicted results for both the angle

θ

and the combination weight

λ

. The implementation method is as follows:

Initially, the input feature map X is subjected to Involution(·) for image encoding, as shown in Formula (3). Next, after layer normalization and ReLU(·) activation, the average pooling layer is applied for downsampling, as shown in Formula (4). Subsequently, the obtained feature maps are sent to two branches with different activation functions for predicting the rotation angle

θ

and the combined weight

λ

. For the rotation angle prediction branch, a fully connected layer is first passed through, and then sigmoid(·) activation is performed to obtain the predicted value

θ

, as shown in Formula (5). For the combination weight prediction branch, a fully connected layer is first passed through, and then softsign(·) activation is performed to obtain the predicted value

λ

, as shown in Formula (6).

Y = I n v o l u t i o n (X)

(3)

Z = A v g P o o l (R E L U (L a y e r N o r m (Y)))

(4)

λ = s o f t s i g n (L i n e a r (Z))

(5)

θ = s i g m o i d (L i n e a r (Z))

(6)

3.4. Multi-Order Spatial-Channel Aggregation Block (MOSCAB)

The background in remote sensing images is complex, and to better extract effective features, we need to focus on distinguishing between the background and foreground. The rotated convolution kernel generated in DRC captures features rich in angular details but cannot effectively distinguish the relationship between background and foreground. Feature intergration theory [45] suggests that human vision perceives objects by extracting basic contextual features and associating individual features with attention. However, relying solely on regional perception or contextual aggregation is not sufficient to simultaneously learn different contextual features and multi-level interactions. This paper proposes MOSCAB. Its convolutional structure can extract multi-stage features with locality. It can also simulate the interaction of multi-stage features through the MEGM Block. The MEGM Block can simultaneously learn different contextual features. It can also learn multi-stage interactions. The Multi-Order Spatial-Channel Aggregation Block (MOSCAB) is proposed as shown in Figure 7. It enhances the perception ability of foreground key information in images. This enhancement is achieved through selective focusing and global information aggregation. The features used are rich in angular details. These features are obtained from DRC.

The specific structure of MOSCAB is depicted in Figure 8. Initially, the input feature maps undergo spatial attention and channel attention layers. The model uses channel attention to focus on the importance of each channel. This enhances the focus on key information. It also uses spatial attention to capture long-range dependencies. This identifies regions where key information is located. As a result, the model achieves a selective focus on features. Traditional deep neural networks, which have incompatible combinations of local perception and context aggregation, tend to overemphasize extreme-order interactions while suppressing the most discriminative mid-order interactions [46,47]. After selective focusing, the features need to be effectively aggregated. Therefore, at the end of the module, the MGEM Block is utilized to achieve an effective aggregation of features.

The proposed MOSCAB consists of three cascaded components and can be represented as the following: Here, SFM(·) denotes the Selective Focusing Module (SFM). FDM(·) represents the Feature Decomposition Module (FDM). MGEM(·) refers to the Multi-Order Gated Enhancement Module, which includes a gating branch and a multi-scale context branch.

\begin{matrix} Z = S F M (X) + M G E M (F D M (N o r m (S F M (X)))) \end{matrix}

(7)

Selective Focusing Module (SFM): After the DRC module extracts feature maps containing rich angular details, two issues need to be addressed: what constitutes the critical information and where this critical information is located. The Selective Focusing Module (SFM) is proposed. It enhances the focus on critical information. It identifies the regions containing this information. Channel attention is used for the channel dimension. Spatial attention is used for the spatial dimension. The SFM we designed is defined as

\begin{matrix} X_{C A} = X \otimes M_{C} (X) \\ X_{S A} = X_{C A} \otimes M_{S} (X_{C A}) \\ Z = X_{S A} \otimes M_{C} (X_{S A}) \end{matrix}

(8)

where

M_{C}

and

M_{S}

represent the spatial attention and channel attention modules, respectively. The symbol ⊗ denotes element-wise multiplication. Intermediate broadcasting mechanisms are employed for dimension transformation and alignment.

Feature Decomposition Module (FDM): When extracting multi-order features, two complementary features emerge: fine-grained local texture features (low-order) and complex global shape features (mid-order), which are obtained by Conv1×1(·) and GAP(·), respectively. To compel the network to focus on multi-order interactions, we design FDM(·) to dynamically exclude trivial interactions, which is defined as

\begin{matrix} Y = C o n v_{1 \times 1} (X) \\ Z = G E L U (Y + γ_{s} \cdot (Y - G A P (Y))) \end{matrix}

(9)

where

γ_{s} \in R^{C \times 1}

is a scaling factor initialized to 0. FDM(·) enhances feature diversity by reweighting the interaction components

Y - G A P (Y)

.

Multi-Order Gated Enhancement Module (MGSM): Previously, some works simply combined DWConv and self-attention to model the interaction between local and global features. Unlike these approaches, the multi-scale context branch of our MGSM(·) encodes multi-order features using

D W C o n v_{5 \times 5}

, capturing low-order, mid-order, and high-order features through different dilation rates

d = \{1, 2, 3\}

. The specific implementation is as follows:

First, the input feature

X \in R^{C \times H \times W}

is processed using

D W C o n v_{5 \times 5} (\cdot)

with a dilation of 1 to capture low-order features. Subsequently, it is decomposed along the channel dimension into

X_{l} \in R^{C_{l} \times H \times W}

,

X_{m} \in R^{C_{m} \times H \times W}

, and

X_{h} \in R^{C_{h} \times H \times W}

, where

C = C_{l} + C_{m} + C_{h}

. Then,

X_{l}

,

X_{m}

, and

X_{h}

are processed using

D W C o n v_{3 \times 3} (\cdot)

with a dilation of 1,

D W C o n v_{5 \times 5 (\cdot)}

with a dilation of 2, and

D W C o n v_{7 \times 7} (\cdot)

with a dilation of 2, respectively, resulting in the following outputs:

\begin{matrix} Y_{l} = D W C o n v_{3 \times 3} (X_{l}) \\ Y_{m} = D W C o n v_{5 \times 5} (X_{m}) \\ Y_{h} = D W C o n v_{7 \times 7} (X_{h}) \end{matrix}

(10)

The outputs obtained are then concatenated to form the multi-order context

Y_{c} = Concat (Y_{l}, Y_{m}, Y_{h})

. Subsequently, a $C o n v_{1 \times 1}$ operation is applied, which performs a weighted summation on each pixel of the input feature map, facilitating information exchange and feature fusion across different channels. The enhanced context information is then passed through a SiLU(·) activation function to yield

Y_{c s} = SiLU ({Conv}_{1 \times 1} (Y_{c}))

.

For the other gated branch, the input is first processed using Conv1×1 to adjust the number of channels to C and then activated using SiLU to obtain

Y_{g s} = SiLU ({Conv}_{1 \times 1} (X))

.

Thus, we can rewrite the formula for MGSM(·) as

\begin{matrix} Z = Y_{g s} \otimes Y_{c s} \end{matrix}

(11)

3.5. AugFPN+

As the depth of the backbone network increases, operations such as downsampling and pooling may lead to the loss of spatial location information of small targets [28]. To address this issue, we propose a feature pyramid network called AugFPN+ based on AugFPN. As shown in Figure 9, it consists of two main components, Adaptive Spatial Fusion+(ASF+) and Residential Feature Augmentation+(RFA+). Compared to AugFPN, Adaptive Spatial Fusion+ (ASF+) and Residual Feature Augmentation+ (RFA+) have been improved to enhance the preservation of spatial position information of small targets through more branches. Specifically, we initially constructed the feature pyramid based on the multi-scale features C2, C3, C4, C5 derived from the backbone. Due to the propagation of the highest-level C5 feature map in the top–down pathway of the Feature Pyramid Network (FPN), the reduction of feature channels leads to information loss. This results in single-scale contextual information. The single-scale contextual information is incompatible with the result features of other levels. To address this, the Residual Feature Augmentation+ module (RFA+) is introduced. It improves the feature representation of M5 through residual connections. Concurrently, our upsampling employs bilinear interpolation, and considering the aliasing effects induced by interpolation, we have designed an Adaptive Spatial Fusion+ (ASF+) module to adaptively combine these contextual features rather than simply summing them up.

Residual Feature Augmentation+: In the Feature Pyramid Network (FPN), the features at the highest level, M5, undergo enhancement of lower-level features with high-level semantics as they propagate from top to bottom, resulting in features with varying contextual information. On the other hand, the highest-level features may suffer from information loss due to the reduction in channels, leading to semantic discrepancies between contextual information and features at other levels. To address this, the Residual Feature Augmentation+ (RFA+) module is proposed. It aims to integrate diverse spatial information back into the original branch. This is accomplished through residual connections. The goal is to minimize information loss at the highest feature level. Specifically, we perform adaptive pooling at different scales on C5 to generate four contextually distinct features of sizes

(α_{1}, α_{2}, α_{3}, α_{4} \times [W, H])

, each of which is processed independently by a 1 × 1 convolutional layer to reduce the feature channels to 256. Finally, these contextual features are upsampled to H × W through bilinear interpolation and fed into the ASF+ for feature fusion.

Adaptive Spatial Fusion+: Considering the aliasing effects induced by interpolation, a simple summation of contextual features is not viable. Therefore, we designed the Adaptive Spatial Fusion+ (ASF+) module for the adaptive combination of contextual information. The ASF+ module generates a spatial weight for each feature, leveraging these weights to aggregate context. Specifically, multiple upsampled feature maps are fed into two branches. One branch undergoes a 1 × 1 convolution to reduce the dimensionality of feature channels. The other branch first performs a concatenation, which is followed by a 3 × 3 convolution to maintain the size of the feature maps, then a 1 × 1 convolution to adjust the channels and a Hadamard product, culminating in feature fusion.

4. Experimental Setup

This section provides a detailed account of the experiments conducted. Initially, Section 4.1 presents an introduction to Datasets and Metrics. Subsequently, Section 4.2 delves into the Data Preprocessing aspect. Thereafter, Section 4.3 specifically elucidates the Implementation Details. Concluding this section, Section 4.4 offers an in-depth exposition of the Model Training Progress.

4.1. Datasets and Metrics

The HRSC2016 dataset is a collection of high-resolution optical remote sensing images specifically designed for ship detection and recognition research, which were released by Northwestern Polytechnical University in 2016. This dataset comprises 1680 images, of which 1061 contain valid annotations, covering maritime and coastal vessels in six significant ports. The image resolutions range from 0.4 m to 2 m with dimensions varying from 300 × 300 to 1500 × 900 pixels. The annotations are provided in the format of oriented bounding boxes (OBBs), offering three types of annotation information, including bounding box, oriented bounding box, and pixel-based segmentation, as well as additional metadata such as port information, data source, and capture time. The ship models in the dataset are organized into a three-tiered structure, comprising Ship Class, Ship Category, and Ship Type, and they are divided into training, validation, and testing sets, containing 436, 181, and 444 images, respectively. The HRSC2016 dataset has become an important benchmark for evaluating and developing new ship detection methods and algorithms due to its high image resolution, numerous object instances, rich annotation information, and challenging characteristics.

The DOTA-V1.0 dataset [48] represents an extensive optical remote sensing imagery dataset, consisting of 2806 images with a total of 188,282 instances categorized into 15 distinct object classes: Ground Track Fields (GTFs), Basketball Courts (BCs), Swimming Pools (SPs), Roundabouts (RAs), Soccer Fields (SFs), Baseball Diamonds (BDs), Storage Tanks (STs), Helicopters (HCs), Bridges (BRs), Tennis Courts (TCs), Harbors (HAs), Planes (PLs), Large Vehicles (LVs), Small Vehicles (SVs), and Ships (SHs). This dataset is distinguished by its diverse imagery sources and scenes, as well as its variable image dimensions, with the maximum resolution reaching

4000 \times 4000

pixels. For experimental consistency, images were cropped to a standardized size of

1024 \times 1024

pixels, maintaining a

20 %

overlap, and any resulting gaps were filled with a black background. The dataset is segmented into three subsets: a training set comprising 15,749 images, a validation set containing 5297 images, and a test set with 5296 images.

Here, we present several fundamental concepts and their corresponding computational methodologies. For the evaluation of model performance, the VOC-format Average Precision (AP) and mean Average Precision (mAP) were primarily utilized. The computation of these metrics requires the use of True Positives (TPs), False Positives (FPs), True Negatives (TNs), and False Negatives (FNs). The definitions of TP, FP, TN, and FN are illustrated in Figure 10. Based on the obtained values of TP, FP, TN, and FN, Precision (P) and Recall (R) were calculated using the following methods:

P = \frac{T P}{T P + F P}

(12)

R = \frac{T P}{T P + F N}

(13)

For each class, the corresponding AP was computed based on P and R. The mAP was then derived by averaging the AP values across all classes, providing a comprehensive measure of detection effectiveness. Here, AP denotes the area under the P–R curve relative to the coordinate axes, and mAP represents the average of AP values across Nc categories. The calculation methods for AP and mAP are as follows:

A P = \int_{0}^{1} P_{(R)} d_{(R)}

(14)

m A P = \frac{1}{N_{c}} \sum_{i = 1}^{N_{c}} A P_{i}

(15)

4.2. Data Preprocessing

Data augmentation techniques are a critical factor in achieving optimal model performance. Therefore, to enhance the effectiveness of model training, we initially applied image augmentation to the dataset. We employed a variety of data augmentation methods, including contrast adjustment, addition of Gaussian noise, and rotation, to increase the complexity of the background and diversify the objects. The specific augmentation methods are illustrated in Figure 11.

4.3. Implementation Details

We constructed the backbone network DRC-ResNet based on the commonly employed ResNet architecture. Leveraging the fact that 1 × 1 convolutions are rotationally invariant, we replaced all 3 × 3 convolutions in the final three stages with the proposed DRC module while keeping the 1 × 1 convolutions intact. For training, we elected to conduct the entire network training on the DOTA-V1.0 dataset utilizing the Oriented-RCNN framework. By standard training configurations, the specific training parameters are presented in Table 1. All experiments were conducted on hardware comprising Nvidia GeForce RTX3090 GPUs and Intel Xeon E5-2660 v3 CPUs. The results for DOTA-V1.0 and HRSC2016 were obtained using the MMRotate toolbox, except Oriented R-CNN [10], which was implemented using the OBBDetection codebase.

4.4. Model Traning Progress

Our training process is depicted in Figure 12. The blue boxes primarily contain descriptions of training operations, while the black arrows represent the sequence of steps. Initially, we incorporate the concept of transfer learning by pre-training the backbone of the MS COCO dataset. Subsequently, the network is fine-tuned on the object dataset based on the pre-trained weights. Regarding dataset partitioning, considering the limited number of images and high resolution of aerial imagery datasets, our training process does not include a separate validation step. Instead, we combine the training and validation sets, using the test set for performance evaluation of the model.

5. Experimental Results

This section reports the findings of our experiments. To ensure a fair comparison, we present the Average Precision (AP) and mean Average Precision (mAP) for each category. In Section 5.1, we replaced the backbone network of Oriented-RCNN with the proposed DRC-ResNet and conducted experiments on the DOTA-V1.0 and HRSC2016 datasets, comparing with several State-Of-The-Art methods. Subsequently, in Section 5.2, we performed ablation studies. Finally, in Section 5.3, we present some visualized detection results on the DOTA-V1.0 and HRSC2016 datasets.

5.1. Comparative Experiments

To evaluate the performance of the model proposed in this paper, we conducted comprehensive comparisons. Experiments were carried out on the DOTA-V1.0 and HRSC2016 datasets. We report the Average Precision (AP) for 15 categories and the mean Average Precision (mAP) in the DOTA-V1.0 dataset, as well as the mean Average Precision (mAP) for the HRSC2016 dataset.

5.1.1. Result on DOTA-V1.0

The Oriented-RCNN selected in this article has a wide range of applications and representativeness in rotated object detection in aerial images. This method has been widely used as a benchmark in previous research, and comparing it can more fairly and meaningfully evaluate its performance. Unlike Oriented-RCNN, we use DRC-ResNet as the backbone network for angle adaptive feature extraction and use AugFPN+ to adaptively combine contextual information at different scales. The method proposed in this article achieved a

77.56 %

mAP improvement compared to the selected baseline (Oriented-RCNN), demonstrating better detection performance.

Table 2 illustrates the results of various detectors on the DOTA-V1.0 dataset, validating the effectiveness of the method proposed in this paper. By combining our proposed DRC-ResNet backbone network and AugFPN+ feature fusion network with Oriented R-CNN, the model’s ability to represent features and integrate contextual information for targets at different angles and scales has been enhanced. It is evident that our model achieved an mAP of

77.56 %

, which represents an increase of

2.64 %

compared to the baseline and an increase of more than

1.5 %

compared to some of the latest models. In addition, our model demonstrated significant improvements over the previous methods in categories such as Bridge, Ground Track Field, Small Vehicle, Ship, Soccer Field, Harbor, Swimming Pool, and Helicopter, effectively balancing the detection performance across objects of different scales.

In addition, considering the real-time requirements of remote sensing object detection algorithms, we also compared the computational efficiency of the algorithms, as shown in Table 3. It can be observed that our algorithm has a lower performance in FLOPs and has certain advantages compared to other methods. This can reduce hardware and cost expenses, which is beneficial for running multiple low FLOPs models simultaneously on devices with limited resources to complete different tasks. However, our algorithm still has some room for optimization in terms of FPS and Params.

5.1.2. Result on HRSC2016

Table 4 presents the experimental outcomes of the HRSC2016 dataset, affirming the efficacy of the approach introduced in this paper. By combining our proposed DRC-ResNet backbone network and AugFPN+ feature fusion network with Oriented R-CNN, the model has stronger capabilities in feature representation and the integration of contextual information for targets in scenes with arbitrary orientation and scale diversity. Our proposed method has achieved an mAP of

90.4 %

, which represents an increase approximately between

0.4 %

and

1.91 %

compared to some of the latest models.

From Figure 13, it can be seen that our model has achieved competitive results in ship detection. To be specific, due to the adaptability of the model in handling objects of different orientations and scales, our method has excellent localization and recognition capabilities for small, medium, and large ships.

5.2. Ablation Experiments

We conducted an ablation study to analyze the impact of different designs on the performance of rotated object detection. For convenience, we use abbreviations for module names. Among them, Baseline represents the original network, “DRC” represents the Dynamic Rotational Convolution, “MOSCAB” represents the Multi-Order Spatial-Channel Aggregation Block, and “ARU” represents the Adaptive Routing Unit. Initially, we demonstrated the effectiveness of the DRC and MOSCAB modules. Subsequently, we further explored how various replacement strategies affect detection performance. Lastly, we investigated the architectural design of the ARU module.

Effectiveness of DRC and MOSCAB: The DRC module is responsible for adaptively extracting features oriented toward the target from various angles from the input remote sensing images. As shown in the second row of Table 5, when the baseline (Oriented RCNN) was equipped with DRC, the mAP on the DOTA-V1.0 dataset increased by

1.09 %

(from

75.81 %

to

76.9 %

). This growth demonstrates the effectiveness of our DRC module in improving remote sensing target detection performance and feature extraction for targets facing from various angles. This module generates different convolution kernels through data-driven methods to improve the quality of feature extraction in scenes with inconsistent target orientations.

Specifically, we use the Involution operator to encode the input image, which has two benefits. Firstly, it can obtain a wider range of spatial contextual information. Secondly, it can adaptively allocate different weights to spatial positions with different statuses, which is more conducive to finding visual elements with higher contributions to the foreground in complex background scenes. Finally, the obtained effective features are passed to two branches, namely the angle prediction branch and the combined weight prediction branch, providing effective guidance for the parameters required for the rotation of the convolution kernel.

MOSCAB is responsible for selectively focusing and aggregating global information on the angular detail-rich features captured by the DRC’s rotation convolution kernel to enhance the perception ability of key foreground information in the image. As shown in the second and third rows of Table 5, when the baseline was equipped with both DRC and MOSCAB, the mAP increased by

0.56 %

(from

76.9 %

to

77.46 %

) compared to a network equipped only with DRC. This growth demonstrates the effectiveness of MOSCAB in enhancing the ability to foreground critical information.

As mentioned earlier, human vision perceives objects by extracting basic contextual features and associating individual features with attention. To make object perception more accurate, it is necessary to simultaneously learn different contextual features and multi-stage interactions. Specifically, MOSCAB has a convolutional structure that can extract multi-stage features with locality. By introducing MGEM to simulate the interaction of multi-stage features, different contextual features and multi-stage interactions can be learned simultaneously.

Effectiveness of AugFPN+: AugFPN+ is responsible for more effectively combining contextual information of different scales, alleviating the problem of spatial position information loss of small targets caused by the increase in backbone network depth, downsampling, and pooling operations. As shown in Table 6, by comparing the detection performance of different feature fusion networks, it can be found that compared with the baseline, the mAP is significantly improved when equipped with AugFPN+, increasing from

76.02 %

to

77.56 %

. This improvement demonstrates the effectiveness of the proposed AugFPN+in feature fusion. This feature fusion method not only adaptively combines contextual information of different scales through ASF+ but also reduces the information loss of the highest level features by residual connecting the features extracted by RFA+.

Specifically, compared to AugFPN, we use more scale branches on ASF and RFA modules to enhance the ability to maintain spatial position information of different scales, especially small targets. The RFA+ formed from this can integrate various spatial information back into the original branch through residual connections, reducing the information loss of the highest-level features. In addition, the resulting ASF+ can generate a spatial weight for each feature and use the weight to adaptively combine the context.

Comparison of different feature enhancement methods: MOSCAB is a feature enhancement method responsible for selectively focusing and globally aggregating angle adaptive features extracted by the DRC module to increase attention to foreground target regions. As shown in Table 7, compared with other feature enhancement methods such as CBAM, Context Anchor Attention (CAA), and SE-Net, when the model is equipped with MOSCAB, mAP is significantly improved, reaching the highest of

77.56 %

. This improvement proves that the MOSCAB proposed in this paper is more effective than other attention mechanisms.

Specifically, to make object identification more precise, it is essential to learn various contextual features and interactions at different stages at the same time. MOSCAB has a convolutional design that can extract local features at different stages. By incorporating MGEM to model the interaction of features at different stages, the model can learn various contextual features and interactions at different stages simultaneously, thus boosting its ability to identify targets.

Replacement strategy of DRC: During the feature fusion process in FPN, it deals with the feature maps from the last three stages of the backbone; hence, this paper only replaces the convolutional layers in the last three stages of the backbone. Additionally, since 1 × 1 convolutions are rotationally invariant, an ablation study was conducted to explore the impact of replacing the 3 × 3 convolutions in the last three stages of the backbone on network performance. The ablation experiments were carried out on the backbone network ResNet50.

According to the results in Table 8, it can be found that when replacing one stage (Stage 2), mAP increased from

74.92 %

of the baseline to

76.75 %

. By further replacing the latter two stages, mAP continued to improve, proving that DRC contributed to the improvement of detection performance in different stages.

Architecture Design of ARU: We conducted ablation studies on the ARU module as well. Considering that the ARU module is required to predict the rotation angle

θ

and the combination weights

λ

based on the input feature map x, the spatial encoding method of the input features necessitates careful consideration. Spatial encoding here refers to the part between the input features and the Average Pooling. In terms of the ablation of spatial encoding methods, we compared Involution with the standard DWConv-LN-RELU approach.

From the results in Table 9, it can be found that using different spatial encoding methods in ARU, using the simplest DWConv will result in poorer performance than using other convolutions. Among them, the spatial encoding method used by Involution performed better than the other two, achieving an mAP of

77.21 %

.

5.3. Visualization Results

In order to more intuitively prove the effectiveness of our method, we show the results of GCA2Net on the HRSC2016 dataset (Figure 13) and DOTA-V1.0 dataset (Figure 14). Our GCA2Net performs well in the object detection of high-resolution aerial images. The visualization results obtained show that GCA2Net can skillfully perform the rotated object detection task even in a scene with a complex background, large object scale changes and dense object distribution.

To further evaluate GCA2Net (our model), we randomly selected image samples from DOTA-V1.0. We use the baseline model, S2ANet, RetinaNet and GCA2Net (our model) for detection and compare the detection results of different models, as shown in Figure 15. It can be found that GCA2Net (our model) effectively identifies objects of different categories and scales in more complex backgrounds. In particular, for small objects, the number of missed detection and false detection is significantly reduced, and the detection performance is superior.

In addition, in order to more intuitively reflect the ability of our model to solve core challenges, we also compared the object detection capabilities at different resolutions and observation angles on the DOTA-V1.0 dataset.

The experiment on spatial resolution changes: We simulated the changes in resolution by performing crop operations of different sizes on the original images in the test dataset and observed the changes in the detection performance of the model for typical categories. Due to space limitations, we have presented the visualization results of typical categories as shown in Figure 16. For the universality of the experiment, the size of our crop operation is random. At the same time, in order to clearly represent the changes in image resolution, the original image size is still maintained after each crop, and the remaining parts are filled with zeros, that is, black. It can be observed that our model can effectively locate typical objects such as planetary tanks and maintain good detection accuracy when the resolution of the image changes.

The experiment of observing angle changes: We simulated the changes in observation angles by rotating the original images in the test dataset at different angles and observed the changes in the detection performance of the model for typical categories. Due to space limitations, we have presented the visualization results of typical categories as shown in Figure 17. For the convenience of comparison, we rotated the original image by 30, 60, 90, 120, 150, and 180 degrees. It can be observed that our model effectively detects typical objects such as planes, tennis courts, harbors, ships, etc. when the observation angle changes and the positioning accuracy of the rotating box is good.

6. Conclusions

This paper introduces a novel rotated object detector (GCA2Net) capable of angle adaptation based on the orientation of objects within images for the task of aerial image object detection. We propose Dynamic Rotational Convolution (DRC) to extract features at various angles through the rotation of convolutional kernels, obtaining features that are more suited to the characteristics of aerial imagery. Additionally, considering the varying density between the background and foreground in remote sensing images, we introduce the Multi-Order Spatial-Channel Aggregation Block (MOSCAB), enabling the model to selectively focus on regions that require closer attention within the image. Finally, this paper proposes a new feature fusion approach, building upon the Feature Pyramid Network (FPN) with the Residual Feature Augmentation+ (RFA+) and Adaptive Spatial Fusion+ (ASF+) modules. By reducing the semantic gap between different hierarchical levels before fusion, this approach minimizes the loss of spatial information for small objects. Compared to other mainstream rotated object detectors, our model has surpassed them, achieving the best performance in accuracy on the DOTA-V1.0 and HRSC2016 datasets. In the future, we plan to expand and optimize the algorithm from two aspects: real-time detection capability and the development of lightweight models. Specifically, more efficient inference strategies can be further explored to improve the detection speed of the model. By using model compression and quantization techniques, the computational complexity is significantly reduced while maintaining detection accuracy. In addition, our method can also be applied and expanded in other fields. Specifically, it can be used in the field of autonomous driving. In complex background scenarios, our MOSCAB module enables selective focusing and global information aggregation, effectively distinguishing between foreground and background and enhancing the perception ability of complex traffic scenes. It can also be used in industrial automation. In scenarios where the angles and orientations of parts are inconsistent, our DRC module can better extract the features of parts with different angles and orientations, improving the accuracy of part detection.

Author Contributions

Conceptualization, S.Z.; software, S.Z.; validation, J.Z., Y.W. and H.Z.; formal analysis, S.Z. and G.Q.; resources, G.Q., Y.W. and H.L.; original draft preparation, S.Z., H.L. and Z.L.; funding acquisition, Y.L. and J.Z.; review and editing, S.Z., H.L., Z.L., J.Z., Y.W. and H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Frontier Research Fund of the Institute of Optics and Electronics, Chinese Academy of Sciences under Grant number C24K003 and Western Young Scholar Program of the Chinese Academy of Sciences.

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhu, P.; Wen, L.; Du, D.; Bian, X.; Ling, H.; Hu, Q.; Nie, Q.; Cheng, H.; Liu, C.; Liu, X.; et al. Visdrone-det2018: The vision meets drone object detection in image challenge results. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018; pp. 437–468. [Google Scholar]
Xie, S.; Zhou, M.; Wang, C.; Huang, S. CSPPartial-YOLO: A lightweight YOLO-based method for typical objects detection in remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2023, 17, 388–399. [Google Scholar] [CrossRef]
Zhang, Y.; Ye, M.; Zhu, G.; Liu, Y.; Guo, P.; Yan, J. FFCA-YOLO for small object detection in remote sensing images. IEEE Trans. Geosci. Remote. Sens. 2024, 62, 5611215. [Google Scholar] [CrossRef]
Jiang, C.; Ren, H.; Ye, X.; Zhu, J.; Zeng, H.; Nan, Y.; Sun, M.; Ren, X.; Huo, H. Object detection from UAV thermal infrared images and videos using YOLO models. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102912. [Google Scholar] [CrossRef]
Sharma, M.; Dhanaraj, M.; Karnam, S.; Chachlakis, D.G.; Ptucha, R.; Markopoulos, P.P.; Saber, E. YOLOrs: Object detection in multimodal remote sensing imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 14, 1497–1508. [Google Scholar] [CrossRef]
Ren, Y.; Zhu, C.; Xiao, S. Small object detection in optical remote sensing images via modified faster R-CNN. Appl. Sci. 2018, 8, 813. [Google Scholar] [CrossRef]
Zheng, Q.; Zheng, L.; Bai, Y.; Liu, H.; Deng, J.; Li, Y. Boundary-aware network with two-stage partial decoders for salient object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5605713. [Google Scholar] [CrossRef]
Gao, P.; Tian, T.; Zhao, T.; Li, L.; Zhang, N.; Tian, J. Double FCOS: A two-stage model utilizing FCOS for vehicle detection in various remote sensing scenes. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 4730–4743. [Google Scholar] [CrossRef]
Bai, T.; Pang, Y.; Wang, J.; Han, K.; Luo, J.; Wang, H.; Lin, J.; Wu, J.; Zhang, H. An optimized faster R-CNN method based on DRNet and RoI align for building detection in remote sensing images. Remote Sens. 2020, 12, 762. [Google Scholar] [CrossRef]
Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3520–3529. [Google Scholar]
Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning RoI transformer for oriented object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2849–2858. [Google Scholar]
Han, J.; Ding, J.; Li, J.; Xia, G.S. Align deep features for oriented object detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5602511. [Google Scholar] [CrossRef]
Xu, Y.; Fu, M.; Wang, Q.; Wang, Y.; Chen, K.; Xia, G.S.; Bai, X. Gliding vertex on the horizontal bounding box for multi-oriented object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 1452–1459. [Google Scholar] [CrossRef]
Chen, Z.; Chen, K.; Lin, W.; See, J.; Yu, H.; Ke, Y.; Yang, C. Piou loss: Towards accurate oriented object detection in complex environments. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part V 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 195–211. [Google Scholar]
Qian, W.; Yang, X.; Peng, S.; Yan, J.; Guo, Y. Learning modulated loss for rotated object detection. Proc. Aaai Conf. Artif. Intell. 2021, 35, 2458–2466. [Google Scholar] [CrossRef]
Yang, X.; Yan, J.; Ming, Q.; Wang, W.; Zhang, X.; Tian, Q. Rethinking rotated object detection with gaussian wasserstein distance loss. In Proceedings of the International Conference on Machine Learning, PMLR, Shenzhen China, 26 February–1 March 2021; pp. 11830–11841. [Google Scholar]
Yang, X.; Yang, X.; Yang, J.; Ming, Q.; Wang, W.; Tian, Q.; Yan, J. Learning high-precision bounding box for rotated object detection via kullback-leibler divergence. Adv. Neural Inf. Process. Syst. 2021, 34, 18381–18394. [Google Scholar]
Yang, X.; Zhou, Y.; Zhang, G.; Yang, J.; Wang, W.; Yan, J.; Zhang, X.; Tian, Q. The KFIoU loss for rotated object detection. arXiv 2022, arXiv:2201.12558. [Google Scholar]
Guo, Z.; Liu, C.; Zhang, X.; Jiao, J.; Ji, X.; Ye, Q. Beyond bounding-box: Convex-hull feature adaptation for oriented and densely packed object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8792–8801. [Google Scholar]
Hou, L.; Lu, K.; Yang, X.; Li, Y.; Xue, J. G-rep: Gaussian representation for arbitrary-oriented object detection. Remote Sens. 2023, 15, 757. [Google Scholar] [CrossRef]
Li, W.; Chen, Y.; Hu, K.; Zhu, J. Oriented reppoints for aerial object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1829–1838. [Google Scholar]
Yang, X.; Zhang, G.; Yang, X.; Zhou, Y.; Wang, W.; Tang, J.; He, T.; Yan, J. Detecting rotated objects as gaussian distributions and its 3-d generalization. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 4335–4354. [Google Scholar] [CrossRef]
Ming, Q.; Miao, L.; Zhou, Z.; Song, J.; Yang, X. Sparse label assignment for oriented object detection in aerial images. Remote Sens. 2021, 13, 2664. [Google Scholar] [CrossRef]
Ming, Q.; Zhou, Z.; Miao, L.; Zhang, H.; Li, L. Dynamic anchor learning for arbitrary-oriented object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 19–21 May 2021; Volume 35, pp. 2355–2363. [Google Scholar]
Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 2024, 36, 34892–34916. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4015–4026. [Google Scholar]
Chen, K.; Liu, C.; Chen, H.; Zhang, H.; Li, W.; Zou, Z.; Shi, Z. RSPrompter: Learning to prompt for remote sensing instance segmentation based on visual foundation model. IEEE Trans. Geosci. Remote. Sens. 2024, 62, 4701117. [Google Scholar] [CrossRef]
Sunkara, R.; Luo, T. No More Strided Convolutions or Pooling: A New CNN Building Block for Low-Resolution Images and Small Objects. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases; Amini, M.R., Canu, S., Fischer, A., Guns, T., Kralj Novak, P., Tsoumakas, G., Eds.; Springer: Cham, Switzerland, 2023; pp. 443–459. [Google Scholar]
Guo, C.; Fan, B.; Zhang, Q.; Xiang, S.; Pan, C. Augfpn: Improving multi-scale feature learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12595–12604. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Redmon, J. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Wang, J.; Lin, Y.; Guo, J.; Zhuang, L. SSS-YOLO: Towards more accurate detection for small ships in SAR image. Remote Sens. Lett. 2021, 12, 93–102. [Google Scholar] [CrossRef]
Yang, X.; Yan, J.; Feng, Z.; He, T. R3det: Refined single-stage detector with feature refinement for rotating object. AAAI Conf. Artif. Intell. 2021, 35, 3163–3171. [Google Scholar] [CrossRef]
Yang, X.; Yan, J.; Liao, W.; Yang, X.; Tang, J.; He, T. Scrdet++: Detecting small, cluttered and rotated objects via instance-level feature denoising and rotation loss smoothing. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 2384–2399. [Google Scholar] [CrossRef] [PubMed]
Yang, X.; Yang, J.; Yan, J.; Zhang, Y.; Zhang, T.; Guo, Z.; Sun, X.; Fu, K. Scrdet: Towards more robust detection for small, cluttered and rotated objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8232–8241. [Google Scholar]
Cheng, G.; Wang, J.; Li, K.; Xie, X.; Lang, C.; Yao, Y.; Han, J. Anchor-free oriented proposal generator for object detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5625411. [Google Scholar] [CrossRef]
Hou, L.; Lu, K.; Xue, J.; Li, Y. Shape-adaptive selection and measurement for oriented object detection. AAAI Conf. Artif. Intell. 2022, 36, 923–932. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Ghiasi, G.; Lin, T.Y.; Le, Q.V. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7036–7045. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Pu, Y.; Wang, Y.; Xia, Z.; Han, Y.; Wang, Y.; Gan, W.; Wang, Z.; Song, S.; Huang, G. Adaptive rotated convolution for rotated object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 6589–6600. [Google Scholar]
Li, D.; Hu, J.; Wang, C.; Li, X.; She, Q.; Zhu, L.; Zhang, T.; Chen, Q. Involution: Inverting the inherence of convolution for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12321–12330. [Google Scholar]
Quinlan, P.T. Visual feature integration theory: Past, present, and future. Psychol. Bull. 2003, 129, 643. [Google Scholar] [CrossRef]
Deng, H.; Ren, Q.; Zhang, H.; Zhang, Q. Discovering and explaining the representation bottleneck of dnns. arXiv 2021, arXiv:2111.06236. [Google Scholar]
Cheng, X.; Chu, C.; Zheng, Y.; Ren, J.; Zhang, Q. A game-theoretic taxonomy of visual concepts in dnns. arXiv 2021, arXiv:2106.10938. [Google Scholar]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar]
Pan, X.; Ren, Y.; Sheng, K.; Dong, W.; Yuan, H.; Guo, X.; Ma, C.; Xu, C. Dynamic refinement network for oriented and densely packed object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11207–11216. [Google Scholar]
Xie, X.; Cheng, G.; Rao, C.; Lang, C.; Han, J. Oriented object detection via contextual dependence mining and penalty-incentive allocation. IEEE Trans. Geosci. Remote. Sens. 2024, 62, 5618010. [Google Scholar] [CrossRef]
Li, Y.; Mao, H.; Girshick, R.; He, K. Exploring plain vision transformer backbones for object detection. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 280–296. [Google Scholar]
Zhang, G.; Lu, S.; Zhang, W. CAD-Net: A context-aware detection network for objects in remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2019, 57, 10015–10024. [Google Scholar] [CrossRef]
Li, C.; Xu, C.; Cui, Z.; Wang, D.; Zhang, T.; Yang, J. Feature-attentioned object detection in remote sensing imagery. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; IEEE: Piscataway Township, NJ, USA, 2019; pp. 3886–3890. [Google Scholar]
Wang, J.; Ding, J.; Guo, H.; Cheng, W.; Pan, T.; Yang, W. Mask OBB: A semantic attention-based mask oriented bounding box representation for multi-category object detection in aerial images. Remote Sens. 2019, 11, 2930. [Google Scholar] [CrossRef]
Han, J.; Ding, J.; Xue, N.; Xia, G.S. Redet: A rotation-equivariant detector for aerial object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2786–2795. [Google Scholar]
Lee, C.; Son, J.; Shon, H.; Jeon, Y.; Kim, J. FRED: Towards a full rotation-equivariance in aerial image object detection. AAAI Conf. Artif. Intell. 2024, 38, 2883–2891. [Google Scholar] [CrossRef]
Zeng, Y.; Chen, Y.; Yang, X.; Li, Q.; Yan, J. ARS-DETR: Aspect ratio-sensitive detection transformer for aerial oriented object detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5610315. [Google Scholar] [CrossRef]
Song, B.; Li, J.; Wu, J.; Xue, S.; Chang, J.; Wan, J. Single-stage oriented object detection via Corona Heatmap and Multi-stage Angle Prediction. Knowl.-Based Syst. 2024, 295, 111815. [Google Scholar] [CrossRef]
Xu, Y.; Wu, X.; Wang, L.; Xu, L.; Shao, Z.; Fei, A. HOFA-Net: A High-Order Feature Association Network for Dense Object Detection in Remote Sensing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 17, 1513–1522. [Google Scholar] [CrossRef]
Ross, T.Y.; Dollár, G. Focal loss for dense object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2980–2988. [Google Scholar]
Ma, J.; Shao, W.; Ye, H.; Wang, L.; Wang, H.; Zheng, Y.; Xue, X. Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans. Multimed. 2018, 20, 3111–3122. [Google Scholar] [CrossRef]
Ming, Q.; Miao, L.; Zhou, Z.; Song, J.; Dong, Y.; Yang, X. Task interleaving and orientation estimation for high-precision oriented object detection in aerial images. ISPRS J. Photogramm. Remote Sens. 2023, 196, 241–255. [Google Scholar] [CrossRef]
Ming, Q.; Miao, L.; Zhou, Z.; Dong, Y. CFC-Net: A critical feature capturing network for arbitrary-oriented object detection in remote-sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5605814. [Google Scholar] [CrossRef]
Jiang, Y.; Zhu, X.; Wang, X.; Yang, S.; Li, W.; Wang, H.; Fu, P.; Luo, Z. R2CNN: Rotational region CNN for orientation robust scene text detection. arXiv 2017, arXiv:1706.09579. [Google Scholar]
Pan, Y.; Xu, Y.; Wu, Z.; Wei, Z. Joint Multiscale Spatial-Frequency Domain Network for Oriented Object Detection in Remote Sensing Images. In Proceedings of the IGARSS 2024-2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 7–12 July 2024; IEEE: Piscataway Township, NJ, USA, 2024; pp. 9425–9429. [Google Scholar]
Xu, Y.; Dai, M.; Zhu, D.; Yang, W. Adaptive Angle Module and Radian Regression Method for Rotated Object Detection. IEEE Geosci. Remote. Sens. Lett. 2024, 21, 6006505. [Google Scholar] [CrossRef]

Figure 1. The comparison between horizontal bounding boxes and oriented bounding boxes. (a) Horizontal bounding box annotation of objects. (b) Oriented bounding box annotation of objects.

Figure 2. On the DOTA-V1.0 test set, the mAP and GFlops index are used to compare the mainstream network with our proposed network. The X-axis represents GFlops, and the Y-axis represents mAP. Our GCA2Net has excellent performance in the balance between mAP and GFlops.

Figure 3. The overall framework of GCA2Net. The detector first acquires feature maps through the backbone network DRC-ResNet, which incorporates Dynamic Rotational Convolution. Subsequently, the AugFPN+ network is employed for feature fusion to obtain enhanced feature maps that provide a better representation of rotational characteristics. The fusion network consists of the Residual Feature Augmentation+ module and the Adaptive Spatial Fusion+ module. Finally, the Oriented RPN network generates oriented candidate bounding boxes, which are then passed to the classification and regression heads for prediction.

Figure 4. The overall framework of DRC-ResNet backbone. This backbone network is built on ResNet50. Specifically, we replaced the 3 × 3 convolution of the last three stages (Stage 3, Stage 4, and Stage 5) with a sequential combination of DRC and MOSCAB.

Figure 5. Rotation mechanism of the convolution kernels. Different colors represent different convolution weights. (a) Initialize a convolutional kernel, (b) expand the kernel to a convolutional space through bilinear interpolation, (c) rotate the kernel and resample it, where each weight uses the average of the weights of all the cells occupied by that cell, (d) obtain the rotated convolution kernel.

Figure 6. Example of convolution kernel rotation process. We use

2 \times 2

convolution kernels and rotate them by 45 degrees. Among them, interpolation represents the bilinear interpolation extended convolution kernel. For convenience, we have extended the convolution kernel to

4 \times 4

. Finally, resample and take the average value of the region occupied by each grid to obtain the final rotational convolution kernel.

Figure 6. Example of convolution kernel rotation process. We use

2 \times 2

convolution kernels and rotate them by 45 degrees. Among them, interpolation represents the bilinear interpolation extended convolution kernel. For convenience, we have extended the convolution kernel to

4 \times 4

. Finally, resample and take the average value of the region occupied by each grid to obtain the final rotational convolution kernel.

Figure 7. The overall framework of Dynamic Rotational Convolution. MOSCAB acts on the feature map output by DRC. The weighted addition operation is shown in Formula (2).

Figure 8. The overall framework of MOSCAB.

Figure 9. The overall framework of Adaptive Spatial Fusion+ and Residual Feature Augmentation+. Hadamard Product is an element-wise dot product.

Figure 10. Schematic diagram of confusion matrix.

Figure 11. Data augmentation methods. During the training phase, we employed the data augmentation strategy mentioned above. However, during the testing phase, we did not use data augmentation strategies.

Figure 12. The model trainning progress.

Figure 13. The detection results of the HRSC2016 dataset. The model achieved accurate detection results for ships of different categories and scales, demonstrating its adaptability to target orientation and target scale.

Figure 14. Visualization performance on the DOTA-V1.0 dataset is shown in the figure, which displays 15 object categories. The model has good adaptability to target orientation and target scale.

Figure 15. Comparison of detection performance of different methods on the DOTA-V1.0 dataset. It can be found that this article outperforms other methods in reducing the False Positives of targets. For example, OrientedRCNN and S2ANet both incorrectly checked the boxes for swimming pool (pink box), large velocity (orange box), and ship (red box). In addition, S2ANet also incorrectly checked the helicopter box (blue box). RetinaNet missed the detection of the harbor (yellow box) and also made errors detecting the swimming pool, harbor, and ground track field.

Figure 16. Experimental results of spatial resolution variation. Due to space limitations, we have provided results for typical categories. When the resolution of the image changes, typical objects such as planes (purple box), tennis courts (light blue box), and storage tanks (green box) can be effectively located, and good detection accuracy can be maintained.

Figure 17. Experimental results of observing angle changes. Due to space limitations, we have provided visualization results for typical categories. Our model effectively detects typical objects such as planes (purple box), tennis courts (light blue box), harbors (yellow box), ships (red box), etc. when the observation angle changes and the positioning accuracy of the rotating box is good.

Table 1. The model training parameters.

Parameter	Value
Batch Size	2
Input Size	$1024 \times 1024$
Epoch	12
Optimizer	SGD
Learning Rate	0.005
Weight Decay	0.0001
Momentum	0.9

Table 2. Quantitative comparisons with state-of-the-art methods on DOTA-V1.0 test set. The best and second-best results are highlighted in bold and underline, respectively. ↑ indicates that the higher the numerical value, the better the model performance.

Method	PL	BD	BR	GTF	SV	LV	SH	TC	BC	ST	SBF	RA	HA	SP	HC	mAP (↑)
One-stage
DRN [49]	88.91	80.22	43.52	63.35	73.48	70.69	84.94	90.14	83.85	84.11	50.12	58.41	67.62	68.60	52.50	70.70
R3Det [34]	88.76	83.09	50.91	67.27	76.23	80.39	86.72	90.78	84.68	83.24	61.98	61.35	66.91	70.63	53.94	73.79
RSDet [15]	89.80	82.90	48.60	65.20	69.50	70.10	70.20	90.50	85.60	83.40	62.50	63.90	65.60	67.20	68.00	72.20
DAL [24]	88.68	76.55	45.08	66.80	67.00	76.76	79.74	90.84	79.54	78.45	57.71	62.27	69.05	73.14	60.11	71.44
S2ANet [12]	89.30	80.11	50.97	73.91	78.59	77.34	86.38	90.91	85.14	84.84	60.45	66.94	66.78	68.55	51.65	74.13
G-Rep [20]	88.89	74.62	43.92	70.24	67.26	67.26	79.80	90.87	84.46	78.47	54.59	62.60	66.67	67.98	52.16	70.59
CFA [19]	89.08	83.20	54.37	66.87	81.23	80.96	87.17	90.21	84.32	86.09	52.34	69.94	75.52	80.76	67.96	76.67
DFDet [50]	89.41	82.42	49.93	70.63	79.57	79.02	87.22	90.91	82.80	84.49	62.05	64.26	72.30	72.90	58.29	75.08
Two-stage
VitDet [51]	88.38	75.86	52.24	74.42	78.52	83.22	88.47	90.86	77.18	86.98	48.95	62.77	76.66	72.97	57.48	74.41
CAD-Net [52]	87.80	82.40	49.40	73.50	71.10	63.50	76.60	90.90	79.20	73.30	48.40	60.90	62.00	67.00	62.20	69.90
RoI Trans [11]	88.64	78.52	43.44	75.92	68.81	73.68	83.59	90.74	77.27	81.46	58.39	53.54	62.83	58.93	47.67	69.56
SCRDet [36]	89.98	80.65	52.09	68.36	68.36	60.32	72.41	90.85	87.94	86.86	65.02	66.68	66.25	68.24	65.21	72.61
G Vertex [13]	89.64	85.00	52.26	77.34	73.01	73.14	86.82	90.74	79.02	86.81	59.55	70.91	72.94	70.86	57.32	75.02
FAOD [53]	90.21	79.58	45.49	76.41	73.18	68.27	79.56	90.83	83.40	86.48	53.40	65.42	74.17	69.69	64.86	73.28
Mask OBB [54]	89.61	85.09	51.85	72.90	75.28	73.23	85.57	90.37	82.08	85.05	55.73	68.39	71.61	69.87	66.33	74.86
ReDet [55]	88.79	82.64	53.97	74.00	78.13	84.06	88.04	90.89	87.78	85.75	61.76	60.39	75.96	68.07	63.59	76.25
AOPG [37]	89.14	82.74	51.87	69.28	77.65	82.42	88.08	90.89	86.26	85.13	60.60	66.30	74.05	67.76	58.77	75.39
Oriented RCNN [10]	86.42	78.97	52.47	69.84	77.30	75.99	86.72	90.89	82.63	85.66	60.13	68.25	73.98	72.22	62.37	74.92
GCA2Net (Ours)	90.06	84.48	59.70	81.30	81.01	84.91	88.20	90.90	87.61	86.42	67.27	70.33	81.18	78.73	71.08	77.56
Other Models
FRED [56]	89.37	82.12	50.84	73.89	77.58	77.38	87.51	90.82	86.30	84.25	62.54	65.10	72.65	69.55	63.41	75.56
ARS-DETR [57]	87.65	76.54	50.64	69.85	79.76	83.91	87.92	90.26	86.24	85.09	54.58	67.01	75.62	73.66	63.39	75.47
SODC [58]	88.54	84.72	51.21	70.18	79.43	80.82	88.56	90.66	87.28	86.49	55.37	64.59	68.22	71.03	64.85	76.06
HOFA-Net [59]	90.42	76.64	47.57	59.01	73.41	85.64	89.29	90.76	73.30	89.44	71.15	69.39	75.16	67.05	74.77	75.53

Table 3. Comparison of algorithm calculation efficiency and accuracy. The (↑) represents that the higher the indicator, the better, and the (↓) represents that the lower the indicator, the better. The computational efficiency was tested on an Nvidia Geforce RTX4090D GPU.

Model	Params (M) (↓)	FLOPs (G) (↓)	FPS (img/s) (↑)	mAP
G. Vertex	43.05	211.7	22.9	75.02
OrientedRCNN	42.92	211.71	23.1	74.92
RetinaNet_OBB	38.19	216.19	23.3	68.42
RoITransformer	56.90	225.56	14.4	69.56
S2ANet	38.02	171.37	23.0	74.13
GCA2Net (Ours)	80.5	196.75	12.5	77.56

Table 4. Quantitative comparisons with preceding state-of-the-art methods on HRSC2016 test set. The best and second-best results are highlighted in bold and underline, respectively.

Method	Backbone	mAP
Rotated RetinaNet [60]	ResNet-50	85.10
R2PN [61]	IMP-VGG-16	79.6
TIOE-Det [62]	Resnet-101	90.16
CFC-Net [63]	Resnet-101	89.70
RSDet [15]	Resnet-50	86.5
R2CNN [64]	IMP-ResNet-101	73.10
RoI Trans [11]	IMP-ResNet-101	86.20
G. Vertex [13]	IMP-ResNet-101	88.20
R3Det [34]	IMP-ResNet-101	89.30
GWD [16]	ResNet-101	88.95
DAL [24]	IMP-ResNet-101	89.80
SLA [23]	ResNet-101	89.51
S2ANet [12]	IMP-ResNet-101	90.20
MSFN [65]	-	90.00
AAM [66]	ResNet-50-AAM	88.49
GCA2Net(Ours)	DRC-ResNet-50	90.40

Table 5. Ablation studies of the influence of DRC and MOSCAB on the DOTA-V1.0 dataset where “✓” indicates that the module is used. The best result is highlighted in bold.

Baseline	DRC	MOSCAB	mAP
✓			75.81
✓	✓		76.9
✓	✓	✓	77.46

Table 6. The ablation study of feature fusion methods on the DOTA-V1.0 dataset. Here, “✓” indicates that the module is used. ↑ indicates that the higher the numerical value, the better the model performance. The best result is highlighted in bold.

FPN	AugFPN	AugFPN+	mAP (↑)
✓			76.02
	✓		76.59
		✓	77.56

Table 7. The ablation study of different feature enhancement methods on the DOTA-V1.0 dataset. Here, “✓” indicates that the module is used. ↑ indicates that the higher the numerical value, the better the model performance. The best result is highlighted in bold. In this table, CAA stands for Context Anchor Attention.

MOSCAB	SE-Net	CAA	CBAM	mAP (↑)
✓				77.56
	✓			76.33
		✓		68.78
			✓	76.89

Table 8. Ablation study of the replacement strategy on the DOTA-V1.0 dataset. Here, “✓” indicates that the stage is replaced. The best result is highlighted in bold.

Stage2	Stage3	Stage4	mAP
✓			76.75
✓	✓		77.08
✓	✓	✓	77.56

Table 9. Ablation study of the spatial encoding ways of the ARU on the DOTA-V1.0 dataset. Here, “✓” indicates that the module is used. The best result is highlighted in bold.

DWConv	WTConv	Involution	mAP
✓			76.24
	✓		76.30
		✓	77.56

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, S.; Liu, Z.; Luo, H.; Qi, G.; Liu, Y.; Zuo, H.; Zhang, J.; Wei, Y. GCA2Net: Global-Consolidation and Angle-Adaptive Network for Oriented Object Detection in Aerial Imagery. Remote Sens. 2025, 17, 1077. https://doi.org/10.3390/rs17061077

AMA Style

Zhou S, Liu Z, Luo H, Qi G, Liu Y, Zuo H, Zhang J, Wei Y. GCA2Net: Global-Consolidation and Angle-Adaptive Network for Oriented Object Detection in Aerial Imagery. Remote Sensing. 2025; 17(6):1077. https://doi.org/10.3390/rs17061077

Chicago/Turabian Style

Zhou, Shenbo, Zhenfei Liu, Hui Luo, Guanglin Qi, Yunfeng Liu, Haorui Zuo, Jianlin Zhang, and Yuxing Wei. 2025. "GCA2Net: Global-Consolidation and Angle-Adaptive Network for Oriented Object Detection in Aerial Imagery" Remote Sensing 17, no. 6: 1077. https://doi.org/10.3390/rs17061077

APA Style

Zhou, S., Liu, Z., Luo, H., Qi, G., Liu, Y., Zuo, H., Zhang, J., & Wei, Y. (2025). GCA2Net: Global-Consolidation and Angle-Adaptive Network for Oriented Object Detection in Aerial Imagery. Remote Sensing, 17(6), 1077. https://doi.org/10.3390/rs17061077

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GCA2Net: Global-Consolidation and Angle-Adaptive Network for Oriented Object Detection in Aerial Imagery

Abstract

1. Introduction

2. Related Work

2.1. Object Detection in Aerial Images

2.2. Rotated Object Detection

2.3. Feature Fusion Method

3. Methodology

3.1. Overall Framework

3.2. Rotation Mechanism of the Convolution Kernels

3.3. Dynamic Rotational Convolution (DRC)

3.4. Multi-Order Spatial-Channel Aggregation Block (MOSCAB)

3.5. AugFPN+

4. Experimental Setup

4.1. Datasets and Metrics

4.2. Data Preprocessing

4.3. Implementation Details

4.4. Model Traning Progress

5. Experimental Results

5.1. Comparative Experiments

5.1.1. Result on DOTA-V1.0

5.1.2. Result on HRSC2016

5.2. Ablation Experiments

5.3. Visualization Results

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI