Task-Aligned Oriented Object Detection in Remote Sensing Images

Qian, Xiaoliang; Zhao, Jiakun; Wu, Baokun; Chen, Zhiwu; Wang, Wei; Kong, Han

doi:10.3390/electronics13071301

Open AccessArticle

Task-Aligned Oriented Object Detection in Remote Sensing Images

by

Xiaoliang Qian

¹

,

Jiakun Zhao

¹,

Baokun Wu

²,

Zhiwu Chen

¹,

Wei Wang

^1,*

and

Han Kong

^1,*

¹

College of Electrical and Information Engineering, Zhengzhou University of Light Industry, Zhengzhou 450002, China

²

Xuji Delischl Electric Co., Ltd., Xuchang 461000, China

^*

Authors to whom correspondence should be addressed.

Electronics 2024, 13(7), 1301; https://doi.org/10.3390/electronics13071301

Submission received: 31 January 2024 / Revised: 23 March 2024 / Accepted: 28 March 2024 / Published: 30 March 2024

(This article belongs to the Special Issue Image and Video Processing Based on Deep Learning)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Oriented object detection (OOD) can recognize and locate various objects more precisely than horizontal object detection; however, two problems have not been satisfactorily resolved so far. Firstly, the absence of interactions between the classification and regression branches leads to inconsistent performance in the two tasks of object detection. Secondly, the traditional convolution operation cannot precisely extract the features of objects in extremely aspect ratio in remote sensing images (RSIs). To address the first problem, the task-aligned detection module (TADM) and the task-aligned loss function (TL) are proposed in this paper. On the one hand, a spatial probability map and a spatial offset map are inferred from the shared features in the TADM and separately incorporated into the classification and regression branches to obtain consistency in the two tasks. On the other hand, the TL combines employing the generalized intersection over union (GIoU) metric with classification loss to further enhance the consistency in the two tasks. To address the second problem, a two-stage detection framework based on alignment convolution (TDA) is proposed. The features extracted from the backbone network are refined through alignment convolution in the first stage, and the final OOD results are inferred from refined features in the second stage. The ablation study verifies the effectiveness of the TADM, TL, and TDA. The comparisons with other advanced methods, on two RSI benchmarks, demonstrate the overall effectiveness of our method.

Keywords:

oriented object detection (OOD); remote sensing image (RSI); task-aligned detection module (TADM); task-aligned loss function (TL); alignment convolution (AlignConv)

1. Introduction

In remote sensing images (RSIs), oriented object detection (OOD) methods have demonstrated greater precision in identifying objects of arbitrary orientation compared to horizontal object detection (HOD) [1,2]. OODs superiority primarily lies in its ability to accurately identify and locate objects with irregular shapes or directions, thereby extracting more accurate information from complex RSIs [3,4]. The applications of OOD are extensive, including intelligent transportation [5], disaster rescue [6], military base reconnaissance [7], and analysis of X-ray images [8], MRI images [9], and CT scans [10], etc. [11,12].

So far, most OOD models have been devised to address the challenge of arbitrary-oriented object detection in RSIs. For example, the rotated region proposal networks [13] imitated the practice of setting anchors of different scales on anchor points in the horizontal domain, and then setting anchors of different angles based on this, to allow the boxes to surround the object more closely. The RoI Transformer [14] takes this concept to a new level by transforming horizontal boxes into oriented ones, thereby learning about the offsets in the process. Essentially, the RoI Transformer consists of two main components: the rotated RoI (RRoI) learner and the RRoI Wrapping. The key idea advanced by this model is the transition from the traditional horizontal RoI to a more sophisticated RRoI. Conversely, the Oriented R-CNN detection framework [15] employs a simpler approach. This directly transforms horizontal boxes into oriented proposal boxes using a six-parameter midpoint offset method and achieves precise prediction results with only three preset anchors at each position. Furthermore, the recently developed EIA-PVT object detector [16] is based on an efficient inductive vision transformer framework for OOD, comprising an adaptive multigrained routing mechanism, compact dual-path encoding architecture, and an angle tokenization technique. OASL [17] is another recently developed detection method, which introduced an orientation-aware structured information extraction module for capturing spatial contextual features. Furthermore, DDC-SCAM [18] is a novel method which integrates dynamic deformable convolution and self-normalizing channel attention for OOD in RSIs. Other recently proposed methods include KFIOU [19], RIDet [20], and CFCNet [21].

Despite significant advancements in the performance of traditional OOD methods, two major obstacles remain to be addressed. Firstly, these methods have failed to bridge the gap between the classification branch and the bounding box regression (BBR) branch, because the two branches are independent of each other. Specifically, the classification branch is used to predict the category of an object according to features extracted from it, while the BBR branch uses the features to predict the location of an object indicated by its angle, height, width, and coordinates from the center point of the bounding box. Due to the lack of feature interaction, it is difficult to achieve consistency between the results of classification and regression. Consequently, this leads to a dilemma, i.e., the classification task score may be the highest, but the position identified by the corresponding regression branch is not necessarily the correct prediction frame, and vice versa.

Secondly, these methods struggle with the problem of extreme aspect ratio (EAR) present in RSIs. Because RSIs contain elongated objects, standard convolution kernel sampling points struggle to accurately represent these types of objects. Although dilated convolution can effectively enlarge the sampling range, it only achieves proportional expansion. Figure 1a illustrates the impact of sampling positions in standard three × three convolution on the effectiveness of feature extraction. When extracting object features in RSIs with traditional convolution kernels a lot of irrelevant background information is often included, which can cause the network to learn inaccurate object features.

In order to address the two problems mentioned above, a task-aligned oriented object detection (TAOD) model is proposed in this paper. For the first problem, two key components are introduced, i.e., the task-aligned detection model (TADM) and task-aligned loss function (TL). On the one hand, the TADM infers a spatial probability map (SPM) and spatial offset map (SOM) from the feature interaction module, both of which are separately incorporated into the classification and regression branches to ensure the consistency of the two tasks. On the other hand, the TL further enhances the consistency of the two tasks by combining the generalized intersection over union (GIoU) with the classification loss.

For the second problem, a two-stage detection framework on the basis of AlignConv (TDA) is proposed. In the initial stage, to precisely capture the objects with EAR, the features extracted from the backbone network are refined by AlignConv, which is guided by the OOD result inferred from the TADM. In the refinement stage, the TADM is used to infer the OOD result again on the basis of the refined features, where the SPM (SOM) is replaced by the average value of the SPM (SOM) given by the TADM in both stages.

The main contributions of this paper can be summarized as the following points:

The TADM and TL are proposed to solve the inconsistency problem between the classification and regression branches.
A two-stage scheme based on AlignConv is proposed to solve the problem that the common feature extraction methods cannot handle the objects with EAR.

2. Related Work

As the domain of deep learning continues to advance and unfold [22,23], object detection methodologies grounded in deep learning have emerged as the standard methods of choice [24,25,26,27]. Existing OOD methods typically fall into two primary types; one being anchor-based methods and the other being anchor-free methods. The anchor-based methods can be further subdivided into a one-stage and a two-stage process. Each of these methods has made considerable headway in OOD. Notable studies employing these methods and their inherent limitations, which drive the need for our method, are outlined below.

2.1. Anchor-Based One-Stage Methods

The traditional one-stage method, known as YOLO [28], a revolutionary one-step method introduced by R. Joseph et al. in 2015, reshaped object detection with its unique whole-image, single-pass detection approach. Following YOLO, the SSD [29] model was developed by W. Liu et al. This model went a step further by implementing multi-reference and multi-resolution techniques and uniquely detecting objects of different scales at different network layers, marking an evolution in the field. RetinaNet [30] introduced focal loss to address the problem of class imbalance, thereby improving the model’s focus on ‘hard’ and difficult-to-classify samples during the training phase. S²ANet [31] used the feature alignment module for better anchor alignment. GWD [32] enhances object detection models by representing rotated bounding boxes as two-dimensional Gaussian distributions. KFIoU [19] addresses the inconsistency between detection metrics and regression loss by using Gaussian modeling and Kalman filtering to approximate the skew intersection over union loss. DRN [33] adapted neuronal receptive fields to the shape and orientation of objects by using an advanced feature selection module. Other prominent anchor-based one-stage methods include, CG-Net [34], RefineDet [35], RON [36], DSSD [37], DSOD [38], DARNet [39], and Double-Heads [40], etc.

2.2. Anchor-Based Two-Stage Methods

Over time, significant advancements have been made in two-stage detection methods, evolving from the region proposal-dependent RCNN [41] to quicker and more accurate techniques. A major step forward in object detection has been the development of the RCNN family, including Fast [42] and Faster RCNN [43]. These advanced models, derived from RCNN, have streamlined the detection process by integrating the detection network and bounding box regressor into a single structure. This innovative setup has resulted in substantial speed improvements, exceeding 200 times the speed of the oriented RCNN. Subsequently, SPPNet [44] introduced a unique layer for processing varying image sizes, increasing speed but facing training challenges. Feature pyramid networks (FPN) [45] further advanced the field with a new structure that improved performance and influenced subsequent models. ReDet [46] enables accurate orientation prediction and a significant reduction in model size by explicitly encoding rotation equivariance and invariance. CenterMap-Net [47], introduces an additional feature-map in the head of the network specifically for handling angle information. These excellent methods inspired the task alignment network, while anchor-based detectors frequently assign anchor-boxes by calculating intersection over unions (IoUs) between the anchor boxes and the ground truth. However, the ideal anchors for classification and regression often do not align and can significantly differ based on the shape and attributes of the objects. Following this, several models, like MPFP-Net [48] and CSL [49], have been developed to capture rotational features in detection. Other notable anchor-based two-stage methods include FRIOU [50], DPGN [51], Smooth GIoU [52], and R3Det [53], etc.

2.3. Anchor-Free Methods

Anchor-free methods in object detection use key points of objects to shape bounding boxes. In the early stages, anchor-free detectors used geometry-based assignment methods to identify a number of either pre-established or autonomously acquired key points, which served the dual purposes of classification and regression.

An example of these early methods was CornerNet [54], which was further refined by CornerNet-Lite [55] to enhance processing speed. Meanwhile, Grid R-CNN [56] introduced a second object detection stage by employing grid points to guide bounding box creation. ExtremeNet [57] leverages extreme and center points for bounding boxes, and CenterNet-DLA [58] identifies object centers along with other attributes. CenterNet [59] boosts precision and recall with a key point triplet, and RepPoints [60] uses sample points to delineate the size and shape of the object. To enhance the training process, oriented RepPoints [5] introduced an adaptive point learning scheme for the selection of representative samples. All of the above methods improved object detection methods capacity for alignment classification and regression tasks through the use of anchor-free detectors.

On the other hand, anchor-free methods in object detection use the center points of objects to shape bounding boxes, such as DenseBox [61] and GA-RPN [62], which concentrate on an object’s center for detection and use diverse strategies for positive predictions. FSAF [63] and FCOS-O [64] pioneer new techniques for detection, while CSP [65] employs the object’s center for specific detection tasks. DODet [66] adopts a two-stage approach. Initially, it is used to generate valuable proposals, and, subsequently, it performs a fine-grained evaluation of the generated proposals and locates and corrects potential misalignments in order to address the problem of spatial and feature misalignments. In a similar vein, AOPG [67] first generates primitive oriented bounding boxes without using predefined anchor boxes, which are subsequently fine-tuned into high-precision oriented proposals. The Gliding Vertex [68] method regresses four length ratios to improve the detection accuracy of objects with EAR by clarifying the ambiguity of near-horizontal objects. Other prominent anchor-free methods include FoveaBox [69], MidNet [70], and CentripetalNet [71], etc.

3. Proposed Method

3.1. Overview

The baseline method selected for this paper is RetinaNet-O, which employs R-50-FPN (denoted as ResNet50-FPN [72]) as its backbone network. Horizontal FPN features are selected with scaling factors of 16, 32, 64, 128, and 256. Initially, this section discusses the alignment module incorporated within the detector, then, based on the coarsely aligned features, our proposed TDA module is introduced to refine the results. Ultimately, TL is employed as the definitive loss function for training the classification and regression branches across both stages.

3.2. Task-Aligned Detection Model

As depicted by Figure 2, the ‘Backbone + FPN’ extracted features and then four successive convolutions were performed. This can be represented as

F_{1 ~ k} = \{\begin{matrix} δ (c o n v_{k} (P h)), k = 1 \\ δ (c o n v_{k} (F_{k - 1})), k > 1 \end{matrix}, \begin{matrix} \forall k \in \{1, 2, 3, 4\} \end{matrix}

(1)

where

δ

and conv_k represent the scaled exponential linear unit (SELU) [73] and the kth convolution operation on the input feature layer, respectively. The Ph

\in ℝ^{\hat{H \times W \times C}}

represents the FPN features, where richer features are obtained through the superposition of multiple convolutions. Equations (2) and (3) describe a layer attention operation that separately computes the classification or regression for each task, and Equations (4) and (5) describe and individually calculate the dense classification scores and object bounding box coordinates for each respective task in classification or regression, as follows:

F_{1 ~ k}^{c l s} = ω_{c l s} \cdot F_{1 ~ k}

(2)

ω_{c l s} = σ (f c_{2} (δ (f c_{1} (F_{1 ~ k}))))

(3)

C F_{1} = σ (δ (c o n v_{1} (F_{1 ~ k}^{c l s})))

(4)

R F_{1} = Ω (δ (c o n v_{3} (F_{1 ~ k}^{r e g})))

(5)

where

σ

is defined as a function of sigmoid,

Ω

represents the process of transforming distance predictions into a bounding box, and

c o n v_{1}

and

c o n v_{3}

are two 1 × 1 convolution kernels used to modify the output channels dimension, as below. Only classification scores

C F_{1} \in ℝ^{\hat{H \times W \times K}}

apply the sigmoid function, whereas the oriented bounding boxes (OBBs)

R F_{1} \in ℝ^{\hat{H \times W \times 5}}

do not utilize the sigmoid function, instead using convolution operations.

In the alignment stage in classification branching and BBR branching prediction, the alignment of the two predictions is achieved by collectively considering the two tasks using the computed task interaction features. The spatial distribution of the two predictions is adjusted via Equations (6)–(9) to further ensure the consistency across both tasks, as follows:

C F_{1}^{a} = \sqrt{C F_{1} \times M_{1}}

(6)

R F_{1}^{a} (i, j, v) = R F_{1} (i + O_{1} (i, j, 2 \times v), j + O_{1} (i, j, 2 \times v + 1), v)

(7)

M_{1} = σ (c o n v_{2} (δ (c o n v_{1} (F_{1 ~ k}))))

(8)

O_{1} = c o n v_{4} (δ (c o n v_{3} (F_{1 ~ k})))

(9)

where conv₂ and conv₄ are two separate 3 × 3 convolution kernels, the former is charged with the task of generating the classification probability prediction, whereas the latter is responsible for producing the regression bias prediction.

C F_{1}^{a} \in ℝ^{\hat{H \times W \times K}}

denotes the task-aligned classification features,

R F_{1}^{a} \in ℝ^{\hat{H \times W \times 5}}

denotes the task-aligned regression features, index (i, j, v) indicates the spatial location with the (i, j) in the tensor at the vth channel, and

M_{1}

(

O_{1}

) denotes as SPM (SOM).

3.3. Two-Stage Detection Framework Based on AlignConv

The features extracted from the backbone network are refined through alignment convolution in the first stage. Specifically, following the acquisition of the initial aligned OBBs, key points such as the midpoints of the box edges, the vertices of the predicted anchors, and the central point are deemed as Deformable Conv (DConv) offsets. A further convolution layer is applied to the feature map to obtain offsets in the x and y coordinates. This transition is visually represented in Figure 3a,b, where the original positions (marked in red dots) in Figure 3a are shifted to new locations (highlighted in blue dots) in Figure 3b. During the training phase, the convolution kernels generating output features and those producing offsets are learned synchronously. In particular, the learning of the offsets harnesses interpolation algorithms and is facilitated via backpropagation.

The final OOD results are inferred from refined features in the second stage. In order to generate refined features, the original FPN features and the initially staged aligned

R F_{1}^{a}

are processed through an alignment convolution operation referred to as AlignConv, as illustrated in Figure 1b. The sampling points of the AlignConv kernel fully cover the transport vessel, with a central component of this operation being a DConv layer directed by an offset field for feature alignment.

The stack AF_1~k, featuring interactive elements, is derived following the procedure outlined in Formula (1). Subsequently, alignment maps, denoted as M₂ and O₂, are autonomously ascertained from AF_1~k, leveraging the inherent learning mechanism. Both

A F_{1 ~ k}^{c l s}

and

ω_{c l s}^{a}

are derived following the same methodology as presented in Equations (2) and (3). In particular, the mean operation integrates M₁ from the initial stage and M₂ from the refinement stage, as well as O₁ and O₂. The SPM and SOM of the protection phases ensure that no valuable information is discarded. Post-mean operation, the SPM and SOM guide CF₂ and RF₂ in executing the alignment operation. This process culminates in the derivation of the aligned classification branches

C F_{2}^{a}

and aligned regression branches

R F_{2}^{a}

in the refinement stage, which are used as TAOD output results. CF₂, RF₂,

C F_{2}^{a}

, and

R F_{2}^{a}

are obtained in accordance with the methods detailed in Equations (4)–(7).

At the refinement stage, TDA corrects for misalignment between kernel sampling and object features, improving extraction for objects with EAR.

3.4. Task-Aligned Loss Function

In regression tasks, traditional methods employ Horizontal Bounding Boxes (HBBs) to define the position and size of objects. However, this approach fails to accurately capture rotation information and the precise shapes of objects with arbitrary orientations such as ships. To address this issue, the HBB calculation scheme can be modified to use OBBs, which better adapt to the rotation and shape characteristics of objects. A computation method based on GIoU can be employed, adjusting the bounding box representation and calculation formula to cater to the properties of OBBs. The IoU is calculated between the forecasted bounding box and the actual ground truth., represented as Q and P, which can be illustrated as follows:

I o U = \frac{R e g (Q \cap P)}{R e g (Q \cup P)}, I o U \in [0, 1]

(10)

where

R e g (Q \cap P)

and

R e g (Q \cup P)

correspond to the overlapping area (intersection) and combined area (union) of the HBBs Q and P, respectively. With the modified GIoU calculation scheme [52], the similarity between two OBBs can be compared, and detection results can be evaluated and optimized according to the similarity index. The computation of the GIoU in OOD is as follows:

G I o U = I o U - \frac{R e g (J) - R e g (Q \cup P)}{R e g (J)}, G I o U \in (- 1, 1]

(11)

L_{G I o U} = 1 - G I o U

(12)

where Reg(J) is the minimal OBB that includes both the ground truth box and the predicted box, and L_GIoU signifies GIoU-based BBR loss. To align the task with the regression module, the TL combines the GIoU with classification loss to further improve the consistency of the two tasks. This integration provided the basis for the subsequent classification task, which is detailed below:

T L (p, G I o U) = \{\begin{matrix} - G I o U (G I o U \log (p)) + (1 - G I o U) \log (1 - p) & I O U \geq 0.5 \\ - α p^{γ} \log (1 - p) & I O U < 0.5 \end{matrix}

(13)

where

p \in [0, 1]

signifies the predicted values, and the variables

γ

and

α

serve as hyperparameters, with

γ

being assigned a value of 2 and

α

set to 0.75. When

I O U \geq 0.5

, the positive sample is weighted with the GIoU. This transforms the GIoU between the predicted bounding box and the ground truth into a soft classification label. Notably, this approach is particularly effective in extracting information from positive samples, especially those that exhibit a high degree of overlap. The loss for the initial stage consists of two parts; TL loss for classification and GIoU loss for regression, articulated as follows:

T L_{1} = T L + L_{G I o U}

(14)

where TL₁ represents the training loss function for the initial stage, the GIoU loss is chosen for the BBR branch, and TL for the classification branch. Similarly to the first stage, the following characterizes the loss at the refinement stage:

L = T L_{1} + T L_{2}

(15)

where TL₂ represents the loss function for the refinement stage and is calculated in the same way as Equation (15). The overall training loss for L is derived from the sum of TL₁ and TL₂.

4. Experiment

4.1. Datasets

DOTA is a comprehensive aerial image dataset [74] encompassing 2806 images and 188,282 instances across 15 common object categories, namely plane (PL), baseball (BD), bridge (BR), ground track field (GTF), small vehicle (SV), large vehicle (LV), ship (SH), tennis court (TC), basketball court (BC), storage tank (ST), soccer field (SBF), roundabout (RA), harbor (HA), swimming pool (SP), and helicopter (HC). The training, validation and testing sets include 1403, 467 and 936 images, respectively. The resolution of images varies from 800 × 800 to 4000 × 4000 pixels. For single-scale training and testing, each image is divided into several sub-images with 1024 × 1024 pixels by using a step of 824 pixels, which ensures an overlap of 200 pixels between adjacent sub-images. For multiscale training and testing, each image is resized by three scales (0.5, 1.0, 1.5), and each resized image is divided into multiple sub-images of 1024 × 1024 pixels by using a step of 524 pixels.

The DIOR-R dataset [67] includes 23,463 images and 192,518 instances of 20 terrestrial object classes. These classes include airplane (APL), airport (APO), baseball field (BF), basketball court (BC), bridge (BR), chimney (CH), expressway service area (ESA), expressway toll station (ETS), dam (DAM), golf course (GF), ground track field (GTF), harbor (HA), overpass (OP), ship (SH), stadium (STA), storage tank (STO), tennis court (TC), train station (TS), vehicle (VE), and windmill (WM). The training, validation and testing sets include 5862, 5863 and 11,738 images, respectively. The resolution of all images is 800 × 800 pixels.

For the above two RSI datasets, the existing OOD models used the training and validation sets for training, and the testing set was used to evaluate the performance of the OOD model. For fair comparison, the above setting is also adopted by our method.

4.2. Implementation Details

Training details: This study employs the stochastic gradient descent algorithm for network optimization, with a momentum of 0.9 and a weight decay of 0.0001 [75]. A batch size of two is used, along with an initial learning rate and epoch count of 0.005 and 12 [5], respectively. The learning rate is further reduced by a factor of 1/10 at the 8th and 11th epochs [67]. Generally, the non-maximum suppression threshold is set to 0.1 for the DOTA dataset and 0.5 for the DIOR-R dataset. The ResNet-50 network is used as the backbone.

Environment settings: All experiments were conducted using the PyTorch framework, and both the proposed method and the results of all comparative algorithms were obtained using the

m m d e t e c t i o n

library. The experiments were performed on a workstation equipped with E5-2630 V4 CPUs (2.2 GHz, 12 × 2 cores), 256 GB of memory, and a single Nvidia RTX 2080Ti GPU (with a total of 11 GB of memory).

Evaluation Metrics: Quantitative assessment of the experimental results was conducted using the average precision (AP) for each separate class, alongside the mean average precision (mAP) for all classes combined, with an Intersection over Union (IoU) threshold of 0.5.

4.3. Ablation Study

4.3.1. Ablation Study of Activation Function

As shown in Table 1, several ablation studies are performed on the DOTA dataset to evaluate the effectiveness of different activation functions. To evaluate the impact of random (natural) variability, the experiments are conducted ten times and the average value and standard deviation (STD) of the mAP are provided.

As shown in Table 1, the ReLU [76], leaky ReLU [77], exponential linear unit (ELU) [78], and scaled exponential linear unit (SELU) [73] are compared with each other. Obviously, the SELU obtains the highest average value and the minimum STD of the mAP. Therefore, SELU is adapted as the activation function in our OOD models.

4.3.2. Ablation Study of the Proposed Modules

As shown in Table 2, several ablation studies are performed on the DOTA dataset to evaluate the effectiveness of the proposed modules. To evaluate the impact of random (natural) variability, the experiments are conducted ten times and the average value and standard deviation (STD) of the mAP are provided.

Effectiveness of TDA: To evaluate the impact of the proposed TDA, the baseline + TDA experiment is conducted. As shown in Table 2, the baseline + TDA method achieves 75.08% mAP with 0.039 STD, which shows an increase (decrease) of 6.65 (1.8) percentage points of the mAP (STD) in comparison to the baseline. Consequently, the effectiveness of the TDA is affirmed.
Effectiveness of the TADM: To evaluate the impact of the proposed TADM, the baseline + TADM experiment is conducted. As shown in Table 2, the baseline + TADM method achieves 73.77% mAP with 0.018 STD, which shows an increase (decrease) of 5.34 (3.9) percentage points of the mAP (STD) in comparison to the baseline. Consequently, the effectiveness of the TADM is affirmed.
Effectiveness of the TL: To evaluate the impact of the proposed TL, the baseline + TL experiment is conducted. As shown in Table 2, the baseline + TDA method achieves 71.09% mAP with 0.025 STD, which shows an increase (decrease) of 2.66 (3.2) percentage points of the mAP (STD) in comparison to the baseline. Consequently, the effectiveness of the TL is affirmed.

When the TADM + TDA method was employed, the combination of the two modules achieved 76.45% mAP with 0.031 STD, which shows an increase (decrease) of 8.02 (2.6) percentage points of the mAP (STD) in comparison to the baseline. This indicates a strong interdependence between the TADM and TDA. In the case of adopting the TL + TADM and TL + TDA methods, the mAP reached 76.09% with 0.019 STD and 76.13% with 0.015 STD, respectively, which shows an increase (decrease) of 7.66 (3.8) percentage points and 7.7 (4.2) percentage points comparatively above the baseline. This suggests a substantial complementary correlation between the TADM and TL. The combination of TADM + TDA + TL approach resulted in 76.73% mAP with 0.014 STD, which shows an increase (decrease) of 8.3 (4.3) percentage points of the mAP (STD) in comparison to the baseline. Therefore, the efficacy of combining these approaches is effectively corroborated.

4.4. Comparison with the SOTA Methods

To further substantiate the comprehensive effectiveness of our method, the compiled results are compared with mainstream OOD methods on two commonly referenced RSI datasets. The approach is analyzed in comparison to six OOD methods, i.e., RetinaNet-O [30], Faster R-CNN [43], S²ANet [31], RoI Transformer [14], AOPG [67], and Oriented R-CNN [15], on both RSI datasets. Furthermore, comparisons are made with nine other OOD methods, such as DRN [33], LO-Det-GGHL [75], CenterMap-Net [47], DPGN [51], Oriented RepPoints [5], YOLOv2-O [28], Rep-YOLO [79], RSI-YOLOv5 [80], and YOLOv8-O [81] on the DOTA dataset. The assessment is broadened with an additional four OOD methods, including Double-Heads [40], Gliding Vertex [68], QPDet [53], and DODet [66] on DIOR-R. As shown in Table 3, the DOTA dataset includes 15 different object categories, denoted as PL, BD, BR, GTF, SV, LV, SH, TC, BC, ST, SBF, RA, HA, SP, and HC. The average precision (AP) of each category and mean average precision (mAP) of the 15 categories are compared for all OOD methods. The relevant experimental results and their corresponding analysis are presented as follows.

Results on DOTA Dataset: As indicated in Table 3, the experimental results demonstrate that our method achieved superior results under the single scale condition, registering a 76.73% mAP. Considering the mAP metric, our method surpasses DRN by 6.03 percentage points, YOLOv2-O by 37.53 percentage points, RSI-YOLOv5 by 5.53 percentage points, Rep-YOLO by 3.31 percentage points, YOLO v8-O by 2.99 percentage points, S²ANet by 2.61 percentage points, DPGN by 2.48 percentage points, RoI Transformer by 2.12 percentage points, AOPG by 1.49 percentage points, and Oriented RepPoints by 0.76 percentage points. For multiscale training and testing, the TAOD head (with 12 epochs) achieves an mAP of 80.76%, outperforming the best-in-class detector AOPG (also with 12 epochs), which trails behind by a 0.1 percentage point margin in terms of mAP. Although the highest AP in each category is not achieved by our model, the highest mAP is obtained by our model. To the best of our knowledge, no OOD methods can achieve the highest AP across all categories.
Results on the DIOR-R Dataset: As indicated in Table 4, our method reaches an mAP of 67.12%. Compared to SOTA OOD methods, our method shows better performance. Specifically, it outperforms Gliding Vertex with a 7.06 percentage point increase in mAP, outperforms S²ANet with a 4.22 percentage point increase, and outperforms RoI Transformer with a 3.25 percentage point improvement. It also beats QPDet with an increase of 2.92 percentage points, beats AOPG with an increase of 2.71 percentage points, beats Oriented R-CNN with an increase of 2.49 percentage points, and beats DODet with an increase of 2.02 percentage points.

In summary, the comparison of our method with others on the DOTA and DIOR-R datasets demonstrates its effectiveness. Figure 4 and Figure 5 offer a visual representation of the detection results on both datasets.

4.5. Analysis of the Computational Performance

To evaluate the computational efficiency of our method, the giga floating point operations (GFLOPs) and frames per second (FPS) of our method are provided in this section. As shown in Table 5, the GFLOP of our model is 293.7, and the FPS are 14.9 (24.4) on the DOTA (DIOR-R) dataset by only using single RTX 2080Ti.

5. Discussion

As shown in Table 3 and Table 4, the overall performance of our method is the best on the DOTA and DIOR-R datasets, and the proposed TDA, TADM and TL play an important role in the improvement. For example, as shown in Table 4, the existing OOD methods give poor performance in the dam category because of the extreme aspect ratio of a dam; however, our method gives better performance by using the TDA which introduces alignment convolution. For another example, as shown in Table 4, the existing OOD methods also give poor performance in the vehicle category because the accurate location of a vehicle is difficult to ascertain; however, our method gives the best performance because the TADM and TL can enhance the interaction between the classification and bounding box regression.

Although better performance has been achieved in this paper, incorrect results are occasionally produced under some circumstances. The detailed reasons for these occasional incorrect results are as follows.

1.: Dense detection: When objects from the same group are tightly clustered and share comparable visual characteristics, it may obstruct the process of identifying individual entities. This situation presents difficulties in accurately distinguishing between adjacent objects, potentially resulting in multiple objects being enclosed within a single oriented bounding box. Figure 6a demonstrates this, where three vehicles positioned closely together with similar appearances caused two vehicles to be contained within a single oriented bounding box.
2.: Sunlight shadows: Shadows projected by foreground objects often have similar characteristics to the objects that cast them. This similarity can cause the shadow to be misidentified as part of the object, leading to inaccurate results. Instances of such occurrences are evident in Figure 6b,d, where the shadows of the storage tank and the windmill are contained within the oriented bounding box.
3.: Analogous Background: Background areas share similar features with the objects in the foreground, adding complexity to the process of object identification, so the model might struggle to differentiate between the objects and the background. As shown in Figure 6c, where the lorry compartment and the black car are similar in color to the asphalt roads.

6. Conclusions

This article proposes the TADM and TL to address the lack of interaction between the classification and regression branches that causes inconsistent task performance. Firstly, an SPM and an SOM are inferred from the shared features in the TADM and are separately incorporated into the classification and regression branches to improve the consistency of the two tasks. Secondly, the TL combines the GIoU with classification loss to further enhance the consistency of the two tasks. Furthermore, a new TDA strategy is proposed to solve the problem that traditional convolution operations cannot precisely extract the features of objects with EAR in RSIs. Specifically, the features extracted from the backbone network are refined through AlignConv in the first stage, and then the TADM is used to infer the OOD result again based on refined features in the second stage. The SPM (SOM) in the second stage is replaced by the average value of SPM (SOM) given by the TADM in both stages. Ablation studies verify the effectiveness of the TADM, TL and TDA. Comparisons with other OOD methods on two RSI benchmarks further demonstrate the overall effectiveness of our method.

Considering the intra-class diversity and inter-class similarity in the RSI, solely using the features of objects may result in misclassification problems. Therefore, finding methods of leveraging the contextual information is the key aim of our future work.

Author Contributions

Conceptualization, X.Q., J.Z. and B.W.; methodology, X.Q. and J.Z.; software, J.Z.; validation, X.Q. and W.W.; formal analysis, X.Q., J.Z. and Z.C.; resources, H.K.; writing—original draft, J.Z.; writing—review and editing, X.Q.; supervision, W.W.; project administration, H.K.; funding acquisition, W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China (Grants No. 62076223), Key Research Project of Henan Province Universities (Grant No. 24ZX005) and the Key Science and Technology Program of Henan Province (Grant No. 232102211018).

Data Availability Statement

The DOTA and DIOR-R datasets are available at following URLs: https://captain-whu.github.io/DOTA/dataset.html (accessed on 25 December 2022) and https://gcheng-nwpu.github.io/#Datasets (accessed on 25 December 2022), respectively.

Conflicts of Interest

Author Baokun Wu was employed by the company XJ Electric Co., Ltd. He declares no conflicts of interest. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Li, L.; Yao, X.; Wang, X.; Hong, D.; Cheng, G.; Han, J. Robust few-shot aerial image object detection via unbiased proposals filtration. IEEE Trans. Geosci. Remote Sens. 2023, 60, 5617011. [Google Scholar] [CrossRef]
Xie, X.; Cheng, G.; Feng, X.; Yao, X.; Qian, X.; Han, J. Attention Erasing and Instance Sampling for Weakly Supervised Object Detection. IEEE Trans. Geosci. Remote Sens. 2023, 62, 5600910. [Google Scholar] [CrossRef]
Zeng, L.; Huo, Y.; Qian, X.; Chen, Z. High-Quality Instance Mining and Dynamic Label Assignment for Weakly Supervised Object Detection in Remote Sensing Images. Electronics 2023, 12, 2758. [Google Scholar] [CrossRef]
Yao, Y.; Cheng, G.; Wang, G. On Improving Bounding Box Representations for Oriented Object Detection. IEEE Trans. Geosci. Remote Sens. 2022, 61, 5600111. [Google Scholar] [CrossRef]
Li, W.; Chen, Y.; Hu, K. Oriented reppoints for aerial object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Han, X.; Zhong, Y.; Zhang, L. An efficient and robust integrated geospatial object detection framework for high spatial resolution remote sensing imagery. Remote Sens. 2017, 9, 666. [Google Scholar] [CrossRef]
Qian, X.; Li, J.; Cao, J.; Wu, Y.; Wang, W. Micro-cracks detection of solar cells surface via combining short-term and long-term deep features. Neural Netw. 2020, 127, 132–140. [Google Scholar] [CrossRef] [PubMed]
Yen, T.-Y.; Ho, C.-S.; Chen, Y.-P.; Pei, Y.-C. Diagnostic Accuracy of Deep Learning for the Prediction of Osteoporosis Using Plain X-rays: A Systematic Review and Meta-Analysis. Diagnostics 2024, 14, 207. [Google Scholar] [CrossRef] [PubMed]
Alhussainan, N.F.; Ben Youssef, B.; Ben Ismail, M.M. A Deep Learning Approach for Brain Tumor Firmness Detection Based on Five Different YOLO Versions: YOLOv3–YOLOv7. Computation 2024, 12, 44. [Google Scholar] [CrossRef]
George, J.; Skaria, S.; Varun, V.V. Using YOLO based deep learning network for real time detection and localization of lung nodules from low dose CT scans. In Medical Imaging 2018: Computer-Aided Diagnosis; SPIE: Bellingham, DC, USA, 2018; Volume 10575, pp. 347–355. [Google Scholar]
Montero-Valverde, J.A.; Organista-Vázquez, V.D.; Martínez-Arroyo, M.; de la Cruz-Gámez, E.; HernándezHernández, J.L.; Hernández-Bravo, J.M.; Hernández-Hernández, M. Automatic Detection of Melanoma in Human Skin Lesions. In Proceedings of the International Conference on Technologies and Innovation, Guayaquil, Ecuador, 13–16 November 2023; Springer Nature: Cham, Switzerland, 2023; pp. 220–234. [Google Scholar]
Yu, J.; Gao, H.; Chen, Y.; Zhou, D.; Liu, J.; Ju, Z. Deep object detector with attentional spatiotemporal LSTM for space human–robot interaction. IEEE Trans. Hum. Mach. Syst. 2022, 52, 784–793. [Google Scholar] [CrossRef]
Ma, J.; Shao, W.; Ye, H.; Wang, L.; Wang, H.; Zheng, Y.; Xue, X. Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans. Multimed. 2018, 20, 3111–3122. [Google Scholar] [CrossRef]
Ding, J.; Xue, N.; Long, Y. Learning roi transformer for oriented object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2849–2858. [Google Scholar]
Xie, X.; Cheng, G.; Wang, J. Oriented R-CNN for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Zhang, C.; Su, J.; Ju, Y.; Lam, K.M.; Wang, Q. Efficient inductive vision transformer for oriented object detection in remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5616320. [Google Scholar] [CrossRef]
Zhao, Z.; Li, S. OASL: Orientation-aware adaptive sampling learning for arbitrary oriented object detection. Int. J. Appl. Earth Obs. Geoinf. 2024, 128, 103740. [Google Scholar] [CrossRef]
Zhang, Y.; Ma, C.; Zhuo, L.; Li, J. Arbitrary-Oriented Object Detection in Aerial Images with Dynamic Deformable Convolution and Self-Normalizing Channel Attention. Electronics 2023, 12, 2132. [Google Scholar] [CrossRef]
Yang, X.; Zhou, Y.; Zhang, G.; Yang, J.; Wang, W.; Yan, J. The KFIoU Loss for Rotated Object Detection. arXiv 2022, arXiv:2201.12558. [Google Scholar] [CrossRef]
Ming, Q.; Miao, L.; Zhou, Z.; Yang, X.; Dong, Y. Optimization for arbitrary-oriented object detection via representation invariance loss. arXiv 2021, arXiv:2103.11636. [Google Scholar] [CrossRef]
Ming, Q.; Miao, L.; Zhou, Z.; Dong, Y. CFC-Net: A Critical Feature Capturing Network for Arbitrary-Oriented Object Detection in Remote-Sensing Images. arXiv 2021, arXiv:2101.06849. [Google Scholar] [CrossRef]
Qian, X.; Zeng, Y.; Wang, W.; Zhang, Q. Co-saliency Detection Guided by Group Weakly Supervised Learning. IEEE Trans. Multimed. 2023, 25, 1810–1818. [Google Scholar] [CrossRef]
Cheng, G.; Li, Q.; Wang, G.; Xie, X.; Min, L.; Han, J. SFRNet: Fine-Grained Oriented Object Recognition via Separate Feature Refinement. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5610510. [Google Scholar] [CrossRef]
Qian, X.; Wang, C.; Li, C.; Li, Z.; Zeng, L.; Wang, W.; Wu, Q. Multi-Scale Image Splitting Based Feature Enhancement and Instance Difficulty Aware Training for Weakly Supervised Object Detection in Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 7497–7506. [Google Scholar] [CrossRef]
Xie, X.; Lang, C.; Miao, S.; Cheng, G.; Li, K.; Han, J. Mutual-Assistance Learning for Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 15171–15184. [Google Scholar] [CrossRef]
Qian, X.; Li, C.; Wang, W.; Yao, X.; Cheng, G. Semantic segmentation guided pseudo label mining and instance re-detection for weakly supervised object detection in remote sensing images. Int. J. Appl. Earth Obs. Geoinf. 2023, 119, 103301. [Google Scholar] [CrossRef]
Qian, X.; Huo, Y.; Cheng, G.; Gao, C.; Yao, X.; Wang, W. Mining High-quality Pseudo Instance Soft Labels for Weakly Supervised Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5607615. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In European Conference on Computer Vision; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
Han, J.; Ding, J.; Li, J.; Xia, G.-S. Align deep features for oriented object detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5602511. [Google Scholar] [CrossRef]
Yang, X.; Yan, J.; Ming, Q.; Wang, W.; Zhang, X.; Tian, Q. Rethinking Rotated Object Detection with Gaussian Wasserstein Distance Loss. arXiv 2021, arXiv:2101.11952. [Google Scholar]
Pan, X.; Ren, Y.; Sheng, K. Dynamic refinement network for oriented and densely packed object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Wei, Z.; Liang, D.; Zhang, D. Learning calibrated-guidance for object detection in aerial images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 2721–2733. [Google Scholar] [CrossRef]
Zhang, S.; Wen, L.; Bian, X.; Lei, Z.; Li, S.Z. Single-shot refinement neural network for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2023; pp. 4203–4212. [Google Scholar]
Kong, T.; Sun, F.; Yao, A.; Liu, H.; Lu, M.; Chen, Y. Ron: Reverse connection with objectness prior networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5936–5944. [Google Scholar]
Fu, C.-Y.; Liu, W.; Ranga, A.; Tyagi, A.; Berg, A.C. Dssd: Deconvolutional single shot detector. arXiv 2017, arXiv:1701.06659. [Google Scholar]
Shen, Z.; Liu, Z.; Li, J.; Jiang, Y.-G.; Chen, Y.; Xue, X. Dsod: Learning deeply supervised object detectors from scratch. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1919–1927. [Google Scholar]
Zhang, F.; Wang, X.; Zhou, S. DARDet: A dense anchor-free rotated object detector in aerial images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 8024305. [Google Scholar] [CrossRef]
Wu, Y.; Chen, Y.; Yuan, L. Rethinking classification and localization for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 7–12 December 2015; pp. 91–99. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Shafiq, M.; Gu, Z. Deep residual learning for image recognition: A survey. Appl. Sci. 2022, 12, 8972. [Google Scholar] [CrossRef]
Han, J.; Ding, J.; Xue, N. Redet: A rotation-equivariant detector for aerial object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 202.
Wang, J.; Yang, W.; Li, H.-C. Learning center probability map for detecting objects in aerial images. IEEE Trans. Geosci. Remote Sens. 2020, 59, 4307–4323. [Google Scholar] [CrossRef]
Shamsolmoali, P.; Chanussot, J.; Zareapoor, M. Multipatch feature pyramid network for weakly supervised object detection in optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5610113. [Google Scholar] [CrossRef]
Yang, X.; Yan, J. Arbitrary-oriented object detection with circular smooth label. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 677–694. [Google Scholar]
Qian, X.; Wu, B.; Cheng, G.; Yao, X.; Wang, W.; Han, J. Building a Bridge of Bounding Box Regression Between Oriented and Horizontal Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5605209. [Google Scholar] [CrossRef]
Li, Q.; Cheng, G.; Miao, S. Dynamic Proposal Generation for Oriented Object Detection in Aerial Images. IGARSS 2022–2022 IEEE International Geoscience and Remote Sensing Symposium. In Proceedings of the IGARSS 2022–2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022. [Google Scholar]
Qian, X.; Zhang, N.; Wang, W. Smooth GIoU Loss for Oriented Object Detection in Remote Sensing Images. Remote Sens. 2023, 15, 1259. [Google Scholar] [CrossRef]
Yang, X.; Yan, J.; Feng, Z.; He, T. R3det: Refined single-stage detector with feature refinement for rotating object. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 3163–3171. [Google Scholar]
Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
Law, H.; Teng, Y.; Russakovsky, O.; Deng, J. Cornernet-lite: Efficient keypoint based object detection. arXiv 2019, arXiv:1904.08900. [Google Scholar]
Lu, X.; Li, B.; Yue, Y.; Li, Q.; Yan, J. Grid r-cnn. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7363–7372. [Google Scholar]
Zhou, X.; Zhuo, J.; Krahenbuhl, P. Bottom-up object detection by grouping extreme and center points. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 850–859. [Google Scholar]
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578. [Google Scholar]
Yang, Z.; Liu, S.; Hu, H.; Wang, L.; Lin, S. Reppoints: Point set representation for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9657–9666. [Google Scholar]
Huang, L.; Yang, Y.; Deng, Y.; Yu, Y. Densebox: Unifying landmark localization with end to end object detection. arXiv 2015, arXiv:1509.04874. [Google Scholar]
Wang, J.; Chen, K.; Yang, S.; Loy, C.C.; Lin, D. Region proposal by guided anchoring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2965–2974. [Google Scholar]
Zhu, C.; He, Y.; Savvides, M. Feature selective anchor-free module for single-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 840–849. [Google Scholar]
Tian, Z.; Chu, X.; Wang, X.; Wei, X.; Shen, C. Fully Convolutional One-Stage 3D Object Detection on LiDAR Range Images. arXiv 2022, arXiv:2205.13764. [Google Scholar]
Liu, W.; Liao, S.; Ren, W.; Hu, W.; Yu, Y. High-level semantic feature detection: A new perspective for pedestrian detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5187–5196. [Google Scholar]
Cheng, G.; Yao, Y.; Li, S. Dual-aligned oriented detector. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5618111. [Google Scholar] [CrossRef]
Cheng, G.; Wang, J.; Li, K. Anchor-free oriented proposal generator for object detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5625411. [Google Scholar] [CrossRef]
Xu, Y.; Fu, M.; Wang, Q. Gliding vertex on the horizontal bounding box for multi-oriented object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 1452–1459. [Google Scholar] [CrossRef]
Kong, T.; Sun, F.; Liu, H.; Jiang, Y.; Li, L.; Shi, J. Foveabox: Beyound anchor-based object detection. IEEE Trans. Image Process. 2020, 29, 7389–7398. [Google Scholar] [CrossRef]
Jie, F.; Liang, Y.; Zhang, J.; Zhang, X.; Yao, Q.; Jiao, L. MidNet: An Anchor-and-Angle-Free Detector for Oriented Ship Detection in Aerial Images. arXiv 2021, arXiv:2111.10961. [Google Scholar]
Dong, Z.; Li, G.; Liao, Y.; Wang, F.; Ren, P.; Qian, C. Centripetalnet: Pursuing high-quality keypoint pairs for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10519–10528. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
Xia, G.-S.; Bai, X.; Ding, J. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Huang, Z.; Li, W.; Xia, X.G.; Tao, R. A General Gaussian Heatmap Label Assignment for Arbitrary-Oriented Object Detection. IEEE Trans. Image Process. 2022, 31, 1895–1910. [Google Scholar] [CrossRef] [PubMed]
Hanin, B. Universal Function Approximation by Deep Neural Nets with Bounded Width and ReLU Activations. Mathematics 2019, 7, 992. [Google Scholar] [CrossRef]
Padshetty, S.; Ambika. Leaky ReLU-ResNet for Plant Leaf Disease Detection: A Deep Learning Approach. Eng. Proc. 2023, 59, 39. [Google Scholar] [CrossRef]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Qing, Y.; Liu, W.; Feng, L.; Gao, W. Improved YOLO Network for Free-Angle Remote Sensing Target Detection. Remote Sens. 2021, 13, 2171. [Google Scholar] [CrossRef]
Li, Z.; Yuan, J.; Li, G.; Wang, H.; Li, X.; Li, D.; Wang, X. RSI-YOLO: Object Detection Method for Remote Sensing Images Based on Improved YOLO. Sensors 2023, 23, 6414. [Google Scholar] [CrossRef]
Vats, A.; Anastasiu, D.C. Enhancing Retail Checkout Through Video Inpainting, YOLOv8 Detection, and DeepSort Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5529–5536. [Google Scholar]

Figure 1. Illustration of sampling locations between (a) the traditional convolutional kernel and (b) the AlignConv kernel in RSIs. The red dots denote sampling locations.

Figure 2. Framework of our method.

Figure 3. Illustration of (a) common convolution and (b) AlignConv. The red and blue dots denote the sampling points of common convolution and AlignConv, respectively. The arrows denote the offsets between red and blue dots. The green line denotes the boundary of sampling region.

Figure 4. Visualizations of our detection results on the DOTA dataset. OBBs with different colors indicate different categories. The same below.

Figure 5. Visualizations of our detection results on the DIOR-R dataset.

Figure 6. Illustration of several incorrect results on the (a,b) DOTA dataset and (c,d) DIOR-R dataset. The meaning of color boxes in (a,b) and (c,d) refers to Figure 4 and Figure 5, respectively.

Table 1. Ablation study of activation function on the DOTA dataset.

Activation Function	mAP (STD)
ReLU	76.42 (0.049)
Leaky ReLU	76.52 (0.037)
ELU	76.61 (0.026)
SELU	76.73 (0.014)

Table 2. Ablation study of the proposed modules on the DOTA dataset.

Modules	Baseline	Different Settings of TAOD
TDA	RetinaNet-O	✓			✓	✓		✓
TADM			✓		✓		✓	✓
TL				✓		✓	✓	✓
mAP(STD)	68.43 (0.057)	75.08 (0.039)	73.77 (0.018)	71.09 (0.025)	76.45 (0.031)	76.13 (0.015)	76.09 (0.019)	76.73 (0.014)

Table 3. Comparison with State-of-The-Art (SOTA) methods on the DOTA dataset. † means multiscale training and testing.

Methods	Backbone	Epoch	PL	BD	BR	GTF	SV	LV	SH	TC
YOLOv2-O [28]	Darknet-19	200	76.90	33.87	22.73	34.88	38.73	32.02	52.37	61.65
RetinaNet-O [30]	R-50-FPN	12	88.67	77.62	41.81	58.17	74.58	71.64	79.11	90.29
Faster R-CNN [43]	R-50-FPN	12	88.44	73.06	44.86	73.25	59.09	71.49	77.11	90.84
DRN [33]	H-104	120	88.91	80.22	43.52	63.35	73.48	70.69	84.94	90.14
RSI-YOLOv5 [80]	CSP-Darknet53	120	87.93	75.83	47.91	60.31	67.15	75.85	87.71	89.20
LO-Det-GGHL [78]	MobileNetv2	12	89.66	83.02	38.55	77.09	72.57	71.86	82.47	90.78
CenterMap-Net [47]	R-50-FPN	12	88.88	81.24	53.15	60.65	78.62	66.55	78.10	88.83
R3Det [53]	R-101-FPN	20	89.24	80.81	51.11	65.62	70.67	76.03	78.32	90.83
Rep-YOLO [79]	RepVGG	500	90.27	79.34	52.34	64.35	71.02	76.27	77.41	91.04
YOLOv8-O [81]	CSP-Darknet53	120	89.49	81.17	50.53	66.10	70.92	78.66	78.21	90.81
S²ANet [31]	R-50-FPN	12	89.11	82.84	48.37	71.11	78.11	78.39	87.25	90.83
DPGN [51]	R-101-FPN	12	89.15	79.57	51.26	77.61	76.29	81.67	79.95	90.90
RoI Transformer [14]	R-50-FPN	12	88.65	82.60	52.53	70.87	77.93	76.67	86.87	90.71
AOPG [67]	R-50-FPN	12	89.27	83.49	52.50	69.97	73.51	82.31	87.95	90.89
Oriented R-CNN [15]	R-50-FPN	12	89.46	82.12	54.78	70.86	78.93	83.00	88.20	90.90
OrientedRepPoints [5]	R-50-FPN	40	87.02	83.17	54.13	71.16	80.18	78.40	87.28	90.90
AOPG † [67]	R-50-FPN	12	89.88	85.57	60.90	81.51	78.70	85.29	88.85	90.89
Ours
TAOD	R-50-FPN	12	89.43	84.48	52.94	74.81	79.10	83.76	88.17	90.89
TAOD †	R-50-FPN	12	90.42	83.68	59.09	81.50	79.83	84.76	88.50	91.12
Methods	Backbone	Epoch	BC	ST	SBF	RA	HA	SP	HC	mAP
YOLOv2-O [28]	Darknet-19	200	48.54	33.91	29.27	36.83	36.44	38.26	11.61	39.20
RetinaNet-O [30]	R-50-FPN	12	82.18	74.32	54.75	60.60	62.57	69.57	60.64	68.43
Faster R-CNN [43]	R-50-FPN	12	78.94	83.90	48.59	62.95	62.18	64.91	56.18	69.05
DRN [33]	H-104	120	83.85	84.11	50.12	58.41	67.62	68.60	52.50	70.70
RSI-YOLOv5 [80]	CSP-Darknet53	120	83.90	83.62	54.83	67.85	67.70	65.13	53.20	71.20
LO-Det-GGHL [78]	MobileNetv2	12	78.05	83.56	47.74	67.83	64.21	67.83	54.16	71.26
CenterMap-Net [47]	R-50-FPN	12	77.80	83.61	49.36	66.19	72.10	72.36	58.70	71.74
R3Det [53]	R-101-FPN	20	84.89	84.42	65.10	57.18	68.10	68.98	60.88	72.81
Rep-YOLO [79]	RepVGG	500	86.21	84.17	66.82	63.07	67.23	69.75	62.07	73.42
YOLOv8-O [81]	CSP-Darknet53	120	85.26	84.23	61.81	63.77	68.16	69.83	67.17	73.74
S²ANet [31]	R-50-FPN	12	84.90	85.64	60.36	62.60	65.26	69.13	57.94	74.12
DPGN [51]	R-101-FPN	12	85.52	84.71	61.72	62.82	75.31	66.34	50.86	74.25
RoI Transformer [14]	R-50-FPN	12	83.83	82.51	53.95	67.61	74.67	68.75	61.03	74.61
AOPG [67]	R-50-FPN	12	87.64	84.71	60.01	66.12	74.19	68.30	57.80	75.24
Oriented R-CNN [15]	R-50-FPN	12	87.50	84.68	63.97	67.69	74.94	68.84	52.28	75.87
OrientedRepPoints [5]	R-50-FPN	40	85.97	86.25	59.90	70.49	73.53	72.27	58.97	75.97
AOPG † [67]	R-50-FPN	12	87.60	87.65	71.66	68.69	82.31	77.32	73.10	80.66
Ours
TAOD	R-50-FPN	12	86.57	84.52	62.49	68.15	75.13	69.20	61.26	76.73
TAOD †	R-50-FPN	12	84.15	87.48	72.09	71.89	81.71	78.26	76.87	80.76

Table 4. Comparison with SOTA methods on the DIOR-R dataset.

Method	Backbone	Epoch		APL	APO	BF	BC	BR	CH	DAM	ETS	ESA	GF
FCOS-O [64]	R-50-FPN	12		48.70	24.88	63.57	80.97	18.41	68.99	23.26	42.37	60.25	64.83
RetinaNet-O [30]	R-50-FPN	12		61.49	28.52	73.57	81.17	23.98	72.54	19.94	72.39	58.20	69.25
Double-Heads [40]	R-50-FPN	12		62.13	19.53	71.50	87.09	28.01	72.17	20.35	61.19	64.56	73.37
Faster R-CNN [43]	R-50-FPN	12		62.79	26.80	71.72	80.91	34.20	72.57	18.95	66.45	65.75	66.63
Gliding Vertex [68]	R-50-FPN	12		65.35	28.87	74.96	81.33	33.88	74.31	19.58	70.72	64.70	72.30
S²ANet [22]	R-50-FPN	12		65.40	42.04	75.15	83.91	36.01	72.61	28.01	65.09	75.11	75.56
RoI Transformer [14]	R-50-FPN	12		63.34	37.88	71.78	87.53	40.68	72.60	26.86	78.71	68.09	68.96
QPDet [53]	R-50-FPN	12		63.22	41.39	71.97	88.55	41.23	72.63	28.82	78.90	69.00	70.07
AOPG [67]	R-50-FPN	12		62.39	37.79	71.62	87.63	40.90	72.17	31.08	65.42	77.99	73.20
Oriented R-CNN [15]	R-50-FPN	12		62.00	44.92	71.78	87.93	43.84	72.64	35.46	66.39	81.35	74.10
DODet [66]	R-50-FPN	12		63.40	43.35	72.11	81.32	43.12	72.59	33.32	78.77	70.84	74.15
Ours
TAOD	R-50-FPN	12		64.39	44.73	72.74	88.57	42.71	73.65	35.72	65.12	80.54	74.76
Method	Backbone		GTF	HA	OP	SH	STA	STO	TC	TS	VE	WM	mAP
FCOS-O [64]	R-50-FPN	12	60.66	31.84	40.80	73.09	66.32	56.61	77.55	38.10	30.69	55.87	51.39
RetinaNet-O [30]	R-50-FPN	12	79.54	32.14	44.87	77.71	65.57	61.09	81.46	47.33	38.01	60.24	57.55
Double-Heads [40]	R-50-FPN	12	81.97	40.68	42.40	80.36	73.12	62.37	87.09	54.94	41.32	64.86	59.45
Faster R-CNN [43]	R-50-FPN	12	79.24	34.95	48.79	81.14	64.34	71.21	81.44	47.31	50.46	65.21	59.54
Gliding Vertex [68]	R-50-FPN	12	78.68	37.22	49.64	80.22	69.26	61.13	81.49	44.76	47.71	65.04	60.06
S²ANet [31]]	R-50-FPN	12	80.47	35.91	52.10	82.33	65.89	66.08	84.61	54.13	48.00	69.67	62.90
RoI Transformer [14]	R-50-FPN	12	82.74	47.71	55.61	81.21	78.23	70.26	81.61	54.86	43.27	65.52	63.87
QPDet [53]	R-50-FPN	12	83.01	47.83	55.54	81.23	72.15	62.66	89.05	58.09	43.38	65.36	64.20
AOPG [67]	R-50-FPN	12	81.94	42.32	54.45	81.17	72.69	71.31	81.49	60.04	52.38	69.99	64.41
Oriented R-CNN [15]	R-50-FPN	12	80.95	43.52	58.42	81.25	68.01	62.52	88.62	59.31	43.27	66.31	64.63
DODet [66]	R-50-FPN	12	75.47	48.00	59.31	85.41	74.04	71.56	81.52	55.47	51.86	66.40	65.10
Ours
TAOD	R-50-FPN	12	83.44	47.65	58.81	82.73	76.81	73.79	89.74	62.33	53.68	70.58	67.12

Table 5. GFLOPs and FPS of our method on the DOTA and DIOR-R datasets.

Dataset	GPU	FPS	GFLOPs
DOTA	1 × RTX 2080Ti	14.9	293.7
DIOR-R	1 × RTX 2080Ti	24.4	293.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qian, X.; Zhao, J.; Wu, B.; Chen, Z.; Wang, W.; Kong, H. Task-Aligned Oriented Object Detection in Remote Sensing Images. Electronics 2024, 13, 1301. https://doi.org/10.3390/electronics13071301

AMA Style

Qian X, Zhao J, Wu B, Chen Z, Wang W, Kong H. Task-Aligned Oriented Object Detection in Remote Sensing Images. Electronics. 2024; 13(7):1301. https://doi.org/10.3390/electronics13071301

Chicago/Turabian Style

Qian, Xiaoliang, Jiakun Zhao, Baokun Wu, Zhiwu Chen, Wei Wang, and Han Kong. 2024. "Task-Aligned Oriented Object Detection in Remote Sensing Images" Electronics 13, no. 7: 1301. https://doi.org/10.3390/electronics13071301

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Task-Aligned Oriented Object Detection in Remote Sensing Images

Abstract

1. Introduction

2. Related Work

2.1. Anchor-Based One-Stage Methods

2.2. Anchor-Based Two-Stage Methods

2.3. Anchor-Free Methods

3. Proposed Method

3.1. Overview

3.2. Task-Aligned Detection Model

3.3. Two-Stage Detection Framework Based on AlignConv

3.4. Task-Aligned Loss Function

4. Experiment

4.1. Datasets

4.2. Implementation Details

4.3. Ablation Study

4.3.1. Ablation Study of Activation Function

4.3.2. Ablation Study of the Proposed Modules

4.4. Comparison with the SOTA Methods

4.5. Analysis of the Computational Performance

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI