Lightweight Progressive Fusion Calibration Network for Rotated Object Detection in Remote Sensing Images

Liu, Jing; Jing, Donglin; Cao, Yanyan; Wang, Ying; Guo, Chaoping; Shi, Peijun; Zhang, Haijing

doi:10.3390/electronics13163172

Open AccessArticle

Lightweight Progressive Fusion Calibration Network for Rotated Object Detection in Remote Sensing Images

by

Jing Liu

¹

,

Donglin Jing

²

,

Yanyan Cao

^1,*,

Ying Wang

^3,*

,

Chaoping Guo

¹,

Peijun Shi

¹ and

Haijing Zhang

¹

Xi’an Key Laboratory of Human-Machine Integration and Control Technology for Intelligent Rehabilitation, School of Computer Science, Xijing University, Xi’an 710123, China

²

School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China

³

School of Information Science and Engineering, Wuchang Shouyi University, Wuhan 430072, China

^*

Authors to whom correspondence should be addressed.

Electronics 2024, 13(16), 3172; https://doi.org/10.3390/electronics13163172

Submission received: 5 July 2024 / Revised: 2 August 2024 / Accepted: 7 August 2024 / Published: 11 August 2024

(This article belongs to the Special Issue Advanced Machine Learning, Pattern Recognition, and Deep Learning Technologies: Methodologies and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Rotated object detection is a crucial task in aerial image analysis. To address challenges such as multi-directional object rotation, complex backgrounds with occlusions, and the trade-off between speed and accuracy in remote sensing images, this paper introduces a lightweight progressive fusion calibration network for rotated object detection (LPFC-RDet). The network comprises three main modules: the Retentive Meet Transformers (RMT) feature extraction block, the Progressive Fusion Calibration module (PFC), and the Shared Group Convolution Lightweight detection head (SGCL). The RMT feature extraction block integrates a retentive mechanism with global context modeling to learn rotation-insensitive features. The PFC module employs pixel-level, local-level, and global-level weights to calibrate features, enhancing feature extraction from occluded objects while suppressing background interference. The SGCL detection head uses decoupled detection tasks and shared group convolution layers to achieve parameter sharing and feature interaction, improving accuracy while maintaining a lightweight structure. Experimental results demonstrate that our method surpasses state-of-the-art detectors on three widely used remote sensing object datasets: HRSC2016, UCAS_AOD, and DOTA.

Keywords:

object detection; convolutional neural network; feature extraction; lightweight; remote sensing images

1. Introduction

Remote sensing imagery serves as a critical tool for acquiring surface information about the Earth, with extensive applications in environmental monitoring, urban planning, military reconnaissance, and more. With the rapid growth of remote sensing data, the precise detection of objects from large volumes of images has garnered increasing attention. In recent years, deep learning technologies have revolutionized the analysis of remote sensing imagery by efficiently capturing complex patterns, thereby significantly enhancing the accuracy of detection and recognition tasks. Presently, deep learning-based detectors are categorized into two main types: two-stage detectors and one-stage detectors. Typical representatives of two-stage detectors include Fast RCNN [1], Mask RCNN [2], Cascade RCNN [3], and RFCN [4]. Typical one-stage detectors include SSD [5], RetinaNet [6], YOLO [7,8,9,10,11,12,13], etc. These detectors have demonstrated robust applications in various computer vision tasks; this is particularly true for the YOLO series, which is based on a one-stage design philosophy and excels in efficient feature extraction, becoming widely recognized as one of the fastest and most accurate models. Additionally, recent advancements such as Visual Transformers [14], DETR [15], and Mamba [16] have further propelled significant breakthroughs in different domains. However, challenges persist in the precise detection of objects in remote sensing images due to factors like varying shooting angles, sensor attitudes, the multi-directional rotations of objects, complex backgrounds with occlusions, and the difficulty in balancing speed and accuracy under limited hardware conditions.

In remote sensing images, targets exhibit diverse rotational angles, making it challenging for traditional horizontal bounding box-based detection methods to accurately represent their shapes and positions, as depicted in Figure 1a. For objects with high aspect ratios, shown in Figure 1b, horizontal bounding boxes often encapsulate a substantial portion of the background, diminishing detection accuracy, particularly in dense or overlapping scenarios where the impact is more pronounced. In remote sensing detection tasks, rotated bounding boxes offer advantages over axis-aligned horizontal boxes. Horizontal boxes fail to effectively enclose irregularly shaped objects, capturing their true boundaries and orientations inaccurately, thus including an excessive amount of background and reducing precision [17]. When dealing with tilted or rotated objects, horizontal boxes tend to produce significant localization errors. Conversely, rotated bounding boxes define object boundaries more precisely, effectively encompassing targets, thereby reducing false and missed detections. This enhances the adaptability and accuracy of detection models [18].

The presence of complex backgrounds and occlusions is another major challenge. Remote sensing images, due to their wide perspective, often contain extensive background information such as terrain, vegetation, and buildings, as shown in Figure 1c. The overwhelming richness of background information reduces the proportion of foreground information, causing the model to lean towards learning background features during training, potentially neglecting crucial foreground information. Additionally, instances of target overlap and occlusion in images (as shown in Figure 1d) further complicate feature extraction for the model.

The imbalance between speed and accuracy poses yet another issue. To achieve satisfactory detection accuracy amidst background interference, multi-scale challenges, and multi-directional complexities, detection algorithms often employ more complex network structures, multi-layer concatenation fusion, and attention mechanisms in backbone and neck layers. However, this increases computational complexity. Some algorithms prioritize high frame rates (FPS) for real-time performance, potentially compromising attention to detail or smaller targets during detection, thereby impacting overall detection accuracy.

In response to these challenges, this paper proposes the lightweight progressive fusion and calibration-rotated object detection network (LPFC-RDet) for remote sensing images. This network consists of three key modules: the Retentive Meet Transformers (RMT) feature extraction block, the Progressive Fusion Calibration module (PFC), and the Shared Group Convolution Lightweight Detection Head (SGCL). The primary contributions of this paper are as follows:

In the feature extraction stage, we introduce the RMT block, which integrates the retention mechanism of Retentive Networks with the global context modeling capabilities of Vision Transformers. By injecting spatial prior information into self-attention mechanisms, the RMT block learns feature representations insensitive to rotation. By considering contextual information around targets, the RMT block more accurately describes the shapes and positions of targets, improving the model’s ability to extract features. By incorporating explicit spatial priors and attention decomposition methods, the RMT block effectively reduces computational burden while maintaining a linear complexity.
We propose the PFC module, which calibrates features by learning joint weights at pixel, channel, and spatial levels. This module fully blends features from three levels to ensure information interaction, enhancing the model’s robustness to variations in input images (such as lighting, viewpoint, and scale). This fusion strategy better handles interference and occlusion from complex backgrounds in remote sensing images.
We propose the SGCL detection head. It decouples the detection task into multiple sub-tasks and utilizes shared group convolution layers for parameter sharing and feature interaction, reducing the number of parameters. Additionally, group normalization is incorporated to enhance stability and generalization capabilities. SGCL achieves both improved accuracy in rotated object detection in remote sensing images and lightweight performance.

The structure of the paper is as follows: Section 2 reviews the literature on remote sensing object detectors in detail. Section 3 provides detailed descriptions of our model and its modules. In Section 4, we conduct ablation experiments to demonstrate the feasibility of the designed modules. Comparative experiments and a visual analysis of detection performance are also presented. Finally, Section 5 presents our conclusions and explores future research directions.

2. Related Work

Currently, CNN-based data processing has become a predominant trend in remote sensing object detection research due to its outstanding feature representation and generalization capabilities. The core design philosophy revolves around the combination of feature extraction and bounding box regression. Existing remote sensing object detection frameworks are primarily categorized into one-stage and two-stage detectors.

2.1. Two-Stage Rotated Object Detectors

R^{2} CNN

[19] utilizes different sizes of ROI pooling to extract features from candidate regions, combining connected characteristics and fully connected layers for text/non-text classification and prediction of axis-aligned boxes and rotated minimum area boxes. It employs rotated NMS for post-processing to achieve rotated text detection. ICN [20] combines image cascades and feature pyramid networks to extract multi-scale features. It employs rotated NMS to eliminate redundant detections and enhance detection performance for rotated objects.

R^{2} PN

[21] introduces rotated anchor generation to create candidate regions in arbitrary orientations. It integrates these using a rotation RoI pooling layer within a unified network for end-to-end training, accurately regressing bounding box orientations. RRPN [22] incorporates orientation parameters in anchor boxes, and designs rotated anchor boxes, initially used for inclined text detection and later extended to rotated object detection in remote sensing. RoI Transformer [23] enhances feature extraction for object recognition and detection by learning rotated regions of interest (RoIs), thereby improving rotated object detection performance. Oriented RCNN [24] directly generates high-quality oriented candidate regions using an orientation RPN network. It uses rotated RoI pooling to extract features for classification and regression, thereby improving detection accuracy. Libra-RCNN [25] tackles sample, feature, and target imbalance issues through IoU-balanced sampling, balanced feature pyramids, and balanced L1 loss integration. It reduces interference from large gradient samples during model training, thereby improving detection performance.

2.2. One-Stage Rotated Object Detectors

S^{2} A

-Net [26] generates high-quality anchors by employing the Feature Alignment Module (FAM) and adaptively utilizing convolutional features through the Oriented Detection Module (ODM). It encodes directional information using active rotation filters to resolve discrepancies between classification and localization.

R^{3} Det

[27] is a progressive rotated detector that enhances speed and recall rates using horizontal anchors. It refines accuracy through fine rotated anchors and the Feature Refinement Module (FRM), addressing non-differential issues with approximate SkewIoU loss. SCRNet [28] utilizes the SF-Net feature fusion structure and an IoU constant factor to predict coordinates at any angle. SCRDet++ [29] enhances detection capabilities for small objects through instance-level denoising and new IoU constant factors to address boundary issues. DRN [30] enhances detection accuracy by employing the Feature Selection Module (FSM) and Dynamic Refinement Head (DRH) to select optimal features tailored for different objects. ReDet [18] embeds a rotation-equivariant module to adapt to rotational changes in input images, accurately identifying rotated objects from small to large sizes. RSDet [31] introduces modulated rotation losses to address rotational sensitivity errors and improves various parameter methods uniformly. Deng [32] introduces variable Gaussian labels and feature fusion with CPA to achieve a global acceptance domain and more robust features. Yang [33] treats angle prediction as a classification problem, designing Circular Smooth Labels (CSL) to adapt to angle periodicity, improving classification tolerance. PKINet [34] extracts object features and remote context information through multi-scale convolutional kernels and the Context Anchored Attention (CAA) module. LSKNet [35] dynamically adjusts spatial receptive fields to simulate ranging environments in remote sensing scenarios. These methods each enhance the performance of rotated object detection from different perspectives.

While prior remote sensing rotation target detection methods contributed significantly, they fell short on small targets and occlusion issues. The two-stage approach, prioritizing accuracy over speed, first identifies regions of interest before detection. Conversely, the one-stage method boosts speed but compromises accuracy. Neither achieves speed-accuracy harmony. Our LPFC-RDet introduces an RMT module for rotation-insensitive feature extraction and a PFC module for calibrating features via joint pixel, channel, and spatial weights, enhancing robustness. The SGCL detection head, with shared group convolutions and group normalization, balances parameters, interactions, stability, and generalization. LPFC-RDet efficiently tackles remote sensing rotating target detection, balancing speed, and accuracy.

3. Method

In this section, we present LPFC-RDet, a lightweight progressive fusion and calibration-rotated object detection network proposed in this paper. We provide an overview of our approach and describe the overall network architecture in detail.

3.1. Overall Architecture

YOLOv8-obb (Oriented Bounding Box) is selected as our benchmark framework. YOLOv8-obb is an extended version of YOLOv8, specifically designed for object detection tasks involving rotating bounding boxes. Compared to the standard YOLOv8, YOLOv8-obb adds a prediction of rotation angle in the output layer, enabling the generation of bounding boxes in any direction. This is particularly important for object detection in remote sensing images, as targets in these images often appear at various angles.

We proposed LPFC-RDet, depicted in Figure 2, which comprises three main components: the backbone, neck, and head. Mainly, the backbone has added an RMT module based on YOLOv8’s CSPDarknet. The backbone employs a CSPDarknet to construct a five-layer feature pyramid C1, C2, C3, C4, C5 for extracting features across different scales [36]. Only C3, C4, and C5 are used due to their appropriate resolutions relative to the original image size. A pivotal addition to the backbone is the RMT block, which integrates spatial prior information. This block enhances the model’s capability to detect rotated objects by implementing attention decay tailored to different distances within the image, thereby improving spatial feature extraction while reducing computational complexity. In the neck, FPN integrates C5 with C4 and C3 in a top-down fashion. This is different from the PANet used in YOLOv8. We proposed the Progressive Fusion Calibration (PFC) module, which recalibrates and fuses features using learned joint weights across pixels, channels, and spatial dimensions. This module enhances semantic information in feature maps, bolstering the model’s resilience to variations in input images such as lighting, viewpoint, and scale, and effectively mitigating background interference in remote sensing scenarios. The SGCL feature decoupling detection head processes feature maps from F3, F4, and F5. It includes branches for angle prediction, bounding box regression, and classification tasks. SGCL adopts group normalization instead of batch normalization (BN) and shared convolutions to optimize feature sharing, balancing detection accuracy, and computational efficiency.

3.2. RMT Block

Rotational angles of objects in remote sensing images vary significantly, leading to substantial differences in their feature representations within the feature space. Accurately capturing spatial relationships is crucial for improving object detection accuracy. Conventional deep learning models often lack the capability to learn rotation invariance during feature extraction, making it challenging to distinguish and recognize objects with different rotation angles.

To address these issues and integrate spatial prior information for extracting rotation-invariant features, we introduce the RMT block [37] within the backbone of LPFC-RDet. The RMT block combines the retention mechanism of Retentive Networks with the global context modeling capability of Vision Transformers. This integration effectively utilizes holistic information surrounding the targets to reduce background interference. By incorporating self-attention mechanisms and explicit spatial priors, the RMT block can learn relatively stable and helpful features for recognition tasks under rotational transformation that are insensitive to rotations, thereby enhancing the capability to extract features of rotated objects and resolving the feature distortion caused by differences in rotation angles.

The RMT block considers the spatial characteristics of two-dimensional space and designs a decay matrix based on two-dimensional distances. To mitigate the computational burden introduced by a large number of tokens in the early stages of visual backbone processing, the computation is split along the two axes of the image. This mechanism, tailored for images, is termed the Manhattan Self-Attention (MaSA) mechanism, which enriches spatial priors. The structure of the RMT block based on the MaSA mechanism is depicted in Figure 3. Darker colors represent smaller spatial decay rates, while lighter colors represent larger ones. The spatial decay rates that change with distance provide the model with rich spatial priors.

From the Figure 3, the final output of the RMT module can be represented as

\begin{matrix} X_{o u t} = MaSA (X) + LCE (V) \end{matrix}

(1)

where LCE represents the Local Context Enhancement module using DWConv, and MaSA is decomposed into two axes of the image, represented by Equation (2) after decomposition. Specifically, attention scores are computed separately along the horizontal and vertical directions of the image. Subsequently, a one-dimensional bidirectional attenuation matrix is applied to these attention weights.

\begin{matrix} A t t n_{H} = Softmax (Q_{H} K_{H}^{T}) ⊙ D^{H}, \\ A t t n_{W} = Softmax (Q_{W} K_{W}^{⊺}) ⊙ D^{W}, \\ MaSA (X) = A t t n_{H} {(A t t n_{W} V)}^{⊺} \end{matrix}

(2)

The RMT block explicitly injects spatial prior information into the self-attention mechanism, aiding the model in better understanding spatial relationships within images. By introducing explicit spatial priors and attention decomposition methods, RMT block effectively reduces computational burden while maintaining linear complexity. Additionally, RMT block enhance robustness to arbitrary rotation angles by incorporating rotational invariance properties, thereby improving detection accuracy for rotated objects in remote sensing images. This capability provides RMT with significant advantages in handling rotated targets within remote sensing applications.

3.3. PFC Module

In the neck stage of object detection networks, the primary task is to fuse multi-scale features extracted by the backbone to accurately identify targets. Typically, methods such as Feature Pyramid Networks (FPN) and Path Aggregation Networks (PAN) are employed to integrate local and global features. However, remote sensing images often contain abundant complex background information such as terrain, vegetation, and buildings. These backgrounds share similarities in color, texture, and shape with foreground objects such as vehicles, ships, and specific buildings. This similarity can cause the model to overlook the effective learning of foreground information during training, resulting in insufficient feature extraction or confusion for certain classes. Additionally, occlusions in remote sensing images can disrupt the continuity and integrity of targets, making it difficult for the model to derive significant features from the restricted visible areas. To alleviate these challenges, we propose a Progressive Fusion Calibration (PFC) module, which consists of three progressive levels: pixel-level, local-level, and global-level. PFC adjusts features by learning joint weights across pixels, channels, and spatial dimensions, effectively blending features from all three levels while ensuring information interaction. This approach efficiently addresses issues caused by severe background interference and occlusions in remote sensing images. Figure 4 illustrates the details of the proposed progressive fusion scheme based on the PFC module. The core of PFC module involves computing feature space weights using pixel-level, local-level, and global-level features to adaptively represent the importance levels of different regions. Weighted fusion is then performed to recalibrate features. Skip connections are utilized to add input features, mitigating gradient vanishing problems and simplifying the learning process. Finally, a

1 \times 1

convolution layer is applied to project the fused features, producing the final feature map.

In the backbone and neck stages, extensive convolution operations are employed. However, due to the local nature and receptive field limitations of convolutions, they are insufficient for modeling global features. In contrast, Transformers are proficient in extracting global features and capturing long-range dependencies via attention mechanisms. Convolutions and attention mechanisms complement each other in modeling both global and local features. To address this, the global branch employs a self-attention mechanism to capture extensive remote sensing data information, whereas the local branch concentrates on extracting localized features.

During the global integration stage, we first apply

1 \times 1

convolutions along with

3 \times 3

depth-wise convolutions to produce query (Q), key (K), and value (V) tensors of size H × W × C. We then reshape Q to

\hat{Q} \in R^{H W \times C}

and K to

\hat{K} \in R^{C \times H W}

. The attention map is computed through the interaction between

\hat{Q}

and

\hat{K}

, with the global integration branch output being defined as follows:

\begin{matrix} O_{G l o b a l} = C_{1 \times 1} (S o f t m a x (\hat{K} \hat{Q} / α) \times \hat{V}) + X_{i} \end{matrix}

(3)

here,

α

is a trainable scaling parameter that regulates the extent of the matrix multiplication between

\hat{Q}

and

\hat{V}

in the application function.

In the local branch,

1 \times 1

convolutions are initially applied to adjust the channel dimensions for improving inter-channel interactions and feature fusion. This is followed by channel shuffle operations, which split the input tensor along the channel dimension into groups, perform depth-wise separable convolutions to shuffle the channels within each group, and concatenate the resulting tensors along the channel dimension. A

3 \times 3 \times 3

convolution is then used to extract features, as given by

\begin{matrix} X_{i} = C o n c a t (p_{i}, c_{i}), \\ O_{L o c a l} = C_{3 \times 3 \times 3} (C S (C_{1 \times 1} (X_{i}))) \end{matrix}

(4)

here,

O_{L o c a l}

represents the output from the local branch,

C_{1 \times 1}

is a

1 \times 1

convolution operation,

C_{3 \times 3 \times 3}

refers to a

3 \times 3 \times 3

convolution,

C S

denotes the channel shuffle operation, and

p_{i}

,

c_{i}

are the two levels of features in the input.

To effectively integrate individual pixels, during pixel-level fusion, the features of the input undergo average pooling. Subsequently, the sum of the features obtained from local-level and global-level fusion undergoes maxpooling. This sum then undergoes

7 \times 7

convolution to achieve pixel-level fusion:

\begin{matrix} O_{P i x e l} = C_{7 \times 7} (M p (O_{G l o b a l} + O_{L o c a l}) + A p (X_{i})) \end{matrix}

(5)

where

M p

represents the maxpooling operation, and

A p

denotes the average pooling operation. The final output of the three-level fusion is

\begin{matrix} ε = s i g m o d (O_{P i x e l}) \\ F_{o u t} = C_{1 \times 1} (P_{i} \cdot ε + C_{i} \cdot (1 - ε) + P_{i} + C_{i}) \end{matrix}

(6)

After the aforementioned three-level progressive fusion calibration, the global-level fusion captures broader remote sensing data information through self-attention mechanisms, aiding the model in understanding global context and long-range dependencies within the images. The local-level fusion enhances inter-channel interactions, facilitating the extraction of richer feature representations critical for detecting details and texture information within targets. Pixel-level fusion, achieved through average pooling and maxpooling operations, further processes and integrates feature maps obtained from global and local fusion stages, reducing feature redundancy and noise while preserving essential information. This fusion strategy enables the model to better handle complex background interference in remote sensing images, exhibiting enhanced robustness against challenges such as lighting variations, viewing angles, and occlusions. It effectively mitigates issues related to background interference and occlusion, thereby improving the model’s capability in feature extraction.

3.4. SGCL Detection Head

In object detection tasks, the head component of the network architecture is tasked with predicting both the category and spatial location of targets. This component can be designed as either a coupled head or a decoupled head. In a coupled head design, predictions for target category and position are intertwined and share network layers, as illustrated in Figure 5a. This design offers the advantage of simplicity in model structure and high computational efficiency. However, since category and position predictions are merged into a single head, errors in one task may affect the other due to their distinct optimization objectives (classification and regression). Using the same network layers for both predictions can potentially increase optimization difficulty.

In contrast, the design of a decoupled head separates the tasks of category and position prediction, employing different network layers for each task, as shown in Figure 5b. This approach can reduce mutual error interference, thereby enhancing model performance. However, the independent operation of these two task branches limits their ability to fully leverage advanced features extracted by the backbone network, resulting in lower model efficiency and increased complexity and computational demands.

To effectively enhance the accuracy and speed of rotated object detection in remote sensing and address the challenges associated with both coupled head and decoupled head designs, this paper proposes a novel Shared Group Convolution Lightweight Detection Head (SGCL). SGCL integrates the advantages of both coupled and decoupled heads by incorporating three branches: an angle prediction branch, a bounding box regression branch, and a classification branch, each of which are employed to perform specific tasks. It computes bounding box regression loss, confidence loss, classification loss, and rotation box loss. In SGCL, group normalization (GN) is employed instead of conventional batch normalization (BN) to mitigate normalization issues related to batch size dependency. By using shared convolution, SGCL achieves parameter sharing across three different scale feature maps, enhancing feature interactions and improving detection accuracy while maintaining lightweight efficiency. The structure is illustrated in Figure 6.

After feature fusion in the neck, resulting in F3, F4, and F5 feature maps, they are fed into the detection head. The decoupled head consists of three branches, which extract information through multiple

3 \times 3

convolutions followed by a

1 \times 1

convolution. Typically, batch normalization (BN) is applied after convolution layers to normalize each channel’s input data, stabilizing the distribution, reducing gradient vanishing and exploding problems, accelerating model training, and enhancing generalization by reducing dependency on input data.

BN operates on the batch dimension and its effectiveness depends on batch size, where too small batches increase errors significantly and too large batches can be computationally expensive on GPUs. To address these issues, this paper uses group normalization (GN) [38] instead of traditional BN. GN has been demonstrated in the FCOS [39] to improve performance in localization and the classification of detection heads. In the diagram, GNConv indicates convolution followed by GN, offering a flexible alternative to BN issues.

To overcome the lack of information interaction among tasks in typical decoupled detection heads and to fully utilize high-level features extracted by the backbone network, SGCL employs shared convolution, denoted by the red box in the diagram. This significantly reduces the parameter count, making the model more lightweight, especially on resource-constrained devices. Shared convolution is also used in subsequent classification and regression branches because targets detected at F3, F4, and F5 scales are learning the same object, thus parameter sharing is beneficial. For regression tasks, scale layers are used to adjust feature scales to handle inconsistencies in detected object scales across different heads, enabling parameter sharing across scales.

4. Experiment

4.1. DataSet

This study conducted experiments on the HRSC2016 [40], UCAS_AOD [41], and DOTA 1.0 [42] datasets. The HRSC2016 dataset, published by Northwestern Polytechnical University in 2016, is primarily used for ship detection. Images in this dataset are extracted from Google Earth, covering six major ports with various types of ships including aircraft carriers, destroyers, and cruisers. The dataset includes ships both sailing at sea and docked near the coast. Although it contains 1680 images, only 1061 are effectively annotated. The image dimensions vary between 300 × 300 to 1500 × 900 pixels, predominantly larger than 1000 × 600 pixels. The dataset features densely packed ships near the shore, often with overlapping annotation boxes. The complex backgrounds of remote sensing images and the similarity between target ships and coastal textures pose significant challenges. Ships vary widely in scale within the same image.

The UCAS_AOD dataset is specifically designed for remote sensing object detection, containing samples of airplanes, cars, and background instances. The dataset comprises aerial images captured from different regions worldwide using Google Earth software, totaling 2420 images and 14,596 instances. Aircraft and car targets in UCAS_AOD often appear as small objects in remote sensing images with low pixel coverage and minimal detail. Targets are densely distributed with significant size variations amidst various terrain types such as buildings, roads, and vegetation, accompanied by noise interference. Background clutter poses risks of false positives and misses due to confusion with targets.

DOTA 1.0 is a large-scale dataset designed for aerial image object detection tasks, encompassing 2806 images sized at 4000 × 4000 pixels, totaling 188,282 object instances across 15 different categories. Images in DOTA originate from various sensors and platforms collecting aerial images, exhibiting challenges such as significant scale and orientation variations of targets, complex background interference, diverse target shapes, and the presence of small objects. These factors collectively increase the algorithmic complexity of object detection. To perform multi-scale training and testing, the approach began by resizing the original images to three different scales: 0.5, 1.0, and 1.5. These scaled images were then cropped into 1024 × 1024 pixel patches, using a 500-pixel stride. This configuration ensured an overlap of 524 pixels between adjacent patches.

4.2. Evaluation Metrics and Environment

4.2.1. Evaluation Metrics

To evaluate the effectiveness of object detection methods for remote sensing images, it is crucial to employ a set of quantitative metrics. These metrics, which include Precision, Recall, and mean Average Precision (mAP), are fundamental for the experimental assessment of performance.

Precision assesses the ratio of correctly predicted positive samples among all predicted positive outcomes. It is computed using the following formula, where true positives (TP) and false positives (FP) are used as criteria:

Precision = \frac{TP}{TP + FP}

(7)

Recall evaluates the proportion of correctly predicted positive instances among all actual positive samples, based on the actual samples as the criteria. Recall is represented as follows:

Recall = \frac{TP}{TP + FN}

(8)

In object detection, a huge Recall may indicate that the model can detect most targets but might come with a higher false positive rate (i.e., lower precision). Conversely, high Precision may indicate a higher proportion of correctly predicted positive samples among predicted positives, but it may also mean the model misses more true targets (i.e., lower recall). mAP combines Precision and Recall to provide a comprehensive evaluation of object detection algorithms. In tasks where multiple object categories need detection, mAP computes the Average Precision (AP) for each category and then averages these values to derive the overall mAP score. The formula for mAP is

mAP = \frac{\sum_{n = 1}^{N} {AP}_{n}}{N}

(9)

where N is the total number of object categories,

{AP}_{n}

represents the average precision of the n-th class of objects, calculated as the area under the Precision–Recall curve. A higher AP indicates a higher average accuracy of the model.

AP = \int_{0}^{1} Precison (t) dt

(10)

Thus, mAP comprehensively considers the detection results of multiple categories, avoiding the limitations of evaluating single-category metrics. It provides a comprehensive, accurate measure of object detection algorithm performance, addressing multi-category issues, assessing model stability, and facilitating easy understanding and comparison. Therefore, mAP has become a widely used evaluation metric in object detection tasks.

In addition to the primary metrics, other evaluation criteria are often employed. Parameters refer to the number of trainable parameters in a model, with #P indicating the parameter size. This metric reflects the model’s complexity. FLOPs denote the number of floating-point operations required during execution and are a crucial measure of computational complexity.

4.2.2. Experimental Environment

The training parameter settings are as follows: the training runs for 300 epochs with a batch size of 16 and 8 worker processes. Input images are resized to 640 × 640 pixels. The model utilizes the SGD optimizer to adjust the learning rate, with a maximum learning rate set to 1 × 10⁻³ and a minimum of 1 × 10⁻⁵. Weight decay is applied at a rate of 5 × 10⁻⁴ to prevent overfitting, with a momentum of 0.937. Early Stopping is implemented during training; the process terminates automatically when the validation loss stabilizes, indicating basic convergence of the model.

The experimental setup consists of the following components: Red Hat 4.8.5-28 served as the operating system, with PyTorch 2.1, Python 3.11, and CUDA 12.1 being the primary software components. The experiments utilize a cluster environment with three A800 GPUs, each equipped with 256 GB of memory. To balance experimental efficiency and result accuracy, different levels of data augmentation are applied based on the model size. This setup ensures the efficient and effective training of the model while maintaining the accuracy and reliability of the results.

4.3. Ablation Study

Ablation experiments were conducted on the three datasets discussed earlier to evaluate the effectiveness of each module in our approach. YOLOv8n-obb was employed as the baseline, and the results for the HRSC2016 dataset are displayed in Table 1. The cross mark indicates that the module is not added, while the check mark indicates that the module is included. The symbols in Table 2 and Table 3 have the same meaning.

The above experimental results demonstrate the performance of several model configurations, involving three potential components (RMT, PFC, and SGCL), along with their corresponding parameters, FLOPs, and mPA. The baseline model, without any additional modules, had 3,077,414 parameters and achieved an mPA of 87.8. YOLOv8-obb, as a deep learning framework for real-time object detection, inherently features efficiency and accuracy. Its backbone includes multiple convolutional and pooling layers designed to extract both low-level and high-level features from images. The use of a C2f structure in convolutional layers, augmented with additional skip connections and split operations, enriches feature information while reducing computational overhead. However, these operations, effective in natural scenes, face limitations in detecting numerous small targets amidst complex backgrounds typical in remote sensing images.

Upon adding the RMT block to the baseline, mPA is improved by 1.9%, accompanied by reductions in both Parameters and FLOPs. This improvement stems from RMT’s explicit incorporation of spatial priors and spatial attenuation matrices, enabling the better utilization of spatial information within images and thereby enhancing performance across various visual tasks. Additionally, decomposing the global modeling process along the image’s two axes reduces the computational burden of global information modeling, further lowering the required parameter count. The integration of the progressive fusion approach based on PFC in the neck section increased Parameters and FLOPs, yet provided a 2.7% mPA improvement. This approach combines multi-level feature fusion and skips connections to enhance model representation capability and learning efficiency by adaptively representing the importance levels of different regions, integrating both local and global information effectively. The introduction of the SGCL rotation detection head led to decreases in Parameters and FLOPs, with an mPA increase of 2.9%. SGCL employs shared convolution, significantly reducing parameter count by sharing the same convolution kernels or weights across multiple feature maps. Using GN instead of BN enhances computational efficiency by normalizing across channel dimensions without relying on batch-wide statistics. SGCL’s adoption of shared convolution and GN improves information utilization, reduces spatial misalignment issues in the detection head, and enables a more comprehensive exploitation of input data for predictions, thereby greatly enhancing precision. The further addition of both PFC and SGCL modules resulted in a 2.1% mAP improvement over the baseline method, although the individual module’s contributions were offset due to their sequential operations, and the model’s overall performance improvement was constrained by the modules reaching their performance limits. Conversely, integrating both RMT and SGCL modules substantially improved performance compared to adding a single module, demonstrating their effectiveness across different stages of detection tasks. The integration of all three modules achieved optimal performance, enhancing mPA by 3.6%. The integration of these modules allows for the joint optimization of multiple network components, thereby improving the network’s performance in handling complex target detection tasks in remote sensing images.

Similar experimental results were obtained for UCAS_AOD and DOTA datasets, as shown in Table 2 and Table 3. Networks with different combinations of modules outperformed those with single modules. Combining all three modules allows the model to capture and leverage spatial information from remote sensing images, interpret feature dimensions, adaptively integrate features, mitigate background interference, and achieve superior outcomes in regression and classification tasks. Furthermore, the experiments indicate no conflicts among the proposed modules when all methods are employed; the model achieves optimal performance.

For the three ablation experiments described, the trend in Parameters for the same algorithm is similar across experiments, though the values differ slightly. This variation is due to differences in the number of classes in the datasets, which affects the number of parameters learned by the model. However, the complexity of the algorithm remains constant across all three experiments, so the FLOPs for the same algorithm are consistent.

4.4. Comparative Experiment

To provide additional validation for our approach, we compared it with existing classic methods through experiments on the HRSC2016 dataset, with the results detailed in Table 4. The up and down arrows in the indicator indicate that the higher or lower the direction of the arrow, the better the result. Two-stage detectors such as

R^{2} CNN

, RRPN,

R^{2} PN

, and ROI Transformer performed poorly, all achieving detection rates below 80%. This is primarily because these models were not adequately designed to account for characteristics specific to remote sensing images, such as arbitrary orientation and large aspect ratios. Despite considering the transformation of horizontal anchor boxes into rotated ones, the ROI Transformer still relies on horizontal anchor outputs from RPN, which limits its detection accuracy to some extent. In contrast, one-stage methods have shown superior performance over two-stage methods primarily because they directly predict the position and category of targets. These methods typically feature fewer hyperparameters and simpler network structures. Targets in remote sensing images vary significantly in size, shape, and orientation are often embedded in complex backgrounds. One-stage methods excel in capturing comprehensive contextual information from images and perform end-to-end prediction of target position and category, thereby maintaining a competitive advantage. In one-stage methods, all models except ReDet show a significant reduction in the number of parameters (#P), without noticeably affecting their performance. YOLOv8-obb and LPFC-RDet have greatly reduced the number of parameters through architectural optimization. Notably, LPFC-RDet has only 4.7 M parameters, indicating a more compact model.

LPFC-RDet showed improvements in mAP compared to both the baseline, one-stage, and two-stage methods, achieving a 3.6% improvement over the baseline, and 1% and 1.2% improvements over ReDet and

S^{2} A

-Net, respectively. This demonstrates that LPFC-RDet, by introducing a RMT feature extraction block, PFC module, and rotation-decoupled SGCL detection head, achieves a lightweight design while maintaining high precision, making it more advantageous for practical applications.

A visual comparison of the detection results is shown in Figure 7. The first row displays detections from LPFC-RDet, while the second row shows results from the baseline method. The HRSC2016 dataset includes various types of ships such as aircraft carriers, destroyers, and cruisers, which are predominantly elongated and have high aspect ratios.

From the detections, it is evident that our method correctly identifies the targets, indicating robust recognition capabilities for high aspect ratio objects. In the first three columns of the figure, the baseline method exhibits missed detections. In the fourth column, where there are no actual targets (only islands), the baseline method erroneously detects the island as a target.

This is mainly attributed to the integration of the RMT block into the backbone network in this work, which incorporates spatial prior information into the self-attention mechanism. This enhancement helps the model better understand spatial relationships within images and extract rotation insensitive features. Additionally, in the PFC module, a three-level progressive fusion calibration strategy is employed. This fusion approach enables the model to better handle interference from complex backgrounds in remote sensing images, enhancing robustness against factors like lighting, viewpoint variations, and occlusions. It effectively suppresses background interference and issues affecting feature extraction. The SGCL detection head combines the advantages of decoupled and coupled heads. It accomplishes different tasks through degree prediction branches, bounding box regression branches, and classification branches. The use of GN instead of conventional BN addresses normalization issues dependent on batch size. Shared convolution is employed across three different scale feature maps to promote feature interaction, thereby enhancing detection accuracy while achieving a lightweight model design.

As previously described, the backbone primarily utilizes feature pyramids to extract multi-scale features, reflecting the model’s understanding of images at different levels. Figure 8 depicts feature maps for Figure 7 in the first and third column. The first and third rows show our method, and the second and fourth rows show the baseline method. The colors indicate different activation intensities, whereas green and yellow regions represent higher activation strengths, indicating that these areas contain features contributing significantly to the model’s decisions. From the visual effects of the feature maps, this work’s method appears more focused on specific image regions, potentially extracting more distinct features. In contrast, the baseline method’s feature maps show more dispersed colors, suggesting broader activation areas where features are less concentrated on target regions, thereby affecting detection accuracy and leading to false positives or missed detections.

On the UCAS-AOD dataset, the detection performance for cars and airplanes is illustrated in Figure 9. The first row depicts the car detection results, revealing dense arrangements of cars with numerous shadows and occlusions. The proposed method effectively detects these targets. In the second row, airplane detection is shown, where airplanes vary in direction and scale, amidst backgrounds containing similar non-target objects. LPFC-RDet accurately identifies the target areas under these conditions, highlighting its robustness against complex background interferences. This demonstrates that the proposed method excels in detecting targets amidst challenging backgrounds.

To assess the performance of our proposed method, we compared it with several state-of-the-art techniques on the DOTA dataset. The comparison results are illustrated in Table 5. The DOTA dataset consists of 15 target classes: Plane (PL), Ship (SH), Storage Tank (ST), Baseball Diamond (BD), Tennis Court (TC), Basketball Court (BC), Ground Track Field (GTF), Harbor (HA), Bridge (BR), Large Vehicle (LV), Small Vehicle (SV), Helicopter (HE), Roundabout (RA), Soccer Ball Field (SBF), and Swimming Pool (SP). In terms of #P and FLOPs, one-stage methods generally exhibit lower values. Notably, YOLOv8-obb and our proposed method have parameter counts of only 6.53 M and 6.1 M, respectively, with FLOPs reduced to 34.8 G and 24.4 G. These reductions are significant compared to other one-stage and two-stage methods. Our method achieved the highest mAP scores in several categories: PL (94.4%), SH (90.3%), TC (94.1%), HA (83.6%), BR (53.5%), LV (85%), RA (66.7%), and SP (76%). Additionally, we achieved the overall best average results across all categories with an mAP of 72.8%. Furthermore, our method not only achieves a lightweight design but also enhances accuracy compared to other methods. Table 5 provides comprehensive experimental results, including average precisions for each category and the overall mAP on the DOTA dataset. Visual detection results for some aerial targets in DOTA are illustrated in Figure 10.

From the detection results shown in the images, our proposed method performs well in detecting various types of targets amidst complex backgrounds. The detection boxes accurately cover different types of objects and demonstrate precise detection in densely populated target areas, showcasing the robustness of our method in diverse environments. Objects with rotation, such as ships and vehicles, are accurately detected with bounding boxes aligned correctly to their actual orientations, indicating that our method effectively addresses rotation challenges in remote sensing images. In complex scenes like harbors and parking lots depicted in the images, our method successfully detects targets, showcasing strong background suppression capabilities through the PFC module. Even in scenarios with occluded objects, such as vehicles next to buildings in the images, our method shows good performance in detecting partially occluded targets. These outstanding results validate the effectiveness of LPFC-RDet.

For a more intuitive comparison of our proposed method, we select representative images from the test set containing complex scenes with a small proportion of targets and dense sample distribution. These images are used to compare with several neural network algorithms. The detection results are shown in Figure 11, where the first column presents the results of our method, the second column shows results from the one-stage detection model Rotate-Retinanet, and the third column shows results from the two-stage detection model Rotated Faster-RCNN.

The first row depicts an area with an airport, parking lot, and factory, featuring airplanes, small vehicles, and large objects (containers and trucks) densely arranged. Rotate-Retinanet missed some small vehicles in the middle area and containers in the upper-right corner, while Faster-RCNN missed detections in the container area. The second row showcases a parking lot with small and large vehicles densely parked, some of which are partially occluded. Both alternative methods experienced missed detections, especially on the right side where vehicles are not fully visible. The third row illustrates a scene with water bodies where both alternative methods missed detections. This highlights our method’s significant advantage in handling targets with varying rotation angles, complex background interference, and occlusions.

Therefore, our proposed method, LPFC-RDet, leverages the RMT block, PFC module, and SGCL detection head to address challenges in detecting rotated multi-angle targets, complex backgrounds, and occlusions, achieving lightweight design. Compared to popular one-stage and two-stage detectors, our method demonstrates superior performance in rotated object detection, achieving noticeable improvements in detection accuracy.

5. Conclusions

This study introduces LPFC-RDet, a lightweight rotated object detection network designed for remote sensing images. It incorporates three key modules: the RMT feature extraction block, PFC module, and SGCL head. These modules aim to tackle challenges such as varying rotation angles, complex backgrounds, occlusions, and the trade-off between speed and accuracy in remote sensing target detection.

The RMT block integrates retention mechanisms from Retentive Networks and global contextual modeling capabilities from Vision Transformers. By injecting spatial prior information and self-attention mechanisms, it enhances feature extraction capabilities to learn rotation-insensitive feature representations. The PFC module introduces joint weighting of pixels, channels, and spatial dimensions to calibrate features, effectively blending features from multiple levels to enhance model robustness against input variations and improve detection accuracy in complex backgrounds. The SGCL design decouples the detection task into multiple sub-tasks, implements parameter sharing and feature interaction using shared group convolution layers, reduces parameter count, and enhances model stability and generalization with GN, achieving a balance between detection accuracy and speed. To validate the effectiveness of our approach, extensive experiments were conducted on HRSC2016, DOTA, and UCAS-AOD datasets. The results demonstrate improved precision over current mainstream one-stage and two-stage methods. Furthermore, ablation studies validate the efficacy of each module, providing insights for further improvements.

Although LPFC-RDet performs well in rotated object detection in remote sensing, there are areas for further exploration. Future research could explore integrating advanced deep learning architectures or attention mechanisms to efficiently extract features, further improving detection precision and speed. Additionally, expanding and utilizing larger-scale, more diverse remote sensing image datasets for training and validation could enhance model generalization and practicality.

At the same time, we know that the VC-Theory [45] is an important theory in the field of machine learning, which is mainly used to study the ability of learning algorithms, especially the relationship between model complexity and its generalization ability. In future work, we will continue to work on regularization methods and reducing the number of parameters based on VC theory, in order to achieve a balance between model complexity and generalization ability.

Author Contributions

Conceptualization, J.L. and Y.C.; methodology, J.L., Y.W. and P.S.; software, D.J., C.G. and H.Z.; validation, P.S.; visualization, Y.C.; writing—original draft, J.L. and H.Z.; writing—review & editing, Y.W. and C.G. All authors have read and agreed to the published version of this manuscript.

Funding

This research received no external funding.

Data Availability Statement

The HRSC2016 is available at https://www.kaggle.com/datasets/guofeng/hrsc2016 (accessed on 5 March 2024); The UCAS-AOD is available at https://github.com/ming71/UCAS-AOD-benchmark (accessed on 5 March 2024); The DOTA is available at https://captain-whu.github.io/DOTA/dataset.html (accessed on 12 April 2024).

Acknowledgments

The authors acknowledge the referees and the editor for carefully reading this paper and giving many helpful comments. The authors also express their gratitude to the reviewers for their insightful comments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv 2015, arXiv:1506.01497. [Google Scholar] [CrossRef] [PubMed]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
Dai, J.; Li, Y.; He, K.; Sun, J. R-fcn: Object detection via region-based fully convolutional networks. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2016; Volume 29. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Zhang, J.; Lei, J.; Xie, W.; Fang, Z.; Li, Y.; Du, Q. SuperYOLO: Super resolution assisted object detection in multimodal remote sensing imagery. IEEE Trans. Geosci. Remote. Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Van Etten, A. You only look twice: Rapid multi-scale object detection in satellite imagery. arXiv 2018, arXiv:1805.09512. [Google Scholar]
Tarasiou, M.; Chavez, E.; Zafeiriou, S. Vits for sits: Vision transformers for satellite image time series. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 10418–10428. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. arXiv 2023, arXiv:2304.08069. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar]
Cheng, G.; Zhou, P.; Han, J. Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images. IEEE Trans. Geosci. Remote. Sens. 2016, 54, 7405–7415. [Google Scholar] [CrossRef]
Han, J.; Ding, J.; Xue, N.; Xia, G.S. Redet: A rotation-equivariant detector for aerial object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2786–2795. [Google Scholar]
Jiang, Y.; Zhu, X.; Wang, X.; Yang, S.; Li, W.; Wang, H.; Fu, P.; Luo, Z. R2CNN: Rotational region CNN for orientation robust scene text detection. arXiv 2017, arXiv:1706.09579. [Google Scholar]
Azimi, S.M.; Vig, E.; Bahmanyar, R.; Körner, M.; Reinartz, P. Towards multi-class object detection in unconstrained remote sensing imagery. In Proceedings of the Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; pp. 150–165. [Google Scholar]
Zhang, Z.; Guo, W.; Zhu, S.; Yu, W. Toward arbitrary-oriented ship detection with rotated region proposal and discrimination networks. IEEE Geosci. Remote. Sens. Lett. 2018, 15, 1745–1749. [Google Scholar] [CrossRef]
Ma, J.; Shao, W.; Ye, H.; Wang, L.; Wang, H.; Zheng, Y.; Xue, X. Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans. Multimed. 2018, 20, 3111–3122. [Google Scholar] [CrossRef]
Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning RoI transformer for oriented object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2849–2858. [Google Scholar]
Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3520–3529. [Google Scholar]
Pang, J.; Chen, K.; Shi, J.; Feng, H.; Ouyang, W.; Lin, D. Libra R-CNN: Towards Balanced Learning for Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Han, J.; Ding, J.; Li, J.; Xia, G.S. Align deep features for oriented object detection. IEEE Trans. Geosci. Remote. Sens. 2021, 60, 1–11. [Google Scholar] [CrossRef]
Yang, X.; Yan, J.; Feng, Z.; He, T. R3det: Refined single-stage detector with feature refinement for rotating object. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 3163–3171. [Google Scholar]
Yang, X.; Yang, J.; Yan, J.; Zhang, Y.; Zhang, T.; Guo, Z.; Sun, X.; Fu, K. Scrdet: Towards more robust detection for small, cluttered and rotated objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8232–8241. [Google Scholar]
Yang, X.; Yan, J.; Liao, W.; Yang, X.; Tang, J.; He, T. Scrdet++: Detecting small, cluttered and rotated objects via instance-level feature denoising and rotation loss smoothing. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 2384–2399. [Google Scholar] [CrossRef] [PubMed]
Pan, X.; Ren, Y.; Sheng, K.; Dong, W.; Yuan, H.; Guo, X.; Ma, C.; Xu, C. Dynamic refinement network for oriented and densely packed object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11207–11216. [Google Scholar]
Qian, W.; Yang, X.; Peng, S.; Yan, J.; Guo, Y. Learning modulated loss for rotated object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 2458–2466. [Google Scholar]
Zhao, T.; Liu, N.; Celik, T.; Li, H.C. An arbitrary-oriented object detector based on variant gaussian label in remote sensing images. IEEE Geosci. Remote. Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Yang, X.; Yan, J. Arbitrary-oriented object detection with circular smooth label. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 677–694. [Google Scholar]
Cai, X.; Lai, Q.; Wang, Y.; Wang, W.; Sun, Z.; Yao, Y. Poly Kernel Inception Network for Remote Sensing Detection. arXiv 2024, arXiv:2403.06258. [Google Scholar]
Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.M.; Yang, J.; Li, X. Large selective kernel network for remote sensing object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 16794–16805. [Google Scholar]
Liu, J.; Jing, D.; Zhang, H.; Dong, C. SRFAD-Net: Scale-Robust Feature Aggregation and Diffusion Network for Object Detection in Remote Sensing Images. Electronics 2024, 13, 2358. [Google Scholar] [CrossRef]
Fan, Q.; Huang, H.; Chen, M.; Liu, H.; He, R. Rmt: Retentive networks meet vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2024; pp. 5641–5651. [Google Scholar]
Wu, Y.; He, K. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Tian, Z.; Chu, X.; Wang, X.; Wei, X.; Shen, C. Fully convolutional one-stage 3d object detection on lidar range images. Adv. Neural Inf. Process. Syst. 2022, 35, 34899–34911. [Google Scholar]
Liu, Z.; Yuan, L.; Weng, L.; Yang, Y. A High Resolution Optical Satellite Image Dataset for Ship Recognition and Some New Baselines. In Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods—ICPRAM, Porto, Portugal, 24–26 February 2017; pp. 324–331. [Google Scholar] [CrossRef]
Zhu, H.; Chen, X.; Dai, W.; Fu, K.; Ye, Q.; Jiao, J. Orientation robust object detection in aerial images using deep convolutional neural network. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015; pp. 3735–3739. [Google Scholar]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images. In Proceedings of the The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Liao, M.; Zhu, Z.; Shi, B.; Xia, G.S.; Bai, X. Rotation-sensitive regression for oriented scene text detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5909–5918. [Google Scholar]
Ren, Z.; Tang, Y.; He, Z.; Tian, L.; Yang, Y.; Zhang, W. Ship detection in high-resolution optical remote sensing images aided by saliency information. IEEE Trans. Geosci. Remote. Sens. 2022, 60, 1–16. [Google Scholar] [CrossRef]
Devroye, L.; Györfi, L.; Lugosi, G.; Devroye, L.; Györfi, L.; Lugosi, G. Vapnik-Chervonenkis Theory. In A Probabilistic Theory of Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 1996; pp. 187–213. [Google Scholar]

Figure 1. Difficulties in remote sensing target detection. (a) Rotating targets at different angles, (b) challenge posed by horizontally oriented bounding boxes, (c) complex background interference, (d) target is obstructed.

Figure 2. Overall architecture of LPFC-RDet. It includes a backbone built with Conv and RMT block, producing feature maps C3, C4, and C5. These features are fused via the PFC module in the neck, resulting in F3 to F5 feature maps. The SGCL detection head, using shared group convolutions, predicts classifications and regresses rotated bounding boxes.

Figure 3. Overall architecture of RMT block, with a

3 \times 3

depthwise convolution (DWConv), layer normalization (LN), Manhattan self-attention, and a feed forward neural network (FFN), with residual connections throughout.

Figure 3. Overall architecture of RMT block, with a

3 \times 3

depthwise convolution (DWConv), layer normalization (LN), Manhattan self-attention, and a feed forward neural network (FFN), with residual connections throughout.

Figure 4. Architecture of the PFC module. Combines pixel-level, local-level, and global-level joint weights to calibration features, skip connections and a

1 \times 1

convolution layer produce the final feature map.

Figure 4. Architecture of the PFC module. Combines pixel-level, local-level, and global-level joint weights to calibration features, skip connections and a

1 \times 1

convolution layer produce the final feature map.

Figure 5. Comparison of decoupled head and coupled head Structures.

Figure 6. The structure of decoupling SGCL detection head for rotating targets. It includes angle prediction branch (Ori.), bounding box regression branch (Reg.), and classification branch (Cls.), with the shared convolution indicated by the red box.

Figure 7. Comparison of detection results on the HRSC2016 dataset, with the first line representing our method and the second line representing the baseline method.

Figure 8. Visual comparison of feature maps. For the images in columns 1 and 3 of Figure 7, the feature maps in rows 1 and 3 are generated via our method, while lines 2 and 4 are generated using the baseline method.

Figure 9. Performance of UCAS-AOD dataset.

Figure 10. Performance of DOTA dataset.

Figure 11. Comparison of detection results for several algorithms on the DOTA dataset.

Table 1. Ablation experiments on the HRSC2016 dataset.

RMT	PFC	SGCL	Parameters	FLOPs	mAP
×	×	×	3,077,414	30.9 G	87.8
✓	×	×	2,769,638	27.9 G	89.7
×	✓	×	3,230,535	32.0 G	90.5
×	×	✓	2,376,711	24.6 G	90.7
×	✓	✓	2,529,832	25.7 G	89.9
✓	×	✓	2,068,935	21.2 G	90.9
✓	✓	✓	2,222,056	22.3 G	91.4

Table 2. Ablation experiments on the UCAS_AOD dataset.

RMT	PFC	SGCL	Parameters	FLOPs	mAP
×	×	×	3,006,038	30.9 G	94.9
✓	×	×	2,769,833	27.9 G	95.5
×	✓	×	3,230,730	32.0 G	93.8
×	×	✓	2,376,776	24.6 G	95.1
×	✓	✓	2,529,897	25.7 G	95.3
✓	×	✓	2,069,000	21.2 G	95.0
✓	✓	✓	2,222,121	22.3 G	96.8

Table 3. Ablation experiments on the DOTA 1.0 dataset.

RMT	PFC	SGCL	Parameters	FLOPs	mAP
×	×	×	3,008,573	30.9 G	67.5
✓	×	×	2,772,368	27.9 G	71.9
×	✓	×	3,233,265	32.0 G	71.6
×	×	✓	2,377,621	24.6 G	70.8
×	✓	✓	2,530,742	25.7 G	71.5
✓	×	✓	2,069,845	21.2 G	71.8
✓	✓	✓	2,222,966	22.3 G	72.8

Table 4. Experimental results comparing with state-of-the-art methods on the HRSC2016 dataset.

Method	Backbone	Size	#P↓	FLOPs↓	mAP↑
Two-stage:
$R^{2} CNN$ [19]	ResNet101	800 × 800	41.12 M	198.4 G	73.1
RRPN [22]	ResNet101	800 × 800	38.71 M	180.25 G	79.1
$R^{2} PN$ [21]	VGG16	384 × 384	55.11 M	120.43 G	79.6
RoI Transformer [23]	ResNe101	800 × 800	55.05 M	162.58 G	86.2
One-stage:
$R^{3} Det$ [27]	ResNet101	800 × 800	41.53 M	156.78 G	89.3
ReDet [18]	ResNet101	800 × 800	56.21 M	130.58 G	90.4
RRD [43]	VGG16	384 × 384	53.17 M	128.56 G	84.3
R-RetinaNet [6]	ResNet101	800 × 800	36.13 M	209.58 G	89.2
$S^{2} A$ -Net [26]	ResNet101	800 × 800	38.5 M	196.21 G	90.2
SDet [44]	ResNet101	800 × 800	45.26 M	129.4 G	89.2
YOLOv8-obb	CSPDarknet	640 × 640	6.2 M	33.2 G	87.8
LPFC-RDet	CSPDarknet+RMT	640 × 640	4.7 M	24 G	91.4

Table 5. Performance evaluation on the DOTA dataset.

Method	#P↓	FLOPs↓	mAP↑	PL	SH	ST	BD	TC	BC	GTF	HA	BR	LV	SV	HE	RA	SBF	SP
Two-Stage:
$R^{2} CNN$ [19]	41.14 M	198.41 G	61.25	81.26	56.17	73.21	65.88	91.24	67.25	68.23	56.21	35.63	51.35	60.32	55.98	53.32	55.65	53.48
$R^{2} PN$ [21]	39.6 M	187.3 G	60.56	88.66	57.23	67.21	71.20	90.56	72.34	59.35	53.28	31.76	56.28	51.93	53.58	52.84	56.71	52.34
RoI Transformer [23]	55.13 M	199.52 G	69.75	88.75	83.69	81.65	78.56	90.98	77.35	75.96	62.96	43.56	73.58	68.88	62.93	53.65	58.63	58.94
SCRNet [28]	-	-	72.35	89.33	72.25	86.61	80.14	90.31	87.47	68.32	66.21	52.12	60.13	68.21	66.33	66.36	64.32	68.12
One-Stage:
R-RetinaNet [6]	36.42 M	215.92 G	68.43	88.54	72.43	76.52	81.32	90.56	82.79	65.81	63.73	44.65	55.65	67.32	63.31	63.58	54.13	69.72
RSDet [31]	37.52 M	239.1 G	72.13	89.50	70.13	83.42	82.91	90.34	85.36	65.17	65.53	48.43	71.20	69.32	65.31	63.46	62.17	67.32
$R^{3} Det$ [27]	41.9 M	335.74 G	71.72	89.69	77.36	83.34	81.87	90.88	81.36	62.35	65.36	48.66	74.36	70.56	65.36	59.78	61.36	67.86
YOLOv8-obb	6.53 M	34.8 G	67.5	92.8	88.8	69.9	71.7	92.7	58.1	55.7	81.5	41.9	83.8	59.5	55.6	54.2	40.9	66.21
LPFC-RDet	6.1 M	24.4 G	72.8	94.4	90.3	76.5	78.9	94.1	63.1	62.6	83.6	53.5	85	63.5	58.9	66.7	44.3	76.13

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, J.; Jing, D.; Cao, Y.; Wang, Y.; Guo, C.; Shi, P.; Zhang, H. Lightweight Progressive Fusion Calibration Network for Rotated Object Detection in Remote Sensing Images. Electronics 2024, 13, 3172. https://doi.org/10.3390/electronics13163172

AMA Style

Liu J, Jing D, Cao Y, Wang Y, Guo C, Shi P, Zhang H. Lightweight Progressive Fusion Calibration Network for Rotated Object Detection in Remote Sensing Images. Electronics. 2024; 13(16):3172. https://doi.org/10.3390/electronics13163172

Chicago/Turabian Style

Liu, Jing, Donglin Jing, Yanyan Cao, Ying Wang, Chaoping Guo, Peijun Shi, and Haijing Zhang. 2024. "Lightweight Progressive Fusion Calibration Network for Rotated Object Detection in Remote Sensing Images" Electronics 13, no. 16: 3172. https://doi.org/10.3390/electronics13163172

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lightweight Progressive Fusion Calibration Network for Rotated Object Detection in Remote Sensing Images

Abstract

1. Introduction

2. Related Work

2.1. Two-Stage Rotated Object Detectors

2.2. One-Stage Rotated Object Detectors

3. Method

3.1. Overall Architecture

3.2. RMT Block

3.3. PFC Module

3.4. SGCL Detection Head

4. Experiment

4.1. DataSet

4.2. Evaluation Metrics and Environment

4.2.1. Evaluation Metrics

4.2.2. Experimental Environment

4.3. Ablation Study

4.4. Comparative Experiment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI