1. Introduction
Remote sensing images, a crucial technological tool for acquiring information about the Earth’s surface, have demonstrated their irreplaceable potential for applications in environmental monitoring, urban development, and military surveillance. However, as remote sensing technology continues to advance rapidly, the speed of generating remote sensing data has sharply increased. Identifying target objects accurately and efficiently from vast remote sensing images has become a significant challenge in contemporary scientific research. In recent years, the flourishing development of deep learning technology has opened up a new research direction for the field of remote sensing image data analysis. Deep learning, with its outstanding feature extraction and learning capabilities, significantly enhances the precision of object detection and recognition in remote sensing applications.
Object detection frameworks are generally divided into two main categories: one-stage detectors and two-stage detectors. Two-stage detectors first extract Regions of Interest (RoI) and then use detection heads to classify the RoI features into target categories and regress their positions. Typical representatives include Faster-RCNN [
1], Mask RCNN [
2], Cascade-RCNN [
3], etc. In contrast, one-stage detectors do not have a process of filtering the foreground. Instead, they directly classify and regress the feature maps generated by the backbone network. Typical one-stage detectors include SSD [
4], RetinaNet [
5], YOLO [
6,
7,
8,
9,
10,
11,
12] series, etc. Many object detection methods are based on these two types of detector frameworks. Recently, emerging methods such as Visual Transformer [
13], DETR [
14], and Mamba [
15] have achieved new SOTA in various fields. Due to changes in factors such as the shooting angle and sensor attitude, target detection faces challenges such as large-scale differences, complex and variable backgrounds, and imbalanced difficulty samples. However, in remote sensing target detection, scale difference is a core challenge that directly affects the effectiveness of feature extraction, fusion, regression, and classification.
Feature Extraction: Significant differences in target scales in remote sensing images lead to notable changes in the visual features of the target, including texture, shape, and edge information [
16]. Traditional feature extraction methods often struggle to capture target features across all scales [
17,
18,
19]. This limitation arises from their reliance on fixed scales, rendering feature extractors unable to adapt to scale changes. Deep learning methods primarily extract features by constructing multi-scale pyramids to mitigate large-scale differences [
16,
20]. However, small targets in remote sensing images may experience gradual dilution or loss during the construction of feature pyramids. Large-sized targets also exhibit complexity and diversity, posing challenges for feature extraction [
21]. The model proposed in this article is designed with a Scale-Robust Feature Network (SRFN) in the Backbone stage, which includes a downsampling module(ADown) and a Deformable Attention (DAttention) mechanism [
22]. This reduces the loss caused by downsampling during feature pyramid construction and can dynamically adjust sampling positions and attention weights to enhance the model’s ability to model distant targets, thereby improving feature extraction capability.
Feature Fusion: Multi-scale feature fusion serves as a pivotal technique for enhancing detection performance in remote sensing target detection. Features at different scales exhibit significant differences in expression form, resolution, and semantic information [
20,
23]. This difference makes it challenging to directly fuse features of different scales using simple fusion methods that cannot effectively utilize the information of these features. Many methods employ simple feature stacking or concatenation for feature fusion, overlooking the differences and complementarity between features at different scales [
4,
20,
24]. Moreover, most methods only perform feature fusion at a single level, such as fusing solely at the last layer of the backbone network, thereby failing to fully exploit the semantic information and resolution of different layers. Furthermore, current feature fusion methods lack self-adaptability, unable to adjust fusion strategies adaptively based on specific tasks and datasets. Consequently, models exhibit poor generalization ability across different scenarios and fail to adapt to diverse remote sensing image data [
4,
5,
25]. To address these challenges, an Adaptive Feature Aggregation and Diffusion(AFAD) module was designed in the neck section, which accurately selects multi-scale features through a channel perception selection mechanism, aggregates local and contextual information using multiple convolutional kernels, and transmits information to various scales through a unique diffusion mechanism. This effectively solves the problems of large-scale differences, background interference, and adaptive fusion strategies, significantly improving the model’s feature fusion ability.
Regression and Classification: In object detection, regression is utilized to forecast the positions and dimensions of bounding boxes. However, in cases of large differences in target scale, regression algorithms may struggle to accurately adapt to targets of varying scales, resulting in significant deviations between the predicted positions and sizes and the actual targets [
26,
27,
28,
29]. Because of the varying visual features of targets at different scales, classifiers may struggle to accurately recognize targets across all scales, leading to a decrease in classification performance. This article proposes the utilization of a Focaler-GIoU Loss to enhance the similarity measurement between predicted and ground truth bounding boxes. Additionally, it prioritizes challenging samples during training, thereby improving detection accuracy.
Therefore, this article proposes a Scale-Robust Feature Aggregation and Diffusion Network (SRFAD-Net) for remote sensing image detection, which includes Backbone, Neck, and Head. The network effectively addresses the challenges of feature extraction, fusion, regression, and classification caused by large-scale differences in remote sensing images through different strategies. In the Backbone stage, we designed SRFN to extract key features from images. In order to enhance the robustness of the network to targets of different scales and shapes and effectively alleviate the problem of the easy loss of small target information in downsampling, we adopt a special ADown module that combines the advantages of average pooling and max pooling, thereby highlighting important features while retaining background information. At the upper level of the feature pyramid, we introduce a DAttentionmechanism that can adjust the focus point based on the shape and size of the object, enabling the network to more accurately capture distant targets in the image. In the feature fusion section, we propose an AFADmodule that intelligently selects and integrates feature information from different channels and effectively transmits this information to various scales through a diffusion mechanism to achieve comprehensive feature fusion. Finally, in the detection head section, the network converts the fused features into the final prediction result. In order to more accurately evaluate the matching degree between the predicted results and the actual target, we adopted a Focaler-GIoU loss function, which can help the model focus on the challenging samples and effectively handle the problem of bounding box overlap, further improving the accuracy of detection. Our main contributions are as follows:
The ADown downsampling module combines the benefits of average pooling and max pooling. This not only extends the contextual information capture range but also adeptly preserves crucial features like edges, corners, and textures during the construction of a multi-scale feature pyramid. Consequently, it greatly enhances the network’s resilience to targets of varied scales and shapes.
The DAttention module merges Deformable Convolution and self-attention mechanisms to dynamically adjust sampling positions and attention weights. This enables the model to concentrate on distant target areas in the image, extract more precise and accurate feature representations, and enhance its adaptability to objects of diverse shapes and scales. Consequently, it improves the model’s capability to model long-range targets.
The AFAD module employs a channel-aware selection mechanism to precisely select features. It aggregates local and contextual information via multiple convolutional kernels, and then uniformly disperses features containing rich contextual information across various scales through a unique diffusion mechanism. This greatly enhances the network’s information fusion capability and effectively mitigates issues arising from large-scale differences and background interference.
The Focaler-GIoU Loss merges the benefits of Focaler-IoU in emphasizing challenging samples and the advantages of GIoU in managing non-overlapping bounding boxes. This results in higher accuracy in describing target positions and shapes, thereby enhancing detection accuracy.
The structure of this article is as follows:
Section 2 provides a comprehensive review of relevant literature on remote sensing target detection and feature extraction.
Section 3 details our proposed model, thoroughly outlining each module.
Section 4 shows the results of ablation and comparative experiments conducted on our model architecture, highlighting the effectiveness of the module design and model performance, along with a visual analysis. Lastly,
Section 5 concludes the article and outlines potential directions for future research.
3. Method
3.1. Overall Architecture
Our proposed SRFAD-Net follows the structure of mainstream object detection networks, comprising Backbone, Neck, and Heads. In
Figure 1, the Backbone constructs SRFN and utilizes a CNN network to create a five-layer feature pyramid for extracting information with varying scales and features. Layers 2 to 4 consist of the ADown module and the C2f module. ADown aggregations average pooling and max pooling to achieve downsampling while minimizing image feature loss. The DAttention module is introduced in the fifth layer to accurately model long-range information in remote sensing images. Subsequently, the extracted feature maps are further processed and aggregated with the Neck via the AFAD module, consisting of the AFA module and the C2f module. Multi-scale feature aggregation and diffusion are facilitated through channel perception, multi-core feature extraction, and feature diffusion. The Head layer receives features from the Neck layer, where it performs crucial object detection tasks such as accurately predicting the position and category of the target. To enhance the precision in describing the target’s position and shape during the regression task, the Focaler-GIoU Loss is utilized. The final output of the detection network comprises precise bounding box coordinates and class probabilities, ensuring accurate detection results.
3.2. Scale-Robust Feature Network
Feature extraction holds a crucial role in remote sensing target detection, yet the significant variations in target scale within remote sensing images present formidable challenges. These differences manifest as substantial variations in target visual features like texture, shape, and edge information. Traditional feature extraction methods, constrained by fixed scales and feature extractors, often struggle to comprehensively capture target features across diverse scales. To address these challenges, deep learning approaches commonly adopt multi-scale feature pyramid constructions. However, constructing feature pyramids in remote sensing images presents difficulties: small targets may be dilution or loss, while large targets exhibit intricate and diverse features, complicating feature extraction. In response, this paper proposed SRFN in its Backbone component, as illustrated in
Figure 2. Layer 1 conducts convolution operations to extract low-level feature information from input images. Layers 2–4 consist of ADown modules and C2f modules, while the C2f module enriches model gradient flow by connecting additional branches. Layer 5 comprises ADown, DAttention, and SPPF modules.
The core of SRFN is the design of the ADown downsampling module and DAttention mechanism. The ADown downsampling module aims to minimize information loss during downsampling while preserving the integrity of small target features. Concurrently, the DAttention mechanism dynamically adjusts sampling positions and attention weights. The model’s ability to capture features of distant targets is enhanced, leading to significant improvements in feature extraction. By integrating these two strategies, SRFN effectively extracts robust features representing targets of various scales in remote sensing images, laying a solid foundation for subsequent object detection tasks.
3.2.1. DAttention Module
Input the feature map
, where
H represents height,
W represents width, and
C represents the number of channels. To determine the offset of each reference point, the feature map undergoes a linear projection onto the query marker
q, given by
. Subsequently, the reference points on the upper path are downsampled using scaling coefficients
r and arranged into a grid. The offset
is obtained from the offset network with the query as input, represented as
, and added to the reference points to obtain the offset position information
. The deformed reference points are further sampled using the bilinear interpolation method
to obtain
,
,
, as shown below:
Here,
are the projection matrices. Multi-head attention can be calculated by adding relative position encoding, as follows:
where
d is the dimension of each head,
denotes the softmax function, and
denotes the embedding output from the
m-th attention head. The final output
z is to concatenate the multi-head outputs
.
From
Figure 2, it can be seen that there are three aircraft targets in the leftmost image(as shown in the red box). The thermal map obtained after fourth layers of downsampling is shown in the second image. After passing through the DAttention module, the network fully captures the target information, ignoring the interference of complex backgrounds, as shown in the rightmost image. Therefore, the introduction of DAttention not only helps the model focus more on key areas related to the target object, ignoring background or other irrelevant information, but also helps the model locate the target object more accurately and improve detection performance.
3.2.2. ADown Module
During the processing of remote sensing image detection tasks, feature pyramids are constructed through downsampling. Typically, this involves using convolution with a stride of 2 to reduce the image size, as illustrated in
Figure 3a. Remote sensing images, characterized by diverse imaging methods, often exhibit substantial variations in complex backgrounds and target orientations. This variability poses a challenge, as downsampling can easily result in the loss of information about small targets, leading to false positives and missed detections, thereby compromising the overall detection performance. Hence, the implementation of an effective downsampling strategy is crucial for extracting key features while minimizing computational costs. This article adopts the ADown module, as shown in
Figure 3b. This approach combines the benefits of average and max pooling, effectively preserving both rich background information and salient features during downsampling. It also enhances the network’s robustness in detecting perception targets of varying scales and shapes.
The ADown module initially accepts an input feature map, typically comes from the previous layer in a convolutional neural network, and then performs average pooling. The resulting output is simultaneously directed into two branches, one branch with a stride of 2 convolution operations and the other branch executing max pooling operations. Subsequently, the outcomes from both branches are concatenated and forwarded as output. Average pooling serves to retain the global information of feature maps, thereby expanding the receptive field of subsequent convolutional layers and facilitating the capture of a broader range of contextual information. Concurrently, max pooling accentuates the local maximum value within the feature map, representing the most salient feature. The output feature maps from the two branches are subsequently fused. This fusion technique amalgamates the benefits of average pooling and max pooling, yielding a novel feature map encompassing both global information and prominent features.
The ADown module enhances the receptive field of following convolutional layers and capturing contextual information across a broader range while preserving significant features like edges, corners, and textures. This boosts the network’s resilience to targets of varied scales and shapes. Simultaneously reducing feature dimensionality, it upholds high computational efficiency.
3.3. AFAD Module
Multi-scale feature fusion plays a pivotal role in optimizing the detection performance for remote sensing target detection. However, integrating features across different scales poses significant challenges due to variations in expression form, resolution, and semantic information. Traditional fusion methods, such as simple feature stacking or concatenation, often fail to fully leverage the complementarity and differences between features at different scales, leading to suboptimal fusion outcomes. Moreover, most methods only fuse features at a single level, limiting the model’s exploitation of semantic information and resolution across different levels, while overlooking the importance of feature weights at different scales. To tackle these challenges, we introduce the AFAD module. The AFAD module adopts a channel-aware strategy, fully considering the weight and importance of features at different scales, effectively integrating local and contextual information, and spreading this information to various scales, thereby significantly enhancing the model’s feature fusion capability. This design not only considers the complementarity between features at different scales but also endows the model with the ability to adaptively adjust fusion strategies, allowing it to flexibly adjust according to the characteristics of different tasks and datasets. The AFAD module achieves superior results in multi-scale feature fusion. This approach enhances the accuracy and generalization capabilities for remote sensing target detection.
The AFAD module structure is shown in
Figure 1. It is composed of an AFA module and features a diffusion mechanism. Among them, the AFA module can accept inputs of three different scales. It uses a dimension adaptive perception selection mechanism and combines a set of parallel depthwise convolution operations to accurately select and aggregate multi-channel information. The feature diffusion mechanism consists of a C2f module, upsampling, and concat operation, which effectively spreads features containing rich contextual information to various scales, making it more conducive to subsequent target detection and classification.
The AFA module, depicted in
Figure 4, aligns high-dimensional features
and low-dimensional features
with the current layer’s features
through convolution and split operations. These features are then divided into four equal segments,
,
, and
, and
,
, and
represent the
i-th partition feature of the high-dimensional, low-dimensional, and current layer’s features, respectively. This operation can be expressed as follows:
where
represents convolution operations with a kernel size of
,
represents the upsample operation, while
indicates split quartering in the channel dimension.
represents the value obtained through the activation function applied to
, and
represents the selective aggregation result for each partition.
The variable
possesses the capability to perceive and select feature dimensions. When
, the model prioritizes fine-grained features, whereas when
, it emphasizes contextual features. The calculation for channel perception aggregation features is represented as follow:
where the operations
, and
denote concat, convolution, batch normalization, and rectified linear unit (ReLU), respectively.
After merging
p in the channel dimension,
is processed through a set of parallel depthwise convolutions to extract contextual features. These convolutions utilize kernels of 3, 5, 7, and 9 without dilation, thereby preventing overly sparse feature extraction. Subsequently, 1 × 1 convolution is employed to fuse the local and contextual features. This 1 × 1 convolution acts as a channel fusion mechanism, integrating features with varying receptive field sizes. Finally, the output is obtained.
Through this approach, the AFAD module effectively handles multi-scale inputs, adaptively selects feature dimensions, captures broad contextual information while preserving feature integrity, and fuses local and contextual features. Additionally, the rich features obtained earlier are efficiently diffused to various scales through a diffusion mechanism, thereby enhancing detection accuracy.
3.4. Focaler-GIoU Loss
In object detection, regression models predict the bounding box position and size, but scale differences challenge their performance. Small-scale targets lack feature information, hampering accurate bounding box prediction, while large-scale targets suffer localization accuracy reduction due to noise. Addressing these challenges, a Focaler-GIoU Loss accurately measures bounding box similarity, emphasizing hard-to-classify samples and enhancing detection accuracy. Our model shows significant regression and localization improvement, particularly in handling scale differences, demonstrating robustness and accuracy.
The Focaler-GIoU Loss combines Focaler-IOU [
27] and GIOU Loss [
48] concepts, defined as follows:
where
is the reconstructed Focaler-IoU, IoU is the original IoU value, and
.
B and
represent the predicted box and ground truth (GT) box, respectively. By adjusting the values of
d and
u, we can make
focus on different regression samples. Its loss L is defined below:
The GIoU loss, as an enhancement of IoU, calculates losses by incorporating a minimum bounding rectangle encompassing both the predicted and true boxes, thereby providing a more accurate measure of their overlap. With the introduction of Focaler-GIoU, the model is encouraged to pay greater attention to challenging hard samples during training, specifically those with lower GIoU values. This combination allows Focaler-GIoU to not only address cases excluded by standard IoU but also prioritize samples that significantly influence detection performance. Consequently, it provides a more refined measure of the similarity between predicted and true bounding boxes, leading to improved detection accuracy.
5. Conclusions
This article has introduced the Scale-Robust Adaptive Feature Aggregation and Diffusion Network (SRFAD-Net) for object detection in remote sensing images. By employing a carefully designed Backbone, Neck, and Heads, the network has effectively addressed challenges related to feature extraction, fusion, regression, and classification posed by scale variations in remote sensing images. In the Backbone, SRFN has constructed a multi-scale pyramid and incorporated an ADown downsampling module and DAttention module. These components have preserved rich backgrounds and salient features while dynamically adjusting sampling positions and attention weights to accommodate targets of varying shapes and scales. This design has notably enhanced the network’s robustness to targets of diverse scales and shapes, mitigating information loss for small targets and improving modeling for long-range targets. In the Neck, the AFAD module has utilized channel awareness to aggregate local and contextual information, effectively propagating it across various scales through diffusion mechanisms. AFAD has significantly enhanced the model’s feature fusion capability, enabling it to fully exploit the complementarity among features at different scales and further enhance detection performance. In the head, incorporating a Focaler-GIoU Loss has enabled SRFAD-Net to accurately measure the similarity between predicted and truth bounding boxes and prioritize challenging samples for classification. This approach has enhanced both the model’s detection accuracy and its adaptability to complex scenes and diverse datasets.
Extensive experimental results have demonstrated that SRFAD-Net achieves higher accuracy compared with current mainstream one-stage and two-stage methods. Furthermore, the effectiveness of each module has also been discussed in the ablation study. Future efforts will primarily focus on optimizing network models further, reducing computational complexity, and enhancing the detection capability for small targets to enhance the robustness and generalization of the models.