Feature Symmetry Fusion Remote Sensing Detection Network Based on Spatial Adaptive Selection

Xiao, Heng; Jing, Donglin; Zhao, Fujun; Zha, Shaokang

doi:10.3390/sym17040602

Open AccessArticle

Feature Symmetry Fusion Remote Sensing Detection Network Based on Spatial Adaptive Selection

¹

School of Information and Intelligent Engineering, University of Sanya, Sanya 572022, China

²

Academician Rong Chunmin Workstation, University of Sanya, Sanya 572022, China

³

Shanghai Aerospace Control Technology Institute, Shanghai 201109, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(4), 602; https://doi.org/10.3390/sym17040602

Submission received: 17 March 2025 / Revised: 10 April 2025 / Accepted: 14 April 2025 / Published: 16 April 2025

(This article belongs to the Special Issue Symmetry and Asymmetry Study in Object Detection)

Download

Browse Figures

Versions Notes

Abstract

:

This paper proposes a spatially adaptive feature fine fusion network consisting of a Fast Convolution Decomposition Sequence (FCDS) and a Spatial Selection Mechanism (SSM). Firstly, in FCDS, a large kernel convolution decomposition operation is used to break down dense convolution kernels into small convolutions with gradually increasing hole rates, forming a continuous kernel sequence to obtain finer scale features. This approach significantly reduces the number of parameters, improves network inference efficiency, and preserves the spatial feature expression ability of the network. Notably, the decomposed convolution kernel sequence adopts a symmetric dilation rate increment strategy, maintaining symmetry constraints in kernel weight distribution while expanding receptive fields. On this basis, the spatial selection mechanism is utilized to enhance the key features and background differences of the target location in the feature map, dynamically allocate weights to different fine scale feature maps, and improve the adaptive ability of multi-scale domains. This mechanism employs symmetric attention weight allocation (symmetric channel attention + spatial attention) to establish complementary symmetric response patterns across feature maps in both channel and spatial dimensions. Numerous experiments have shown that our method achieves higher performance with 81.64%, 91.34%, 91.20%mAP on three commonly used remote sensing target datasets (DOTA, UCAS AOD, HRSC2016) compared to existing advanced detection networks.

Keywords:

convolutional sequence; multi-scale object detection; spatially adaptive; feature fusion

1. Introduction

Remote sensing (RS) images are images of the Earth’s surface obtained through remote sensing platforms such as airplanes and satellites at long distances, providing rich information for monitoring the Earth’s surface. Object detection is an important task in the RS field, used to classify and locate land targets. It is widely used in differentfields such as military reconnaissance, resource investigation, disaster relief, environmental monitoring, etc. The detection objects of remote sensing images include multiple types of targets with complex backgrounds such as airplanes, vehicles, and ships. Due to the arbitrary direction, multi-scale, and dense distribution of remote sensing targets, as well as the influence of different environmental factors on remote sensing images, multi-target detection in remote sensing scenes is highly challenging. Traditional detection algorithms are mainly based on manual feature modeling, using prior rules to select candidate regions, extracting features from each region, and establishing feature representations of the target. Finally, classifiers are used to classify and eliminate false targets and regressors are used to locate real targets to obtain the final detection results. However, this type of approach has a single characteristic, weak expression ability, and poor generalization ability, making it difficult to apply to complex and diverse remote sensing scenes [1,2,3].

With the development of deep learning technology and the continuous expansion of data samples, convolutional neural networks are gradually being used to solve object detection problems in RS scenes. The current deep learning-based RS image object detection algorithms can be divided into two categories: two-stage and single-stage algorithms. In the dual order algorithm based on candidate regions, Zhao et al. [4] proposed a non-local feature enhancement detection method. Firstly, a stepwise regression network model was constructed using a region candidate network-based real-time object detector (Faster RCNN [5]) and RoI Trans [6] to achieve precise localization from coarse to fine granularity. Secondly, NFEM was proposed based on a non-local network to enhance the expressive power of features in the network. Zhong et al. [7] proposed a remote sensing image semantic segmentation network model based on multi-scale information fusion. In the encoding stage, a multi-scale feature fusion strategy across convolutional layers based on the DenseNet network was designed to handle complex background regions. In the decoding stage, a short decoder was designed that can fuse convolutional features from different levels. Finally, a hierarchical supervision mechanism network model with multiple outputs was adopted to obtain supervision information from different levels. Chen et al. [8] proposed the attenuated NMS algorithm to solve the problem of overlapping multiple remote sensing targets at different scales. Firstly, target upsampling and re-evolution was designed into a given deep ResNet-101 model to achieve accurate target detection of small targets. Secondly, the decay NMS technique was used to overcome error elimination in dense candidate target areas.

In the single-order algorithm, Qu et al. [9] proposed the extended convolution and feature fusion for a single detector (SSD [10]) network. This algorithm utilized the structure of FPN [11] to fuse high-resolution low-level feature maps and high-level feature maps with rich semantic information; Simultaneously using dilated convolution to enhance the receptive field of the third level feature map of the network improves the efficiency of small object detection. Li et al. [12] proposed an improved object detection and recognition method based on the serial modified linear unit CReLU and the feature pyramid network. Firstly, added CReLU to the shallow layers of the SSD network to improve the efficiency of transmitting its shallow features; Then, FPN was used to gradually fuse the multi-scale feature maps from deep to shallow layers of the network.

The multi-scale problem of targets in RS images is complex, such as the situation where targets of different scales or targets of different sizes within the same category are in the same field of view. This leads to the inability of deep convolutional networks based on feature maps to effectively capture the features of multi-scale targets, thereby affecting the detection accuracy of multi-scale targets. Therefore, RS target detection with a wide range of scale changes remains a challenging problem.

In response to the multi-scale phenomenon of such targets, the mainstream approaches primarily involve constructing image or feature pyramids. For image pyramids, networks are trained with randomly scaled images to enforce multi-scale object detection adaptability, while testing employs multi-scale detection on the same image followed by non-maximum suppression (NMS) for result integration. Singh et al. [13] proposed a Scale Normalized Image Pyramid (SNIP) strategy that exclusively provides scale-appropriate supervision signals at each pyramid level, thereby mitigating interference from extremely large or small targets. The AZ-Net algorithm developed by Lu et al. [14] further enhances contextual utilization through adjacent region prediction and zoom indicators to focus on critical areas, improving detection performance in complex scenarios. Although image pyramids facilitate contextual information exploitation, their high-resolution inputs incur substantial memory and computational overhead.

Regarding feature pyramids, Pang et al. [15] advocated for balanced utilization of all feature layers by unifying multi-level features into intermediate-scale representations, subsequently employing non-local modules [16] to strengthen inter-feature dependencies and integrate cross-scale contextual information. Additionally, the spatial pyramid alignment strategy proposed by Lazebnik et al. [17] enhances contextual extraction through multi-scale image partitioning into sub-regions, followed by feature statistics collection and aggregation to form comprehensive image representations. This approach enables more holistic capture of target-surrounding contextual information, delivering refined feature representations for detectors.

However, these methodologies exhibit limitations when extracting multi-scale features: (1) Overly dense scale distributions introduce significant background noise; (2) Complex network architectures hinder broader practical applications. Future research directions should focus on optimizing scale granularity and simplifying model structures while maintaining detection accuracy.

In summary, in remote sensing detection tasks facing complex scale changes in targets, current methods struggle to achieve an effective balance between detection accuracy and efficiency. Specifically, remote sensing images are often high-resolution bird’s-eye views, and remote sensing targets have significant scale differences, which also exhibit a gradual gradient. For example, when inspecting aircraft at airports, due to different aircraft models, there are significant differences in the size of the aircraft and the scale is gradually changing. Existing detectors often use fixed scale feature pyramids at different levels to fuse multi-scale features when detecting targets of different scales, as shown in Figure 1. However, other convolutional layers that have not undergone scale feature fusion cannot accurately extract the features of scale targets: increasing the number of layers in the feature pyramid significantly increases computational complexity, and downsampling scale features to reduce complexity makes it difficult for small targets to learn rich feature expressions. Therefore, large-scale differences in target gradients in remote sensing images make it difficult for existing methods to dynamically adjust targets at different finer scales.

Current feature pyramid designs often neglect the crucial role of geometric symmetry in scale adaptation. The gradient variations of remote sensing targets inherently imply spatial symmetry laws—the feature distribution of large-scale targets at high levels should maintain scale symmetry with the feature response of small-scale targets at low levels. However, as shown in Figure 1, traditional fixed-level feature fusion disrupts this symmetry, leading to geometrically inconsistent cross-scale feature representations. This paper innovatively introduces symmetry constraints into dynamic feature fusion: (1) Constructing symmetric bidirectional feature pathways to achieve interactive compensation between high-level semantics and low-level details through symmetric mapping. (2) Designing learnable symmetric weight matrices to maintain mirror symmetry of feature distribution in channel dimensions. This symmetry-enhanced mechanism enables the network to adaptively adjust the topology of feature pyramids, effectively resolving the detection challenges of gradient-varying targets.

Specifically, this article proposed a lightweight convolution structure that can cover different scale ranges. Firstly, the large kernel was decomposed into small convolution sequences to significantly reduce the number of parameters. By utilizing deep convolution with continuously expanding void rates, the receptive field could be controlled by increasing the convolution expansion rate. Combined with spatial selection mechanisms, suitable convolution kernels could be dynamically selected to match target scales of different sizes in order to obtain finer scale features. Finally, a spatial attention mechanism was introduced to transform pooled features into multiple spatial attention feature maps through convolution fusion operation and self-attention mechanism, enhancing the key regions of the feature maps and improving the detection ability of the model for targets with arbitrary scale changes in remote sensing images.

2. Related Works

RS images are more complex than natural scene images. In addition to having many similar objects that can easily interfere with the target, the diverse scales and large changes in aspect ratios of the target make it difficult to accurately locate and recognize the object, posing a huge challenge for target detection. The exsiting research work focuses on rotation feature extraction, feature fusion, and other aspects.

2.1. Rotation Feature Extraction

Yang et al. [18] proposed circular smooth labels to address the problem of boundary discontinuity in rotating object detection methods. This method transformed angle prediction to a classification task, used cyclic label encoding to adapt to periodic changes in angles, and limited the classification range of prediction results, increasing the distance between adjacent category angles. Zhao et al. [19] proposed a robust anchor-free detector for oriented object detection. This detector targeted the periodicity and aspected ratio of angles. Firstly, the Gaussian label VGL generation method was used to generate discrete Gaussian labels to adapt to different aspect ratios of objects. Then, the channel pixel attention CPA was used to fuse the features between channels and pixels, obtaining a global receptive field and extracting more stable features. Han et al. [20] proposed a rotation equivariant detector (ReDet), which uses a rotation equivariant network as the backbone structure of the detector, extracts rotation equivariant features, effectively models the spatial direction of the target, and adaptively extracts rotation invariant features from the equivariant features based on the direction of RoI. The direction dimension features are aligned by cyclically switching direction channels and feature interpolation. Zhu et al. [21] proposed a new adaptive periodic embedding (APE) method, which used two two-dimensional vectors with different periods to represent the rotation angle of the bounding box, thereby reducing the complexity of labeled angle regression. At the same time, a two-stage cascaded R-CNN model with length-independent IoU (LIIoU) was adopted to detect slender objects, improving the quality of R-CNN regression and making the detector more robust to slender targets. Yang et al. [22] proposed a cascaded optimization rotation target detection algorithm that uses a stepwise regression method from coarse to fine granularity to quickly and accurately detect targets. The algorithm first used horizontal anchors to achieve faster speed and more proposals, and then used modified rotation anchors to adapt to dense scenes. In addition, the refined anchor position information was obtained using the Feature Refinement Module (FRM) to achieve feature alignment.

2.2. Feature Fusion

Huang et al. [23] proposed a non-local perception pyramid and multi-task correction. Firstly, a non-local excitation pyramid attention mechanism was proposed to effectively suppress background noise. Then, a dual branch structure from top to bottom was used to perform bidirectional aggregation of features to enhance fusion efficiency. A new type of dilated convolution with multiple receptive fields was constructed to refine the misalignment. Yang et al. [24] proposed a novel multi-class rotation detector SCRDet, which designed a sampling fusion network SF Net that integrates multi-layer features and effective anchor sampling. It combined supervised multidimensional attention learners (MDA Net) to highlight target features and improve the detection accuracy of dense small targets. Simultaneously, they used R-NMS to address the issue of drastic changes in losses caused by rotating bounding boxes. Xu et al. [25] proposed a dynamic correction network whose feature selection module enabled neurons to adjust the receptive field based on the shape and direction of the target object, effectively alleviating the misalignment between the receptive field and the object. The dynamic correction head DRH modeled the uniqueness and specificity of each sample, enabling the model to dynamically optimize predictions in a way that was perceived by the target object. Zhu et al. [26] addressed the issue of excessive contextual information that may cause negative effects in variable convolution due to uncontrollable bias. It upgraded the variable convolution to not only learn bias, but also the weight of each sampling point, which was equivalent to a local attention mechanism. Chen et al. [27] proposed dynamic convolution, which integrates multiple parallel convolution kernels based on input and then fuses them based on the attention mechanism, thus having stronger adaptive feature expression ability.

3. Methodology

3.1. Basic Architecture

In response to the gradual changes in target scale, existing methods often use fixed scale feature pyramids. Convolutional kernels have a fixed spatial coverage of the receptive field, and existing pyramids fuse these fixed scale features. However, other convolutional layers that have not undergone scale feature fusion cannot accurately extract the features of scale targets, making it difficult for this architecture to dynamically adjust targets of different finer scales. In addition, increasing the number of layers in the feature pyramid will significantly increase the computational complexity.

In order to achieve an effective balance between scale diversity and computational complexity in network extraction, this paper proposes a multi-scale based dynamic selection mechanism. The basic framework is shown in Figure 2. Convolutional kernel sequences with gradually expanding hole rates are used to obtain finer scale features. In addition, by combining spatial selection mechanisms, attention maps with differentiated features at different scales are obtained, and refined multi-scale features are weighted and fused to dynamically select convolution kernels of appropriate sizes, achieving adaptive capabilities for targets in different spatial ranges.

3.2. Convolutional Sequence

Firstly, a lightweight network is built that can cover different scale ranges. This network consists of a series of deep convolutions with different sizes of convolution kernels and gradually increasing hole rates. The rapid increase in receptive field is controlled by the increase in convolution kernel size and expansion rate in order to meet the needs of targets of different sizes. At the same time, in order to ensure that the dilated convolution does not introduce differences between feature maps, an upper limit on the dilation rate is also set. Assuming that the size of the ith deep convolution in the convolutional kernel sequence is

k_{i} (k_{i - 1} < k_{i})

, the receptive field is

r_{i}

, and the dilation rate is

ei (e_{i - 1} < e_{i} < = r_{i - 1})

. They satisfy the following relationship:

\begin{matrix} e_{1} = 1, r_{1} = 1 \\ r_{i} = e_{i} (k_{i} - 1) + r_{i - 1} \end{matrix}

(1)

Assuming the original receptive field is 23, the size of the convolution kernel is 23, and the dilation is 1, decoupling the convolution yields the convolution kernel sequences (5, 1), (7, 3). i.e.,

k_{1} = 5, e_{1} = 1, r_{1} = k_{1} = 5, k_{2} = 7, e_{1} < e_{2} = 3 < = r_{1}

,

r_{2} = e_{i} (k_{i} - 1) + r_{i - 1} = 3 \times (7 - 1) + 5 = 23

. This design can generate a series of convolution kernels with different sizes and light weight, which can be used to extract features from different spatial ranges, obtain image detail information at different scales, adapt to the gradient scale requirements, and improve the detection ability of targets. In addition, decoupling the convolutional kernel sequence decomposes the large convolutional kernel into multiple lightweight small convolutional kernels, effectively reducing the parameter count of the model.

Next, a series of convolution kernels are used for depth wise convolution operation to obtain rich background information features from different regions of the input data X.

\begin{matrix} F_{0} = X, F_{i + 1} = C o n v_{i}^{d w} (F_{i}) \\ \tilde{F} = C o n v_{i}^{1 \times 1} (F_{i}) f o r i i n [1, N] \end{matrix}

(2)

Among them,

X

is the input data,

C o n v_{i}^{d w}

is the depth wise convolution operation with convolution kernel

k_{i}

and expansion rate

e_{i}

, and

F

is the feature map, where

\tilde{F}

is the fusion result. Assuming there are N decoupled convolution kernels, each convolution operation is followed by a

1 \times 1

convolution layer for channel fusion of spatial feature vectors.

3.3. Spatial Selection Mechanism

For different targets, based on the multi-scale features extracted from the convolutional sequence, a spatial selection mechanism is used to dynamically select convolution kernels of appropriate sizes to capture spatial information of different scales. Specifically, it is achieved as follows: Firstly, the features from different receptive field convolution kernels are concatenated

\tilde{F} = [\tilde{F_{1}}, \dots, \tilde{F_{n}}]

, and then channel level average pooling

P_{avg}

and max pooling

P_{\max}

are applied to extract spatial feature descriptors. The spatial pooling feature information obtained after average pooling and max pooling operations is represented by

S_{avg}

and

S_{\max}

as follows:

S_{a v g} = P_{a v g} (\tilde{F}), S_{m a x} = P_{m a x} (\tilde{F})

(3)

In order to facilitate information exchange between different spatial features, we fuse the spatially merged features. The spatially pooled features are spliced using a convolutional layer, and the pooled features of the two channels are converted into N spatial attention feature maps through convolutional operations and self-attention (SA). This combined pooling operation, when used together with the self-attention (SA) mechanism, can further enhance the key regions of the feature maps.

{Conv}^{2 \to N}

represents a convolution operation combined with spatial attention mechanism, where the input features are 2 spatially pooled feature information values [S_avg; S_max], the output features are N channels, and

\hat{FA}

represents the enhanced feature information.

\hat{F A} = C o n v^{2 \to N} ([S_{a v g}; S_{m a x}])

(4)

Then, the Sigmoid is applied to the attention feature map to obtain the independent spatial selection mask corresponding to each decoupled large convolution kernel. This module selection network, which combines ordinary two-dimensional convolution with Sigmoid function, can obtain selection weights of different convolution kernel sizes and dynamically select different receptive field sizes, further enhancing attention to the target area and achieving better detection results.

\tilde{F A_{ι}} = σ (\hat{F A_{ι}})

(5)

Next, the decoupled large convolution kernel sequence is weighted with the corresponding spatial selection mask and fused through convolutional layers to obtain attention features. This step is used to weigh the more important regions in the feature map to enhance the network’s ability to detect specific objects in remote sensing images. For example, in the detection of ships on the sea surface, due to the large differences in the size of different types of ships, such as small ships with less obvious characteristics (such as fishing boats), it is necessary to appropriately expand the coverage of the receptive field and obtain contextual information around the target to determine its category in order to improve the accuracy of such target recognition. At the same time, due to the different ship shapes, there is a phenomenon of small inter class differences and large intra-class differences. It is necessary to integrate attention mechanisms to guide the model to pay more attention to difficult to classify samples or target areas.

F_{a} = C o n v (\sum_{i}^{N} ({\tilde{F A}}_{i} \cdot {\tilde{F}}_{i}))

(6)

Finally, the input features are multiplied with the attention features

F_{a}

point by point to obtain the output result.

Y = X \cdot F_{a}

(7)

Overall, this method has two advantages. Firstly, it clearly generates multiple features with differentlarge receptive fields. By rapidly expanding the receptive fields in the shallow layers of the network, it can capture spatial information of different scales (fine) to extract detailed features of targets of different sizes. Combined with attention mechanisms for dynamic selection, it improves the network’s ability to focus on the most relevant spatial information regions of the target, thereby obtaining better discriminative power. The second is to decompose the large convolution kernel into multiple small convolution kernels in order, reducing the number of parameters and improving network efficiency.

3.4. Loss Function

In object detection tasks, regression models are responsible for predicting the position and size of bounding boxes; however, scale differences pose significant challenges to the performance of regression models. For small-scale targets, due to the relatively limited available feature information, regression models often face greater difficulties in predicting their bounding boxes, leading to increased inaccuracy in the prediction results. On the contrary, for large-scale targets, although the feature information is relatively rich, due to the target occupying a large area, the detector may be more susceptible to noise and interference, thereby reducing the accuracy of localization. To address these challenges, this paper introduces Focaler-GIoU Loss, which not only accurately measures the similarity between predicted bounding boxes and real bounding boxes, but also pays special attention to samples that are difficult to classify, effectively improving detection accuracy. Through this innovative approach, our model achieved significant performance improvements in regression and localization, especially when dealing with targets with scale differences, demonstrating stronger robustness and accuracy. Focaler-GIoU Loss can focus on bounding box regression of difficult samples for detection tasks with a high proportion of difficult samples. The specific manifestations are as follows: To focus on different tasks, we use linear interval mapping to reconstruct IoU loss and improve edge regression. The formula is as follows:

I o U^{f o c a l e r} = \{\begin{matrix} 0, & I o U < d \\ \frac{I o U - d}{u - d}, & d ≪ I o U ≪ u \\ 1, & I o U > u \end{matrix}

(8)

Among them,

{IoU}^{focaler}

is the reconstructed Focaler-IoU and [d, u] ∈ [0, 1]. By adjusting the values of d and u,

{IoU}^{focaler}

can focus on different regression samples. The definition of its loss is as follows:

L_{Focaler-IoU} = 1 - I o U^{f o c a l e r}

(9)

Focaler-IoU loss is applied to bounding box regression based on IoU.

L_{Focaler-IoU}

can be expressed as follows:

L_{Focaler-IoU} = L_{G I o U} + I o U - I o U^{F o c a l e r}

(10)

4. Experiments

The experimental results are presented on three typical public datasets: DOTA-v1.0 [28], HRSC2016 [29], and UCAS-AOD [30]. The following sections discuss in detail the information about the dataset, method implementation, evaluation metrics, and experimental results.

4.1. Experimental Dataset

UCAS-AOD is used for aircraft and vehicle detection. It includes airplanes and cars, as well as a certain number of counterexamples, totaling 2420 images and 14,596 instances. All objects in the dataset are evenly distributed in direction, and the images cover a variety of complex natural environments and weather conditions, making the models trained using this dataset have stronger generalization ability and robustness in practical applications. All images are in PNG format, with a size range from

1280 \times 659

to

1372 \times 941

. UCAS-AOD adopts the OBB (Oriented bounding box) for each image includes the four vertices of the rotated rectangular box, tilt angle, width, and height. This article adopts the attribute of rotating the four vertices of a rectangular box (x1, y1, x2, y2, x3, y3, x4, y4).

The images of the HRSC2016 dataset are sourced from six famous ports, which have different scales and geographical environments, providing diverse samples for research. The image scale distribution ranges from 300 × 300 to 1500 × 900, and diverse image sizes contribute to the algorithm’s generalization. The dataset classifies ships into three levels, from their type and purpose to specific features, providing more refined guidance for remote sensing target detection.

The size of image in the DOTA dataset is approximately 4000 × 4000 pixels. Fifteen target categories are labeled, with a total of 188,282 instances, each labeled with an arbitrary quadrilateral determined by four points. Each object in the image label information has 10 numerical values, with the first 8 representing the coordinates of the 4 corners of a rectangular box, the 9th representing the object category, the 10th representing the level of recognition difficulty, 0 representing simplicity, and 1 representing difficulty. The dataset achieves a good balance among instances in different directions, which greatly helps improve the robustness of network detectors.

4.2. Comparative Experiments

4.2.1. Comparative Experiment Results on DOTA

In this section, the obtained results were compared with the current state-of-the-art methods. Table 1 shows the detection results of our proposed SASNet and other remote sensing image detection methods on DOTA dataset. SASNet achieved the highest performance for categories such as airplanes, baseball diamond, and bridges, with accuracies of 92.38%, 86.74%, and 67.24%, respectively. It obtained the best average results with an mAP of 81.64%. From Table 1, it can be seen that RVSA performed the best in detection tasks, except for SASNet. It uses a rotating variable window attention to improve the performance of ViT and can generate windows with different shapes and positions to adapt to targets of any direction and size in remote sensing images. It can also extract rich contextual information thereby learning better object representations. But it lacks the ability to handle local visual structures and scale changes, as the model treats images as one-dimensional visual marker sequences, making it difficult to model local visual structures and adapt to scale changes. Our method can adaptively select matching convolution kernels to achieve feature extraction of targets at different scales.

We visualized the partial results of SASNet on DOTA as shown in Figure 3. Ports have a large scale and extreme aspect ratios, making them difficult to detect. However, our detection results still closely fit the shape of the port, achieving adaptability to detecting objects with extreme aspect ratios. At the same time, small target ships with significant differences in scale from port targets in the image can also be detected in preparation. The scene in the first image of the second line is complex and the features are not obvious. Our method can still accurately identify bridge targets of different scales, demonstrating good generalization ability and adapting to different complex environments. The second image in the second row has the characteristics of high image noise, unclear texture boundaries, and small target scale. However, our model can still prepare for differential detection and recognition of targets and has good anti-interference ability. In the first and second images of the third row, the targets are densely arranged, with small gaps between adjacent objects and overlapping boundaries. Moreover, the scale differences of different types of targets are very large. Our method can detect different types of targets well, demonstrating good adaptability to dense and complex scenes.

The above experimental results indicate that SASNet can dynamically adjust the receptive field for feature selection, achieve adaptability at different scales, and accurately detect remote sensing targets of different scales and categories, ensuring high efficiency and accuracy.

4.2.2. Comparative Experiment Results on UCAS-AOD

The partial detection results are visualized in Figure 4, where the first row shows the performance under different lighting conditions. Our methods can handle differentbackgrounds and lighting conditions well, demonstrating generalization in complex environments and partial occlusion. The second line is object detection in the presence of obstruction, such as the main body of the vehicle being obstructed by tree shade or some vehicles being obstructed by buildings or structures. For such scenarios, this method can prepare to detect partially occluded targets. This is because the convolutional kernel sequence proposed in this article generates multiple features with differentlarge receptive fields, and the spatial selection mechanism can dynamically adjust the receptive fields, supplement the contextual information of the target area, and enrich the main features of the target. At the same time, by combining attention mechanisms, the ability of the network to focus on the most relevant spatial information regions of the target can be improved, thereby obtaining better discriminative ability. Compared to other advanced detection networks, The comparative data is shown in Table 2, our method achieved the highest accuracy of 91.34%.

4.2.3. Comparative Experimental Results on HRSC2016

The visualization of the detection results on the HRSC2016 dataset is shown in Figure 5. The images in the first row have extreme aspect ratios, while the images in the second row have similar interferences and blurry targets, such as cargo on a ship obstructing the hull and resembling land objects, and the hull merging with the sea surface. Our method can accurately detect target objects in images, and the detection results can closely fit the targets, reflecting the good anti-interference ability of this method in complex scenes, as well as its adaptability to targets in any direction and aspect ratio in remote sensing images. The detection results compared to other advanced models are shown in Table 3.

4.3. Ablation Studies

4.3.1. Analysis Experiment of Different Components

The basic structure of SASNet mainly consists of convolutional kernel sequences and spatial selection mechanisms. We conducted relevant ablation experiments on the HRSC2016 dataset to validate the performance of different modules proposed. Table 4 lists the achievements on HRSC2016. The mAP of the baseline model is 82.95, which is because ordinary convolution kernels struggle difficult to model multi-scale targets well. When using spatial selection mechanism, the attention mechanism enhances the focus on the target area, resulting in a 4.6% improvement in detection accuracy. By replacing ordinary convolutions with convolutional kernel sequences and combining them with spatial selection mechanisms, the detection accuracy was improved to 91.2%. This indicates that by decomposing large kernel convolutions, target features can be better extracted, and spatial selection mechanisms can match appropriate receptive fields for targets of different scales, improving the network’s ability to focus on the most relevant spatial information regions of the target.

4.3.2. Convolutional Sequence Experiment

The experiment tested the impact of different convolutional kernel sequences on detection accuracy, as shown in Table 5. The experiment found that when using large kernel convolutions of 23 × 23 and 29 × 29 sizes, the number of parameters doubled, reaching over 40 K. When decoupling the large kernel convolution into multiple deep convolution sequences, the number of parameters was significantly reduced to 11.3 K and the detection accuracy was improved to a maximum of 91.2%. This indicated that decomposing a large kernel convolution into multiple convolutional kernel sequences can reduce parameters, improve speed, and also enhance detection accuracy. When decomposing the 29 × 29 convolution kernel into three sequences, the detection accuracy was lower than that of two sequences. However, when the receptive field was 23 and the convolution kernel was decoupled into two convolution sequences (5, 1) and (7, 3), the accuracy was the highest. Experiments have shown that when the receptive field is too large and there are too many decomposition layers, background noise interference is introduced, which affects detection accuracy. However, when the receptive field is too small, it is difficult to obtain global information or contextual information of the target, resulting in a decrease in detection accuracy.

4.3.3. Analysis of Parameters Related to Spatial Dynamic Selection Mechanism

The experiment tested the effectiveness of using the spatial dynamic selection mechanism, the experimental results are shown in Table 6. Among convolutional kernel sequences of different sizes, the receptive field of size 23 performed the best during the experiment. On this basis, combined with our proposed spatial selection mechanism, it was found that the detection accuracy was significantly improved. This indicated that the spatial dynamic selection mechanism could capture spatial information at different scales and through attention mechanism-assisted dynamic selection obtain target detail features of different sizes, thereby achieving better discriminative ability.

5. Conclusions

In this paper, a spatially adaptive dynamic selection network was proposed, which aimed to address the imbalance between detection accuracy and efficiency caused by multi-scale targets. The network aimed to utilize the inherent characteristics, which require a broader and more adaptive context understanding. We generated multiple feature maps with various large receptive fields through convolutional kernel sequences, dynamically selected based on the target scale, effectively simulated the subtle contextual differences of different object types, and combined attention mechanisms to weight and fuse the features. Through extensive experiments, it was shown that our SASNet achieves advanced performance on highly competitive remote sensing benchmarks. Our contributions are as follows: (1) Constructing symmetric bidirectional feature pathways to achieve interactive compensation between high-level semantics and low-level details through symmetric mapping; (2) Designing learnable symmetric weight matrices to maintain mirror symmetry of feature distribution in channel dimensions. This symmetry-enhanced mechanism enables the network to adaptively adjust the topology of feature pyramids, effectively resolving the detection challenges of gradient-varying targets.

Author Contributions

Conceptualization, H.X. and D.J.; methodology, H.X.; software, H.X.; validation, F.Z. and S.Z.; formal analysis, H.X.; investigation, D.J.; resources, D.J.; data curation, F.Z.; writing—original draft preparation, H.X.; visualization, S.Z.; supervision, F.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by Hainan Provincial Natural Science Foundation of China (Project No.: 624RC529) and Talent Introduction Project of University of Sanya (Project No.: USYRC22-08).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The HRSC2016 is available at following https://aistudio.baidu.com/712aistudio/datasetdetail/31232 (accessed on 18 May 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yi, X.; Gu, S.; Wu, X.; Jing, D. AFEDet: A Symmetry-Aware Deep Learning Model for Multi-Scale Object Detection in Aerial Images. Symmetry 2025, 17, 488. [Google Scholar] [CrossRef]
Deng, C.; Jing, D.; Han, Y.; Wang, S.; Wang, H. FAR-Net: Fast Anchor Refining for Arbitrary-Oriented Object Detection. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6505805. [Google Scholar] [CrossRef]
Zhu, H.; Jing, D. Optimizing Slender Target Detection in Remote Sensing with Adaptive Boundary Perception. Remote Sens. 2024, 16, 2643. [Google Scholar] [CrossRef]
Chen, P.; Li, Q.; Li, Q.; Wu, Z. Remote Sensing Image Object Detection Method with Feature Denoising Fusion Module. In Proceedings of the 2024 IEEE 7th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chongqing, China, 15–17 March 2024. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning RoI Transformer for Oriented Object Detection in Aerial Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Zhong, W.; Guo, F.; Xiang, S. Ship Target Detection Model for Remote Sensing Images of Rotating Rectangular Regions. J. Comput. Aided Des. Graph. 2019, 31, 11. [Google Scholar]
Chen, H.B.; Jiang, S.; He, G.; Zhang, B.; Yu, H. TEANS: A Target Enhancement and Attenuated Nonmaximum Suppression Object Detector for Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2021, 18, 632–636. [Google Scholar] [CrossRef]
Qu, J.S.; Su, C.; Zhang, Z.; Razi, A. Dilated Convolution and Feature Fusion SSD Network for Small Object Detection in Remote Sensing Images. IEEE Access 2020, 8, 82832–82843. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Computer Vision—ECCV 2016; Springer International Publishing: Cham, Switzerland, 2016; Volume 9905 LNCS. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Li, H.; Zhou, K.; Han, T. SSD ship target detection based on CReLU and FPN improvement. J. Instrum. 2020, 41, 183–190. [Google Scholar]
Singh, B.; Davis, L.S. An Analysis of Scale Invariance in Object Detection—SNIP. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Lu, Y.; Javidi, T.; Lazebnik, S. Adaptive Object Detection Using Adjacency and Zoom Prediction. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2351–2359. [Google Scholar] [CrossRef]
Pang, J.; Chen, K.; Shi, J.; Feng, H.; Ouyang, W.; Lin, D. Libra R-CNN: Towards Balanced Learning for Object Detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Lazebnik, S.; Schmid, C.; Ponce, J. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006. [Google Scholar]
Yang, X.; Yan, J. Arbitrary-Oriented Object Detection with Circular Smooth Label. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Zhao, T.; Liu, N.; Celik, T.; Li, H.C. An Arbitrary-Oriented Object Detector Based on Variant Gaussian Label in Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8013605. [Google Scholar] [CrossRef]
Han, J.; Ding, J.; Xue, N.; Xia, G.S. ReDet: A Rotation-equivariant Detector for Aerial Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Zhu, Y.; Du, J.; Wu, X. Adaptive Period Embedding for Representing Oriented Objects in Aerial Images. IEEE Trans. Geosci. Remote Sens. 2020, 58, 7247–7257. [Google Scholar] [CrossRef]
Yang, X.; Liu, Q.; Yan, J.; Li, A.; Zhang, Z.; Yu, G. R3Det: Refined Single-Stage Detector with Feature Refinement for Rotating Object. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019. [Google Scholar]
Huang, Z.; Li, W.; Xia, X.G.; Wu, X.; Tao, R. A Novel Nonlocal-Aware Pyramid and Multiscale Multitask Refinement Detector for Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5601920. [Google Scholar] [CrossRef]
Yang, X.; Yang, J.; Yan, J.; Zhang, Y.; Zhang, T.; Guo, Z.; Xian, S.; Fu, K. SCRDet: Towards More Robust Detection for Small, Cluttered and Rotated Objects; Shanghai Jiao Tong University: Shanghai, China; Institute of Electrics, Chinese Academy of Sciences: Beijing, China, 2019. [Google Scholar]
Xu, C.; Ma, C.; Yuan, H.; Sheng, K.; Dong, W.; Guo, X.; Pan, X.; Ren, Y. Dynamic Refinement Network for Oriented and Densely Packed Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable ConvNets V2: More Deformable, Better Results. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Chen, Y.; Dai, X.; Liu, M.; Chen, D.; Liu, Z. Dynamic Convolution: Attention Over Convolution Kernels. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-scale Dataset for Object Detection in Aerial Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Liu, Z.; Yuan, L.W.L.; Yang, Y. A high resolution optical satellite image dataset for ship recognition and some new baselines. In Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods, Porto, Portugal, 24–26 February 2017. [Google Scholar]
Zhu, H.; Chen, X.; Dai, W.; Fu, K.; Ye, Q.; Jiao, J. Orientation robust object detection in aerial images using deep convolutional neural network. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015. [Google Scholar] [CrossRef]
Hou, L.; Lu, K.; Xue, J.; Li, Y. Shape-adaptive selection and measurement for oriented object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022. [Google Scholar]
Han, J.; Ding, J.; Li, J.; Xia, G.S. Align Deep Features for Oriented Object Detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5602511. [Google Scholar] [CrossRef]
Cheng, G.; Yao, Y.; Li, S.; Li, K.; Xie, X.; Wang, J.; Yao, X.; Han, J. Dual-Aligned Oriented Detector. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5618111. [Google Scholar] [CrossRef]
Cheng, G.; Wang, J.; Li, K.; Xie, X.; Lang, C.; Yao, Y.; Han, J. Anchor-free Oriented Proposal Generator for Object Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5625411. [Google Scholar] [CrossRef]
Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Yang, X.; Zhou, Y.; Zhang, G.; Yang, J.; Wang, W.; Yan, J.; Zhang, X.; Tian, Q. The KFIoU Loss for Rotated Object Detection. arXiv 2022, arXiv:2201.12558. [Google Scholar]
Di, W.; Zhang, Q.; Xu, Y.; Zhang, J.; Du, B.; Tao, D.; Zhang, L. Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model. IEEE Trans. Geosci. Remote Sens. 2022, 61, 5607315. [Google Scholar]
Ming, Q.; Miao, L.; Zhou, Z.; Song, J.; Yang, X. Sparse Label Assignment for Oriented Object Detection in Aerial Images. Remote Sens. 2021, 13, 2664. [Google Scholar] [CrossRef]
Ming, Q.; Miao, L.; Zhou, Z.; Dong, Y. CFC-Net: A Critical Feature Capturing Network for Arbitrary-Oriented Object Detection in Remote-Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5605814. [Google Scholar] [CrossRef]
Ming, Q.; Miao, L.; Zhou, Z.; Song, J.; Dong, Y.; Yang, X. Task interleaving and orientation estimation for high-precision oriented object detection in aerial images. ISPRS J. Photogramm. Remote Sens. 2023, 196, 241–255. [Google Scholar] [CrossRef]
Ming, Q.; Miao, L.; Zhou, Z.; Yang, X.; Dong, Y. Optimization for Arbitrary-Oriented Object Detection via Representation Invariance Loss. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8021505. [Google Scholar] [CrossRef]
Ming, Q.; Zhou, Z.; Miao, L.; Zhang, H.; Li, L. Dynamic Anchor Learning for Arbitrary-Oriented Object Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020. [Google Scholar]
Qian, W.; Yang, X.; Peng, S.; Yan, J.; Guo, Y. Learning Modulated Loss for Rotated Object Detection. In Proceedings of the National Conference on Artificial Intelligence, Virtual, 19–21 May 2021. [Google Scholar]
Yi, J.; Wu, P.; Liu, B.; Huang, Q.; Qu, H.; Metaxas, D. Oriented Object Detection in Aerial Images with Box Boundary-Aware Vectors. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021. [Google Scholar]

Figure 1. Fixed scale feature pyramid fusion method.

Figure 2. Network structure.

Figure 3. Detection effect of DOTA dataset. The first line of the scene demonstrates the adaptability of the model to extreme aspect ratio object detection. The second line demonstrates the generalization ability of object detection in complex scenes with different scales and unclear boundaries. The third line demonstrates the adaptability of detecting scenes with large scale differences and dense targets.

Figure 4. Detection effect of UCAS-AOD. The first line demonstrates the model’s generalization to complex environments such as different backgrounds and lighting conditions. The second line shows the detection effect under partial occlusion.

Figure 5. Visualization result on the HRSC2016 dataset. Ships that are densely docked side by side and have a high aspect ratio can be accurately detected.

Table 1. The detection performance on DOTA dataset.

Method	SASM [31]	s²ANet [32]	R3Det [22]	SCRDet [24]	Rol-Trans [6]	ReDet [20]	DODet [33]	AOPG [34]	ORCNN [35]	KFloU [36]	RVSA [37]	Ours
PL	89.54	88.89	89.80	89.98	88.65	88.81	89.96	89.88	89.84	89.44	88.97	92.38
BD	85.94	83.60	83.77	80.65	82.60	82.48	85.52	85.57	85.43	84.41	85.76	86.74
BR	57.73	57.74	48.11	52.09	52.53	60.83	58.01	60.90	61.09	62.22	61.46	67.24
GTF	78.41	81.95	66.77	68.36	70.87	80.82	81.22	81.51	79.82	82.51	81.27	84.68
SV	79.78	79.94	78.76	68.36	77.93	78.34	78.71	78.70	79.71	80.10	79.98	71.02
LV	84.19	83.19	83.27	60.32	76.67	86.06	85.46	85.29	85.35	86.07	85.31	80.47
SH	89.25	89.11	87.84	72.41	86.87	88.31	88.59	88.85	88.82	88.68	88.30	88.01
TC	90.87	90.78	90.82	90.85	90.71	90.87	90.89	90.89	90.88	90.90	90.84	89.86
BC	58.80	84.87	85.38	87.94	83.83	88.77	87.12	87.60	86.68	87.32	85.06	90.36
ST	87.27	87.81	85.51	86.86	82.51	87.03	87.80	87.65	87.73	88.38	87.50	79.84
SBF	63.82	70.30	65.57	65.02	53.95	68.65	70.50	71.66	72.21	72.80	66.77	70.32
RA	67.81	68.25	62.68	66.68	67.61	66.90	71.54	68.69	70.80	71.95	73.11	84.37
HA	78.67	78.30	67.53	66.25	74.67	79.26	82.06	82.31	82.42	78.96	84.75	79.51
SP	79.35	77.01	78.56	68.24	68.75	79.71	77.43	77.32	78.18	74.95	81.88	68.63
HC	69.37	69.58	72.62	65.2	61.03	74.67	74.47	73.10	74.11	75.27	77.58	69.41
mAP	79.17	79.42	76.47	72.61	74.61	80.10	80.62	80.66	80.87	80.93	81.24	81.64

Table 2. Comparison of detection results of differentmodels on UCAS-AOD dataset.

Model	Backbone	Input_SIZE	Car	Airplane	mAP
Faster RCNN [5]	ResNet50	800 × 800	86.87	89.86	88.36
RoI Transformer [6]	ResNet50	800 × 800	88.02	90.02	89.02
SLA [38]	ResNet50	800 × 800	88.57	90.30	89.44
CFC-Net [39]	ResNet50	800 × 800	89.29	88.69	89.49
TIOE-Det [40]	ResNet50	800 × 800	88.83	90.15	89.49
RIDet-O [41]	ResNet50	800 × 800	88.88	90.35	89.62
DAL [42]	ResNet50	800 × 800	89.25	90.49	89.87
S2ANet [32]	ResNet50	800 × 800	89.56	90.42	89.99
Ours	ResNet50	800 × 800	90.28	92.19	91.34

Table 3. Comparison of detection results on the HRSC2016 dataset.

Model	Backbone	mAP (07)
RoI-Transformer [6]	ResNet101	86.20
RSDet [43]	ResNet50	86.50
BBAVectors [44]	ResNet101	88.60
R3Det [22]	ResNet101	89.26
S2ANet [32]	ResNet101	90.17
ReDet [20]	ResNet101	90.46
Oriented R-CNN [35]	ResNet101	90.50
Ours	LargeKernel	91.20

Table 4. Comparison of module selection.

Ordinary Convolution	Convolutional Kernel Sequence	Spatial Selection Mechanism	Accuracy mAP (%)
✓	✗	✗	82.95
✓	✗	✓	87.53
✗	✓	✗	88.76
✗	✓	✓	91.20

Table 5. Comparison of Convolution sequence on HRSC2016. Arrows represent decoupling large convolution kernels in order.

Convolutional Kernel Sequence (k, d)	Experience Wild RF	Quantity	Parameter Quantity # P	Accuracy mAP (%)
(23, 1)	23	1	40.4 K	82.95
(5, 1) → (7, 3)	23	2	11.3 K	91.20
(29, 1)	29	1	60.4 K	83.94
(5, 1) → (7, 4)	29	2	11.3 K	89.68
(3, 1) → (5, 2) → (7, 3)	29	3	11.3 K	89.37

Table 6. The Influence of spatial Selection Mechanism on Accuracy and Speed.

Convolutional Kernel Sequence (k, d)	Experience Wild RF	Spatial Selection SS	Speed FPS	Accuracy mAP (%)
(3, 1) → (5, 2)	11	-	22.1	88.21
(5, 1) → (7, 3)	23	-	21.3	88.93
(5, 1) → (7, 4)	29	-	21.7	89.68
(5, 1) → (7, 3)	23	✓	20.7	91.20

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xiao, H.; Jing, D.; Zhao, F.; Zha, S. Feature Symmetry Fusion Remote Sensing Detection Network Based on Spatial Adaptive Selection. Symmetry 2025, 17, 602. https://doi.org/10.3390/sym17040602

AMA Style

Xiao H, Jing D, Zhao F, Zha S. Feature Symmetry Fusion Remote Sensing Detection Network Based on Spatial Adaptive Selection. Symmetry. 2025; 17(4):602. https://doi.org/10.3390/sym17040602

Chicago/Turabian Style

Xiao, Heng, Donglin Jing, Fujun Zhao, and Shaokang Zha. 2025. "Feature Symmetry Fusion Remote Sensing Detection Network Based on Spatial Adaptive Selection" Symmetry 17, no. 4: 602. https://doi.org/10.3390/sym17040602

APA Style

Xiao, H., Jing, D., Zhao, F., & Zha, S. (2025). Feature Symmetry Fusion Remote Sensing Detection Network Based on Spatial Adaptive Selection. Symmetry, 17(4), 602. https://doi.org/10.3390/sym17040602

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Feature Symmetry Fusion Remote Sensing Detection Network Based on Spatial Adaptive Selection

Abstract

1. Introduction

2. Related Works

2.1. Rotation Feature Extraction

2.2. Feature Fusion

3. Methodology

3.1. Basic Architecture

3.2. Convolutional Sequence

3.3. Spatial Selection Mechanism

3.4. Loss Function

4. Experiments

4.1. Experimental Dataset

4.2. Comparative Experiments

4.2.1. Comparative Experiment Results on DOTA

4.2.2. Comparative Experiment Results on UCAS-AOD

4.2.3. Comparative Experimental Results on HRSC2016

4.3. Ablation Studies

4.3.1. Analysis Experiment of Different Components

4.3.2. Convolutional Sequence Experiment

4.3.3. Analysis of Parameters Related to Spatial Dynamic Selection Mechanism

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI