A Progressive Saliency-Guided Small Ship Detection Method for Large-Scene SAR Images

Zhu, Hanying; Li, Dong; Wang, Haoran; Yang, Ruquan; Liang, Jishen; Liu, Shuang; Wan, Jun

doi:10.3390/rs17173085

Open AccessArticle

A Progressive Saliency-Guided Small Ship Detection Method for Large-Scene SAR Images

by

Hanying Zhu

¹

,

Dong Li

^1,*,

Haoran Wang

¹,

Ruquan Yang

¹,

Jishen Liang

²,

Shuang Liu

¹

and

Jun Wan

¹

School of Microelectronics and Communication Engineering, Chongqing University, Chongqing 400044, China

²

School of Communication Officers, Army Engineering University, Chongqing 400035, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(17), 3085; https://doi.org/10.3390/rs17173085

Submission received: 16 August 2025 / Revised: 31 August 2025 / Accepted: 3 September 2025 / Published: 4 September 2025

Download

Browse Figures

Versions Notes

Abstract

Highlights

What are the main findings?

A progressive saliency-guided (PSG) method is proposed which employs saliency-derived positional priors to progressively enhance feature extraction and proposal learning for small ships in large-scene SAR images.
The PSG framework effectively alleviates weak responses and missed detections, improving small ship feature representation and proposal quality.

What are the implications of the main findings?

Extensive experiments on LS-SSDD and HRSID demonstrate that the PSG method significantly improves the detection performance compared with that of state-of-the-art methods.
The method provides an effective solution for accurate and robust small ship detection in large-scale SAR imagery.

Abstract

Large-scene space-borne SAR images with a high resolution are particularly effective for monitoring vast oceanic areas globally. However, ships are easily overlooked in such large scenes due to their small size and cluttered backgrounds, making SAR ship detection challenging for the existing methods. To address this challenge, we propose a progressive saliency-guided (PSG) method, which uses saliency-derived positional priors to guide the model in focusing on small targets and extracting their features. Specifically, a dual-guided perception enhancement (DGPE) module is developed, which introduces additional target saliency maps as prior information to cross-guide and highlight key regions in SAR images at the feature level, enhancing small object feature representation. Additionally, a saliency confidence aware assessment (SCAA) mechanism is designed to strengthen small object proposal learning at the proposal level, guided by classification and localization scores at key locations. The DGPE and SCAA modules jointly enhance small object learning across different network levels. Extensive experiments demonstrate that the PSG method significantly improves the detection performance (+4.38% AP on LS-SSDD and +4.35% on HRSID) for small ships in large-scene SAR images compared to that of the baseline, providing an effective solution for small ship detection in large scenes.

Keywords:

small object detection; large-scene; progressive saliency guidance

1. Introduction

Large-scene space-borne Synthetic Aperture Radar (SAR) images are obtained using the ScanSAR/TOPS wide-swath imaging mode, which allows a single scene to cover tens to hundreds of square kilometers, offering both wide-area observation and detailed imaging capabilities [1]. As global marine resource development intensifies and territorial disputes grow, satellite-based SAR systems in high-resolution wide-swath (HRWS) mode have become essential for all-weather maritime monitoring [2], particularly for military applications requiring wide-area surveillance and strategic maritime security.

Large-scene SAR ship detection is critical for maritime security and monitoring [3]. Due to the vast coverage and complex backgrounds in large-scene SAR images, the traditional methods followed a two-step process: first performing sea–land segmentation (SLS) to remove irrelevant areas and narrow the search space, followed by ship detection using the constant false alarm rate (CFAR) method [4,5,6]. These CFAR-based approaches depend heavily on manual modeling and fixed parameter settings, making them less robust in complex backgrounds. For example, unclear land–sea boundaries and scattering interference near coasts often lead to segmentation errors, requiring manual adjustment of the thresholds and morphological parameters. Moreover, the inherent physical properties of SAR cause ships and background clutter to appear to be similar, increasing the false alarm rates in CFAR detectors. Consequently, the reliance on manual feature representation constrains the performance of traditional methods in large-scene SAR ship detection.

To overcome the limitations of the traditional CFAR-based methods, deep learning (DL) techniques are increasingly applied to large-scene SAR ship detection. The first category of DL-based approaches follows a cascaded framework, where semantic segmentation or sea–land classification is performed prior to detection [3]. For example, fully convolutional networks (FCNs) are employed to replace traditional sea–land segmentation, improving automation and accuracy. The segmented patches are then processed by detectors such as Faster R-CNN [7] and YOLO [8] for ship localization. In addition, some studies replace the segmentation step with slice-level classification, where a classification network first selects patches likely to contain ships, followed by detection [9]. Although cascaded frameworks improve the efficiency, they increase the annotation costs, ignore global context, reduce the robustness in complex backgrounds, and require separate training and maintenance, which conflicts with the trend toward full automation. The second category adopts end-to-end frameworks that directly process large-scene images, removing redundant preprocessing and jointly optimizing context modeling and localization. However, both approaches still face the fundamental challenge of detecting extremely small targets in large scenes. As shown in Figure 1, ships occupy only a tiny fraction of pixels, often appearing as a few bright points. This results in three problems: (1) Small ships are easily confused with speckle noise. (2) Their weak features are often lost during downsampling and multi-scale fusion, making them difficult to distinguish and localize. (3) The conventional feature extractors and detection heads respond poorly to small ships, resulting in low confidence scores and frequent missed detections. To address these challenges, additional prior information is needed to explicitly guide the detector toward potential target areas during feature extraction and localization.

The Gaussian centerness model employs a 2D Gaussian distribution to accentuate the disparity in confidence between the target center and the background. It has been validated in arbitrary-oriented object detection (AOOD) and has been shown to significantly improve the responses in the target center region [10]. Similarly, in remote sensing, Dai et al. [11] propose a Gaussian significance map generation network to help the detector locate objects roughly and to focus on and recognize small areas of them. In this situation, it seems that this physical information can also be used to detect small ships in large-scene SAR images. Specifically, the strong center and weak edge scattering characteristics of ships in SAR images are aligned well with the Gaussian centroid model, while the distribution of background clutter and speckle noise is made more complex, multi-peaked, and spatially irregular and conforms less to a single Gaussian model (SGM).

Based on the above observations, in this paper, we propose a progressive saliency-guided (PSG) method to enhance the detection accuracy for small ships in large-scene SAR imagery. Firstly, a dual-guided perception enhancement (DGPE) module is introduced to strengthen the responses to small targets by injecting ship saliency maps as priors for cross-feature guidance and focused attention on ship regions. This enables accurate localization while alleviating the feature degradation during downsampling. Secondly, to address the misidentification of small ships as the background due to low confidence in large scenes, a saliency confidence aware assessment (SCAA) mechanism is designed. This mechanism combines the spatial information based on saliency priors with the confidence difference between the primary and auxiliary branches. It evaluates the proposal quality and generates classification weights, guiding the model to focus on small objects with accurate localization but low classification confidence during training. Guided by saliency maps, the proposed PSG encourages the model to enhance small object feature representation across various network levels and improves discriminative learning through dynamic proposal classification calibration, driven by the synergy of the DGPE and SCAA modules. The main contributions of this work are summarized as follows:

To enhance the small ship detection performance in large-scene SAR images, we propose PSG, which uses saliency-derived positional priors to guide the model to focus on small targets and extract their features.
In response to the feature disappearance introduced by downsampling small ships, DGPE is designed, incorporating saliency maps as additional prior information to facilitate cross-branch guidance and critical area alignment for SAR image features.
To tackle missed detections of small ships caused by an excessively small field of view, the SCAA mechanism computes the classification and localization scores at key locations under saliency prior guidance, enhancing learning from hard proposals.
Extensive experiments are conducted on the LS-SSDD and HRSID datasets. The results demonstrate that PSG achieves a 4% AP improvement over the baseline on each dataset, outperforming the existing methods.

The remainder of this paper is organized as follows. Section 2 reviews related work on large-scene SAR ship detection. Section 3 introduces the proposed PSG method. The experimental results and model analysis are presented in Section 4. Section 5 discusses the results and future work. Finally, Section 6 concludes this paper.

2. Materials

2.1. Large-Scene SAR Ship Detection

The existing methods for large-scene SAR ship detection are primarily classified into three frameworks: (1) traditional cascaded frameworks, (2) DL-based cascaded frameworks, and (3) end-to-end DL frameworks. The following provides a detailed review of these methodologies.

Traditional cascaded frameworks. The early approaches mainly adopted a cascaded framework which sequentially performed sea–land segmentation and CFAR-based detection. The process typically begins with segmentation, achieved either through intensity-based techniques such as clustering and thresholding [5] or via geospatial masking with external coastline databases such as GSHHG [6]. Subsequently, CFAR algorithms identify candidate regions by statistically modeling background clutter. To handle complex sea environments, researchers introduce advanced CFAR variants, including those based on K-distribution modeling [12] and adaptive schemes like SO-CFAR and GO-CFAR [13,14]. Finally, the framework suppresses false alarms by verifying candidate regions using texture-based feature discriminators. However, the dependence on handcrafted features and physical priors makes them sensitive to segmentation errors and limits their adaptability to diverse backgrounds. These limitations have led to the adoption of deep learning for more robust and adaptable detection.

DL-based cascaded frameworks. Recent advances in DL techniques have enabled the integration of convolutional neural networks (CNNs) into large-scene SAR ship detection, reducing the reliance on handcrafted features and physical priors. A typical cascaded framework first performs sea–land segmentation or coarse classification to remove irrelevant areas, followed by ship detection in the remaining regions. For example, Liu et al. [15] propose SLS-CNN, which combines spectral residual and corner-guided segmentation with a sliding-window detection strategy. Cui et al. [3] present a detection process using coarse sea–land classification, while the TNN [16] improves the segmentation accuracy for target extraction. Jia et al. [17] present a progressive algorithm that integrates both DL-based and traditional methods, leveraging hierarchical spatial constraints and multi-level feature fusion to enhance the detection performance. Although DL-based cascaded frameworks improve the accuracy over that of the traditional methods, their multi-stage design complicates training, increases the annotation costs, and overlooks the full-scene context, limiting their applicability to fully automated large-scene SAR ship detection.

End-to-end DL frameworks. Another approach employs unified end-to-end DL frameworks, which remove multi-stage processing and preserve the full-scene context. These methods typically divide large-scene SAR images into patches and process them using a DL detector. Anchor-based approaches include both single-stage detectors such as SSD [18] and YOLO [8], as well as two-stage detectors like Faster R-CNN [7]. Notable advancements include Shao et al. [19] fusing pixel-level and feature-level CFAR-CNN outputs to enhance the location determination accuracy and Zhang et al. [20] developing a multi-scale detection framework with dual-stage post-processing to improve the recall. Various YOLO-based enhancements incorporating lightweight feature pyramids and attention mechanisms have also appeared [21,22]. Conversely, anchor-free methods [23,24,25] bypass the anchor design by leveraging scattering centers for candidate localization.

Although end-to-end frameworks achieve a better overall performance, they often fail to accurately pinpoint small ships, resulting in high false negative rates and inadequate localization for such targets. For instance, Shao et al.’s hybrid method struggles with activating the central features of small targets [7], and Zhang et al.’s multi-scale approach demonstrates similarly weak feature activation for low-resolution vessels, especially in large-scene settings [19]. In practical scenarios, large-scene SAR ship detection is crucial for wide-area maritime monitoring and defense surveillance, where the sparsity and small size of ships amid cluttered backgrounds exacerbate these limitations further. To address this issue, this paper integrates saliency map priors into the detection framework, explicitly guiding the model’s attention toward small object regions to enhance the recognition robustness.

2.2. Small Object Detection

In large-scene SAR ship detection, intuitive challenges in small object detection (SOD), due to weak feature representation, noise interference, and localization sensitivity, are exacerbated further by the wide-area coverage. The existing SOD methods, including data augmentation [26,27,28], improved feature extraction [29,30,31,32], multi-scale learning [33,34,35,36] and better sample allocation [37,38,39], are effective in optical images but perform poorly in SAR imagery due to coherent scattering and speckle noise.

Activating and modeling key positional features are essential for SOD. Downsampling small targets in deep networks often leads to excessive spatial degradation and weak feature responses, compromising the localization accuracy. To address this issue, Wang et al. [40] utilize deformable convolutions in YOLOv8 to enlarge the receptive fields while preserving the local structure. Gao et al. [41] propose multi-wave upsampling (MRU) to reconstruct high-frequency details in the frequency domain for improved edge localization. Yu et al. [42] developed the SBAM module, which integrates bilinear interpolation and attention mechanisms to suppress background interference and enhance the focus on weak targets. Furthermore, Bai et al. [43] employ reinforcement learning to optimize the wavelet time-frequency channels for enhanced target feature representation. While these methods improve the feature representation, they rely on clear textures and structural details, which are often absent in large-scene SAR images due to scattering, resulting in a suboptimal performance in detecting small targets.

Optimizing the sample assignment mitigates matching bias in SOD tasks. Despite close spatial proximity, these proposals are often assigned as negative, leading to frequent missed detections. To address this, Wang et al. [38] and Xu et al. [39] model the anchors and ground truths as 2D Gaussian distributions, using the distribution similarity instead of the IoU for box matching. Zhu et al. [44] proposed AutoAssign, which adaptively adjusts the thresholds based on the target scale to improve the sampling probability for small objects. Yuan et al. [45] introduce a size-adaptive dynamic threshold to generate high-quality proposals in an RPN. While these methods improve the assignment quality, they still rely on geometric consistency or confidence scores, which are easily disturbed by speckle noise and weak target signatures in SAR imagery.

Incorporating priors and explicit guidance structures enhances small object perception from both architectural and supervisory perspectives. For example, Zhao et al. [46] developed SCDNet, which integrates scene classification and foreground enhancement to focus on semantically relevant regions. Zhang et al. [47] propose an object reconstruction network with multi-receptive field adaptive enhancements to guide backbone optimization. The Adaptive Region Proposal Network (ARPN) [48] combines progressive attention with density map predictions to improve the proposal quality and recall. These methods attempt to improve the modeling of small objects using backbone networks through intermediate feature guidance. However, the scattering noise in SAR images complicates target feature extraction, making it challenging to generate high-quality priors for detection.

The existing SOD methods in large-scene SAR detection often perform poorly due to the scattering properties of SAR images. This paper addresses these challenges by introducing an approach that uses saliency maps as external priors to guide the feature extraction and sample assignment, enabling accurate recognition and localization of small ships in large-scene SAR images.

3. Methods

In this section, this paper presents the technical details of the proposed PSG method. Section 3.1 describes the overall network architecture. Section 3.2, Section 3.3 and Section 3.4 detail the motivation and implementation of the Gaussian saliency map generation, the DGPE module, and the SCAA module, respectively. Finally, Section 3.5 defines the overall optimization objective for the model.

3.1. The Overall Architecture

The overall architecture of the proposed PSG method is illustrated in Figure 2. Built on the two-stage Faster R-CNN framework, it employs a dual-branch structure to improve the detection of small targets in large-scene SAR images. The primary branch receives SAR images and performs the main detection task, while the auxiliary branch receives target saliency maps to train the detection network, acting as explicit positional priors. During the training phase, both branches use networks with identical architectures but independent parameters to extract multi-level features. In the backbone, the dual-guided perception enhancement (DGPE) module uses saliency features from the auxiliary branch to guide the primary branch’s attention towards small target regions, enhancing the feature responses. Each branch employs an RPN with identical architectures but separate parameters, producing proposal regions denoted as rois and

{rois}_{hm}

for the primary and auxiliary branches, respectively. To further focus the primary branch on small targets, the saliency confidence aware assessment (SCAA) mechanism adjusts the classification loss weights based on the localization quality and inter-branch confidence discrepancy. This strategy emphasizes well-localized but semantically ambiguous proposals, improving the primary detector’s classification optimization. Both branches are jointly optimized during training. During inference, only the primary branch is activated, ensuring efficient and effective detection.

3.2. Gaussian Saliency Map Generation

Given that small ships are often overlooked during downsampling, this paper introduces an auxiliary detection branch that takes saliency maps as the input to provide explicit spatial guidance. The auxiliary branch has the same architecture as the primary detector with independent parameters. During training, the auxiliary branch provides saliency priors that progressively guide the primary detector to learn small targets. The auxiliary branch is discarded at inference, incurring no runtime overhead. As shown in Figure 3, each saliency map is generated by centering a Gaussian distribution at the target centroid, with a controlled amplitude and spatial extent to model the target region and enhance the spatial attention during training.

Specifically, given the bounding box center coordinates

C (x_{i}, y_{i})

, each target is modeled as a normalized 2D Gaussian distribution on the image plane. The positional saliency map is defined as follows:

G_{i} (x, y) = exp (- \frac{{(x - x_{i})}^{2}}{2 σ_{x, i}^{2}} - \frac{{(y - y_{i})}^{2}}{2 σ_{y, i}^{2}})

(1)

Here,

σ_{x, i}

and

σ_{y, i}

are proportional to the target width

w_{i}

and height

h_{i}

, respectively, i.e.,

σ_{x, i} = \frac{w_{i}}{k}

,

σ_{y, i} = \frac{h_{i}}{k}

. The scaling factor

k

controls the spread of the Gaussian distribution. The final saliency map is obtained by summing the individual target Gaussian kernels.

G (x, y) = \sum_{i = 1}^{N} G_{i} (x, y)

(2)

where N denotes the number of targets in the image. The resulting saliency map G is added as an input channel alongside the SAR image during training, providing an explicit spatial attention prior that enhances feature activation in small ship regions and suppresses irrelevant background responses.

3.3. Dual-Guided Perception Enhancement

In large-scene SAR images, small ships are often difficult to detect because they are easily buried in cluttered backgrounds and speckle noise. To address this issue, a DGPE module is designed to make use of saliency priors and improve the representation of small targets. As shown in Figure 4, DGPE takes the feature maps from both the SAR branch (

F_{c}^{i}

) and the saliency-guided branch (

F_{h}^{i}

) as input. Since saliency maps highlight potential targets more clearly, Cross-Guided Attention (CGA) is applied, where saliency-derived features are used to guide the learning of the target features in the SAR branch. However, relying solely on this guidance may result in unstable learning under noisy SAR backgrounds. To address this, critical area consistency alignment (CACA) is introduced to enforce consistency between saliency attention and SAR feature attention, which helps stabilize feature learning. Compared with existing cross-attention methods, such as CAINet [49], which mainly model feature interactions, DGPE strengthens the target representation using saliency-guided attention and focuses on small targets through attention alignment between the two branches.

Specifically, for each input SAR feature

F_{c} \in R^{C \times H \times W}

and saliency feature map

F_{h} \in R^{C \times H \times W}

, the spatial attention weights

ρ_{c} \in R^{1 \times H \times W}

and

ρ_{h} \in R^{1 \times H \times W}

are first extracted to quantify the strength of spatial activation in each branch. Here,

ρ_{c}

represents the response distribution of the primary branch, while

ρ_{h}

encodes saliency-driven spatial priors.

ρ = AT (F)

(3)

where AT(·) denotes a spatial attention module designed to capture and enhance target regions. The spatial attention weights from the saliency branch modulate the primary branch features, resulting in the cross-guided SAR feature representation

F_{c}^{'}

, expressed as

F_{c}^{'} = F_{c} ⊙ ρ_{h}

(4)

where ⊙ denotes element-wise multiplication. To enhance the backbone’s ability to capture target features, consistency between the two attention weights in critical regions is enforced through a discrepancy constraint, which is defined as follows:

L_{MSE} = \frac{1}{N} \sum_{i}^{n} {(ρ_{i}^{c} - ρ_{i}^{h})}^{2}

(5)

This loss term encourages the primary branch’s features to align with the saliency map priors in target regions, enabling focus and feature alignment within those regions. Overall, the process first guides the primary feature map with small target location information from the saliency map. It then enhances the alignment between the two branches in critical regions through a consistency constraint. This enables effective learning of small or weak targets without introducing background noise.

3.4. Saliency Confidence Aware Assessment

In anchor-based detectors, positive and negative proposals are assigned based on the intersection-over-union (IoU) between candidate anchors and ground truth (GT) boxes. However, in large-scene SAR images, small ship targets often fail to meet the IoU threshold for positive proposals. As illustrated in Figure 5d–f, a one-pixel shift in the anchor position causes only a minor change in the IoU for large objects. In contrast, the same shift leads to a significant drop in the IoU for small ships, as shown in Figure 5b,c. This demonstrates that small targets are more sensitive to IoU variations, increasing the risk of them being overlooked during training and causing an imbalance in the distribution of positive and negative proposals, which degrades the detection performance.

The imbalance between positive and negative proposals, especially for small targets, hinders accurate classification and efficient model training. Unlike existing dynamic sample assignment methods, such as AutoAssign [44], that adopt a global assignment strategy to determine positive samples and their weights, this paper introduces saliency priors to provide spatial guidance. A saliency confidence aware assessment (SCAA) strategy is further developed to adaptively reweight critical proposals, as shown in Figure 6. This method combines spatial priors from the saliency branch with the confidence difference between classification outputs to create a dual-quality evaluation. First, the spatial quality of each proposal in the main branch is measured by its maximum IoU with positive proposals from the saliency branch. Second, the classification reliability is assessed based on the confidence difference between the primary and saliency branches. These indicators generate dynamic weights for the classification loss, strengthening the supervision of proposals with high spatial consistency but a low classification agreement, improving small target discrimination. The detailed procedure of the SCAA is summarized in Algorithm 1.

Specifically, in the RPN stage, the primary branch and the saliency-guided auxiliary branch generate their respective candidate proposal and assigned labels, denoted as

{b_{i}}_{i = 1}^{N}

,

y_{j}

and

{k_{j}}_{j = 1}^{N}

,

t_{j}

. Here,

y_{j}

and

t_{j}

are labels, where a value of 1 indicates a positive sample and a value of 0 indicates a negative sample. Proposals with label 1 from the saliency branch are treated as high-quality spatial priors. For each proposal

b_{i}

in the primary branch, the IoU is calculated with all positive proposals from the saliency branch, with the maximum IoU value used as a measure of spatial alignment. The corresponding mathematical formulation is shown below.

α_{i} = max_{1 \leq j \leq N} {IoU}_{i, j} (b_{i}, k_{j}), t_{j} = 1

(6)

where IoU denotes the IoU computation, and N represents the number of candidate proposals. A matching threshold

τ_{q}

is defined to evaluate the spatial consistency between each proposal and the saliency proposals. If

α_{i} \geq τ_{q}

, the proposal

b_{i}

is considered spatially consistent with saliency proposal

k_{j}

and assigned a spatial quality score

Q_{space}

=

α_{i}

. If

α_{i} < τ_{q}

for all proposals,

b_{i}

is treated as a negative sample with

Q_{space}

= 0. This binary metric prioritizes consistent proposals and suppresses mismatched ones. Next, classification confidence differences are introduced to assess the reliability of the proposals, as shown in the following equation.

β^{(i)} = |P_{sar}^{(i)} - P_{h}^{(i)}|

(7)

Algorithm 1 The saliency confidence aware assessment (SCAA) strategy.

Input: The main branch proposals

R = {b_{i}}_{i = 1}^{N}

with labels

L = {y_{i}}_{i = 1}^{N}

; the saliency branch proposals

R^{h} = {k_{j}}_{j = 1}^{N}

with labels

L^{h} = {t_{j}}_{j = 1}^{N}

; the IoU threshold

τ_{q}

; the confidence threshold

τ_{p}

Output: Classification loss

L_{c l s}

1: for all

i = 1, 2, . . ., N

do

2: for all

j = 1, 2, . . ., N

do

3: if

t_{j} = 1

then

4:

M_{i} = IoU (b_{i}, k_{j})

5:

{IoU}_{i} = m a x ({IoU}_{i}, M_{i})

6: end if

7: end for

8: end for

9: The primary branch foreground confidences

P_{sar} = {p_{i}}_{i = 1}^{N}

10: The saliency branch foreground confidences

P_{h} = {q_{i}}_{i = 1}^{N}

11: The classification discrepancy

Δ P = {| p_{i} - q_{i} {|}}_{i = 1}^{N}

12: Initialize a composite quality assessment weight

H = {1}_{i = 1}^{N}

13: for

i = 1

to N do

14: if

{IoU}_{i} > τ_{q}

and

Δ P_{i} > τ_{p}

then

15:

H_{i} = e^{Δ P_{i}} + {IoU}_{i}

16: end if

17: end for

18: Use

H_{i}

to guide each proposal

L_{SCAA} = \frac{1}{\sum_{i = 1}^{n} H^{(i)} + ε} \sum_{i = 1}^{n} H^{(i)} ℓ_{CE} (y_{i}, {\hat{y}}_{i})

19: return

L_{c l s}

Specifically,

P_{sar}^{(i)}

and

P_{h}^{(i)}

denote the foreground class probabilities for the ith box in the primary and saliency branches, respectively. The absolute difference between these two probabilities,

β^{(i)}

, reflects the classification discrepancy. A large discrepancy indicates weak semantic activation despite accurate spatial localization, leading to classification uncertainty and requiring enhanced consideration. Based on the spatial alignment score

Q_{space}

and the confidence difference

β^{(i)}

, a composite quality assessment weight is defined for the ith candidate box.

H^{(i)} = \{\begin{matrix} e^{β^{(i)}} + Q_{space}^{(i)}, & if Q_{space}^{(i)} > τ_{q} and β^{(i)} > τ_{p} \\ 1, & otherwise \end{matrix}

(8)

Here,

τ_{q}

and

τ_{p}

denote the thresholds for spatial alignment and confidence discrepancy filtering, respectively. In this work, this paper empirically sets

τ_{q}

= 0.7 and

τ_{p}

= 0.3. The resulting quality assessment weight is incorporated into the cross-entropy classification loss to guide the network in focusing on informative proposals during training.

L_{SCAA} = \frac{1}{\sum_{i = 1}^{n} H^{(i)} + ε} \sum_{i = 1}^{n} H^{(i)} ℓ_{CE} (y_{i}, {\hat{y}}_{i})

(9)

where n represents the total number of proposals.

ℓ_{CE} (y_{i}, {\hat{y}}_{i})

denotes the cross-entropy loss for each proposal between the predicted class probability

{\hat{y}}_{i}

and the GT label

y_{i}

.

H (i)

is the quality-aware weight assigned to the ith candidate proposal, and

ε

is a small constant added for numerical stability. This strategy encourages the model to focus on each proposal that is spatially consistent but semantically uncertain, improving its ability to distinguish small ships.

3.5. The Loss Function

The PSG module is optimized using a multi-task joint loss function, consisting of the primary detection loss

L_{\det}^{SAR}

, the saliency-guided auxiliary branch loss

L_{\det}^{H}

, and the consistency guidance loss for critical regions

L_{MSE}

.

For the SAR primary detection loss

L_{\det}^{SAR}

, the standard Faster R-CNN loss formulation [7] is applied, which includes a classification loss

L_{SCAA}

and a bounding box regression loss

L_{reg}

.

L_{\det}^{SAR} = L_{SCAA} + L_{reg}

(10)

where

L_{SCAA}

denotes the classification loss of the SAR detection branch, enhanced by the SCAA strategy. This strategy assigns adaptive weights to each proposal based on its IoU with foreground boxes from the auxiliary branch and the discrepancy in the classification confidence, offering stronger supervision for spatially aligned but semantically ambiguous proposals.

L_{SCAA}

is defined in Equation (9) of Section 3.4.

For the auxiliary branch loss

L_{\det}^{H}

, the same Faster R-CNN loss formulation is utilized, comprising a classification loss

L_{cls}^{H}

and a bounding box regression loss

L_{reg}^{H}

.

L_{\det}^{H} = L_{cls}^{H} + L_{reg}^{H}

(11)

For the consistency guidance loss for critical regions

L_{MSE}

, the DGPE minimizes the discrepancy between the attention maps from the SAR features and those from the saliency branch. This encourages the primary network to focus on potential target regions aligned with saliency priors. The spatial consistency loss

L_{MSE}

is defined in Equation (5) of Section 3.3.

The total training objective is the weighted combination of the following components:

L_{t o t a l} = L_{\det}^{SAR} + λ_{0} L_{\det}^{H} + λ_{1} L_{MSE}

(12)

4. Results

In this section, this paper presents an evaluation of our proposed PSG method. Section 4.1 describes the datasets, evaluation metrics, and implementation details. Section 4.2 presents a comparison with other state-of-the-art detection methods. Section 4.3 provides the model analysis and discussion. Additional details are provided below.

4.1. The Experimental Setup

In this study, this paper evaluates the performance of the proposed PSG method using two SAR image datasets: the large-scene SAR dataset LS-SSDD [50] and the small object SAR dataset HRSID [51]. The detailed specifications of these datasets are presented in Table 1, and several representative sample images from each dataset are shown in Figure 7. The datasets exhibit a rich diversity of scene types, including offshore and nearshore environments, in which the ship targets are generally small and sparsely distributed.

The LS-SSDD dataset comprises 15 large-scene SAR images, each of which was captured using Sentinel-1 and covers an area of approximately 250 km. With an average ship target size of 351 pixels, the dataset is well suited to large-scene object detection tasks. As shown in Figure 7a, direct input of LS-SSDD images into the network is impractical due to their high resolution. Therefore, the images are divided into smaller patches and input into the network at a size of 800 × 800 pixels for detection.

As shown in Figure 7b, the HRSID dataset contains small ships, making it a suitable benchmark for evaluating the detection of small objects in SAR imagery.

In this paper, the performance of the detection model is evaluated using the metrics of precision, recall, and average precision (AP) [52]. First, the IoU between each predicted bounding box

B_{p}

and the corresponding ground truth bounding box

B_{gt}

is defined as follows:

IoU = \frac{area (B_{p} \cap B_{gt})}{area (B_{p} \cup B_{gt})}

(13)

Next, precision is used to measure the accuracy of the detection model. It is calculated as follows:

Precision = \frac{T P}{T P + F P}

(14)

Here, TP represents the number of positive samples correctly predicted as ship targets, while FN represents the number of positive samples missed by the model. FP denotes the number of negative samples incorrectly predicted as ships. Recall is used to measure the model’s ability to identify ships, defined as the proportion of correctly predicted ships among all actual ships.

Recall = \frac{T P}{T P + F N}

(15)

where FN denotes the number of positive samples that are incorrectly classified as negative.

Additionally, the average precision (AP) is used, which computes the average of the precision and recall values under different IoU thresholds, to evaluate the overall performance of the model.

AP = \int_{0}^{1} P (r) \cdot d r

(16)

where

P (r)

denotes the precision corresponding to a given recall r. The F1-Score is used to evaluate the overall performance of the model in terms of precision and recall, defined as follows:

F 1 - Score = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(17)

The inference performance of the model is evaluated using frames per second (FPS), defined as

FPS = 1 / t

, where t denotes the inference time for a single small patch. For the LS-SSDD dataset, the total detection time for a full, large-scene satellite SAR image is

T = 600

t.

The proposed PSG model is implemented in PyTorch-1.8.0 and built upon the Faster R-CNN framework with ResNet-50 [53] as the backbone. The feature extractor is initialized with ImageNet-pretrained weights, while the remaining parameters are randomly initialized. This paper employs stochastic gradient descent (SGD) with a momentum of 0.9 and a weight decay of 0.0005 as the optimizer. The initial learning rate is set to 0.001. The entire model is trained end-to-end for seven epochs on an NVIDIA GeForce RTX 4090 GPU with a batch size of 4. During both training and inference, the input image size is fixed at 800 × 800 pixels.

For the object detection task, in Equation (12), the hyperparameter

λ_{0}

for the saliency-guided auxiliary branch is set to 1, and the hyperparameter

λ_{1}

for the critical area consistency alignment is also set to 1. The other hyperparameter settings are as follows: in Equation (8), the matching threshold

τ_{q}

is set to 0.7, and the confidence discrepancy threshold

τ_{p}

is set to 0.3.

4.2. Comparisons with the State-of-the-Art

To demonstrate the superiority of the proposed PSG method, this paper compares it with eight state-of-the-art object detection methods, including anchor-based detectors such as RetinaNet [54] and YOLOX [55]; anchor-free frameworks like ATSS [56], AutoAssign [44], and TOOD [57]; and small object detection methods such as NWD [38] and RFLA [39]. Faster R-CNN with the FPN [7] serves as the baseline. All methods are evaluated on the LS-SSDD and HRSID datasets, using ResNet50 [53] as the backbone for a fair comparison. Quantitative evaluations are conducted using the metrics from Section 4.1, and qualitative comparisons are made by visualizing the detection results of the different algorithms. The following section presents the detailed experimental results.

4.2.1. Quantitative Evaluation

Table 2 and Table 3 present the performance of the various object detection methods on the LS-SSDD and HRSID datasets. As shown in Table 2, the existing detection models perform relatively poorly in large-scene scenarios on the LS-SSDD dataset. Among the anchor-based methods, RetinaNet lacks the stable performance of two-stage models. Anchor-free methods such as ATSS and TOOD show an improved AP but have low recall rates, while YOLOx achieves the highest precision but struggles with recall. Specialized small object detectors like NWD and RFLA underperform due to the small target size and frequent land-based false alarms. These results suggest that the existing SOD methods are not well suited to large-scene SAR detection. In contrast, PSG achieves the best overall accuracy, with a precision of 95.60%, recall of 80.28%, F1-Score of 87.27%, and AP of 73.38%, outperforming all competitors. Table 3 compares PSG with the SOTA on the HRSID dataset. Although the anchor-free methods AutoAssign and TOOD achieve a high AP, their precision is notably lower than the baseline. Similarly, advanced small object detection methods like NWD and RFLA show a poor F1-Score performance. These results indicate that many advanced methods sacrifice precision for small object detection capability, leading to a suboptimal overall performance. In comparison, PSG outperforms the others, with a precision of 99.47%, recall of 87.08%, and F1-Score of 92.86%, with an AP 4.35% higher than the baseline. From the results in Table 2 and Table 3, in terms of inference speed, single-stage models like YOLOx achieve the fastest performance, while two-stage models are generally slower. PSG matches the baseline’s speed, achieving 21.50 FPS on LS-SSDD and 18.10 FPS on HRSID, while providing higher accuracy. Moreover, T is measured in seconds and reflects the practical detection efficiency in large-scene SAR scenarios, with smaller values indicating faster inference. YOLOx achieves the fastest inference time of 9.33 s, while RFLA has the slowest at 37.74 s. PSG has an inference time of 27.91 s, offering a balanced performance and significantly improving the precision and recall. These results demonstrate that PSG achieves a superior balance between the detection accuracy and inference speed. Its strong performance across both datasets confirms its generalization capability and robustness, particularly in detecting small ships in large-scene SAR imagery.

4.2.2. Quantitative Evaluation

Figure 8 and Figure 9 show the visualization results of different algorithms on the LS-SSDD and HRSID datasets. In Figure 8a, the green boxes represent the ground truth of the SAR ship targets. Figure 8b–h show the ship detection results on the LS-SSDD dataset using the baseline, RetinaNet, YOLOx, ATSS, AutoAssign, TOOD, NWD, RFLA, and PSG, respectively. In the visualizations, yellow boxes indicate detected targets, blue boxes highlight missed detections, and orange ovals mark false alarms. Figure 8 shows that PSG effectively locates small ship targets in vast environments, reducing missed detections. In offshore areas (the first and second rows), PSG overcomes background scattering noise to detect small targets resembling strong points. In nearshore regions (the third and fourth rows), where land-based false alarms are prevalent, PSG accurately identifies most targets, outperforming the other methods and addressing the issue of missed detections in large-scene SAR imagery. On the HRSID dataset, as shown in Figure 9, PSG achieves a superior performance in locating small targets, without missed detections, as seen in the first row. Combining the detection results from Figure 8 and Figure 9, it is clear that PSG excels in detecting small ship targets in large scenes, especially in densely distributed nearshore areas, outperforming the other methods and mitigating the missed detection issues in the existing approaches.

4.3. Further Analysis

In this section, a comprehensive evaluation of the proposed PSG method is conducted through ablation studies and a hyperparameter sensitivity analysis. Due to its extensive coverage and diversity, the LS-SSDD dataset is particularly well suited to large-scene ship detection, and all subsequent experiments are conducted on this dataset.

4.3.1. Ablation Studies

To analyze the proposed PSG method further, this paper perform comprehensive ablation experiments on its key components: DGPE and SCAA. Each module is removed individually to assess its contribution. Since DGPE consists of the CGA and CACA submodules, ablation experiments are also conducted on them separately. First, DGPE enables dual-guided learning for backbone feature extraction using saliency priors. CGA directs PSG to attend to salient regions via cross-attention, while CACA enforces consistency between the two attention maps in critical regions. Removing DGPE (i.e., “Ours w/o DGPE”) causes a 2.19% drop in precision and a 1.86% drop in the AP (Table 4). As shown in Figure 10b, PSG without DGPE struggles to concentrate on the target regions, leading to background noise and irrelevant textures. This confirms that DGPE effectively guides PSG to focus on small target regions. Second, the SCAA mechanism enhances classification learning for small and challenging proposals under saliency guidance. Removing SCAA (i.e., “Ours w/o SCAA”) reduces precision by 1.35% and AP by 1.95% (Table 4). Figure 10c shows that the absence of SCAA weakens the responses in key target regions, highlighting its role in improving the discrimination of difficult proposals. Moreover, removing CGA (i.e., “Ours w/o CGA”) leads to decreases of 1.00% in the precision, 0.38% in the recall, and 1.28% in the AP (Table 4). As visualized in Figure 10d, although PSG still attends to most target areas, it produces more noise, indicating that CGA is crucial for saliency-guided feature representation. Similarly, removing CACA (i.e., “Ours w/o CACA”) reduces the recall by 1.39% and the AP by 1.41% (Table 4). Figure 10e reveals that without CACA, PSG fails to adequately respond to many small targets, leading to a degraded localization accuracy. This highlights the importance of CACA in improving the learning of small ship regions.

In this section, the impact of DGPE across different feature levels is investigated to explore its optimal application scenarios. As shown in Table 5, the experiments utilize “Ours w/o DGPE” as the baseline, with C2, C3, and C4 representing feature layers extracted by ResNet-50. The findings are as follows: First, applying DGPE to any single layer consistently improves the AP, with the enhancement diminishing in deeper layers, indicating that shallower features contain richer semantic information. When augmented by DGPE, these layers offer greater discriminative power for aligning key regions. Second, the best performance is achieved by integrating DGPE progressively at the C2 and C3 stages, where these intermediate layers retain substantial semantic richness, providing structured guidance for the backbone network. Third, applying DGPE to multiple layers degrades the performance, suggesting that multi-level feature attention may lead to information redundancy, hindering the learning of critical patterns. Notably, the significant feature discrepancy between the C2 and C4 layers introduces conflicts when DGPE is applied, disrupting feature synergy and compromising the detection accuracy. Therefore, the application of DGPE at this level is avoided.

4.3.2. Sensitivity of the Hyperparameters

This section conducts a detailed investigation into the sensitivity of key hyperparameters in the proposed PSG method. The evaluation is focused on two categories of parameters: (1) the loss balancing weights

λ_{0}

and

λ_{1}

, which control the contributions of the saliency-guided auxiliary detection branch and the CACA module during model optimization, respectively, and (2) the spatial matching threshold

τ_{q}

and the confidence difference threshold

τ_{p}

in the SCAA mechanism, which are introduced to enhance the training of hard examples. Hyperparameter experiments are conducted on the LS-SSDD dataset using a controlled variable strategy, where one hyperparameter is varied while others are fixed to their empirically optimal values (see Section 4.1). The results are shown in Figure 11, which demonstrates the variation in the model’s metrics. For the auxiliary branch, the parameter

λ_{0}

is set to 0.1, 0.3, 0.5, 0.7, and 1.0. The performance is observed to be optimal at

λ_{0} = 1.0

. This suggests that in the absence of parameter sharing between the saliency-guided auxiliary network and the primary detector, equal loss weighting is imperative to ensure effective training of the auxiliary network and the provision of valuable prior information. For the CACA module,

λ_{1}

is evaluated at values of 0.1, 0.3, 0.5, 0.7, and 1.0, with the optimal performance occurring at

λ_{1} = 0.1

. This indicates that a stronger constraint on feature consistency may lead to overfitting, hindering the learning of other discriminative representations. For the spatial matching threshold

τ_{q}

, which evaluates the alignment between proposals and saliency-based proposals, values of 0.1, 0.3, 0.5, 0.7, and 0.9 are tested. The results show that excessively low thresholds may lead to noisy proposals being considered as matches, while excessively high thresholds may exclude valuable hard proposals. The optimal AP is achieved when

τ_{q} = 0.7

, demonstrating that a moderate threshold effectively balances these effects. Lastly, for the confidence difference threshold

τ_{p}

which measures the foreground classification discrepancy between the two branches, the model performs best when

τ_{p} = 0.3

. This is because lower thresholds facilitate the identification of spatially aligned but semantically inconsistent proposals. Conversely, higher thresholds may result in the challenging proposals being overlooked, thus diminishing the overall performance.

5. Discussion

This study demonstrates the effectiveness of the proposed progressive saliency-guided (PSG) framework in addressing the challenges of small ship detection in large-scene SAR images. It should be noted that the saliency maps in this framework provide reliable target location information, which significantly enhances the model’s ability to detect small targets. It is important to note that these saliency maps are generated from ground-truth annotations, which may require additional labeling effort when applied to new datasets. Furthermore, the generation of these saliency maps introduces additional computational complexity, which could impact the efficiency of the training process. While our method alleviates the challenges of small target detection in SAR images such as weak feature representation and missed detections, inherent limitations of SAR imaging such as coherent scattering and speckle noise persist. Future work will focus on exploring solutions for small target detection in large-scene SAR images further, specifically addressing the inherent characteristics of SAR imaging to enhance the detection accuracy and robustness.

6. Conclusions

In this paper, we propose a progressive saliency-guided (PSG) method for SAR ship detection in large scenes. The proposed PSG method addresses the challenges of detecting small ships in vast and complex backgrounds and the missed detections caused by a small target size. By introducing a dual-guided perception enhancement (DGPE) module and a saliency confidence aware assessment (SCAA) mechanism, the PSG method effectively enhances small target features and adaptively re-weights the training supervision for hard examples. It is particularly suited to wide-area maritime surveillance scenarios where ships are sparse and embedded into cluttered backgrounds. Extensive experimentation on the large-scene dataset LS-SSDD and the small object dataset HRSID demonstrates that the PSG method consistently outperforms the existing approaches in its detection performance while maintaining inference stability. Additional ablation and comparison experiments further confirm the effectiveness and robustness of the PSG method in large-scene SAR ship detection tasks. While the PSG method alleviates the challenges of small target detection in SAR images, inherent SAR imaging issues such as coherent scattering and speckle noise remain. Future work will focus on addressing these limitations and further exploring solutions to enhance the detection accuracy and robustness.

Author Contributions

Conceptualization: H.Z., D.L. and S.L.; formal analysis: H.Z. and D.L.; funding acquisition: D.L. and J.L.; investigation: S.L.; methodology: H.Z., D.L. and S.L.; project administration: D.L.; supervision: D.L. and S.L.; validation: H.Z., D.L., H.W., R.Y., J.L., S.L. and J.W.; writing—original draft: H.Z.; writing—review and editing: D.L., H.W., R.Y., J.L., S.L. and J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grants 62371079, 62201099, and 62473373; the Defense Industrial Technology Development Program under Grant JCKY2022110C171; the Key Laboratory of Cognitive Radio and Information Processing, the Ministry of Education under Grant CRKL220202; the Opening Project of the Guangxi Wireless Broadband Communication and Signal Processing Key Laboratory under Grant GXKL06200214 and Grant GXKL06200205; the Sichuan Science and Technology Program under Grant 2022SZYZF02; the General Project of Chongqing Natural Science Foundation under Grant cstb2022nscq-msx1125.

Data Availability Statement

No new data were created or analyzed in this study. The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Attema, E.; Davidson, M.; Snoeij, P.; Rommen, B.; Floury, N. Sentinel-1 mission overview. In Proceedings of the 2009 IEEE International Geoscience and Remote Sensing Symposium, Cape Town, South Africa, 12–17 July 2009; Volume 1, pp. 1–36. [Google Scholar]
Asiyabi, R.M.; Ghorbanian, A.; Tameh, S.N.; Amani, M.; Jin, S.; Mohammadzadeh, A. Synthetic aperture radar (SAR) for ocean: A review. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 9106–9138. [Google Scholar] [CrossRef]
Cui, Z.; Wang, X.; Liu, N.; Cao, Z.; Yang, J. Ship detection in large-scale SAR images via spatial shuffle-group enhance attention. IEEE Trans. Geosci. Remote Sens. 2020, 59, 379–391. [Google Scholar] [CrossRef]
Li, J.; Xu, C.; Su, H.; Gao, L.; Wang, T. Deep learning for SAR ship detection: Past, present and future. Remote Sens. 2022, 14, 2712. [Google Scholar] [CrossRef]
Ciecholewski, M. Review of segmentation methods for coastline detection in sar images. Arch. Comput. Methods Eng. 2024, 31, 839–869. [Google Scholar] [CrossRef]
Mittal, H.; Joshi, A. Automatic Ship Detection Using CFAR Algorithm for Quad-Pol UAV-SAR Imagery. In Proceedings of the International Conference on Unmanned Aerial System in Geomatics, Roorkee, India, 2–4 April 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 199–210. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 1, 91–99. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Tang, C.; Cui, Z.; Liu, N.; Cao, Z. D-Atr Via Deep Neural Network for Large Scene Sar Images. In Proceedings of the IGARSS 2018—2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; pp. 2326–2329. [Google Scholar]
Huang, Z.; Li, W.; Xia, X.G.; Tao, R. A general Gaussian heatmap label assignment for arbitrary-oriented object detection. IEEE Trans. Image Process. 2022, 31, 1895–1910. [Google Scholar] [CrossRef]
Dai, H.; Gao, S.; Huang, H.; Mao, D.; Zhang, C.; Zhou, Y. An adaptive sample assignment network for tiny object detection. IEEE Trans. Multimed. 2023, 26, 2918–2931. [Google Scholar] [CrossRef]
Frery, A.C.; Muller, H.J.; Yanasse, C.d.C.F.; Sant’Anna, S.J.S. A model for extremely heterogeneous clutter. IEEE Trans. Geosci. Remote Sens. 1997, 35, 648–659. [Google Scholar] [CrossRef]
Novak, L.; Hesse, S. On the performance of order-statistics CFAR detectors. In Proceedings of the Conference Record of the Twenty-Fifth Asilomar Conference on Signals, Systems & Computers, Pacific Grove, CA, USA, 4–6 November 1991; IEEE Computer Society: Pacific Grove, CA, USA, 1991; pp. 835–836. [Google Scholar]
Novak, L.M.; Owirka, G.J.; Netishen, C.M. Performance of a high-resolution polarimetric SAR automatic target recognition system. Linc. Lab. J. 1993, 6, 835–840. [Google Scholar]
Liu, Y.; Zhang, M.h.; Xu, P.; Guo, Z.w. SAR ship detection using sea-land segmentation-based convolutional neural network. In Proceedings of the 2017 International Workshop on Remote Sensing with Intelligent Processing (RSIP), Shanghai, China, 18–21 May 2017; pp. 1–4. [Google Scholar]
Cui, J.; Jia, H.; Wang, H.; Xu, F. A fast threshold neural network for ship detection in large-scene SAR images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 6016–6032. [Google Scholar] [CrossRef]
Jia, H.; Pu, X.; Liu, Q.; Wang, H.; Xu, F. A fast progressive ship detection method for very large full-scene SAR images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5206615. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Zhang, C.; Yang, C.; Cheng, K.; Guan, N.; Dong, H.; Deng, B. MSIF: Multisize inference fusion-based false alarm elimination for ship detection in large-scale SAR images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5224811. [Google Scholar] [CrossRef]
Shao, Z.; Zhang, X.; Xu, X.; Zeng, T.; Zhang, T.; Shi, J. CFAR-guided convolution neural network for large scale scene SAR ship detection. In Proceedings of the 2023 IEEE Radar Conference (RadarConf23), San Antonio, TX, USA, 1–5 May 2023; pp. 1–5. [Google Scholar]
Yan, G.; Chen, Z.; Wang, Y.; Cai, Y.; Shuai, S. LssDet: A lightweight deep learning detector for SAR ship detection in high-resolution SAR images. Remote Sens. 2022, 14, 5148. [Google Scholar] [CrossRef]
Gao, S.; Liu, J.; Miao, Y.; He, Z. A high-effective implementation of ship detector for SAR images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 4019005. [Google Scholar] [CrossRef]
Sun, Y.; Sun, X.; Wang, Z.; Fu, K. Oriented ship detection based on strong scattering points network in large-scale SAR images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5218018. [Google Scholar] [CrossRef]
Fu, K.; Fu, J.; Wang, Z.; Sun, X. Scattering-keypoint-guided network for oriented ship detection in high-resolution and large-scale SAR images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 11162–11178. [Google Scholar] [CrossRef]
Sun, Y.; Wang, Z.; Sun, X.; Fu, K. SPAN: Strong scattering point aware network for ship detection and classification in large-scale SAR imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 1188–1204. [Google Scholar] [CrossRef]
Yu, X.; Gong, Y.; Jiang, N.; Ye, Q.; Han, Z. Scale match for tiny person detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 1257–1265. [Google Scholar]
Kisantal, M.; Wojna, Z.; Murawski, J.; Naruniec, J.; Cho, K. Augmentation for small object detection. arXiv 2019, arXiv:1902.07296. [Google Scholar] [CrossRef]
Wang, J.; Yang, W.; Guo, H.; Zhang, R.; Xia, G.S. Tiny object detection in aerial images. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 3791–3798. [Google Scholar]
Yu, J.; Zhou, G.; Zhou, S.; Qin, M. A fast and lightweight detection network for multi-scale SAR ship detection under complex backgrounds. Remote Sens. 2021, 14, 31. [Google Scholar] [CrossRef]
Xu, X.; Zhang, X.; Zhang, T. Lite-yolov5: A lightweight deep learning detector for on-board ship detection in large-scene sentinel-1 sar images. Remote Sens. 2022, 14, 1018. [Google Scholar] [CrossRef]
Zhou, Y.; Zhang, F.; Ma, F.; Xiang, D.; Zhang, F. Small vessel detection based on adaptive dual-polarimetric feature fusion and sea–land segmentation in SAR images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 2519–2534. [Google Scholar] [CrossRef]
Zhu, M.; Hu, G.; Li, S.; Zhou, H.; Wang, S. FSFADet: Arbitrary-oriented ship detection for SAR images based on feature separation and feature alignment. Neural Process. Lett. 2022, 54, 1995–2005. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Hong, M.; Li, S.; Yang, Y.; Zhu, F.; Zhao, Q.; Lu, L. SSPNet: Scale selection pyramid network for tiny person detection from UAV images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 8018505. [Google Scholar] [CrossRef]
Li, H.; Zhang, R.; Pan, Y.; Ren, J.; Shen, F. Lr-fpn: Enhancing remote sensing object detection with location refined feature pyramid network. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June–5 July 2024; pp. 1–8. [Google Scholar]
Xu, C.; Wang, J.; Yang, W.; Yu, L. Dot distance for tiny object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1192–1201. [Google Scholar]
Wang, J.; Xu, C.; Yang, W.; Yu, L. A normalized Gaussian Wasserstein distance for tiny object detection. arXiv 2021, arXiv:2110.13389. [Google Scholar]
Xu, C.; Wang, J.; Yang, W.; Yu, H.; Yu, L.; Xia, G.S. RFLA: Gaussian receptive field based label assignment for tiny object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 526–543. [Google Scholar]
Wang, W.; Dai, J.; Chen, Z.; Huang, Z.; Li, Z.; Zhu, X.; Hu, X.; Lu, T.; Lu, L.; Li, H.; et al. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14408–14419. [Google Scholar]
Gao, S.; Chen, Y.; Cui, N.; Qin, W. Enhancing object detection in low-resolution images via frequency domain learning. Array 2024, 22, 100342. [Google Scholar] [CrossRef]
Yu, C.; Liu, Y.; Wu, S.; Xia, X.; Hu, Z.; Lan, D.; Liu, X. Pay attention to local contrast learning networks for infrared small target detection. IEEE Geosci. Remote Sens. Lett. 2022, 19, 3512705. [Google Scholar] [CrossRef]
Bai, J.; Ren, J.; Yang, Y.; Xiao, Z.; Yu, W.; Havyarimana, V.; Jiao, L. Object detection in large-scale remote-sensing images based on time-frequency analysis and feature optimization. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5405316. [Google Scholar] [CrossRef]
Zhu, B.; Wang, J.; Jiang, Z.; Zong, F.; Liu, S.; Li, Z.; Sun, J. Autoassign: Differentiable label assignment for dense object detection. arXiv 2020, arXiv:2007.03496. [Google Scholar] [CrossRef]
Yuan, X.; Cheng, G.; Yan, K.; Zeng, Q.; Han, J. Small object detection via coarse-to-fine proposal generation and imitation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 6317–6327. [Google Scholar]
Zhao, Z.; Du, J.; Li, C.; Fang, X.; Xiao, Y.; Tang, J. Dense tiny object detection: A scene context guided approach and a unified benchmark. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5606913. [Google Scholar] [CrossRef]
Liu, D.; Zhang, J.; Qi, Y.; Wu, Y.; Zhang, Y. Tiny object detection in remote sensing images based on object reconstruction and multiple receptive field adaptive feature enhancement. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5616213. [Google Scholar] [CrossRef]
Yu, Y.; Zhang, K.; Wang, X.; Wang, N.; Gao, X. An adaptive region proposal network with progressive attention propagation for tiny person detection from UAV images. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 4392–4406. [Google Scholar] [CrossRef]
Wang, F.; Su, Y.; Wang, R.; Sun, J.; Sun, F.; Li, H. Cross-modal and cross-level attention interaction network for salient object detection. IEEE Trans. Artif. Intell. 2023, 5, 2907–2920. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, X.; Ke, X.; Zhan, X.; Shi, J.; Wei, S.; Pan, D.; Li, J.; Su, H.; Zhou, Y.; et al. LS-SSDD-v1.0: A deep learning dataset dedicated to small ship detection from large-scale Sentinel-1 SAR images. Remote Sens. 2020, 12, 2997. [Google Scholar] [CrossRef]
Wei, S.; Zeng, X.; Qu, Q.; Wang, M.; Su, H.; Shi, J. HRSID: A high-resolution SAR images dataset for ship detection and instance segmentation. IEEE Access 2020, 8, 120234–120254. [Google Scholar] [CrossRef]
Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vision 2010, 88, 303–338. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9759–9768. [Google Scholar]
Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. Tood: Task-aligned one-stage object detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; IEEE Computer Society: Montreal, QC, Canada, 2021; pp. 3490–3499. [Google Scholar]

Figure 1. Challenges in large-scene SAR ship detection. The large-scene SAR image shows the detection results, where yellow boxes indicate predicted detections, green boxes denote ground truth annotations, and red boxes highlight missed targets. From the corresponding local patch and its ground truth, it can be observed that ships occupy extremely small spatial extents and are easily overlooked by the model, resulting in severe miss rates.

Figure 2. An illustration of the overall architecture of the proposed PSG method. SAR images and Gaussian saliency maps are fed into two independent detectors. Firstly, the DGPE module computes the primary branch attention weights

ρ_{i}^{c}

and the saliency branch weights

ρ_{i}^{h}

, which enable cross-branch guidance. Next, the Region Proposal Network (RPN) produces primary proposals rois and saliency-guided proposals

{rois}_{hm}

. The SCAA mechanism adaptively reweights the classification loss, using spatial priors from saliency maps and confidence discrepancies to prioritize well-located but low-confidence candidates. Through the combined operations of DGPE and SCAA, the network progressively refines the feature representation and improves the localization of small targets.

Figure 2. An illustration of the overall architecture of the proposed PSG method. SAR images and Gaussian saliency maps are fed into two independent detectors. Firstly, the DGPE module computes the primary branch attention weights

ρ_{i}^{c}

and the saliency branch weights

ρ_{i}^{h}

, which enable cross-branch guidance. Next, the Region Proposal Network (RPN) produces primary proposals rois and saliency-guided proposals

{rois}_{hm}

. The SCAA mechanism adaptively reweights the classification loss, using spatial priors from saliency maps and confidence discrepancies to prioritize well-located but low-confidence candidates. Through the combined operations of DGPE and SCAA, the network progressively refines the feature representation and improves the localization of small targets.

Figure 3. An illustration of saliency map generation from the ground truths. (a) The ground truth bounding boxes

y_{target}

on the original SAR image. (b) The grid-based saliency map response centered at the target location. (c) An adaptive Gaussian saliency map G generated based on the target’s position and size.

Figure 3. An illustration of saliency map generation from the ground truths. (a) The ground truth bounding boxes

y_{target}

on the original SAR image. (b) The grid-based saliency map response centered at the target location. (c) An adaptive Gaussian saliency map G generated based on the target’s position and size.

Figure 4. An illustration of the dual-guidance perception module. First, DGPE computes the spatial attention weights,

ρ_{c}

and

ρ_{h}

, for the SAR primary branch and the saliency auxiliary branch, respectively, to quantify their spatial response intensities. Next,

ρ_{h}

modulates the primary features by providing cross-branch spatial guidance, resulting in cross-enhanced features

F_{c}^{'}

that increase the responsiveness of the primary detector to potential small targets. Additionally, a difference consistency loss

L_{MSE}

is computed between the attention-modulated feature maps to ensure coherence in salient regions, facilitating accurate feature focusing in ship areas.

Figure 4. An illustration of the dual-guidance perception module. First, DGPE computes the spatial attention weights,

ρ_{c}

and

ρ_{h}

, for the SAR primary branch and the saliency auxiliary branch, respectively, to quantify their spatial response intensities. Next,

ρ_{h}

modulates the primary features by providing cross-branch spatial guidance, resulting in cross-enhanced features

F_{c}^{'}

that increase the responsiveness of the primary detector to potential small targets. Additionally, a difference consistency loss

L_{MSE}

is computed between the attention-modulated feature maps to ensure coherence in salient regions, facilitating accurate feature focusing in ship areas.

Figure 5. An illustration of the results for the anchor-based detector. The figure shows an example of the network determining positive and negative proposals. The green box represents the GT, and the yellow box represents the anchor.

Figure 6. An illustration of the saliency confidence aware assessment (SCAA) module. The strategy evaluates each proposal’s spatial alignment using the maximum IoU with positive proposals from the saliency branch and its semantic reliability based on the confidence discrepancy between the two branches. These indicators are combined to create a dynamic weight that modulates the classification loss, enhancing supervision of proposals that are spatially aligned but semantically ambiguous.

Figure 7. Sample images in the two experiment datasets. (a) LS-SSDD. (b) HRSID.

Figure 8. The detection results of different methods on the LS-SSDD dataset. (a) Ground truth. (b) Baseline. (c) RetinaNet. (d) YOLOx. (e) ATSS. (f) AutoAssign. (g) TOOD. (h) NWD. (i) RFLA. (j) PSG (Ours). Green boxes represent the ground truth locations of the SAR ship targets, yellow boxes denote predicted targets, blue boxes indicate missed detections, and orange ovals highlight false alarms.

Figure 9. The detection results of different methods on the HRSID dataset. (a) Ground truth. (b) Baseline. (c) RetinaNet. (d) YOLOx. (e) ATSS. (f) AutoAssign. (g) TOOD. (h) NWD. (i) RFLA. (j) PSG (Ours). Green boxes represent the ground truth locations of the SAR ship targets, yellow boxes denote predicted targets, blue boxes indicate missed detections, and orange ovals highlight false alarms.

Figure 10. Visualization of attention maps for different methods on the LS-SSDD dataset. (a) Ground truth. (b) Ours w/o DGPE. (c) Ours w/o SCAA. (d) Ours w/o CGA. (e) Ours w/o CACA. (f) Ours (full model). These attention maps reflect the regions of interest for the models. The color’s brightness indicates the corresponding region’s importance, and the brighter the region, the more it contributes to the model classification.

Figure 11. Effect of different hyperparameters

λ_{0}

,

λ_{1}

,

τ_{q}

and

τ_{p}

on model performance.

Figure 11. Effect of different hyperparameters

λ_{0}

,

λ_{1}

,

τ_{q}

and

τ_{p}

on model performance.

Table 1. Information on the LS-SSDD and HRSID datasets.

Dataset	LS-SSDD	HRSID
Satellite	Sentinel-1	Sentinel-1,TerraSAR-X
Polarization	VV, VH	HH, HV, VV
Scenes	inshore and offshore	inshore and offshore
Swath (km)	250	80
Image Size	24,000 × 16,000	800 × 800
Image number	9000	5604
Ship number	3350	16,951
Ship Pixel Proportion	0.0001%	0.2800%

Table 2. Performance comparison on LS-SSDD dataset.

Method	Precision (%)	Recall (%)	F1-Score (%)	AP (%)	T (s)	FPS (img/s)
Baseline [7]	92.61	77.50	84.38	69.00	28.57	21.00
RetinaNet [54]	89.78	50.98	65.03	66.50	23.44	25.60
YOLOx [55]	96.83	40.99	57.60	59.80	9.33	64.30
ATSS [56]	95.12	51.66	66.96	73.10	23.53	25.50
AutoAssign [44]	59.76	72.86	65.67	58.80	23.26	25.80
TOOD [57]	84.20	67.88	75.16	72.80	34.48	17.40
NWD [38]	60.76	69.39	64.79	68.80	29.27	20.50
RFLA [39]	76.52	71.49	73.92	71.50	37.74	15.90
PSG (Ours)	95.60	80.28	87.27	73.38	27.91	21.50

Table 3. Performance comparison on HRSID dataset.

Method	Precision (%)	Recall (%)	F1-Score (%)	AP (%)	FPS (img/s)
Baseline [7]	99.35	82.34	90.05	78.81	18.20
RetinaNet [54]	58.10	68.41	62.83	68.70	24.50
YOLOx [55]	94.16	67.91	78.91	78.80	60.50
ATSS [56]	96.06	47.78	63.82	61.20	24.30
AutoAssign [44]	75.57	85.96	80.43	86.60	28.00
TOOD [57]	78.81	84.45	81.53	85.60	18.80
NWD [38]	71.26	82.43	76.44	83.00	16.40
RFLA [39]	72.82	77.92	75.28	82.23	13.20
PSG (Ours)	99.47	87.08	92.86	83.16	18.10

Table 4. Ablation studies on the effectiveness of different components were conducted on the LS-SSDD dataset.

Method	SCAA	CGA	CACA	Precision (%)	Recall (%)	AP (%)
Baseline	✗	✗	✗	92.61	77.50	69.00
Ours w/o DGPE	✓	✗	✗	93.41	80.53	71.52
Ours w/o SCAA	✗	✓	✓	94.25	80.11	71.43
Ours w/o CGA	✓	✗	✓	94.60	79.90	72.10
Ours w/o CACA	✓	✓	✗	95.58	78.89	71.97
Ours (Full Model)	✓	✓	✓	95.60	80.28	73.38

Table 5. Impact of DGPE at different levels on LS-SSDD dataset. The bold indicates the optimal experimental results.

Method	Precision (%)	Recall (%)	AP (%)
Baseline	93.41	80.53	71.52
C2	93.18	81.54	72.86
C3	95.43	80.11	72.03
C4	94.58	79.69	71.84
C2 & C3	95.60	80.28	73.38
C2 & C4	93.47	78.81	70.56
C3 & C4	94.03	78.68	71.30
C2 & C3 & C4	90.08	80.19	70.02

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, H.; Li, D.; Wang, H.; Yang, R.; Liang, J.; Liu, S.; Wan, J. A Progressive Saliency-Guided Small Ship Detection Method for Large-Scene SAR Images. Remote Sens. 2025, 17, 3085. https://doi.org/10.3390/rs17173085

AMA Style

Zhu H, Li D, Wang H, Yang R, Liang J, Liu S, Wan J. A Progressive Saliency-Guided Small Ship Detection Method for Large-Scene SAR Images. Remote Sensing. 2025; 17(17):3085. https://doi.org/10.3390/rs17173085

Chicago/Turabian Style

Zhu, Hanying, Dong Li, Haoran Wang, Ruquan Yang, Jishen Liang, Shuang Liu, and Jun Wan. 2025. "A Progressive Saliency-Guided Small Ship Detection Method for Large-Scene SAR Images" Remote Sensing 17, no. 17: 3085. https://doi.org/10.3390/rs17173085

APA Style

Zhu, H., Li, D., Wang, H., Yang, R., Liang, J., Liu, S., & Wan, J. (2025). A Progressive Saliency-Guided Small Ship Detection Method for Large-Scene SAR Images. Remote Sensing, 17(17), 3085. https://doi.org/10.3390/rs17173085

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Progressive Saliency-Guided Small Ship Detection Method for Large-Scene SAR Images

Abstract

Highlights

Abstract

1. Introduction

2. Materials

2.1. Large-Scene SAR Ship Detection

2.2. Small Object Detection

3. Methods

3.1. The Overall Architecture

3.2. Gaussian Saliency Map Generation

3.3. Dual-Guided Perception Enhancement

3.4. Saliency Confidence Aware Assessment

3.5. The Loss Function

4. Results

4.1. The Experimental Setup

4.2. Comparisons with the State-of-the-Art

4.2.1. Quantitative Evaluation

4.2.2. Quantitative Evaluation

4.3. Further Analysis

4.3.1. Ablation Studies

4.3.2. Sensitivity of the Hyperparameters

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI