MFT-Reasoning RCNN: A Novel Multi-Stage Feature Transfer Based Reasoning RCNN for Synthetic Aperture Radar (SAR) Ship Detection

Zhan, Siyu; Zhong, Muge; Yang, Yuxuan; Lu, Guoming; Zhou, Xinyu

doi:10.3390/rs17071170

Open AccessArticle

MFT-Reasoning RCNN: A Novel Multi-Stage Feature Transfer Based Reasoning RCNN for Synthetic Aperture Radar (SAR) Ship Detection

by

Siyu Zhan

^†

,

Muge Zhong

^†

,

Yuxuan Yang

,

Guoming Lu

and

Xinyu Zhou

^*

Institute of Intelligent Computing, University of Electronic Science and Technology of China, Chengdu 611731, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2025, 17(7), 1170; https://doi.org/10.3390/rs17071170

Submission received: 9 February 2025 / Revised: 18 March 2025 / Accepted: 24 March 2025 / Published: 26 March 2025

(This article belongs to the Special Issue Integrating Deep Learning with Image Perception for Advanced Remote Sensing Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Conventional ship detection using synthetic aperture radar (SAR) is typically limited to fully focused spatial features of the ship target in SAR images. In this paper, we propose a multi-stage feature transfer (MFT)-based reasoning RCNN (MFT-Reasoning RCNN) to detect ships in SAR images. This algorithm can detect the SAR ship target using the MFT strategy and adaptive global reasoning module over all object regions by exploiting diverse knowledge between the ship and its surrounding elements. Specifically, we first calculate the probability of the simultaneous occurrence of environmental and target elements. Then, taking the environmental and target elements as entities, we construct the relationships between them using an adjacency matrix. Finally, we propose an MFT and use filter feature enhancement in the backbone layer to better extract the target features of SAR images and transfer knowledge between datasets. This paper has been tested on more than 10,000 images, and the experimental results demonstrate that our method can effectively detect different-scale ships in SAR images.

Keywords:

SAR ship detection; multi-stage feature transfer (MFT); adaptive global reasoning; knowledge graph

1. Introduction

Ship detection is a key technology in automatic synthetic aperture radar (SAR) recognition, and it has a significant impact on the effectiveness of surveillance and reconnaissance systems [1,2]. Especially among the various remote sensing platforms, Spaceborne SAR sensors have become the dominant platform for ship detection due to their advantages, such as all-weather capability, multi-polarization, strong penetration, and wide coverage area [3]. Due to their unique imaging features, SAR images present special challenges for ship detection that are different from those faced by traditional optical remote sensing methods [4,5]. Today, SAR ship target detection algorithms can be categorized into three types: feature-extraction-based methods, convolutional neural network (CNN)-based methods, and transformer-based methods.

In the field of feature extraction-based ship-detection methods, traditional SAR image target detection relies on manually designed feature extraction algorithms. Among them, the Constant False Alarm Rate (CFAR) technique has been widely used in SAR image detection due to its simple model structure and adaptive thresholding [6]. Huang et al. [7] proposed a multi-scale heterogeneity-based ship-detection scheme under a reverse decision framework. This approach determines the detection threshold based on a pre-established clutter statistical model and relies on information from around the target cell. However, this model is highly dependent on manually designed features and is vulnerable to image noise, which often leads to suboptimal detection results. Traditional SAR ship-detection algorithms based on manual feature extraction generally suffer from issues such as low detection efficiency, weak generalization ability, and long processing times.

In recent years, with the release of numerous SAR ship-detection datasets, deep-learning-based SAR ship detection has become a major research direction [8]. Compared to optical images, the spatial features of the target in SAR images are blurred due to their reflective polarization features. Additionally, SAR images are heavily influenced by noise. These factors result in lower image quality, which in turn affects the accuracy of SAR ship detection. Therefore, improving SAR ship-detection performance remains a challenge [9,10]. The emergence of neural network algorithms has led to significant breakthroughs in fields such as target detection.

In the field of Convolutional Neural Network (CNN)-based ship-detection methods, with the widespread application of deep learning algorithms in target detection, researchers have developed numerous ship-detection algorithms for SAR images in recent years. Wang et al. [11] addressed the issue of insufficient ground truth for training in SAR target detection by applying two training data augmentation strategies: data augmentation and transfer learning. They proposed an enhanced small-sample multi-target detector for SAR targets. Yang et al. [12] combined the advantages of RetinaNet and rotatable bounding boxes, proposing an improved one-stage detector to solve problems such as the imbalance between positive and negative samples in SAR ship training data. Tang et al. [13] used YOLOv7 as the backbone network and integrated three modules—DCNets, BiFormer, and Wise-IoU—proposing the DBW-YOLO algorithm to improve SAR ship-detection rates. Zhou et al. [14] addressed the issues of low real-time performance and detection accuracy in SAR ship-detection models by proposing a lightweight SAR ship-detection network based on transformation and feature enhancement. Zhou et al. [15] proposed a sidelobe-aware small ship-detection network for SAR images, specifically designed for detecting small ships in SAR images. Guo et al. [16] proposed a Mask Efficient Adaptive Network (MEA-Net) for SAR ship detection to address the imbalance in different ship sample sizes’ numbers in publicly available SAR datasets. The model combines morphological processing and ship label data to solve the imbalance between inshore and offshore ship samples, significantly improving the model’s ability to detect inshore ships. Similarly, Tang et al. [17] proposed a Pyramid Pooling Attention Network (PPA-Net) for multi-scale ship detection in SAR images, designing a Pyramid Pooling Attention Module (PPAM) to enhance the model’s detection capability. However, the CNN-based SAR ship-detection models mentioned above are limited by their receptive fields, which results in varying detection performance across ships of different sizes and types in large-scale scenarios. They also overlook the potential knowledge features that may exist between different targets and between the target and the environment.

In the field of Transformer-based ship-detection algorithms, with the widespread application of Transformer-based algorithms in the field of machine vision, Transformer-based SAR target detection algorithms have also become a research hotspot. Compared to CNN-based detection models, Transformer-based models excel at learning long-term feature dependencies, making them particularly suitable for tasks that require a comprehensive understanding of global features. Specifically, Dong et al. [18] applied the Vision Transformer (ViT) framework to SAR image classification tasks, treating images as patches and not relying on CNNs for encoding. In ViT, images are reduced to one-dimensional vectors, with learnable classification patches and position patches inserted. In 2020, Carion et al. [19] proposed the Detection Transformer (DETR) based on a set-based global loss, enforcing unique predictions through a global loss function and bipartite matching. This approach simplified the detection process by eliminating anchor generation and non-maximum suppression (NMS) steps. In 2022, Xia et al. [20] designed the CRTransSar architecture, which combines global context awareness from Transformer with local feature extraction from CNNs. In the same year, Li et al. [21] proposed ESTDNet, which uses the FESwin module for feature extraction and the AFF module for feature fusion, and ablation experiment studies confirmed the effectiveness of these two modules. In 2024, Mahmoud et al. [22] designed a feature-enhanced DETR model for SAR images. They improved feature extraction and clutter filtering through preprocessing techniques such as filtering, denoising, and changing max pooling and median pooling. Additionally, they used metrics like Line Stretch Function (LSF) and Peak Signal-to-Noise Ratio (PSNR) to predict the optimal pooling operation, improving edge clarity and image fidelity. However, Transformer-based SAR ship-detection methods face challenges such as high training costs, slow convergence speed, and poor detection performance for small targets.

Although the aforementioned methods have achieved good performance, most of the research optimizes detection models from the perspective of target features or target structure to improve the detection capability for SAR ship targets. In contrast, when the human brain identifies targets, it not only considers the features of the target itself but also takes into account the surrounding environment and simultaneously considers the target in context to comprehensively assess its type. Therefore, this paper was inspired by the human brain’s discrimination method and proposes a global adaptive SAR ship target reasoning detection method based on the MFT of target features. This method first employs the MFT module to learn the features of SAR images in a stepwise manner. Through a multi-stage learning approach, the images of ship targets at different scales are learned gradually, enabling the proposed detection model to detect ships in different scales more accurately. Specifically, the features extracted by the MFT filter are robust and contain rich information, which is enhanced information obtained from the original images. Therefore, by using these filter-extracted features as auxiliary information together with the original image, the feature representation of blurred SAR images can be enhanced. Next, the features obtained from the first layer of the bbox head are used to build a global semantic pool, and an adjacency matrix is used to establish a mapping for global reasoning. This ensures that knowledge is transferred through the pooling network. The computed enhanced features are then combined with the original features, and the final enhanced result is the output.

The main contributions of this paper are as follows:

Knowledge level: This paper constructs a knowledge graph for ship targets. Through the analysis of a large amount of data, it not only builds the knowledge between targets and scenes but also establishes the relationships between different targets.
Feature level: This paper proposes a multi-stage feature transfer (MFT) strategy to train target images of different scales and enhance the generalization ability of the target-detection model. In this module, the original features of the targets are extracted using a small number of target samples at each layer. These features are then transferred to the next level for training. Through the combination of features at different scales, the targets at different scales can be simultaneously represented.
Model level: This paper proposes a SAR ship target detection method that utilizes global adaptive reasoning. By associating knowledge graph between targets and between targets and scene elements, this algorithm achieves global reasoning for ship targets, thus improving the detection rate of the targets.

The rest of the article is organized as follows. In Section 2, we provide a brief review of the knowledge graph. In Section 3, the proposed MFT-Reasoning RCNN method based on knowledge reasoning is introduced. The ablation experiment and comparison experiment of our method are analyzed in detail in Section 4. In Section 5, the conclusions of the proposed method are presented briefly.

2. Background

Knowledge Graph

When a target is blurry, the human brain often infers its identity by analyzing the types of nearby clear targets and leveraging the implicit relationships between them. These implicit relationships can be modeled as a graph structure, where nodes represent different categories, and edges (or their attributes) represent the relationships between these categories. This type of structure is referred to as a knowledge graph.

In deep learning, visual reasoning focuses on combining interactions between various pieces of information or specific objects. Examples can be seen in tasks like target detection [23] and visual relationship detection [24]. These approaches often consider factors such as the relationships between objects or shared attributes [25,26]. Some methods also explore similarity-based approaches [27], such as attributes found in language space. More recently, researchers have turned to graph structures to integrate knowledge [23,24]. However, these approaches typically limit reasoning to local regions, which can struggle when images contain severe occlusion or blur, resulting in poor feature-based reasoning. To address this, our method extends reasoning across all categories, implementing an adaptive global reasoning module.

For instance, Qian et al. [28] proposed a knowledge graph based on relationships and attributes for airplane components. Given the unique nature of SAR images compared to optical images, our knowledge graph focuses on target-to-target relationships, such as those between bridges and ships or ports and ships. In a broader context, a relationship-based knowledge graph (KG) can be formalized as

G^{R}

, which captures relationships between targets (e.g., land, drive, on, and near):

G^{R} = < N, ε >,

(1)

where N represents the category nodes, and each edge

e_{i j} \in ε

signifies a form of knowledge shared between two categories. We can enhance the target detection features by utilizing a relational knowledge graph. In SAR images, this relationship can be abstracted as the probability of different targets appearing within the same scene.

Specifically, we begin by initializing a statistical matrix

R^{C}

of size

C \times C

, where the counts reflect the probability of targets co-occurring in the same scene. Next, we construct an undirected graph

G^{R}

, merging the transpose of

R^{C}

, denoted

{(R^{c})}^{T}

, with the original matrix

R^{C}

. Finally, normalization is applied to produce a learnable representation of the knowledge graph G, where

e_{i j}^{R}

becomes a symmetric matrix of size

C \times C

:

e_{i j}^{R} = \frac{R_{i j}^{C}}{\sqrt{S_{i i} S_{j j}}},

(2)

where

S_{i i}

represents the sum of the i-th row of

R^{C}

,

S_{j j}

represents the sum of the j-th column of

R^{C}

, and

R_{i j}^{C}

denotes the value at the i-th row and j-th column of

R^{C}

:

S_{i i} = \sum_{j = 1}^{C} R_{i j}^{C}

(3)

S_{j j} = \sum_{i = 1}^{C} R_{i j}^{C}

(4)

3. Proposed Method

This section introduces related work relevant to this paper, focusing on two modules: multi-stage feature transfer (MFT) and knowledge graph reasoning.

The proposed method is illustrated in Figure 1. We have designed the MFT, which trains the model in stages using three different datasets. The scale of the three datasets progressively increases, and the satellite sources for the data are also different. Through using the MFT module, the model learns the features of SAR images step by step, gradually improving its capabilities. The entire method can be divided into two main steps. The first step involves using filters in the backbone layer to better extract features from the SAR images. The feature maps are then enhanced through an FPN. The output feature maps have clearer features, facilitating subsequent training. The second step introduces a knowledge graph in the RoI head classification stage to improve classification scores and detection accuracy. The knowledge graph is input into the network in the form of an adjacency matrix, which is a

(n + 1) \times (n + 1)

matrix with a value range of [0, 1], derived through probabilistic statistics. Here, ‘n’ represents the total number of target categories, and the first row and first column of the adjacency matrix are set to zero to simplify model computation.

3.1. Feature Enhanced via Knowledge Reasoning

3.1.1. Global Semantic Pool

Most methods [23] typically propagate visual features locally between regions. However, when significant occlusion or class ambiguity occurs in the image, these methods may fail in reasoning as the feature representations become poor or dispersed, a common issue in large-scale detection tasks. In contrast, our approach aims to propagate global information across all categories, not just those present in the image. To achieve this, we create a global semantic pool that stores high-level semantic representations for all categories, which is conceptually similar to memory in the human brain, where one Recalls the appearance of a specific category.

In existing works, generating a global semantic pool often involves averaging the features for each category or using clustering techniques to find centroids as reference features [29]. However, these methods require gathering information from all the data, which imposes a significant computational burden. They attempt to train classifiers to fit the weights for unseen or unfamiliar categories [30,31,32,33].

Our method introduces a novel approach to generating the global semantic pool. The classifier weights for each category inherently contain high-level semantic information because they capture feature activations learned from all images. Formally, let

M \in R^{C \times D}

represent the weights of the classifiers for all C categories. Our model’s global semantic pool is derived by copying the parameters M from the previous classification layer into the bbox head of the detection network. During each training iteration, the classifiers are updated, ensuring that the global semantic pool M becomes more accurate over time.

3.1.2. Feature Enhancement

After creating a global semantic pool M for all categories C, we propagate the relationships with M through a knowledge graph, represented by the adjacency matrix’s edges

ε \in R^{C \times C}

. This enables the sharing and propagation of information across all C categories based on the knowledge, represented as

ε M

. To enhance the region features, we need to establish a mapping between the region proposals

N_{r}

and categories C. Intuitively, this mapping can be derived from the classification results of the previous stage in the detection network. We introduce a soft mapping approach, where instead of direct hard mapping from region proposals to categories, we use the classification probability distribution

P \in R^{N_{r} \times C}

over all categories C. Compared to hard mapping, the soft mapping approach delivers better results. The matrix P is calculated using the softmax function applied to the scores for each of the C categories in the previous classifier. This process can be expressed as

P ε M W_{G}

, where

W_{G} \in R^{D \times E}

is the shared transformation weight matrix across all graphs and E represents the output dimension of the reasoning module. Since global graph reasoning is based on all categories, it can introduce noise. Therefore, an adaptive reasoning mechanism is necessary to combine the specific visual patterns of each image. This is why we incorporate an attention-based adaptive reasoning mechanism.

3.2. Adaptive Attention Module

As shown in Figure 2, when considering the global features

ε M

, it is essential to highlight the information and relevant categories while suppressing those that are less useful, in order to achieve adaptive reasoning for each image. We observe that not all category information is beneficial for identifying objects within a specific image. Humans, when recognizing items in a scene, tend to focus only on a few potentially relevant categories. In this paper, we leverage the squeeze-and-excitation technique to further rescale the categories being considered [34]. In the squeeze step, we input the entire image feature

F \in R^{W \times H \times D}

into a CNN (with a kernel size of

3 \times 3

and an output channel of

D / 64

), followed by a global pooling operation that reduces the feature map size by half. The excitation step involves a fully connected layer with input

Z_{s} \in R^{D / 64}

. We then apply the softmax function to obtain the attention weights for each category:

α = softmax (Z_{s} W_{s} M^{T}),

(5)

where

W_{s} \in R^{D / 64 \times D}

is the weight matrix of the fully connected layer, and

α \in R^{C}

. The adaptive reasoning enhanced feature

f^{'}

can be computed as

f^{'} = P (α \otimes ε M) W_{G},

(6)

where ⊗ is the channel-wise product and the rest is the matrix multiplication.

f^{'} \in R^{N_{r} \times E}

is the enhanced feature of dimension E obtained through adaptive global graph reasoning.

3.3. Adaptive Global Reasoning Module

As shown in Figure 3, global reasoning is performed on the global semantic pool M based on the prior knowledge graph edges

ε

. Image-wise adaptive attention

α

is calculated through a squeeze-and-excitation operation on the image base features, emphasizing the relevant categories. The adaptive global reasoning with

α

is obtained by the channel-wise product. A soft mapping from categories to regions is applied based on the probability matrix P. The final enhanced feature

f^{'}

is then derived by matrix multiplication with a fully connected weight matrix

W_{G}

.

Finally, the enhanced feature

f^{'}

is concatenated with the original region features f, forming the combined feature

[f; f^{'}]

, which is then fed into the bounding box regression layer and classification layer to produce the final detection results. Note that

f^{'}

contains distilled information across categories connected by edges, such as similar attributes or relationships. This approach helps address issues such as occlusion, class ambiguity, and small object detection by incorporating adaptive contextual information from the global semantic pool guided by external knowledge. In the field of object detection, these three types of issues are highly prevalent. Occlusion occurs when the information about an object in an image is incomplete. Class ambiguity arises when the abstract features of multiple categories exhibit a certain degree of similarity after feature extraction, making it difficult for the model to accurately distinguish between them, leading to incorrect detection results. Small objects in an image are often insufficiently clear, resulting in the model extracting blurry abstract features.

In the real world, an object not only carries its own information but also embodies implicit relationships with other objects. For example, the likelihood of a cat appearing on a table is significantly higher than that of a bear appearing on a table. By leveraging a knowledge graph, the abstract features of an object can be further refined based on their correlations with surrounding entities. For instance, if only half of a boat is visible in an image (i.e., the object is occluded), and another clear boat is present in the same image, the abstract features of the occluded boat can be enhanced through the knowledge graph, enabling the model to detect it more effectively.

3.4. Multi-Stage Feature Transfer Module

We designed the MFT module to allow the model to progressively learn features from SAR images, thereby enhancing its generalization ability. Although many existing ship-detection datasets such as ASDD are available. Specifically, for SAR ship detection, we trained the model using three different publicly available SAR datasets: SSDD [35], HRSID [36], and SARDet-100k [37]. These datasets vary in scale, image quantity, number of targets, and scene complexity, as shown in Figure 4.

To ensure knowledge transfer during training and to eliminate the difference between different scales’ targets, we employed filters to extract image features and subsequently fused these features with the original images. The filtered-enhanced features of the original data can be defined as

M_{i}^{x} = T_{i} (x), i \in {H O G, C a n n y, H a a r, W S T, G R E},

(7)

where

T_{i}

is a predefined transformation, similar to the information residual design in ResNet [38]. The histogram of oriented gradients (HOG) captures the shape characteristics of objects by statistically analyzing the distribution of gradient orientations within localized regions of an image. This method is particularly effective in helping models identify and delineate the contours of targets in synthetic aperture radar (SAR) images. The canny edge detector (Canny) extracts prominent edges through a multi-step process, including Gaussian filtering and gradient computation, effectively reducing noise interference and enhancing edge clarity. Haar-like features (Haar) describe the brightness differences in local regions of an image through predefined rectangular templates, and they can extract the linear structures and local contrast features in SAR images. The wavelet scattering transform (WST), through the combination of multi-scale wavelet transforms and scattering paths, can enhance the representation ability for small objects and occluded objects. The gradient by ratio edge (GRE) calculates the edge response through the gradient ratio, which can suppress speckle noise and improve the clarity of the target boundaries. HOG, Canny, Haar, WST and GRE are five different manually designed feature extraction methods. In the MFT module, we first use GRE to suppress speckle noise and then use WST to enhance the features. The original SAR image x is concatenated with the filter-enhanced features

M_{i}^{x}

, constructing the model’s filter-enhanced input:

I n p = c o n c a t (x, M_{i}^{x})

(8)

After the filter-enhanced features, the differences between the images are significantly reduced. Following the MFT process, knowledge is iteratively transferred during training, and our model achieves improved generalization ability. The multi-stage training process can be expressed by the following formula:

A_{1} = T r a i n (A_{0}) (D_{S S D D})

(9)

A_{2} = T r a i n (A_{1}) (D_{H R S I D})

(10)

A_{f i n a l} = T r a i n (A_{2}) (D_{S A R D e t - 100 k}),

(11)

where Equation (9) indicates that the initialized model

A_{0}

is trained on the SSDD dataset (

D_{S S D D}

) using our method to obtain

A_{1}

. Equation (10) indicates that the model

A_{1}

is trained on the HRSID dataset (

D_{H R S I D}

) using our method to obtain the model

A_{2}

. Similarly, Equation (11) indicates that the model

A_{2}

is trained on the SARDet-100k dataset (

D_{S A R D e t - 100 k}

) using our method to obtain the final model

A_{f i n a l}

.

3.5. Loss Function

For the classification stage, the loss function

l o s s_{c l s}

is calculated using CrossEntropyLoss, which helps reduce the discrepancy between the predicted and actual values in the target domain. The formula for the classification loss is

l o s s_{c l s} = - \sum_{x} (p (x) l o g q (x)),

(12)

where

p (x)

represents the expected output, and

q (x)

is the actual output of the model, which is computed by using the softmax function. For the detection stage, the bounding box regression

l o s s_{b b o x}

is calculated by using SmoothL1Loss. Compared to L1-Loss, the SmoothL1Loss function has a smoother curve, which helps reduce sensitivity to outliers. The formula for SmoothL1Loss is as follows:

l o s s_{b b o x} = \frac{1}{n} \sum_{i = 1}^{n} \{\begin{matrix} 0.5 * {(y_{i} - f (x_{i}))}^{2}, & i f | y_{i} - f (x_{i}) | < 1 \\ | y_{i} - f (x_{i}) | - 0.5, & o t h e r w i s e \end{matrix}

(13)

4. Experimental Details

4.1. Experimental Settings

All experiments were performed on a consistent computer setup. The hardware configuration included a Tesla T4 GPU (manufactured by NVIDIA) with 16 GB VRAM. The software environment features Python 3.8.0, CUDA 12.2, PyTorch 2.1.1+cu121, mmdet 3.3.0, mmcv 2.1.0 and mmengine 0.10.5. The system was run on Ubuntu 24.02. For training, we did not use any pre-training files. To ensure the fairness and generalizability of the experiments, all relevant parameters were kept in config files, except for the batch_size, which was adjusted according to the GPU numbers. These settings included using the AdamW optimizer, 12 epochs, the same dataset, an optimizer weight decay of 0.05, and the gradient clipping, among others.

4.2. Datasets

To thoroughly validate the effectiveness and superiority of our proposed method, we conducted experiments using the SSDD, HRSID, and SARDet-100k datasets. Additionally, we tested the generalization capability of our model.

SAR Ship-Detection Dataset (SSDD), released in 2017, is a publicly available SAR image dataset designed for object detection. It contains 1160 images with 2456 ship annotations, averaging 2.12 ships per image. Annotations in SSDD are manually labeled for ships with pixel values greater than 3. The images in this dataset are sourced from the Sentinel-1, RadarSat-2, and TerraSAR-X satellites.

The High-Resolution SAR Image Dataset (HRSID) is based on data collected from Sentinel-1B, TerraSAR-X, and TanDEM-X. It includes 5604 images with a total of 16,969 ship targets. Among these, inshore scenes account for

18.4 %

, while offshore scenes make up

81.6 %

. Additionally, HRSID provides 400 pure-background SAR images, which are useful for testing the robustness of trained models.

As shown in Table 1, SARDet-100k is the first SAR remote sensing dataset of COCO-scale magnitude. It aggregates images from 10 publicly available SAR datasets through preprocessing and slicing operations, resulting in a large-scale dataset. SARDet-100k consists of over 116,000 images and more than 245,000 targets. The images in the SARDet-100k dataset have five different polarizations, with the angles of incidence ranging from 10 to 60 degrees. These images are sourced from seven different satellites and Airborne SAR synthetic slices. Most of the images are sourced from the Gaofen-3 satellite and Sentinel-1 sensors, with spatial resolutions ranging from 0.1 m to 25 m. The dataset includes both satellite and airborne data, offering diverse and challenging scenarios for testing.

These datasets allow for a comprehensive evaluation of the proposed method’s performance across varying resolutions, sources, and scene complexities, demonstrating its potential in real-world SAR image detection tasks.

4.3. Evaluation Metrics

In evaluating the effectiveness of our proposed method, we introduce the key performance metrics: Average Precision (AP), Recall, and F1 score. When calculating the Average Precision, four components are involved: True Positives (TPs), False Positives (FPs), True Negatives (TNs), and False Negatives (FNs).

TPs (True Positives): The number of positive samples correctly predicted by the model. FPs (False Positives): The number of negative samples incorrectly classified as positive by the model. TNs (True Negatives): The number of negative samples correctly predicted as negative (background). FNs (False Negatives): The number of positive samples that are incorrectly predicted as negative by the model. In the context of SAR image analysis, due to the unique features of these images, the focus of SAR target detection tasks is more on accurately detecting targets. The Average Precision (

A P

), which indicates the proportion of correctly predicted targets, can be derived from the following equation:

P = \frac{T P}{T P + F P}, R = \frac{T P}{T P + F N}

(14)

The calculation of Average Precision (

A P

) involves the relationship between Precision (P) and Recall (R):

A P = \int_{0}^{1} P (R) d R

(15)

Based on different Intersection over Union (IoU) thresholds, AP can be further subdivided into IoU = 0.5(

{A P}_{50}

) and IoU = 0.75(

{A P}_{75}

). These two indicators will be used to verify the detection accuracy of our model.

4.4. Ablation Experiment

In this section, we make the ablation experiment by choosing nine figures from three different datasets. The experimental results are shown in Figure 5, Figure 6 and Figure 7. In each figure, (a)–(c) are the original images, (d)–(f) are the ground truth, (g)–(i) are the detection results of Two-stage RCNN, (j)–(l) are the detection results of RCNN + Reasoning, and (m)–(o) are the detection results of RCNN + Reasoning + Multi-stage Feature Transfer (our method).

4.4.1. Global Reasoning Module

In the RoI_head layer, we use the first bbox_head to obtain the initial classification features. Then, in the second bbox_head, a knowledge graph is added, and the mapping relationship is established using the global semantic pool obtained from the first layer. We trained two networks: one without a knowledge graph and one with a knowledge graph. The only difference between the two deep networks is the inclusion of a knowledge graph; all other configurations, training datasets, testing datasets, etc., are the same. The training was conducted for six epochs. In the examples, the (g)–(i) images show the detection results of the Two-stage RCNN model, and the (j)–(l) images show the detection results of the same image processed by the Reasoning RCNN model.

Figure 5 comes from the SSDD dataset. In Figure 5j, the Reasoning RCNN model generates fewer detection boxes, which is due to the increase in the feature vector dimension when enhancing features with a knowledge graph (e.g., from 1024 to 1280). The increase in feature vector dimensions leads to different results when generating prediction boxes. In Figure 5h,k, both models produce false alarms, which is due to insufficient training epochs, causing the model to not yet learn enough knowledge. In Figure 5i,l, the Reasoning RCNN model generally achieved higher prediction scores than the Two-stage RCNN model. This is because a large number of ships appear in the same image, and under the influence of a knowledge graph, their scores are aggregated, thereby boosting the scores of most of the targets. In the right part of the image, there is one fewer false alarm target for Reasoning RCNN than in Two-stage RCNN.

Figure 6 shows the prediction results from both models on the HRSID dataset. In Figure 6j, inside the green box, the result of Reasoning RCNN is clearer than that of Two-stage RCNN. This is attributed to the increase in the feature dimension of Reasoning RCNN, which enables it to obtain clearer features of the clustered targets. Similarly, in Figure 6k, the score of the ship inside the green box is higher than the score from the Two-stage RCNN model, as the knowledge graph helps improve detection. In Figure 6l, the existence of a knowledge graph enables the model to reduce false alarm targets. The HRSID dataset contains more complex and larger scenes compared to the SSDD dataset. Simply put, the ships in the HRSID dataset are smaller within the scene.

Figure 7 shows the prediction results from both models on the SARDet-100k dataset. In Figure 7j, the Reasoning RCNN model detects one more false target compared to the Two-stage RCNN model. This is due to the increased dimensions of the enhanced features, which can affect predictions, especially when the number of training epochs is limited. With fewer training iterations, the model has not learned sufficient knowledge, making it more susceptible to the influence of increased feature dimensions. In Figure 7k, due to the large number of ships in the same scene, the average score increase brought by the knowledge graph reaches 20. Such scenarios are common in real-world applications, demonstrating that the introduction of a knowledge graph is meaningful for SAR image detection. In Figure 7l, both models fail to produce satisfactory results due to the overly blurred and complex nature of the image. However, the Reasoning RCNN model’s results are less surprising compared to the Two-stage RCNN model. In Figure 7i, the Two-stage RCNN model produces unexpected results (aircraft) in the upper region of the image, which is clearly an error.

Table 2 shows the detection performance of the two models on different targets within the SARDet-100k dataset. For targets sourced from satellites, the two models exhibit varying strengths and weaknesses, due to the limited number of training epochs and the strong randomness this introduces. For targets such as cars sourced from Airborne SAR data, the Reasoning RCNN model shows significant advantages. This could be attributed to the clearer imaging quality of the synthetic slices produced by Airborne SAR.

4.4.2. Multi-Stage Feature Transfer Module

For the features of SAR images, the MFT can better extract target features. Unlike ResNet in the Faster-RCNN network, the MFT module is specifically designed for SAR images. The following test compares two models: the first is the Reasoning RCNN model, and the second is the MFT-Reasoning RCNN model. Both are trained for six epochs. The images in Figure 5j–l, Figure 6j–l and Figure 7j–l show the detection results of the Reasoning RCNN model, and the images in (m)–(o) show the detection results of the same image using the MFT-Reasoning RCNN model.

Figure 5 shows the detection results of Reasoning RCNN and MFT-Reasoning RCNN on the SSDD dataset. In Figure 5m, MFT-Reasoning RCNN not only achieves higher scores compared to the Reasoning RCNN model but also eliminates one false positive target. This demonstrates that the method of enhancing raw features with filters in the MFT algorithm is effective. Similarly, in Figure 5n, the results of MFT-Reasoning RCNN are more convergent or reliable. The detection results of MFT-Reasoning RCNN are almost exactly the same as the labeled results. In Figure 5o, we can observe similar results: the predictions of MFT-Reasoning RCNN are more credible, as it detects one more correct target than Reasoning RCNN.

Figure 6 shows the prediction results of the two models on the HRSID dataset. In Figure 6m, although MFT-Reasoning RCNN still fails to detect some targets, the existence of the MFT module significantly boosts the confidence scores of this model for familiar targets. In Figure 6j, Reasoning RCNN scores 42.9, while in Figure 6m, MFT-Reasoning RCNN scores 100. In Figure 6n, the results of the MFT-Reasoning network are nearly identical to the ground truth. This demonstrates that the MFT module plays a significant role. Knowledge was transferred from the SSDD dataset to the HRSID dataset through this module, and the feature enhancement by filters also contributed to the improvement. In Figure 6o, since the score already reached 100, the results of both models are almost identical.

Figure 7 shows the prediction results of the two models on the SARDet-100k dataset. In Figure 7m, the miss detections caused by feature expansion in Reasoning RCNN are eliminated by MFT-Reasoning RCNN. This demonstrates the effectiveness of the MFT module we designed. In Figure 7n, the influence of the MFT module is evident as the average score decreases by 2.5. However, in the lower part of the image, one false positive target is eliminated. In Figure 7o, MFT-Reasoning RCNN achieves better results than Reasoning RCNN, with more targets detected. We believe that knowledge from the SSDD and HRSID datasets was transferred to the training of SARDet-100k through the MFT module. This approach is, to some extent, similar to humans’ learning process.

Table 3 shows the detection results of the two models on different targets in the SARDet-100k dataset. The inclusion of the MFT module improved the detection performance for ship targets. In terms of Average Precision, the inclusion of this module also brought improvements. The module enhanced the detection of targets like ships, but there was a slight decline in the detection of car targets. Our analysis suggests that during the initial stages of learning, MFT-Reasoning RCNN performs better for elongated targets, while car targets are more compact. Since the two models were trained for only six epochs, we believe that the MFT-Reasoning RCNN’s performance will improve with further training.

Table 4 shows the precision of three different models on three datasets. As can be seen, with the introduction of the knowledge graph and the MFT module, the model’s performance gradually improves.

The addition of the knowledge graph mainly has the impacts of increasing the confidence scores and reducing false detections, such as in Figure 5j,l, Figure 6k,l and Figure 7k,l. The addition of the MFT mainly has the impacts of further reducing false detections and reducing missed detections, such as in Figure 5m,o, Figure 6m,n and Figure 7m,n. After the addition of the knowledge graph, since the dimension of the feature vectors has changed, it may result in a slight increase in false detection results, such as in Figure 7j. To reduce this kind of error, we further introduced the MFT module, thereby weakening the negative impacts brought by the knowledge graph, such as in Figure 7m. To enable the model to learn the relationships between some targets from human implicit knowledge, we introduce knowledge graph reasoning, which has increased the confidence scores and reduced the false detection rate but, at the same time, may predict incorrect results in some scenarios. To weaken the negative impacts of the knowledge graph on the model, we further designed the MFT module to further reduce the false detection rate and also reduce the missed detection rate.

4.5. Comparison with Others’ Methods

In the MFT-Reasoning RCNN we designed, we conducted multi-stage training using the SSDD, HRSID, and SARDet-100k datasets. Upon completion of the training, we evaluated the model. We first performed comparative experiments on the SSDD dataset, comparing our designed detector with various existing detectors. These methods include Faster-RCNN [31], Reasoning-RCNN, FCOS [39], YOLOX [40], DETR [19], ESTDNet [21], ELLK-Net [41] and LHSDNet [42].

Table 5 shows the results tested on the SSDD dataset, where all models were trained for 100 epochs. After introducing the knowledge graph and the MFT module, our model improved the Average Precision (

A P

) by 5.1 over ELLK-Net and 3.3 over LHSDNet. Regarding

{A P}_{50} (%)

, the differences between models were relatively small. Still, our model significantly outperformed others in

A P

, indicating that as the Intersection over Union (IoU) increased, the gap between our model’s

A P

and those of other models widened. This is due to the impact of feature enhancement through filtering in the MFT module. In the SSDD dataset, because there is only one class, the role of knowledge graph is greatly diminished.

Table 6 shows the results tested on the HRSID dataset. Our model was trained for 48 epochs, while the other models were trained for 100 epochs. Due to the difference in training epochs, our model did not achieve the same significant advantage as it did on the SSDD dataset. Nevertheless, because the MFT module transferred knowledge from the SSDD dataset and utilized feature enhancement through filtering, our model still achieved better results. Compared with ELLK-Net and LHSDNet, our method can achieve better results with fewer training epochs.

Table 7 shows the results tested on the SARDet-100k dataset. Due to computational resource limitations, each model was trained for only 12 epochs. The SARDet-100k dataset contains more targets, and since our training was limited, the results were not as strong as those on the SSDD and HRSID datasets. However, unlike the SSDD and HRSID datasets, the SARDet-100k dataset includes more categories, allowing our knowledge graph to begin to have an impact. As shown in the ablation experiment, the model’s performance on the SARDet-100k dataset improved significantly. Although DETR has broad potential in object detection, it still cannot compare with convolutional neural networks (CNN) in the short term, which is why DETR performs the worst in these datasets. Once the knowledge graph started to take effect, our model began to significantly outperform the other models, demonstrating the effectiveness of our proposed approach. Compared with ELLK-Net and LHSDNet, our model has achieved better results in terms of Precision, AP, and F1 score on the SARDet-100k dataset, specifically, the parameters and computation of MFT-Reasoning RCNN are 61.5M and 236GFLOPs.

5. Conclusions

This paper proposed a target detection method for SAR images based on Reasoning RCNN and MFT, named MFT-Reasoning RCNN. Compared to the traditional CNN based target detection algorithm, we introduce a knowledge graph to infer the categories of targets by utilizing the associative relationships between targets and their surrounding elements, which improves the detection rate and confidence scores of the algorithm for SAR ship targets. To enhance the generalization ability of the model, that is, to accurately identify targets of different scales, we also introduce an MFT module to transfer and fuse the features of ship targets of different scales, further improving the detection ability of the model. We use publicly available SAR ship target datasets, namely SSDD, HRSID, and SARDet-100k datasets, for training and testing. The experimental results show that the SAR ship target detection model proposed in this paper can accurately identify targets of different scales and improve the confidence scores of the detected targets. In addition, the model presented in this paper can reduce the false alarm rate of target detection. Compared with other algorithms, the detection rate of targets is approximately improved by 2% in AP.

Author Contributions

Conceptualization, S.Z.; Methodology, M.Z.; Formal analysis, X.Z.; Investigation, M.Z. and Y.Y.; Resources, G.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work is partially supported by the Sichuan Central-Guided Local Science and Technology Development under grant 2023ZYD0165.

Data Availability Statement

Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Huang, Z.; Yao, X.; Dumitru, C.O.; Datcu, M.; Han, J. Physically explainable cnn for sar image classification. ISPRS J. Photogramm. Remote Sens. 2021, 190, 25–37. [Google Scholar]
Zhao, S.; Luo, Y.; Zhang, T.; Guo, W.; Zhang, Z. A feature decomposition-based method for automatic ship detection crossing different satellite sar images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5234015. [Google Scholar] [CrossRef]
Moreira, A.; Prats-Iraola, P.; Younis, M.; Krieger, G.; Hajnsek, I.; Papathanassiou, K. A tutorial on synthetic aperture radar. Geosci. Remote Sens. Mag. IEEE 2013, 1, 6–43. [Google Scholar] [CrossRef]
Xue, R.; Bai, X.; Zhou, F. Spatial–temporal ensemble convolution for sequence sar target classification. IEEE Trans. Geosci. Remote Sens. 2021, 59, 1250–1262. [Google Scholar] [CrossRef]
He, Q.; Sun, X.; Diao, W.; Yan, Z.; Yin, D.; Fu, K. Retracted: Transformer-induced graph reasoning for multimodal semantic segmentation in remote sensing. ISPRS J. Photogramm. Remote Sens. 2022, 193, 90–103. [Google Scholar]
Cui, J.; Jia, H.; Wang, H.; Xu, F. A fast threshold neural network for ship detection in large-scene sar images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 6016–6032. [Google Scholar] [CrossRef]
Huang, X.; Yang, W.; Zhang, H.; Xia, G.-S. Automatic ship detection in sar images using multi-scale heterogeneities and an a contrario decision. Remote Sens. 2015, 7, 7695–7711. [Google Scholar] [CrossRef]
Li, J.; Xu, C.; Su, H.; Gao, L.; Wang, T. Deep learning for sar ship detection: Past, present and future. Remote Sens. 2022, 14, 2712. [Google Scholar] [CrossRef]
Liu, T.; Yang, Z.; Gao, G.; Marino, A.; Chen, S.-W. Simultaneous diagonalization of hermitian matrices and its application in polsar ship detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5220818. [Google Scholar]
Yang, Z.; Fang, L.; Shen, B.; Liu, T. Polsar ship detection based on azimuth sublook polarimetric covariance matrix. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 8506–8518. [Google Scholar]
Wang, Z.; Du, L.; Mao, J.; Liu, B.; Yang, D. Sar target detection based on ssd with data augmentation and transfer learning. IEEE Geosci. Remote Sens. Lett. 2019, 16, 150–154. [Google Scholar]
Yang, R.; Pan, Z.; Jia, X.; Zhang, L.; Deng, Y. A novel cnn-based detector for ship detection based on rotatable bounding box in sar images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 1938–1958. [Google Scholar]
Tang, X.; Zhang, J.; Xia, Y.; Xiao, H. Dbw-yolo: A high-precision sar ship detection method for complex environments. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 7029–7039. [Google Scholar]
Zhou, S.; Zhang, M.; Wu, L.; Yu, D.; Li, J.; Fan, F.; Zhang, L.; Liu, Y. Lightweight sar ship detection network based on transformer and feature enhancement. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 4845–4858. [Google Scholar]
Zhou, Y.; Liu, H.; Ma, F.; Pan, Z.; Zhang, F. A sidelobe-aware small ship detection network for synthetic aperture radar imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5205516. [Google Scholar] [CrossRef]
Guo, Y.; Zhou, L. Mea-net: A lightweight sar ship detection model for imbalanced datasets. Remote Sens. 2022, 14, 4438. [Google Scholar] [CrossRef]
Tang, G.; Zhao, H.; Claramunt, C.; Zhu, W.; Wang, S.; Wang, Y.; Ding, Y. Ppa-net: Pyramid pooling attention network for multi-scale ship detection in sar images. Remote Sens. 2023, 15, 2855. [Google Scholar] [CrossRef]
Dong, H.; Zhang, L.; Zou, B. Exploring vision transformers for polarimetric sar image classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5219715. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Xia, R.; Chen, J.; Huang, Z.; Wan, H.; Wu, B.; Sun, L.; Yao, B.; Xiang, H.; Xing, M. Crtranssar: A visual transformer based on contextual joint representation learning for sar ship detection. Remote Sens. 2022, 14, 1488. [Google Scholar] [CrossRef]
Li, K.; Zhang, M.; Xu, M.; Tang, R.; Wang, L.; Wang, H. Ship detection in sar images based on feature enhancement swin transformer and adjacent feature fusion. Remote Sens. 2022, 14, 3186. [Google Scholar] [CrossRef]
Ahmed, M.; El-Sheimy, N.; Leung, H. A novel detection transformer framework for ship detection in synthetic aperture radar imagery using advanced feature fusion and polarimetric techniques. Remote Sens. 2024, 16, 3877. [Google Scholar] [CrossRef]
Chen, X.; Li, L.-J.; Li, F.-F.; Gupta, A. Iterative visual reasoning beyond convolutions. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7239–7248. [Google Scholar]
Dai, B.; Zhang, Y.; Lin, D. Detecting visual relationships with deep relational networks. In Proceedings of the 2017 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3298–3308. [Google Scholar]
Lampert, C.H.; Nickisch, H.; Harmeling, S. Learning to detect unseen object classes by between-class attribute transfer. In Proceedings of the 2009 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009; pp. 951–958. [Google Scholar]
Li, Z.; Peng, C.; Yu, G.; Zhang, X.; Deng, Y.; Sun, J. Light-head r-cnn: In defense of two-stage object detector. arXiv 2017, arXiv:1711.07264. [Google Scholar]
Reed, S.; Akata, Z.; Lee, H.; Schiele, B. Learning deep representations of fine-grained visual descriptions. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 27–30 June 2016; pp. 49–58. [Google Scholar]
Qian, Y.; Pu, X.; Jia, H.; Wang, H.; Xu, F. Arnet: Prior knowledge reasoning network for aircraft detection in remote-sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5205214. [Google Scholar]
Lee, K.-H.; He, X.; Zhang, L.; Yang, L. Cleannet: Transfer learning for scalable image classifier training with label noise. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 5447–5456. [Google Scholar]
Wang, X.; Ye, Y.; Gupta, A. Zero-shot recognition via semantic embeddings and knowledge graphs. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 6857–6866. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [PubMed]
Gidaris, S.; Komodakis, N. Dynamic few-shot visual learning without forgetting. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4367–4375. [Google Scholar]
Gong, C.; He, D.; Tan, X.; Qin, T.; Wang, L.; Liu, T.Y. Frage: Frequency-agnostic word representation. In Proceedings of the 2018 Conference on Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 2–8 December 2018; pp. 1341–1352. [Google Scholar]
Hu, R.; Dollár, P.; He, K.; Darrell, T.; Girshick, R. Learning to segment every thing. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4233–4241. [Google Scholar]
Li, J.; Qu, C.; Shao, J. Ship detection in SAR images based on an improved faster R-CNN. In Proceedings of the Conference on SAR in Big Data Era-Models, Methods and Applications (BIGSARDATA), Beijing, China, 13–14 November 2017; pp. 1–6. [Google Scholar]
Wei, S.; Zeng, X.; Qu, Q.; Wang, M.; Su, H.; Shi, J. HRSID: A high-resolution SAR images dataset for ship detection and instance segmentation. IEEE Access 2020, 8, 120234–120254. [Google Scholar]
Li, Y.; Li, X.; Li, W.; Hou, Q.; Liu, L.; Cheng, M.M.; Yang, J. SARDet-100k: Towards open-source benchmark and toolkit for large-scale sar object detection. arXiv 2024, arXiv:2403.06534. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9626–9635. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Shen, J.; Bai, L.; Zhang, Y.; Chowdhuray Momi, M.; Quan, S.; Ye, Z. Ellk-net: An efficient lightweight large kernel network for sar ship detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5221514. [Google Scholar]
Dai, D.; Wu, H.; Wang, Y.; Ji, P. LHSDNet: A Lightweight and High-Accuracy SAR Ship Object Detection Algorithm. Remote Sens. 2024, 16, 4527. [Google Scholar] [CrossRef]

Figure 1. The total framework of our model.

Figure 2. Global reasoning via knowledge graph.

Figure 3. Detailed flowchart of the adaptive global reasoning.

Figure 4. Multi-stage feature transfer module.

Figure 5. Experiment on the SSDD dataset. ((a–c) are the original images, (d–f) are the ground truth, (g–i) are the detection results of Two-stage RCNN, (j–l) are the detection results of RCNN + Reasoning, and (m–o) are the detection results of RCNN + Reasoning + Multi-stage Feature Transfer).

Figure 6. Experiment on the HRSID dataset. ((a–c) are the original images, (d–f) are the ground truth, (g–i) are the detection results of Two-stage RCNN, (j–l) are the detection results of RCNN + Reasoning, and (m–o) are the detection results of RCNN + Reasoning + Multi-stage Feature Transfer).

Figure 7. Experiment on the SARDet-100k dataset. ((a–c) are the original images, (d–f) are the ground truth, (g–i) are the detection results of Two-stage RCNN, (j–l) are the detection results of RCNN + Reasoning, and (m–o) are the detection results of RCNN + Reasoning + Multi-stage Feature Transfer).

Table 1. The SARDet-100k dataset’s source datasets.

Datasets	Targets	Res. (m)	Band	Polarization	Satellites
AIR_SARShip	S	1 m, 3 m	C	VV	GF-3
HRSID	S	0.5–3 m	C/X	HH, HV, VH, VV	S-1B, TerraSAR-X, TanDEMX
MSAR	A, T, B, S	<1 m	C	HH, HV, VH, VV	HISEA-1
SADD	A	0.5–3 m	X	HH	TerraSAR-X
SAR-AIRcraft	A	1 m	C	Uni-polar	GF-3
ShipDataset	S	3–25 m	C	HH, HV, VH, VV	S-1, GF-3
SSDD	S	1–15 m	C/X	HH, HV, VH, VV	S-1, TerraSAR-X, RadarSat-2
OGSOD	B, H, T	3 m	C	VV, VH	GF-3
SIVED	C	0.1, 0.3 m	Ka, Ku, X	VV, VH	Airborne SAR synthetic slice

Table 2. Detection capabilities of two models for different targets. (Two-stage RCNN and Reasoning RCNN).

Model	Ship	Aircraft	Car	Bridge	Harbor
Two-stage RCNN	79%	73%	69%	34%	49%
Reasoning RCNN	81%	71%	88%	36%	42%

Table 3. Detection capability of two models for different targets (Reasoning RCNN and MFT-Reasoning RCNN).

Model	Ship	Aircraft	Car	Bridge	Harbor
Reasoning RCNN	81%	71%	88%	36%	42%
MFT-Reasoning RCNN	83%	70%	79%	39%	52%

Table 4. Comparison of 3 models for 3 datasets.

Dateset	Model	$AP$ (%)	${AP}_{50}$ (%)	${AP}_{75}$ (%)
SSDD	Two-stage RCNN	63.6	87.1	72.1
	Reasoning RCNN	63.8	87.2	72.3
	MFT-Reasoning RCNN	64.5	89.5	75.6
HRSID	Two-stage RCNN	60.9	82.7	68.9
	Reasoning RCNN	61.5	83.7	69.3
	MFT-Reasoning RCNN	61.9	84.4	70.2
SARDet-100k	Two-stage RCNN	30.8	54.4	25.3
	Reasoning RCNN	31.6	56.4	25.4
	MFT-Reasoning RCNN	32.1	56.8	28.6

Table 5. Comparison of different methods for the SSDD dataset.

Method	Precision (%)	Recall (%)	$AP$ (%)	${AP}_{50}$ (%)	${AP}_{75}$ (%)	F1 (%)
Faster-RCNN [31]	93.6	81.8	62.1	93.4	72.1	87.5
Reasoning-RCNN	94.3	76.1	66.9	93.5	78.6	84.3
FCOS [39]	93.0	82.4	60.1	93.6	68.9	87.4
YOLOX [40]	95.7	78.2	53.1	92.2	54.7	86.1
DETR [19]	90.4	81.9	55.7	92.2	61.6	86.1
ESTDNet [21]	89.8	80.7	59.4	93.8	69.1	85.1
ELLK-Net [41]	96.3	79.3	63.9	94.6	74.6	86.7
LHSDNet [42]	95.9	82.3	65.7	94.8	76.5	88.3
Our method	96.7	81.2	69.0	94.9	82.5	88.3

Note: The bolded number represents the best result of each index.

Table 6. Comparison of different methods for the HRSID dataset.

Method	Precision (%)	Recall (%)	$AP$ (%)	${AP}_{50}$ (%)	${AP}_{75}$ (%)	F1 (%)
Faster-RCNN [31]	92.3	74.6	65.9	87.4	76.6	82.3
Reasoning-RCNN	93.6	72.8	66.1	90.3	77.1	82.1
FCOS [39]	84.5	72.5	64.7	89.0	73.6	78.3
YOLOX [40]	94.2	75.2	61.9	88.1	70.7	83.7
DETR [19]	68.7	71.4	45.8	71.9	51.6	70.2
ESTDNet [21]	87.9	74.6	63.6	89.4	74.8	80.5
ELLK-Net [41]	93.6	75.4	66.8	90.4	76.0	83.7
LHSDNet [42]	91.4	77.6	64.9	89.8	75.8	83.8
Our method	93.2	76.3	67.4	90.5	77.2	84.1

Note: The bolded number represents the best result of each index.

Table 7. Comparison of different methods for the SARDet-100k dataset.

Method	Precision (%)	Recall (%)	$AP$ (%)	${AP}_{50}$ (%)	${AP}_{75}$ (%)	F1 (%)
Faster-RCNN [31]	71.4	46.8	40.8	68.6	43.3	56.4
Reasoning-RCNN	73.0	46.1	41.7	69.4	45.2	56.5
FCOS [39]	72.3	51.2	39.6	70.1	44.2	60.0
YOLOX [40]	72.4	50.7	38.5	69.4	41.0	59.6
DETR [19]	60.8	45.7	26.7	58.8	28.3	52.0
ESTDNet [21]	72.7	58.3	38.9	70.8	45.1	64.8
ELLK-Net [41]	74.1	57.6	41.2	72.5	44.5	65.0
LHSDNet [42]	75.9	55.0	42.7	73.7	44.5	63.8
Our method	76.5	57.5	43.2	74.0	44.9	65.5

Note: The bolded number represents the best result of each index.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhan, S.; Zhong, M.; Yang, Y.; Lu, G.; Zhou, X. MFT-Reasoning RCNN: A Novel Multi-Stage Feature Transfer Based Reasoning RCNN for Synthetic Aperture Radar (SAR) Ship Detection. Remote Sens. 2025, 17, 1170. https://doi.org/10.3390/rs17071170

AMA Style

Zhan S, Zhong M, Yang Y, Lu G, Zhou X. MFT-Reasoning RCNN: A Novel Multi-Stage Feature Transfer Based Reasoning RCNN for Synthetic Aperture Radar (SAR) Ship Detection. Remote Sensing. 2025; 17(7):1170. https://doi.org/10.3390/rs17071170

Chicago/Turabian Style

Zhan, Siyu, Muge Zhong, Yuxuan Yang, Guoming Lu, and Xinyu Zhou. 2025. "MFT-Reasoning RCNN: A Novel Multi-Stage Feature Transfer Based Reasoning RCNN for Synthetic Aperture Radar (SAR) Ship Detection" Remote Sensing 17, no. 7: 1170. https://doi.org/10.3390/rs17071170

APA Style

Zhan, S., Zhong, M., Yang, Y., Lu, G., & Zhou, X. (2025). MFT-Reasoning RCNN: A Novel Multi-Stage Feature Transfer Based Reasoning RCNN for Synthetic Aperture Radar (SAR) Ship Detection. Remote Sensing, 17(7), 1170. https://doi.org/10.3390/rs17071170

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MFT-Reasoning RCNN: A Novel Multi-Stage Feature Transfer Based Reasoning RCNN for Synthetic Aperture Radar (SAR) Ship Detection

Abstract

1. Introduction

2. Background

Knowledge Graph

3. Proposed Method

3.1. Feature Enhanced via Knowledge Reasoning

3.1.1. Global Semantic Pool

3.1.2. Feature Enhancement

3.2. Adaptive Attention Module

3.3. Adaptive Global Reasoning Module

3.4. Multi-Stage Feature Transfer Module

3.5. Loss Function

4. Experimental Details

4.1. Experimental Settings

4.2. Datasets

4.3. Evaluation Metrics

4.4. Ablation Experiment

4.4.1. Global Reasoning Module

4.4.2. Multi-Stage Feature Transfer Module

4.5. Comparison with Others’ Methods

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI