Intelligent Target Detection in Synthetic Aperture Radar Images Based on Multi-Level Fusion

Liu, Qiaoyu; Ye, Ziqi; Zhu, Chenxiang; Ouyang, Dongxu; Gu, Dandan; Wang, Haipeng

doi:10.3390/rs17010112

Open AccessArticle

Intelligent Target Detection in Synthetic Aperture Radar Images Based on Multi-Level Fusion

by

Qiaoyu Liu

¹,

Ziqi Ye

¹,

Chenxiang Zhu

¹,

Dongxu Ouyang

¹,

Dandan Gu

² and

Haipeng Wang

^1,*

¹

Key Laboratory for Information Science of Electromagnetic Waves (MoE), Fudan University, Shanghai 200433, China

²

National Key Laboratory of Scattering and Radiation, Shanghai 200438, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(1), 112; https://doi.org/10.3390/rs17010112

Submission received: 24 October 2024 / Revised: 24 December 2024 / Accepted: 30 December 2024 / Published: 1 January 2025

(This article belongs to the Special Issue SAR-Based Signal Processing and Target Recognition (Second Edition))

Download

Browse Figures

Versions Notes

Abstract

:

Due to the unique imaging mechanism of SAR, targets in SAR images present complex scattering characteristics. As a result, intelligent target detection in SAR images has been facing many challenges, which mainly lie in the insufficient exploitation of target characteristics, inefficient characterization of scattering features, and inadequate reliability of decision models. In this respect, we propose an intelligent target detection method based on multi-level fusion, where pixel-level, feature-level, and decision-level fusions are designed for enhancing scattering feature mining and improving the reliability of decision making. The pixel-level fusion method through the channel fusion of original images and their features after scattering feature enhancement represents an initial exploration of image fusion. Two feature-level fusion methods are conducted using respective migratable fusion blocks, namely DBAM and FDRM, presenting higher-level fusion. Decision-level fusion based on DST can not only consolidate complementary strengths in different models but also incorporate human or expert involvement in proposition for guiding effective decision making. This represents the highest-level fusion integrating results by proposition setting and statistical analysis. Experiments of different fusion methods integrating different features were conducted on typical target detection datasets. As shown in the results, the proposed method increases the mAP by 16.52%, 7.1%, and 3.19% in ship, aircraft, and vehicle target detection, demonstrating high effectiveness and robustness.

Keywords:

synthetic aperture radar (SAR); target detection; scattering features; feature fusion; image fusion

1. Introduction

Synthetic aperture radar (SAR) is an all-weather and all-time active microwave sensor for high-resolution earth observation that images by emitting coherent electromagnetic waves to the surface and receiving scattered echoes [1]. And it is critically important in both military and civil applications such as reconnaissance, terrain mapping, and disaster assessment [2].

Before the application of deep learning (DL) to SAR automatic target recognition (ATR), most research focused on the detection, discrimination, and classification processes proposed by Lincoln Laboratory [3,4,5]. These model-driven methods, which rely on mathematical theories and expert knowledge to design shallow models, can offer strong interpretability. However, due to their heavy dependence on manual features and complex parameter adjustments, these models exhibit limited robustness and generalization capabilities across different scenarios [6]. Consequently, developing reliable and efficient predictive models for complex and diverse scenarios has become a crucial research need.

In the era of big data and the rapid advancement of artificial intelligence, deep neural networks (DNNs) have demonstrated exceptional information extraction and processing capabilities in computer vision, which have achieved remarkable performance in object classification [7], object detection [8,9], change detection [10,11], etc. These data-driven methods, leveraging their powerful feature representation capabilities, can extract higher-level abstract semantic features, resulting in superior performance and robustness. However, the powerful feature extraction and representation capabilities require the support of high-quality, large-scale datasets. Low-quality image data or insufficient effective data volume can exacerbate the decision-making uncertainty, leading to higher false alarms and missed detections. This does not meet the current practical application needs for the high-precision detection of SAR image targets. Moreover, due to the explainable theoretical foundations for deep learning applications in SAR image interpretation not being thoroughly established, the frontier developments are still guided by empirical and experimental results rather than theoretical insights [12], resulting in insufficient model interpretability. Then, many target-characteristic-driven methods have been proposed to improve the reliability of the decision model, methods specifically designed for SAR images integrating imaging mechanisms [13], scattering characteristics [14], etc.

Our earlier study explored how to integrate traditional manual features and their scattering information into deep learning algorithms by leveraging the powerful automatic feature extraction and representation capabilities while allowing mathematically interpretable manual features to guide the training and learning process of the model. At first, we carried out feature fusion experiments by image channel concatenation, which is a pixel-level fusion method. Channel fusion is one of the primary approaches for pixel-level image fusion, where different feature channels or image channels are combined to form a comprehensive feature representation. Although this method can enrich the detailed information in the input, it also means that the utilization of scattering information remains at the level of low-level visual details, leading to insufficient exploitation of enhanced scattering features. Therefore, to more thoroughly extract and utilize the information contained in these manual features, we designed and improved our network first to extract abstract information from manual feature images and original images separately and then fuse them at the feature level. This process constituents a higher level of image fusion: feature-level fusion.

After completing experiments based on pixel-level and feature-level fusion, we found that different manual features fused using different fusion methods have varying impacts on different categories. This means that we cannot rely solely on structural improvements in deep learning algorithms, nor can we depend entirely on traditional scattering features and pattern recognition methods that are fully mathematically interpretable. Instead, we should combine both approaches as much as possible to achieve complementary advantages. In this way, given that decision-level fusion has a strong capacity for integrating complementary, redundant or competing, and synergistic data as well as for different fusion approaches, we proposed a decision-level fusion method. This method maximizes the advantages of both features and fusion strategies across different categories, thereby enabling high-performance and robust target detection in SAR images.

Based on the above discussion, we conducted research on SAR target detection based on multi-level fusion, where pixel-level and feature-level fusion are leveraged for better mining and exploiting scattering features, and decision-level fusion is leveraged for result integration and the improvement of the model’s decision reliability.

The contributions of our work can be summarized as follows:

As for the insufficient exploitation of scattering features, a pixel-level fusion method with an improved backbone network ST-PA_RCNN and scattering feature enhancement module is proposed, representing an initial exploration of image fusion through the channel fusion of original images and their features.
To further enhance the ability of feature mining and characterization, two feature-level fusion methods are conducted based on respective migratable fusion blocks, namely DBAM and FDRM. These represent higher fusion levels compared to pixel-level, integrating abstract features by network designing.
As for the inadequate reliability of decision making in DL-based methods, decision-level fusion method based on DST is proposed for multi-model integration. It can not only consolidate complementary strengths in different models but also incorporate human or expert involvement in proposition for guiding effective decision making.
The proposed method was validated on typical ground and maritime surface target detection datasets, achieving mAP increases of 16.52%, 7.1%, and 3.19% for ships, aircrafts, and vehicles, demonstrating our method’s effectiveness and robustness.

2. Materials and Methods

2.1. Related Work

With the gradual maturation of SAR imaging hardware and algorithms, achieving high-precision SAR automatic target recognition (SAR ATR) of typical maritime surface targets is of significant importance. The related work is primarily divided into three parts: model-driven methods, data-driven methods, and target-characteristic-driven methods.

2.1.1. Model-Driven Methods

Traditional model-driven SAR ATR methods rely on mathematical theories and expert knowledge for background clutter modeling, manual feature extraction, and classifier design, which are divided into three stages: detection, discrimination, and classification. The detection stage, usually implemented based on background clutter modeling including Gaussian, Rayleigh, exponential, and K-distribution [15], is conducted using a Constant False Alarm Rate (CFAR) [16] and its numerous improved forms, such as Two-Parameter CFAR (TP-CFAR) [17], CA-CFAR [18], GO-CFAR [19], and SO-CFAR [20]. In the discrimination stage, feature information is extracted to distinguish between targets and clutters, which is essentially a binary classification problem aiming to obtain as many regions of interest (RoIs) as possible. At last, the classification stage processes the RoIs through feature extraction and classifier design to further eliminate false alarms [21], ultimately obtaining class prediction information.

In a word, the main parts of traditional SAR ATR are manual feature extraction and classifier design. Manual features, such as geometric features, texture features, grayscale statistical features, and angular features, are input into model-driven methods based on template matching or Machine Learning (ML) [22]. Among template-matching classifiers, statistical pattern recognition proposed by Ross et al. [23] is the most typical, predicting by comparison to a standard template base. ML-based classifiers perform predictions using support vector machines (SVMs) [24], Decision Tree (DT) [25], Random Forest (RF) [26], a Perceptron [27], and Naive Bayes [28].

In model-driven methods, the ability to apply traditional classifiers based on template or ML has become increasingly limited, especially facing increasingly complex and diverse detection and recognition scenarios. However, the potential application of manual features is reserved due to their relatively strong interpretability and reliability.

2.1.2. Data-Driven Methods

Nowadays, the advancement of deep learning has led to its extensive application in the field of SAR ATR, attributed to the superior capabilities in extracting and unearthing features. Chen et al. [7] were the first to apply deep neural networks to the target classification field of SAR images in 2014, and the proposed A-ConvNets achieved an average accuracy of 99% on the MSTAR ten-class dataset. Zhang et al. [29] proposed a cascaded three-view network that combines the advantages of Faster R-CNN and residual units in feature extraction. An et al. [30] alleviated the issue of positive and negative sample imbalance by applying Focal Loss (FL) [31] and a hard negative mining module to their designed rotated bounding box SAR image target detection method. Liu et al. [32] constructed a multi-scale fully convolutional deep neural network for rotated bounding box ship target detection in port areas, significantly enhancing ship target detection performance in complex nearshore scenarios. Li et al. [33] addressed the problem of limited effective samples and the difficulty in distinguishing negative samples by employing Generative Adversarial Networks (GANs) [34] for the robust detection of ship targets in SAR images. Sun et al. [35] conducted extensive experiments on the AIR-SARShip-1.0 dataset, demonstrating the superiority and robustness of densely connected networks. For ground vehicle detection tasks, Du et al. [36] combined deep learning with transfer learning theory to achieve superior performance on the miniSAR dataset compared to CFAR.

Numerous research efforts have validated the superior feature characterization capabilities of data-driven methods compared to traditional SAR ATR. However, some issues have also emerged: on one hand, the limited number of effective SAR image samples or imbalanced samples can directly affect model performance; on the other hand, these data-driven methods are primarily inherited from computer vision methods based on optical image processing, and their feature mining capabilities in SAR images still require guidance and enhancement.

2.1.3. Target-Characteristic-Driven Methods

Model-driven methods rely on manual feature extraction, offering strong algorithm interpretability but exhibiting poor robustness in complex backgrounds. Purely data-driven methods or those that simply incorporate SAR image characteristics have stronger feature characterization, but their feature fusion effectiveness and model interpretability remain inadequate [6]. As detection scenarios become increasingly complex and diversified, and as the requirements for model performance and decision credibility in target detection rise, more DL-based methods that combine model-driven and data-driven approaches based on target characteristics are gradually being proposed. The relationship of methods driven by model, data, and target characteristics is shown in Figure 1.

Du et al. [37] proposed a feature decomposition-based SSD for ship target detection. Deep abstract features extracted by the backbone network are decomposed into discriminative and interference features, constraining the network to focus on learning discriminative features to achieve low false alarm and missed detection rates. Wang et al. [38] utilized Haar transforms to obtain texture information in the frequency domain to describe subtle differences between objects and their surrounding backgrounds. They designed and constructed a multi-feature fusion network for ship target detection. He et al. [39] addressed the sparsity and diversity issues brought by the scattering mechanism of synthetic aperture radar by proposing a method that leverages deep features and prior component structures. Li et al. [40] extracted grayscale features and enhanced spatial information, combined with scattering mechanism features, to suppress interference and redundant information in complex environments, achieving superior detection performance while assisting aircraft in rapid and efficient detection. Ginner et al. [41] designed adaptive noise suppression and scattering center feature extraction modules to accurately characterize the scattering features of aircraft targets. Mix MSTAR [42] is the first publicly available SAR vehicle target detection dataset, specifically designed for detecting rotating objects in large-scale scenes with complex backgrounds. Due to its recent release, related research based on this dataset has not yet commenced. Prior to the release of the Mix MSTAR dataset, research on vehicle target detection was very limited, with most target-characteristic-driven methods focusing primarily on vehicle target classification within the MSTAR dataset. Guo [43] proposed a multi-feature fusion decision CNN that achieves higher recognition rates than other models without increasing the number of model training parameters. Tang and Chen [44] employed a multi-set canonical correlation analysis method to fuse multi-view SAR images into a single feature vector, and then used joint sparse representation to characterize and classify feature vectors from different view sets. This approach achieved optimal classification performance even under extended working conditions, including noise interference and target occlusion. Liu et al. [45] proposed a causal reasoning framework for vehicle target recognition in complex scenes by modeling the SAR ATR task as a causal graph, representing the sources of background-related bias, which reduces false correlations between the background and prediction results, achieving robust recognition performance.

The aforementioned research on target-characteristic-driven methods for typical ground and maritime surface targets is highly commendable. By integrating target characteristics and designing proper DL-based methods, interpretable characteristic information has, to some extent, guided and provided insight on DNNs that rely on only abstract features. However, existing target-characteristic-driven methods still exhibit limited human interaction in decision making, particularly in reconnaissance, where environmental conditions are variable, and more human involvement in proposition formulation is needed.

2.2. The Proposed Method

2.2.1. Overview

In this paper, we proposed an intelligent method based on multi-level fusion for target detection in SAR images, and the structure is shown in Figure 2. First, feature extraction is conducted for capturing four features. Then, the original images and their features are fed into deep neural network based on one pixel-level fusion method and two feature-level fusion methods for mining and exploiting scattering features as much as possible. Lastly, detection results and prior information settings are utilized by the decision-level fusion method, where the final results are obtained.

2.2.2. Scattering Feature Extraction

Scattering features of single-channel SAR images are composed of three main parts: geometric features, grayscale statistical features, and texture features. Geometric features include the target’s outline, component shapes, structural dimensions, etc. Grayscale statistical features present the differences in structural and material properties between targets and clutters, reflected by the gray amplitude. As for texture features, grayscale changes focus more on local differences, making categories more distinguishable. Therefore, targets are often recognized by defining and designing local texture features. Moreover, compared to point and vector features, texture features in image form can provide richer, detailed information and exhibit great adaptability to various fusion methods, which caters to the research topic in our work. Thus, four texture features were selected for feature fusion in the experiments, which are SAR-SIFT, SAR-HOG, NSLP, and LC. Corresponding feature extraction methods are introduced below.

SAR-SIFT

Scale-Invariant Feature Transform (SIFT) [46] is an important local feature description tool proposed by Lowe in 1999 for image processing. SIFT was designed to identify and describe local features in an image that remains invariant to rotation, scale, and brightness. However, due to the specific multiplicative speckle noise in SAR images, SIFT, primarily designed for optical images, does not work as effectively as on SAR images. This is because, on the one hand, multiplicative noise results in stronger gradient magnitude in homogeneous regions with high reflectivity compared to areas with lower reflectivity; on the other hand, the computation of orientation and local feature descriptors in SIFT relies on the classical differential gradient, which is not robust to multiplicative noise. Considering the potential application of SIFT local feature and its properties to SAR image feature extraction, SAR-SIFT [47] was utilized in our study, which has been proven to have good adaptability to multiplicative noise. The pseudocode is presented in Algorithm 1.

Algorithm 1: SAR-SIFT

Input: Image $I (x, y)$
Output: Feature SAR-SIFT $I_{S I F T}$

$M_{1, α} (i = 1) = \int_{x = R^{+} y = R} \int_{x = R^{+} y = R} I (a + x, b + y) \times e^{- \frac{|x| + |y|}{α}};$
$M_{2, α} (i = 1) = \int_{x = R^{-}} \int_{y = R} I (a + x, b + y) \times e^{- \frac{|x| + |y|}{α}};$
$R_{1, α} = M_{1, α} (1) / M_{2, α} (1);$
$M_{1, α} (i = 3) = \int_{x = R} \int_{y = R^{+}} I (a + x, b + y) \times e^{- \frac{|x| + |y|}{α}};$
$M_{2, α} (i = 3) = \int_{x = R} \int_{y = R^{-}} I (a + x, b + y) \times e^{- \frac{|x| + |y|}{α}};$
$R_{3, α} = M_{1, α} (3) / M_{2, α} (3);$
$G_{x, α} = l o g (R_{1, α});$
$G_{y, α} = l o g (R_{3, α});$
$C_{S H} (x, y, α) = G_{\sqrt{2} \cdot α} * [\begin{matrix} {(G_{x, α})}^{2} & (G_{x, α}) \cdot (G_{y, α}) \\ (G_{x, α}) \cdot (G_{y, α}) & {(G_{y, α})}^{2} \end{matrix}]$ ;
$R_{S H} (x, y, α) = d e t (C_{S H} (x, y, α)) - d \cdot t r {(C_{S H} (x, y, α))}^{2};$
$S_{p k} \leftarrow$ Local extrema detection $(R_{S H} (x, y, σ));$
$G_{n, α} = \sqrt{{(G_{x, α})}^{2} + {(G_{y, α})}^{2}};$
$G_{t, α} = a r c t a n (G_{y, α} / G_{x, α});$
$S_{o s} = ⌀;$
for $p (x, y, σ) i n S_{p k}$ do
if $R < 6 σ$ then
$p (x, y, σ, θ) \leftarrow Select main orientations (p (x, y, σ), G_{n, α}, G_{t, α});$
$p (x, y, σ, θ) \in S_{o s};$
end
end
for $p (x, y, σ, θ) i n; S_{o s}$ do
if $R < 12 σ$ then
$I_{S I F T} \leftarrow$ gather $(p (x, y, σ, θ), G_{n}, α, G_{t, α});$
end
end

SAR-HOG

Since SAR images are more sensitive to changes in the target’s angle and attitude than optical images, it is necessary to pay more attention to the stable pixels in the process of scattering feature extraction, which are dominated by strong backscattered echoes toward insensitive structures. In order to better capture the scattering feature responding to such stable structures, a modified algorithm based on histogram of oriented gradients (HOG) [48], which is called SAR-HOG [49], was employed in our study. This algorithm can extract stable structures in SAR images by ratio-based gradient computation. The pseudocode is presented in Algorithm 2.

Algorithm 2: SAR-HOG

Input: Image $I (x, y)$ , odd size of the average region $w$ , orientations $o$ , pixels per cell $p$ , cells per block $c$ , block norm $n$
Output: Feature SAR-HOG $I_{H O G}$

$I (x, y) = I {(x, y)}^{γ};$
$M_{1} (1) (x, y) = local means (I (x - \frac{w - 1}{2}, y - \frac{w - 1}{2}), I (x + \frac{w - 1}{2}, y));$
$M_{2} (1) (x, y) = local means (I (x - \frac{w - 1}{2}, y), I (x + \frac{w - 1}{2}, y + \frac{w - 1}{2}));$
$M_{1} (3) (x, y) = local means (I (x - \frac{w - 1}{2}, y - \frac{w - 1}{2}), I (x, y + \frac{w - 1}{2}));$
$M_{2} (3) (x, y) = local means (I (x, y - \frac{w - 1}{2}), I (x + \frac{w - 1}{2}, y + \frac{w - 1}{2}));$
$R_{1} (x, y) = M_{1} (1) / M_{2} (1);$
$R_{3} (x, y) = M_{1} (3) / M_{2} (3);$
$G_{H} (x, y) = l o g (R_{1} (x, y));$
$G_{V} (x, y) = l o g (R_{3} (x, y));$
$G_{m} (x, y) = \sqrt{G_{H} {(x, y)}^{2} + G_{V} {(x, y)}^{2}};$
$G_{θ} (x, y) = {t a n}^{- 1} (\frac{G_{V} (x, y)}{G_{H} (x, y)});$
divide $I (x, y)$ into cells by p;
for $c e l l i n c e l l s$ do
build hog histogram(cell, o, p, c);
end
combine cells into blocks by c;
for $b l o c k i n b l o c k s$ do
$b l o c k = n o r m (b l o c k, n);$
end

NSLP

The Non-Subsampled Laplacian Pyramid (NSLP) [50,51] is an algorithm used in image processing and computer vision for multi-scale spatial representation. Unlike traditional Laplacian or Gaussian pyramids, NSLP does not perform subsampling, thereby maintaining the spatial resolution of the image across different scales. This approach is particularly effective for capturing texture features in images, as it preserves detailed information at various scales. The pseudocode is presented in Algorithm 3.

Algorithm 3: NSLP

Input: Image $I (x, y)$ , Filter $f$
Output: Feature Image $I_{N S L P}$

$f$ = Gaussian nucleus(3*3);
$I_{c o n v} = I (x, y) * f$
$I_{N S L P} = I (x, y) - I_{c o n v}$

LC

Saliency feature refers to models developed by researchers that mimic the characteristics of human visual attention, enabling computers to selectively ignore non-essential background information in a scene during image processing. This allows the computer to focus attention on regions of interest. For example, in SAR ATR of ship targets, the ship’s hull is primarily made of metal, resulting in strong backscattering signals, which can create a strong contrast with the sea surface, forming a simple scene for detection utilizing saliency features. However, in practical scenarios, backgrounds are often complex and diverse, with significant electromagnetic interference, making it challenging to achieve optimal SAR ATR performance using saliency features. In this study, we used the Linear Color (LC) algorithm [52] to extract saliency features, which has proven to improve the performance of ship target detection with complex backgrounds in SAR images. The pseudocode is presented in Algorithm 4.

Algorithm 4: LC

Input: Image $I (x, y)$
Output: Feature Image $I_{L C}$

$h (x, y)$ = the hist of the $I (x, y)$ ;
for $I_{K} i n I$ do
for n in range(256) do;
$f_{n}$ = frequency of pixel level(n);
$I_{K}$ += $f_{n} ‖I_{K} - I_{n}‖$ ;
end
$I_{K}$ =norm( $I_{K}$ )
end

2.2.3. Image Fusion Theory

Data fusion is a strategy employed to process information from multiple sources to optimize decision making, commonly found in multimodal data processing and often manifested as image fusion [53]. More accurate and unified information can be extracted through data fusion for improving efficiency in decision making. In this study, we used image fusion for describing more precisely, as data fusion is conducted in image form.

Image fusion can be categorized into three main levels according to different stages of the process [54], which are pixel-level, feature-level, and decision-level fusion, as shown in Figure 3.

Pixel-level image fusion

Pixel-level fusion is at a low level in image fusion; it is conducted by combining pixel information from two or more images. Pixel-level fusion allows data pre-processing operations to be performed prior to formal fusion, which then results in a comprehensive representation that contains as much detailed information as possible.

b.: Feature-level image fusion

Feature-level fusion is at a middle level in image fusion, where abstract features are extracted first, then followed by the further screening of redundant features, forming new features that can be effectively utilized. Feature-level fusion is often used to obtain and enhance more abstract and comprehensive feature information.

c.: Decision-level image fusion

Decision-level fusion is at a high level in image fusion. In its process, extracted features are fed into a feature identification module to obtain initial results. These initial results are consolidated by decision fusion for final results. Decision-level fusion presents a strong ability to integrate results and to develop complementarity of strengths among models to the extent possible with better fault tolerance. However, the requirements for pre-processing, feature extraction, and the determination of results are relatively high.

2.2.4. Backbone Network

In this paper, the backbone network Oriented RCNN [55] based on the Swin Transformer and PA-FPN is named ST-PA_RCNN. Its structure is shown in Figure 4. Oriented RCNN is a simple and efficient generic two-stage network for rotation detection possessing good accuracy and timeliness, with feature extraction composed of CNN kernels. However, due to the limited receptive field of the CNN kernels, they only capture local features within the region covered by the kernels, which cannot effectively meet the higher-quality feature extraction requirements especially in complex scenes. In this regard, the Swin Transformer can capture long-range dependencies through a self-attention mechanism without using CNN kernels, providing a more global perspective. The Path Aggregation Feature Pyramid Network (PA-FPN) [56], an improved Feature Pyramid Network (FPN) [57], is conducted for enhancing feature extraction through path augmentation.

Swin Transformer

As shown in Figure 4, the Swin Transformer optimizes feature extraction through its unique processing. At first, the input image is divided into multiple non-overlapping patches by patch partition and then mapped to specified dimensions through linear embedding. Next, primary stage feature maps are generated from two successive Swin Transformer blocks. Then, through stage 2, stage 3, and stage 4, deeper features are obtained.

The details of two successive Swin Transformer blocks are shown on the right of Figure 4, where their focus on self-attention computation within local and non-overlapping windows is enhanced by replacing the multi-head self-attention (MSA) in standard Transformers with window multi-head self-attention (W-MSA) and shifted-window multi-head self-attention (SW-MSA). This modification allows the blocks to concentrate more on self-attention calculations, while reducing the computational complexity to a linear level. At the same time, to overcome the potential limitations of W-MSA in building long-range dependencies, specifically the lack of cross-window connections, SW-MSA is employed.

b.: PA-FPN

The structure of PA-FPN is shown in Figure 5, which enhances information flow between different scale features by bottom-up path augmentation based on the FPN, especially for the better performance of multiple-scale target detection in a complex background. PA-FPN mainly consists of an FPN and bottom-up path augmentation.

FPN

The FPN is designed to enhance feature mining for different scales by constructing a multi-level and multi-scale feature pyramid, which can effectively improves performance in different-scale target detection. It is shown by the blue structure in Figure 5.

Multi-scale feature levels generated by the FPN can be represented by

\{P 2, P 3, P 4, P 5\}

, correspondingly. As shown in Figure 5, feature dimensions are gradually downsampled by a factor of 2 from

P 2

to

P 5

. New feature mappings corresponding to

\{P 2, P 3, P 4, P 5\}

are represented by

\{N 2, N 3, N 4, N 5\}

. Lateral connections fuse the features obtained from both paths at the same resolution, preserving the high-resolution information of the original image while integrating high-level semantic information from the deeper features. This process iterates continuously until the lowest-level feature map is generated. Additionally, 1

\times

1 convolution is utilized for the dimensionality of the feature maps.

Bottom-up path augmentation

Bottom-up path augmentation primarily enhances system performance by introducing path augmentation and information aggregation, specifically by establishing efficient lateral connections from lower to higher layers, facilitating a smoother upward transmission from low-level information, as shown by red and green dashed arrows in Figure 5.

c.: Oriented RCNN

Oriented RCNN is a two-stage detector that mainly consists of an oriented Region Proposal Network (RPN) and an oriented RCNN head. The first stage produces high-quality oriented proposals with minimal computational cost, while the second stage utilizes an oriented RCNN head for proposal classification and regression. The Oriented Region Proposal Network can generate high-quality RoIs at almost zero cost. Compared to the standard RPN, Oriented RPN adds two parameters in a regression branch for better adaptation in rotated detection. Additionally, Oriented RPN introduces the midpoint offset to represent proposals, further improving detection accuracy. The Oriented RCNN head is used for refining and recognizing the RoIs. Rotated RoIAligh can accurately extract features from RoIs, supporting subsequent regression and classification. The network’s loss function is as follows:

L_{O R C N N} = \frac{1}{N} \sum_{i = 1}^{N} L_{c l s} (p_{i}, p_{i}^{*}) + \frac{1}{N} \sum_{i = 1}^{N} L_{r e g} (t_{i}, t_{i}^{*}),

(1)

where

i

denotes the index of anchors.

N

refers to the number of samples in a mini-batch.

p_{i}^{*}

presents the ground-truth label of the i-th anchor.

p_{i}

denotes the output of classification branch of the oriented RPN.

t_{i}^{*} = (t_{x}^{*}, t_{y}^{*}, t_{w}^{*}, t_{h}^{*}, t_{α}^{*}, t_{β}^{*})

presents the ground-truth box of the i-th anchor.

t_{i} = (t_{x}, t_{y}, t_{w}, t_{h}, t_{α}, t_{β})

presents the output of regression branch of the oriented RPN.

L_{c l s}

presents the Cross Entropy loss, which is defined as

L_{c l s} (p_{i}, p_{i}^{*}) = - \log (p_{i}^{*} p_{i} + (1 - p_{i}^{*}) (1 - p_{i})),

(2)

L_{r e g}

presents the Smooth

L 1

loss, which is defined as

L_{r e g} (t_{i}, t_{i}^{*}) = {s m o o t h}_{L 1} (t_{i} - t_{i}^{*}),

(3)

where

{s m o o t h}_{L 1} (x) = \{\begin{matrix} 0.5 x^{2}, & i f | x | < 1 \\ |x| - 0.5, & otherwise \end{matrix},

(4)

2.2.5. Pixel-Level Fusion

ST-PA_RCNN based on channel fusion is proposed in this subsection, where channel fusion is one of the primary methods of pixel-level fusion [58]. Channel fusion enriches detailed information by combining different feature channels or image channels to form a comprehensive feature representation. The structure of the ST-PA_RCNN based on pixel-level fusion is shown in Figure 6. The scattering feature enhancement region obtained from the original images after scattering feature enhancement is input into the backbone network through channel fusion.

Scattering Feature Enhancement

Scattering feature enhancement consists of three main parts: a scattering feature extractor, a SAR-Harris detector, and OPTICS clustering. Among these, the scattering feature extractor is used for the 4 texture feature extractions mentioned in Section 2.2.2. Then, the scattering features are fed into the SAR-Harris detector for key point extraction. At last, the front-processed features are clustered by OPTICS to obtain a scattering feature enhancement region. The pseudocode of the SAR-Harris detector and OPTICS are presented separately in Algorithm 5 and Algorithm 6.

Algorithm 5: SAR-Harris

Input: Image $I (x, y)$
Output: Feature SAR-Harris $S_{H a r r i s}$

$M_{1, α} (i = 1) = \int_{x = R^{+}} \int_{y = R} I (a + x, b + y) \times e^{- \frac{| x | + | y |}{α}};$
$M_{2, α} (i = 1) = \int_{x = R^{-}} \int_{y = R} I (a + x, b + y) \times e^{- \frac{| x | + | y |}{α}};$
$R_{1, α} = M_{1, α} (1) / M_{2, α} (1);$
$M_{1, α} (i = 3) = \int_{x = R} \int_{y = R^{+}} I (a + x, b + y) \times e^{- \frac{| x | + | y |}{α}};$
$M_{2, α} (i = 3) = \int_{x = R} \int_{y = R^{-}} I (a + x, b + y) \times e^{- \frac{| x | + | y |}{α}};$
$R_{3, α} = M_{1, α} (3) / M_{2, α} (3);$
$G_{x, α} = \log (R_{1, α});$
$G_{y, α} = \log (R_{3, α});$
$C_{S H} (x, y, α) = G_{\sqrt{2} \cdot α} * [\begin{matrix} (G_{x, α})^{2} & (G_{x, α}) \cdot (G_{y, α}) \\ (G_{x, α}) \cdot (G_{y, α}) & (G_{y, α})^{2} \end{matrix}];$
$R_{S H} (x, y, α) = \det (C_{S H} (x, y, α)) - d \cdot t r {(C_{S H} (x, y, α))}^{2};$
$S_{H a r r i s} \leftarrow Local extrema detection (R_{S H} (x, y, σ));$

Algorithm 6: OPTICS

Input: DB, esp, MinPts
Output: orderedlist

OPTICS(DB, esp, MinPts)
// Extract all core points from the dataset
$c o r e_p o i n t s \leftarrow g e t C o r e P o i n t s (D B, e s p, M i n P t s)$ ;
// Calculate the core distance for each point
$c o r e_d i s t s \leftarrow c o m p u t e C o r e D i s t s (D B, e s p, M i n P t s)$ ;
for each unprocessed point $p$ in core_points do
$N \leftarrow g e t N e i g h b o r s (p, e s p);$
mark $p$ as processed;
output $p$ to the ordered list;
if $N \geq M i n P t s$ then
$S e e d s \leftarrow e m p t y p r i o r i t y q u e u e;$
update( $N, p, S e e d s, c o r e_d i s t s$ );
for each next q in Seeds do
$N^{'} \leftarrow g e t N e i g h b o r s (q, e s p);$
mark $q$ as processed;
output $q$ to the ordered list;
if $N^{'} \geq M i n P t s$ then
update( $N^{'}, q, S e e d s, c o r e_d i s t s$ );
return orderedlist

b.: Channel Fusion

In this study, channel fusion was conducted by concatenating the original image and feature image after scattering enhancement, where the third channel of the original image is replaced with the feature image. The process of channel fusion is shown in Figure 7.

2.2.6. Feature-Level Fusion

There were two feature-fusion methods based on different fusion blocks used in this study, which were DBAM ST-PA_RCNN and FDRM ST-PA_RCNN. The following is mainly centered around these two fusion blocks: Dual-Branch FPN and Attention Mechanism (DBAM) and Feature Decomposition and Reweighting Model (FDRM). The structure of the DBAM is referenced in article [38], and the FDRM is the main contribution and innovation of this subsection.

Dual-Branch FPN and Attention Mechanism

The structure of DBAM ST-PA_RCNN is shown in Figure 8, which consists of two main parts: dual-branch PA-FPN and attention mechanism fusion block.

Dual-Branch PA-FPN

The structure of DB-PA-FPN is shown in Figure 9, with a PA-FPN displayed on the left and right. The obtained dual-branch features

\{P 2, P 3, P 4, P 5\}

from PA-FPN are then input into the feature fusion block to obtain a new feature mapping

\{M 2, M 3, M 4, M 5\}

. DB-PA-FPN inherits the superior performance of PA-FPN in enhancing the flow of information among different scale features by effective feature mining, which is crucial for handling complex or multi-scale target detection scenarios. The path augmentation is shown by the orange and purple dashed arrows in Figure 9.

Attention Mechanism Fusion Block

The structure of the attention mechanism is shown in Figure 10. The features extracted by DB-PA-FPN from the original image and the feature image after scattering enhancement are concatenated by channel fusion. The obtained features are then input into residual-like modules. The attention mechanism [59], known for its efficiency and practicality, has become a popular method for enhancing the performance of deep neural networks. It can adaptively focus on regions of interest, effectively reducing background noise and confusion between classes, thereby enabling a more concentrated capture of key information related to the target.

b.: Feature Decomposition and Reweighting Model

FDRM ST-PA_RCNN is proposed like the other feature-level fusion methods, and its structure is shown in Figure 11. FDRM ST-PA_RCNN is a method based on DBAM ST-PA_RCNN, using an improved feature fusion block named the FDRM. Compared to the attention mechanism, the FDRM can reduce more feature redundancy. Additionally, the reweighting of the two branch features can be achieved through accumulative learning strategy, which enhances the interaction between the dual feature spaces and allows the network to learn and utilize scattering characteristics more effectively.

The structure of the FDRM fusion block is shown in Figure 12. The features extracted by the DB-PA-FPN from the original image and the feature image after scattering enhancement are input into the FDRM for feature decomposition and reweighting. The obtained features are then channel-concatenated and finally downsampled for output.

As shown on the right part of Figure 12, the FDRM is mainly composed of two parts: feature decomposition and feature reweighting.

Feature Decomposition

In order to reconstruct the dual feature spaces of original features and scattering features, reduce feature redundancy, and enhance information entropy, feature decomposition leverages two loss functions: orthogonal loss and compactness loss.

Orthogonal Loss. To amplify the divergence between original feature space and scattering feature space, the orthogonal loss is defined as

L_{o p l} = \frac{1}{N} \sum_{n = 1}^{N} (f_{o}^{T} \times f_{s}),

(5)

where

N

denotes the number of images in a mini-batch.

f_{o}

and

f_{s}

denote the two input feature spaces. T represents the transpose operation. As a result, the divergence between inter-features has been amplified.

Compactness Loss. Inspired by center loss [60], compactness loss is designed to learn a center from inter-features and penalize the distances among these features and their respective centers. Specifically,

L_{c m p t} = \frac{1}{N} \sum_{n = 1}^{N} \sum_{j = 1}^{M} {‖I_{i, j} - c_{j}‖}_{2}^{2},

(6)

where

{‖\cdot‖}_{2}

indicates the

L 2

norm.

I_{i, j}

refers the input features for

j = 1, \dots, M (M = 2 i n o u r m e t h o d)

.

c_{j}

indicates the center of the j-th features, which is updated with every mini-batch. Consequently, the compactness and cohesion among intra-features has been enhanced.

Together with two loss functions

L_{c l s}

and

L_{l o c}

in the multi-scale detection module, the final loss function

L_{t o t a l}

is

L_{D F R M} = L_{O R C N N} + λ_{o p l} L_{o p l} + λ_{c m p t} L_{c m p t},

(7)

where

λ_{o p l}

and

λ_{c m p t}

are two balancing hyperparameters, which are assigned values of 1 and 0.1, respectively.

Feature Reweighting

After feature decomposition, the two feature spaces are reconstructed to maximize inter-feature divergence and to ensure intra-feature compactness. Then, these features are supposed to be fused prior to further processing. The FRCLM is proposed to strike a balance between the influence of original features and scattering features in model training. The fusion strategy can be described as a summation. A trade-off parameter μ is introduced as follows:

F_{r} = μ F_{o} + (1 - μ) F_{s},

(8)

where

F_{r}

denotes the fusion feature from the two features. Given the current training epoch

T

and total training epoch

T_{M A X}

, a μ corresponding to the parabolic strategy is shown below:

μ = {(T / T_{m a x})}^{2},

(9)

Throughout the training, the feature reweighting model incrementally redirects the model’s focus between the primary and scattering feature pathways. This strategy ensures comprehensive learning from both sets of features, ultimately improving detection and recognition capabilities.

2.2.7. Decision-Level Fusion

In this subsection, we propose a multi-feature fusion method based on Dempster–Shafer Theory (DST) for object detection, and its structure is shown in Figure 13. The detection results obtained by pixel-level and feature-level fusion methods, which are also shown in Figure 2, are utilized to calculate global/local confidence of the nth model integrating one feature using one fusion method with prior information (category weight settings in this paper) for evidence collecting. All models from all features fused by pixel-level and feature-level fusions will be leveraged in our decision-level fusion. Later, these pre-processed data will be fed into a decision-level fusion module, including discernment framework building, basic probability assignment, evidence combination, and a decision rule. Last but not least, the final results for object detection are obtained.

Dempster–Shafer Theory

The discernment framework

Θ = {θ_{1}, θ_{2}, \dots, θ_{N}}

is the set of all possible outcomes, where

θ_{i} (i = 1, 2, \dots, N)

indicates the element, and elements are mutually exclusive.

Basic probability assignment (BPA) is denoted as

m : 2^{Θ} \to [0, 1]

; it assigns a probability mass to each subset of

Θ

, and the total mass must equal 1.

The belief function (Bel) quantifies the support for a hypothesis based on the available evidence; that is,

B e l (A)

denotes the sum of BPAs containing all subsets in proposition A. It is as follows:

B e l (A) = \sum_{B \subseteq A} m (B),

(10)

The plausibility function (Pl) represents the degree of belief in a hypothesis considering the uncertainty, which is calculated as

P l (A) = \sum_{B \cap A \neq \emptyset} m (B),

(11)

Use Dempster’s Rule of Combination to merge BPAs from different sources. The rule is defined as

m (A) = \frac{1}{1 - K} \sum_{\cap A_{i} = A} \prod_{j = 1}^{N} m_{j} (A_{i}),

(12)

where

K

is a normalization factor that accounts for conflicting evidence, calculated as

K = \sum_{\cap A_{i} = \emptyset} \prod_{j = 1}^{N} m_{j} (A_{i}),

(13)

Decision Rule or Decision Making: Based on the combined evidence, decide on the most plausible hypothesis, typically selecting the one with the highest plausibility or belief.

m (A_{o p t}) = \max \{m (A_{i}), A_{i} \subset Θ\}

.

2.: The processing of decision-level fusion

Category Weight Configuration

Since different categories receive different attention, we assign weights to the categories in this paper to assist the model in making more valuable decisions. For example, the detection of fighters has a higher intelligence value for surveillance and battlefield assessment than commercial airliners. Then, we define the weight of the

i

th category as

w_{i}^{c}

and

\sum_{i} w_{i}^{c} = 1

.

Confusion Matrix

Due to the different performances of different fusion methods integrating different scattering features, in order to achieve the effective fusion of the detection results of each deep learning model, we adopted a confusion matrix [61] as one of the assignment components in BPA. A confusion matrix can be used to describe and evaluate the relationship between ground truth and the recognition results. For the

N

-classification task, the output confusion matrix can be defined as

C M = [\begin{matrix} {c m}_{11} & \dots & {c m}_{1 N} \\ ⋮ & ⋱ & ⋮ \\ {c m}_{N 1} & \dots & {c m}_{N N} \end{matrix}],

(14)

Global/Local Confidence

Based on the results of the confusion matrix, global and local confidence for the model to identify the

i

th category are defined as

w_{i}^{g} = \frac{{c m}_{i i}}{\sum_{i = 1}^{N} {c m}_{i i}},

(15)

w_{i}^{l} = \frac{{c m}_{i i}}{\sum_{j = 1}^{N} {c m}_{i j}}

(16)

BPA

Based on the definition of local and global confidence, combined with the category weight assignment, the basic probability assignment in DST is completed, calculated as

m_{k} ({S^{'}}_{i}) = \frac{w_{i}^{l} S_{i} w_{i}^{c}}{\sum_{i = 1}^{N} w_{i}^{l} S_{i} w_{i}^{c}},

(17)

where

S_{i}

denotes the prediction score for the

i

th category, and

k

presents the

k

th model.

For discernment framework

Θ = {{S^{'}}_{1}, {S^{'}}_{2}, \dots, {S^{'}}_{N}, Θ, \emptyset} c

, there is

\{\begin{array}{l} m_{j} ({S^{'}}_{1}) + m_{j} ({S^{'}}_{2}) + . . . + m_{j} ({S^{'}}_{N}) + m_{j} (Θ) = 1 \\ m (\emptyset) = 0 \end{array},

(18)

That is,

m_{k} (Θ) = 1 - ({S^{'}}_{1}) - m_{j} ({S^{'}}_{2}) - . . . - m_{j} ({S^{'}}_{N})

.

Combination and Decision Rule

Follow the combination and decision rule mentioned in the subsection of Dempster–Shafer Theory above.

3. Results

In this section, comprehensive experiments are included in this section to analyze the model’s efficiency and potential. Multiple feature fusion strategies were investigated in our experiments, and they demonstrate the effectiveness and significance of the proposed method. Adequate experiments of multiple typical target datasets on the ground and sea were conducted to validate the method’s applicability.

3.1. Dataset Description

3.1.1. SRSDD-v1.0

SRSDD-v1.0 [62] is a high-resolution SAR ship detection dataset from GF3 in spotlight (SL) mode with a resolution of 1 m and size of 1024

\times

1024, which contains six categories of ships, 63.1% inshore scenes and 36.9% offshore scenes, making detection more challenge. Detailed information of categories in the dataset is shown in Table 1. Different scenarios are illustrated in Figure 14.

3.1.2. GF3-ADD

The Gaofen-3 Aircraft Detection Dataset is an in-lab dataset from GF3 with a resolution of 1 m. There are five types of airports and three categories of aircrafts. The ground truths are annotated by professionally trained personnel with reference to the optical image of same regions. Images are cropped to 512

\times

512 pixels for preserving details of target scattering features while reducing background clutters in feature extraction. Detailed information of categories in the dataset is shown Table 2. Examples of different scene slices are shown in Figure 15.

3.1.3. MSTAR-VDD

Mix MSTAR [42] is a synthetic benchmark dataset for multi-class rotation vehicle detection, which mixes target chips and clutter backgrounds with original MSTAR [63] data at the pixel level. Mix MSTAR contains 20 fine-grained categories in 100 high-resolution images, predominantly 1478

\times

1784 pixels, achieved by refining T72 into 11 categories (T72 A04, T72 A05, T72 A07, T72 A10, T72 A32, T72 A62, T72 A63, T72 A64, T72 SN132, T72 SN812, T72 SNS7). The dataset includes various landscapes such as woods, grasslands, urban buildings, and tightly arranged vehicles.

Since targets with similar class structure have similar scattering mechanisms and close scattering characteristics, in order to better explore the potential application of scattering features, we grouped the 20 categories in Mix MSTAR into 5 broad categories, called MSTAR-VDD for convenience in this paper, which are tank, self-propelled artillery, amphibious, dozer, and truck. The details of fine-to-broad categorization are shown in Figure 16. The category information in the dataset is shown in Table 3. Different scenarios are illustrated in Figure 17.

3.2. Evaluation Metrics

The precision

P

and recall

R

are defined with True Positive (

T P

), False Positive (

F P

), and False Negative (FN), respectively.

P = \frac{T P}{T P + F P},

(19)

R = \frac{T P}{T P + F N},

(20)

As the most common evaluation metric in object detection, Average Precision (

A P

) represents the comprehensive performance of the detector since a trade-off exists between precision and recall.

A P

and

m A P

are calculated as

A P = \int_{0}^{1} P (R) d R,

(21)

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i},

(22)

3.3. Parameter Setting

All these experiments were conducted on four NVIDIA TITAN Xp GPUs for a total of 36 epochs on each dataset with a batch size of 8. The initial learning rate was 0.0001, which increased linearly by 0.3333 every 500 iterations, and the momentum was 0.9 with a weight decay of 0.05 on an AdamW optimizer. All datasets were supposed to be cut into 512

\times

512 with an overlap of 256 for preserving details of target scattering features while reducing background clutters.

3.4. Experiments and Analyses

3.4.1. Experiments on SRSDD-v1.0

For the ablation experiments of the backbone network, shown in Table 4, ST-PA_RCNN presents the best performance in mAP, which demonstrates the effectiveness of the backbone network improved based on Oriented RCNN.

For experiments of pixel-level fusion methods integrating different features, shown in Table 5, HOG achieved the best improvement of 2.79% in mAP compared to the baseline. Notably, HOG significantly boosted the accuracy for law enforce and dredger, especially for law enforce, with a twofold increase of 8.33%. Furthermore, NSLP improved the category accuracy of bulk cargo by 6.62% and fishing by 7.59%.

For experiments of feature-level fusion (DBAM), shown in Table 5, HOG achieved the highest improvement of 4.08% in mAP compared to the baseline. And it is worth noting that HOG enhanced the category accuracy of law enforce by 20.81%, which is highly significant. Additionally, SIFT achieved an accuracy improvement of 1.03% for bulk cargo, and NSLP realized a 3.69% increase for ore-oil.

For experiments of feature-level fusion (FDRM), shown in Table 5, SIFT achieved an improvement of 10.36% in average accuracy compared to the baseline, representing the highest and most significant increase. Notably, this resulted in an exceptional improvement of 60.02% in category accuracy for law enforce, making it the most substantial enhancement across all feature fusion methods. This indicates that the SAR-SIFT feature, combined with feature fusion levels based on the FDRM, is particularly suitable for detecting law enforce. HOG achieved improvements of 1.52% and 1.61% for dredgers and bulk cargo, respectively. Additionally, NSLP realized a 10.49% increase for ore-oil, while LC improved the category accuracy for containers by 3.2%.

For experiments of decision-level fusion based on DST, category weight setting was set as in Table 6, which represented the degree of attention or interest among categories. The confusion matrix and global and local confidence of models integrating SAR-HOG are shown in Table 7 and Table 8 as an example of intermediate result presentation. After consolidating and utilizing the detection results of pixel-level and feature-level integrating different features, DS maximized the detection accuracy on SRSDD-v1.0, with an increase of 16.52% compared to ST-PA_RCNN, thanks to the significant improvement of category accuracy for the few-shot category.

3.4.2. Experiments on GF3-ADD

For the ablation experiments of the backbone network, shown in Table 9, ST-PA_RCNN presents the best performance in mAP, which demonstrates the effectiveness of the backbone network improved based on Oriented RCNN.

For experiments of pixel-level fusion methods integrating different features, shown in Table 10. SIFT achieved the best improvement of 6.46% in mAP compared to the baseline. Specifically, SIFT significantly increased the accuracy for carriers, with an increase of 7.18%. NSLP improved the category accuracy of fighters by 5.69% and airliners by 7.51%.

For experiments of feature-level fusion (DBAM), shown in Table 10, HOG achieved the highest improvement of 2.28% in mAP compared to the baseline, which enhanced the category accuracy of fighters by 5.68% and carriers by 1.01%. Additionally, HOG achieved accuracy improvements of 7.77% for fighters and 1.00% for airliners, but decreased in accuracy for carriers.

For experiments of feature-level fusion (FDRM), shown in Table 10, SIFT achieved an improvement of 3.64% in average accuracy compared to the baseline, representing the highest increase. Notably, it resulted in a significant improvement of 7.91% in category accuracy for carriers. Additionally, LC realized a 4.31% increase for fighters.

For experiments of decision-level fusion based on DST, category weight setting was set as in Table 11, which represented the degree of attention or interest among categories. The confusion matrix and global and local confidence of models integrating SAR-HOG are shown in Table 12 and Table 13 as an example of intermediate result presentation. After consolidating and utilizing the detection results of pixel-level and feature-level integrating different features, DS maximized the detection accuracy on GF3-ADD, with an increase of 7.1% compared to ST-PA_RCNN.

3.4.3. Experiments on MSTAR-VDD

For the ablation experiments of the backbone network, shown in Table 14, ST-PA_RCNN presents the best performance in mAP, which demonstrates the effectiveness of the backbone network improved based on Oriented RCNN.

For experiments of pixel-level fusion methods integrating different features, shown in Table 15, HOG achieved the best improvement of 2.11% in mAP compared to the baseline. Specifically, HOG realized the highest accuracy for the amphibious, dozer, and truck categories, especially for amphibious, with an increase of 5.98%. Furthermore, NSLP improved the category accuracy of self-propelled artillery by 2.67%.

For experiments of feature-level fusion (DBAM), shown in Table 15, HOG achieved the highest improvement of 1.06% in mAP compared to the baseline. Specifically, HOG realized the highest accuracy for tanks, self-propelled artillery, and dozers, especially for self-propelled artillery and dozers increased by 1.83% and 1.72%. Additionally, SIFT improved the category accuracy of trucks by 2.82%.

For experiments of feature-level fusion (FDRM), shown in Table 15, SIFT achieved the highest improvement of 0.53% in average accuracy compared to the baseline. This resulted in the highest accuracy for the tank and self-propelled artillery categories and increased the category accuracy of the self-propelled artillery, amphibious, and dozer categories by 0.92%, 0.71%, and 0.60%. Additionally, NSLP improved the category accuracy of self-propelled artillery by 1.00% and dozer by 0.96%, while LC improved the category accuracy for amphibious by 0.72%. It is worth noting that the mAP of the baseline by feature-level fusion (FDRM) has reached 97.75%, which is already the highest among other baselines from other fusion methods with an increase of 2.03%, leaving relatively limited room for accuracy improvement from fusing features.

For experiments of decision-level fusion based on DST, category weight setting was set as in Table 16. This represented the degree of attention or interest among categories. The confusion matrix and global and local confidence of models integrating SAR-HOG are shown in Table 17 and Table 18 as an example of intermediate result presentation. After consolidating and utilizing the detection results of pixel-level and feature-level integrating different features, DS maximized the detection accuracy on MSTAR-VDD, with an increase of 3.19% compared to ST-PA_RCNN, thanks to the significant improvement of category accuracy for the few-shot category.

4. Discussion

Due to the unique imaging mechanism of SAR, targets in SAR images present complex scattering characteristics, especially with diversified backgrounds. There are three main challenges for intelligent SAR target detection: insufficient exploitation of target characteristics, inefficient characterization for scattering features by data-driven methods, and inadequate reliability of model decision.

In this paper, we propose an intelligent SAR target detection method based on multi-level fusion for better performance in complex backgrounds, which consists of pixel-level, feature-level, and decision-level fusion.

For the exploitation of target characteristics, four texture features were selected for feature fusion in our experiments, which are SAR-SIFT, SAR-HOG, NSLP, and LC. Since texture features are in image form, they can provide richer detail information and exhibit great adaptability to various fusion methods, as compared to point and vector features. Furthermore, for extracting deeper abstract features, ST-PA_RCNN is designed as the backbone network based on Oriented RCNN by replacing the feature extractor with a Swin Transformer and improving the FPN with path augmentation. All ablation experiments of ST-PA_RCNN on SRSDD-v1.0, GF3-ADD, and MSTAR-VDD validate its effectiveness and robustness, where ST-PA_RCNN achieve the highest mAP among other backbones.

For digging feature characterization, the pixel-level fusion method represents an initial exploration of image fusion through the channel fusion of original images and their features after scattering feature enhancement. ST-PA_RCNN based on pixel-level fusion with a designed scattering feature enhancement module provides an initial exploration for integrating scattering-enhanced features through channel fusion, where HOG achieves the highest mAP increase of 2.79% in ship detection and 2.11% in vehicle detection; SIFT achieves the highest mAP increase of 6.46% in aircraft detection.

To further enhance the ability of feature mining and characterization, two feature-level fusion methods are used by respective migratable fusion blocks, namely the DBAM and FDRM, presenting higher-level fusion compared to at the pixel level. The DBAM is used for better attention fusion between features from original images and their features. The FDRM is designed for reducing feature redundancy and relative full feature learning by reweighting. In the experiments of feature-level fusion, HOG (DBAM) achieves the highest mAP increase of 4.36% in ship detection and 2.04% in vehicle detection; SIFT (FDRM) achieves the highest mAP increase of 12.27% in ship detection, 6.47% in aircraft detection, and 2.56% in vehicle detection; and LC (DBAM) achieves the highest mAP increase of 4.49% in aircraft detection. Last but not least, it is worth noting that the feature-level FDRM (baseline) shows the best performance on all three experimental datasets in comparison to the pixel-level (baseline) and feature-level DBAM (baseline), validating the effectiveness and robustness of our original designed method.

In the experiments of pixel-level and feature-level fusion, the results indicate that the highest category accuracy of different categories require a specific combination of fused features and fusion method, which means that the highest performance can be calculated using effective combinations based on decision-level fusion. For improving the reliability of model decisions, a decision-level fusion method based on DST for multi-model integration represents the highest-level fusion by proposition setting and statistical analysis. It can not only consolidate the complementary strengths in different models but also incorporate human or expert involvement in proposition for guiding effective decision making. In the experiments on typical target detection datasets, the proposed method increases the mAP by 16.52%, 7.1%, and 3.19% in ship, aircraft, and vehicle target detection, demonstrating high effectiveness and robustness.

5. Conclusions

In this paper, an intelligent SAR target detection method based on multi-level fusion is proposed to improve the insufficient exploitation of target characteristics and the inadequate reliability of the model’s decision making. Four texture features (SAR-SIFT, SAR-HOG, NSLP, and LC) are employed to enhance the scattering feature representation, while ST-PA_RCNN integrates a Swin Transformer-based feature extractor and path-augmented FPN, serving as the backbone network to extract deeper abstract features. Experimental evaluations on SRSDD-v1.0, GF3-ADD, and MSTAR-VDD confirmed the robust performance and high accuracy of ST-PA_RCNN, surpassing other backbones in terms of mAP. In the multi-level fusion process, pixel-level fusion provides an initial exploration for integrating scattering-enhanced features through channel fusion; two feature-level fusion methods, DBAM and FDRM, facilitated higher-level feature aggregation by emphasizing attention mechanisms and reducing redundancy; decision-level fusion based on DST effectively integrated complementary model outputs and accommodated expert involvement, thereby improving the mAP by 16.52%, 7.1%, and 3.19% in ship, aircraft, and vehicle target detection, respectively.

Overall, these findings underscore the synergistic potential of combining deep learning with carefully designed fusion strategies across multiple levels. Future research may explore more advanced pixel-level fusion methods, expand the repertoire of feature-level fusion blocks, and refine decision-level fusion propositions to accommodate diverse scattering features and application scenarios, ultimately laying a stronger theoretical and practical foundation for robust SAR target detection.

Author Contributions

Conceptualization, Q.L.; methodology, Q.L.; software, Q.L.; validation, Q.L. and C.Z.; formal analysis, Q.L. and Z.Y.; investigation, Q.L. and Z.Y.; resources, Q.L. and C.Z.; data curation, Q.L., C.Z. and D.O.; writing—original draft preparation, Q.L.; writing—review and editing, H.W. and D.G.; visualization, Q.L. and C.Z.; supervision, H.W.; funding acquisition, H.W. and D.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China (Grant No. 62271153), the Natural Science Foundation of Shanghai (Grant No. 22ZR1406700), and the Open Fund of National Key Laboratory of Scattering and Radiation, Shanghai Radio Equipment Research Institute (Grant No. 802NKL2023-002), and the Shanghai Science and Technology Commission Project (Grant No. XTCX-KJ-2023-2-04).

Data Availability Statement

The publicly available dataset SRSDD-V1.0 can be accessed at https://github.com/HeuristicLU/SRSDD-V1.0 (accessed on 12 August 2021). The Mix-MSTAR dataset can be obtained by contacting Zhigang Liu (losonjay@163.com). For other experimental data, please contact Qiaoyu Liu (qiaoyuliu21@m.fudan.edu.cn).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Moreira, A.; Prats-Iraola, P.; Younis, M.; Krieger, G.; Hajnsek, I.; Papathanassiou, K.P. A tutorial on synthetic aperture radar. IEEE Geosci. Remote Sens. Mag. 2013, 1, 6–43. [Google Scholar] [CrossRef]
Xu, F.; Jin, Y.-Q. Microwave vision and intelligent perception of radar imagery. J. Radars 2024, 13, 285–306. [Google Scholar]
Kreithen, D.E.; Halversen, S.D.; Owirka, G.J. Discriminating targets from clutter. Linc. Lab. J. 1993, 6, 25–52. [Google Scholar]
Novak, L.M.; Halversen, S.D.; Owirka, G.J.; Hiett, M. Effects of polarization and resolution on the performance of a SAR automatic target recognition system. Linc. Lab. J. 1995, 8, 49–68. [Google Scholar]
Novak, L.M.; Owirka, G.J.; Netishen, C.M. Performance of a high-resolution polarimetric SAR automatic target recognition system. Linc. Lab. J. 1993, 6, 11–24. [Google Scholar]
Luo, R.; Zhao, L.; He, Q.; Ji, K.; Kuang, G. Intelligent technology for aircraft detection and recognition through SAR imagery: Advancements and prospects. J. Radars 2024, 13, 307–330. [Google Scholar]
Chen, S.; Wang, H.; Xu, F.; Jin, Y.-Q. Target classification using the deep convolutional networks for SAR images. IEEE Trans. Geosci. Remote Sens. 2016, 54, 4806–4817. [Google Scholar] [CrossRef]
Wang, Y.; Liu, H. A hierarchical ship detection scheme for high-resolution SAR images. IEEE Trans. Geosci. Remote Sens. 2012, 50, 4173–4184. [Google Scholar] [CrossRef]
Cui, J.; Jia, H.; Wang, H.; Xu, F. A Fast Threshold Neural Network for Ship Detection in Large-Scene SAR Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 6016–6032. [Google Scholar] [CrossRef]
Li, W.; Ma, P.; Wang, H.; Fang, C. SAR-TSCC: A novel approach for long time series SAR image change detection and pattern analysis. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5203016. [Google Scholar] [CrossRef]
Kahar, S.; Hu, F.; Xu, F. A Novel Background Removal Method for High-Cluttered Environments Using SAR Time Series. In Proceedings of the IGARSS 2022–2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 3696–3699. [Google Scholar]
Huang, Z.; Yao, X.; Han, J. Progress and Perspective on Physically Explainable Deep Learning for Synthetic Aperture Radar Image Interpretation. J. Radars 2022, 11, 107–125. [Google Scholar]
Lin, S.; Chen, T.; Huang, X.; Chen, S. Synthetic aperture radar image aircraft detection based on target spatial imaging characteristics. J. Electron. Imaging 2023, 32, 021608. [Google Scholar] [CrossRef]
Fu, K.; Dou, F.-Z.; Li, H.-C.; Diao, W.-H.; Sun, X.; Xu, G.-L. Aircraft recognition in SAR images based on scattering structure feature and template matching. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 4206–4217. [Google Scholar] [CrossRef]
Gao, G. Statistical modeling of SAR images: A survey. Sensors 2010, 10, 775–795. [Google Scholar] [CrossRef]
Steenson, B.O. Detection performance of a mean-level threshold. IEEE Trans. Aerosp. Electron. Syst. 1968, AES-4, 529–534. [Google Scholar] [CrossRef]
Novak, L.M.; Owirka, G.J.; Brower, W.S.; Weaver, A.L. The automatic target-recognition system in SAIP. Linc. Lab. J. 1997, 10, 187–202. [Google Scholar]
HM, F. Adaptive detection mode with threshold control as a function of spatially sampled clutter-level estimates. RCA Rev. 1968, 29, 414–465. [Google Scholar]
Hansen, V.G.; Sawyers, J.H. Detectability loss due to “greatest of” selection in a cell-averaging CFAR. IEEE Trans. Aerosp. Electron. Syst. 1980, AES-16, 115–118. [Google Scholar] [CrossRef]
Trunk, G.V. Range resolution of targets using automatic detectors. IEEE Trans. Aerosp. Electron. Syst. 1978, AES-14, 750–755. [Google Scholar] [CrossRef]
Du, L.; Wang, Z.; Wang, Y.; Wei, D.; Li, L. Survey of research progress on target detection and discrimination of single-channel SAR images for complex scenes. J. Radars 2020, 9, 34–54. [Google Scholar]
Jianxiong, Z.; Zhiguang, S.; Xiao, C.; Qiang, F. Automatic target recognition of SAR images based on global scattering center model. IEEE Trans. Geosci. Remote Sens. 2011, 49, 3713–3729. [Google Scholar] [CrossRef]
Ross, T.D.; Worrell, S.W.; Velten, V.J.; Mossing, J.C.; Bryant, M.L. Standard SAR ATR evaluation experiments using the MSTAR public release data set. In Proceedings of the Algorithms for Synthetic Aperture Radar Imagery V, Orlando, FL, USA, 13–17 April 1998; pp. 566–573. [Google Scholar]
Burges, C.J. A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 1998, 2, 121–167. [Google Scholar] [CrossRef]
Quinlan, J.R. Induction of decision trees. Mach. Learn. 1986, 1, 81–106. [Google Scholar] [CrossRef]
Ho, T.K. Random decision forests. In Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, Canada, 14–16 August 1995; pp. 278–282. [Google Scholar]
Rosenblatt, F. The perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Rev. 1958, 65, 386. [Google Scholar] [CrossRef]
Rennie, J.D.; Shih, L.; Teevan, J.; Karger, D.R. Tackling the poor assumptions of naive bayes text classifiers. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), Washington, DC, USA, 21–24 August 2003; pp. 616–623. [Google Scholar]
Zhang, L.; Li, C.; Zhao, L.; Xiong, B.; Quan, S.; Kuang, G. A cascaded three-look network for aircraft detection in SAR images. Remote Sens. Lett. 2020, 11, 57–65. [Google Scholar] [CrossRef]
An, Q.; Pan, Z.; Liu, L.; You, H. DRBox-v2: An improved detector with rotatable boxes for target detection in SAR images. IEEE Trans. Geosci. Remote Sens. 2019, 57, 8333–8349. [Google Scholar] [CrossRef]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Liu, L.; Chen, G.; Pan, Z.; Lei, B.; An, Q. Inshore ship detection in SAR images based on deep neural networks. In Proceedings of the IGARSS 2018—2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; pp. 25–28. [Google Scholar]
Li, J.; Qu, C.; Peng, S.; Jiang, Y. Ship Detection in SAR images Based on Generative Adversarial Network and Online Hard Examples Mining. J. Electron. Inf. Technol. 2019, 41, 143–149. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27. [Google Scholar]
Sun, X.; Wang, R.; Sun, Y.; Diao, W.; Zhang, Y.; Fu, K. AIR-SARShip-1.0: High-resolution SAR ship detection dataset. J. Radars 2019, 8, 852–862. [Google Scholar]
Du, L.; Liu, B.; Wang, Y.; Liu, H.; Dai, H. Target Detection Method Based on Convolutional Neural Network for SAR Image. J. Electron. Inf. Technol. 2016, 38, 3018–3025. [Google Scholar]
Li, Y.; Du, L.; Du, Y. Convolutional neural network based on feature decomposition for target detection in SAR images. J. Radars 2023, 12, 1069–1080. [Google Scholar]
Wang, S.; Cai, Z.; Yuan, J. Automatic SAR Ship Detection Based on Multifeature Fusion Network in Spatial and Frequency Domains. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4102111. [Google Scholar] [CrossRef]
He, C.; Tu, M.; Xiong, D.; Tu, F.; Liao, M. A component-based multi-layer parallel network for airplane detection in SAR imagery. Remote Sens. 2018, 10, 1016. [Google Scholar] [CrossRef]
Li, M.; Wen, G.; Huang, X.; Li, K.; Lin, S. A lightweight detection model for sar aircraft in a complex environment. Remote Sens. 2021, 13, 5020. [Google Scholar] [CrossRef]
Ginner, L.; Gesperger, J.; Wöhrer, A.; Drexler, W.; Baumann, B.; Leitgeb, R.; Salas, M.; Lichtenegger, A.; Niederleithner, M. Ex-vivo Alzheimer’ s disease brain tissue investigation: A multiscale approach using 1060-nm swept source optical coherence tomography for a direct correlation to histology. Neurophotonics 2020, 7, 035004. [Google Scholar]
Liu, Z.; Luo, S.; Wang, Y. Mix mstar: A synthetic benchmark dataset for multi-class rotation vehicle detection in large-scale sar images. Remote Sens. 2023, 15, 4558. [Google Scholar] [CrossRef]
Guo, L. SAR image classification based on multi-feature fusion decision convolutional neural network. IET Image Process. 2022, 16, 1–10. [Google Scholar] [CrossRef]
Tang, Y.; Chen, J. A multi-view SAR target recognition method using feature fusion and joint classification. Remote Sens. Lett. 2022, 13, 631–642. [Google Scholar] [CrossRef]
Liu, J.; Liu, Z.; Zhang, Z.; Wang, L.; Liu, M. A New Causal Inference Framework for SAR Target Recognition. IEEE Trans. Artif. Intell. 2024, 5, 4042–4057. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Dellinger, F.; Delon, J.; Gousseau, Y.; Michel, J.; Tupin, F. SAR-SIFT: A SIFT-Like Algorithm for SAR Images. IEEE Trans. Geosci. Remote Sens. 2015, 53, 453–466. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; pp. 886–893. [Google Scholar]
Song, S.; Xu, B.; Yang, J. SAR Target Recognition via Supervised Discriminative Dictionary Learning and Sparse Representation of the SAR-HOG Feature. Remote Sens. 2016, 8, 683. [Google Scholar] [CrossRef]
Cunha, A.L.D.; Zhou, J.; Do, M.N. The Nonsubsampled Contourlet Transform: Theory, Design, and Applications. IEEE Trans. Image Process. 2006, 15, 3089–3101. [Google Scholar] [CrossRef]
Lu, Z.; Yang, G.; Yang, J.; Wang, Y. An Adaptive Arbitrary Multiresolution Decomposition for Multiscale Geometric Analysis. IEEE Trans. Multimed. 2021, 23, 2883–2893. [Google Scholar] [CrossRef]
Zhang, C.; Liu, P.; Wang, H.; Jin, Y. Saliency-Based Centernet for Ship Detection in SAR Images. In Proceedings of the IGARSS 2022–2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 1552–1555. [Google Scholar]
Bleiholder, J.; Naumann, F. Data fusion. ACM Comput. Surv. (CSUR) 2009, 41, 1–41. [Google Scholar] [CrossRef]
Zhang, Y. Understanding image fusion. Photogramm. Eng. Remote Sens. 2004, 70, 657–661. [Google Scholar]
Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3520–3529. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Ghassemian, H. A review of remote sensing image fusion methods. Inf. Fusion 2016, 32, 75–89. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Wen, Y.; Zhang, K.; Li, Z.; Qiao, Y. A discriminative feature learning approach for deep face recognition. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Volume VII 14, pp. 499–515. [Google Scholar]
Stehman, S.V. Selecting and interpreting measures of thematic classification accuracy. Remote Sens. Environ. 1997, 62, 77–89. [Google Scholar] [CrossRef]
Lei, S.; Lu, D.; Qiu, X.; Ding, C. SRSDD-v1. 0: A high-resolution SAR rotation ship detection dataset. Remote Sens. 2021, 13, 5104. [Google Scholar] [CrossRef]
MSTAR Public Dataset. Available online: https://www.sdms.afrl.af.mil/index.php?collection=mstar (accessed on 10 March 2011).

Figure 1. Relationship of methods driven by model, data, and target characteristics.

Figure 2. Structure of intelligent method based on multi-level image fusion for target detection in SAR images.

Figure 3. Processing levels of image fusion.

Figure 4. Structure of backbone network.

Figure 5. Structure of PA-FPN.

Figure 6. Structure of ST-PA_RCNN based on pixel-level fusion.

Figure 7. Channel fusion.

Figure 8. Structure of DBAM ST-PA_RCNN.

Figure 9. Structure of DB-PA-FPN.

Figure 10. Structure of attention mechanism fusion block.

Figure 11. Structure of FDRM ST-PA_RCNN.

Figure 12. Structure of FDRM fusion block.

Figure 13. Structure of decision-level fusion based on DST.

Figure 14. Examples of different scene slices in SRSDD-v1.0.

Figure 15. Examples of different scene slices in GF3-ADD.

Figure 16. Details in fine-to-broad categorization.

Figure 17. Examples of different scene slices in MSTAR-VDD.

Table 1. Category information of SRSDD-v1.0.

Category	Ore-Oil	Bulk Cargo	Fishing	Law Enforce	Dredger	Container
Number	166	2053	288	25	263	89

Table 2. Category information of GF3-ADD.

Category	Fighter	Carrier	Airliner
Number	319	377	969

Table 3. Category information of MSTAR-VDD.

Category	Tank	Self-Propelled Artillery	Amphibious	Dozer	Truck
Number	3044	548	1252	274	274

Table 4. Ablation experiments of backbone network on SRSDD-v1.0.

Algorithm	Ore-Oil	Container	Fishing	Law Enforce	Dredger	Bulk Cargo	mAP (%)
Rotated Retinanet	31.41	56.60	24.72	8.10	66.10	4.90	39.32
R3Det	54.10	34.84	21.03	1.09	82.21	78.51	45.30
Oriented RCNN	43.41	52.80	34.63	4.29	71.27	79.63	47.67
ST-PA_RCNN	54.15	66.72	37.63	6.71	85.84	68.03	53.18

Table 5. Detection results of different fusion levels integrating different features on SRSDD-v1.0.

Fusion Level	Model	Ore-Oil		Container			Fishing		Law Enforce		Dredger		Bulk Cargo			mAP (%)
Pixel Level	Baseline *	54.12		66.72			37.63		6.71		85.84		68.03			53.18
	HOG **	59.92		65.13			39.12		15.04 (+8.33)		87.20 (+1.36)		69.40			55.97 (+2.79)
	SIFT **	59.01		65.80			38.40		6.73		85.63		64.03			53.27
	NSLP **	60.74 (+6.62)		62.61			45.22 (+7.59)		5.02		86.21		70.31 (+2.28)			55.02
	LC **	60.23		65.72			39.20		13.60		82.01		64.30			54.18
Feature-Level DBAM	Baseline	⁺ 54.71 (+0.59)		63.60 (−3.12)			39.60 (+1.97)		6.71 (+0.00)		88.81 (+2.97)		67.30 (−0.73)			53.46 (+0.28)
	HOG	57.10		65.21	(−3.12)		39.10		27.52		88.51		67.82			57.54	(+0.28)
	HOG	57.10		65.21	(+1.61)		39.10		27.52		88.51		67.82			57.54	(+4.08)
	SIFT	53.32		62.20			39.91	(+1.97)	15.02		84.72		68.33	(−0.73)		53.92
	SIFT	53.32		62.20			39.91	(+0.31)	15.02		84.72		68.33	(+1.03)		53.92
	NSLP	⁺⁺ 58.40	(+0.59)	62.93			36.54		30.21	(+0.00)	86.03		66.30			56.74
	NSLP	⁺⁺ 58.40	(+3.69)	62.93			36.54		30.21	(+23.5)	86.03		66.30			56.74
	LC	57.72		64.52			38.20		15.04		84.33		68.20			54.67
Feature-Level FDRM	Baseline	47.91 (−6.21)		65.00 (−1.72)			42.92 (+5.29)		20.02 (+13.31)		86.10 (+0.26)		68.61 (+0.58)			55.09 (+1.91)
	HOG	54.80		65.44			39.30		60.03		87.62	(+0.26)	70.22		(+0.58)	62.90
	HOG	54.80		65.44			39.30		60.03		87.62	(+1.52)	70.22		(+1.61)	62.90
	SIFT	52.92		66.40			38.22		80.04	(+13.31)	86.20		68.94			65.45	(+1.91)
	SIFT	52.92		66.40			38.22		80.04	(+60.02)	86.20		68.94			65.45	(+10.36)
	NSLP	58.40	(−6.21)	62.93			36.50		30.21		86.03		66.31			56.73
	NSLP	58.40	(+10.49)	62.93			36.50		30.21		86.03		66.31			56.73
	LC	55.33		68.20		(−1.72)	40.22		38.71		87.50		67.12			59.51
	LC	55.33		68.20		(+3.20)	40.22		38.71		87.50		67.12			59.51
Decision Level	DS ***	⁺⁺⁺ 62.11 (+7.99)		68.85 (+2.13)			43.67 (+6.04)		81.23 (+74.52)		89.03 (+3.19)		73.31 (+5.28)			69.70 (+16.52)

* Baseline in pixel level refers to ST-PA_RCNN based on its affiliated level without feature fusion. ** HOG, SIFT, NSLP, and LC refer to models that integrate SAR-HOG, SAR-SIFT, NSLP, and LC. *** DS refers to the proposed decision-level fusion method based on DST. ⁺ The value in parentheses (single line) indicates the increase (+) and the decrease (−) compared to the results of baseline in pixel level. ⁺⁺ The values in parentheses (two lines) indicate the changes compared to the results of baseline in pixel level (the first line) and the results of baseline in its affiliated level (the second line), respectively. ⁺⁺⁺ The value in parentheses indicates the changes compared to the results of baseline in pixel level.

Table 6. Category weight setting for SRSDD-v1.0.

Weight	Ore-Oil	Container	Fishing	Law Enforce	Dredger	Bulk Cargo
$w_{i}^{c}$	0.18	0.18	0.1	0.18	0.18	0.18

Table 7. Confusion matrix of models integrating SAR-HOG feature on SRSDD-v1.0.

Fusion Level	Category	Bulk Cargo	Container	Dredger	Fishing	Law Enforce	Ore-Oil
Pixel Level	Bulk Cargo	0.4286	0.2429	0	0.0286	0	0.18
	Container	0.0009	0.5990	0.0085	0	0	0.0085
	Dredger	0	0.0806	0.7258	0	0	0.0242
	Fishing	0	0.0989	0	0.2857	0	0
	Law Enforce	0	0.0909	0	0	0.6364	0
	Ore-oil	0.0404	0.0606	0.0707	0	0	0.4242
Feature-Level DBAM	Bulk Cargo	0.4478	0.2239	0	0.0149	0	0
	Container	0.0053	0.5471	0.0009	0	0	0.0123
	Dredger	0	0.088	0.704	0	0	0.024
	Fishing	0	0.1056	0	0.2389	0	0
	Law Enforce	0	0	0	0	0.6	0
	Ore-oil	0.0404	0.0505	0.0707	0	0	0.4545
Feature-Level FDRM	Bulk Cargo	0.4225	0.2535	0	0.0282	0	0
	Container	0.0009	0.5552	0.0026	0	0	0.0096
	Dredger	0	0.1094	0.6641	0	0	0.0234
	Fishing	0	0.1198	0	0.2812	0	0
	Law Enforce	0	0	0	0	0.9	0
	Ore-oil	0.0385	0.0962	0.0769	0	0	0.4423

Table 8. Global and local confidence of models integrating SAR-HOG feature on SRSDD-v1.0.

Fusion Level	Global Confidence	Local Confidence
Fusion Level	Global Confidence	Container	Bulk Cargo	Dredger	Fishing	Law Enforce	Ore-Oil
Pixel Level	0.4987	0.6522	0.9673	0.8627	0.6935	1.0000	0.7377
Feature-Level DBAM	0.5166	0.6122	0.9710	0.8738	0.7428	0.8750	0.7119
Feature-Level FDRM	0.5442	0.6000	0.9769	0.8334	0.7012	1.0000	0.6764

Table 9. Ablation experiments of backbone network on GF3-ADD.

Algorithm	Fighter	Carrier	Airliner	mAP (%)
Rotated Retinanet	72.10	85.12	88.20	81.47
R3Det	70.42	84.33	90.71	81.82
Oriented RCNN	82.31	77.03	90.31	83.22
ST-PA_RCNN	89.62	90.13	85.60	88.45

Table 10. Detection results of different fusion levels integrating different features on GF3-ADD.

Fusion Level	Model	Fighter		Carrier		Airliner		mAP (%)
Pixel Level	Baseline	89.62		90.13		85.60		88.45
	HOG	92.71		93.12		92.22		92.68
	SIFT	95.22		97.31 (+7.18)		92.21		94.91 (+6.46)
	NSLP	95.31 (+5.69)		81.43		93.11 (+7.51)		89.95
	LC	87.54		94.33		92.11		91.33
Feature-Level DBAM	Baseline	87.54 (−2.08)		92.33 (+2.20)		92.11 (+6.51)		90.66 (+2.21)
	HOG	95.31	(−2.08)	89.43		93.11	(+6.51)	92.62
	HOG	95.31	(+7.77)	89.43		93.11	(+1.00)	92.62
	SIFT	94.62		90.15		85.60		90.12
	NSLP	92.71		93.12		92.22		92.68
	LC	93.22		93.34	(+2.20)	92.27		92.94	(+2.21)
	LC	93.22		93.34	(+1.01)	92.27		92.94	(+2.28)
Feature-Level FDRM	Baseline	91.31 (+1.69)		89.43 (−0.70)		93.11 (+7.51)		91.28 (+2.83)
	HOG	92.57		94.37		92.12		93.02
	SIFT	95.22		97.34	(−0.70)	92.21		94.92	(+2.83)
	SIFT	95.22		97.34	(+7.91)	92.21		94.92	(+3.64)
	NSLP	92.78		93.12		91.22		92.37
	LC	95.62	(+1.69)	90.13		85.60		90.45
	LC	95.62	(+4.31)	90.13		85.60		90.45
Decision Level	DS	96.12 (+6.50)		96.52 (+6.39)		94.01 (+8.41)		95.55 (+7.10)

Table 11. Category weight setting for GF3-ADD.

Weight	Fighter	Carrier	Airliner
$w_{i}^{c}$	0.4	0.4	0.2

Table 12. Confusion matrix of models integrating SAR-HOG features on GF3-ADD.

Fusion Level	Category	Fighter	Carrier	Airliner
Pixel Level	Fighter	0.7043	0	0
	Carrier	0	0.9713	0
	Airliner	0.0065	0	0.6753
Feature-Level DBAM	Fighter	0.6121	0	0
	Carrier	0	0.9786	0
	Airliner	0	0	0.6519
Feature-Level FDRM	Fighter	0.6944	0	0.0093
	Carrier	0	0.9785	0
	Airliner	0	0	0.7902

Table 13. Global and local confidence of models integrating SAR-HOG features on GF3-ADD.

Fusion Level	Global Confidence	Local Confidence
Fusion Level	Global Confidence	Fighter	Airliner	Carrier
Pixel Level	0.7836	1.0000	1.0000	0.9905
Feature-Level DBAM	0.7838	1.0000	1.0000	1.0000
Feature-Level FDRM	0.8210	0.9868	1.0000	1.0000

Table 14. Ablation experiments of backbone network on MSTAR-VDD.

Algorithm	Tank	Self-Propelled Artillery	Amphibious	Dozer	Truck	mAP (%)
Rotated Retinanet	88.34	82.17	85.22	86.21	69.71	82.33
R3Det	90.40	85.50	89.60	91.70	78.30	87.10
Oriented RCNN	99.60	90.71	92.33	95.21	89.32	93.43
ST-PA_RCNN	99.61	95.43	87.72	97.92	97.90	95.72

Table 15. Detection results of different fusion levels integrating different features on MSTAR-VDD.

Fusion Level	Model	Tank			Self-Propelled Artillery		Amphibious			Dozer			Truck		mAP (%)
Pixel Level	Baseline	99.61			95.43		87.72			97.92			97.91		95.72
	HOG	99.63			97.11		93.70 (+5.98)			99.90 (+1.98)			98.80 (+0.89)		97.83 (+2.11)
	SIFT	99.54			95.80		87.34			98.40			98.42		95.90
	NSLP	99.72 (+0.11)			98.10 (+2.67)		92.10			96.82			98.80 (+0.89)		97.11
	LC	99.50			94.72		91.83			98.01			97.90		96.39
Feature-Level DBAM	Baseline	99.61 (+0.00)			95.30 (−0.13)		96.61 (+8.89)			96.90 (−1.02)			95.10 (−2.81)		96.70 (+0.98)
	HOG	99.82	(+0.00)		97.13	(−0.13)	96.42			98.62	(−1.02)		96.81		97.76	(+0.98)
	HOG	99.82	(+0.21)		97.13	(+1.83)	96.42			98.62	(+1.72)		96.81		97.76	(+1.06)
	SIFT	99.73			94.40		96.04			98.22			97.92	(−2.81)	97.26
	SIFT	99.73			94.40		96.04			98.22			97.92	(+2.82)	97.26
	NSLP	99.64			95.21		96.62	(+8.89)		98.30			95.71		97.10
	NSLP	99.64			95.21		96.62	(+0.01)		98.30			95.71		97.10
	LC	99.81			95.11		96.41			97.81			95.83		96.99
Feature-Level FDRM	Baseline	99.40 (−0.21)			96.90 (+1.47)		95.30 (+7.58)			98.74 (+0.82)			98.41 (+0.50)		97.75 (+2.03)
	HOG	99.51			97.21		95.43			98.40			99.03	(+0.50)	97.92
	HOG	99.51			97.21		95.43			98.40			99.03	(+0.62)	97.92
	SIFT	99.64		(−0.21)	97.82	(+1.47)	96.01			99.34			98.61		98.28	(+2.03)
	SIFT	99.64		(+0.24)	97.82	(+0.92)	96.01			99.34			98.61		98.28	(+0.53)
	NSLP	99.61			97.90		94.80			99.70		(+0.82)	98.40		98.08
	NSLP	99.61			97.90		94.80			99.70		(+0.96)	98.40		98.08
	LC	99.62			97.61		96.02		(+7.58)	98.40			98.21		97.97
	LC	99.62			97.61		96.02		(+0.72)	98.40			98.21		97.97
Decision Level	DS	99.83 (+0.22)			98.21 (+2.78)		97.34 (+9.62)			99.92 (+2.00)			99.23 (+1.32)		98.91 (+3.19)

Table 16. Category weight setting for MSTAR-VDD.

Weight	Tank	Self-Propelled Artillery	Amphibious	Dozer	Truck
$w_{i}^{c}$	0.25	0.25	0.25	0.1	0.15

Table 17. Confusion matrix of models integrating SAR-HOG feature on MSTAR-VDD.

Fusion Level	Category	Tank	Self-Propelled Artillery	Amphibious	Dozer	Truck
Pixel Level	Tank	0.9881	0.0043	0.0030	0	0
	Self-propelled Artillery	0.0396	0.8581	0.0264	0.0248	0.0116
	Amphibious	0.0790	0.0321	0.8178	0	0.0028
	Dozer	0	0.0576	0	0.8814	0.0169
	Truck	0.0133	0.0067	0.0133	0.0300	0.9100
Feature-Level DBAM	Tank	0.9903	0.0043	0.0007	0.0003	0.001
	Self-propelled Artillery	0.0407	0.8879	0.0136	0.0034	0.0187
	Amphibious	0.1109	0.0317	0.746	0.0031	0.0145
	Dozer	0	0.035	0.0035	0.951	0.007
	Truck	0.0137	0.0068	0.0034	0.0137	0.9486
Feature-Level FDRM	Tank	0.987	0.005	0.002	0.0003	0.0007
	Self-propelled Artillery	0.0307	0.8891	0.0256	0.0068	0.0085
	Amphibious	0.0819	0.0251	0.8153	0.0014	0.0047
	Dozer	0	0.0212	0.0071	0.9364	0.0071
	Truck	0.0101	0.0169	0.0034	0.0203	0.9291

Table 18. Global and local confidence of models integrating SAR-HOG feature on MSTAR-VDD.

Fusion Level	Global Confidence	Local Confidence
Fusion Level	Global Confidence	Tank	Self-Propelled Artillery	Amphibious	Dozer	Truck
Pixel Level	0.8911	0.9927	0.8934	0.8778	0.9221	0.9350
Feature-Level DBAM	0.9048	0.9937	0.9208	0.8232	0.9543	0.9619
Feature-Level FDRM	0.9114	0.9920	0.9255	0.8782	0.9636	0.9483

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, Q.; Ye, Z.; Zhu, C.; Ouyang, D.; Gu, D.; Wang, H. Intelligent Target Detection in Synthetic Aperture Radar Images Based on Multi-Level Fusion. Remote Sens. 2025, 17, 112. https://doi.org/10.3390/rs17010112

AMA Style

Liu Q, Ye Z, Zhu C, Ouyang D, Gu D, Wang H. Intelligent Target Detection in Synthetic Aperture Radar Images Based on Multi-Level Fusion. Remote Sensing. 2025; 17(1):112. https://doi.org/10.3390/rs17010112

Chicago/Turabian Style

Liu, Qiaoyu, Ziqi Ye, Chenxiang Zhu, Dongxu Ouyang, Dandan Gu, and Haipeng Wang. 2025. "Intelligent Target Detection in Synthetic Aperture Radar Images Based on Multi-Level Fusion" Remote Sensing 17, no. 1: 112. https://doi.org/10.3390/rs17010112

APA Style

Liu, Q., Ye, Z., Zhu, C., Ouyang, D., Gu, D., & Wang, H. (2025). Intelligent Target Detection in Synthetic Aperture Radar Images Based on Multi-Level Fusion. Remote Sensing, 17(1), 112. https://doi.org/10.3390/rs17010112

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Intelligent Target Detection in Synthetic Aperture Radar Images Based on Multi-Level Fusion

Abstract

1. Introduction

2. Materials and Methods

2.1. Related Work

2.1.1. Model-Driven Methods

2.1.2. Data-Driven Methods

2.1.3. Target-Characteristic-Driven Methods

2.2. The Proposed Method

2.2.1. Overview

2.2.2. Scattering Feature Extraction

2.2.3. Image Fusion Theory

2.2.4. Backbone Network

2.2.5. Pixel-Level Fusion

2.2.6. Feature-Level Fusion

2.2.7. Decision-Level Fusion

3. Results

3.1. Dataset Description

3.1.1. SRSDD-v1.0

3.1.2. GF3-ADD

3.1.3. MSTAR-VDD

3.2. Evaluation Metrics

3.3. Parameter Setting

3.4. Experiments and Analyses

3.4.1. Experiments on SRSDD-v1.0

3.4.2. Experiments on GF3-ADD

3.4.3. Experiments on MSTAR-VDD

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI