1. Introduction
Synthetic aperture radar (SAR) is an all-weather and all-time active microwave sensor for high-resolution earth observation that images by emitting coherent electromagnetic waves to the surface and receiving scattered echoes [
1]. And it is critically important in both military and civil applications such as reconnaissance, terrain mapping, and disaster assessment [
2].
Before the application of deep learning (DL) to SAR automatic target recognition (ATR), most research focused on the detection, discrimination, and classification processes proposed by Lincoln Laboratory [
3,
4,
5]. These model-driven methods, which rely on mathematical theories and expert knowledge to design shallow models, can offer strong interpretability. However, due to their heavy dependence on manual features and complex parameter adjustments, these models exhibit limited robustness and generalization capabilities across different scenarios [
6]. Consequently, developing reliable and efficient predictive models for complex and diverse scenarios has become a crucial research need.
In the era of big data and the rapid advancement of artificial intelligence, deep neural networks (DNNs) have demonstrated exceptional information extraction and processing capabilities in computer vision, which have achieved remarkable performance in object classification [
7], object detection [
8,
9], change detection [
10,
11], etc. These data-driven methods, leveraging their powerful feature representation capabilities, can extract higher-level abstract semantic features, resulting in superior performance and robustness. However, the powerful feature extraction and representation capabilities require the support of high-quality, large-scale datasets. Low-quality image data or insufficient effective data volume can exacerbate the decision-making uncertainty, leading to higher false alarms and missed detections. This does not meet the current practical application needs for the high-precision detection of SAR image targets. Moreover, due to the explainable theoretical foundations for deep learning applications in SAR image interpretation not being thoroughly established, the frontier developments are still guided by empirical and experimental results rather than theoretical insights [
12], resulting in insufficient model interpretability. Then, many target-characteristic-driven methods have been proposed to improve the reliability of the decision model, methods specifically designed for SAR images integrating imaging mechanisms [
13], scattering characteristics [
14], etc.
Our earlier study explored how to integrate traditional manual features and their scattering information into deep learning algorithms by leveraging the powerful automatic feature extraction and representation capabilities while allowing mathematically interpretable manual features to guide the training and learning process of the model. At first, we carried out feature fusion experiments by image channel concatenation, which is a pixel-level fusion method. Channel fusion is one of the primary approaches for pixel-level image fusion, where different feature channels or image channels are combined to form a comprehensive feature representation. Although this method can enrich the detailed information in the input, it also means that the utilization of scattering information remains at the level of low-level visual details, leading to insufficient exploitation of enhanced scattering features. Therefore, to more thoroughly extract and utilize the information contained in these manual features, we designed and improved our network first to extract abstract information from manual feature images and original images separately and then fuse them at the feature level. This process constituents a higher level of image fusion: feature-level fusion.
After completing experiments based on pixel-level and feature-level fusion, we found that different manual features fused using different fusion methods have varying impacts on different categories. This means that we cannot rely solely on structural improvements in deep learning algorithms, nor can we depend entirely on traditional scattering features and pattern recognition methods that are fully mathematically interpretable. Instead, we should combine both approaches as much as possible to achieve complementary advantages. In this way, given that decision-level fusion has a strong capacity for integrating complementary, redundant or competing, and synergistic data as well as for different fusion approaches, we proposed a decision-level fusion method. This method maximizes the advantages of both features and fusion strategies across different categories, thereby enabling high-performance and robust target detection in SAR images.
Based on the above discussion, we conducted research on SAR target detection based on multi-level fusion, where pixel-level and feature-level fusion are leveraged for better mining and exploiting scattering features, and decision-level fusion is leveraged for result integration and the improvement of the model’s decision reliability.
The contributions of our work can be summarized as follows:
As for the insufficient exploitation of scattering features, a pixel-level fusion method with an improved backbone network ST-PA_RCNN and scattering feature enhancement module is proposed, representing an initial exploration of image fusion through the channel fusion of original images and their features.
To further enhance the ability of feature mining and characterization, two feature-level fusion methods are conducted based on respective migratable fusion blocks, namely DBAM and FDRM. These represent higher fusion levels compared to pixel-level, integrating abstract features by network designing.
As for the inadequate reliability of decision making in DL-based methods, decision-level fusion method based on DST is proposed for multi-model integration. It can not only consolidate complementary strengths in different models but also incorporate human or expert involvement in proposition for guiding effective decision making.
The proposed method was validated on typical ground and maritime surface target detection datasets, achieving mAP increases of 16.52%, 7.1%, and 3.19% for ships, aircrafts, and vehicles, demonstrating our method’s effectiveness and robustness.
2. Materials and Methods
2.1. Related Work
With the gradual maturation of SAR imaging hardware and algorithms, achieving high-precision SAR automatic target recognition (SAR ATR) of typical maritime surface targets is of significant importance. The related work is primarily divided into three parts: model-driven methods, data-driven methods, and target-characteristic-driven methods.
2.1.1. Model-Driven Methods
Traditional model-driven SAR ATR methods rely on mathematical theories and expert knowledge for background clutter modeling, manual feature extraction, and classifier design, which are divided into three stages: detection, discrimination, and classification. The detection stage, usually implemented based on background clutter modeling including Gaussian, Rayleigh, exponential, and K-distribution [
15], is conducted using a Constant False Alarm Rate (CFAR) [
16] and its numerous improved forms, such as Two-Parameter CFAR (TP-CFAR) [
17], CA-CFAR [
18], GO-CFAR [
19], and SO-CFAR [
20]. In the discrimination stage, feature information is extracted to distinguish between targets and clutters, which is essentially a binary classification problem aiming to obtain as many regions of interest (RoIs) as possible. At last, the classification stage processes the RoIs through feature extraction and classifier design to further eliminate false alarms [
21], ultimately obtaining class prediction information.
In a word, the main parts of traditional SAR ATR are manual feature extraction and classifier design. Manual features, such as geometric features, texture features, grayscale statistical features, and angular features, are input into model-driven methods based on template matching or Machine Learning (ML) [
22]. Among template-matching classifiers, statistical pattern recognition proposed by Ross et al. [
23] is the most typical, predicting by comparison to a standard template base. ML-based classifiers perform predictions using support vector machines (SVMs) [
24], Decision Tree (DT) [
25], Random Forest (RF) [
26], a Perceptron [
27], and Naive Bayes [
28].
In model-driven methods, the ability to apply traditional classifiers based on template or ML has become increasingly limited, especially facing increasingly complex and diverse detection and recognition scenarios. However, the potential application of manual features is reserved due to their relatively strong interpretability and reliability.
2.1.2. Data-Driven Methods
Nowadays, the advancement of deep learning has led to its extensive application in the field of SAR ATR, attributed to the superior capabilities in extracting and unearthing features. Chen et al. [
7] were the first to apply deep neural networks to the target classification field of SAR images in 2014, and the proposed A-ConvNets achieved an average accuracy of 99% on the MSTAR ten-class dataset. Zhang et al. [
29] proposed a cascaded three-view network that combines the advantages of Faster R-CNN and residual units in feature extraction. An et al. [
30] alleviated the issue of positive and negative sample imbalance by applying Focal Loss (FL) [
31] and a hard negative mining module to their designed rotated bounding box SAR image target detection method. Liu et al. [
32] constructed a multi-scale fully convolutional deep neural network for rotated bounding box ship target detection in port areas, significantly enhancing ship target detection performance in complex nearshore scenarios. Li et al. [
33] addressed the problem of limited effective samples and the difficulty in distinguishing negative samples by employing Generative Adversarial Networks (GANs) [
34] for the robust detection of ship targets in SAR images. Sun et al. [
35] conducted extensive experiments on the AIR-SARShip-1.0 dataset, demonstrating the superiority and robustness of densely connected networks. For ground vehicle detection tasks, Du et al. [
36] combined deep learning with transfer learning theory to achieve superior performance on the miniSAR dataset compared to CFAR.
Numerous research efforts have validated the superior feature characterization capabilities of data-driven methods compared to traditional SAR ATR. However, some issues have also emerged: on one hand, the limited number of effective SAR image samples or imbalanced samples can directly affect model performance; on the other hand, these data-driven methods are primarily inherited from computer vision methods based on optical image processing, and their feature mining capabilities in SAR images still require guidance and enhancement.
2.1.3. Target-Characteristic-Driven Methods
Model-driven methods rely on manual feature extraction, offering strong algorithm interpretability but exhibiting poor robustness in complex backgrounds. Purely data-driven methods or those that simply incorporate SAR image characteristics have stronger feature characterization, but their feature fusion effectiveness and model interpretability remain inadequate [
6]. As detection scenarios become increasingly complex and diversified, and as the requirements for model performance and decision credibility in target detection rise, more DL-based methods that combine model-driven and data-driven approaches based on target characteristics are gradually being proposed. The relationship of methods driven by model, data, and target characteristics is shown in
Figure 1.
Du et al. [
37] proposed a feature decomposition-based SSD for ship target detection. Deep abstract features extracted by the backbone network are decomposed into discriminative and interference features, constraining the network to focus on learning discriminative features to achieve low false alarm and missed detection rates. Wang et al. [
38] utilized Haar transforms to obtain texture information in the frequency domain to describe subtle differences between objects and their surrounding backgrounds. They designed and constructed a multi-feature fusion network for ship target detection. He et al. [
39] addressed the sparsity and diversity issues brought by the scattering mechanism of synthetic aperture radar by proposing a method that leverages deep features and prior component structures. Li et al. [
40] extracted grayscale features and enhanced spatial information, combined with scattering mechanism features, to suppress interference and redundant information in complex environments, achieving superior detection performance while assisting aircraft in rapid and efficient detection. Ginner et al. [
41] designed adaptive noise suppression and scattering center feature extraction modules to accurately characterize the scattering features of aircraft targets. Mix MSTAR [
42] is the first publicly available SAR vehicle target detection dataset, specifically designed for detecting rotating objects in large-scale scenes with complex backgrounds. Due to its recent release, related research based on this dataset has not yet commenced. Prior to the release of the Mix MSTAR dataset, research on vehicle target detection was very limited, with most target-characteristic-driven methods focusing primarily on vehicle target classification within the MSTAR dataset. Guo [
43] proposed a multi-feature fusion decision CNN that achieves higher recognition rates than other models without increasing the number of model training parameters. Tang and Chen [
44] employed a multi-set canonical correlation analysis method to fuse multi-view SAR images into a single feature vector, and then used joint sparse representation to characterize and classify feature vectors from different view sets. This approach achieved optimal classification performance even under extended working conditions, including noise interference and target occlusion. Liu et al. [
45] proposed a causal reasoning framework for vehicle target recognition in complex scenes by modeling the SAR ATR task as a causal graph, representing the sources of background-related bias, which reduces false correlations between the background and prediction results, achieving robust recognition performance.
The aforementioned research on target-characteristic-driven methods for typical ground and maritime surface targets is highly commendable. By integrating target characteristics and designing proper DL-based methods, interpretable characteristic information has, to some extent, guided and provided insight on DNNs that rely on only abstract features. However, existing target-characteristic-driven methods still exhibit limited human interaction in decision making, particularly in reconnaissance, where environmental conditions are variable, and more human involvement in proposition formulation is needed.
2.2. The Proposed Method
2.2.1. Overview
In this paper, we proposed an intelligent method based on multi-level fusion for target detection in SAR images, and the structure is shown in
Figure 2. First, feature extraction is conducted for capturing four features. Then, the original images and their features are fed into deep neural network based on one pixel-level fusion method and two feature-level fusion methods for mining and exploiting scattering features as much as possible. Lastly, detection results and prior information settings are utilized by the decision-level fusion method, where the final results are obtained.
2.2.2. Scattering Feature Extraction
Scattering features of single-channel SAR images are composed of three main parts: geometric features, grayscale statistical features, and texture features. Geometric features include the target’s outline, component shapes, structural dimensions, etc. Grayscale statistical features present the differences in structural and material properties between targets and clutters, reflected by the gray amplitude. As for texture features, grayscale changes focus more on local differences, making categories more distinguishable. Therefore, targets are often recognized by defining and designing local texture features. Moreover, compared to point and vector features, texture features in image form can provide richer, detailed information and exhibit great adaptability to various fusion methods, which caters to the research topic in our work. Thus, four texture features were selected for feature fusion in the experiments, which are SAR-SIFT, SAR-HOG, NSLP, and LC. Corresponding feature extraction methods are introduced below.
Scale-Invariant Feature Transform (SIFT) [
46] is an important local feature description tool proposed by Lowe in 1999 for image processing. SIFT was designed to identify and describe local features in an image that remains invariant to rotation, scale, and brightness. However, due to the specific multiplicative speckle noise in SAR images, SIFT, primarily designed for optical images, does not work as effectively as on SAR images. This is because, on the one hand, multiplicative noise results in stronger gradient magnitude in homogeneous regions with high reflectivity compared to areas with lower reflectivity; on the other hand, the computation of orientation and local feature descriptors in SIFT relies on the classical differential gradient, which is not robust to multiplicative noise. Considering the potential application of SIFT local feature and its properties to SAR image feature extraction, SAR-SIFT [
47] was utilized in our study, which has been proven to have good adaptability to multiplicative noise. The pseudocode is presented in Algorithm 1.
Algorithm 1: SAR-SIFT |
; Local extrema detection for do if then end end for do if then gather end end
|
Since SAR images are more sensitive to changes in the target’s angle and attitude than optical images, it is necessary to pay more attention to the stable pixels in the process of scattering feature extraction, which are dominated by strong backscattered echoes toward insensitive structures. In order to better capture the scattering feature responding to such stable structures, a modified algorithm based on histogram of oriented gradients (HOG) [
48], which is called SAR-HOG [
49], was employed in our study. This algorithm can extract stable structures in SAR images by ratio-based gradient computation. The pseudocode is presented in Algorithm 2.
Algorithm 2: SAR-HOG |
Input: Image , odd size of the average region , orientations , pixels per cell , cells per block , block norm Output: Feature SAR-HOG
divide into cells by p; for do build hog histogram(cell, o, p, c); end combine cells into blocks by c; for do end
|
The Non-Subsampled Laplacian Pyramid (NSLP) [
50,
51] is an algorithm used in image processing and computer vision for multi-scale spatial representation. Unlike traditional Laplacian or Gaussian pyramids, NSLP does not perform subsampling, thereby maintaining the spatial resolution of the image across different scales. This approach is particularly effective for capturing texture features in images, as it preserves detailed information at various scales. The pseudocode is presented in Algorithm 3.
Algorithm 3: NSLP |
Input: Image , Filter Output: Feature Image
|
Saliency feature refers to models developed by researchers that mimic the characteristics of human visual attention, enabling computers to selectively ignore non-essential background information in a scene during image processing. This allows the computer to focus attention on regions of interest. For example, in SAR ATR of ship targets, the ship’s hull is primarily made of metal, resulting in strong backscattering signals, which can create a strong contrast with the sea surface, forming a simple scene for detection utilizing saliency features. However, in practical scenarios, backgrounds are often complex and diverse, with significant electromagnetic interference, making it challenging to achieve optimal SAR ATR performance using saliency features. In this study, we used the Linear Color (LC) algorithm [
52] to extract saliency features, which has proven to improve the performance of ship target detection with complex backgrounds in SAR images. The pseudocode is presented in Algorithm 4.
Algorithm 4: LC |
= the hist of the ; for do for n in range(256) do; = frequency of pixel level(n); += ; end =norm() end
|
2.2.3. Image Fusion Theory
Data fusion is a strategy employed to process information from multiple sources to optimize decision making, commonly found in multimodal data processing and often manifested as image fusion [
53]. More accurate and unified information can be extracted through data fusion for improving efficiency in decision making. In this study, we used image fusion for describing more precisely, as data fusion is conducted in image form.
Image fusion can be categorized into three main levels according to different stages of the process [
54], which are pixel-level, feature-level, and decision-level fusion, as shown in
Figure 3.
Pixel-level fusion is at a low level in image fusion; it is conducted by combining pixel information from two or more images. Pixel-level fusion allows data pre-processing operations to be performed prior to formal fusion, which then results in a comprehensive representation that contains as much detailed information as possible.
- b.
Feature-level image fusion
Feature-level fusion is at a middle level in image fusion, where abstract features are extracted first, then followed by the further screening of redundant features, forming new features that can be effectively utilized. Feature-level fusion is often used to obtain and enhance more abstract and comprehensive feature information.
- c.
Decision-level image fusion
Decision-level fusion is at a high level in image fusion. In its process, extracted features are fed into a feature identification module to obtain initial results. These initial results are consolidated by decision fusion for final results. Decision-level fusion presents a strong ability to integrate results and to develop complementarity of strengths among models to the extent possible with better fault tolerance. However, the requirements for pre-processing, feature extraction, and the determination of results are relatively high.
2.2.4. Backbone Network
In this paper, the backbone network Oriented RCNN [
55] based on the Swin Transformer and PA-FPN is named ST-PA_RCNN. Its structure is shown in
Figure 4. Oriented RCNN is a simple and efficient generic two-stage network for rotation detection possessing good accuracy and timeliness, with feature extraction composed of CNN kernels. However, due to the limited receptive field of the CNN kernels, they only capture local features within the region covered by the kernels, which cannot effectively meet the higher-quality feature extraction requirements especially in complex scenes. In this regard, the Swin Transformer can capture long-range dependencies through a self-attention mechanism without using CNN kernels, providing a more global perspective. The Path Aggregation Feature Pyramid Network (PA-FPN) [
56], an improved Feature Pyramid Network (FPN) [
57], is conducted for enhancing feature extraction through path augmentation.
As shown in
Figure 4, the Swin Transformer optimizes feature extraction through its unique processing. At first, the input image is divided into multiple non-overlapping patches by patch partition and then mapped to specified dimensions through linear embedding. Next, primary stage feature maps are generated from two successive Swin Transformer blocks. Then, through stage 2, stage 3, and stage 4, deeper features are obtained.
The details of two successive Swin Transformer blocks are shown on the right of
Figure 4, where their focus on self-attention computation within local and non-overlapping windows is enhanced by replacing the multi-head self-attention (MSA) in standard Transformers with window multi-head self-attention (W-MSA) and shifted-window multi-head self-attention (SW-MSA). This modification allows the blocks to concentrate more on self-attention calculations, while reducing the computational complexity to a linear level. At the same time, to overcome the potential limitations of W-MSA in building long-range dependencies, specifically the lack of cross-window connections, SW-MSA is employed.
- b.
PA-FPN
The structure of PA-FPN is shown in
Figure 5, which enhances information flow between different scale features by bottom-up path augmentation based on the FPN, especially for the better performance of multiple-scale target detection in a complex background. PA-FPN mainly consists of an FPN and bottom-up path augmentation.
The FPN is designed to enhance feature mining for different scales by constructing a multi-level and multi-scale feature pyramid, which can effectively improves performance in different-scale target detection. It is shown by the blue structure in
Figure 5.
Multi-scale feature levels generated by the FPN can be represented by
, correspondingly. As shown in
Figure 5, feature dimensions are gradually downsampled by a factor of 2 from
to
. New feature mappings corresponding to
are represented by
. Lateral connections fuse the features obtained from both paths at the same resolution, preserving the high-resolution information of the original image while integrating high-level semantic information from the deeper features. This process iterates continuously until the lowest-level feature map is generated. Additionally, 1
1 convolution is utilized for the dimensionality of the feature maps.
Bottom-up path augmentation primarily enhances system performance by introducing path augmentation and information aggregation, specifically by establishing efficient lateral connections from lower to higher layers, facilitating a smoother upward transmission from low-level information, as shown by red and green dashed arrows in
Figure 5.
- c.
Oriented RCNN
Oriented RCNN is a two-stage detector that mainly consists of an oriented Region Proposal Network (RPN) and an oriented RCNN head. The first stage produces high-quality oriented proposals with minimal computational cost, while the second stage utilizes an oriented RCNN head for proposal classification and regression. The Oriented Region Proposal Network can generate high-quality RoIs at almost zero cost. Compared to the standard RPN, Oriented RPN adds two parameters in a regression branch for better adaptation in rotated detection. Additionally, Oriented RPN introduces the midpoint offset to represent proposals, further improving detection accuracy. The Oriented RCNN head is used for refining and recognizing the RoIs. Rotated RoIAligh can accurately extract features from RoIs, supporting subsequent regression and classification. The network’s loss function is as follows:
where
denotes the index of anchors.
refers to the number of samples in a mini-batch.
presents the ground-truth label of the i-th anchor.
denotes the output of classification branch of the oriented RPN.
presents the ground-truth box of the i-th anchor.
presents the output of regression branch of the oriented RPN.
presents the Cross Entropy loss, which is defined as
presents the Smooth
loss, which is defined as
where
2.2.5. Pixel-Level Fusion
ST-PA_RCNN based on channel fusion is proposed in this subsection, where channel fusion is one of the primary methods of pixel-level fusion [
58]. Channel fusion enriches detailed information by combining different feature channels or image channels to form a comprehensive feature representation. The structure of the ST-PA_RCNN based on pixel-level fusion is shown in
Figure 6. The scattering feature enhancement region obtained from the original images after scattering feature enhancement is input into the backbone network through channel fusion.
Scattering feature enhancement consists of three main parts: a scattering feature extractor, a SAR-Harris detector, and OPTICS clustering. Among these, the scattering feature extractor is used for the 4 texture feature extractions mentioned in
Section 2.2.2. Then, the scattering features are fed into the SAR-Harris detector for key point extraction. At last, the front-processed features are clustered by OPTICS to obtain a scattering feature enhancement region. The pseudocode of the SAR-Harris detector and OPTICS are presented separately in Algorithm 5 and Algorithm 6.
Algorithm 5: SAR-Harris |
|
Algorithm 6: OPTICS |
Input: DB, esp, MinPts Output: orderedlist
OPTICS(DB, esp, MinPts) // Extract all core points from the dataset ; // Calculate the core distance for each point ; for each unprocessed point in core_points do mark as processed; output to the ordered list; if then update(); for each next q in Seeds do mark as processed; output to the ordered list; if then update(); return orderedlist
|
- b.
Channel Fusion
In this study, channel fusion was conducted by concatenating the original image and feature image after scattering enhancement, where the third channel of the original image is replaced with the feature image. The process of channel fusion is shown in
Figure 7.
2.2.6. Feature-Level Fusion
There were two feature-fusion methods based on different fusion blocks used in this study, which were DBAM ST-PA_RCNN and FDRM ST-PA_RCNN. The following is mainly centered around these two fusion blocks: Dual-Branch FPN and Attention Mechanism (DBAM) and Feature Decomposition and Reweighting Model (FDRM). The structure of the DBAM is referenced in article [
38], and the FDRM is the main contribution and innovation of this subsection.
The structure of DBAM ST-PA_RCNN is shown in
Figure 8, which consists of two main parts: dual-branch PA-FPN and attention mechanism fusion block.
The structure of DB-PA-FPN is shown in
Figure 9, with a PA-FPN displayed on the left and right. The obtained dual-branch features
from PA-FPN are then input into the feature fusion block to obtain a new feature mapping
. DB-PA-FPN inherits the superior performance of PA-FPN in enhancing the flow of information among different scale features by effective feature mining, which is crucial for handling complex or multi-scale target detection scenarios. The path augmentation is shown by the orange and purple dashed arrows in
Figure 9.
The structure of the attention mechanism is shown in
Figure 10. The features extracted by DB-PA-FPN from the original image and the feature image after scattering enhancement are concatenated by channel fusion. The obtained features are then input into residual-like modules. The attention mechanism [
59], known for its efficiency and practicality, has become a popular method for enhancing the performance of deep neural networks. It can adaptively focus on regions of interest, effectively reducing background noise and confusion between classes, thereby enabling a more concentrated capture of key information related to the target.
- b.
Feature Decomposition and Reweighting Model
FDRM ST-PA_RCNN is proposed like the other feature-level fusion methods, and its structure is shown in
Figure 11. FDRM ST-PA_RCNN is a method based on DBAM ST-PA_RCNN, using an improved feature fusion block named the FDRM. Compared to the attention mechanism, the FDRM can reduce more feature redundancy. Additionally, the reweighting of the two branch features can be achieved through accumulative learning strategy, which enhances the interaction between the dual feature spaces and allows the network to learn and utilize scattering characteristics more effectively.
The structure of the FDRM fusion block is shown in
Figure 12. The features extracted by the DB-PA-FPN from the original image and the feature image after scattering enhancement are input into the FDRM for feature decomposition and reweighting. The obtained features are then channel-concatenated and finally downsampled for output.
As shown on the right part of
Figure 12, the FDRM is mainly composed of two parts: feature decomposition and feature reweighting.
In order to reconstruct the dual feature spaces of original features and scattering features, reduce feature redundancy, and enhance information entropy, feature decomposition leverages two loss functions: orthogonal loss and compactness loss.
Orthogonal Loss. To amplify the divergence between original feature space and scattering feature space, the orthogonal loss is defined as
where
denotes the number of images in a mini-batch.
and
denote the two input feature spaces.
T represents the transpose operation. As a result, the divergence between inter-features has been amplified.
Compactness Loss. Inspired by center loss [
60], compactness loss is designed to learn a center from inter-features and penalize the distances among these features and their respective centers. Specifically,
where
indicates the
norm.
refers the input features for
.
indicates the center of the
j-th features, which is updated with every mini-batch. Consequently, the compactness and cohesion among intra-features has been enhanced.
Together with two loss functions
and
in the multi-scale detection module, the final loss function
is
where
and
are two balancing hyperparameters, which are assigned values of 1 and 0.1, respectively.
After feature decomposition, the two feature spaces are reconstructed to maximize inter-feature divergence and to ensure intra-feature compactness. Then, these features are supposed to be fused prior to further processing. The FRCLM is proposed to strike a balance between the influence of original features and scattering features in model training. The fusion strategy can be described as a summation. A trade-off parameter μ is introduced as follows:
where
denotes the fusion feature from the two features. Given the current training epoch
and total training epoch
, a
μ corresponding to the parabolic strategy is shown below:
Throughout the training, the feature reweighting model incrementally redirects the model’s focus between the primary and scattering feature pathways. This strategy ensures comprehensive learning from both sets of features, ultimately improving detection and recognition capabilities.
2.2.7. Decision-Level Fusion
In this subsection, we propose a multi-feature fusion method based on Dempster–Shafer Theory (DST) for object detection, and its structure is shown in
Figure 13. The detection results obtained by pixel-level and feature-level fusion methods, which are also shown in
Figure 2, are utilized to calculate global/local confidence of the
nth model integrating one feature using one fusion method with prior information (category weight settings in this paper) for evidence collecting. All models from all features fused by pixel-level and feature-level fusions will be leveraged in our decision-level fusion. Later, these pre-processed data will be fed into a decision-level fusion module, including discernment framework building, basic probability assignment, evidence combination, and a decision rule. Last but not least, the final results for object detection are obtained.
The discernment framework is the set of all possible outcomes, where indicates the element, and elements are mutually exclusive.
Basic probability assignment (BPA) is denoted as ; it assigns a probability mass to each subset of , and the total mass must equal 1.
The belief function (Bel) quantifies the support for a hypothesis based on the available evidence; that is,
denotes the sum of BPAs containing all subsets in proposition A. It is as follows:
The plausibility function (Pl) represents the degree of belief in a hypothesis considering the uncertainty, which is calculated as
Use Dempster’s Rule of Combination to merge BPAs from different sources. The rule is defined as
where
is a normalization factor that accounts for conflicting evidence, calculated as
Decision Rule or Decision Making: Based on the combined evidence, decide on the most plausible hypothesis, typically selecting the one with the highest plausibility or belief. .
- 2.
The processing of decision-level fusion
Since different categories receive different attention, we assign weights to the categories in this paper to assist the model in making more valuable decisions. For example, the detection of fighters has a higher intelligence value for surveillance and battlefield assessment than commercial airliners. Then, we define the weight of the th category as and .
Due to the different performances of different fusion methods integrating different scattering features, in order to achieve the effective fusion of the detection results of each deep learning model, we adopted a confusion matrix [
61] as one of the assignment components in BPA. A confusion matrix can be used to describe and evaluate the relationship between ground truth and the recognition results. For the
-classification task, the output confusion matrix can be defined as
Based on the results of the confusion matrix, global and local confidence for the model to identify the
th category are defined as
Based on the definition of local and global confidence, combined with the category weight assignment, the basic probability assignment in DST is completed, calculated as
where
denotes the prediction score for the
th category, and
presents the
th model.
For discernment framework
, there is
That is, .
Follow the combination and decision rule mentioned in the subsection of Dempster–Shafer Theory above.
3. Results
In this section, comprehensive experiments are included in this section to analyze the model’s efficiency and potential. Multiple feature fusion strategies were investigated in our experiments, and they demonstrate the effectiveness and significance of the proposed method. Adequate experiments of multiple typical target datasets on the ground and sea were conducted to validate the method’s applicability.
3.1. Dataset Description
3.1.1. SRSDD-v1.0
SRSDD-v1.0 [
62] is a high-resolution SAR ship detection dataset from GF3 in spotlight (SL) mode with a resolution of 1 m and size of 1024
1024, which contains six categories of ships, 63.1% inshore scenes and 36.9% offshore scenes, making detection more challenge. Detailed information of categories in the dataset is shown in
Table 1. Different scenarios are illustrated in
Figure 14.
3.1.2. GF3-ADD
The Gaofen-3 Aircraft Detection Dataset is an in-lab dataset from GF3 with a resolution of 1 m. There are five types of airports and three categories of aircrafts. The ground truths are annotated by professionally trained personnel with reference to the optical image of same regions. Images are cropped to 512
512 pixels for preserving details of target scattering features while reducing background clutters in feature extraction. Detailed information of categories in the dataset is shown
Table 2. Examples of different scene slices are shown in
Figure 15.
3.1.3. MSTAR-VDD
Mix MSTAR [
42] is a synthetic benchmark dataset for multi-class rotation vehicle detection, which mixes target chips and clutter backgrounds with original MSTAR [
63] data at the pixel level. Mix MSTAR contains 20 fine-grained categories in 100 high-resolution images, predominantly 1478
1784 pixels, achieved by refining T72 into 11 categories (T72 A04, T72 A05, T72 A07, T72 A10, T72 A32, T72 A62, T72 A63, T72 A64, T72 SN132, T72 SN812, T72 SNS7). The dataset includes various landscapes such as woods, grasslands, urban buildings, and tightly arranged vehicles.
Since targets with similar class structure have similar scattering mechanisms and close scattering characteristics, in order to better explore the potential application of scattering features, we grouped the 20 categories in Mix MSTAR into 5 broad categories, called MSTAR-VDD for convenience in this paper, which are tank, self-propelled artillery, amphibious, dozer, and truck. The details of fine-to-broad categorization are shown in
Figure 16. The category information in the dataset is shown in
Table 3. Different scenarios are illustrated in
Figure 17.
3.2. Evaluation Metrics
The precision
and recall
are defined with True Positive (
), False Positive (
), and False Negative (
FN), respectively.
As the most common evaluation metric in object detection, Average Precision (
) represents the comprehensive performance of the detector since a trade-off exists between precision and recall.
and
are calculated as
3.3. Parameter Setting
All these experiments were conducted on four NVIDIA TITAN Xp GPUs for a total of 36 epochs on each dataset with a batch size of 8. The initial learning rate was 0.0001, which increased linearly by 0.3333 every 500 iterations, and the momentum was 0.9 with a weight decay of 0.05 on an AdamW optimizer. All datasets were supposed to be cut into 512 512 with an overlap of 256 for preserving details of target scattering features while reducing background clutters.
3.4. Experiments and Analyses
3.4.1. Experiments on SRSDD-v1.0
For the ablation experiments of the backbone network, shown in
Table 4, ST-PA_RCNN presents the best performance in mAP, which demonstrates the effectiveness of the backbone network improved based on Oriented RCNN.
For experiments of pixel-level fusion methods integrating different features, shown in
Table 5, HOG achieved the best improvement of 2.79% in mAP compared to the baseline. Notably, HOG significantly boosted the accuracy for law enforce and dredger, especially for law enforce, with a twofold increase of 8.33%. Furthermore, NSLP improved the category accuracy of bulk cargo by 6.62% and fishing by 7.59%.
For experiments of feature-level fusion (DBAM), shown in
Table 5, HOG achieved the highest improvement of 4.08% in mAP compared to the baseline. And it is worth noting that HOG enhanced the category accuracy of law enforce by 20.81%, which is highly significant. Additionally, SIFT achieved an accuracy improvement of 1.03% for bulk cargo, and NSLP realized a 3.69% increase for ore-oil.
For experiments of feature-level fusion (FDRM), shown in
Table 5, SIFT achieved an improvement of 10.36% in average accuracy compared to the baseline, representing the highest and most significant increase. Notably, this resulted in an exceptional improvement of 60.02% in category accuracy for law enforce, making it the most substantial enhancement across all feature fusion methods. This indicates that the SAR-SIFT feature, combined with feature fusion levels based on the FDRM, is particularly suitable for detecting law enforce. HOG achieved improvements of 1.52% and 1.61% for dredgers and bulk cargo, respectively. Additionally, NSLP realized a 10.49% increase for ore-oil, while LC improved the category accuracy for containers by 3.2%.
For experiments of decision-level fusion based on DST, category weight setting was set as in
Table 6, which represented the degree of attention or interest among categories. The confusion matrix and global and local confidence of models integrating SAR-HOG are shown in
Table 7 and
Table 8 as an example of intermediate result presentation. After consolidating and utilizing the detection results of pixel-level and feature-level integrating different features, DS maximized the detection accuracy on SRSDD-v1.0, with an increase of 16.52% compared to ST-PA_RCNN, thanks to the significant improvement of category accuracy for the few-shot category.
3.4.2. Experiments on GF3-ADD
For the ablation experiments of the backbone network, shown in
Table 9, ST-PA_RCNN presents the best performance in mAP, which demonstrates the effectiveness of the backbone network improved based on Oriented RCNN.
For experiments of pixel-level fusion methods integrating different features, shown in
Table 10. SIFT achieved the best improvement of 6.46% in mAP compared to the baseline. Specifically, SIFT significantly increased the accuracy for carriers, with an increase of 7.18%. NSLP improved the category accuracy of fighters by 5.69% and airliners by 7.51%.
For experiments of feature-level fusion (DBAM), shown in
Table 10, HOG achieved the highest improvement of 2.28% in mAP compared to the baseline, which enhanced the category accuracy of fighters by 5.68% and carriers by 1.01%. Additionally, HOG achieved accuracy improvements of 7.77% for fighters and 1.00% for airliners, but decreased in accuracy for carriers.
For experiments of feature-level fusion (FDRM), shown in
Table 10, SIFT achieved an improvement of 3.64% in average accuracy compared to the baseline, representing the highest increase. Notably, it resulted in a significant improvement of 7.91% in category accuracy for carriers. Additionally, LC realized a 4.31% increase for fighters.
For experiments of decision-level fusion based on DST, category weight setting was set as in
Table 11, which represented the degree of attention or interest among categories. The confusion matrix and global and local confidence of models integrating SAR-HOG are shown in
Table 12 and
Table 13 as an example of intermediate result presentation. After consolidating and utilizing the detection results of pixel-level and feature-level integrating different features, DS maximized the detection accuracy on GF3-ADD, with an increase of 7.1% compared to ST-PA_RCNN.
3.4.3. Experiments on MSTAR-VDD
For the ablation experiments of the backbone network, shown in
Table 14, ST-PA_RCNN presents the best performance in mAP, which demonstrates the effectiveness of the backbone network improved based on Oriented RCNN.
For experiments of pixel-level fusion methods integrating different features, shown in
Table 15, HOG achieved the best improvement of 2.11% in mAP compared to the baseline. Specifically, HOG realized the highest accuracy for the amphibious, dozer, and truck categories, especially for amphibious, with an increase of 5.98%. Furthermore, NSLP improved the category accuracy of self-propelled artillery by 2.67%.
For experiments of feature-level fusion (DBAM), shown in
Table 15, HOG achieved the highest improvement of 1.06% in mAP compared to the baseline. Specifically, HOG realized the highest accuracy for tanks, self-propelled artillery, and dozers, especially for self-propelled artillery and dozers increased by 1.83% and 1.72%. Additionally, SIFT improved the category accuracy of trucks by 2.82%.
For experiments of feature-level fusion (FDRM), shown in
Table 15, SIFT achieved the highest improvement of 0.53% in average accuracy compared to the baseline. This resulted in the highest accuracy for the tank and self-propelled artillery categories and increased the category accuracy of the self-propelled artillery, amphibious, and dozer categories by 0.92%, 0.71%, and 0.60%. Additionally, NSLP improved the category accuracy of self-propelled artillery by 1.00% and dozer by 0.96%, while LC improved the category accuracy for amphibious by 0.72%. It is worth noting that the mAP of the baseline by feature-level fusion (FDRM) has reached 97.75%, which is already the highest among other baselines from other fusion methods with an increase of 2.03%, leaving relatively limited room for accuracy improvement from fusing features.
For experiments of decision-level fusion based on DST, category weight setting was set as in
Table 16. This represented the degree of attention or interest among categories. The confusion matrix and global and local confidence of models integrating SAR-HOG are shown in
Table 17 and
Table 18 as an example of intermediate result presentation. After consolidating and utilizing the detection results of pixel-level and feature-level integrating different features, DS maximized the detection accuracy on MSTAR-VDD, with an increase of 3.19% compared to ST-PA_RCNN, thanks to the significant improvement of category accuracy for the few-shot category.
4. Discussion
Due to the unique imaging mechanism of SAR, targets in SAR images present complex scattering characteristics, especially with diversified backgrounds. There are three main challenges for intelligent SAR target detection: insufficient exploitation of target characteristics, inefficient characterization for scattering features by data-driven methods, and inadequate reliability of model decision.
In this paper, we propose an intelligent SAR target detection method based on multi-level fusion for better performance in complex backgrounds, which consists of pixel-level, feature-level, and decision-level fusion.
For the exploitation of target characteristics, four texture features were selected for feature fusion in our experiments, which are SAR-SIFT, SAR-HOG, NSLP, and LC. Since texture features are in image form, they can provide richer detail information and exhibit great adaptability to various fusion methods, as compared to point and vector features. Furthermore, for extracting deeper abstract features, ST-PA_RCNN is designed as the backbone network based on Oriented RCNN by replacing the feature extractor with a Swin Transformer and improving the FPN with path augmentation. All ablation experiments of ST-PA_RCNN on SRSDD-v1.0, GF3-ADD, and MSTAR-VDD validate its effectiveness and robustness, where ST-PA_RCNN achieve the highest mAP among other backbones.
For digging feature characterization, the pixel-level fusion method represents an initial exploration of image fusion through the channel fusion of original images and their features after scattering feature enhancement. ST-PA_RCNN based on pixel-level fusion with a designed scattering feature enhancement module provides an initial exploration for integrating scattering-enhanced features through channel fusion, where HOG achieves the highest mAP increase of 2.79% in ship detection and 2.11% in vehicle detection; SIFT achieves the highest mAP increase of 6.46% in aircraft detection.
To further enhance the ability of feature mining and characterization, two feature-level fusion methods are used by respective migratable fusion blocks, namely the DBAM and FDRM, presenting higher-level fusion compared to at the pixel level. The DBAM is used for better attention fusion between features from original images and their features. The FDRM is designed for reducing feature redundancy and relative full feature learning by reweighting. In the experiments of feature-level fusion, HOG (DBAM) achieves the highest mAP increase of 4.36% in ship detection and 2.04% in vehicle detection; SIFT (FDRM) achieves the highest mAP increase of 12.27% in ship detection, 6.47% in aircraft detection, and 2.56% in vehicle detection; and LC (DBAM) achieves the highest mAP increase of 4.49% in aircraft detection. Last but not least, it is worth noting that the feature-level FDRM (baseline) shows the best performance on all three experimental datasets in comparison to the pixel-level (baseline) and feature-level DBAM (baseline), validating the effectiveness and robustness of our original designed method.
In the experiments of pixel-level and feature-level fusion, the results indicate that the highest category accuracy of different categories require a specific combination of fused features and fusion method, which means that the highest performance can be calculated using effective combinations based on decision-level fusion. For improving the reliability of model decisions, a decision-level fusion method based on DST for multi-model integration represents the highest-level fusion by proposition setting and statistical analysis. It can not only consolidate the complementary strengths in different models but also incorporate human or expert involvement in proposition for guiding effective decision making. In the experiments on typical target detection datasets, the proposed method increases the mAP by 16.52%, 7.1%, and 3.19% in ship, aircraft, and vehicle target detection, demonstrating high effectiveness and robustness.
5. Conclusions
In this paper, an intelligent SAR target detection method based on multi-level fusion is proposed to improve the insufficient exploitation of target characteristics and the inadequate reliability of the model’s decision making. Four texture features (SAR-SIFT, SAR-HOG, NSLP, and LC) are employed to enhance the scattering feature representation, while ST-PA_RCNN integrates a Swin Transformer-based feature extractor and path-augmented FPN, serving as the backbone network to extract deeper abstract features. Experimental evaluations on SRSDD-v1.0, GF3-ADD, and MSTAR-VDD confirmed the robust performance and high accuracy of ST-PA_RCNN, surpassing other backbones in terms of mAP. In the multi-level fusion process, pixel-level fusion provides an initial exploration for integrating scattering-enhanced features through channel fusion; two feature-level fusion methods, DBAM and FDRM, facilitated higher-level feature aggregation by emphasizing attention mechanisms and reducing redundancy; decision-level fusion based on DST effectively integrated complementary model outputs and accommodated expert involvement, thereby improving the mAP by 16.52%, 7.1%, and 3.19% in ship, aircraft, and vehicle target detection, respectively.
Overall, these findings underscore the synergistic potential of combining deep learning with carefully designed fusion strategies across multiple levels. Future research may explore more advanced pixel-level fusion methods, expand the repertoire of feature-level fusion blocks, and refine decision-level fusion propositions to accommodate diverse scattering features and application scenarios, ultimately laying a stronger theoretical and practical foundation for robust SAR target detection.