A Multiscale Local–Global Feature Fusion Method for SAR Image Classification with Bayesian Hyperparameter Optimization Algorithm

Lian, Xiaoqin; Huang, Xue; Gao, Chao; Ma, Guochun; Wu, Yelan; Gong, Yonggang; Guan, Wenyang; Li, Jin

doi:10.3390/app13116806

Open AccessArticle

A Multiscale Local–Global Feature Fusion Method for SAR Image Classification with Bayesian Hyperparameter Optimization Algorithm

by

Xiaoqin Lian

^1,2,

Xue Huang

^1,2

,

Chao Gao

^1,2,*

,

Guochun Ma

^1,2,

Yelan Wu

^1,2,

Yonggang Gong

^1,2,

Wenyang Guan

^1,2 and

Jin Li

^1,2

¹

School of Artificial Intelligence, Beijing Technology and Business University, Beijing 100048, China

²

Key Laboratory of Industrial Internet and Big Data, China National Light Industry, Beijing Technology and Business University, Beijing 100048, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(11), 6806; https://doi.org/10.3390/app13116806

Submission received: 10 April 2023 / Revised: 26 May 2023 / Accepted: 1 June 2023 / Published: 3 June 2023

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

In recent years, the advancement of deep learning technology has led to excellent performance in synthetic aperture radar (SAR) automatic target recognition (ATR) technology. However, due to the interference of speckle noise, the task of classifying SAR images remains challenging. To address this issue, a multi-scale local–global feature fusion network (MFN) integrating a convolution neural network (CNN) and a transformer network was proposed in this study. The proposed network comprises three branches: a CovNeXt-SimAM branch, a Swin Transformer branch, and a multi-scale feature fusion branch. The CovNeXt-SimAM branch extracts local texture detail features of the SAR images at different scales. By incorporating the SimAM attention mechanism to the CNN block, the feature extraction capability of the model was enhanced from the perspective of spatial and channel attention. Additionally, the Swin Transformer branch was employed to extract SAR image global semantic information at different scales. Finally, the multi-scale feature fusion branch was used to fuse local features and global semantic information. Moreover, to overcome the problem of poor accuracy and inefficiency of the model due to empirically determined model hyperparameters, the Bayesian hyperparameter optimization algorithm was used to determine the optimal model hyperparameters. The model proposed in this study achieved average recognition accuracies of 99.26% and 94.27% for SAR vehicle targets under standard operating conditions (SOCs) and extended operating conditions (EOCs), respectively, on the MSTAR dataset. Compared with the baseline model, the recognition accuracy has been improved by 12.74% and 25.26%, respectively. The results demonstrated that Bayes-MFN reduces the inter-class distance of the SAR images, resulting in more compact classification features and less interference from speckle noise. Compared with other mainstream models, the Bayes-MFN model exhibited the best classification performance.

Keywords:

synthetic aperture radar (SAR); speckle noise; ConvNeXt; Swin Transformer; Bayesian hyperparameter optimization algorithm

1. Introduction

As an active microwave remote sensing detection platform, a synthetic aperture radar (SAR) has the advantages of all-day, all-weather imaging capabilities and certain penetration capabilities; hence, it is widely used in military and civilian fields [1]. Unlike optical images that can be interpreted directly by human eyes, SAR images contain rich target information. However, the task of SAR image interpretation becomes challenging due to SAR’s unique imaging mechanism [2]. In addition, it is difficult to acquire and label SAR images, which requires a lot of human and material resources. Therefore, SAR automatic target recognition (SAR ATR) has become a research focus in recent years. Due to the presence of speckle noise, SAR images have specific fine-grained characteristics, which are manifested as large and small inter-class differences. As a result, the task of classifying SAR images remains challenging.

With the development of deep learning theory, deep neural networks have been widely applied to various fields [3]. In the field of SAR ATR, compared with the methods based on template matching [4] and model matching [5], the methods based on deep neural networks have the advantage of automatic feature extraction. Among them, the most representative method is based on convolutional neural networks (CNNs). CNNs collect local features in a hierarchical way to obtain powerful image representation and can fully extract local texture details of SAR images. In 2022, Liu et al. [6] proposed a pure convolutional neural network named ConvNeXt, which has become the most representative CNN, with a more concise algorithm and fewer parameters compared with the traditional CNNs. However, limited by the size of the convolution kernel, CNN’s ability to capture global features of images is weak. As a result, it is difficult to obtain the long-distance relationship between SAR image elements, leading to the semantic information relationship between fine-grained image parts being ignored [7].

With the emergence of transformers [8], the monopoly of CNNs in the field of computer vision is gradually being broken. Unlike the architecture of CNNs, a self-attention mechanism is used by the transformer architecture for global relationship modeling. In 2020, Dosovitskiy et al. [9] proposed the vision transformer (ViT) model for image classification. The ViT model is the first pure transformer model applied in the field of computer vision, which solves the input problem of transformers in the image field by dividing the image into multiple patches and coding them into sequence vectors. This approach has achieved excellent performance in the field of image classification. However, the ViT model has a large number of parameters, high computational cost, and certain restrictions on the size of the input image. Aiming to address the shortcomings of ViT, Liu et al. [10] proposed a Swin Transformer model with a sliding window operation and hierarchical design, which has higher efficiency. However, the Swin Transformer’s ability to extract local detail features of SAR images is weaker due to the slight differences among the SAR image sub-categories. The model’s reliance on global semantic information limits its representation ability, while distinguishable detail features are crucial for SAR image classification tasks.

The SAR images have prominent fine-grained characteristics, so it is necessary to conduct global modeling to obtain the long-distance dependence relationship of the SAR images. However, SAR images have low resolution and blurred edge texture details, which make it difficult to distinguish SAR targets from the background when only modeling them globally. Therefore, it is necessary to acquire their local features to enhance their ability to capture subtle differences. As a result, to make full use of the detail diversity of SAR images, to suppress speckle noise, and to improve the problem of small interclass differences between SAR images, a three-branch multi-scale feature fusion network called MFN for SAR image classification was proposed in this paper. The MFN model combines the advantages of both CNN and Swin Transformer architectures. Inspired by the ConvNeXt and Swin Transformer models, MFN comprises three branches: ConvNeXt-SimAM, Swin Transformer, and multi-scale feature fusion. The multi-scale feature fusion branch can thoroughly mine the shallow detailed features and advanced semantic information from different levels and fuse the local features and global features of SAR images through the feature fusion unit (FFU) to enhance the representation ability of the SAR target classification model. Finally, the Bayesian hyperparameter optimization algorithm was used to optimize the model’s hyperparameters to improve the classification performance of the model.

The main contributions of this paper are as follows.

Parallel local and global feature branches were designed to fully explore the local and global features at different levels. In addition, the SimAM attention mechanism was added to the local feature module to enhance the learning ability of network features. This enhancement allowed the network to focus more on the texture details of SAR images within the limited feature information and to effectively suppress speckle noise.
The multi-scale feature fusion branch was constructed, comprising an FFU and a down-sampling module. This branch adaptively fuses local detailed features and global semantic information of SAR images from different scales.
A three-branch multi-scale feature fusion model based on a Bayesian hyperparameter optimization algorithm was proposed, which fully combined the advantages of ConvNeXt and Swin Transformer and achieved excellent results on the MSTAR dataset with fewer parameters and higher recognition accuracy.

The rest of this article is arranged as follows. Section 2 introduces the related work. Section 3 introduces the design idea and network structure of Bayes-MFN. In Section 4, we present the experimental results of Bayes-MFN from different angles and compare them with the traditional models. Finally, Section 5 summarizes the study and provides an outlook for future work.

2. Related Work

Our work focuses on three main directions: CNNs, a visual transformer, and a Bayesian hyperparameter optimization algorithm. In this section, we present a few representative approaches that are closely related to our work.

2.1. CNN

CNNs have excellent learning abilities for hierarchical and local features, leading to many improved networks that mine more abstract and deeper features in SAR images. For instance, Babu et al. [11] introduced One-Vs-All (OVA) technology into CNN and achieved high recognition accuracy on the MSTAR dataset.

Yu et al. [12] fused the Gabor features of SAR images with CNN features at different scales to enhance the richness of SAR features. Similarly, Ayodeji et al. [13] proposed a new fast visual decoder that uses a residual-based deep CNN as the backbone of the model and a stacked denoising autoencoder (SDAE) model. Zhang et al. [14] proposed the separation measure CNN (SM-CNN) model, which achieved excellent performance on the MSTAR and OpenSARShip datasets by introducing the maximum coding rate reduction principle into the backbone network. Lang et al. [15] embedded a cascaded multi-domain attention module of offline cosine transform and discrete wavelet transform into the CNN to improve the performance of the CNN model in limited samples, enhancing the feature extraction ability of the CNN and achieving good performance on the MSTAR dataset.

Due to the influence of speckle noise in SAR images, they exhibit specific fine-grained characteristics, and the importance of global contextual relationships cannot be ignored. Although the above methods have achieved good results, they only consider the importance of local features and overlook the importance of global dependencies in SAR ATR.

2.2. Visual Transformer

In recent years, models based on the transformer architecture have become popular in the field of computer vision. However, transformer architecture models tend to have weak local perception. To address this problem, T2TViT [16] can better model local features by recursively aggregating adjacent tokens into a token. Similarly, ConVit [17] made adjustments on ViT, including introducing a gated self-attention mechanism (GPSA) and imitating convolution operations to obtain inductive bias. Both CMT [18] and ResT [19] mix convolution and Swin Transformer modules in a serial manner, using transformers and CNNs to capture long-distance dependencies and local features, respectively. However, such a serial approach interferes with both local and global features. Peng et al. [20] proposed a parallel conformer model, which retains both local and global features to the greatest extent, and verified the feasibility of the parallelized model.

In the field of remote sensing image classification, the combination of CNNs and transformers is relatively rare. Wang et al. [21] proposed a convolutional converter (ConvT) for SAR ATR few sample learning (FSL). This model integrates the local features of CNN and the global dependency of the transformer with the novel loss of finite SAR images and verifies the effectiveness of the proposed method on the MSTAR dataset. Li et al. [22] used the FESwin module as the backbone network, used the CNN to aggregate the contextual information perceived before and after the Swin Transformer model, and carried out scale fusion based on the visual dependency of self-attention capture, achieving good detection accuracy on SSDD and SARShip. The above multi-feature fusion methods mainly use concatenation to combine features, which may cause interference when merging features with different attributes, resulting in poor performance and weak fusion generalization.

Liu et al. [23] designed an end-to-end local–global network structure for high-resolution SAR image classification using CNN and vision transformer models to extract local and global features, respectively, and for mining complementary information through the feature fusion module and achieved superior performance. Unlike the existing studies, a three-branch parallel multi-scale feature fusion SAR classification model was proposed to allow for the local and global branches to minimize the disturbance. Moreover, the Bayesian optimization algorithm was used for the first time to search for the hyperparameters of the proposed feature fusion model.

2.3. Bayesian Hyperparameter Optimization Algorithm

In machine learning or deep learning tasks, adjusting model hyperparameters is an essential process that has a crucial impact on the performance of machine learning or deep learning models [24]. The determination of hyperparameters is usually manually optimized by researchers according to personal experience. To make the selection process of hyperparameters more efficient, some hyperparameter optimization algorithms have been proposed, such as the grid search algorithm and Bayesian optimization algorithm [25]. Bayesian optimization algorithm builds a probabilistic proxy model of objective function based on historical evaluation results, which can fully use previous evaluation information to select the next set of hyperparameters, significantly reducing the evaluation times [26]. Therefore, compared with traditional algorithms such as manual tuning and grid search, it occupies less computing resources and is more efficient.

Rizaev et al. [27] used the Bayesian optimization algorithm to determine the hyperparameters of the AlexNet model and achieved good classification performance on the proposed synthetic SAR image data set with ship wakes, which verified the feasibility of the Bayesian optimization algorithm. Aiming at the MFN model proposed in this paper, the Bayesian hyperparameter optimization algorithm was used to search for the optimal model hyperparameter set to enhance the model’s classification performance, to improve the model’s training efficiency, and to save human resources and computing resources. Referring to previous research, there are few studies on the application of the Bayesian hyperparameter optimization algorithm in the field of SAR ATR, so the work in this paper is groundbreaking.

3. Proposed Method

3.1. Overview

To effectively obtain the local detail features and a global semantic representation of SAR vehicle targets at different scales, we combined the advantages of ConvNeXt and Swin Transformer models to design a classification model called MFN, which integrates the local–global features at different scales. The structure diagram of the MFN model is shown in Figure 1. The MFN model proposed here comprises three branches: ConvNeXt-SimAM, Swin Transformer, and multi-scale feature fusion. The ConvNeXt-SimAM and the Swin Transformer branches can be divided into three stages to build feature diagrams at different levels. The three stages correspond to 4, 8, and 16 times down-sampling. Furthermore, the height and width of the feature map decrease, the number of channels increases, and the dimension of the feature in the same stage remains unchanged. Finally, local and global features from different scales are fused and classified by the multi-scale feature fusion branch. The implementation details of the model are discussed further. We introduce the specific parameters of the model in detail in Section 4.2.

3.2. Network Structure

3.2.1. ConvNeXt-SimAM Branch

To fully extract the local detail features of SAR vehicle targets, we constructed the ConvNeXt-SimAM branch by stacking CNN blocks and down-sampling operations, and the whole branch is divided into three stages, as shown in Figure 1. To align with the size of the feature map of each stage in the Swin Transformer branch, the height and width of the feature map of each stage in the ConvNeXt-SimAM branch are reduced by half of the previous stage through a down-sampling convolution operation, while doubling the number of channels. Moreover, the shallow local fine-grained and deep coarse-grained information from stage 1 and stage 3, respectively, are extracted and sent to the multi-scale feature fusion branch, which can fuse local features of different scales to enhance the feature representation ability of the model.

The CNN block architecture is shown in Figure 2a. Inspired by ConvNeXt, the CNN block comprises a sequence of group convolutions with a convolution kernel size of 7 × 7, layer normalization (LN), convolutions with a kernel size of 1 × 1, GELU activation functions, SimAM attention modules, and convolutions with a convolution kernel size of 1 × 1. To introduce additional learnable parameters, we inserted the SimAM [28] attention module after the second convolutional layer (conv k1,4 × dim) of the CNN block, enhancing the local feature extraction ability of the CNN block to fully extract SAR vehicle detail features and to suppress speck noise. Unlike the existing channel attention or spatial attention mechanisms, the SimAM module can derive three-dimensional attention weights from the feature maps of channels and spatial dimensions without adding additional parameters, which flexibly and effectively enhances the representation ability of convolutional modules. Yang et al. [29] defined an energy function for each neuron on the feature map based on the discovery of neurology, and the input feature was

X \in R^{C \times H \times W}

, so each channel had H × W energy functions. The importance of each neuron was evaluated by minimizing the energy function, and higher weight was given to neurons with a greater influence, thus enhancing the ability to extract local features and effectively suppressing noise interference. Therefore, the formula after minimizing the energy function is shown in Equation (1):

e_{t}^{*} = \frac{4 (σ^{2} + λ)}{{(t - μ)}^{2} + 2 σ^{2} + 2 λ}

(1)

where t is the target neuron in a single channel of X, that is, the pixel value;

μ

and

σ

are the mean and variance of pixel values of all neurons on the channel feature map, respectively; and

λ

is the regularization coefficient. The lower the energy of neuron t, the higher its importance.

The final output of input feature X after the SimAM attention mechanism is as follows:

\tilde{X} = s i g m o i d (\frac{1}{E}) \otimes X

(2)

where E groups all

e_{t}^{*}

across the channel and spatial dimensions, and

\otimes

represents element-wise multiplication [30].

In the ConvNeXt block, group convolution

f^{d 7 \times 7}

with a convolution kernel size of 7 × 7 and a channel number of dim is performed first, where the number of groups is equal to the number of channels. The use of group convolution can reduce the calculation amount of the model. Subsequently, layer normalization is used, and the convolution operation

f^{1 \times 1}

with a convolution kernel size of 1 × 1 and a channel number of 4× dim is used to upgrade the dimension. Finally, the dimensionality reduction is performed using the GELU activation function and the convolution operation

f^{1 \times 1}

with a convolution kernel size of 1 × 1 and a channel number of dim. The CNN block process is described in Equation (3).

C_{i} = f^{1 \times 1} (SimAM (GELU (f^{1 \times 1} (LN (f^{d 7 \times 7} (C_{i - 1})))))) + C_{i - 1}

(3)

where

C_{i}

represents the output feature of the CNN block, and LN is a layer normalization layer.

3.2.2. Swin Transformer Branch

The workflow for the Swin Transformer branch is shown in Figure 1. The feature maps of different sizes were constructed by the Swin Transformer branch through three stages. In stage 1, the feature map is first segmented through the patch partition and linear embedding layers and is flattened in the channel direction. Down-sampling is first performed by the patch merging layer, and channel dimension is increased to build a multi-stage hierarchical architecture. Each stage is stacked with N Trans blocks, the structure of which is shown in Figure 2b. Considering that there are two structures including W-MSA and SW-MSA in the Trans block and both are usually used in pairs, N takes an even value. To avoid a large increase in the number of model parameters, the value of N was taken as 2. The introduction of the Swin Transformer branch can enhance the ability of the ConvNeXt-SimAM branch to capture global semantic information.

As shown in Figure 1, the feature map size changes to 32 × 32 × 96 after the input image goes through stage 1. Subsequently, the feature map is down-sampled by the patch merging operation. The height and width of the feature map are halved, and the number of channels is doubled. The structure of the Trans block is shown in Figure 2b. First, the feature map

S_{i - 1}

enters the window multi-head self-attention (W-MSA) module through the Layer Norm layer, dramatically reducing the computational complexity by calculating the self-attention in each window. Then, it passes through a convolution layer with a convolution kernel size of 1 × 1 and a GELU activation function. Finally, a residual connection is applied after each module. The shifted window multi-head self-attention (SW-MSA) module is introduced in the next module, and the other layers remain unchanged. The introduction of SW-MSA can enable the interaction between different windows. The calculation process of the Trans block is as follows:

w_{i} = GELU (f^{1 \times 1} (W - M S A (LN (S_{i - 1})))) + S_{i - 1}

(4)

S_{i} = GELU (f^{1 \times 1} (SW - M S A (LN (w_{i})))) + w_{i}

(5)

where

w_{i}

is the output feature of the W-MSA layer and

S_{i}

is the output feature of the Trans block.

3.2.3. Multi-Scale Feature Fusion Branch

To improve the recognition accuracy of the SAR image classification model, it is necessary to fuse local and global features from different scales. Meanwhile, to preserve both local and global features as much as possible, a multi-scale feature fusion branch was implemented. The multi-scale feature fusion branch can adaptively fuse local features, global representations from different scales, and semantic information from the previous level fusion based on the input features. The workflow of the multi-scale feature fusion branch is shown in Figure 1. We capture the shallow local and global features of stage 1 and the deep local and global semantic information of stage 3 through a multi-scale feature fusion branch. Finally, the feature information of each fusion path is fused to gradually fill in the semantic gaps, and the final output is obtained by connecting the global average pooling layer and the fully connected layer.

As shown in Figure 3, the multi-scale feature fusion branch consists of the FFU and down-sampling operation. Among them, the FFU consists of a LN layer, a convolution layer with a convolution kernel size of 1 × 1, and a hard swish activation function. The specific processes of stage 3 are shown in Figure 2c. First, the FFU output features

F_{i - 2}

of the previous stage are down-sampled through the convolution layer with the convolution kernel size of 1 × 1 and an average pooling layer (Avgpool); then, the channel dimensions are aligned to facilitate subsequent fusion operations. The fusion feature

{\hat{F}}_{i - 2}

obtained after down-sampling is fused with the local feature

C_{i}

and the global feature

S_{i}

from the ConvNeXt-SimAM and the Swin Transformer branches, respectively, and normalized by the LN layer. Subsequently, a convolutional layer with a convolution kernel size of 1 × 1 and a hard swish activation function are adopted. Finally, the feature

F_{i}

fused in this stage effectively captures both the local features and global semantic information of the SAR image. The resulting feature

F_{i}

is subsequently classified through the global average pooling and fully connected layers. The process of this stage is described in the following equations.

{\hat{F}}_{i - 2} = Avgpool (f^{1 \times 1} (F_{i - 2}))

(6)

F_{i} = hardswish (f^{1 \times 1} (LN (Concat ({\hat{F}}_{i - 2}, S_{i}, C_{i}))))

(7)

O = FC (glo_avg_pool (F_{i}))

(8)

where O represents the final output of the model, FC represents the fully connected layer, and glo_avg_pool represents global tie pooling.

3.3. MFN Based on Bayesian Hyperparameter Optimization Algorithm (Bayes-MFN)

The manual parameter adjustment can be time-consuming and labor-intensive and may not result in finding the optimal combination of hyperparameters due to the complexity of the deep neural network model, which is often considered as a black box method. To establish the optimal neural network model, it is particularly crucial to optimize the hyperparameters.

Compared with the traditional grid search and random search methods, the algorithm framework of Bayesian optimization is sequential. That is, the Bayesian optimization method adjusts the optimal combination of hyperparameters by using prior knowledge to approximate the posterior distribution of unknown functions, thereby reducing a large number of unnecessary calculations. The Bayesian optimization method has gained popularity in many fields due to its ability to obtain the optimal value of complex objective function with fewer evaluation times, using limited function sampling values.

Hyperparameters can be divided into two groups: those for model training and those for model structure. Since the design of the proposed model aimed to achieve better classification performance with fewer learnable parameters, the hyperparameters of model structure were fixed. Proper selection of the hyperparameters related to model training can enable the neural network to learn faster and to perform better. In the training process, the batch size affects the generalization performance of the model, and the learning rate (lr) determines the convergence state of the model [31]. In addition, the warmup and decay (wd) hyperparameters are closely related to the stability of the deep neural network model. An appropriate epoch size can help to avoid overfitting or underfitting of the model. Therefore, it is necessary to perform Bayesian optimization for those four hyperparameters, with the loss function as the objective function.

This study adopted the Bayesian optimization algorithm provided by the HyperOpt package [32] in Python to achieve hyperparameter optimization with fewer steps [33]. In this paper, the tree parzen estimator (TPE) search algorithm [34] was chosen for efficient hyperparameter optimization. The process of optimizing hyperparameters is as follows: (1) define an objective function to minimize, (2) define the hyperparameters and search space to be optimized, and (3) select the search algorithm [35]. The TPE algorithm was used to find the optimal combination of hyperparameters by minimizing the loss function.

4. Experiments and Results

4.1. MSTAR Datasets

To evaluate the performance of the proposed model, we used the moving and stationary target acquisition and recognition (MSTAR) dataset [36] published by the Defense Advanced Research Projects Agency and the Air Force Laboratory of the United States in the mid-1990s.

The SAR military vehicles in the MSTAR dataset were collected by a high-resolution spotlight synthetic aperture radar with a 0.3 m resolution, operating in the X-band and HH polarization mode. The acquisition conditions can be divided into standard operating conditions (SOCs) and extended operating conditions (EOCs) [37]. SOC refers to situations where the sag angle and aspects of the training set and test set samples are slightly different, but the serial number and target configuration of the samples are the same. Conversely, EOC refers to situations in which the training set and test set had significant differences in pitch angle, target definition, and other factors, making it a more complex classification task than SOC. The MSTAR dataset contains SAR images of various sizes, including 128 × 128, 158 × 158, and 177 × 178 pixels. To maintain consistency in target size, the original images were uniformly cropped to 128 × 128 pixels without affecting the target.

In SOC classification, there are 10 types of vehicle targets, including BMP2 (infantry combat vehicle), BRDM2 (armored reconnaissance vehicle), BTR60 (armored transport vehicle), BTR70 (armored transport vehicle), D7 (bulldozer), T62 (tank), T72 (tank), ZIL131 (cargo truck), ZSU234 (self-propelled anti-aircraft gun), and 2S1 (self-propelled howitzer). The training and test samples were images taken at 17° and 15° pitch angles, respectively. Figure 4 shows the optical and SAR images of different vehicle targets in the MSTAR dataset. It is evident from the figure that SAR images have low resolution and blurred target edge information, making it difficult to distinguish each vehicle target visually.

In EOC classification, there are large differences between the training and test sets, such as different pitch angles; however, the target categories in the dataset are more similar than in SOC. Therefore, the classification task is more complex, and the model’s robustness is more critical than under SOC. Based on previous research [38], this study focused on three types of ground articulated vehicle targets: T72, BRDM2, and ZSU234. The training and test samples were taken at 17° and 30° pitch angles, respectively.

4.2. Experimental Setting

Table 1 shows the architecture details of the MFN model on the SAR military vehicle dataset. Input images with a resolution of 128 × 128 were simultaneously processed by the CovNeXt-SimAM and Swin Transformer branches. Each branch was divided into three stages to obtain local and global features at different scales. Subsequently, fusion features containing rich SAR feature information were obtained by a multi-scale feature fusion branch. After global average pooling, SAR features were classified by a fully connected layer.

The details on the MFN model parameters are provided in Table 1. In the convolution layer parameters, d7 × 7, 96 is the short form of the output, where d represents the group convolution, 7 × 7 represents the convolution kernel size, and 96 represents the number of channels. Avgpool k2, s4 indicates an average pooled operation with a pool window size of 2 × 2 and a step size of 4. Parameters in the Swin Transformer branch include window size = 7 × 7 and head = 3, which indicate a multi-head self-attention mechanism with a window size of 7 × 7 and three heads.

To evaluate the performance of the proposed model, it was tested on a computer with AMD Ryzen 7-5800 CPU, NVIDIA GeForce RTX 3060 GPU, and 16 GB memory. The deep learning frameworks of Python 3.6.6 and PyTorch-1.7.1 were used. In this study, we used label smoothing as the loss function and optimized it using the AdamW gradient descent algorithm.

4.3. Results and Analysis of MSTAR Experiments

4.3.1. Bayesian Hyperparameter Optimization Results

The Bayesian method was used to optimize the four hyperparameters of MFN: epoch, learning rate (lr), warmup and decay (wd) hyperparameters, and batch size. The results of the Bayesian optimization in SOC and EOC experiments are shown in Table 2. The following experiment utilized the optimal hyperparameter set to classify SAR images.

4.3.2. Results from SOC Experiments

In the SOC experiment, detailed information on 10 types of SAR vehicle targets is shown in Table 3. There were 2747 and 2452 samples of SAR vehicle data at 17° and 15° pitch angles in the training and test datasets, respectively.

The confusion matrix of the recognition results of the proposed model for 10 types of vehicle targets in SOC experiments is shown in Figure 5. The data in the test set were basically concentrated on the diagonal, indicating that these targets were correctly classified into their categories, and the overall recognition accuracy reached 99.26%. In SOC experiments, the average precision, recall, and F1-score of the proposed model reached 99.18%, 99.29%, and 99.23%, respectively. The precision of the three types of 2S1, BRDM2, and BTR70 reached 100%. The confusion matrix shows that the precision of ZIL131 is the lowest, and some ZIL131 data are wrongly identified as BRDM2 and BTR60, potentially due to the high similarity of ZIL131, BRDM2, and BTR60 as trucks. In general, the proposed model demonstrated good recognition performance in SOC experiments, effectively validating its effectiveness.

4.3.3. Results from EOC Experiments

For the EOC experiments, detailed information on three types of SAR vehicle articulated targets is shown in Table 4. There are 896 and 384 samples of SAR vehicle data at 17° and 30° pitch angles in the training and test datasets, respectively. With fewer SAR data samples, higher similarity among target categories, and a large difference in pitch angles, the recognition task was more challenging than that in SOC experiments, requiring a higher model performance.

The confusion matrix of the proposed model is shown in Figure 6 for the identification results of the three types of vehicle targets in the EOC experiments. The identification accuracy was slightly lower than the classification accuracy of the 10 types of targets; however, it still reached 94.27%. In EOC experiments, the average precision, recall, and F1-score of the proposed model reached 94.39%, 94.40%, and 94.24%, respectively. The results showed that the proposed model demonstrated good generalization and robustness.

4.4. Data Visualization

To visually assess the processing effect of the model, the t-distributed stochastic neighbor embedding algorithm (t-SNE) [39] was used in this study to visualize the distribution of SAR data before and after the classification of models under SOCs and EOCs on the MSTAR dataset, as shown in Figure 7 and Figure 8.

As shown in Figure 7a and Figure 8a, the distribution of SAR image data before model processing is scattered and difficult to distinguish under SOCs and EOCs. After processing by the proposed model, the inter-class distance was effectively increased, the intra-class distance was more compact, and the SAR image data were easier to distinguish, as shown in the figures, which further demonstrated the effectiveness of the model proposed in this paper.

4.5. Ablation Experiments

Table 5 presents the results of the ablation experiments conducted to evaluate the individual influence of each component on the proposed model’s performance on the MSTAR dataset. The evaluation process involved starting from the local path (ConvNeXt branch); adding the Swin Transformer branch, SimAM attention module, and Bayesian optimization hyperparameters; and finally forming our model.

The accuracies of the ConvNeXt branch under SOCs and EOCs were 86.52% and 69.01%, respectively. The accuracy increased by 8.61% and 23.7% after the Swin Transformer branch was added, indicating the insufficiency of considering only local features. The fusion of local detail and global semantic features significantly improved the performance of the model, validating the proposed approach. After adding the SimAM attention module in the ConvNeXt branch, the accuracy was improved by 1.08% and 0.26% under SOCs and EOCs, respectively, demonstrating the attention mechanism’s ability to enhance the model’s feature extraction and presentation. Finally, using the Bayesian optimization algorithm to optimize the model’s hyperparameters increased the accuracy in SOC and EOC experiments by 3.05% and 1.3%, respectively.

In addition, this study compared the influence of the Bayesian hyperparameter optimization algorithm and empirical selection hyperparameters on the classification performance of the model. Under SOCs and EOCs, the results of the hyperparameter set optimized by the Bayesian algorithm and those selected by experience are shown in Table 6. Furthermore, the comparison results of the MFN model using the hyperparameter set selected by experience and those optimized by the Bayesian hyperparameter optimization algorithm are shown in Table 7.

The results demonstrated that using the Bayesian optimization algorithm effectively improved the model’s classification performance under SOC and EOC across all four evaluation indexes. Moreover, the Bayesian hyperparameter optimization algorithm significantly saved time and resources during the model optimization process.

4.6. Experimental Comparison with Existing Methods

To further demonstrate the advantages of the proposed model, it was compared with several mainstream algorithms in the field of image classification, such as ConvNeXT, Resnet50, Resnet101, Vision Transformer-tiny (ViT-tiny), Swin Transformer-tiny, and T2T-ViT-14. The classification performances of these mainstream models and the proposed model under SOCs and EOCs on the MSTAR dataset were compared, as shown in Table 8 and Table 9. The backbone networks of these mainstream models are CNNs and Transformers. To provide more clear and intuitive results, the comparative experimental bubble diagram was presented in Table 8 and Table 9, as shown in Figure 9 and Figure 10, respectively. The smaller the bubble, the closer it is to the upper left corner of the plot, indicating better overall model performance.

Table 8 reveals that, under SOCs, the lowest floating-point operations per second (FLOPs) of the T2t-ViT-14 network was 0.61 G, followed by 1.19 G in the proposed model. Table 9 shows that the classification task is more challenging under EOCs, while the Bayes-MFN achieved the highest recognition accuracy by a significant margin. Among the compared models, the proposed model had the smallest number of parameters and the highest recognition accuracy. In the SOC experiments, the parameters of the model and the accuracy were 6.68 M and 99.26%, respectively. In EOC experiments, the model had 6.67 M parameters and achieved 94.27% accuracy. In conclusion, the structure optimization of the proposed model presented in this paper was more effective. It can extract the local features and global dependency information of SAR images efficiently and simultaneously with fewer parameters and computational requirements to obtain more accurate results. Overall, the comprehensive performance of the proposed model was better than that of mainstream algorithms.

5. Conclusions

In SAR automatic target recognition technology, speckle noise can negatively impact the performance of the classification models. To address this issue, an MFN was proposed to preserve SAR local detail features and global semantic information using a parallelization method to enhance SAR texture detail features and to improve the problems caused by speckle noise, such as large intra-class differences and small inter-class distances. In addition, a Bayesian hyperparameter optimization algorithm was used to optimize MFN’s hyperparameters. The experimental results showed that the proposed model had specific generalization and validity on the MSTAR dataset. Meanwhile, Bayes-MFN achieved excellent performance compared with other mainstream models under SOCs and EOCs.

However, the FLOPs of the model proposed are suboptimal and need to be more lightweight. Moreover, we only tested the performance of Bayes-MFN on the MSTAR dataset, which may need to be more convincing. In future studies, we aim to improve the Bayes-MFN model and to reduce its computational complexity without degrading the classification performance. Furthermore, we plan to verify the robustness and generalization of the proposed model on SAR image datasets with more complex background conditions.

Author Contributions

Conceptualization, X.L., X.H. and C.G.; methodology, X.L., X.H. and C.G.; software, X.L. and X.H.; validation, X.L. and X.H.; formal analysis, C.G.; investigation, G.M. and Y.W.; resources, X.L. and C.G.; data curation, X.H., G.M. and Y.W.; writing—original draft preparation, X.L., X.H. and C.G; writing—review and editing, X.L., X.H. and C.G; visualization, X.H.; supervision, C.G., Y.W. and Y.G.; project administration, X.L. and C.G.; funding acquisition, X.L., W.G. and J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Beijing Natural Science Foundation (No. 6214034).

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wang, K.; Zhang, G.; Leng, Y.; Leung, H. Synthetic aperture radar image generation with deep generative models. IEEE Geosci. Remote Sens. Lett. 2018, 16, 912–916. [Google Scholar] [CrossRef]
Gao, F.; Yang, Y.; Wang, J.; Sun, J.; Yang, E.; Zhou, H. A deep convolutional generative adversarial networks (DCGANs)-based semi-supervised method for object recognition in synthetic aperture radar (SAR) images. Remote Sens. 2018, 10, 846. [Google Scholar] [CrossRef] [Green Version]
Wang, L.; Bai, X.; Xue, R.; Zhou, F. Few-shot SAR automatic target recognition based on Conv-BiLSTM prototypical network. Neurocomputing 2021, 443, 235–246. [Google Scholar] [CrossRef]
Novak, L.M.; Owirka, G.J.; Brower, W.S. The automatic target-recognition system in SAIP. Linc. Lab. J. 1997, 10, 187–202. [Google Scholar]
Hummel, R. Model-based ATR using synthetic aperture radar. In Proceedings of the IEEE International Radar Conference, Arilington, VA, USA, 7–12 May 2000; pp. 856–861. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 11976–11986. [Google Scholar]
Dong, H.; Zhang, L.; Zou, B. Exploring vision transformers for polarimetric SAR image classification. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5219715. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A. An image is worth 16 x 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022.11. [Google Scholar]
Babu, B.P.; Narayanan, S.J. One-vs-All Convolutional Neural Networks for Synthetic Aperture Radar Target Recognition. Cybern. Inf. Technol. 2022, 22, 179–197. [Google Scholar] [CrossRef]
Yu, Q.; Hu, H.; Geng, X.; Jiang, Y.; An, J. High-performance SAR automatic target recognition under limited data condition based on a deep feature fusion network. IEEE Access 2019, 7, 165646–165658. [Google Scholar] [CrossRef]
Ayodeji, A.; Wang, W.; Su, J. Fast Vision Decoder: A robust Automatic Target Recognition Model for Sar Images. SSRN 4057945. Available online: http://dx.doi.org/10.2139/ssrn.4057945 (accessed on 15 March 2022).
Zhang, Y.; Xia, J.; Gao, X. SM-CNN: Separability Measure based CNN for SAR Target Recognition. IEEE Geosci. Remote Sens. Lett. 2023, 20, 4005605. [Google Scholar] [CrossRef]
Lang, P.; Fu, X.; Feng, C. LW-CMDANet: A Novel Attention Network for SAR Automatic Target Recognition. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 6615–6630. [Google Scholar] [CrossRef]
Yuan, L.; Chen, Y.; Wang, T. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 558–567. [Google Scholar]
d’Ascoli, S.; Touvron, H.; Leavitt, M.L. Convit: Improving vision transformers with soft convolutional inductive biases. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 2286–2296. [Google Scholar]
Guo, J.; Han, K.; Wu, H. Cmt: Convolutional neural networks meet vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 12175–12185. [Google Scholar]
Zhang, Q.; Res, Y.Y.B. T: An efficient transformer for visual recognition. Adv. Neural Inf. Process. Syst. 2021, 34, 15475–15485. [Google Scholar]
Peng, Z.; Guo, Z.; Huang, W. Conformer: Local features coupling global representations for recognition and detection. IEEE Trans. Pattern Anal. Mach. Intell. 2023. [Google Scholar] [CrossRef] [PubMed]
Wang, C.; Huang, Y.; Liu, X. Global in local: A convolutional transformer for SAR ATR FSL. IEEE Geosci. Remote Sens. Lett. 2022, 19, 4509605. [Google Scholar] [CrossRef]
Li, K.; Zhang, M.; Xu, M. Ship detection in SAR images based on feature enhancement Swin transformer and adjacent feature fusion. Remote Sens. 2022, 14, 3186. [Google Scholar] [CrossRef]
Liu, X.; Wu, Y.; Liang, W.; Cao, Y.; Li, M. High resolution SAR image classification using global-local network structure based on vision transformer and CNN. IEEE Geosci. Remote Sens. Lett. 2022, 19, 4505405. [Google Scholar] [CrossRef]
Dabboor, M.; Atteia, G.; Meshoul, S. Deep Learning-Based Framework for Soil Moisture Content Retrieval of Bare Soil from Satellite Data. Remote Sens. 2023, 15, 1916. [Google Scholar] [CrossRef]
Lacerda, P.; Barros, B.; Albuquerque, C. Hyperparameter optimization for COVID-19 pneumonia diagnosis based on chest CT. Sensors 2021, 21, 2174. [Google Scholar] [CrossRef]
Xu, T.; Chen, Y.; Wang, Y. EMI Threat Assessment of UAV Data Link Based on Multi-Task CNN. Electronics 2023, 12, 1631. [Google Scholar] [CrossRef]
Rizaev, I.G.; Achim, A. SynthWakeSAR: A Synthetic SAR Dataset for Deep Learning Classification of Ships at Sea. Remote Sens. 2022, 14, 3999. [Google Scholar] [CrossRef]
Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. Simam: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 11863–11874. [Google Scholar]
Li, S.; Wang, S.; Dong, Z.; Li, A.; Qi, L.; Yan, C. PSBCNN: Fine-grained image classification based on pyramid convolution networks and SimAM. In Proceedings of the IEEE international Conference on Dependable, Autonomic and Secure Computing, International Conference on Pervasive Intelligence and Computing, International conference on Cloud and Big Data Computing, International Conference on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), Falerna, Italy, 12–15 September 2022; Volume 2022, pp. 1–4. [Google Scholar]
You, H.; Lu, Y.; Tang, H. Plant disease classification and adversarial attack using SimAM-EfficientNet and GP-MI-FGSM. Sustainability 2023, 15, 1233. [Google Scholar] [CrossRef]
Yu, T.; Zhu, H. Hyper-parameter optimization: A review of algorithms and applications. arXiv 2020, arXiv:2003.05689. [Google Scholar]
Bergstra, J.; Komer, B.; Eliasmith, C.; Yamins, D.; Cox, D.D. Hyperopt: A python library for model selection and hyperparameter optimization. Comput. Sci. Discov. 2015, 8, 014008. [Google Scholar] [CrossRef]
Zhang, J.; Wang, Q.; Shen, W. Hyper-parameter optimization of multiple machine learning algorithms for molecular property prediction using hyperopt library. Chin. J. Chem. Eng. 2022, 52, 115–125. [Google Scholar] [CrossRef]
Bergstra, J.; Bardenet, R.; Bengio, Y. Algorithms for hyper-parameter optimization. In Proceedings of the Advances in Neural Information Processing Systems 24, Granada, Spain, 12–15 December 2011; Volume 24. [Google Scholar]
Kang, K.; Ryu, H. Predicting types of occupational accidents at construction sites in Korea using random forest model. Saf. Sci. 2019, 120, 226–236. [Google Scholar] [CrossRef]
Ross, T.D.; Worrell, S.W.; Velten, V.J.; Mossing, J.C.; Bryant, M.L. Standard SAR ATR evaluation experiments using the MSTAR public release dataset. In Proceedings of the Algorithms for Synthetic Aperture Radar Imagery V—SPIE, Orlando, FL, USA, 15 September 1998; Volume 3370, pp. 566–573. [Google Scholar]
Shi, B.; Zhang, Q.; Wang, D.; Li, Y. Synthetic aperture radar SAR image target recognition algorithm based on attention mechanism. IEEE Access 2021, 9, 140512–140524. [Google Scholar] [CrossRef]
Gao, F.; Huang, T.; Sun, J.; Wang, J.; Hussain, A.; Yang, E. A new algorithm for SAR image target recognition based on an improved deep convolutional neural network. Cogn. Comput. 2019, 11, 809–824. [Google Scholar] [CrossRef] [Green Version]
Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res 2008, 9, 11. [Google Scholar]
He, K.; Zhang, X.; Ren, S. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]

Figure 1. The structure diagram of the MFN model. (The green section represents the Swin Transformer branch. The blue section represents the CovNeXt-SimAM branch. The yellow part represents the multi-scale feature fusion branch.)

Figure 2. Diagrams of different processes: (a) CNN block architecture; (b) two consecutive Trans block architectures; (c) the feature fusion process in stage 3.

Figure 3. Implementation details of the CNN block, Trans block, and FFU block in stage 3. (The green part represents the Trans block. The blue part represents the CNN block structure. The yellow part is the process from the output feature of the previous FFU to the next FFU through the down-sample operation).

Figure 4. Ten class vehicle targets for optical images (top) and SAR images (bottom).: (a) 2S1; (b) BMP2; (c) BRDM2; (d) BTR60; (e) BTR70; (f) D7; (g) T62; (h) T72; (i) ZIL_C; (j) ZSU23LG.

Figure 5. Ten categories of SAR vehicle target confusion matrix under SOC.

Figure 6. Three categories of SAR vehicle target confusion matrix in EOC experiments.

Figure 7. Data distribution before and after model processing under SOCs. (a) Distribution of raw data in SOC experiments; (b) data distribution after Bayes-MFN model processing in SOC experiments.

Figure 8. Data distribution before and after model processing under EOCs. (a) Raw data distribution in EOC experiments; (b) data distribution after Bayes-MFN model processing in EOC experiments.

Figure 9. Performance comparison of each model under SOCs.

Figure 10. Performance comparison of various models under EOCs.

Table 1. Specific parameters of MFN model (in the multi-scale feature fusion branch, the arrow indicates the direction of the feature).

Stage	Output Size	CovNeXt-SimAM Branch	Multi-Scale Feature Fusion Branch	Swin Transformer Branch
-	32 × 32, 96	4 × 4, 96	-	4 × 4, 96
Stage 1	32 × 32, 96	$[\begin{matrix} d 7 \times 7, 96 \\ 1 \times 1, 384 \\ 1 \times 1, 96 \end{matrix}] \times 2$	$\to 1 \times 1, 96 \leftarrow$	$[\begin{matrix} window size = 7 \times 7 \\ head = 3, \\ 1 \times 1, 96 \end{matrix}] \times 2$
Stage 2	16 × 16, 192	2 × 2, 192	1 × 1, 384	Patch merging
		$[\begin{matrix} d 7 \times 7, 192 \\ 1 \times 1, 768 \\ 1 \times 1, 192 \end{matrix}] \times 2$	Avgpool k2, s4	$[\begin{matrix} window size = 7 \times 7 \\ head = 6, \\ 1 \times 1, 192 \end{matrix}] \times 2$
			↓
Stage 3	8 × 8, 384	2 × 2, 384	$\to 1 \times 1, 384 \leftarrow$	Patch merging
Stage 3	8 × 8, 384	$[\begin{matrix} d 7 \times 7, 384 \\ 1 \times 1, 1536 \\ 1 \times 1, 384 \end{matrix}] \times 2$	$\to 1 \times 1, 384 \leftarrow$	$[\begin{matrix} window size = 7 \times 7 \\ head = 12, \\ 1 \times 1, 384 \end{matrix}] \times 2$
Classifier	1 × 1, 1	-	global average pooling	-
Classifier	1 × 1, 1		1 × 1, numclass

The format of d7 × 7, 96 and similar expressions is the short form of the output, where d represents the group convolution, 7 × 7 represents the convolution kernel size, and 96 represents the number of channels. Avgpool k2, s4 indicates an average pooled operation with a pool window size of 2 × 2 and a step size of 4.

Table 2. Optimization results of Bayesian hyperparameters in SOC and EOC experiments using the MSTAR dataset.

Hyperparameter	SOC	EOC
epoch	180	180
batch size	16	8
lr	0.00390	0.00016
wd	0.0335	0.0356

Table 3. Detailed information on vehicle targets under SOC in the MSTAR dataset.

	Train		Test
Class	Depression	Number	Depression	Number
2S1	17°	299	15°	274
BMP2	17°	233	15°	195
BRDM2	17°	298	15°	274
BTR60	17°	256	15°	195
BTR70	17°	233	15°	196
D7	17°	299	15°	274
T62	17°	299	15°	273
T72	17°	232	15°	196
ZIL131	17°	299	15°	274
ZSU_23_4	17°	299	15°	274

Table 4. Detailed information on vehicle targets under EOC in the MSTAR dataset.

	Train		Test
Class	Depression	Number	Depression	Number
T72	17°	299	30°	133
ZSU_23_4	17°	299	30°	118
BRDM2	17°	298	30°	133

Table 5. Experimental results of MSTAR dataset ablation.

Methods	Accuracy
Methods	SOC	EOC
ConvNeXt branch	86.52%	69.01%
+Swin Transformer branch	95.13%	92.71%
+SimAM	96.21%	92.97%
+Bayesian optimization	99.26%	94.27%

Table 6. Comparison between Bayesian optimization hyperparameters and empirical selection.

Hyperparameters	Hyperparameters after Bayesian Optimization		Hyperparameters of Empirical Selection
Hyperparameters	SOC	EOC	SOC	EOC
epoch	180	180	150	150
batch size	16	8	16	16
lr	0.00390	0.00016	1 × 10⁻⁴	1 × 10⁻⁴
wd	0.0335	0.0356	0.0500	0.0500

Table 7. Comparison of model performance before and after using the Bayesian optimization algorithm in the MSTAR dataset.

	SOC				EOC
	Accuracy	Precision	Recall	F1-Score	Accuracy	Precision	Recall	F1-Score
MFN	96.21%	95.95%	96.18%	95.99%	92.97%	93.01%	93.19%	92.93%
Bayes-MFN	99.26%	99.18%	99.29%	99.23%	94.27%	94.39%	94.40%	94.24%

Table 8. Performance comparison results of various models under SOCs in the MSTAR dataset.

Model	FLOPs (G)	Parameters (M)	Accuracy
ConvNeXT [6]	1.46	27.81	92.37%
Resnet50 [40]	1.34	23.53	88.37%
Resnet101 [40]	2.56	42.52	93.07%
ViT-tiny [9]	1.88	31.33	84.16%
Swin Transformer-tiny [10]	1.21	10.21	95.01%
T2T-ViT-14 [16]	0.61	8.67	89.24%
Proposed model	1.19	6.68	99.26%

FLOPs denotes floating-point operations per second.

Table 9. Performance comparison results of models in the MSTAR dataset under EOC.

Model	FLOPs (G)	Parameters (M)	Accuracy
ConvNeXT [6]	1.46	27.80	76.31%
Resnet50 [40]	1.34	23.51	57.55%
Resnet101 [40]	2.56	42.51	63.80%
ViT-tiny [9]	1.88	31.31	71.61%
Swin Transformer-tiny [10]	1.22	10.20	71.61%
T2T-ViT-14 [16]	0.61	8.63	79.43%
Proposed model	1.19	6.67	94.27%

FLOPs denotes floating-point operations per second.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lian, X.; Huang, X.; Gao, C.; Ma, G.; Wu, Y.; Gong, Y.; Guan, W.; Li, J. A Multiscale Local–Global Feature Fusion Method for SAR Image Classification with Bayesian Hyperparameter Optimization Algorithm. Appl. Sci. 2023, 13, 6806. https://doi.org/10.3390/app13116806

AMA Style

Lian X, Huang X, Gao C, Ma G, Wu Y, Gong Y, Guan W, Li J. A Multiscale Local–Global Feature Fusion Method for SAR Image Classification with Bayesian Hyperparameter Optimization Algorithm. Applied Sciences. 2023; 13(11):6806. https://doi.org/10.3390/app13116806

Chicago/Turabian Style

Lian, Xiaoqin, Xue Huang, Chao Gao, Guochun Ma, Yelan Wu, Yonggang Gong, Wenyang Guan, and Jin Li. 2023. "A Multiscale Local–Global Feature Fusion Method for SAR Image Classification with Bayesian Hyperparameter Optimization Algorithm" Applied Sciences 13, no. 11: 6806. https://doi.org/10.3390/app13116806

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multiscale Local–Global Feature Fusion Method for SAR Image Classification with Bayesian Hyperparameter Optimization Algorithm

Abstract

1. Introduction

2. Related Work

2.1. CNN

2.2. Visual Transformer

2.3. Bayesian Hyperparameter Optimization Algorithm

3. Proposed Method

3.1. Overview

3.2. Network Structure

3.2.1. ConvNeXt-SimAM Branch

3.2.2. Swin Transformer Branch

3.2.3. Multi-Scale Feature Fusion Branch

3.3. MFN Based on Bayesian Hyperparameter Optimization Algorithm (Bayes-MFN)

4. Experiments and Results

4.1. MSTAR Datasets

4.2. Experimental Setting

4.3. Results and Analysis of MSTAR Experiments

4.3.1. Bayesian Hyperparameter Optimization Results

4.3.2. Results from SOC Experiments

4.3.3. Results from EOC Experiments

4.4. Data Visualization

4.5. Ablation Experiments

4.6. Experimental Comparison with Existing Methods

5. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI