1. Introduction
Since images from Synthetic Aperture Radar (SAR) Automatic Target Recognition (ATR) outperform optical images in their adaptivity to different weather, durability in time and extensity of detection range, SAR has always been a vital interest of numerous researchers of space-based earth observation. Automatic Target Recognition in SAR images, as a paramount means of assisting the manual interpretation of images and early warnings for national homeland security, plays an essential role in civil and military target perception in this field [
1,
2], primarily including reports of disaster information and prevention of natural disasters, identification and localization of military targets, etc. Although a great number of approaches have been developed for SAR ATR, which have attracted lots of attention [
3,
4] in the field, a multitude of limitations simultaneously are yet to be solved. The primary limitations of SAR ATR are insignificant image texture information, serious speckle noise, severe geometric distortion, critical structural defects, and low angle sensitivity.
Nowadays, the mainstream SAR automatic target recognition method can be generally classified into the template-oriented method and model-oriented method. The core of the template-oriented approach is feature extraction and selection depending on the foundation of deep domain knowledge theory. While the template-based SAR ATR method focuses on the construction of the feature library, the traditional feature extraction method with mainly handcrafted features is based on space, spectrum, texture, shape and other information. Typical features include texture description [
5], gist [
6], scale-invariant feature transform [
7], gradient Histogram [
8], local binary pattern [
9], etc. The template-based approach with more outline features significantly reduces computational complexity. However, its limitation lies in the fact that some implicit key features cannot be effectively utilized, therefore considerably limiting or degrading the performance improvement, because the feature-based approach is an essentially sparse approach to feature space. The model-based approach, occupying model design as its core feature, mainly works on the construction of the physical model of the object, including its shape, geometry, texture, material and so on. The representation of the physical structure of the interest object is close to its physical essence, which shares the ability of better robustness and generality. Nevertheless, domain-related knowledge is a requisite for reducing the influence produced by the feature extraction and template-matching technology of SAR images, and yet the optimal classification feature of SAR images cannot be learned by the ATR system independently [
10]. Many works have been concerned with SAR image classifications [
11]. Clemente et al. [
12] presented a method of military vehicles with Krawtchouk Moments to solve the SAR tasks and obtained a good performance in recognition. Sun et al. [
13] proposed a dictionary learning and joint dynamic sparse representation method which is an effective way to recognize SAR images. Kim et al. [
14] used Adaboost-based feature selection to robust ground target detection by SAR and IR sensor fusion to solve the background scatter noise problem and obtained excellent results. Thus, the traditional artificial image interpretation has an inferior efficiency, higher error rate, and more resource consumption, which cannot meet the needs of time-sensitive applications.
The deep learning method based on big data provides a new technical route without manual feature engineering and target modeling. Lin et al. [
15] designed a unique architecture called the convolutional highway unit to extract deep feature representations for classification. Sharifzadeh et al. [
16] applied a new hybrid Convolutional Neural Networks (CNN)-Multilayer Perceptron (MLP) classifier for ship classification in SAR images. Tian et al. [
17] integrated a weighted kernel module (WKM) to improve the feature extraction capability into the commonplace CNN and achieved superior performance. Ma et al. [
18] utilized a CNN-based method to work on ship classification and detection using GF-3 SAR images. Recent advances in CNNs have been widely borne out in various SAR targets, but SAR ATR still suffers from insufficient training samples and inferior optimized model design.
To our best knowledge, SAR target recognition depends on substantial labeled training images to train a robust model for classification, but is widely limited by space scope, region accessibility, and high-cost data acquisition. In this regard, this paper specifically serves to mitigate the dependence on large training samples by designing a network called a multi-level feature fusion attention network that combines the feature fusion method to attach more attention to the classification mechanism. For a better measure of quality on model predictions, a novel optimization method with batch normalization has been fashioned in this paper as well. Meanwhile, transfer learning is introduced to validate the optimized model by using pre-trained weights for a new classification.
The contributions are as follows:
1. MFFA-SARNET: A deep learning architecture exploiting a multi-level feature fusion scheme is utilized to refine the extracted features and subsequently discard background features learned from the SAR targets, considerably facilitating the function of weight distribution and task focus;
2. A dual optimized loss for training optimization: A dual optimized loss is composed of two losses, with one to encourage the interclass dissimilarity, and the other to serve as the constraint to balance network optimization, the combination of which has considerably ameliorated the discriminative power to accomplish the SAR classification task;
3. Transfer learning adaptation: the theory of transfer learning is utilized to enforce the feature representation in the case of small samples, which indicates that the performance of the proposed method surpasses those of other advance works;
4. Small-sample classification task: the proposed network validates its superiority in working with small samples under three different configurations, significantly reducing the data dependence and enabling insight into the raw images.
The remainder of the paper is organized as follows.
Section 2 presents a brief introduction to the basic related work from previous researchers, while
Section 3 expounds the notions of the proposed methods. After the analysis and results of the proposed methods are unveiled in
Section 4,
Section 5 draws a short conclusion of the whole paper.
2. Related Work
In recent years, the SAR ATR task has obtained quite a few preliminary results. Researchers often consider the extraction of robust features during feature extraction, whose methods according to mathematical transformation are widely applied in automatic target recognition for SAR images, containing linear feature extraction and nonlinear feature extraction. The data are analyzed and transformed by mathematical methods so that they can be better represented in the feature space by better discrimination features. The orthogonal transform such as K-L Transform, Hough Transform, Wavelet Transform, Radon transform, and Mellin transform [
19,
20], can be recruited to extract the orthogonal component of the target and reduce the correlation between the image pixels and the feature dimension of the feature space. In addition, in SAR ATR tasks, the main linear feature extraction methods include Principal Component Analysis (PCA) [
21] and linear discriminant analysis (LDA) [
22] based on the Fisher criterion. The results of both methods in the Moving and Stationary Target Acquisition and Recognition (MSTAR) database have verified the effectiveness of PCA and LDA in SAR image feature extraction.
Apart from the elements mentioned above, sparse representation theory has also attracted interest from a myriad of researchers, and has been deployed in numerous fields of image processing, such as dictionary learning, image denoising and so forth. For instance, Yang et al. initialed an efficient and reliable classification method called Sparse Representation Classification (SRC), which can construct an over-complete library used for the linear representation of testing samples. In [
23], sparse representation is engaged in SAR target classification with 2D canonical correlation analysis, which gives satisfying results. Moreover, Yu et al. [
24] propounded a method by a joint sparse and dense representation of the monogenic signal, greatly decreasing the complexity of the algorithm and enhancing the performance.
Thanks to the accessibility of adequate training samples and the introduction of Deep Neural Networks (DNN), it has become much more popular in the field of machine learning. Furthermore, it has been noticed that the multi-hidden layer artificial neural network (ANN) possesses the excellent ability of feature learning beneficial to visual classification [
25]. Consequently, the training limitation of DNN can be solved by adopting the policy of layer-wise pre-training [
26]. CNN, proposed by Lecun [
27], is the first learning algorithm to train the multi-layer network successfully. It is capable of reducing the storage of learned parameters and improving the efficiency of the network by using the local connection and weight sharing of the network and backpropagation. With its outstanding edges, CNN was flexibly exercised in various works. Hinton et al. [
28] applied CNN to ImageNet, the largest database for image recognition, with stunning results obtained, surpassing all the previous ones, while Zhang et al. [
29] suggested an approach based on CNN cascaded features and an AdaBoost rotation forest to extenuate the problems arising from the lack of samples. In Liu et al. [
30], the sparse manifold regularized networks were presented for polarimetric SAR terrain classification, in which the number of training samples was reduced by fine-tuning a few parameters.
Furthermore, as an important research direction in SAR ATR, multi-feature fusion cannot be ignored. In this area, Amrani et al. [
31] deployed the traditional cascade and discriminant correlation analysis algorithm to fuse the extracted deep features while Wang et al. [
32] proposed a two-channel feature fusion method for intensity features and gradient amplitude features. This representation method can effectively maintain the spatial relationship between the two features and achieve a better feature fusion effect. Zheng et al. [
33] offered an improved form of CNN with higher generalization ability and less fitting probability, combining the convolution (conv) layer of CNN with a two-dimensional PCA algorithm to further improve its efficiency and robustness. Yu et al. [
34] presented a deep feature fusion network acquiring prominent results under limited data conditions, on the basis of which a structure containing multi-input parallel network topology was created, where the SAR image features of different perspectives were extracted layer by layer, and the features of different viewpoints were merged step by step, which were robust to the change in the visual angle.
Transfer learning also plays an indispensable role in deep learning. In [
35], transfer learning was introduced to transfer the prior experience learned from enough unmarked SAR images to marked SAR targets. Rostami et al. [
36] trained a DNN for SAR targets by deep transferring the weights to the target task, successfully eliminating the need for sufficient samples. Xu et al. [
37] employed framework-oriented transfer learning method with discriminative adaptation regularization for ship classification. In short, transfer learning can contribute to boosting performance in the case of a lack of training samples.
3. Proposed MFFA-SARNET
The proposed MFFA-SARNET scheme is explained in
Figure 1 in meticulous detail. We will present our newly developed framework in this section. In our work, given the characteristics of SAR targets, SAR images are transferred into the proposed network, in which different level features from multiple layers are fused into the attention module to complete the weight distribution and task focus. After the framework has learned the attention area for the class to be identified, a novel loss function with batch normalization is applied to recognize the target, after which SAR targets transferred into the network are trained through the backpropagation algorithm. Data analysis displayed in
Table 1 aims to augment data intuitiveness for an easier and better understanding from readers.
As displayed in
Table 1, we could observe that parameters learned from each layer have been influenced by the network settings, from the parameters learned from the conv layers are increasing while descending after deconvolution (deconv) layers and fc1/fc2 layers. The final parameters are slightly smaller and are more effective for classification.
3.1. Multi-Level Feature Attention Network
3.1.1. Multi-Level Feature Extraction and Fusion
Feature fusion refers to the extraction of different types of features according to a method by utilizing a plurality of feature extraction methods. For its superiority in extracting abundant texture information and possessing promising robustness for various changes in images, it is well adapted to further mine image information. In this section, we optimized the accuracy of image recognition by adopting a method of multi-feature fusion, as shown in
Figure 2. Specifically, SAR target classification is carried out by utilizing a method of multi-feature fusion, where SAR image classification is operated and assisted by the idea of low-level feature mapping and high-level semantic feature fusion with strong representation ability.
Suppose that the size of the feature map in convolution layers is
(m×m)×n, with
(m×m) as the feature map dimension, n as the network depth. The pixel region where the interaction between
i-th mapping features and u×u×1 convolution kernel is
, so the
is formulated as follows:
where
refers to the specific value of the
i-th convolution kernel in the region
(a, b).
l denotes the
l-th branch of the network.
denotes the output feature map in the
i-th layer from the branch
l in the network. By using the function of weight
, offset value
and
, the fusion feature mapping of the region
is obtained:
where
f is the affine transformation. This could be the activation functios for Rectified Linear Unit (ReLU), sigmoid, softmax, Exponential Linear Unit (ELU) and so on. The output feature maps
can be described in another form,
, among which
j is one of the fusion branches, C denotes the channel number, and
N stands for the number of training targets. [
W, H], as the number of channels, appertains to the width and the height of the feature map, respectively. Assuming that the feature graph is calculated by Formula (2) and that tensor is a vector list containing the parameters
N, C, W, H, the fusion procedure can be worked out as the process in
Figure 2. In this paper, there are three branches named the
o-th,
p-th, and
q-th branch, and the fusion feature in the branch
o,
p,
q can be expressed as Formula (3):
The function of the multi-feature fusion module intends to obtain different feature graph information, to provide ample feature information for feature discrimination.
3.1.2. Attention Module
We investigated the attention mechanism, whose essence proves to locate the information of interest and suppress the useless information so that the SAR target’s features can be well focused. The results obtained from the former step are usually presented in the form of a probability graph or probability characteristic vector channel.
Figure 3 demonstrates the attention module containing the specific approach to channel attention, which can be illustrated as follows. Firstly, the feature tensor is transformed into
U, where
U = [
u1,
u2, …,
uD], in which
ui ∈
Rk represents the
i-th dimension of the features, and
D is the dimension of
vi or the total number of channels in each domain. Then, we pool each channel to generate a channel vector shown in Formula (4).
where
is the mean vector of
, which denotes the feature of
i-th channel. The process of the channel attention model is expressed as follows:
where
Wvc,
Wqc, and
Wc are the embedding matrixes, while
bvc,
bqc and
bc are the bias term. Q expresses the input vector of validation images.
represents the outer product of a vector and the channel attention vector
will be obtained through the channel attention mechanism
Ac, which can be simplified as:
Through the above steps from Formula (4)~(7), we can obtain the channel attention weight
, thus feeding back to the channel attention function
fc to calculate a feature map
Vc.
where
fc denotes the product of the channel and the corresponding channel weight of the region feature mapping at the channel level.
V is the input feature map followed by the channel attention mechanism. Thus, the
Vc can be represented as:
Given the calculated feature map
Vc, new features are generated by inputting the
Vc and
Q into the network, and then the softmax function is employed to calculate the spatial attention weight based on the region. The spatial attention mechanism is defined as follows:
where
and
are the embedding matrices, mapping the visual and problem features to the shared latent space. Additionally,
Wo is a set of parameters that needs to be relearned, b is a model bias term, and
is a matrix and vector phase addition operation.
Simply, the attention weight can be optimized as:
3.2. The Dual Optimized Loss for Training Optimization
It is known that weaker classifiers should be used to improve the discriminative performance of the learned representations because the massive parameters may make the network prone to overfitting, especially for small samples. Besides, the cost function is also a perfect choice to improve performance by optimizing the network. To avoid overfitting and computation, network optimization also becomes one of the research hotspots. Chen et al. [
38] proposed a new low-degree-of-freedom sparse connected convolution structure to replace the traditional full connection, which reduced the number of free parameters, optimized the serious over-fitting problem triggered by the limitation of the number of training images and adopted dropout technology, aiming to enhance the generalization ability. The small-batch random gradient drop method with momentum was used to optimize and quickly find the global optimum. Wilmanski et al. [
39] were committed to the improvement of the learning algorithm, using AdaGrad and AdaDelta technology to avoid manually adjusting the learning rate and other parameters, engendering better robustness to parameter selection.
To optimize the classification of SAR images with noise-free tags, we designed a novel dual optimized loss with a batch normalization algorithm to gain an agreeable classification performance in this section. The loss function could be divided into two parts: Lossm and constraint SSIM. The former one is the modified softmax loss function with batch normalization. The depth feature of softmax training, which divided the entire hyperspace or hypersphere into categories based on the number of categories, ensuring that the categories were separable, proved ideal for multi-category tasks, under the condition that softmax did not require intra-class compactness and inter-class separation. For batch normalization, each batch was normalized to zero means, and the original data was mapped to a distribution with a mean of zero and a variance of one. The performance brought by BN contained the input distribution, which helpfully promoted the smoothness of the solution space of the optimization problem and the predictability and stability of the gradient. Therefore, we modified softmax with batch normalization, not only ensuring separability but also guaranteeing the best compactness of feature vector class and the greatest separation between classes.
If the input of the optimization part is
xi, the batch normalization (BN) can be described as:
where
refers to the mean value,
the variance,
the normalized value, and
the batch normalization values, which is a posterior form of the Gaussian model with a pooled covariance matrix, serving to determine the prediction result.
The SSIM loss is a measure of the similarity between two images, to ensure and further improve the optimization, while it also serves as the constraint to balance network optimization. The SSIM defines the structure information irrelevant to brightness and contrast, to illustrate the object structure properties from the perspective of image composition.
The dual optimized loss is defined as follows:
where
where
ys is the one-hot label,
y is the output value,
is the corresponding mean values,
is the covariance between
ys and
y and
C1 and
C2 are the constants. To ensure a clearer understanding, Algorithm 1 demonstrates the training optimization in meticulous detail below.
Algorithm 1: Dual Optimized Loss for Training Optimization |
Require: Constant C1, C2 |
Require: The balanced parameter β |
Require: Stepsize α |
Require: β1, β2 ∈ [0,1): Exponential decay rates for the moment estimates |
Require: θ0: Initial parameter vector |
m0←0; |
v0←0; |
Given that the training set includes m samples in small batches, X = {x1, x2,…, xi}, the corresponding ground truth to the target is yi, and the corresponding output is ys |
while θt is not converged, do: |
Step 1: t ← t+1 |
Step 2: Compute the mean ys and mean y: |
Step 3: Compute the covariance y and ys: |
Step 4: Simulate the value of Lossm and SSIM by Equation (15) |
Step 5: Simulate the whole Loss by Equation (14) |
Step 6: Loss optimization gained as follows |
|
end while |
return θt |
3.3. Transfer Learning
Transfer learning devotes itself to figuring out the shared characteristic between several tasks and transferring the weights at the level of general features. By training other image datasets like ImageNet, or learning other similar images from SAR images, the shallow, middle and high-level features that can be used to deal with classification tasks can be obtained, and leveraging data from related tasks can effectively improve generalization and reduce the runtime of evaluating a set of classifiers. The domain is described as D = {F, P(X)}, while F = {f1, f2, …, fn} is a feature space with n dimensions, X = {x1, x2, …, xn} denotes learning samples, and P(X) represents the marginal probability distribution of X. To our knowledge, different domains share different Fs and P(X)s. The task is a pair of T = {y, f (·)}, where y is the label space and f (·) is a prediction function. In this paper, the domain task F = { f1, …, fn } remains the same, while the P(X) varies according to the classification task.
In this work, based on the proposed network, we utilized the source dataset, whose class is different from that of the target dataset, to train an optimized model in advance, followed by introducing transfer learning to copy the pre-trained weights to the network for fine-tuning the network by training the raw samples obtained from the target dataset. Concretely, three feature extractors are taken into consideration to be used for weight transfer, and the parameter update of the layer before Feature Extractor 2 is preserved in the same way as in the pre-trained model, while the weights are trained by the target dataset in Feature Extractor 3 from scratch. Expository details of the framework are given in
Figure 4.
Specifically, in this paper, the procedure of transfer learning could be described as follows. Firstly, we used the source dataset to start our training and obtained a pre-trained model Mpre, which contains the weight values and other feature information learned from the source data. Notice that all the learned weights are regarded as the initial settings of the network. Then, we input the limited target samples for training by setting the learned parameters in each feature extractor. For example, when transferring the parameters learned from Feature Extractor 2, we configure the learning rate in Extractor 1 and 2 (before Extractor 3) to zero or smaller values, the parameters in Extractor 3 keep the same initialization to yield the model Mtest. Finally, the model is used for the SAR ATR tasks. In our work, we explored the performance by transferring different feature extractors and the results of the experiment are shown.