1. Introduction
Synthetic Aperture Radar (SAR) can realize high-resolution microwave imaging through the synthetic aperture principle. Because of the high resolution, the capability of being free from the influence of weather and illumination, and the effectiveness of distinguishing camouflage and penetrating coverings, SAR is better than other remote sensing methods in many applications both in military and civil fields. Therefore, it has attracted increasing attention during the past years. However, compared with optical images, the unique speckle noise and different imaging mechanism in SAR images make the interpretation of SAR images more difficult [
1]. Automatic target recognition (ATR) is one of the pivotal steps of SAR images interpretation and has great significance in both civil and military fields [
2].
With the emergence of extensive data sets and the growth of computer processing power, deep learning has been widely used in many fields. In order to imitate the human brain’s cognitive mechanism, deep learning constructs a multi-level model structure to achieve multi-layer nonlinear transformation, through which the original data can be mapped into a feature space more suitable for recognition. Convolutional neural network (CNN) is a typical network used in image classification, which achieves feature extraction and classification using a unified framework and can automatically learn features more suitable for classification. Several existing studies [
3] show that the deep features learned through convolution operations inclined to be more discriminatory for different types of targets. Currently, there are many SAR ATR methods based on CNN. Chen et al. [
1] constructed a new network called A-ConvNet for SAR ATR, which has no fully connected layers and only contains convolution layers. The method in [
1] can settle the over-fitting problem of CNN caused by insufficient training data to a certain extent. Leonan et al. [
4] recognized oil rigs from Sentinel-1 SAR images using VGG-16 and VGG-19. Huang et al. [
5] proposed a lightweight CNN for SAR ATR, which has global stream and local stream. The two streams are used to extract multi-level features, which will be combined to classify the target.
Although the CNN-based methods for SAR ATR have achieved good results, most of them mainly use the image information of SAR images and make little use of the unique electromagnetic scattering characteristics of SAR images. For SAR images, attributed scattering centers (ASCs) can use several physically relevant parameters to accurately describe the electromagnetic scattering characteristics and the local structures of the target, which are notably effective for SAR ATR [
6]. At present, some ASC-based methods for SAR ATR have been proposed. Based on the Bayesian theory, Chiang et al. [
7] proposed an ASC matching method, which evaluates the similarity between two ASC sets by the posterior probability. Dungan et al. [
8] calculated the distance between the attributed point set of the test image and the attributed point sets of training samples using the least trimmed square Hausdorff distance (LTS-HD). The category of the test image is determined through the shortest distance. Tian et al. [
9] also proposed an ASC matching method, which calculates the correspondence between test ASC set and template ASC sets to recognize the test image.
The abovementioned ASC-based methods for SAR ATR are based on ASC matching. Considering the powerful performance of CNN on SAR ATR, there have been some attempts to combine the ASCs of SAR targets and CNN to achieve SAR ATR. Lv et al. [
10] extracted ASCs from the SAR image first, and then used different numbers of ASCs to obtain the reconstructed SAR images. The training set is augmented by these reconstructed SAR images. This method only uses the ASC reconstructed images to augment the training samples, which does not really realize the combination of ASC and CNN. Jiang et al. [
11] fused the CNN and ASC matching hierarchically to achieve SAR ATR. In this hierarchical fusion method, CNN is used to classify the test sample first, and its output is used to calculate a reliability level. Then, the reliability level is used to judge whether the ASC matching need to be used to further classify the test sample. This method divides the whole recognition process into two separated stages, which is not an end-to-end network architecture and cannot be jointly optimized.
In order to combine the ASCs of SAR targets and CNN to achieve an end-to-end network structure, the ASC schematic map obtained through the physically relevant parameters of ASCs is adopted as one of the inputs of CNN in our method. As shown in
Figure 1b, the ASC schematic map contains the geometric shape of each ASC extracted from the SAR image, such as dihedral, trihedral and so on. The ASC schematic map mainly describes the scattering centers of the target in the SAR image. There is no background clutter, and the target is composed of scattered geometric structures. Since the ASC schematic map can reflect the local structure of the target corresponding to each ASC, it is also meaningful for SAR ATR.
In this paper, we propose a CNN combined with ASCs for SAR ATR to comprehensively utilize the features related to SAR images and the features related to ASCs, which can improve the accuracy of SAR ATR. There are two branches in the proposed network. The input of one branch is SAR image, more discriminative image features of which can be extracted in this branch. The input of the other branch is the ASC schematic map generated via the ASC model, which reflects the local structure of the target corresponding to each ASC, and features with physical meaning can be extracted in this branch. Since the two branches can complement each other, we fuse the high-level features obtained by the two branches to recognize the target. The whole network is jointly optimized.
2. Proposed Target Recognition Method
Figure 2 shows the overall framework of our method. It can be seen from
Figure 2 that our proposed method contains two separated feature extraction networks, one feature fusion part and a classification part. Both feature extraction networks are composed of some convolutional layers and a fully-connected layer. In the two branches, one takes the SAR image obtained through executing a modulo operation on original complex SAR data as input, more discriminative image features of which can be extracted in this branch. The SAR image only contains the amplitude information of the complex SAR data. The other takes the corresponding ASC schematic map, which is obtained via the ASC model from the complex SAR data, as input and this branch can acquire features related to the target’s local structures. The ASC schematic map contains the amplitude information and phase information of the complex SAR data. These two types of features are complementary to each other. Therefore, the features obtained from the above-mentioned two branches are input into the feature fusion part. The feature fusion part contains a feature concatenation layer and a fully-connected layer with 1024 units. The fused feature obtained from the feature fusion part contains richer information of the SAR target, which are more descriptive and more discriminative. Finally, the fused feature is delivered into the classification part, which contains a fully-connected layer and a softmax function. The fully-connected layer in the classification part is used to reduce the dimensionality of the fused feature, and the number of units in this layer is equal to the number of categories. The softmax function is used to determine the target label. The output of the softmax function is a vector, where the value represents the probability that the input image belongs to each category. The target label will be determined as the class with the maximum probability. During the network training, the entire network is trained with the cross-entropy loss function, and two branches are jointly optimized.
2.1. Attributed Scattering Center Model
For a distributed target in high frequency region, it can be seen as a composition of several independent scattering centers. Therefore, we can summate the responses of these scattering centers to approximate the radar backscattering of the distributed target as follows [
12]:
where
indicates the frequency;
means the aspect angle;
means the number of target’s ASC;
indicates the ASCs’ parameter set. For a single ASC, the ASC model can be used to describe its backscattering field as follows [
12]:
where
indicates the propagation velocity of the electromagnetic wave;
means radar center frequency; for the
th ASC,
means the complex amplitude;
means the frequency dependence; the position coordinates of a scattering center in the range dimension and azimuth dimension are shown by
and
, respectively; the orientation and length of a distributed ASC are shown by
and
, respectively; the aspect dependence of a localized ASC is
.
2.2. The ASC Schematic Map
For a distributed target in high frequency region, the dependence of its backscattering response on azimuth and frequency can be described through a set of parameters of the ASC model. These parameters can describe the physical characteristics of scattering centers in the target, including relative amplitude, shape, position and orientation (pose) [
12]. Among these parameters, two of them are selected to differentiate eight iconic shapes of ASCs, which are listed in
Table 1. Different colors are used to illustrate these eight shapes. As we can see in
Table 1, the length of the edge broadside and edge diffraction cannot be distinguished by only the Length (
) but can be distinguished by combining both Frequency Dependence (
) and Length (
); therefore, they are denoted in different colors in
Table 1.
It is a high-dimensional, non-linear and non-convex problem to estimate ASC parameters. Currently, there are many methods based on image domain processing or frequency domain processing for estimating ASC parameters. Most existing estimation methods in image domain rely on the image segmentation to estimate the parameters of ASC [
14], the estimation results of which highly depend on the accuracy of the image segmentation. Compared with the estimation methods in image domain, there is no need to segment image in the estimation methods in frequency domain. However, the high computational complexity and storage demand of the estimation methods in frequency domain also limit their applications.
Since we use the ASC schematic map as the input of one of the network branches, an accurate ASC schematic map is of great significance for the final recognition result. In order to obtain accurate ASC schematic map, we adopt the method in [
13] to extract ASCs of the input image. The method in [
13] is a new ASC extraction algorithm in image domain, which has a good accuracy and a fast calculation speed. In this method, SAR measurements are converted to the sparse representations in image domain first. Then, the ASC model parameters are estimated through the Newtonized orthogonal matching pursuit (NOMP) algorithm. Specifically, there are four iterative steps in the ASCs extraction algorithm in [
13], namely, atom selection, atom refinement, projection, and residue evaluation. Therefore, a signal can be sparsely approximated through a set of refined atoms. Detailed operations of this algorithm are summarized as Algorithm 1.
After obtaining parameters of all ASCs extracted from the input SAR image, the geometric shape of each ASC can be determined according to the frequency dependence parameter and length parameter, and then according to the position parameters of all ASCs, the ASC schematic map of the input SAR image can be obtained. The obtained ASC schematic map reflects the local structure of the target corresponding to each ASC and will be used as the input of the ASC schematic map branch.
Figure 1 gives an example of the SAR image and its corresponding ASC schematic map.
Algorithm 1 The ASCs extraction algorithm proposed in [13] |
- Input:
SAR image , dictionary , where is the normalized ASC image is the ASC parameters set of q-th ASC and Θ contains all possible values of ASC parameters.
|
Output: |
Initialization: residual image |
while the stop criterion is not met. do - 1.
Coarsely estimating the parameters of the i-th ASC, denoted by θi, by selecting the most matched atom for , which can be done by evaluating the inner product of the residual image and every atom in Φ, as given by: . - 2.
Refining the estimation of the continuous parameters of the i-th ASC (i.e., x, y, L, and γ) by taking θi as the starting point and running Newton’s method: Generating new atom according to , and put the new atom into generated atoms collection : . - 3.
Using atoms in to approximate the original image The least-square estimation is carried out to estimate the coefficients of the i selected atoms, as given by: - 4.
Updating the residual image by canceling the i selected atoms, i.e.,
|
end while |
2.3. Feature Extraction Networks
CNN is one of the representative algorithms of deep learning, which has been widely used in image interpretation. Due to its deep architecture, CNN can automatically integrate the extracted features into abstract features layer by layer, and the obtained features are more applicable for classification. In our network, there are two feature extraction networks. The inputs of these two branches are different, one is SAR image, and the other is ASC schematic map. Therefore, it is reasonable to design different architectures for these two feature extraction networks. For the SAR branch, we consider the following aspects to design the feature extraction network:
- (1)
In CNN, the extracted features of higher layer contain richer semantic information and tend to have better ability to distinguish different types of targets. However, the deeper the network, the larger the number of parameters, and larger amounts of labeled data are needed to estimate the parameters. In real scenarios, it is very difficult and costly to collect large amounts of labeled SAR images. Therefore, considering the feature extraction ability and parameter quantity of the network, we built a simple and efficient CNN architecture as the feature extraction network, which has 5 convolutional layers and one fully-connected layer.
- (2)
According to existing experience, the number of convolution kernels is generally increased layer by layer. For SAR branch, we hope this branch can extract features that are more descriptive and more discriminative from the input SAR image. Therefore, we set a larger number of convolution kernels in each layer to learn better features, which are 96, 96, 512, 512, 1000, and 1000, respectively.
- (3)
In CNN, the larger the convolution kernel, the larger the receptive field. And in SAR images, large receptive fields can alleviate the effect of speckle noise by degrees. In addition, the receptive fields of the shallower layers are relatively small while those of deeper layers are relatively large in CNN. Considering that large convolution kernels introduce large parameters, we reduce the sizes of the convolution kernels in our feature extraction network layer by layer, which are
, , , , and , respectively.
For the ASC branch, we consider the following aspects to design the feature extraction network:
- (1)
Compared with SAR image, ASC schematic map is easier. It should be more reasonable that the architecture of the ASC branch is simpler than SAR branch. Therefore, we decrease the number of convolutional kernels of each layer in the ASC branch to make it simpler, which are 16, 32, 64, 128, and 256, respectively.
- (2)
There is almost no noise in the ASC schematic map. Therefore, the ASC branch does not require large-sized convolution kernels like SAR branch. And we just select convolution kernels with the size of 5 × 5 and 3 × 3.
- (3)
In CNN, the receptive fields of the shallower layers are relatively small while those of deeper layers are relatively large. In addition, considering that large convolution kernels introduce large parameters, thus, 5 × 5 convolution kernels are applied in the first two layers to increase receptive field considerably and reduce the size of feature maps rapidly. 3 × 3 convolution kernels are utilized in the other layers.
The two feature extraction network architectures in our proposed method are different but each one is combined with convolutional layers, pooling layers and fully connected layer. Following each convolutional layer, there is a batch normalization (BN) layer, which can speed up the convergence of the network, prevent gradient explosion and gradient disappearance, and prevent overfitting [
15]. After each convolution layer and BN layer, a max pooling is performed with a kernel size of 3 × 3 and a stride of 2 pixels. After the five convolutional layers, a fully connected layer is used to transform feature maps to a feature vector. The number of units in this layer is 1024. All activation functions used in the feature extraction networks are the rectified linear unit (ReLU).
During the training of our designed network, the weights are initialized from Gaussian distributions with zero mean and a standard deviation of 0.01, and biases are initialized with a constant value of 1. In many existing networks, the initial learning rate is mostly set to 0.001, 0.01 or 0.1. Considering that a too large learning rate may cause the model to fail to converge, while a too small learning rate will cause the model to converge slowly, the learning rate in our network is initially 0.01. The update rule of the learning rate is to reduce the learning rate when the loss of the network becomes stable. In our paper, the learning rate decreases by a factor of 0.1 after 15,000 iterations. Considering the computer’s memory and computing power, the batch size is set to be 16. And the max iteration is set to be 80,000 to ensure complete convergence of the network.
Through the two feature extraction networks, the information contained in SAR image and ASC schematic map can be learned, respectively. The deep network structure can map the shallow features into high-level abstract features layer by layer. The final feature vectors obtained by the feature extraction networks are more suitable for recognition.
5. Conclusions
CNNs have excellent feature self-learning ability and have been widely used in many fields. However, many existing CNNs have large number of parameters, which need to be learned by larger amounts of labeled data. Actually, it is very difficult and costly to collect large amounts of labeled SAR images in real scenarios. Therefore, we construct a simple and efficient CNN architecture as the feature extraction network, which has fewer parameters that need to be learned while still having better feature learning ability than some existing networks, such as A-ConvNet and VGG-16. In addition, although several different kinds of CNNs have been applied to SAR ATR and achieved good results, most of them mainly use the image information of SAR targets and make little use of the unique electromagnetic scattering characteristics of SAR targets. For SAR targets, ASCs can use several physically relevant parameters to accurately describe the electromagnetic scattering characteristics and the local structures of the target, which are notably effective for SAR ATR. Therefore, we propose a network to comprehensively use the image information contained in SAR image and the local structure information contained in ASCs for improving the performance of SAR ATR. The experiments on real SAR dataset show that comprehensively using the two types of information can achieve the best performance than all compared methods, which proves the effectiveness of the combination of these two types of information.