In this part, we demonstrate the application of CBLS-SARNET in small-sample SAR ATR through theoretical analysis.
Section 3.1 introduces the architecture of CBLS-SARNET.
Section 3.2 introduces SAR image pre-processing.
Section 3.3 proves the equivalence of our method.
Section 3.4 presents the UDC framework. Finally,
Section 3.5 introduces how to achieve fast SAR ATR through the PS network.
3.1. CBLS-SARNET Architecture
This paper introduces an effective and robust method, CBLS-SARNET, for addressing the SAR ATR problem with limited samples.
Figure 2 illustrates the overall network architecture of this method. The design objective is to enhance the feature extraction capability of the network by selecting significant features through feature screening, eliminating irrelevant and redundant features, thereby achieving efficient target recognition for small samples in SAR images.
The method is divided into the UDC framework and the PS network. The UDC framework is designed similarly to BLS to achieve weight and bias sharing. The coefficients of BLS can be obtained through pseudo-inversion. However, it differs in that all components comprise FC layers for co-training. In the FC layer, each node is connected to all nodes in the preceding layer, facilitating the integration of features extracted from the front. Moreover, the learning process of these two layers adheres to the back-propagation theorem. This design aims to enhance the network’s feature extraction capabilities, as demonstrated by improvements in training speed and testing accuracy [
30].
The PS network is adapted from the BLS structure. Nonetheless, the distinction lies in the pre-trained parameters of nodes. The network shares the pre-training weights with the UDC framework. The goal is to deeply extract spatial feature details, effectively guiding the network to extract vital image semantic information and robust features. Additionally, the PS network introduces a novel node activation function tailored to the characteristics of SAR images. Experimental results demonstrate that this proposed activation function enhances network performance compared to traditional activation functions.
3.3. From Invariance to Equivalence
To accurately identify SAR targets with various poses, it is crucial to preserve the hierarchical pose relationship between the whole object and its constituent parts. While traditional CNN methods excel at extracting texture, shape, and other object features, they often struggle to discern spatial transformations of objects, demonstrating limited translation invariance. However, CapsNet addresses this limitation by incorporating the relative spatial relationships between objects, representing them as pose matrices, known as “equivalence”.
To achieve equivalence, a precise representation of individual parts within the image is necessary. The manifold is associated with the presented parts, with perspective transformations resulting in slight changes in the poses of these parts. Therefore, the image undergoes initial processing by the CNN, and the feature vector is transformed into the shape of
. This step aims to encapsulate image component information into the learned manifold. A vector that stores part pose and presentation probability information is called a capsule. The pose matrix employs a weight matrix to depict the relationship between the part and the overall pose. Given the pose vector
of the component, the corresponding overall pose
of the target can be expressed as
The object is transformed as a whole part. Due to the equivalence of the viewing angle, there are
Further, transforming it will obtain
where
C is a constant coefficient. By accurately representing the components and leveraging the perspective equivalence of the whole-component relationship, the perspective-equivalent transformation method is derived.
3.4. Feature Screening via UDC Framework
Redundant features can lead to risks such as model overfitting. To mitigate issues stemming from an excessive number of features, feature filtering becomes imperative. Hence, the UDC framework is designed as a feature filter incorporating various perspectives. When confronted with high-dimensional features, the UDC framework exhibits superior robustness and can implicitly select features to eliminate irrelevant and redundant ones. Feature screening can be conceptualized as a trained neural network model
M. Given a sample
w, it can be interpreted as a sequence of several features
. Utilizing the feature sensitivity analysis method based on the receptive field, the contribution of each feature to a specific neural network category, such as the predicted probability
of the true sample category
, can be obtained from
, and then the average contribution of each feature when it appears is
where
represents the contribution of feature
i in sample
j.
The feature screening mechanism of the UDC framework is implemented through network co-training. Co-training is a divergence-based approach that assumes each sample can be feature-extracted from a different perspective. Various feature filters can be trained from distinct views and then the features deemed credible to join the network can be selected. As these feature filters are trained from diverse perspectives, they can complement each other and enhance the recognition rate. This aids in sifting through information and reducing redundant features, much like gaining a better understanding of things from different perspectives. The UDC framework is illustrated in
Figure 4.
Supposing that
T training samples
from the
C class, cross-entropy can be calculated as
Among them, refers to the true label and stands for the softmax normalized probability generated from the training model.
Assume that the feature mapping layer
M is divided into
m smaller layers. Due to its fully connected characteristics, the FC layer contains numerous parameters. In this process, the entire FC layer is partitioned into several smaller FC layers to facilitate different layer co-training, resulting in a significant reduction in network computation [
31]. Some splitting methods [
32] use an implicit “divide and ensemble” strategy. By adhering to the principles of maintaining metrics, the number of parameters remains roughly unchanged regardless of whether division is employed or not.
According to the definition in PyTorch,
represents the size of its kernel.
and
represent the input and output maps, which stand for the number of feature maps with
d as the group quantity and represent every input channel
that is convolved with their sets of filters of size
. Under this situation, the parameters’ quantity is
and the floating-point operations are
, where
shows the output feature map size and −1 appears after adding
numbers, which only need
operations. The omission of the bias serves for brevity. In depthwise convolution, we have
. Commonly,
in which
t1 serves as a constant. Therefore, an FC layer needs to be divided by a factor S, dividing
Cin by
, which can fully achieve the goal:
After division, different angles of the same samples are utilized for co-training [
33], aiming to train small connection layers to enhance the diversity of features and achieve better integration properties. Simultaneously, network learning can occur among small connection layers to further enhance individual performance.
If a given SAR feature vector
contains
N samples that were operated after CapsNet, each sample has
M dimensions. For
n layers of feature mapping layers division,
l layers of enhancement layers are given. The input feature vector
X is projected and the
mapping feature vector can be expressed as
where
and
represent the weight and bias of the feature mapping layer
. And
,
,
, is an optional nonlinear activation function. All feature mapping layers can be denoted as
, which will be co-training at the same time.
After division, the weight and bias of the feature mapping layer can be written as
and
. Similarly, the
group of enhancement nodes can be written as
where
and
, respectively, represent the weight and bias of the enhancement layer
. And
,
.
refers to a nonlinear activation function. With the same calculation, all enhancement layers can be represented as
, which will be training together for reducing the occupation of computing resources.
Furthermore, the weight and bias of enhancement layers can be written as and .
Lastly, the feature mapping and the enhancement layer are concatenated together and fed into the output layer, inspired by [
34], which proves that the weight will efficiently improve performance during training and maintain the efficiency of inference. The output layer can be described as
where
denotes the nonlinear function.