*2.1. Local Feature Extraction and Descriptor Quantization*

Considering that the SIFT feature has the characteristic of invariance to scale, orientation, and affine distortion, in this study, dense-SIFT descriptors were extracted to express an SAR image [14]. The image *I* can be denoted by a set of local features' descriptors as: *<sup>I</sup>* <sup>=</sup> [*d*1, *<sup>d</sup>*2,..., *dN*] <sup>∈</sup> <sup>R</sup>*D*×*N*, where *di* represents the *<sup>D</sup>*−dimensional SIFT description vector. Given a codebook *<sup>B</sup>* <sup>=</sup> [*b*1, *<sup>b</sup>*2,..., *bM*] <sup>∈</sup> <sup>R</sup>*D*×*N*, the K-Means clustering algorithm [13] was adopted with the Euclidean distance to cluster local features into groups, and the generated centers of each group were taken as the codebook.

Sparse coding [10] was applied to encode the local feature vectors, as follows:

$$\arg\min\_{V} \sum\_{i=1}^{N} ||d\_i - Bv\_i||\_2 + \lambda ||v\_i||\_1. \tag{1}$$

The feature vector for a descriptor *di* becomes an M-dimensional vector *vi*. *vi* is the corresponding vector to the descriptor *di*. The sparse coding vectors are obtained by the feature-sign search algorithm [20] by solving Equation (1) here.

The next step is the spatial pooling step aiming at obtaining more discriminative image representation from each sub-region. This study applied the construction of a three-level spatial pyramid. At every resolution *e*, *e* = 0, 1, 2, a grid was constructed such that there were 2*<sup>e</sup>* resolution cells along every dimension, i.e., the 1 × 1, 2 × 2, 4 × 4 grid structures; thus, the sub-regions were *K* = 21 in total. We let *V* be a collection containing *T* local feature codes acquired from a sub-region. In addition, the max-pooling strategy was used to concatenate all the descriptors for the sub-region, which can acquire many discriminative features to SAR image spatial variations. The expression of the max-pooling is as follows:

$$t\_k = \max(v\_1, v\_2, \dots, v\_\Gamma) \tag{2}$$

where max represents the element-wisely maximization for the involved vectors. The local features were pooled in all the sub-regions and concatenated to form the image representation *f* = [*t*1; *t*2; ··· ; *tK*].

Considering a set of *G* training images from *C* classes, *F* = [ *f*1, *f*2, ··· , *fG*], where *fg* is the feature vector, which is the pooling result of the *g*th image. Correspondingly, *F* is partitioned as *F* = *FT* <sup>1</sup> , *<sup>F</sup><sup>T</sup>* <sup>2</sup> , ··· , *<sup>F</sup><sup>T</sup> K <sup>T</sup>* and *Fk* <sup>∈</sup> *<sup>R</sup>d*×*<sup>G</sup>* with *<sup>d</sup>* <sup>&</sup>lt; *<sup>G</sup>*. Distinctly, the *<sup>g</sup>*th column of *Fk* is the feature vector for the *k*th sub-region of the *g*th image, and *G* represents the image number of the training dataset. Analogously, a test image *y* is divided into *y* = [*y*1; *y*2; ··· ; *yK*].
