*2.1. Superpixel Segmentation*

The multi-scale segmentation algorithm [37] is an image segmentation method, in which the segmentation results are called patches. The essence of segmentation is to segmen<sup>t</sup> the image into many non-overlapping sub-regions. These patches or sub-regions are what we called superpixels. In this paper, a multi-scale segmentation algorithm is used to generate superpixels. This method uses the bottom-up region-growth strategy to group pixels with similar spectral values into the same superpixel. The key of the method is that the heterogeneity of the grouped region under the constraint term is minimal. The multi-scale segmentation method consists of three main steps:

(1) We define a termination condition *T*, also called the scale parameter, to control whether a regional merger is stopped. If *T* is smaller, the number of regions will be greater, and each region will have fewer pixels, and vice versa.

(2) Calculation of the spectral heterogeneity *h*1 and the spatial heterogeneity *h*2:

$$h\_1 = \sum\_{i=1}^{n} \omega\_i \sigma\_i \tag{1}$$

$$h\_2 = \omega\_u u + (1 - w\_u)v \tag{2}$$

where *σi* is the standard deviation of the *i*-th band spectral values in the region. *wi* refer to that the weight of the *i*-th band, and *n* is the band number. *v* and *u* represent the compactness and smoothness of the region, and *wu* is the weight of the smoothness.

(3) The regional heterogeneity *f* can be obtained by combining *h*1 and *h*2:

$$f = \omega h\_1 + (1 - \omega)h\_2\tag{3}$$

Here, *ω* is the weight of the spectral heterogeneity, and its value ranges from 0–1. (4) Observation of the heterogeneity of regions *f* : if *f* < *T*, the region with the smallest heterogeneity will be merged with the adjacent regions.

(5) Operation of Step 4 until there are no regions that need to be merged.

#### *2.2. Spatial-Spectral Graph-Based Label Propagation*

The label propagation algorithm [24,40] is a graph-based classification method, in which the class labels are assigned to unlabeled samples by building a graph to propagate the labels. This algorithm models the input image *X* = {*<sup>x</sup>*1, *x*2, ..., *xn*} ∈ *Rd*×*<sup>N</sup>* as a weighted graph *G* = (*<sup>V</sup>*, *<sup>E</sup>*), in which the vertices *v* ∈ *V* correspond to the pixels and the edges *e* ∈ *E* ⊆ *V* × *V* correspond to the links that connect two adjacent pixels. The label propagation algorithm consists of the following steps.

(1) A set of labeled pixels *VM* is provided, where each pixel *vi* ∈ *VM* has been assigned a label *c*, and the label set is *c* ∈ *L* = {1, ..., *<sup>K</sup>*}.

(2) The unlabeled sample set *VU* of the neighbors of the labeled samples and the labeled sample set *VM* are considered the nodes in the weight graph. Then, the weight matrices of spectral graph *Wwij* and *Wsij*are calculated as follows:

$$\mathcal{W}\_{ij}^{w} = e^{-(\frac{||\boldsymbol{v}\_{i} - \boldsymbol{v}\_{j}||^{2}}{2\sigma^{2}})} \quad \text{if} \quad \boldsymbol{x}\_{j} \in \mathcal{N}\mathcal{B}\_{k}^{w}(\boldsymbol{x}\_{i}) \tag{4}$$

$$\mathcal{W}\_{ij}^{s} = \mathfrak{e}^{-(\frac{||v\_{i} - v\_{j}||^{2}}{2\mathfrak{a}^{2}})} \quad \text{if} \quad \mathfrak{x}\_{j} \in \mathcal{N}B\_{d}^{s}(\mathfrak{x}\_{i}) \tag{5}$$

where *NBwk* (*xi*) is is a set of *k* nearest neighbors of *xi* obtained by the spectral Euclidean distance and *ε* is a free parameter. *NBsk*(*xi*) is a set of the spatial neighbors of *xi* in a spatial neighborhood system, the width of which is *d*.

(3) Construction of the graph *Wij* as follow:

$$\mathcal{W} = \mu \mathcal{W}^{\overline{w}} + (1 - \mu) \mathcal{W}^{\overline{s}} \tag{6}$$

where *μ* measures the weight of the spatial and spectral graph.

(4) According to the weight matrix, the propagation probability of the *i*-th node to the *j*-th node in the graph is calculated. The formula is as follows:

$$H\_{ij} = \frac{\mathcal{W}\_{ij}}{\sum\_{k=1}^{n} \mathcal{W}\_{ik}} \tag{7}$$

(5) The labeled matrix *A* and probability distribution matrix *P* are initialized.

$$M\_{i\bar{j}} = \begin{cases} 1 & \mathbf{c}\_i = k, i \le m \\ 0 & \mathbf{c}\_i \ne k, i \le m \\ 1/K & m < i \le n \end{cases} \tag{8}$$

$$P\_{i\bar{j}} = M\_{i\bar{j}} \qquad 1 \le i \le n, 1 \le k \le K \tag{9}$$

where the value of the labeled matrix is a probability value of each initialized node. If node *i* is a labeled sample, then the probability that the *i*-th node belongs to the *k*-th class is one, while the probability of belonging to the other classes is zero. If the node is an unlabeled sample, then the probability that it belongs to each class is initialized as 1/*K*.

(6) Propagation process: According to the label propagation probability **P**, each node adds the weighted label information transmitted from adjacent nodes and updates the probability distribution Pto show that the nodes belong to each class. The updated formula is as follows:

$$P\_{\rm ij} = \sum\_{k=1}^{n} H\_{\rm ik} P\_{\rm kj} \qquad 1 \le i \le n, 1 \le j \le K \tag{10}$$

(7) After the probability of propagation **P** is obtained, the label is assigned to the unlabeled samples based on the maximum probability.

$$c\_i = \arg\max\_{j \le K} P\_{ij} \qquad 1 \le i \le n \tag{11}$$

All the nodes in the graph update the probability distribution based on the probability distribution of adjacent nodes. The label propagation algorithm is iteratively executed until the probability distribution of the nodes converges, then the class with the highest propagation probability is selected as the class label for the node. The propagation procedure is shown in Figure 1, and the light gray and the dark gray nodes are labeled samples from different classes, while hollow nodes represent the unlabeled samples. The values on the arrows are the propagation probabilities from the labeled samples to the unlabeled samples.

**Figure 1.** Procedure of label propagation.

#### *2.3. Rolling Guidance Filtering*

Filtering is an important step that removes weak edges while preserving strong ones when performing classification. In order to capture the different objects and structures in an image, the rolling guidance filtering is used to remove small-scale structures and preserve the original appearance of the large-scale structure. Therefore, the results processed by RGF are considered as the input feature of the SVM classifier, which can improve the classification accuracy. The rolling guidance filtering [39] contains two steps:

(1) Small structure removal:

In this section, a Gaussian filter is applied to blur the image, and the output is expressed as:

$$f^0(p) = \frac{\sum\_{q \in R(p)} \exp\left(-\frac{|p-q|^2}{2\sigma\_s^2}\right) I(q)}{\sum\_{q \in R(p)} \exp\left(-\frac{|p-q|^2}{2\sigma\_s^2}\right)}\tag{12}$$

where *I* is the input image, *p* and *q* index the pixel coordinates in the image, *<sup>R</sup>*(*q*) is a neighborhood pixel set for *p* and *σs* is the square of the Gaussian filter of variance *σ*<sup>2</sup> *s* . This means that when the scale of the image structure is smaller than *σs*, the structure will be completely removed.

(2) Large-scale edge recovery:

Large-scale edge recovery can be implemented in two steps. In the first step, the image processed by a Gaussian filter is treated as a guidance image (*J*0), and then the joint bilateral filter is applied to guidance image *J*0 and the initial image (*I*) to obtain output image *J*1. In the second step, the guidance image is continuously updated by feeding the output of the previous iteration as the input to the next iteration. When the large-scale edges are recovered, the iteration of the guidance image can terminate. This procedure can be described as follows:

$$f^{t+1}(p) = \frac{1}{K\_p} \sum\_{q \in N(p)} \exp(\frac{-||p-q||^2}{2\sigma\_s^2} - \frac{||f^t(p) - f^t(q)||^2}{2\sigma\_r^2}) I(q) \tag{13}$$

$$K\_p = \sum\_{q \in N(p)} \exp(\frac{-||p-q||^2}{2\sigma\_s^2} - \frac{||f^t(p) - f^t(q)||^2}{2\sigma\_r^2})\tag{14}$$

where Equation (11) is used for normalization and *σs* and *σr* control the spatial and range weights, respectively. *t* is the iteration number, and *J<sup>t</sup>*+<sup>1</sup> is the result of the *t*-th iteration.

By the above two steps, RGF can perform well on the hyperspectral images. Thus, rolling guidance filtering is used to extract the information and features of the initial images, and the filtered image ˜*I* is expressed as follows:

$$I = \mathcal{R}GF(I) \tag{15}$$

where *RGF* is the rolling-guidance filtering operator and *I* is the initial input image.
