2.2.3. Step 3

Use a sliding window with size *w* × 1 to scan **sˆ***i*, and then collect all the patches. Calculate the histogram features in all the patches (with 2 *N* bins), and concatenate them into a single vector **f***m i* , where *m* = [1, 2, ··· , *M*] denotes the *m*th sub-feature set, *M* is the number of sub-feature set, and **f***m i* ∈ <sup>R</sup>2*<sup>N</sup>*(*<sup>L</sup>*−*m*+<sup>1</sup>)×<sup>1</sup> if the step size of sliding window is set as 1. At last, the final hierarchical feature for pixel *i* is determined by

$$\mathbf{f}\_{i} = [\mathbf{f}\_{i}^{1}, \mathbf{f}\_{i}^{2}, \dots, \mathbf{f}\_{i}^{M}] \in \mathbb{R}^{2^{N}M(L-m+1)\times 1}.\tag{9}$$

Obviously, increasing the step size of sliding window could reduce the dimensionality of the obtained features. Usually, 50% overlapping between patches is appropriate. Size of sliding window also has some influence on the results. Theoretically, smaller *w* could enhance the sparsity of the obtained features, but lead to very high dimensionality. In order to balance the sparsity and computational cost of computer memory, we set the window size as 7 × 1.

H2F could be considered as a hierarchical representation for the original HSI data. According to our empirical experience, we do not recommend dimension reduction on H2F because it may lead to loss of distinctive information. Instead, to reduce the computational cost and avoid overfitting, we use a very simple classifier, ELM, to determine the final classification results.

#### *2.3. ELM Based Classification*

ELM [51] is a simple neural network with only three layers (input, hidden and output), which performs well in small-scale data sets. ELM has two leadings characteristics: (1) the input and hidden layers are connected randomly; and (2) the weights between hidden and output layers are learned by a least squares algorithm.

Let **F** = [**<sup>f</sup>**1,**f**2, ··· ,**f***nt*] ∈ R*d*×*nt* denote the training samples matrix, *d* is the dimension and *nt* is the number of training samples. In ELM, the weights between input and hidden layers are obtained randomly, denoted by **W** ∈ <sup>R</sup>(*nh*×*d*), where *nh* is the number of nodes in the hidden layer. Then, the objective function of ELM can be described by

$$\mathbf{B} \mathbf{g} (\mathbf{W}\_t \cdot \mathbf{f}\_i + \mathbf{b}\_t) = \mathbf{Y}\_{i\prime} \tag{10}$$

where **B** ∈ R(*<sup>C</sup>*×*nh*) is the weights matrix connecting hidden and output layer, **b** ∈ R*nh*×<sup>1</sup> is the bias vector in the hidden layer, g(·) is an activation function such as sigmoid function, *C* is the number of classes, and **Y** ∈ R*C*×*nt* is the label matrix for all the training samples. Note that g(**<sup>W</sup>***t* · **f***i* + **b***t*) is the output of the hidden layer for sample **f***i*. Because **W** and **b** are randomly assigned, the outputs of the hidden layer have been determined. Then, Equation (10) is actually equal to the following expression:

$$
\mathbf{H} \cdot \mathbf{B} = \mathbf{Y},
\tag{11}
$$

where **H** is the outputs of the hidden layer. Obviously, Equation (11) can be solved by a simple least squares method, i.e., **B** = **H** + **Y**.

In H2F, the final features are classified by ELM. Because the random matrix generation needs little time, the major computational cost lies in Equation (11). As long as we restrict the number of hidden nodes, the training operation could be very fast. In Algorithm 1, we provide a pseudocode for the H2F based HSI classification method.


#### **3. Experiments and Discussion**
