**1. Introduction**

With the rapid development of hyperspectral sensors, the present hyperspectral images (HSIs) contain rich spectral and spatial information. Therefore, different objects can be accurately recognized from HSIs using various classification algorithms for different applications, such as geological survey [1], mineral mapping [2], fine agricultural research [3,4], environmental monitoring [5], etc.

HSI classification is one of the most popular problems in the field of remote sensing and has aroused much concern, but faces the following challenges [6–8]: First, it is very difficult to acquire sufficient labeled samples. Second, information redundancy and Hughes phenomenon are inevitable due to the high dimensional features represented by hundreds of spectral bands. Finally, HSIs are often corrupted with different types of noise and dominated by mixed pixels. To solve these problems, many researchers recurred to pixel-wise methods to classify each pixel in HSIs to a certain class using its spectral information individually [9–14]. Among them, the SVM [10,15] and multinomial logistic regression (MLR) [16–18] are the two most commonly used techniques. However, these methods

often result in much "salt-and-pepper" noise in classification maps, without considering spatial neighborhoods, and the classification performance cannot be further improved.

This difficulty has been greatly conquered at the appearance of spectral-spatial classification methods [19]. Generally, these methods can be divided into three categories. In the first category, spatial information is integrated with spectral information by using composite kernels [20–23]. There are many methods for spatial feature extraction, such as mean filtering [20], area filtering [24], Gabor filtering [25], gray-level co-occurrence matrix [26], edge-preserving filtering (EPF) [27], and extended morphological profiles (EMPs) [28]. In the second category, the integration of the spectral information and the spatial information is first performed by image segmentation algorithms, such as mean-shift [29], watershed [30], hierarchical segmentation [31,32], minimum spanning forest [33], graph cut [34,35], and superpixel [36] approaches. Then, the final classification map is produced by combining the pixel-wise classification map and the unsupervised segmentation map by employing a majority voting algorithm. In the third category, the two types of information are jointly included in the classification process using Markov random field (MRF) models. By applying the maximum a posteriori (MAP) decision rule, HSI classification can be effectively solved by minimizing a MAP-MRF energy function. The ensemble method of SVM and the MRF-based model is a regular scheme [37–43].

Kernel-based classification methods have been very popular for HSI classification because they can effectively deal with the intractable issues of curse of dimensionality, limited labeled samples, and noise corruption. The SVM algorithm using a single (e.g., linear, polynomial, or Gaussian radial basis function (RBF)) kernel has been widely used for image classification. To perform HSI classification, several SVM techniques using the spectral-spatial kernel were presented. For instance, Camps-Valls et al. [20] formulated a general framework of multiple kernels by exploring both the spectral and spatial information, and the spatial information is defined using basic statistical measures within a fixed-size window in the image. The selection of a suitable window size is a challenging problem because spatial structures extracted from such a region cannot be accurately represented. To solve this problem, the adaptive neighborhood system based on morphological filtering and area filtering has been considered. On the one hand, Fauvel et al. [44] applied feature extraction on the original HSI and its EMPs, respectively, and performed the SVM classification using the RBF kernel with spectral-spatial stacked vectors. Li et al. [45] developed a MLR framework using generalized composite kernels (GCK), where the spatial information is represented by using EMPs as well. The obtained spatial information by such methods is highly dependent to the Structuring Element (SE) of morphological operators. On the other hand, Fauvel et al. [22] proposed an improved SVM by using a customized spectral-spatial kernel where the spatial information is modeled as the median value on the adaptive neighbourhood of each pixel defined using morphological area filtering. The result that is achieved by such a method is very sensitive to the predefined number of areas. Recently, the superpixel-based techniques have been applied to HSI classification by Shutao Li's research group. Fang et al. [46] presented an effective SVM classifier characterized with a superpixel-based composite kernel, where the three types of the spectral, intra-superpixel and inter-superpixel information are combined, and the superpixel map is obtained using the entropy rate superpixel (ERS) algorithm. Meanwhile, texture features are crucial for object classification of HSIs. Later, we introduced an alternative SVM classifier featured with a spectral-texture kernel [23], where the textual information is modeled for each superpixel with its local spectral histogram. The number of superpixels is a data-dependent and greatly influences classification results. Lu et al. [47] developed an effective HSI classification framework by integrating the multiple feature-induced kernels into a SVM classifier, where subpixel, pixel and superpixel features are combined. More recently, Peng et al. [48] improved the spectral-spatial composite kernel by embedding label information with an ideal regularized technique. The information that is extracted from the label domain cannot describe spatial structures well.

In this paper, we develop a novel SVM classification framework with the spectral, spatial, and hierarchical kernels (SVM-SSHK), in which the spectral, spatial, and hierarchical structure

information in HSIs are integrated into the SVM classifier in a way of multiple kernels. Specifically, the spectral kernel is constructed through each pixel's vector value in the original HSI, and the spatial kernel is modeled by using the EMP method due to its simplicity and effectiveness. To accurately characterize hierarchical structure features, the techniques of Fish-Markov selector (FMS), marker-based hierarchical segmentation (MHSEG) and algebraic multigrid (AMG) are combined. First, the FMS algorithm is used on the original image for feature selection to produce its spectral subset. Then, the multigrid structure of this subset is constructed using the AMG method. Subsequently, the MHSEG algorithm is exploited to obtain a hierarchy consisting of a series of segmentation maps. Finally, the hierarchical structure information is modeled by using these segmentation maps. The main contributions of this work is to present an effective composite kernel framework for HSI classification by utilizing spatial structure information in multiple scales. The previously mentioned kernel-based approaches cannot simultaneously capture salient and fine structures in the image with a predefined number of regions. However, the proposed framework can obtain a hierarchical representation of spatial structure information in HSIs. Furthermore, this hierarchical structure is only dependent to the original HSI, without considering the problem of choosing a neighborhood system or the size of a region (e.g., an area or a superpixel).

The remainder of the paper is organized as follows. In Section 2, some related techniques are reviewed. In Section 3, the proposed classification framework that is characterized with a spectral-spatial-hierarchical kernel is introduced. In Section 4, experimental results are reported in comparing to popular HSI classification methods and some issues are discussed. The last section presents some concluding remarks and the future work.

## **2. Related Techniques**

Let **x** represent an HSI which contains *B*-band vectors with **x** ≡ {*<sup>x</sup>*1, *x*2,..., *xN*} ∈ <sup>R</sup>*B*, **y** ≡ {*y*1, *y*2,..., *yN*} ∈ *L<sup>N</sup>* the final classification result with *yk* ≡ {*<sup>L</sup>*1, *L*2,..., *LT*} (*k* = 1, 2, ... , *N*), {(*xk*, *yk*)}*<sup>n</sup> k*=1 training samples, *ni* the number of the training samples of *Li* (*i* = 1, 2, . . . , *Z*).

#### *2.1. Spatial Information with EMPs*

In the spectral-spatial classification method, the first step is to extract some featured bands to model spectral information from hyperspectral images (HSIs) by dimensionality reduction, which is used to minimize redundant information and to improve computational efficiency. To this end, the most popular approaches have been used, such as principal component analysis (PCA) [28], independent component analysis (ICA) [49], Kernel PCA [50], decision boundary feature extraction, nonparametric weighted feature extraction, and Bhattacharyya distance feature selection [51]. In this work, the widely used PCA transform was used to produce the EMP. First, almost all of the spectral information in HSIs can be represented by using the first three or four principal components (PCs). Second, object boundaries in the HSIs can be better preserved in the resultant PCs [27]. Finally, it was recorded that the EMP was first constructed using PCA [7].

The main idea of EMP is to reconstruct the spatial information through morphological (opening/closing) operators, while preserving the boundaries of the image. Let *k* and *n* be the total number of the selected principal components (PCs) and the morphological operators, respectively, *ψ* and *η* the opening and closing operations, and *I* a gray-level image, we can build the morphological profile (MP) for each PC, as follows:

$$MP(I) = \{\psi\_{\Gamma}(I), \dots, \psi\_{2}(I), \psi\_{1}(I), I, \eta\_{1}(I), \eta\_{2}(I), \dots, \eta\_{\Gamma}(I)\}\tag{1}$$

For each PC, the MP is a (2*n* + 1)-band image. Then, the MPs are stacked to obtain the EMP as follows:

$$\text{LMP}\_{k} = \{MP(\text{PC}\_{1}), \text{MP}(\text{PC}\_{2}), \dots, \text{MP}(\text{PC}\_{k})\}. \tag{2}$$

where EMP is a stacked vector with the dimensionality of *k*(<sup>2</sup>*n* + 1) and includes both the spectral and spatial information of the HSIs. In fact, we can extract EMPs for all of the spectral bands or for some selective bands in the HSIs without PCA, which causes the following limitations. First, some redundancy can be observed in the *<sup>B</sup>*(<sup>2</sup>*n* + 1)-band image, where *B* is the number of spectral bands for the HSI, which may decrease the classification accuracies. Second, the classification process should be fit for such high-dimensional data with much more computational cost.

#### *2.2. Band Selection with FMS*

In many research fields, it is necessary for supervised classification to perform feature selection. Given a set of test samples, the selected features are used to assign a class label for each sample. Feature selection and subspace methods are widely used for dimensionality reduction [52–55]. For instance, Cheng et al. [56] presented the FMS algorithm for the feature selection of high-dimensional data, whose basic idea is to find the optimal subset of features to maximize intra-class separability and minimize inter-class variations in a higher dimensional kernel space. By employing some spectral kernel functions, such as the polynomial kernel, the feature selection problem can be solved efficiently using MRF optimization techniques. In the original space, denote the within-, between- (or inter-) class, and total scatter matrices by *Sw*, *Sb*, and *St*:

$$S\_{w} = \frac{1}{N} \sum\_{j=1}^{Z} \sum\_{i=1}^{n\_{\hat{j}}} \left( x\_{i}^{(j)} - m\_{\hat{j}} \right) \left( x\_{i}^{(j)} - m\_{\hat{j}} \right)^{T} \tag{3}$$

$$S\_b = \frac{1}{N} \sum\_{j=1}^{Z} n\_j \left( m\_j - m \right) \left( m\_j - m \right)^T \tag{4}$$

$$S\_l = \frac{1}{N} \sum\_{i=1}^{n} n\_j (\mathbf{x}\_i - m)(\mathbf{x}\_i - m)^T = S\_{w} + S\_b \tag{5}$$

where *x* (*j*) *i* is the *i*th training sample in class *Lj*, and *mj* and *m* represent the sample means for class *Lj* and the whole training set, respectively. The scatter matrices are denoted by *<sup>S</sup>w*, *Sb*, and *St* in the kernel space, whose traces in algebra can be calculated as follows:

$$\operatorname{Tr}(\tilde{S}\_w) = \frac{1}{n}\operatorname{Tr}(K) - \sum\_{i=1}^{Z} \frac{1}{n\_i} \operatorname{Sum}\left(K^{(i)}\right) \tag{6}$$

$$\operatorname{Tr}(\tilde{S}\_b) = \frac{1}{n} \sum\_{i=1}^{Z} \frac{1}{n\_i} \operatorname{Sum}\left(K^{(i)}\right) - \frac{1}{n^2} \operatorname{Sum}(K) \tag{7}$$

$$\operatorname{Tr}(\tilde{S}\_I) = \frac{1}{\mathbf{n}} \operatorname{Tr}(K) - \frac{1}{\hbar^2} \operatorname{Sum}(K) \tag{8}$$

where Tr(·) and Sum(·) are the summation and trace operators, respectively, and *K* and *K*(*i*) are two matrices with size of *n* × *n* and *ni* × *ni*, respectively, and have the following forms:

$$\{K\}\_{k,l} = k(\mathbf{x}\_k, \mathbf{x}\_l), k, l \in \{1, 2, \dots, n\} \tag{9}$$

$$\left\{ K^{(i)} \right\}\_{\boldsymbol{\mu}, \boldsymbol{\upsilon}} = k \Big( \mathbf{x}\_{\boldsymbol{\mu}}^{(i)}, \mathbf{x}\_{\boldsymbol{\upsilon}}^{(i)} \Big), \boldsymbol{\mu}, \boldsymbol{\upsilon} \in \{ 1, 2, \dots, n\_i \}, \ i = 1, 2, \dots, Z \tag{10}$$

The feature selector is represented by *α* = [*<sup>α</sup>*1, *α*2,..., *<sup>α</sup>B*] *Z* ∈ {0, <sup>1</sup>}*<sup>B</sup>*, where "1" indicates that the *k*th feature is selected or "0" not selected. The selected features from the vector *x* are defined, as follows:

$$\mathfrak{x}(\mathfrak{a}) = \mathfrak{x} \odot \mathfrak{a} \tag{11}$$

where is the Hadamard product. Substituting (9) and (10) with (11), *K* and *K*(*i*) can be expressed as functions of α:

$$\{\{K(\mathfrak{a})\}\_{k,l} = k(\mathfrak{x}\_k \odot \mathfrak{a}, \mathfrak{x}\_l \odot \mathfrak{a})\tag{12}$$

$$\left\{ K^{(i)}(\mathfrak{a}) \right\}\_{\mathfrak{u},\mathfrak{v}} = k\left( \mathfrak{x}\_{\mathfrak{u}}^{(i)} \odot \mathfrak{a}, \mathfrak{x}\_{\mathfrak{v}}^{(i)} \odot \mathfrak{a} \right) \tag{13}$$

In such way, the previously mentioned scatter matrices can be defined as functions of α as Tr*Sw*(*α*), Tr*Sb*(*α*) and Tr*St*(*α*).

The aim of feature selection is to maximize the class separations for the most discriminative capability of the variables. According to the spirit of Fisher, the following optimization function can be obtained:

$$\underset{a \in \{0, 1\}^{\mathcal{B}}}{\text{argmax}} \left\{ \text{Tr} \left( \widetilde{S}\_b \right) (a) - \lambda \text{Tr} \left( \widetilde{S}\_t \right) (a) \right\} \tag{14}$$

where *λ* is a parameter to balance the two items.

Actually, Equation (14) is a special case of the Markov problem without pairwise interaction term and can result in the optimal solution. In this work, we can compute a coefficient for each band of HSIs to demonstrate its significance by using the FMS algorithm. The higher the coefficient, the more significant the corresponding band. In this way, we can obtain the most relevant spectral bands.

#### *2.3. Hierarchical Representation of HSIs*

To construct a scale-space representation of a HSI **u** = [*<sup>u</sup>*1, *u*2,..., *uN*], a vector-valued anisotropic diffusion PDE can be used [57,58]:

$$\frac{\partial u\_i}{\partial t} = \text{div}(g(\theta(\nabla \mathbf{u}\_\sigma)) \nabla u\_i), i = 1, 2, \dots, N \tag{15}$$

where **u***σ* is obtained by convolving **u** with a Gaussian kernel of standard deviation *σ*, and *g*(·) is the diffusivity of |∇**<sup>u</sup>***σ*|. Recently, AMG has been used for multiscale representation of HSIs due to the advantage that AMG is capable of constructing a hierarchical representation of the problem from a fine grid to a coarse grid and the linear system is suitable to be effectively solved in the coarsest grid [59]. In this work, the proposed framework exploits all of the vertices in the multigrid structure as the markers.

According to the work of [60], we can construct a "pyramid" multigrid structure of HSIs as shown in Figure 1, where each grid, *s* = 1, 2, ... , *S*, can be described by a weighted graph (*Vs*, *Es*) in which *Vs* and *Es* are the set of vertices and edges, respectively, and the weight *gij* of (*i*, *j*) ⊂ *Es* expresses the similarity between the pixels of *usi* and *usj* in *Vs*. Initially, the first graph (*V*0, *E*<sup>0</sup>) is built from the original HSI, where *V*<sup>0</sup> denotes the set of vertices, whose size is the same as the HSI, while *E*<sup>0</sup> represents the set of edges connecting each vertex to its four-neighborhoods with weights. In our method, the initial weights *g*0*ij* of (*i*, *j*) ⊂ *E*<sup>0</sup> are computed by using the diffusivity of the anisotropic diffusion partial differential equation:

$$\mathbf{g}(\theta) = \begin{cases} 1, & \theta = 0 \\ 1 - e^{-\frac{3.91488}{\left(\theta/\beta\right)^{8}}}, & \theta > 0 \end{cases} \tag{16}$$

where *θ* is an indicator of the image edge strength using the Euclidean distance (ED) or the spectral angle mapper (SAM) between two pixel vectors, *β* denotes a gradient threshold.

**Figure 1.** The multigrid structure of hyperspectral images (HSIs).

The main steps of building the multigrid structure is summarized as follows [60].

**Step 1:** To consecutively select a new set of *Vl*+<sup>1</sup> from *Vl*. To build the AMG multigrid structure, the authors of [60] introduced a mass *mi* for each vertex, which is a measure for the number of pixels that are assigned to a given vertex selected to the next grid and can be initialized as *m*0*i* = 1. The first vertex of *Vl*+<sup>1</sup> is selected as the vertex in *Vl* with the greatest mass. The rest vertices in *Vl* are sorted in decreasing order of mass. Then, a new vertex is iteratively selected if this satisfies the condition as follows [10]:

$$\frac{\sum\_{j \in V^{l+1}} \mathcal{g}\_{ij}^{l}}{\sum\_{(i,j) \in \mathbb{Z}^{l}} \mathcal{g}\_{ij}^{l}} \le \nu \Rightarrow V^{l+1} = V^{l+1} \cup \{i\}, \text{ for each } i \in V^{l} \backslash V^{l+1} \tag{17}$$

where *υ* is a threshold value with 0 ≤ *υ* ≤ 1, and *V<sup>l</sup>*\*Vl*+<sup>1</sup> indicates the set difference between *Vl* and *Vl*+1. In the multigrid structure, each vertex has a mass value and the masses in the (*l* + 1)th grid are calculated as follows:

$$\forall i \in V^{l+1}: m\_i^{l+1} = m\_i^l + \sum\_{j \in V^{l+1}} w\_{ij}^l \tag{18}$$

where *mli* and *ml*+<sup>1</sup> *i* are the masses in the *l*th and (*l* + 1)th grids, respectively, and *wlij* weights how much vertex *i* ∈ *V<sup>l</sup>*\*Vl*+<sup>1</sup> depends on the vertex *j* ∈ *Vl*+1:

$$\forall i \in V^l \backslash V^{l+1}, j \in V^{l+1} : w\_{ij}^l = w\_{ji}^l = \frac{\mathcal{S}\_{ij}^l}{\sum\_{k \in V^{l+1}} \mathcal{S}\_{ik}^l} \tag{19}$$

**Step 2:** To connect the vertices in *Vl*+<sup>1</sup> to obtain *El*+1. The matrix of diffusivities are obtained using the Garlekin operator **G***l*+<sup>1</sup> = *Icf* **G***lIfc* [61], where *Icf* and *Ifc* denote the restriction and interpolation operators:

$$\left[I\_f^c\right]\_{ij} = \frac{w\_{ij}^l}{1 + \sum\_{j \in V^l \backslash V^{l+1}} w\_{ij}^l} \tag{20}$$

$$\left[I\_{\mathfrak{c}}^{f}\right]\_{ij} = w\_{ij}^{l}\tag{21}$$

According to the Garlekin operator, the weight in the (*l* + 1)th grid is computed, as follows:

$$\mathcal{g}\_{ij}^{l+1} = \frac{1}{1 + \sum\_{j \in V^l \backslash V^{l+1}} w\_{ij}^l} \sum\_{p, q \in V^l} w\_{ip}^l \mathcal{g}\_{pq}^l w\_{qj}^l \tag{22}$$

*El*+<sup>1</sup> are obtained by connecting the vertices, as follows:

$$E^{l+1} = \left\{ (i, j) \; : \; i, j \in V^{l+1} \land g^l\_{ij} > 0 \right\} \tag{23}$$

To iteratively perform the previously two steps, a *S*-level multigrid structure of HSIs is constructed. The markers correspond to pixels of the smoothed image **u***t* determined by the position of the vertices in the coarse grid. The smoothed spectra can be considered as the average of spectrally similar and spatially adjacent pixels, which can decrease noise and improve the representation of the different objects in the HSI. In this work, the vertices in each grid level are used for the subsequent region growing algorithm.
