**1. Introduction**

Hyperspectral imagery with hundreds of narrow spectral channels provides wealthy spectral information. With very high spectral resolution, hyperspectral data has been of grea<sup>t</sup> interest in many practical applications, such as in agriculture, environment, surveillance, medicine [1–4] etc. Hyperspectral classification is a key technique employed in aforementioned applications. A majority of classification methods have been promoted in the last several decades to distinguish physical objects and classify each pixel into a unique land-cover label, such as maximum likelihood [5], minimum distance [6], K-nearest neighbors [7,8], random forests [9], Bayesian models [10,11], neural networks, etc., and their improvements [12–15]. Among these supervised classifiers, one of the most important classifiers is kernel-based support vector machine (SVM), which can also be considered as a kind of neural network. It can achieve superior hyperspectral classification accuracy via building an optimal hyperplane to best separate training samples.

In addition, sparse representation based on an over-complete signal dictionary has gained grea<sup>t</sup> attention in the literature. Sparse representation-based classification (SRC) [16–18] and collaborative representation classification (CRC) [19,20] are proposed from a different aspect: they do not adopt the traditional training–testing fashion. Such classification methods do not need any prior knowledge about probability density distribution of the data. To further enhance the performance of SRC and CRC, Du and Li [21] utilized a diagonal weight matrix to adaptively adjust the regularization parameter. To address the issues of Hughes phenomenon in hyperspectral classification, majority of feature extraction and selection algorithms are utilized to delete redundant features from the original data. To further improve performance of hyperspectral classification, multi-features are extracted and employed for classification. For instance, Kang et al. combined spectral and spatial features through a guided filter to process pixel-wise classification map in each class [22]. Several studies [23–25] focused on integrating spatial and spectral information in hyperspectral imagery. In addition, texture features are considered to assist hyperspectral classification [26], and modeling of hyperspectral image textures is significant for classification and material identification.

Recent research has highlighted deep learning with deep neural networks, which can learn high-level features hierarchically. They have demonstrated their potential in image classification, which also motivated successful applications of deep models on hyperspectral image classification. The classic deep learning method is convolutional neural networks (CNN), which plays a dominant role in visual-based issues. The local receptive fields of CNN can extract spatial-related features at high levels. Fukushima [27] introduced the motivations of CNNs. Ciresan and Lee et al. [28,29] depicted the invariants of CNNs. Chen et al. proposed 2-D CNN and 3-D CNN [30] to capture deep abstract and robust features, yielding superior hyperspectral classification performance. Although CNNs are typical supervised models, a massive training dataset is needed to trigger their powers. Unfortunately, a limited number of labeled samples are usually given in hyperspectral imagery. Deep belief networks (DBNs) [31] and stacked autoencoders (SAEs) [32] are also very promising deep learning methods for hyperspectral classification with limited training samples.

In this paper, we mainly investigate the DBN for its suitability and practicality to hyperspectral classification. A novel hyperspectral classification framework is proposed based on an optimum DBN. To acquire desirable performance, we also promote an advanced algorithm to enhance the texture features of hyperspectral imagery. The main contributions of this paper are summarized below.


The rest of the paper is organized into four sections. Section 2 is a brief description of related work. In Section 3, we detail our proposed DBN model. Datasets and parameters setting are demonstrated in Section 4. Experimental results and discussions are depicted in Section 5. Section 6 draws the conclusion of this paper.

#### **2. The Related Work**

A deep belief network (DBN) is a model that is first pre-trained in an unsupervised way, and then the available labeled training samples are used to fine-tune the pre-trained model through optimizing a cost function defined over the labels of training samples and their predictions.

The original DBN, published in Science [33], uses a generative model in the pre-training procedure, and uses back-propagation in the fine-tuning stage. This is very useful when the number of training samples is limited, such as in the case of hyperspectral remote sensing. DBN can be efficiently trained in an unsupervised, layer-by-layer manner where the layers are typically made of restricted Boltzmann machines (RBM). Thus, to explain the structure and theory of the DBN, we first describe its main component, the RBM.

#### *2.1. Restricted Boltzmann Machines (RBM)*

An RBM generally uses unsupervised learning, which can be interpreted as stochastic neural networks. It was originally developed to form a distributed representation. It is a two layer-wise network, which is composed of visible and hidden units. Learning RBM only allows the full connection between visible and hidden units, and does not allow connection between two visible units or connections between two hidden units. With the given visible units, hidden units can be obtained via mapping of visible units. The activations of each neuron in hidden layers are independent. Meanwhile, with the given hidden units, visible units have the same effects. A typical RBM structure is depicted in Figure 1.

**Figure 1.** Architecture of Restricted Boltzmann Machines.

The visible units can be represented as **h**, and the hidden units can be expressed as **v**. The RBM model is a kind of energy-based models in which the joint distribution of the layers can be expressed as Boltzmann distribution. Energy-based probabilistic models define a probability distribution through an energy function as:

$$p(\mathbf{v}, \mathbf{h} | \theta) = \frac{\exp\left(-E(\mathbf{v}, \mathbf{h} | \theta)\right)}{Z(\theta)},\tag{1}$$

where the normalization constant *Z*(*θ*) is called the partition function by analogy with physical systems:

$$Z(\theta) = \sum\_{\mathbf{v}} \sum\_{\mathbf{h}} E(\mathbf{v}, \mathbf{h}; \theta) \tag{2}$$

A joint configuration of the units has an energy given by:

$$\begin{array}{rcl} E(\mathbf{v}, \mathbf{h}; \theta) &=& -\sum\_{i=1}^{n} a\_i \mathbf{v}\_i - \sum\_{j=1}^{m} b\_j \mathbf{h}\_j - \sum\_{i=1}^{n} \sum\_{j=1}^{m} \mathbf{v}\_i \mathbf{w}\_{ij} \mathbf{h}\_j \\ &=& -\mathbf{a}^T \mathbf{v} - \mathbf{b}^T \mathbf{h} - \mathbf{v}^T \mathbf{w} \mathbf{h} \end{array} \tag{3}$$

where *θ* = {*ai*, *bj*, *wij*}; *wij* represents the weight connecting the visible unit *i* and the hidden unit *j*; *ai* and *bj* denote the bias terms of visible and hidden layers, respectively; *n* and *m* are the total visible and hidden unit numbers; and *vi* and *hj* represent the states of visible unit *i* and hidden unit *j*.

Due to the specific structure of RBMs, visible and hidden units are conditionally independent, as given by:

$$\begin{array}{l} P(v\_i = 1 | \mathbf{h}, \boldsymbol{\theta}) = \sigma(a\_i + \sum\_i w\_{ij} v\_i) \\ P(h\_{\boldsymbol{\cdot}} = 1 | \mathbf{v}, \boldsymbol{\theta}) = \sigma(b\_{\boldsymbol{\cdot}} + \sum\_{\boldsymbol{\cdot}} w\_{i\boldsymbol{\cdot}} h\_{\boldsymbol{\cdot}}) \end{array} \tag{4}$$

where *σ*(•) is the logistic function defined as

$$\sigma(\mathbf{x}) = \frac{1}{1 + \exp(-\mathbf{x})} \tag{5}$$

Overall, an RBM has five parameters: **h**, **v**, **w**, **a** and **b**, where **w**, **a** and **b** are achieved via learning, **v** is input, and **h** is output. **w**, **a** and **b** can be learned and updated via the contrastive divergence (CD) method as

$$w\_{ij} \gets w\_{ij} + \lambda \left( P(h\_j|v\_i)v\_i - P(h\_j^r|v\_i^r)v\_i^r \right) \tag{6}$$

$$a\_i \gets a\_i + \lambda (\upsilon\_i - \upsilon\_i^r) \tag{7}$$

$$b\_j \gets b\_j + \lambda (h\_j - h\_j^r) \tag{8}$$

where *λ* denotes the learning rate, *P*(*hr j* |*vr i*) represents the reconstructed probability distribution, and *v<sup>r</sup> i* and *hr j* are the reconstruction of visible and hidden unit, respectively. Once the states of hidden units are chosen, the visible units can be reconstructed via the hidden units sampled via Gibbs method. Then, the states of hidden units are updated through the visible units, so that the hidden units demonstrate the features of reconstruction. The distribution of visible units approximates the distribution of the real data. The learning ability of an RBM depends on whether the hidden units contain enough information of the input data.

#### *2.2. Deep Belief Learning*

The learning ability of a single hidden layer is limited. To capture the comprehensive information of data, the hidden units of the RBM can be feed as the input (visible units) of another RBM. This kind of layer-by-layer learning structure trained in a greedy manner forms so-called Deep Belief Networks. In this way, DBN can extract deep features of image data. The structure of three-layer DBN is depicted in Figure 2.

**Figure 2.** An illustration of three-layer DBN with logistic regression.

The process of training of DBN consists of two parts: pre-training and fine-tuning. The pre-training is an unsupervised training stage that initializes the model in such a way to enhance the efficiency of supervised training. The fine-tuning process can be realized as supervised training stage, which adjusts the classifier's prediction to match the ground truth of the data.

#### **3. The Proposed Framework**

To extract more powerful and invariant features, we propose a novel DBN hyperspectral classification algorithm based on TFE. DBN is composed of several layers of latent factors, which can be deemed as neurons of neural networks. However, the limited training samples in the real hyperspectral image classification task usually lead to many "dead" (never responding) or "potential over-tolerant" (always responding) latent factors (neurons) in the trained DBN. Our proposed framework mainly consists of three steps: band grouping and sample band selection, TFE, and DBN-based classification.

#### *3.1. Band Grouping and Sample Band Selection*

Compared to multispectral imagery, hyperspectral imagery with hundreds of spectral bands has relatively narrow bandwidths. The correlation between spectral bands needs to be considered. In our framework, we calculated all the pair wise correlation coefficient of bands, and then utilized the correlations between adjacent bands. The spectral correlation coefficients in different datasets are depicted in Figure 3.

**Figure 3.** The maps of correlation coefficients of spectral bands in different datasets: (**a**) Indian Pines; (**b**) University of Pavia; and (**c**) Salinas.

We can obtain the correlation coefficient between adjacent bands as:

$$\rho\_{i,j} = \text{corr}(\mathbf{B}\_i, \mathbf{B}\_j) = \text{cov}(\mathbf{B}\_i, \mathbf{B}\_j) / \sqrt{\text{var}(\mathbf{B}\_i)\text{var}(\mathbf{B}\_j)} \tag{9}$$

where cov is covariance and var means variance. B*i* and B*j* represent the *i*-th and *j*-th band channels, respectively. *i* = 1, 2, ... , *L* − 1. Here, *L* denotes the number of bands of the hyperspectral dataset. Based on Equation (9), the correlation coefficients between adjacent bands in different datasets are calculated, as shown in Figure 4. We can see that the highest correlation coefficient in Indian Pines is 0.9997, and the lowest correlation coefficient is 0.0686. The spectral bands of university of Pavia have strong correlations overall, where the highest correlation coefficient is 0.9998, and the lowest correlation coefficient is 0.9294. The highest correlation coefficient in Salinas is 0.9999, and the lowest correlation coefficient is 0.5856.

**Figure 4.** The correlation coefficients of adjacent spectral bands in different datasets: (**a**) Indian Pines; (**b**) University of Pavia; and (**c**) Salinas.

Here, we design an algorithm for grouping bands rationally.

Firstly, calculate the average correlation coefficients of the adjacent bands, denoted as *C*, which is utilized as the threshold in the following steps. It can be calculated through:

$$\overline{C} = \frac{1}{L-1} \sum\_{i=1}^{L-1} \rho\_{i,j} \tag{10}$$

where *j* = *i* + 1. If the correlation coefficients of adjacent bands are greater than *C*, these two bands are considered to have strong correlation.

*Remote Sens.* **2018**, *10*, 396

Second, search local minimum values from the correlation coefficients between the adjacent bands, denoted as ρmin, where ρmin = {*ρi*,*j*|*ρi*,*<sup>j</sup>* ≤ *ρi*+1,*j*+<sup>1</sup> || *ρi*,*j* ≤ *<sup>ρ</sup>i*−1,*j*−1}. All the elements in ρmin are compared with *C*. If the inequality {*ρi*,*<sup>j</sup>* ∈ ρmin} < *C* is satisfied, it indicates that the correlation between the *i*-th band and the *j*-th band is lower than the average correlation value, and the correlation between these two bands is considered to be weak. Then, the corresponding index group {*i*, *j*} is recorded and added to the set <sup>ρ</sup>*Loc*.

Third, band grouping depends on the stored index pairs in <sup>ρ</sup>*Loc*. For instance, with regard to index pair {*i*, *j*}, the *i*-th band is set as the end band of the former band group and the *j*-th band is set as the first band of the next band group. Thus, based on the aforementioned rules, all the bands are divided into different band groups {**<sup>G</sup>**1, **G**2, ··· , **<sup>G</sup>***K*}.

After dividing all the bands of hyperspectral dataset into different band groups, a sample band with the strongest and clearest texture features is searched and selected from each group.

To extract texture features, the gray level co-occurrence matrix (GLCM) has been employed successfully. GLCM [34] is defined as a matrix of frequencies which can extract second order statistics from a hyperspectral image. The distribution in the matrix depends on the angular and distance relationship between pixels. After the GLCM is created, it can be used to compute various features. We choose the five most commonly used features in Table 1 to select a sample band from each band group. The texture feature score of each band can be calculated by Equation (11):

$$T = \sum\_{i=1}^{5} F\_i \tag{11}$$


**Table 1.** Feature calculated from the normalized co-occurrence matrix *<sup>P</sup>*(*<sup>i</sup>*, *j*).

The sample band in each band group can be selected through:

$$\mathbf{g}\_k = \underset{\mathbf{B}\_{\parallel k}}{\text{argmax}} \left\{ T\_{\mathbf{B}\_{\parallel k}} \Big| \mathbf{B}\_{\parallel^k} \in \mathbf{G}\_k \right\},\tag{12}$$

where **G***k* represents the *k*-th band group of the dataset, *l k* ∈ {1, 2, ··· , *<sup>N</sup><sup>k</sup>*}, *N<sup>k</sup>* is the number of bands in the *k*-th band group, and <sup>B</sup>*l<sup>k</sup>* represent the *l <sup>k</sup>*-th band in the *k*-th band group. Finally, the sample band set are comprised of {**g**1, **g**2, ··· , **g***k*}.

#### *3.2. Texture Feature Enhancement*

As an effective edge-preserving filter, guided filter (GF) was proposed by He in 2012. It can enhance the detail of an image. Texture feature is a kind of important spatial characteristics and also has long history in image processing. In this paper, we utilize the GF in each band group to enhance the texture features of the image.

The general guided image filtering was designed for gray-scale images or color images. It is very easy to extend to multi-channel image. Firstly, the guidance image in our proposed framework is multi-channel image, denoted as **<sup>I</sup>***M*, which is comprised of the copies of the band with the strongest texture features in each band group. We assume *q<sup>M</sup>* is a linear transform of **I***<sup>M</sup>* in a window *ωk* centered at the pixel *k*, and the multi-channel guided filter model can be expressed as

$$q\_i^M = \left(\mathbf{a}\_k^M\right)^T \mathbf{I}\_i^M + b\_k^M{}^\prime \/ \forall i \in \omega\_k \tag{13}$$

where **I***Mi* is a *C* × 1 vector, and *C* is the channel number of the input image, **a***Mk* is a *C* × 1 coefficient vector, and *qMi*and *bMk*are scalars. The guided filter for multi-channel guidance image becomes

$$\begin{array}{ll} \mathbf{a}\_{k}^{M} = \left(\boldsymbol{\Sigma}\_{k} + \boldsymbol{\varepsilon}\mathbf{U}\right)^{-1} \left(\frac{1}{|\boldsymbol{\omega}|} \sum\_{i \in \omega\_{k}} \mathbf{I}\_{i}^{M} p\_{i}^{M} - \mu\_{k} \overline{p\_{k}^{M}}\right) \\ \boldsymbol{b}\_{k}^{M} = \overline{p\_{k}^{M}} - \left(\mathbf{a}\_{k}^{M}\right)^{T} \boldsymbol{\mu}\_{k} \\ \boldsymbol{q}\_{i}^{M} = \left(\overline{\mathbf{a}\_{i}^{M}}\right)^{T} \mathbf{I}\_{i}^{M} + \overline{b\_{i}^{M}} \end{array} \tag{14}$$

where ∑*k* is the *C* × *C* covariance matrix of **I***<sup>M</sup>* in *ωk*, **U** is an *C* × *C* identity matrix, *p<sup>M</sup>* denotes a filtering input image which is given beforehand according to the application, *μk* is the mean of **I***<sup>M</sup>* in *ωk*, *pMk* is the mean of *p<sup>M</sup>* in *ωk*, and |*ω*| represents the number of pixels in *ωk*.

Then, the extending guided image filtering for multi-channel images will be applied to each band group. For instance, each channel of the guidance image *I<sup>M</sup>* in Equation (14) for the *k*-th band group **G***k* is the copy of the sample band **g***k* selected previously.

After guided filtering for all groups is completed, the output bands are restored to a hyperspectral image cube according to the band number. Finally, the reconstructed image data with enhanced texture features are obtained through the aforementioned steps. Figure 5 demonstrates the procedure of band grouping and TFE. We can see that, after sample bands with strongest textures are obtained, the reconstructed image data with enhanced texture feature can be achieved through the GF process.

**Figure 5.** The procedure of band grouping and texture features enhancement.

#### *3.3. DBN Classification Model*

In this section, a DBN-based framework for hyperspectral classification with feature enhanced data is developed.

Spectral information is the most significant and direct feature, and can be directly utilized for classification. Architectures of existing methods, such as SVM and KNN, can extract spectral features but not deep enough. Therefore, only a deep architecture can make full use of the texture enhanced hyperspectral image characteristics. However, as the training samples are limited, the overfitting problem often occurs if the network is too deep, so we advocate a novel DBN framework, which has only two hidden layers (Figure 6).

**Figure 6.** Our proposed DBN network for classification.

The input data consist of training samples that are one-dimensional (1-D) vectors, and each pixel of a training sample is collected from the texture enhanced HSI data. For ease of description, the first hidden layer is denoted as **h**1 and the second **h**2. The first layer is learned for extracting features from the input data, and the learned features are preserved in **h**1. Then, to pursue refined and abstract features, using the features contained in **h**1 as the visible data of the second layer, **h**2 keeps the refined features. This procedure is generally called recursive greedy learning for pre-training a DBN.

In practice, learning each layer is often performed through the *n*-step CD, and the weights are updated using Equations (6)–(8).

To fine-tune the DBN and accomplish classification, a Softmax layer is added to the end of the network.

Now, let **X** = {**<sup>x</sup>**1, **x**2,..., **<sup>x</sup>***K*} be a set of training samples and **Y** = {*y*1, *y*2,..., *yK*} be the corresponding labels, where **<sup>x</sup>***k* = [*xk*1, *xk*2,..., *xkL*] *T* is the spectral signature of the *k*-th sample with *L* bands. Utilizing the maximum likelihood method, the objective function can be written as

$$\mathbb{C}(\boldsymbol{\Theta}) = -\sum\_{k=1}^{K} \log(P(\boldsymbol{y}\_k|\mathbf{x}\_k), \boldsymbol{\Theta}) = -\sum\_{k=1}^{K} \log(\mathcal{S}\_{\mathcal{Y}\_k}(\mathbf{x}\_k, \boldsymbol{\Theta})) \tag{15}$$

where *K* is the number of training samples, (*P*(*yk*|**<sup>x</sup>***k*), θ) means the distribution of *yk* when given **<sup>x</sup>***k* with the parameters θ of the Softmax layer, and *Syk* (**<sup>x</sup>***k*, θ) denotes the output of the Softmax layer of the *k*-th training sample, that is

$$S\_{y\_k}(\mathbf{x}\_k, \boldsymbol{\Theta}) = \frac{\exp\{-\sum\_{m=1}^{M} \delta(y\_k = m) \boldsymbol{\Theta}\_m^T \mathbf{h}^{H\_L}\}}{\sum\_{n=1}^{M} \exp\{-\boldsymbol{\Theta}\_n^T \mathbf{h}^{H\_L}\}},\tag{16}$$

where *HL* is the number of the hidden layers, which is set to 2 in our proposed framework, and *M* is the number of the classes. θ*m* and θ*n* are the parameter vectors for the *m*-th and *n*-th unit of the softmax layer, respectively. **h** *HL* is the output of the *HL*-th hidden layer, which is calculated via the input data, the weights and bias from the first layer to the *HL*-th hidden layer. To optimize the objective function, the stochastic gradient descent (SGD) algorithm is used. Finally, the label of each testing pixel is determined via the weights and biases from aforementioned steps.
