1. Introduction
Hyperspectral images (HSIs) contain considerable different reflections of electromagnetic waves from visible light to near-infrared or even far-infrared [
1,
2]. This characteristic allows various ground objects to be discriminated based on HSIs with abundant information. Because of it, HSIs are widely used in astronomy [
3], agriculture [
4], biomedical imaging [
5], geosciences [
6] and military surveillance [
7]. However, the abundant features in HSIs could lead to significant redundancy. When using traditional classification algorithms to distinguish the class/object of each pixels in HSIs, the curse of dimensionality or the so-called “Hughes Phenomenon” would occur [
8]. Chang et al. found that up to 94% of the spectral bands can be brushed aside without affecting the classification accuracy [
9]. Therefore, Dimensionality Reduction (DR), a pre-processing procedure which tries to discover low-dimensional latent features from high-dimensional HSIs, plays a vital role in HSIs data analysis and classification.
In general, DR methods can be divided into two categories: feature selection and feature extraction. The former attempts to select a small subset of bands from the original bands based on some criteria, while the latter tries to find a low-dimensional subspace embedded in high-dimensional observations [
10]. As reviewed in [
11], discovering optimal bands from large numbers of possible feature combinations by feature selection methods could be suboptimal, so we only focus on feature extraction based DR methods for HSIs instead of feature selection in this paper.
A variety of feature extraction based DR models have been introduced for HSIs data analysis over the past decades. They can be roughly divided into two categories: unsupervised and supervised DR techniques. Unsupervised DR methods try to find low-dimensional representations that could preserve the intrinsic structure of the high-dimensional observations without using labels, while supervised methods make use of the available labels to find low-dimensional and discriminant features. The most representative unsupervised DR algorithm could be Principal Component Analysis (PCA), which tries to use the linear model to project the observed data into low-dimensional space with maximal variances [
12,
13]. Based on PCA, various extensions have been proposed, such as Probabilistic PCA (PPCA), Robust PCA (RPCA), Sparse PCA, Tensor PCA, etc. [
11,
14,
15,
16,
17,
18]. However, the aforementioned algorithms belong to the linear DR methods. When dealing with nonlinear structures embedded into the high-dimensional HSIs data, PCA and its linear extensions could be unable to provide satisfactory performance. Therefore, many nonlinear DR methods have been introduced, among which manifold learning based DR models have been widely employed in HSIs data analysis [
19,
20,
21].
Representative manifold learning based DR algorithms include Isometric Mapping (ISOMAP) [
20], Locally Linear embedding (LLE) [
19], Laplacian Eigenmaps (LE) [
22], Local Tangent Space Alignment (LTSA) [
23], etc. The idea behind these algorithms is to assume that the data lie along a low-dimensional manifold embedded in a high-dimensional Euclidean space, and to uncover this manifold structure [
24] with different criteria. For example, ISOMAP, an extension of Multi-dimensional Scaling (MDS) [
25], seeks a low-dimensional embedding that preserves geodesic distances of all the pairs of points. In LLE, each sample is reconstructed by a linear combination of its neighbors and then the corresponding low-dimensional representations that could preserve the linear reconstruction relationship in original space are solved. LE utilizes a similarity graph to represent the neighbor relationships of pairwise points in low-dimensional space. The local geometry via the tangent space is modeled in LTSA to learn the low-dimensional embedding. However, most of the manifold learning models encounter the so-called out-of-sample problem [
26], which means it could be ineffective to find the low-dimensional representation corresponding to a new testing sample. An effective solution to this problem is to add a linear mapping that projects observed samples to low-dimensional subspace. For instance, Locality Preserving Projections (LPP) [
27], Neighborhood Preserving Embedding (NPE) [
28] and Linear Local Tangent Space Alignment (LLTSA) [
29] are the linear extensions of LE, LLE and LTSA, respectively. In [
30], a Graph Embedding (GE) framework has been proposed to unify these manifold learning methods on the basis of geometry theory. Recently, the representation-based algorithms have also been introduced to the GE framework to construct various similarity graphs [
31]. For example, Sparse Representation (SR), Collaborative Representation (CR) and Low Rank Representation (LRR) [
32] are utilized to constitute the sparse graph (
graph), collaborative graph (
graph) and low-rank graph, leading to Sparsity Preserving Projection (SPP) [
33], Collaborative Representation based Projection (CRP) [
34] and Low Rank Preserving Projections (LRPP) [
35], respectively.
Nevertheless, the aforementioned algorithms are all unsupervised DR models, which means that extra labels available in HSIs data are not utilized. To take advantage of these label information, the unsupervised DR models can be extended to the supervised versions, which could improve the discriminative power of DR models [
24]. In this line, Linear Discriminant Analysis (LDA), as the most well-known supervised DR model, attempts to improve the class-separability by maximizing the distance between heterogeneous samples and minimizing the distance between homogeneous samples [
36]. However, LDA can only extract up to
features with
c being the number of label classes. Thus, Nonparametric Weighted Feature Extraction (NWFE) was proposed to tackle this problem by using the weighted mean to calculate the nonparametric scatter matrices which could obtain more than
-dimension features [
37]. Other related works including Regularized LDA (RLDA) [
38], Modified Fisher’s LDA (MFLDA) [
39] and Supervised PPCA (SPPCA) [
11,
40] have been introduced for supervised HSIs feature extraction.
Apparently, the above supervised linear DR models may fail to discover the nonlinear geometric structure in HSIs data, resulting in unsatisfactory performance of DR models. Therefore, many supervised nonlinear DR methods have been introduced to find the complex geometric structure embedded in high-dimensional data. For example, Local Fisher Discriminant Analysis (LFDA) effectively combine the advantages of Fisher Discriminant Analysis (FDA) [
41] and LPP by maximizing the between-class separability and minimizing the within-class distance simultaneously [
42,
43]. Local Discriminant Embedding (LDE) extends the concept of LDA to perform local discrimination [
44,
45]. Low-rank Discriminant Embedding (LRDE) learns the latent embedding space by maximizing the empirical likelihood and preserving the geometric structure [
46]. Other related techniques include Discriminative Gaussian Process Latent Variable Model (DGPLVM) [
47], Locally Weighted Discriminant Analysis (LWDA) [
48], Multi-Feature Manifold Discriminant Analysis (MFMDA) [
49], etc. Similarly, the representation based algorithms have also been introduced to supervised DR framework, such as Sparse Graph-based Discriminate Analysis (SGDA) [
50], Weighted Sparse Graph-based Discriminate Analysis (WSGDA) [
51], Collaborative Graph-based Discriminate Analysis (CGDA) [
52], Laplacian regularized CGDA (LapCGDA) [
53], Discriminant Analysis with Graph Learning (DAGL) [
54], Graph-based Discriminant Analysis with Spectral Similarity (GDA-SS) [
55], Local Geometric Structure Fisher Analysis (LGSFA) [
56], Sparse and Low-Rank Graph-based Discriminant Analysis (SLGDA) [
57], Kernel CGDA (KCGDA) [
53], Laplacian Regularized Spatial-aware CGDA (LapSaCGDA) [
58], etc. A good survey of these discriminant analysis models can be found in [
31].
Although the graph embedding based DR methods are effective for extracting discriminative spectral features of HSIs data, these models are significantly affected by two factors: similarity graphs and model parameters. The similarity graph is the key for all the graph embedding models, while the performance of the models largely relies on the manual settings of model parameters, which is time-consuming and inefficient. Motivated by the nonparametric Gaussian Process (GP) model [
59], we constitute the similarity graphs with GP in this paper. A Gaussian process is a type of continuous stochastic process, which defines a probability distribution over functions. With various covariance/kernel functions, GP can nonparametrically model complex and nonlinear mappings. Furthermore, all parameters of covariance functions typically termed hyperparameters can be learned automatically in GP. Inspired by the benefits of GP, we try to learn the similarity matrix in the graph embedding framework with GP. Specifically, the learned covariance matrix in GP is considered as the similarity graphs, giving rise to the Gaussian Process Graph based Discriminate Analysis (GPGDA), which could learn more efficient similarity graphs and avoid manually tuning parameters compared to existing algorithms. Experimental results on three HSIs datasets demonstrate that the proposed GPGDA can effectively improve the classification accuracy without time-consuming model parameters tuning.
The rest of the paper is organized as follows. In
Section 2, we briefly review the related works, including the Gaussian Process (GP), Graph-Embedding Discriminate Analysis framework. The proposed Gaussian Process Graph-based Discriminate Analysis (GPGDA) is introduced in
Section 3. Then, three HSIs datasets are used to evaluate the effectiveness of the proposed GPGDA in
Section 4. Finally, a brief summary is given in
Section 5.
3. The Proposed Method
To effectively learn the similarity graphs without time-consuming parameters tuning in the Graph-Embedding Discriminant Analysis (GEDA) framework, the Gaussian Process Graph based Discriminant Analysis (GPGDA) method is proposed to address it. The GPGDA method makes use of the nonparametric and nonlinear GP model to learn the similarity/affinity matrix adaptively.
The flowchart of the proposed GPGDA method is shown in
Figure 1. Firstly, the HSIs data will be divided into training and testing dataset randomly. Then, we try to construct the block-diagonal similarity matrix. Inspired by CGDA, we make use of GPR to model training samples from each class. To be specific, when we handle the training data from
lth class, their labels are manually set to be 1 while the rest of the training data are labeled to be 0 conversely, which implicitly enforces interclass separability when learning the similarity graphs of each class. Thus, the learned similarity matrix should be more efficient than those from CGDA, KCGDA and other related algorithms. Subsequently, the similarity matrices of all classes are reassembled into the block-diagonal matrix, resulting a complete similarity matrix in the GEDA framework. Finally, the projection matrix can be acquired by solving the generalized eigenvalue decomposition problem and the dimension-reduced testing data can be obtained accordingly. To further measure the proposed method, the dimension-reduced training and testing data will be fed to different classifiers to get the prediction results.
From
Section 2.1, we can learn that the convariance/kernel function is vital to GP since the corresponding kernel matrix measures the similarities between all pairs of samples. In view of this, kernel matrix in GP can be used to represent the similarity graph in GEDA framework. Because we want the intrinsic graph reflecting class-label information, it will be eventually expressed in the form of block-diagonal structure as in Equation (
8). Here, we straightforwardly make use of GPR rather than GPC to model the high-dimensional training data with discrete labels, because time-consuming approximation methods in GPC could increase the model complexity. In addition, GPR is enough to model the class-specific training data.
Given a dataset
, in which
is a high-dimensional HSI sample and
is the corresponding label. First, to efficiently learn the similar matrix, we set the label of training samples from
lth class to be 1 while the rest are 0. The new binary labels can be denoted by
. Then, let us model the mapping from
to
with nonlinear GPR,
The optimal hyperparameters
of specific kernel function can be automatically estimated by optimizing the GPR objective function in Equation (
1) with gradients based optimization algorithms. With the optimal hyperparameters, we could obtain the corresponding kernel matrix as follows,
where
are the training samples from
lth class with the number of the class-specific data
, and
denotes training data from other categories.
At this moment, we only care about training samples from
lth class, which have been labeled to be 1, so we choose the
block from the kernel matrix, which corresponds to the samples from
lth class in order to form the similarity matrix
. Once we have repeatedly obtained all the class-specific similarity matrix
by GPR, the block-diagonal matrix
W in Equation (
8) can be simply constructed
Finally, based on the GEDA framework, it is easy to solve the optimal projection matrix
P by solving the eigenvalue decomposition in Equation (
6).
The complete GPGDA algorithm is outlined in Algorithm 1. To boost the performance of the models, we preprocess the HSIs data by a simple average filtering initially.
Algorithm 1 GPGDA for HSIs dimensionality reduction and classification. |
Input: High-dimensional training samples , training ground truth , pre-fixed latent dimensionality d, and testing pixels , testing ground truth . Output: s = {, P}. - 1:
Preprocess all the training and testing data by average filtering; - 2:
Estimate the hyperparameters’ set of kernel function by Equation ( 1); - 3:
Evaluate the similarity matrix W by Equation ( 11) and Equation ( 12); - 4:
Evaluate the optimal projection matrix P by solving the eigenvalue decomposition in Equation ( 6); - 5:
Evaluate low-dimensional features for all the training and testing data by ; - 6:
Perform KNN/SVM in low-dimensional feature space and return classification accuracy ; - 7:
returns
|
As for the model complexity, since only small-scale training data are considered, we do not make use of the approximation methods such as Fully Independent Training Conditional (FITC) model [
59] for GPR. Therefore, the time complexity of the proposed GPGDA is
where
C is the number of the discrete classes and
is the maximum number of samples in each class. By comparison, other discriminant analysis based methods such as CGDA and LapCGDA are with
as well because there are matrix inversion operation. Thus, the proposed GPGDA does not increase the model complexity theoretically.