**1. Introduction**

Hyperspectral (HS) image classification always suffers from varieties of difficulties, such as high dimensionality, limited or unbalanced training samples, spectral variability, and mixing pixels. It is well known that increasing data dimensionality and high redundancy between features might cause problems during data analysis, for example, in the context of supervised classification. A considerable amount of literature has been published with regard to overcoming these challenges, and performing hyperspectral image classification effectively [1]. Machine learning techniques such as artificial neural networks (ANNs) [2], support vector machine (SVM) [3], multinomial logistic regression [4], active learning, semi-supervised learning [5], and other methods like hyperspectral unmixing [6], object-oriented classification [7], and the multiple classifier system [8] have been popularly investigated recently as well.

Multiple classifier system (MCS), which is also sometimes named as classifier ensemble or ensemble learning (EL) in the machine learning field, is a popular strategy for improving the classification performance of hyperspectral images by combining the predictions of multiple classifiers, thereby reducing the dependence on the performance of a single classifier [8–11]. The concept of MCS, on the other hand, does not refer to a specific algorithm but to the idea of combining outputs

from more than one classifier to enhance classification accuracy [12]. These outputs may result from either the same classifier of different variants or different classifiers of the same/different training samples. Previous studies have demonstrated both theoretically and experimentally that one of the main reasons for the success of ensembles is the diversity among the individual learners (namely the base classifiers) [13], because combining similar classification results would not further improve the accuracy.

MCSs have been widely applied to HS remote sensing image classification. Two approaches for constructing classifier ensembles are perceived as "classic", bagging and boosting [14,15], and afterwards numerous algorithms were successively derived from them. Bagging creates many classifiers with each base learner trained by a new bootstrapped training data set [16]. Boosting processes the data with iterative retraining, and concentrates on the difficult samples, with the goal of correctly classifying these samples in the next iteration [17,18]. Ho [19] proposed random subspace ensembles, which used random subsets of features instead of the entire feature set for each individual classifier. The rationale of the random subspace is to break down a complex high dimensional problem into several lower dimensional problems, thereby alleviating the curse of dimensionality. By integrating bagging and random subspace approaches, Breiman [20] proposed the well-known random forest (RF) algorithm [21,22]. The characteristics of RF, including reasonable computational cost, inherent support of parallelism, highly accurate predictions, and ability to handle a very large number of input variables without overfitting, make it a popular and promising classification algorithm for remote sensing data [23–25]. Generally, decision tree (DT) is used as the base classifier in ensemble learning because of its high computation efficiency, easy implementation, and sensitivity to slight changes in data. Recently, some researchers incorporated several prevalent machine learning algorithms into ensemble learning. Gurram and Kwon [26] proposed a sparse kernel-based support vector machine (SVM) ensemble algorithm that yields better performance compared with the SVM trained by cross-validation. Samat et al. [27] proposed Bagging-based and Adaboost-based extreme learning machines to overcome the drawbacks of input parameter randomness of traditional extreme learning machines. For a more detailed description about EL, refer to [28,29].

In a paper by Rodriguez and Kuncheva [30], the authors proposed a new ensemble classifier called rotation forest (RoF). By applying feature extraction (i.e., principal component analysis, PCA) to the random feature subspace, RoF greatly promotes the diversity and accuracy of the classifiers. Thereafter, several improved algorithms were proposed based on the idea of RoF, for example, Anticipative Hybrid Extreme Rotation Forest [31], rotation random forest with kernel PCA (RoRF-KPCA) [32]. Chen et al. [33] proposed to combine rotation forest with multi-scale segmentation for hyperspectral data classification, which incorporated spatial information to generate the classification maps with homogeneous regions.

A massive number of research studies show that RoF surpasses conventional RF due to the high diversity in training sample and features. Nevertheless, it is well documented in the literatures that PCA is not particularly suitable for feature extraction (FE) in classification because it does not include discriminative information in calculating the optimal rotation of the axes [30,34,35]. Although the authors explain that PCA is also valuable as a diversifying heuristic, it is expected to achieve better classification results if we try to find good class discriminative directions. Therefore, in this paper, we present an improved ensemble learning method, which uses the semi-supervised feature extraction technique instead of PCA during the "rotation" process of classical RoF approach. The proposed algorithm, named semi-supervised rotation forest (SSRoF), applies the semi-supervised local discriminant analysis (SLDA) FE method, which was proposed in our previous work [36], to fully take advantage of both the class separability and local neighbor information, with the aim of finding better rotation directions. In addition, to further enhance the diversity of features, we propose to use a weighted form of SLDA, which can balance the values of labeled samples and unlabeled samples. The main contributions of this paper are as follows: (1) an exploration of the benefit of the unlabeled samples in conventional ensemble learning methods; (2) an adjustment of the previous

SLDA technique to a weighted generalized eigenvalue problem; (3) the construction of an ensemble of classifiers, in which the weights can be randomly selected, thereby reducing the human effort for determining the optimal parameters.

The remainder of this paper is organized as follows. Section 2 describes the study data sets, and elaborates the proposed semi-supervised rotation forest algorithm. For better understanding, the SLDA feature extraction method is also briefly introduced. Section 3 reports the experiments and results. Finally, the conclusions are drawn in Section 4.

#### **2. Materials and Methodology**

In this section, we first introduce the experimental data sets, then we elaborate the proposed ensemble learning algorithm.

#### *2.1. Study Data Sets*

The experimental data sets include four HS images acquired by different sensors and resolutions. Each HS image is attached with a co-registered ground truth image.


#### *2.2. Weighted Semi-Supervised Local Discriminant Analysis*

Semi-supervised local discriminant analysis is a semi-supervised feature extraction method that has been applied in hyperspectral image classification. It combines the supervised FE method-local Fisher discriminant analysis and unsupervised FE method-neighborhood preserving embedding, and thus attempts to discover the local discriminative information of the data while preserving the local neighbor information [36]. Compared with other typical semi-supervised FE methods, SLDA focuses more on the exploration of local information, and gives a more accurate description of the distribution of samples. For better illustration, we first briefly review the feature extraction methods.

**Figure 1.** Experimental hyperspectral and corresponding ground truth images.

Let *xi* ∈ R*<sup>d</sup>* be a *d*-dimensional sample vector, and *X* = {*xi*}*ni*=<sup>1</sup> be the matrix of *n* samples. *Z* = *T*<sup>T</sup> **X**, (*Z* ∈ R*r*×*n*) is the low-dimensional representation of the sample matrix, where *T* ∈ R*d*×*r* is the transformation matrix, T denotes the transpose.

Many dimensionality reduction techniques developed so far involve an optimization problem of the following form [40]:

$$T = \operatorname\*{argmax}\_{T} \left[ \frac{\left| T^{\top} S^{b} T \right|}{\left| T^{\top} S^{w} T \right|} \right] \tag{1}$$

Generally speaking, *Sb* (and *S<sup>w</sup>*) corresponds to the quantity that we want to increase (and decrease), for example, between-class scatter (and within-class scatter). Equation (1) is equal to the solution of the following generalized eigenvalue problem:

$$\mathbf{S}^{b}\boldsymbol{\varphi} = \lambda \mathbf{S}^{w}\boldsymbol{\varphi} \tag{2}$$

where {*ϕk*}*dk*=<sup>1</sup> is the generalized eigenvectors associated with the generalized eigenvalues {*<sup>λ</sup>k*}*dk*=<sup>1</sup> ,(*<sup>λ</sup>*<sup>1</sup> > *λ*2 > ... > *<sup>λ</sup>d*). *T* = {*ϕk*}*rk*=<sup>1</sup> is composed of the first *r* eigenvectors corresponding to the largest eigenvalues {*<sup>λ</sup>k*}*rk*=1. Particularly, when *Sb* is the total scatter matrix of all samples, and *S<sup>w</sup>* = *Id*×*d*, where *I* denotes the identity matrix. Equation (1) turns into the PCA method.

#### 2.2.1. Local Fisher Discriminant Analysis (LFDA)

Suppose *yi* = *c* , *c* ∈ {1, 2, . . . , *C*} is the associated class labels of the sample vector *xi*. *C* is the number of classes. *nc* is the number of samples in class *c*, then ∑*Cc*=<sup>1</sup> *nc* = *n*. Let *Sb* and *S<sup>w</sup>* be the local between-class and within-class scatter matrices, respectively, defined by [41],

$$\begin{aligned} \mathbf{S}^{b} &= \frac{1}{2} \sum\_{i=1}^{n} \sum\_{j=1}^{n} \mathbf{W}\_{i,j}^{b} \left( \mathbf{x}\_{i} - \mathbf{x}\_{j} \right) \left( \mathbf{x}\_{i} - \mathbf{x}\_{j} \right)^{\mathsf{T}} \\ \mathbf{S}^{w} &= \frac{1}{2} \sum\_{i=1}^{n} \sum\_{j=1}^{n} \mathbf{W}\_{i,j}^{w} \left( \mathbf{x}\_{i} - \mathbf{x}\_{j} \right) \left( \mathbf{x}\_{i} - \mathbf{x}\_{j} \right)^{\mathsf{T}} \end{aligned} \tag{3}$$

then Equation (2) turns into a local Fisher discriminant analysis problem, where *W<sup>b</sup>* and *Ww* are *n* × *n* matrices, 

$$\begin{aligned} \mathcal{W}\_{i,j}^{b} &= \begin{cases} A\_{i,j}(1/n - 1/n\_c), \text{ if } y\_i = y\_j = c \\ 1/n, \text{ if } y\_i \neq y\_j \\ A\_{i,j}/n\_c, \text{ if } y\_i = y\_j = c \\ 0, \text{ if } y\_i \neq y\_j \end{cases} \end{aligned} \tag{4}$$

*Ai*,*<sup>j</sup>* is the affinity value between *xi* and *xj*. *Ai*,*<sup>j</sup>* is large if the two samples are close, and vice versa. The definition of *Ai*,*<sup>j</sup>* can be found in [42]. Note that we do not weight the values for the sample pairs in different classes. If ∀ *i*, *j*, *<sup>A</sup>i*,*<sup>j</sup>* = 1, then LFDA degenerates into the classical Fisher discriminant analysis (FDA or linear discriminant analysis, LDA) [43]. Thus, LFDA can be regarded as a localized variant of FDA, which overcomes the weakness of LDA against within-class multimodality or outliers.

#### 2.2.2. Neighborhood Preserving Embedding (NPE)

NPE is an unsupervised feature extraction method that seeks a projection that preserves neighboring data structure in the low-dimensional feature space [44]. It can characterize the local structural information of massive unlabeled samples. The first step of NPE is also to construct an adjacency graph, and then compute the weight matrix *Q* by solving the following objective function,

$$\begin{aligned} \min & \sum\_{i} \left\| x\_i - \sum\_{j} \mathbf{Q}\_{ij} x\_j \right\|^2 \\ & \text{s.t. } \sum\_{j} \mathbf{Q}\_{ij} = 1 \end{aligned} \tag{5}$$

In other words, for each sample, we use its K-nearest neighbors (KNN) to reconstruct it. Thus, the goal of NPE is to preserve this neighbor relationship in the projected low-dimensional space,

$$\begin{aligned} \min & \sum\_{i} \left\| z\_i - \sum\_{j} \mathbf{Q}\_{ij} z\_j \right\|^2 \\ & \text{s.t. } \sum\_{j} \mathbf{Q}\_{ij} = 1 \end{aligned} \tag{6}$$

where *zi* = *T*<sup>T</sup> *xi*. Then we have

$$\text{minimrace}\left[\mathbf{Z}(\mathbf{I}-\mathbf{Q})^{\top}(\mathbf{I}-\mathbf{Q})\mathbf{Z}^{\top}\right] \tag{7}$$

By imposing the following constraint,

$$\sum\_{i} z\_{i} z\_{i}^{\top} = \mathbf{I} \Longrightarrow \mathbf{Z} \mathbf{Z}^{\top} = \mathbf{I}. \tag{8}$$

the transformation matrix can be optimized by solving the following generalized eigenvalue problem,

$$\mathbf{X}\mathbf{X}^{\top}\boldsymbol{\varrho} = \lambda \mathbf{X}\mathbf{M}\mathbf{X}^{\top}\boldsymbol{\varrho} \tag{9}$$

where *ϕ* denotes generalized eigenvectors, and *M* = (*I* − *Q*) T (*I* − *Q*).

## 2.2.3. Weighted SLDA

It has been demonstrated that the performance of LFDA (and all other supervised dimensionality reduction methods) tends to degrade if only a small number of labeled samples are available [40], while PCA or NPE (and other unsupervised feature extraction (FE) methods) will generally lose the discriminative information of labeled information. Thus, combining supervised and unsupervised FE methods [45] is believed to compensate for each other's weaknesses. In this paper, we consider the combination of the aforementioned LFDA and NPE methods. As mentioned above, feature extraction techniques can be transformed into eigenvalue problems, thus, a possible way to combine LFDA and NPE is to merge the above generalized eigenvalue problems as follows [40],

$$\begin{aligned} \beta \mathbf{S}^b \boldsymbol{\varrho} &= \lambda \beta \mathbf{S}^w \boldsymbol{\varrho} \\ (1 - \beta) \mathbf{X} \mathbf{X}^T \boldsymbol{\varrho} &= \lambda (1 - \beta) \mathbf{X} \mathbf{M} \mathbf{X}^T \boldsymbol{\varrho} \\ &\Downarrow \\ \left[ \beta \mathbf{S}^b + (1 - \beta) \mathbf{X} \mathbf{X}^T \right] \boldsymbol{\varrho} \\ &= \lambda \left[ \beta \mathbf{S}^w + (1 - \beta) \mathbf{X} \mathbf{M} \mathbf{X}^T \right] \boldsymbol{\varrho} \end{aligned} \tag{10}$$

where *β* ∈ [0, 1] is a trade-off parameter. Calculating the *Sb* and *S<sup>w</sup>* of LFDA is time-consuming; an efficient implementation can be used according to [41]. Let *S<sup>m</sup>* denote the local mixture scatter matrix,

$$\mathbf{S}^{\rm{uv}} = \mathbf{S}^{b} + \mathbf{S}^{\rm{uv}} = \frac{1}{2} \sum\_{i=1}^{n} \sum\_{j=1}^{n} \mathbf{W}\_{i,j}^{\rm{m}} \left(\mathbf{x}\_{i} - \mathbf{x}\_{j}\right) \left(\mathbf{x}\_{i} - \mathbf{x}\_{j}\right)^{\rm{T}} \tag{11}$$

where

$$\mathcal{W}^{m} = \mathcal{W}^{b} + \mathcal{W}^{w} = \begin{cases} A\_{i,j}/n, \text{ if } y\_i = y\_j\\ 1/n, \text{ if } y\_i \neq y\_j \end{cases} \tag{12}$$

Since Equation (3) can be expressed as

$$\begin{aligned} \mathbf{S}^{\text{w}} &= \sum\_{i=1}^{n} \sum\_{j=1}^{n} \mathbf{W}\_{i,j}^{\text{w}} \mathbf{x}\_{i} \mathbf{x}\_{j}^{\top} - \sum\_{i=1}^{n} \sum\_{j=1}^{n} \mathbf{W}\_{i,j}^{\text{w}} \mathbf{x}\_{i} \mathbf{x}\_{j}^{\top} \\ &= \mathbf{X} (\mathbf{D}^{\text{w}} - \mathbf{W}^{\text{w}}) \mathbf{X}^{\top} \end{aligned} \tag{13}$$

where *Dw* is the *n*-dimensional diagonal matrix with *<sup>D</sup>wi*,*<sup>i</sup>* = ∑*nj*=<sup>1</sup> *<sup>W</sup>wi*,*j*. Similarly, *S<sup>m</sup>* can be expressed as

$$\mathbf{S}^{\rm un} = \mathbf{X}(\mathbf{D}^{\rm un} - \mathbf{W}^{\rm un})\mathbf{X}^{\rm \top} \tag{14}$$

where *Dm* is the *n*-dimensional diagonal matrix with *<sup>D</sup>mi*,*<sup>i</sup>* = ∑*nj*=<sup>1</sup> *<sup>W</sup>mi*,*j*. Therefore, the generalized eigenvalue problem of LFDA, namely Equation (2), can be rewritten as

$$\lambda \mathbf{X} \mathbf{L}^b \mathbf{X}^\top \boldsymbol{\varphi} = \lambda \mathbf{X} \mathbf{L}^w \mathbf{X}^\top \boldsymbol{\varphi} \tag{15}$$

where *Lw* = *Dw* − *W<sup>w</sup>*, *L<sup>b</sup>* = (*Dm* − *Wm*) − ( *Dw* − *<sup>W</sup><sup>w</sup>*), from which we can see that the eigenvalue problem of LFDA has a similar form with NPE, i.e., Equation (9).

Suppose the training sample vectors are arranged by *X* = %*XL*, *<sup>X</sup><sup>U</sup>*&, where *X<sup>L</sup>* = .*xLi* /*nl i*=1 denotes the labeled samples, and *X<sup>U</sup>* = .*xUi* /*nu i*=1 denotes the unlabeled samples, where *n* = *nl* + *nu* is the total number of available samples. We can define the following matrices

$$\begin{aligned} \mathbf{P\_1} &= \begin{bmatrix} L^b & \mathbf{0\_{n\_l \times n\_u}} \\ \mathbf{0\_{n\_u \times n\_l}} & \mathbf{0\_{n\_u \times n\_l}} \end{bmatrix}, \mathbf{P\_2} = \begin{bmatrix} L^w & \mathbf{0\_{n\_l \times n\_u}} \\ \mathbf{0\_{n\_u \times n\_l}} & \mathbf{0\_{n\_u \times n\_u}} \end{bmatrix} \\ \mathbf{P\_3} &= \begin{bmatrix} \mathbf{0\_{n\_l \times n\_l}} & \mathbf{0\_{n\_l \times n\_u}} \\ \mathbf{0\_{n\_u \times n\_l}} & I\_{n\_u \times n\_u} \end{bmatrix}, \mathbf{P\_4} = \begin{bmatrix} \mathbf{0\_{n\_l \times n\_l}} & \mathbf{0\_{n\_l \times n\_u}} \\ \mathbf{0\_{n\_u \times n\_l}} & \mathbf{M} \end{bmatrix} \end{aligned} \tag{16}$$

Therefore, the weighted SLDA is equal to the solution of the following generalized eigenvalue problem

$$\begin{array}{c} \beta \mathbf{X} \mathbf{P}\_{\mathbf{1}} \mathbf{X}^{\top} \boldsymbol{\varrho} = \lambda \,\beta \mathbf{X} \mathbf{P}\_{\mathbf{2}} \mathbf{X}^{\top} \boldsymbol{\varrho} \\ (1 - \beta) \mathbf{X} \mathbf{P}\_{\mathbf{3}} \mathbf{X}^{\top} \boldsymbol{\varrho} = \lambda \,(1 - \beta) \mathbf{X} \mathbf{P}\_{\mathbf{4}} \mathbf{X}^{\top} \boldsymbol{\varrho} \\ \Downarrow \\ \mathbf{S}^{rb} = \mathbf{X} [\beta \mathbf{P}\_{1} + (1 - \beta) \mathbf{P}\_{\mathbf{3}}] \mathbf{X}^{\top} \\ \mathbf{S}^{rw} = \mathbf{X} [\beta \mathbf{P}\_{2} + (1 - \beta) \mathbf{P}\_{\mathbf{4}}] \mathbf{X}^{\top} \\ \Downarrow \\ \mathbf{S}^{rb} \boldsymbol{\varrho} = \lambda \mathbf{S}^{rw} \boldsymbol{\varrho} \end{array} \tag{17}$$

and *β* is the trade-off parameter. In general, 0 < *β* < 1 inherits the characteristics of both LFDA and NPE, and thus makes full use of both the class discriminative and local neighbor spatial information. In practice, searching for the optimal *β* is time-consuming and sometimes impractical if there are insufficient labeled samples available for validation. Several research studies sugges<sup>t</sup> that ensemble learning methods can be employed to avoid the huge effort of searching for the optimal parameters [46,47]. On the other hand, different parameters also lead to diversity among features or classifiers, which benefits the generalization performance of the ensembles. Hence, we present an EL method based on the idea of RoF and the weighted SLDA algorithm.

#### *2.3. Proposed Semi-Supervised Rotation Forest*

Rotation forest was developed from conventional random forest to building independent decision trees on different sets of features. It consists of splitting the feature set into several random disjoint subsets, running PCA separately on each subset, and reassembling the extracted features [30,48]. By applying different splits of the features, diverse classifiers are obtained. The main steps of RoF are briefly presented as follows:


By substituting SLDA for the PCA method, we propose the following SSRoF ensemble algorithm. Apart from the different FE methods between Algorithm 1 and RoF, we use the different weights (*β*) to balance the discriminative information and structure information, thereby enhancing the diversity of features. Although the computation of the eigenvector matrix is repeated ten times (corresponding to different *β*) for each feature subset, it can be noticed that since the within-class and between-class scatter matrices are invariant for different weights, the computation cost is greatly reduced. Of course, the discrete values of *β* can be set by different steps; we recommend the values above by considering both the diversity and computation time.

**Algorithm 1:** Procedures of SSRoF

Input: Training samples *X<sup>L</sup>* = .*xLi* /*nl i*=1, testing samples *X<sup>T</sup>* = .*xTi* /*nl i*=1, unlabeled samples *X<sup>U</sup>* = .*xUi* /*nu i*=1, ensemble classifiers *L*, number of feature subsets *K*, ensemble L = ∅ Output: Class labels of *X<sup>T</sup>* For *i* = 1 : *L* 1. Randomly split the features into *K* subsets; For *j* = 1 : *K* 2. Randomly select a subset of samples from *X<sup>L</sup>* and *<sup>X</sup>U*, respectively, (typically 75% of samples) using bootstrap approach; 3. Perform the weighted SLDA algorithm by the subset of *X<sup>L</sup>* and *X<sup>U</sup>* to obtain the pairs of between-class and within-class scatter matrices in Equation (17); For *β* = 0.1 : 0.1 : 1 4. Obtain the eigenvector matrix *Tj*,*<sup>β</sup>* by solving Equation (17); End for End for For *β* = 0.1 : 0.1 : 1 5. Construct the transformation matrix *Tβ* = \**<sup>T</sup>*1,*β*, *<sup>T</sup>*2,*β*,..., *TK*,*<sup>β</sup>*+ by merging the eigenvector matrices, and rearrange the columns of *Tβ* to match the order of original features; 6. Build DT sub-classifier using *T*T*β <sup>X</sup>L*; 7. Perform classification for *T*T*β X<sup>T</sup>* by using the sub-classifier; End for End for 8. Use a majority voting rule for the *L* × 10 sub-classifiers to compute the confidence of *X<sup>T</sup>* and assign a class label for each testing sample;

#### **3. Experimental Results and Discussion**

In this section, we report the experiments on the four groups of hyperspectral images. First, the presented method is compared with several other EL algorithms to show the advantages. Then, we also introduce the performance evaluation of our method under different parameters.
