1. Introduction
Hyperspectral image classification is a widely studied subject among remote sensing applications which are heavily dependent on machine learning theory and the related techniques [
1]. The long lasting research interest for hyperspectral image classification is mainly due to the extremely high spectral resolution of hyperspectral images (HSIs), which is a very unique characteristic compared to natural images and other kinds of optical remote sensing images. HSIs with high spectral resolutions can be used to measure informative and distinctive spectral signatures of the ground surface. Therefore, HSIs are valuable to various applications, including resource management, environment monitoring and disaster analysis [
2]. However, the high dimensionality of HSIs also makes HSI classification a challenging task. One critical issue is the Hughes phenomenon [
3], which is caused by the high dimensionality of HSIs and the scarcity of training samples. In practical HSI classification tasks, the amount of labeled pixels which are available as training samples, is generally quite limited. These training samples result in being very sparse in the original high dimensional spectral feature space. Accordingly, they are often insufficient to support the training process of a complex machine learning model. Therefore, dimensionality reduction operations are usually included in HSI classification practices. Band selection is the most direct way to reduce the dimensionality of HSIs [
4]. The flaw of band selection techniques is obviously that, in order to reduce information redundancy, some useful information will also be discarded by simply removing most of the spectral bands in HSIs. A more sophisticated choice is to adopt projection-based dimensionality reduction methods. Dimensionality reduction based on linear projection techniques, such as principle component analysis (PCA), independent component analysis (ICA) and factor analysis (FA), has typically been a standard preprocessing in HSI classification tasks for a long time [
5]. On the other hand, nonlinear dimensionality reduction methods, such as those based on graphs and manifolds [
6,
7], are becoming more and more popular.
In recent years, there is an obvious growing trend that convolutional neural networks (CNNs) and other deep learning models are more and more popular in HSI classification tasks [
8,
9,
10]. In order to make the CNN models more compatible with the unique characteristics of HSIs, many novel designs and exquisite structures have been proposed. As a pioneer work, the contextual deep CNN (CDCNN) [
11] use a multi-scale convolutional filter bank to achieve a joint exploitation of the spatial and the spectral information in HSIs. In the diverse region-based CNN (DR-CNN) [
12], six different shaped neighbor regions of a target pixel are extracted to provide richer spatial features for classifying the pixel. Besides extracting features from multiple scales and diverse regions, multi-model combination is also a widely adopted strategy in many HSI classification approaches. The two-stream model proposed in [
13] uses a stacked denoising autoencoder to encode pixel-wise spectral values and a deep CNN to extract spatial features. In the spectral-spatial unified network (SSUN) [
14], the pixel-wise analysis is implemented by a long short term memory (LSTM) module in parallel with a typical CNN established for patch-wise analysis and spatial feature extraction. In the double-branch multi-attention mechanism network (DBMA) [
15], a CNN equipped with a channel attention module and another CNN equipped with a spatial attention module are combined to implement parallel spectral–spatial feature extraction. The usage of other attention modules has also been reported, such as the efficient channel attention (ECA) module embedded in the attention-based adaptive spectral–spatial kernel improved residual network (A2S2K-ResNet) proposed in [
16]. There are also some very fresh works based on pure attention models, such as the model named spectralFormer proposed in [
17]. As for single-stream models, such as the spectral–spatial residual network (SSRN) [
18] and the hybrid spectral CNN (HybridSN) [
19], 3D convolutions are usually adopted to achieve the simultaneous extraction of both the spectral and the spatial features in HSIs. In our previous work [
20], we also used 3D convolutions together with dilated convolutions to achieve state-of-the-art performance on benchmark datasets. As compared to the basic 2D CNNs, 3D CNNs are usually more complex and require a larger amount of training samples.
In order to support the training processes of large CNN models using only a limited number of labeled samples, preprocessing steps, such as data augmentation and data generation, are usually implemented [
21]. In [
22], a ‘virtual sample enhanced’ method is presented to improve the training of the proposed 3D CNN by creating virtual training samples based on the mixture of real samples. In [
23], data augmentation techniques based on image rotation and image flipping are adopted to increase the number of training samples up to six times. In [
24], the idea of adversarial training is introduced into HSI classification tasks, and a multi-class spatial–spectral generative adversarial network (MSGAN), which contains two generator components and one discriminator component, is proposed. During the adversarial training procedure of MSGAN, one generator imitates the original training samples and generates synthetic samples containing only spectral information; the other one generates synthetic samples containing spatial information. These synthetic samples are given to the discriminator to improve its ability to classify real HSI samples.
Besides data augmentation and data generation, ensemble learning has also been verified as an effective technique to address the contradiction between large models and small training sets. The band-adaptive spectral–spatial feature learning neural network (Bass Net) proposed in [
25] is an early stage deep neural network ensemble for HSI classification, which is based on an equal partition of the HSI spectral channels. A state-of-the-art performance on benchmark data sets was achieved by Bass Net without involving any kind of data augmentation. As compared to the idea of spectral feature partitioning used in Bass Net, random feature selection (RFS) is a more convenient and widely adopted manner to construct CNN ensembles for HSI classification [
4]. As reported in [
26], individual CNN classifiers with very simple structures are defined based on randomly selected spectral features extracted from the original HSIs. The resulting ensemble can produce highly accurate classifications after a training process based on the use of only a small amount of training samples. This work was improved later in [
27] by introducing transfer learning [
28] and employing pre-trained ImageNet models [
29] as the base classifiers. The inspiration here is quite straightforward, i.e., the ensemble can be improved by enhancing the base classifiers. In [
30], a model augmentation technique is proposed to synthesize new deep networks based on the original one by injecting Gaussian noise into the model’s weights, and this technique notably boosts the ensembles’ generalization ability over the unseen test data. In [
31], the random oversampling of training samples is performed to enhance the training processes of base classifiers and therefore can improve the performance of the ensemble. Following a similar strategy, semi-supervised learning [
32] and self-supervised learning [
33] have also been introduced into the training processes of classification ensembles for HSIs.
In our study, we focus on the idea of using ensemble learning to solve the problems caused by the scarcity of training samples and the high dimensionality of HSIs. Instead of the RFS process, we propose a trainable spectral feature refining module as a very effective and convenient technique to construct ensembles of improved CNN classifiers. This spectral feature refining module consists of a channel attention computation and a 1 × 1 convolution layer, and it can be embedded into the CNN classifiers to support an end-to-end processing procedure. Unlike the independent RFS process, the spectral feature refining module can be trained along with the other layers within the base CNN classifier. Therefore, an optimized lower dimensional feature subspace can be produced by the module to support better classifications. The diversity among base classifiers in the ensemble is guaranteed by the inherent randomness of the training processes of the modules and the CNN models. The end-to-end fashion for training the base classifiers makes the proposed strategy more convenient than the RFS-based ensembles.
The main contributions of our study are twofold:
We propose a trainable spectral feature refining module that is an effective dimensionality reduction technique for HSI classification. While the widely used projection-based dimensionality reduction techniques are usually implemented independently in the preprocessing stages of HSI classification tasks, the proposed spectral feature refining is more like an internal process of the classifier and can be optimized directly for improving the classification results.
A new ensemble learning strategy for HSI classification is established based on the proposed spectral feature refining module and the inherent randomness of CNN models. Using such a simple strategy, it is quite convenient to produce diversity among base classifiers. Without explicitly splitting the original spectral feature space, the base classifiers are automatically trained on different low dimensional spectral feature subspaces produced by the embedded spectral feature refining modules.
The rest of this paper is organized as follows. As the two pillars of our proposal, the idea of ensemble learning and the mechanism of channel attention operations are discussed in
Section 2. In
Section 3, we describe the proposed ensemble model from its core mechanism to the overall architecture. Experimental comparisons between our proposal and the state-of-the-art approaches are reported in
Section 4, followed by the conclusion of our study in
Section 5.
4. Experimental Results
4.1. Data Set Description and Experimental Setup
In our study, the performance of the proposed SFRN ensemble is evaluated on four classical HSI benchmark data sets, including the Indian Pines (IP) data set, the Salinas (SA) data set, the Pavia University (PU) data set and the Kennedy Space Center (KSC) data set [
10]. Brief introductions about these data sets are as follows:
The IP image was captured in 1992 by the 224-band airborne visible/infrared imaging spectrometer (AVIRIS) [
44] over the Indian Pines test site in Northwestern Indiana, USA. The image contains
pixels with a spatial resolution of 20 mpp. Here, 200 bands in the spectral range of 0.4–2.5 μm are selected out of the original image, then 10,249 pixels in the image are labeled and divided into 16 classes to form the data set.
The SA data set is another data set gathered by the AVIRIS sensor. The campaign was conducted in 1998 over the agricultural area of Salinas Valley, California. The image contains pixels with a spatial resolution of 3.7 mpp. Here, 20 bands in the original image are discarded due to water absorption and noise. The data set contains 54,129 pixels belonging to 16 different classes.
The PU image was captured by the reflective optics system imaging spectrometer (ROSIS) [
45] over the campus of the University of Pavia, in the north of Italy. The image contains
pixels with a spatial resolution of 1.3 mpp. After discarding the noisy bands, 103 out of the 115 original spectral bands, covering the spectral range from 0.43 to 0.86 μm, are kept to form the data set. It contains 42,776 pixels which can be categorized as nine different classes that belong to an urban environment with multiple solid structures, natural objects and shadows.
The KSC image was also captured by the AVIRIS sensor. The campaign was conducted in 1996 over the neighborhood of the Kennedy Space Center in Florida, USA. The image contains pixels with a spatial resolution of 18 mpp. Only 176 spectral bands ranging from 0.4 to 2.5 μm are kept in the data set, which contains 5211 labeled pixels belonging to 13 different land cover classes.
Four sets of comparative experiments were conducted. The first experiment is a comparison between the SFR module and the PCA-based dimensionality reduction. This is to study the saturation phenomenon in the band selection process for HSI classification, and it is also to demonstrate the superiority of the SFR module as an optimized dimensionality reduction technique. In the second experiment, the SFRN ensemble is compared with some state-of-the-art (SOTA) HSI classification models. This experiment is to verify that the proposed ensemble is capable of improving the performance of very simple CNN models to the level of those of SOTA CNN models with very complex structures. In the third experiment, the SFRN ensemble is compared with other ensembles for HSI classification. This is to verify the effectiveness of the proposed convenient strategy to construct reliable ensembles which can make accurate predictions based on small amounts of training samples. Ablation analysis is also performed by comparing ensembles of SFRNs, CDRNs and basic CNNs. This is the fourth set of experiments conducted in our study. The overall accuracy (OA), the average accuracy (AA) and the kappa coefficient are the metrics involved in our experiments to evaluate the classification results of different models.
The experiments are conducted on an Intel Xeon E5 platform equipped with 64 GB memory and a Nvidia Geforce GTX 1080Ti graphic processing unit. The proposed SFRN ensemble is implemented based on the framework of Tensorflow. The source code is available on Github (
https://github.com/modestyao/SFRN-ensemble, accessed on 7 October 2022). More details about the programming environment can be found on our source code page. In the first two sets of experiments, all the results are obtained in our own programs, while in the third experiment, the accuracy metrics of the existing ensembles are cited from the original paper. This is because we have no access to the source codes of these ensemble approaches and therefore cannot reproduce their experiments fairly.
4.2. Spectral Redundancy and Dimensionality Reduction
The purpose of the first experiment is to evaluate the SFR module as an effective dimensionality reduction technique for HSI classification. Four different dimensionality reduction processes, including the one based on the SFR module, are implemented and compared with each other in the experiment. The PCA-based approach is the most classic one which has been widely used and is still quite popular in recent researches. The FA-based approach is also very effective for reducing a large number of variables into fewer numbers of factors. The convolution-based dimensionality reduction (CDR) can be considered the prototype of the SFR module. As discussed in
Section 3.1, the dimensionality of HSIs can be reduced by barely using a
convolution layer, and the channel attention operation in the SFR module can further improve the spectral features with reduced dimensions.
A very simple CNN model with only two convolution layers is employed as the objective model working on the spectral feature subspaces created by different dimensionality reduction approaches. The structure of this objective model is illustrated in
Figure 6. The objective model is trained using 10% of the labeled samples in each of the four benchmark data sets, and the overall accuracies of its predictions are estimated on the remaining samples. Patches corresponding to the
neighborhood of the samples are cropped from the images after the dimensionality reduction operations. These patches are constructed as the inputs to the classification model. The number of epochs for training the model is set to 50. We use the Adam optimizer for the training processes, and the learning rate is set to 0.001. We repeat the experiments for five times and the means and standard deviations (STDs) of the OA values are illustrated in
Figure 7. We denote the classification results achieved by the model on different spectral subspaces created by the four dimensionality reduction approaches as PCA, FA, CDR and SFR, respectively.
Since we are using a quite small model, most of the accuracy curves show a trend to saturate rapidly when the dimensions of the inputs increase. Both the PCA-based dimensionality reduction process and the FA-based dimensionality reduction process are implemented based on the internal structures of the data sets, while CDR and SFR are optimized for the classification tasks. Therefore, it is not surprising to see that the CNN model produces less accurate classification results on the feature spaces created by the the two projection-based approaches, especially when the dimensionality of the feature space is lower than five. When the dimensionality is higher than 10, the accuracies corresponding to the four different dimensionality reduction approaches are very close to each other on the data sets of SA and PU. On the IP data set, the advantages of the proposed SFR dimensionality reduction approach over the projection-based ones are more obvious, while the accuracy curves are also quite similar to each other when the dimensionality is higher than 30. The situation is quite different on the KSC data set. The PCA dimensionality reduction approach leads to very poor classification results even when the dimensionality is only reduced to 50. The FA curve looks better, but the accuracies are also quite low when the dimensionality is lower than 10. On the contrary, SFR and CDR lead to much better classification results. In particular, the accuracies corresponding to SFR are constantly higher than 95% when the dimensionality is higher than two. The reason for the poor performances of the PCA and the FA approaches is that the labeled samples only take a very small percentage among the pixels in the whole image. The projection-based approaches are optimized for the whole image, while CDR and SFR are optimized only for the labeled samples. CDR and SFR are more “task-oriented” and hence can support better classifications. In general, for any of the four data sets, the SFR module can help the CNN model to produce accurate classifications even when the spectral dimensionality of the data set is largely reduced. The improvements from CDR to SFR are also very obvious on all the data sets involved in our experiments.
4.3. Classification Performance
The second part of our experimental analysis is related to the comparisons between the proposed SFRN ensemble and the SOTA HSI classification approaches. In these comparisons, the implementation of the SFRN ensemble (SFRN-E) consists of 10 SFRNs as the base classifiers, and the structures of these SFRNs in the ensemble are exactly the same as illustrated in
Figure 4, except for that they take 11 × 11 patches as inputs. SFRN-E is compared with five SOTA models, namely CDCNN [
11], SSRN [
18], DBMA [
15], HybridSN [
19] and SpectralNet [
46]. In each of the 10 base SFRN classifier in our ensemble, we use the SFR module to reduce the dimensionality of the original HSIs to three. The randomness of the SFR-based dimensionality reduction process will guarantee the diversity of the obtained three-dimensional feature spaces. This means that we can obtain 30 different feature dimensions in the ensemble. For the other single-model approaches, we reduce the dimensionality of the HSIs to 30 before constructing our training samples. Therefore, both our ensemble and the SOTA models are trained on 30-dimensional feature spaces. This gives our comparative experiment a certain level of fairness. All the models are trained using a total of 200 samples in each of the four benchmark data sets. These samples are randomly selected from all the categories according to their proportions within each data set. The performance of the models is estimated on the remaining samples, as reported in
Table 1,
Table 2,
Table 3 and
Table 4. The amounts of samples in the training set and the test set are also reported. The per-class classification accuracies are measured using the F1-score; OA, AA and the kappa coefficients are reported to demonstrate the overall performance of different models.
SFRN-E outperforms all the compared approaches on the four data sets, in terms of overall accuracy. As regarding to class-wise accuracies, SFRN-E produced the best classification results on seven out of the 16 classes in the IP dataset, and this proportion is 12/13 for the dataset of KSC. All the class-wise accuracies achieved by SFRN-E are above 90%, on 15 out of the 16 classes in the SA dataset. This is a noticeably more stable and balanced performance as compared to the other models. On the PU dataset, SFRN-E also achieved the best class-wise results on three out of the nine classes. A very important advantage of SFRN-E is the consistency of its performance across different datasets. As a contrast, the performance of DBMA is quite close to SFRN-E on the datasets of SA and PU, but it drops a lot on the dataset of KSC. In general, the advantage of SFRN-E is more obvious on the IP and the KSC data sets, which contain fewer labeled samples as compared to the other two data sets. This can be considered as verification of the ability of SFRN to deal with the scarcity of training samples.
Visual comparisons are also included here, as illustrated in
Figure 8,
Figure 9,
Figure 10 and
Figure 11. Classification maps produced by different approaches are compared with the ground truths. In general, the classification maps produced by the SFRN ensemble show fewer mislabeled areas as compared to the maps produced by the other approaches.
4.4. Comparisons with Other Ensembles
The third experiment is to compare the proposed SFRN ensemble with other ensembles. The comparisons are partially based on the experimental results reported in [
27]. As discussed in
Section 1, ensemble learning is introduced into the tasks of HSI classification as an alternative to data augmentation techniques when the amount of labeled samples is not large enough to fully support the training of a complex CNN model. In our experiment, we select only 200 samples from each data set to train the SFRN ensemble and then we evaluate the ensemble using the remaining samples. As in the second experiment, we use 10 SFRNs as the base classifiers to construct our ensemble. The Adam optimizer is adopted, and the learning rate is uniformly set to 0.001 for the training processes of all the base classifiers. The numbers of training epochs are set to 50. For the sake of fair comparisons, we follow the settings in [
27] by repeating the training and evaluation processes of our ensemble 10 times, and the averages and standard deviations of accuracies achieved on the test sets are recorded and compared with the recordings reported in [
27]. As shown in
Table 5, four ensembles are considered for comparisons, including an ensemble of support vector machine (SVM-E), a CNN ensemble (CNN-E), a CNN ensemble with transfer learning (TCNN-E) and a CNN ensemble with transfer learning and improved label smoothing (TCNN-E-ILS). As in
Section 4.3, SFRN-E in
Table 5 represents the implementation of the proposed SFRN ensemble. Inspired by the label smoothing process applied in [
27], we also include label smoothing into the training processes of our base SFRN classifiers.
These ensembles are evaluated on three data sets, namely IP, PU and KSC. Since the SA data set is not included in the experiments reported in [
27], we cannot compare the performance of SFRN-E with the other ensembles on this data set. The overall classification performances are reported in
Table 5, in terms of OAs, AAs and kappa coefficients. SVM-E, CNN-E and TCNN-E are all established on the randomness of the random feature selection process. This preprocessing is abandoned in SFRN-E by converting it into an internal module of the base CNN classifier. SFRN outperforms SVM-E and CNN-E on all the three datasets. The overall performance results of SFRN-E and TCNN-E are close to each other on the data sets of IP and KSC, but SFRN-E is much more reliable on the data set of PU. Considering the fact that TCNN-E is an ensemble of pretrained CNNs, SFRN-E is much easier to construct. The experimental results confirm the effectiveness of this more convenient strategy adopted in our study to construct CNN ensembles.
Another interesting phenomenon that can be observed in the experimental results is that the performance of the ensemble is largely correlated with the base classifiers. Since the ability of CNN models to extract spatial features from images is much stronger than that of SVMs, all the CNN ensembles outperform SVM-E. Furthermore, the ensemble of CNN can be improved when the individual CNN models are enhanced. This kind of improvement can be achieved by adopting techniques, such as transfer learning and label smoothing, as shown by the comparison between CNN-E, TCNN-E and TCNN-E trained with label smoothing. The SFR module proposed in our study can also be considered as a very effective technique to improve the individual CNN models in the ensemble. On the IP and KSC data sets, the performance improvements brought by the SFR module are roughly equivalent to transfer learning, while on the PU data set, the SFR is obviously a much more effective boosting technique. Meanwhile, when the SFR module is used together with label smoothing, the performance of the ensemble will be further improved.
The total numbers of the parameters in different models, including the proposed SFRN ensemble, TCNN-E, HybridSN and the others involved in our study, are reported in
Table 6. The size of the proposed SFRN ensemble is much smaller than TCNN-E, and it is even smaller than HybridSN. This indicates the advantage of the proposed ensemble as a low complexity model which can provide comparable or even better classification results as compared to those very complex models. It should be pointed out that these parameter numbers only correspond to the models established for the KSC dataset. For the other datasets with different class numbers, the sizes of these models will also be different, but the order of size will remain consistent.
4.5. Ablation Analysis
The ablation study is conducted by modifying and removing the SFR modules from the SFRNs in the proposed ensemble. Specifically, four ensembles composed of different types of base classifiers are constructed and compared with each other. There are 10 base classifiers in each of these ensembles, and the differences between the structures of their base classifiers are illustrated in
Figure 12. In the figure, “CNN” denotes the very simple CNN structure as explained in
Section 4.2. When we replace the SFR module in SFRN with a CDR module, we obtain the CDR network (CDRN) as our base classifier. The proposed SFR module is based on the SE block, and in the original study, SE blocks are used in the middle part of CNNs. Therefore, we remove the SFR module from the top of SFRN and insert it between the two convolutional layers in the network. We denote this variant of the base classifier in our ensemble as “SENet”. The SFRN ensemble and its variants are compared on the data sets of IP, SA, PU and KSC. A total of 200 samples out of each dataset are selected for training the base classifiers in different ensembles. We still use the Adam optimizer and the learning rate is still 0.001 during all the training processes. Each training process still consists of 50 epochs.
As reported in
Table 7, the improvements from the “CNN” ensemble to the CDRN ensemble and then to the SFRN ensemble are quite obvious. A performance increase of at least five percent in terms of OA can be observed when comparing the “CNN” ensemble with the SFRN ensemble. This demonstrates the effectiveness of the SFR module as a task-oriented dimensionality reduction technique. The comparison between the CDRN ensemble and the SFRN ensemble reveals the necessity to include the spectral attention based soft feature selection operation in our dimensionality reduction approach. The results produced by the “SENet” ensemble are comparable to the SFRN ensemble, but the overall advantages of SFRN ensemble are still notable.
5. Conclusions
This paper presents ensemble learning for HSI classification as an alternative solution to the training sample scarcity problem. As a common phenomenon in machine learning researches, the training processes of simpler models are less demanding on the amounts of required training samples, while ensemble learning is an effective technique to promote the performance of simple models. Therefore, when the training samples are not sufficient to support the training process of a complex CNN model, ensembles of simpler models can be exploited. Following such an idea, we propose a quite convenient approach to construct a very effective CNN ensemble for HSI classification, based on a novel spectral feature refining module and the inherent randomness in the initialization of CNNs.
Besides the proposed approach for HSI classification, a very important theoretical contribution in our study is the combination between a solution to the training sample scarcity problem and a solution to the problems caused by the high dimensionality of HSIs. An implicit dimensionality reduction is included in the spectral feature refining module, which is the base for the ensemble. Experimental results demonstrate that the proposed ensemble is a reliable choice for HSI classification tasks when training samples are scarce, and the proposed module is also an effective technique for dimensionality reduction.
As the base classifier in the proposed ensemble, SFRN is a model featured by its very simple structure. However, the SFRN ensemble as a whole is still a rather big model. As compared to single-model approaches, the training process for any type of classification ensemble can be more time consuming, especially when the ensemble contains a large amount of base classifiers. We suppose that this is probably the main reason why ensemble learning is less popular for small dataset problems, such as HSI classification. Therefore, improving the efficiency of the classification model will be the main goal in our future study. In fact, we have already started to research knowledge distillation techniques for HSI classification tasks.