1. Introduction
A hyperspectral image (HSI) is a unified image of interest targets captured by a specific wavelength of an optical sensor, consisting of hundreds of continuous bands with a fine resolution [
1,
2]. These bands are rich in available spectral and spatial information, allowing the identification of subtle differences between features. In recent years, due to the higher and higher spectral resolution of HSI, it has been successfully used in numerous applications, including military monitoring [
3], food safety [
4], medical diagnosis [
5], etc. However, the strong inter-band correlation and high-dimensional hyperspectral data lead to a large amount of information redundancy, heavy computational and storage burdens, and dimensional catastrophe problems, which pose a great deal of difficulty and challenges to the development of HSI processing methods. Hyperspectral dimensionality reduction is of great significance as a form of preprocessing to address the above challenges.
The most commonly used methods for HSI dimensionality reduction to reduce spectral band redundancy include band selection (BS) [
6,
7] and feature extraction (FE) [
8,
9], where BS-based methods are based on the criteria of selecting the most representative band subset directly from the original HSI data without any transformation, and the obtained sub-bands are informative, distinguishable, and beneficial for subsequent tasks. Compared with FE, BS can reduce data dimensionality while preserving the physical meaning, inherent properties, and spectral characteristics of the original data, which is beneficial for interpreting the selected subset of bands in subsequent analysis and has been widely used in practical applications [
10,
11]. According to whether labeled information is used or not, the existing BS methods include three types as follows: supervised [
12,
13,
14], semi-supervised [
15,
16,
17], and unsupervised [
18,
19,
20,
21]. Owing to the fact that labeled HSI data are difficult to obtain, this paper focuses on unsupervised BS methods with the advantage of being more flexible, feasible, and effective in practice.
As a popular technique for unsupervised band selection, self-representation is implemented using the self-expression properties of data and different regularization constraints, where the most representative algorithms include fast and robust self-representation (FRSR) [
22], robust dual graph self-representation (RDGSR) [
23], and self-marginalized graph self-representation (MGSR) [
24]. In addition, self-representation-based subspace clustering (SSC) has achieved a large number of successful results in unsupervised BS [
25,
26,
27], where under the representation-based framework, the clustering structure of the spectral band can be learned in the low-dimensional subspace with the robustness of noise and outliers, and can effectively cluster high-dimensional data. However, in practical applications, the HSI data to be processed typically have great spectral variability and are located in nonlinear subspace, where traditional SSC-based BS methods with linear characteristics are not applicable for the nonlinear relationships of HSI, failing to achieve satisfactory performance.
Recently, deep neural networks have demonstrated superiority in handling high-dimensional data because of their remarkable ability to extract complex nonlinear relationships of features in an end-to-end learnable way. Deep learning models for HSI band selection have achieved tremendous success [
28,
29,
30,
31]. Although these methods are able to learn the complex nonlinear structure information of HSI, they ignore the correlation between bands, resulting in a large number of band redundancies. In view of this issue, deep learning-based clustering methods are introduced into band selection [
32,
33,
34,
35], considering the spatial information inherent in band images. However, they also have certain limitations. On the one hand, the representation learning of these models is embedded in the deep convolutional autoencoder, leading to a lack of an effective self-supervised representation ability; on the other hand, without considering the subspace clustering representation of the low-level and high-level information of the input HSI, these models ignore the meaningful multi-scale information embedded in different layers of deep convolutional autoencoders, imposing a waste of information that is conducive to clustering. In addition, due to the ignorance of the connectivity within the subspace, the existing models are blocked from further improving the clustering performance.
To solve the above problems, a self-supervised deep multi-level representation learning fusion-applying maximum entropy subspace clustering (MLRLFMESC) algorithm is proposed for BS in this paper, with the main contributions as follows:
- (1)
Considering the multi-level spectral–spatial information of hyperspectral data, self-representation-based subspace clustering, comprising multiple fully connected layers, is respectively inserted between the encoder layers of the deep stacked convolutional autoencoder and its corresponding decoder layers, respectively, to realize multi-level representation learning (MLRL), which can fully extract low-level and high-level information and obtain more informative and discriminative multi-level representations.
- (2)
Self-supervised information is provided to further enhance the representation capability of the MLRL, and a new auxiliary task is constructed for MLRL to perform multi-level self-supervised learning (MLSL). Furthermore, a fusion module is designed to fuse the multi-level spectral–spatial information extracted by the proposed MLRL to obtain a more informative subspace representation matrix.
- (3)
To enhance the connectivity within the same subspace, the MER method is applied to ensure that the elements within the same subspace are uniformly and densely distributed, which is beneficial for subsequent spectral clustering.
The remainder of this paper is organized as follows.
Section 2 gives a detailed description of the proposed MLRLFMESC algorithm.
Section 3 presents the experiments and corresponding analysis of the proposed algorithm with other state-of-the-art BS methods. Finally, the conclusions and discussions are drawn in
Section 4.
2. Proposed Method
This section describes in detail the proposed MLRLFMESC method for BS. The flowchart of this method is shown in
Figure 1. The main steps are as follows: firstly, considering the low-level and high-level spatial information of hyperspectral data, the proposed method inserts multiple fully connected layers between the encoder layers and their corresponding decoder layers to achieve multi-level representation learning (MLRL), thus generating multiple sets of self-expression coefficient matrices at different levels of the encoder layers to obtain more informative and discriminative subspace clustering representations. Secondly, a new auxiliary task is constructed based on the MLRL, which provides multi-level self-supervised information to further enhance the representation ability of the model, termed multi-level self-supervision learning (MLSL). Finally, a fusion module is designed to integrate the multi-scale information extracted from different layers of representation learning to obtain a more differentiated self-expression coefficient matrix, where maximum entropy regularization (MER) is introduced to ensure that elements of the same subspace are evenly and densely distributed, thereby enhancing connectivity within each subspace and facilitating subsequent spectral clustering to determine the most representative band subset.
2.1. Multi-Level Representation Learning (MLRL)
The proposed MLRLFMESC method exploits a deep stacked convolutional autoencoder (SCAE) constructed by the structure of a symmetrical encoder–decoder as the core network for feature extraction to sufficiently extract the spectral–spatial information of HSI data. The original HSI data cube
contains
W ×
H spatial dimensions and
B spectral band dimensions. To ensure that the input of HSI samples X can be constructed by the deep SCAE, the definition of the loss of function is as follows:
where the Frobenius norm is expressed as
, and the reconstructed HSI samples are denoted as
.
Self-representation-based subspace clustering assumes that all data points belong to a combination of linear or affinity subspaces, which is typically represented as a self-representative model. Let the spectral bands come from the union of n different subspaces with the dimension in . In order to extract spatial information, considering the nonlinear relationships of HSI, the self-representative model is embedded in the latent space of the deep SCAE to implement self-representation properties and obtain representations of subspace clustering; then, the potential clusters are recovered with spectral clustering.
Inspired by the fact that the encoder with different layers can learn more complicated feature representations of the input HSI data, a self-representation model, comprising multiple fully connected layers, is inserted between the encoder layers of the deep stacked convolutional autoencoder and its corresponding decoder layers, respectively, to achieve MLRL; thus, the multi-level spectral–spatial information can be extracted. In order to capture the shared information between encoders and generate unique information for each layer, the consistency matrix
and the discrimination matrix
are defined, respectively. In view of the above-mentioned consideration, MLRL can be performed by the following loss function:
where
represents the latent representation matrix and
m is the dimension of the deep spatial feature.
The loss of self-expression
is used to promote the learning of self-expression feature representations at different encoder levels. As for the discrimination matrix
, the Frobenius norm is employed; thus, the connectivity of subspace representations related to each fully connected layer can be ensured. Meanwhile, to generate the sparse representation of the HSI, the
l1-norm is used in the consistency matrix
. Accordingly, the regular terms added to the model are shown as follows:
The multi-level spectral–spatial information of HSI is obtained via MLRL to facilitate the feature learning process, thereby obtaining multiple sets of information representations accordingly.
2.2. Multi-Level Self-Supervised Learning (MLSL)
Aiming at the further improvement of the representation ability of the proposed model, MLSL is used as a self-supervised method to better learn self-expression feature representations by constructing auxiliary tasks for MLRL.
To perform MLSL, the auxiliary tasks are constructed as follows: Firstly, positive and negative sample pairs are formulated for the inputs and outputs of the MLRL. For a given input
and its corresponding set of outputs
at layer
l of MLRL,
and
are matched as a positive pair, while
and
are treated as a negative pair. Subsequently, the MLSL is implemented by formulating a self-supervised loss function, expressed as follows:
where
σ is a temperature parameter that controls the distribution of the concentration level.
and
are normalizations of
and
, respectively.
An l2-normalization layer is used to satisfy and . The classifier of B-way softmax is exploited to classify as in the loss function of MLRL.
2.3. Fusion Module with Maximum Entropy Regularization (MER)
Considering that the coefficient matrices of the information representations obtained from MLRL have multiple information of the input HSI data, it is preferable to fuse these matrices into a more discriminative and informative coefficient subspace representation matrix as the input of the subspace clustering.
The matrices
and
learned using MLRL are fused through the fusion module. Stacking
and
along the channel dimension can help acquire the stacked matrix
. Then, the channels of
are merged using a convolutional kernel
k to realize the fusion. Finally, a more informative subspace representation matrix
is obtained via channel fusion learning, expressed as follows:
where
is the convolution operation.
By using an appropriate kernel size k, is able to capture more local information on and each with the block diagonal structure.
Entropy is a measure of uncertain information contained within a random variable, where for a discrete random variable X, the entropy can be calculated as
with
p(X) as the probability distribution function of X. The similarity between hyperspectral data samples
i and
j in the subspace representation matrix
can be expressed as
, and the MER method is applied to the subspace representation matrix. According to the fact that
, the following loss function for
can be obtained as follows:
where
satisfies
. The MER forces the equal strength of connections between elements from the same subspace. Simultaneously, it ensures a uniform dense distribution of elements belonging to the same subspace.
2.4. Implementation Details
The final loss function of the proposed MLRLFMESC method is expressed as follows:
where
are parameters that balance the trade-off between the earlier-mentioned different losses.
and
are the parameters updated by the standard backpropagation of the network trained with the Adam gradient.
Once the network has been trained to obtain the matrix
, a symmetric affinity matrix can be created for spectral clustering,
Matrix A shows the pairwise relationship between the bands. Given the above, the spectral clustering algorithms can be utilized to recover the underlying subspaces and cluster the samples into respective subspaces to obtain the clustering results. The clustering centers can be obtained via the average of the spectrum in each cluster. Then, the distances between the cluster center and each band in the same cluster can be calculated to find the closest band as the selected representative band of the cluster, and the final subset of bands can be obtained.
3. Experiments and Results
3.1. Hyperspectral Datasets
Comparative experiments are conducted on three publicly available HSIs with different scenarios to prove the effectiveness of the proposed algorithm, including the Indian Pines (IP) dataset, the Pavia University (PU) dataset, and the Salinas (SA) dataset. Detailed descriptions of the three hyperspectral datasets are given in
Table 1.
3.1.1. Indian Pines (IP) Dataset
The IP dataset was captured using the AVIRIS sensor in Northwestern Indiana on 12 June 1992, containing 220 spectral bands with wavelengths of 0.4 to 2.5 μm and containing 145 × 145 pixels with a spatial resolution of 20 m. After the removal of 20 water absorption bands, 200 bands with 16 classes of crops are left for experiments.
Figure 2 shows the IP dataset’s pseudo-color map as well as the true image feature class distribution.
3.1.2. Pavia University (PU) Dataset
The PU dataset was collected via the ROSIS sensor for the city of Pavia in North Italy during a flying activity in 2003, consisting of 115 bands with wavelengths of 0.43 to 0.86 μm and containing 610 × 340 pixels with a spatial resolution of 1.3 m. After the removal of 12 noise bands, there are 9 types of objects available in the remaining 103 bands (the types of objects are shown in the label of
Figure 3). The pseudo-color map and real image feature class distribution of the PU dataset are shown in
Figure 3.
3.1.3. Salinas (SA) Dataset
The SA dataset was gathered using the AVIRIS sensor in Salinas Valley, California, in 1998, which comprises 224 bands with wavelengths of 0.36 to 2.5 µm and contains 512 × 217 pixels with spatial resolution of 3.7 m. After deleting 20 water absorption bands, the remaining 204 bands containing 16 classes are utilized for experiments. The pseudo-color map and real image feature class distribution of the SA dataset are shown in
Figure 4.
3.2. Experimental Setup
In order to prove the effectiveness of the proposed BS algorithm, five existing BS methods are used for comparison. Considering search-based, clustering-based, and ranking-based, as well as the comparison between traditional methods and deep learning methods, the selected comparable algorithms are UBS [
18], E-FDPC [
19], ISSC [
25], ASPS_MN [
5] and DSC [
32], as these algorithms have open-source implementations provided by their respective authors. Performance analysis is conducted on a subset of bands obtained from various BS approaches using the same SVM classifier in [
36] with an open-source code.
To quantitatively assess the quality of the selected band subsets, indicators of classification accuracy are used in this section, including overall accuracy (OA), average accuracy (AA), and the kappa coefficient (Kappa). For a fair comparative purpose, identical training and testing data subsets within each round are utilized when evaluating different BS algorithms. Specifically, 10% of samples from each class are randomly chosen as the training set, with the remaining samples allocated to the testing set. The experimental results are averaged via ten independent runs to reduce the randomness.
The deep SCAE network of the proposed method is composed of three symmetric layers of the stacked encoder and decoder with the following parameter settings: the stacked encoder consists of 10, 20, and 30 filters with corresponding kernel sizes of 5 × 5, 3 × 3, and 3 × 3, respectively. The network learning rate is The trade-off parameters are , , , and . The kernel size k is 3 × 3.
3.3. Randomness Validation by Random Selection of Training and Testing Sets
As for classification results, the training and testing sets have a significant influence on the classification performance. Therefore, before the comparison of various algorithms, this section gives the average and variance of OA across 10 distinct runs, allowing the assessment of disparities between experiment runs where each run involves alterations to the training and testing sets.
Figure 5a–c illustrates the box plots depicting the OA results for the six algorithms utilizing 35 bands across the three HSI datasets in 10 separate runs. The experiments entail the repeated random selection of training and testing datasets. To ensure an equitable comparison, the training and testing sets for all six algorithms within the same round remain identical. Sub-figures demonstrate that the proposed MLRLFMESC obtains the optimal mean OA with a perfect criteria bias, especially for IP and PU datasets, as shown in
Figure 5a–c.
3.4. Ablation Study of the Proposed MLRLFMESC Method
To separately verify the effectiveness of maximum entropy regularization and self-supervised learning in the proposed MLRLFMESC method, ablation studies are conducted in this section. As can be shown in Equation (8) in
Section 2, the proposed method adopts partial loss functions from the DSC method, namely
,
,
, and
, collectively referred to as
in this paper. Building upon this foundation, the proposed method introduces the following two innovative techniques: the multi-level self-supervised learning model (MLSL) and maximum entropy regularization (MER). In MLSL, the
is introduced to obtain improved self-supervised features. In MER, the
is introduced to promote connectivity within the same subspace while ensuring a uniform and dense distribution of elements within the subspace, which is beneficial for subsequent spectral clustering. As a result, the ablation study should be implemented with/without the
and the
loss in
Table 2.
The following conclusions are given from
Table 2: Firstly, when adding either the MER into the model (as shown in the second line in Table with the form of
) or the multi-level self-supervised learning model (as shown in the third line in the Table with the form of
), the OA performance can be better than the DSC method. The classification performance of the proposed MLRLFMESC with both MER and MLSL (as shown in the fourth line in Table with the form of
and
) has the best OA performance with bold fonts under the selection with a different number of selected band subsets. These experiments demonstrate the effectiveness of maximum entropy regularization and self-supervised learning in the proposed MLRLFMESC approach.
3.5. Classification Results Analysis for Different BS Algorithms
In this section, comparative experiments are conducted on three publicly available HSIs with different scenarios to prove the effectiveness of the proposed algorithm, including the Indian Pines (IP) dataset, the Pavia University (PU) dataset, and the Salinas (SA) dataset. However, limited by the number of pages and considering the reproducibility of conclusions, only the results of the IP dataset are shown in this section, with the results of the PU and SA datasets in
Appendix A and
Appendix B.
3.5.1. BS Results with Different Number of Selected Bands
To evaluate the classification accuracy of the proposed MLRLFMESC algorithm while comparing it with the existing BS methods, the quantity
n of selected bands in different BS methods is varied in the region of [
5,
35] with a step of five. The reason why we chose 35 as the maximum is that when using virtual dimension (VD) analysis, which is a widely used technique for selecting the number of bands, the VD is typically less than 35.
In
Figure 6, the suggested MLRLFMESC technique clearly outperforms the other five BS approaches in terms of classification accuracy for OA, AA, and Kappa. In
Figure 6a of the IP dataset, the proposed MLRLFMESC has the highest and most stable OA with a significant improvement over the other five comparable BS methods, especially when
n = 5, 30, 35, and the proposed MLRLFMESC has better accuracies of 3.42%, 3.56%, and 2.83% compared to the suboptimal approach. The AA accuracy is shown in
Figure 6b, and the MLRLFMESC technique exhibits the “Hughes” phenomena that grows and then drops as the number of bands increases. In terms of the Kappa given in
Figure 6c, the MLRLFMESC approach has a growing advantage over the suboptimal approach as the number of chosen bands grows, notably when
n = 30, 35, MLRLFMESC has 3.56% and 2.83% greater accuracy than the suboptimal approach, respectively. The other five BS approaches exhibit some “Hughes” phenomena in the OA, AA, and Kappa curves, indicating the necessity for band selection.
3.5.2. Classification Performance Analysis by Band Subsets Using Various BS Algorithms
To further assess the efficacy of the proposed MLRLFMESC approach for the analysis,
Figure 7 shows the index distribution of the 30 bands selected from the IP dataset via various band selection approaches. It is usually regarded as a bad strategy if the selected bands are scattered across a relatively short range.
As can be seen in
Figure 7, most band selection approaches pick bands that span a broad range of all bands.
Figure 8 presents the labeled image, the classification result maps for the various band selection methods, and all bands, which may be used to visually analyze the classification performance of various band selection methods on the IP dataset. There is some visual deviance, especially regarding the fact that the proposed algorithm has the best performance of the corn-min class with the orange color in the lower left corner and the better performance of the wood class with the brown color in the lower right corner. Further quantitative analysis is conducted for a fair comparison.
In order for a quantitative comparison,
Table 3 displays the average classification results of 10 runs with OA, AA, and Kappa, using various band selection methods (UBS, E-FDPC, ISSC, ASPS_MN, DSC, and the proposed MLRLFMESC, respectively) and all bands, where the best classification results are blacked out and the second-best results are underlined.
It can be easily seen from
Table 3 that the proposed method outperforms all previous band selection methods except all bands in terms of OA, AA, and Kappa, as well as classification accuracy in most classes. The proposed MLRLFMESC approach provides the highest classification accuracy in classes 8, 12, and 15 when compared to existing approaches.
3.6. Time Consuming for Different BS Algorithms
In this section, a comparison of computing times across different band selection algorithms is conducted to discern their respective computational complexity.
Table 4 lists the average computation time for ten runs of various BS approaches on the IP dataset, where the ISSC has the lowest running time with bold font. Since MLRLFMESC and DSC use deep neural networks, the running time is longer compared to traditional BS methods. However, the runtime of MLRLFMESC is significantly less than that of DSC, efficiently obtaining the desired bands within an acceptable timeframe while achieving a superior classification performance.
4. Conclusions and Discussion
This paper presents a novel MLRLFMESC framework for unsupervised hyperspectral band selection. Self-representation subspace clustering is applied in deep SCAE to enable the learning of hyperspectral nonlinear spectra–spatial relationships in a trainable deep network.
- (1)
From the results in
Section 3, it can be seen that the proposed MLSL model retains good band subsets with multi-level spectral–spatial information and multi-level discriminative information representations.
- (2)
A fusion module is employed to fuse the multi-level discriminative information representations, where the MER method is applied to enhance the objectiveness of the bands in each subspace while ensuring the uniform and dense distribution of bands in the same subspace, which was shown to be successful in the ablation study.
- (3)
Comparable experiments indicate that the proposed MLRLFMESC approach performs better than the other five state-of-the-art BS methods on three real HSI datasets for classification performance.
Although this work, and even other existing BS methods, have been extensively studied and achieved excellent performance in classification tasks, there has been limited research on integrating BS algorithms into tasks such as hyperspectral image target detection, target tracking, unmixing, etc. An important research direction in the future is to apply band selection algorithms to different task requirements while considering high performance. Furthermore, this paper does not pay attention to the problem of the uneven distribution of samples in classification-oriented tasks, resulting in the better performance of large categories and the poor performance of small categories. Therefore, the expansion of small category samples and the design or improvement of targeted models are the direction of focus in further research.