1. Introduction
The increase in data volume, velocity, and diversity has lately given rise to the term “Big Data”, which symbolizes the multifaceted issues faced by many of the scientific and applied domains. In the context of remote sensing the current data acquisition sources for Earth observation generate vast amounts of data, which are typically images acquired at various scales (high/low) and resolutions (i.e., spatial, spectral/temporal) [
1]. Over the past decade, machine learning and deep learning methodologies have gained wide recognition for hyperspectral data analysis in remote sensing applications [
2,
3]. Deep-learning-based feature extraction and classification methodologies in hyperspectral remote sensing applications using convolutional neural networks (CNNs) [
4], recurrent neural networks (RNNs) [
5], and their variations foster the automation of processes due to their potential for progressively learning the attributes and information present in the high-dimensional hyperspectral data [
6]. Needless to say, the more complex deep-learning-based classification/object detection frameworks are, the higher their subsequent computational overhead is expected to be. However, this high computational cost is not desirable as we gravitate towards more automated/real-time hyperspectral data analysis applications [
7].
The Earth’s land cover is a dynamic canvas on which human beings and natural systems are always interacting. Land use/land cover (LULC) classification and its dynamics, which partially result from land surface processes, have considerable effects on biotic diversity, soil degradation, terrestrial ecosystems, and the ability of biological systems to support human needs [
8]. Thus, land cover classification, and its dynamics with remote sensing data, is an important field in environmental change research at different scales. The efficient assessment and monitoring of land cover changes are indispensable to advance our understanding of the mechanisms of change and model the effects of these changes on the environment and associated ecosystems at different scales [
9].
Remote sensing techniques represent some of the most effective tools to obtain information on LULC classification and dynamics (i.e., temporal–spatial changes and the transformation of landscapes). Many methods can detect land cover changes based on optical and radar imagery with different spatial and spectral resolutions. Existing techniques for accomplishing land cover classification can be broadly grouped into three general types, namely supervised classification algorithms, unsupervised classification algorithms, and a mixture of supervised and unsupervised classification techniques [
8]. A large amount of high-dimensional, high-spatial–spectral resolution hyperspectral remote sensing data is becoming available due to the fast development of satellite and sensor technology, and the above-mentioned supervised and unsupervised classification methods could swiftly obtain cardinal information from the remote sensing data, thus playing an important role in hyperspectral imagery applications [
10]. This being said, over time, classification frameworks based on high spatial–spectral resolution hyperspectral remote sensing data using machine learning algorithms such as neural networks have made a great impact in the field of remote sensing and our work is directly related to this.
Conventional machine-learning-based hyperspectral imagery classification and object detection frameworks are heavily inclined towards operating on spectral information as features [
11]. Most of the spectral-information-reliant frameworks suggested in the literature include some form of similarity- or dissimilarity-distance-measure-based band grouping [
12], and traditional supervised classification paradigms using k-nearest neighbors [
13], maximum likelihood criterion [
14], logistic regression [
15], random forest classification [
16], bagging and boosting techniques like AdaBoost [
17], etc., have proved to be effective in classifying HSI data. Most of these spectral-information-based methodologies lack the potential to capture and utilize the corresponding spectral variability and information effectively, that is readily available in the high-dimensional hyperspectral data. This problem of information extraction and processing such high-dimensional data is not contemporary, however, it has started to gain more importance lately due to the surge in the volume of data (big data) and its acquisition methodologies. The big data attributes directly imply high dimensionality and data redundancy, which in turn exacerbates the curse of dimensionality caveat [
18].
As a consequence, dimensionality reduction (DR) plays a prominent role in hyperspectral data analysis [
19]. In general, DR techniques help combat the intensive data learning overhead by projecting high-dimensional data from their original feature space to a lower dimensional subspace and preserving all the vital information present in the data. Additionally, DR also brings down the computational requirements by a considerable factor. In the literature, various DR techniques such as principal component analysis (PCA) [
20], linear discriminant analysis (LDA) [
21], random projections (RPs) [
4,
6,
22], etc., have gained increased attention due to their demonstrated computational efficacy and ability to preserve vital information present in the hyperspectral data. In addition, recent literature has proven that the integration of any form of additional information, such as spatial or contextual, alongside an efficient DR technique, to the available spectral information can improve the efficacy of hyperspectral data analysis [
23,
24].
In hyperspectral imaging, the relationship between the acquired spectral information and underlying land cover material is inherently nonlinear. To combat this issue, deep learning and machine learning algorithms have generally been adopted as fundamental feature extraction tools for effectively addressing/modeling data with nonlinear intrinsic relationships in the past few years. As a result, such deep learning techniques have shown promising results in the realm of hyperspectral data learning and representation for classification [
2], object recognition [
25], and other remote sensing applications. However, one of the major shortcomings of these techniques is how the data are presented to the deep learning framework for generalization. Generally, the information extracted from each spectral band is assigned an equal emphasis or importance without any consideration to the significance of spectral information/features on the final data analysis outcome [
26,
27]. Moreover, this form of antiquated equal importance designation to all the spectral bands can lead to the inclusion of inherent noise or redundant spectral information, which can not only be detrimental to hyperspectral data analysis but also affect the generalization capability of the underlying deep-learning-based HSI analysis framework. Thereby, it can severely inhibit the automation capabilities and efficacy of the methodology [
7,
23,
28].
Therefore, this work leverages the benefits of DR techniques in conjunction with the ability of spatial–spectral representation provided by deep learning techniques to formulate an adaptive spectral attention framework for hyperspectral data analysis. In this framework, the input high-dimensional hyperspectral cube is first reduced to its lower dimensional subspace using principal component analysis (PCA). PCA is an unsupervised linear feature extraction method that uses orthogonal transformation to explore the correlation between the HSI spectral bands in order to extract their intrinsic properties. It is based on the notion that contiguous bands of HSI data are highly correlated and typically convey the same information about the ground objects in order to function efficiently [
29]. Following the PCA-based DR, this reduced dimensional data are input to the proposed 3D-convolution [
30] and bidirectional LSTM [
31] based spectral attention and classification mechanism. The proposed spectral attention model delivers enhanced hyperspectral data learning, which prioritizes spectral information that is significant for hyperspectral data analysis and suppresses the redundant spectral bands. In addition, an FNN-based supervised classification [
32] is incorporated to analyze the performance of this automated hyperspectral data analysis model.
Therefore, the novel contributions of the proposed work are summarized as follows:
A lightweight spectral feature extraction methodology for hyperspectral data analysis is proposed using 3D-convolutions in conjunction to an effective dimensionality reduction technique using PCA.
The acquired spectral features, which are now a better representation of the temporal information in a lower dimensional subspace, are fed into a bidirectional LSTM-based attention framework, followed by an FNN-based supervised classification.
Hence, the proposed spectral-attention-driven classification framework is driven towards improved automated hyperspectral data analysis, while also addressing big data challenges such as high computational and memory overhead.
This work also presents variations of the proposed deep-learning-based feature extraction and classification frameworks to include the spectral-only, spatial-only, and spectral–spatial information extraction models. A comprehensive performance study of the several spatial–spectral-information-based hyperspectral data analysis frameworks is also conducted.
The rest of the paper is organized as follows: the proposed spectral attention-based classification methodology is discussed in
Section 2 followed by several deep-learning-based classification techniques used for comparison briefed in
Section 3. In
Section 4, we experimentally demonstrate and validate the efficacy of the proposed spectral attention model, which offers enhanced hyperspectral data analysis through automated extraction of significant spectral information extraction and suppression of the redundant spectral bands. Finally, we summarize the effectiveness of our proposed automated hyperspectral data analysis model in
Section 5.
4. Experimental Results
In this section, all the datasets used for experimentation are briefly discussed alongside a detailed report on the experimental setup used for all the experiments conducted in this research work. Additionally, the efficiency of the proposed spectral attention and classification architecture BI-DI-SPEC-ATTN is validated and compared against four other models namely, PCA-3D-CNN, SPEC-3D-CNN, SPAT-2D-CNN, and SVM-CK as described in
Section 3.
4.1. Datasets
All experiments were conducted on two airborne visible/infrared imaging spectrometer (AVIRIS) datasets—Salinas and Indian Pines—and a reflective optics system imaging spectrometer (ROSIS) dataset—University of Pavia [
34]. The Salinas dataset is composed of 224 spectral bands, out of which 20 water absorption bands have been discarded. It has a spatial resolution of (512 × 217). This dataset comprises 16 classes related to vegetables, vineyard fields, and bare soils. The Indian Pines dataset was acquired by an AVIRIS sensor over the Indian Pines test site in northwestern Indiana. This dataset has a spatial dimension of 145 × 145 and 224 spectral bands (200 after removal of the water-absorption bands) with a spatial resolution of 20 m spanning 16 land cover classes. The Pavia University dataset has 103 spectral bands each having a spatial dimension of (610 × 340) with a spatial resolution of 1.3 m spanning nine classes of land covers. For each dataset, the training set was randomly chosen spanning from 5% through 50%.
4.2. Parameter Tuning and Experimental Setup
For our proposed methodology to function optimally, we have several parameters that need to be adjusted: the size of the reduced dimensional space using DR (D), the learning rate, the optimizer, etc. The reduced dimension D for the PCA computation was empirically found to be 100 for the Salinas and Indian Pines datasets and 50 for the Pavia University datasets, respectively. The length of the LSTM input vector K was empirically set to 256 for the Salinas and Indian Pines datasets and 128 for the Pavia University dataset, respectively. All parameters in the proposed approach were experimentally set to their optimal values to produce the best classification results. These parameters related to both the proposed methodology and frameworks used for comparison were tuned well enough to not leave any room for improvement for the classification results on all three datasets.
The objective function used in all our experimentation was the categorical cross-entropy with a learning rate of and a decay of . The choice to pick categorical cross-entropy as the objective function was straightforward, as the nature of the problem we address in this work is multiclass classification. However, this was not the case when choosing an optimal learning rate during experimentation. Numerous values of learning rates, such as , , , , , and were investigated. Upon rigorous experimentation, it was determined that a learning rate of with a decay of produced optimal results on all three datasets, and fluctuated the least when the results were averaged over three trials. Additionally, choosing a suitable batch size can effectively improve the memory utilization while training the classification model and improve the convergence accuracy of the architecture. We experimented by setting the batch size to multiple values, namely, 16, 32, 64, and 128, with a batch size of 32 producing the optimal results on all three datasets.
All the experiments used the Adam optimizer as it produced optimum results on all the datasets that are discussed in this work. In a normal gradient descent optimizer, the weights are adjusted based on the gradient calculated in the same epoch. However, with the Adam optimizer, the weights are adjusted based on the moving average of gradients calculated in current and previous epochs. The moments adjustment as per the Adam algorithm is calculated as a moving average of previous and current gradients and then those moments are used to update the weights. Gradient descent, RMSprop, and Adam optimizers, which are well known in the literature, were pitted against each other during experimentation and the Adam optimizer produced the best classification results on the Salinas, Indian Pines and Pavia University datasets.
To avoid any bias induced by random sampling of pixels, the classification results were averaged over three trials and the average accuracies along with execution time of the models are presented. All experiments were implemented using python on an Intel(R) Core(TM) i7-7700HQ processor with 16 GB RAM machine, and no GPU training was involved. For the purpose of training on all three datasets, samples were picked randomly from each class label in equal proportion and experimental results across different train/test ratios spanning from 5% through 50% were documented.
4.3. Discussion
Table 1,
Table 2 and
Table 3 denote the specific number of training and testing samples used for experimentation with 10% of training data across all three datasets discussed in this paper.
Figure 2,
Figure 3 and
Figure 4 illustrate the classification maps for
of training data for the proposed bidirectional LSTM-based spectral attention and classification analysis methodology BI-DI-SPEC-ATTN for the Salinas, Indian Pines, and Pavia University datasets, along with the frameworks used for comparison. It can be further inferred from
Table 4,
Table 5 and
Table 6 that BI-DI-SPEC-ATTN gave superior classification performance over other frameworks that are discussed for both the Indian Pines and Pavia University datasets.
Table 7 shows the overall execution time of all the models in this work for 10% of training data. It can be clearly noted from
Figure 2,
Figure 3 and
Figure 4 that our proposed BI-DI-SPEC-ATTN methodology has more coherent classification regions and fewer misclassifications with a competitive computational efficiency when compared to other methods discussed, at a reasonable trade-off between computational time and classification performance compared to other spatial-only, spectral-only and spatial–spectral-information-based feature extraction models.
Our proposed framework produces the best classification results with an overall accuracy of 97.78%, 94.07%, and 97.80% on the Salinas, Indian Pines, and Pavia University datasets, respectively, for just 10% of training samples selected, which can be reaffirmed from the figures and tables documented in
Section 4. Even though other classification methodologies discussed in this work are efficient, with many of them being state-of-the-art techniques, they lack the ability to capture distinctive features and information between different classes across the three datasets discussed in this work to produce effective classification results in comparison with the proposed BI-DI-SPEC-ATTN methodology. While the state-of-the-art composite kernel SVM-based classification technique (SVM-CK) discussed produced good classification results on the Salinas and Pavia University datasets with its ability to incorporate both spatial and spectral features present in the hyperspectral data through a 3 × 3 window-based average spatial kernel, coupled with an RBF spectral kernel, it under-performs when applied on the Indian Pines dataset, producing larger misclassification regions compared to all the other methodologies. Additionally, the 3D-CNN-based classification methodologies discussed in our work, namely, PCA-3D-CNN and SPEC-3D-CNN, produced superior classification results overall against their counterparts, namely, the 2D-CNN-architecture-based classification techniques SPAT-2D-CNN and SVM-CK, owing to their ability to effectively incorporate both spatial and temporal features that are critical for effective classification of hyperspectral data. Finally, the results produced by the proposed bidirectional LSTM-based attention and classification framework outperformed all the methodologies discussed in this work demonstrating the importance and feasibility of constructing the relationship between features and weighing them with the aid of an effective attention methodology. This was followed by a solid FNN-based network for classification of the constructed features, which produced results that bolstered the efficacy of the proposed technique.
The efficacy of our proposed methodology BI-DI-SPEC-ATTN can be further affirmed from the overall classification accuracy plots as depicted in
Figure 5,
Figure 6 and
Figure 7 for the Salinas, Indian Pines, and Pavia University datasets, respectively. Our proposed approach BI-DI-SPEC-ATTN significantly outperformed all other methods compared, especially against the conventional principal-component-analysis-based spectral feature analysis model (PCA-3D-CNN), a 2D-convolutional-neural-network-based hyperspectral data classification model (SPAT-2D-CNN) and a conventionally used SVM-based spatial–spectral information inclusion model (SVM-CK). Our proposed methodology BI-DI-SPEC-ATTN presents a pragmatic and an efficient attention-based classification framework to automate the feature selection process through varied levels of importance/weighting assigned to spectral bands in a dataset, based on their quality of information. Thus, the BI-DI-SPEC-ATTN model provides superior classification performance not only with just
of training samples but also at various different training-testing ratios as demonstrated above in
Figure 5,
Figure 6 and
Figure 7. Therefore, our BI-DI-SPEC-ATTN model serves as an effective framework for automated decision making with excellent classification performance for hyperspectral data analysis applications.
With the wide range of experiments and analysis that we conducted, it would definitely be worthy to denote the importance of PCA as a dimensionality reduction technique alongside being a principal feature extraction component in our work. It not only reduced the computational complexity of our spectral attention and classification methodology (BI-DI-SPEC-ATTN), but also acted as an efficient lightweight spectral feature extraction technique and a noise reduction component.
The importance of a dimensionality reduction technique such as PCA for DR, information retrieval, and as a linear orthogonal transformation technique that transforms the data to a new coordinate system, has been justified in the literature over time in HSI applications. PCA is explicitly not designed for noise removal but instead, it is designed to reduce the dimensionality of the feature space with which the underlying deep learning regression/classification model approximates. We can think of PCA as a tuning knob to smoothly decide how much information we want to retain, which is impossible to achieve if one works directly with the original features. Since we cannot directly decide which features to retain and the ones to eliminate, as the original features have no order of priority or usability, PCA comes into play.
As a result, eliminating some of the PCs with lower variances, i.e., with lower eigenvalues, usually helps the model to generalize better. PCs with higher eigenvalues capture the principal information about the dataset and thus adding more and more PCs ends up appending information to the existing reduced dimensional data space. Thus, removing some PCs with lower eigenvalues actually acts as a regularization technique to minimize the redundancy of the information present in the data. Hence, in this work, we aimed to alleviate the inherent process noise and data redundancy present in the hyperspectral data using PCA to enhance the data learning outcomes of deep learning methods [
4,
6].
Alongside PCA, the bidirectional LSTM-based feature importance weighting/attention module, which operates by selectively emphasizing the feature values and correlating the output sequence of feature vectors of the high dimensional hyperspectral data with the results of selective learning, constitutes our proposed attention and classification framework BI-DI-SPEC-ATTN.
Additionally,
Table 7 denotes the overall execution time (includes training, validation, and testing) for all the methodologies discussed in this work for the Salinas, Indian Pines and Pavia University datasets with 10% of training data. Overall, the experimental results presented in this paper demonstrate that the proposed bidirectional and 3D-CNN-oriented spectral-attention-based classification architecture (BI-DI-SPEC-ATTN) required only a small number of training samples for effective classification, while also providing robust performance with all the datasets used in the experimentation phase.