Next Article in Journal
An Outphasing Architecture Based on Parallel Radio Frequency–Pulse Width Modulation Method for All-Digital Transmitter
Previous Article in Journal
Demonstration-Based and Attention-Enhanced Grid-Tagging Network for Mention Recognition
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multi-Scale Spatial–Spectral Residual Attention Network for Hyperspectral Image Classification

College of Computer Science and Technology, Zhengzhou University of Light Industry, Zhengzhou 450000, China
*
Author to whom correspondence should be addressed.
Electronics 2024, 13(2), 262; https://doi.org/10.3390/electronics13020262
Submission received: 2 December 2023 / Revised: 20 December 2023 / Accepted: 27 December 2023 / Published: 5 January 2024

Abstract

:
Hyperspectral images (HSIs) encapsulate a vast amount of information due to their expansive size and high number of channel dimensions. However, they are insufficiently utilized for ineffective feature extraction, particularly for regions with few samples and predominant edges. To fully leverage the spatial–spectral features of HSIs, a dual-branch multi-scale spatial–spectral residual attention network (MSRAN) that integrates multi-scale feature extraction with residual attention mechanisms is proposed. MSRAN independently extracts spatial and spectral features through dual branches, minimizing the interference between these features and enhancing the focus on feature extraction in different dimensions. Specifically, in the spectral feature extraction branch, diverse-scale 3D convolution kernels capture extended spectral sequence characteristics and neighborhood spectral features. The convolution fusion emphasizes the weight of the central pixel to be classified, followed by the use of spectral residual attention mechanisms to extract enhanced central-pixel spectral features. In the spatial feature extraction branch, multi-level receptive fields are utilized to extract various fine-grained spatial contours, edges, and local detailed features, which are further processed through spatial residual attention to effectively extract spatial composite features. Finally, the convolution fusion module adaptively integrates the center-enhanced spectral features with multi-level fine-grained spatial features for classification. Extensive comparative experiments and ablation studies demonstrate that MSRAN achieves highly competitive results on two classic datasets from Pavia University and Salinas as well as on a novel dataset of WHU-Hi-LongKou.

1. Introduction

Remote sensing images with high resolution and rich spectral information play a pivotal role in diverse fields, such as plant pest and disease detection [1,2,3], mineral exploration [4,5], crop yield estimation [6,7], and environmental monitoring [8,9]. Fang et al. [10] estimated the soil moisture using satellite remote sensing data by integrating Bayesian depth image prior (BDIP) downsampling with a deep fully convolutional neural network. Pang et al. [11] developed an advanced platform combining Lidar, CCD cameras, and hyperspectral sensors for detailed forest monitoring and analysis. Guo et al. [12] utilized vegetation indices (VIs) and texture features (TFs) from unmanned aerial vehicle (UAV) hyperspectral images to create a disease monitoring model based on the partial least squares regression (PLSR) for detecting wheat yellow rust. Hyperspectral images (HSIs) [13], composed of hundreds of spectral bands, offer finer spectral divisions than other images, which provides powerful material discrimination capabilities. As a critical technology in these applications, HSI classification employs traditional methods like the support vector machine (SVM) [14], decision tree (DT) [15], and maximum likelihood classifier (MLC) [16]. However, these conventional methods primarily extract shallow features and often overlook deeper characteristics with the manual setting of feature selectors significantly impacting the classification accuracy.
In contrast to traditional HSI classification techniques, deep learning methods have the advantages of automatic feature learning and superior classification capabilities. They have achieved significant progress in various domains. Consequently, an increasing number of scholars are exploring HSI classification through deep learning approaches. Boulch et al. [17] proposed an autoencoder (AE)-based network to perform HSI classification by combining multi-layer AE and pooling layers. Chen et al. [18] proposed a Deep Belief Network (DBN)-based method to extract both shallow and deep features from HSI images for classification. The advent of Convolutional Neural Networks (CNNs) significantly enhanced the image classification accuracy. Given the abundance of sequential information in the spectral channels of HSIs, Mou et al. [19] applied recurrent neural networks to HSI classification for the first time and achieved commendable results. Hu et al. [20] effectively extracted spectral features from HSIs using a stack of five 1D-CNNs. Sharma et al. [21] utilized 2D-CNNs to extract contextual spatial feature information from dimensionality-reduced HSIs, improving the classification performance with limited HSI training samples. However, utilizing the information from spectral or spatial features alone cannot fully take advantage of the rich features of HSIs. The fusion of spectral and spatial features is a beneficial complement to HSI classification. Yang et al. [22] combined the features of 1D-CNN and 2D-CNN frameworks to extract spatial and spectral features, respectively, and then fused these features through a fully connected layer for classification. Joint spatial–spectral features can significantly improve the HSI classification accuracy. Nevertheless, dual-branch networks that extract features from spectral and spatial dimensions separately can lead to the loss of some original information, resulting in inadequate spatial–spectral joint features. The introduction of three-dimensional Convolutional Neural Networks has effectively mitigated this issue. Li et al. [23] proposed an end-to-end five-layer 3D-CNN network. Chen et al. [24] combined the 3D-CNN with regularization techniques to effectively extract original three-dimensional features and reduce overfitting in neural networks. Zhong et al. [25] introduced an end-to-end Spectral–Spatial Residual Network for HSI classification, which takes the original three-dimensional data cube as the input and extracts both spectral and spatial features using the 3D-CNN. Qi et al. [26] combined multi-scale and residual concepts to extract deeper features from different receptive fields and classify HSIs. Li et al. [27] integrated attention mechanisms into the Dual-Branch Multi-Attention Network to refine hyperspectral features and achieve a better classification performance. Meng et al. [28] proposed a new deep residual involution network (DRIN) for HSI classification. By using an enlarged involution kernel, the long-distance spatial interactions can be well-modeled.
Although CNNs have shown good performance in the field of HSI classification, there are still challenges to be addressed. For example, labeled HSIs require substantial human and material resources, so maximizing the utilization of existing data when HSI samples are limited is crucial. Furthermore, distinguishing the varying contributions of deep features to each node of the neural network is a challenging task. Lastly, the extraction of joint spatial–spectral features can lead to interference, and determining the correct combination of extracted features in multi-branch networks poses a challenge.
To mitigate these issues, a novel model combining 3D-CNN, 2D-CNN, multi-scale mechanisms [29,30], and residual attention mechanisms [31,32], namely MSRAN, is proposed. This method independently extracts spatial and spectral features through dual branches to minimize the interference between features of different dimensions. It then uses multi-scale mechanisms to enrich the feature diversity and employs spatial residual attention and spectral residual attention to identify effective features, thereby improving feature utilization. Finally, convolutional adaptive fusion integrates spatial texture features and spectral sequence features for classification. The contributions of this paper can be summarized as follows.
  • We present a Dual-Branch Multi-Scale Residual Spatial–Spectral Attention Network to classify the hyperspectral remote sensing images. This network independently extracts spatial and spectral features, minimizes the interference between these two types of information, and enables the model to focus on multi-dimensional features.
  • We extract the sequential and neighboring spectral features using different-sized convolution kernels. Meanwhile, multi-scale 2D convolution is employed to capture spatial features by superimposing various receptive fields. This approach improves the ability to classify the multi-boundary samples by highlighting the central pixel weight and capturing the ground contours and local details.
  • The proposed MSRAN method employs dual residual spectral and spatial attention mechanisms to identify the important features for hyperspectral image classification, which eliminates the disruptive features and enhances the utilization of spatial and spectral features.
  • A comprehensive assessment of the proposed MSRAN method is conducted through extensive ablation studies and comparative experiments on three datasets from Pavia University (PU) [33], Salinas Valley (SV) [33], and WHU-Hi-LongKou (WHKL) [34]. The results demonstrate a great improvement in the classification accuracy over advanced methods.
The subsequent parts of this paper are organized as follows: Section 2 discusses the proposed MSRAN method. In Section 3, ablation studies and comparative experiments are conducted to verify the effectiveness and competitiveness of the proposed model. Section 4 summarizes our work.

2. Proposed MSRAN for HSI Classification

2.1. Structure of the Multi-Scale Spatial–Spectral Residual Attention Network

In HSI classification methods, spatial and spectral features are directly extracted and fused to improve the accuracy, potentially leading to interactions between the two branches [35]. Although some improved approaches extract spectral and spatial features in parallel, further optimization of these features is necessary. To enhance the HSI classification performance, this paper extracts joint spatial–spectral, spatial texture, and spectral sequence features from HSIs at different receptive fields using the multi-scale 3D-CNN and 2D-CNN, followed by residual attention optimization. The structure of the proposed MSRAN method is illustrated in Figure 1. Initially, the original HSI is cropped into overlapping blocks to reduce the input data volume. These cropped HSI blocks are then fed into two branches for spectral and spatial feature extraction, respectively, followed by feature fusion classification. Specifically, in the spectral branch, various-scale 3D convolutions are utilized to extract joint spatial–spectral and spectral sequence features. To enhance the spectral distinguishability of different materials, an improved residual spectral attention block reallocates the weights of features, thus augmenting the salient spectral characteristics while filtering out noise and less relevant features. In the spatial feature extraction branch, a combination of multi-scale 2D-CNNs extracts spatial texture features from different receptive fields. For instance, small convolution kernels focus on spatial texture features, while larger ones are more adept at capturing object contour characteristics. However, due to parameter issues associated with large convolution kernels, we expand the receptive field by stacking sequential 2D-CNNs, thus attaining more spatial context features while reducing the number of parameters. Subsequently, to improve the spatial distinguishability of different materials, residual spatial attention reallocates the contributions of different depth features to the neural network nodes, enhancing significant spatial characteristics and diminishing the weights of irrelevant spatial features to the target task. Finally, the convolution fusion module combines the spectral and spatial optimized features through the attention mechanism and uses the Softmax function to classify the fused spatial spectral features.

2.2. Spectral Feature Extraction Branch

In the spectral feature extraction branch, the multi-scale 3D-CNN is used to design the spectral feature extraction module, as shown in Figure 2. The joint spatial–spectral cubic features are extracted by using a three-dimensional convolution with a convolution kernel of 3 × 3 × 3 , and the spectral sequence features are extracted by using a three-dimensional convolution with a convolution kernel of 5 × 1 × 1 . Subsequently, the spatial–spectral features and spectral features are summed and then fused through convolution. Specifically, the input HSI is denoted as x c × w × h , where c represents the spectral dimension, and w and h respectively indicate the width and height of the input HSI. The spectral features extracted by convolution at different scales are activated by Relu and summed to obtain the output feature y . The calculation formulas are shown in Equations (1)–(3):
x i = Relu C o n v 3 × 3 × 3 x , i = 1
x i = Relu C o n v 5 × 1 × 1 x , i = 2
y = x 1 + x 2
In the spectral feature extraction module, x i c × w × h i = 1 , 2 represents the Relu-activated convolutional features at different spectral scales, where i denotes the serial number of convolutions. The convolution operations in C o n v 3 × 3 × 3 · and C o n v 5 × 1 × 1 · are similar. Taking the latter as an example, it is a 3D convolution with a kernel size of 5 × 1 × 1 , where the spectral dimension is 5, and the spatial dimensions are 1 × 1 .
To refine the spectral features and redistribute the feature contributions to nodes, we introduce the spectral residual attention module by using the residual structure, as shown in Figure 3. This module employs maximum pooling and average pooling on the feature matrix to respectively capture decisive features within the target neighborhood and smooth shared neighborhood features. Subsequently, these features are processed through a Multi-Layer Perceptron (MLP) for shared weighting, thereby increasing the weights that are favorable for target feature classification. The module then utilizes a dual residual structure to mitigate network degradation issues and expedite network updates. Specifically, the extracted spectral features y undergo convolution with a B × 1 × 1 kernel, mapping the information from the spectral dimension to the channel dimension, resulting in the output spectral feature y ^ . This feature is then fed into the spectral channel attention module for further processing. The spectral feature y ^ c × w × h undergoes maximum pooling and average pooling, followed by a parameter-shared MLP to obtain preassigned weights. The MLP consists of two layers. The first layer produces a feature map of size c r × 1 × 1 (where r is the feature compression ratio), while the second layer yields a feature map with the dimensions c × 1 × 1 . The feature map outputs from the second layer of the MLP are added together, and then an activation function is used to generate the redistributed weights M c c × 1 × 1 . These weights are subsequently combined with the spectral feature y ^ to enhance the feature representation. y ^ is added to the output of the spectral feature extraction module z . The formulas are shown in Equations (4)–(6):
y ^ = Re l u C o n v B × 1 × 1 y
M C y ^ =   σ M L P A v g P o o l y ^ + M L P M a x P o o l y ^   =   σ W 1 W 0 y ^ a v g C + W 1 W 0 y ^ max C
z = M c y ^ × y ^ + y ^
where C o n v B × 1 × 1 · represents a 3D convolution with a kernel size of B × 1 × 1 , B is the spectral dimension size of the input y , and A v g P o o l ( · ) and M a x P o o l ( · ) represent the average pooling and maximum pooling operations, respectively. σ is the Sigmoid activation function, and M c · denotes the weights generated by the MLP. After simplification in Equation (5), W 0 and W 1 are the weights of the first and second layers of MLP, while y ^ a v g C and y ^ max C represent the feature matrices after average pooling and maximum pooling.

2.3. Spatial Feature Extraction Branch

The spatial dimension of the HSI is significantly smaller than the spectral dimension, which can easily lead to the Hughes phenomenon [36] in the HSI classification. To minimize the disparity between the spectral and spatial dimensions, spectral dimension compression is performed while extracting features using the 2D-CNN. This not only reduces the loss of spectral information but also decreases the number of model parameters. As shown in Figure 4, in order to fully extract and utilize the spatial neighborhood features, this paper attempts to combine different receptive fields and the multi-level 2D-CNN to extract the features of four branches and then performs spatial attention to screen the effective features to obtain the output of each branch. Specifically, the first branch uses a single 2DConv with a convolution kernel size of 3 × 3 to extract the spatial details and texture features under a small-scale receptive field. The second branch employs a 2DConv with a convolution kernel size of 5 × 5 to extract spatial object contour features under a larger receptive field. The third branch initially utilizes a 2DConv with a convolution kernel size of 5 × 5 for a large-scale receptive field, followed by a 2DConv with a convolution kernel size of 3 × 3 to further expand the receptive field and extract environmental features surrounding spatial objects. In branch four, average pooling is used to obtain smooth neighborhood features to assist with the fusion of various levels of spatial features. Each branch is numbered M i i = 1 , 2 , 3 , 4 , where i represents the number of different branches. The calculation formula is shown in Equation (7):
K i = Relu C o n v 2 d 3 × 3 x ^ , i = 1 M s Relu C o n v 2 d 5 × 5 x ^ , i = 2 M s Relu C o n v 2 d 5 × 5 C o n v 2 d 3 × 3 ( x ^ ) , i = 3 A v g P o o l i n g x ^ , i = 4
where C o n v 2 d 3 × 3 · represents a 2D convolution with a size of 3 × 3 , Relu · is the Relu activation function, M S · denotes the spatial residual attention, A v g P o o l i n g · signifies average pooling, and σ is the Sigmoid activation function.
Based on the extracted spatial features in two dimensions, within the spatial residual attention, as illustrated in Figure 5, decisive features within the target neighborhood and smoothed neighborhood features are obtained through maximum pooling and average pooling. After concatenating these features, convolution is used for feature discrimination, focusing on characteristics that are beneficial for target classification. Finally, a dual-path residual is employed to accelerate network training and prevent network degradation. Specifically, the spatial feature x ^ 8 × w × h is compressed through maximum and average pooling along the spectral dimension. Then, the redistributed weights M s 8 × w × h are generated through convolution mapping and the Sigmoid activation function and are multiplied with the spatial feature x ^ . Finally, the result of this multiplication is directly added to x ^ , yielding the output K of the spatial feature extraction module.

2.4. Feature Fusion and Classification Module

The spectral and spatial features optimized by the attention mechanism are fused through a cascading operation, as shown in Figure 6. The features are then mapped through convolution to output N dimensions, where N is the number of classes of ground objects in the training samples. Finally, the classification result is obtained through a Softmax operation, as calculated in Equation (8):
S = δ C o n v 1 × 1 Z , K i , i = 1 , 2 , 3 , 4
where δ Softmax represents the Softmax function, and i is the number of different branches in the spatial feature extraction module. In the spatial–spectral feature fusion module, the spectral and spatial features are fused through a cascading operation. This is followed by a convolution operation to obtain spatial–spectral joint features with consistent dimensions, which are then further integrated through feature fusion.

3. Experimental Results and Analysis

To validate the effectiveness of the proposed MSRAN classification method, a series of ablation and comparative experiments were conducted using two classic datasets, the PU dataset and the SV dataset, along with a novel dataset, the WHLK dataset. The contributions of different modules to the network and the classification performance of various algorithms were assessed using multiple indicators (Overall Accuracy—OA, Average Accuracy—AA, the Kappa × 100 (K) coefficient [37], and the category-wise classification accuracy for each dataset).

3.1. Experimental Datasets

3.1.1. Pavia University Dataset

This dataset was captured by the Reflection Optical System Imaging Spectroradiometer (ROSIS) in the university town of Pavia in northern Italy in 2001. It consists of 115 spectral bands. Twelve spectral bands were removed due to noise, leaving 103 spectral bands for the experimental study. The image size is 610 × 340 pixels, the spectral resolution is 4 nm, and the spatial resolution is 1.3 m/pixel. This dataset includes real-life scenarios consisting of nine different types of land cover, with 42,776 labeled samples. In this study, 5% of the samples were used as the training set and validation set, and the rest were used as the test set. Figure 7 shows the pseudo-color image and ground truth map. Table 1 lists the numbers of training, validation, and testing samples for each category.

3.1.2. Salinas Valley Dataset

This dataset was captured by the Airborne Visible/Infrared Imaging Spectroradiometer (AVIRIS) sensor in the Salinas Valley agricultural region of California in 1998. It originally contained 224 spectral bands, of which 204 were retained for experimental studies after removing 20 water absorption bands. The image size is 512 × 217 pixels, the wavelength range is 400–2500 nm, and the spatial resolution is 3.7 m/pixel. This dataset contains 16 different types of land cover, with 54,129 labeled samples. In this study, 5% of the samples were used as the training set and validation set, and the rest were used as the test set. Figure 8 shows the pseudo-color image and ground truth map. Table 2 lists the numbers of training, validation, and testing samples for each category.

3.1.3. WHU-Hi-LongKou Dataset

This dataset was captured by the Headwall Nano-Hyperspec imaging sensor from 13:49 to 14:37 on 17 July 2018 over Longkou Town, Hubei Province, China. It consists of 270 spectral bands that are used for experimental studies. The image is 550 × 400 pixels, the wavelength range is 400–1000 nm, and the spatial resolution is 0.463 m/pixel. This dataset contains nine land cover types with 204,542 pixels labeled. In this study, 1% of the samples were used as the training set and validation set, and the rest were used as the test set. Figure 9 shows the pseudo-color image and ground truth map. Table 3 lists the numbers of training, validation, and testing samples for each category.

3.2. Experimental Settings

In the experiments, the Adam optimizer was utilized, with the default setting of 64 samples per batch for training. The learning rate was set to 1 × 10−3 by default, and the patch size was set to nine by default. For training periods, the PU and SV datasets were subjected to 300 epochs, while the WHLK dataset was set to 200 epochs. The parameter settings for other comparative methods were consistent with those reported in the corresponding literature. All training samples underwent the same data preprocessing methods, and the results were derived from the average of multiple experiments. In order to ensure fairness in the experimental process, all HSI classification methods were implemented on the same computing workstation. The workstation was equipped with 40 GB of memory, an Intel 8255C CPU, and an RTX 3080 GPU. The experiments were conducted on the Ubuntu 18.04 platform using the PyTorch 1.8.1 framework.
In order to quantitatively compare the effectiveness of the method presented in this paper with multiple other approaches, four quantitative evaluation metrics were adopted: OA, AA, the K coefficient, and the classification accuracy for individual categories in each dataset. OA represents the percentage of samples correctly classified by the model in the entire dataset, which is used as a measure of the overall performance, where the higher the value, the better the performance. AA is the mean of the precision values for each category, providing a comprehensive assessment of the performance across categories. The Kappa coefficient is a measure of the consistency between the model and random classification, taking into account the randomness in the classification results, thereby offering more reliability than the mere OA. The Kappa coefficient ranges between −1 and 1, where 0 indicates agreement with random classification and 1 denotes complete agreement. The category classification accuracy of each dataset refers to the proportion of correctly classified samples in each category.

3.3. Experimental Results and Analysis

3.3.1. Ablation Studies

To demonstrate the contribution of each module, ablation experiments were performed using a controlled module variable approach. The MSRAN is mainly composed of the multi-scale spectral residual network (spectral branch), the multi-scale spatial residual network (spatial branch), and the feature fusion module. Different combinations of these modules were tested to assess their contributions. Since the feature fusion module integrates feature fusion and classification tasks, it was included by default in our experiments. As shown in Table 4, a comparison between NET-1 and NET-4 revealed that the inclusion of multi-scale spatial residual attention in NET-4 led to increases in the OA of 1.02% and 3.40% for the PU and SV datasets, respectively. This improvement is attributed to the fact that HSIs encompass both spatial and spectral information, making the joint spatial–spectral features more suitable for HSI classification tasks. The incorporation of spatial residual attention mechanisms enables the network to focus more on features that are effective for classification. Notably, for the WHLK dataset, the introduction of multi-scale spatial–spectral residual attention mechanisms resulted in a significant 7.91% increase in the OA. This can be attributed to the higher spatial resolution of the WHLK, which provides richer details for land cover spatial features. Multi-scale feature extraction can extract more levels of land cover contours, edges, and local details through different receptive fields, thereby enhancing the precision of the classification features. By comparing it with the NET-2, the introduction of multi-scale spectral residual attention in the NET-4 resulted in notable OA improvements of 10.75%, 7.33%, and 19.68% on three HSI datasets, respectively. This improvement is partly due to the advantages brought by the joint spatial–spectral features and partly because of the use of 3 × 3 × 3 three-dimensional convolution kernels and 5 × 1 × 1 one-dimensional convolution kernels. These kernels not only extract longer one-dimensional spectral features but also accommodate three-dimensional neighboring spectral features. Subsequently, spectral residual attention was applied to enhance the utilization of spectral features. In NET-3, the spatial and spectral multi-scale mechanisms are modified to a single scale. Specifically, in the spectral branch, the multi-scale 3DConv is replaced with a single-scale 3DConv with a 3 × 3 × 3 convolution kernel, while in the spatial branch, the multi-level multi-scale convolution is substituted with a single-scale 2DConv with a 5 × 5 kernel. All other experimental settings are retained, as they are in our final NET-4 model. By comparison, our final NET-4 model incorporates a multi-scale mechanism and improves by 6.31%, 8.54%, and 9.63% in terms of the OA on three distinct datasets, respectively. This demonstrates that the multi-scale mechanism is adept at capturing object features of various sizes. The reason for this mainly lies in the fact that the various volumes of different-sized objects lead to diverse scales of edge contours and local detail information. This diversity necessitates the use of different receptive fields for effective feature extraction. The combination of spectral and spatial multi-scale mechanisms enhances our proposed MSRAN method to achieve a superior classification performance.

3.3.2. Comparative Experiments

To validate the effectiveness of the proposed MSRAN method, comparative experiments were conducted against a range of both classical and contemporary networks, including the SVM [14], 1D-RNN [18], 1D-CNN [20], 2D-CNN [21], 3D-CNN [23], DBDA [27], DRIN [28], and LS2CM [38]. Table 5, Table 6 and Table 7 present the OA, AA, K coefficient, and classification accuracy for each land cover type across the PU, SV, and WHLK datasets. Bolded results in these tables signify superior classification performances. Remarkably, the MSRAN outperformed the others, achieving the highest OA, AA, and Kappa coefficient values on two traditional and one novel dataset. In particular, the proposed MSRAN method achieved OA improvements of 3.68%, 4.26%, and 3.34% compared with the classical 3D-CNN method across these datasets, respectively. Even compared with the advanced DRIN and LS2CM methods, MSRAN maintained its superiority due to its efficient integration of spatial and spectral features. The classification results across the datasets showed notable imbalances. For instance, in Table 5 (Bitumen), Table 6 (Lettuce R4), and Table 7 (Mixed weed), classification methods like the SVM, 1D-CNN, and 1D-RNN utilizing one-dimensional features exhibited subpar performances due to the complex sample distribution and the predominance of boundary samples. Conversely, MSRAN along with DBDA and 3D-CNN demonstrated a consistently balanced classification accuracy across various datasets and sample types, where the proposed MSRAN showed the most uniform performance. This consistency demonstrates the ability of our MSRAN method to handle diverse and complex sample data, which can be mainly attributed to the multi-scale mechanism, feature extraction at various granularities, feature enhancement of the central pixel, as well as the residual attention mechanism. The integration of these modules enhances the utilization of spatial and spectral features, which validates the robustness of the proposed MSRAN method.
In visual comparisons, classification maps are generated by these methods on the PU, SV, and WHLK datasets, as depicted in Figure 10, Figure 11 and Figure 12. As observed, the maps from the DBDA, DRIN, LS2CM, and MSRAN align more closely with the actual ground conditions, particularly for the classification map of the proposed MSRAN method, which exhibits superior visual accuracy. Conversely, the classification maps from other methods are less precise. While some methods perform well with larger sample sizes, they struggle with smaller, more dispersed samples, such as the scattered Asphalt and Trees in the PU dataset, as shown in Figure 10. Samples that are close to, or even coincident with, edge pixels, such as LettuceR4 in the SV dataset of Figure 11 and Roads and Houses in the WHLK dataset of Figure 12, do not perform well. In contrast, the proposed MSRAN method excels in classes with a high percentage of boundary samples, as shown in the high-resolution WHLK dataset, where it effectively managed complex samples like Roads and Houses and Mixed Weed along Riverbanks with intricate boundary challenges.
Overall, the proposed MSRAN method enhances the spectral features of the central pixel and extracts multi-level fine-grained spatial details, which cooperates with the dual-path residual attention mechanism to screen effective features. By removing the redundant feature interference and minimizing the interference between surrounding pixels, the proposed MSRAN method enables the accurate classification of land cover types in areas with complex spatial distribution, particularly for the areas with scattered and boundary samples.

3.3.3. Impacts of Different Training Ratios

Figure 13 displays the OA values of various models with different proportions of training samples. Considering the varying total number of samples and the stability of the models under exposure to different proportions, 3%, 5%, 7%, and 10% of the training samples were randomly selected for the PU and SV datasets, while 0.5%, 1%, 3%, and 7% were chosen for the WHLK dataset. Overall, it is observed that, even with a limited number of training samples, it still achieves satisfactory classification results. As the number of training samples increases, all methods show a growth trend for performance across these three datasets. Notably, the proposed MSRAN method demonstrates a stable growth trend, further affirming its robustness.
These findings suggest that MSRAN is particularly effective in scenarios with the common challenge of limited training data for HSI classification. Its ability to maintain stable performance improvements with increasing amounts of training data highlights its potential for use in various applications, particularly in areas where collecting extensive training samples is challenging or impractical. The robustness of MSRAN in these contexts underscores its suitability for real-world applications and its effectiveness for harnessing limited data for accurate classification.

4. Conclusions

Traditional convolutional neural network methods often face challenges with feature extraction, particularly in terms of spatial and spectral features influencing each other and the poor extraction of features when there are few samples with multiple edges. This paper introduced a novel method for HSI classification, the Dual-Branch Multi-Scale Spectral–Spatial Residual Attention Network. This approach employs two branches to extract spatial features with different granularities and enhance central-pixel spectral features. It utilizes spatial residual attention and spectral residual attention to refine the extracted spectral–spatial features. Finally, these features are fused and classified using a Softmax classifier for HSIs. Extensive experimental evaluations demonstrated that this method is highly competitive. The dual-branch structure allows for more effective and independent extraction of spatial and spectral features, addressing the limitations of traditional CNN methods. The use of residual attention mechanisms further enhances feature extraction, particularly for complex samples with multiple edges or a limited quantity. The final fusion of these features ensured a comprehensive representation of the data, leading to an improved classification performance. Overall, the MSRAN method provides higher accuracy and robustness for hyperspectral image classification under different scenarios and conditions.

Author Contributions

Q.W. and M.H. proposed the research ideas and methods; M.H. conducted the experiments and data analysis and wrote the first draft of the paper; Z.L. and Y.L. wrote the paper and constructed the drawings. Finally, Q.W. revised the draft. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation (No. 61502435) of P. R. China, in part by the Young Backbone Teacher Training Program of Henan Province (No. 2023GGJS090), and in part by the Scientific and Technological Research Project of Henan Provincial Department of Science and Technology (No. 232102211048).

Data Availability Statement

The data presented in this study are available in the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Wu, Q.; Wu, Y.; Li, Y.; Huang, W. Improved YOLOv5s With Coordinate Attention for Small and Dense Object Detection From Optical Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2023. [Google Scholar] [CrossRef]
  2. Mahlein, A.; Oerke, E.; Steiner, U.; Dehne, H. Recent Advances in Sensing Plant Diseases for Precision Crop Protection. Eur. J. Plant Pathol. 2012, 133, 197–209. [Google Scholar] [CrossRef]
  3. Lacar, F.M.; Lewis, M.M.; Grierson, I.T. Use of Hyperspectral Imagery for Mapping Grape Varieties in the Barossa Valley. In Proceedings of the International Geoscience and Remote Sensing Symposium, Sydney, NSW, Australia, 9–13 July 2001; Volume 6, pp. 2875–2877. [Google Scholar]
  4. Murphy, R.; Schneider, S.; Monteiro, S. Consistency of Measurements of Wavelength Position from Hyperspectral Imagery: Use of the Ferric Iron Crystal Field Absorption at ∼900 nm as an Indicator of Mineralogy. IEEE Trans. Geosci. Remote Sens. 2014, 52, 2843–2857. [Google Scholar] [CrossRef]
  5. Van Der Meer, F. Analysis of Spectral Absorption Features in Hyperspectral Imagery. Int. J. Appl. Earth Obs. Geoinf. 2004, 5, 55–68. [Google Scholar] [CrossRef]
  6. Datt, B.; McVicar, T.; Van Niel, T.; Jupp, D.; Pearlman, J. Preprocessing EO-1 Hyperion Hyperspectral Data to Support the Application of Agricultural Indexes. IEEE Trans. Geosci. Remote Sens. 2003, 41, 1246–1259. [Google Scholar] [CrossRef]
  7. Zhuang, L.; Wang, J.; Bai, L.; Jiang, G.; Sun, S.; Yang, P.; Wang, S. Cotton Yield Estimation Based on Hyperspectral Remote Sensing in Arid Region of China. Trans. Chin. Soc. Agric. Eng. 2011, 27, 176–181. [Google Scholar]
  8. Hestir, E.; Brando, V.; Bresciani, M.; Giardino, C.; Matta, E.; Villa, P.; Dekker, A. Measuring Freshwater Aquatic Ecosystems: The Need for a Hyperspectral Global Mapping Satellite Mission. Remote Sens. Environ. 2015, 167, 181–195. [Google Scholar] [CrossRef]
  9. Zhang, P.; Lu, Q.; Hu, X.; Gu, S.; Yang, L.; Min, M.; Chen, L.; Xu, N.; Sun, L.; Bai, W.; et al. Latest Progress of the Chinese Meteorological Satellite Program and Core Data Processing Technologies. Adv. Atmos. Sci. 2019, 36, 1027–1045. [Google Scholar] [CrossRef]
  10. Fang, Y.; Xu, L.; Chen, Y.; Zhou, W.; Wong, A.; Clausi, D.A. A Bayesian Deep Image Prior Downscaling Approach for High-Resolution Soil Moisture Estimation. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2022, 15, 4571–4582. [Google Scholar] [CrossRef]
  11. Pang, Y.; Li, Z.; Ju, H.; Lu, H.; Jia, W.; Si, L.; Guo, Y.; Liu, Q.; Li, S.; Liu, L.; et al. LiCHy: The CAF’s LiDAR, CCD and Hyperspectral Integrated Airborne Observation System. Remote Sens. 2016, 8, 398. [Google Scholar] [CrossRef]
  12. Guo, A.; Huang, W.; Dong, Y.; Ye, H.; Ma, H.; Liu, B.; Wu, W.; Ren, Y.; Ruan, C.; Geng, Y. Wheat Yellow Rust Detection Using UAV-Based Hyperspectral Technology. Remote Sens. 2021, 13, 123. [Google Scholar] [CrossRef]
  13. Della, C.; Bekit, A.; Lampe, B.; Chang, C.-I. Hyperspectral Image Classification Via Compressive Sensing. IEEE Trans. Geosci. Remote Sens. 2019, 57, 8290–8303. [Google Scholar] [CrossRef]
  14. Zhao, P.; Tang, Y.; Li, Z. Wood Species Recognition with Microscopic Hyper-spectral Imaging and Composite Kernel SVM. Spectrosc. Spect. Anal. 2019, 39, 3776–3782. [Google Scholar]
  15. Xia, J.; Yokoya, N.; Iwasaki, A. Hyperspectral Image Classification with Canonical Correlation Forests. IEEE Trans. Geosci. Remote Sens. 2017, 55, 421–431. [Google Scholar] [CrossRef]
  16. Peng, J.; Li, L.; Tang, Y. Maximum Likelihood Estimation-based Joint Sparse Representation for the Classification of Hyperspectral Remote Sensing Images. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 1790–1802. [Google Scholar] [CrossRef] [PubMed]
  17. Boulch, A.; Audebert, N.; Dubucq, D. Autoencodeurs Pour La Visualisation D’images Hyperspectrales. In Proceedings of the 25th Colloque Gretsi, Juan-les-Pins, France, 5–8 September 2017. [Google Scholar]
  18. Chen, Y.; Zhao, X.; Jia, X. Spectral–Spatial Classification of Hyperspectral Data Based on Deep Belief Network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 8, 2381–2392. [Google Scholar] [CrossRef]
  19. Mou, L.; Ghamisi, P.; Zhu, X.X. Deep Recurrent Neural Networks for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3639–3655. [Google Scholar] [CrossRef]
  20. Hu, W.; Huang, Y.; Wei, L.; Zhang, F.; Li, H. Deep Convolutional Neural Networks for Hyperspectral Image Classification. J. Sens. 2015, 2015, 258619. [Google Scholar] [CrossRef]
  21. Sharma, V.; Diba, A.; Tuytelaars, T.; Van Gool, L. Hyperspectral CNN for Image Classification and Band Selection, with Application to Face Recognition; Technical Report KUL/ESAT/PSI/1604; KU Leuven, ESAT: Leuven, Belgium, 2016. [Google Scholar]
  22. Yang, J.; Zhao, Y.; Chan, J. Learning and Transferring Deep Joint Spectral–spatial Features for Hyperspectral Classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 4729–4742. [Google Scholar] [CrossRef]
  23. Li, Y.; Zhang, H.; Shen, Q. Spectral–Spatial Classification of Hyperspectral Imagery with 3D Convolutional Neural Network. Remote Sens. 2017, 9, 67. [Google Scholar] [CrossRef]
  24. Chen, Y.; Jiang, H.; Li, C.; Jia, X.; Ghamisi, P. Deep Feature Extraction and Classification of Hyperspectral Images Based on Convolutional Neural Networks. IEEE Trans. Geosci. Remote Sens. 2016, 54, 6232–6251. [Google Scholar] [CrossRef]
  25. Zhong, Z.; Jonathan, L. Spectral-spatial Residual Network for Hyperspectral Image Classification: A 3-D Deep Learning Framework. IEEE Trans. Geosci. Remote Sens. 2018, 56, 847–858. [Google Scholar] [CrossRef]
  26. Qi, Y.F.; Chen, J.; Huo, Y.L. Hyperspectral Image Classification Algorithm Based on multi-scale Convolutional Neural Network. Infrared Tech. 2020, 42, 855–862. [Google Scholar]
  27. Li, R.; Zheng, S.; Duan, C.; Yang, Y.; Wang, X. Classification of Hyperspectral Image Based on Double-Branch Dual-Attention Mechanism Network. Remote Sens. 2020, 12, 582. [Google Scholar] [CrossRef]
  28. Meng, Z.; Zhao, F.; Liang, M.; Xie, W. Deep Residual Involution Network for Hyperspectral Image Classification. Remote Sens. 2021, 13, 3055. [Google Scholar] [CrossRef]
  29. Zhang, C.; Li, G.D.; Du, S.H. Multi-Scale Dense Networks for Hyperspectral Remote Sensing Image Classification. IEEE Trans. Geosci. Remote Sens. 2019, 57, 9201–9222. [Google Scholar] [CrossRef]
  30. He, N.; Paoletti, M.E.; Haut, J.M.; Fang, L.; Li, S.; Plaza, A.; Plaza, J. Feature Extraction with multi-scale Covariance Maps for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2019, 57, 755–769. [Google Scholar] [CrossRef]
  31. Woo, S.; Park, J.; Lee, J.Y. Cbam: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  32. Cheng, R.J.; Yang, Y.; Li, L.W. Lightweight Residual Network Based on Depthwise Separable Convolution for Hyperspectral Image Classification. Acta Opt. Sin. 2023, 43, 1228010. [Google Scholar] [CrossRef]
  33. Hyperspectral Remote Sensing Scenes. Available online: https://www.ehu.eus/ccwintco/index.php/Hyperspectral%20Remote%20Sensing%20Scenes (accessed on 11 January 2023).
  34. Zhong, Y.; Hu, X.; Luo, C.; Wang, X.; Zhao, J.; Zhang, L. WHU-Hi: UAV-borne Hyperspectral with High Spatial Resolution (H2) Benchmark Datasets and Classifier for Precise Crop Identification based on Deep Convolutional Neural Network with CRF. Remote Sens. Environ. 2020, 250, 112012. [Google Scholar] [CrossRef]
  35. Wu, Q.G.; Liu, Z.C.; He, M.K. Fusion of MS3D-CNN and Attention Mechanism for Hyperspectral Image Classification. J. Chongqing Inst. Technol. 2023, 37, 173–182. [Google Scholar]
  36. Yang, Z.H.; Li, Z.X.; Han, J.F. The Hughes Phenomenon in Hyperspectral Analysis and the Application of the Lowpass Filter. Acta Geod. Cartogr. Sin. 2004, 4, 253–255+258. [Google Scholar]
  37. Thompson, W.D.; Walter, S.D. A Reappraisal of the Kappa Coefficient. J. Clin. Epidemiol. 1988, 41, 949–958. [Google Scholar] [CrossRef] [PubMed]
  38. Meng, Z.; Jiao, L.; Liang, M.; Zhao, F. A Lightweight Spectral-Spatial Convolution Module for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett. 2022, 19, 5505105. [Google Scholar] [CrossRef]
Figure 1. Overall architecture of the proposed MSRAN for HSI classification.
Figure 1. Overall architecture of the proposed MSRAN for HSI classification.
Electronics 13 00262 g001
Figure 2. Spectral feature extraction and optimization branch.
Figure 2. Spectral feature extraction and optimization branch.
Electronics 13 00262 g002
Figure 3. Spectral residual attention structure.
Figure 3. Spectral residual attention structure.
Electronics 13 00262 g003
Figure 4. Structure of the spatial feature extraction and optimization module.
Figure 4. Structure of the spatial feature extraction and optimization module.
Electronics 13 00262 g004
Figure 5. Structure of the spatial residual attention.
Figure 5. Structure of the spatial residual attention.
Electronics 13 00262 g005
Figure 6. Feature fusion and classification module.
Figure 6. Feature fusion and classification module.
Electronics 13 00262 g006
Figure 7. Pseudo-color image and ground truth image of the PU dataset.
Figure 7. Pseudo-color image and ground truth image of the PU dataset.
Electronics 13 00262 g007
Figure 8. Pseudo-color image and ground truth image of the SV dataset.
Figure 8. Pseudo-color image and ground truth image of the SV dataset.
Electronics 13 00262 g008
Figure 9. Pseudo-color image and ground truth image of the WHLK dataset.
Figure 9. Pseudo-color image and ground truth image of the WHLK dataset.
Electronics 13 00262 g009
Figure 10. Classification maps of the PU dataset with different methods.
Figure 10. Classification maps of the PU dataset with different methods.
Electronics 13 00262 g010
Figure 11. Classification maps of the SV dataset with different methods.
Figure 11. Classification maps of the SV dataset with different methods.
Electronics 13 00262 g011
Figure 12. Classification maps of the WHLK dataset with different methods.
Figure 12. Classification maps of the WHLK dataset with different methods.
Electronics 13 00262 g012
Figure 13. Classification results versus different percentages of training samples for the three datasets. (a) PU. (b) SA. (c) WHLK.
Figure 13. Classification results versus different percentages of training samples for the three datasets. (a) PU. (b) SA. (c) WHLK.
Electronics 13 00262 g013
Table 1. Categories and corresponding training, validation, and test sample numbers for the PU dataset.
Table 1. Categories and corresponding training, validation, and test sample numbers for the PU dataset.
NO.CategorySample SizeTrain/ValTest
1Asphalt66313325967
2Meadows18,64993216,785
3Gravel20991051889
4Trees30641532758
5Painted metal sheets1345671211
6Bare soil50292514527
7Bitumen1330671196
8Self-blocking bricks36821843314
9Shadows94747853
Total 42,776213838,500
Table 2. Categories and corresponding training, validation, and test sample numbers for the SV dataset.
Table 2. Categories and corresponding training, validation, and test sample numbers for the SV dataset.
NO.CategorySample SizeTrain/ValTest
1Broccoli Gl20091001809
2Broccoli G237261863354
3Fallow1976991778
4Fallow R1394701254
5Fallow_S26781342410
6Stubble39591983563
7Celery35791793221
8GrapesU11,27156410,143
9Soil_V62033105583
10Comn_S32781642950
11Lettuce R4106853962
12Lettuce R51927961735
13Lettuce R691646824
14Lettuce R7107054962
15Vineyard U72683636542
16Vineyard V1807901627
Total 54,129270648,717
Table 3. Categories and corresponding training, validation, and test sample numbers for the WHLK dataset.
Table 3. Categories and corresponding training, validation, and test sample numbers for the WHLK dataset.
NO.CategorySample SizeTrain/ValTest
1Corn34,51134533,821
2Cotton8374848206
3Sesame3031302971
4Broad-leaf soybean63,21263261,948
5Narrow-leaf soybean4151424067
6Rice11,85411911,616
7Water67,05667165,714
8Roads and houses7124716982
9Mixed weed5229525125
Total 204,5422046200,450
Table 4. Contributions of different components to the proposed MSRAN method.
Table 4. Contributions of different components to the proposed MSRAN method.
Multi-Scale
Mechanism
Spectral BranchSpatial BranchMetricPUSVWHLK
NET-1-×OA%97.8093.6390.95
AA%96.8794.1972.62
K%97.0892.9288.52
NET-2-×OA%88.0789.7079.18
AA%79.8890.7545.63
K%83.8488.4973.39
NET-3×--OA%92.5188.5889.23
AA%88.9890.3561.98
K%90.0087.2185.68
NET-4OA%98.8297.0398.86
AA%97.3797.8295.80
K%98.4496.6898.50
The bold entities indicate the highest value. ×: does not exist, √: exists, -: not relevant.
Table 5. Classification results of PU datasets with different methods.
Table 5. Classification results of PU datasets with different methods.
NO.SVM1DRNN1DCNN2DCNN3D-CNNDBDADRINLS2CMMSRAN
183.7388.7481.6279.0595.7698.0498.5398.6397.84
288.7195.7687.3271.8397.1195.3497.6197.6999.78
34.7467.381.1259.4391.2696.1090.2694.4296.38
489.9993.5478.8982.6796.1996.2397.7898.8599.34
599.1799.8398.4897.5899.2999.6799.9699.96100.0
633.2187.5238.8999.7497.3298.3899.4899.4099.30
70.0066.750.3098.3590.9899.9298.5698.5692.30
875.9875.2075.2294.6293.9598.9295.2097.2897.57
999.9499.1899.4182.7796.8598.5199.8899.8299.82
OA%79.6889.6777.5469.5395.1494.7496.5597.0198.82
AA%65.9885.9963.8777.7593.9296.0896.5197.2897.37
K%71.6286.2968.7264.0793.6193.2295.4796.0998.44
The bold entities indicate the highest value.
Table 6. Classification results of SV datasets with different methods.
Table 6. Classification results of SV datasets with different methods.
NO.SVM1DRNN1DCNN2DCNN3D-CNNDBDADRINLS2CMMSRAN
196.9799.8693.9670.4498.2699.3999.1699.7599.30
297.8299.8596.6599.9799.8199.5899.4299.8499.05
370.1095.6183.3078.5797.6798.4299.1894.9499.75
498.1598.6995.1488.6498.6299.0999.2499.3399.29
590.3596.7990.9889.3497.6098.6298.8097.2999.32
699.5599.9699.6388.0899.4299.8699.7399.7599.90
795.1699.8195.6471.5299.1999.3899.8199.6098.45
872.8080.5972.4889.0387.5787.7292.6293.1194.53
989.6799.6092.5589.5299.5899.7699.0698.6799.89
1079.8595.5677.3584.0495.4496.8797.7794.4798.02
110.0095.6146.5881.8796.4192.8791.6488.2896.31
1287.1298.9794.6582.3698.1799.6399.3699.7199.57
1395.0299.1587.5684.1398.7198.0098.8498.4198.61
1491.8797.1191.5183.6397.3697.2998.3497.3698.08
150.0070.5445.3962.9677.6084.6988.2389.8590.70
1681.4798.9578.470.0094.5099.1799.2098.9897.75
OA%78.7991.0881.2869.8192.7794.6096.1795.9597.03
AA%79.7195.5182.3267.3395.2397.1897.5796.7097.82
K%76.0190.0879.0167.4191.9594.0095.7495.4996.68
The bold entities indicate the highest value.
Table 7. Classification results of WHLK datasets with different methods.
Table 7. Classification results of WHLK datasets with different methods.
NO.SVM1DRNN1DCNN2DCNN3D-CNNDBDADRINLS2CMMSRAN
198.4891.8186.4973.1298.1896.9797.0196.7599.50
275.1255.9847.0677.0088.6296.4296.4196.8794.10
376.270.000.0074.4692.2497.3797.0297.2593.75
495.6688.7187.7776.8396.8597.0597.4197.1698.94
575.160.000.0085.7975.1493.6095.0692.8788.86
699.1693.4582.4894.0898.6798.8798.8698.7199.82
799.9599.9599.9293.6699.5999.3099.3299.2399.99
888.0778.4973.7189.4893.4793.5896.5194.6096.43
988.140.000.0088.5792.8994.7996.8093.5498.30
OA%95.8388.8086.2773.3696.2996.0396.3296.0098.86
AA%87.0558.0355.3074.1392.7394.5095.5294.7395.80
K%94.5084.9981.6767.5995.1694.8395.2194.7998.50
The bold entities indicate the highest value.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wu, Q.; He, M.; Liu, Z.; Liu, Y. Multi-Scale Spatial–Spectral Residual Attention Network for Hyperspectral Image Classification. Electronics 2024, 13, 262. https://doi.org/10.3390/electronics13020262

AMA Style

Wu Q, He M, Liu Z, Liu Y. Multi-Scale Spatial–Spectral Residual Attention Network for Hyperspectral Image Classification. Electronics. 2024; 13(2):262. https://doi.org/10.3390/electronics13020262

Chicago/Turabian Style

Wu, Qinggang, Mengkun He, Zhongchi Liu, and Yanyan Liu. 2024. "Multi-Scale Spatial–Spectral Residual Attention Network for Hyperspectral Image Classification" Electronics 13, no. 2: 262. https://doi.org/10.3390/electronics13020262

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop