1. Introduction
The accelerated progress in satellite remote sensing technology has made HSI a compelling area of research [
1]. Unlike traditional natural images, HSIs continuously record spectral responses in numerous narrow and uninterrupted bands, thereby achieving the ability to accurately identify and differentiate materials with small spectral changes [
2]. HSI classification achieves the allocation of pixels within HSI data to distinct land cover types, facilitating the precise categorization and identification of surface features like farmland, forest, and water. This classification technique has extensive application in diverse fields, including agriculture [
3], environmental monitoring [
4], urban planning [
5], and geological exploration [
6].
HSI classification uses a variety of methods in the field of traditional machine learning. Some commonly used methods include random forest [
7], the minimum distance classifier [
8], support vector machine (SVM) [
9], the K-nearest neighbor algorithm (KNN) [
10], and the Bayesian classifier [
11]. In addition, because of the high-dimensional nature of HSI, a variety of dimensionality reduction methods have also been widely used, including principal component analysis (PCA) [
12], isometric mapping (Isomap) [
13], and local linear embedding (LLE) [
14]. However, traditional methods often rely on manually extracting spatial and spectral features from HSI and struggle to capture complex non-linear relationships and high-order correlations in the data. This limitation results in a reduction in performance when dealing with complex datasets.
With the advancement of deep learning methods, its widespread application in various computer vision (CV) tasks has become increasingly evident in recent years [
15,
16]. These tasks include, but are not limited to, denoising [
17,
18], target detection [
19,
20], change detection [
21,
22], and classification [
23,
24,
25]. Numerous research findings consistently demonstrate the superior performance of deep learning methods over traditional approaches in extracting high-level features. These features extracted by deep networks exhibit enhanced proficiency in capturing intricate and abstract information, contributing to a substantial improvement in the accuracy of HSI classification. Consequently, a range of deep learning methods have been proposed, encompassing recurrent neural networks (RNNs) [
26], deep belief neural networks (DBNs) [
27], stacked autoencoders (SAEs) [
28], and more. However, these HSI classification methods often overlook spatial information, focusing solely on spectral details. This oversight may lead to challenges such as differentiating between spectral of the same substance and attributing the same spectrum to distinct substances.
To tackle the mentioned problems, various deep learning methods utilizing CNNs have been suggested to effectively extract spectral–spatial features in HSI classification. Yu et al. introduced a dedicated one-dimensional CNN (1D-CNN) to capture spectral correlations between spectral bands [
29]. Gao et al. proposed a two-dimensional CNN (2D-CNN) to efficiently capture the spatial structure and texture information in images, thereby enhancing spatial feature extraction [
30]. Meanwhile, Xu et al. introduced a three-dimensional CNN (3D-CNN) to capture both spectral and spatial characteristics in HSI, as well as their interactions with each other [
31]. Roy et al. suggested the HybridSN network, a combination of a 2D-CNN and 3D-CNN, for the simultaneous extraction of spatial and spectral features [
32]. Given the varied sizes of targets in HSI, researchers commonly explore feature extraction across multiple scales. He et al. introduced a multi-scale 3D-CNN capable of extracting spectral–spatial information from images at four different scales [
33]. Hu et al. put forward a hybrid convolutional network that combined multi-scale 2D and 3D depthwise separable convolution [
34]. Recognizing the great importance of information at different scales for specific tasks, many researchers have incorporated attention mechanisms into deep learning models. Mou et al. integrated an attention module into the spectrum, utilizing a gating mechanism to selectively emphasize frequency bands rich in information and adaptively recalibrate the spectral bands [
35]. Cui et al. devised a network that combined multi-scale feature aggregation with a dual-channel spectral–spatial attention mechanism, aiming to adeptly capture local contextual information [
36]. To enhance the extraction of long-range features, researchers have been progressively utilizing Transformers in the classification of HSI. Hong et al. introduced the SpectralFormer model, which incorporates a Transformer module that merges contextual information from neighboring frequency bands, capturing both local and spectral sequence information [
37]. Sun proposed a method named the Spectral–Spatial Feature Tokenization Transformer (SSFTT), designed to capture both high-level semantic features and spectral–spatial features [
38].
While the deep learning methods mentioned above have found extensive application in HSI classification, there are still some key obstacles, which can be summarized as follows.
(1) These approaches fail to effectively leverage the multi-scale features presenting in HSI and neglect to establish strong dependencies among spectral bands. Consequently, their capacity to distinguish long-range spectral disparities within HSI is limited.
(2) Existing Transformer-based HSI classification methods capture the contextual relationships among all input embeddings via the multi-head self-attention (MHSA) mechanism. However, it requires the correlation calculation of the square scale of the number of tokens, which results in an increase in computational complexity within the network.
In order to tackle the above difficulties, we propose a multi-scale spectral–spatial attention network with a frequency-domain lightweight Transformer. Specifically, we use a spectral–spatial feature extraction module to effectively extract the spectral–spatial features and capture the long-range spectral dependence of HSI. In addition, the frequency-domain lightweight Transformer applies the FFT to convert features from the spatial domain to the frequency domain, effectively extracting global information and significantly reducing the time complexity of the network. Our main contributions are as follows.
(1) MSA-LWFormer proposes a spectral–spatial feature extraction module aimed at extracting shallow spectral–spatial features and capturing long-range spectral dependencies. This module emphasizes cross-channel and multi-scale features by integrating the multi-scale 2D-CNN and MS-SA techniques. These designs enhance the model’s ability to accurately capture and interpret complex spectral information.
(2) Applying the FFT to the query, key, and value matrices within a frequency-domain lightweight Transformer reduces the time complexity of the network. This transformation process serves to convert features from the spatial domain to the frequency domain, thereby enhancing the extraction of comprehensive global information and reducing the time complexity of the network.
(3) Our proposed network demonstrates good classification results across three classic HSI datasets, providing compelling evidence for the effectiveness of our approach.
The succeeding sections of this manuscript are organized as follows.
Section 2 presents a comprehensive examination of the overall structure of the MSA-LWFormer and the design specifics of each sub-module.
Section 3 presents the details of the experiments conducted and the corresponding experimental results.
Section 4 is devoted to the discussion of the ablation experiments, time complexity analysis, and hyperparameter analysis of MSA-LWFormer. Lastly,
Section 5 summarizes the study’s conclusions and outlines prospects for future work.