1. Introduction
Hyperspectral images (HSIs) are obtained using imaging spectrometers in different regions of the electromagnetic spectrum, where each spectral band records continuous spectral information from the same spatial scene. Unlike conventional RGB images, which capture limited spectral information, HSIs provide near-continuous spectral data alongside rich spatial details, enabling the detection of subtle differences among land cover types [
1]. In HSI classification, each pixel is labeled according to its land cover type, ensuring an accurate representation of geographic attributes [
2]. This process is essentially a form of pixel-level semantic segmentation, as it requires assigning a semantic label to every pixel. Therefore, HSI classification has found extensive application in disease diagnosis [
3], mineral resource discovery [
4], environmental management [
5], military intelligence operations [
6], and other fields [
7,
8,
9]. Moreover, it provides an important research direction for semantic segmentation methods, since both approaches aim to extract fine-grained spatial and spectral information from high-dimensional data to achieve precise target identification.
Traditional approaches to HSI classification typically follow a two-step process: manual feature extraction followed by classification. In early studies, techniques such as principal component analysis (PCA) [
10], independent component analysis (ICA) [
11], and nonlinear principal component analysis (NLPCA) [
12] were employed to reduce data dimensionality by mapping high-dimensional features into a more compact space. Following data preprocessing, traditional machine learning approaches like k-nearest neighbors (k-NNs) [
13,
14], support vector machines (SVMs) [
15,
16], multilayer perceptrons (MLPs) [
17], and random forests (RFs) [
18] have been extensively employed for classification. While these methods produced promising outcomes in early HSI classification tasks, they often struggled to fully leverage the complex spectral and spatial information inherent in HSI. Consequently, some researchers have focused on integrating spatial and spectral information. He et al. [
19] decoupled a 3D Gabor filter, which captures both spectral and spatial attributes, into multiple sub-filters, and introduced a novel Discriminative Low-Rank Gabor Filtering (DLRGF) method for HSI classification. Dalla et al. [
20] sequentially applied morphological attribute filters to create a multi-level representation of HSI, which was used to model various types of structural information. Although these methods harness spatial and spectral cues to enhance HSI classification, their reliance on hand-crafted extractors restricts their ability to learn robust representations.
Deep learning, with its powerful feature representation capabilities, offers significant advantages over traditional algorithms in computer vision image classification and has seen widespread adoption for classification in applications [
21]. Unlike traditional techniques that rely on manual feature engineering, deep learning models enable end-to-end training and inherently learn highly discriminative representations, thereby eliminating the need for manual intervention [
22]. Methods ranging from stacked autoencoders (SAEs) [
23] and deep belief networks (DBNs) [
24] to various neural network models [
25,
26,
27,
28] have all achieved promising classification results on HSI. However, SAE and DBN still treat pixels as one-dimensional vectors when input into the network, which fails to effectively preserve spatial information. RNNs are inherently designed for sequential data, and their sequential processing makes it difficult to capture the complex spatial dependencies and high-dimensional feature representations in HSI. Owing to their proficiency in extracting spatial patterns, CNNs have become an essential component of current HSI classification methods [
29]. Earlier CNN approaches primarily concentrated on spectral data, representing each pixel’s spectral information as a 1D vector to capture intensity variations across different wavelengths. Gao et al. [
30] employed cascaded 1 × 1 and 3 × 3 convolutional layers to convert 1D spectral representations into 2D feature maps. While this design facilitated the reuse of features across spectral bands, it failed to incorporate spatial context. Chen et al. [
31] proposed an enhanced 2D-CNN incorporating Gabor filtering to mitigate overfitting and effectively preserve spatial details, although it suffered from insufficient utilization of spectral information. To harness the complementary strengths of spatial and spectral information, Zhang et al. [
32] introduced a dual-branch architecture that utilizes a 1D-CNN to extract hierarchical spectral features and a 2D-CNN to capture spatial features, with the results fused for final classification. Chen et al. [
33] designed a 3D-CNN that approaches the problem from a three-dimensional perspective, simultaneously capturing both the spectral and spatial features in HSI. Zhong et al. [
34] introduced a supervised spectral–spatial residual network (SSRN) that employs a series of 3D-CNN layers arranged into dedicated spatial and spectral residual blocks, effectively capturing integrated features from both dimensions. To address the inability of 2D-CNNs to capture inter-spectral correlations while avoiding the complexity inherent in 3D-CNN architectures, Roy et al. [
35] introduced a 3D-2D CNN architecture, wherein a 3D-CNN is initially employed to extract integrated spectral–spatial features, followed by a 2D-CNN that refines these representations to capture a higher-level spatial context. Wang et al. [
36] proposed an end-to-end fast dense convolutional network for hyperspectral data classification, employing convolutional kernels of different sizes to form a densely connected structure that separately extracts spectral and spatial features. Subsequently, Wang et al. [
37] enhanced the dense convolutional network architecture by introducing Cubic-CNN, which uses both the original image patches and the dimensionally reduced 1D convolution features as inputs, effectively reducing feature redundancy. CNN-based methods directly capture integrated representations that combine both spectral and spatial features through convolution operations [
38,
39], thereby fully exploiting the inherent spectral properties and spatial cues of HSIs. However, because they depend on local receptive fields and fixed convolutional kernels, these methods frequently have difficulty capturing long-range relationships and comprehensive features [
40,
41,
42].
The transformer is effective at capturing long-range interactions [
43], prompting many researchers to integrate transformer modules into HSI classification frameworks. A transformer can establish global relationships between different locations and automatically learn important spatial and spectral features through the self-attention mechanism [
44,
45]. Furthermore, it can handle multi-dimensional features and flexibly adjust the weights between different spectral bands through self-attention, helping to extract more useful information from complex hyperspectral data, thereby improving classification accuracy [
46,
47]. Panboonyuen et al. [
48] used the transformer’s self-attention mechanism to capture long-range dependencies and contextual information differences in remote sensing images. He et al. [
49] introduced an HSI-BERT model that leverages a global receptive field, allowing for dynamic input regions and capturing global pixel dependencies without considering spatial distances. Similarly, Qing et al. [
50] developed SAT-Net based on the transformer’s architecture, which employs both spectral and self-attention mechanisms to extract spatial and spectral features, thereby effectively capturing long-range continuous spectral relationships. In order to better combine the advantages of CNNs in extracting local features and transformers in capturing global features, many studies have employed fusion models that integrate both. He et al. [
51] proposed spatial features using a CNN-based backbone network while leveraging a transformer to capture the relationships between adjacent spectral bands. Similarly, Sun et al. [
52] combined 3D and 2D convolutions to obtain shallow spectral–spatial features and then introduced the Spectral-Spatial Feature Tokenization Transformer (SSFTT) to capture more abstract semantic features. To address the issues of insufficient feature extraction and inadequate spatial–spectral representation in complex scenarios, researchers often adopt methods that fuse multi-scale or multi-feature approaches to enhance feature representation. Tan et al. [
53] designed a context-driven feature aggregation network that fuses feature maps of multi-scale ground objects with different resolutions to enhance the network’s ability to recognize small target objects. Roy et al. [
54] integrated the conventional transformer with morphological operations by processing input image patches through two parallel morphological modules. This dual-path approach extracts multi-scale, complementary spatial structural information; the outputs are then concatenated and augmented with a CLS token, enabling the transformer to capture spectral and spatial features from diverse perspectives. Sun et al. [
55] employed multi-scale convolution to extract features, which were subsequently fused with local binary pattern spatial information to enhance representation. After further refinement with various attention mechanisms, these enriched features were used for classification. These methods integrate shallow and deep features, exploring the correlations between spatial and spectral feature sequences, and effectively improve classification performance. The advantages and limitations of several typical deep learning-based HSI classification methods are shown in
Table 1.
Existing models based on CNN and transformers typically extract features from spatial and spectral domains, enabling efficient capture of local textures and global dependencies. However, hyperspectral data are characterized by high dimensionality, intrinsic complexity, and redundant information, making it challenging for conventional spatial and spectral feature extraction methods to fully reveal the underlying deep information. Frequency domain technology can decompose signals, reveal hidden structural patterns, and separate noise from effective signals to a certain extent so as to provide a fresh perspective for feature extraction, which is difficult to achieve with traditional spatial and spectral feature extraction methods. For instance, Tang et al. [
56] employed convolution to extract high- and low-frequency components, while Qiao et al. [
57] designed a multi-head neighborhood attention block and a global filter block to capture high- and low-frequency features, respectively. By decomposing the signal into multiple frequency components, these methods effectively capture both fine-grained local variations and overall trends, thereby enhancing classification robustness and accuracy. However, [
56] has limitations in capturing global information, while [
57] has shortcomings in information compression and high computing costs caused by tokenization. Within natural image analysis, Chen et al. [
58] introduced the concept of Implicit Neural Representations (INRs) to map pixels and their spatial positions into a continuous space, which further enhanced the representation of natural images. However, the INR approach has shortcomings in capturing low-frequency information. To address this issue, Song et al. [
59] proposed using Orthogonal Position Encoding (OPE) to map image coordinates into a higher-dimensional space, thereby abstracting image features and achieving efficient arbitrary-scale super-resolutions.
Inspired by these works and aiming to address the challenges of inadequate feature extraction and low utilization of spatial and spectral features in existing methods, this paper proposes a dual-branch multi-frequency feature enhancement network (DB-MFENet) for HSI classification, which maps image coordinates to high-dimensional space through OPE, and constructs a rich multi-frequency feature representation. Orthogonal properties help decouple different frequency components, preserving absolute positional information while establishing complementary relationships among various frequencies, thereby fusing spatial and spectral information more efficiently. Compared to the conventional position encoding, OPE better matches the complex distribution characteristics of hyperspectral data in both spatial and spectral dimensions, enhancing the overall feature representation’s discriminability and robustness. Specifically, the hyperspectral data are first decomposed into frequency components under multiple orthogonal bases by a multi-frequency feature extraction module. Here, the low-frequency component represents the overall structure, while the high-frequency component captures local details and subtle spectral variations. Next, the two-branch feature enhancement module extracts and enhances global spectral–spatial information and local fine-grained features separately, thereby addressing the potential shallow information compression issue that may arise from traditional tokenization methods. Subsequently, the transformer encoder module employs the multi-head self-attention mechanism to model the global dependency relationships of the fused features, enabling the detailed integration of multi-scale information. Finally, the category labels are generated via the linear classification layer. The main contributions of this paper are as follows:
DB-MFENet approaches HSI classification from a frequency-domain perspective by mapping HSI coordinates into a high-dimensional space, thereby extracting multi-frequency features that seamlessly combine spatial and spectral information. The feature extraction process is completely parameter-free and requires no training.
Multi-frequency features are divided according to the frequency range in the positional encoding, and a dual-branch structure is proposed to separately enhance features of different frequencies. The low-frequency branch employs a global spectral–spatial attention (GSSA) mechanism to assign appropriate spatial weights to each spectral band, effectively emphasizing regions in the image that are crucial for low-frequency information. The high-frequency branch employs a Local Enhanced Spectral Attention (LESA) module to capture spectral information while simultaneously enhancing the perception of local spatial details.
By effectively combining CNNs and transformers, the spatial–spectral information in HSIs is comprehensively leveraged, resulting in a substantial enhancement in classification performance. Experiments conducted on four publicly available datasets further confirmed the efficacy of DB-MFENet.