DB-MFENet: A Dual-Branch Multi-Frequency Feature Enhancement Network for Hyperspectral Image Classification

Zang, Chen; Song, Gaochao; Li, Lei; Zhao, Guangrui; Lu, Wanxuan; Jiang, Guiyuan; Sun, Qian

doi:10.3390/rs17081458

Open AccessArticle

DB-MFENet: A Dual-Branch Multi-Frequency Feature Enhancement Network for Hyperspectral Image Classification

by

Chen Zang

¹

,

Gaochao Song

²

,

Lei Li

³,

Guangrui Zhao

¹

,

Wanxuan Lu

⁴,

Guiyuan Jiang

⁵

and

Qian Sun

^1,*

¹

School of Electronic and Information Engineering, Nanjing University of Information Science and Technology, Nanjing 210044, China

²

College of Intelligence and Computing, Tianjin University, Tianjin 300072, China

³

School of Computer Science, Nanjing University of Science and Technology, Nanjing 210044, China

⁴

Key Laboratory of Network Information System Technology, Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China

⁵

School of Computer Science and Technology, Ocean University of China, Qingdao 266100, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(8), 1458; https://doi.org/10.3390/rs17081458

Submission received: 16 February 2025 / Revised: 13 April 2025 / Accepted: 15 April 2025 / Published: 18 April 2025

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Abstract

:

HSI classification is essential for monitoring and analyzing the Earth’s surface, with methods utilizing convolutional neural networks (CNNs) and transformers rapidly gaining prominence and advancing in recent years. However, CNNs are limited by their restricted receptive fields and can only process local information. Although transformers excel at establishing long-range dependencies, they underutilize the spatial information of HSIs. To tackle these challenges, we present the multi-frequency feature enhancement network (DB-MFENet) for HSI classification. First, orthogonal position encoding (OPE) is employed to map image coordinates into a high-dimensional space, which is then combined with corresponding spectral values to compute a multi-frequency feature. Next, the multi-frequency feature is divided into low-frequency and high-frequency components, which are separately enhanced through a dual-branch structure and then fused. Finally, a transformer encoder and a linear layer are employed to encode and classify the enhanced multi-frequency feature. The experimental results demonstrate that our method is efficient and robust for HSIs classification, achieving overall accuracies of 97.05%, 91.92%, 98.72%, and 96.31% on Indian Pines, Salinas, Pavia University, and WHU-Hi-HanChuan datasets, respectively.

Keywords:

hyperspectral image classification; multi-frequency feature; convolutional neural network; transformer; attention mechanism

1. Introduction

Hyperspectral images (HSIs) are obtained using imaging spectrometers in different regions of the electromagnetic spectrum, where each spectral band records continuous spectral information from the same spatial scene. Unlike conventional RGB images, which capture limited spectral information, HSIs provide near-continuous spectral data alongside rich spatial details, enabling the detection of subtle differences among land cover types [1]. In HSI classification, each pixel is labeled according to its land cover type, ensuring an accurate representation of geographic attributes [2]. This process is essentially a form of pixel-level semantic segmentation, as it requires assigning a semantic label to every pixel. Therefore, HSI classification has found extensive application in disease diagnosis [3], mineral resource discovery [4], environmental management [5], military intelligence operations [6], and other fields [7,8,9]. Moreover, it provides an important research direction for semantic segmentation methods, since both approaches aim to extract fine-grained spatial and spectral information from high-dimensional data to achieve precise target identification.

Traditional approaches to HSI classification typically follow a two-step process: manual feature extraction followed by classification. In early studies, techniques such as principal component analysis (PCA) [10], independent component analysis (ICA) [11], and nonlinear principal component analysis (NLPCA) [12] were employed to reduce data dimensionality by mapping high-dimensional features into a more compact space. Following data preprocessing, traditional machine learning approaches like k-nearest neighbors (k-NNs) [13,14], support vector machines (SVMs) [15,16], multilayer perceptrons (MLPs) [17], and random forests (RFs) [18] have been extensively employed for classification. While these methods produced promising outcomes in early HSI classification tasks, they often struggled to fully leverage the complex spectral and spatial information inherent in HSI. Consequently, some researchers have focused on integrating spatial and spectral information. He et al. [19] decoupled a 3D Gabor filter, which captures both spectral and spatial attributes, into multiple sub-filters, and introduced a novel Discriminative Low-Rank Gabor Filtering (DLRGF) method for HSI classification. Dalla et al. [20] sequentially applied morphological attribute filters to create a multi-level representation of HSI, which was used to model various types of structural information. Although these methods harness spatial and spectral cues to enhance HSI classification, their reliance on hand-crafted extractors restricts their ability to learn robust representations.

Deep learning, with its powerful feature representation capabilities, offers significant advantages over traditional algorithms in computer vision image classification and has seen widespread adoption for classification in applications [21]. Unlike traditional techniques that rely on manual feature engineering, deep learning models enable end-to-end training and inherently learn highly discriminative representations, thereby eliminating the need for manual intervention [22]. Methods ranging from stacked autoencoders (SAEs) [23] and deep belief networks (DBNs) [24] to various neural network models [25,26,27,28] have all achieved promising classification results on HSI. However, SAE and DBN still treat pixels as one-dimensional vectors when input into the network, which fails to effectively preserve spatial information. RNNs are inherently designed for sequential data, and their sequential processing makes it difficult to capture the complex spatial dependencies and high-dimensional feature representations in HSI. Owing to their proficiency in extracting spatial patterns, CNNs have become an essential component of current HSI classification methods [29]. Earlier CNN approaches primarily concentrated on spectral data, representing each pixel’s spectral information as a 1D vector to capture intensity variations across different wavelengths. Gao et al. [30] employed cascaded 1 × 1 and 3 × 3 convolutional layers to convert 1D spectral representations into 2D feature maps. While this design facilitated the reuse of features across spectral bands, it failed to incorporate spatial context. Chen et al. [31] proposed an enhanced 2D-CNN incorporating Gabor filtering to mitigate overfitting and effectively preserve spatial details, although it suffered from insufficient utilization of spectral information. To harness the complementary strengths of spatial and spectral information, Zhang et al. [32] introduced a dual-branch architecture that utilizes a 1D-CNN to extract hierarchical spectral features and a 2D-CNN to capture spatial features, with the results fused for final classification. Chen et al. [33] designed a 3D-CNN that approaches the problem from a three-dimensional perspective, simultaneously capturing both the spectral and spatial features in HSI. Zhong et al. [34] introduced a supervised spectral–spatial residual network (SSRN) that employs a series of 3D-CNN layers arranged into dedicated spatial and spectral residual blocks, effectively capturing integrated features from both dimensions. To address the inability of 2D-CNNs to capture inter-spectral correlations while avoiding the complexity inherent in 3D-CNN architectures, Roy et al. [35] introduced a 3D-2D CNN architecture, wherein a 3D-CNN is initially employed to extract integrated spectral–spatial features, followed by a 2D-CNN that refines these representations to capture a higher-level spatial context. Wang et al. [36] proposed an end-to-end fast dense convolutional network for hyperspectral data classification, employing convolutional kernels of different sizes to form a densely connected structure that separately extracts spectral and spatial features. Subsequently, Wang et al. [37] enhanced the dense convolutional network architecture by introducing Cubic-CNN, which uses both the original image patches and the dimensionally reduced 1D convolution features as inputs, effectively reducing feature redundancy. CNN-based methods directly capture integrated representations that combine both spectral and spatial features through convolution operations [38,39], thereby fully exploiting the inherent spectral properties and spatial cues of HSIs. However, because they depend on local receptive fields and fixed convolutional kernels, these methods frequently have difficulty capturing long-range relationships and comprehensive features [40,41,42].

The transformer is effective at capturing long-range interactions [43], prompting many researchers to integrate transformer modules into HSI classification frameworks. A transformer can establish global relationships between different locations and automatically learn important spatial and spectral features through the self-attention mechanism [44,45]. Furthermore, it can handle multi-dimensional features and flexibly adjust the weights between different spectral bands through self-attention, helping to extract more useful information from complex hyperspectral data, thereby improving classification accuracy [46,47]. Panboonyuen et al. [48] used the transformer’s self-attention mechanism to capture long-range dependencies and contextual information differences in remote sensing images. He et al. [49] introduced an HSI-BERT model that leverages a global receptive field, allowing for dynamic input regions and capturing global pixel dependencies without considering spatial distances. Similarly, Qing et al. [50] developed SAT-Net based on the transformer’s architecture, which employs both spectral and self-attention mechanisms to extract spatial and spectral features, thereby effectively capturing long-range continuous spectral relationships. In order to better combine the advantages of CNNs in extracting local features and transformers in capturing global features, many studies have employed fusion models that integrate both. He et al. [51] proposed spatial features using a CNN-based backbone network while leveraging a transformer to capture the relationships between adjacent spectral bands. Similarly, Sun et al. [52] combined 3D and 2D convolutions to obtain shallow spectral–spatial features and then introduced the Spectral-Spatial Feature Tokenization Transformer (SSFTT) to capture more abstract semantic features. To address the issues of insufficient feature extraction and inadequate spatial–spectral representation in complex scenarios, researchers often adopt methods that fuse multi-scale or multi-feature approaches to enhance feature representation. Tan et al. [53] designed a context-driven feature aggregation network that fuses feature maps of multi-scale ground objects with different resolutions to enhance the network’s ability to recognize small target objects. Roy et al. [54] integrated the conventional transformer with morphological operations by processing input image patches through two parallel morphological modules. This dual-path approach extracts multi-scale, complementary spatial structural information; the outputs are then concatenated and augmented with a CLS token, enabling the transformer to capture spectral and spatial features from diverse perspectives. Sun et al. [55] employed multi-scale convolution to extract features, which were subsequently fused with local binary pattern spatial information to enhance representation. After further refinement with various attention mechanisms, these enriched features were used for classification. These methods integrate shallow and deep features, exploring the correlations between spatial and spectral feature sequences, and effectively improve classification performance. The advantages and limitations of several typical deep learning-based HSI classification methods are shown in Table 1.

Existing models based on CNN and transformers typically extract features from spatial and spectral domains, enabling efficient capture of local textures and global dependencies. However, hyperspectral data are characterized by high dimensionality, intrinsic complexity, and redundant information, making it challenging for conventional spatial and spectral feature extraction methods to fully reveal the underlying deep information. Frequency domain technology can decompose signals, reveal hidden structural patterns, and separate noise from effective signals to a certain extent so as to provide a fresh perspective for feature extraction, which is difficult to achieve with traditional spatial and spectral feature extraction methods. For instance, Tang et al. [56] employed convolution to extract high- and low-frequency components, while Qiao et al. [57] designed a multi-head neighborhood attention block and a global filter block to capture high- and low-frequency features, respectively. By decomposing the signal into multiple frequency components, these methods effectively capture both fine-grained local variations and overall trends, thereby enhancing classification robustness and accuracy. However, [56] has limitations in capturing global information, while [57] has shortcomings in information compression and high computing costs caused by tokenization. Within natural image analysis, Chen et al. [58] introduced the concept of Implicit Neural Representations (INRs) to map pixels and their spatial positions into a continuous space, which further enhanced the representation of natural images. However, the INR approach has shortcomings in capturing low-frequency information. To address this issue, Song et al. [59] proposed using Orthogonal Position Encoding (OPE) to map image coordinates into a higher-dimensional space, thereby abstracting image features and achieving efficient arbitrary-scale super-resolutions.

Inspired by these works and aiming to address the challenges of inadequate feature extraction and low utilization of spatial and spectral features in existing methods, this paper proposes a dual-branch multi-frequency feature enhancement network (DB-MFENet) for HSI classification, which maps image coordinates to high-dimensional space through OPE, and constructs a rich multi-frequency feature representation. Orthogonal properties help decouple different frequency components, preserving absolute positional information while establishing complementary relationships among various frequencies, thereby fusing spatial and spectral information more efficiently. Compared to the conventional position encoding, OPE better matches the complex distribution characteristics of hyperspectral data in both spatial and spectral dimensions, enhancing the overall feature representation’s discriminability and robustness. Specifically, the hyperspectral data are first decomposed into frequency components under multiple orthogonal bases by a multi-frequency feature extraction module. Here, the low-frequency component represents the overall structure, while the high-frequency component captures local details and subtle spectral variations. Next, the two-branch feature enhancement module extracts and enhances global spectral–spatial information and local fine-grained features separately, thereby addressing the potential shallow information compression issue that may arise from traditional tokenization methods. Subsequently, the transformer encoder module employs the multi-head self-attention mechanism to model the global dependency relationships of the fused features, enabling the detailed integration of multi-scale information. Finally, the category labels are generated via the linear classification layer. The main contributions of this paper are as follows:

DB-MFENet approaches HSI classification from a frequency-domain perspective by mapping HSI coordinates into a high-dimensional space, thereby extracting multi-frequency features that seamlessly combine spatial and spectral information. The feature extraction process is completely parameter-free and requires no training.
Multi-frequency features are divided according to the frequency range in the positional encoding, and a dual-branch structure is proposed to separately enhance features of different frequencies. The low-frequency branch employs a global spectral–spatial attention (GSSA) mechanism to assign appropriate spatial weights to each spectral band, effectively emphasizing regions in the image that are crucial for low-frequency information. The high-frequency branch employs a Local Enhanced Spectral Attention (LESA) module to capture spectral information while simultaneously enhancing the perception of local spatial details.
By effectively combining CNNs and transformers, the spatial–spectral information in HSIs is comprehensively leveraged, resulting in a substantial enhancement in classification performance. Experiments conducted on four publicly available datasets further confirmed the efficacy of DB-MFENet.

2. Materials and Methods

Figure 1 presents the overall architecture of the proposed DB-MFENet for HSI classification. The framework consists of four primary modules: HSI data preprocessing, multi-frequency feature extraction, a dual-branch multi-frequency feature enhancement structure, and a transformer encoder.

2.1. HSI Data Preprocessing

We denote the original HSI data as

X \in R^{H \times W \times C}

, where H, W, and C denote the height, width, and number of spectral bands. Each pixel in X is assigned a one-hot label vector

L = (l_{1}, l_{2}, \dots, l_{N})

, with N representing the total number of classes. Considering that redundant bands in HSI can increase computational burden and negatively impact model performance, we apply PCA to reduce the number of bands to D while preserving the spatial dimensions. The dimensionality-reduced HSI is denoted as

X_{PCA} \in R^{H \times W \times D}

. This PCA-based reduction not only removes redundant bands but also retains the information that is critical for the classification task.

Then, for multi-frequency feature extraction,

X_{PCA}

is divided into multiple overlapping 3D patches

P \in R^{S \times S \times B}

of equal size, where

S \times S

specifies the spatial size of each patch. Specifically, patches are extracted centered around each pixel, with zero padding applied for edge pixels. The label for each patch is assigned based on its central pixel, effectively preventing any risk of test data leakage. After removing invalid patches labeled as zero, experiments are conducted on the remaining samples.

2.2. Multi-Frequency Feature Extraction Based on Orthogonal Position Encoding

After data preprocessing, we use the concept of representing a 2D continuous signal with a set of feature vectors through OPE to extract the spatial–spectral features from the HSI representation. This set of feature vectors corresponds to the multi-frequency feature we need. The i-th band in P can be represented as

P_{i} \in R^{S \times S \times 1}

. The spectral value of each pixel in

P_{i}

can be represented as

S (x, y)

, where

(x, y)

denotes the pixel coordinates. Each pixel in

P_{i}

is considered a discrete sampling of a continuous bivariate function

f (x, y)

. Firstly, positional encoding is applied separately to the x and y coordinates:

X = γ (x) = \sqrt{2} \cdot [\frac{1}{\sqrt{2}}, \cos (π x), \sin (π x), \cos (2 π x), \sin (2 π x), \dots, \cos (n π x), \sin (n π x)]

(1)

Y = γ (y) = \sqrt{2} \cdot [\frac{1}{\sqrt{2}}, \cos (π y), \sin (π y), \cos (2 π y), \sin (2 π y), \dots, \cos (n π y), \sin (n π y)]

(2)

where

γ (\cdot) : R \to R^{1 \times (2 n + 1)}

denotes the unidirectional positional encoding transformation, and

n \in N

is the maximum frequency of the positional encoding.

To bolster the representation of spatial features, the matrix

X^{T} Y

is computed and flattened to form the final positional encoding as follows:

O P E (x, y) = f l a t (X^{T} Y)

(3)

where

O P E (x, y) \in R^{1 \times {(2 n + 1)}^{2}}

uses a set of orthogonal sine and cosine functions as the basis, which is analogous to the principles of the Fourier transform. The orthogonality ensures that each basis function does not interfere with the others mathematically, meaning that each frequency component is independent and free from redundancy. This orthogonality allows each frequency component to be clearly interpreted as corresponding to spatial oscillations at different scales or orientations, thereby intuitively reflecting the separation and integration of local and global structures.

In signal processing and mathematical analysis, the role of the orthogonal basis is to decompose and represent arbitrary functions; then, the continuous bivariate function

f (x, y)

can be expanded on

O P E (x, y)

:

f (x, y) = \sum_{d} Z_{d} O P E_{d} (x, y)

(4)

where

Z \in R^{1 \times {(2 n + 1)}^{2}}

represents the projection coefficients, which serve as the feature vector characterizing

f (x, y)

. d represents a certain dimension. Since

S (x, y)

is discrete, element-wise multiplication is used instead of integration to compute the feature vector. Firstly, the spectral values of a single band are expanded to match the dimensions of

O P E (x, y)

:

\hat{S} (x, y) = repeat (S (x, y), l)

(5)

where l represents the dimensions of

O P E (x, y)

. Projecting

\hat{S} (x, y)

onto

O P E (x, y)

yields the multi-frequency feature as follows:

F (x, y) = \hat{S} (x, y) \cdot O P E (x, y)

(6)

where

F (x, y) \in R^{1 \times {(2 n + 1)}^{2}}

contains both spectral and spatial information. Using the high-dimensional features generated by OPE, an element-wise multiplication is performed on the spectral data, thereby modulating the expression of spectral information across different frequency scales. This process is analogous to modulation in signal processing, where a modulation signal (the sine and cosine values produced by OPE) is used to weight the carrier signal (the spectral values), ensuring that the spectral information in each frequency component is either emphasized or suppressed accordingly. This leads to a more accurate reflection of both the global structure and local details of the image.

Finally, by calculating the multi-frequency feature for each pixel across all bands and then arranging them according to pixel positions, the multi-frequency feature

F_{P} \in R^{S \times S \times D {(2 n + 1)}^{2}}

can, thus, be obtained.

2.3. Dual-Branch Multi-Frequency Feature Enhancement Structure

The multi-frequency feature, generated by combining feature vectors at various frequency levels, integrates high-dimensional spatial–spectral information. Features captured at different frequency levels represent distinct scales of information in the hyperspectral image: low-frequency features typically embody the global structure (e.g., large-scale textures and shapes), while high-frequency features emphasize local variations (e.g., edges and fine details). To enable the classification model to learn multi-scale information and enhance its expressive capacity, we designed a multi-frequency feature encoding structure with two branches. Specifically, sine and cosine functions at different frequencies form a set of orthogonal bases. We regard the lower frequency range (from frequency 0 to m) as low-frequency components since they primarily capture the overall structure and global information of the image, whereas the higher-frequency components are considered high-frequency, mainly representing details and local variations. The low- and high-frequency features can be represented as follows:

F_{low} = Extract (F, s f = 0, e f = m)

(7)

F_{high} = Extract (F, s f = m + 1, e f = n)

(8)

where

F_{low} \in R^{D \times (4 n^{2} + 4 n + 1)}

,

F_{high} \in R^{D \times [4 (m^{2} - n^{2}) + 4 (m - n)]}

, m denotes the maximum frequency of the low-frequency feature, n denotes the maximum frequency of the positional encoding, and

s f

and

e f

represent the starting and ending frequencies, respectively.

2.3.1. Low-Frequency Branch

The low-frequency branch aims to enhance the representation of low-frequency information through a global spectral–spatial attention (GSSA) module. As illustrated in Figure 1, global average pooling (GAP) along the spatial dimensions aggregates the image information, acting similarly to a low-pass filter that effectively removes local noise and high-frequency details while preserving the overall contours and global structure. Given that low-frequency signals vary smoothly across the spectrum, 1D convolution can capture and weight these gradual variations, ensuring that the low-frequency features intuitively reflect the image’s overall structure and gradual changes. The entire process can be expressed as follows:

o u t = F_{low} \cdot S i g (Conv1d (GAP (F_{low}, (H_{F_{l o w}}, W_{F_{l o w}}))))

(9)

where

(H_{F_{l o w}}, W_{F_{l o w}})

represents the spatial dimension of the low-frequency feature. Then, 3D convolution is used to further extract the modulated features, and the output channel dimension is adjusted to be consistent with the high-frequency branch output, which facilitates the integration of the overall features and further global information modeling.

2.3.2. High-Frequency Branch

This branch focuses on enhancing the capture of high-frequency details and critical spatial patterns by fine-graining the extraction of spatial and spectral information from the input features to improve the representation of high-frequency features. As illustrated in Figure 1, 3D convolution is first used to reduce the spectral dimension and compress the features. This operation accentuates the subtle local spectral variations and texture details typical of high-frequency information.

Then, the Local Enhanced Spectral Attention (LESA) module captures the dependencies among adjacent spectral channels through a multi-head spectral self-attention mechanism and employs depthwise convolution in the generation of Q, K, and V to further model the high-frequency variations within local regions, enabling fine-grained changes to be effectively captured and distinguished. The attention mechanism then computes the attention scores through scaled dot-product attention, aggregates the value features by performing a weighted sum, and projects the result back to the original dimension as follows:

o u t = P r o j (MHSA (DepthwiseConv 3 D (Conv 3 D (F_{high}))))

(10)

where

P r o j (\cdot)

is the output linear projection, ensuring that the self-attention output has the same dimension as the original input. MHSA is multi-head self-attention, with the computation process provided in Section 2.4.

While self-attention mechanisms are often considered too global for high-frequency features, LESA addresses this by using deep convolution to enhance local spatial information. In addition, LESA is calculated on a 3D patch with a small spatial dimension, and self-attention itself can help capture the cross-regional relationship between details by weighting different spatial positions, which is beneficial to the enhancement of high-frequency features. This combination of local enhancement with global spectral self-attention ensures that LESA effectively captures high-frequency information without losing its ability to focus on key spatial patterns.

At the end of the dual-branch structure, the outputs of the two branches are concatenated along the spectral dimension to fuse features, which is then reconstructed through convolutions into the input shape required by the transformer encoder.

2.4. Transformer Encoder

The transformer encoder (TE) ensures the integration of spatial and spectral positional information in the features, while the multi-head self-attention (MHSA) mechanism enables information at different frequencies to interact on a global scale, thereby enhancing the model’s classification capability. It mainly consists of an MHSA mechanism block and a multilayer perceptron (MLP) block.

As shown in Figure 2, TE centers around the MHSA block, which leverages self-attention to model feature correlations. Learnable weight matrices

W_{Q}

,

W_{K}

, and

W_{V}

project the input features into (Q), (K), and (V) representations. Attention scores are computed from Q and K, with weights normalized via the softmax function. This can be expressed as follows:

A = Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{K}}}) V

(11)

Here, softmax is the activation function that computes the attention weights, and

d_{K}

is the dimension of K.

Then, the MHSA block employs multiple sets of weight matrices to produce Q, K, and V, applying the same operational procedure across all heads. Once the attention results are obtained, they are concatenated and passed through a linear transformation to compute the final global interaction output:

MHSA (Q, K, V) = Concat (A_{1}, A_{2}, \dots, A_{h}) W

(12)

where h represents the number of attention heads.

W \in R^{h \times d_{K} \times d_{w}}

is the linear transformation matrix, and

d_{w}

is the number of feature channels.

Afterward, the MLP block processes the output from the MHSA. The resulting refined spectral–spatial features are then mapped into multiple categories for effective HSI classification.

3. Experiments and Results

We assess the performance of the proposed DB-MFENet using four widely used and publicly available datasets for HSI classification. Firstly, an introduction to the four datasets is outlined. Then, we describe the evaluation metrics and configuration parameters used in the experiments. Next, we present a quantitative analysis of the classification experiment’s results along with visualization illustrations. Finally, we conduct parameter experiments and ablation studies to evaluate the effect of different parameter settings and modules on the classification performance.

3.1. Dataset Details

3.1.1. Indian Pines

It was collected by the NASA AVIRIS sensor, covering farmlands and a part of the forested area in Indiana, USA. The spatial resolution is 20 m per pixel, and it consists of an HSI with a resolution of 145 × 145 pixels, where each pixel contains 220 spectral bands. Among them, the 20 bands significantly affected by the atmosphere were excluded from the experiment. It includes 16 land cover classes, such as various types of crops, grasslands, and forests, though the class distribution is imbalanced. Figure 3 presents the false-color image and the corresponding label map, while Table 2 presents the sample partitioning.

3.1.2. Salinas

This dataset was captured using the AVIRIS sensor and covers the Salinas Valley in California. It has an HSI size of 512 × 217 pixels and a spatial resolution of approximately 3.7 m per pixel. After eliminating bands affected by water absorption, 204 spectral bands are typically utilized. The ground truth is organized into 16 categories, including examples such as fallow fields with rough plowing, smooth fallow lands, and regions with crop residues, which collectively reflect the diverse agricultural management practices in the area. Researchers frequently use this dataset to evaluate fine-grained classification tasks in high-resolution HSI. Figure 4 displays the false-color image and the labeling map, while Table 3 presents the sample partitioning.

3.1.3. Pavia University

The Pavia University dataset was collected using the ROSIS sensor, covering the university area in the Pavia region of Italy. With a resolution of 610 × 340 pixels and a spatial resolution of around 1.3 m per pixel, the dataset includes 103 spectral bands. It provides ground truth labels for nine categories, such as asphalt, trees, and shadows. Given its high spatial resolution and detailed spectral data, this dataset is frequently utilized in studies aimed at classifying different land cover types in urban settings. Figure 5 presents the false-color image and the corresponding label map, while Table 4 shows the sample partitioning.

3.1.4. WHU-Hi-HanChuan

This dataset was collected in Hanchuan, Hubei Province, China [60]. It has an image size of 1217 × 303 pixels with a spatial resolution of 0.109 m and contains 274 bands. The ground truth labels include 16 land cover categories. Figure 6 shows the false-color image and the label map, and Table 5 presents the sample partitioning.

3.2. Experimental Setting

3.2.1. Evaluation Indicators

To assess and compare the classification performance of the proposed method against other approaches, three quantitative metrics were employed: overall accuracy (OA), average accuracy (AA), and the kappa coefficient (

κ

). Higher values indicate more precise classification results.

3.2.2. Configuration

The proposed method was implemented using PyTorch 1.11.0. The computational setup consisted of an Intel Xeon Gold 6430 CPU, 256 GB of RAM, and an NVIDIA GeForce RTX 4090 GPU with 24 GB of memory. During the training phase, the batch size was configured to 64, and the model was trained for 100 epochs. The number of bands after PCA was set to 30. To balance performance and efficiency, we set the maximum frequency n of position encoding to 3 [59].

3.3. Results with Classification Maps

We compared DB-MFENet with several well-established methods to assess its classification performance. These included traditional methods such as SVM [16]; CNN-based methods such as 1D CNN [26], 2D CNN [27], 3D CNN [28], and HybridSN [35]; as well as transformer-based methods such as SSFTT [52], MorphAT [54], and GSCViT [61]. For consistency, all methods were tested under the same settings across four datasets. We conducted 10 sets of experiments, each using a different randomly selected training sample, and we averaged the results; the best performances are highlighted in bold.

Table 6, Table 7, Table 8 and Table 9 show the quantitative results. Overall, SVM did not perform well in small-sample scenarios, with a significantly lower performance. Methods that combine CNN and a transformer achieved better experimental results. The proposed DB-MFENet outperformed all other methods on all three metrics across the four datasets and attained a 100% classification accuracy in more categories compared to other methods. This shows that the multi-frequency feature proposed in this paper is effective for improving the classification accuracy and has more advantages in the utilization of spatial–spectral information. For example, in the Salinas dataset, a classification accuracy greater than 99% was achieved across multiple categories despite using only 0.1% of the training samples. This proves that the frequency division processing strategy proposed in this paper for multi-frequency features can still fully extract features in challenging classification scenarios such as sample scarcity.

Figure 7, Figure 8, Figure 9 and Figure 10 illustrate the visualization of classification maps produced by our proposed method alongside those from other approaches across the four datasets. These figures show that most methods produced classification maps with noticeable noise points, especially the traditional SVM method. In contrast, the deep learning-based methods generated smoother classification maps, but they had limitations in classifying boundary pixels. For example, in the Pavia University dataset, due to the scattered distribution of objects and more complex boundaries, most methods produced classification maps with significant noise, making boundary pixels prone to misclassification. Our proposed method, by integrating high-frequency and low-frequency features and fusing local and global information, ensured a smooth overall classification map while effectively classifying boundary pixels, achieving optimal classification performance.

3.4. Parameter Setting Adjustment

For key hyperparameters that significantly influence classification accuracy, we designed relevant experiments to verify the reasonableness of their settings. These hyperparameters include the patch size S, the position coding frequency m m used to divide low- and high-frequency features, the spectral dimension after PCA reduction, and the number of attention heads.

Figure 11 illustrates how varying the patch size influences the classification results across multiple datasets. It can be observed that, initially, as the patch size increases, the classification performance gradually improves since larger patches contain more information. However, when the patch size grows beyond a certain threshold, the classification accuracy no longer improves and may even decline. This could be due to larger patches incorporating irrelevant information. For the proposed DB-MFENet, the optimal classification accuracy was obtained using an

11 \times 11

patch size.

Figure 12 illustrates the impact of the positional encoding frequency m, which determines the division between high-frequency and low-frequency features, on the three classification metrics across different datasets. When

m = 0

, the low-frequency features contain only the global structural information of the HSI, making them most distinct from their corresponding high-frequency features. This clear separation enhances the complementarity between the two features and fully leverages the advantage of the dual-branch structure, resulting in the best classification performance. However, when

m > 0

, the low-frequency feature no longer consist solely of global structural information but instead incorporates some local details. This blurs the boundary between low-frequency and high-frequency features, making it difficult for the dual-branch structure to effectively decouple them, thereby leading to a suboptimal classification performance.

The influence of PCA band numbers on OA is shown in Figure 13. It can be seen that when the dimensionality reduction is large, the classification accuracy is reduced. This is because too much spectral information is lost. When the spectral dimensions lie in the range of [25, 30] for all datasets, the classification results are relatively stable. Taking into account the computational cost and classification performance, we set the PCA band number to 30.

Figure 14 illustrates the impact of the number of attention heads on OA across different datasets. The performance of the proposed method is quite stable when the number of attention heads lies in the range of [2, 16]. Specifically, it achieves the best performance when the number of attention heads is 8 for Indian Pines, Pavia University, and WHU-Hi-HanChuan. For the Salinas datasets, the optimal value of number of attention heads is 4.

3.5. Ablation Studies

To further confirm the efficacy of our approach, we performed ablation studies on the Indian Pines dataset, examining the contributions of various components within DB-MFENet. These components include multi-frequency feature extraction (MFF), the low-frequency branch (LFB), the high-frequency branch (HFB), and the transformer encoder (TE). The evaluation metrics used in the ablation study were the same as those in the main experiment, and the results are presented in Table 10.

In Case 1, the multi-frequency feature extraction step was moved, and we directly fed the PCA-processed HSI into the dual-branch structure for low-frequency and high-frequency feature extraction. The experimental results demonstrate that relying solely on a learning-based approach to extract frequency-domain features is less effective than the proposed multi-frequency feature extraction method.

In Case 2 and Case 3, we retained only the low-frequency branch and high-frequency branch, respectively, to process the corresponding features. The results show that relying on a single branch alone fails to effectively integrate global structural information and local details, leading to degraded classification performance.

In Case 4, we replaced the transformer encoder with CNN-based feature processing. This led to a decline in all evaluation metrics, proving the positive impact of the transformer in handling fused features.

3.6. Comparison of Computational Efficiency

Table 11 presents the number of parameters for our proposed method compared to several other deep learning-based approaches. Our method has significantly fewer parameters than those primarily using CNNs to extract deep features. However, due to the high dimensionality of the multi-frequency feature, our method exhibited an increase in parameter count compared to the latest research. Nevertheless, the extraction of the multi-frequency feature is one of the key reasons for maintaining a strong classification performance.

3.7. Comparison of Classification Accuracy Under Different Training Sample Proportions

Figure 15 reflects the effect of different proportions of training samples on classification accuracy. With the increase in the number of training samples, the classification accuracy was gradually improved. It is worth noting that even with a limited sample size, the proposed method consistently showed commendable performance.

3.8. Feature Map Visualization

Figure 16 shows 2-D plots of the features extracted by the proposed DB-MFENet for the Indian Pines, Salinas, Pavia University, and WHU-Hi-HanChuan datasets, respectively. By applying the t-SNE approach [62], it can be observed that samples from similar classes cluster together, and intraclass variance is minimized in all four datasets. These findings indicate that the proposed method can effectively capture intrinsic features from different categories.

3.9. Potential Extension to Semantic Segmentation

Figure 17 presents the full-scene classification results of the proposed DB-MFENet on four datasets. The predictions cover all pixels in the scene, and despite the limited number of training samples, the results still exhibit a certain degree of spatial continuity and coherence. This effect is more evident on HSIs with larger spatial coverage and a higher spatial resolution, as they contain richer spatial information. These results indirectly support the potential applicability of the model to segmentation tasks.

4. Conclusions

This paper proposes a dual-branch multi-frequency feature enhancement network (DB-MFENet) for HSI classification from a frequency-domain perspective. The method employs Orthogonal Position Encoding to map input coordinates into a high-dimensional space and computes the multi-frequency feature in combination with pixel spectral values. The multi-frequency feature is divided into low-frequency and high-frequency features based on the frequency range of the position encoding, and they are enhanced separately through a dual-branch feature enhancement structure before being fused. This effectively improves the representation capability of the multi-frequency feature. Finally, a transformer encoder is used for classification. Experiments on four datasets demonstrated that DB-MFENet outperformed other classification methods. The experimental results on datasets containing a variety of crops also proved that the proposed method has a good application prospect in the field of precision agriculture. The proposed method still has room for improvement in multi-frequency feature processing and computation cost on large-scale hyperspectral data. Future work will proceed in two directions: on the one hand, we will refine the lightweight transformer architecture to improve computational efficiency and further enhance classification accuracy by integrating multi-source remote sensing data; on the other hand, it is worthwhile to explore how to extend and adapt the proposed method to semantic segmentation tasks on large-scale remote sensing datasets.

Author Contributions

Conceptualization, C.Z. and Q.S.; methodology, C.Z. and G.S.; software, Q.S.; validation, C.Z., G.S. and Q.S.; formal analysis, Q.S.; investigation, L.L.; data curation, C.Z. and G.Z.; writing—original draft preparation, C.Z.; writing—review and editing, C.Z. and Q.S.; visualization, C.Z. and L.L.; supervision, Q.S.; project administration, W.L. and G.J.; funding acquisition, Q.S. and W.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Laboratory of Target Cognition and Application Technology, grant number 2023-CXPT-LC-005, in part by the National Natural Science Foundation of China, grant number 62372421 and 62402465.

Data Availability Statement

The Indian Pines, Salinas, and Pavia University datasets are available at http://www.ehu.eus/ccwintco/index.php?title=Hyperspectral_Remote_Sensing_Scenes (accessed on 13 February 2025). The WHU-Hi-HanChuan dataset is available at http://rsidea.whu.edu.cn/resource_WHUHiriver_sharing.htm (accessed on 13 February 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, Q.; Zheng, Y.; Yuan, Q.; Song, M.; Yu, H.; Xiao, Y. Hyperspectral image denoising: From model-driven, data-driven, to model-data-driven. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 13143–13163. [Google Scholar] [CrossRef]
Wang, D.; Du, B.; Zhang, L.; Xu, Y. Adaptive spectral–spatial multiscale contextual feature extraction for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2020, 59, 2461–2477. [Google Scholar] [CrossRef]
Lu, G.; Fei, B. Medical hyperspectral imaging: A review. J. Biomed. Opt. 2014, 19, 010901. [Google Scholar] [CrossRef]
Shirmard, H.; Farahbakhsh, E.; Müller, R.D.; Chandra, R. A review of machine learning in processing remote sensing data for mineral exploration. Remote Sens. Environ. 2022, 268, 112750. [Google Scholar] [CrossRef]
Bioucas-Dias, J.M.; Plaza, A.; Camps-Valls, G.; Scheunders, P.; Nasrabadi, N.; Chanussot, J. Hyperspectral remote sensing data analysis and future challenges. IEEE Geosci. Remote Sens. Mag. 2013, 1, 6–36. [Google Scholar] [CrossRef]
Shimoni, M.; Haelterman, R.; Perneel, C. Hypersectral imaging for military and security applications: Combining myriad processing and sensing techniques. IEEE Geosci. Remote Sens. Mag. 2019, 7, 101–117. [Google Scholar] [CrossRef]
Gevaert, C.M.; Suomalainen, J.; Tang, J.; Kooistra, L. Generation of spectral—Temporal response surfaces by combining multispectral satellite and hyperspectral UAV imagery for precision agriculture applications. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 8, 3140–3146. [Google Scholar] [CrossRef]
Khanal, S.; Kc, K.; Fulton, J.P.; Shearer, S.; Ozkan, E. Remote sensing in agriculture—Accomplishments, limitations, and opportunities. Remote Sens. 2020, 12, 3783. [Google Scholar] [CrossRef]
Soppa, M.A.; Silva, B.; Steinmetz, F.; Keith, D.; Scheffler, D.; Bohn, N.; Bracher, A. Assessment of polymer atmospheric correction algorithm for hyperspectral remote sensing imagery over coastal waters. Sensors 2021, 21, 4125. [Google Scholar] [CrossRef]
Fauvel, M.; Chanussot, J.; Benediktsson, J.A. Kernel principal component analysis for the classification of hyperspectral remote sensing data over urban areas. EURASIP J. Adv. Signal Process. 2009, 2009, 783194. [Google Scholar] [CrossRef]
Liang, L.; Yang, M.h.; Li, Y.f. Hyperspectral remote sensing image classification based on ICA and SVM algorithm. Spectrosc. Spectr. Anal. 2010, 30, 2724–2728. [Google Scholar] [CrossRef]
Licciardi, G.; Marpu, P.R.; Chanussot, J.; Benediktsson, J.A. Linear versus nonlinear PCA for the classification of hyperspectral data based on the extended morphological profiles. IEEE Geosci. Remote Sens. Lett. 2011, 9, 447–451. [Google Scholar] [CrossRef]
Yu, H.; Gao, L.; Liao, W.; Zhang, B.; Zhuang, L.; Song, M.; Chanussot, J. Global spatial and local spectral similarity-based manifold learning group sparse representation for hyperspectral imagery classification. IEEE Trans. Geosci. Remote Sens. 2019, 58, 3043–3056. [Google Scholar] [CrossRef]
Haut, J.M.; Paoletti, M.; Plaza, J.; Plaza, A. Cloud implementation of the K-means algorithm for hyperspectral image analysis. J. Supercomput. 2017, 73, 514–529. [Google Scholar] [CrossRef]
Liu, J.; Wu, Z.; Wei, Z.; Xiao, L.; Sun, L. Spatial-spectral kernel sparse representation for hyperspectral image classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2013, 6, 2462–2471. [Google Scholar] [CrossRef]
Melgani, F.; Bruzzone, L. Classification of hyperspectral remote sensing images with support vector machines. IEEE Trans. Geosci. Remote Sens. 2004, 42, 1778–1790. [Google Scholar] [CrossRef]
Tarabalka, Y.; Fauvel, M.; Chanussot, J.; Benediktsson, J.A. SVM-and MRF-based method for accurate classification of hyperspectral images. IEEE Geosci. Remote Sens. Lett. 2010, 7, 736–740. [Google Scholar] [CrossRef]
Zhang, Y.; Cao, G.; Li, X.; Wang, B.; Fu, P. Active semi-supervised random forest for hyperspectral image classification. Remote Sens. 2019, 11, 2974. [Google Scholar] [CrossRef]
He, L.; Li, J.; Plaza, A.; Li, Y. Discriminative low-rank Gabor filtering for spectral–spatial hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2016, 55, 1381–1395. [Google Scholar] [CrossRef]
Dalla Mura, M.; Benediktsson, J.A.; Waske, B.; Bruzzone, L. Morphological attribute profiles for the analysis of very high resolution images. IEEE Trans. Geosci. Remote Sens. 2010, 48, 3747–3762. [Google Scholar] [CrossRef]
Li, S.; Song, W.; Fang, L.; Chen, Y.; Ghamisi, P.; Benediktsson, J.A. Deep learning for hyperspectral image classification: An overview. IEEE Trans. Geosci. Remote Sens. 2019, 57, 6690–6709. [Google Scholar] [CrossRef]
Xu, Y.; Zhang, L.; Du, B.; Zhang, F. Spectral-spatial unified networks for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2018, 56, 5893–5909. [Google Scholar] [CrossRef]
Zhao, C.; Wan, X.; Zhao, G.; Cui, B.; Liu, W.; Qi, B. Spectral-spatial classification of hyperspectral imagery based on stacked sparse autoencoder and random forest. Eur. J. Remote Sens. 2017, 50, 47–63. [Google Scholar] [CrossRef]
Chen, Y.; Zhao, X.; Jia, X. Spectral-spatial classification of hyperspectral data based on deep belief network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 8, 2381–2392. [Google Scholar] [CrossRef]
Zhou, F.; Hang, R.; Liu, Q.; Yuan, X. Hyperspectral image classification using spectral-spatial LSTMs. Neurocomputing 2019, 328, 39–47. [Google Scholar] [CrossRef]
Makantasis, K.; Karantzalos, K.; Doulamis, A.; Doulamis, N. Deep supervised learning for hyperspectral data classification through convolutional neural networks. In Proceedings of the 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Milan, Italy, 26–31 July 2015; IEEE: New York, NY, USA, 2015; pp. 4959–4962. [Google Scholar] [CrossRef]
Zhao, W.; Du, S. Spectral–spatial feature extraction for hyperspectral image classification: A dimension reduction and deep learning approach. IEEE Trans. Geosci. Remote Sens. 2016, 54, 4544–4554. [Google Scholar] [CrossRef]
Hamida, A.B.; Benoit, A.; Lambert, P.; Amar, C.B. 3-D deep learning approach for remote sensing image classification. IEEE Trans. Geosci. Remote Sens. 2018, 56, 4420–4434. [Google Scholar] [CrossRef]
Wang, Z.; Cao, B.; Liu, J. Hyperspectral image classification via spatial shuffle-based convolutional neural network. Remote Sens. 2023, 15, 3960. [Google Scholar] [CrossRef]
Gao, H.; Yang, Y.; Li, C.; Zhou, H.; Qu, X. Joint alternate small convolution and feature reuse for hyperspectral image classification. ISPRS Int. J.-Geo-Inf. 2018, 7, 349. [Google Scholar] [CrossRef]
Chen, Y.; Zhu, L.; Ghamisi, P.; Jia, X.; Li, G.; Tang, L. Hyperspectral images classification with Gabor filtering and convolutional neural network. IEEE Geosci. Remote Sens. Lett. 2017, 14, 2355–2359. [Google Scholar] [CrossRef]
Zhang, H.; Li, Y.; Zhang, Y.; Shen, Q. Spectral-spatial classification of hyperspectral imagery using a dual-channel convolutional neural network. Remote Sens. Lett. 2017, 8, 438–447. [Google Scholar] [CrossRef]
Chen, Y.; Jiang, H.; Li, C.; Jia, X.; Ghamisi, P. Deep feature extraction and classification of hyperspectral images based on convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 2016, 54, 6232–6251. [Google Scholar] [CrossRef]
Zhong, Z.; Li, J.; Luo, Z.; Chapman, M. Spectral-spatial residual network for hyperspectral image classification: A 3-D deep learning framework. IEEE Trans. Geosci. Remote Sens. 2017, 56, 847–858. [Google Scholar] [CrossRef]
Roy, S.K.; Krishna, G.; Dubey, S.R.; Chaudhuri, B.B. HybridSN: Exploring 3-D–2-D CNN feature hierarchy for hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 2019, 17, 277–281. [Google Scholar] [CrossRef]
Wang, W.; Dou, S.; Jiang, Z.; Sun, L. A fast dense spectral-spatial convolution network framework for hyperspectral images classification. Remote Sens. 2018, 10, 1068. [Google Scholar] [CrossRef]
Wang, J.; Song, X.; Sun, L.; Huang, W.; Wang, J. A novel cubic convolutional neural network for hyperspectral image classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 4133–4148. [Google Scholar] [CrossRef]
Li, J.; Zhao, X.; Li, Y.; Du, Q.; Xi, B.; Hu, J. Classification of hyperspectral imagery using a new fully convolutional neural network. IEEE Geosci. Remote Sens. Lett. 2018, 15, 292–296. [Google Scholar] [CrossRef]
Gong, Z.; Zhong, P.; Yu, Y.; Hu, W.; Li, S. A CNN with multiscale convolution and diversified metric for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2019, 57, 3599–3618. [Google Scholar] [CrossRef]
Liu, L.; Shi, Z.; Pan, B.; Zhang, N.; Luo, H.; Lan, X. Multiscale deep spatial feature extraction using virtual RGB image for hyperspectral imagery classification. Remote Sens. 2020, 12, 280. [Google Scholar] [CrossRef]
Zhu, J.; Fang, L.; Ghamisi, P. Deformable convolutional neural networks for hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 2018, 15, 1254–1258. [Google Scholar] [CrossRef]
Li, S.; Zhu, X.; Liu, Y.; Bao, J. Adaptive spatial-spectral feature learning for hyperspectral image classification. IEEE Access 2019, 7, 61534–61547. [Google Scholar] [CrossRef]
Vaswani, A. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Available online: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf (accessed on 13 February 2025).
Sun, L.; Zhang, H.; Zheng, Y.; Wu, Z.; Ye, Z.; Zhao, H. MASSFormer: Memory-Augmented Spectral-Spatial Transformer for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5516415. [Google Scholar] [CrossRef]
Liang, L.; Zhang, Y.; Zhang, S.; Li, J.; Plaza, A.; Kang, X. Fast hyperspectral image classification combining transformers and SimAM-based CNNs. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5522219. [Google Scholar] [CrossRef]
Qing, Y.; Liu, W.; Feng, L.; Gao, W. Improved transformer net for hyperspectral image classification. Remote Sens. 2021, 13, 2216. [Google Scholar] [CrossRef]
Zhong, Z.; Li, Y.; Ma, L.; Li, J.; Zheng, W.S. Spectral-spatial transformer network for hyperspectral image classification: A factorized architecture search framework. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5514715. [Google Scholar] [CrossRef]
Panboonyuen, T.; Jitkajornwanich, K.; Lawawirojwong, S.; Srestasathiern, P.; Vateekul, P. Transformer-based decoder designs for semantic segmentation on remotely sensed images. Remote Sens. 2021, 13, 5100. [Google Scholar] [CrossRef]
He, J.; Zhao, L.; Yang, H.; Zhang, M.; Li, W. HSI-BERT: Hyperspectral image classification using the bidirectional encoder representation from transformers. IEEE Trans. Geosci. Remote Sens. 2019, 58, 165–178. [Google Scholar] [CrossRef]
He, X.; Chen, Y. Optimized input for CNN-based hyperspectral image classification using spatial transformer network. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1884–1888. [Google Scholar] [CrossRef]
He, X.; Chen, Y.; Lin, Z. Spatial-Spectral Transformer for Hyperspectral Image Classification. Remote Sens. 2021, 13, 498. [Google Scholar] [CrossRef]
Sun, L.; Zhao, G.; Zheng, Y.; Wu, Z. Spectral-Spatial Feature Tokenization Transformer for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5522214. [Google Scholar] [CrossRef]
Tan, X.; Xiao, Z.; Zhang, Y.; Wang, Z.; Qi, X.; Li, D. Context-Driven Feature-Focusing Network for Semantic Segmentation of High-Resolution Remote Sensing Images. Remote Sens. 2023, 15, 1348. [Google Scholar] [CrossRef]
Roy, S.K.; Deria, A.; Shah, C.; Haut, J.M.; Du, Q.; Plaza, A. Spectral-spatial morphological attention transformer for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5503615. [Google Scholar] [CrossRef]
Sun, Q.; Zhao, G.; Fang, Y.; Fang, C.; Sun, L.; Li, X. MEA-EFFormer: Multiscale Efficient Attention with Enhanced Feature Transformer for Hyperspectral Image Classification. Remote Sens. 2024, 16, 1560. [Google Scholar] [CrossRef]
Tang, X.; Meng, F.; Zhang, X.; Cheung, Y.M.; Ma, J.; Liu, F.; Jiao, L. Hyperspectral image classification based on 3-D octave convolution with spatial-spectral attention network. IEEE Trans. Geosci. Remote Sens. 2020, 59, 2430–2447. [Google Scholar] [CrossRef]
Qiao, X.; Huang, W. A dual frequency transformer network for hyperspectral image classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 10344–10358. [Google Scholar] [CrossRef]
Chen, Y.; Liu, S.; Wang, X. Learning continuous image representation with local implicit image function. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8628–8638. [Google Scholar] [CrossRef]
Song, G.; Sun, Q.; Zhang, L.; Su, R.; Shi, J.; He, Y. OPE-SR: Orthogonal position encoding for designing a parameter-free upsampling module in arbitrary-scale image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 10009–10020. [Google Scholar] [CrossRef]
Zhong, Y.; Wang, X.; Xu, Y.; Wang, S.; Jia, T.; Hu, X.; Zhao, J.; Wei, L.; Zhang, L. Mini-UAV-borne hyperspectral remote sensing: From observation and processing to applications. IEEE Geosci. Remote Sens. Mag. 2018, 6, 46–62. [Google Scholar] [CrossRef]
Zhao, Z.; Xu, X.; Li, S.; Plaza, A. Hyperspectral Image Classification Using Groupwise Separable Convolutional Vision Transformer Network. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5511817. [Google Scholar] [CrossRef]
Van Der Maaten, L. Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res. 2014, 15, 3221–3245. [Google Scholar] [CrossRef]

Figure 1. The overall framework of the proposed DB-MFENet method.

Figure 2. Multi-head self-attention (MHSA) mechanism.

Figure 3. Indian Pines dataset. (a) False-color map. (b) Ground-truth map.

Figure 4. Salinas dataset. (a) False-color map. (b) Ground-truth map.

Figure 5. Pavia University dataset. (a) False-color map. (b) Ground-truth map.

Figure 6. WHU-Hi-HanChuan dataset. (a) False-color map. (b) Ground-truth map.

Figure 7. Visualization results from various methods on the Indian Pines dataset. (a) Ground truth. (b) SVM. (c) 1D CNN. (d) 2D CNN. (e) 3D CNN. (f) HybridSN. (g) SSFTT. (h) MorphAT. (i) GSCViT. (j) DB-MFENet.

Figure 8. Visualization results from various methods on the Salinas dataset. (a) Ground truth. (b) SVM. (c) 1D CNN. (d) 2D CNN. (e) 3D CNN. (f) HybridSN. (g) SSFTT. (h) MorphAT. (i) GSCViT. (j) DB-MFENet.

Figure 9. Visualization results from various methods on the Pavia University dataset. (a) Ground truth. (b) SVM. (c) 1D CNN. (d) 2D CNN. (e) 3D CNN. (f) HybridSN. (g) SSFTT. (h) MorphAT. (i) GSCViT. (j) DB-MFENet.

Figure 10. Visualization results from various methods on the WHU-Hi-HanChuan dataset. (a) Ground truth. (b) SVM. (c) 1D CNN. (d) 2D CNN. (e) 3D CNN. (f) HybridSN. (g) SSFTT. (h) MorphAT. (i) GSCViT. (j) DB-MFENet.

Figure 11. The effect of varying patch sizes on OA, AA, and

κ

. (a) Indian Pines. (b) Salinas. (c) Pavia University. (d) WHU-Hi-HanChuan.

Figure 11. The effect of varying patch sizes on OA, AA, and

κ

. (a) Indian Pines. (b) Salinas. (c) Pavia University. (d) WHU-Hi-HanChuan.

Figure 12. The effect of different values of m on OA, AA, and

κ

. (a) Indian Pines. (b) Salinas. (c) Pavia University. (d) WHU-Hi-HanChuan.

Figure 12. The effect of different values of m on OA, AA, and

κ

. (a) Indian Pines. (b) Salinas. (c) Pavia University. (d) WHU-Hi-HanChuan.

Figure 13. Effect of the number of attention heads on OA.

Figure 14. Effect of the number of attention heads on OA.

Figure 15. The effect of different spectral dimension sizes after PCA on OA, AA, and

κ

. (a) Indian Pines. (b) Salinas. (c) Pavia University. (d) WHU-Hi-HanChuan.

Figure 15. The effect of different spectral dimension sizes after PCA on OA, AA, and

κ

. (a) Indian Pines. (b) Salinas. (c) Pavia University. (d) WHU-Hi-HanChuan.

Figure 16. The 2-D graphical visualization of the features extracted by DB-MFENet through t-SNE. (a) Indian Pines. (b) Salinas. (c) Pavia University. (d) WHU-Hi-HanChuan.

Figure 17. The full-scene classification maps of DB-MFENet on the four datasets. (a) Indian Pines. (b) Salinas. (c) Pavia University. (d) WHU-Hi-HanChuan.

Table 1. Advantages and limitations of some typical deep learning-based HSI classification methods.

Method	Advantage	Shortcoming
3D CNN [28]	Simultaneously extracts both spectral and spatial information	High computational cost and memory demands
HybridSN [35]	Balances effective feature extraction with computational efficiency	Limited capability in capturing long-range global features
SAT-Net [50]	Effectively extracts global features using self-attention mechanisms	Requires a large amount of data
SSFTT [52]	Separates shallow feature extraction from high-level semantic learning for hierarchical representation	Tokenization compresses shallow features, leading to loss of fine-grained details
MorphAT [54]	Extracts multi-scale spatial information by integrating morphological operations with transformers	Complex architecture and sensitivity to parameter settings may affect robustness

Table 2. Sample partitioning on Indian Pines.

Class Name	Training	Test
Alfalfa	2	44
Corn-notill	71	1357
Corn-mintill	41	789
Corn	12	225
Grass-pasture	24	459
Grass-tree	37	693
Grass-pasture-mowed	1	27
Hay-windrowed	24	454
Oats	1	19
Soybean-notill	49	923
Soybean-mintill	123	2332
Soybean-clean	30	563
Wheat	10	195
Woods	63	1202
Buildings-Grass-Trees	19	367
Stone-Steel-Towers	5	88
Total	512	9737

Table 3. Sample partitioning on Salinas.

Class Name	Training	Test
Brocoli_green_weeds_1	2	2007
Brocoli_green_weeds_22	4	3722
Fallow	2	1974
Fallow_rough_plow	1	1393
Fallow_smooth	3	2675
Stubble	4	3955
Celery	4	3575
Grapes_untrained	11	11,260
Soil_vinyard_develop	6	6197
Corn_senesced_green_weeds	3	3275
Lettuce_romaine_4wk	1	1067
Lettuce_romaine_5wk	2	1925
Lettuce_romaine_6wk	1	915
Lettuce_romaine_7wk	1	1069
Vinyard_untrained	7	7261
Vinyard_vertical_trellis	2	1805
Total	54	54,075

Table 4. Sample partitioning on Pavia University.

Class Name	Training	Test
Asphalt	66	6565
Meadows	186	18,463
Gravel	21	2078
Trees	31	3033
Painted metal sheets	13	1332
Bare Soil	50	4979
Bitumen	13	1317
Self-Blocking Bricks	37	3645
Shadows	10	937
Total	427	42,349

Table 5. Sample partitioning on WHU-Hi-HanChuan.

Class Name	Training	Test
Strawberry	447	44,288
Cowpea	227	22,526
Soybean	103	10,184
Sorghum	54	5299
Water spinach	12	1188
Watermelon	45	4488
Greens	59	5844
Trees	180	17,798
Grass	95	9374
Red roof	105	10,411
Gray roof	169	16,742
Plastic	37	3642
Bare soil	91	9025
Road	186	18,374
Bright object	11	1125
Water	754	74,647
Total	2575	254,955

Table 6. Quantitative results from various methods on Indian Pines.

NO.	SVM	1D CNN	2D CNN	3D CNN	HybridSN	SSFTT	MorphAT	GSCViT	DB-MFENet
C01	3.12	18.57	29.05	34.29	33.33	61.90	82.86	65.15	90.48
C02	49.14	85.85	91.68	90.75	87.40	95.26	95.18	91.08	97.12
C03	14.86	85.68	91.97	84.07	95.36	94.38	93.04	96.70	95.18
C04	3.76	48.83	93.43	77.93	67.14	96.71	91.08	94.50	100.00
C05	65.52	91.72	87.59	96.67	86.90	92.87	89.43	95.80	90.57
C06	93.30	96.65	96.50	96.50	98.33	98.63	97.41	95.77	97.87
C07	1.43	19.23	63.85	69.23	46.15	95.25	68.80	88.89	99.85
C08	92.33	99.75	99.77	96.74	99.85	99.77	98.77	95.52	98.29
C09	1.07	15.81	64.44	66.67	61.26	54.75	53.42	79.47	72.22
C10	30.21	77.00	92.61	91.92	74.03	91.42	93.71	90.81	92.33
C11	84.88	90.72	91.44	92.26	90.18	94.79	97.19	97.62	99.64
C12	12.57	70.17	95.50	85.12	81.24	94.75	88.12	91.99	96.06
C13	83.24	82.70	96.76	95.68	83.24	100.00	99.19	98.84	97.30
C14	94.29	99.03	98.95	98.77	94.74	99.91	99.82	99.85	99.56
C15	12.07	95.69	98.28	89.71	77.99	99.43	93.56	99.69	97.13
C16	54.22	74.70	97.59	88.80	90.36	93.98	90.90	84.62	86.75
OA (%)	60.35	87.34	92.57	91.36	88.91	95.64	95.80	95.82	97.05
AA (%)	43.50	72.01	86.84	84.07	79.21	91.48	89.61	91.64	94.40
$κ \times 100$	52.95	85.49	91.54	90.14	87.35	95.03	95.21	95.24	96.63

Table 7. Quantitative results from various methods on Salinas.

NO.	SVM	1D CNN	2D CNN	3D CNN	HybridSN	SSFTT	MorphAT	GSCViT	DB-MFENet
C01	53.32	99.20	100.00	99.95	96.86	100.00	89.80	99.75	100.00
C02	79.27	99.41	94.94	100.00	95.64	96.83	100.00	99.95	99.62
C03	50.66	95.69	99.24	99.24	87.72	100.00	99.95	63.88	100.00
C04	30.52	39.18	29.74	61.35	92.82	76.36	50.70	92.45	79.67
C05	61.93	98.65	97.83	83.87	92.18	92.70	97.90	69.65	99.18
C06	77.65	94.05	97.80	94.13	87.14	99.22	99.65	100.00	97.75
C07	78.10	99.50	99.41	99.52	98.46	99.94	98.82	94.57	99.69
C08	63.66	94.04	89.83	93.56	91.27	90.57	95.25	96.08	94.08
C09	67.69	80.32	100.00	99.60	99.40	100.00	99.77	99.97	100.00
C10	45.28	89.73	93.03	89.30	89.33	92.42	93.89	91.11	90.16
C11	10.48	98.91	97.47	93.71	86.49	99.25	87.75	82.86	98.31
C12	37.41	76.65	94.28	63.34	79.72	97.61	96.41	99.95	99.94
C13	15,69	32.17	37.44	44.86	81.62	86.00	100.00	92.81	100.00
C14	30.73	97.57	97.28	91.48	93.07	97.50	32.68	52.93	64.79
C15	54.92	50.36	64.97	76.14	57.69	67.74	55.53	69.43	67.60
C16	44.87	40.21	63.06	62.63	74.93	73.63	83.58	100.00	88.46
OA (%)	60.68	84.60	87.15	87.36	85.57	90.05	89.08	89.75	91.92
AA (%)	50.13	80.39	84.77	84.54	87.15	91.84	87.56	87.83	92.45
$κ \times 100$	53.94	82.73	85.66	85.84	83.83	88.90	87.79	88.57	90.98

Table 8. Quantitative results from various methods on Pavia University.

NO.	SVM	1D CNN	2D CNN	3D CNN	HybridSN	SSFTT	MorphAT	GSCViT	DB-MFENet
C01	43.35	94.92	96.06	95.71	93.66	95.55	96.61	93.78	98.72
C02	79.03	99.92	99.45	99.77	99.15	99.67	99.85	99.52	99.95
C03	31.31	76.08	64.95	81.14	77.69	75.60	76.91	97.18	91.01
C04	40.65	92.37	96.20	93.57	95.54	98.63	94.34	86.21	99.23
C05	58.33	94.36	98.10	99.17	99.25	100.00	100.00	98.94	100.00
C06	26.07	88.05	90.32	91.99	92.59	95.84	99.45	100.00	99.41
C07	15.97	88.34	86.81	94.48	96.93	99.92	99.85	98.77	100.00
C08	28.26	73.86	81.71	77.02	78.94	90.66	88.39	96.37	94.76
C09	27.12	47.17	80.26	90.72	88.67	99.03	97.95	79.61	98.27
OA (%)	56.99	92.14	93.54	94.55	94.20	96.56	96.75	96.87	98.72
AA (%)	38.90	83.89	88.21	91.59	91.46	94.99	94.81	94.49	97.92
$κ \times 100$	40.63	89.48	91.40	92.75	92.29	95.43	95.70	95.86	98.30

Table 9. Quantitative results from various methods on WHU-Hi-HanChuan.

NO.	SVM	1D CNN	2D CNN	3D CNN	HybridSN	SSFTT	MorphAT	GSCViT	DB-MFENet
C01	93.68	98.88	97.06	97.05	95.32	98.20	99.24	98.11	99.40
C02	77.11	92.17	92.81	95.25	94.25	94.75	97.21	92.54	95.96
C03	41.59	82.97	86.41	84.00	82.43	93.26	91.20	91.29	89.64
C04	57.16	98.93	97.81	98.44	98.38	99.58	98.93	98.11	99.08
C05	24.10	47.02	66.50	57.48	68.71	67.60	81.97	94.13	81.63
C06	10.13	65.79	74.30	64.93	55.28	63.52	73.46	55.23	81.07
C07	56.20	91.31	92.67	92.91	74.78	94.42	95.16	91.89	97.37
C08	60.28	82.06	84.96	86.62	82.25	89.79	89.69	93.19	92.68
C09	19.71	89.58	90.77	92.68	84.74	92.36	92.57	93.78	94.14
C10	63.10	97.67	97.99	97.42	98.29	98.37	98.06	97.43	99.02
C11	26.31	96.96	96.95	96.85	91.53	98.22	98.14	97.63	97.80
C12	12.14	51.65	76.81	75.81	64.88	91.23	91.01	93.20	90.15
C13	23.94	77.51	75.97	78.36	71.26	79.63	80.26	85.99	80.03
C14	73.05	90.60	92.76	92.36	91.96	93.91	93.56	93.74	94.72
C15	10.18	52.87	68.94	74.87	41.02	85.64	91.38	95.42	82.32
C16	98.87	99.59	99.45	99.43	98.91	99.52	99.77	99.58	99.67
OA (%)	74.72	92.94	93.81	93.89	91.33	95.35	96.00	95.47	96.31
AA (%)	46.72	82.22	87.01	86.47	80.87	90.00	91.98	91.96	92.17
$κ \times 100$	69.93	91.73	92.76	92.85	89.86	94.57	95.31	94.70	95.68

Table 10. Ablation experiment results.

Cases	MF	LFB	LHB	TE	OA (%)	AA (%)	$κ \times 100$
1	×	✓	✓	✓	96.65	93.68	96.17
2	✓	×	✓	✓	96.60	92.80	96.12
3	✓	✓	×	✓	96.74	94.05	96.28
4	✓	✓	✓	×	96.15	93.56	95.47
5	✓	✓	✓	✓	97.05	94.39	96.63

Table 11. Comparison of parameters and accuracy of approaches based on deep learning on Indian Pines dataset.

	3D CNN	HybridSN	SSFTT	MorphAT	GSCViT	DB-MFENeT
Params. (K)	462.769	797.625	148.488	217.952	77.90	299.837
OA (%)	91.36	88.91	95.64	95.80	95.82	97.05

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zang, C.; Song, G.; Li, L.; Zhao, G.; Lu, W.; Jiang, G.; Sun, Q. DB-MFENet: A Dual-Branch Multi-Frequency Feature Enhancement Network for Hyperspectral Image Classification. Remote Sens. 2025, 17, 1458. https://doi.org/10.3390/rs17081458

AMA Style

Zang C, Song G, Li L, Zhao G, Lu W, Jiang G, Sun Q. DB-MFENet: A Dual-Branch Multi-Frequency Feature Enhancement Network for Hyperspectral Image Classification. Remote Sensing. 2025; 17(8):1458. https://doi.org/10.3390/rs17081458

Chicago/Turabian Style

Zang, Chen, Gaochao Song, Lei Li, Guangrui Zhao, Wanxuan Lu, Guiyuan Jiang, and Qian Sun. 2025. "DB-MFENet: A Dual-Branch Multi-Frequency Feature Enhancement Network for Hyperspectral Image Classification" Remote Sensing 17, no. 8: 1458. https://doi.org/10.3390/rs17081458

APA Style

Zang, C., Song, G., Li, L., Zhao, G., Lu, W., Jiang, G., & Sun, Q. (2025). DB-MFENet: A Dual-Branch Multi-Frequency Feature Enhancement Network for Hyperspectral Image Classification. Remote Sensing, 17(8), 1458. https://doi.org/10.3390/rs17081458

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DB-MFENet: A Dual-Branch Multi-Frequency Feature Enhancement Network for Hyperspectral Image Classification

Abstract

1. Introduction

2. Materials and Methods

2.1. HSI Data Preprocessing

2.2. Multi-Frequency Feature Extraction Based on Orthogonal Position Encoding

2.3. Dual-Branch Multi-Frequency Feature Enhancement Structure

2.3.1. Low-Frequency Branch

2.3.2. High-Frequency Branch

2.4. Transformer Encoder

3. Experiments and Results

3.1. Dataset Details

3.1.1. Indian Pines

3.1.2. Salinas

3.1.3. Pavia University

3.1.4. WHU-Hi-HanChuan

3.2. Experimental Setting

3.2.1. Evaluation Indicators

3.2.2. Configuration

3.3. Results with Classification Maps

3.4. Parameter Setting Adjustment

3.5. Ablation Studies

3.6. Comparison of Computational Efficiency

3.7. Comparison of Classification Accuracy Under Different Training Sample Proportions

3.8. Feature Map Visualization

3.9. Potential Extension to Semantic Segmentation

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI