1. Introduction
Hyperspectral imagery (HSI) contains a wealth of spectral information and comprises multiple, and in some cases, hundreds of bands. This spectral information can be leveraged to classify important ground objects based on the characteristics exhibited across different bands. Feature extraction plays a pivotal role in HSI classification and has garnered growing interest among researchers. Hyperspectral remote sensing has made significant contributions in various domains. Such as military applications [
1], medical research [
2], water quality monitoring [
3], and agricultural research [
4].
However, the presence of numerous frequency bands in the hyperspectral data results in strong correlations between adjacent bands [
5]. This correlation leads to a significant amount of redundant information for classification tasks [
6]. Consequently, early approaches in hyperspectral classification primarily focused on data reduction techniques and feature engineering [
7,
8].
In recent years, with the advancements in deep learning, this technology has been increasingly adopted in various domains [
9,
10,
11], including hyperspectral remote sensing, and has achieved remarkable success [
6]. Deep learning models have the capability to extract meaningful knowledge from vast amounts of redundant data [
12]. The multi-layer structure of these models enables the acquisition of higher-level semantic information from the samples [
13].
Various deep learning models have been developed for hyperspectral data analysis, with convolutional neural network (CNN)-based models standing out due to their remarkable performance. Yu et al. [
14] introduced a CNN architecture that takes a single pixel as input, enabling the network to directly learn the relationships between different spectral bands. Chen et al. [
15] propose a 3D-CNN model with sparse constraints that can directly extract spectral–spatial features from HSI. Ghaderizadeh et al. [
16] presented a hybrid 3D-2D CNN architecture. This hybrid CNN approach offers advantages over standalone 3D-CNN by reducing the model’s complexity and mitigating the impacts of noise and limited training samples.
In addition to CNNs, several other network architectures have demonstrated strong performance in HSI classification. Recurrent neural networks (RNNs) are capable of capturing both long-term and short-term spectral dependencies and have found widespread application in HSI classification [
17]. Fully convolutional networks (FCNs), a popular model in image segmentation, have been extensively employed in hyperspectral remote sensing tasks [
18]. Transformers, which have shown significant advancements in recent years, have also been successfully applied to HSI classification [
19,
20,
21,
22,
23]. Furthermore, graph convolutional networks (GCNs) have gained attention in HSI classification and have achieved notable performance [
24,
25].
However, the majority of these models for HSI analysis are primarily patch-based, necessitating laborious preprocessing steps and resulting in substantial storage requirements. Consequently, several studies [
20,
22,
26,
27] have attempted to address these challenges by directly performing semantic segmentation on HSI. In these approaches, HSIs are treated as multi-channel images, akin to conventional RGB images, and external ground object labels are employed for annotation. This process can be seen as manually marking and selecting regions of interest within the ROItools [
28]. During the loss calculation, only the known ground object types are considered for gradient computation using masks. Experimental verification has demonstrated the simplicity and effectiveness of this approach. Nevertheless, the spectral–spatial characteristics of hyperspectral images are often not fully taken into account by most existing methods. Yu et al. [
26] integrated Transformer features directly within the decoder part, overlooking the intrinsic global relationship between distinct patches [
25]. In a similar vein, Chen et al. [
20] employed a combination of convolution and Transformer in the encoder part to extract hyperspectral image (HSI) features. However, their approach models the spectral sequence in the upper layer of the model, while the spatial characteristics are modeled in the lower layer, thereby neglecting the consideration of consistent spectral–spatial characteristics.
Spatial–spectral fusion methods have been extensively employed in hyperspectral classification tasks for over a decade. Early research focused on analyzing the size, orientation, and contrast characteristics of spatial structures in images, followed by the utilization of support vector machines (SVMs) for classification purposes [
29]. Subsequent studies explored supervised classification of hyperspectral images through segmentation and spectral features extracted from partition clustering [
30]. Li et al. [
31] investigated the use of 3D convolutional neural networks (3DCNN) for direct spatial–spectral fusion in classification tasks. More recently, a two-stage method inspired by image denoising and segmentation was proposed in [
32] to merge spatial and spectral information. Moreover, Qiao et al. [
33] introduced a novel approach that captures information by concurrently considering the interactions between channels, spectral bands, spatial depth and width. However, it should be noted that these methods primarily operate at the patch level and may not be directly applicable to semantic segmentation tasks.
Some recent works [
34,
35] have focused on enhancing convolutional modules to better capture spatial and channel details, yielding impressive performance across various tasks. However, when applied to HSIs, extracting both spatial and spectral features comprehensively becomes crucial. Conventional 2D convolutions are insufficient for effective hyperspectral feature extraction, while 3D convolutions exhibit high complexity and parameter redundancy. Thus, to address these limitations holistically, there is a need to employ modules that can extract both spectral and spatial features in hyperspectral tasks, thereby replacing traditional 2D and 3D convolutions. Several studies in the field of HSI [
36,
37], use new modules with attention mechanisms and multi-scale features to replace traditional convolutions, and have achieved good results in HSI patch-based classification tasks. However, these modules need to be used in conjunction with various different modules, and at the same time, the online module has a high number of parameters and complexity, making it difficult to apply to semantic segmentation tasks.
To simplify the modeling of spectral–spatial relationships in hyperspectral imaging sequences and establish a unified hyperspectral image semantic segmentation architecture. This paper proposes a novel image-based global spectral–spatial feature learning framework called MSSFF. In contrast to conventional classification methods, MSSFF utilizes the MMFF module to hierarchically model features in spectral–spatial sequences, resulting in outstanding classification performance even with a limited number of labeled samples (refer to
Figure 1). Firstly, in the encoder component, effective extraction of hyperspectral features is achieved by incorporating a spectral feature fusion module and a spatial feature fusion module. Secondly, an efficient Transformer is introduced between the encoder and decoder to capture global dependencies among deep feature nodes. Lastly, a spatial attention mechanism is employed in the upper layer of the model to model region-level features. The contributions of our proposed MSSFF framework can be summarized as follows.
The contributions of this paper can be summarized as follows:
- (1)
The paper introduces the MSSFF framework, a new method for hyperspectral semantic segmentation. It reevaluates the importance of spectral and spatial features and incorporates them effectively into the encoder. The framework also includes a Transformer in the skip connection section to capture global spectral–spatial information from the feature map. This demonstrates the potential of Transformers in modeling spectral–spatial feature maps for hyperspectral remote sensing.
- (2)
We conducted a series of ablation experiments and module selection experiments to investigate the optimal depth of the hyperspectral semantic segmentation model. The results of these experiments confirmed that increasing the depth of the model beyond a certain point does not necessarily yield improved performance. Additionally, we explored the order of feature fusion and found that performing spectral feature fusion before spatial feature fusion yields better results. These findings suggest that considering spectral information before spatial information enhances the performance of the hyperspectral semantic segmentation model.
- (3)
We performed comparative experiments involving the patch-based method and the semantic segmentation method to assess the feasibility of our proposed approach in the field of hyperspectral semantic segmentation. The results of these experiments confirmed the effectiveness and viability of our method for hyperspectral semantic segmentation.
2. Method
As shown in
Figure 1, we find that shallow models can effectively classify HSIs, so we propose an end-to-end shallow semantic segmentation model. HSIs are rich in spatial and spectral information, and spectral correlation and spatial correlation should be fully utilized for modeling. Therefore, in this work, we first propose a Backbone that simultaneously extracts spatial and spectral features, we use SSFM to replace the traditional convolution module, and at the end of the model, we use a pyramid pooling strategy to capture multiple scale contexts. In the decoder part, we followed the standard Unet architecture. However, we introduce the efficient Transformer in the skip connection part to model the deep feature map globally, and for the shallow (topmost) feature map, we use the spatial attention module for shallow feature extraction. Through the above modules, the accuracy of HSI classification is significantly improved. The following sections describe the core components of the framework.
The framework adopts an encoder–decoder architecture, and the encoder part is similar to ResNet18 [
38], but we use SSFM to replace the standard Conv module in ResNet. In general, we need to pad the boundaries of the input HSI. We choose to fill the length and width of the HSI to a multiple of 16, assuming the input is an Indian image
, we fill it with
. The HSI is directly input for forward calculation. In the encoder part, we replace the input parameter of Backbone’s first convolutional layer with the number of HSI spectral channels. A pyramid pooling module (PPM) is introduced at the end of the encoder. The multi-scale features extracted by the multi-scale aggregation module are very effective for the modeling of the framework. Residual connections between PPM and underlying feature maps can better facilitate gradient backpropagation. In the decoder part, one upsampling layer and two convolutional layers are set as a group, and there are three groups of upsampling modules in total. Before the upper and lower layer features are fused, the features of the encoder are enhanced by the ET or SA module, and then concat with the upsampled output of the lower layer features. Perform the same operation as above for each layer feature map of the encoder, and finally sample the feature map to the input size. To compute the loss, a small number of samples from the region are used to construct the mask. For the output of each batch, we only calculate the gradient of the known samples after the mask, and do not calculate the unknown samples.
2.1. Spectral–Spatial Fusion Module (SSFM)
To enhance the feature extraction capabilities of traditional 2D convolutions in both spectral and spatial domains, we introduce the concept of SSFM. Our approach involves the extraction and fusion of features from both the spectral and spatial dimensions. Specifically, we propose SSFM that applies the spectral feature fusion module first, followed by the connection of spatial feature fusion modules. The order of these modules will be discussed in the experimental results section.
2.1.1. Spectral Fusion Module
In order to fully leverage the potential of spectral features, we propose the integration of a spectral feature fusion module, as depicted in
Figure 2. This module employs a split-extract-fusion strategy, which aims to address the challenges associated with extracting effective feature maps along the spectral dimension. In computer vision [
38,
39,
40], particularly in the context of HSIs, the use of repeated convolutions for feature extraction can pose difficulties in capturing informative spectral-specific features, which has been identified as a critical issue [
20,
21,
22]. Therefore, our proposed spectral feature fusion module provides a solution to overcome this flaw and improve the ability to extract meaningful spectral features in HSI analysis.
Given an input feature map , firstly, we divide the features into two parts: and , based on the spectral dimension C. Simultaneously, both feature sets undergo a 1×1 convolution operation, which compresses their dimensions by half, resulting in and . Next, the features from the upper layer undergo extraction using both 1×1 and 3×3 convolution modules. Concatenation is then performed to obtain . Similarly, the features from the lower layer pass through a 1×1 convolution module while preserving their original features. Concatenation is performed again to obtain .
To obtain the combined feature representation, and are concatenated, resulting in the total feature representation . Subsequently, an average pooling (Avg-Pooling) operation is applied to , and the resulting weights are divided into two parts, corresponding to and . These weights are used to perform feature weighting on the respective feature sets. Finally, the two weighted features are superimposed at the end of the module.
The following formula can be used to summarize:
where the operation denoted by
signifies the splitting of the input along the spectral dimension. Specifically,
and
are learnable weight matrices. These matrices are employed to facilitate the spectral-wise splitting and manipulation of the input features.
where we define the learnable weight matrices associated with different components as follows:
represents the weight matrix for
,
denotes the weight matrix for
, and
corresponds to the weight matrix for
. These weight matrices are learnable parameters that are utilized within the given formulation for various processing steps and transformations. The function Concat refers to dimension concatenation.
After performing feature extraction, instead of directly concatenating or adding the two types of features, we adopt the approach proposed in [
41,
42] to selectively merge the output features from the feature extraction stage, denoted as
and
. Subsequently, we apply global Avg-Pooling to aggregate the global spatial information and obtain
, which includes spectral statistics. Next, we normalize the global spatial information and multiply it element-wise with the feature map
, resulting in the generation of the feature importance vector
. To further refine the feature representation, we split the feature vector
into two equal parts, yielding
and
. Finally, we superimpose
and
to obtain the spectral refinement feature
.
2.1.2. Spatial Fusion Module
To ensure the encoder effectively captures spatial features, we propose the integration of a spatial feature fusion module, as illustrated in
Figure 3. This module employs separation and fusion operations to enhance its functionality. The primary objective of the separation operation is to distinguish informative feature maps from those containing comparatively less relevant spatial content. By subsequently fusing feature maps that possess rich information with those exhibiting lesser information, we can extract more comprehensive feature information than what can be achieved through convolution operations alone.
Specifically, we propose a method that utilizes group normalization (GN) for a given feature
. GN partitions the input spectral dimension into 16 groups, enabling independent calculations of the mean
and variance
for each group. The mean is computed by averaging the values within a group, while the variance is determined by calculating the squared differences between each value and the mean, followed by averaging the squared differences. Subsequently, the activations within each group are normalized by subtracting the group mean and dividing by the square root of the group variance. This normalization process ensures consistent and efficient feature scaling within each group. GN introduces learnable parameters, which include scaling and shifting factors for each group. These parameters enable the network to learn optimal scaling and shifting of the normalized activations. The scaling factor
adjusts the normalized value, allowing for fine-grained control of the feature representation, while the shift factor
introduces a bias to the normalized value, aiding in capturing higher-order feature interactions.
Simultaneously, the scaling factor
within the GN layer serves as an indicator to quantify the variance of spatial pixels within each spectral dimension. The value of
reflects the extent of spatial pixel variation, with richer spatial information resulting in a larger
value. To obtain the weights for different feature maps, the following formula is employed: the features are multiplied with the weights within the GN layer. Subsequently, a sigmoid function is utilized to map the feature values to the interval [0, 1]. This process enables effective modulation and normalization of the feature representations.
Subsequently, a mask is constructed for the feature based on a threshold of 0.5. Values greater than or equal to 0.5 are assigned to , while values less than 0.5 are assigned to . These divisions result in two weighted features: , representing the information-rich feature, and , representing the less informative feature. To enhance the spatial feature fusion capability of the module and reduce spatial redundancy, the feature with rich information is added to the feature with less information. This is followed by a cross-reconstruction operation that facilitates comprehensive integration of the two weighted features, allowing for effective information exchange and generating more informative features. The resulting cross-reconstructed features are then concatenated to obtain spatial detail features, capturing fine-grained spatial information.
2.2. Efficient Transformer (ET)
The standard Transformer model exhibits limitations in terms of high computational complexity and a lack of explicit spatial structure modeling. To address these shortcomings, researchers have proposed various enhanced Transformer models aimed at improving their performance in computer vision tasks. For instance, attention mechanism improvements [
43], locality-based attention [
44], and hybrid models [
45] have been developed. Consequently, it is valuable to explore the integration of Transformer with convolutional models.
Recent research endeavors [
46,
47] have focused on replacing positional embedding in the Transformer model with convolution operations. By incorporating convolution operations into the Transformer, it becomes possible to effectively combine local and global features. Building upon the aforementioned concept, we present the ET that utilizes convolutional operations to effectively reduce the dimensionality of the feature space while capturing positional information. The architecture of ET is depicted in
Figure 4. Furthermore, we introduce convolutional layers at both the input and output of the module to enhance the extraction of spatial features.
Space-reduced Efficient Multi-head Self-Attention (SEMSA) operates in a similar manner to Multi-head Self-Attention (MSA), as it takes (query), (key), and (value) as input and produces features of the original size as output. However, a key distinction lies in that SEMSA reduces the spatial scale of K and V before the attention operation. This reduction significantly diminishes the computational and memory overhead.
Specifically, in our study, we employ SEMSA as a replacement for the traditional MSA in the encoder module. Each instance of the ET comprises an attention layer and a feed-forward layer (FFN). Considering the high-resolution feature maps involved in hyperspectral semantic segmentation, we utilize convolution (SR) to reduce the spatial dimension of these feature maps while simultaneously learning spatial information. SEMSA operates in a similar manner to MSA, as it takes
,
and
as input and produces features of the original size as output. However, a key distinction lies in that SEMSA reduces the spatial scale of
and
before the attention operation. This reduction significantly diminishes the computational and memory overhead. The SEMSA of stage
i can be expressed as follows.
Then, for the
i-th
, it can be expressed by the following formula:
where
,
, and
represent linear projection matrices, and the size
of each head is equal to
. Here,
N represents the number of attention heads. The function
denotes the utilization of convolution to reduce the dimensionality of the input feature space based on the reduction rate
.
where
, where
represents the spatial dimensions of the input and
C denotes the number of spectrals. The reduction rate is denoted as
R. The operation
refers to transforming
x into a new shape of
. Here,
corresponds to a linear projection matrix.
The attention calculation is defined as follows:
where
,
, and
represent the query, key, and value matrices, respectively. The variable
d represents the dimension of the sequence.
2.3. Pyramid Pooling Module (PPM)
The PPM is shown in
Figure 5. For the hyperspectral semantic segmentation task, it is crucial to consider spatial features at different scales. Utilizing pooling modules with varying sizes allows for the extraction of spatial feature information at different scales, thereby enhancing the model’s robustness. To further address the loss of context information between different subregions, approaches such as [
48,
49] have introduced a hierarchical global prior structure. By incorporating language information from various scales and subregions, a global scene prior can be constructed based on the final layer feature map of the deep neural network, leading to significant improvements in region segmentation accuracy.
To implement this, the input feature map is transformed into four feature maps with different spatial sizes. Subsequently, 1x1 convolutions are applied to reduce the dimensionality of the four feature maps. Next, the four different feature maps are resized to match the size of the input feature map using linear interpolation. Finally, the input feature map is concatenated with the four interpolated feature maps.
The above process can be expressed by the formula
where
denotes the input feature map.
represents the outcome of the
ith pooling operation applied to the input feature map. The variable
n signifies the number of pooling operations employed within the PPM. The function Concat refers to the concatenation of all the pooling results along the spectral dimension. Lastly, ConvModule represents a module encompassing convolution, batch normalization, and ReLU activation.
2.4. Spatial Attention (SA)
The spatial attention in our work is modified from that in [
39]. To apply SA, we first reduce the dimensionality of the channel features. Then, we perform average pooling and maximum pooling operations on the features to obtain corresponding results using the “avg” and “max” operations, respectively. These pooled features are concatenated together to form a single feature map.
Next, we utilize a two-dimensional convolutional layer with a kernel size of (7, 7) to process the concatenated feature map. This convolutional operation can be represented by the following formula:
where
represents a learnable weight matrix.
and
represent avgpooling and maxpooling operations respectively,
represents sigmoid activation function, and
represents module output features.
3. Experiments
3.1. Experimental Platform Parameter Settings
All experiments were conducted on a Windows 11 system equipped with an Intel (R) Core (TM) i5 10400 CPU @ 2.90 GHz processor and Nvidia GeForce RTX 3060 graphics card. To minimize experimental variability, the model adopts a controlled sampling approach by selecting a limited number of samples from the dataset for training. The experiment is conducted over 150 epochs, and all reported results are averaged over 5 independent experiments to ensure statistical significance. The model employs the AdamW optimizer with default parameters and initializes the learning rate to
. The loss function uses the standard cross-entropy, and the training process is the same as that in the literature [
20,
26]. We employ the hierarchical mask sampling method for calculating the loss function in our model. Specifically, we utilize masks to isolate relevant regions and compute the cross-entropy loss between the masked vectors and the corresponding ground truth objects. However, the presence of imbalanced class distributions and significant inter-class variations pose challenges. To address this, we adopt a strategy of random pixel sampling for known ground object categories. In this approach, we randomly select five pixels from each ground object category during multiple sampling iterations. This ensures comprehensive coverage of all known feature categories.
To verify the validity of the proposed method in this paper, a comparison is made between the segmentation effect of our proposed method (MSSFF) and several alternative methods, encompassing both patch-based approaches and semantic segmentation methods. The experiments are conducted on three publicly available datasets, namely Indian Pines (IA), Pavia Universitylia (PU), and Salinas (SA). In order to evaluate the performance of various models for HSI classification, the overall accuracy (OA), average accuracy (AA), and Kappa coefficient (K) are utilized as evaluation metrics.
3.2. Datasets
3.2.1. Indian Pines (IA)
The Indian Pines dataset was captured at a farm test site in northwest Indiana and collected using AVIRIS, an onboard sensor. In this paper, the data of 200 bands are classified after water absorption and low signal-to-noise ratio bands are eliminated. During the experiment, 10% of each type of ground object was selected for training, and the remaining samples were used for testing. When the number of selected samples of each type of ground object was less than five, we set it to 5. The specific training samples and test samples are shown in
Table 1.
3.2.2. Pavia University (PU)
The dataset of Pavia University was shot in the University of Pavia, northern Italy, and was collected by airborne sensor ROSIS. In this paper, the data of 103 bands were classified by eliminating the bands affected by noise. During the experiment, 1% of each type of ground object was selected for training, and the remaining samples were used for testing. The specific training samples and test samples are shown in
Table 2.
3.2.3. Salinas (SA)
The Salinas dataset was taken in the Salinas Valley, California, and the USA, and like the India dataset, it was collected using the airborne sensor AVIRIS. But unlike Indian Pines, it has a spatial resolution of 3.7 m. During the experiment, 1% of each type of ground object was selected for training, and the remaining samples were used for testing. The specific training samples and test samples are shown in
Table 3.
3.2.4. Houston (HU)
The Houston dataset was acquired using the ITRES CASI-1500 sensor in the vicinity of the University of Houston, Texas, USA, including nearby rural areas. This dataset serves as a benchmark and is commonly utilized to evaluate the performance of land cover classification models. The hyperspectral dataset consists of 349 × 1905 pixels with 144 wavelength bands spanning from 364 to 1046 nm at 10 nm intervals. During the experiment, 5% of each type of ground object was selected for training, and the remaining samples were used for testing. The specific training samples and test samples are shown in
Table 4.
3.3. Comparative Experiment
Table 5,
Table 6,
Table 7 and
Table 8 present a comparative analysis of our proposed model alongside several patch-based frameworks, such as M3DCNN [
50], HyBridSN [
51], A2S2K [
52], ViT [
53], and SSFTT [
54]. Additionally, the experimental results of Unet [
55], PSPnet [
48], Swin [
44], and SegFormer [
47], which are based on semantic segmentation frameworks, are also included for comparison. It is worth noting that semantic segmentation-based methods demonstrate superior performance in capturing global spatial information and exhibit significant advantages, particularly in scenarios with imbalanced training samples.
The experimental findings demonstrate the significant advantages of MSSFF when compared to both patch-based models and various semantic segmentation models. Specifically, M3DCNN, as a conventional 3DCNN model, suffers from parameter redundancy and inadequate extraction of spectral and spatial features, resulting in the poorest performance. ViT overlooks the unique characteristics of hyperspectral data by solely modeling the spectral sequence without considering the spectral similarity of ground objects, leading to subpar results. In contrast, HyBridSN leverages the strengths of both 3DCNN and 2DCNN, yielding certain improvements and highlighting the importance of feature redundancy in hyperspectral analysis. A2S2K adopts a residual-based 3DCNN approach where residual blocks are introduced into the hyperspectral domain. This design choice enables the model to effectively capture and exploit residual information, enhancing its ability to learn complex spatial and spectral features from hyperspectral data. Consequently, better results are achieved, although the computational complexity and parameter count of 3DCNN remain high. SSFTT employs a combination of 3DCNN and 2DCNN for feature extraction and incorporates Transformer to globally model the feature map. Notably, SSFTT outperforms other patch-based methods, underscoring the effectiveness of Transformers in modeling underlying feature maps.
However, the encoder component of Unet fails to fully consider the spatial and spectral characteristics of HSIs, resulting in poor correlation, particularly observed in the AA index, indicating significant misclassification issues with the Unet model. Similarly, PspNet shares the same encoder as Unet but introduces the PPM in the decoder to effectively capture semantic information at multiple scales, leading to improved performance. Swin Transformer incorporates Transformer in the encoder to globally model spectral and spatial features. Additionally, Swin Transformer includes UperNet in the decoder, enabling the capture of semantic information at various scales. Consequently, Swin Transformer demonstrates favorable results; however, Transformers still exhibit feature redundancy compared to convolutional methods.
In contrast, SegFormer leverages an efficient Transformer as the encoder while designing a simple and lightweight MLP decoder to reduce feature redundancy, resulting in outstanding performance across multiple tasks. Nevertheless, using a pure Transformer as the encoder for hyperspectral tasks may introduce invalid modeling, leading to poor model stability. To address this concern, MSSFF introduces SSFM, which considers both spectral and spatial features, as a replacement for the standard 2DCNN. The modification enhances stability and reduces model complexity. Additionally, MSSFF incorporates an efficient Transformer in the deep feature map, aligning with the findings of previous literature [
54]. By considering feature extraction ability and model complexity, MSSFF achieves the best performance across the three datasets.
The classification results of different methods are presented in
Figure 6,
Figure 7,
Figure 8 and
Figure 9. It can be observed from the figures that there is a significant number of misclassifications between M3DCNN and ViT, particularly when dealing with ground objects that exhibit similar spectral characteristics. However, HyBridSN, A2S2K, and SSFTT show some improvements, although there are still instances of misclassifications. Unet and PspNet, which take into account spatial characteristics, notably reduce the misclassification phenomenon in the central areas of ground objects. However, misclassification still occurs in the edge connection areas of different ground objects. Swin and SegFormer employ a hierarchical Transformer as the encoder, providing a global receptive field. Nevertheless, there are still misclassifications for ground objects with similar spectral and spatial characteristics. MSSFF shows significant improvements in mitigating misclassifications for ground objects with similar spectral and spatial characteristics, with only very few misclassifications occurring in the edge areas of different ground objects. Overall, MSSFF exhibits excellent classification performance for diverse ground objects, fully considering their spectral and spatial characteristics.
3.4. Model Analysis
To verify the effectiveness of each component in the proposed MSSFF framework, this section focuses on conducting ablation experiments. Additionally, we also explore the selection of the number of layers in the encoder and the sequencing of the spectral feature fusion module and the spatial feature fusion module in SSFM.
3.4.1. Ablation Experiments
We conducted a series of ablation experiments to assess the individual contributions of the modules in the MSSFF method. The results of the ablation experiments are shown in
Table 9. The MSSFF method comprises four modules: SSFM, PPM, ET, and SA. During the ablation experiments, we systematically removed these modules and evaluated the resulting changes in the classification metrics, namely OA, AA, and K.
When all modules were removed, the classification metric scores were relatively low, indicating the significant role of these modules in improving the classification performance. Specifically, when only the PPM was used, there was a significant improvement in the classification index, demonstrating its favorable impact on enhancing classification performance. Building upon the PPM, the addition of the ET module further improved the classification index, highlighting its positive influence on classification performance. The inclusion of the SA module resulted in slight improvements in the classification metrics. Although the observed improvements were small, they still indicated the contribution of the SA module to the enhancement of classification performance. Finally, when all modules (SSFM, PPM, ET, and SA) were utilized, the classification metrics (OA, AA, and K) achieved their highest levels. This observation underscores the effectiveness of combining these modules in improving the hyperspectral classification performance of the MSSFF method.
Figure 10 illustrates the visualization of feature maps obtained from the MSSFF framework using SSFM and ET modules. A careful selection of representative feature maps was made for visual comparison, revealing that the visualization results obtained with the SSFM module exhibit enhanced refinement, capturing finer details such as object edges, contours, and texture structures. On the other hand, the visualization results obtained with the ET module demonstrate a wider receptive field and a greater emphasis on the overall context compared to those without ET. This visual analysis provides compelling evidence for the effectiveness and superiority of the designed SSFM and ET modules in the MSSFF framework.
3.4.2. Comparative Analysis of Attention Modules in MSSFF
We consider the impact of various types of attention modules on MSSFF. Specifically, we study and compare multiple existing attention mechanisms, including self-attention, channel attention, and spatial attention. Each attention module provides unique capabilities to capture different types of dependencies and enhances feature representation. Through comprehensive experiments, we identify the most effective attention module based on the characteristics of the dataset and the task goals. This systematic approach improves the performance of our deep learning models and enhances model interpretability. As shown in
Table 10, the ET module achieved the best results on all three datasets.
3.4.3. Fusion Module Order Selection
The results of the sequential selection experiments conducted on the spectral feature fusion module and spatial feature fusion module in SSFM are presented in
Table 11. The feature fusion module employed in SSFM shares similarities with CBAM [
39], as both require careful consideration of the order in which spectral and spatial dimensions are modeled. To comprehensively evaluate the impact of feature fusion, we divided the experiments into two parts: Space-Spectral and Spectral-Space.
Interestingly, our findings indicate that fusing the spectral dimension features of hyperspectral data prior to the fusion of spatial dimensions yields better results. We speculate that this is due to the fusion of spatial dimensions potentially causing a disruption to the spectral features, leading to a decline in the effectiveness of spectral feature fusion.
3.4.4. Explore the Layers of Encoder
Regarding the impact of different layers in the encoder on the model, the corresponding results are presented in
Table 12. Recent literature [
20,
54,
57,
58] has demonstrated the effectiveness of shallower models in hyperspectral object classification tasks. Therefore, we conducted an exploration by varying the number of layers in the encoder to assess their influence on model performance.
Table 12 clearly indicates that the number of layers in the encoder does not necessarily follow a “deeper is better” trend. Specifically, the model’s performance does not consistently improve as the number of layers increases. On the contrary, there is a downward trend in model performance with an increasing number of layers. This phenomenon can be attributed to the introduction of excessive redundant information by overly deep encoders when processing hyperspectral data, which subsequently hampers model performance.
Based on these observations, we can conclude that for hyperspectral object classification tasks, a shallower encoder may be more suitable, and an excessively deep encoder does not necessarily lead to performance improvements. Thus, when designing the model, the number of layers in the encoder should be considered in a comprehensive manner, and an appropriate number of layers should be selected to achieve the optimal performance.
3.4.5. Mean Squared Error (MSE) Discussion on Different Methods
Although the confusion matrix accounts for the significant differences between different categories, we have observed that the patch-based methods (HyBridSN, A2S2K, and SSFTT) exhibit similar Kappa coefficients, OA, and AA. However, merely comparing the significance differences is insufficient to fully explain the relative merits of these methods. Therefore, we conducted further testing using the MSE metric on different datasets. The experimental results are shown in
Table 13.
Through the analysis of the MSE metric, we have found that the SSFTT method demonstrated a distinct advantage over A2S2K and HyBridSN across all datasets. Particularly, on the lower-resolution IA and SA datasets, A2S2K showed relatively better performance compared to HyBridSN. However, on the higher-resolution PU dataset, A2S2K exhibited relatively poorer performance.
4. Conclusions
In this paper, we propose an architecture called MSSFF that effectively combines spectral and spatial features for accurate hyperspectral semantic segmentation. MSSFF incorporates spectral and spatial feature aggregation modules within the encoder, allowing for the fusion of features and the generation of hierarchical representations. Additionally, in the deep layers of the encoder, we introduce a PPM for aggregating multi-scale semantic information. In the skip connection part, we employ an efficient Transformer to perform global modeling on deep feature maps, while utilizing a spatial attention mechanism for local feature extraction on shallow feature maps. Consequently, MSSFF exhibits strong capabilities in feature extraction as well as local–global modeling.
The performance of MSSFF was evaluated on three benchmark datasets, and it consistently outperformed other methods in terms of key evaluation metrics, including OA, AA, and Kappa. These results highlight the remarkable potential of MSSFF for hyperspectral semantic segmentation tasks, confirming its superiority over existing approaches.
Furthermore, we conducted an investigation into the impact of the number of layers in the encoder on the model’s performance. Our analysis revealed that deeper models tend to yield better results, with the optimal performance achieved when the number of layers is set to four. In future research, we plan to explore the feasibility of shallow models for hyperspectral semantic segmentation and investigate the deployment of lightweight hyperspectral semantic segmentation models on resource-constrained devices.