1. Introduction
Crop classification is an important application of hyperspectral remote sensing technology [
1]. Thanks to the numerous narrow spectral bands, hyperspectral imaging has been widely applied in various fields such as environmental monitoring, agriculture, mineral exploration, and urban planning [
2,
3,
4,
5]. Hyperspectral images capture continuous spectral information of target objects across a wide spectral range, enabling the detection of subtle spectral reflectance differences on material surfaces [
6,
7,
8]. However, due to the high morphological and color similarity of crops, as well as the complexity of the background (e.g., the mixture of weeds and crops), fine-grained crop classification becomes a significant challenge.
In recent years, with the advancement of computational power and the development of machine learning algorithms [
9,
10], particularly deep learning methods, the rapid progress of crop hyperspectral image classification technology has been facilitated [
11,
12,
13]. These technological advancements have not only improved classification accuracy but also made significant strides in handling high-dimensional data and reducing computational complexity [
14,
15]. Deep learning-based methods can automatically extract deep abstract features of ground objects from hyperspectral images, which has led to widespread applications in the remote sensing field [
16,
17,
18]. Among these, convolutional neural networks (CNNs) can effectively capture local details and deep semantic features in images through convolution operations [
19]. Zhang et al. [
20] used 2D CNNs to effectively extract complex features of desert grasslands. Falaschetti et al. [
21] utilized CNNs for pest and disease recognition in plant leaf images. Although CNNs have achieved remarkable results in extracting image features, they face certain limitations in handling long-range dependencies and capturing global features, which restricts their application in hyperspectral image classification [
22]. This is especially true in crop hyperspectral images, where the high color similarity between crops and weeds makes it difficult to accurately distinguish between different crop types.
Recently, image classification methods based on the transformer model have garnered widespread attention from researchers [
23]. The transformer model, through its self-attention mechanism, can directly model global dependencies and capture richer feature representations, making it particularly suitable for handling the spatial complexity and high-dimensional spectral characteristics of hyperspectral data [
24]. Unlike CNNs, which rely on convolution operations, transformers aggregate features from different scales and spatial locations through self-attention and multi-head attention mechanisms, demonstrating stronger expressiveness and flexibility in hyperspectral image classification [
25]. This novel approach has provided new ideas and tools for hyperspectral image classification, driving further advancements in the field. Yang et al. [
26] demonstrated superior classification performance by applying the transformer model to hyperspectral images compared to CNNs. Additionally, some studies have shown that combining CNNs’ local feature extraction capability with the transformer’s ability to capture long-range dependencies can further improve hyperspectral image classification performance. For example, Zhang et al. [
27] introduced convolution operations within the multi-head self-attention mechanism to better capture local–global spectral features of hyperspectral images. Wang et al. [
28] improved the ResNet-34 network to extract features from hyperspectral images and then passed these features into a transformer to capture global dependencies, thus enhancing classification performance. Although the combination of CNNs and transformers effectively improves hyperspectral image classification performance, these methods overlook the issue of spatial information loss due to the serialization processing of transformer models.
To address the learning deficiencies of transformer models in spatial information, researchers have proposed several effective solutions that combine CNNs and transformers to improve model performance. For example, Zhao et al. [
29] achieved feature fusion across different layers by interactively combining CNNs and transformers. This approach effectively combines CNNs’ advantages in local feature extraction with transformers’ capabilities in global feature modeling, thereby enhancing the learning of spatial information. However, this fusion method still faces challenges such as increased model complexity and reduced training efficiency. Li et al. [
30] proposed a hierarchical feature fusion strategy based on 3D-CNNs, aimed at joint learning of spatial–spectral features. This method captures spatial and spectral information in hyperspectral images more comprehensively through 3D convolution operations. However, in practical applications, the fixed-size image sampling strategy still introduces a large amount of background heterogeneous information, which may interfere with the model’s ability to predict crop labels. Additionally, due to the spectral similarity between the background and crops, this background interference becomes especially pronounced in various scenarios, leading to a decline in classification performance.
To better address these challenges, this paper proposes a semantic-guided transformer network (SGTN) for the fine-grained classification of crop hyperspectral images. SGTN introduces a multi-scale spatial–spectral information extraction (MSIE) module, specifically designed to effectively model crop variations at different scales, thereby reducing the impact of background information. This module not only captures the changing characteristics of crops at multiple scales but also enhances the richness and accuracy of features, laying a solid foundation for subsequent classification tasks. Furthermore, SGTN includes a semantic-guided attention (SGA) module, which improves the model’s sensitivity to crop semantic information. By precisely focusing on crop regions, this module effectively reduces background interference and improves classification accuracy. Through the integrated use of the MSIE and SGA modules, SGTN excels in feature extraction and fusion, significantly addressing the limitations encountered by existing methods in hyperspectral image classification tasks. This paper aims to enhance the fine-grained classification accuracy of crops, providing technical support for the development of precision agriculture. The main contributions of this article are as follows:
- (1)
The SGTN model is proposed for hyperspectral image crop classification, achieving a balance between accuracy and speed.
- (2)
The MSIE and SGA modules are designed, where MSIE is aimed at extracting multi-scale information of crops, while SGA enhances the semantic representation of the crops.
- (3)
Extensive experiments are conducted on three benchmark datasets, and the proposed model achieves state-of-the-art performance compared to the most advanced methods.
The remaining structure of this paper is organized as follows:
Section 2 details the proposed SGTN model.
Section 3 presents the experimental results on the three datasets.
Section 4 concludes the paper.
2. Methodology
The SGTN structure proposed in this study is shown in
Figure 1. The network consists of two stages of feature extraction layers. Each stage includes the MSIE and SGA modules, where the MSIE module is primarily used to extract spatial information of crops at different scales, and the SGA module guides the model to learn more semantically relevant features. After feature extraction, a global average pooling function is applied to aggregate the spatial information. This is followed by two fully connected layers that output the crop categories. It is important to note that 64 convolution filters of size 1 × 1 are used to reduce the spectral dimensionality. In the fully connected (FC) layers, the number of neurons in the layers is distributed as 32 and the corresponding number of crop categories. Overall, the feature extraction layer of the model can be divided into two branches. The main branch is the MSIE module, which is primarily responsible for learning crop features, while the other branch is the SGA module, which focuses on semantic guidance to improve the feature learning of the main branch. Notably, the model’s input patches are sampled using a sliding window approach with a fixed spatial size, aimed at better capturing spatial information. In general, the label of the central pixel of the patch is typically considered the classification label of the entire patch, representing the category of the most representative pixel within the patch. Specifically, if the spatial size of a patch is
, the label of the central pixel is usually assigned as the label for the entire patch, ensuring an effective representation of the region’s features and its category.
2.1. Multi-Scale Spatial–Spectral Information Extraction
The MSIE module, as the primary feature extraction structure of the model, effectively captures the spatial information of crops, as shown in
Figure 2. This module extracts features through parallel multiple branches and then aggregates and outputs the corresponding feature maps from each branch. In the first branch, a 1 × 1 convolution kernel is first used to extract spectral details from the image, followed by a 3 × 3 2D average pooling function to aggregate the spatial information. To ensure spatial scale invariance, zero padding is applied to the 2D pooling function. For the second branch, a 1 × 1 convolution kernel is employed to better capture the nonlinear spectral variation details. The third branch follows a similar structure to the first branch, where a 1 × 1 convolution kernel is used for spectral feature learning, followed by 1 × 3 and 3 × 1 convolution kernels to capture edge detail features while reducing model parameters. In contrast to the third branch, the fourth branch considers the impact of series and parallel connections on model learning. Therefore, it uses concatenated 1 × 3 and 3 × 1 convolution kernels to further enhance the model’s learning capability. Specifically, a 1 × 1 convolution kernel is first applied, followed by concatenated 1 × 3 and 3 × 1 convolution kernels, and then parallel feature extraction from both convolution kernels. It is important to note that, to ensure the feature maps from different branches can be correctly aggregated, the number of filters in all convolution kernels is set to 64.
Overall, the MSIE module is designed to achieve efficient feature representation through multi-scale feature extraction and computational optimization. This module utilizes multiple parallel branches to extract features at different scales, combining 1 × 1 convolutions to capture local details and 3 × 3 convolutions to extract local spatial information, thereby realizing the fusion of multi-scale features. Notably, the use of parallel branches allows for the processing of different types of features, adapting to the diverse patterns and scale variations within the image. Additionally, the traditional 3 × 3 convolution is decomposed into two unidirectional convolutions (i.e., 1 × 3 and 3 × 1 convolutions) to achieve the same effect while reducing the number of parameters to improve efficiency. Furthermore, performing convolution operations in both horizontal and vertical directions enhances the boundary feature information of the crops.
2.2. Semantically Guided Attention Module
The SGA module is designed by improving the transformer model, incorporating multi-head self-attention, multilayer perceptron (MLP), and layer normalization (LN), as shown in
Figure 3. This module aims to guide the model in strengthening the semantic output for crops. In the SGA module, the original positional encoding function in the transformer is discarded, and a learnable pixel weight parameter
is embedded along the channel dimension, where
s denotes the spatial size. The mathematical representation is as follows:
where
represents the concatenation of data along the specified dimension.
x denotes the original input data to the model, and
represents the concatenated result, with dimensions
, where
c is the number of channels in the data. Additionally, considering the feature redundancy caused by a large number of channels, this paper uses a learnable parameter
to filter important spectral bands. The mathematical expression is as follows:
where
represents the Sigmoid activation function.
In the transformer model, before the input is passed into the multi-head self-attention mechanism, it undergoes LN to reduce covariate shift during training and accelerate the convergence of training. In the self-attention layer, three different linear transformations are used to generate three new vector representations: the query vector (
Q), the key vector (
K), and the value vector (
V). For each query vector, its dot product with all key vectors is calculated to measure the similarity between the query and each key. The attention weights are then computed based on the results of these dot products, and these weights are used in matrix multiplication with the value vectors to obtain the final output. The calculation formula for the attention weights is as follows:
where
represents the dimensionality of the key vector, while
Q,
K, and
V are the query matrix, key matrix, and value matrix, respectively. The softmax function is applied to the scaled dot product of
Q and
K to produce a set of attention weights, which are then used to compute the weighted sum of the value vectors. This process allows the model to focus on different parts of the input sequence when making predictions.
In multi-head self-attention, the attention mechanism is executed independently multiple times, with each attention head using different linear transformation parameters to generate distinct queries, keys, and values. These independent attention heads allow the model to focus on different parts or features of the input sequence simultaneously, capturing various relationships and patterns from multiple perspectives. The formula for multi-head self-attention is as follows:
where
, where h is the number of heads, and
is the linear transformation matrix used to transform the output of multiple heads back to the original dimensions after splicing. The function
represents concatenating the results of each self-attention computation along the feature dimension
c to obtain the final output of the multi-head self-attention mechanism.
After obtaining the output from the self-attention layer, the model’s nonlinear expressiveness is enhanced through the MLP. It is worth noting that a residual connection is used in the transformer encoder structure to facilitate feature reuse within the network, allowing the model to learn more efficiently. After the transformer encoding, the pixel weight parameter is extracted and reshaped to a spatial size of , resulting in weights with the same spatial dimensions as the input image. Notably, this weight is normalized using a Sigmoid activation function to obtain the weight size for each spatial pixel. Based on this weight , the network is continuously guided during the learning process to focus on the true semantic categories of the image, thereby minimizing background interference. Finally, this weight is element-wise multiplied with the main branch to enhance the model’s semantic output, which in turn improves crop recognition accuracy in complex background scenarios.
3. Results and Analysis
All experiments in this study were conducted on hardware consisting of an Intel(R) Xeon(R) CPU E5-2682 V4, an NVIDIA GeForce RTX 3060-12G GPU, and 30 GB of memory. The programming language used was Python 3.8, and the deep learning framework was PyTorch 1.12.1. To effectively evaluate the classification performance of the model, three evaluation metrics were selected: overall accuracy (OA), average accuracy (AA), and Kappa coefficient.
In the experiment, cross-entropy was selected as the loss function, the batch size was set to 64, Adam was used as the optimizer, the learning rate was 0.001, and the number of iterations was 100. The SGA module in the model was set to a depth of one layer, and the number of attention heads in the self-attention mechanism was set to one. To eliminate experimental variability, the results were averaged over five repeated experiments.
3.1. Datasets
In the experiments, two crop hyperspectral image benchmark datasets were used: Indian Pines (IP) and Salinas (SA) [
31]. Additionally, the Pavia University (PU) dataset was chosen to assess the model’s generalization capability. The IP dataset is a typical crop dataset collected from the Indian Pines test site in Northwest Indiana, USA, using the AVIRIS sensor (Jet Propulsion Laboratory, Pasadena, CA, USA). It has a spatial size of 145 × 145 and includes 220 spectral bands. After removing bands that cannot be reflected by water, 200 bands were retained for analysis. This dataset contains 16 crop categories and a total of 10,249 labeled samples. The SA dataset, also obtained by the AVIRIS sensor, was collected near Salinas, CA, USA. The image has a spatial size of 512 × 217, with an original 224 spectral bands. After removing noise-affected bands, 204 valid bands were retained for analysis. This dataset includes 16 crop categories and a total of 54,129 labeled samples. The PU dataset was obtained by the ROSIS sensor and collected near the University of Pavia, Italy. It has a spatial size of 610 × 340 and contains 103 spectral bands. This dataset includes nine categories and a total of 42,776 labeled samples. In the experiments, for the IP dataset, 10%, 10%, and 80% of the samples were randomly selected for the training set, validation set, and test set, respectively. For the PU and SA datasets, 1%, 1%, and 98% of the samples were randomly selected for the training set, validation set, and test set, respectively. The sample divisions for the three datasets are shown in
Table 1,
Table 2 and
Table 3, and the false color images and ground truth are shown in
Figure 4,
Figure 5 and
Figure 6.
3.2. Impact of Space Size
Different spatial sizes have a significant impact on the model’s classification performance. Generally, as the spatial size increases, the spatial information contained in the image becomes more abundant. However, this also introduces more redundant information, which may negatively affect the model’s classification performance. Therefore, to evaluate the impact of different spatial sizes on the proposed SGTN model, five spatial scales in the range of 7, 9, 11, 13, 15 were selected for experimentation. The results are shown in
Figure 7. From the figure, it can be observed that for all three datasets, classification accuracy tends to increase as the spatial size grows. For the IP dataset, the impact of different spatial sizes on the model accuracy is relatively small, with the classification accuracy reaching its maximum when the spatial size is 13. The PU and SA datasets are more sensitive to different input spatial sizes, with smaller spatial sizes resulting in lower classification accuracy. When the spatial size reaches 15, the classification accuracy of both datasets reaches its peak.
3.3. Ablation Experiment
To evaluate the impact of the SGA module on crop classification performance, ablation experiments were conducted, as shown in
Table 4. From the table, it is evident that the classification accuracy has improved on all three datasets after the introduction of the SGA module. This demonstrates that the module can enhance the semantic relationships of crops in the image, thereby improving the model’s ability to capture the semantic features of each crop. On the IP dataset, the AA accuracy increased by 4.04% after introducing the SGA module, indicating a significant improvement, further validating the effectiveness of the SGA module. In addition, to verify the effectiveness of the MSIE module, it was replaced with a two-layer standard convolution for feature extraction. According to the experimental results, when the MSIE module was replaced with standard convolution, the accuracy significantly decreased on all three datasets, further demonstrating the effectiveness of the MSIE module. Overall, the SGTN model improves classification performance with only a slight increase in training time, effectively balancing the classification accuracy across all categories.
3.4. Classification Results
To validate the effectiveness of the proposed algorithm, six state-of-the-art hyperspectral image classification models were selected for comparison: DFFN [
32], HSST [
33], SSAN [
34], SSFTTnet [
24], MASSFormer [
35], and MSSTT [
36].
DFFN (Deep Feature Fusion Network): This model optimizes convolutional layers, multilayer feature fusion, and PCA dimensionality reduction through residual learning, effectively enhancing classification accuracy, particularly excelling in small sample scenarios.
HSST (Hierarchical Spatial–Spectral Transformer): An end-to-end hierarchical spatial–spectral transformer model that effectively extracts spatial–spectral features from hyperspectral data using multi-head self-attention (MHSA), while employing a hierarchical architecture to reduce the number of parameters.
SSAN (Spectral–Spatial Attention Network): This model combines spectral and spatial modules and introduces an attention mechanism to suppress the influence of noisy pixels, effectively extracting discriminative spectral-spatial features from hyperspectral data.
SSFTTnet (Spectral–Spatial Feature Tokenization Transformer): This model combines spectral-spatial feature extraction modules with a transformer architecture, utilizing a Gaussian-weighted feature tokenization module to convert shallow features into semantic features.
MASSFormer (Memory-Augmented Spectral–Spatial Transformer): This model introduces a memory tokenizer (MT) and a memory-augmented transformer encoder (MATE) module to convert spectral–spatial features into memory tokens that store prior knowledge while extending multi-head self-attention (MHSA) operations to achieve more comprehensive information fusion.
MSSTT (Multi-Scale Super-Token Transformer): This model includes a multi-scale convolution (MSConv) branch and a multi-scale super-token attention (MSSTA) branch to achieve both local and global feature extraction.
All models used the same structural parameters as those in the original papers, and experiments were conducted in the same experimental environment.
3.4.1. Classification Results for the IP Dataset
As shown in
Table 5, the proposed SGTN achieves the highest classification accuracy among all comparison models, with OA, AA, and Kappa coefficients of 98.24%, 94.64%, and 0.9799, respectively. Specifically, compared to the model with the lowest accuracy, HSST, SGTN improves OA by 25.58%, showing a significant enhancement. Compared to the model with the highest classification accuracy, DFFN, SGTN improves OA, AA, and Kappa coefficients by 1.02%, 2.27%, and 1.16%, respectively, further demonstrating its excellent performance in crop classification. Additionally, from the perspective of AA, SGTN shows an improvement of up to 41.86%, with the smallest increase being 2.27%. This indicates that SGTN can effectively balance the classification accuracy across different crop categories. The primary reason for this performance is the ability of the SGA module to effectively extract semantic information, thereby enhancing the model’s semantic representation. Except for the crops “Alfalfa” and “Grass-pasture-mowed”, SGTN achieves over 90% classification accuracy for all other crop types. Furthermore, as observed in
Figure 8, the classification results of SGTN are relatively smooth, and the model is capable of accurately identifying the spatial distribution of crops.
3.4.2. Classification Results for the SA Dataset
As shown in
Figure 9, “Grapes_untrained” and “Vinyard_untrained” exhibit significant salt-and-pepper noise, particularly in the classification results of DFFN and HSST, as seen in
Figure 9d,h. This misclassification is primarily due to the similarity in spectral and visual color information between these two grape-related categories, which makes them prone to confusion. In contrast, SGTN exhibits fewer misclassifications, attributed to the SGA module’s semantic guidance, which enhances the model’s semantic outputs. From
Table 6, under the condition of only 1% training samples, SGTN achieved the highest classification performance, with OA, AA, and Kappa scores reaching 97.89%, 98.44%, and 0.9765, respectively. Compared to the other models, SGTN achieved improvements in OA, AA, and Kappa by 2.24–22.82%, 1.73–28.64%, and 2.49–25.80%, respectively. Observing the experimental results on the SA dataset, it can be seen that among the comparative models, the HSST model performs the worst, while the SSRN model shows better results. However, the overall classification performance of SSRN still falls short of SGTN.
3.4.3. Classification Results for the PU Dataset
The results in
Table 7 show that, with only 1% of the training samples, the proposed SGTN achieves the best classification accuracy, with OA, AA, and Kappa coefficients of 98.34%, 96.82%, and 0.9780, respectively. Additionally, the classification accuracy for all categories, except “Trees” and “Shadows”, exceeds 95%. Furthermore, as observed in
Figure 10, for the recognition of “Bare Soil”, SGTN achieves better classification results with fewer misclassifications. Specifically, in
Figure 10d, the HSST method performs poorly in recognizing “Bare Soil”. Compared to the other models, the SGTN model shows significant improvements in OA, AA, and Kappa, with accuracy gains ranging from 1.99% to 16.26%, 1.99% to 21.43%, and 2.64% to 22.23%, respectively. Overall, the classification results of the proposed SGTN are closer to the true ground truth of the dataset. These results further demonstrate the advanced generalization ability of SGTN.
3.5. Runtime Analysis
To assess the computational efficiency of the model, this section presents an analysis of the parameters and prediction time on three datasets, as shown in
Table 8. The results indicate that SSAN has significantly more parameters than other models, while SSFTTnet has the fewest parameters. In terms of prediction time, except for MSSTT and HSST, the prediction time differences across the three datasets for the other methods are not significant, especially for the IP dataset. In contrast, the proposed SGTN model shows moderate performance in both parameters and prediction time, falling within the medium range. The model utilizes a self-attention mechanism to capture the semantic relationships in the image, which increases computational complexity but also leads to a significant improvement in classification accuracy. Overall, SGTN not only achieves the best classification results but also maintains a reasonable level of computational complexity, demonstrating balanced overall performance.