Land Cover Classification of Remote Sensing Imagery with Hybrid Two-Layer Attention Network Architecture

Fan, Xiangsuo; Li, Xuyang; Fan, Jinlong

doi:10.3390/f15091504

Open AccessArticle

Land Cover Classification of Remote Sensing Imagery with Hybrid Two-Layer Attention Network Architecture

by

Xiangsuo Fan

^1,2

,

Xuyang Li

^1,*

and

Jinlong Fan

³

¹

School of Automation, Guangxi University of Science and Technology, Liuzhou 545006, China

²

Guangxi Collaborative Innovation Centre for Earthmoving Machinery, Guangxi University of Science and Technology, Liuzhou 545006, China

³

National Satellite Meteorological Center, China Meteorological Administration, Beijing 100081, China

^*

Author to whom correspondence should be addressed.

Forests 2024, 15(9), 1504; https://doi.org/10.3390/f15091504

Submission received: 16 July 2024 / Revised: 25 August 2024 / Accepted: 27 August 2024 / Published: 28 August 2024

(This article belongs to the Special Issue Artificial Intelligence and Machine Learning Applications in Forestry)

Download

Browse Figures

Versions Notes

Abstract

:

In remote sensing image processing, when categorizing images from multiple remote sensing data sources, the deepening of the network hierarchy is prone to the problems of feature dispersion, as well as the loss of semantic information. In order to solve this problem, this paper proposes to integrate a parallel network architecture HDAM-Net algorithm with a hybrid dual attention mechanism Hybrid dual attention mechanism for forest land cover change. Firstly, we propose a fusion MCA + SAM (MS) attention mechanism to improve VIT network, which can capture the correlation information between features; secondly, we propose a multilayer residual cascade convolution (MSCRC) network model using Double Cross-Attention Module (DCAM) attention mechanism, which is able to efficiently utilize the spatial dependency between multiscale encoder features: the spatial dependency between multiscale encoder features. Finally, the dual-channel parallel architecture is utilized to solve the structural differences and realize the enhancement of forestry image classification differentiation and effective monitoring of forest cover changes. In order to compare the performance of HDAM-Net, mountain urban forest types are classified based on multiple remote sensing data sources, and the performance of the model is evaluated. The experimental results show that the overall accuracy of the algorithm proposed in this paper is 99.42%, while the Transformer (ViT) is 96.92%, which indicates that the proposed classifier is able to accurately determine the cover type.The HDAM-Net model emphasizes the effectiveness in terms of accurately classifying the land, as well as the forest types by using multiple remote sensing data sources for predicting the future trend of the forest ecosystem. In addition, the land utilization rate and land cover change can clearly show the forest cover change and support the data to predict the future trend of the forest ecosystem so that the forest resource survey can effectively monitor deforestation and evaluate forest restoration projects.

Keywords:

HDAM-Net; dynamic changes; multiple remote sensing data sources; precision agriculture and forestry

1. Introduction

Remote Sensing (RS) plays a vital role in forest ecosystem research, providing unparalleled insight and operational efficiency in forest resource management. Through remote sensing, we are able to achieve accurate monitoring of forest cover, vegetation types, and biomass, which is essential for developing effective forest conservation strategies and biodiversity maintenance programs. In addition, satellite data make the application of remote sensing data for climate change understanding challenging due to sensor errors and uncertainty in data recovery algorithms, while remote sensing technology has also demonstrated its unique value in climate change adaptation by helping us to identify and quantify forest ecosystems’ response to climate change, thus providing critical data to support global climate change research. In summary, remote sensing technology not only enhances our understanding of forest ecosystems, but also provides the scientific basis and technical means to realize sustainable forest management and address the challenges of climate change. Remote sensing technology can provide technical means for large area forest crop detection [1,2,3,4]. RS is a powerful tool that allows us to obtain information about the Earth’s surface from a distance. Analyzing remote sensing images, we can accurately identify and classify different types of land cover [5,6,7,8] is crucial for decision making in areas such as natural resource management and forest environment construction. Remote sensing image classification is a key step in this process, which involves the identification and segmentation of different features or feature types in remote sensing images [9]. This process classifies the image content into different categories such as forests, grasslands, cities and waters by extracting and analyzing the features in the image. The Geographic Information System (GIS), on the other hand, is a tool for capturing, storing, analyzing, and managing geospatial data. It not only helps us to visualize the data, but it also performs complex spatial analysis to reveal patterns and relationships behind land cover changes [10]. Machine learning technology plays an important role in this process, which can analyze and learn from a large amount of remote sensing data through algorithms to improve the accuracy and efficiency of classification. Machine learning models are able to learn features based on historical data and then classify new data. With the combined use of remote sensing, GIS, and machine learning, we can systematically observe and analyze the land surface to achieve land cover classification. This method can not only identify and record the natural and anthropogenic cover characteristics of a specific area, but also provide key support for scientific research and decision making in the fields of environmental monitoring, natural resource management, and urban planning [11,12,13,14].

RS technology, with its robust data collection and analysis capabilities, has revolutionized the tools available for monitoring forest cover change, assessing forest health conditions, and predicting future trends in forest ecosystems. For instance, high-resolution satellite imagery allows researchers to track the dynamic processes of forest deforestation and regeneration. By comparing images from different time points, the rate of forest cover change can be accurately quantified. Additionally, through the analysis of vegetation indices such as the Normalized Difference Vegetation Index (NDVI), the health status of forests can be evaluated, and areas affected by stress factors like pests, diseases, or drought can be identified. Remote Sensing also enables the prediction of future development trends in forest ecosystems, such as the potential for forest expansion or degradation, providing a forward-looking scientific basis for the formulation of forest management and conservation strategies. These applications not only enhance our understanding of the complexity of forest ecosystems but also offer critical technical support for the sustainable utilization and protection of forest resources.

Semantic segmentation techniques [15] play a pivotal role in the realm of image processing. Traditional manual methods predominantly rely on image processing techniques that heavily depend on manually crafted feature extraction and image algorithms [16]. Common methodologies encompass supervised classification techniques, which utilize labeled training samples to train classifiers such as Support Vector Machines (SVMs) and others [17,18,19]. Conversely, unsupervised classification methods do not require prelabeled training samples; instead, the system automates the learning process for classification, with examples including k-means clustering and Principal Component Analysis (PCA) [20,21,22]. Algorithms for extracting relevant image features include methods such as the Gray Level Co-occurrence Matrix (GLCM) for texture analysis, as well as color and shape descriptors; Gaussian Filtering; and edge detection techniques [23,24]. Image segmentation involves dividing the image into discrete elements, with each element being individually classified. Spectral information, such as multispectral or hyperspectral data, is frequently employed for element-based classification [25,26]. Object-level classification takes into account the contextual relationships between pixels to form objects with semantic meaning. This process incorporates algorithms for segmentation and merging, region growing, and others. These methods often rely on predefined rules and prior knowledge, utilizing various filters, edge detection, and segmentation algorithms to achieve the semantic segmentation of images. While these traditional methods have seen success in certain contexts, they tend to be less effective in handling complex and variable image scenarios and are not as efficient for processing large-scale datasets.

These traditional methods necessitate further enhancement to effectively tackle more complex tasks. In contrast, deep learning-based approaches provide a more automated and end-to-end solution for semantic segmentation by learning image feature representations through neural network models. Convolutional Neural Networks (CNNs) [27] are particularly adept at learning abstract feature representations from extensive labeled datasets, enabling the model to achieve superior performance across a variety of scenarios. Transfer learning is a technique that leverages pretrained deep learning models for new tasks; pretrained image classification models can be fine-tuned to enhance classification accuracy and efficiency. These methodologies have led to significant advancements in semantic segmentation tasks and have propelled the field of image processing forward. It is important to note that for remote sensing image classification, additional techniques such as image preprocessing, feature extraction, and data augmentation are typically considered to bolster the performance and generalization capabilities of the model.

Moreover, current remote sensing image classification techniques exhibit certain limitations [28]. Remote sensing data inherently encompass multiple dimensions, such as spectral, textural, and spatial attributes, which can interfere with one another, potentially resulting in suboptimal classification outcomes [29]. Topographic and environmental factors, such as vegetation cover, water bodies, soil types, and others, can influence the textural and spectral characteristics of remote sensing imagery, thus impacting classification accuracy. Therefore, there is an urgent requirement for additional research and methodological improvements to enhance classification precision and dependability. Furthermore, current remote sensing image classification techniques often neglect the significance of background classification. Background context offers crucial information that can greatly enrich the comprehensive understanding of scenes and enhance interpretation performance [30].

In order to cope with the challenges of high resolution and the complexity of remote sensing images, researchers at home and abroad have tried many different algorithms and techniques. The Support Vector Machine (SVM) is a classical machine learning algorithm commonly used in the classification problem of remote sensing images; Random Forest (RF) is an integrated learning method, which performs well in dealing with high-dimensional remote sensing data; and of course clustering algorithms, such as K-means and DBSCAN, are used for unsupervised classification and anomaly detection of remote sensing images, while the Recurrent Neural Network (RNN) [31] is widely used for remote sensing image classification, especially for time series data, such as multitemporal remote sensing images. The RNN can capture the time-varying information in the image and improve the classification accuracy. In addition, the inability of the RNN to perform parallel computation is also one of the drawbacks. To solve these problems, researchers have proposed many improvements, such as using the attention mechanism to enhance the RNN’s ability to capture long-distance dependencies. Depthwise separable convolutional networks [32] offer a computationally efficient approach for the analysis of high-resolution images with large sizes and rich textural information. These networks effectively fuse features across various levels of abstraction, from low to high, and their feature extraction capabilities can be further augmented by increasing network depth. Tailoring the parameter settings of depthwise separable convolution to the unique characteristics of remote sensing images, such as convolution kernel size and stride, optimizes the accuracy and efficiency of the convolution operation.

The approach rooted in depthwise separable convolutional networks has significantly advanced the semantic segmentation of RS images. Rather than discarding depthwise separable convolutional networks, our aim is to propose a novel framework that builds upon their strengths and leverages their advantages.

In recent years, Transformer [33] has emerged as a powerful neural network model with impressive achievements in the field of computer vision. Its ability to process continuous data has opened up new avenues for image classification. The Transformer model excels at capturing spatial and contextual information within images while also accommodating various scales and intricate feature characteristics, thereby enhancing the accuracy and generalization capabilities of remote sensing image classification. As a result, the Vision Transformer (ViT) model [34] has emerged as a popular backbone architecture for a multitude of computer vision tasks.

The ViT model consists of two core components: the token mixer and the channel mixer. This method of global information aggregation demonstrates remarkable potential in feature extraction without relying on inductive biases such as convolution, enabling the development of powerful data-driven models. Hong et al. introduced a novel model named SpectralFormer [35] that utilizes the ViT and has shown promising results. Additionally, certain advancements within the ViT, such as those proposed in the literature [36,37], facilitate the integration and fusion of the ViT with larger-scale models while sharing weights. This ensures consistent processing across different image sizes, thereby maximizing data utilization.

However, the Vision Transformer’s (ViT) encoder, originally designed for linguistic modeling, faces inherent limitations when applied to downstream computer vision tasks. Notably, the computation of global affinity matrices within the self-attention mechanism is computationally intensive due to its quadratic complexity and high memory requirements, which constrains its utility for high-resolution image features. Consequently, capturing spectrally localized sequence information in high-resolution images poses a significant challenge.

Therefore, there is an urgent need for a more detailed approach to overcome these limitations. In the face of these challenges, it is crucial to maintain the richness of information during feature extraction and ensure the consistency of spatial and semantic information during feature fusion. This paper introduces a parallel network, HDAM-Net, designed for the classification of land use images using multispectral remote sensing data techniques.

A new remote sensing image semantic segmentation network framework, HDAM-Net, is proposed. It consists of a VIT network strand utilizing the attention mechanism of MCA + SAM (MS) and a Multiscale Cascaded Residual Convolution (MSCRC) network strand, forming a parallel dual-encoder architecture. Specifically, we enhance the segmentation capability by designing the attention mechanism of the MS, which is a crucial component of our VIT network. Subsequently, the features obtained from the MSCRC network on the other branch are fused to better capture diverse image features and achieve pixel-level remote sensing image classification.

The main contributions of this paper are summarized as follows:

(1): An enhanced feature extraction method, HDAM-Net, for remote sensing images is proposed. This method effectively preserves the richness of information and ensures the consistency of spatial and semantic information during the feature fusion process.
(2): We propose the MSCRC network to enhance the model’s ability to perceive information at various scales. The MSCRC can more effectively capture the features in the original image and minimize the loss of detailed information.
(3): We construct the attention mechanism of DCAM to introduce more powerful context modeling and feature association when dealing with images in sequence data image processing. We efficiently extract channel and spatial dependencies between multiscale encoder features to address the semantic gap problem.
(4): Addressing structural differences involves utilizing a dual-channel parallel architecture to prevent information loss and excessive feature fusion. This architecture can effectively distinguish between feature types that have high similarity.

2. Data on Specific Research Materials

2.1. Regional Overview

Xiangyin County in Yueyang City, located in the northeast of Hunan Province (available from https://www.usgs.gov/downloadforuse (accessed on 6 October 2021)), was selected as the study area for this study. Xiangyin County, with a long history and the ancient name of Luocheng, is a county under the jurisdiction of Yueyang City. Geographically, Xiangyin County is uniquely situated between the coccyxes of the Xiangjiang and Zijiang rivers, and it is adjacent to the southern shore of Dongting Lake. The county’s longitude ranges from 112°30′20″–113°1′50″ E longitude and latitude from 28°30′13″–29°3′2″ N latitude. The county borders Miluo City in the east, Yiyang City in the west, Wangcheng District in Changsha City in the south, and Yuanjiang City and Quyuan Management District in the north. The total population of Xiangyin County is about 680,900. The topography of the area is divided into two parts, east and west, bounded by the Xiangjiang River: the eastern part is hilly and granular, while the western part is a vast lakeside plain. This topographical diversity makes the area ideal for aquaculture and crop cultivation, similar to neighboring Nanxian County. Xiangyin County’s climate is humid subtropical, with four distinct seasons and abundant rainfall. Summers are hot and humid, and winters are mild with little rain. The annual precipitation is about 1383 mm, and the average temperature is about 15 to 22 degrees Celsius. In terms of vegetation, the eastern hilly areas of Xiangyin County are dominated by coniferous and deciduous broad-leaved forests such as horsetail pine and fir, while agricultural vegetation such as rice paddies and lotus ponds are widely distributed in the western plains. In addition, the wetland ecosystem around Dongting Lake is also a major feature of the area, with abundant aquatic plants and wetland vegetation that provide habitat for a wide variety of waterfowl and wildlife.

The following experiments were conducted in some of the most productive areas of Nanxian and Xiangyin counties. The diagram described in Figure 1 is shown.

2.2. Preprocessing

2.2.1. Multispectral Data Sources

In order to understand the distribution of feature types in the study area, a field sampling activity was conducted on 6 October 2021, and sample points of different feature types were obtained by collecting field data. The data source used in this study was the 10 m resolution multispectral image acquired by Sentinel-2 satellite in October 2021 covering the study area, In this study, a sample library of seven major land types, including urban areas, water bodies, sugarcane fields, ponds, wetlands, vegetable fields, and oilseed rape crops in Xiangyin County, was constructed through field collection and labeling with ENVI5.6 software. To ensure the quality of the samples, the research team selected the collection areas with the assistance of GPS cameras and combined the data with the expertise of experts to accurately select the samples. Particularly, when collecting sugarcane and rice samples, large areas were prioritized to support follow-up training and validation. These refined data collection methods deepen the understanding of regional feature distribution and provide a solid foundation for feature classification studies. The Sentinel-2 satellite band information is available in Table 1.

In this paper, we utilized multispectral imagery captured by the Sentinel-2 satellite on 6 October 2021, with a spatial resolution of 10 m, as our primary data source. To preprocess the data, we began with atmospheric correction using the Sen2Cor V2.8 software to mitigate atmospheric impacts on the imagery. Subsequent steps included image enhancement and mask extraction. To ensure consistent resolution across all bands, we resampled them using SNAP7.0 software, harmonizing them to a common 10 m resolution. In line with our study objectives, we selected seven bands (Band 2, Band 3, Band 4, Band 8, Band 9, Band 11, and Band 12) and combined these bands using ENVI5.6 software. These selected bands cover the visible and near-infrared spectral domains, making them particularly valuable for feature classification and identification. During the sample data acquisition phase, we identified regions of interest (ROIs) and created a corresponding sample library in the Heshan District of Yiyang City, leveraging prior knowledge and field surveys, as shown in Figure 2.

Through the implementation of these remote sensing data processing and analysis techniques, we obtained high-quality training samples and accuracy validation samples, which are crucial for both feature classification and verifying feature type accuracy within the study area.

2.2.2. Hyperspectral Data Sources

To validate the accuracy and wide applicability of our proposed model for classification on various types of satellite remote sensing images, we selected three publicly available hyperspectral datasets from Houston, Pine Tree of India, and the University of Pavia for our experiments.

The Houston dataset, captured by the ITRES CASI-1500 sensor, depicts the University of Houston and its surrounding areas in Houston, TX, USA. This dataset, funded by the NSF and managed by the Center for Airborne Laser Mapping (NCALM) at the University of Houston, comprises hyperspectral images that are 349 × 1905 pixels in size and include 144 spectral bands. It spans a spectral range of 364 nm to 1046 nm and offers a spatial resolution of 2.5 m.

Furthermore, the Houston dataset is characterized by its high color fidelity and abundant detail, with its image accuracy rigorously validated to ensure precise representation of the target object’s actual properties. Concurrently, this dataset offers a wealth of geographic information, which holds significant value for a multitude of research disciplines. Table 2 lists the names of the various object classes in this dataset, along with the specific sample values for training and testing.

The Indian Pines dataset, captured in 1992 with the AVIRIS sensor, provides remote sensing data of the Indian Pines area in USA. It includes a 145 × 145 pixel image with a 20 m resolution, spanning 400 to 2500 nm and 220 spectral bands. Due to noise, bands 104–108, 150–163, and the 220th band were excluded, leaving 200 bands for analysis. Table 3 presents the land cover categories and sample distributions for training and testing.

The Pavia University dataset, obtained in 2003 using the ROSIS-03 sensor over Pavia, Italy, features a 610 × 340 pixel image at 1.3 m resolution, covering 430 to 860 nm and 115 spectral bands. For this study, 103 bands were chosen for analysis, omitting 12 noisy bands to enhance data quality. Table 4 lists the land cover categories and sample distributions for training and testing.

2.3. HDAM-Net Model

In this study, to address the challenge of extracting semantic content from high-resolution remote sensing images for classification, which often contain features at various levels with different resolutions and semantic information, the HDAM-Net parallel algorithm network structure is introduced, as illustrated in Figure 3. This network primarily consists of the HGSE module, MSCRC, LeakyRelu activation function, conv1*1, and the MS Transformer. The HGSE module generates grouped spectral embeddings, which are then flattened into 1D sequences and processed through a fully connected layer to adjust their dimensions for MSCRC input. These sequences are reshaped into 3D feature matrices. The resulting 3D spectral features are fed into the MS Transformer, and the features from the Transformer Encoder are merged with those from the MSCRC. This combined feature set undergoes dimensionality reduction via a fully connected layer, and the final classification is achieved through LeakyRelu and conv1*1 activation functions for pixel-level multispectral image classification.

2.3.1. Half-Sequence Grouping Spectral Embedding (HSGSE)

The strategic and effective utilization of spectral information is essential in the field of feature recognition and classification to differentiate between different feature types. Advanced feature extraction techniques, like spectral blending analysis or dimensionality reduction, are crucial in minimizing data redundancies while preserving important spectral characteristics. Moreover, selecting optimal band combinations that correspond to the distinct attributes of diverse geographical areas can improve the adaptability of feature classification across different environmental conditions. Hence, the precise collection of spectral data and understanding band correlations are vital for analyzing remote sensing imagery and accurately classifying features.

Considering the input of a sequence of 1D pixels x, the initial input a for the Transformer is determined through the computation outlined in Equation (1):

A = w x

(1)

where

x \in R^{1 \times m}

is the input pixel sequence in the algorithm, and the x data format is 1D;

w \in R^{d \times 1}

denotes a linear transformation that applies to all bands of this spectral sequence,

A \in R^{d \times m}

, and A is the resulting output feature. GSE is shown by Equation (2):

\dot{A} = WX

(2)

where

W \in R^{d \times n}

represents the linear transformation,

X \in R^{n \times m}

represents the spectral features, and n represents the number of neighboring bands. Please refer to in Figure 4. for a detailed schematic diagram.

This method uncovers periodic patterns within time series data by organizing the data into discrete groups, each comprising samples that exhibit analogous temporal fluctuations. This approach is particularly advantageous for the analysis of time series that exhibit periodic variations. By employing embedding algorithms, such as spectral embedding, these discrete groups are transformed into representations in a reduced dimensional space. These representations not only retain the salient features of the original data but also reveal intricate patterns and structures, effectively enhancing the model’s performance and interpretability.

The notable benefit of this technique lies in its ability to synergize the advantages of periodic structure detection with embedding learning algorithms. Through the integration of half-sequence grouping and spectral embedding, the method can efficiently reduce data dimensionality while preserving essential data features, thereby enhancing the model’s efficiency and accuracy.

In essence, the half-series grouping spectral embedding technique is a powerful feature engineering tool for reducing the dimensionality of time series data. It cleverly and ingeniously achieves semi-series grouping and embedding principles embedding to acquire and learn the low-dimensional representations of the intricate, complex patterns and structures within data. This process enhances enhancing model performance and interpretability, simplifying and reducing data complexity.

2.3.2. Add the Transformer from MS

In the field of image categorization, feature extraction has been significantly improved by incorporating the transformer architecture with channel attention mechanisms like MCA (Multiple Head Channel Attention) and spatial attention modules such as SAM (Spatial Attention Module). This integration enhances the performance of semantic segmentation by leveraging the Transformer’s capability to capture complex correlations and spatial relationships in the data.

MCA extends the attention mechanism in the Transformer framework with a multihead approach, which enhances the model’s ability to discern multichannel correlations. This multihead mechanism in MCA fosters more intricate interactions between features, leading to a more detailed representation of the data. The integration of MCA into the Transformer model involved training and fine-tuning to optimize its impact on the segmentation task. MCA’s capacity to capture cross-channel correlations boosts the feature representation capabilities, thereby enhancing the accuracy and robustness of semantic segmentation.

SAM introduces a spatial attention mechanism at each layer of the converter, enabling the modeling of spatial relationships between pixels. To adapt this mechanism for image classification, we developed specific spatial attention modules, including the integration of spatial location coding in the self-attention mechanism. Through experiments and hyperparameter tuning, the effectiveness of the SAM was optimized to improve the model’s perception of spatial information, facilitating the accurate modeling of spatial relationships between pixels.

The combination of the Transformer framework with the MCA and SAM significantly elevates the performance in semantic segmentation images. These attention mechanisms facilitate the capture of correlation information between features and spatial relationships between pixels, thereby enhancing the model’s accuracy, robustness, and spatial information perception.

Furthermore, the quality and efficiency of the semantic segmentation of remote sensing images can be substantially enhanced by the integration of the MCA with SAM. The MCA facilitates improved feature interaction and selection within the channel dimensions, while the SAM aids in capturing spatial information to enhance the accuracy of semantic segmentation. To implement this approach, the MCA and SAM are integrated into the Transformer model at different layers, ensuring their collaborative contribution to semantic segmentation tasks.

The MCA is integrated into the Transformer’s encoder and decoder to enhance the model’s semantic understanding by leveraging its feature interaction and selection capabilities. In the encoder, the MCA aids in capturing cross-channel correlations and highlighting salient features. In the decoder, the MCA facilitates the capture of features from the encoder and the inference of the final semantic segmentation result. The SAM is integrated into the Transformer model to enhance spatial information grasping. In the encoder, the SAM captures detailed information about specific regions and models them accurately. In the decoder, the SAM aids in the recovery of spatial details from the features acquired by the encoder, thereby further enhancing the accuracy of semantic segmentation.

By synergizing the MCA and SAM, the model gains a more comprehensive understanding of semantic information and improves its focus on specific regions. This integration not only enhances the accuracy of image classification but also improves the efficiency of the model when handling large-scale data. The method’s implementation involves model structural design, the integration of attention mechanisms, and model training, with the need to fully consider the role of each component and their interplay to achieve optimal results.

2.3.3. Double Cross-Attention Module (DCAM)

The Dual Cross-Attention Module (DCAM) represents an advanced attention mechanism designed for incorporation into deep learning architectures. Primarily, it is employed to enhance the robustness of context modeling and the association of features when dealing with sequential data.

The goal of this module is to enhance the U-Net structure and its variants in a way that improves the effectiveness of jump connections (hopping connections) in a straightforward and efficient manner, with a minimal increase in parameters and complexity. Conventional approaches to improving the U-Net structure have several drawbacks, including the inability of convolutional localization to capture long-range dependencies between different features. DCA addresses the semantic gap by logically capturing the spatial dependencies between features. This is achieved by generating encoder feature patch embedding modules. Subsequently, DCA extracts the information interactions between global channel features with each other using channel cross-attention (CCA) and captures the information interactions between global spatial features with each other using spatial cross-attention (SCA). Then, all of these extracted features are upsampled and concatenated with the corresponding feature portion of the decoder to create a hopping concatenation scheme.

The DCA module can be characterized by two primary phases and three distinct steps. The initial phase involves the extraction of encoder tokens through a multiscale patch embedding module. The subsequent phase implements DCA on these tokens using CCA and SCA modules to capture long-range dependencies. Finally, these tokens are serialized and upsampled using layer normalization and the GeLU activation function to connect with the corresponding parts of the decoder, as depicted in Figure 5.

The semantic gap is effectively addressed by the introduction of the DCA module, which adeptly leverages the channel and spatial dependencies between multiscale encoder features. This innovation allows deep learning models to more accurately comprehend contextual information and the associations between features when processing sequential data, thereby significantly enhancing their performance and accuracy.

The patch is extracted from n multiscale encoder stages and mapped using 1 × 1 depth-separable convolution on the spread 2-dimensional patch. Equation (3) is as follows:

T_{i} = D C o n v 1 D_{E_{i}} (R e s h a p e (A v g P o o l 2 D_{E_{i}} (E_{i})))

(3)

where

T_{i}

denotes the patch after the ith encoder stage spreading. Note that P represents the number of patches, which is the same for each

T_{i}

, so that the cross-attention between these tokens can be utilized.

Q, K, V are mapped queries, keys, and values, respectively, thus the CCA formula is expressed as follows:

C C A (Q_{i}, K, V) = S o f t m a x (\frac{Q_{i}^{T} K}{\sqrt{C_{c}}}) V^{T}

(4)

where

C_{c}

is the scale factor. Finally, the output of cross-attention is processed using depth-separable convolution and fed into the SCA module.

The SCA equation can then be expressed as follows:

S C A (Q, K, V_{i}) = S o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V_{i}

(5)

where

D_{k}

is the scale factor. The output of the SCA is then processed using depth-separable convolution to obtain the final DCA output. The DCA output is further processed with layer normalization and GeLU. Subsequently, the n outputs of the DCA are connected to their corresponding decoder sections by upsampling. Lastly, all DCA outputs are linked to the same characteristic locations of the decoder.

The introduction of the dual cross-attention module improves the ability to model sequence data in the model, allowing the model to better capture global information and complex relationships between different locations.

2.3.4. Multiscale Cascaded Residual Convolution (MSCRC)

The Multiscale Cascaded Residual Convolutional Network (MCCN) is an image classification module based on the Residual Convolutional Network (ResNet). Its primary objective is to enhance the model’s capability to recognize and combine multiscale features. While the ResNet effectively addresses the challenges of gradient vanishing and explosion through the incorporation of residual units, it lacks proficiency in managing multiscale features independently.

The essence of the MCCN lies in the integration of convolutional filters of varying scales within the network architecture, enabling the simultaneous extraction of both local details and global information from the input image. This is achieved by establishing multiple parallel convolutional branches, with each optimized for processing features at a specific scale. The outputs from these branches are then aggregated to create a comprehensive feature representation that encapsulates multiscale information. Each branch employs a unique combination of convolutional kernel sizes and channel counts to adaptively extract features across a spectrum of scales. Consequently, the network is capable of comprehending image content at diverse spatial scales.

To preserve ResNet’s residual learning advantage, the MCCN incorporates residual connections onto the merged feature maps, thereby preventing information loss and streamlining the network’s training process.

Deploying the MCCN significantly improves the accuracy and robustness of the segmentation results in the field of image classification, particularly in relation to forests, providing a more efficient and dependable tool for agricultural remote sensing analysis. As remote sensing technology continues to advance, it is anticipated that the MCCN will assume an increasingly significant role in agricultural remote sensing, offering novel perspectives and methodologies for related research endeavors.

2.4. Evaluation Indicators

To evaluate the classification performance of multispectral pixels, three metrics are utilized, which are all based on the confusion matrix: the overall accuracy (

O A

), average accuracy (

A A

), and kappa coefficient. These are defined as follows:

(1): $O A$ : This metric indicates the percentage of correctly classified agricultural predictions relative to the total predictions.

$O A = \frac{T q + T p}{T q + F q + T p + F p}$

(6)

where $T_{q}$ is true positive, $F_{q}$ is false positive, $T_{p}$ is true negative, and $F_{q}$ is false negative.
(2): $A A$ denotes the average precision, which is a more refined evaluation index in agricultural classification. The formula for calculating it is detailed below:

$A A = \frac{1}{L} \frac{T q + T p}{T p + F q + T q + F p}$

(7)

where L denotes the number of categories.
(3): Kappa coefficient: the Kappa value assesses the consistency of an agricultural classification model with its real-world test predictions. The calculation is carried out using the formula detailed below:

$K a p p a = \frac{P_{x} - P_{y}}{1 - P_{y}}$

(8)

$P_{x}$ Indicates conformity with factual data, and $P_{y}$ is the likelihood that the classifier will predict a consistent result between the classification and the actual situation.

3. Results

3.1. Ablation Study

To evaluate the efficacy of the network architecture proposed in this paper, ablation experiments were conducted on the dataset of the study area, with the results detailed in Table 5. The baseline Overall Accuracy (OA) yielded with the addition of the HGSE module is 97.35%, suggesting that the Vision Transformer (ViT) is particularly adept at handling multispectral image classification. The OA increased to 98.29% with the inclusion of the MSCRC module, underscoring the MSCRC’s contribution to enhancing channel connectivity. The implementation of the DSC attention mechanism further refined the accuracy. The OA climbed to 99.28% upon the integration of the MCA-SAM module, confirming its advantage in multispectral image pixel classification. When all the modules were combined, the OA reached 99.42%, highlighting the collective effectiveness of the proposed modules for agricultural land cover classification. This comprehensive approach significantly elevated the overall classification accuracy.

In our research, the selection of the sample pool was a deliberate process of analysis and evaluation, rather than a mere random sampling. Considering the unique circumstances of Xiangyin County, we employed different proportions of samples for training and testing to comprehensively assess the impact of sample proportions on the experimental results. The experimental outcomes, as presented in Table 6, are a critical reference, highlighting that higher sample proportions do not invariably lead to greater precision. This revelation is significant for data classification in practical applications, where the pursuit of higher precision often overlooks the influence of noise. Our findings emphasize the need to carefully balance various factors, including sample proportion, precision, and noise, to achieve the optimal overall classification result. Therefore, we opted for a 70% sample proportion in our experiment to attain the best overall classification effect.

3.2. Multimethod Comparison

3.2.1. Comparative Analysis of Multispectral Data

To highlight the efficacy of our model, this paper compares the HDAM-Net with the SVM, KNN, RF, CNN, RNN, VIT, VITCNN, and SF. The following are the parameter settings of each model, as well as the background:

(1): For the SVM, the radial basis function (RBF) kernel was used, which was set within a range of 1 × 10⁻². This kernel is suitable for small sample datasets and is sensitive to feature selection.
(2): For the KNN, we chose five for n neighbors, uniform for weights, and auto for algorithm. It is simple and easy to implement, but it these settings make it sensitive to high-dimensional data and noise.
(3): For RF, the parameter of n estimators is crucial, which was set to 300 for optimal performance, and the tree depth was set to 30. This makes it suitable for large-scale datasets and insensitive to feature selection.
(4): For the CNN, the architecture includes convolutional kernel size and stride, a batch normalization layer, a ReLU activation function, a max pooling layer, a fully connected layer, and an output layer. It is applicable to image data and is capable of automatic feature extraction.
(5): For the RNN, we chose 128 for the LSTM/GRU layer and 64 for the fully connected layer for sequential data such as time series remote sensing images.
(6): For the VIT, the architecture comprises five encoder blocks—with a patch size set to 64—four attention heads, eight-layer MLPs, and a dropout layer that suppresses 10% of neurons. It is a Transformer-based model designed for large-scale image data.
(7): For the VITCNN, we used the parameter settings outlined in reference [36], combining the strengths of the CNN and ViT for processing complex image data.
(8): For SF, the multiscale convolutional kernel size was set to four, with four attention heads, eight-layer MLPs, and four spectral convolutional layers. This configuration is suitable for hyperspectral data.

Our proposed algorithm HDAM-Net, based on the Vision Transformer (VIT), incorporates the attention mechanism of Multiscale (MS) and secondly proposes to adopt the Multiscale Contextual Residual Connection (MSCRC) network model, which effectively exploits the channel and spatial dependencies between multiscale encoder features, thus solving the problem of semantic gaps. The HDAM-Net model is more scientific and promising, emphasizing its effectiveness in accurately categorizing land and forest types using multiple remote sensing data sources, which can accurately determine the cover type and species. Complex image data, whether they are large-scale image data, multispectral data, or hyperspectral data, can be processed efficiently.

To assess our model’s performance, we compared it with the SVM, KNN, RF, CNN, RNN, ViT, VITCNN, and our proposed HDAM-Net using the Xiangyin County dataset. Table 7 details the classification results, with the CNN performing the poorest (OA: 82.26%, AA: 79.66%, and Kappa: 0.7883). Traditional methods like the SVM, KNN, and RF showed moderate results. Deep learning methods, including our HDAM-Net, excelled, with the HDAM-Net achieving the top scores (OA: 99.42%, AA: 99.25%, and Kappa: 0.9931), demonstrating its superior capability to extract time series and spatial features from spectral data.

Figure 6 presents the land cover map of Xiangyin County classified according to various methods. We employed the VIT, CNN, RNN, VITCNN, and HDAM-Net deep learning models, which demonstrated strong overall performance, effectively delineating feature class boundaries and distinguishing between closely related classes. Notably, the HDAM-Net outperformed others locally, particularly in accurately identifying Lotus and Rapeseed, and it exhibited a superior ability to discriminate the detection area.

3.2.2. Comparative Analysis of Hyperspectral Data

In this paper, we have performed image classification experiments on three publicly available datasets, as detailed in Table 8.

To evaluate the accuracy, qualitative results, and generalizability of our models, we compared our classification method with existing state-of-the-art models on three hyperspectral datasets. Table 4 details the performance of various models, which were measured according to their overall accuracy (OA), average accuracy (AA), and Kappa coefficient values, with the best results for each category highlighted in bold. Overall, the VIT showed the poorest performance, with lower OA, AA, and Kappa values compared to the other models. Figure 7, Figure 8 and Figure 9 present the classification plots generated by different models on the hyperspectral Houston, Indian Pines, and Pavia University datasets, respectively.

Traditional classifiers such as RF, KNN, and SVM exhibited moderate performance on the three hyperspectral datasets, with their OA, AA, and Kappa values ranking in the middle to lower range among all classification methods. In contrast, deep learning models, including RNN and SpectralFormer, demonstrated superior performance. The HDAM-Net algorithm proposed in this paper excelled in extracting time series and spatial features from spectral data, achieving higher classification accuracy and showing enhanced performance for categories with fewer training samples, with accuracies of 86.69%, 77.54%, and 83.37%. Figure 10 displays the training loss curves of the HDAM-Net on the four datasets.

3.3. Analysis of Land Use Change in Forests and Trees in the Study Area

This study utilized Sentinel-2 series imagery obtained from the European Space Agency’s official website to analyze features in the Heshan District of Yiyang City, Hunan Province. Imagery from the years 2019, 2021, and 2023 were utilized for the classification After Following of the image data; field-collected sample data were annotated with regions of interest (ROIs) using ENVI software, and the ROI coordinate point data were exported to a text file. The training and test sets were divided in a 7:3 ratio. Detailed information about the sample pool within the study area for the years 2019, 2021, and 2023 is presented in Table 9.

The algorithm presented in this study was utilized for land cover classification, land use change analysis, and comparison with existing classification knowledge and models, with a focus on forests and trees in the Heshan District. By examining changes in these specific areas, a more comprehensive understanding of forest distribution and agricultural production in the region can be achieved. According to the data in Table 10, Table 11 and Table 12, the HDAM-Net algorithm demonstrated superior classification performance when applied to Sentinel-2 images of the Heshan District for the years 2019, 2021, and 2023. Consequently, this chapter will investigate land use changes in the Heshan District using the classification methodology proposed in this research.

The application of various classification methods to datasets from 2019, 2021, and 2023 in the Heshan District of Yiyang City, Hunan Province, revealed that the HDAM-Net model consistently delivered the best qualitative results across all three phases of classification in the region. Specifically, the HDAM-Net model achieved OA values of 98.49%, 98.10%, and 99.20% for the respective datasets of 2019, 2021, and 2023. Notably, the HDAM-Net demonstrated superior performance over the three years in both forested areas, a key area of interest, and agricultural cultivation areas, making it an effective choice for analyzing land use changes in the Heshan District of Yiyang City, Hunan Province.

Based on the data presented in Table 13, it can be inferred that the forests and tree areas in Heshan District, Yiyang City, Hunan Province are experiencing gradual growth. This trend is likely a response to the national initiative of converting farmland back to forests. The forests area has shown a consistent, albeit slight, increase across the years 2017, 2019, and 2021. In contrast, the tree area has exhibited more significant fluctuations over the same period. While an overall upward trend is evident, a temporary decline between 2019 and 2021 may be attributed to the impact of an epidemic during that period, resulting in a slower expansion of the tree area. Subsequently, a notable increase in this region occurred post-epidemic. Figure 11 offers a detailed representation of the land cover distribution within the Heshan District of Yiyang City, Hunan Province.

The primary vegetation cover types observed in Heshan District consist of forests and tree areas. These areas have been identified as focal points for research to facilitate the enhanced monitoring of vegetation alterations and the comprehension of ecological advancements within the region. By conducting a detailed examination of the changes within these key zones, a more comprehensive understanding of land resource utilization and transformation patterns can be attained. This, in turn, can facilitate the optimization of resource distribution, foster balanced ecological progress, and bolster agricultural and planting productivity. The analysis of land use within forests and tree areas holds significant importance for the Heshan District in its pursuit of sustainable development and ecological preservation. By integrating the data presented in Table 13 to evaluate alterations in land use categories across Heshan District over a three-year period, it is evident that each feature category exhibited either an upward or downward trajectory, potentially influenced by factors such as the pandemic and the implementation of national policies promoting the conversion of farmland to forested areas. The primary focus of this study revolves around the dynamic transformations occurring within forests and tree areas within the study region, as illustrated in Figure 12 depicting changes observed over the course of three years in Heshan District.

Land cover classification confirms that our proposed method is indeed suitable for detecting changes in shrub and tree cover. This is crucial for determining forest fire risk. By analyzing and classifying land cover types, flammable areas, potential sources of ignition, and fire spread paths can be identified to assess and predict forest fire risk. Our method identifies flammable areas, assesses combustible materials, analyzes topographic and climatic factors, identifies potential fire sources, and predicts fire spread paths. By comprehensively analyzing these factors, a scientific basis is provided for fire prevention, monitoring, and emergency response. Therefore, the proposed method plays a vital role in land cover classification and fire risk assessment, enhancing the accuracy and efficiency of the assessment.

From the analysis of tree species distribution and their dynamic changes, it is evident that remote sensing image processing technology, leveraging advanced algorithms and data analysis capabilities, can effectively identify and quantify key indicators of forest ecosystems. By utilizing multispectral and hyperspectral remote sensing data in conjunction with proposed algorithms, different tree species can be classified, and the extent of species changes can be analyzed quantitatively. In this study, we conducted a quantitative analysis of the degree of tree species change. Through time series remote sensing imagery, the composition of tree species over time can be monitored. Change detection techniques can identify shifts in forest species, such as natural regeneration or the introduction of new species by humans. This dynamic analysis is crucial for understanding the succession process of forest ecosystems and evaluating the effectiveness of forest management measures. In the event of a forest experiencing fire, pest infestations, or other disturbances, remote sensing technology can be used to monitor the recovery process. By comparing remote sensing images before and after the disturbance, changes in tree species can be quantified, and the speed and effectiveness of forest recovery can be assessed. In summary, the algorithm technology we propose provides a powerful tool for the monitoring and management of forest ecosystems, enhancing our understanding of forest dynamic changes and providing a scientific basis for the sustainable utilization and protection of forest resources.

In conclusion, the data provided by remote sensing image processing technology regarding tree species distribution, dynamic changes, biomass, and the monitoring of disturbances and recovery, holds profound practical significance for forest management and conservation decisions. The data not only enhance our understanding of forest ecosystems but also provide decision support for the effective management and protection of forest resources, playing an irreplaceable role in achieving sustainable forest development and the health maintenance of ecosystems.

4. Discussion

In the highly complex field of remote sensing image processing, challenges arise from the differences in resolution and semantic information at various feature levels, as well as the diverse topographic conditions that favor specific crop varieties. The distribution of land categories is closely linked to local agricultural practices, often exhibiting characteristic clustering within certain spatial ranges. To address these complexities, we have introduced an efficient method for land cover classification in high-resolution hyperspectral and multispectral remote sensing imagery.

Our proposed HDAM-Net algorithm effectively models contextual information in a token-based, space–time domain, using context-rich tokens to enhance the original feature representations. It can be regarded as an attention-based approach that adaptively adjusts attention weights between each pixel and its neighboring pixels, facilitating feature extraction within channels and across spatial dimensions.

To validate the effectiveness of HDAM-Net, our work was compared with other studies. These experiments evaluate the performance of traditional machine learning algorithms, such as SVM, KNN and RF, and deep learning algorithms, including CNN, RNN, ViT and SF, on multispectral and hyperspectral images. In the Xiangyin County dataset, the accuracy of traditional algorithm came out to more than 90 percent, while the CNN had some play in the hyperspectrum due to its stronger handling of details, while it does not play very well in the multispectrum, and the performance of the model containing VIT was very good, and our algorithm also showed superiority in the multispectrum. The advantage of CNN can be clearly seen in the hyperspectral, the traditional algorithm showed a small difference in classification performance, and the algorithm in this paper was even better performance in terms of classification accuracy. On the multispectral classification result prediction map, the division of the forest undoubtedly demonstrated that deep learning models performed well, while in the local range observation, our algorithm was clearly superior to other deep learning algorithms. And in the hyperspectral in the CNN classification prediction map, it was a little more clear and accurate than other algorithms, but our algorithm still presented a better classification ability.

Moreover, the HDAM-Net exhibited high efficiency and applicability. It can efficiently process high-resolution remote sensing imagery, thereby enhancing the efficiency and accuracy of classification and its adaptability to various terrain conditions and crop varieties. Remote sensing technology can cover large areas of forest regions, providing macro-level information on forest resources, which is extremely beneficial for large-scale forest surveys and management. Through platforms such as satellites and drones, remote sensing technology can achieve real-time or near-real-time monitoring of forest ecosystems, promptly detecting changes and anomalies. Remote sensing offers multispectral, hyperspectral, and LiDAR data sources that can be used complementarily to improve the accuracy of forest feature identification. Compared to traditional ground surveys, remote sensing offers significant advantages in terms of cost and time, enabling the acquisition of vast amounts of forest information at a lower cost. Additionally, the HDAM-Net is well-suited to align with the agricultural cultivation patterns observed in different regions, making it a valuable tool for the implementation of precision agriculture operations.

While the HDAM-Net demonstrates numerous advantages in forestry applications, it also has certain limitations. Despite the increasing resolution, some remote sensing data still lack the resolution to identify small-scale forest features, such as individual tree species or small vegetation communities. Therefore, future tasks should involve integrating different types of remote sensing data (such as optical, radar, and LiDAR data) to enhance spatial and spectral resolution, improving the identification of forest features. Through these improvements, the accuracy and efficiency of remote sensing image processing in forestry applications will be further enhanced, providing stronger support for the sustainable management and protection of forest resources.

5. Conclusions

In this study, we focused on enhancing feature recognition by fusing channel information and spatially enriched contexts from multisource datasets, especially in the fine-grained analysis of forest ecosystems. Our proposed HDAM-Net algorithm combines the Vision Transformer (ViT) and Multiscale Contextual Residual Convolution (MSCRC) techniques to form a semantic segmentation framework with a dual-encoder architecture designed for fine-grained classification of forest land cover. By introducing the Dual Cross-Attention Mechanism Module (DCAM), our approach is able to utilize feature correlation and channel importance to extract more discriminative forest features, leading to significant progress in tree species identification and forest structure analysis. The dual-channel parallel architecture effectively facilitates feature information exchange and improves the recognition accuracy of similar forest feature types at different scales. Our results show that the HDAM-Net has significant advantages in forest land cover classification, especially in improving classification efficiency and accuracy. In addition, it has great potential for application in forestry, and it can provide important support for forest resource surveys, health monitoring, biodiversity assessment, and the development of climate change adaptation strategies.

However, our HDAM-Net has limitations in accurately extracting feature boundaries, where some weaker features may be missed, as well as more accurate classification performance while not guaranteeing low parameter consumption and time superiority. To address this issue, we intend to further explore boundary feature algorithms, as well as explore increasing classification accuracy while ensuring time efficiency. In addition, to maximize the use of high-resolution remote sensing images with complex features, we plan to extend our dataset by creating a multisource remote sensing image dataset focusing on agricultural features. This extension aims to comprehensively evaluate and improve the performance of our proposed method.

Author Contributions

Conceptualization, X.F., X.L. and J.F.; methodology, X.F., X.L. and J.F.; software, X.F., X.L. and J.F.; validation, X.F., X.L. and J.F.; formal analysis, X.F., X.L. and J.F.; investigation, X.F., X.L. and J.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work is the result of a research project funded by the National Natural Science Foundation of China 62261004 and the Guangxi Key Research and Development Program: AB24010312.

Data Availability Statement

Data are contained within the article. If necessary, the corresponding author can provide.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

MSCRC	Multilayer Residual Cascade Convolutional
DCAM	Double Cross-Attention Module
MS	MCA + SAM
SF	Spectral Formal
GIS	Geographic Information System
SVMs	Support Vector Machines
RNNs	Recurrent Neural Networks
MSI	Multispectral Instrument
ViT	Vision Transformer
CNNs	Convolutional Neural Networks
VNIR	Visible and Near-Infrared
SWIRs	Shortwave Infrared Bands
ROI	Region of Interest
NCALM	NSF-Funded Center for Airborne Laser Mapping
Aviris	Airborne Visual Infrared Imaging Spectrometer
HSGSE	Half-Sequence Grouping Spectral Embedding
MCA	Multihead Channel Attention
OA	Overall Accuracy
SAM	Spatial Attention Module
AA	Average Accuracy

References

Li, J.; Pei, Y.; Zhao, S.; Xiao, R.; Sang, X.; Zhang, C. A review of remote sensing for environmental monitoring in China. Remote Sens. 2020, 12, 1130. [Google Scholar] [CrossRef]
Macarringue, L.S.; Bolfe, É.L.; Pereira, P.R.M. Developments in land use and land cover classification techniques in remote sensing: A review. J. Geogr. Inf. Syst. 2022, 14, 1–28. [Google Scholar] [CrossRef]
Afaq, Y.; Manocha, A. Analysis on change detection techniques for remote sensing applications: A review. Ecol. Inform. 2021, 63, 101310. [Google Scholar] [CrossRef]
Van Westen, C. Remote sensing for natural disaster management. Int. Arch. Photogramm. Remote Sens. 2000, 33, 1609–1617. [Google Scholar]
Sishodia, R.P.; Ray, R.L.; Singh, S.K. Applications of remote sensing in precision agriculture: A review. Remote Sens. 2020, 12, 3136. [Google Scholar] [CrossRef]
Zhao, Q.; Yu, L.; Du, Z.; Peng, D.; Hao, P.; Zhang, Y.; Gong, P. An overview of the applications of earth observation satellite data: Impacts and future trends. Remote Sens. 2022, 14, 1863. [Google Scholar] [CrossRef]
Zhao, S.; Wang, Q.; Li, Y.; Liu, S.; Wang, Z.; Zhu, L.; Wang, Z. An overview of satellite remote sensing technology used in China’s environmental protection. Earth Sci. Inform. 2017, 10, 137–148. [Google Scholar] [CrossRef]
Jhawar, M.; Tyagi, N.; Dasgupta, V. Urban planning using remote sensing. Int. J. Innov. Res. Sci. Eng. Technol. 2013, 1, 42–57. [Google Scholar]
Mehmood, M.; Shahzad, A.; Zafar, B.; Shabbir, A.; Ali, N. Remote sensing image classification: A comprehensive review and applications. Math. Probl. Eng. 2022, 2022, 5880959. [Google Scholar] [CrossRef]
Foody, G.M. Status of land cover classification accuracy assessment. Remote Sens. Environ. 2002, 80, 185–201. [Google Scholar] [CrossRef]
Jaiswal, R.K.; Saxena, R.; Mukherjee, S. Application of remote sensing technology for land use/land cover change analysis. J. Indian Soc. Remote Sens. 1999, 27, 123–128. [Google Scholar] [CrossRef]
Zeferino, L.B.; de Souza, L.F.T.; do Amaral, C.H.; Fernandes Filho, E.I.; de Oliveira, T.S. Does environmental data increase the accuracy of land use and land cover classification? Int. J. Appl. Earth Obs. Geoinf. 2020, 91, 102128. [Google Scholar] [CrossRef]
Jansen, L.J.; Di Gregorio, A. Land-use data collection using the “land cover classification system”: Results from a case study in Kenya. Land Use Policy 2003, 20, 131–148. [Google Scholar] [CrossRef]
Hiscock, O.H.; Back, Y.; Kleidorfer, M.; Urich, C. A GIS-based land cover classification approach suitable for fine-scale urban water management. Water Resour. Manag. 2021, 35, 1339–1352. [Google Scholar] [CrossRef]
Lateef, F.; Ruichek, Y. Survey on semantic segmentation using deep learning techniques. Neurocomputing 2019, 338, 321–348. [Google Scholar] [CrossRef]
Garcia-Garcia, A.; Orts-Escolano, S.; Oprea, S.; Villena-Martinez, V.; Martinez-Gonzalez, P.; Garcia-Rodriguez, J. A survey on deep learning techniques for image and video semantic segmentation. Appl. Soft Comput. 2018, 70, 41–65. [Google Scholar] [CrossRef]
Hearst, M.A.; Dumais, S.T.; Osuna, E.; Platt, J.; Scholkopf, B. Support vector machines. IEEE Intell. Syst. Their Appl. 1998, 13, 18–28. [Google Scholar] [CrossRef]
Kotsiantis, S.B. Decision trees: A recent overview. Artif. Intell. Rev. 2013, 39, 261–283. [Google Scholar] [CrossRef]
Peterson, L.E. K-nearest neighbor. Scholarpedia 2009, 4, 1883. [Google Scholar] [CrossRef]
Kodinariya, T.M.; Makwana, P.R. Review on determining number of Cluster in K-Means Clustering. Int. J. 2013, 1, 90–95. [Google Scholar]
Wold, S.; Esbensen, K.; Geladi, P. Principal component analysis. Chemom. Intell. Lab. Syst. 1987, 2, 37–52. [Google Scholar] [CrossRef]
Jolliffe, I.T. Principal Component Analysis: A Beginner’s Guide—I. Introduction and Application; Blackwell Publishing Ltd.: Oxford, UK, 1990; Volume 45, pp. 375–382. [Google Scholar]
Shin, D.H.; Park, R.H.; Yang, S.; Jung, J.H. Block-based noise estimation using adaptive Gaussian filtering. IEEE Trans. Consum. Electron. 2005, 51, 218–226. [Google Scholar] [CrossRef]
Torre, V.; Poggio, T.A. On edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 1986, PAMI-8, 147–163. [Google Scholar] [CrossRef]
Yokoya, N.; Grohnfeldt, C.; Chanussot, J. Hyperspectral and multispectral data fusion: A comparative review of the recent literature. IEEE Geosci. Remote Sens. Mag. 2017, 5, 29–56. [Google Scholar] [CrossRef]
Chen, Y.; Lin, Z.; Zhao, X.; Wang, G.; Gu, Y. Deep learning-based classification of hyperspectral data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 2094–2107. [Google Scholar] [CrossRef]
Li, J.; Cui, R.; Li, B.; Li, Y.; Mei, S.; Du, Q. Dual 1D-2D spatial-spectral cnn for hyperspectral image super-resolution. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 3113–3116. [Google Scholar]
Maulik, U.; Chakraborty, D. Remote Sensing Image Classification: A survey of support-vector-machine-based advanced techniques. IEEE Geosci. Remote Sens. Mag. 2017, 5, 33–52. [Google Scholar] [CrossRef]
Guo, B.; Damper, R.I.; Gunn, S.R.; Nelson, J.D. A fast separability-based feature-selection method for high-dimensional remotely sensed image classification. Pattern Recognit. 2008, 41, 1653–1662. [Google Scholar] [CrossRef]
Li, W.; Guo, Q.; Elkan, C. One-class remote sensing classification from positive and unlabeled background data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 14, 730–746. [Google Scholar] [CrossRef]
Hang, R.; Liu, Q.; Hong, D.; Ghamisi, P. Cascaded recurrent neural networks for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2019, 57, 5384–5394. [Google Scholar] [CrossRef]
Wang, B.; Lei, Y.; Li, N.; Yan, T. Deep separable convolutional network for remaining useful life prediction of machinery. Mech. Syst. Signal Process. 2019, 134, 106330. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–11. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Hong, D.; Han, Z.; Yao, J.; Gao, L.; Zhang, B.; Plaza, A.; Chanussot, J. SpectralFormer: Rethinking hyperspectral image classification with transformers. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–15. [Google Scholar] [CrossRef]
Fan, X.; Li, X.; Yan, C.; Fan, J.; Yu, L.; Wang, N.; Chen, L. MARC-Net: Terrain Classification in Parallel Network Architectures Containing Multiple Attention Mechanisms and Multi-Scale Residual Cascades. Forests 2023, 14, 1060. [Google Scholar] [CrossRef]
Fan, X.; Li, X.; Yan, C.; Fan, J.; Chen, L.; Wang, N. Converging Channel Attention Mechanisms with Multilayer Perceptron Parallel Networks for Land Cover Classification. Remote Sens. 2023, 15, 3924. [Google Scholar] [CrossRef]

Figure 1. Research data graphs used for the research data in this paper.

Figure 2. Example of class labels in the sample library of Heshan District.

Figure 3. Schematic structure of the HDAM algorithm proposed in this paper.

Figure 4. Half-sequence grouping spectral embedding schematic.

Figure 5. DCA block.

Figure 6. Testing accuracy of various algorithms in Xiangyin County study area: (a) image, (b) SVM, (c) KNN, (d) RF, (e) CNN, (f) RNN, (g) ViT, (h) HDAM-Net, (i) VITCNN.

Figure 7. Results of different algorithms on the Houston dataset: (a) SVM. (b) KNN. (c) RF. (d) CNN. (e) RNN. (f) ViT. (g) SF. (h) HDAM-Net.

Figure 8. Results of different algorithms on the Indian Pines dataset, as well as spatial distribution of Indian Pines training and test sets: (a) SVM. (b) KNN. (c) RF. (d) CNN. (e) RNN. (f) ViT. (g) SF. (h) HDAM-Net.

Figure 9. Results of different algorithms on the Pavia University dataset: (a) SVM. (b) KNN. (c) RF. (d) CNN. (e) RNN. (f) ViT. (g) SF. (h) HDAM-Net.

Figure 10. Loss curves of the proposed HDAM-Net algorithm during training.

Figure 12. Three -year tree migration of trees in Heshan District: (a) 2019–2021, (b) 2021–2023, (c) 2019–2023. Three-year dynamics of forests in Heshan district: (d) 2019–2021, (e) 2021–2023, (f) 2019–2023.

Table 1. The Sentinel-2 bands used contain information.

Wave Band	Resolution	Center Wavelength	Descriptive
B1	60 m	443 nm	ultramarine
B2	10 m	490 nm	blue
B3	10 m	560 nm	greener
B4	10 m	665 nm	red
B5	20 m	705 nm	VNIR
B6	20 m	740 nm	VNIR
B7	20 m	783 nm	VNIR
B8	10 m	842 nm	VNIR
B8A	20 m	865 nm	VNIR
B9	60 m	940 nm	SWIR
B10	60 m	1375 nm	SWIR
B11	20 m	1610 nm	SWIR
B12	20 m	2190 nm	SWIR

Table 2. Houston dataset contains different object class names and specific sample values for training and testing.

C N.	Class Name	Training	Testing
1	Healthy Grass	198	1053
2	Stressed Grass	190	1064
3	Synthetic Grass	192	505
4	Tree	188	1056
5	Soil	186	1056
6	Water	182	143
7	Residential	196	1072
8	Commercial	191	1053
9	Road	193	1059
10	Highway	191	1036
11	Railway	181	1054
12	Parking Lot 1	192	1041
13	Parking Lot 2	184	285
14	Tennis Court	181	247
15	Running Track	187	473
Total		2832	12,197

Table 3. Indian Pines dataset contains different object class names and specific sample values for training and testing.

C N.	Class Name	Training	Testing
1	Corn Notill	50	1384
2	Corn Mintill	50	784
3	Corn	50	184
4	Grass Pasture	50	447
5	Grass Trees	50	697
6	Hay Windrowed	50	439
7	Soybean Notill	50	918
8	Soybean Mintill	50	2418
9	Soybean Clean	50	564
10	Wheat	50	162
11	Woods	50	1244
12	Buildings Grass Trees Drives	50	330
13	Stones Steel Towers	50	45
14	Alfalfa	15	39
15	Grass Pasture Mowed	15	11
16	Oats	15	5
Total		695	9671

Table 4. Pavia University dataset contains different object class names and specific sample values for training and testing.

C N.	Class Name	Training	Testing
1	Asphalt	548	6304
2	Meadows	540	18,146
3	Gravel	392	1815
4	Trees	524	2912
5	Metal Sheets	265	1113
6	Bare Soil	532	4572
7	Bitumen	375	981
8	Bricks	514	3364
9	Shadows	231	795
Total		3921	40,002

Table 5. Experiments on ablation of master region datasets by HDAM networks.

Different Methods	Different Module					Metric			Time (s) ↓
Different Methods	HGSE	MSCC	DSC	MS	leakyRelu	OA (%) ↑	AA (%) ↑	Kappa ↑	Time (s) ↓
ViT	✓	×	×	×	×	97.35	97.24	0.9684	1276.98
HDAM-Net	✓	✓	×	×	×	98.29	98.28	0.9796	1447.75
HDAM-Net	✓	✓	✓	×	×	98.99	98.67	0.9879	1551.63
HDAM-Net	✓	✓	✓	✓	×	99.28	99.15	0.9914	1693.33
HDAM-Net	✓	✓	✓	✓	✓	99.42	99.25	0.9931	1706.91

Table 6. Test results of HDAM-Net using different training sample ratios in Xiangyin County data.

Train	C N.							Metrics
Train	1	2	3	4	5	6	7	OA (%) ↑	AA (%) ↑	Kappa ↑	Time (s) ↓
10%	99.49	99.62	100.00	96.64	93.53	99.60	99.66	98.56	98.37	0.9829	1599.21
20%	99.36	98.68	100.00	99.52	97.26	99.20	99.83	99.15	99.13	0.9899	1462.91
30%	99.66	99.06	100.00	97.12	96.51	99.20	100.00	98.81	98.80	0.9558	1415.57
40%	99.04	99.95	100.00	97.42	94.77	99.20	99.75	98.78	98.59	0.9855	1620.73
50%	98.98	98.91	100.00	98.99	97.81	99.28	99.80	99.09	99.11	0.9892	1489.16
60%	99.66	99.96	99.91	99.32	99.25	99.80	99.83	99.70	99.68	0.9964	2091.09
70%	99.81	99.67	100.00	99.55	96.01	99.88	99.76	99.42	99.25	0.9931	1433.95
80%	99.30	99.85	100.00	98.89	97.82	99.95	99.87	99.43	99.39	0.9933	1952.07
90%	99.74	99.87	99.94	97.92	98.17	99.78	99.88	99.35	99.33	0.9922	1706.91

Table 7. Testing accuracy of various algorithms in Xiangyin County study area.

C N.	Various Algorithms
C N.	SVM	KNN	RF	CNN	RNN	ViT	ViTCNN	HDAM-Net
1	95.18	90.12	95.52	81.13	94.60	97.24	98.37	99.81
2	97.69	99.62	99.56	95.31	99.27	98.74	99.78	99.67
3	96.26	96.77	97.45	86.17	98.03	99.63	99.92	100.00
4	84.66	89.93	90.33	77.39	92.94	93.18	96.84	99.55
5	76.11	91.87	88.88	49.18	97.75	91.32	93.17	96.01
6	94.33	98.02	96.31	78.05	96.38	98.35	98.64	99.88
7	97.34	99.55	98.00	90.36	99.47	99.19	99.81	99.76
OA (%) ↑	92.47	95.13	95.52	82.26	96.27	96.92	98.32	99.42
AA (%) ↑	91.66	95.13	95.16	79.66	96.07	96.81	98.08	99.25
Kappa ↑	0.9102	0.9419	0.9466	0.7883	0.9555	0.9633	0.9800	0.9931

Table 8. Accuracy of test results for various algorithms in hyperspectral datasets.

Evaluation		Different Methods
Evaluation		SVM	KNN	RF	CNN	RNN	ViT	SF	Our
Houston	OA (%) ↑	73.63	79.42	77.59	84.15	78.07	75.82	77.31	86.69
	AA (%) ↑	74.42	80.76	80.41	85.53	80.19	78.15	79.56	83.19
	Kappa ↑	0.7141	0.7769	0.7625	0.8280	0.7625	0.7383	0.7541	0.8505
Indian	OA (%) ↑	55.32	60.56	69.66	71.74	53.27	50.64	75.38	77.54
	AA (%) ↑	49.08	71.40	76.77	78.03	53.10	56.12	81.20	75.19
	Kappa ↑	0.4916	0.5564	0.6576	0.6787	0.4673	0.4486	0.7192	0.7281
Pavia	OA (%) ↑	71.97	70.83	69.28	81.93	78.35	68.83	74.95	83.37
	AA (%) ↑	76.65	79.92	80.01	86.21	84.05	77.69	83.30	81.31
	Kappa ↑	0.6320	0.6323	0.6196	0.7628	0.7223	0.6018	0.6797	0.8000

Table 9. Study area Heshan District, Yiyang City, Hunan Province dataset sample size.

C N.	Class Name	Heshan
C N.	Class Name	Training	Testing
1	Buildup	1396	599
2	Water	1663	713
3	Tree	599	257
4	Pond	256	111
5	WetLand	1609	690
6	Vegetable	791	340
7	Forests	1102	473
Total		7416	3183

Table 10. Testing accuracy of different algorithms on Heshan District 2019 dataset.

C N.	Various Algorithms
C N.	SVM	RF	CNN	RNN	ViT	SF	ViTCNN	HDAM-Net
1	75.95	86.97	61.03	70.55	96.06	95.72	96.91	98.85
2	97.19	98.03	90.55	92.54	92.06	95.42	97.41	96.09
3	69.64	73.92	60.76	64.27	95.65	98.83	99.33	99.33
4	61.26	69.36	42.18	59.37	94.92	98.82	99.60	100.00
5	83.47	85.94	61.28	76.88	95.64	98.25	98.44	99.62
6	49.41	80.58	37.67	68.14	87.48	88.24	92.16	98.35
7	83.72	89.00	68.96	83.66	95.73	97.09	97.73	99.27
OA (%) ↑	79.64	87.18	65.72	77.66	94.04	95.97	97.26	98.49
AA (%) ↑	74.38	83.41	60.35	73.63	93.94	96.06	97.37	98.79
Kappa ↑	75.12	84.41	58.25	72.88	92.79	95.12	96.69	98.17

Table 11. Testing accuracy of different algorithms on Heshan District 2021 dataset.

C N.	Various Algorithms
C N.	SVM	RF	CNN	RNN	ViT	SF	ViTCNN	HDAM-Net
1	81.96	88.14	52.14	85.53	96.20	98.20	97.34	99.14
2	97.61	98.73	93.86	95.00	95.73	96.15	98.61	98.37
3	69.64	78.21	67.44	82.47	96.99	99.16	99.33	99.33
4	71.17	72.07	58.98	79.29	98.43	99.60	100.00	100.00
5	79.42	87.39	76.38	83.34	95.52	96.89	98.69	98.57
6	61.76	88.82	64.22	79.51	83.43	90.01	92.66	92.41
7	71.88	86.89	66.69	89.92	93.46	93.28	97.36	98.63
OA (%) ↑	79.89	88.88	71.68	86.73	94.32	95.98	97.68	98.10
AA (%) ↑	76.21	85.75	68.54	85.01	94.26	96.19	97.72	98.07
Kappa ↑	75.42	86.48	65.48	83.91	93.12	95.13	97.19	97.70

Table 12. Testing accuracy of different algorithms on Heshan District 2023 dataset.

C N.	Various Algorithms
C N.	SVM	RF	CNN	RNN	ViT	SF	ViTCNN	HDAM-Net
1	81.13	86.97	54.87	87.24	97.27	96.48	98.92	99.42
2	97.89	97.19	95.55	89.77	93.74	96.03	97.83	98.37
3	61.86	75.48	44.90	74.62	97.82	98.49	96.49	100.00
4	79.27	83.78	64.06	85.93	100.00	100.00	100.00	100.00
5	76.23	87.97	77.31	86.01	97.57	98.44	98.69	99.50
6	65.00	87.64	55.62	76.35	91.78	95.82	98.73	98.98
7	77.80	90.27	63.33	78.31	93.46	97.64	98.54	99.27
OA (%) ↑	79.99	89.00	69.71	83.99	95.54	97.20	98.40	99.20
AA (%) ↑	77.03	87.05	65.10	82.61	95.95	97.56	98.46	99.37
Kappa ↑	75.61	86.64	62.90	80.58	94.60	96.61	98.06	99.04

Table 13. Trend direction of land use types in Heshan District for the three years: 2019, 2021, and 2023.

C N.	Area (km²)			Area Change Rate (%)
C N.	2019	2021	2023	2019–2021	2021–2023	2019–2023
Building	1096.88	802.98	929.29	−26.79	15.73	−15.28
Water	1260.18	1416.03	1245.96	12.37	−12.01	−1.13
Tree	304.67	280.44	434.35	−7.95	54.88	42.56
Pond	265.42	116.57	148.78	−56.08	27.63	−43.95
Wetland	731.10	923.62	754.51	26.33	−18.31	3.20
Vegetable	396.17	480.24	358.44	21.22	−25.36	−9.52
Forests	859.59	894.13	1042.68	4.02	16.61	21.30

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fan, X.; Li, X.; Fan, J. Land Cover Classification of Remote Sensing Imagery with Hybrid Two-Layer Attention Network Architecture. Forests 2024, 15, 1504. https://doi.org/10.3390/f15091504

AMA Style

Fan X, Li X, Fan J. Land Cover Classification of Remote Sensing Imagery with Hybrid Two-Layer Attention Network Architecture. Forests. 2024; 15(9):1504. https://doi.org/10.3390/f15091504

Chicago/Turabian Style

Fan, Xiangsuo, Xuyang Li, and Jinlong Fan. 2024. "Land Cover Classification of Remote Sensing Imagery with Hybrid Two-Layer Attention Network Architecture" Forests 15, no. 9: 1504. https://doi.org/10.3390/f15091504

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Land Cover Classification of Remote Sensing Imagery with Hybrid Two-Layer Attention Network Architecture

Abstract

1. Introduction

2. Data on Specific Research Materials

2.1. Regional Overview

2.2. Preprocessing

2.2.1. Multispectral Data Sources

2.2.2. Hyperspectral Data Sources

2.3. HDAM-Net Model

2.3.1. Half-Sequence Grouping Spectral Embedding (HSGSE)

2.3.2. Add the Transformer from MS

2.3.3. Double Cross-Attention Module (DCAM)

2.3.4. Multiscale Cascaded Residual Convolution (MSCRC)

2.4. Evaluation Indicators

3. Results

3.1. Ablation Study

3.2. Multimethod Comparison

3.2.1. Comparative Analysis of Multispectral Data

3.2.2. Comparative Analysis of Hyperspectral Data

3.3. Analysis of Land Use Change in Forests and Trees in the Study Area

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI