Next Article in Journal
Biomass Spatial Pattern and Driving Factors of Different Vegetation Types of Public Welfare Forests in Hunan Province
Previous Article in Journal
Prediction of Potential Distribution Patterns of Three Larix Species on Qinghai-Tibet Plateau under Future Climate Scenarios
Previous Article in Special Issue
Tree Species Classification in a Complex Brazilian Tropical Forest Using Hyperspectral and LiDAR Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

MARC-Net: Terrain Classification in Parallel Network Architectures Containing Multiple Attention Mechanisms and Multi-Scale Residual Cascades

1
School of Automation, Guangxi University of Science and Technology, Liuzhou 545006, China
2
Guangxi Collaborative Innovation Centre for Earthmoving Machinery, Guangxi University of Science and Technology, Liuzhou 545006, China
3
National Satellite Meteorological Center, China Meteorological Administration, Beijing 100081, China
*
Author to whom correspondence should be addressed.
Forests 2023, 14(5), 1060; https://doi.org/10.3390/f14051060
Submission received: 22 April 2023 / Revised: 13 May 2023 / Accepted: 16 May 2023 / Published: 22 May 2023

Abstract

:
To address the problem that traditional deep learning algorithms cannot fully utilize the correlation properties between spectral sequence information and the feature differences between different spectra, this paper proposes a parallel network architecture land-use classification based on a combined multi-head attention mechanism and multiscale residual cascade called MARC-Net. This parallel framework is firstly implemented by deeply mining the features generated by grouped spectral embedding for information among spectral features by adding a multi-head attention mechanism, which allows semantic features to have expressions from more subspaces while fully considering all spatial location interrelationships. Secondly, a multiscale residual cascade CNN (convolutional neural network) is designed to fully utilize the fused feature information at different scales, thereby improving the network’s ability to represent different levels of information. Lastly, the features obtained by the multi-head attention mechanism are fused with those obtained by the CNN, and the merged resultant features are downgraded through the fully connected layer to obtain the classification results and achieve pixel-level multispectral image classification. The findings show that the algorithm proposed in this paper has an aggregate precision of 97.22%, compared to that of the Vision Transformer (ViT) with 95.08%; its performance on the Sentinel-2 dataset shows a huge improvement. Moreover, this article mainly focuses on the change rate of forest land in the study area. The Forest land area was 125.1143 km 2  in 2017, 105.6089 km 2  in 2019, and 76.3699 km 2  in 2021, with an increase of  15.59 % , an decrease of  0.97 % , and increase of  14.76 %  in 2017–2019, 2019–2021 and 2017–2021, respectively.

1. Introduction

Although Remote Sensing (RS) technology is increasingly achieving remarkable results in practical areas such as crop monitoring, weather forecasting, marine research, and geological surveys [1,2,3,4], as well as land-cover classification, more related research is needed because of the complexity of feature types in some study areas, which easily leads to confusion of samples. Land-cover classification has an extremely important role in tasks such as refined agriculture, land resource exploration, regional geological change, and integrated urban planning [5,6,7,8]. Therefore, accurate access to real-time remote sensing data to improve the accuracy of land-cover classification has been an inevitable need for practical applications. In contrast, remote sensing data are mainly used to monitor environmental parameters based on physical models before using algorithms for predictive classification purposes [9]. Although physical models can generate remote sensing data well from environmental parameters, these models are highly dependent on the a priori knowledge of the models. Accordingly, various data mining-based machine learning methods (ML) have non-negligible value in the field of remote sensing image processing. Machine learning algorithms used for RS image figure recognition include support vector machines (SVMs) [10], random forests (RFs) [11], k-means [12], and k-nearest neighbors (KNNs) [13]. SVMs can be transformed into a constrained optimal solution problem for linearly divisible problems to perform hyperplane classification. K-means is an iterative division-based cluster analysis algorithm that uses distance as a similarity metric, but is sensitive to anomalous samples and cannot handle discrete features or guarantee global optimality. K-nearest neighbor uses the distance between different feature values for classification but is unable to cope with multiple samples. Therefore, the level of machine learning algorithms for RS image classification needs to be improved. Moreover, the generalization capability of traditional neural networks is not satisfactory, and the degree of approximation of nonlinear models needs to be improved, making their implementation difficult for end-to-end image segmentation tasks.
Deep learning (DL) is a promising research method for large-scale deep neural networks. DL models can accurately approximate nonlinear relationships between environmental parameters due to their multilevel learning properties [14], which helps to achieve sensing, retrieval, fusion, and downscaling across remote sensing environmental variables. Deep learning algorithms can be used in the field of image processing to extract multiscale and multilevel features. The unique advantage in combining features from low to high [15] confers them with better performance in image processing and classification. Therefore, DL models outperform traditional models in image processing using remote sensing data [16]; as a result, more scholars are applying deep learning to remote sensing research. For example, in the field of deep learning for security in remote sensing image classification, Cheng et al. proposed an effective defense framework specified for Remotely Sensed Imagery (RSI) scene classification called perturbation-seeking generative adversarial network, which uses unknown attacks of random type during training to eliminate the blind spots of the classifier [17]. In recent years, deep learning has made significant breakthroughs in fine-grained agricultural classification [18] and has shown good performance in land-cover classification tasks for RS images. Semantic segmentation has also been applied to the land-cover classification task, which works by associating each pixel with a category label. The semantic segmentation network extracts the semantic features of each pixel from the pixel grayscale information and uses this algorithm to achieve accurate classification of pixels in different categories. U-Net [19] represents one of the early algorithms using multiscale features for semantic segmentation, which is a classical fully convolutional network. The U-Net network is widely used in remote sensing images for its ability to achieve good prediction. Zhang et al., achieved remarkable performance in building extraction using Inception-U-Net [20]. And Mustafa et al. used Res-U-Net to segment the collected high-resolution remote sensing images of iron ore [21]. Han et al., used the U-Net model for convective precipitation forecasting [22]. Asma et al., proposed the ability of pretrained U-Net to classify satellite images [23], which contributes to the satellite’s pinpointing capability. This shows that the task of RS image segmentation is fraught with complexity.
DL in remote sensing image processing differs from applications in natural images, which are usually more complex and diverse, along with richer available spatial spectrum information, which requires higher processing algorithms for remote sensing images. Due to its strong feature representation capability, deep learning has been applied in the field of remote sensing visual images for land-cover classification, environmental parameter retrieval, data fusion, downscaling, information construction, and prediction. Recurrent Neural Network (RNNs) [24] and convolutional neural networks (CNNs) [25], which work well in processing iterative network structures for temporal data, are currently the dominant trend in image processing. U et al., combined the CNN and RNN frameworks using a simple mechanism for independently encoding dual-time images to obtain their representation vectors [26]. Although CNNs and RNNs can obtain better classification results compared with traditional methods, due to the limitations of the network structure, CNNs mainly focus on local spatial features and ignores the importance of overall connectivity, being unable to adequately analyze the sequence data of spectral features [27], whereas RNNs require input–output sequence alignment and cannot perform parallel computation. Since the emergence of Transformer [28] within the field of natural language processing (NLP), it has received increasing attention from experts. Transformer’s architecture relies entirely on the attention mechanism to model the global dependencies of inputs and outputs, and on positional encoding to solve the problem of how to represent the relative or absolute problem of positional relationships. In contrast to CNNs, the number of operations required to compute the association between two positions does not increase with distance. The Google team proposed Vision Transformer (ViT) [29] as an image classification model, representing a landmark study in the application of transformers in the CV field. ViT divides the image into patches while tiling them into one-dimensional vectors, and then concatenates the vectors to form a tensor. This sequence is input to Transformer, which uses the pixel sequence to effectively classify the pixels. Hong et al. [27] proposed a new backbone network, SpectralFormal, which can learn spectral local sequence information from adjacent bands of images to generate fractional spectral embeddings. In addition, this algorithm is generally applicable to block-level input, which enriches information features by linking across layers of hops and generates packet spectral embeddings in adjacent bands to learn spectral local sequence information.
Data classification of multispectral image information, as a key issue in remote sensing applications, has tradeoffs in balancing spatial resolution and temporal coverage. There is an irreconcilable conflict between the spatial and spectral resolution of multispectral images; thus, it is difficult to obtain remote sensing data with rich spatial and temporal resolution. To solve this problem, most researchers use both spatial and spectral features [30,31]. Yang et al. [32] used a dual-channel CNN to jointly learn spectral and spatial features, whereas Yang et al. [33] proposed using the spectral properties and spatial homogeneity of the spectral spatial neighborhood map for robust stream shape learning, and then performed the classification task. Lee et al. [34] designed a context-depth fully convolutional DL network that jointly utilizes spatial and Hue-Sturation-Intensity(HSI) spectral features for learning and performing classification tasks. Variable-size convolutional features were used to create a spectral spatial feature map.
While multispectral data information can be used for land-cover mapping, land-cover mapping based on remote sensing images relies on image classification. Most traditional classification methods classify images on the basis of different spatial units, such as pixels, sliding windows, moving objects, and scenes [35,36,37]. However, traditional classification methods only consider low-level features in the spectral and spatial domains, and it is difficult to distinguish complex classes of land structures or terrain objects with limited feature information. Therefore, classification methods using high-level features are ideal. Due to its advantages in multiscale and multilevel feature extraction, DL has recently been applied to land-cover classification with good results [38,39,40].
CNN algorithms are more data-demanding; hence, the widely used Transformer is applied as a more efficient way to process pixel sequence information. However, although many ViT studies aimed to surpass and eventually replace convolutional neural networks (CNNs), a stronger performance was obtained when combining the two [41]. ViT has the advantage of being able to consider all pixels in an image in terms of size scaling but with the obvious disadvantage that it needs to be taught through extensive training. For the approaches incorporated into a CNN architecture, the local context window of CNN only considers the local pixel problem, and, even though different image locations can be processed, weight sharing can be achieved for the transformer to have more information with fewer data. To solve these problems, a new architecture called MARC-Net is proposed in this paper. This paper is quite different from other papers that incorporated ViT into CNN architectures, and the main innovative differences are explained below.
Firstly, the form of the CNN cascade convolution is completely different. Although both CNNs were chosen for feature extraction, we chose three jump connections in the CNN branch to make two convolutional layers in the cascade, thus forming a DenseNet structure with a 2 × 2 AVG Pool, which is more conducive to multiscale feature fusion. In contrast, most scholars used two jump–join two-layer convolutions with an SE structure to extend the perceptual domain. Secondly, to address the inconsistent network framework structure, we merged the features obtained by the Transformer encoder with the features obtained by CNN to merge the generated features in a parallel way, thereby shortening the processing time without losing the accuracy of the algorithm. In contrast, most scholars inputted the features obtained by CNN into Transformer and then processed them, maximizing the use of Transformer’s features, enabling it to handle large data with excellent results. Thirdly, the inconsistent structure of the Transformer input is caused by the inconsistent structure of the network framework and the different choices of bands. The input of the Transformer in our network framework can be categorized as follows: 0, position encoding; 1–4, the four common frequency bands; 5–8, the VINR bands. Then, the GSE spectral feature module is added to the initial input, and only half of each of the two sequences is selected for fusion before the input encoder to maximize the mutuality between bands. In contrast, some scholars proposed a network framework with inputs 5–6 of Transformer separately representing the features extracted by the two layers of the CNN, and input 7 representing the features extracted by the CNN module using this band fusion as the input. Lastly, to address the inconsistent classification module, we merged the generated features using fully connected layers to reduce the dimensionality, and then obtained the classification results using the activation function and conv1×1. In contrast, most researchers used MLP headers for classification in visual transformers.
In this paper, we propose MARC-Net, which uses a parallel network structure and integrates a multi-head attention mechanism and a multiscale residual cascade to reduce the processing time while maintaining the superiority of the algorithm. Firstly, the visible near-infrared (VNIR) band is added to make full use of the band information [42]; secondly, the GSE module [28] is called to generate the grouped spectral embedding. The first branch uses a multi-head attention mechanism for spectral sequence information processing. The second branch designs a multiscale residual cascaded CNN network to convert the one-dimensional pixel sequence information into three dimensions by constructing a fully connected layer, and then performs multiscale feature extraction by constructing a multiscale residual cascaded CNN network. Lastly, the features obtained by the multi-head attention mechanism are fused with those obtained by the CNN, and the pixel classification results are obtained by conv1×1. The main contributions of this paper are outlined below.
On the one hand, the features generated by group nesting are added to deepen the information among spectral features; on the other hand, the fusion of feature information at different scales is fully utilized to improve the network’s ability to characterize information at different levels.
(1) We propose a parallel network architecture combining a multi-head attention mechanism and a multiscale residual cascade, effectively mining global association information and feature information at different scales, which allows richer expression of lexical features while fully incorporating spatial location associations, thus achieving better pixel-level image classification with an integrated classification accuracy of 97.22%.
(2) Adding VNIR bands to the commonly used RGB + NiR quad-bands to form the input eight bands allows deeper a priori mining of potentially useful information for the classification of specific land cover in precise study areas.
(3) The newly generated features by grouping spectral nested data sequences are directly used as input to Transformer, and the features obtained by the Transformer encoder are combined with those obtained by CNN to generate features in parallel, which can simplify the algorithm structure and shorten the processing time while achieving the desired results.
(4) By using the GSE spectral feature module to increase the correlation between different bands and choosing to use only half of every two sequences in the Transformer branch for fusion, better analysis results can be obtained.
(5) The dynamic changes in land use in the study area were analyzed, focusing on the changes in the distribution areas of double cropping (rice Double cropping) and Forest land during the period of 2017–2021. The changes in land use and land cover can visually reflect the interaction between local economic development and biodiversity conservation, as well as provide strong scientific data support for the construction of high-standard farmland, land remediation projects, and decision making of relevant departments. The remainder of this paper is structured as follows: Section 2 describes the study data and the algorithm structure; Section 3 describes the results; Section 4 discusses potential research directions and prospects; Section 5 concludes the study.

2. Materials and Methods

In this section, we first briefly describe the details of the study area and the data processing operations. Then, the optimization of the network structure and framework is described in detail.

2.1. Study Area Overview

Huarong County belongs to Yueyang City, Hunan Province, China (29°10′18″–29°48′27″ N latitude and 112°18′31–113°1′32″ E longitude), located on the northern border of Hunan Province, the western border of Yueyang City, with obvious geomorphological zoning features. A low mountainous hilly area can be found to the northeast, interspersed with valley plains; hilly areas also comprise the south–central region, with the remainder of the Huarong area characterized by plains dotted with lakes and extremely well-developed water systems, with abundant water resources, which provide favorable innate conditions for the development of agriculture and aquaculture in the study area, with crayfish being the main aquatic species. The study area (shown in Figure 1) is a pond, and the precise acquisition of the pond area and the analysis of the area changes in recent years have played a significant role in promoting the development of local aquaculture.
Anxiang County is part of Changde City, Hunan Province, in the northwestern part of Dongting Lake, featuring similar types of landforms to Huarong County, as well as a rich aquatic and crop environment.

2.2. Study Area Image Preprocessing

Field surface cover sample data are an important source of remote sensing image samples for training and accuracy verification, and the data quality directly determines the classification quality. To obtain the feature classes of the study region, we visited the study region site on 6 June 2021, and sampled the site data to obtain an accurate picture of the actual feature classes. In this paper, multispectral images covering the study area were acquired using the Sentinel-2 toolbox, selected at 10 m resolution, carrying a multispectral instrument (MSI) containing a total of 13 spectral waves, making it a data source. Band 4, Band 5, Band 6, Band 7, Band 8, and Band 8A were selected for band synthesis, and then the region of interest (ROI) was marked according to a priori knowledge and field data collection to obtain the sample database data of Huarong County, with a total of 19,059 samples in 12 categories (80% for training and 20% for testing). In addition, the data for Changde Anxiang County were obtained, with 15,920 samples in 12 categories (80% for training, and 20% for testing).

2.3. The Architecture Proposed in this Paper MARC-Net

The proposed parallel network architecture MARC-Net that fuses the multi-head attention mechanism with a multiscale residual cascade is shown in Figure 2. The structure mainly consists of a CNN, GSE, activation function, conv1×1, and Transformer. Firstly, the GSE module is used to generate the grouped spectral nesting. The grouped spectral nests are flattened into 1D spectral sequences before being incorporated into a full connectivity layer. According to the input requirements of the CNN, the line conversion feature of the full connectivity layer is used to convert the 1D series into 3D spectrum feature matrices. The newly generated features after being grouped spectral nesting are directly used for Transformer’s input, and the features obtained by the Transformer encoder are merged with those obtained by CNN. The merged generated features are reduced in dimensionality by the fully connected layer; finally, the classification results are obtained using the activation function and conv1×1 to realize pixel-level multispectral image classification.

2.3.1. Groupwise Spectral Embedding

To explore the impact of using different bands, the GSE module was introduced from SpectralFormer. Hong et al. [27] concluded that the spectral information at different locations reflects the absorption properties at different wavelengths, and capturing the local detail changes of these spectral features is the key to classification. Although multispectral images have relatively few bands compared to hyperspectral images, enhancing the correlation between bands is still necessary. For the input 1D pixel sequence  x = [ x 1 , x 2 , x 3 , , x m ] R 1 × m , and the input a of Transformer is calculated from Equation (1) to obtain.
A = w x
where  w R d × 1  denotes a linear transformation, Equivalent for all bands of a spectral sequence,  A R d × m , A collects the output features. GSE is shown by Equation (2):
A ˙ = W X
where  W R d × n  denotes a linear transformation,  X R n × m  represents the spectral features, and n represents the number of adjacent bands. The 1D-pixel sequence without GSE is split into eight 1 × 1 sequences. The 8 d × 1 sequences are generated in the CNN branch depending on the values set in the neighboring bands. The best value of d = 2 is chosen according to the experimental accuracy and the accuracy of the prediction map. An improvement is added to this model by using only half of every two sequences in the second Transformer branch for fusion, which can achieve a better analysis, as shown in Figure 3.

2.3.2. Multi-Scale Residual Cascaded Convolutional Networks

In terms of image processing, 2D-CNN has a very wide range of uses; however, for unprocessed 1D sequences, only 1D-CNN can be used for classification. In order to effectively implement 2D-CNN classification, it is common practice to process 1D sequences as multidimensional feature matrices or plaques of central pixels as samples. In this paper, we take an FC to transform its dimensions and reshape them into the desired inputs. The expression is as follows:
y i = w x i + b
where w and brepresent the weight matrix and bias vector of the fully connected layer, respectively,  x i  Indicates input 1D-pixel sequence,  y i R 1 × 1 × m , m is the output dimension of the fully connected layer, Reshaping  y i  for  y i y i R n × n × m n 2 , Where m = 256, n = 8. The structure of the multiscale residual cascaded convolutional network is shown in Figure 4. The first convolutional layer of BaseNet was used to improve the input sample dimension, including 64 convolutional kernels of size 1 × 1, whereas the second convolutional layer included 64 convolutional kernels of size 3 × 3. The first convolutional layer and the next layer were connected using residuals to form a cascaded convolutional network, and a 2 × 2 average pool was inserted after the residual layer to reduce the feature vectors output from the convolutional layer to minimize the reduction in features and redundantly increase the effect of feature information. The DenseNet structure can reduce the redundancy of feature information and improve the results by reducing the feature vector of the convolutional layer output and making a 2 × 2 average pooling of the shallow feature output with deeper features to form a DenseNet structure, which can reduce the vanishing gradient (gradient disappearance) and achieve multiscale feature fusion. The average pooled features are compressed into vector form and fused with Transformer features.  y i R n × n × m n 2  is the input to BaseNet, n = 8, m = 64, BaseNet output is 1 × 1 × 64.

2.3.3. Multi-Head Attention

Transformer has also recently shown good performance in the fields of image classification and semantic segmentation [29,43]. The core content of Transformer is its attention mechanism and position coding, which are mainly used in the field of image processing. Due to the self-attention mechanism and positional coding, Transformer is completely superior to RNNs in terms of sequence-to-sequence conversion, which is a major advantage when working with sequence data. Transformer’s structure not only captures the interactions between different spatial positions for learning, but also solves the long-term dependency problem of input and output. Moreover, it has the ability of parallel computing, which greatly reduces the consumption of computational resources. In this paper, the Transformer encoder (shown in Figure 3) was used, with each encoder layer consisting of a multi-head attention module (norm, multi-head attention, and dropout) and a forward propagation layer (norm and feed-forward). The transformer uses fixed positional encoding to represent the absolute position information of the token. The most “core” multi-head attention structure, which represents the parallelization of attention, does not compute attention once, but computes attention many times in parallel, using the scaled dot-product attention, which fully considers the interrelationship among all spatial locations and can simultaneously pay attention to the information in multiple subspaces, as shown in Figure 5.
First enter the sequence data  x = [ x 1 , x 2 , , x i ] x i R 1 × 1 i = 1 , 2 , , m , and m as the number of bands, The Input Embedding is then mapped to the new  x i R 1 × d i m , It is then passed through each of the three trainable transformation matrices to obtain the corresponding query ( Q = [ q 0 , q 1 , , q m ] q i R 1 × d i m ), key  ( K = [ k 0 , k 1 , , k m ] k i R 1 × d i m ) , and value ( V = [ v 0 , v 1 , , v m ] v i R 1 × d i m ) , Next, we first take Q and each K to match, dot product operation, due to the large value of inner product operation makes the gradient after Softmax becomes small, After the dot product operation, divide by  d  to calculate the correlation, where d represents the length of the vector K. Softmax is applied to each row separately to obtain the weights for each V. Softmax can be considered as a special case operation of the nonlinear function defined by the exponential function and the Gaussian projection. Then the weighting is calculated to obtain the final result attention value. The self-attentive mechanism implementation can be expressed as Equation (4):
A t t e n i o n Q , K , V = s o f t m a x Q K T d k V
In practice, the basic use of multi-head attention module is similar to that of self-attention, necessitating the same steps to get Q, K, and V through the weights after the use of the number of head h to get the Q, K, and V evenly divided. Then, each head obtains the results of concat stitching; the results of the fusion and multi-head attention mechanism can be expressed as Equations (5) and (6):
M u l t i H e a d Q , K , V = C o n c a t h e a d 1 , , h e a d 2 W o
W h e r e h e a d i = A t t e n t i o n Q W i Q , K W i K , V W i V

2.3.4. Evaluation Metrics and Loss Functions

The following metrics are commonly used to evaluate pixel classification in remote sensing imagery:  O A A A , and  K a p p a . Pixel-level evaluation metrics such as  O A A A , and  K a p p a  are used primarily to assess the ability of the proposed model to accurately classify the data.
(1)  O A : the number of correct predictions as a percentage of the number of all predictions.
O A = T 0 + T 1 T 0 + F 2 + F 1 + T 1
where  T 0  is truly positive,  F 1  is pseudo-positive,  T 1  is truly negative, and  F 2  is pseudo-negative.
(2)  A A : The proportion of predicted positive categories to all predicted positive categories.
A A = T 0 T 0 + F 1
(3)  K a p p a  coefficient: a measure of classification accuracy.
K a p p a = T A T B 1 T B
T A  is the overall accuracy, and  T B  is the chance consistency error. Even for two completely independent variables, the consistency will not be 0. There is still a chance phenomenon that makes the two variables consistent; thus, the chance consistency still has to be extracted.
The output of a multiclass task is the probability that the target is divided into each category, and the sum of the probabilities of all categories is 1, where the category with the highest proportion is selected. Therefore, in the classification of multiple targets, it is necessary to output the features extracted after the softmax function as the probability of each class, and then select cross-entropy as the loss function. The cross-entropy loss function compares the predicted class of each pixel with the target class, and its expression is as follows:
L = 1 N i c = 1 M Y l x log P l x
where M is the total number of classifications;  Y l x  is the sign function, if the true category of sample l is not equal to x take 0, otherwise take 1;  p l x  is the predicted probability that the observed sample l belongs to category x.

2.3.5. Experimental Environment

In the experiments, the ENVI software was used to obtain sample marker regions of interest according to prior knowledge and field sampling, and the data were normalized. In the training prediction, the optimizer was chosen as adam, the batch size was set to 32, the maximum number of iterations was 300, and the initial learning rate was 0.0005. All codes were written using python 3.9 in pytorch 1.12.1 in this paper, and the model is trained on a Windows 11 + 12th Gen Intel(R) Core(TM) i5-12400F + NVIDA GeForce RTX 3060 Laptop GPU.

3. Results

The data for the research area are shown in Table 1, of which 80% were used for training and 20% were used for testing. Experiments were conducted in 2021 in the study areas of Huarong and Anxiang using the data in Table 1.
To highlight the performance effect of this model with huge improvement, this paper compares MARC-Net with SVM [10], KNN [13], CNN [26], RNN [25], ViT [30], and CAF [28]. The following parameters were set for MARC-Net: two adjacent bands, with five encoder blocks in the ViT module. The multiscale residual cascaded convolutional network module contained a fully connected layer with an input dimension of 8 and an output dimension of 256, reshape of (4, 4, 16) features, residual connections containing 64 1 × 1 convolutional kernels for the first convolution layer and 64 3 × 3 convolutional kernels for the second convolution layer, after a 2 × 2 averaging pooling layer, followed by a jump from the first layer features to the second layer. The DenseNet cascaded convolutional network was connected to a separate fully connected layer with an input dimension of 8 and an output dimension of 64 in parallel with ViT to achieve feature fusion.

3.1. Ablation Study

To assess the properties of the algorithm proposed in this paper, ablation experiments were conducted. The results of the test are displayed in Table 2. The OA of the ViT was 95.08%, which is already a very good classification result, showing that ViT is more suitable for the field of remote sensing image classification. The OA of FC + CNN was improved to 95.23%, indicating the feasibility of the FC + CNN module, which can be reshaped to the required input feature size after changing the pixel spectral sequence dimension by the fully connected layer to achieve the accurate classification of images. The OA of the GSE module inserted into ViT was improved by 1.19% compared with the original ViT, and the AA and Kappa were also slightly improved. This indicates that GSE alone may improve ViT’s classification of some categories of multispectral data. The fusion of 1D spectral sequences with ViT’s features after passing them through the fully connected layer achieved an AA of 96.37%, i.e., a 1.29% improvement relative to ViT, while Kappa also improved by 1.52% compared to ViT, thus verifying that the parallel network (FC + ViT) was superior to ViT for the multispectral image pixel classification task. When ViT and CNN were processed in parallel before fusing features, the OA improved to 96.84%; however, the GSE module added to the fully connected layer improved the OA compared to FC + CNN. However, when all modules are tested together, the accuracy of MARC-Net was 97.22%, which is a 2.14% improvement in OA compared to the original ViT, and there was also good improvement in AA and Kappa, achieving relatively good results.
The classification of different modules is shown in Figure 6. In Figure 6a, the enlarged box contains water, rapeseed wetland, etc. Comparing Figure 6b,c, we can see that FC + CNN classified water better than ViT. In Figure 6b,d,e,g, the network structure was not connected in parallel with FC + CNN, and the classification of pond was not as good as in Figure 6f,h. The complete MARC-Net was more distinguishable than the combination of its modules. For the most difficult task classification in this region, i.e., the mixing of pond and water, the complete MARC-Net had the best performance and a good denoising effect.
To explore the effect of the number of neighboring spectra on MARC-Net and ViT, different numbers of neighboring spectra were taken for the experiments, and the values of the number of neighboring spectra were taken as 1–4; the experimental results are shown in Table 3. According to the experimental results, for ViT, AA, OA, and Kappa were the highest when the number of neighboring spectral bands was 3. For MARC-Net, the overall results were best when the number of neighboring spectral bands was 2. The experimental results show that the ideal number of neighboring spectral bands was dependent on the model. Therefore, we chose two neighboring bands in subsequent experiments.
To evaluate the influence of various sample proportions on the classification accuracy of the algorithm, we selected 10%–80% of samples from Huarong County for training tests, without the need for verification (since the previous experimental results were taken at 80% of the sample proportion, the test results are not repeated this time). The results are listed in Table 4. Upon increasing the number of training samples, AA did not necessarily improve, and the noise generated was also randomly assigned. Because of the randomness of sample selection, there was a certain risk of producing obvious misclassification, such as misclassifying water as pond. Therefore, in order to obtain better experimental data accuracy, we chose an 8:2 sample ratio for the experiment.
The results of training samples using different scales are shown in Figure 7.
Since the data sources obtained from Sentinel-2 contain a total of 13 spectral bands, but other satellites do not necessarily contain so many bands, to increase the generality of the dataset, researchers typically choose four universal bands, RGB + NiR, as the pixel sequence information for experimental studies. To determine whether it is worthwhile to choose these four bands from the fast-processing and widely applicable Sentinel-2 and discard the a priori potentially useful information recorded by Sentinel-2, and to determine whether it helps to classify specific land cover in a precise study area when more Sentinel-2 bands are used, we added the near-infrared (NIR) band. The bands used for the experiments and the test results are listed in Table 5. When all VNIR bands were added, a significant improvement was obtained with very good results except for wetland.
The results of the two algorithms using different bands are shown in Figure 8.
To evaluate multiscale residual cascades using parallel networks that reduce processing time while maintaining algorithmic superiority, we used VIT and CNN as tandem networks connected in series for comparison. Table 6 shows the time efficiency, as well as the total OA, AA, and Kappa results, using both network structures in the study area of Huarong County.
To evaluate the impact of our proposed algorithm by adding NDVI alone or NDWI alone and adding NDVI and NDWI together, we conducted experiments, Table 7 shows the comparison of adding different vegetation indices, based on the experimental data it can be concluded that when these indices are added as features, a decrease in accuracy occurs, probably because NDVI and NDWI may only have the ability to discriminate between specific ground The NDVI and NDWI may have the ability to discriminate only for specific ground cover. In some scenarios, they may not be the optimal features. Other features need to be considered to improve the accuracy of the model, and NDVI and NDWI may not be the best features for all scenarios The size, structure, and distribution of the dataset, and the correlation between features need to be considered.
The algorithm results using different vegetation indices are shown in Figure 9. It can be seen from the figure that the background map is misclassified into Lotus, and the classification of the a-map with only the NDVI band added is better than that with NDWI and both vegetation indices added instead, which also verifies that NDVI and NDWI may not be the best features for all scenes.

3.2. Multi-Method Comparison

Table 8 and Table 9 show the correct rates for all categories of datasets and the total OA, AA, and kappa results for different methods in the study areas of Huarong and Anxiang counties. It can be found that CNN was the least effective method, with the lowest results for all three evaluation metrics; the accuracy rates for building and pond were only 64.49% and 64.45%, with that of single not reaching 60%. This may be because multispectral images have fewer bands (only four), and it is difficult for CNN to extract features well, whereas hyperspectral images have 200 bands of information, whereby CNN outperformed SVM and KNN. The OA of traditional classifiers SVM and KNN was 89.09% and 92.41%, respectively. RNN, ViT, CAF, and MARC-Net are all spectral sequence classification methods based on deep learning, showing the advantages of deep learning in the sequence data processing. The classification performance of ViT was good, and CAF had the best classification ability for vegetables. The evaluation index of MARC-Net was higher compared to other algorithmic models, performing better in various categories such as building, greenhouse, and lotus.
The classification map of Huarong County obtained using various methods is shown in Figure 10. The red box in Figure 11 contains the six main categories of building, tree, lotus, rapeseed, water, and single (single cropping of rice). SVM completely treated water as pond, which produced confusion. CNN had poor classification results for all regions of the study area and did not easily distinguish between categories with small intra-group differences. MARC-Net could better distinguish between pond and water with small intra-group differences and did not misclassify greenhouse, providing clearer results than other algorithms.
As shown in Figure 11, the classification results of SVM and CNN were very poor. MARC-Net had a relatively clearer edge extraction of pond and water, with fewer misclassifications. In general, RNN, CAF, and ViT performed better in the classification of multispectral images. The proposed MARC-Net, despite a more complicated model structure than ViT, showed some improvements in OA, AA, and Kappa, and the prediction map effect was clearer compared with other algorithms, which has great value.

3.3. Analysis of Land Use Change

In this paper, Sentinel-2 series images on 17 July 2017 and 4 November 2019 were downloaded from the USGS website for feature classification of the study area; the image data were preprocessed operationally, and then the ROI was retagged using the a priori knowledge accumulated from the outdoor collection data in 2021 through ENVI software to export a txt file containing the ROI coordinates. The training test set was divided according to the coordinates of the txt file in a ratio of 8:2. The sample data for 2017 and 2019 are shown in Table 10. To analyze each model more visually, several different models from Table 4 were also used to compare and analyze the three images in this paper. Then, we used the MARC-Net proposed in this paper to analyze the land-use changes in the Huarong County area.
The results of the classification according to different algorithms in 2017 and 2019 in Huarong County are shown in Table 11 and Table 12. In the 3 years, the OA of MARC-Net was 93.41%, 95.92%, and 97.22%, respectively, featuring higher accuracy than the other algorithmic models, e.g., improvements of 3.88%, 1.46%, and 2.14% compared to the original ViT in 2017, 2019, and 2021. The classification accuracy of the MARC-Net algorithm was very good, thus successfully meeting classification needs in field recognition research.
The prediction results of MARC-Net were best, as shown in the generated maps for 2017, 2019, and 2021; therefore, MARC-Net was used to analyze the dynamics of feature changes in the study area in recent years. The feature identification of the study area in 2017, 2019, and 2021 is shown in Figure 12. It can be found that the distribution of Double crop was mainly in the northeast and southwest, distributed in various locations suitable for planting, whereas the distribution of pond was more concentrated in the north–central region. Trees were mainly distributed in the northeastern part of the study area, surrounded by buildings. The greenhouses were scattered in different areas because the study area was filled with lake water. According to the analysis of the experimental results, land use in Huarong County in recent years has been very reasonable.
The land use and the rate of change of land use in Hualong County in recent years are shown in Table 13. The area covered in the study area and the largest change in the three pictures was Double crop material, followed by rape and single material; the increase in ponds and greenhouses led to a relative increase in aquaculture and crops. The area of buildings also increased from 2017 to 2021. This may have been due to the rapid development of China, whereby even small cities have started to build tall buildings, and they have all grown to a certain extent. The increase in buildings also affected the number of trees in the study area to some extent; however, due to the afforestation policy, the number of trees rebounded in the last 2 years. Crayfish grew consistently because it is naturally valued by the people of Hunan as a specialty food of the country and because of its natural geographical advantages. Canola and sunchokes also grew rapidly in these 4 years; the data show that the study area is in line with the sustainable development strategy.
Figure 13 shows the spatial distribution of forest land and stand-in. The proportion of forest land in the region is relatively high, but according to the changes in recent years, the proportion has declined, which may be related to the great rise in construction housing and planting aquaculture, or the need to further strengthen forestry planting to have completed the basic policy of the state to return farmland to forest to protect nature. The Double crop mostly highlights the distribution characteristics near rivers. The emergence of Double crop has played a very important role in making full use of natural and labor resources to increase food production, while single-season late rice is not easy to produce due to the strict requirements of short sunshine, therefore, the local government has increased the Double crop planting of agricultural products and aquaculture.
The dynamics of Double crop and Forest land from 2017 to 2021 are shown in Figure 14. The arable area of Double crop was 49.9226 km 2  in 2017, 117.6979 km 2  in 2019, and 199.4702 km 2  in 2021, with an increase of  135.7608 % , an increase of  16.64,974 % , and increase of 175.0143% in 2017–2019, 2019–2021, and 2017–2021, respectively. The overall increase from 2017 to 2021 was huge, as the advantages of Double crop harvesting twice a year slowly manifested, leading to farmers being able to gain more revenue, while the climate in the Hunan region was also more suitable for cultivation, leading to increased farming. The Forest land area was 125.1143 km 2  in 2017, 105.6089 km 2  in 2019, and 76.3699 km 2  in 2021, with an increase of  15.59 % , an decrease of  0.97 % , and increase of  14.76 %  in 2017–2019, 2019–2021, and 2017–2021, respectively.

4. Discussion

Sampling bands should have a large number of spectral ranges to choose from to make full use of spectral information. Since the visible NIR region satisfies the requirement of having a sufficient number of spectral bands, the bands being as adjacent to each other as possible, and the visible NIR band being very narrow with good resolution, in contrast to the traditional four bands of RGB + NiR, this choice helped to classify the specific land cover of the study area.
In the experiments, a total of seven methods were compared (SVM, KNN, and DL algorithms CNN, RNN, ViT, CAF, and MARC-Net), showing the advantages of each method. The traditional methods were faster to train, but they were less sensitive to sequence information than the RNN and ViT models, which were good at handling sequence information. It is possible that the multispectral images taken by the Sentinel-2 satellite were not hyperspectral and the 1D CNN classification accuracy was poor due to the low information content of the multispectral images, whereby the CNN did not work as well as it should have. RNNs are good at processing sequential data, unlike CNNs; hence, they perform better than CNNs in multispectral image classification, but it is difficult for them to learn the long-term dependence of input and output, as demonstrated by Transformer. CAF showed good performance in all datasets, outperforming all other models in the classification of multispectral images except MARC-Net. Interestingly, 2D CNNs also seemed to fail to capture spatial information well, probably because of the low dimensionality of the spatial kernel.
Using hyperspectral images as opposed to multispectral images may yield unexpected results. Although parallel processing can reduce processing time, improving the self-attentive mechanism or simplifying the complex algorithmic network can accelerate data processing. Using a recursive graph, Grimmer angle field (GAF), short-time Fourier transform (STFT), Markov variable field, etc. to convert one dimension into two dimensions for processing may be better.
The graph neural network approach can be used in later work, representing a rather sophisticated method of extracting features from graph data, which would make it possible to use these features for node classification, graph classification, and edge prediction, and to obtain a graph embedding representation, which is indeed very versatile. This is associated with a scalable graph neural network framework capable of learning adaptive sensory paths. Its adaptive path layer has two complementary units, one for learning the weights of first-order neighborhood nodes for breadth exploration and one for extracting and filtering the information converged in higher-order neighborhoods for depth exploration. Unexpected results have been achieved in experiments with both transductive and inductive learning tasks, and it is a worthwhile approach for future research.
The proposed MARC-Net incorporating GES, a multiscale residual cascaded convolutional network, and Transformer performed best on all four different datasets. This is because the features extracted by Transformer and CNN could be fused by parallel networks without losing feature information, yielding good classification results. Then, land-use dynamic change analysis was performed to obtain a clearer picture of land-cover classification changes in the study area in recent years.The reduction in the area of forest land cultivation from 2017–2021, especially from 2017–2019, was predictable for Hunan, a land of fish and rice, where the local government allocated more objects to construction as well as aquaculture to accelerate economic income more quickly. After the reduction of z share in the first 2 years, the z share area in 2019–2021 has rebounded, probably because the national government aims to achieve a reasonable distribution of resources in many aspects in a common development way, and the local government correspondingly the national t policy of returning farmland to forest.

5. Conclusions

In the method proposed in this study, when using ViT with CNN parallel fusion, the feature information obtained by both methods could be better utilized for accurate image classification, largely compensating for the multispectral band information we discarded. We proposed a parallel network architecture fusing a multi-head attention mechanism with a multiscale residual cascade for land-use classification algorithms. The parallel architecture first introduced the GSE module to generate grouped spectral nesting to increase the connections linking local information. A multiscale residual cascade CNN network was designed to fully utilize the fused feature information at different scales to achieve the pixel-level classification of remote sensing images. Lastly, the feature information obtained from both steps was fused, and the features generated by merging were reduced in dimensionality through fully connected layers. The classification results were obtained using the activation function and conv1×1 to achieve pixel-level classification. In the future, we will further process the input of the data to improve the informational diversity of its features. Furthermore, an aspect worth investigating is the design of new enhanced network architectures that can be used in any scenario to improve performance while reducing computational complexity, thus increasing efficiency and applicability.For Forest land, the overall trend of the study area from 2017 to the present is still declining, although it has rebounded in the last two years, but it is still relatively not very important and needs further attention.

Author Contributions

Conceptualization, X.F. and X.L.; methodology, X.L. and X.F.; software, X.L., X.F., C.Y., J.F., L.Y., N.W. and L.C.; validation, X.L., X.F. and J.F.; formal analysis, X.L., X.F. and C.Y.; investigation, X.L., X.F., C.Y., J.F., L.Y., N.W. and L.C.; resources, J.F. and X.F.; data curation, X.L., X.F., C.Y., J.F., L.Y., N.W. and L.C.; writing—original draft preparation, X.L.; writing—review and editing, X.L., X.F., C.Y., J.F., L.Y., N.W. and L.C. All authors have read and agreed to the published version of the manuscript.

Funding

This document is the results of the research project funded by the National Natural Science Foundation of China: 62261004 and 62001129.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The code will be available at https://github.com/lixuyaaaaa/MARC-Net accessed on 27 March 2023.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
MARC-Netattention mechanism and multi-scale residual cascades
RSRemote Sensing
GAFsGrammy angle fields
Singsing cropping of rice
STFTShort Time Fourier transformations
HSIHue-Sturation-Intensity
GSEgrouped spectral embedding
CNNConvolutional Neural Network
RSIRemotely Sensed Imagery
RNNRecurrent Neural Network
CAFSpectralFormer
FCFully Connected Layer
MLmachine learning
DLDeep learning
ViTVision Transformer
SVMSupport Vector Machine
RFRandom Forest
CN.Class Number
KNNK-Nearest Neighbor
VNIRVisible Near Infrared
NLPNatural Language Processing
DoubleDouble cropping of rice
MSImulti-spectral instrument
ROIregion of interest
RBFradial basis function

References

  1. Sawant, S.; Mohite, J.; Sakkan, M.; Pappula, S. Near real time crop loss estimation using remote sensing observations. In Proceedings of the 2019 8th International Conference on Agro-Geoinformatics (Agro-Geoinformatics), Istanbul, Turkey, 16–19 July 2019; pp. 1–5. [Google Scholar]
  2. Figa, J.; Stoffelen, A. On the assimilation of Ku-band scatterometer winds for weather analysis and forecasting. IEEE Trans. Geosci. Remote Sens. 2000, 38, 1893–1902. [Google Scholar] [CrossRef]
  3. Xu, J.; Wang, X.; Zhu, X.; Cui, C.; Liu, P.; Li, B. Research on marine radar oil spill network monitoring technology. In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Fort Worth, TX, USA, 23–28 July 2017; pp. 1868–1871. [Google Scholar]
  4. Haihui, H.; Yilin, W.; Zhuan, Z.; Guangli, R.; Min, Y. Extraction of Altered Mineral from Remote Sensing Data in Gold Exploration Based on the Nonlinear Analysis Technology. In Proceedings of the 2018 10th IAPR Workshop on Pattern Recognition in Remote Sensing (PRRS), Beijing, China, 19–20 August 2018; pp. 1–8. [Google Scholar]
  5. Jiang, J.; Zhang, Q.; Yao, X.; Tian, Y.; Zhu, Y.; Cao, W.; Cheng, T. HISTIF: A new spatiotemporal image fusion method for high-resolution monitoring of crops at the subfield level. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 4607–4626. [Google Scholar] [CrossRef]
  6. Calvin, W.M.; Pace, E.L.; Davies, G.E.; Pearson, N.C. HyspIRI for energy and mineral resource exploration, applications, and impacts. In Proceedings of the 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Milan, Italy, 26–31 July 2015; pp. 80–82. [Google Scholar]
  7. Labbassi, K.; Tajdi, A.; Er-Raji, A. Remote sensing and geological mapping for a groundwater recharge model in the arid area of Sebt Rbrykine: Doukkala, western Morocco. In Proceedings of the 2009 IEEE International Geoscience and Remote Sensing Symposium, Cape Town, South Africa, 12–17 July 2009; Volume 1, pp. 1–112. [Google Scholar]
  8. Luo, H.; Ye, B.; Zhang, Y. Study of Urban Landuse Evaluation of the comprehensive planning—Nanjing city as a case. In Proceedings of the 2011 International Conference on Remote Sensing, Environment and Transportation Engineering, Nanjing, China, 24–26 June 2011; pp. 4456–4459. [Google Scholar]
  9. Liang, S.; Li, X.; Wang, J. Atmospheric correction of optical imagery. Adv. Remote Sens. 2012, 117. [Google Scholar] [CrossRef]
  10. Melgani, F.; Bruzzone, L. Classification of hyperspectral remote sensing images with support vector machines. IEEE Trans. Geosci. Remote Sens. 2004, 42, 1778–1790. [Google Scholar] [CrossRef]
  11. Ayerdi, B.; Romay, M.G. Hyperspectral image analysis by spectral–spatial processing and anticipative hybrid extreme rotation forest classification. IEEE Trans. Geosci. Remote Sens. 2015, 54, 2627–2639. [Google Scholar] [CrossRef]
  12. Lin, T.H.; Li, H.T.; Tsai, K.C. Implementing the Fisher’s Discriminant Ratio in ak-Means Clustering Algorithm for Feature Selection and Data Set Trimming. J. Chem. Inf. Comput. Sci. 2004, 44, 76–87. [Google Scholar] [CrossRef]
  13. Alimjan, G.; Sun, T.; Liang, Y.; Jumahun, H.; Guan, Y. A new technique for remote sensing image classification based on combinatorial algorithm of SVM and KNN. Int. J. Pattern Recognit. Artif. Intell. 2018, 32, 1859012. [Google Scholar] [CrossRef]
  14. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
  15. Zhang, L.; Zhang, L.; Du, B. Deep learning for remote sensing data: A technical tutorial on the state of the art. IEEE Geosci. Remote Sens. Mag. 2016, 4, 22–40. [Google Scholar] [CrossRef]
  16. Reichstein, M.; Camps-Valls, G.; Stevens, B.; Jung, M.; Denzler, J.; Carvalhais, N. Deep learning and process understanding for data-driven Earth system science. Nature 2019, 566, 195–204. [Google Scholar] [CrossRef]
  17. Cheng, G.; Sun, X.; Li, K.; Guo, L.; Han, J. Perturbation-seeking generative adversarial networks: A defense framework for remote sensing image scene classification. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–11. [Google Scholar] [CrossRef]
  18. Lanthier, Y.; Bannari, A.; Haboudane, D.; Miller, J.R.; Tremblay, N. Hyperspectral data segmentation and classification in precision agriculture: A multi-scale analysis. In Proceedings of the IGARSS 2008—2008 IEEE International Geoscience and Remote Sensing Symposium, Boston, MA, USA, 6–11 July 2008; Volume 2, pp. II-585–II-588. [Google Scholar]
  19. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
  20. Zhang, W.; Tang, P.; Zhao, L.; Huang, Q. A comparative study of U-nets with various convolution components for building extraction. In Proceedings of the 2019 Joint Urban Remote Sensing Event (JURSE), Vannes, France, 22–24 May 2019; pp. 1–4. [Google Scholar]
  21. Mustafa, N.; Zhao, J.; Liu, Z.; Zhang, Z.; Yu, W. Iron ORE region segmentation using high-resolution remote sensing images based on Res-U-Net. In Proceedings of the IGARSS 2020—2020 IEEE International Geoscience and Remote Sensing Symposium, Waikoloa, HI, USA, 26 September–2 October 2020; pp. 2563–2566. [Google Scholar]
  22. Han, L.; Liang, H.; Chen, H.; Zhang, W.; Ge, Y. Convective precipitation nowcasting using U-Net Model. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–8. [Google Scholar] [CrossRef]
  23. Asma, S.B.; Abdelhamid, D.; Youyou, L. U-Net Based Classification For Urban Areas in Algeria. In Proceedings of the 2020 Mediterranean and Middle-East Geoscience and Remote Sensing Symposium (M2GARSS), Tunis, Tunisia, 9–11 March 2020; pp. 101–104. [Google Scholar]
  24. Hang, R.; Liu, Q.; Hong, D.; Ghamisi, P. Cascaded recurrent neural networks for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2019, 57, 5384–5394. [Google Scholar] [CrossRef]
  25. Li, J.; Cui, R.; Li, B.; Li, Y.; Mei, S.; Du, Q. Dual 1D-2D spatial-spectral cnn for hyperspectral image super-resolution. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 3113–3116. [Google Scholar]
  26. Khusni, U.; Dewangkoro, H.I.; Arymurthy, A.M. Urban area change detection with combining CNN and RNN from sentinel-2 multispectral remote sensing data. In Proceedings of the 2020 3rd International Conference on Computer and Informatics Engineering (IC2IE), Yogyakarta, Indonesia, 15–16 September 2020; pp. 171–175. [Google Scholar]
  27. Hong, D.; Han, Z.; Yao, J.; Gao, L.; Zhang, B.; Plaza, A.; Chanussot, J. SpectralFormer: Rethinking hyperspectral image classification with transformers. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–15. [Google Scholar] [CrossRef]
  28. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
  29. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
  30. Lan, R.; Li, Z.; Liu, Z.; Gu, T.; Luo, X. Hyperspectral image classification using k-sparse denoising autoencoder and spectral–restricted spatial characteristics. Appl. Soft Comput. 2019, 74, 693–708. [Google Scholar] [CrossRef]
  31. Li, J.; Yuan, Q.; Shen, H.; Meng, X.; Zhang, L. Hyperspectral image super-resolution by spectral mixture analysis and spatial–spectral group sparsity. IEEE Geosci. Remote Sens. Lett. 2016, 13, 1250–1254. [Google Scholar] [CrossRef]
  32. Yang, J.; Zhao, Y.; Chan, J.C.W.; Yi, C. Hyperspectral image classification using two-channel deep convolutional neural network. In Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Beijing, China, 10–15 July 2016; pp. 5079–5082. [Google Scholar]
  33. Yang, H.L.; Crawford, M.M. Exploiting spectral-spatial proximity for classification of hyperspectral data on manifolds. In Proceedings of the 2012 IEEE International Geoscience and Remote Sensing Symposium, Munich, Germany, 22–27 July 2012; pp. 4174–4177. [Google Scholar]
  34. Lee, H.; Kwon, H. Contextual deep CNN based hyperspectral classification. In Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Beijing, China, 10–15 July 2016; pp. 3322–3325. [Google Scholar]
  35. Blaschke, T. Object based image analysis for remote sensing. ISPRS J. Photogramm. Remote Sens. 2010, 65, 2–16. [Google Scholar] [CrossRef]
  36. Ma, L.; Li, M.; Ma, X.; Cheng, L.; Du, P.; Liu, Y. A Review of Supervised Object-Based Land-Cover Image Classification; Elsevier: Amsterdam, The Netherlands, 2017; Volume 130, pp. 277–293. [Google Scholar]
  37. Zhang, C.; Sargent, I.; Pan, X.; Li, H.; Gardiner, A.; Hare, J.; Atkinson, P.M. An object-based convolutional neural network (OCNN) for urban land use classification. Remote Sens. Environ. 2018, 216, 57–70. [Google Scholar] [CrossRef]
  38. Huang, B.; Zhao, B.; Song, Y. Urban land-use mapping using a deep convolutional neural network with high spatial resolution multispectral remote sensing imagery. Remote Sens. Environ. 2018, 214, 73–86. [Google Scholar] [CrossRef]
  39. Scott, G.J.; Marcum, R.A.; Davis, C.H.; Nivin, T.W. Fusion of Deep Convolutional Neural Networks for Land Cover Classification of High-Resolution Imagery; IEEE: Manhattan, NY, USA, 2017; Volume 14, pp. 1638–1642. [Google Scholar]
  40. Zhang, C.; Sargent, I.; Pan, X.; Li, H.; Gardiner, A.; Hare, J.; Atkinson, P.M. Joint Deep Learning for Land Cover and Land Use Classification; Elsevier: Amsterdam, The Netherlands, 2019; Volume 221, pp. 173–187. [Google Scholar]
  41. Khaki, S.; Pham, H.; Han, Y.; Kuhl, A.; Kent, W.; Wang, L. Deepcorn: A Semi-Supervised Deep Learning Method for High-Throughput Image-Based Corn Kernel Counting and Yield Estimation; Elsevier: Amsterdam, The Netherlands, 2021; Volume 218, p. 106874. [Google Scholar]
  42. Cui, Z.; Kerekes, J. Potential of Red Edge Spectral Bands in Future Landsat Satellites on Agroecosystem Canopy Chlorophyll Content Retrieval. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 7168–7171. [Google Scholar]
  43. Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.; et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6881–6890. [Google Scholar]
Figure 1. Location of the study area and Sentinel-2 remote sensing image (6 June 2021).
Figure 1. Location of the study area and Sentinel-2 remote sensing image (6 June 2021).
Forests 14 01060 g001
Figure 2. Schematic of the parallel network architecture MARC-Net architecture.
Figure 2. Schematic of the parallel network architecture MARC-Net architecture.
Forests 14 01060 g002
Figure 3. Groupwise Spectral Embedding Schematic, showing the change of feature embedding. (a) d = 2. (b) d = 1/2.
Figure 3. Groupwise Spectral Embedding Schematic, showing the change of feature embedding. (a) d = 2. (b) d = 1/2.
Forests 14 01060 g003
Figure 4. Schematic diagram of multi-scale residual cascaded convolutional network.
Figure 4. Schematic diagram of multi-scale residual cascaded convolutional network.
Forests 14 01060 g004
Figure 5. Multi-Head Attention consists of several attention layers running in parallel.
Figure 5. Multi-Head Attention consists of several attention layers running in parallel.
Forests 14 01060 g005
Figure 6. About the experimental results of MARC-Net algorithm sub-modules in Huarong County, (a) Image. (b) ViT. (c) only FC + CNN. (d) MARC-Net (GSE). (e) MARC-Net (FC). (f) MARC-Net (FC + CNN). (g) MARC-Net (GSE + FC). (h) MARC-Net (GSE + FC + CNN).
Figure 6. About the experimental results of MARC-Net algorithm sub-modules in Huarong County, (a) Image. (b) ViT. (c) only FC + CNN. (d) MARC-Net (GSE). (e) MARC-Net (FC). (f) MARC-Net (FC + CNN). (g) MARC-Net (GSE + FC). (h) MARC-Net (GSE + FC + CNN).
Forests 14 01060 g006
Figure 7. Plots of the results of different scaled training samples. (a) image. (b) 10%. (c) 20%. (d) 30%. (e) 40%. (f) 50%. (g) 60%. (h) 70%.
Figure 7. Plots of the results of different scaled training samples. (a) image. (b) 10%. (c) 20%. (d) 30%. (e) 40%. (f) 50%. (g) 60%. (h) 70%.
Forests 14 01060 g007
Figure 8. ViT results in different bands. (a) 4 bands. (b) 4 bands + VNIR. MARC-Net results in different bands. (c) 4 bands. (d) 4 bands + VNIR.
Figure 8. ViT results in different bands. (a) 4 bands. (b) 4 bands + VNIR. MARC-Net results in different bands. (c) 4 bands. (d) 4 bands + VNIR.
Forests 14 01060 g008
Figure 9. Algorithm results using different vegetation indices: (a) MARC (NDVI); (b) MARC (NDWI); (c) MARC (NDVI + NDWI).
Figure 9. Algorithm results using different vegetation indices: (a) MARC (NDVI); (b) MARC (NDWI); (c) MARC (NDVI + NDWI).
Forests 14 01060 g009
Figure 10. Huarong County uses different algorithms to classify the results. (a) Image. (b) SVM. (c) KNN. (d) CNN. (e) RNN. (f) ViT. (g) CAF. (h) MARC-Net.
Figure 10. Huarong County uses different algorithms to classify the results. (a) Image. (b) SVM. (c) KNN. (d) CNN. (e) RNN. (f) ViT. (g) CAF. (h) MARC-Net.
Forests 14 01060 g010
Figure 11. Anxiang County uses different algorithms for classification results. (a) Image. (b) SVM. (c) KNN. (d) CNN. (e) RNN. (f) ViT. (g) CAF. (h) MARC-Net.
Figure 11. Anxiang County uses different algorithms for classification results. (a) Image. (b) SVM. (c) KNN. (d) CNN. (e) RNN. (f) ViT. (g) CAF. (h) MARC-Net.
Forests 14 01060 g011
Figure 12. Original three-year map of Huarong area (a) 2017, (b) 2019, and (c) 2021; Distribution map of the Huarong area for three years (d) 2017, (e) 2019, and (f) 2021.
Figure 12. Original three-year map of Huarong area (a) 2017, (b) 2019, and (c) 2021; Distribution map of the Huarong area for three years (d) 2017, (e) 2019, and (f) 2021.
Forests 14 01060 g012
Figure 13. Distribution of Forest land in Huarong County (a) 2017, (b) 2019 and (c) 2021; Distribution of Double in Huarong County (d) 2017, (e) 2019 and (f) 2021.
Figure 13. Distribution of Forest land in Huarong County (a) 2017, (b) 2019 and (c) 2021; Distribution of Double in Huarong County (d) 2017, (e) 2019 and (f) 2021.
Forests 14 01060 g013
Figure 14. Dynamics of Forest land in 2017–2019 (a), 2019–2021 (b), and 2017–2021 (c); dynamics of Double crop in 2017–2019 (d), 2019–2021 (e), and 2017–2021 (f).
Figure 14. Dynamics of Forest land in 2017–2019 (a), 2019–2021 (b), and 2017–2021 (c); dynamics of Double crop in 2017–2019 (d), 2019–2021 (e), and 2017–2021 (f).
Forests 14 01060 g014
Table 1. Sample size of the study area 2021 data set.
Table 1. Sample size of the study area 2021 data set.
HuarongAnxiang
NO.ClassTrainingTestingTrainingTesting
1Building873219808203
2Tree11072771010253
3Water31607902445612
4Greenhouse1035259526132
5Lotus1000251754189
6Pond8252071563391
7Wetland13863471540385
8Vegetable743186798200
9Rapeseed1925482811203
10Crayfish1537385854214
11Single7461871096274
12Double crop812320527132
Total15,149391012,7323188
Table 2. Experimental ablation results of MARC-Net on the study area dataset using different combinations of modules.
Table 2. Experimental ablation results of MARC-Net on the study area dataset using different combinations of modules.
DifferentDifferent ModuleMetric
MethodGSEFCCNNOA (%)AA (%)Kappa
ViT×××95.0894.3194.49
only FC + CNN×××95.2394.2294.66
MARC-Net××96.2795.9195.83
MARC-Net××96.3795.8395.94
MARC-Net×96.8496.4796.47
MARC-Net×96.996.6696.53
MARC-Net97.2296.8296.89
Table 3. Effect of the number of neighboring bands on ViT and MARC-Net.
Table 3. Effect of the number of neighboring bands on ViT and MARC-Net.
MethodMetricThe Number of Neighboring Bands
1234
Transformer (ViT)OA (%)95.0896.2796.5595.62
AA (%)94.3195.9196.1695.03
Kappa94.4995.8396.1495.10
MARC-NetOA (%)96.8497.2295.8196.93
AA (%)96.4796.8295.5796.00
Kappa96.4796.8995.3295.56
Table 4. In 2021, Huarong County used MARC-Net test results for various sample proportions.
Table 4. In 2021, Huarong County used MARC-Net test results for various sample proportions.
Class No.Ratio of Training
10%20%30%40%50%60%70%
196.3399.0898.7797.4799.8199.3899.47
287.6893.8491.3295.1194.3697.7195.35
399.7499.4999.6699.8199.7999.9199.89
498.4499.61100.00100.0099.84100.00100.00
5100.00100.00100.00100.0099.84100.00100.00
682.5286.8994.4996.6094.1897.2595.01
793.0690.1790.9491.0592.0390.1892.16
894.5697.8398.5696.7698.7099.4698.92
990.8392.7294.5995.5394.3492.9397.03
1090.1091.1493.0593.7596.9895.1496.57
1175.2673.6578.8577.4782.4087.4780.39
1297.0299.0198.0298.2799.2198.1997.04
MetricsOA (%)93.3394.4895.4995.8996.4796.6096.72
AA (%)92.1393.6294.8695.1695.9696.4795.99
Kappa92.5393.8194.9595.4096.0596.2096.32
Table 5. Test results in the 2021 Huarong County data set using different methods in different wavebands.
Table 5. Test results in the 2021 Huarong County data set using different methods in different wavebands.
Class No.Different Bands (Method)
4 Bands4 Bands + VNIR
ViTMARC-NetViTMARC-Net
189.9297.2597.0299.77
288.5390.3395.9397.65
399.4599.4399.7199.90
498.7899.4299.90100.00
596.4598.6099.6099.90
677.8379.0389.9397.09
792.9194.3088.1691.63
889.5393.5395.5596.63
985.8090.7593.2995.37
1083.7994.5994.3398.17
1174.2774.1280.9686.86
1293.5293.7197.2998.89
MetricsOA (%)90.7293.5795.0897.22
AA (%)89.2492.0994.3196.82
Kappa89.6092.8094.4996.89
Table 6. Comparison of the two network structures.
Table 6. Comparison of the two network structures.
C N.Different Method
Series ConnectionMARC-Net
199.5499.77
294.6797.65
399.9699.90
4100.00100.00
599.9099.90
696.1297.09
793.1491.63
899.3296.63
993.7195.37
1098.9598.17
1188.7386.86
1299.2698.89
OA (%)97.2097.22
AA (%)96.9596.82
Kappa96.8796.89
time(s)65,115.2461,606.42
Table 7. Comparison of adding vegetation index.
Table 7. Comparison of adding vegetation index.
C N.Different Vegetation Index
MARC (NDVI)MARC (NDWI)MARC (NDVI + NDWI)MARC-Net
199.4799.6099.2199.77
295.6697.2193.9097.65
399.9299.9299.8199.90
4100.0099.6699.66100.00
599.8899.77100.0099.90
695.0195.7091.6897.09
790.6091.1792.0891.63
898.9299.0799.2396.63
995.4895.6092.0495.37
1097.2496.9596.8098.17
1184.9977.0288.0586.86
1298.8798.7398.4598.89
OA (%)96.7996.5796.2397.22
AA (%)96.3495.8795.9196.82
Kappa96.4196.1695.7896.89
Table 8. Classification results of various methods in Huarong County 2021.
Table 8. Classification results of various methods in Huarong County 2021.
C N.Different Methods
SVMKNNCNNRNNViTCAFMARC-Net
189.4984.9364.4994.8497.0299.0899.77
293.5092.7791.2394.9495.9395.3097.65
398.6099.3697.6899.5899.7199.8199.90
497.2998.8492.4699.7199.9099.71100.00
599.6097.2189.7099.5099.6099.8099.90
664.7377.2965.4582.3089.9393.6997.09
788.1894.5285.6489.3988.1692.9291.63
892.4794.6276.5895.9695.5597.8496.63
990.6688.7975.9492.0093.2993.8795.37
1088.8386.2375.9291.6094.3397.2698.17
1144.3886.6350.2675.8780.9682.4386.86
1283.2594.0889.0398.5297.2997.7898.89
OA (%)89.0992.4182.7693.9395.0896.4097.22
AA (%)85.9291.2879.5492.8594.3195.7996.82
Kappa87.7491.4980.6593.2094.4995.9696.89
Table 9. Various methods in Anxiang County 2021 classification results.
Table 9. Various methods in Anxiang County 2021 classification results.
C N.Different Methods
SVMKNNCNNRNNViTCAFMARC-Net
195.7993.5991.2195.7996.1698.5198.76
295.9496.4479.2095.9496.1398.3197.72
398.2898.8590.5598.2899.2699.0199.55
494.8694.6980.6094.8696.1998.0997.90
596.0289.4175.9996.0297.7498.6794.69
694.1188.2571.2794.1193.9296.0997.31
796.6292.2086.6296.6295.3895.7197.27
883.9578.5066.0483.9582.5887.0990.72
969.2970.9313.0769.2981.6277.6885.32
1076.2266.3537.1176.2282.2081.9686.53
1185.4976.6483.3985.4985.8589.1488.04
1281.7883.3324.6681.7885.7686.9088.80
OA (%)90.9487.7072.1890.9492.4593.5194.68
AA (%)89.0385.7766.6589.0391.0792.2793.56
Kappa89.8986.2868.9489.8991.5792.7694.06
Table 10. Sample size of Huarong County data set for 2017 and 2019.
Table 10. Sample size of Huarong County data set for 2017 and 2019.
No.Class20172019
TrainingTestingTrainingTesting
1Building10832711489373
2Tree10472621657415
3Water4640116147761195
4Greenhouse16204051472369
5Lotus13253321446362
6Pond674169869218
7Wetland8522142294574
8Vegetable477120660166
9Rapeseed5001251404351
10Crayfish11612911526382
11Single546137708178
12Double crop432109777195
Total14,357359619,0784778
17,95323,856
Table 11. Various methods in Huarong County 2017 classification results.
Table 11. Various methods in Huarong County 2017 classification results.
C N.Different Method
SVMKNNCNNRNNViTCAFMARC-Net
197.8094.6389.2095.9797.8698.5399.02
297.1196.3898.1995.4899.6398.4697.74
396.7497.5696.8198.2399.3999.2599.83
446.6672.3836.6668.8077.0289.0487.61
571.4280.7146.9186.5089.5496.2495.35
656.5862.4040.7571.8881.6186.0891.34
782.3292.3476.8992.9392.0194.6594.45
875.7387.5774.7390.4987.3692.5792.12
981.1275.3259.6581.7687.8488.2692.84
1039.1568.9315.6265.9976.5984.2184.29
1144.9160.9643.2969.9783.2484.1882.03
1250.9268.5134.1462.3873.9584.1484.25
OA (%)76.0382.9165.6784.8689.5392.6193.41
AA (%)70.0479.8159.4181.7087.1891.3191.74
Kappa73.1380.9761.7183.1088.3191.7692.65
Table 12. Various methods in Huarong County 2019 classification results.
Table 12. Various methods in Huarong County 2019 classification results.
C N.Different Method
SVMKNNCNNRNNViTCAFMARC-Net
175.2275.6676.4486.5587.8892.8889.33
290.4092.8085.8792.1893.6893.3897.79
399.6499.7699.2299.88100.00100.00100.00
463.6786.8966.2989.4193.0792.3292.97
583.3389.7977.5591.9296.5998.2997.61
650.0066.4021.0677.7569.6877.7583.46
795.1796.1987.5597.0198.4199.4997.71
876.7785.8066.8889.1293.3496.1094.15
994.6093.5690.8594.2396.7297.3596.67
1066.4077.9944.5787.1184.5986.8290.11
1180.9590.0478.3692.1792.8296.6395.21
1272.2990.9055.4893.3795.8796.4197.17
OA (%)84.8790.6478.9493.2494.4695.8295.92
AA (%)79.0487.1570.8590.9091.8993.9694.35
Kappa82.9389.4576.2092.3893.7595.2895.40
Table 13. Table of Changes in Feature Use Types in Huarong County, 2017–2021.
Table 13. Table of Changes in Feature Use Types in Huarong County, 2017–2021.
ClassArea (km 2 )Area Change Rate (%)
2017201920212017–20192019–20212017–2021
Building114.9272124.5547182.15988.3751.2463.91
Tree125.1143105.608976.3699−15.590.97−14.76
Water157.331797.8937132.7025−37.7719.41−25.69
Greenhouse45.424871.692316.188757.82−82.29−72.05
Lotus56.438232.170224.1319−42.9916.56−33.55
Pond55.405499.061284.547478.791.1180.79
Wetland137.4385135.4457133.4817−1.44−1.45−2.87
Vegetable105.0842138.79120.009432.07−22.542.29
Rapeseed139.8075293.9893262.4486110.28−0.49109.23
Crayfish179.1558226.0001365.096426.1447.6386.23
Single57.9711115.722155.128999.62−41.3916.98
Double crop49.9226117.6979199.4702135.7616.64175.01
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Fan, X.; Li, X.; Yan, C.; Fan, J.; Yu, L.; Wang, N.; Chen, L. MARC-Net: Terrain Classification in Parallel Network Architectures Containing Multiple Attention Mechanisms and Multi-Scale Residual Cascades. Forests 2023, 14, 1060. https://doi.org/10.3390/f14051060

AMA Style

Fan X, Li X, Yan C, Fan J, Yu L, Wang N, Chen L. MARC-Net: Terrain Classification in Parallel Network Architectures Containing Multiple Attention Mechanisms and Multi-Scale Residual Cascades. Forests. 2023; 14(5):1060. https://doi.org/10.3390/f14051060

Chicago/Turabian Style

Fan, Xiangsuo, Xuyang Li, Chuan Yan, Jinlong Fan, Ling Yu, Nayi Wang, and Lin Chen. 2023. "MARC-Net: Terrain Classification in Parallel Network Architectures Containing Multiple Attention Mechanisms and Multi-Scale Residual Cascades" Forests 14, no. 5: 1060. https://doi.org/10.3390/f14051060

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop