MARC-Net: Terrain Classification in Parallel Network Architectures Containing Multiple Attention Mechanisms and Multi-Scale Residual Cascades

Fan, Xiangsuo; Li, Xuyang; Yan, Chuan; Fan, Jinlong; Yu, Ling; Wang, Nayi; Chen, Lin

doi:10.3390/f14051060

Open AccessArticle

MARC-Net: Terrain Classification in Parallel Network Architectures Containing Multiple Attention Mechanisms and Multi-Scale Residual Cascades

¹

School of Automation, Guangxi University of Science and Technology, Liuzhou 545006, China

²

Guangxi Collaborative Innovation Centre for Earthmoving Machinery, Guangxi University of Science and Technology, Liuzhou 545006, China

³

National Satellite Meteorological Center, China Meteorological Administration, Beijing 100081, China

^*

Author to whom correspondence should be addressed.

Forests 2023, 14(5), 1060; https://doi.org/10.3390/f14051060

Submission received: 22 April 2023 / Revised: 13 May 2023 / Accepted: 16 May 2023 / Published: 22 May 2023

(This article belongs to the Special Issue Remote Sensing Approaches to Mapping and Monitoring Forest Vegetation Conditions)

Download

Browse Figures

Versions Notes

Abstract

:

To address the problem that traditional deep learning algorithms cannot fully utilize the correlation properties between spectral sequence information and the feature differences between different spectra, this paper proposes a parallel network architecture land-use classification based on a combined multi-head attention mechanism and multiscale residual cascade called MARC-Net. This parallel framework is firstly implemented by deeply mining the features generated by grouped spectral embedding for information among spectral features by adding a multi-head attention mechanism, which allows semantic features to have expressions from more subspaces while fully considering all spatial location interrelationships. Secondly, a multiscale residual cascade CNN (convolutional neural network) is designed to fully utilize the fused feature information at different scales, thereby improving the network’s ability to represent different levels of information. Lastly, the features obtained by the multi-head attention mechanism are fused with those obtained by the CNN, and the merged resultant features are downgraded through the fully connected layer to obtain the classification results and achieve pixel-level multispectral image classification. The findings show that the algorithm proposed in this paper has an aggregate precision of 97.22%, compared to that of the Vision Transformer (ViT) with 95.08%; its performance on the Sentinel-2 dataset shows a huge improvement. Moreover, this article mainly focuses on the change rate of forest land in the study area. The Forest land area was 125.1143 km

^{2}

in 2017, 105.6089 km

^{2}

in 2019, and 76.3699 km

^{2}

in 2021, with an increase of

15.59 %

, an decrease of

0.97 %

, and increase of

14.76 %

in 2017–2019, 2019–2021 and 2017–2021, respectively.

Keywords:

MARC-Net; GSE; residual cascade; semantic features

1. Introduction

Although Remote Sensing (RS) technology is increasingly achieving remarkable results in practical areas such as crop monitoring, weather forecasting, marine research, and geological surveys [1,2,3,4], as well as land-cover classification, more related research is needed because of the complexity of feature types in some study areas, which easily leads to confusion of samples. Land-cover classification has an extremely important role in tasks such as refined agriculture, land resource exploration, regional geological change, and integrated urban planning [5,6,7,8]. Therefore, accurate access to real-time remote sensing data to improve the accuracy of land-cover classification has been an inevitable need for practical applications. In contrast, remote sensing data are mainly used to monitor environmental parameters based on physical models before using algorithms for predictive classification purposes [9]. Although physical models can generate remote sensing data well from environmental parameters, these models are highly dependent on the a priori knowledge of the models. Accordingly, various data mining-based machine learning methods (ML) have non-negligible value in the field of remote sensing image processing. Machine learning algorithms used for RS image figure recognition include support vector machines (SVMs) [10], random forests (RFs) [11], k-means [12], and k-nearest neighbors (KNNs) [13]. SVMs can be transformed into a constrained optimal solution problem for linearly divisible problems to perform hyperplane classification. K-means is an iterative division-based cluster analysis algorithm that uses distance as a similarity metric, but is sensitive to anomalous samples and cannot handle discrete features or guarantee global optimality. K-nearest neighbor uses the distance between different feature values for classification but is unable to cope with multiple samples. Therefore, the level of machine learning algorithms for RS image classification needs to be improved. Moreover, the generalization capability of traditional neural networks is not satisfactory, and the degree of approximation of nonlinear models needs to be improved, making their implementation difficult for end-to-end image segmentation tasks.

Deep learning (DL) is a promising research method for large-scale deep neural networks. DL models can accurately approximate nonlinear relationships between environmental parameters due to their multilevel learning properties [14], which helps to achieve sensing, retrieval, fusion, and downscaling across remote sensing environmental variables. Deep learning algorithms can be used in the field of image processing to extract multiscale and multilevel features. The unique advantage in combining features from low to high [15] confers them with better performance in image processing and classification. Therefore, DL models outperform traditional models in image processing using remote sensing data [16]; as a result, more scholars are applying deep learning to remote sensing research. For example, in the field of deep learning for security in remote sensing image classification, Cheng et al. proposed an effective defense framework specified for Remotely Sensed Imagery (RSI) scene classification called perturbation-seeking generative adversarial network, which uses unknown attacks of random type during training to eliminate the blind spots of the classifier [17]. In recent years, deep learning has made significant breakthroughs in fine-grained agricultural classification [18] and has shown good performance in land-cover classification tasks for RS images. Semantic segmentation has also been applied to the land-cover classification task, which works by associating each pixel with a category label. The semantic segmentation network extracts the semantic features of each pixel from the pixel grayscale information and uses this algorithm to achieve accurate classification of pixels in different categories. U-Net [19] represents one of the early algorithms using multiscale features for semantic segmentation, which is a classical fully convolutional network. The U-Net network is widely used in remote sensing images for its ability to achieve good prediction. Zhang et al., achieved remarkable performance in building extraction using Inception-U-Net [20]. And Mustafa et al. used Res-U-Net to segment the collected high-resolution remote sensing images of iron ore [21]. Han et al., used the U-Net model for convective precipitation forecasting [22]. Asma et al., proposed the ability of pretrained U-Net to classify satellite images [23], which contributes to the satellite’s pinpointing capability. This shows that the task of RS image segmentation is fraught with complexity.

DL in remote sensing image processing differs from applications in natural images, which are usually more complex and diverse, along with richer available spatial spectrum information, which requires higher processing algorithms for remote sensing images. Due to its strong feature representation capability, deep learning has been applied in the field of remote sensing visual images for land-cover classification, environmental parameter retrieval, data fusion, downscaling, information construction, and prediction. Recurrent Neural Network (RNNs) [24] and convolutional neural networks (CNNs) [25], which work well in processing iterative network structures for temporal data, are currently the dominant trend in image processing. U et al., combined the CNN and RNN frameworks using a simple mechanism for independently encoding dual-time images to obtain their representation vectors [26]. Although CNNs and RNNs can obtain better classification results compared with traditional methods, due to the limitations of the network structure, CNNs mainly focus on local spatial features and ignores the importance of overall connectivity, being unable to adequately analyze the sequence data of spectral features [27], whereas RNNs require input–output sequence alignment and cannot perform parallel computation. Since the emergence of Transformer [28] within the field of natural language processing (NLP), it has received increasing attention from experts. Transformer’s architecture relies entirely on the attention mechanism to model the global dependencies of inputs and outputs, and on positional encoding to solve the problem of how to represent the relative or absolute problem of positional relationships. In contrast to CNNs, the number of operations required to compute the association between two positions does not increase with distance. The Google team proposed Vision Transformer (ViT) [29] as an image classification model, representing a landmark study in the application of transformers in the CV field. ViT divides the image into patches while tiling them into one-dimensional vectors, and then concatenates the vectors to form a tensor. This sequence is input to Transformer, which uses the pixel sequence to effectively classify the pixels. Hong et al. [27] proposed a new backbone network, SpectralFormal, which can learn spectral local sequence information from adjacent bands of images to generate fractional spectral embeddings. In addition, this algorithm is generally applicable to block-level input, which enriches information features by linking across layers of hops and generates packet spectral embeddings in adjacent bands to learn spectral local sequence information.

Data classification of multispectral image information, as a key issue in remote sensing applications, has tradeoffs in balancing spatial resolution and temporal coverage. There is an irreconcilable conflict between the spatial and spectral resolution of multispectral images; thus, it is difficult to obtain remote sensing data with rich spatial and temporal resolution. To solve this problem, most researchers use both spatial and spectral features [30,31]. Yang et al. [32] used a dual-channel CNN to jointly learn spectral and spatial features, whereas Yang et al. [33] proposed using the spectral properties and spatial homogeneity of the spectral spatial neighborhood map for robust stream shape learning, and then performed the classification task. Lee et al. [34] designed a context-depth fully convolutional DL network that jointly utilizes spatial and Hue-Sturation-Intensity(HSI) spectral features for learning and performing classification tasks. Variable-size convolutional features were used to create a spectral spatial feature map.

While multispectral data information can be used for land-cover mapping, land-cover mapping based on remote sensing images relies on image classification. Most traditional classification methods classify images on the basis of different spatial units, such as pixels, sliding windows, moving objects, and scenes [35,36,37]. However, traditional classification methods only consider low-level features in the spectral and spatial domains, and it is difficult to distinguish complex classes of land structures or terrain objects with limited feature information. Therefore, classification methods using high-level features are ideal. Due to its advantages in multiscale and multilevel feature extraction, DL has recently been applied to land-cover classification with good results [38,39,40].

CNN algorithms are more data-demanding; hence, the widely used Transformer is applied as a more efficient way to process pixel sequence information. However, although many ViT studies aimed to surpass and eventually replace convolutional neural networks (CNNs), a stronger performance was obtained when combining the two [41]. ViT has the advantage of being able to consider all pixels in an image in terms of size scaling but with the obvious disadvantage that it needs to be taught through extensive training. For the approaches incorporated into a CNN architecture, the local context window of CNN only considers the local pixel problem, and, even though different image locations can be processed, weight sharing can be achieved for the transformer to have more information with fewer data. To solve these problems, a new architecture called MARC-Net is proposed in this paper. This paper is quite different from other papers that incorporated ViT into CNN architectures, and the main innovative differences are explained below.

Firstly, the form of the CNN cascade convolution is completely different. Although both CNNs were chosen for feature extraction, we chose three jump connections in the CNN branch to make two convolutional layers in the cascade, thus forming a DenseNet structure with a 2 × 2 AVG Pool, which is more conducive to multiscale feature fusion. In contrast, most scholars used two jump–join two-layer convolutions with an SE structure to extend the perceptual domain. Secondly, to address the inconsistent network framework structure, we merged the features obtained by the Transformer encoder with the features obtained by CNN to merge the generated features in a parallel way, thereby shortening the processing time without losing the accuracy of the algorithm. In contrast, most scholars inputted the features obtained by CNN into Transformer and then processed them, maximizing the use of Transformer’s features, enabling it to handle large data with excellent results. Thirdly, the inconsistent structure of the Transformer input is caused by the inconsistent structure of the network framework and the different choices of bands. The input of the Transformer in our network framework can be categorized as follows: 0, position encoding; 1–4, the four common frequency bands; 5–8, the VINR bands. Then, the GSE spectral feature module is added to the initial input, and only half of each of the two sequences is selected for fusion before the input encoder to maximize the mutuality between bands. In contrast, some scholars proposed a network framework with inputs 5–6 of Transformer separately representing the features extracted by the two layers of the CNN, and input 7 representing the features extracted by the CNN module using this band fusion as the input. Lastly, to address the inconsistent classification module, we merged the generated features using fully connected layers to reduce the dimensionality, and then obtained the classification results using the activation function and conv1×1. In contrast, most researchers used MLP headers for classification in visual transformers.

In this paper, we propose MARC-Net, which uses a parallel network structure and integrates a multi-head attention mechanism and a multiscale residual cascade to reduce the processing time while maintaining the superiority of the algorithm. Firstly, the visible near-infrared (VNIR) band is added to make full use of the band information [42]; secondly, the GSE module [28] is called to generate the grouped spectral embedding. The first branch uses a multi-head attention mechanism for spectral sequence information processing. The second branch designs a multiscale residual cascaded CNN network to convert the one-dimensional pixel sequence information into three dimensions by constructing a fully connected layer, and then performs multiscale feature extraction by constructing a multiscale residual cascaded CNN network. Lastly, the features obtained by the multi-head attention mechanism are fused with those obtained by the CNN, and the pixel classification results are obtained by conv1×1. The main contributions of this paper are outlined below.

On the one hand, the features generated by group nesting are added to deepen the information among spectral features; on the other hand, the fusion of feature information at different scales is fully utilized to improve the network’s ability to characterize information at different levels.

(1) We propose a parallel network architecture combining a multi-head attention mechanism and a multiscale residual cascade, effectively mining global association information and feature information at different scales, which allows richer expression of lexical features while fully incorporating spatial location associations, thus achieving better pixel-level image classification with an integrated classification accuracy of 97.22%.

(2) Adding VNIR bands to the commonly used RGB + NiR quad-bands to form the input eight bands allows deeper a priori mining of potentially useful information for the classification of specific land cover in precise study areas.

(3) The newly generated features by grouping spectral nested data sequences are directly used as input to Transformer, and the features obtained by the Transformer encoder are combined with those obtained by CNN to generate features in parallel, which can simplify the algorithm structure and shorten the processing time while achieving the desired results.

(4) By using the GSE spectral feature module to increase the correlation between different bands and choosing to use only half of every two sequences in the Transformer branch for fusion, better analysis results can be obtained.

(5) The dynamic changes in land use in the study area were analyzed, focusing on the changes in the distribution areas of double cropping (rice Double cropping) and Forest land during the period of 2017–2021. The changes in land use and land cover can visually reflect the interaction between local economic development and biodiversity conservation, as well as provide strong scientific data support for the construction of high-standard farmland, land remediation projects, and decision making of relevant departments. The remainder of this paper is structured as follows: Section 2 describes the study data and the algorithm structure; Section 3 describes the results; Section 4 discusses potential research directions and prospects; Section 5 concludes the study.

2. Materials and Methods

In this section, we first briefly describe the details of the study area and the data processing operations. Then, the optimization of the network structure and framework is described in detail.

2.1. Study Area Overview

Huarong County belongs to Yueyang City, Hunan Province, China (29°10′18″–29°48′27″ N latitude and 112°18′31–113°1′32″ E longitude), located on the northern border of Hunan Province, the western border of Yueyang City, with obvious geomorphological zoning features. A low mountainous hilly area can be found to the northeast, interspersed with valley plains; hilly areas also comprise the south–central region, with the remainder of the Huarong area characterized by plains dotted with lakes and extremely well-developed water systems, with abundant water resources, which provide favorable innate conditions for the development of agriculture and aquaculture in the study area, with crayfish being the main aquatic species. The study area (shown in Figure 1) is a pond, and the precise acquisition of the pond area and the analysis of the area changes in recent years have played a significant role in promoting the development of local aquaculture.

Anxiang County is part of Changde City, Hunan Province, in the northwestern part of Dongting Lake, featuring similar types of landforms to Huarong County, as well as a rich aquatic and crop environment.

2.2. Study Area Image Preprocessing

Field surface cover sample data are an important source of remote sensing image samples for training and accuracy verification, and the data quality directly determines the classification quality. To obtain the feature classes of the study region, we visited the study region site on 6 June 2021, and sampled the site data to obtain an accurate picture of the actual feature classes. In this paper, multispectral images covering the study area were acquired using the Sentinel-2 toolbox, selected at 10 m resolution, carrying a multispectral instrument (MSI) containing a total of 13 spectral waves, making it a data source. Band 4, Band 5, Band 6, Band 7, Band 8, and Band 8A were selected for band synthesis, and then the region of interest (ROI) was marked according to a priori knowledge and field data collection to obtain the sample database data of Huarong County, with a total of 19,059 samples in 12 categories (80% for training and 20% for testing). In addition, the data for Changde Anxiang County were obtained, with 15,920 samples in 12 categories (80% for training, and 20% for testing).

2.3. The Architecture Proposed in this Paper MARC-Net

The proposed parallel network architecture MARC-Net that fuses the multi-head attention mechanism with a multiscale residual cascade is shown in Figure 2. The structure mainly consists of a CNN, GSE, activation function, conv1×1, and Transformer. Firstly, the GSE module is used to generate the grouped spectral nesting. The grouped spectral nests are flattened into 1D spectral sequences before being incorporated into a full connectivity layer. According to the input requirements of the CNN, the line conversion feature of the full connectivity layer is used to convert the 1D series into 3D spectrum feature matrices. The newly generated features after being grouped spectral nesting are directly used for Transformer’s input, and the features obtained by the Transformer encoder are merged with those obtained by CNN. The merged generated features are reduced in dimensionality by the fully connected layer; finally, the classification results are obtained using the activation function and conv1×1 to realize pixel-level multispectral image classification.

2.3.1. Groupwise Spectral Embedding

To explore the impact of using different bands, the GSE module was introduced from SpectralFormer. Hong et al. [27] concluded that the spectral information at different locations reflects the absorption properties at different wavelengths, and capturing the local detail changes of these spectral features is the key to classification. Although multispectral images have relatively few bands compared to hyperspectral images, enhancing the correlation between bands is still necessary. For the input 1D pixel sequence

x = [x_{1}, x_{2}, x_{3}, \dots, x_{m}] \in R^{1 \times m}

, and the input a of Transformer is calculated from Equation (1) to obtain.

A = w x

(1)

where

w \in R^{d \times 1}

denotes a linear transformation, Equivalent for all bands of a spectral sequence,

A \in R^{d \times m}

, A collects the output features. GSE is shown by Equation (2):

\dot{A} = W X

(2)

where

W \in R^{d \times n}

denotes a linear transformation,

X \in R^{n \times m}

represents the spectral features, and n represents the number of adjacent bands. The 1D-pixel sequence without GSE is split into eight 1 × 1 sequences. The 8 d × 1 sequences are generated in the CNN branch depending on the values set in the neighboring bands. The best value of d = 2 is chosen according to the experimental accuracy and the accuracy of the prediction map. An improvement is added to this model by using only half of every two sequences in the second Transformer branch for fusion, which can achieve a better analysis, as shown in Figure 3.

2.3.2. Multi-Scale Residual Cascaded Convolutional Networks

In terms of image processing, 2D-CNN has a very wide range of uses; however, for unprocessed 1D sequences, only 1D-CNN can be used for classification. In order to effectively implement 2D-CNN classification, it is common practice to process 1D sequences as multidimensional feature matrices or plaques of central pixels as samples. In this paper, we take an FC to transform its dimensions and reshape them into the desired inputs. The expression is as follows:

y_{i} = w x_{i} + b

(3)

where w and brepresent the weight matrix and bias vector of the fully connected layer, respectively,

x_{i}

Indicates input 1D-pixel sequence,

y_{i} \in R^{1 \times 1 \times m}

, m is the output dimension of the fully connected layer, Reshaping

y_{i}

for

y_{i}^{'}

,

y_{i}^{'} \in R^{n \times n \times \frac{m}{n^{2}}}

, Where m = 256, n = 8. The structure of the multiscale residual cascaded convolutional network is shown in Figure 4. The first convolutional layer of BaseNet was used to improve the input sample dimension, including 64 convolutional kernels of size 1 × 1, whereas the second convolutional layer included 64 convolutional kernels of size 3 × 3. The first convolutional layer and the next layer were connected using residuals to form a cascaded convolutional network, and a 2 × 2 average pool was inserted after the residual layer to reduce the feature vectors output from the convolutional layer to minimize the reduction in features and redundantly increase the effect of feature information. The DenseNet structure can reduce the redundancy of feature information and improve the results by reducing the feature vector of the convolutional layer output and making a 2 × 2 average pooling of the shallow feature output with deeper features to form a DenseNet structure, which can reduce the vanishing gradient (gradient disappearance) and achieve multiscale feature fusion. The average pooled features are compressed into vector form and fused with Transformer features.

y_{i}^{'} \in R^{n \times n \times \frac{m}{n^{2}}}

is the input to BaseNet, n = 8, m = 64, BaseNet output is 1 × 1 × 64.

2.3.3. Multi-Head Attention

Transformer has also recently shown good performance in the fields of image classification and semantic segmentation [29,43]. The core content of Transformer is its attention mechanism and position coding, which are mainly used in the field of image processing. Due to the self-attention mechanism and positional coding, Transformer is completely superior to RNNs in terms of sequence-to-sequence conversion, which is a major advantage when working with sequence data. Transformer’s structure not only captures the interactions between different spatial positions for learning, but also solves the long-term dependency problem of input and output. Moreover, it has the ability of parallel computing, which greatly reduces the consumption of computational resources. In this paper, the Transformer encoder (shown in Figure 3) was used, with each encoder layer consisting of a multi-head attention module (norm, multi-head attention, and dropout) and a forward propagation layer (norm and feed-forward). The transformer uses fixed positional encoding to represent the absolute position information of the token. The most “core” multi-head attention structure, which represents the parallelization of attention, does not compute attention once, but computes attention many times in parallel, using the scaled dot-product attention, which fully considers the interrelationship among all spatial locations and can simultaneously pay attention to the information in multiple subspaces, as shown in Figure 5.

First enter the sequence data

x = [x_{1}, x_{2}, \dots, x_{i}]

,

x_{i} \in R^{1 \times 1}

,

i = 1, 2, \dots, m

, and m as the number of bands, The Input Embedding is then mapped to the new

x_{i} \in R^{1 \times d i m}

, It is then passed through each of the three trainable transformation matrices to obtain the corresponding query (

Q = [{q_{0}, q}_{1}, \dots, q_{m}]

,

q_{i} \in R^{1 \times d i m}

), key

(K = [{k_{0}, k}_{1}, \dots, k_{m}]

,

k_{i} \in R^{1 \times d i m})

, and value (

V = [{v_{0}, v}_{1}, \dots, v_{m}]

,

v_{i} \in R^{1 \times d i m})

, Next, we first take Q and each K to match, dot product operation, due to the large value of inner product operation makes the gradient after Softmax becomes small, After the dot product operation, divide by

\sqrt{d}

to calculate the correlation, where d represents the length of the vector K. Softmax is applied to each row separately to obtain the weights for each V. Softmax can be considered as a special case operation of the nonlinear function defined by the exponential function and the Gaussian projection. Then the weighting is calculated to obtain the final result attention value. The self-attentive mechanism implementation can be expressed as Equation (4):

A t t e n i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(4)

In practice, the basic use of multi-head attention module is similar to that of self-attention, necessitating the same steps to get Q, K, and V through the weights after the use of the number of head h to get the Q, K, and V evenly divided. Then, each head obtains the results of concat stitching; the results of the fusion and multi-head attention mechanism can be expressed as Equations (5) and (6):

M u l t i H e a d (Q, K, V) = C o n c a t ({h e a d}_{1}, \dots, {h e a d}_{2}) W^{o}

(5)

W h e r e {h e a d}_{i} = A t t e n t i o n (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})

(6)

2.3.4. Evaluation Metrics and Loss Functions

The following metrics are commonly used to evaluate pixel classification in remote sensing imagery:

O A

,

A A

, and

K a p p a

. Pixel-level evaluation metrics such as

O A

,

A A

, and

K a p p a

are used primarily to assess the ability of the proposed model to accurately classify the data.

(1)

O A

: the number of correct predictions as a percentage of the number of all predictions.

O A = \frac{T_{0} + T_{1}}{T_{0} + F_{2} + F_{1} + T_{1}}

(7)

where

T_{0}

is truly positive,

F_{1}

is pseudo-positive,

T_{1}

is truly negative, and

F_{2}

is pseudo-negative.

(2)

A A

: The proportion of predicted positive categories to all predicted positive categories.

A A = \frac{T_{0}}{T_{0} + F_{1}}

(8)

(3)

K a p p a

coefficient: a measure of classification accuracy.

K a p p a = \frac{T_{A} - T_{B}}{1 - T_{B}}

(9)

T_{A}

is the overall accuracy, and

T_{B}

is the chance consistency error. Even for two completely independent variables, the consistency will not be 0. There is still a chance phenomenon that makes the two variables consistent; thus, the chance consistency still has to be extracted.

The output of a multiclass task is the probability that the target is divided into each category, and the sum of the probabilities of all categories is 1, where the category with the highest proportion is selected. Therefore, in the classification of multiple targets, it is necessary to output the features extracted after the softmax function as the probability of each class, and then select cross-entropy as the loss function. The cross-entropy loss function compares the predicted class of each pixel with the target class, and its expression is as follows:

L = - \frac{1}{N} \sum_{i} \sum_{c = 1}^{M} Y_{l x} log (P_{l x})

(10)

where M is the total number of classifications;

Y_{l x}

is the sign function, if the true category of sample l is not equal to x take 0, otherwise take 1;

p_{l x}

is the predicted probability that the observed sample l belongs to category x.

2.3.5. Experimental Environment

In the experiments, the ENVI software was used to obtain sample marker regions of interest according to prior knowledge and field sampling, and the data were normalized. In the training prediction, the optimizer was chosen as adam, the batch size was set to 32, the maximum number of iterations was 300, and the initial learning rate was 0.0005. All codes were written using python 3.9 in pytorch 1.12.1 in this paper, and the model is trained on a Windows 11 + 12th Gen Intel(R) Core(TM) i5-12400F + NVIDA GeForce RTX 3060 Laptop GPU.

3. Results

The data for the research area are shown in Table 1, of which 80% were used for training and 20% were used for testing. Experiments were conducted in 2021 in the study areas of Huarong and Anxiang using the data in Table 1.

To highlight the performance effect of this model with huge improvement, this paper compares MARC-Net with SVM [10], KNN [13], CNN [26], RNN [25], ViT [30], and CAF [28]. The following parameters were set for MARC-Net: two adjacent bands, with five encoder blocks in the ViT module. The multiscale residual cascaded convolutional network module contained a fully connected layer with an input dimension of 8 and an output dimension of 256, reshape of (4, 4, 16) features, residual connections containing 64 1 × 1 convolutional kernels for the first convolution layer and 64 3 × 3 convolutional kernels for the second convolution layer, after a 2 × 2 averaging pooling layer, followed by a jump from the first layer features to the second layer. The DenseNet cascaded convolutional network was connected to a separate fully connected layer with an input dimension of 8 and an output dimension of 64 in parallel with ViT to achieve feature fusion.

3.1. Ablation Study

To assess the properties of the algorithm proposed in this paper, ablation experiments were conducted. The results of the test are displayed in Table 2. The OA of the ViT was 95.08%, which is already a very good classification result, showing that ViT is more suitable for the field of remote sensing image classification. The OA of FC + CNN was improved to 95.23%, indicating the feasibility of the FC + CNN module, which can be reshaped to the required input feature size after changing the pixel spectral sequence dimension by the fully connected layer to achieve the accurate classification of images. The OA of the GSE module inserted into ViT was improved by 1.19% compared with the original ViT, and the AA and Kappa were also slightly improved. This indicates that GSE alone may improve ViT’s classification of some categories of multispectral data. The fusion of 1D spectral sequences with ViT’s features after passing them through the fully connected layer achieved an AA of 96.37%, i.e., a 1.29% improvement relative to ViT, while Kappa also improved by 1.52% compared to ViT, thus verifying that the parallel network (FC + ViT) was superior to ViT for the multispectral image pixel classification task. When ViT and CNN were processed in parallel before fusing features, the OA improved to 96.84%; however, the GSE module added to the fully connected layer improved the OA compared to FC + CNN. However, when all modules are tested together, the accuracy of MARC-Net was 97.22%, which is a 2.14% improvement in OA compared to the original ViT, and there was also good improvement in AA and Kappa, achieving relatively good results.

The classification of different modules is shown in Figure 6. In Figure 6a, the enlarged box contains water, rapeseed wetland, etc. Comparing Figure 6b,c, we can see that FC + CNN classified water better than ViT. In Figure 6b,d,e,g, the network structure was not connected in parallel with FC + CNN, and the classification of pond was not as good as in Figure 6f,h. The complete MARC-Net was more distinguishable than the combination of its modules. For the most difficult task classification in this region, i.e., the mixing of pond and water, the complete MARC-Net had the best performance and a good denoising effect.

To explore the effect of the number of neighboring spectra on MARC-Net and ViT, different numbers of neighboring spectra were taken for the experiments, and the values of the number of neighboring spectra were taken as 1–4; the experimental results are shown in Table 3. According to the experimental results, for ViT, AA, OA, and Kappa were the highest when the number of neighboring spectral bands was 3. For MARC-Net, the overall results were best when the number of neighboring spectral bands was 2. The experimental results show that the ideal number of neighboring spectral bands was dependent on the model. Therefore, we chose two neighboring bands in subsequent experiments.

To evaluate the influence of various sample proportions on the classification accuracy of the algorithm, we selected 10%–80% of samples from Huarong County for training tests, without the need for verification (since the previous experimental results were taken at 80% of the sample proportion, the test results are not repeated this time). The results are listed in Table 4. Upon increasing the number of training samples, AA did not necessarily improve, and the noise generated was also randomly assigned. Because of the randomness of sample selection, there was a certain risk of producing obvious misclassification, such as misclassifying water as pond. Therefore, in order to obtain better experimental data accuracy, we chose an 8:2 sample ratio for the experiment.

The results of training samples using different scales are shown in Figure 7.

Since the data sources obtained from Sentinel-2 contain a total of 13 spectral bands, but other satellites do not necessarily contain so many bands, to increase the generality of the dataset, researchers typically choose four universal bands, RGB + NiR, as the pixel sequence information for experimental studies. To determine whether it is worthwhile to choose these four bands from the fast-processing and widely applicable Sentinel-2 and discard the a priori potentially useful information recorded by Sentinel-2, and to determine whether it helps to classify specific land cover in a precise study area when more Sentinel-2 bands are used, we added the near-infrared (NIR) band. The bands used for the experiments and the test results are listed in Table 5. When all VNIR bands were added, a significant improvement was obtained with very good results except for wetland.

The results of the two algorithms using different bands are shown in Figure 8.

To evaluate multiscale residual cascades using parallel networks that reduce processing time while maintaining algorithmic superiority, we used VIT and CNN as tandem networks connected in series for comparison. Table 6 shows the time efficiency, as well as the total OA, AA, and Kappa results, using both network structures in the study area of Huarong County.

To evaluate the impact of our proposed algorithm by adding NDVI alone or NDWI alone and adding NDVI and NDWI together, we conducted experiments, Table 7 shows the comparison of adding different vegetation indices, based on the experimental data it can be concluded that when these indices are added as features, a decrease in accuracy occurs, probably because NDVI and NDWI may only have the ability to discriminate between specific ground The NDVI and NDWI may have the ability to discriminate only for specific ground cover. In some scenarios, they may not be the optimal features. Other features need to be considered to improve the accuracy of the model, and NDVI and NDWI may not be the best features for all scenarios The size, structure, and distribution of the dataset, and the correlation between features need to be considered.

The algorithm results using different vegetation indices are shown in Figure 9. It can be seen from the figure that the background map is misclassified into Lotus, and the classification of the a-map with only the NDVI band added is better than that with NDWI and both vegetation indices added instead, which also verifies that NDVI and NDWI may not be the best features for all scenes.

3.2. Multi-Method Comparison

Table 8 and Table 9 show the correct rates for all categories of datasets and the total OA, AA, and kappa results for different methods in the study areas of Huarong and Anxiang counties. It can be found that CNN was the least effective method, with the lowest results for all three evaluation metrics; the accuracy rates for building and pond were only 64.49% and 64.45%, with that of single not reaching 60%. This may be because multispectral images have fewer bands (only four), and it is difficult for CNN to extract features well, whereas hyperspectral images have 200 bands of information, whereby CNN outperformed SVM and KNN. The OA of traditional classifiers SVM and KNN was 89.09% and 92.41%, respectively. RNN, ViT, CAF, and MARC-Net are all spectral sequence classification methods based on deep learning, showing the advantages of deep learning in the sequence data processing. The classification performance of ViT was good, and CAF had the best classification ability for vegetables. The evaluation index of MARC-Net was higher compared to other algorithmic models, performing better in various categories such as building, greenhouse, and lotus.

The classification map of Huarong County obtained using various methods is shown in Figure 10. The red box in Figure 11 contains the six main categories of building, tree, lotus, rapeseed, water, and single (single cropping of rice). SVM completely treated water as pond, which produced confusion. CNN had poor classification results for all regions of the study area and did not easily distinguish between categories with small intra-group differences. MARC-Net could better distinguish between pond and water with small intra-group differences and did not misclassify greenhouse, providing clearer results than other algorithms.

As shown in Figure 11, the classification results of SVM and CNN were very poor. MARC-Net had a relatively clearer edge extraction of pond and water, with fewer misclassifications. In general, RNN, CAF, and ViT performed better in the classification of multispectral images. The proposed MARC-Net, despite a more complicated model structure than ViT, showed some improvements in OA, AA, and Kappa, and the prediction map effect was clearer compared with other algorithms, which has great value.

3.3. Analysis of Land Use Change

In this paper, Sentinel-2 series images on 17 July 2017 and 4 November 2019 were downloaded from the USGS website for feature classification of the study area; the image data were preprocessed operationally, and then the ROI was retagged using the a priori knowledge accumulated from the outdoor collection data in 2021 through ENVI software to export a txt file containing the ROI coordinates. The training test set was divided according to the coordinates of the txt file in a ratio of 8:2. The sample data for 2017 and 2019 are shown in Table 10. To analyze each model more visually, several different models from Table 4 were also used to compare and analyze the three images in this paper. Then, we used the MARC-Net proposed in this paper to analyze the land-use changes in the Huarong County area.

The results of the classification according to different algorithms in 2017 and 2019 in Huarong County are shown in Table 11 and Table 12. In the 3 years, the OA of MARC-Net was 93.41%, 95.92%, and 97.22%, respectively, featuring higher accuracy than the other algorithmic models, e.g., improvements of 3.88%, 1.46%, and 2.14% compared to the original ViT in 2017, 2019, and 2021. The classification accuracy of the MARC-Net algorithm was very good, thus successfully meeting classification needs in field recognition research.

The prediction results of MARC-Net were best, as shown in the generated maps for 2017, 2019, and 2021; therefore, MARC-Net was used to analyze the dynamics of feature changes in the study area in recent years. The feature identification of the study area in 2017, 2019, and 2021 is shown in Figure 12. It can be found that the distribution of Double crop was mainly in the northeast and southwest, distributed in various locations suitable for planting, whereas the distribution of pond was more concentrated in the north–central region. Trees were mainly distributed in the northeastern part of the study area, surrounded by buildings. The greenhouses were scattered in different areas because the study area was filled with lake water. According to the analysis of the experimental results, land use in Huarong County in recent years has been very reasonable.

The land use and the rate of change of land use in Hualong County in recent years are shown in Table 13. The area covered in the study area and the largest change in the three pictures was Double crop material, followed by rape and single material; the increase in ponds and greenhouses led to a relative increase in aquaculture and crops. The area of buildings also increased from 2017 to 2021. This may have been due to the rapid development of China, whereby even small cities have started to build tall buildings, and they have all grown to a certain extent. The increase in buildings also affected the number of trees in the study area to some extent; however, due to the afforestation policy, the number of trees rebounded in the last 2 years. Crayfish grew consistently because it is naturally valued by the people of Hunan as a specialty food of the country and because of its natural geographical advantages. Canola and sunchokes also grew rapidly in these 4 years; the data show that the study area is in line with the sustainable development strategy.

Figure 13 shows the spatial distribution of forest land and stand-in. The proportion of forest land in the region is relatively high, but according to the changes in recent years, the proportion has declined, which may be related to the great rise in construction housing and planting aquaculture, or the need to further strengthen forestry planting to have completed the basic policy of the state to return farmland to forest to protect nature. The Double crop mostly highlights the distribution characteristics near rivers. The emergence of Double crop has played a very important role in making full use of natural and labor resources to increase food production, while single-season late rice is not easy to produce due to the strict requirements of short sunshine, therefore, the local government has increased the Double crop planting of agricultural products and aquaculture.

The dynamics of Double crop and Forest land from 2017 to 2021 are shown in Figure 14. The arable area of Double crop was 49.9226 km

^{2}

in 2017, 117.6979 km

^{2}

in 2019, and 199.4702 km

^{2}

in 2021, with an increase of

135.7608 %

, an increase of

16.64,974 %

, and increase of 175.0143% in 2017–2019, 2019–2021, and 2017–2021, respectively. The overall increase from 2017 to 2021 was huge, as the advantages of Double crop harvesting twice a year slowly manifested, leading to farmers being able to gain more revenue, while the climate in the Hunan region was also more suitable for cultivation, leading to increased farming. The Forest land area was 125.1143 km

^{2}

in 2017, 105.6089 km

^{2}

in 2019, and 76.3699 km

^{2}

in 2021, with an increase of

15.59 %

, an decrease of

0.97 %

, and increase of

14.76 %

in 2017–2019, 2019–2021, and 2017–2021, respectively.

4. Discussion

Sampling bands should have a large number of spectral ranges to choose from to make full use of spectral information. Since the visible NIR region satisfies the requirement of having a sufficient number of spectral bands, the bands being as adjacent to each other as possible, and the visible NIR band being very narrow with good resolution, in contrast to the traditional four bands of RGB + NiR, this choice helped to classify the specific land cover of the study area.

In the experiments, a total of seven methods were compared (SVM, KNN, and DL algorithms CNN, RNN, ViT, CAF, and MARC-Net), showing the advantages of each method. The traditional methods were faster to train, but they were less sensitive to sequence information than the RNN and ViT models, which were good at handling sequence information. It is possible that the multispectral images taken by the Sentinel-2 satellite were not hyperspectral and the 1D CNN classification accuracy was poor due to the low information content of the multispectral images, whereby the CNN did not work as well as it should have. RNNs are good at processing sequential data, unlike CNNs; hence, they perform better than CNNs in multispectral image classification, but it is difficult for them to learn the long-term dependence of input and output, as demonstrated by Transformer. CAF showed good performance in all datasets, outperforming all other models in the classification of multispectral images except MARC-Net. Interestingly, 2D CNNs also seemed to fail to capture spatial information well, probably because of the low dimensionality of the spatial kernel.

Using hyperspectral images as opposed to multispectral images may yield unexpected results. Although parallel processing can reduce processing time, improving the self-attentive mechanism or simplifying the complex algorithmic network can accelerate data processing. Using a recursive graph, Grimmer angle field (GAF), short-time Fourier transform (STFT), Markov variable field, etc. to convert one dimension into two dimensions for processing may be better.

The graph neural network approach can be used in later work, representing a rather sophisticated method of extracting features from graph data, which would make it possible to use these features for node classification, graph classification, and edge prediction, and to obtain a graph embedding representation, which is indeed very versatile. This is associated with a scalable graph neural network framework capable of learning adaptive sensory paths. Its adaptive path layer has two complementary units, one for learning the weights of first-order neighborhood nodes for breadth exploration and one for extracting and filtering the information converged in higher-order neighborhoods for depth exploration. Unexpected results have been achieved in experiments with both transductive and inductive learning tasks, and it is a worthwhile approach for future research.

The proposed MARC-Net incorporating GES, a multiscale residual cascaded convolutional network, and Transformer performed best on all four different datasets. This is because the features extracted by Transformer and CNN could be fused by parallel networks without losing feature information, yielding good classification results. Then, land-use dynamic change analysis was performed to obtain a clearer picture of land-cover classification changes in the study area in recent years.The reduction in the area of forest land cultivation from 2017–2021, especially from 2017–2019, was predictable for Hunan, a land of fish and rice, where the local government allocated more objects to construction as well as aquaculture to accelerate economic income more quickly. After the reduction of z share in the first 2 years, the z share area in 2019–2021 has rebounded, probably because the national government aims to achieve a reasonable distribution of resources in many aspects in a common development way, and the local government correspondingly the national t policy of returning farmland to forest.

5. Conclusions

In the method proposed in this study, when using ViT with CNN parallel fusion, the feature information obtained by both methods could be better utilized for accurate image classification, largely compensating for the multispectral band information we discarded. We proposed a parallel network architecture fusing a multi-head attention mechanism with a multiscale residual cascade for land-use classification algorithms. The parallel architecture first introduced the GSE module to generate grouped spectral nesting to increase the connections linking local information. A multiscale residual cascade CNN network was designed to fully utilize the fused feature information at different scales to achieve the pixel-level classification of remote sensing images. Lastly, the feature information obtained from both steps was fused, and the features generated by merging were reduced in dimensionality through fully connected layers. The classification results were obtained using the activation function and conv1×1 to achieve pixel-level classification. In the future, we will further process the input of the data to improve the informational diversity of its features. Furthermore, an aspect worth investigating is the design of new enhanced network architectures that can be used in any scenario to improve performance while reducing computational complexity, thus increasing efficiency and applicability.For Forest land, the overall trend of the study area from 2017 to the present is still declining, although it has rebounded in the last two years, but it is still relatively not very important and needs further attention.

Author Contributions

Conceptualization, X.F. and X.L.; methodology, X.L. and X.F.; software, X.L., X.F., C.Y., J.F., L.Y., N.W. and L.C.; validation, X.L., X.F. and J.F.; formal analysis, X.L., X.F. and C.Y.; investigation, X.L., X.F., C.Y., J.F., L.Y., N.W. and L.C.; resources, J.F. and X.F.; data curation, X.L., X.F., C.Y., J.F., L.Y., N.W. and L.C.; writing—original draft preparation, X.L.; writing—review and editing, X.L., X.F., C.Y., J.F., L.Y., N.W. and L.C. All authors have read and agreed to the published version of the manuscript.

Funding

This document is the results of the research project funded by the National Natural Science Foundation of China: 62261004 and 62001129.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The code will be available at https://github.com/lixuyaaaaa/MARC-Net accessed on 27 March 2023.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MARC-Net	attention mechanism and multi-scale residual cascades
RS	Remote Sensing
GAFs	Grammy angle fields
Sing	sing cropping of rice
STFT	Short Time Fourier transformations
HSI	Hue-Sturation-Intensity
GSE	grouped spectral embedding
CNN	Convolutional Neural Network
RSI	Remotely Sensed Imagery
RNN	Recurrent Neural Network
CAF	SpectralFormer
FC	Fully Connected Layer
ML	machine learning
DL	Deep learning
ViT	Vision Transformer
SVM	Support Vector Machine
RF	Random Forest
CN.	Class Number
KNN	K-Nearest Neighbor
VNIR	Visible Near Infrared
NLP	Natural Language Processing
Double	Double cropping of rice
MSI	multi-spectral instrument
ROI	region of interest
RBF	radial basis function

References

Sawant, S.; Mohite, J.; Sakkan, M.; Pappula, S. Near real time crop loss estimation using remote sensing observations. In Proceedings of the 2019 8th International Conference on Agro-Geoinformatics (Agro-Geoinformatics), Istanbul, Turkey, 16–19 July 2019; pp. 1–5. [Google Scholar]
Figa, J.; Stoffelen, A. On the assimilation of Ku-band scatterometer winds for weather analysis and forecasting. IEEE Trans. Geosci. Remote Sens. 2000, 38, 1893–1902. [Google Scholar] [CrossRef]
Xu, J.; Wang, X.; Zhu, X.; Cui, C.; Liu, P.; Li, B. Research on marine radar oil spill network monitoring technology. In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Fort Worth, TX, USA, 23–28 July 2017; pp. 1868–1871. [Google Scholar]
Haihui, H.; Yilin, W.; Zhuan, Z.; Guangli, R.; Min, Y. Extraction of Altered Mineral from Remote Sensing Data in Gold Exploration Based on the Nonlinear Analysis Technology. In Proceedings of the 2018 10th IAPR Workshop on Pattern Recognition in Remote Sensing (PRRS), Beijing, China, 19–20 August 2018; pp. 1–8. [Google Scholar]
Jiang, J.; Zhang, Q.; Yao, X.; Tian, Y.; Zhu, Y.; Cao, W.; Cheng, T. HISTIF: A new spatiotemporal image fusion method for high-resolution monitoring of crops at the subfield level. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 4607–4626. [Google Scholar] [CrossRef]
Calvin, W.M.; Pace, E.L.; Davies, G.E.; Pearson, N.C. HyspIRI for energy and mineral resource exploration, applications, and impacts. In Proceedings of the 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Milan, Italy, 26–31 July 2015; pp. 80–82. [Google Scholar]
Labbassi, K.; Tajdi, A.; Er-Raji, A. Remote sensing and geological mapping for a groundwater recharge model in the arid area of Sebt Rbrykine: Doukkala, western Morocco. In Proceedings of the 2009 IEEE International Geoscience and Remote Sensing Symposium, Cape Town, South Africa, 12–17 July 2009; Volume 1, pp. 1–112. [Google Scholar]
Luo, H.; Ye, B.; Zhang, Y. Study of Urban Landuse Evaluation of the comprehensive planning—Nanjing city as a case. In Proceedings of the 2011 International Conference on Remote Sensing, Environment and Transportation Engineering, Nanjing, China, 24–26 June 2011; pp. 4456–4459. [Google Scholar]
Liang, S.; Li, X.; Wang, J. Atmospheric correction of optical imagery. Adv. Remote Sens. 2012, 117. [Google Scholar] [CrossRef]
Melgani, F.; Bruzzone, L. Classification of hyperspectral remote sensing images with support vector machines. IEEE Trans. Geosci. Remote Sens. 2004, 42, 1778–1790. [Google Scholar] [CrossRef]
Ayerdi, B.; Romay, M.G. Hyperspectral image analysis by spectral–spatial processing and anticipative hybrid extreme rotation forest classification. IEEE Trans. Geosci. Remote Sens. 2015, 54, 2627–2639. [Google Scholar] [CrossRef]
Lin, T.H.; Li, H.T.; Tsai, K.C. Implementing the Fisher’s Discriminant Ratio in ak-Means Clustering Algorithm for Feature Selection and Data Set Trimming. J. Chem. Inf. Comput. Sci. 2004, 44, 76–87. [Google Scholar] [CrossRef]
Alimjan, G.; Sun, T.; Liang, Y.; Jumahun, H.; Guan, Y. A new technique for remote sensing image classification based on combinatorial algorithm of SVM and KNN. Int. J. Pattern Recognit. Artif. Intell. 2018, 32, 1859012. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Zhang, L.; Zhang, L.; Du, B. Deep learning for remote sensing data: A technical tutorial on the state of the art. IEEE Geosci. Remote Sens. Mag. 2016, 4, 22–40. [Google Scholar] [CrossRef]
Reichstein, M.; Camps-Valls, G.; Stevens, B.; Jung, M.; Denzler, J.; Carvalhais, N. Deep learning and process understanding for data-driven Earth system science. Nature 2019, 566, 195–204. [Google Scholar] [CrossRef]
Cheng, G.; Sun, X.; Li, K.; Guo, L.; Han, J. Perturbation-seeking generative adversarial networks: A defense framework for remote sensing image scene classification. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–11. [Google Scholar] [CrossRef]
Lanthier, Y.; Bannari, A.; Haboudane, D.; Miller, J.R.; Tremblay, N. Hyperspectral data segmentation and classification in precision agriculture: A multi-scale analysis. In Proceedings of the IGARSS 2008—2008 IEEE International Geoscience and Remote Sensing Symposium, Boston, MA, USA, 6–11 July 2008; Volume 2, pp. II-585–II-588. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Zhang, W.; Tang, P.; Zhao, L.; Huang, Q. A comparative study of U-nets with various convolution components for building extraction. In Proceedings of the 2019 Joint Urban Remote Sensing Event (JURSE), Vannes, France, 22–24 May 2019; pp. 1–4. [Google Scholar]
Mustafa, N.; Zhao, J.; Liu, Z.; Zhang, Z.; Yu, W. Iron ORE region segmentation using high-resolution remote sensing images based on Res-U-Net. In Proceedings of the IGARSS 2020—2020 IEEE International Geoscience and Remote Sensing Symposium, Waikoloa, HI, USA, 26 September–2 October 2020; pp. 2563–2566. [Google Scholar]
Han, L.; Liang, H.; Chen, H.; Zhang, W.; Ge, Y. Convective precipitation nowcasting using U-Net Model. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–8. [Google Scholar] [CrossRef]
Asma, S.B.; Abdelhamid, D.; Youyou, L. U-Net Based Classification For Urban Areas in Algeria. In Proceedings of the 2020 Mediterranean and Middle-East Geoscience and Remote Sensing Symposium (M2GARSS), Tunis, Tunisia, 9–11 March 2020; pp. 101–104. [Google Scholar]
Hang, R.; Liu, Q.; Hong, D.; Ghamisi, P. Cascaded recurrent neural networks for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2019, 57, 5384–5394. [Google Scholar] [CrossRef]
Li, J.; Cui, R.; Li, B.; Li, Y.; Mei, S.; Du, Q. Dual 1D-2D spatial-spectral cnn for hyperspectral image super-resolution. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 3113–3116. [Google Scholar]
Khusni, U.; Dewangkoro, H.I.; Arymurthy, A.M. Urban area change detection with combining CNN and RNN from sentinel-2 multispectral remote sensing data. In Proceedings of the 2020 3rd International Conference on Computer and Informatics Engineering (IC2IE), Yogyakarta, Indonesia, 15–16 September 2020; pp. 171–175. [Google Scholar]
Hong, D.; Han, Z.; Yao, J.; Gao, L.; Zhang, B.; Plaza, A.; Chanussot, J. SpectralFormer: Rethinking hyperspectral image classification with transformers. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–15. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Lan, R.; Li, Z.; Liu, Z.; Gu, T.; Luo, X. Hyperspectral image classification using k-sparse denoising autoencoder and spectral–restricted spatial characteristics. Appl. Soft Comput. 2019, 74, 693–708. [Google Scholar] [CrossRef]
Li, J.; Yuan, Q.; Shen, H.; Meng, X.; Zhang, L. Hyperspectral image super-resolution by spectral mixture analysis and spatial–spectral group sparsity. IEEE Geosci. Remote Sens. Lett. 2016, 13, 1250–1254. [Google Scholar] [CrossRef]
Yang, J.; Zhao, Y.; Chan, J.C.W.; Yi, C. Hyperspectral image classification using two-channel deep convolutional neural network. In Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Beijing, China, 10–15 July 2016; pp. 5079–5082. [Google Scholar]
Yang, H.L.; Crawford, M.M. Exploiting spectral-spatial proximity for classification of hyperspectral data on manifolds. In Proceedings of the 2012 IEEE International Geoscience and Remote Sensing Symposium, Munich, Germany, 22–27 July 2012; pp. 4174–4177. [Google Scholar]
Lee, H.; Kwon, H. Contextual deep CNN based hyperspectral classification. In Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Beijing, China, 10–15 July 2016; pp. 3322–3325. [Google Scholar]
Blaschke, T. Object based image analysis for remote sensing. ISPRS J. Photogramm. Remote Sens. 2010, 65, 2–16. [Google Scholar] [CrossRef]
Ma, L.; Li, M.; Ma, X.; Cheng, L.; Du, P.; Liu, Y. A Review of Supervised Object-Based Land-Cover Image Classification; Elsevier: Amsterdam, The Netherlands, 2017; Volume 130, pp. 277–293. [Google Scholar]
Zhang, C.; Sargent, I.; Pan, X.; Li, H.; Gardiner, A.; Hare, J.; Atkinson, P.M. An object-based convolutional neural network (OCNN) for urban land use classification. Remote Sens. Environ. 2018, 216, 57–70. [Google Scholar] [CrossRef]
Huang, B.; Zhao, B.; Song, Y. Urban land-use mapping using a deep convolutional neural network with high spatial resolution multispectral remote sensing imagery. Remote Sens. Environ. 2018, 214, 73–86. [Google Scholar] [CrossRef]
Scott, G.J.; Marcum, R.A.; Davis, C.H.; Nivin, T.W. Fusion of Deep Convolutional Neural Networks for Land Cover Classification of High-Resolution Imagery; IEEE: Manhattan, NY, USA, 2017; Volume 14, pp. 1638–1642. [Google Scholar]
Zhang, C.; Sargent, I.; Pan, X.; Li, H.; Gardiner, A.; Hare, J.; Atkinson, P.M. Joint Deep Learning for Land Cover and Land Use Classification; Elsevier: Amsterdam, The Netherlands, 2019; Volume 221, pp. 173–187. [Google Scholar]
Khaki, S.; Pham, H.; Han, Y.; Kuhl, A.; Kent, W.; Wang, L. Deepcorn: A Semi-Supervised Deep Learning Method for High-Throughput Image-Based Corn Kernel Counting and Yield Estimation; Elsevier: Amsterdam, The Netherlands, 2021; Volume 218, p. 106874. [Google Scholar]
Cui, Z.; Kerekes, J. Potential of Red Edge Spectral Bands in Future Landsat Satellites on Agroecosystem Canopy Chlorophyll Content Retrieval. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 7168–7171. [Google Scholar]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.; et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6881–6890. [Google Scholar]

Figure 1. Location of the study area and Sentinel-2 remote sensing image (6 June 2021).

Figure 2. Schematic of the parallel network architecture MARC-Net architecture.

Figure 3. Groupwise Spectral Embedding Schematic, showing the change of feature embedding. (a) d = 2. (b) d = 1/2.

Figure 4. Schematic diagram of multi-scale residual cascaded convolutional network.

Figure 5. Multi-Head Attention consists of several attention layers running in parallel.

Figure 6. About the experimental results of MARC-Net algorithm sub-modules in Huarong County, (a) Image. (b) ViT. (c) only FC + CNN. (d) MARC-Net (GSE). (e) MARC-Net (FC). (f) MARC-Net (FC + CNN). (g) MARC-Net (GSE + FC). (h) MARC-Net (GSE + FC + CNN).

Figure 7. Plots of the results of different scaled training samples. (a) image. (b) 10%. (c) 20%. (d) 30%. (e) 40%. (f) 50%. (g) 60%. (h) 70%.

Figure 8. ViT results in different bands. (a) 4 bands. (b) 4 bands + VNIR. MARC-Net results in different bands. (c) 4 bands. (d) 4 bands + VNIR.

Figure 9. Algorithm results using different vegetation indices: (a) MARC (NDVI); (b) MARC (NDWI); (c) MARC (NDVI + NDWI).

Figure 10. Huarong County uses different algorithms to classify the results. (a) Image. (b) SVM. (c) KNN. (d) CNN. (e) RNN. (f) ViT. (g) CAF. (h) MARC-Net.

Figure 11. Anxiang County uses different algorithms for classification results. (a) Image. (b) SVM. (c) KNN. (d) CNN. (e) RNN. (f) ViT. (g) CAF. (h) MARC-Net.

Figure 14. Dynamics of Forest land in 2017–2019 (a), 2019–2021 (b), and 2017–2021 (c); dynamics of Double crop in 2017–2019 (d), 2019–2021 (e), and 2017–2021 (f).

Table 1. Sample size of the study area 2021 data set.

		Huarong		Anxiang
NO.	Class	Training	Testing	Training	Testing
1	Building	873	219	808	203
2	Tree	1107	277	1010	253
3	Water	3160	790	2445	612
4	Greenhouse	1035	259	526	132
5	Lotus	1000	251	754	189
6	Pond	825	207	1563	391
7	Wetland	1386	347	1540	385
8	Vegetable	743	186	798	200
9	Rapeseed	1925	482	811	203
10	Crayfish	1537	385	854	214
11	Single	746	187	1096	274
12	Double crop	812	320	527	132
	Total	15,149	3910	12,732	3188

Table 2. Experimental ablation results of MARC-Net on the study area dataset using different combinations of modules.

Different	Different Module			Metric
Method	GSE	FC	CNN	OA (%)	AA (%)	Kappa
ViT	×	×	×	95.08	94.31	94.49
only FC + CNN	×	×	×	95.23	94.22	94.66
MARC-Net	✓	×	×	96.27	95.91	95.83
MARC-Net	×	✓	×	96.37	95.83	95.94
MARC-Net	×	✓	✓	96.84	96.47	96.47
MARC-Net	✓	✓	×	96.9	96.66	96.53
MARC-Net	✓	✓	✓	97.22	96.82	96.89

Table 3. Effect of the number of neighboring bands on ViT and MARC-Net.

Method	Metric	The Number of Neighboring Bands
Method	Metric	1	2	3	4
Transformer (ViT)	OA (%)	95.08	96.27	96.55	95.62
	AA (%)	94.31	95.91	96.16	95.03
	Kappa	94.49	95.83	96.14	95.10
MARC-Net	OA (%)	96.84	97.22	95.81	96.93
	AA (%)	96.47	96.82	95.57	96.00
	Kappa	96.47	96.89	95.32	95.56

Table 4. In 2021, Huarong County used MARC-Net test results for various sample proportions.

Class No.		Ratio of Training
Class No.		10%	20%	30%	40%	50%	60%	70%
1		96.33	99.08	98.77	97.47	99.81	99.38	99.47
2		87.68	93.84	91.32	95.11	94.36	97.71	95.35
3		99.74	99.49	99.66	99.81	99.79	99.91	99.89
4		98.44	99.61	100.00	100.00	99.84	100.00	100.00
5		100.00	100.00	100.00	100.00	99.84	100.00	100.00
6		82.52	86.89	94.49	96.60	94.18	97.25	95.01
7		93.06	90.17	90.94	91.05	92.03	90.18	92.16
8		94.56	97.83	98.56	96.76	98.70	99.46	98.92
9		90.83	92.72	94.59	95.53	94.34	92.93	97.03
10		90.10	91.14	93.05	93.75	96.98	95.14	96.57
11		75.26	73.65	78.85	77.47	82.40	87.47	80.39
12		97.02	99.01	98.02	98.27	99.21	98.19	97.04
Metrics	OA (%)	93.33	94.48	95.49	95.89	96.47	96.60	96.72
	AA (%)	92.13	93.62	94.86	95.16	95.96	96.47	95.99
	Kappa	92.53	93.81	94.95	95.40	96.05	96.20	96.32

Table 5. Test results in the 2021 Huarong County data set using different methods in different wavebands.

Class No.		Different Bands (Method)
		4 Bands		4 Bands + VNIR
		ViT	MARC-Net	ViT	MARC-Net
1		89.92	97.25	97.02	99.77
2		88.53	90.33	95.93	97.65
3		99.45	99.43	99.71	99.90
4		98.78	99.42	99.90	100.00
5		96.45	98.60	99.60	99.90
6		77.83	79.03	89.93	97.09
7		92.91	94.30	88.16	91.63
8		89.53	93.53	95.55	96.63
9		85.80	90.75	93.29	95.37
10		83.79	94.59	94.33	98.17
11		74.27	74.12	80.96	86.86
12		93.52	93.71	97.29	98.89
Metrics	OA (%)	90.72	93.57	95.08	97.22
	AA (%)	89.24	92.09	94.31	96.82
	Kappa	89.60	92.80	94.49	96.89

Table 6. Comparison of the two network structures.

C N.	Different Method
C N.	Series Connection	MARC-Net
1	99.54	99.77
2	94.67	97.65
3	99.96	99.90
4	100.00	100.00
5	99.90	99.90
6	96.12	97.09
7	93.14	91.63
8	99.32	96.63
9	93.71	95.37
10	98.95	98.17
11	88.73	86.86
12	99.26	98.89
OA (%)	97.20	97.22
AA (%)	96.95	96.82
Kappa	96.87	96.89
time(s)	65,115.24	61,606.42

Table 7. Comparison of adding vegetation index.

C N.	Different Vegetation Index
C N.	MARC (NDVI)	MARC (NDWI)	MARC (NDVI + NDWI)	MARC-Net
1	99.47	99.60	99.21	99.77
2	95.66	97.21	93.90	97.65
3	99.92	99.92	99.81	99.90
4	100.00	99.66	99.66	100.00
5	99.88	99.77	100.00	99.90
6	95.01	95.70	91.68	97.09
7	90.60	91.17	92.08	91.63
8	98.92	99.07	99.23	96.63
9	95.48	95.60	92.04	95.37
10	97.24	96.95	96.80	98.17
11	84.99	77.02	88.05	86.86
12	98.87	98.73	98.45	98.89
OA (%)	96.79	96.57	96.23	97.22
AA (%)	96.34	95.87	95.91	96.82
Kappa	96.41	96.16	95.78	96.89

Table 8. Classification results of various methods in Huarong County 2021.

C N.	Different Methods
C N.	SVM	KNN	CNN	RNN	ViT	CAF	MARC-Net
1	89.49	84.93	64.49	94.84	97.02	99.08	99.77
2	93.50	92.77	91.23	94.94	95.93	95.30	97.65
3	98.60	99.36	97.68	99.58	99.71	99.81	99.90
4	97.29	98.84	92.46	99.71	99.90	99.71	100.00
5	99.60	97.21	89.70	99.50	99.60	99.80	99.90
6	64.73	77.29	65.45	82.30	89.93	93.69	97.09
7	88.18	94.52	85.64	89.39	88.16	92.92	91.63
8	92.47	94.62	76.58	95.96	95.55	97.84	96.63
9	90.66	88.79	75.94	92.00	93.29	93.87	95.37
10	88.83	86.23	75.92	91.60	94.33	97.26	98.17
11	44.38	86.63	50.26	75.87	80.96	82.43	86.86
12	83.25	94.08	89.03	98.52	97.29	97.78	98.89
OA (%)	89.09	92.41	82.76	93.93	95.08	96.40	97.22
AA (%)	85.92	91.28	79.54	92.85	94.31	95.79	96.82
Kappa	87.74	91.49	80.65	93.20	94.49	95.96	96.89

Table 9. Various methods in Anxiang County 2021 classification results.

C N.	Different Methods
C N.	SVM	KNN	CNN	RNN	ViT	CAF	MARC-Net
1	95.79	93.59	91.21	95.79	96.16	98.51	98.76
2	95.94	96.44	79.20	95.94	96.13	98.31	97.72
3	98.28	98.85	90.55	98.28	99.26	99.01	99.55
4	94.86	94.69	80.60	94.86	96.19	98.09	97.90
5	96.02	89.41	75.99	96.02	97.74	98.67	94.69
6	94.11	88.25	71.27	94.11	93.92	96.09	97.31
7	96.62	92.20	86.62	96.62	95.38	95.71	97.27
8	83.95	78.50	66.04	83.95	82.58	87.09	90.72
9	69.29	70.93	13.07	69.29	81.62	77.68	85.32
10	76.22	66.35	37.11	76.22	82.20	81.96	86.53
11	85.49	76.64	83.39	85.49	85.85	89.14	88.04
12	81.78	83.33	24.66	81.78	85.76	86.90	88.80
OA (%)	90.94	87.70	72.18	90.94	92.45	93.51	94.68
AA (%)	89.03	85.77	66.65	89.03	91.07	92.27	93.56
Kappa	89.89	86.28	68.94	89.89	91.57	92.76	94.06

Table 10. Sample size of Huarong County data set for 2017 and 2019.

No.	Class	2017		2019
No.	Class	Training	Testing	Training	Testing
1	Building	1083	271	1489	373
2	Tree	1047	262	1657	415
3	Water	4640	1161	4776	1195
4	Greenhouse	1620	405	1472	369
5	Lotus	1325	332	1446	362
6	Pond	674	169	869	218
7	Wetland	852	214	2294	574
8	Vegetable	477	120	660	166
9	Rapeseed	500	125	1404	351
10	Crayfish	1161	291	1526	382
11	Single	546	137	708	178
12	Double crop	432	109	777	195
Total		14,357	3596	19,078	4778
Total		17,953		23,856

Table 11. Various methods in Huarong County 2017 classification results.

C N.	Different Method
C N.	SVM	KNN	CNN	RNN	ViT	CAF	MARC-Net
1	97.80	94.63	89.20	95.97	97.86	98.53	99.02
2	97.11	96.38	98.19	95.48	99.63	98.46	97.74
3	96.74	97.56	96.81	98.23	99.39	99.25	99.83
4	46.66	72.38	36.66	68.80	77.02	89.04	87.61
5	71.42	80.71	46.91	86.50	89.54	96.24	95.35
6	56.58	62.40	40.75	71.88	81.61	86.08	91.34
7	82.32	92.34	76.89	92.93	92.01	94.65	94.45
8	75.73	87.57	74.73	90.49	87.36	92.57	92.12
9	81.12	75.32	59.65	81.76	87.84	88.26	92.84
10	39.15	68.93	15.62	65.99	76.59	84.21	84.29
11	44.91	60.96	43.29	69.97	83.24	84.18	82.03
12	50.92	68.51	34.14	62.38	73.95	84.14	84.25
OA (%)	76.03	82.91	65.67	84.86	89.53	92.61	93.41
AA (%)	70.04	79.81	59.41	81.70	87.18	91.31	91.74
Kappa	73.13	80.97	61.71	83.10	88.31	91.76	92.65

Table 12. Various methods in Huarong County 2019 classification results.

C N.	Different Method
C N.	SVM	KNN	CNN	RNN	ViT	CAF	MARC-Net
1	75.22	75.66	76.44	86.55	87.88	92.88	89.33
2	90.40	92.80	85.87	92.18	93.68	93.38	97.79
3	99.64	99.76	99.22	99.88	100.00	100.00	100.00
4	63.67	86.89	66.29	89.41	93.07	92.32	92.97
5	83.33	89.79	77.55	91.92	96.59	98.29	97.61
6	50.00	66.40	21.06	77.75	69.68	77.75	83.46
7	95.17	96.19	87.55	97.01	98.41	99.49	97.71
8	76.77	85.80	66.88	89.12	93.34	96.10	94.15
9	94.60	93.56	90.85	94.23	96.72	97.35	96.67
10	66.40	77.99	44.57	87.11	84.59	86.82	90.11
11	80.95	90.04	78.36	92.17	92.82	96.63	95.21
12	72.29	90.90	55.48	93.37	95.87	96.41	97.17
OA (%)	84.87	90.64	78.94	93.24	94.46	95.82	95.92
AA (%)	79.04	87.15	70.85	90.90	91.89	93.96	94.35
Kappa	82.93	89.45	76.20	92.38	93.75	95.28	95.40

Table 13. Table of Changes in Feature Use Types in Huarong County, 2017–2021.

Class	Area (km $^{2}$ )			Area Change Rate (%)
Class	2017	2019	2021	2017–2019	2019–2021	2017–2021
Building	114.9272	124.5547	182.1598	8.37	51.24	63.91
Tree	125.1143	105.6089	76.3699	−15.59	0.97	−14.76
Water	157.3317	97.8937	132.7025	−37.77	19.41	−25.69
Greenhouse	45.4248	71.6923	16.1887	57.82	−82.29	−72.05
Lotus	56.4382	32.1702	24.1319	−42.99	16.56	−33.55
Pond	55.4054	99.0612	84.5474	78.79	1.11	80.79
Wetland	137.4385	135.4457	133.4817	−1.44	−1.45	−2.87
Vegetable	105.0842	138.79	120.0094	32.07	−22.54	2.29
Rapeseed	139.8075	293.9893	262.4486	110.28	−0.49	109.23
Crayfish	179.1558	226.0001	365.0964	26.14	47.63	86.23
Single	57.9711	115.7221	55.1289	99.62	−41.39	16.98
Double crop	49.9226	117.6979	199.4702	135.76	16.64	175.01

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fan, X.; Li, X.; Yan, C.; Fan, J.; Yu, L.; Wang, N.; Chen, L. MARC-Net: Terrain Classification in Parallel Network Architectures Containing Multiple Attention Mechanisms and Multi-Scale Residual Cascades. Forests 2023, 14, 1060. https://doi.org/10.3390/f14051060

AMA Style

Fan X, Li X, Yan C, Fan J, Yu L, Wang N, Chen L. MARC-Net: Terrain Classification in Parallel Network Architectures Containing Multiple Attention Mechanisms and Multi-Scale Residual Cascades. Forests. 2023; 14(5):1060. https://doi.org/10.3390/f14051060

Chicago/Turabian Style

Fan, Xiangsuo, Xuyang Li, Chuan Yan, Jinlong Fan, Ling Yu, Nayi Wang, and Lin Chen. 2023. "MARC-Net: Terrain Classification in Parallel Network Architectures Containing Multiple Attention Mechanisms and Multi-Scale Residual Cascades" Forests 14, no. 5: 1060. https://doi.org/10.3390/f14051060

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MARC-Net: Terrain Classification in Parallel Network Architectures Containing Multiple Attention Mechanisms and Multi-Scale Residual Cascades

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area Overview

2.2. Study Area Image Preprocessing

2.3. The Architecture Proposed in this Paper MARC-Net

2.3.1. Groupwise Spectral Embedding

2.3.2. Multi-Scale Residual Cascaded Convolutional Networks

2.3.3. Multi-Head Attention

2.3.4. Evaluation Metrics and Loss Functions

2.3.5. Experimental Environment

3. Results

3.1. Ablation Study

3.2. Multi-Method Comparison

3.3. Analysis of Land Use Change

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI