Remote Sensing Image Scene Classification Based on Global Self-Attention Module

Li, Qingwen; Yan, Dongmei; Wu, Wanrong

doi:10.3390/rs13224542

Open AccessArticle

Remote Sensing Image Scene Classification Based on Global Self-Attention Module

by

Qingwen Li

^1,2,3

,

Dongmei Yan

^1,2,3,*

and

Wanrong Wu

^1,2

¹

Key Laboratory of Digital Earth Science, Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China

²

International Research Center of Big Data for Sustainable Development Goals, Beijing 100094, China

³

University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2021, 13(22), 4542; https://doi.org/10.3390/rs13224542

Submission received: 12 October 2021 / Revised: 29 October 2021 / Accepted: 8 November 2021 / Published: 12 November 2021

Download

Browse Figures

Versions Notes

Abstract

:

The complexity of scene images makes the research on remote-sensing image scene classification challenging. With the wide application of deep learning in recent years, many remote-sensing scene classification methods using a convolutional neural network (CNN) have emerged. Current CNN usually output global information by integrating the depth features extricated from the convolutional layer through the fully connected layer; however, the global information extracted is not comprehensive. This paper proposes an improved remote-sensing image scene classification method based on a global self-attention module to address this problem. The global information is derived from the depth characteristics extracted by the CNN. In order to better express the semantic information of the remote-sensing image, the multi-head self-attention module is introduced for global information augmentation. Meanwhile, the local perception unit is utilized to improve the self-attention module’s representation capabilities for local objects. The proposed method’s effectiveness is validated through comparative experiments with various training ratios and different scales on public datasets (UC Merced, AID, and NWPU-NESISC45). The precision of our proposed model is significantly improved compared to other methods for remote-sensing image scene classification.

Keywords:

remote-sensing image; scene classification; convolutional neural network (CNN); global self-attention module

Graphical Abstract

1. Introduction

Remote sensing is a comprehensive detection technology that uses sensors to record a target’s electromagnetic wave properties and derives the features of the target and its change regulation through data processing and comprehensive analysis. In recent years, with the improvement of computer science and the growing quantity of remote-sensing satellites, the images obtained by remote-sensing technology have higher spatial, temporal and spectral resolutions [1]. The traditional “pixel-oriented” and “object-oriented” remote-sensing image classification only focuses on the low-level features. However, with the gradual increase of remote-sensing image resolution and the increasing quantity of data, it is especially significant to analyze the higher-level contextual information of graphics. Based on upper-level semantic information included in the images, remote-sensing image scene classification maps images to predefined scene category labels and is widely used in urban planning [2], environmental monitoring [3], object detection [4], change detection [5,6] and other fields.

In the remote-sensing image scene classification, scenes in the same semantic category often have great visual distinctions due to the differences in optical attributes such as geometry, scale, orientation, and texture of the ground objects. The different imaging conditions can lead to large differences in terms of radiation intensity and color exhibited in scenes of the same type. A scene can comprise various local objects, with different scenes having an identical local object and the identical scene having different local objects. Thus, the problem of enormous intra-class variety and little inter-class variety would be solved in scene classification. This is displayed in Figure 1, in which there is a major discrepancy in airport scenes of the same classification. In contrast, scenes such as a highway, railway and runway have similar features, making inter-class differences small and indistinguishable. This brings about increasing complexity and difficulty of scene classification, making remote-sensing scene classification a critical and heated research topic [7].

Earlier scene classification methods extracted the underlying features and mid-level features of images, usually extracting color features, texture features, shape features, etc., and performing operations such as coding based on these features. The commonly used descriptors are bag of visual word (BOVW) [8], local feature aggregation descriptor [9], scale-invariant feature transform (SIFT) [10], texture descriptor (TD) [11], histogram of oriented gradient (HOG) [12], etc. However, these features rely heavily on manual design and do not validly depict the semantic message to achieve excellent classification outcomes. Along with the evolution of artificial intelligence, it has been widely used in many fields, such as automatic ship classification based on a variety of artificial intelligence technologies [13], and remote-sensing image analysis tasks [14]. Better results have been acquired for remote-sensing image scene classification using convolutional neural network (CNN) methodologies [15,16,17,18]. AlexNet [19] uses a high number of convolutional kernels for feature abstraction. VGGNet [20] uses deeper and wider networks to strengthen the feature learning abilities of the network. GoogleNet [21] can extract feature maps of different perceptual fields, increasing the depth and width of the network compared to AlexNet and VGGNet, and reducing the network parameters. ResNet [22] solves the network degradation problem by the residual structure when the model is deepened. DenseNet [23] uses a densely connected structure so that features are reused to decrease the number of network parameters.

Given the large scale and complicated content of remote-sensing scene classification images and its features of enormous intra-class variety and little inter-class variety, the effectiveness of directly using the basic network structure involved in scene classification is limited. Chen et al. [24] generated a large amount of feature maps with fewer parameters and operational costs by introducing group convolution, grouping the feature maps and convolutional kernels accordingly, and doing convolutional operations within the corresponding groups. Shi et al. [25] used depthwise separable convolution, reducing the computational volume and improving computational speed compared with standard convolution. In an attempt to expand the perceptual field, Zhang et al. [26] adopted dilated convolution and resolved the problem of information loss that occurred in the process of convolution. Huang and Xu [27] regarded the fully connected layers as global features, composed multiple convolutional layers in the convolutional network into multilayer features by iBOVW coding, and extracted texture features that can represent the scene image from global to local and from bottom to deep using local binary patterns [28]. Fang et al. [29] not only extracted global features and local features in the spatial domain using CNN but also extracted rotation invariant information of the image in the frequency domain based on a band pass filter network.

In remote-sensing scene classification, the feature maps obtained using convolution are able to reflect the images. But there is no focus on the significant regions or features, and also, the redundant parts bring troubles for further classification. To solve this problem, researchers have used various feature enhancement approaches, such as obtaining more effective features by selecting feature maps [30,31], finding more effective ways of feature fusion [32,33,34], and introducing attention mechanisms into CNN [35,36,37], etc. Chen et al. [24] and Kim et al. [38] introduced channel domain attention module and spatial domain attention module into their network model to augment features. Of the first model, channel attention considers all information from all regions and learns the weighting coefficients of different channels. The higher the weight, the more relevant the channel becomes, indicating that it should be considered. Spatial attention considers all channels’ information and then learns the coefficients of each region in the image to seek the salient regions. Shen et al. [39] divided the feature map into low, medium and high level features along the channels and used the attention mechanism to enhance the features at each level separately in order to obtain the salient regions with different perceptual fields. Li et al. [40] applies the attention mechanism to the feature fusion framework to generate an attention map, using the gradient weighted class activation map (Grad-CAM) algorithm to spotlight vital regions of the image. However, these attention mechanisms only enhance local features obtained by convolution operation and disregard enhancing global semantic information. Over several years, affected by the transformer model [41], which is commonly applied in natural language processing (NLP) research, some research works have combined CNN and the transformer model to apply to image classification, such as ViT [42], BERT [43], PVT [44], etc. Wu et al. [45] used CNN to withdraw features and put the feature maps as embedding sequences into transformer to obtain the output results. Self-attention, as an important structure in transformer, has a global perceptual field more suitable to capture the internal relevance of information. Bello et al. [46] computed the self-attention layer in parallel with standard convolution to enhance the network. Nevertheless, this method takes more time on account of the large size of the image. Srinivas et al. [47] enhanced the global information of the image by replacing spatial convolution with self-attention. In the field of scene classification, Wu et al. [48] improved feature discrimination by adding a self-attention layer between the convolutional layers of the original CNN, the network failed to combine self-attention and convolution adequately.

In an attempt to improve the capability of features to express the semantic content of remote-sensing scene images, this paper replaces part of the convolutional layers in the ResNet50 with a global self-attention module and encodes the relative position relationships of features to enable the module to perceive relative distances, and uses local perceptual units to aggregate local information on the module.

The following are the three main contributions to the research in this paper.

(1): We propose a network structure based on the global self-attention module and CNN. To address the problem of the inadequacy of CNN in extracting global information, the self-attention module is able to make the model have a global perceptual field by augmenting depth features on a global scale.
(2): The proposed global self-attention module can perceive the spatial location and local features of the image while enhancing global information.
(3): With the purpose of optimizing the classification performance and mitigating the overfitting phenomenon, we use a data enhancement strategy. For the validity of our proposed model, extensive experiments were conducted and comparative analysis was carried out.

For the remaining chapters the arrangement is as follows. Section 2 describes the structure of our proposed network based on the global self-attention module. In Section 3, we present three commonly used datasets and conduct experiments to obtain results individually. In the next part, the paper discusses the experimental results. In the last part, we summarize the paper and provide an outlook.

2. Materials and Methods

2.1. Overall Framework

In order to derive salient features relating to the scene in remote-sensing images, a global self-attention module is introduced in this paper to perform global enhancement of the depth features drawn from CNN. The framework as a whole is shown in Figure 2b, and ResNet50 is used as the backbone network, and the original ResNet50 structure is displayed in Figure 2a. ResNet50 contains five stages of convolutional layers, which can effectively extract features at different levels of images. The CNN can better represent the global content of the scene image by deepening the network and extracting features with a bigger perceptual field. We replace the 3 × 3 convolution of the fifth stage in ResNet50 with the self-attention module to globally enhance the extracted deep features while retaining the residual learning capability of ResNet50 so that the output features contain more global semantic information and are more discriminative. In the global self-attention module, the embedded relative position encoding enables the self-attention module to be spatial position aware. The different heads in the self-attention module are able to learn information from different subspaces, introduce local perception units, and integrate local features from each subspace.

In Algorithm 1 below, we provide the main steps for training our model:

Algorithm 1: Our Method

Input: Training images

Output: Predicted labels of testing images

1: Parameters initialization: initial learning ratio = 0.01, batch size = 16, the number of iterations = 80

2: Load parts of the parameters of the pre-trained ResNet50 model

3: For iteration = 1:80

For batch = 1:16

Data preprocessing;

Calculate cross entropy loss;

Backpropagate loss;

Update model parameters

4: The predicted labels of each testing image

2.2. Basic Network

CNN become more capable of feature representation as the model deepens, leading to slower network convergence. To solve this problem, this paper adopts the ResNet network, which was proposed in 2015 as the basic network, and its core module is the residual structure. As illustrated in Figure 3, the residual learning unit comprises stacked convolutional layers and skip connections, and maps the input features to the output in a consistent manner. The definitions are as below.

Y = σ (F (X) + X)

(1)

F (X) = W_{2} σ (W_{1} X)

(2)

where

Y

means the output,

X

means the input,

F (X)

means the result of

X

after two layers of weight mapping,

σ

is the correction unit,

W_{1}

and

W_{2}

are the weights of each layer, and

F (X) + X

is the output of the residual module. The residual structure directly maps the input information to the output in a constant manner to achieve skip connections, which reduces the phenomenon of loss in information transfer. In the scene classification, when the difference between classes of scene images is not large, the depth features of the images need to be learned in order to classify them correctly. But when the difference between categories of scene images is large, the scene images can be classified using shallow features, and deeper features are not needed. At this time, the residual module can retain the valuable features in the bottom layer and solve the problem of network degradation when the model deepens.

2.3. Global Self-Attention Module

2.3.1. Multi-Head Self-Attention Layer

The transformer has achieved an effective result in several tasks in the field of NLP. As an important component in the transformer network, the multi-head self-attention layer can be of great help in improving network classification capabilities. In remote-sensing image, the scene classification is on the basis of the upper-level semantic content of the image, which is the overall information of the image. In CNN, the convolutional layer is able to describe the local object information of the image well, but the description of the global information is lacking. We utilize a multi-head self-attention layer to address this problem, which can augment the extracted depth features at the global scale because of its long-range dependencies.

As shown in Figure 4, the self-attention layer uses three 1 × 1 convolution to transform the input feature map into three matrices with different meanings, Q (query), K (key) and V (value). The degree of correlation between Q and each K is calculated using Q and K multiplication, and the original correlation coefficients are normalized by the softmax function, the mathematical operation is shown in Equations (3)–(6),

*

represents convolution calculation,

W_{q}

,

W_{k}

and

W_{v}

represent different convolution kernels respectively. Essentially, self-attention is the learning of the dependency between any two positions within an image by the operation of Q and K compared to convolution. It is able to learn the dependency of the whole image. Then, the computed weight coefficients are multiplied with the corresponding V. This weighting coefficient measures the importance of each region, and the larger the weight coefficient, the more important the region is for classification.

Q = W_{q} * X

(3)

K = W_{k} * X

(4)

V = W_{v} * X

(5)

A = softmax ({QK}^{T}) V

(6)

On the other hand, the multi-head self-attention layer maps the input feature maps into multiple subspaces, where the parameters of different subspaces are independent and later merges the outputs of all subspaces. Since the parameters of the generating matrices Q, K, and V are different, the weight coefficients learned in each subspace are also different. It can focus on different aspects of information in different subspaces and finally combine the feature information of each aspect. As shown in the following equation, the number of heads is n.

M S A = A_{1} + A_{2} + \dots + A_{n}

(7)

2.3.2. Relative Position Encoding

In the multi-head self-attention layer, Q is multiplied with K to learn the weight of the current position. But it cannot perceive the position information of the image, so position coding needs to be introduced. The commonly used position coding has absolute position coding and relative location coding, and early experiments show that relative position coding can better characterizes the location information of the image [49]. Relative position coding represents the row offset and column offset by embedding a vector in both horizontal and vertical dimensions. The vector is learnable to learn the relative distance between different positions on the scene image. The output of the module embedded with relative position coding is:

A = softmax ({QK}^{T} + {QR}^{T}) V

(8)

R = R_{h} + R_{w}

(9)

The global self-attention module with the introduction of relative position coding is able to consider not only the global information but also the relative distances of features at different locations with translation invariance.

2.3.3. Local Perception Unit

The global self-attention module has superior performance in extracting global information but has limited local perception capability. Our study considers consider using the local information abstraction capability of convolution and introduce local perception unit in the multi-head self-attention layer makes the global self-attention module synergistic in global context aggregation and local feature extraction.

Since different heads in the global self-attention module represent different subspaces, and each head learns different global information, we choose to reinforce local features in each subspace separately. Inspired by deep separable convolution, we use different convolutional kernels for each head, and each head is convolved by only one convolutional kernel to aggregate local information while saving computational costs. As shown in Figure 5, we use 16 convolution kernels to convolve 16 self-attention heads to obtain features that fuse the global and local information. After the local perception unit, multiplied by V, the output is:

Z = \sum_{i = 1}^{n} (softmax (Q_{i} K_{i}^{T} + Q_{i} R^{T}) * w_{i}) V_{i}

(10)

where n is the number of heads, i denotes the i-th head,

w_{i}

denotes a 3

\times

3 convolution kernel, and

*

denotes convolutional computation.

2.4. Dataset Description

In order to assess the performance of the model, we use the publicly available UC Merced land-use dataset (UCM) [50], NWPU-NESISC45 dataset (NWPU) [51] and the Aerial Image Dataset (AID) [52] as benchmark datasets for our experiments. The details of the three scene classification datasets are summarized in Table 1.

The UCM dataset was released by the University of California in 2010. The dataset includes twenty-one categories of scenes such as agricultural, airplane, and dense residential, and each category consists of 100 images, for a grand total of 2100 images, all acquired from urban areas in several regions of the United States. The image is 256 × 256 in pixel size and the spatial resolution is 0.3 m. Figure 6 shows the images of different types in the UCM dataset.

Northwestern Polytechnic University publishes the NWPU-RESISC45 dataset, and the images cover more than 100 countries and regions globally, which includes developing countries, transition countries and highly developed economies. The dataset has 45 categories of scenes, each consisting of 700 images, for a total of 31,500 pictures. Figure 7 illustrates the images for each type in the NWPU dataset, with 256 × 256 pixels in size and a spatial resolution of 0.2 to 30 m.

The AID dataset is jointly released by Huazhong University of Science and Technology and Wuhan University. All sample images in each type are selected from diverse regions of many countries, and the images in the same category have different resolutions. The dataset has 30 categories of scene, each of which consists of 220 to 420 images, for a sum of 10,000 images. The image is 600 × 600 in pixel size and the spatial resolution is 0.5 to 8 m. Figure 8 displays the images for each type in the AID dataset.

3. Experimental Results

3.1. Data Preprocessing

The amount of data in the public dataset is limited in performing network training, so data augmentation is mostly applied to enrich the training data. Data augmentation can generate more complex data using the existing data, making the data diverse and helping to reduce the overfitting problem of the model. The data augmentation methods used in this experiment include random rotation, random flipping, and random cropping, which use simple geometric transformations to generate new training samples and can significantly improve the classification results of the network.

3.2. Experimental Setup and Evaluation Metrics

3.2.1. Experimental Setup

Our experiments are based on the PyTorch framework [53] and implemented on an NVIDIA GeForce RTX 2080Ti GPU server. Data enhancement is employed in each experiment, and the training data is cropped to 224 × 224 pixels. We use an SGD optimizer with weight decay for network training, with weight decay of 0.005 and momentum set to 0.9. The initial learning ratio is set to 0.01, the data batch size is 16, and the number of iterations is 80.

To facilitate comparison with existing scene classification methods, the training and testing ratios of the dataset in this experiment were based on previous studies [54,55]. The UCM dataset was randomly divided into 50% (80%) as the training data set and the residual part as the testing data set. In the NWPU dataset, 10% and 20% were randomly selected for training and the remaining 90% and 80% for testing, respectively. The AID dataset’s training percentage was set to 20% and 50%, and the remainder was used as the test set.

3.2.2. Evaluation Metrics

To verify the algorithm’s effectiveness, we use the overall accuracy (OA), kappa coefficient, F1 coefficient and confusion matrix as the evaluation metrics of the algorithm.

The OA is the proportion of the quantity of network models successfully classified in the test set to the overall amount of test sets, with the following equations:

OA = T / (T + F)

(11)

In the formula, T represents the number of correct classification and F represents the number of wrong classification.

The confusion matrix is able to express the classification results more intuitively and is widely utilized to measure the ability of the network. In the confusion matrix, each line stands for the true labels of remote-sensing images, and each column indicates the labels predicted by the model. Hence, the diagonal values of the confusion matrix show the classification success rate of each category. With the confusion matrix, the misclassification among the easily confused categories can be seen.

3.3. Accuracy Evaluation

The proposed method in this paper considers the global information and local characteristics of remote-sensing scene images. It embeds the self-attention module with global enhancement and local aggregation capability into the convolution, which leads to a large improvement of the classification effect. Table 2 lists OA, kappa and F1 score of our method for different datasets with various training ratios, where the contents of the brackets after the datasets represent the training ratio.

To highlight the advantages of the method in this paper, we contrast the proposed model with the current advanced methods, mainly comparing CNN feature-based methods, including ResNet-50 + EAM, ACNet, GLANet and DDRL-AM, which use the attention mechanism for feature enhancement, and SCCov and AMB-CNN, which perform fusion of features.

Table 3 compares the proposed model with the latest methods and list the overall accuracy of each method for the UCM dataset with a 50% and 80% training ratio. Our method clearly outperforms most other CNN-based methods. However, in the UCM dataset with an 80% training ratio, there is not much difference in the overall accuracy due to the accuracy of most methods converging to 100%.

Table 4 compares the overall accuracy of NWPU with different training scales. Our method outperforms other ResNet-based methods for feature enhancement because our method considers global information while focusing on local object features. Our accuracy is 1.3% higher than that of ResNet-50+EAM using channel attention and spatial attention and 2.77% higher than that of SDAResNet under the condition that the training ratio is 10%, and 0.49% higher than that of ResNet-50+EAM and 2.85% higher than that of SDAResNet under the condition that the training ratio is 20%. The GLANet model considers global information and learns local information using an attention mechanism. However, the method only uses a squeeze-excitation module for global information extraction and fails to link the information at different locations on the image. In contrast, our global information extraction considers the linkage of various locations on the image, and the accuracy is 1.14% higher than GLANet when the training ratio is 10% and 0.55% higher than GLANet when the training ratio is 20%.

In Table 5, it can be seen that the experimental outcomes of our proposed method on the AID dataset outperform those of other CNN-based methods. This is mainly because our method has the ability of residual learning, global information enhancement and local feature perception. Compared with other methods, it can describe the content in the scene more effectively and has better accuracy.

3.4. Confusion Matrix for Result Analysis

The confusion matrix with different training ratios on different datasets is shown in Figure 9, Figure 10 and Figure 11. In the confusion matrix, the blank represents 0 at that position, and the diagonal line represents the percentage of being correctly classified. The confusion matrix can visually represent the classification effect of each category on the dataset.

Figure 9 illustrates the classification outcomes of our proposed method for the UCM dataset with a 50% training ratio. Among the 21 classes in the UCM dataset, most of them have an accuracy of 95% or more, 15 of them have an accuracy of 100%, and only “buildings” and “medium residential” have lower accuracy. The accuracy of the two categories is as low as 94% and 92%, respectively. This is because the main objects in the “buildings”, “medium residential”, and “dense residential” scenes are buildings, which have similar local characteristics, causing some “buildings” and “medium residential” to be misclassified as “dense residential”.

Figure 10 shows the confusion matrix of our method for the NWPU dataset with a ratio of 2:8 for the training and test sets. Among the 45 categories in the NWPU dataset, 39 scene categories have accuracy greater than 90%, and some categories such as “airplane”, “golf_course”, and “sea _ica”, etc., reach 98% and above, which has an excellent classification performance. However, in Figure 10, we can see that the most confusing categories in the scene are “palace” and “church”. As shown in Figure 11, these two categories have similar buildings and surroundings, which are hard to distinguish even manually. The “dense_residential” and “medium_residential” scenes have different spatial distributions, but they are composed of the same objects and are easily confused.

For the AID dataset, the confusion matrix with a training ratio of 20% is displayed in Figure 12. Among the 30 categories in the AID dataset, 26 scene categories have an accuracy higher than 90%, and 22 of them are higher than 95%. Among the misclassified images, “Park”, “Resort”, and “School” accounted for the largest percentage. As shown in Figure 13, both “Park” and “Resort” images have water, trees and sparse buildings, making them difficult to identify from one another and resulting in poor classification.

4. Discussion

4.1. Analysis on the Improvement of Algorithm Accuracy by the Global Self-Attention Module

To demonstrate the effectiveness of the global self-attention module, we compared ResNet50 and our method on the UCM, NWPU, and AID datasets, respectively. The experimental results of both methods after data augmentation are shown in Figure 14, and the classification results of the proposed method are greatly improved compared to ResNet50. Specifically, the accuracy is improved by 1.05% and 0.96%, in the UCM dataset with a training proportion of 50% and 80%, respectively. In the NWPU dataset of 10% and 20% training proportion, the accuracy improved by 2.98% and 1.71%, respectively. In the AID dataset for training proportions of 20% and 50%, the accuracy improved by 2.67% and 1.82%, respectively. Depending on the difficulty of different dataset classification, the precision is improved to varying degrees for different datasets. In the NWPU dataset, the characteristics of enormous intra-class variety and little inter-class variety are more prominent, and the classification difficulty is increased. The more difficult the dataset is to classify, the more significant the improvement of the classification effect is after adding the global self-attention module. This is because our method is able to obtain global features of complex scene images compared to the original network.

To clearly compare ResNet with the proposed method, we use Grad-CAM [65] for feature visualization. Grad-CAM uses the last layer of the feature map to obtain gradients and generate an attention map that can be used to display the important regions of the image. The attention maps of the original image, ResNet50 and our method are, respectively shown in Figure 15. The original pictures are from dense_residential, forest, golf_course, harbor, and palace of the NWPU dataset, respectively. ResNet’s attention is mainly focused on a contiguous area, while our method’s attention can start with the overall content of the scene and focus on multiple local objects related to the scene category at the global scale.

4.2. Analysis on the Improvement of Algorithm Accuracy by Data Augmentation

Data augmentation plays an essential role in network training, making full use of the limited data and reducing the overfitting of the network. This paper mainly uses random cropping, rotation, and random horizontal flipping for data augmentation. To evaluate the impact of data augmentation, we conduct a comparison of the classification precision with and without data augmentation for a variety of training ratios of the UCM, AID, and NWPU datasets. With data augmentation, our method shows some improvement in classification, as shown in Figure 16. Specifically, in the UCM dataset of 50% and 80% training proportion, the precision improves by 0.84% and 0.5%, respectively. In the NWPU dataset of 10% and 20% training proportion, the accuracy improved by 1.71% and 1.1%, respectively. In the AID dataset with training proportions of 20% and 50%, the accuracy improved by 1.12% and 0.56%, respectively.

4.3. Analysis of the Improvement of Algorithm Accuracy by Relative Position Coding

To evaluate the role of relative position coding in the multi-head self-attention module, we compared the classification results with and without relative position coding on different training ratios of the UCM, AID and NWPU datasets, respectively. As shown in Figure 17, the classification performance of the multi-head self-attention module with embedded relative position coding has a small improvement over that without relative position coding. When the training ratio of the UCM dataset is set to 80%, the adoption of relative position coding has little effect on the classification result, mainly because the accuracy of the original network is close to 100%, making further improvement difficult. In other experiments, the adoption of relative position coding improved classification accuracy by 0.17% to 1.09%, with a greater improvement when training ratios are low.

4.4. Analysis of the Improvement of Algorithm Accuracy by Local Perception Unit (LPU)

We use the local perceptual unit to improve the local information extraction ability of the global self-attention module. The module can perform well in both global and local information extraction. We carried out trials on three benchmark datasets to estimate the usefulness of local perceptual units for our approach. The results are displayed in Figure 18, and the classification precision can be increased by about 0.2% after adding the local sensing unit. Although the improvement in accuracy is not significant, it proves the effectiveness of local perception unit and provides a basis for future research on the combination of global and local augmentation.

5. Conclusions

We propose a network model based on global self-attention for remote-sensing image scene classification. First, data enhancement is performed on the scene images, and the depth features of the images are obtained using ResNet50 with the residual connection. Then, the depth features are globally enhanced using the global self-attention module. Our proposed network focuses on the global information of the image and enhances the important regions of the imagery, which can better represent the semantic information of the scene images. We evaluate the performance of our model on three publicly available benchmark datasets, achieving 99.50%, 96.98%, and 94.13% accuracy on the UCM, AID, and NWPU datasets, respectively. The results demonstrate that the network delivers performance superior to existing methods. However, there are some limitations in this study. Training and testing were carried out on three datasets respectively, without cross-dataset experiments. In addition, due to the additional module, the computational cost is increased compared with the basic network. For future research, we will consider combining overall information enhancement and local information enhancement to improve feature representation capacities and pruning the network model to obtain a lightweight model.

Author Contributions

Conceptualization, Q.L. and D.Y.; methodology, Q.L.; validation, Q.L. and D.Y.; formal analysis, Q.L.; investigation, Q.L.; writing—original draft preparation, Q.L.; writing—review and editing, D.Y., Q.L. and W.W.; visualization, Q.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Chinese Academy of Sciences Strategic Priority Research Program of the Big Earth Data Science Engineering Program (CASEarth) (XDA19090200), Capacity Building Project of Big Earth Science Data Center of the Chinese Academy of Sciences (WX145XQ07-13).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript: CNN convolutional neural network, Grad-CAM gradient weighted class activation map, NLP natural language processing.

References

Gómez-Chova, L.; Tuia, D.; Moser, G.; Camps-Valls, G. Multimodal Classification of Remote Sensing Images: A Review and Future Directions. Proc. IEEE 2015, 103, 1560–1584. [Google Scholar] [CrossRef]
Longbotham, N.; Chaapel, C.; Bleiler, L.; Padwick, C.; Emery, W.; Pacifici, F. Very High Resolution Multiangle Urban Classification Analysis. IEEE Trans. Geosci. Remote Sens. 2011, 50, 1155–1170. [Google Scholar] [CrossRef]
Zhang, T.; Huang, X. Monitoring of Urban Impervious Surfaces Using Time Series of High-Resolution Remote Sensing Images in Rapidly Urbanized Areas: A Case Study of Shenzhen. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 2692–2708. [Google Scholar] [CrossRef]
Cheng, G.; Han, J.; Zhou, P.; Guo, L. Multi-class geospatial object detection and geographic image classification based on collection of part detectors. ISPRS J. Photogramm. Remote Sens. 2014, 98, 119–132. [Google Scholar] [CrossRef]
Ghaderpour, E.; Vujadinovic, T. Change Detection within Remotely Sensed Satellite Image Time Series via Spectral Analysis. Remote Sens. 2020, 12, 4001. [Google Scholar] [CrossRef]
Panuju, D.R.; Paull, D.J.; Griffin, A.L. Change Detection Techniques Based on Multispectral Images for Investigating Land Cover Dynamics. Remote Sens. 2020, 12, 1781. [Google Scholar] [CrossRef]
Fan, H. Feature Learning Based High Resolution Remote Sensing Image Scene Classification. Ph.D. Thesis, Wuhan University, Wuhan, China, 2017. [Google Scholar]
Zhao, L.; Tang, P.; Huo, L.-Z. Land-Use Scene Classification Using a Concentric Circle-Structured Multiscale Bag-of-Visual-Words Model. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 4620–4631. [Google Scholar] [CrossRef]
Daniilidis, K.; Maragos, P.; Paragios, N. Improving the Fisher Kernel for Large-Scale Image Classification; Computer Vision—ECCV 2010; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2010; Volume 6314. [Google Scholar]
Li, Q.; Qi, S.; Shen, Y.; Ni, D.; Zhang, H.; Wang, T. Multispectral Image Alignment with Nonlinear Scale-Invariant Keypoint and Enhanced Local Feature Matrix. IEEE Geosci. Remote Sens. Lett. 2015, 12, 1551–1555. [Google Scholar] [CrossRef]
Ojala, T.; Pietikainen, M.; Maenpaa, T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 971–987. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005. [Google Scholar] [CrossRef] [Green Version]
Połap, D.; Włodarczyk-Sielicka, M.; Wawrzyniak, N. Automatic ship classification for a riverside monitoring system using a cascade of artificial intelligence techniques including penalties and rewards. ISA Trans. 2021. [Google Scholar] [CrossRef]
Ma, L.; Liu, Y.; Zhang, X.; Ye, Y.; Yin, G.; Johnson, B.A. Deep learning in remote sensing applications: A meta-analysis and review. ISPRS J. Photogramm. Remote Sens. 2019, 152, 166–177. [Google Scholar] [CrossRef]
Shin, H.-C.; Roth, H.R.; Gao, M.; Lu, L.; Xu, Z.; Nogues, I.; Yao, J.; Mollura, D.; Summers, R.M. Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning. IEEE Trans. Med Imaging 2016, 35, 1285–1298. [Google Scholar] [CrossRef] [Green Version]
Hu, F.; Xia, G.-S.; Hu, J.; Zhang, L. Transferring Deep Convolutional Neural Networks for the Scene Classification of High-Resolution Remote Sensing Imagery. Remote Sens. 2015, 7, 14680–14707. [Google Scholar] [CrossRef] [Green Version]
Zou, Q.; Ni, L.; Zhang, T.; Wang, Q. Deep Learning Based Feature Selection for Remote Sensing Scene Classification. IEEE Geosci. Remote Sens. Lett. 2015, 12, 2321–2325. [Google Scholar] [CrossRef]
Dong, R.; Xu, D.; Jiao, L.; Zhao, J.; An, J. A Fast Deep Perception Network for Remote Sensing Scene Classification. Remote Sens. 2020, 12, 729. [Google Scholar] [CrossRef] [Green Version]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015; pp. 1–14. [Google Scholar] [CrossRef]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar] [CrossRef] [Green Version]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef] [Green Version]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar] [CrossRef] [Green Version]
Chen, Z.; Wang, Y.; Han, W.; Feng, R.; Chen, J. An Improved Pretraining Strategy-Based Scene Classification With Deep Learning. IEEE Geosci. Remote Sens. Lett. 2019, 17, 844–848. [Google Scholar] [CrossRef]
Shi, C.; Wang, T.; Wang, L. Branch Feature Fusion Convolution Network for Remote Sensing Scene Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 5194–5210. [Google Scholar] [CrossRef]
Zhang, B.; Zhang, Y.; Wang, S. A Lightweight and Discriminative Model for Remote Sensing Scene Classification With Multidilation Pooling Module. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 2636–2653. [Google Scholar] [CrossRef]
Huang, H.; Xu, K. Combing Triple-Part Features of Convolutional Neural Networks for Scene Classification in Remote Sensing. Remote Sens. 2019, 11, 1687. [Google Scholar] [CrossRef] [Green Version]
Ahonen, T.; Hadid, A.; Pietikäinen, M. Face Description with Local Binary Patterns: Application to Face Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 2037–2041. [Google Scholar] [CrossRef]
Fang, J.; Yuan, Y.; Lu, X.; Feng, Y. Robust Space–Frequency Joint Representation for Remote Sensing Image Scene Classification. IEEE Trans. Geosci. Remote Sens. 2019, 57, 7492–7502. [Google Scholar] [CrossRef]
Zhang, J.; Zhang, M.; Shi, L.; Yan, W.; Pan, B. A Multi-Scale Approach for Remote Sensing Scene Classification Based on Feature Maps Selection and Region Representation. Remote Sens. 2019, 11, 2504. [Google Scholar] [CrossRef] [Green Version]
Yuan, Y.; Fang, J.; Lu, X.; Feng, Y. Remote Sensing Image Scene Classification Using Rearranged Local Features. IEEE Trans. Geosci. Remote Sens. 2018, 57, 1779–1792. [Google Scholar] [CrossRef]
Li, E.; Xia, J.; Du, P.; Lin, C.; Samat, A. Integrating Multilayer Features of Convolutional Neural Networks for Remote Sensing Scene Classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 5653–5665. [Google Scholar] [CrossRef]
Chaib, S.; Liu, H.; Gu, Y.; Yao, H. Deep Feature Fusion for VHR Remote Sensing Scene Classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 4775–4784. [Google Scholar] [CrossRef]
Liang, J.; Deng, Y.; Zeng, D. A Deep Neural Network Combined CNN and GCN for Remote Sensing Scene Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 4325–4338. [Google Scholar] [CrossRef]
Bi, Q.; Qin, K.; Zhang, H.; Li, Z.; Xu, K. RADC-Net: A residual attention based convolution network for aerial scene classification. Neurocomputing 2020, 377, 345–359. [Google Scholar] [CrossRef]
Xu, R.; Tao, Y.; Lu, Z.; Zhong, Y. Attention-Mechanism-Containing Neural Networks for High-Resolution Remote Sensing Image Classification. Remote Sens. 2018, 10, 1602. [Google Scholar] [CrossRef] [Green Version]
Chen, J.; Wang, C.; Ma, Z.; Chen, J.; He, D.; Ackland, S. Remote Sensing Scene Classification Based on Convolutional Neural Networks Pre-Trained Using Attention-Guided Sparse Filters. Remote Sens. 2018, 10, 290. [Google Scholar] [CrossRef] [Green Version]
Kim, I.; Baek, W.; Kim, S. Spatially attentive output layer for image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 9533–9542. [Google Scholar]
Shen, J.; Zhang, T.; Wang, Y.; Wang, R.; Wang, Q.; Qi, M. A Dual-Model Architecture with Grouping-Attention-Fusion for Remote Sensing Scene Classification. Remote Sens. 2021, 13, 433. [Google Scholar] [CrossRef]
Li, J.; Lin, D.; Wang, Y.; Xu, G.; Zhang, Y.; Ding, C.; Zhou, Y. Deep Discriminative Representation Learning with Attention Map for Scene Classification. Remote Sens. 2020, 12, 1366. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Lin-guistics: Human Language Technologies, Minneapolis, MI, USA, 2–7 June 2019; Long and Short Papers. Volume 1, pp. 4171–4186. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. arXiv 2021, arXiv:2102.12122. [Google Scholar]
Wu, B.; Xu, C.; Dai, X.; Wan, A.; Zhang, P.; Tomizuka, M.; Keutzer, K.; Vajda, P. Visual transformers: Token-based image representation and processing for computer vision. arXiv 2020, arXiv:2006.03677. [Google Scholar]
Bello, I.; Zoph, B.; Vaswani, A.; Shlens, J.; Le, Q.V. Attention Augmented Convolutional Networks. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 3285–3294. [Google Scholar]
Srinivas, A.; Lin, T.-Y.; Parmar, N.; Shlens, J.; Abbeel, P.; Vaswani, A. Bottleneck Transformers for Visual Recognition. arXiv 2021, arXiv:2101.11605. [Google Scholar]
Wu, H.; Zhao, S.; Li, L.; Lu, C.; Chen, W. Self-Attention Network With Joint Loss for Remote Sensing Image Scene Classification. IEEE Access 2020, 8, 210347–210359. [Google Scholar] [CrossRef]
Shaw, P.; Uszkoreit, J.; Vaswani, A. Self-Attention with Relative Position Representations. arXiv Prepr. 2018, arXiv:1803.02155. [Google Scholar]
Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, CA, USA, 3–5 November 2010; p. 270. [Google Scholar] [CrossRef]
Cheng, G.; Han, J.; Lu, X. Remote Sensing Image Scene Classification: Benchmark and State of the Art. Proc. IEEE 2017, 105, 1865–1883. [Google Scholar] [CrossRef] [Green Version]
Xia, G.-S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A Benchmark Data Set for Performance Evaluation of Aerial Scene Classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef] [Green Version]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems; Curran Associates: Vancouver, BC, Canada, 2019; pp. 8024–8035. [Google Scholar]
Xiong, W.; Lv, Y.; Cui, Y.; Zhang, X.; Gu, X. A Discriminative Feature Learning Approach for Remote Sensing Image Retrieval. Remote Sens. 2019, 11, 281. [Google Scholar] [CrossRef] [Green Version]
Lv, Y.; Zhang, X.; Xiong, W.; Cui, Y.; Cai, M. An end-to end local-globalfusion feature extraction network for remote sensing image scene classification. Remote Sens. 2019, 11, 3006. [Google Scholar] [CrossRef] [Green Version]
Liu, B.-D.; Meng, J.; Xie, W.-Y.; Shao, S.; Li, Y.; Wang, Y. Weighted Spatial Pyramid Matching Collaborative Representation for Remote-Sensing-Image Scene Classification. Remote Sens. 2019, 11, 518. [Google Scholar] [CrossRef] [Green Version]
Liu, X.; Zhou, Y.; Zhao, J.; Yao, R.; Liu, B.; Zheng, Y. Siamese Convolutional Neural Networks for Remote Sensing Scene Classification. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1200–1204. [Google Scholar] [CrossRef]
He, N.; Fang, L.; Li, S.; Plaza, J.; Plaza, A. Skip-Connected Covariance Network for Remote Sensing Scene Classification. IEEE Trans. Neural Netw. Learn. Syst. 2019, 31, 1461–1474. [Google Scholar] [CrossRef] [Green Version]
Shi, C.; Zhao, X.; Wang, L. A Multi-Branch Feature Fusion Strategy Based on an Attention Mechanism for Remote Sensing Image Scene Classification. Remote Sens. 2021, 13, 1950. [Google Scholar] [CrossRef]
Zhao, Z.; Li, J.; Luo, Z.; Li, J.; Chen, C. Remote Sensing Image Scene Classification Based on an Enhanced Attention Module. IEEE Geosci. Remote Sens. Lett. 2020, 18, 1926–1930. [Google Scholar] [CrossRef]
Fan, R.; Wang, L.; Feng, R.; Zhu, Y. Attention based Residual Network for High-Resolution Remote Sensing Imagery Scene Classification. In Proceedings of the 2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019. [Google Scholar]
Tang, X.; Ma, Q.; Zhang, X.; Liu, F.; Ma, J.; Jiao, L. Attention Consistent Network for Remote Sensing Scene Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 2030–2045. [Google Scholar] [CrossRef]
Guo, D.; Xia, Y.; Luo, X. Scene Classification of Remote Sensing Images Based on Saliency Dual Attention Residual Network. IEEE Access 2020, 8, 6344–6357. [Google Scholar] [CrossRef]
Guo, Y.; Ji, J.; Lu, X.; Huo, H.; Fang, T.; Li, D. Global-Local Attention Network for Aerial Scene Classification. IEEE Access 2019, 7, 67200–67212. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Net-works via Gradient-Based Localization. In Proceedings of the IEEE International Conference on Computer Vision; IEEE: Venice, Italy, 2017; pp. 618–626. [Google Scholar]

Figure 1. Scene classification images. (a) Intra-class differences: airport. (b) Inter-class differences: freeway, railway, runway.

Figure 2. Overall framework. (a) The original ResNet50 structure. (b) The framework of our method. The numbers in the green box represent input dimension, convolution kernel size and output dimension respectively.

Figure 3. Residual structure.

Figure 4. Global self-attention module.

Figure 5. Local perception unit.

Figure 6. Images for each type in the UCM dataset.

Figure 7. Images for each type in the NWPU dataset.

Figure 8. Images for each type in the AID dataset.

Figure 9. Confusion matrices of the UC Merced dataset under the training ratio of 50%.

Figure 10. Confusion matrices of the NWPU dataset under the training ratio of 20%.

Figure 11. Scene comparison.

Figure 12. Confusion matrices of the AID dataset under the training ratio of 20%.

Figure 13. Scene comparison.

Figure 14. Comparison with and without global self-attention module.

Figure 15. Feature visualization.

Figure 16. Comparison with and without data augmentation.

Figure 17. Comparison with and without relative position coding.

Figure 18. Comparison with and without local perception unit (LPU).

Table 1. Dataset characteristics.

Dataset	Number of Classes	Number of Images Per Class	Image Size	Resolution (m)	Year
UC Merced land-use dataset (UCM)	21	100	256 * 256	0.3	2010
NWPU-NESISC45 dataset (NWPU)	45	700	256 * 256	0.2–30	2016
Aerial Image Dataset (AID)	30	220–420	600 * 600	0.5–8	2017

Table 2. Overall accuracy (OA), kappa and F1 score in different data sets.

Dataset	OA (%)	Kappa (%)	F1
UCM (50%)	98.80	98.75	98.82
UCM (80%)	99.50	99.21	99.25
NWPU (10%)	92.11	91.99	92.15
NWPU (20%)	94.00	93.87	94.04
AID (20%)	95.30	95.27	95.22
AID (50%)	97.10	97.06	97.05

Table 3. Comparison of the OA of each method on the UCM dataset.

Methods	OA (50/50) (%)	OA (80/20) (%)
ResNet + weighted spatial pyramid matching collaborative representation based classification (WSPM-CRC) [56]	--	97.95
Siamese ResNet_50 [57]	90.95	94.29
Skip-connected covariance (SCCov) [58]	--	99.05 ± 0.25
Deep discriminative representation learning with attention map (DDRL-AM) [40]	--	99.05 ± 0.08
Attention-oriented multi-branch CNN (AMB-CNN) [59]	--	99.52 ± 0.11
ResNet-50+ attention-oriented multi-branch (EAM) [60]	--	98.98 ± 0.37
Attention based Residual Network [61]	--	98.81 ± 0.30
Attention consistent network (ACNet) [62]	--	99.76 ± 0.10
Our Method	98.80 ± 0.13	99.50 ± 0.08

Table 4. Comparison of the OA of each method on the NWPU dataset.

Methods	OA (10/90) (%)	OA (20/80) (%)
Saliency dual attention residualnetwork (SDAResNet) [63]	89.40	91.15
Siamese ResNet_50 [57]	--	92.28
Global-local attention network (GLANet) [64]	91.03 ± 0.18	93.45 ± 0.17
SCCov [58]	89.30 ± 0.35	92.10 ± 0.25
DDRL-AM [40]	92.17 ± 0.08	92.46 ± 0.09
AMB-CNN [59]	88.99 ± 0.14	92.42 ± 0.14
ResNet-50+EAM [60]	90.87 ± 0.15	93.51 ± 0.12
Attention based Residual Network [61]	--	92.10 ± 0.30
ACNet [62]	91.09 ± 0.13	92.42 ± 0.16
Our Method	92.11 ± 0.06	94.00 ± 0.13

Table 5. Comparison of the OA of each method on the AID dataset.

Methods	OA (20/80) (%)	OA (50/50) (%)
GLANet [64]	95.02 ± 0.28	96.66 ± 0.19
SCCov [58]	93.12 ± 0.25	96.10 ± 0.16
DDRL-AM [40]	92.36 ± 0.10	--
AMB-CNN [59]	93.27 ± 0.22	95.54 ± 0.13
ResNet-50+EAM [60]	93.64 ± 0.25	96.62 ± 0.13
ACNet [62]	93.33 ± 0.29	95.38 ± 0.29
Our Method	95.30 ± 0.19	97.10 ± 0.23

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Q.; Yan, D.; Wu, W. Remote Sensing Image Scene Classification Based on Global Self-Attention Module. Remote Sens. 2021, 13, 4542. https://doi.org/10.3390/rs13224542

AMA Style

Li Q, Yan D, Wu W. Remote Sensing Image Scene Classification Based on Global Self-Attention Module. Remote Sensing. 2021; 13(22):4542. https://doi.org/10.3390/rs13224542

Chicago/Turabian Style

Li, Qingwen, Dongmei Yan, and Wanrong Wu. 2021. "Remote Sensing Image Scene Classification Based on Global Self-Attention Module" Remote Sensing 13, no. 22: 4542. https://doi.org/10.3390/rs13224542

APA Style

Li, Q., Yan, D., & Wu, W. (2021). Remote Sensing Image Scene Classification Based on Global Self-Attention Module. Remote Sensing, 13(22), 4542. https://doi.org/10.3390/rs13224542

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Remote Sensing Image Scene Classification Based on Global Self-Attention Module

Abstract

1. Introduction

2. Materials and Methods

2.1. Overall Framework

2.2. Basic Network

2.3. Global Self-Attention Module

2.3.1. Multi-Head Self-Attention Layer

2.3.2. Relative Position Encoding

2.3.3. Local Perception Unit

2.4. Dataset Description

3. Experimental Results

3.1. Data Preprocessing

3.2. Experimental Setup and Evaluation Metrics

3.2.1. Experimental Setup

3.2.2. Evaluation Metrics

3.3. Accuracy Evaluation

3.4. Confusion Matrix for Result Analysis

4. Discussion

4.1. Analysis on the Improvement of Algorithm Accuracy by the Global Self-Attention Module

4.2. Analysis on the Improvement of Algorithm Accuracy by Data Augmentation

4.3. Analysis of the Improvement of Algorithm Accuracy by Relative Position Coding

4.4. Analysis of the Improvement of Algorithm Accuracy by Local Perception Unit (LPU)

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI