TRS: Transformers for Remote Sensing Scene Classification

Zhang, Jianrong; Zhao, Hongwei; Li, Jiao

doi:10.3390/rs13204143

Open AccessArticle

TRS: Transformers for Remote Sensing Scene Classification

by

Jianrong Zhang

^1,2,

Hongwei Zhao

^1,2

and

Jiao Li

^3,*

¹

College of Computer Science and Technology, Jilin University, Changchun 130012, China

²

Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China

³

Department of Jilin University Library, Jilin University, Changchun 130012, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2021, 13(20), 4143; https://doi.org/10.3390/rs13204143

Submission received: 11 September 2021 / Revised: 11 October 2021 / Accepted: 13 October 2021 / Published: 16 October 2021

(This article belongs to the Special Issue Deep Learning for Remote Sensing Image Classification)

Download

Browse Figures

Versions Notes

Abstract

Remote sensing scene classification remains challenging due to the complexity and variety of scenes. With the development of attention-based methods, Convolutional Neural Networks (CNNs) have achieved competitive performance in remote sensing scene classification tasks. As an important method of the attention-based model, the Transformer has achieved great success in the field of natural language processing. Recently, the Transformer has been used for computer vision tasks. However, most existing methods divide the original image into multiple patches and encode the patches as the input of the Transformer, which limits the model’s ability to learn the overall features of the image. In this paper, we propose a new remote sensing scene classification method, Remote Sensing Transformer (TRS), a powerful “pure CNNs → Convolution + Transformer → pure Transformers” structure. First, we integrate self-attention into ResNet in a novel way, using our proposed Multi-Head Self-Attention layer instead of 3 × 3 spatial revolutions in the bottleneck. Then we connect multiple pure Transformer encoders to further improve the representation learning performance completely depending on attention. Finally, we use a linear classifier for classification. We train our model on four public remote sensing scene datasets: UC-Merced, AID, NWPU-RESISC45, and OPTIMAL-31. The experimental results show that TRS exceeds the state-of-the-art methods and achieves higher accuracy.

Keywords: transformers; deep convolutional neural networks; multi-head self-attention; remote sensing scene classification

Share and Cite

MDPI and ACS Style

Zhang, J.; Zhao, H.; Li, J. TRS: Transformers for Remote Sensing Scene Classification. Remote Sens. 2021, 13, 4143. https://doi.org/10.3390/rs13204143

AMA Style

Zhang J, Zhao H, Li J. TRS: Transformers for Remote Sensing Scene Classification. Remote Sensing. 2021; 13(20):4143. https://doi.org/10.3390/rs13204143

Chicago/Turabian Style

Zhang, Jianrong, Hongwei Zhao, and Jiao Li. 2021. "TRS: Transformers for Remote Sensing Scene Classification" Remote Sensing 13, no. 20: 4143. https://doi.org/10.3390/rs13204143

APA Style

Zhang, J., Zhao, H., & Li, J. (2021). TRS: Transformers for Remote Sensing Scene Classification. Remote Sensing, 13(20), 4143. https://doi.org/10.3390/rs13204143

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TRS: Transformers for Remote Sensing Scene Classification

Abstract

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI