MSCAC: A Multi-Scale Swin–CNN Framework for Progressive Remote Sensing Scene Classification
Abstract
:1. Introduction
2. Materials and Methods
2.1. Overview of the Proposed Model
2.2. Materials
2.3. Research Challenges in Aerial Image Classification
2.4. Multi-Scale Swin–CNN Aerial Classifier (MSCAC)
2.4.1. Local Feature Extraction
2.4.2. Multilevel Feature Fusion
2.4.3. Global Feature Extraction
2.4.4. Feature Fusion and Classification
3. Results and Discussion
3.1. Experimental Setup
3.2. Evaluation Metrics
3.3. Comparison with State-of-the-Art Models
3.3.1. UC-Merced Dataset
3.3.2. WHU-RS19 Dataset
3.3.3. RSSCN7 Dataset
3.3.4. Aerial Image Dataset (AID)
4. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. Available online: https://papers.nips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html (accessed on 20 May 2024).
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Bazi, Y.; Bashmal, L.; Al Rahhal, M.M.; Al Dayil, R.; Al Ajlan, N. Vision transformers for remote sensing image classification. Remote Sens. 2021, 13, 516. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
- Swain, M.J.; Ballard, D.H. Color indexing. Int. J. Comput. Vis. 1991, 7, 11–32. [Google Scholar] [CrossRef]
- Haralick, R.M.; Shanmugam, K.; Dinstein, I.H. Textural features for image classification. IEEE Trans. Syst. Man Cybern. 1973, 610–621. [Google Scholar] [CrossRef]
- Jain, A.K.; Ratha, N.K.; Lakshmanan, S. Object detection using Gabor filters. Pattern Recognit. 1997, 30, 295–309. [Google Scholar] [CrossRef]
- Ojala, T.; Pietikainen, M.; Maenpaa, T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 971–987. [Google Scholar] [CrossRef]
- Oliva, A.; Torralba, A. Modeling the shape of the scene: A holistic representation of the spatial envelope. Int. J. Comput. Vis. 2001, 42, 145–175. [Google Scholar]
- Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar]
- Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef]
- Hinton, G.E.; Osindero, S.; Teh, Y.-W. A fast learning algorithm for deep belief nets. Neural Comput. 2006, 18, 1527–1554. [Google Scholar] [CrossRef] [PubMed]
- Hinton, G.; Salakhutdinov, R. An efficient learning procedure for deep Boltzmann machines. Neural Comput. 2012, 24, 1967–2006. [Google Scholar]
- Zhao, W.; Guo, Z.; Yue, J.; Zhang, X.; Luo, L. On combining multiscale deep learning features for the classification of hyperspectral remote sensing imagery. Int. J. Remote Sens. 2015, 36, 3368–3379. [Google Scholar] [CrossRef]
- Vincent, P.; Larochelle, H.; Lajoie, I.; Bengio, Y.; Manzagol, P.-A.; Bottou, L. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 2010, 11, 3371–3408. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2012; pp. 1097–1105. Available online: https://proceedings.neurips.cc/paper/2012 (accessed on 29 May 2024).
- Sermanet, P.; Eigen, D.; Zhang, X.; Mathieu, M.; Fergus, R.; LeCun, Y. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv 2013, arXiv:1312.6229. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. arXiv 2014, arXiv:1409.4842. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
- Cheng, G.; Li, Z.; Yao, X.; Guo, L.; Wei, Z. Remote Sensing Image Scene Classification Using Bag of Convolutional Features. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1735–1739. [Google Scholar] [CrossRef]
- Sheppard, C.; Rahnemoonfar, M. Real-time scene understanding for UAV imagery based on deep convolutional neural networks. In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Fort Worth, TX, USA, 23–28 July 2017; pp. 2243–2246. [Google Scholar]
- Yu, Y.; Liu, F. Dense connectivity based two-stream deep feature fusion framework for aerial scene classification. Remote Sens. 2018, 10, 1158. [Google Scholar] [CrossRef]
- Ye, L.; Wang, L.; Sun, Y.; Zhao, L.; Wei, Y. Parallel multi-stage features fusion of deep convolutional neural networks for aerial scene classification. Remote Sens. Lett. 2018, 9, 294–303. [Google Scholar] [CrossRef]
- Sen, O.; Keles, H.Y. Scene recognition with deep learning methods using aerial images. In Proceedings of the 2019 27th Signal Processing and Communications Applications Conference (SIU), Sivas, Turkey, 24–26 April 2019; pp. 1–4. [Google Scholar]
- Anwer, R.M.; Khan, F.S.; Laaksonen, J. Compact deep color features for remote sensing scene classification. Neural Process. Lett. 2021, 53, 1523–1544. [Google Scholar] [CrossRef]
- Huang, W.; Yuan, Z.; Yang, A.; Tang, C.; Luo, X. TAE-net: Task-adaptive embedding network for few-shot remote sensing scene classification. Remote Sens. 2021, 14, 111. [Google Scholar] [CrossRef]
- Wang, X.; Yuan, L.; Xu, H.; Wen, X. CSDS: End-to-end aerial scenes classification with depthwise separable convolution and an attention mechanism. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 10484–10499. [Google Scholar] [CrossRef]
- El-Khamy, S.E.; Al-Kabbany, A.; Shimaa, E.-B. MLRS-CNN-DWTPL: A new enhanced multi-label remote sensing scene classification using deep neural networks with wavelet pooling layers. In Proceedings of the 2021 International Telecommunications Conference (ITC-Egypt), Alexandria, Egypt, 13–15 July 2021; pp. 1–5. [Google Scholar]
- Zhang, J.; Zhao, H.; Li, J. TRS: Transformers for remote sensing scene classification. Remote Sens. 2021, 13, 4143. [Google Scholar] [CrossRef]
- Alhichri, H.; Alswayed, A.S.; Bazi, Y.; Ammour, N.; Alajlan, N. Classification of Remote Sensing Images Using EfficientNet-B3 CNN Model With Attention. IEEE Access 2021, 9, 14078–14094. [Google Scholar] [CrossRef]
- Guo, D.; Xia, Y.; Luo, X. GAN-Based Semisupervised Scene Classification of Remote Sensing Image. IEEE Geosci. Remote Sens. Lett. 2021, 18, 2067–2071. [Google Scholar] [CrossRef]
- Hao, S.; Wu, B.; Zhao, K.; Ye, Y.; Wang, W. Two-stream swin transformer with differentiable sobel operator for remote sensing image classification. Remote Sens. 2022, 14, 1507. [Google Scholar] [CrossRef]
- Wang, H.; Gao, K.; Min, L.; Mao, Y.; Zhang, X.; Wang, J.; Hu, Z.; Liu, Y. Triplet-metric-guided multi-scale attention for remote sensing image scene classification with a convolutional neural network. Remote Sens. 2022, 14, 2794. [Google Scholar] [CrossRef]
- Zheng, F.; Lin, S.; Zhou, W.; Huang, H. A lightweight dual-branch swin transformer for remote sensing scene classification. Remote Sens. 2023, 15, 2865. [Google Scholar] [CrossRef]
- Thapa, A.; Horanont, T.; Neupane, B.; Aryal, J. Deep Learning for Remote Sensing Image Scene Classification: A Review and Meta-Analysis. Remote Sens. 2023, 15, 4804. [Google Scholar] [CrossRef]
- Chen, Z.; Yang, J.; Feng, Z.; Chen, L.; Li, L. BiShuffleNeXt: A lightweight bi-path network for remote sensing scene classification. Meas. J. Int. Meas. Confed. 2023, 209, 112537. [Google Scholar] [CrossRef]
- Wang, W.; Sun, Y.; Li, J.; Wang, X. Frequency and spatial based multi-layer context network (FSCNet) for remote sensing scene classification. Int. J. Appl. Earth Obs. Geoinf. 2024, 128, 103781. [Google Scholar] [CrossRef]
- Sivasubramanian, A.; Prashanth, V.R.; Hari, T.; Sowmya, V.; Gopalakrishnan, E.A.; Ravi, V. Transformer-based convolutional neural network approach for remote sensing natural scene classification. Remote Sens. Appl. Soc. Environ. 2024, 33, 101126. [Google Scholar] [CrossRef]
- Xu, X.; Feng, Z.; Cao, C.; Li, M.; Wu, J.; Wu, Z.; Shang, Y.; Ye, S.; Feng, Z.; Cao, C.; et al. An improved swin transformer-based model for remote sensing object detection and instance segmentation. Remote Sens. 2021, 13, 4779. [Google Scholar] [CrossRef]
- Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, CA, USA, 2–5 November 2010; Association for Computing Machinery: New York, NY, USA, 2010; pp. 270–279. [Google Scholar]
- Dai, Q.; Xiao, C.; Luo, Z.; Li, W.; Zhang, C. Satellite image classification via two-layer sparse coding with biased image representation. IEEE Geosci. Remote Sens. Lett. 2011, 8, 173–176. [Google Scholar] [CrossRef]
- Zou, Z.; Shi, Z. Deep learning based feature selection for remote sensing scene classification. IEEE Geosci. Remote Sens. Lett. 2015, 12, 2321–2325. [Google Scholar] [CrossRef]
- Xia, G.-S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef]
- Yu, Y.; Liu, F. Aerial scene classification via multilevel fusion based on deep convolutional neural networks. IEEE Geosci. Remote Sens. Lett. 2018, 15, 287–291. [Google Scholar] [CrossRef]
Authors | Methodology | Dataset Used |
---|---|---|
Sheppard and Rahnemoonfar [22] | Deep CNN for UAV imagery | Custom UAV dataset |
Yu and Liu [23] | Two-stream deep feature fusion model | UCM, AID, and NWPU-RESISC45 |
Ye et al. [24] | Parallel multi-stage (PMS) architecture | UCM, and AID |
Sen and Keles [25] | Hierarchically designed CNN | NWPU-RESISC45 |
Anwer et al. [26] | Deep color model fusion | UCM, WHU-RS19, RSSCN7, AID, and NWPU-RESISC45 |
Huang et al. [27] | Task-Adaptive Embedding Network (TAE-Net) | UCM, WHU-RS19, and NWPU-RESISC45 |
Wang et al. [28] | Channel–Spatial Depthwise Separable (CSDS) | AID and NWPU-RESISC45 |
El-Khamy, Al-Kabbany, and El-bana [29] | CNN with wavelet transform pooling | UCM and AID |
Zhao and Li [30] | Remote Sensing Transformer (TRS) with self-attention and ResNet | UCM, AID, NWPU-RESISC45, and OPTIMAL-31 |
Alhichri et al. [31] | EfficientNet-B3 CNN with Attention | UC-Merced, KSA, OPTIMAL-31, RSSCN7, WHU-RS19, AID |
Guo et al. [32] | GAN-Based Semisupervised Scene Classification | UC-Merced, EuroSAT |
Wang et al. [33] | Two-stream Swin Transformer network (original and edge stream features) | UCM, AID, and NWPU-RESISC45 |
Hu and Liu [34] | Triplet-metric-guided multi-scale attention (TMGMA) | UCM, AID, and NWPU-RESISC45 |
Zhou and Huang [35] | Lightweight dual-branch Swin Transformer (ViT and CNN branches) | UCM, AID, and NWPU-RESISC45 |
Thapa et al. [36] | CNN-based, Vision Transformer (ViT)-based, and Generative Adversarial Network (GAN)-based architectures | AID, NWPU-RESISC45 |
Chen et al. [37] | BiShuffleNeXt | UCM, AID and NWPU-45 |
Wang et al. [38] | Frequency- and spatial-based multi-layer attention network (FSCNet) | UCM, AID, and NWPU |
Sivasubramanian et al. [39] | Transformer-based convolutional neural network | UCM, WHU-RS19, OPTIMAL-31, RSI-CB256, and MLRSNet |
Dataset | Scene Classes | Samples/Class | Image Size | Spatial Resolution | Challenging Factor |
---|---|---|---|---|---|
UC-Merced [41] | 21 | 100 | 256 × 256 | 0.3 m | Overlapping classes with different structure densities. |
WHU-RS19 [42] | 19 | 50 | 600 × 600 | Up to 0.5 m | Resolution, scale, orientation, and illumination variations. |
RSSCN7 [43] | 7 | 400 | 400 × 400 | - | Google Earth images with scale variations, cropped at four scales. |
AID [44] | 30 | 200–400 | 600 × 600 | 0.5–8 m | Multi-source images from various countries, seasons, and conditions, increasing intra-class diversity. |
Method | Accuracy | |
---|---|---|
50% Training | 80% Training | |
SIFT [44] | 28.92 ± 0.95 | 32.10 ± 1.95 |
BoVW(SIFT) [44] | 71.90 ± 0.79 | 74.12 ± 3.30 |
CaffeNet [44] | 93.98 ± 0.67 | 95.02 ± 0.81 |
GoogleLeNet [44] | 92.70 ± 0.60 | 94.31 ± 0.89 |
VGG-VD-16 [44] | 94.14 ± 0.69 | 95.21 ± 1.20 |
MSCAC (proposed) | 94.01 ± 0.93 | 94.67 ± 0.89 |
Method | Accuracy | |
---|---|---|
40% Training | 60% Training | |
SIFT [44] | 25.37 ± 1.32 | 27.21 ± 1.77 |
BoVW(SIFT) [44] | 75.26 ± 1.39 | 80.13 ± 2.01 |
CaffeNet [44] | 95.11 ± 1.20 | 96.24 ± 0.56 |
GoogleLeNet [44] | 93.12 ± 0.82 | 96.05 ± 0.91 |
VGG-VD-16 [44] | 95.44 ± 0.60 | 96.05 ± 0.91 |
MSCAC (proposed) | 94.99 ± 0.89 | 96.57 ± 1.20 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Solomon, A.A.; Agnes, S.A. MSCAC: A Multi-Scale Swin–CNN Framework for Progressive Remote Sensing Scene Classification. Geographies 2024, 4, 462-480. https://doi.org/10.3390/geographies4030025
Solomon AA, Agnes SA. MSCAC: A Multi-Scale Swin–CNN Framework for Progressive Remote Sensing Scene Classification. Geographies. 2024; 4(3):462-480. https://doi.org/10.3390/geographies4030025
Chicago/Turabian StyleSolomon, A. Arun, and S. Akila Agnes. 2024. "MSCAC: A Multi-Scale Swin–CNN Framework for Progressive Remote Sensing Scene Classification" Geographies 4, no. 3: 462-480. https://doi.org/10.3390/geographies4030025
APA StyleSolomon, A. A., & Agnes, S. A. (2024). MSCAC: A Multi-Scale Swin–CNN Framework for Progressive Remote Sensing Scene Classification. Geographies, 4(3), 462-480. https://doi.org/10.3390/geographies4030025