1. Introduction
Scene classification, which aims to derive contextual details and allocate a categorical label to a specific image, has garnered considerable interest in the domain of intelligent interpretation in remote sensing (RS). However, the considerable variation within classes and the minimal distinction between classes in RS images create difficulties for precise scene classification. Current methods for scene classification in RS can be broadly categorized into two groups: (1) handcrafted feature-based methods and (2) deep-learning-based methods. Methods based on handcrafted features depend on manually crafted features like Gabor filters, local binary patterns (LBPs), and the bag of visual words (BoVW). These approaches often experience suboptimal classification results owing to their constrained representational ability. Deep-learning-based methods, on the other hand, have demonstrated a superior ability to automatically learn discriminative features from large image datasets.
Despite their promise, deep-learning-based methods face several challenges. One significant issue is the considerable variation within classes, where scenes of the same category can look very different due to changes in lighting, weather, and seasonal effects. Another challenge is the minimal distinction between classes; different scene categories can appear visually similar, making it difficult for models to differentiate between them. Additionally, the scale variability of objects and the complexity of spatial arrangements in aerial images further complicate the classification task.
The shift from traditional handcrafted feature-based methods to CNNs and Transformer-based approaches has significantly advanced the models’ capabilities to extract complex spatial patterns and comprehend large-scale aerial images. The Transformer, initially developed for sequence modeling and transduction tasks and recognized for its implementation of attention mechanisms, has shown outstanding capabilities in the field of computer vision [
1]. Dosovitskiy et al. [
2] utilized the Transformer architecture on distinct, non-overlapping image segments for image categorization, and the Vision Transformer (ViT) demonstrated superior performance in image classification relative to CNNs. Bazi et al. [
3] adapted the Transformer architecture to enhance the accuracy of scene classification in RS. Unlike CNNs, the Transformer demonstrates a superior ability to capture the long-range relationships among local features in RS imagery.
Recent research has highlighted the effectiveness of Transformer models, particularly the Swin Transformer, in image classification [
4]. This new paradigm shift involves the utilization of deep-learning architectures capable of capturing global semantic information, essential for accurately categorizing aerial scenes. Its innovative architecture allows for a more effective and detailed interpretation of aerial images by addressing the challenges posed by the scale variability of objects and the complexity of spatial arrangements found in such imagery.
Early works in aerial scene classification primarily focused on utilizing handcrafted features to represent the content of scene images. These features were designed based on engineering skills and domain expertise, capturing various characteristics such as color, texture, shape, spatial, and spectral information. Some of the most representative handcrafted features used in early works include color histograms [
5], texture descriptors [
6,
7,
8], global image similarity transformation (GIST) [
9] scale-invariant feature transform (SIFT) [
10], and the histogram of oriented gradients (HOG) for scene classification.
As the field progressed, the focus shifted towards developing more generalizable and automated feature extraction methods, leading to the adoption of deep-learning techniques. Deep-learning models, with their ability to learn hierarchical feature representations directly from data, have significantly advanced the state of the art in aerial scene classification, overcoming many limitations of early handcrafted features. Notably, in 2006, Hinton and Salakhutdinov achieved a significant advancement in deep feature learning [
11]. Since then, researchers have sought to replace handcrafted features with trainable multi-layer networks, demonstrating a remarkable capability in feature representation for various applications, including scene classification in remote sensing images. These deep-learning features, automatically extracted from data using deep-architecture neural networks, offer a significant advantage over traditional handcrafted features that require extensive engineering skills and domain knowledge. With multiple processing layers, deep-learning models capture robust and diverse abstractions of data representation, proving highly effective in uncovering complex patterns and distinguishing characteristics in high-dimensional data. Currently, various deep-learning models are available, including deep belief networks (DBNs) [
12], deep Boltzmann machines (DBMs) [
13], the multiscale convolutional auto-encoder [
14], the stacked autoencoder (SAE) [
15], CNNs [
16,
17,
18,
19,
20], the bag of convolutional features (BoCF) [
21], and so on.
The field of aerial scene classification has evolved significantly, transitioning from handcrafted feature-based methods to advanced deep-learning techniques.
Table 1 summarizes notable deep-learning-based studies in aerial scene classification, highlighting the authors, methodologies, and datasets used in their research.
Sheppard and Rahnemoonfar [
22] focused on the instantaneous interpretation of UAV imagery through deep CNNs, achieving a high accuracy in real-time applications. Yu and Liu [
23] proposed texture- and saliency-coded two-stream deep architectures, enhancing the classification accuracy. Ye et al. [
24] proposed hybrid CNN features for aerial scene classification, combined with an ensemble extreme-learning machine classifier, achieving remarkable performance. Sen and Keles [
25] developed a hierarchically designed CNN model, achieving high accuracy on the NWPU-RESIS45 dataset. Anwer et al. [
26] explored the significance of color within deep-learning frameworks for aerial scene classification, demonstrating that the fusion of several deep color models significantly improves recognition performance. Huang et al. [
27] introduced a Task-Adaptive Embedding Network (TAE-Net) for few-shot remote sensing scene classification, designed to adapt to different tasks with limited labeled samples. Wang et al. [
28] proposed the Channel–Spatial Depthwise Separable (CSDS) network, incorporating a channel–spatial attention mechanism, and El-Khamy et al. [
29] developed a CNN model using wavelet transform pooling for multi-label RS scene classification.
Recent advancements in remote sensing scene classification have leveraged transformer-based architectures and multi-scale feature integration for enhanced performance. Zhao and Li [
30] introduced the Remote Sensing Transformer (TRS) that integrates self-attention into ResNet and employs pure Transformer encoders for improved classification performance. Alhichri et al. [
31] utilized an EfficientNet-B3 CNN with attention, showing strong capabilities in classifying RS scenes. Guo et al. [
32] proposed a GAN-based semisupervised scene classification method. Wang et al. [
33] developed a two-stream Swin Transformer network that uses both original and edge stream features to enhance classification accuracy. Hu and Liu [
34] proposed the triplet-metric-guided multi-scale attention (TMGMA) method, enhancing salient features while suppressing redundant ones. Zhou and Huang [
35] proposed a lightweight dual-branch Swin Transformer combining ViT and CNN branches to improve scene feature discrimination and reduce computational consumption. Thapa et al. [
36] reviewed CNN-based, Vision Transformer (ViT)-based, and Generative Adversarial Network (GAN)-based architectures. Chen et al. [
37] developed BiShuffleNeXt. Wang et al. [
38] introduced the frequency and spatial-based multi-layer attention network (FSCNet). Sivasubramanian et al. [
39] proposed a transformer-based convolutional neural network, evaluated on multiple datasets. Shang and Ye [
40] improved Swin-Transformer-based models for object detection and segmentation, showing significant improvements in small object detection and edge detail segmentation.
This literature survey highlights significant advancements in deep-learning-based aerial scene classification, showcasing diverse techniques and their continuous evolution. Approaches like deep color model fusion, texture- and saliency-coded two-stream architectures, and CNN-based models have notably improved accuracy. Hybrid CNN features and few-shot classification networks like TAE-Net have shown remarkable results. Real-time UAV image interpretations with deep CNNs have also achieved high accuracy. Techniques incorporating channel–spatial attention mechanisms and wavelet transform pooling layers have enhanced multi-label classification. Transformer-based architectures and multi-scale feature integration have further advanced the field. Recent contributions include EfficientNet-B3 with attention, GAN-based semisupervised classification, two-stream Swin Transformer networks, and triplet-metric-guided multi-scale attention methods. Lightweight dual-branch Swin Transformers combining ViT and CNN branches have also been introduced to improve feature discrimination and reduce computational consumption. These advancements pave the way for our proposed methodology, which combines CNNs and Swin Transformers to address existing classification challenges effectively.