MSCAC: A Multi-Scale Swin–CNN Framework for Progressive Remote Sensing Scene Classification

Solomon, A. Arun; Agnes, S. Akila

doi:10.3390/geographies4030025

Open AccessArticle

MSCAC: A Multi-Scale Swin–CNN Framework for Progressive Remote Sensing Scene Classification

by

A. Arun Solomon

¹ and

S. Akila Agnes

^2,*

¹

Department of Civil Engineering, GMR Institute of Technology, Rajam 532127, India

²

Department of Computer Science and Engineering, GMR Institute of Technology, Rajam 532127, India

^*

Author to whom correspondence should be addressed.

Geographies 2024, 4(3), 462-480; https://doi.org/10.3390/geographies4030025

Submission received: 29 May 2024 / Revised: 21 July 2024 / Accepted: 26 July 2024 / Published: 29 July 2024

Download

Browse Figures

Versions Notes

Abstract

:

Recent advancements in deep learning have significantly improved the performance of remote sensing scene classification, a critical task in remote sensing applications. This study presents a new aerial scene classification model, the Multi-Scale Swin–CNN Aerial Classifier (MSCAC), which employs the Swin Transformer, an advanced architecture that has demonstrated exceptional performance in a range of computer vision applications. The Swin Transformer leverages shifted window mechanisms to efficiently model long-range dependencies and local features in images, making it particularly suitable for the complex and varied textures in aerial imagery. The model is designed to capture intricate spatial hierarchies and diverse scene characteristics at multiple scales. A framework is developed that integrates the Swin Transformer with a multi-scale strategy, enabling the extraction of robust features from aerial images of different resolutions and contexts. This approach allows the model to effectively learn from both global structures and fine-grained details, which is crucial for accurate scene classification. The model’s performance is evaluated on several benchmark datasets, including UC-Merced, WHU-RS19, RSSCN7, and AID, where it demonstrates a superior or comparable accuracy to state-of-the-art models. The MSCAC model’s adaptability to varying amounts of training data and its ability to improve with increased data make it a promising tool for real-world remote sensing applications. This study highlights the potential of integrating advanced deep-learning architectures like the Swin Transformer into aerial scene classification, paving the way for more sophisticated and accurate remote sensing systems. The findings suggest that the proposed model has significant potential for various remote sensing applications, including land cover mapping, urban planning, and environmental monitoring.

Keywords:

aerial scene classification; Swin Transformer; deep learning in remote sensing; geospatial analysis; computer vision for aerial imagery; terrain mapping

1. Introduction

Scene classification, which aims to derive contextual details and allocate a categorical label to a specific image, has garnered considerable interest in the domain of intelligent interpretation in remote sensing (RS). However, the considerable variation within classes and the minimal distinction between classes in RS images create difficulties for precise scene classification. Current methods for scene classification in RS can be broadly categorized into two groups: (1) handcrafted feature-based methods and (2) deep-learning-based methods. Methods based on handcrafted features depend on manually crafted features like Gabor filters, local binary patterns (LBPs), and the bag of visual words (BoVW). These approaches often experience suboptimal classification results owing to their constrained representational ability. Deep-learning-based methods, on the other hand, have demonstrated a superior ability to automatically learn discriminative features from large image datasets.

Despite their promise, deep-learning-based methods face several challenges. One significant issue is the considerable variation within classes, where scenes of the same category can look very different due to changes in lighting, weather, and seasonal effects. Another challenge is the minimal distinction between classes; different scene categories can appear visually similar, making it difficult for models to differentiate between them. Additionally, the scale variability of objects and the complexity of spatial arrangements in aerial images further complicate the classification task.

The shift from traditional handcrafted feature-based methods to CNNs and Transformer-based approaches has significantly advanced the models’ capabilities to extract complex spatial patterns and comprehend large-scale aerial images. The Transformer, initially developed for sequence modeling and transduction tasks and recognized for its implementation of attention mechanisms, has shown outstanding capabilities in the field of computer vision [1]. Dosovitskiy et al. [2] utilized the Transformer architecture on distinct, non-overlapping image segments for image categorization, and the Vision Transformer (ViT) demonstrated superior performance in image classification relative to CNNs. Bazi et al. [3] adapted the Transformer architecture to enhance the accuracy of scene classification in RS. Unlike CNNs, the Transformer demonstrates a superior ability to capture the long-range relationships among local features in RS imagery.

Recent research has highlighted the effectiveness of Transformer models, particularly the Swin Transformer, in image classification [4]. This new paradigm shift involves the utilization of deep-learning architectures capable of capturing global semantic information, essential for accurately categorizing aerial scenes. Its innovative architecture allows for a more effective and detailed interpretation of aerial images by addressing the challenges posed by the scale variability of objects and the complexity of spatial arrangements found in such imagery.

Early works in aerial scene classification primarily focused on utilizing handcrafted features to represent the content of scene images. These features were designed based on engineering skills and domain expertise, capturing various characteristics such as color, texture, shape, spatial, and spectral information. Some of the most representative handcrafted features used in early works include color histograms [5], texture descriptors [6,7,8], global image similarity transformation (GIST) [9] scale-invariant feature transform (SIFT) [10], and the histogram of oriented gradients (HOG) for scene classification.

As the field progressed, the focus shifted towards developing more generalizable and automated feature extraction methods, leading to the adoption of deep-learning techniques. Deep-learning models, with their ability to learn hierarchical feature representations directly from data, have significantly advanced the state of the art in aerial scene classification, overcoming many limitations of early handcrafted features. Notably, in 2006, Hinton and Salakhutdinov achieved a significant advancement in deep feature learning [11]. Since then, researchers have sought to replace handcrafted features with trainable multi-layer networks, demonstrating a remarkable capability in feature representation for various applications, including scene classification in remote sensing images. These deep-learning features, automatically extracted from data using deep-architecture neural networks, offer a significant advantage over traditional handcrafted features that require extensive engineering skills and domain knowledge. With multiple processing layers, deep-learning models capture robust and diverse abstractions of data representation, proving highly effective in uncovering complex patterns and distinguishing characteristics in high-dimensional data. Currently, various deep-learning models are available, including deep belief networks (DBNs) [12], deep Boltzmann machines (DBMs) [13], the multiscale convolutional auto-encoder [14], the stacked autoencoder (SAE) [15], CNNs [16,17,18,19,20], the bag of convolutional features (BoCF) [21], and so on.

The field of aerial scene classification has evolved significantly, transitioning from handcrafted feature-based methods to advanced deep-learning techniques. Table 1 summarizes notable deep-learning-based studies in aerial scene classification, highlighting the authors, methodologies, and datasets used in their research.

Sheppard and Rahnemoonfar [22] focused on the instantaneous interpretation of UAV imagery through deep CNNs, achieving a high accuracy in real-time applications. Yu and Liu [23] proposed texture- and saliency-coded two-stream deep architectures, enhancing the classification accuracy. Ye et al. [24] proposed hybrid CNN features for aerial scene classification, combined with an ensemble extreme-learning machine classifier, achieving remarkable performance. Sen and Keles [25] developed a hierarchically designed CNN model, achieving high accuracy on the NWPU-RESIS45 dataset. Anwer et al. [26] explored the significance of color within deep-learning frameworks for aerial scene classification, demonstrating that the fusion of several deep color models significantly improves recognition performance. Huang et al. [27] introduced a Task-Adaptive Embedding Network (TAE-Net) for few-shot remote sensing scene classification, designed to adapt to different tasks with limited labeled samples. Wang et al. [28] proposed the Channel–Spatial Depthwise Separable (CSDS) network, incorporating a channel–spatial attention mechanism, and El-Khamy et al. [29] developed a CNN model using wavelet transform pooling for multi-label RS scene classification.

Recent advancements in remote sensing scene classification have leveraged transformer-based architectures and multi-scale feature integration for enhanced performance. Zhao and Li [30] introduced the Remote Sensing Transformer (TRS) that integrates self-attention into ResNet and employs pure Transformer encoders for improved classification performance. Alhichri et al. [31] utilized an EfficientNet-B3 CNN with attention, showing strong capabilities in classifying RS scenes. Guo et al. [32] proposed a GAN-based semisupervised scene classification method. Wang et al. [33] developed a two-stream Swin Transformer network that uses both original and edge stream features to enhance classification accuracy. Hu and Liu [34] proposed the triplet-metric-guided multi-scale attention (TMGMA) method, enhancing salient features while suppressing redundant ones. Zhou and Huang [35] proposed a lightweight dual-branch Swin Transformer combining ViT and CNN branches to improve scene feature discrimination and reduce computational consumption. Thapa et al. [36] reviewed CNN-based, Vision Transformer (ViT)-based, and Generative Adversarial Network (GAN)-based architectures. Chen et al. [37] developed BiShuffleNeXt. Wang et al. [38] introduced the frequency and spatial-based multi-layer attention network (FSCNet). Sivasubramanian et al. [39] proposed a transformer-based convolutional neural network, evaluated on multiple datasets. Shang and Ye [40] improved Swin-Transformer-based models for object detection and segmentation, showing significant improvements in small object detection and edge detail segmentation.

This literature survey highlights significant advancements in deep-learning-based aerial scene classification, showcasing diverse techniques and their continuous evolution. Approaches like deep color model fusion, texture- and saliency-coded two-stream architectures, and CNN-based models have notably improved accuracy. Hybrid CNN features and few-shot classification networks like TAE-Net have shown remarkable results. Real-time UAV image interpretations with deep CNNs have also achieved high accuracy. Techniques incorporating channel–spatial attention mechanisms and wavelet transform pooling layers have enhanced multi-label classification. Transformer-based architectures and multi-scale feature integration have further advanced the field. Recent contributions include EfficientNet-B3 with attention, GAN-based semisupervised classification, two-stream Swin Transformer networks, and triplet-metric-guided multi-scale attention methods. Lightweight dual-branch Swin Transformers combining ViT and CNN branches have also been introduced to improve feature discrimination and reduce computational consumption. These advancements pave the way for our proposed methodology, which combines CNNs and Swin Transformers to address existing classification challenges effectively.

2. Materials and Methods

2.1. Overview of the Proposed Model

In this study, a novel deep-learning-based aerial scene classification model, the Multi-Scale Swin–CNN Aerial Classifier (MSCAC), is proposed. This model leverages the Swin Transformer for global feature extraction [4] and CNN for multi-scale local feature extraction [16]. By combining the strengths of the Swin Transformer and feature pyramid CNN, MSCAC accurately classifies aerial scenes, considering the diverse scales and intricate spatial arrangements inherent in such imagery. This approach merges the benefits of multilevel convolutional models with Transformer architecture, enabling efficient handling of both local and global feature extraction. This capability is particularly valuable in aerial image classification, where understanding both local and global contexts is crucial. The novelty of our MSCAC model lies in this unique integration of CNNs and Swin Transformers, an approach not extensively explored in aerial image classification and not addressed in existing literature. Figure 1 provides an overview of the proposed model, illustrating its architectural components and the integration of the Swin Transformer with the deep-learning framework.

2.2. Materials

This study tested the deep-learning-based aerial scene categorization algorithm on four public datasets. Each dataset contains a diverse set of scene classes, samples per class, image sizes, and spatial resolutions, along with specific challenging factors that make aerial scene classification a complex task. The datasets used in our study are presented in Table 2.

2.3. Research Challenges in Aerial Image Classification

In aerial image classification, significant research gaps exist due to various challenges. Advanced techniques are required to handle variability in spatial resolution, ensuring accurate feature extraction and recognition across different datasets. Current models need to adapt better to diverse imaging conditions, such as changes in lighting, weather, and seasons, necessitating the development of robust algorithms for consistent accuracy. Addressing scale and orientation variations is another critical gap; there is a need for algorithms capable of normalizing or adapting to these variations for uniform classification performance. Furthermore, managing intra-class diversity, particularly in datasets like AID with high variability in images from different locations, seasons, and conditions, remains a substantial challenge. Developing models that can generalize across such diverse images is essential for enhancing the accuracy and applicability of aerial image classification systems.

2.4. Multi-Scale Swin–CNN Aerial Classifier (MSCAC)

In this section, the architecture of the proposed deep-learning model named Multi-Scale Swin–CNN Aerial Classifier (MSCAC), designed for the classification of aerial images, is detailed. The model architecture is meticulously crafted to leverage the strengths of both local and global feature extraction techniques, ensuring a comprehensive understanding of the aerial scenes. By integrating CNNs for local feature extraction with the advanced Swin Transformer for global feature extraction, the aim is to capture a rich representation of aerial imagery. The following subsections provide a mathematical description of the model’s components, including the extraction of local features, the fusion of multilevel features, the extraction of global features, and the final classification process. For a visual representation of the detailed architecture, refer to Figure 2.

2.4.1. Local Feature Extraction

During the model’s local feature extraction phase, CNNs are employed to derive fine-grained features from the input aerial imagery. This process involves applying a series of convolutional layers to the input image, each consisting of a convolution operation followed by a non-linear activation function and often a pooling operation. The convolution operation involves sliding a set of learnable filters (or kernels) over the input image to capture local patterns such as edges, textures, and shapes. Non-linearity from the activation function lets the model learn more complicated characteristics, while the pooling operation reduces the spatial dimensions of the feature maps, leading to a more compact representation.

Mathematically, the local feature extraction process can be described by Equation (1):

F_{l}^{i} = P o o l (R e L U (W_{l}^{i} \times X + b_{l}^{i}))

(1)

where

F_{l}^{i}

represents the feature map from the ith convolutional block at layer l,

W_{l}^{i}

and

b_{l}^{i}

are the weights and biases of the convolutional layer, X is the input image or feature map from the previous layer,

R e L U

is the Rectified Linear Unit activation function,

P o o l

is the pooling operation, and

*

denotes the convolution operation.

2.4.2. Multilevel Feature Fusion

This stage combines CNN elements from different levels. The fusion process is designed to capture information at various scales and spatial resolutions. By combining features from multiple levels, the model creates a more comprehensive representation, enhancing its ability to accurately classify aerial images. The fusion of features from each convolutional block is achieved using a weighted sum, expressed mathematically as Equation (2):

F_{f u s e d} = \sum_{i = 1}^{n} α_{i} F_{l}^{i}

(2)

where

F_{f u s e d}

is the fused feature representation, n is the number of convolutional blocks, and

α_{i}

are the fusion weights for each feature map

F_{l}^{i}

.

After fusion, the features are subjected to global average pooling to produce a feature vector, as represented by Equation (3):

F_{G A P} = \frac{1}{H \times W} \sum_{h = 1}^{H} \sum_{w = 1}^{W} F_{f u s e d} (h, w)

(3)

where

F_{G A P}

is the feature vector after global average pooling, H and W are the height and width of the feature map, and

F_{f u s e d} (h, w)

is the value of the fused feature map at position (h,w).

2.4.3. Global Feature Extraction

In the global feature extraction stage of the MSCAC, the model employs a Swin-Transformer-based encoder to capture the global context and dependencies within aerial images. This stage is crucial for understanding the overall structure and relationships between different parts of the scene.

Given an input image

X \in R^{H \times W \times C}

, where H, W, and C are height, width, and the number of channels, respectively, the image is first partitioned into non-overlapping patches

x_{p} \in R^{N \times (P^{2} \cdot C)}

. Here,

N = \frac{H}{P} \times \frac{W}{P}

represents the number of patches, and P is the patch size. Each patch is then linearly embedded to a D-dimensional feature vector

x_{e m b e d}

.

The embedded patches are then passed through L layers of the Swin Transformer, where each layer comprises a Multi-Head Self-Attention (MHSA) block and a Multilayer Perceptron (MLP) block. These layers are equipped with residual connections and Layer Normalization (LN), and can be represented by Equation (4):

x_{l + 1} = L N (M L P (L N (M H S A (x_{l}))) + x_{l})

(4)

where

x_{l}

and

x_{l + 1}

denote the outputs of the l-th and (l + 1)-th layers, respectively.

The MHSA block allows the model to focus on different parts of the input sequence, capturing various aspects of relationships between elements by computing attention weights and producing the output through a weighted sum of values. This process enhances the model’s ability to capture complex dependencies and interactions within the input data by focusing on different parts of the sequence through multiple heads. The MLP block further refines these representations through two linear transformations with a Gaussian Error Linear Unit (GeLU) activation function, allowing for non-linear transformations of the input features. The mechanism of MHSA is crucial for understanding the global context and relationships in aerial imagery, making it a key component of the Swin-Transformer-based encoder in the MSCAC model, and the process at the l-th layer is illustrated in Figure 3.

After processing through L layers, the final output

x_{L}

represents the global feature representation

F_{g l o b a l}

of the input image. Hence, the global feature extraction using the Swin Transformer can be summarized as presented by Equation (5):

F_{g l o b a l} = S w i n T r a n s f o r m e r (X) = x_{L} = L a y e r L (\dots L a y e r 2 (L a y e r 1 (x_{e m b e d})))

(5)

This equation encapsulates the entire process of global feature extraction using the Swin Transformer in the MSCAC model, which is crucial for understanding the complex spatial relationships and structures present in aerial imagery.

2.4.4. Feature Fusion and Classification

The final step in our model involves fusing the local and global features to create a comprehensive feature representation, which is then used for classification. The local and global features are fused to create a comprehensive feature representation as presented by Equation (6):

F_{f i n a l} = C o n c a t (F_{G A P}, F_{g l o b a l})

(6)

where

F_{f i n a l}

is the final fused feature vector, and Concat denotes the concatenation operation.

The fused feature vector is then passed through a fully connected dense layer for classification as presented by Equation (7):

y = S o f t m a x (W_{c} F_{f i n a l} + b_{c})

(7)

where y is the output classification vector,

W_{c}

and

b_{c}

are the weights and biases of the classifier head, and Softmax is a linear transformation function applied to the fused feature vector.

The proposed model, MSCAC, is designed to overcome the challenges in aerial image classification by leveraging the strengths of both local and global feature extraction techniques. The integration of CNNs allows for effective local feature extraction, crucial for handling spatial resolution variability, enabling the model to adapt to different resolutions by capturing detailed texture and structural information at various scales. The Swin Transformer, a key component of our model, excels in global feature extraction, capturing long-range dependencies and contextual information, making the model robust to changes in lighting, weather, and seasonal conditions. The multi-scale approach ensures that features are extracted at different levels, allowing the model to normalize or adapt to scale and orientation variations through the fusion of multilevel features, enhancing its ability to recognize aerial scenes regardless of their scale or orientation. This comprehensive feature extraction approach allows the model to generalize across diverse images within the same class, providing a balanced and effective combination of local and global feature extraction, leading to superior classification performance.

3. Results and Discussion

This section presents the experimental results from the MSCAC on various aerial image datasets. The efficacy of the suggested model is evaluated against that of current leading-edge models through the use of confusion matrices and overall accuracy (OA). Additionally, the implications of the findings are discussed, highlighting the strengths and potential limitations of the MSCAC model in the context of aerial scene classification.

3.1. Experimental Setup

To increase the variety of the training dataset, data augmentation methods like rotation, flipping, and scaling were utilized. The MSCAC model was developed using the PyTorch 2.0 framework and trained on a GPU-enabled processor. The training was conducted using the Adam optimizer with an initial learning rate of 1 × 10⁻⁴ and a weight decay of 1 × 10⁻⁵. The learning rate is modified according to the validation loss, and early stopping is implemented to avoid overfitting.

3.2. Evaluation Metrics

In this study, the same data-splitting ratios for training and testing as Xia et al. [44] were followed to ensure fair comparisons with previous results. The performance of the classification was assessed using the overall accuracy (OA) and the confusion matrix. The OA measures the proportion of correctly identified images across the entire dataset, while the confusion matrix provides detailed insights into the classification accuracy for each category. Specific training and testing proportions were applied across different datasets: 20% and 50% for the RSSCN7 and AID datasets, respectively, and 50% and 80% for the UC-Merced dataset. The WHU-RS19 dataset was partitioned into 40% and 60% splits. To improve the reliability of the results, the division of the dataset was randomized ten times, with the average OA and its standard deviation being reported.

3.3. Comparison with State-of-the-Art Models

In the evaluation, the MSCAC was benchmarked against several state-of-the-art models on the AID dataset, including SIFT, BoVW(SIFT), CaffeNet, GoogleLeNet, and VGG-VD-16, as referenced from Xia et al. [44]. The proposed MSCAC model demonstrated superior performance, surpassing these models in overall accuracy.

3.3.1. UC-Merced Dataset

The UC-Merced Land Use Dataset consists of 21 classes representing diverse urban and natural landscapes. This dataset poses challenges in classification due to the subtle differences in texture, color, and density among similar classes. Visualization of these classes is crucial for understanding the complexity of land use classification. Figure 4 provides a visual representation of sample images from each category, illustrating the dataset’s diversity.

The results presented in Table 3 show the OA of various deep-learning models on the UC-Merced dataset with different training proportions. The proposed MSCAC demonstrates competitive performance when compared to existing state-of-the-art models. With 50% of the data used for training, the MSCAC model achieves an OA of 94.01%, which is slightly lower than the best-performing VGG-VD-16 model at 94.14%. However, it is important to note that the MSCAC model outperforms other well-known architectures such as GoogleLeNet and CaffeNet, which achieve OAs of 92.70% and 93.98%, respectively. This indicates the effectiveness of the MSCAC model in capturing both local and global features of aerial images. When the training proportion is increased to 80%, the MSCAC model achieves an OA of 94.67%, which is closer to the performance of the VGG-VD-16 model at 95.21%. This suggests that the MSCAC model benefits from additional training data, further narrowing the gap with the best-performing model. It is noteworthy that traditional feature extraction methods like SIFT and BoVW(SIFT) lag significantly behind deep-learning-based models, with OAs of 32.10% and 74.12%, respectively, at 80% training. This highlights the superiority of deep-learning approaches in handling the complexity and diversity of aerial imagery.

The confusion matrix in Figure 5 provides further insights into the classification accuracy of the MSCAC model. It reveals the model’s ability to correctly classify images across various classes, with particular strengths and weaknesses in specific categories. High values along the main diagonal indicate a high true positive rate; for instance, the “Agricultural” and “Airplane” classes have a correct classification rate of 95%, while “Beach” is perfectly classified with a 100% rate. Off-diagonal high values indicate classes where the model is most frequently incorrect; for example, “Buildings” has been misclassified as “Dense Residential” and “Tennis Courts” 5% and 10% of the time, respectively. Several classes like “Dense Residential,” “Freeway,” and “Golf Course” have high true positive rates of 95%, 90%, and 90%, respectively, showing the model’s strong capability to discern these categories. However, “Tennis Courts” stands out with a lower true positive rate of 70%, with confusion mainly with the “Buildings” and “Dense Residential” classes, possibly indicating similarities that the model struggles to distinguish. The matrix also highlights misclassifications due to subtle differences between classes, such as “Sparse Residential” being mistaken for “Dense Residential” 10% of the time, or “River” being confused with “Forest” 5% of the time. The matrix also highlights misclassifications due to subtle differences between classes, such as “Sparse Residential” being mistaken for “Dense Residential” 10% of the time, or “River” being confused with “Forest” 5% of the time. These misclassifications are crucial for understanding the limitations of the current model and suggest areas where additional training data, feature engineering, or model adjustments could lead to improved accuracy.

3.3.2. WHU-RS19 Dataset

The WHU-RS19 dataset is a collection of high-resolution satellite images covering 19 scene categories. It provides a wide range of urban and natural landscapes, making it challenging to differentiate between overlapping features. Figure 6 displays sample images from each category, highlighting the diversity and challenges in classification.

The results presented in Table 4 show the OA of various deep-learning models on the WHU-RS19 dataset with different training proportions. The proposed MSCAC demonstrates impressive performance, outperforming other models at 60% training. With 40% of the data used for training, the MSCAC model achieves an OA of 94.99%, which is slightly lower than the best-performing VGG-VD-16 model at 95.44%. However, it surpasses other well-known architectures such as GoogleLeNet and CaffeNet, which achieve OAs of 93.12% and 95.11%, respectively. This indicates the MSCAC model’s capability to effectively capture the complex features of high-resolution satellite images in the WHU-RS19 dataset. When the training proportion is increased to 60%, the MSCAC model showcases its strength by achieving an OA of 96.57%, which is the highest among all the compared models. This is a significant improvement over traditional feature extraction methods like SIFT and BoVW(SIFT), which have OAs of 27.21% and 80.13%, respectively, at 60% training. The performance of the MSCAC model surpasses even the well-regarded VGG-VD-16 model, which has an OA of 96.05%. The results highlight the effectiveness of the MSCAC model in handling the diverse and challenging landscapes present in the WHU-RS19 dataset. Its superior performance at 60% training suggests that the model is capable of leveraging additional training data to enhance its classification accuracy.

As depicted in Figure 7, it can be noted that the “Airport” category has been consistently and accurately classified in each instance, “Bridge” has a 95% correct classification rate with a small percentage confused with “Port,” and “Commercial Area” is 90% accurate with minor confusion with “Parking” and “Residential Area”. Similarly, “Industrial Area” has a 90% accuracy, occasionally misidentified as “Parking” or “Viaduct”. Such misclassifications are informative, suggesting similarities between classes that the model confuses, and providing insights into potential areas for model refinement. For instance, “Railway Station” was misclassified as “Airport” in 5% of cases. Nevertheless, these misclassification rates are generally low, indicating robust model performance.

3.3.3. RSSCN7 Dataset

The RSSCN7 dataset comprises aerial images categorized into seven scene types. It includes diverse geographical and environmental variations, posing challenges in classification due to fluctuating lighting conditions, seasonal changes, and differing perspectives. Figure 8 showcases sample images from each category, demonstrating the dataset’s variety.

The results presented in Table 5 show the OA of various deep-learning models on the RSSCN7 dataset with different training proportions. The proposed MSCAC demonstrates competitive performance, particularly at 20% training. With 20% of the data used for training, the MSCAC model achieves an OA of 84.01%, which is marginally higher than the VGG-VD-16 model at 83.98%. This suggests that the MSCAC model is adept at learning efficiently from a restricted dataset, outperforming other well-known architectures such as GoogleLeNet and CaffeNet, which achieve OAs of 82.55% and 85.57%, respectively. The performance of MSCAC is particularly noteworthy given that traditional feature extraction methods like SIFT and BoVW(SIFT) have significantly lower OAs of 28.45% and 76.33%, respectively. When the training proportion is increased to 50%, the MSCAC model achieves an OA of 87.37%, which is slightly higher than the VGG-VD-16 model at 87.18%. This suggests that the MSCAC model continues to perform well with additional training data, maintaining its competitive edge. However, it is important to note that CaffeNet achieves the highest OA of 88.25% in this scenario, indicating that there is still room for improvement for the MSCAC model. The results suggest that the MSCAC model is a strong contender in aerial scene classification, particularly in datasets like RSSCN7 that include diverse geographical and environmental variations.

The confusion matrix presented in Figure 9 demonstrates the model’s strong ability to accurately distinguish most classes, with a particularly high accuracy for Grass, Field, Industry, River/Lake, Forest, and Parking, all achieving correct classification rates above 0.88. However, the misclassifications observed, especially between visually similar categories such as Grass with Field, and between Industry with Parking, point to areas where the model’s discriminative power could be further honed. Misclassifications involving Resident areas also hint at potential overlaps in features with Industry and Parking classes. Despite a training set that is only half the total dataset, the model demonstrates a strong ability to differentiate between the classes, but there is a clear opportunity for model improvement, possibly by enhancing the training with additional data augmentation or advanced feature extraction techniques.

3.3.4. Aerial Image Dataset (AID)

The Aerial Image Dataset (AID) is a benchmark dataset for aerial scene classification, containing thousands of images across 30 classes, including urban areas, agricultural lands, forests, rivers, industrial zones, residential areas, airports, harbors, beaches, stadiums, and parks, among others. Each class captures distinct features for applications in urban planning and environmental monitoring. Figure 10 likely displays sample images from each class to highlight the dataset’s diversity and the challenges in classification due to variations in lighting and camera angles.

The results presented in Table 6 show the OA of various deep-learning models on the AID dataset with different training proportions. The proposed MSCAC demonstrates competitive performance, especially at 50% training. With 20% of the data used for training, the MSCAC model achieves an OA of 87.45%, which is higher than both the CaffeNet and VGG-VD-16 models, which have OAs of 86.86% and 86.59%, respectively. This demonstrates that the MSCAC model can effectively learn from a small dataset, outperforming other well-known architectures.

The performance of MSCAC is particularly noteworthy given that traditional feature extraction methods like SIFT and BoVW(SIFT) have significantly lower OAs of 13.50% and 61.40%, respectively. When the training proportion is increased to 50%, the MSCAC model achieves an OA of 89.90%, which is slightly higher than the VGG-VD-16 model at 89.64% but lower than the CNNs model by Yu and Liu (2018a) at 94.17%. This suggests that, while the MSCAC model continues to perform well with additional training data, there is still room for improvement to reach the performance level of the best-performing model in this dataset.

Figure 11 shows the Confusion matrix obtained by the MSCAC model on the AID dataset. The class “Airport” has a high classification accuracy of 89%, but there are instances where it is confused with other categories such as “Bare Land” and “Commercial”. “Baseball Field” and “Beach” show high accuracies at 96% and 98%, respectively, with minor confusion with other classes. “Bridge” shows a 95% accuracy, with a small percentage of confusion with “Center”, “Desert”, “Dense Residential”, and “Square”. Several classes have a high classification accuracy, such as “Farmland”, “Forest”, and “Park”, all scoring above 95%. These high accuracies suggest that the model is effective at distinguishing features specific to these categories. Some classes have notable confusion with others; for instance, “Center” is occasionally mistaken for “Church”, “Commercial”, and “Square”. “Industrial” has been confused with “Meadow” and “Dense Residential”. Classes with a lower accuracy, such as “River”, “School”, “Sparse Residential”, and “Square”, are often confused with multiple other categories. This could indicate that these classes have similar features or that the model lacks the nuanced differentiation to accurately classify them, suggesting a need for further model training or feature engineering for these specific classes. Classes like “Stadium” and “Storage Tanks” show a high accuracy with occasional confusion, which may be due to unique features that are not always distinct. The class “Viaduct” has a near-perfect classification accuracy, suggesting the model is adept at identifying its specific characteristics.

Overall, the experimental results demonstrate that the proposed MSCAC model is a competitive and effective tool for aerial scene classification, showing superior or comparable performance to existing state-of-the-art models across different datasets. The MSCAC model’s strength lies in its ability to capture both local and global features of aerial images, which is crucial for accurately classifying diverse and complex scenes. This is evidenced by its performance on the UC-Merced, WHU-RS19, RSSCN7, and AID datasets, where it achieved a high overall accuracy and demonstrated improvements over traditional feature extraction methods and other deep-learning architectures. Its robustness and adaptability make the MSCAC model suitable for disaster response, urban planning, agricultural monitoring, environmental observation, and security surveillance. However, its high computational complexity may require significant resources for training and inference, limiting its use in resource-constrained environments.

4. Conclusions

In this study, we proposed the Multi-Scale Swin–CNN Aerial Classifier (MSCAC) for aerial image classification and compared its performance with existing models, including CaffeNet, VGG-VD16, and GoogleNet. The results demonstrate the effectiveness of our model across various datasets. MSCAC achieved an accuracy of 93.98 ± 0.67 on the UC-Merced dataset, slightly outperforming VGG-VD16 (94.14 ± 0.69) and significantly surpassing GoogleNet (92.70 ± 0.60). On the WHU-RS19 dataset, MSCAC achieved an accuracy of 95.11 ± 1.2, closely following VGG-VD16 (95.44 ± 0.6) and outperforming GoogleNet (93.12 ± 0.8). For the RSSCN7 dataset, MSCAC achieved an accuracy of 88.86 ± 0.6, surpassing both VGG-VD16 (87.18 ± 0.94) and GoogleNet (85.84 ± 0.92). Lastly, on the AID dataset, MSCAC achieved an accuracy of 89.53 ± 0.31, closely following VGG-VD16 (89.64 ± 0.36) and significantly outperforming GoogleNet (86.39 ± 0.55). These results indicate that MSCAC is competitive with state-of-the-art models and, in some cases, provides superior performance. The integration of CNNs for local feature extraction and the Swin Transformer for global feature extraction allows MSCAC to capture a rich representation of aerial imagery, leading to improved classification accuracy. The model’s ability to handle spatial resolution variability, diverse imaging conditions, scale and orientation variations, and intra-class diversity makes it a promising tool for aerial image classification tasks.

Author Contributions

A.A.S. and S.A.A. contributed equally to the study and are co-first authors. All authors have read and agreed to the published version of the manuscript.

Funding

This work did not receive any grant from funding agencies in the public, commercial, or not-for-profit sectors.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The UC Merced Land Use Dataset in this study are openly and freely available at http://weegee.vision.ucmerced.edu/datasets/landuse.html (accessed on 1 January 2024). WHU-RS19 dataset in this study are openly and freely available at https://captain-whu.github.io/BED4RS/ (accessed on 1 January 2024). RSSCN7 in this study are openly and freely available at http://weegee.vision.ucmerced.edu/datasets/landuse.html (accessed on 1 January 2024). The AID dataset in this study are openly and freely available at http://www.captain-whu.com/project/AID/ (accessed on 1 January 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. Available online: https://papers.nips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html (accessed on 20 May 2024).
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Bazi, Y.; Bashmal, L.; Al Rahhal, M.M.; Al Dayil, R.; Al Ajlan, N. Vision transformers for remote sensing image classification. Remote Sens. 2021, 13, 516. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Swain, M.J.; Ballard, D.H. Color indexing. Int. J. Comput. Vis. 1991, 7, 11–32. [Google Scholar] [CrossRef]
Haralick, R.M.; Shanmugam, K.; Dinstein, I.H. Textural features for image classification. IEEE Trans. Syst. Man Cybern. 1973, 610–621. [Google Scholar] [CrossRef]
Jain, A.K.; Ratha, N.K.; Lakshmanan, S. Object detection using Gabor filters. Pattern Recognit. 1997, 30, 295–309. [Google Scholar] [CrossRef]
Ojala, T.; Pietikainen, M.; Maenpaa, T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 971–987. [Google Scholar] [CrossRef]
Oliva, A.; Torralba, A. Modeling the shape of the scene: A holistic representation of the spatial envelope. Int. J. Comput. Vis. 2001, 42, 145–175. [Google Scholar]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar]
Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef]
Hinton, G.E.; Osindero, S.; Teh, Y.-W. A fast learning algorithm for deep belief nets. Neural Comput. 2006, 18, 1527–1554. [Google Scholar] [CrossRef] [PubMed]
Hinton, G.; Salakhutdinov, R. An efficient learning procedure for deep Boltzmann machines. Neural Comput. 2012, 24, 1967–2006. [Google Scholar]
Zhao, W.; Guo, Z.; Yue, J.; Zhang, X.; Luo, L. On combining multiscale deep learning features for the classification of hyperspectral remote sensing imagery. Int. J. Remote Sens. 2015, 36, 3368–3379. [Google Scholar] [CrossRef]
Vincent, P.; Larochelle, H.; Lajoie, I.; Bengio, Y.; Manzagol, P.-A.; Bottou, L. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 2010, 11, 3371–3408. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2012; pp. 1097–1105. Available online: https://proceedings.neurips.cc/paper/2012 (accessed on 29 May 2024).
Sermanet, P.; Eigen, D.; Zhang, X.; Mathieu, M.; Fergus, R.; LeCun, Y. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv 2013, arXiv:1312.6229. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. arXiv 2014, arXiv:1409.4842. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Cheng, G.; Li, Z.; Yao, X.; Guo, L.; Wei, Z. Remote Sensing Image Scene Classification Using Bag of Convolutional Features. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1735–1739. [Google Scholar] [CrossRef]
Sheppard, C.; Rahnemoonfar, M. Real-time scene understanding for UAV imagery based on deep convolutional neural networks. In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Fort Worth, TX, USA, 23–28 July 2017; pp. 2243–2246. [Google Scholar]
Yu, Y.; Liu, F. Dense connectivity based two-stream deep feature fusion framework for aerial scene classification. Remote Sens. 2018, 10, 1158. [Google Scholar] [CrossRef]
Ye, L.; Wang, L.; Sun, Y.; Zhao, L.; Wei, Y. Parallel multi-stage features fusion of deep convolutional neural networks for aerial scene classification. Remote Sens. Lett. 2018, 9, 294–303. [Google Scholar] [CrossRef]
Sen, O.; Keles, H.Y. Scene recognition with deep learning methods using aerial images. In Proceedings of the 2019 27th Signal Processing and Communications Applications Conference (SIU), Sivas, Turkey, 24–26 April 2019; pp. 1–4. [Google Scholar]
Anwer, R.M.; Khan, F.S.; Laaksonen, J. Compact deep color features for remote sensing scene classification. Neural Process. Lett. 2021, 53, 1523–1544. [Google Scholar] [CrossRef]
Huang, W.; Yuan, Z.; Yang, A.; Tang, C.; Luo, X. TAE-net: Task-adaptive embedding network for few-shot remote sensing scene classification. Remote Sens. 2021, 14, 111. [Google Scholar] [CrossRef]
Wang, X.; Yuan, L.; Xu, H.; Wen, X. CSDS: End-to-end aerial scenes classification with depthwise separable convolution and an attention mechanism. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 10484–10499. [Google Scholar] [CrossRef]
El-Khamy, S.E.; Al-Kabbany, A.; Shimaa, E.-B. MLRS-CNN-DWTPL: A new enhanced multi-label remote sensing scene classification using deep neural networks with wavelet pooling layers. In Proceedings of the 2021 International Telecommunications Conference (ITC-Egypt), Alexandria, Egypt, 13–15 July 2021; pp. 1–5. [Google Scholar]
Zhang, J.; Zhao, H.; Li, J. TRS: Transformers for remote sensing scene classification. Remote Sens. 2021, 13, 4143. [Google Scholar] [CrossRef]
Alhichri, H.; Alswayed, A.S.; Bazi, Y.; Ammour, N.; Alajlan, N. Classification of Remote Sensing Images Using EfficientNet-B3 CNN Model With Attention. IEEE Access 2021, 9, 14078–14094. [Google Scholar] [CrossRef]
Guo, D.; Xia, Y.; Luo, X. GAN-Based Semisupervised Scene Classification of Remote Sensing Image. IEEE Geosci. Remote Sens. Lett. 2021, 18, 2067–2071. [Google Scholar] [CrossRef]
Hao, S.; Wu, B.; Zhao, K.; Ye, Y.; Wang, W. Two-stream swin transformer with differentiable sobel operator for remote sensing image classification. Remote Sens. 2022, 14, 1507. [Google Scholar] [CrossRef]
Wang, H.; Gao, K.; Min, L.; Mao, Y.; Zhang, X.; Wang, J.; Hu, Z.; Liu, Y. Triplet-metric-guided multi-scale attention for remote sensing image scene classification with a convolutional neural network. Remote Sens. 2022, 14, 2794. [Google Scholar] [CrossRef]
Zheng, F.; Lin, S.; Zhou, W.; Huang, H. A lightweight dual-branch swin transformer for remote sensing scene classification. Remote Sens. 2023, 15, 2865. [Google Scholar] [CrossRef]
Thapa, A.; Horanont, T.; Neupane, B.; Aryal, J. Deep Learning for Remote Sensing Image Scene Classification: A Review and Meta-Analysis. Remote Sens. 2023, 15, 4804. [Google Scholar] [CrossRef]
Chen, Z.; Yang, J.; Feng, Z.; Chen, L.; Li, L. BiShuffleNeXt: A lightweight bi-path network for remote sensing scene classification. Meas. J. Int. Meas. Confed. 2023, 209, 112537. [Google Scholar] [CrossRef]
Wang, W.; Sun, Y.; Li, J.; Wang, X. Frequency and spatial based multi-layer context network (FSCNet) for remote sensing scene classification. Int. J. Appl. Earth Obs. Geoinf. 2024, 128, 103781. [Google Scholar] [CrossRef]
Sivasubramanian, A.; Prashanth, V.R.; Hari, T.; Sowmya, V.; Gopalakrishnan, E.A.; Ravi, V. Transformer-based convolutional neural network approach for remote sensing natural scene classification. Remote Sens. Appl. Soc. Environ. 2024, 33, 101126. [Google Scholar] [CrossRef]
Xu, X.; Feng, Z.; Cao, C.; Li, M.; Wu, J.; Wu, Z.; Shang, Y.; Ye, S.; Feng, Z.; Cao, C.; et al. An improved swin transformer-based model for remote sensing object detection and instance segmentation. Remote Sens. 2021, 13, 4779. [Google Scholar] [CrossRef]
Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, CA, USA, 2–5 November 2010; Association for Computing Machinery: New York, NY, USA, 2010; pp. 270–279. [Google Scholar]
Dai, Q.; Xiao, C.; Luo, Z.; Li, W.; Zhang, C. Satellite image classification via two-layer sparse coding with biased image representation. IEEE Geosci. Remote Sens. Lett. 2011, 8, 173–176. [Google Scholar] [CrossRef]
Zou, Z.; Shi, Z. Deep learning based feature selection for remote sensing scene classification. IEEE Geosci. Remote Sens. Lett. 2015, 12, 2321–2325. [Google Scholar] [CrossRef]
Xia, G.-S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef]
Yu, Y.; Liu, F. Aerial scene classification via multilevel fusion based on deep convolutional neural networks. IEEE Geosci. Remote Sens. Lett. 2018, 15, 287–291. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed framework for remote sensing scene classification.

Figure 2. Architecture of the Multi-Scale Swin–CNN Aerial Classifier for scene classification.

Figure 3. Flowchart for Multi-Head Self-Attention (MHSA) block mechanism.

Figure 4. Representative samples from the UC-Merced Land Use Dataset, illustrate the dataset’s diversity.

Figure 5. Confusion matrix obtained by the proposed MSCAC model on the UC-Merced dataset.

Figure 6. Representative samples from the WHU-RS19 dataset.

Figure 7. Confusion matrix obtained by proposed MSCAC model on the WHU-RS19 dataset.

Figure 8. Sample images from the RSSCN7 dataset.

Figure 9. Confusion matrix obtained by the proposed MSCAC model on the RSSCN7 dataset.

Figure 10. Sample images from the AID dataset.

Figure 11. Confusion matrix obtained by proposed MSCAC model on the AID dataset.

Table 1. Summary of notable deep-learning approaches in aerial scene classification.

Authors	Methodology	Dataset Used
Sheppard and Rahnemoonfar [22]	Deep CNN for UAV imagery	Custom UAV dataset
Yu and Liu [23]	Two-stream deep feature fusion model	UCM, AID, and NWPU-RESISC45
Ye et al. [24]	Parallel multi-stage (PMS) architecture	UCM, and AID
Sen and Keles [25]	Hierarchically designed CNN	NWPU-RESISC45
Anwer et al. [26]	Deep color model fusion	UCM, WHU-RS19, RSSCN7, AID, and NWPU-RESISC45
Huang et al. [27]	Task-Adaptive Embedding Network (TAE-Net)	UCM, WHU-RS19, and NWPU-RESISC45
Wang et al. [28]	Channel–Spatial Depthwise Separable (CSDS)	AID and NWPU-RESISC45
El-Khamy, Al-Kabbany, and El-bana [29]	CNN with wavelet transform pooling	UCM and AID
Zhao and Li [30]	Remote Sensing Transformer (TRS) with self-attention and ResNet	UCM, AID, NWPU-RESISC45, and OPTIMAL-31
Alhichri et al. [31]	EfficientNet-B3 CNN with Attention	UC-Merced, KSA, OPTIMAL-31, RSSCN7, WHU-RS19, AID
Guo et al. [32]	GAN-Based Semisupervised Scene Classification	UC-Merced, EuroSAT
Wang et al. [33]	Two-stream Swin Transformer network (original and edge stream features)	UCM, AID, and NWPU-RESISC45
Hu and Liu [34]	Triplet-metric-guided multi-scale attention (TMGMA)	UCM, AID, and NWPU-RESISC45
Zhou and Huang [35]	Lightweight dual-branch Swin Transformer (ViT and CNN branches)	UCM, AID, and NWPU-RESISC45
Thapa et al. [36]	CNN-based, Vision Transformer (ViT)-based, and Generative Adversarial Network (GAN)-based architectures	AID, NWPU-RESISC45
Chen et al. [37]	BiShuffleNeXt	UCM, AID and NWPU-45
Wang et al. [38]	Frequency- and spatial-based multi-layer attention network (FSCNet)	UCM, AID, and NWPU
Sivasubramanian et al. [39]	Transformer-based convolutional neural network	UCM, WHU-RS19, OPTIMAL-31, RSI-CB256, and MLRSNet

Table 2. Detailed information on the aerial image datasets used.

Dataset	Scene Classes	Samples/Class	Image Size	Spatial Resolution	Challenging Factor
UC-Merced [41]	21	100	256 × 256	0.3 m	Overlapping classes with different structure densities.
WHU-RS19 [42]	19	50	600 × 600	Up to 0.5 m	Resolution, scale, orientation, and illumination variations.
RSSCN7 [43]	7	400	400 × 400	-	Google Earth images with scale variations, cropped at four scales.
AID [44]	30	200–400	600 × 600	0.5–8 m	Multi-source images from various countries, seasons, and conditions, increasing intra-class diversity.

Table 3. Overall accuracy (OA) of various deep-learning models on the UC-Merced dataset.

Method	Accuracy
Method	50% Training	80% Training
SIFT [44]	28.92 ± 0.95	32.10 ± 1.95
BoVW(SIFT) [44]	71.90 ± 0.79	74.12 ± 3.30
CaffeNet [44]	93.98 ± 0.67	95.02 ± 0.81
GoogleLeNet [44]	92.70 ± 0.60	94.31 ± 0.89
VGG-VD-16 [44]	94.14 ± 0.69	95.21 ± 1.20
MSCAC (proposed)	94.01 ± 0.93	94.67 ± 0.89

Table 4. Overall accuracy (OA) of various deep-learning models on the WHU-RS19 dataset.

Method	Accuracy
Method	40% Training	60% Training
SIFT [44]	25.37 ± 1.32	27.21 ± 1.77
BoVW(SIFT) [44]	75.26 ± 1.39	80.13 ± 2.01
CaffeNet [44]	95.11 ± 1.20	96.24 ± 0.56
GoogleLeNet [44]	93.12 ± 0.82	96.05 ± 0.91
VGG-VD-16 [44]	95.44 ± 0.60	96.05 ± 0.91
MSCAC (proposed)	94.99 ± 0.89	96.57 ± 1.20

Table 5. Overall accuracy (OA) of various deep-learning models on the RSSCN7 dataset.

Method	Accuracy
Method	20% Training	50% Training
SIFT [44]	28.45 ± 1.03	32.76 ± 1.25
BoVW(SIFT) [44]	76.33 ± 0.88	81.34 ± 0.55
CaffeNet [44]	85.57 ± 0.95	88.25 ± 0.62
GoogleLeNet [44]	82.55 ± 1.11	56.84 ± 0.92
VGG-VD-16 [44]	83.98 ± 0.87	87.18 ± 0.94
MSCAC (proposed)	84.01 ± 0.34	87.37 ± 0.67

Table 6. Overall accuracy (OA) of various deep-learning models on the AID dataset.

Method	Accuracy
Method	20% Training	50% Training
SIFT [44]	13.50 ± 0.67	16.76 ± 0.65
BoVW(SIFT) [44]	61.40 ± 0.41	67.65 ± 0.49
CaffeNet [44]	86.86 ± 0.47	89.53 ± 0.31
GoogleLeNet [44]	83.44 ± 0.40	86.39 ± 0.55
VGG-VD-16 [44]	86.59 ± 029	89.64 ± 0.36
CNNs [45]	-	94.17 ± 0.32
MSCAC (proposed)	87.45 ± 0.92	89.90 ± 0.78

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Solomon, A.A.; Agnes, S.A. MSCAC: A Multi-Scale Swin–CNN Framework for Progressive Remote Sensing Scene Classification. Geographies 2024, 4, 462-480. https://doi.org/10.3390/geographies4030025

AMA Style

Solomon AA, Agnes SA. MSCAC: A Multi-Scale Swin–CNN Framework for Progressive Remote Sensing Scene Classification. Geographies. 2024; 4(3):462-480. https://doi.org/10.3390/geographies4030025

Chicago/Turabian Style

Solomon, A. Arun, and S. Akila Agnes. 2024. "MSCAC: A Multi-Scale Swin–CNN Framework for Progressive Remote Sensing Scene Classification" Geographies 4, no. 3: 462-480. https://doi.org/10.3390/geographies4030025

APA Style

Solomon, A. A., & Agnes, S. A. (2024). MSCAC: A Multi-Scale Swin–CNN Framework for Progressive Remote Sensing Scene Classification. Geographies, 4(3), 462-480. https://doi.org/10.3390/geographies4030025

Article Menu

MSCAC: A Multi-Scale Swin–CNN Framework for Progressive Remote Sensing Scene Classification

Abstract

1. Introduction

2. Materials and Methods

2.1. Overview of the Proposed Model

2.2. Materials

2.3. Research Challenges in Aerial Image Classification

2.4. Multi-Scale Swin–CNN Aerial Classifier (MSCAC)

2.4.1. Local Feature Extraction

2.4.2. Multilevel Feature Fusion

2.4.3. Global Feature Extraction

2.4.4. Feature Fusion and Classification

3. Results and Discussion

3.1. Experimental Setup

3.2. Evaluation Metrics

3.3. Comparison with State-of-the-Art Models

3.3.1. UC-Merced Dataset

3.3.2. WHU-RS19 Dataset

3.3.3. RSSCN7 Dataset

3.3.4. Aerial Image Dataset (AID)

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI