Next Article in Journal
Estimating Chlorophyll-a Concentrations in Optically Shallow Waters Using Gaofen-1 Wide-Field-of-View (GF-1 WFV) Datasets from Lake Taihu, China
Previous Article in Journal
ETQ-Matcher: Efficient Quadtree-Attention-Guided Transformer for Detector-Free Aerial–Ground Image Matching
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Enhancing Cross-Domain Remote Sensing Scene Classification by Multi-Source Subdomain Distribution Alignment Network

1
School of Electronic Engineering, Xidian University, Xi’an 710071, China
2
The 27th Research Institute of CETC, Zhengzhou 450047, China
3
Guangzhou Institute of Technology, Xidian University, Guangzhou 510555, China
4
School of Telecommunications Engineering, Xidian University, Xi’an 710071, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Remote Sens. 2025, 17(7), 1302; https://doi.org/10.3390/rs17071302 (registering DOI)
Submission received: 18 February 2025 / Revised: 30 March 2025 / Accepted: 2 April 2025 / Published: 5 April 2025

Abstract

:
Multi-source domain adaptation (MSDA) in remote sensing (RS) scene classification has recently gained significant attention in the visual recognition community. It leverages multiple well-labeled source domains to train a model capable of achieving strong generalization on the target domain with little to no labeled data from the target domain. However, the distribution shifts among multiple source domains make it more challenging to align the distributions between the target domain and all source domains concurrently. Moreover, relying solely on global alignment risks losing fine-grained information for each class, especially in the task of RS scene classification. To alleviate these issues, we present a Multi-Source Subdomain Distribution Alignment Network (MSSDANet), which introduces novel network structures and loss functions for subdomain-oriented MSDA. By adopting a two-level feature extraction strategy, this model attains better global alignment between the target domain and multiple source domains, as well as alignment at the subdomain level. First, it includes a pre-trained convolutional neural network (CNN) as a common feature extractor to fully exploit the shared invariant features across one target and multiple source domains. Secondly, a dual-domain feature extractor is used after the common feature extractor, which maps the data from each pair of target and source domains to a specific dual-domain feature space and performs subdomain alignment. Finally, a dual-domain feature classifier is employed to make predictions by averaging the outputs from multiple classifiers. Accompanied by the above network, two novel loss functions are proposed to boost the classification performance. Discriminant Semantic Transfer (DST) loss is exploited to force the model to effectively extract semantic information among target and source domain samples, while Class Correlation (CC) loss is introduced to reduce the feature confusion from different classes within the target domain. It is noteworthy that our MSSDANet is developed in an unsupervised manner for domain adaptation, indicating that no label information from the target domain is required during training. Extensive experiments on four common RS image datasets demonstrate that the proposed method achieves state-of-the-art performance for cross-domain RS scene classification. Specifically, in the dual-source and three-source settings, MSSDANet outperforms the second-best algorithm in terms of overall accuracy (OA) by 2.2% and 1.6%, respectively.

1. Introduction

Remote sensing (RS) technology, as an effective means of obtaining information about the Earth’s surface, has found extensive applications in diverse fields, such as ecology and environmental geology [1,2,3]. Meanwhile, various types of sensors such as hyperspectral [4], multispectral, and SAR [5] provide rich remote sensing images for multi-source observations, supporting extensive and sustained applications. Building upon this, RS scene classification has emerged as a fundamental task, aiming to categorize remote sensing images into predefined semantic categories by utilizing the spatial and spectral information contained within the images. This has played a crucial role in various RS applications, including military operations [6,7], agricultural monitoring [8], and urban planning [9].
In the field of RS cross-domain scene classification, with the increasing acquisition speed of Earth images [10,11,12], deep learning-based methods have gained significant attention from researchers, with the network model having also been continuously improved. However, there are still two major difficulties that need to be solved for RS scene classification methods. First, training high-performance deep networks usually requires a great quantity of labeled data; nevertheless, in real-world scenarios, collecting sufficient labeled data could be cost-prohibitive and labor-intensive. Second, despite the existence of multiple types of remote sensing images, collaborative processing is highly challenging due to significant differences in their distribution and semantics (such as differences in sensor types and acquisition conditions), which poses a severe test on the generalization performance of the model. Therefore, how to train a learner with strong generalization ability and high performance to effectively leverage label information from the source domain has become a focal point of research.
To fight off these obstacles, domain adaptation (DA) techniques have been suggested for RS cross-domain scene classification [13]. In the machine learning community, DA is designed to overcome data distribution discrepancies between the target domain (a domain with scarce or nonexistent labeled data) and the source domain (a domain with labeled data) [14]. By employing DA, the knowledge learned in the source domain could be effectively transferred to the target domain, thereby reducing the need for a large quantity of labeled data in the target domain. In this sense, DA is particularly suitable in cross-domain tasks, especially in RS scene classification, where it can significantly improve the model’s generalization performance in the target domain. Depending on whether labeled samples exist in the target domain, DA can be categorized into unsupervised DA (UDA) and semi-supervised DA. UDA allows us to train a network with good generalization performance using unlabeled data from the target domain and labeled data from the source domain. Generally, prevailing methods for solving UDA problems are mainly divided into two branches. The first is based on statistical moment matching loss, including second-order statistics matching [15], Maximum Mean Discrepancy (MMD) [16,17], and Central Moment Discrepancy (CMD) [18]. The second prevalent technique is to impose adversarial loss on samples from different domains, so that they could be indistinguishable in terms of domain labels [19,20]. Within RS scene classification, MIM [21] enhances cross-scene classification by improving pseudo-label propagation with contrastive learning, masked image modeling, and clustering rectification. Lie [22] proposed a novel group spatial attention to enhance feature learning and domain alignment. FCAN [23] improves cross-domain scene classification by separately refining and aligning low-frequency and high-frequency features.
Traditional DA research has primarily focused on single-source domain adaptation, which utilizes data from only one source domain for training. However, the limitations of single-source domain adaptation (SDA) have become apparent, especially when applied to remote sensing, where the diversity of sensor types, complex environmental conditions, and geographical disparities further complicate cross-domain adaptation challenges. Multi-source domain adaptation (MSDA), on the other hand, has attracted considerable attention owing to its effectiveness in tackling the diverse and complex nature of real-world data distributions [24]. Compared to SDA, MSDA is more desirable for real-world applications, as it can leverage data from multiple source domains simultaneously, facilitating a more comprehensive representation of complex environments. Moreover, the model’s generalization ability in the target domain can be strengthened by MSDA through the incorporation of information from multiple source domains, capturing greater feature variability and distributional diversity. This diversity aids in learning more robust and generalized feature representations, reducing the risk of model overfitting to a single source domain.
To better align the target domain with multi-source domains, recent methods have introduced modifications from aspects such as network architecture and loss constraints. MFSAN [25] addresses the alignment problem by constructing a 1-by-1 domain alignment network architecture. PTMDA [26] utilizes structured source domain information to improve target domain generalization and increases the transferability of intermediate layers in networks to reduce domain shift. IMIS [27] enhances domain adaptation capabilities and improves the model’s generalization ability by aligning the features of multiple incomplete datasets. However, there are still two major challenges to be addressed in MSDA. First, the domain shifts among multiple source domains make it more difficult to align the distributions between all source domains and the target domain. We term it a global alignment problem. Second, performing only global alignment may lose fine-grained information contained in each class, leading to a decline in the performance of DA algorithms. Correspondingly, we call this issue a local alignment problem.
In this work, to overcome these obstacles, we present a novel method coined Multi-Source Subdomain Distribution Alignment Network (MSSDANet) for cross-domain remote sensing scene classification. Our method seeks to improve classification performance by learning fine-grained information for each category while ensuring better alignment of both global features and local subdomain features. This is intuitively illustrated in Figure 1: solely performing global alignment (left) may result in a loss of fine-grained category information, leading to a degraded performance. In contrast, the proposed method (right) learns fine-grained information for each category and performs subdomain alignment, thereby enhancing domain adaptation performance. Global alignment alone risks mixing subdomains of different categories, while subdomain alignment ensures both global and category-specific distribution alignments. These advantages will be more obvious when Figure 1 is extended to multi-source scenarios.
MSSDANet is designed and tailored for subdomain-oriented network structures. First, it extracts common features of the data from the target and multi-source domains through the backbone network. It then pairs target domain features with the ones from each source domain separately to perform dual-domain feature extraction. This aims to effectively reduce the feature distribution discrepancy between corresponding subdomains in the source and target domains while adequately extracting domain-invariant features. Lastly, through the dual-domain feature recognition layer, the algorithm integrates and enhances feature discriminability across the target and multiple source domains.
Simultaneously, Discriminant Semantic Transfer (DST) loss, as an improvement over the semantic transfer loss, is utilized to effectively align the class-level semantic features between the two domains. Additionally, a Class Correlation (CC) loss is proposed, which aims to fully leverage the class correlation information within the class autocorrelation matrix. Through the integration of network architecture design and loss function constraints, the proposed method achieves competitive performance in the task of cross-domain RS scene classification. The main contributions of this article are as follows:
  • A novel network coined MSSDANet for RS cross-domain scene classification is proposed, which focuses on subdomain-level alignment. By capturing local similarities within the framework of MSDA, the method effectively mitigates global and local domain shifts across the target and multiple source domains.
  • With a two-level feature extraction strategy, our network provides two specific components: common feature extraction and dual-domain feature extraction. The first ensures robust global alignment while the latter specializes in refining category-specific representations, which allows our method to effectively extract both global and fine-grained features.
  • We propose the discriminant semantic transfer (DST) loss to force the model to extract semantic information across the target and multiple-source domain samples. Moreover, we develop a Class Correlation (CC) loss to tackle the confusion of different class features within the target domain.
  • Empirically, we carry out extensive experiments on four benchmark RS scene datasets. The results show the feasibility of our method of solving MSDA problems. Note that MSSDANet is developed under an unsupervised setting, which is more suitable for the challenge of label scarcity in remote sensing data.
The structure of this paper is as follows: Section 2 reviews related studies. Section 3 provides a detailed discussion of the proposed method. Section 4 presents experimental results to validate the method’s effectiveness. Finally, Section 5 concludes with a summary and future research directions.

2. Related Work

This section organizes related research into two key directions: (1) unsupervised domain adaptation techniques, and (2) cross-domain RS scene classification methods.
Unsupervised Domain Adaptation: Generally, deep learning-based unsupervised domain adaptation (DA) methods can be broadly categorized into two approaches: metric-based DA and domain adversarial adaptation.
Metric-based DA methods concentrate on measuring the distribution discrepancy between the target and source domains. Considering the inaccessibility of target labels, this type of method typically analyzes distribution statistics to quantify the cross-domain difference. MMD [28,29,30] measures the distance between the feature distributions of the target and source domains by calculating their mean discrepancy, thereby reducing the domain gap. DSAN [31] accurately aligns the distributions of the associated subdomains by introducing local MMD. ETD [32] adjusts the optimal transport constraint by using the cross-domain sample correlations as weights. SE-CC [33] enhances domain adaptation by leveraging category-agnostic clusters in the target domain, enabling effective representation learning for both closed-set and open-set scenarios.
Domain adversarial adaptation methods utilize adversarial loss to make samples from different domains indistinguishable in terms of their domain labels [34,35]. Inspired by the advantages of conditional GANs [36], CDAN [2] employs a linear combination approach to associate features with the predicted category distribution, treating this combination as input for domain alignment. SDAT [37] analyzes the smoothness issue in domain adversarial training and proposes improvement strategies to enhance the model’s generalization ability in the target domain. Multi-adversarial domain adaptation (MADA) [38] utilizes multiple domain discriminators to characterize multi-mode patterns, allowing for more precise alignment of various data distributions.
Cross-Domain RS Scene Classification: With the advancement of RS image processing and deep learning technologies, cross-domain scene classification has evolved into a key research area. Beyond feature representation and traditional classification, DA methods have been applied to RS scene classification to mitigate the distribution gap between the target and source domains, ensuring robust generalization performance across different data distributions.
The gradient reversal (RevGrad) method proposed by Ganin et al. [39] has been applied to RS scene classification [40]. The Multi-branch Neural Network (MB-Net) [41] aims to minimize the distribution discrepancies across multiple source–target pairs at both the feature representation and decision-making levels. Laila et al. [42] proposed a cross-domain network for aerial vehicle classification developed on generative adversarial networks (GANs), called the Siamese-GAN model. This model learns invariant image feature representations from different domains by combining Siamese encoder–decoder architecture with adversarial training. To address the issues of uneven distribution and incomplete categories, Lu et al. [40] proposed the Multi-Source Compensation Network (MSCN), which leverages a cross-domain alignment block to mitigate domain shifts between the target and source domains. SSDAN [43] employs secondary minimax loss in an alternating minimization and maximization manner to force the model to simultaneously learn domain-invariant and class-discriminative features. Generally, the RS community has witnessed significant progress in the development of DA methods for cross-domain scene classification.

3. Proposed Method

In a multi-source unsupervised DA setting, there are N different labeled source domains { D s j } j = 1 N = { ( X s j , Y s j ) } j = 1 N , whose distributions are represented as { p s j ( x , y ) } j = 1 N , where X s j = { x i s j } i = 1 n s j represents samples from source domain j and Y s j = { y i s j } i = 1 | x s j | is the corresponding ground-truth labels. At the same time, we have an unlabeled target domain D t = { X t } whose distribution is represented as p t ( x , y ) , where X t = { x i t } i = 1 n t represents samples from this target domain. In addition, each source domain and one target domain have the same label space, Y = { 1,2 , 3 K } .
To tackle the issue of domain shift between multi-source domains and the target domain in cross-domain RS scene classification, we develop MSSDANet, a novel framework that integrates a subdomain alignment network with innovative loss functions. Through learning local domain transfer, MSSDANet enables the alignment of global distributions along with local distributions at the subdomain level. This network is designed to extract common features across multiple domains and dual-domain shared features. Building on the two-level feature extraction strategy, we also present Semantic Transfer Shift loss and Class Correlation loss, which help to capture fine-grained category details and perform subdomain alignment for cross-domain RS scene data. By utilizing multi-source domain information and subdomain alignment, our method ensures better cross-domain generalization and improved scene classification performance. The workflow of our MSSDANet is illustrated in Figure 2.
Next, in this section, we will detail the network structure and the loss functions of MSSDANet, respectively.

3.1. Network Structure of MSSDANet

MSSDANet employs a two-stage feature extraction strategy, which is reflected in its structure design. The network is composed of three main components: a common feature extractor, a dual-domain feature extractor, and a dual-domain feature classifier. First, the common feature extractor captures shared invariant features across both target and source domains. Then, the dual-domain feature extractor refines and aligns domain-specific features. Finally, the dual-domain feature classifier performs the classification by averaging the aligned features, ensuring optimal adaptation and performance.
  • Common feature extractor
A pre-trained convolutional neural network (CNN) serves as the common feature extractor of MSSDANet. In this paper, we chose EfficientNet-B3 as the pre-trained CNN. Surely, other CNN versions such as ResNet, VGG, etc., can also be selected as the common feature extractor.
Samples from multiple source domains and the target domain are fed into the common feature extractor. On one hand, it extracts the shared representation of the target and all source domains, mapping the RS data from the original image space to the common feature space. On the other hand, it has the potential to explore the connections and invariant features between the target domain and multiple source domains.
2.
Dual-domain feature extractor
The dual-domain feature extractor comprises successive convolutional layers, batch normalization (BN) layers, rectified linear unit (ReLU) activation function layers, and average pooling layers, arranged sequentially to extract hierarchical features. Each pair of source–target domain features is fed into the dual-domain feature extractor after passing through the common feature extractor. Its main function is to map each pair of target–source domain data to a specific dual-domain feature space, fully mining the common invariant features among them. The previous common feature extractor extracts the shared representations of the target and all source domains, while the dual-domain feature extractor here can further bridge the feature distribution gap between the corresponding subdomains in each pair of target and source domains.
3.
Dual-domain feature classifier
The dual-domain feature classifier, consisting of only a simple linear layer and a Softmax layer, is employed to classify the features of target and source domain data during training and to classify the features of target domain data during testing. The final predicted classification result is determined by taking the average of the predicted results obtained by different dual-domain feature classifiers, aiming to jointly consider and fuse the features and information obtained by the dual-domain feature extractor.
As shown in Figure 2, the RS image data of both the target and multiple source domains are passed into the common feature extractor, whose output features are paired one-to-one (a source domain and the target domain) and fed into dual-domain feature extractors. Specifically, the feature from source domain 1 and the feature from the target domain are paired and fed into dual-domain feature extractor 1. Similarly, features from source domain N and the target domain are input into dual-domain feature extractor N. Finally, the outputs of multiple dual-domain feature extractors are fed into a dual-domain feature classifier, and the prediction probabilities are averaged to obtain the final classification result.

3.2. Loss Functions of MSSDANet

  • Discriminant Semantic Transfer (DST) loss
While existing approaches can reduce the feature distribution discrepancy between target and source domains to some extent, they often neglect the abundant semantic information within samples. This oversight may result in misalignments, where features from the target domain could be incorrectly mapped to unrelated features in the source domain. To alleviate this issue, in this paper, we refine the semantic transfer (ST) loss in MSTN [44] and present the discriminant semantic transfer (DST) loss. Building upon the principles of Linear Discriminant Analysis (LDA) [45], the improved loss aims to better capture and utilize the semantic relationships among samples, enabling more accurate alignment of class-level semantic features between the target and source domains.
The original semantic transfer loss in [44] is formulated as:
L s t = k = 1 K φ ( C s k , C t k ) ,
where C s k = Γ ( F ( x s k ) ) and C t k = Γ ( F ( x t k ) ) represent the centroids of the k-th class from the source and target domains in the common feature space, respectively. Here, φ ( · ) is a distance metric function, and Γ ( · ) is the function for computing centroids. The semantic transfer (ST) loss helps the model to utilize the class semantic information across the target and source domains, thereby enhancing cross-domain class-semantic alignment.
However, the standard ST loss only considers semantic alignment of samples of the same class from the target and source domains, which may lead the model to somewhat overlook the semantic distinctions between samples from different categories. In this sense, we first extend the semantic transfer loss by utilizing the inter-class discriminant information, which is formulated as:
L d s t = k = 1 K ψ C s k , C s k + k = 1 K j k K ( C s k , C s j + ( C t k , C t j ) + C s k , C t j ) ,
where ψ a , b = a b 2 represents the Euclidean distance between sample centroids, and C 1 , C 2 = [ C 1 C 2 C 1 C 2 ] p is used to compute the cosine similarity between centroids, where p is a prime factor.
The DST loss aims to ensure that, whether across domains or within domains, the distance between samples of the same class when mapped to the feature space should be as small as possible, while the distance between samples from different categories in the embedding feature space should be as large as possible. Specifically, in Equation (2), the first term of the loss function encourages samples of the same category in the target and source domains to gradually approach each other in this feature space. The second term promotes the separation of samples from different classes within the source domain, the target domain, and across the two domains.
Additionally, considering the significant semantic differences between the target and source domain samples, directly transferring semantic features between the two domains may lead to negative transfer. In other words, the knowledge learned from the source domain might not effectively help the target domain during transfer learning and could instead negatively impact the model’s performance in the target domain. To relieve this issue, we suggest using the integration of both source and target domain features to replace the original source-domain-only features. This modification expands the semantic feature space of the previous source domain and alleviates the negative transfer problem during semantic transfer.
Finally, the discriminant semantic transfer (DST) loss is formulated as:
L d s t = k = 1 K ψ C k , C T k + k = 1 K j k K ( C k , C j + ( C t k , C t j ) + C k , C t j ) ,
where C k = Γ ( [ F X s k , F X t k ] ) represents the class centroid for all k-th class samples from both the source and target domains in this hybrid feature space. From Equation (3), the constraint is expected to produce more compact representations within each class and increase the feature distance between different categories. For the MSDA problem, our designed DST loss enables the network to effectively capture the semantic information between samples from the target and multiple source domains, thus achieving precise transferable semantic alignment.
2.
Class Correlation (CC) loss
In a typical domain adaptation task, the classifier is ultimately required to predict the class probability of each target domain sample, with the final prediction result from the classifier being referred to as a pseudo-label. As discussed in the literature [46], existing deep neural networks tend to produce probabilities during classification tasks that are often higher than the actual accuracy. This implies that these networks are typically overconfident in their predicted confidence levels, resulting in the probabilities not accurately representing true classification accuracy. Therefore, we introduce the Temperature Rescaling Method (TRM) [46] to smooth the pseudo-labels, generating a more softened and smoothed class probability distribution.
TRM is a post-processing technique that utilizes a temperature parameter T (greater than 0) to rescale the logits produced by the neural network. This adjustment ensures that the final probability outputs more accurately represent the model’s confidence. After scaling, the probability that the sample i belongs to class j is calculated as:
Y i j ^ = e x p ( Z i j ^ / T ) j = 1 | k | e x p ( Z i j ^ / T ) ,
where Z i j ^ represents the unprocessed raw logits, and T is the temperature parameter. When T = 1, the probabilities remain the same as the original output, but when T > 1, the overly sharp and confident probability distribution becomes more smoothed and flattened. This adjustment helps reduce the issues caused by the model’s overconfidence and encourage more reliable predictions. TRM mitigates the adverse impact of overconfident but erroneous pseudo-labels on the learning process.
Additionally, considering the UDA case where the source domain contains labeled data while the target domain is unlabeled, traditional DA methods typically employ a global alignment strategy to mitigate the distribution discrepancy between the target and source domains. Nevertheless, this frequently results in classification confusion, where features from different classes might overlap during alignment. Here, we introduce a class autocorrelation matrix from the Minimum Class Confusion (MCC) method [47] to reduce the classification confusion problem. The class autocorrelation matrix is formulated as:
U = Y ^ T Y ^
where Y ^ represents the prediction results of classifier for the target domain samples. We hypothesize that if the prediction results for a sample approximate a uniform distribution, the high uncertainty, due to the lack of a prominent peak, will make it difficult for the sample to provide useful information for the model. Conversely, when the prediction results for a sample exhibit a clearly prominent peak with higher certainty, this sample can provide highly effective supervision information for the model. Therefore, the class autocorrelation matrix is refined as:
U = Y ^ T Q Y ^ ,
Q i i = e x p ( E ( y i ^ ) ) m = 1 B e x p ( E ( y m ^ ) ) ,
E y i ^ = j = 1 K y i j ^ log y i j ^ ,
where U represents the class autocorrelation matrix, Q is a diagonal matrix, and y i ^ denotes the prediction result for the i-th sample. By applying a weighted treatment to the class autocorrelation matrix, we can differentiate the samples with varying levels of certainty. Specifically, samples with higher prediction certainty are assigned greater weights, while those with lower prediction certainty receive smaller weights. This rule allows the model to extract more reliable supervision information from samples with higher confidence in their predictions.
Based on the class autocorrelation matrix, we formally present a new Class Correlation (CC) loss to make sufficient use of the class correlation information within the matrix, which is formulated as:
L c c = 1 K ( i = 1 , j i K U i j μ i = 1 , j = i K U i j ) ,
where μ is a balancing parameter used to balance inter-class and intra-class correlations. The elements on the main diagonal of the class autocorrelation matrix represent intra-class correlations, while the off-diagonal elements represent inter-class correlations. Specifically, i = 1 , j i K U i j indicates inter-class correlation, and i = 1 , j = i K U i j denotes intra-class correlation. By minimizing inter-class correlation while increasing intra-class correlation, this loss can significantly reduce categorical confusion of feature representations in the target domain, hence boosting the model’s adaptability and generalization capabilities.
3.
Source domain classification (CLS) loss
Obviously, for MSDA algorithms, the label information in the multi-source domains plays an important role in providing supervision information for source feature extraction. To guarantee precise and consistent outcomes of MSDA classification tasks, it is required to minimize the empirical risk inherent in the distribution within the source domain. Specifically, it is necessary to optimize both the feature extractor and dual-domain feature classifier of the model via supervised classification loss minimization in the multi-source domains. Therefore, the multi-source domains classification loss function of MSSDANet can be defined as:
L c l s = 1 n s i = 1 n s L c e ( C F x s i , y s i ) ,
where L c e is the commonly used cross-entropy loss. Regarding a multi-source domain, n s is the number of samples contained in this source domain, x s i is the i-th input sample, and y s i is the label information of this multi-source domain sample. C F x s i represents the predicted label obtained from the multi-source domain input after passing through the feature extractor network and the classifier layer. F represents the feature extractor, which includes a common feature extractor and a dual-domain feature extractor. C represents the dual-domain feature classifier.
4.
Local Maximum Mean Discrepancy (LMMD) loss
To achieve better alignment between subdomains in the target and source domains, MSSDANet employs the Local Maximum Mean Discrepancy (LMMD) [31] loss function, which is an improved version of the Maximum Mean Discrepancy (MMD) [15] loss function. MMD primarily focuses on global distribution alignment but neglects the relationships between subdomains of the same class.
In contrast, the LMMD loss accounts for the weights of different samples and measures the kernel mean embedding discrepancy between related subdomains in the target and source domains. The specific calculation is defined as:
L l m m d = 1 N n = 1 N [ i = 1 n s j = 1 n s w i s k w j s k k F x s i , F x s j + i = 1 n t j = 1 n t w i t k w j t k k F x t i , F x t j 2 i = 1 n s j = 1 n t w i s k w j t k k F x s i , F x t j ] ,
where F x s and F x t denote the feature vectors of source and target domain samples derived from the feature extractor, respectively. w i s k and w j s k represent the weights of samples x s i and x s j belonging to the k-th class, respectively. K refers to the kernel function and is formulated as:
k x s , x t = ϕ x s , ϕ x t ,
where · , · symbolizes the inner product of vectors and ϕ · represents the mapping of the input data into the Reproducing Kernel Hilbert Space (RKHS).
5.
Overall loss function of MSSDANet
Based on the above descriptions, our MSSDANet will utilize four loss functions for training the network. Therefore, according to Equations (3), (9)–(11), the specific expression for the overall loss function is formulated as:
L a l l = L c l s + α L l m m d + β L d s t + γ L c c ,
where α , β , and γ are hyperparameters for the loss functions, which are used to coordinate the weights of the different loss components. It should be noted that, as shown in Figure 2, the DST and LMMD loss are calculated at the dual feature extractor, while CC and CLS loss can only be accessed after the dual-domain feature classifier. Together, they constrain our model for reliable adaptation in terms of both feature extraction and classification.
To better illustrate the detailed procedures of our method, we summarize the workflow of the proposed MSSDANet in Algorithm 1.
Algorithm 1. MSSDANet workflow.
Input: target domain images, multiple-source domain images (from source-1 to source-N).
Output: labeled target images.
  Set parameters: Num_epoch, Batch_size, Learing_rate, Temperature, Hyperparameters
  Load the pre-trained CNN network (EfficientNet-B3), which acts as the common feature extractor.
  for epoch to Num_epoch
   D t : the target domain data containing K categories.
   D s n : the source-1 domain data containing K categories.
   Divide dataset samples into Num_batches batches of size Batch_size randomly
   for n = 1 to N
    Both D t and D s n are fed into common feature extractor.
    Both D t and D s n features are fed into their specific dual-domain feature extractor and classifier.
    Compute the Losses of Equations (3), (9)–(11).
    Minimize the loss function of Equation (13).
Test the model using unlabeled target data.
Here, Num_epoch represents the number of training epochs, Batch_size denotes the batch size, Learning_rate indicates the initial learning rate, Temperature refers to the temperature coefficient, and Hyperparameters refer to the hyperparameters used in the loss function of Equation (13). After training, test data from the target domain is passed through the common feature extractor, the dual-domain feature extractor, and the classification layer to obtain the final predicted labels.

4. Experiment

In this section, extensive experiments are carried out to evaluate the effectiveness of the proposed MSSDANet. First, we provide a detailed description of the used RS scene classification datasets, as well as the analysis of their distribution discrepancy. Then, we introduce the experimental setup and the implementation details of our method. Further, the quantitative performance of MSSDANet is evaluated through MSDA experiments under different settings of source numbers. We also report the results of ablation studies, training stability, t-SNE visualization, and the parameter sensitivity of our method. Finally, an extended experiment is conducted on a benchmark natural-image dataset to further validate the generalization capability of our method for non-RS tasks.

4.1. Datasets

To assess the effectiveness of our MSSDANet, we employed four widely used RS scene datasets: the UC Merced dataset [48], the Aerial Image Dataset (AID) [49], the Remote Sensing Image Scene Classification (RESISC45) dataset [50], and the PatternNet dataset [51]. These datasets have been processed to create a new remote sensing domain adaptation dataset [43]. The original four datasets were collected from different sensors in various regions, leading to significant differences in illumination, shape, and other characteristics, resulting in considerable distribution disparities among the datasets.
  • UC Merced dataset
The UC Merced Land Use Dataset contains 21 categories of aerial scenes, with each category consisting of 100 high-resolution RGB samples. Each image is sized at 256 × 256 pixels and features a spatial resolution of 0.3 m per pixel. The dataset is derived from the USGS National Map Urban Area Imagery collection, covering diverse geographic regions across the United States.
2.
AID dataset
The Aerial Image Dataset (AID) comprises 10,000 high-resolution RGB samples spanning 30 distinct land-use categories. Each image maintains a uniform dimension of 600 × 600 pixels, with spatial resolution varying between 0.5–8 m per pixel, sourced from global coverage of Google Earth satellite imagery.
3.
RESISC45 dataset
The RESISC45 benchmark, constructed by researchers at Northwestern Polytechnical University, comprises 45 scene categories with 700 high-resolution RGB images per class. Collected from Google Earth’s multi-temporal satellite imagery, all samples are preprocessed to 256 × 256 pixels, capturing diverse spatial resolutions from 0.2 m to 30 m to enhance classification challenges.
4.
PatternNet dataset
The PatternNet dataset comprises images collected from multi-resolution imagery on Google Earth. It contains 38 categories, with each category consisting of 800 RGB images, covering various regions around the globe. All images are sized at 256 × 256 pixels.
5.
New remote sensing (RS) domain adaptation datasets
It is essential to standardize the number and names of categories among the original RS datasets, as they differ significantly [40]. For a fair comparison, we use the same RS domain adaptation dataset as presented in [43], which is constructed based on the original 4 RS datasets following the same protocol of [40]. Twelve common categories from the four public datasets are extracted to establish the category names and counts for the new DA dataset. For instance, in the RESISC 45 dataset, rectangular and circular farmland are combined into the category “Farm.” It is essential to highlight that while constructing the new dataset, all operations pertained solely to the adjustment of category names, and no processing was performed on the images contained within the datasets. One may refer to [43] for more details of dataset construction.
To illustrate the differences in image data among the four datasets in the new RS cross-domain scene classification dataset, sample images from each dataset are shown in Figure 3.
As shown in Figure 3, there are significant differences among the four datasets in the Beach category regarding color, lighting, and beach objects. In the Airfield category, variations can be observed in the size of the aircraft, the number of planes, and the area of the airfields across the datasets. Similarly, in the Game Space category, the shape, size, and color of the regions exhibit notable differences among the datasets. Therefore, the four original remote sensing datasets used in the paper indeed demonstrate considerable variations in scale and lighting conditions, indicating substantial distribution and semantic differences between the target and multi-source domain data.
To intuitively illustrate the distribution characteristics among different datasets, the t-SNE algorithm is used for dimensionality reduction and visualization of these four datasets. The originally high-dimensional data, which is difficult to visualize directly, is transformed into a two-dimensional point cloud by the t-SNE. This transformation makes the distribution, clustering tendencies, and underlying structures of the datasets in the feature space more intuitive, as shown in Figure 4.
Particularly, we select the “Farm” category from the four datasets for t-SNE visualization. Different colored scatter points represent different datasets. After extracting features from “Farm” samples using a pre-trained EfficientNet, we applied t-SNE for dimensionality reduction. As shown in the figure, the scatter points corresponding to the four datasets exhibit distinct clusters, indicating that the same category samples from different datasets display significant distributional differences.
To further illustrate this issue, we also include representative image samples of the “Farm” class from each dataset in the figure, displayed at the four corners. The visual differences between these datasets are evident—some images feature large-scale, well-structured agricultural fields, while others depict scattered and irregular farmland patterns. These visual differences intensify the challenges of cross-domain scene classification.
In this sense, given these pronounced distributional discrepancies, traditional supervised learning approaches, which rely heavily on large amounts of labeled data, may fail to generalize well when applied to new datasets. A model trained on one dataset may struggle to generalize effectively to another due to substantial variations in spatial structures, textures, and spectral characteristics across remote sensing images. Hence, it is highly likely to suffer from domain shift, leading to performance degradation. This is precisely the challenge that Domain Adaptation (DA) techniques seek to address. By introducing DA, we can bridge the domain gap, improve classification performance, and ensure better model generalization, making it a crucial solution for cross-domain remote sensing scene classification tasks.

4.2. Experimental Setup and Implementation Details

Experimental Setup: We used four RS domain adaption datasets: UC Merced, AID, RESISC45, and PatternNet. For simplicity, the four datasets will be abbreviated as M, A, N, and P in the subsequent sections. And we resized all the dataset images to 224 × 224.
The overall accuracy (OA) across all categories was adopted as the primary metric to quantify the model’s performance. This metric is derived by calculating the proportion of correctly predicted instances relative to the total dataset size:
O A = i = 0 C n i i T ,
where C represents the number of categories, n i i represents the number of correctly classified test samples for category i, and |T| is the cardinality of the test set.
Implement Details: MSSDANet is implemented based on the PyTorch 1.9.1 framework on an RTX 3090 GPU. EfficientNet-B3 is used as the common feature extractor, whose initialization parameters are pre-trained from the ImageNet dataset. During the training process, fine-tuning is performed. The terminal classification layer of the EfficientNet-B3 network is discarded, and the feature embeddings are directly derived from the preceding layer’s outputs through global spatial pooling operations. Given that both the dual-domain feature extractor and task-specific classifier are initialized without pre-trained weights, we employ a differential learning rate strategy where the classifier’s rate (0.01) is an order of magnitude higher than the feature extractor’s (0.001) to balance their optimization dynamics during training.
The optimization process employed an SGD framework with minibatches, incorporating a momentum coefficient of 0.9 to accelerate convergence. Following the adaptive learning rate scheduling outlined in RevGrad [39], we implemented dynamic learning rate adjustment throughout the training iterations rather than conducting exhaustive grid search for initial rate selection, which would be computationally prohibitive. Specifically, η θ = η 0 ( 1 + ϵ θ ) ω , with parameters η 0 = {0.01, 0.001}, ϵ = 5 and ω = 0.75. Here, η 0 denotes the initial learning rate, η θ represents the learning rate during dynamic adjustment, and θ linearly varies between 0 and 1 with the training epochs. The training epochs for the experiment are set to 2000. For hyperparameters α, β, and γ in the loss function, considering computational cost and simplicity, we selected their values from the set {1, 0.7, 0.5}. In the CC loss function, the temperature coefficient T is set to 1.8.
Competitive Methods: We compare the performance of our MSSDANet against alternative methods including ADDA [35], RevGrad [39], MB-Net [41], MSCN [40], Siamese GAN [42], and SSDAN [43]. It is worth noting that although SSDAN has demonstrated superior performance over existing methods, as presented in [43], and it is a semi-supervised DA approach. In contrast, MSSDANet is an unsupervised DA method, indicating that no label information from the target domain is required. To ensure a fair evaluation, we present the baseline experimental results as provided in [43].

4.3. MSDA Experiments in the Two-Source Case

In the case of two-source datasets experiments, three datasets, A, M, and N, are selected. This setting constitutes three different experiments, where one dataset is designated as the target domain, and the remaining two datasets are used as source domains. We labeled these three sets of experiments as: {A, N→M}, {A, M→N}, and {M, N→A}. The respective OA and their average for these methods are shown in Table 1, where the optimal result has been bolded.
The quantitative results presented in Table 1 indicate that the MSSDANet algorithm achieves significantly better results compared to existing methods in two out of three experiments on the two-source datasets. Moreover, our MSSDANet surpasses the next best SSDAN algorithm by 2.2% in terms of average OA. This demonstrates that the network architecture and loss functions of our MSSDANet are better suited to fully extract domain-invariant common features, leading to superior generalization and applicability.

4.4. MSDA Experiments in the Three-Source Case

In the three-source datasets experiments, four datasets, A, M, N, and P, were selected. This setup allows for four distinct experiments, where one of the datasets serves as the target domain while the other three will constitute the source domains. We labeled these four sets of experiments as: {A, M, N → P}, {A, P, N → M}, {P, M, N → A}, and {A, M, P → N}. The respective OA and their average for these methods are shown in Table 2, where the optimal result has been bolded.
As shown in Table 2, the proposed MSSDANet algorithm attains optimal outcomes across all four experiments on the three-source datasets. In terms of OA, MSSDANet outperforms the next-best SSDAN algorithm by 1.6%. The results demonstrate the capability of our network architecture and loss functions in extracting domain-invariant common features. Note that in previous two-source dataset experiments, MSSDANet shows a lower OA in one case compared to the baseline algorithms. Nevertheless, in the three-source experiments, it consistently outperforms all comparison methods. It can be inferred that with more source domains, achieving effective global alignment across the target domain and all multiple source domains becomes increasingly challenging. In contrast, by means of local subdomain alignment, our method shows obvious advantage over existing baselines in addressing the complex and varied distributions inherent in MSDA. By aligning the features of subdomains instead of treating each domain as a single entity, local subdomain alignment could be easily reached by our method at multi-source setting. Moreover, with carefully designed loss functions, MSSDANet effectively captures the fine-grained invariant features and reducing the negative impact of domain shifts in MSDA tasks.

4.5. Ablation Study

We carried out ablation experiments to evaluate the individual contributions of respective loss functions in our model. Specifically, the case of {A, M, N → P} is selected, where the corresponding loss functions are gradually removed. The OA results of different combinations of loss functions are shown in Table 3.
From the ablation results, we can find: (1) When the class correlation (CC) loss is removed, the model’s performance decreases, indicating that our proposed class correlation loss effectively improves the model’s ability to capture intra-class features and differentiate inter-class features. This, in turn, greatly improves pseudo-label reliability and further promotes the learning efficacy of other modules. (2) When the discriminant semantic transfer (DST) loss is removed, the model’s performance has a clear drop from 86.81 to 82.32, which indicates that our DST loss has more contribution to the adaption performance compared to our CC loss. This proves that DST loss could effectively exploit the semantic correlations between samples for accurate semantic alignment, which greatly enhances the model’s robustness. (3) Similar findings also apply to the Local Maximum Mean Discrepancy (LMMD) loss, which effectively aligns the joint distributions between related subdomains, alleviating the domain shift problem. (4) When the CC, DST, and LMMD loss functions are all removed, the OA value drop to 74.36, which means traditional classification loss cannot handle MSDA tasks alone. In this sense, we expect our proposed loss could benefit other deep networks in their specific RS tasks.
Table 3 presents the results obtained by selectively adding or removing individual components to examine the overall impact of the loss function on the model. To gain a more detailed understanding of how the loss function affects classification performance, we further enhance our analysis by incorporating confusion matrix visualizations. Specifically, by presenting the confusion matrices for each ablation scenario, we provide a more intuitive view of the classification performance for each category, as illustrated in Figure 5.
By analyzing the numerical values and color intensity in the confusion matrix, we can intuitively observe the classification performance for each category. The diagonal elements represent correctly classified samples, while the off-diagonal elements indicate misclassifications. Therefore, a deeper color along the diagonal and lighter colors off the diagonal suggest better classification accuracy. By examining Figure 5, we can observe that compared to (e), which retains all loss functions, the results of (a), (b), and (c) exhibit lighter colors along the diagonal. Furthermore, the confusion matrix in (d), where only the classification loss is preserved, shows the lightest diagonal. These results highlight the contributions of each loss function to the model’s performance and demonstrate their effectiveness when combined as a whole.

4.6. The t-SNE Visualization of Domain Adaptation Results

To intuitively illustrate the adaptation performance of the MSSDANet, we present the t-SNE [52] visualization results of our method under the scenario of {A, N → M}, with the first two datasets serving as source domains and M as the target domain. Specifically, we project the data features onto a two-dimensional representation for visualization.
Figure 6a shows the visualized feature vectors of the target domain data obtained using a feature extractor with pre-trained weights, representing the clustering results without domain adaptation. For comparison, Figure 6b presents the visualization of feature vectors for the target domain data derived from the trained feature extractor, reflecting the clustering results after domain adaptation. The twelve colors represent the twelve classes of the target domain. By comparing the two images, it is evident that the proposed MSSDANet effectively captures the information of each category in the target domain.

4.7. Training Stability

To verify the stability of the MSSDANet, we report the detailed OA values at each 10 iterations of model training. Specifically, two experimental groups, {A, M, N → P} and {A, P, M → N}, were selected for the case study. Figure 7 illustrates the experimental outcomes, where the vertical axis denotes the evaluation metrics of the test set, and the horizontal axis corresponds to the iteration count.
From the curves of OA along the iterations, we observe that the proposed MSSDANet reaches convergence after 100 iterations. After convergence, the model’s test accuracy remains stable with no significant changes, suggesting the robustness of MSSDANet in the training process. This consistency also proves that our method effectively avoids overfitting and exhibits strong generalization capability. The smooth OA curves confirm that our MSSDANet not only achieves good convergence but also maintains high accuracy throughout the training process, ensuring reliable adaptation when deployed on unseen data.
To provide a more comprehensive analysis of MSSDANet’s training stability from multiple perspectives, we further selected the A, N→M task and plotted the loss variation curves over training epochs. Specifically, we focus on the first 600 epochs. Since our method requires training on each source–target domain pair in every epoch, we present the training stability curves for both domain pairs in this task. The results are illustrated in Figure 8.
Figure 8a presents the training loss curve for the A-M domain pair, while Figure 8b displays the training loss curve for the N-M domain pair. From the results, we observe that in this A, N→M task, the training process begins to converge around 100 epochs. Furthermore, the loss values remain relatively stable throughout the subsequent training, demonstrating the training stability of the proposed MSSDANet algorithm.

4.8. Parameter Sensitivity

For the choices of hyperparameters α, β, and γ in the loss function of Equation (13), considering computational cost and simplicity, we select all values from the set {1, 0.7, 0.5}. To study the sensitivity of these hyperparameters, we report the OA results against their different values on two tasks, which are shown in Figure 9.
From Figure 9, it is evident that the MSSDANet algorithm demonstrates low sensitivity to the hyperparameters α, β, and γ, maintaining strong stability across various hyperparameter settings. Note that only three candidate values are predefined for each hyperparameter, which makes them easy to tune. In this sense, our MSSDANet is highly robust to variations in the loss function hyperparameters.

4.9. Extended Experiment

It should be noted that multi-source domain adaptation techniques have been less frequently applied in RS scene classification. To further substantiate the efficacy of our MSSDANet, we have extended our experiments to include the Office-31 dataset [53], a well-established benchmark in natural image DA research. In this extended evaluation, we systematically compared our method against several state-of-the-art algorithms, including ResNet50 [54], MDAN [55], M3SDA [56], MFSAN [25], DARN [57], Ltc-MSDA [58], MIAN [59], T-SVDNet [60], DSFE [61], and TFFN [62]. We also report the parameter count (Params) and FLOPs of some advanced methods documented in the literature. Note that all compared methods use ResNet50 as the backbone. To ensure a fair comparison, we also replaced the backbone of MSSDANet with ResNet50. The experimental settings in this section are configured in accordance with those described in Section 4.2. The quantitative results are presented in Table 4. The first three columns of the results represent the OA values for three different tasks, while the fourth column shows the average OA values across these tasks. Notably, all algorithms in Table 4 employ MSDA techniques except ResNet50, which serves as a baseline model without DA.
Several conclusions can be drawn from Table 4: (1) For datasets with significant distribution differences, ResNet50 struggles to achieve satisfactory generalization performance. In contrast, methods that incorporate MSDA consistently achieve classification accuracies exceeding 85%, which is significantly higher than the 73.95% accuracy obtained by this baseline. (2) The proposed MSSDANet yields the highest average OA on the Office-31 dataset. Notably, in the most challenging task of W, D→A, our method outperformed the baseline ResNet50 by 13.18% and exceeded the performance of the second-best algorithm by 1.68%. These results clearly demonstrate that the sub-domain alignment strategy employed by MSSDANet effectively captures local fine-grained features, thereby enhancing the feature alignment across multi-source domain data. (3) MSSDANet achieves excellent generalization and classification performance without significantly increasing the parameter count or FLOPs. Compared to the ResNet50 baseline, our method adds only an additional 2.37 M parameters and 4.35 G FLOPs. Moreover, relative to other high-performing algorithms, MSSDANet demonstrates the most favorable computational cost.
To sum up, MSSDANet demonstrates excellent cross-domain classification performance on both remote sensing images and natural images, indicating that our method possesses strong generalization capability. The experimental results in this paper highlight its ability to effectively capture and align invariant features from different data distributions, which is a key factor for the success of cross-domain learning.

5. Conclusions

Multi-source domain adaptation (MSDA) for RS cross-domain scene classification assumes access to well-labeled source domain data and target domain data, which may be either unlabeled or partially labeled. To address the issues of aligning multiple source domains and the performance degradation caused by ignoring fine-grained category information in RS imagery, we proposed an unsupervised MSDA method, termed Multi-Source Subdomain Distribution Alignment Network (MSSDANet). Our method involves two important changes compared to existing approaches. The first one is to adopt a two-stage feature extraction strategy to achieve both global alignment and local subdomain alignment across the target domain and multiple source domains. Second, to efficiently capture semantic relationships between source and target domain samples and diminish the confusion of different class features within the target domain, DST and CC loss functions are utilized to guide the model training process. Extensive experiments conducted on four common remote sensing datasets prove that the MSSDANet method yields encouraging adaption performance in MSDA tasks.
In future work, we could consider incorporating specific vision-language models like RemoteCLIP [63] to enhance the model’s generalization ability. Another possible direction is to extend our framework to more specific scenarios of RS scene classification, such as source-free domain adaptation [64,65] and domain generalization [66].

Author Contributions

Conceptualization, D.L. and L.W.; methodology, Y.W.; software, Y.F.; validation, Z.S.; formal analysis, R.L.; writing—original draft preparation, Y.W. and Z.S.; investigation, Q.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by the Key Research and Development Program of Shaanxi (Program No. 2024GX-YBXM-144) and the Natural Science Basic Research Program of Shaanxi (Program No. 2024JC-YBMS-499).

Data Availability Statement

The original data presented in the study are openly available in Ref. [43].

Acknowledgments

The authors are grateful to the editors and reviewers for their constructive feedback and valuable suggestions, which significantly improved the quality of this manuscript. We extend our gratitude to the authors of [43] for making the datasets publicly available, without which the achievements presented in this article would not have been possible.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Chi, M.; Plaza, A.; Benediktsson, J.A.; Sun, Z.; Shen, J.; Zhu, Y. Big Data for Remote Sensing: Challenges and Opportunities. Proc. IEEE 2016, 104, 2207–2219. [Google Scholar] [CrossRef]
  2. Long, M.; Cao, Z.; Wang, J.; Jordan, M.I. Conditional adversarial domain adaptation. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018; Volume 31, pp. 1647–1657. [Google Scholar]
  3. Shen, J.; Qu, Y.; Zhang, W.; Yu, Y. Wasserstein distance guided representation learning for domain adaptation. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, New Orleans, LA, USA, 2–7 February 2018; pp. 4058–4065. [Google Scholar]
  4. Zhang, M.; Zhao, X.; Li, W.; Zhang, Y.; Tao, R.; Du, Q. Cross-scene joint classification of multisource data with multilevel domain adaption network. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 11514–11526. [Google Scholar] [CrossRef] [PubMed]
  5. Ren, Z.; Du, Z.; Zhang, Y.; Sha, F.; Li, W.; Hou, B. Multi-Step Unsupervised Domain Adaptation in Image and Feature Space for Synthetic Aperture Radar Image Terrain Classification. Remote Sens. 2024, 16, 1901. [Google Scholar] [CrossRef]
  6. Wang, B.; Lu, X.; Zheng, X.; Li, X. Semantic descriptions of high-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1274–1278. [Google Scholar] [CrossRef]
  7. Zheng, X.; Yuan, Y.; Lu, X. A deep scene representation for aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2019, 57, 4799–4809. [Google Scholar] [CrossRef]
  8. Yuan, Y.; Fang, J.; Lu, X.; Feng, Y. Remote sensing image scene classification using rearranged local features. IEEE Trans. Geosci. Remote Sens. 2018, 57, 1779–1792. [Google Scholar] [CrossRef]
  9. Longbotham, N.; Chaapel, C.; Bleiler, L.; Padwick, C.; Emery, W.J.; Pacifici, F. Very high resolution multiangle urban classification analysis. IEEE Trans. Geosci. Remote Sens. 2011, 50, 1155–1170. [Google Scholar] [CrossRef]
  10. Li, Y.; Zhang, H.; Xue, X.; Jiang, Y.; Shen, Q. Deep learning for remote sensing image classification: A survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2018, 8, e1264. [Google Scholar] [CrossRef]
  11. Song, J.; Gao, S.; Zhu, Y.; Ma, C. A survey of remote sensing image classification based on CNNs. Big Earth Data 2019, 3, 232–254. [Google Scholar] [CrossRef]
  12. Cheng, G.; Xie, X.; Han, J.; Guo, L.; Xia, G.S. Remote sensing image scene classification meets deep learning: Challenges, methods, benchmarks, and opportunities. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 3735–3756. [Google Scholar] [CrossRef]
  13. Aleissaee, A.A.; Kumar, A.; Anwer, R.M.; Khan, S.; Cholakkal, H.; Xia, G.-S.; Khan, F.S. Transformers in Remote Sensing: A Survey. Remote Sens. 2023, 15, 1860. [Google Scholar] [CrossRef]
  14. Othman, E.; Bazi, Y.; Melgani, F.; Alhichri, H.; Alajlan, N.; Zuair, M. Domain adaptation network for cross-scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 4441–4456. [Google Scholar]
  15. Sun, B.; Saenko, K. Deep CORAL: Correlation alignment for deep domain adaptation. In Proceedings; Part III 14, Proceedings of the Computer Vision–ECCV 2016 Workshops, Amsterdam, The Netherlands, 8–10 and 15–16 October 2016; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 443–450. [Google Scholar]
  16. Long, M.; Cao, Y.; Wang, J.; Jordan, M. Learning transferable features with deep adaptation networks. In Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 97–105. [Google Scholar]
  17. Long, M.; Zhu, H.; Wang, J.; Jordan, M.I. Deep transfer learning with joint adaptation networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 2208–2217. [Google Scholar]
  18. Zellinger, W.; Grubinger, T.; Lughofer, E.; Natschläger, T.; Saminger-Platz, S. Central moment discrepancy (CMD) for domain-invariant representation learning. arXiv 2017, arXiv:1702.08811. [Google Scholar]
  19. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the NIPS’14: Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; Volume 27. [Google Scholar]
  20. Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; March, M.; Lempitsky, V. Domain-adversarial training of neural networks. J. Mach. Learn. Res. 2016, 17, 1–35. [Google Scholar]
  21. Zhang, X.; Zhuang, Y.; Zhang, T.; Li, C.; Chen, H. Masked Image Modeling Auxiliary Pseudo-Label Propagation with a Clustering Central Rectification Strategy for Cross-Scene Classification. Remote Sens. 2024, 16, 1983. [Google Scholar] [CrossRef]
  22. Xu, C.; Shu, J.; Zhu, G. Multi-Feature Dynamic Fusion Cross-Domain Scene Classification Model Based on Lie Group Space. Remote Sens. 2023, 15, 4790. [Google Scholar] [CrossRef]
  23. Zhu, P.; Zhang, X.; Han, X.; Cheng, X.; Gu, J.; Chen, P.; Jiao, L. Cross-domain classification based on frequency component adaptation for remote sensing images. Remote Sens. 2024, 16, 2134. [Google Scholar] [CrossRef]
  24. Zhao, S.; Li, B.; Xu, P.; Keutzer, K. Multi-source domain adaptation in the deep learning era: A systematic survey. arXiv 2020, arXiv:2002.12169. [Google Scholar]
  25. Zhu, Y.; Zhuang, F.; Wang, D. Aligning domain-specific distribution and classifier for cross-domain classification from multiple sources. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 5989–5996. [Google Scholar]
  26. Ren, C.X.; Liu, Y.H.; Zhang, X.W.; Huang, K.K. Multi-source unsupervised domain adaptation via pseudo target domain. IEEE Trans. Image Process. 2022, 31, 2122–2135. [Google Scholar]
  27. Gong, T.; Zheng, X.; Lu, X. Cross-domain scene classification by integrating multiple incomplete sources. IEEE Trans. Geosci. Remote Sens. 2020, 59, 10035–10046. [Google Scholar]
  28. Gretton, A.; Borgwardt, K.M.; Rasch, M.J.; Schölkopf, B.; Smola, A. A Kernel Two-Sample Test. J. Mach. Learn. Res. 2012, 13, 723–773. [Google Scholar]
  29. Yan, H.; Ding, Y.; Li, P.; Wang, Q.; Xu, Y.; Zuo, W. Mind the class weight bias: Weighted maximum mean discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 945–954. [Google Scholar]
  30. Long, M.; Wang, J.; Ding, G.; Sun, J.; Yu, P.S. Transfer feature learning with joint distribution adaptation. In Proceedings of the IEEE International Conference on Computer Vision 2013, Sydney, Australia, 1–8 December 2013; pp. 2200–2207. [Google Scholar]
  31. Zhu, Y.; Zhuang, F.; Wang, J.; Ke, G.; Chen, J.; Bian, J.; Xiong, H.; He, Q. Deep subdomain adaptation network for image classification. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 1713–1722. [Google Scholar]
  32. Li, M.; Zhai, Y.M.; Luo, Y.W.; Ge, P.F.; Ren, C.X. Enhanced transport distance for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA, 14–19 June 2020; pp. 13933–13941. [Google Scholar]
  33. Pan, Y.; Yao, T.; Li, Y.; Ngo, C.W.; Mei, T. Exploring category-agnostic clusters for open-set domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA, 14–19 June 2020; pp. 13864–13872. [Google Scholar]
  34. Hoffman, J.; Tzeng, E.; Park, T.; Zhu, J.Y.; Isola, P.; Saenko, K.; Efros, A.; Darrell, T. Cycada: Cycle-consistent adversarial domain adaptation. In Proceedings of the International Conference on Machine Learning 2018, Stockholm, Sweden, 10–15 July 2018; pp. 1989–1998. [Google Scholar]
  35. Tzeng, E.; Hoffman, J.; Saenko, K.; Darrell, T. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 2962–2971. [Google Scholar]
  36. Mirza, M. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar]
  37. Rangwani, H.; Aithal, S.K.; Mishra, M.; Jain, A.; Radhakrishnan, V. A closer look at smoothness in domain adversarial training. In Proceedings of the International Conference on Machine Learning 2022, Baltimore, MD, USA, 17–23 July 2022; pp. 18378–18399. [Google Scholar]
  38. Pei, Z.; Cao, Z.; Long, M.; Wang, J. Multi-adversarial domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 3934–3941. [Google Scholar]
  39. Ganin, Y.; Lempitsky, V. Unsupervised Domain Adaptation by Backpropagation. In Proceedings of the 32nd International Conference on International Conference on Machine Learning-Volume 37; JMLR.org: Lille, France, 2015; pp. 1180–1189. [Google Scholar]
  40. Lu, X.; Gong, T.; Zheng, X. Multisource compensation network for remote sensing cross-domain scene classification. IEEE Trans. Geosci. Remote Sens. 2020, 58, 2504–2515. [Google Scholar]
  41. Al Rahhal, M.M.; Bazi, Y.; Abdullah, T.; Mekhalfi, M.L.; AlHichri, H.; Zuair, M. Learning a multi-branch neural network from multiple sources for knowledge adaptation in remote sensing imagery. Remote Sens. 2018, 10, 1890. [Google Scholar] [CrossRef]
  42. Bashmal, L.; Bazi, Y.; AlHichri, H.; AlRahhal, M.M.; Ammour, N.; Alajlan, N. Siamese-GAN: Learning Invariant Representations for Aerial Vehicle Image Categorization. Remote Sens. 2018, 10, 351. [Google Scholar] [CrossRef]
  43. Lasloum, T.; Alhichri, H.; Bazi, Y.; Alajlan, N. SSDAN: Multi-Source Semi-Supervised Domain Adaptation Network for Remote Sensing Scene Classification. Remote Sens. 2021, 13, 3861. [Google Scholar] [CrossRef]
  44. Xie, S.; Zheng, Z.; Chen, L.; Chen, C. Learning semantic representations for unsupervised domain adaptation. In Proceedings of the International Conference on Machine Learning 2018, Stockholm, Sweden, 10–15 July 2018; pp. 5423–5432. [Google Scholar]
  45. Xanthopoulos, P.; Pardalos, P.M.; Trafalis, T.B. Linear discriminant analysis. In Robust Data Mining; Springer: Berlin/Heidelberg, Germany, 2013; pp. 27–33. [Google Scholar]
  46. Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On calibration of modern neural networks. In Proceedings of the International Conference on Machine Learning 2017, Sydney, Australia, 6–11 August 2017; pp. 1321–1330. [Google Scholar]
  47. Jin, Y.; Wang, X.; Long, M.; Wang, J. Minimum Class Confusion for Versatile Domain Adaptation. In Proceedings of the European Conference on Computer Vision 2020, Glasgow, UK, 23–28 August 2020; pp. 464–480. [Google Scholar]
  48. Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, CA, USA, 2–5 November 2010; Association for Computing Machinery: New York, NY, USA, 2010; pp. 270–279. [Google Scholar]
  49. Xia, G.S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar]
  50. Cheng, G.; Han, J.; Lu, X. Remote sensing image scene classification: Benchmark and state of the art. Proc. IEEE 2017, 105, 1865–1883. [Google Scholar] [CrossRef]
  51. Zhou, W.; Newsam, S.; Li, C.; Shao, Z. PatternNet: A benchmark dataset for performance evaluation of remote sensing image retrieval. ISPRS J. Photogramm. Remote Sens. 2018, 145, 197–209. [Google Scholar]
  52. Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
  53. Saenko, K.; Kulis, B.; Fritz, M.; Darrell, T. Adapting Visual Category Models to New Domains. In Proceedings of the 11th European Conference on Computer Vision (ECCV 2010), Heraklion, Greece, 5–11 September 2010; Springer: Berlin/Heidelberg, Germany, 2010; pp. 213–226. [Google Scholar]
  54. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June 2016; pp. 770–778. [Google Scholar]
  55. Zhao, H.; Zhang, S.; Wu, G.; Moura, J.M.F.; Costeira, J.P.; Gordon, G.J. Adversarial multiple source domain adaptation. In Proceedings of the NIPS’18: Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; Volume 31. [Google Scholar]
  56. Peng, X.; Bai, Q.; Xia, X.; Huang, Z.; Saenko, K.; Wang, B. Moment Matching for Multi-Source Domain Adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October 2019; pp. 1406–1415. [Google Scholar]
  57. Wen, J.; Greiner, R.; Schuurmans, D. Domain Aggregation Networks for Multi-Source Domain Adaptation. In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 12–18 July 2020; pp. 10214–10224. [Google Scholar]
  58. Wang, H.; Xu, M.; Ni, B.; Zhang, W. Learning to Combine: Knowledge Aggregation for Multi-Source Domain Adaptation. In Proceedings of the 16th European Conference on Computer Vision (ECCV 2020), Glasgow, UK, 23–28 August 2020; Part VIII. Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 727–744. [Google Scholar]
  59. Park, G.Y.; Lee, S.W. Information-Theoretic Regularization for Multi-Source Domain Adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 9214–9223. [Google Scholar]
  60. Li, R.; Jia, X.; He, J.; Chen, S.; Hu, Q. T-SVDNet: Exploring High-Order Prototypical Correlations for Multi-Source Domain Adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021; pp. 9991–10000. [Google Scholar]
  61. Wu, K.; Jia, F.; Han, Y. Domain-specific feature elimination: Multi-source domain adaptation for image classification. Front. Comput. Sci. 2023, 17, 174705. [Google Scholar]
  62. Li, Y.; Wang, S.; Wang, B.; Hao, Z.; Chai, H. Transferable feature filtration network for multi-source domain adaptation. Knowl. Based Syst. 2023, 260, 110113. [Google Scholar]
  63. Liu, F.; Chen, D.; Guan, Z.; Zhou, X.; Zhu, J.; Ye, Q.; Fu, L.; Zhou, J. Remoteclip: A vision language foundation model for remote sensing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5622216. [Google Scholar]
  64. Fang, Y.; Yap, P.T.; Lin, W.; Zhu, H.; Liu, M. Source-free unsupervised domain adaptation: A survey. Neural Netw. 2024, 174, 106230. [Google Scholar]
  65. Li, J.; Yu, Z.; Du, Z.; Zhu, L.; Shen, H.T. A comprehensive survey on source-free domain adaptation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5743–5762. [Google Scholar]
  66. Zhou, K.; Liu, Z.; Qiao, Y.; Xiang, T.; Loy, C.C. Domain generalization: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 4396–4415. [Google Scholar]
Figure 1. Illustration of subdomain alignment. The left image shows that solely performing global alignment between source and target domains may overlook fine-grained category information, leading to reduced classification accuracy. In contrast, the right image demonstrates that subdomain alignment captures fine-grained category details, hence improving the performance. Different colors represent different domains, and shapes indicate categories. This comparison reveals that with the help of subdomain alignment, the distributions of related subdomains belonging to the same class in both the source and target domains could be more precisely matched.
Figure 1. Illustration of subdomain alignment. The left image shows that solely performing global alignment between source and target domains may overlook fine-grained category information, leading to reduced classification accuracy. In contrast, the right image demonstrates that subdomain alignment captures fine-grained category details, hence improving the performance. Different colors represent different domains, and shapes indicate categories. This comparison reveals that with the help of subdomain alignment, the distributions of related subdomains belonging to the same class in both the source and target domains could be more precisely matched.
Remotesensing 17 01302 g001
Figure 2. Overview of MSSDANet architecture. To fully align the target and source domains, a two-level feature extraction strategy is adopted in the network: the common feature extractor is used to extract invariant features of the data from all source domains and the target domain, mapping them into the same feature space; the dual-domain feature extractor is deployed to further minimize the feature distribution discrepancy between corresponding subdomains of each source–target domain pair. We use four types of loss functions to constrain each pair of source–target domain features. The arrows in the figure with different colors represent the processing directions of different data. As shown in the figure, the red, green, and yellow arrows represent source 1, source 2, and the target data, respectively. After passing through the common feature extractor, they are input into their respective dual-domain feature extractors. The loss functions are shown in Section 3.2.
Figure 2. Overview of MSSDANet architecture. To fully align the target and source domains, a two-level feature extraction strategy is adopted in the network: the common feature extractor is used to extract invariant features of the data from all source domains and the target domain, mapping them into the same feature space; the dual-domain feature extractor is deployed to further minimize the feature distribution discrepancy between corresponding subdomains of each source–target domain pair. We use four types of loss functions to constrain each pair of source–target domain features. The arrows in the figure with different colors represent the processing directions of different data. As shown in the figure, the red, green, and yellow arrows represent source 1, source 2, and the target data, respectively. After passing through the common feature extractor, they are input into their respective dual-domain feature extractors. The loss functions are shown in Section 3.2.
Remotesensing 17 01302 g002
Figure 3. Illustrative examples from four benchmark datasets adopted in our experiments.
Figure 3. Illustrative examples from four benchmark datasets adopted in our experiments.
Remotesensing 17 01302 g003
Figure 4. Visualization of the distribution differences between these four datasets.
Figure 4. Visualization of the distribution differences between these four datasets.
Remotesensing 17 01302 g004
Figure 5. Confusion matrix visualization of ablation experiment results: (a) remove LMMD loss from MSSDANet, (b) remove DST loss from MSSDANet, (c) remove CC loss from MSSDANet, (d) with only CLS loss retained, (e) our proposed MSSDANet.
Figure 5. Confusion matrix visualization of ablation experiment results: (a) remove LMMD loss from MSSDANet, (b) remove DST loss from MSSDANet, (c) remove CC loss from MSSDANet, (d) with only CLS loss retained, (e) our proposed MSSDANet.
Remotesensing 17 01302 g005
Figure 6. The t-SNE visualization of MSSDANet before and after DA (a) shows the visualization results before DA, while (b) displays the results after DA. t-SNE is performed on the feature extractor’s output. Twelve different colors represent the 12 classes in the dataset, respectively.
Figure 6. The t-SNE visualization of MSSDANet before and after DA (a) shows the visualization results before DA, while (b) displays the results after DA. t-SNE is performed on the feature extractor’s output. Twelve different colors represent the 12 classes in the dataset, respectively.
Remotesensing 17 01302 g006
Figure 7. OA convergence of our method on two different tasks.
Figure 7. OA convergence of our method on two different tasks.
Remotesensing 17 01302 g007
Figure 8. Loss convergence for (a) Source A and (b) Source N, respectively.
Figure 8. Loss convergence for (a) Source A and (b) Source N, respectively.
Remotesensing 17 01302 g008
Figure 9. OA with respect to hyperparameters (a) α, (b) β, and (c) γ, respectively.
Figure 9. OA with respect to hyperparameters (a) α, (b) β, and (c) γ, respectively.
Remotesensing 17 01302 g009
Table 1. Comparison of OA results under the two-source datasets.
Table 1. Comparison of OA results under the two-source datasets.
MethodsA, N → MA, M → NM, N → AAverage
ADDA [35]0.67520.73030.81310.7395
RevGrad [39]0.64060.68590.78790.7048
MB-Net [41]0.72350.75250.77580.7503
MSCN [40]0.84010.79550.91590.8505
Siamese-GAN [42]0.78330.79360.87280.8166
SSDAN [43]0.97150.91860.91650.9355
MSSDANet (ours)0.99580.89290.98390.9575
Table 2. Comparison of OA results under the three-source dataset.
Table 2. Comparison of OA results under the three-source dataset.
MethodsA, M, N → PA, P, N → MP, M, N → AA, M, P → NAverage
ADDA [35]0.68990.60390.62390.65080.6421
RevGrad [39]0.78750.66370.62830.67350.6883
MB-Net [41]0.78380.73090.68720.71960.7304
MSCN [40]0.83910.83830.79080.81500.8197
Siamese-GAN [42]0.84500.81420.77170.80850.8092
SSDAN [43]0.97540.98560.93090.85270.9389
MSSDANet (ours)0.98080.99750.97560.86810.9555
Table 3. Ablation experiment results of loss function.
Table 3. Ablation experiment results of loss function.
L c l s L l m m d L d s t L c c A, M, P → N
74.36%
82.79%
82.32%
85.76%
86.81%
Table 4. Performance Comparison of related methods on Office-31 Dataset.
Table 4. Performance Comparison of related methods on Office-31 Dataset.
MethodsA, W → DW, D → AA, D → WAverageParams (M)FLOPs (G)
ResNet50 [54]0.99300.62500.96700.739525.564.09
MDAN [55]0.99600.66020.97830.8781--
M3SDA [56]1.00000.68560.99060.892128.5413.52
MFSAN [25]0.99500.70890.98500.9023--
DARN [57]0.99870.66310.98640.8827--
Ltc-MSDA [58]0.99620.68950.99530.8937--
MIAN [59]0.99480.74650.98490.9087--
T-SVDNet [60]0.99400.74100.99600.910028.8213.38
DSFE [61]0.99400.73200.98800.905028.0413.20
TFFN [62]1.00000.74000.99000.910028.1013.08
MSSDANet (Ours)0.99800.75680.98360.912827.938.44
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Y.; Shu, Z.; Feng, Y.; Liu, R.; Cao, Q.; Li, D.; Wang, L. Enhancing Cross-Domain Remote Sensing Scene Classification by Multi-Source Subdomain Distribution Alignment Network. Remote Sens. 2025, 17, 1302. https://doi.org/10.3390/rs17071302

AMA Style

Wang Y, Shu Z, Feng Y, Liu R, Cao Q, Li D, Wang L. Enhancing Cross-Domain Remote Sensing Scene Classification by Multi-Source Subdomain Distribution Alignment Network. Remote Sensing. 2025; 17(7):1302. https://doi.org/10.3390/rs17071302

Chicago/Turabian Style

Wang, Yong, Zhehao Shu, Yinzhi Feng, Rui Liu, Qiusheng Cao, Danping Li, and Lei Wang. 2025. "Enhancing Cross-Domain Remote Sensing Scene Classification by Multi-Source Subdomain Distribution Alignment Network" Remote Sensing 17, no. 7: 1302. https://doi.org/10.3390/rs17071302

APA Style

Wang, Y., Shu, Z., Feng, Y., Liu, R., Cao, Q., Li, D., & Wang, L. (2025). Enhancing Cross-Domain Remote Sensing Scene Classification by Multi-Source Subdomain Distribution Alignment Network. Remote Sensing, 17(7), 1302. https://doi.org/10.3390/rs17071302

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop