1. Introduction
A fundamental assumption of conventional machine learning is that test samples are drawn from the identical distribution with training samples [
1]. However, such an assumption does not always hold in real-world applications. For multi-sensor systems, the issue of distribution mismatch can be caused by many factors, e.g., various sensor parameters and background noises. Such a mismatch can be easily found in remote sensing images (different area and weather condition) and person re-identification (different shoot angle). The violation of the assumption results in severe performance degradation, and labeling data from all sources is laborious; as
Figure 1a shows, the decision boundary induced by source samples performs poorly in the target domain. To tackle this problem, domain adaptation (DA) [
2] has attracted much attention. DA aims at leveraging rich knowledge from training data (also referred to as the source domain) and making decisions about different, but related testing data (also referred to as the target domain), which has been successfully applied in many areas, such as image classification [
3,
4,
5], person re-identification [
6,
7], and activity recognition [
8,
9,
10].
In order to address the issue of distribution mismatch, a series of research works focused on discovering domain-shared feature representations [
11,
12], then we can train a classifier with the learned representation. A graphical illustration of the idea is proposed in
Figure 1b; intuitively, the decision boundary learned with source samples also works well in the target domain since they are aligned. Instance re-weighting aims to narrow the distribution distance by assigning weights to samples, the emphasis being different criteria for calculating the weights, e.g., kernel mean matching (KMM). Chen et al. re-weighted source samples for subspace alignment, and the weight
of source sample
increases if it has a similar distribution as the target samples [
13]. Chu et al. attempted to match distributions with KMM and minimize the empirical risk of source domain simultaneously [
14]. However, studies show that it works well in the scenario where the source and target domains have few differences, and the performance degrades as the domain gap becomes larger [
15]. Feature mapping seeks the appropriate feature space/subspace to reduce domain discrepancy. Compared to simply re-weighting, it is capable of learning more powerful representations by means of complex non-linear mappings, such as kernel methods and deep neural networks, thus yielding remarkable performance. Pan et al. proposed a PCA-like framework, named transfer component analysis (TCA), which adopts the maximum mean discrepancy (MMD) as the loss function and maps data to the feature space [
16]. Extending TCA, Long et al. introduced conditional MMD by computing the pseudo labels of target samples, and an iterative training scheme was applied to obtain more accurate labels [
17]. Inspired by the success of deep neural networks (DNN), the combination of domain adaptation and DNN also achieves dramatic success. Tzeng et al. proposed a composite network that minimizes the distribution distance by MMD and the cross-entropy of source samples [
18]. Furthermore, Long et al. employed multi-kernel MMD for matching features [
19]. Both instance re-weighting and feature mapping rely greatly on the evaluation of the distribution distance; however, statistical metrics, e.g., MMD, have been proven to be sensitive to outliers and class weight bias [
20]. Another popular method is adversarial training based domain adaptation, which constructs a source classifier and a domain discriminator simultaneously. The source classifier aims to recognize the objects of multiple classes, and the domain discriminator is designed to learn consistent features for two domains.
Despite the success of distribution matching, there still remains two major issues: (1) Is it necessary for distribution matching? Deep convolution neural networks (CNN) can learn fairly unbiased representations for image data, which is shown by the fact that they report high accuracy over complex vision tasks using CNN representations and a linear classifier [
21]. (2) Existing works made predictions for target samples independently, i.e., the label
of target sample
is obtained according to the relation between
and source samples
, while ignoring the relations within target samples. In this paper, we propose a novel framework based on the nearest class prototype for unsupervised domain adaptation. Similar to instance re-weighting, the proposed method aims to study the sample difference (both in the source and target domain, while instance re-weighting focuses on source samples) in cross-domain scenarios. As
Figure 1c shows, instead of aligning domains, we aim at learning the adaptive decision boundary for two domains. Specifically, we explore the diversity within samples belonging to the same class and find a balance between single-sample discrimination and class wise discrimination. Furthermore, a multi-stage training scheme is presented for better exploiting the discriminative structures in the target domain. The contributions of this paper are summarized as follows.
Corresponding to Issue (1), there is no distribution matching strategy in our method. Experimental results show that the proposed classifier adaptation can achieve comparable performance when compared to popular distribution matching methods.
In response to Issue (2), we propose an easy-to-hard testing scheme. The underlying idea is that the difficulties in recognizing target samples vary from each other, and easy samples along with their labels can assist the prediction for hard samples.
We propose the modified nearest class prototype, which allows more diversity within the same class. Ideally, clusters with less domain discrepancy would yield correct predictions for target samples.
The rest of this paper is organized as follows.
Section 2 gives the background knowledge on related DA works. Then, the nearest neighbors and nearest class prototype are discussed. In
Section 3, we describe our method in detail.
Section 4 and
Section 5 present the experiments and some empirical analysis. Finally,
Section 6 concludes the paper and provides some ideas for future research.
2. Related Works
In this section, we first give a formal definition for unsupervised/homogeneous domain adaptation. Then, a brief introduction of the nearest neighbors and nearest class prototype is presented.
2.1. Domain Adaptation
Domain adaptation deals with the scenario where training and testing data have different distributions. Formally, we first introduce the concepts of domain and task.
Domain: A domain consists of data and a distribution, . For standard DA problems, we have and for the source and target domain, respectively. Besides, two domains have different distributions, .
Task: A task includes labels and the mapping function, . It is worth noting that we can learn multiple source mappings with different models since source data are well labeled. Correspondingly, we have and , and the goal is to learn the target mapping , i.e., .
In this paper, we study unsupervised/homogeneous domain adaptation problems, which means that (1) there is no labeled samples for training in the target domain, and target labels are only available for evaluating methods
. (2) Source and target data have the same dimensions
. Previous works focused on reducing the distribution mismatch, based on either nonparametric (MMD, CORAL) or parametric (
-distance) metrics, but gave very limited considerations on the relation between learned representation and the decision boundary. A similar idea to ours is pseudo-label based domain adaptation (PLDA), which alternates between feature learning and pseudo-label learning. To our best knowledge, Joint Distribution Adaption (JDA) [
17] is the first method that adopts a classifier to obtain pseudo labels, and the objective is to estimate the conditional probability of target samples. Consequently, it allows us to match both the conditional and marginal distribution. Besides, it has an iterative updating strategy, and the classifier is expected to be more powerful as the training goes. Wang et al. further pointed out that pseudo-labels are not always reliable because of the domain shift, then proposed confidence-aware pseudo label selection (CAPLS) [
22] and selective pseudo labeling (SPL) [
23], which select more credible pseudo labels.
Such methods combine feature learning and classifier training, and the desired representations and classifiers can be obtained simultaneously. In this paper, we aim to learn a domain adaptive classifier based on the nearest class prototype, which differs from PLDA in the following aspects: (1) PLDA does not consider the characteristics of a classifier; in other words, it does not care about which type of classifier one chooses: the nearest neighbor is fine, and support vector machine is also feasible. The classifier is only used to make predictions; the emphasis is to learn domain-invariant representations. However, for our method, we analyze how the domain discrepancy would affect the standard nearest class prototype and propose several strategies to alleviate the negative effects. Most importantly, there is no distribution matching procedures in our method, which makes it differ most compared to other works. (2) It seems that our easy-to-hard testing scheme is similar to sample selection, since they all have a selection process. In fact, they are completely different. Selective pseudo labeling aims to reduce the negative impacts of the wrong labels, and our goal is to better exploit the discrimination power, i.e., we can obtain more precise predictions for hard samples by considering both source (all) and target (easy) samples. Hence, sample selection needs to predict all the target samples during each iteration, but our easy-to-hard testing adopts a decreasing number of target samples: once a sample is deemed to have a credible label, it will be removed from the test set.
2.2. Nearest Neighbor and Nearest Class Prototype
The nearest neighbor (NN) and nearest class prototype (NCP) are two pattern recognition methods, which are widely used due to their simplicity [
24,
25,
26]. Given a query sample
and a training set
, under the clustering assumption and manifold assumption, NN and NCP search for the nearest sample/class prototype and take the corresponding label as the prediction for the query sample. It is worth emphasizing that there are many ways to compute the class prototype [
27], e.g., learning vector quantization (LVQ) [
28] and mean vector (MV). In this paper, we focus on MV due to its simplicity and stationarity.
In
Figure 2, we give an intuitive description of the NN and NCP. These methods work well when the training and testing set have the same distribution, but when applied in cross-domain tasks, the clustering assumption does not always hold due to distribution mismatch, thus leading to severe performance degradation. In this paper, we follow the assumption of sample re-weighting based methods, i.e., samples have different importance. A modified NCP method is proposed, which allows more diversity within one class than the original NCP, and a detailed description can be found in the next section.
3. Methodology
In this section, we first introduce the general framework of the proposed method and the necessary notations, then two components, the modified NCP and easy-to-hard testing scheme, are described in detail.
3.1. Framework and Notations
Suppose that we have source and target data
, which are drawn from two related, but different distributions
. Besides, source data are well labeled (
is available), while target data have no labels (
is not available). The goal is to obtain labels for target samples with high precision.
Table 1 introduces the necessary notations and descriptions.
In this paper, we propose a novel unsupervised DA solution based on the NCP. Firstly, we present the modified NCP by finding a balance between single-sample discrimination and class wise discrimination. It allows more diversity within the same class, thus, it is capable of preserving more local structures. Besides, an easy-to-hard testing scheme is introduced. Instead of predicting target samples independently, it selects so-called easy samples, which are considered to have more confidence about the predicted labels, and easy samples along with their labels can assist the prediction of hard samples. By doing this, we hope to utilize the discriminative information existing in target samples. A graphical illustration of the proposed method is given in
Figure 3.
As the title states, we aim to explore the intrinsic relation among samples and learn more robust decision boundaries, both in the source and target domain. To be more specific, we consider the domain discrepancy revealed in each source sample. Recall that NCP exploits the class center to make the prediction for query samples, but for each class, there are both less biased and more biased samples. Intuitively, we hope to use the less biased samples for prediction. By clustering source samples drawn from the same class into several clusters (class sub-center), the query samples are capable of selecting the closest class sub-centers, thus alleviating the negative effects brought by very biased samples. When it comes to the target domain, we hope to make precise predictions by utilizing local discriminative structures existing in the target domain. Previous works focused on the discriminative information in the source domain, which is easy to quantize and use since source samples are well labeled, while ignoring the discrimination power among unlabeled target samples. Similar to the source, we assume target samples have different difficulties in their prediction. Then, a hierarchical structure is presented naturally, and easy samples are predicted in the earlier steps, so we can label hard samples by considering both labeled source samples and previously predicted easy target samples.
3.2. Modified Nearest Class Prototype
The nearest neighbor selects the nearest samples to a query sample by means of certain metrics, e.g., the Euclidean distance. It explores the sample-to-sample relation, so we call it single-sample discrimination, which can be considered as a specific form of the manifold assumption. However, when applied in DA tasks, the manifold assumption may not always hold due to the domain shift [
29], thus causing performance degradation. On the other hand, the nearest class prototype follows the clustering assumption and makes the prediction by means of the distance between the class center and the query sample, so we call it class-wise discrimination. Similarly, the clustering assumption is also broken by the domain shift [
29]. In this paper, we propose the modified nearest class prototype. The original NCP considers all the training samples to be equal, while ignoring the influence brought by the distribution match, i.e., samples have different importance since they hold various degrees of bias. Naturally, we hope to predict the target sample by less biased source samples.
We can quantify the bias degree of two sets of data, i.e., the source set and target set, using some parametric or nonparametric metrics. However, evaluating the degree of bias for certain sample is not feasible. Therefore, we further assume that the less biased and very biased samples exhibit disparate properties, then they can be represented by several clusters. It is worth emphasizing that the distribution mismatch would ruin the cross-domain manifold assumption and clustering assumption, but when it comes to within-domain cases, the manifold assumption and clustering assumption always hold since all samples drawn from the same domain are considered to be independent and identically distributed (i.i.d.).
Given training data and labels [
] and testing data
, the pseudocode is shown in Algorithm 1.
k is the number of clusters for a certain category. Firstly, for samples with the same labels in the training set, we divide them into
k centers, then record each cluster
. Here, we employ K-means clustering due to its simplicity. When testing, we calculate the distance among the query sample and each cluster (there should be
clusters in total). Then, the label can be obtained by the cluster with the shortest distance. Notice that we set a fixed value
k as the number of clusters for each class, which seems unreasonable. The number should be changed as the distribution changes. If the samples are close to each other, we need a small
k to keep their affinity, and vice versa. We admit that it is not a good choice for a fixed
k, but it also shows the following advantages. (1) There is no need for a search program for
k, thus keeping it efficient. According to the literature, searching for an optimal
k is somewhat time-consuming, e.g., the elbow method and average silhouette width demand multiple operations of K-means [
30]. Even worse, there may be no such thing as optimal
k [
31]. (2) We witnessed enough performance improvements with it, so our method achieves state-of-the-art performance when compared with existing works. Finally, we compute the confidence score
for each target sample by normalizing the distance, which can be used as the criterion of sample selection.
Algorithm 1: mNCP: modified nearest class prototype. |
|
3.3. Easy-To-Hard Testing Scheme
Existing works make predictions independently in the target domain, which only investigates the cross-domain correlation, i.e., for a certain target sample x, the prediction is obtained based on the relation among x and source samples , but the relation among x and other target samples is ignored. Recall that the manifold and clustering assumption still hold within-domain, so naturally, the idea of preserving local structures in the target domain is raised. In this paper, we propose an easy-to-hard testing scheme; specifically, it has a hierarchical prediction strategy. Samples in the target domain are considered to have different difficulties in their prediction, then we select the so-called easy samples (which are believed to have more confidence) in the early stages during training, and these samples along with their more confident labels are taken into consideration for predicting hard samples. The underlying philosophy is that we can obtain precise labels for easy samples based on the cross-domain relation, and for hard samples, we need to combine the cross-domain and within-domain relations to predict them. It is worth emphasizing that determining which sample is easy (or hard) is difficult to some extent. Here we first construct a basic classifier with labeled source samples, then we can select samples with a much confidence as easy samples.
The detailed calculation can be found in Algorithm 2. Here, we split the concept of the target/source domain and the training/testing set. For initialization, the source data are set to be the training set, and the target is for the testing. Then in each iteration, we select
N samples with the highest confidence. It is worth noting that there are two ways to determine
N, i.e., hard threshold or fixed number. We chose the fixed number, because selecting a proper threshold is laborious. Thinking of the worst case, if we employ a relatively big value, the model would stop. However, if we utilize a fixed number (>0), the model can always converge. To be more specific, we set a hyper-parameter
as the number of total iterations, then
N can be determined by
. Finally, the selected samples
along with their labels
are integrated into the training set
. It is worth noting that once a sample is selected as an easy sample, its label is accepted, which means that we do not need to predict it again. Therefore, the size of the testing set decreases, and that of the training set increases (corresponding to Algorithm 2 Lines 6–8). After several iterations, all target samples are considered to be well labeled.
Algorithm 2: Easy-to-hard testing. |
|
4. Experiments
In this section, we give the detailed description of the experiments. Firstly, we introduce two widely-used datasets for DA problems, i.e., Office-Caltech10 and ImageCLEF (Cross Language Evaluation Forum (CLEF)), followed by competing methods and the parameter setting. Then, the numerical results are presented with some analysis. Besides, we conduct experiments for parameter sensitivity.
4.1. Data Preparation
ImageCLEF (
https://www.imageclef.org/2014/adaptation) is a visual competition, held by the Cross Language Evaluation Forum (CLEF). It consists of three domains, Caltech (C) [
32], ImageNet (I) [
33], and Pascal (P) [
34]. There are twelve classes of objects for each domain, e.g., airplane, bike, bird, boat, bottle, bus, car, dog, horse, monitor, motorbike, and people. Besides, the number of images per category is 50.
Office-Caltech10 [
35] includes four domains, Amazon (A), Caltech (C) [
32], webcam (W), and DSLR (D). There are ten classes of objects and 8-151 images per class. The accurate number of images for each category can be found in
Table 2.
Given two random domains, e.g., Amazon (A) and webcam (W), we can construct two DA tasks, A→W and W→A. Consequently, we can construct
tasks for ImageCLEF and
tasks for Office-Caltech10. A graphical description of the cross-domain tasks is shown in
Figure 4.
4.2. Experimental Setting
We compare the proposed method with two baseline models and four state-of-the-art DA methods, which are listed below.
Nearest neighbor (NN): NN is selected as a baseline for examining the effectiveness of the proposed method.
Nearest class prototype (NCP): Similar to NN, NCP is also a baseline method since the proposed method is highly correlated with them.
Confidence-aware pseudo label selection (CAPLS): CAPLS (proposed in IJCNN2019 [
22]) selects reliable labels by confidence and learns transferable representations across domains.
Modified A-distance sparse filtering (MASF): MASF (proposed in Pattern Recognit.2020 [
36]) presents an l2 constraint as the metric of domain discrepancy.
Generalized soft-max (GSMAX): GSMAX (proposed in Inf.Sci.2020 [
37]) aims at learning smooth representations and decision boundaries simultaneously.
Selective pseudo labeling (SPL): SPL (proposed in AAAI2020 [
23]) is also a selective pseudo labeling strategy based on structured prediction.
Discriminative sparse filtering (DSF): DSF (proposed in Sensors 2020 [
38]) combines discriminative feature learning and distribution matching based on sparse filtering.
For CAPLS, we set the number of iterations and the feature dimension . For MASF, we set the balance factor and the feature dimension . For GSMAX, we set the balance factor , and the number of nodes on ImageCLEF and on Office-Caltech10. For SPL, we set the number of iterations and the feature dimension . For our method, we set the number of source clusters and the number of iterations . Besides, for all methods, we adopted the Resnet50 features for ImageCLEF and Decaf6 features for Office-Caltech10. The deep models were pre-trained on ImageNet without fine-tuning, and no pre-processing strategy was applied.
Following the setting of [
23,
38], we report the classification accuracy on the target data as the evaluation metric.
where
denotes the predicted label and
y is the true label, so
.
4.3. Implementation Details
For the reproducibility of the paper, here we report the experimental details.
- (1)
All datasets (original images and extracted features) and part of the code (CAPLS, SPL) can be found in public GitHub repositories, and the link is shown in the Acknowledgments.
- (2)
We chose the no pre-processing strategy for the extracted features; as mentioned earlier, they are good enough for recognizing.
- (3)
All methods were implemented with MATLAB 2017a. To eliminate the effect of random numbers, we fixed the random seed to zero.
- (4)
We are pleased to share our code if anyone is interested; please contact Chao Han (
[email protected]).
4.4. Results
We report the numerical results in
Table 3, where the boldface denotes the highest accuracy. To sum up, the proposed method achieves state-of-the-art performance with respect to the average accuracy. Furthermore, we have the following observations:
OURS vs. NN, NCP: Compared to these two baseline models, our method is significantly better. NN and NCP have no adaption measures; thus, they would be heavily affected by distribution mismatch. According to the results, our method yields better recognition accuracies on almost every sub-task, and the improvements could be higher than 15% on some tasks, e.g., D→A and D→C. This findings confirm that the proposed modified NCP and easy-to-hard testing can help make more robust predictions on cross-domain tasks.
OURS vs. MASF: OURS is superior to MASF. MASF proposes the modified -distance for marginal distribution matching; however, it has limited considerations on the relation between the learned representations and the decision boundary. On the contrary, our method tries to adjust the decision boundary adaptively. Consequently, our method achieves superior performance.
OURS vs. CAPLS, SPL: These two methods assign pseudo labels on target samples and select highly confident ones, then an iterative feature aligning strategy is applied to learn the transferable representations. Pseudo labels for target samples allow them to match the conditional probability distribution across domains, so that the learned representations are more discriminative than MASF. However, they still fail to explicitly model the relation between features and classifiers. Besides, they are easily influenced by the quality of pseudo labels. From the results, we can see that MASF < CAPLS, SPL < OURS (with respect to average accuracy).
OURS vs. GSMAX: Objectively speaking, our method works better than GSMAX, which can be considered to be the closest method to ours. It learns a dynamic decision boundary by thinking about both labeled source samples and unlabeled target samples, the underlying idea of which is all samples (including source and target samples) should be far away from the decision boundary. Compared to our method, it does not give consideration to the difficulties of target samples and integrates them all into training; naturally, the wrongly-labeled sample would have negative effects for final recognition.
OURS vs. DSF: DSF performs slightly worse than the proposed method. DSF explores feature separability and distribution matching simultaneously, while it only adopts a linear regression-like constraint for computing efficiently. Such a constraint cannot handle the complex feature distribution, especially for high-dimensional features. Our method aims to find the optimal classifier, rather than feature transformation, thus obtaining higher accuracies. Besides, we also report the running time of these methods; our method also runs faster than it.
It is somewhat surprising that the baseline methods (NN, NCP) achieve comparable or even superior performance when compared to state-of-the-art DA works on several subtasks, e.g., I→P. Does this mean that the study of domain adaptation is meaningless? We think the answer is absolutely no. Firstly, strong evidence of the improvements brought by DA works can be found when referring to average accuracy, which reveals the greater robustness of DA works from a statistical point of view. Secondly, the theorem of no free lunch [
39,
40] indicates that there is no algorithm that can be chosen as the best choice for all problems. Naturally, it is reasonable that baseline methods perform better that state-of-the-art ones (for a few cases).
Recall that instance based methods would be heavily affected by increasing gaps; however, the results show that the proposed method could alleviate the negative effects to some extent. Here, we use the accuracy obtained by non-adaptation methods to measure the domain gap. These methods are expected to have better results in the scenario where testing and training data have the same distribution. Since the proposed method is a variant of the nearest class prototype (NCP), we use the NCP as the non-adaptation method. If the NCP achieves high accuracy on a task, this means that there are few domain gap for this task, and vice versa. We can say that task C→W has a larger domain gap than task C→D since the NCP achieves higher accuracy on the task C→D (82.17% vs. 76.95%). However, the proposed method has larger improvements on C→W (76.95%→89.83%, 13%↑) than C→D (82.17%→87.26%, 5%↑).
Another thing to notice is that when we have very few training samples, i.e., D→C, D→W, and D→A, constructing too many clusters would degrade the proposed method to the nearest neighbors. For example, domain DSLR has only eight images of mug, but we still build five clusters for this class, then each cluster holds only 1 4 images. Therefore, the performance of mNCP is expected to be close to NN. However, according to
Table 3, our method still gains more than 15% improvements on average accuracy for D→C and D→A, and these results further support the idea of easy-to-hard testing. Although we may not have enough samples in the earlier stages, as the hierarchical testing goes, more samples are integrated into the training set, then we are capable of exploring local discriminative structures.
4.5. Parameter Sensitivity Analysis
For better understanding the proposed method, we investigate how each hyper-parameter affects the performance by setting them to a series of different values while fixing the other one. Our method has two hyper-parameters, the number of source clusters k and the number of iterations .
Sensitivity analysis of
k:
k indicates how many clusters we need for training. When
, this means that we think all the training samples are equally important, and the method degrades to NCP. As
k becomes larger, more diversity is allowed within the same class. Our method is able to explore the different importance for training samples. However, if it is too large, especially for the extreme case,
, the method degrades to NN, which is sensitive to outliers and easily affected by distribution mismatch. As shown in
Figure 5a, the mean accuracy first arises then falls, which is consistent with our analysis.
Sensitivity analysis of : is the total iterations for target data. When , our method do not adopt the easy-to-hard testing scheme. We can see that the mean accuracy increase 4% immediately as setting to 2, this finding verifies that the discriminative structure within target domain does help cross-domain recognition. As becomes larger, the mean accuracy rises and tends to be stable at certain point. An interesting phenomenon is that when gets too large, the performance would drop slightly. This result may be explained by the fact that we introduce the so-called high-confidence testing samples into training set; however, it could still be wrongly labeled. The proportion of samples that are predicted based on only well labeled samples is , then the proportion of samples that would be affected by noisy labels is . As gets bigger, noisy labels would affect more samples. A more comprehensive discussion about how the performance changes during testing is given in the following section.
6. Conclusions and Future Works
In this paper, we propose a sample-guided adaptive class prototype method for visual domain adaptation. Unlike previous methods focusing on distribution matching, the proposed method aims to adjust the decision boundary according to the domain discrepancy existing in different samples. Extending the NCP, we present the modified NCP to explore a balance between single-sample discrimination and class-wise discrimination. For better exploiting the discriminative structures existing in the target domain, an east-to-hard training scheme is proposed. Target samples are considered to be of different difficulties to be recognized, then it selects the easy samples and uses them to make predictions for hard samples. Experimental results show that the proposed method is both effective and efficient. However, despite these promising results, questions remain. Previous works proved that class weight bias would gives rise to the performance degradation of statistic based methods, so our method (which does not rely on the statistics of the sample distribution) is expected to gain a large improvement on Office-Caltech10 than on ImageCLEF. However, the experimental results did not find a significant difference between these, and we think this can be explained by the changing sample size have more demand for an adaptive cluster number than the same sample size, while we adopted a fixed number of clusters for all the experiments.
Existing domain adaptation works study how to transfer knowledge between data from different sources; however, there is abundant room for further progress in transferring between different tasks. For example, if a person is good at poker, he/she may grasp chess quickly. We call that learning from experience, but currently, our models can only learn from data. Consequently, we believe investigating learning from experience is the key step to achieving artificial general intelligence.