Gaussian Process-Based Transfer Kernel Learning for Unsupervised Domain Adaptation

Ge, Pengfei; Sun, Yesen

doi:10.3390/math11224695

Open AccessArticle

Gaussian Process-Based Transfer Kernel Learning for Unsupervised Domain Adaptation

by

Pengfei Ge

¹

and

Yesen Sun

^2,*

¹

School of Mathematics and Systems Science, Guangdong Polytechnic Normal University, Guangzhou 510665, China

²

School of Economics and Statistics, Guangzhou University, Guangzhou 510006, China

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(22), 4695; https://doi.org/10.3390/math11224695

Submission received: 28 September 2023 / Revised: 16 November 2023 / Accepted: 16 November 2023 / Published: 19 November 2023

(This article belongs to the Special Issue Application of Machine Learning in Image Processing and Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

:

The discriminability and transferability of models are two important factors for the success of domain adaptation methods. Recently, some domain adaptation methods have improved models by adding a discriminant information extraction module. However, these methods need to carefully balance the discriminability and transferability of a model. To address this problem, we propose a new deep domain adaptation method, Gaussian Process-based Transfer Kernel Learning (GPTKL), which can perform domain knowledge transfer and improve the discrimination ability of the model simultaneously. GPTKL uses the kernel similarity between all samples in the source and target domains as a priori information to establish a cross-domain Gaussian process. By maximizing its likelihood function, GPTKL reduces the domain discrepancy between the source and target domains, thereby enhancing generalization across domains. At the same time, GPTKL introduces the deep kernel learning strategy into the cross-domain Gaussian process to learn a transfer kernel function based on deep features. Through transfer kernel learning, GPTKL learns a deep feature space with both discriminability and transferability. In addition, GPTKL uses cross-entropy and mutual information to learn a classification model shared by the source and target domains. Experiments on four benchmarks show that GPTKL achieves superior classification performance over state-of-the-art methods.

Keywords:

unsupervised domain adaptation; Gaussian process; deep learning; kernel method

MSC:

68T10; 68U10

1. Introduction

With the introduction of some large-scale datasets with sufficient label information, Convolutional Neural Networks (CNNs) have achieved great success in machine learning and computer vision tasks [1,2,3]. CNNs can learn highly non-linear feature representations, thereby improving the discrimination ability of a model. However, when there are obvious differences between the training set and test set, such as multi-modal data, the CNN model trained on the training set may degenerate on the test set. At the same time, since the test set does not contain label information, we cannot directly fine-tune the trained model to fit the test set data. Therefore, it is an important and challenging task to learn a model with domain generalization while maintaining sufficient discrimination ability.

To solve this problem, a more specific task, Unsupervised Domain Adaptation (UDA), is widely studied. The UDA task aims to transfer the features or models learned on a labeled training set (source domain) to an unlabeled test set (target domain) [4,5]. In the UDA task, the source domain and target domain have the same learning task but different data distributions. A diagram of UDA is shown in Figure 1a. It can be seen that there is an obvious distribution discrepancy between the source and target domains.

Previous UDA methods have mainly been based on shallow models, which achieve domain knowledge transfer through sample reweighting strategies [5] or learning a shared feature space [6,7]. However, limited by the representation capacity of models, the performance of these methods does not exceed the deep UDA approach. In recent years, with the development of deep learning, more deep UDA models [8] have been proposed to deal with UDA tasks. Mainstream deep UDA methods learn a domain-invariant feature space based on aligning the distribution of deep features [9,10,11,12,13]. According to different alignment strategies, these methods can be roughly divided into two categories, i.e., marginal-distribution-based methods [9,10] and conditional-distribution-based methods [12,13]. Marginal-distribution-based methods use two-sample testing [14] or adversarial training [15] to reduce the difference between the marginal distributions of features in the source and target domains. Such methods often suffer from negative transfer due to ignoring the dependencies between features and labels [13]. Conditional-distribution-based methods reduce the difference between the conditional distributions of features in the source and target domains. Common strategies include conditional generative adversarial networks [16] and conditional maximum mean discrepancy [13]. However, these methods ignore the discrimination learning of features, which makes it difficult for them to fully utilize the discrimination information in the data, thereby affecting the performance of the model. In particular, excessive alignment learning often leads to a loss of feature discriminability. As shown in Figure 1b, excessive alignment learning causes samples belonging to different categories in the target domain to incorrectly cluster around a certain category in the source domain.

Recently, there has been a trend to improve the discrimination ability of UDA models by adding additional discrimination learning modules [17,18,19,20]. The most common strategy is to introduce a metric learning loss, e.g., center-based loss or triplet-based loss, to learn a compact feature space shared by the source and target domains [17,18]. Most of the existing methods use different losses to control the discrimination learning and alignment learning, respectively. These models need to find appropriate weight hyperparameters to balance discrimination learning and alignment learning. However, during the training process, the optimal weight hyperparameters are constantly changing, which causes the model to over-emphasize discrimination learning or alignment learning. An over-emphasis on alignment learning may lead to a loss of feature discriminability, while an over-emphasis on discrimination learning may reduce the feature transferability. As shown in Figure 1c, excessive discrimination learning leads to feature misalignment between source and target domains. Some recent research works [20,21] have introduced a dynamic weight to balance discrimination learning and alignment learning in real time. However, updating dynamic weights in real time introduces additional calculations and takes more training time. Additionally, these methods only consider the discrimination information within the source and target domains, while ignoring the discrimination information between the source and target domains. Since the similarity information between source domain samples and target domain samples is not effectively exploited, the performance of the model may be limited.

In this paper, we propose a novel deep UDA method, Gaussian Process-based Transfer Kernel Learning (GPTKL), to simultaneously improve the discriminability and transferability of the model. First, GPTKL uses the kernel similarity between all samples in the source and target domains as a priori information to establish a cross-domain Gaussian process. The likelihood function of the cross-domain Gaussian process reflects the domain discrepancy between the source and target domains in a Reproducing Kernel Hilbert Space. GPTKL maximizes the likelihood function of the cross-domain Gaussian process to learn a transfer kernel function. This process effectively improves the transferability of the model. This transfer kernel function can better measure the similarity between samples of multi-domain data. In particular, GPTKL introduces the deep kernel learning strategy to convert the learning of transfer kernel function into the learning of a deep feature space suitable for measuring sample similarity. With deep transfer kernel learning in the cross-domain Gaussian process, GPTKL learns a deep feature space of inter-domain alignment, intra-class compactness and inter-class separation. As shown in Figure 1d, GPTKL learns a deep feature space with both discriminability and transferability. Second, in order to further extract discriminant information, GPTKL uses cross-entropy loss and mutual information to learn a classification model shared by the source and target domains. The contributions of this work are concluded as follows:

We propose a new deep UDA method, GPTKL, which introduces a cross-domain Gaussian process and shared classification model to achieve domain knowledge transfer and improve the discrimination ability of the model.
We introduce the deep kernel learning strategy into the cross-domain Gaussian process to learn a deep feature space with both discriminability and transferability.
We conduct experiments to verify the effectiveness of GPTKL and the transfer kernel function.

The rest of this article is organized as follows. Section 2 briefly reviews the related work on deep UDA and the Gaussian process. Section 3 introduces the proposed method in detail. Section 4 reports the experiment settings and experiment results, and gives an analysis. Section 5 summarizes the conclusions and proposes future research directions.

2. Related Work

In this section, we briefly review the related work from two directions, i.e., deep UDA and the Gaussian process.

2.1. Deep UDA Methods

Deep networks are proven to be able to learn generalized feature representations, which promotes the development of deep UDA methods [22,23]. Most of these deep UDA methods are based on aligning the marginal distribution or conditional distribution of the source and target domains. Among them, a common strategy is to use specific statistical information to measure domain discrepancy. Deep Adaptation Network (DAN) [9] uses maximum mean discrepancy (MMD) [14] to measure the difference between the source and target domains, while Deep CORAL [24] explores the second-order statistics (covariance matrices) to align the source and target domains. Deep Subdomain Adaptation Network (DSAN) [25] effectively aligns the conditional distributions of the source and target domains by reducing the MMD in the subspace. Deep Conditional Adaptation Network (DCAN) [13] introduces an effective metric, i.e., conditional maximum mean discrepancy (CMMD), which can measure the distance between conditional distributions in a Reproducing Kernel Hilbert Space. By reducing the CMMD between the conditional distributions of the source and target domains, DCAN learns a conditional domain-invariant deep feature space. There are also some methods that introduce an additional discriminator to distinguish between source domain features and target domain features [10,26]. These methods aim to obtain feature representations that can fool the discriminator. Specially, Domain Adversarial Neural Network (DANN) [10] uses the discriminant network to align the marginal distributions, while Conditional Domain Adversarial Network (CDAN) [12] uses the category information in the classifier to align the conditional distributions. Maximum Classifier Discrepancy (MCD) [27] uses the difference between the classification results of multiple classifiers to measure domain discrepancy.

Recent UDA methods suggest that the discriminability of features plays an important role in the transfer of category information [19,20]. Joint domain alignment and Discriminative feature learning Domain Adaptation (JDDA) [17] introduces instance-based and center-based discriminative losses on the source domain to learn a feature space with better discriminability. Batch Spectral Penalization (BSP) [19] discusses the relationship between transferability and discriminability, and boosts the discriminability of deep features by constraining the largest singular value of feature matrixes. Stepwise Adaptive Feature Norm (SAFN) [28] improves the discrimination ability of the model by increasing the feature norms. Enhanced Transport Distance (ETD) [29] uses optimal transfer theory to reduce the domain discrepancy between the source and target domains, and introduces the attention mechanisms to enhance the discrimination ability of the model. Discriminative Manifold Propagation (DMP) [30] introduces manifold learning losses on the source and target domains respectively to learn discriminative low-dimensional manifold embeddings. These methods all consider the transferability and discriminability of features. However, these methods require careful balance between domain alignment loss and discrimination learning loss, as they ignore the interaction between alignment learning and discrimination learning. To solve this problem, Dynamic Weighted Learning (DWL) [20] introduces a dynamic weight to adjust the importance of domain alignment loss and discrimination learning loss in real time. Similarly, Dynamically Aligning both the Feature and Label (DAFL) [21] introduces a dynamic balancing weight in adversarial training to balance the generating and discrimination abilities of the model. However, updating dynamic weights in real time introduces additional calculations and takes more training time. Different from the above methods, GPTKL introduces the deep kernel learning strategy into the cross-domain Gaussian process framework to learn deep features with both discriminability and transferability. Note that this process does not introduce the weight parameter for balancing alignment learning and discrimination learning.

2.2. Gaussian Process

The Gaussian process is an important Bayes’ non-parametric model, which directly defines the prior distribution on the function space of samples

D = {\{(x_{i}, y_{i})\}}_{n = 1}^{N}

. Consider a general prediction model:

y_{i} = f (x_{i}) + ϵ_{i}

, where

ϵ_{i}

denotes random noise,

ϵ_{i} \sim N (0, σ^{2})

. Let

f (x) = {[f (x_{1}), \cdot \cdot \cdot, f (x_{N})]}^{T}

. The Gaussian process generally assumes that the prior distribution of

f (x)

satisfies

f (x) \sim GP [0, K]

, where

K

is the kernel matrix of samples,

K_{i j} = k (x_{i}, x_{j})

. Based on the total probability theorem, the marginal distribution of

y = {[y_{1}, \cdot \cdot \cdot, y_{N}]}^{T}

satisfies

y \sim N (y | 0, C)

, where

C = K + σ^{2} I_{N}

, and

I_{N}

is an N-dimensional identity matrix. The Gaussian process directly models the sample distribution based on the kernel matrix, which can effectively describe the structural information between samples.

The Gaussian process can learn generalization models, and thus is widely used in multi-task learning and transfer learning tasks. For instance, Kai Yu et al. propose a hierarchical Bayesian framework based on a non-parametric Gaussian process to handle multi-label text classification problems [31]. Bin Cao et al. propose a Gaussian process-based adaptive transfer learning algorithm to deal with the transfer learning problem [32]. Pengfei Wei et al. design a learnable kernel in the transfer Gaussian process model to measure the similarity between domains, thereby effectively solving the multi-source transfer regression problem [33]. Transfer Gaussian Process Regression (TrGP) [34] proposes a formal definition of the transfer kernel and two advanced general forms of the transfer kernel. These transfer kernels can effectively capture the domain relatedness and are used to deal with transfer regression problems. There are three key differences between TrGP and our GPTKL: (1) TrGP addresses supervised cross-domain regression problems, while GPTKL addresses unsupervised cross-domain classification problems. (2) TrGP learns a transfer kernel to capture domain relatedness, while GPTKL learns a general kernel function to measure the similarity between all samples in the source and target domains. (3) TrGP learns the kernel function on the original data space. GPTKL introduces the deep kernel learning strategy to convert the learning of transfer kernel function into learning a deep feature space; thus, GPTKL can benefit from the powerful representation capacity of deep networks. Aiadi Oussama et al. propose a multi-view Gaussian-based Bayesian learning scheme, which can efficiently address text-based image retrieval problems [35]. Benefiting from assigning a weight to each view, it can better deal with intra-class variation. This idea can also be applied to UDA tasks. Multi-view Gaussian learning can represent the data distribution accurately, thereby promoting the distribution alignment of the source and target domains. Gaussian Process Domain Adaptation (GPDA) [36] proposes a systematic UDA method to achieve hypothesis consistency by learning a Gaussian process on the source domain. Both GPTKL and GPDA use the Gaussian process to deal with UDA problems, but there are two key differences between them: (1) GPDA lacks an explicit domain alignment strategy, while GPTKL learns a domain-invariant deep feature space by introducing the deep kernel learning strategy into the cross-domain Gaussian process. (2) GPDA uses variational inference to solve for the parameters of the Gaussian process. This process requires iterative updating of the variational distribution to learn the posterior distribution of the Gaussian process, so multiple network calculations are required at each network iteration. GPTKL introduces the kernel learning strategy to directly learn the parameters of Gaussian process, which avoids multiple network calculations and reduces the training time of the model.

3. Our Method

In this section, we introduce the proposed method in detail. First, we introduce the motivation and notations in Section 3.1. Second, we introduce how to use the cross-domain Gaussian process and deep kernel learning technique to learn a deep feature space with both discriminability and transferability in Section 3.2. Then, classification model learning on the source and target domains is introduced in Section 3.3. Finally, the overall loss function and training procedure of GPTKL are introduced in Section 3.4.

3.1. Motivation and Notations

The goal of UDA tasks is to reduce the classification error on the target domain. Reference [37] proposes that the classification error on the target domain is controlled by the classification error on the source domain, the distance between the source and target domains and the error of an ideal joint hypothesis. The upper bound of the classification error on the target domain

R_{T} (h)

is defined as follows:

R_{T} (h) \leq R_{S} (h) + \frac{1}{2} d_{H Δ H} (S, T) + λ,

(1)

where h denotes the classification model and

R_{S} (h)

represents the classification error on the source domain. The

H Δ H

distance between the source and target domains is defined as follows:

d_{H Δ H} (S, T) = 2 sup_{(h, h^{'}) \in H^{2}} |\underset{x \sim S}{E} I [h (x) \neq h^{'} (x)] - \underset{x \sim T}{E} I [h (x) \neq h^{'} (x)]|,

(2)

and the error of the ideal joint hypothesis is

λ = R_{S} (h^{*}) + R_{T} (h^{*}),

(3)

where

h^{*} = \underset{h \in H}{arg min} [R_{S} (h) + R_{T} (h)]

is the ideal joint hypothesis.

Based on this theory, an effective UDA method should be able to reduce three items in (1) at the same time. Since the label information of the source domain is known, the classification error in the source domain can be controlled to be very small. For the second item, existing UDA methods propose a variety of effective feature alignment strategies to reduce domain discrepancy. The third item in (1) represents the joint classification error of an ideal joint hypothesis, which is closely related to the discriminability of features. Existing UDA methods usually ignore this item because they believe that

λ

is small. However, recent research results indicate that there is an interaction between alignment learning and discrimination learning in UDA tasks [19,20]. Excessive pursuit of domain alignment will weaken the discriminability of features, which will increase

λ

and affect the final classification performance. Therefore,

λ

cannot simply be ignored. It is necessary to investigate a UDA method that can perform alignment learning and discrimination learning simultaneously.

In this work, GPTKL first establishes a cross-domain Gaussian process to learn a transfer kernel function that can measure the similarity between samples of multi-domain data, and then introduces deep kernel learning strategy to convert the learning of transfer kernel function into learning deep features with both discriminability and transferability. This process simultaneously reduces the second and third items in Equation (1); thus, it can effectively solve the interaction between alignment learning and discrimination learning. Figure 2 shows the network architecture of GPTKL. The shared deep CNN and classifier are used to obtain deep features and prediction categories, respectively. Transfer kernel learning (TKL) based on Gaussian processes is used to learn deep features with both discriminability and transferability. The cross-entropy loss and mutual information are used to learn the discriminative classification network in the source and target domains, respectively.

The formula description of the UDA task is listed here. The UDA task contains two similar but different datasets: the source domain

D_{s} = {(x_{i}^{s}, y_{i}^{s})}_{i = 1}^{n_{s}}

contains

n_{s}

labeled samples, and the target domain

D_{t} = {(x_{i}^{t})}_{i = 1}^{n_{t}}

contains

n_{t}

samples without label information. The source and target domains have the same category space but different data distributions. Let

z = ϕ (x)

denote the deep features extracted by the deep network,

k (\cdot, \cdot)

represent the kernel function used in GPTKL and c represent the number of categories.

3.2. Gaussian Process-Based Transfer Kernel Learning

Let

D = {(x_{i}^{s}, y_{i}^{s})}_{i = 1}^{n_{s}} \cup {(x_{i}^{t}, y_{i}^{t})}_{i = 1}^{n_{t}} = {(x_{i}, y_{i})}_{i = 1}^{n}

denote the dataset consisting of all samples in the source and target domains, and

n = n_{s} + n_{t}

. To realize the domain fusion of the source and target domains, GPTKL learns a cross-domain Gaussian process on the joint dataset

D

, i.e.,

y = f (x) + ϵ

, where

ϵ

denotes random noise,

ϵ \sim N (0, σ^{2})

. The prior distribution of

f (x)

satisfies

f (x) \sim GP [0, K]

, where

K_{i j} = k (x_{i}, x_{j})

denotes the kernel matrix between

x_{i}

and

x_{j}

. Noting that this cross-domain Gaussian process is established on the joint dataset

D

of the source and target domains, the dimension of

K

is thus

n \times n

. Based on the total probability theorem, the probability distribution of the j-th class

y_{j}

can be calculated as follows:

p (y_{j}) = \int p (y_{j} | f (x)) p (f (x)) d f (x) = N (y_{j} | 0, C_{n}),

(4)

where

C_{n} = K + σ^{2} I_{n}

. Assuming that different categories are independent of each other, then the marginal distribution of

y

can be calculated as

p (y) = \prod_{j = 1}^{c} N (y_{j} | 0, C_{n})

. To learn a transfer kernel function adapted to both the source and target domains, GPTKL maximizes the log-likelihood function of

y

. The log-likelihood function of

y

can be calculated as

\begin{matrix} ln p (y) = & \sum_{j = 1}^{c} [- \frac{1}{2} ln |C_{n}| - \frac{1}{2} y_{j}^{T} C_{n}^{- 1} y_{j} - \frac{n}{2} ln (2 π)] \\ = & - \frac{c}{2} ln |C_{n}| - \frac{1}{2} Tr (y^{T} C_{n}^{- 1} y) - \frac{c \times n}{2} ln (2 π), \end{matrix}

(5)

where the last item is a constant, which will be omitted for simplicity. In Equation (5), the kernel function

k (\cdot, \cdot)

is the parameter to be trained. However, due to limited computing power, we cannot traverse all kernel functions to find the maximum value of Equation (5). An effective solution is the deep kernel learning strategy [38], which transforms the selection of the kernel function into learning a deep network mapping of the constructing kernel, and then learns a deep feature space suitable for measuring sample similarity.

Based on the deep kernel learning strategy, GPTKL inputs the deep features

z

obtained from the deep CNN into a Gaussian kernel function to calculate the deep kernel matrix, i.e.,

K_{i j} = k (z_{i}, z_{j})

. Furthermore, we can learn a better transfer kernel function by training the deep CNN, and the transfer kernel learning loss

L_{TKL}

is

\begin{matrix} min_{z} L_{TKL} = \frac{c}{2} ln |C_{n}| + \frac{1}{2} Tr (y^{T} C_{n}^{- 1} y) = \frac{c}{2} ln |C_{n}| + \frac{1}{2} Tr (L C_{n}^{- 1}), \end{matrix}

(6)

where

L = y y^{T}

is the linear kernel matrix of

y

.

L_{TKL}

contains two items: the first item minimizes the log determinant of the kernel matrix, which is a penalty term for the model complexity; the second term minimizes the trace of the matrix obtained by multiplying

L

and

C_{n}^{- 1}

, which connects the sample similarity and the category similarity. We call it similar consistency loss

L_{sim - con}

. Next, we will illustrate the role of

L_{sim - con}

in feature learning through an example.

Consider a 2-tuple

[x_{1}, x_{2}]

, whose kernel matrix

K

is

[1, a; a, 1]

, where

a = k (z_{1}, z_{2}) = k (z_{2}, z_{1})

. Ignoring the effects of random noise, the inverse matrix of

C

is calculated as

C^{- 1} = [\frac{1}{1 - a^{2}}, \frac{- a}{1 - a^{2}}; \frac{- a}{1 - a^{2}}, \frac{1}{1 - a^{2}}]

. When

x_{1}

and

x_{2}

belong to the same category and different categories,

L_{sim - con}

has different effects. When

x_{1}

and

x_{2}

belong to the same category, the category kernel matrix

L = [1, 1; 1, 1]

, and the similar consistency loss

L_{sim - con} = Tr (L C^{- 1}) = \frac{2}{1 + a}

. In this case, minimizing

L_{sim_con}

will increase the kernel similarity between

x_{1}

and

x_{2}

. Conversely, when

x_{1}

and

x_{2}

belong to two different categories,

L = [1, 0; 0, 1]

and

L_{sim - con} = \frac{2}{1 - a^{2}}

. At this time, minimizing

L_{sim_con}

will reduce the kernel similarity between

x_{1}

and

x_{2}

. GPTKL maximizes the similarity of samples with the same label while minimizing the similarity of samples in different classes; thus, it plays a role of discriminant learning in the feature space. Since the cross-domain Gaussian process is based on all data in the source and target domains, GPTKL effectively increases the similarity between source domain features and target domain features from the same category. This process effectively reduces the domain discrepancy between the source and target domains in the deep feature space.

3.3. Classification Model Learning on the Source and Target Domains

In the previous section, through the cross-domain Gaussian process and deep kernel learning technology, GPTKL was able to learn a deep feature space with both discriminability and transferability. As with most deep UDA algorithms, GPTKL uses a fully connected network with a softmax activation function after the deep CNN to obtain the class prediction distribution

p (\hat{y} | x)

. In order to obtain better prediction performance, we further train a classification model to extract discriminant information from the source and target domains simultaneously.

In the source domain, GPTKL extracts discriminant information by minimizing the cross-entropy loss

L_{CE}

between the predicted distribution

p ({\hat{y}}^{s} | x^{s})

and true label distribution

p (y^{s} | x^{s})

, i.e.,

L_{CE} = E_{(x_{s}, y_{s}) \sim p^{s} (X, Y)} [- p (y_{i}^{s} | x_{i}^{s}) log p ({\hat{y}}_{i}^{s} | x_{i}^{s})] .

(7)

GPTKL uses a shared classification model to process the source and target domains. If only

L_{CE}

is considered, the learned model may fall into overfitting in the source domain, thereby increasing the classification error in the target domain. In addition, when calculating

L_{TKL}

, GPTKL needs to use the category information of the target domain, which does not exist in UDA tasks. GPTKL addresses this problem by assigning predicted category distributions as pseudo-labels to target domain samples. In this process, we need to extract the discriminant information in the target domain as much as possible to improve the accuracy of the pseudo-labels. In order to extract the discriminant information in the target domain, GPTKL follows the idea in unsupervised clustering methods [39,40] that maximize the mutual information

I (x_{t}, {\hat{y}}_{t})

between target domain samples

x_{t}

and their predicted categorical variables

{\hat{y}}_{t}

. The mutual information loss

L_{M I}

is defined as the opposite value of

I (x_{t}, {\hat{y}}_{t})

, i.e.,

L_{M I} = - I (x_{t}, {\hat{y}}_{t}) = H ({\hat{y}}_{t} | x_{t}) - H ({\hat{y}}_{t}),

(8)

where

H ({\hat{y}}_{t})

is information entropy and

H ({\hat{y}}_{t} | x_{t})

is conditional entropy. Note that

H ({\hat{y}}_{t})

constrains the model to maintain the balance of predicted categories, while

H ({\hat{y}}_{t} | x_{t})

constrains the classification boundary through the low-density regions of the feature space to reduce the uncertainty of classification.

3.4. Loss Function and Training Procedure

The optimization objective of GPTKL includes three items, i.e., TKL loss

L_{TKL}

, cross-entropy loss

L_{CE}

and mutual information loss

L_{M I}

. Combining Equations (6)–(8), the overall loss function of GPTKL is

min_{w} L_{CE} + λ_{0} L_{TKL} + λ_{1} L_{M I},

(9)

where

w

represents all network parameters in the deep network, and

λ_{0}

and

λ_{1}

represent the weight hyper-parameters of

L_{TKL}

and

L_{M I}

, respectively.

When GPTKL is implemented, it consists of two sequential phases. First, we pretrain the deep network using labeled data from the source domain. Next, samples from the source and target domains are used together to calculate the loss function in Equation (9) and update the deep network parameters based on the gradient descent algorithm. The overall algorithm of GPTKL is summarized in Algorithm 1.

Algorithm 1 The algorithm of GPTKL

Input:

D_{s} = {(x_{i}^{s}, y_{i}^{s})}_{i = 1}^{n_{s}}

,

D_{t} = {x_{i}^{t}}_{i = 1}^{n_{t}}

.
Output: The network parameters

w

.

1:: Pre-train the deep network using cross-entropy loss (7) in $D_{s}$ .
2:: while not converged do
3:: Random sample a mini-batch dataset $D_{m b}^{s}$ and $D_{m b}^{t}$ from $D_{s}$ and $D_{t}$ , respectively;
4:: Calculate the class prediction distribution $p ({\hat{y}}_{i} | x_{i})$ for each sample $x_{i}$ by current deep network, and provide pseudo-labels for target domain samples;
5:: Calculate the TKL loss $L_{TKL}$ , cross-entropy loss $L_{CE}$ and mutual information loss $L_{M I}$ by Equations (6)–(8), respectively;
6:: Compute the gradients of $L_{TKL}$ , $L_{CE}$ , and $L_{M I}$ w.r.t. $mw$ ;
7:: Update the network parameters $w$ by gradient descend algorithm to minimize Equation (9);
8:: end while

We further discuss the computational complexity of GPTKL. The computational complexity of GPTKL can be calculated from the three loss functions in Equation (9). Among them, cross entropy loss

L_{CE}

and mutual information loss

L_{M I}

have linear computational complexity. In TKL loss, the computation focuses on the inverse matrix operations in Equation (6). Therefore, the computational complexity of TKL loss is close to

O (n_{m b}^{3})

, where

n_{m b}

represents the batch size of mini-batch data. Since

n_{m b}

is usually very small in the training of deep networks (

n_{m b} = 32

in our experiments), the actual cost of GPTKL is smaller than most existing deep UDA methods.

4. Experiment Results and Analysis

In this section, we evaluate the effectiveness of GPTKL on four popular UDA benchmark datasets. Firstly, we introduce the datasets and experiment settings used in this section. Secondly, we test the classification performance of GPTKL and compare it with some state-of-the-art UDA methods. Finally, we conduct more analysis experiments, e.g., ablation experiments, sensitivity analysis and visualization analysis, to demonstrate the effectiveness of GPTKL.

4.1. Datasets and Experiment Settings

We set up experiments on four popular UDA benchmark datasets, i.e., Digits, Office-31, Office-Home and VisDA-2017. Figure 3 shows some image samples of these datasets.

The

Digits

[10] dataset contains three digit datasets, i.e., MNIST (M), USPS (U) and SVHN (S). These datasets are composed of digit images (0–9) with different styles. Specifically, MNIST contains 70,000 handwritten digit images with a size of

28 \times 28

, and USPS contains 9298 handwritten digit images with a size of

16 \times 16

. SVHN contains 73,257 color digit images with a size of

3 \times 32 \times 32

, and each image is taken from the house numbers in Google Street View images. Following the experiment settings in the literature [19], we tested three UDA tasks: M→U, U→M and S→M.

Office - 31

[41] is a widely used UDA dataset, which contains 4110 images from 31 different categories. This dataset consists of three different domains, i.e., Amazon (A), Webcam (W) and DSLR (D). Specifically, Amazon includes 2817 images taken from online webpages. Webcam contains 795 low-resolution images captured by webcams, and DSLR contains 498 high-resolution images captured by digital single-lens reflex cameras. We evaluated all six UDA tasks on Office-31.

Office - Home

[42] is a more challenging UDA dataset because it contains more domains and categories. This dataset consists of 15,588 images in 65 categories from four distinct domains, i.e., Art (A), Clipart (C), Product (P) and Real-World (R). This dataset consists of 12 UDA tasks by pairing the four domains in various combinations, and we evaluated all 12 UDA tasks on Office-Home.

VisDA - 2017

[43] is a large-scale UDA dataset, which consists of more than 200,000 images from 12 categories. These images come from two domains, i.e., Synthetic and Real. The performance of GPTKL from Synthetic to Real is evaluated here.

In all experiments, we used labeled source domain samples and unlabeled target domain samples for training. In the testing phase, labeled target domain samples were input into the trained model to calculate classification accuracy as an indicator of model performance. For the Digits dataset, we used the CNN architecture proposed in the literature [10], and initialized the network weights randomly. For the Office-31 and Office-Home datasets, we employed the ResNet-50 model [44] as the backbone. For the Visda-2017 dataset, we employed the ResNet-101 model [44] as the backbone. All backbones were initialized with the network weights trained on the ImageNet dataset. In particular, we replaced the last layer of the backbone with a two-layer fully connected network with a softmax activation function to accommodate the new number of categories in the UDA task, and randomly initialized the network weights of this layer. Since this new layer needed to be trained from scratch, its learning rate was set to 100 times that of the other network layers. In all experiments, the Adaptive Moment Estimation (Adam) [45] algorithm was used to update the network weights, and the learning rate was set to 2 × 10⁻⁴. The hyper-parameters were fixed as

λ_{0} = 0.1

and

λ_{1} = 0.1

. The batch-size was set to 32. A Gaussian kernel with a bandwidth of 100 was used to calculate the kernel function between features. We repeated each experiment five times and report the mean and standard deviation of the classification accuracy. All experiments were implemented using PyTorch and a GeForce RTX 3090 GPU.

4.2. Experiment Results

We compare GPTKL with some state-of-the-art UDA methods, including DAN [9], DANN [10], CDAN [12], MCD [27], BSP based on CDAN (BSP+CDAN) [19], SAFN [28], ETD [29], GPDA [36], Challenging Tough Samples (CTSN) [46], DWL [20], DSAN [25], DAFL [21], DMP [30] and DCAN [13]. The results of baselines are cited from their original papers or were generated on our platform with the released source codes of original papers.

Table 1 shows the classification accuracy of three subtasks on the Digits dataset. On all three tasks, GPTKL achieves state-of-the-art results and the lowest standard deviations. On the most difficult task SVHN→MNIST, GPTKL improves the classification accuracy to 99.1%, which is 0.9% higher than another Gaussian process-based method, GPDA [36]. Different from GPDA, which only considers the Gaussian process in the source domain, GPTKL introduces the deep kernel learning strategy to learn a cross-domain Gaussian process, which encourages the discriminability of target domain features.

Office-31 is the most popular benchmark dataset for testing UDA methods. The experiment results on Office-31 are shown in Table 2. The following observations can be obtained from the experiment results: (1) ResNet-50 has the lowest classification accuracy on all tasks. This is because ResNet-50 is only trained on source domain data without considering the domain discrepancy between the source and target domains. By aligning the feature spaces of the source and target domains, DANN, CDAN, DSAN and DCAN achieve better classification accuracy. (2) DWL, DAFL and GPTKL outperform feature alignment-based methods, which confirms that enhancing the discriminability of learned features can improve domain adaptation performance. (3) GPTKL achieves the highest average classification accuracy and achieves better performance on tasks with large domain discrepancy (D→A, W→A). This validates the effectiveness of GPTKL in simultaneously improving the discriminability and transferability of the learned features.

Compared with the Office-31 dataset, Office-Home has more categories; thus, it can be used to test the UDA methods when the number of categories is large. The results of the experiment on Office-Home are shown in Table 3. The following observations can be obtained from the experimental results: (1) Methods that consider feature discriminability (SAFN, ETD, DMP and GPTKL) achieve better classification performance on this dataset. This shows that when the number of categories is large, the discrimination ability of the model is crucial to the performance of the model. (2) GPTKL outperforms all comparison methods by a large margin on each transfer task, and achieves the highest average classification accuracy of 72.3%, which is at least 1.1% higher than other methods. This is because GPTKL simultaneously improves the discriminability and transferability of the model.

The VisDA-2017 dataset is widely used to evaluate the classification performance of UDA methods on large-scale data. Table 4 shows the experiment results of the task Synthetic→Real on the VisDA-2017 dataset. It can be seen that GPTKL achieves competitive performance with the state-of-the-art methods in all categories, and achieves the highest average classification accuracy of 82.4%, which is at least 2.7% higher than other methods. It is worth noting that GPTKL shows better discrimination ability than baselines in some easily confused categories, e.g., truck and knife. It suggests that the transfer kernel learning in GPTKL can effectively improve the discrimination ability of the model.

4.3. Effectiveness Analysis

4.3.1. Sensitivity Analysis

GPTKL introduces two hyper-parameters,

λ_{0}

and

λ_{1}

, to control

L_{TKL}

and

L_{M I}

, respectively. We designed sensitivity experiments on these two losses on the Digits and Office-Home datasets, and the results are shown in Figure 4. In each experiment, we varied the tested hyper-parameters within the range of

0.01 - 1

and fixed another parameter to its final values. From Figure 4a, it can be observed that when

λ_{0}

increases from 0.01 to 0.1, the classification accuracy of GPTKL gradually increases. When

λ_{0} > 0.1

, the classification accuracy of GPTKL decreases with the increase in

λ_{0}

. From Figure 4b, we see that when

λ_{1}

increases from 0.01 to 1, the classification accuracy of GPTKL first increases and then decreases. When

λ_{1} = 0.1

, GPTKL achieves the highest classification accuracy on all three tasks.

4.3.2. Ablation Study

We conducted ablation experiments to evaluate the importance of two modules of GPTKL, i.e., transfer kernel learning module and mutual information module. On the S→M, C→A and P→C tasks, we evaluated the classification accuracy of GPTKL without

L_{TKL}

and

L_{M I}

, respectively. The results are shown in Table 5. It can be observed that when either

L_{TKL}

or

L_{M I}

is deleted, the classification accuracy of GPTKL decreases. This shows that both modules contribute to the optimal performance. In particular, when

L_{TKL}

is deleted, the classification accuracy of GPTKL drops more significantly. This indicates that the transfer kernel learning module plays a more critical role in this model. We consider that this is because the transfer kernel learning module learns deep features with both discriminability and transferability.

4.3.3. Feature Space Visualization

We used t-SNE [47] to visualize the feature space learned by the source-only model, GPDA, DSAN and GPTKL on the task SVHN→MNIST, and the results are shown in Figure 5. The images in the first row are colored according to the domain to show the alignment status between the source and target domains, while the images in the second row are colored according to the category to show the discriminability of the features. It can be observed that (1) in the feature space of the source-only model, the source domain samples are separated by categories, but the target domain samples are not aligned with the source domain samples; (2) in the feature space of GPDA, there are some samples in the target domain that are far away from the samples in the source domain, and the source and target domains are not completely aligned; (3) in the feature space of DSAN, the source domain samples and the target domain samples are effectively aligned, but the inter-class dispersion of the source domain is not large enough, which results some confusing categories in the target domain not being completely separated; and (4) GPTKL can effectively align the feature space of source and target domains while learning a feature space that is more compact within the class and more dispersed among classes. This verifies that GPTKL can learn a deep feature space with both discriminability and transferability, thereby achieving better classification performance.

4.3.4. Distribution Discrepancy and Ideal Joint Hypothesis

We further tested the discriminability and transferability of different UDA models using the discrepancy loss and error rate of the ideal joint hypothesis proposed in the literature [27], respectively. Figure 6a shows the discrepancy loss of four different methods on the task SVHN→MNIST. It can be seen that the discrepancy loss of GPTKL is smaller than that of the source-only model and GPDA, which verifies that GPTKL can effectively reduce the domain discrepancy between the source and target domains. In particular, GPTKL achieves smaller discrepancy loss than DSAN, which focuses on aligning the conditional distribution of the feature space. We also evaluated the ideal joint hypothesis by training an MLP classifier on the labeled source and target domains. Figure 6b shows the error rate of the ideal joint hypothesis on the features learned by four different methods on the task SVHN→MNIST. It can be observed that the error rate of DSAN is higher than the source-only model. This is because an excessive pursuit of domain alignment leads to a decrease in the discriminability of features. This observation verifies the importance of simultaneously improving the discriminability and transferability of features. In addition, GPTKL achieves the smallest error rate, which verifies that GPTKL can effectively reduce the error of an ideal joint hypothesis.

4.3.5. Discrimination Ability of Kernels

To verify the effectiveness of deep kernel learning strategy in GPTKL, we counted the value of the kernel function on the task SVHN→MNIST and plot the frequency histogram in Figure 7. Figure 7a shows the histograms of intra-class and inter-class kernel functions on the source-only model. It can be observed that there is a large overlap in the probability distributions of the intra-class and inter-class kernel functions, which indicates that the kernel function in the source-only model cannot effectively measure the similarity between samples. Figure 7b shows the histograms of intra-class and inter-class kernel functions learned by GPTKL. It can be observed that

k (z_{s}, z_{t})

of samples in the same class concentrate between 0 and 0.2, while those in different categories concentrate between 0.7 and 0.9. There is a significant difference between their distributions, which verifies that the kernel function learned by GPTKL has a good discrimination ability.

4.3.6. Time Complexity of GPTKL

We evaluate the time complexity of GPTKL, and compare it with two related UDA methods, i.e., GPDA [36] and DSAN. The training times of one epoch on three different tasks are shown in Table 6, and all experiments were run on a device with a GeForce RTX 3090 GPU. It can be observed that the training time of GPTKL is much lower than that of GPDA. Different from GPDA which needs to iteratively update the variational distribution to learn the posterior distribution of the Gaussian process, GPTKL introduces the deep kernel learning strategy to directly learn the parameters of the Gaussian process, which avoids multiple network calculations and reduces the training time of the model. In particular, the training time of GPTKL is comparable to that of the state-of-the-art method DSAN, which shows that our method can be trained efficiently.

5. Conclusions

This paper proposes a novel UDA method, GPTKL, which achieves domain knowledge transfer by learning a cross-domain Gaussian process and a shared classification model. GPTKL proposes a cross-domain Gaussian process to learn the transfer kernel function, which can better measure the similarity between multi-domain data samples. In order to learn transfer kernels efficiently, GPTKL introduces the deep kernel learning strategy, which can convert the learning of the kernel function into learning a deep feature space suitable for measuring sample similarity. Based on the cross-domain Gaussian process and deep kernel learning strategy, GPTKL learns a deep feature space with both discriminability and transferability. In addition, GPTKL uses cross-entropy loss and mutual information to learn a shared classification model on the source and target domains, respectively. Experiment results show that GPTKL achieves state-of-the-art classification performance on UDA tasks and learns a deep feature space of inter-domain alignment, intra-class compactness and inter-class separation. In the future, we will explore how to extend GPTKL to more machine learning tasks, such as Partial UDA, multi-modal learning and few-shot learning.

Author Contributions

Conceptualization, P.G. and Y.S.; methodology, P.G. and Y.S.; software, P.G.; validation, P.G. and Y.S.; formal analysis, P.G. and Y.S.; investigation, P.G. and Y.S.; resources, Y.S.; data curation, P.G.; writing—original draft preparation, P.G.; writing—review and editing, P.G. and Y.S.; visualization, Y.S.; supervision, Y.S.; project administration, P.G.; funding acquisition, P.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Guangdong Basic and Applied Basic Research Foundation (2020A1515111149) and Talent Special Projects of School-level Scientific Research Programs under Guangdong Polytechnic Normal University (2021SDKYA066).

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Gu, J.; Wang, Z.; Kuen, J.; Ma, L.; Shahroudy, A.; Shuai, B.; Liu, T.; Wang, X.; Wang, G.; Cai, J.; et al. Recent advances in convolutional neural networks. Pattern Recognit. 2018, 77, 354–377. [Google Scholar] [CrossRef]
Pan, S.J.; Yang, Q. A Survey on Transfer Learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359. [Google Scholar] [CrossRef]
Zhang, K.; Schölkopf, B.; Muandet, K.; Wang, Z. Domain adaptation under target and conditional shift. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 17–19 June 2013; pp. 819–827. [Google Scholar]
Ben-David, S.; Blitzer, J.; Crammer, K.; Pereira, F. Analysis of representations for domain adaptation. Adv. Neural Inf. Process. Syst. 2006, 19, 1–8. [Google Scholar]
Fernando, B.; Habrard, A.; Sebban, M.; Tuytelaars, T. Unsupervised visual domain adaptation using subspace alignment. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 2960–2967. [Google Scholar]
Wang, M.; Deng, W. Deep visual domain adaptation: A survey. Neurocomputing 2018, 312, 135–153. [Google Scholar] [CrossRef]
Long, M.; Cao, Y.; Wang, J.; Jordan, M. Learning transferable features with deep adaptation networks. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 97–105. [Google Scholar]
Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; Lempitsky, V. Domain-adversarial training of neural networks. J. Mach. Learn. Res. 2016, 17, 1–35. [Google Scholar]
Long, M.; Zhu, H.; Wang, J.; Jordan, M.I. Deep transfer learning with joint adaptation networks, PMLR. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 2208–2217. [Google Scholar]
Long, M.; Cao, Z.; Wang, J.; Jordan, M.I. Conditional adversarial domain adaptation. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, NA, Canada, 3–8 December 2018; pp. 1647–1657. [Google Scholar]
Ge, P.; Ren, C.X.; Xu, X.L.; Yan, H. Unsupervised domain adaptation via deep conditional adaptation network. Pattern Recognit. 2023, 134, 109088. [Google Scholar] [CrossRef]
Gretton, A.; Borgwardt, K.; Rasch, M.; Schölkopf, B.; Smola, A. A kernel method for the two-sample-problem. Adv. Neural Inf. Process. Syst. 2006, 19, 1–8. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar]
Chen, C.; Chen, Z.; Jiang, B.; Jin, X. Joint domain alignment and discriminative feature learning for unsupervised deep domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 3296–3303. [Google Scholar]
Luo, L.; Chen, L.; Hu, S.; Lu, Y.; Wang, X. Discriminative and geometry-aware unsupervised domain adaptation. IEEE Trans. Cybern. 2020, 50, 3914–3927. [Google Scholar] [CrossRef]
Chen, X.; Wang, S.; Long, M.; Wang, J. Transferability vs. discriminability: Batch spectral penalization for adversarial domain adaptation. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 1081–1090. [Google Scholar]
Xiao, N.; Zhang, L. Dynamic weighted learning for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15242–15251. [Google Scholar]
Tian, Q.; Zhu, Y.; Sun, H.; Chen, S.; Yin, H. Unsupervised domain adaptation through dynamically aligning both the feature and label spaces. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 8562–8573. [Google Scholar] [CrossRef]
Donahue, J.; Jia, Y.; Vinyals, O.; Hoffman, J.; Ning, Z.; Tzeng, E.; Darrell, T. DeCAF: A deep convolutional activation feature for generic visual recognition. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 647–655. [Google Scholar]
Yosinski, J.; Clune, J.; Bengio, Y.; Lipson, H. How transferable are features in deep neural networks? In Proceedings of the Advances in Neural Information Processing Systems, Montreal, NA, Canada, 8–13 December 2014; pp. 3320–3328. [Google Scholar]
Sun, B.; Saenko, K. Deep coral: Correlation alignment for deep domain adaptation. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin, Germany, 2016; pp. 443–450. [Google Scholar]
Zhu, Y.; Zhuang, F.; Wang, J.; Ke, G.; Chen, J.; Bian, J.; Xiong, H.; He, Q. Deep subdomain adaptation network for image classification. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 1713–1722. [Google Scholar] [CrossRef] [PubMed]
Tzeng, E.; Hoffman, J.; Saenko, K.; Darrell, T. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7167–7176. [Google Scholar]
Saito, K.; Watanabe, K.; Ushiku, Y.; Harada, T. Maximum classifier discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3723–3732. [Google Scholar]
Xu, R.; Li, G.; Yang, J.; Lin, L. Larger Norm More Transferable: An Adaptive Feature Norm Approach for Unsupervised Domain Adaptation. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1426–1435. [Google Scholar]
Li, M.; Zhai, Y.M.; Luo, Y.W.; Ge, P.F.; Ren, C.X. Enhanced transport distance for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 13936–13944. [Google Scholar]
Luo, Y.W.; Ren, C.X.; Dai, D.Q.; Yan, H. Unsupervised domain adaptation via discriminative manifold propagation. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 1653–1669. [Google Scholar] [CrossRef]
Yu, K.; Tresp, V.; Schwaighofer, A. Learning Gaussian processes from multiple tasks. In Proceedings of the International Conference on Machine Learning, Bonn, Germany, 7–11 August 2005; pp. 1012–1019. [Google Scholar]
Cao, B.; Pan, S.J.; Zhang, Y.; Yeung, D.Y.; Yang, Q. Adaptive transfer learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Atlanta, GA, USA, 11–15 July 2010; pp. 407–412. [Google Scholar]
Wei, P.; Vo, T.V.; Qu, X.; Ong, Y.S.; Ma, Z. Transfer kernel learning for multi-source transfer gaussian process regression. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 3862–3876. [Google Scholar] [CrossRef] [PubMed]
Wei, P.; Ke, Y.; Ong, Y.S.; Ma, Z. Adaptive Transfer Kernel Learning for Transfer Gaussian Process Regression. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 7142–7156. [Google Scholar] [CrossRef]
Oussama, A.; Khaldi, B.; Kherfi, M.L. A fast weighted multi-view Bayesian learning scheme with deep learning for text-based image retrieval from unlabeled galleries. Multimed. Tools Appl. 2023, 82, 10795–10812. [Google Scholar] [CrossRef]
Kim, M.; Sahu, P.; Gholami, B.; Pavlovic, V. Unsupervised visual domain adaptation: A deep max-margin gaussian process approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4380–4390. [Google Scholar]
Ben-David, S.; Blitzer, J.; Crammer, K.; Kulesza, A.; Pereira, F.; Vaughan, J.W. A theory of learning from different domains. Mach. Learn. 2010, 79, 151–175. [Google Scholar] [CrossRef]
Ren, C.X.; Ge, P.; Dai, D.Q.; Yan, H. Learning Kernel for Conditional Moment-Matching Discrepancy-Based Image Classification. IEEE Trans. Cybern. 2019, 51, 2006–2018. [Google Scholar] [CrossRef]
Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, I.; Abbeel, P. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 2172–2180. [Google Scholar]
Ge, P.; Ren, C.X.; Dai, D.Q.; Feng, J.; Yan, S. Dual adversarial autoencoders for clustering. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 1417–1424. [Google Scholar] [CrossRef] [PubMed]
Saenko, K.; Kulis, B.; Fritz, M.; Darrell, T. Adapting Visual Category Models to New Domains. In Proceedings of the European Conference on Computer Vision, Heraklion Crete, Greece, 5–11 September 2010; pp. 213–226. [Google Scholar]
Venkateswara, H.; Eusebio, J.; Chakraborty, S.; Panchanathan, S. Deep Hashing Network for Unsupervised Domain Adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5018–5027. [Google Scholar]
Peng, X.; Usman, B.; Kaushik, N.; Hoffman, J.; Wang, D.; Saenko, K. Visda: The visual domain adaptation challenge. arXiv 2017, arXiv:1710.06924. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Zuo, L.; Jing, M.; Li, J.; Zhu, L.; Lu, K.; Yang, Y. Challenging tough samples in unsupervised domain adaptation. Pattern Recognit. 2021, 110, 107540. [Google Scholar] [CrossRef]
Maaten, L.V.D.; Hinton, G. Visualizing Data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]

Figure 1. Illustration of feature spaces for different UDA methods. Different colors and shapes represent different domains and categories, respectively. (a) There is domain discrepancy in the feature spaces of the source and target domains, and the classifier in the source domain cannot be directly used in the target domain. (b) Excessive alignment learning weakens the discriminability of features, and samples from different categories in the target domain are incorrectly clustered. (c) Excessive discriminant learning leads to domain misalignment. (d) GPTKL can effectively align the feature spaces of the source and target domains, and improve the discriminability of features.

Figure 2. The architecture of GPTKL. GPTKL obtains the deep features through the deep network, and then inputs these deep features into the classifier to obtain predicted categories. These predicted categories are used to calculate cross-entropy loss and mutual information loss. At the same time, GPTKL provides pseudo-labels for the target samples with their predicted category distributions. Then, the features and labels in the source and target domains are input together into a cross-domain Gaussian process to learn the transfer kernel function. With the transfer kernel learning on the cross-domain Gaussian process, GPTKL learns a deep feature space with both discriminability and transferability.

Figure 3. Sample images from Digits, Office-31, Office-Home and VisDA-2017.

Figure 4. Sensitivity of GPTKL to the hyper-parameters

λ_{0}

and

λ_{1}

, respectively.

Figure 4. Sensitivity of GPTKL to the hyper-parameters

λ_{0}

and

λ_{1}

, respectively.

Figure 5. T-SNE visualization on the task S→M of the Digits dataset. The images in the first and second rows are colored by domain and class, respectively.

Figure 6. Transferability and discriminability of the features learned on the task S→M by source-only, DSAN, GPDA and GPTKL models: (a) Domain discrepancy between the feature spaces of the source and target domains. (b) Error rate of the ideal joint hypothesis.

Figure 7. Frequency distribution histogram of kernel function for intra-class features and inter-class features. The features are learned by source-only and GPTKL models, respectively. (a) Kernel function values learned by source-only model. (b) Kernel function values learned by GPTKL.

Table 1. Classification accuracy (%) of UDA tasks on Digits dataset. The best is marked in bold.

Methods	MNIST→USPS	USPS→MNIST	SVHN→MNIST	Average
DAN [9]	80.3 ± 0.4	77.8 ± 0.4	73.5 ± 0.4	77.2
DANN [10]	77.1 ± 1.8	73.0 ± 2.0	73.9 ± 0.1	74.7
CDAN [12]	93.9 ± 0.4	96.9 ± 0.3	88.5 ± 0.1	93.1
CDAN+E [12]	95.6	98.0	89.2	94.3
MCD [27]	94.2 ± 0.5	94.1 ± 0.5	96.2 ± 0.4	94.8
BSP+CDAN [19]	95.0	98.1	92.1	95.1
ETD [29]	96.4 ± 0.2	96.3 ± 0.1	97.9 ± 0.1	96.9
GPDA [36]	98.1 ± 0.1	-	98.2 ± 0.1	98.2
CTSN [46]	96.1 ± 0.3	97.3 ± 0.2	97.1 ± 0.3	96.8
DWL [20]	97.3	97.4	98.1	97.6
DSAN [25]	96.9 ± 0.2	95.3 ± 0.1	90.1 ± 0.4	94.1
DAFL [21]	97.0 ± 0.3	97.6 ± 0.1	97.9 ± 0.1	97.5
DMP [30]	97.2	95.0	94.7	95.6
DCAN [13]	97.5 ± 0.1	98.5 ± 0.1	98.7 ± 0.1	98.2
GPTKL(ours)	98.6 ± 0.1	98.3 ± 0.1	99.1 ± 0.1	98.7

Table 2. Classification accuracy (%) of UDA tasks on Office-31 dataset (ResNet-50). The best is marked in bold.

Methods	A→W	D→W	W→D	A→D	D→A	W→A	Average
Resnet [44]	68.4 ± 0.2	96.7 ± 0.1	99.3 ± 0.1	68.9 ± 0.2	62.5 ± 0.3	60.7 ± 0.3	76.1
DAN [9]	80.5 ± 0.4	97.1 ± 0.2	99.6 ± 0.1	78.6 ± 0.2	63.6 ± 0.3	62.8 ± 0.2	80.4
DANN [10]	82.6 ± 0.2	96.9 ± 0.3	99.3 ± 0.1	81.5 ± 0.1	68.4 ± 0.5	67.5 ± 0.2	82.7
CDAN [12]	93.1 ± 0.2	98.2 ± 0.2	100.0 ± 0.0	89.8 ± 0.3	70.1 ± 0.4	68.0 ± 0.4	86.6
CDAN+E [12]	94.1 ± 0.1	98.6 ± 0.1	100.0 ± 0.0	92.9 ± 0.2	71.0 ± 0.3	69.3 ± 0.3	87.7
MCD [27]	90.4	98.5	100.0	87.3	68.3	67.6	85.4
SAFN [28]	88.8 ± 0.4	98.4 ± 0.0	99.8 ± 0.0	87.7 ± 1.3	69.8 ± 0.4	69.7 ± 0.2	85.7
GPDA [36]	85.8	97.9	99.8	88.8	71.2	71.5	85.8
BSP+CDAN [19]	93.3 ± 0.2	98.2 ± 0.2	100.0 ± 0.0	93.0 ± 0.2	73.6 ± 0.3	72.6 ± 0.3	88.5
ETD [29]	92.1 ± 0.1	100.0 ± 0.0	100.0 ± 0.0	88.0 ± 0.2	71.0 ± 0.4	67.8 ± 0.1	86.2
CTSN [46]	90.6 ± 0.3	98.6 ± 0.5	99.9 ± 0.1	89.3 ± 0.3	73.7 ± 0.4	74.1 ± 0.3	87.7
DWL [20]	89.2	99.2	100.0	91.2	73.1	69.8	87.1
DSAN [25]	93.6 ± 0.2	98.3 ± 0.1	100.0 ± 0.0	90.2 ± 0.7	73.5 ± 0.5	74.8 ± 0.4	88.4
DAFL [21]	94.8 ± 0.5	100.0 ± 0.0	100.0 ± 0.0	93.9 ± 0.2	75.2 ± 0.3	74.2 ± 0.4	89.7
DMP [30]	93.0 ± 0.3	99.0 ± 0.1	100.0 ± 0.0	91.0 ± 0.4	71.4 ± 0.2	70.2 ± 0.2	87.4
DCAN [13]	93.2 ± 0.3	98.7 ± 0.1	100.0 ± 0.0	91.6 ± 0.4	74.6 ± 0.2	74.2 ± 0.2	88.7
GPTKL (ours)	94.4 ± 0.4	99.2 ± 0.1	100.0 ± 0.0	94.8 ± 0.5	75.7 ± 0.5	76.7 ± 0.4	90.1

Table 3. Classification accuracy (%) of UDA tasks on Office-Home dataset (ResNet-50). The best is marked in bold.

Methods	A→C	A→P	A→R	C→A	C→P	C→R	P→A	P→C	P→R	R→A	R→C	R→P	Average
Resnet [44]	34.9	50.0	58.0	37.4	41.9	46.2	38.5	31.2	60.4	53.9	41.2	59.9	46.1
DAN [9]	43.9	57.0	67.9	45.8	56.5	60.4	44.0	43.6	67.7	63.1	51.5	74.3	56.3
DANN [10]	45.6	59.3	70.1	47.0	58.5	60.9	46.1	43.7	68.5	63.2	51.8	76.8	57.6
CDAN [12]	49.0	69.3	74.5	54.4	66.0	68.4	55.6	48.3	75.9	68.4	55.4	80.5	63.8
CDAN+E [12]	50.7	70.6	76.0	57.6	70.0	70.0	57.4	50.9	77.3	70.9	56.7	81.6	65.8
MCD [27]	51.7	72.2	78.2	63.7	69.5	70.8	61.5	52.8	78.0	74.5	58.4	81.8	67.8
SAFN [28]	52.0	71.7	76.3	64.2	69.9	71.9	63.7	51.4	77.1	70.9	57.1	81.5	67.3
GPDA [36]	45.0	62.0	73.3	51.0	59.6	64.0	54.2	45.3	73.2	62.8	50.0	78.0	59.9
BSP+CDAN [19]	52.0	68.6	76.1	58.0	70.3	70.2	58.6	50.2	77.6	72.2	59.3	81.9	66.3
ETD [29]	51.3	71.9	85.7	57.6	69.2	73.7	57.8	51.2	79.3	70.2	57.5	82.1	67.3
DSAN [25]	54.4	70.8	75.4	60.4	67.8	68.0	62.6	55.9	78.5	73.8	60.6	83.1	67.6
DMP [30]	52.3	73.0	77.3	64.3	72.0	71.8	63.6	52.7	78.5	72.0	57.7	81.6	68.1
DCAN [13]	58.0	76.2	79.3	67.3	76.1	75.6	65.4	56.0	80.7	74.2	61.2	84.2	71.2
GPTKL (ours)	58.4	77.9	80.6	67.8	76.9	77.2	67.4	57.9	82	75.3	61.5	85.1	72.3
	±0.3	±0.4	±0.2	±0.5	±0.6	±0.2	±0.4	±0.5	±0.3	±0.2	±0.4	±0.5

Table 4. Classification accuracy (%) of UDA tasks on VisDA-2017 dataset (ResNet-101). The best is marked in bold.

Methods	Plane	Bcycl	Bus	Car	Horse	Knife	Mcyle	Person	Plant	Sktbrd	Train	Truck	Average
ResNet [44]	55.1	53.3	61.9	59.1	80.6	17.9	79.7	31.2	81.0	26.5	73.5	8.5	52.4
DAN [9]	87.1	63.0	76.5	42.0	90.3	42.9	85.9	53.1	49.7	36.3	85.8	20.7	61.1
DANN [10]	81.9	77.7	82.8	44.3	81.2	29.5	65.1	28.6	51.9	54.6	82.8	7.8	57.4
MCD [27]	87.0	60.9	83.7	64.0	88.9	79.6	84.7	76.9	88.6	40.3	83.0	25.8	71.9
CDAN [12]	85.2	66.9	83.0	50.8	84.2	74.9	88.1	74.5	83.4	76.0	81.9	38.0	73.7
GPDA [36]	83.0	74.3	80.4	66.0	87.6	75.3	83.8	73.1	90.1	57.3	80.2	37.9	73.3
BSP+CDAN [19]	92.4	61.0	81.0	57.5	89.0	80.6	90.1	77.0	84.2	77.9	82.1	38.4	75.9
SAFN [28]	93.6	61.3	84.1	70.6	94.1	79.0	91.8	79.6	89.9	55.6	89.0	24.4	76.1
ETD [29]	93.5	69.7	81.0	80.4	93.5	89.8	92.7	85.3	89.5	76.4	87.4	27.2	79.7
CTSN [46]	92.3	65.1	84.2	68.4	90.4	61.2	92.3	74.1	88.7	66.9	82.3	19.4	75.4
DWL [20]	90.7	80.2	86.1	67.6	92.4	81.5	86.8	78.0	90.6	57.1	85.6	28.7	77.1
DSAN [25]	90.9	66.9	75.7	62.4	88.9	77.0	93.7	75.1	92.8	67.6	89.1	39.4	75.1
DMP [30]	92.1	75.0	78.9	75.5	91.2	81.9	89.0	77.2	93.3	77.4	84.8	35.1	79.3
DCAN [13]	94.9	83.7	75.7	56.5	92.9	86.8	83.8	76.5	88.4	81.6	84.2	51.1	79.7
GPTKL (ours)	93.3	82.3	81.5	59.3	96.0	91.7	86.5	78.0	94.5	88.0	88.9	48.8	82.4

Table 5. Classification accuracy (%) of ablation study on the S→M, C→A and P→C tasks. The best is marked in bold.

Transfer Kernel Learning	Mutual Information	S→M	C→A	P→C
×	×	60.1	37.4	37.2
×	√	95.6	62.1	54.5
√	×	98.5	65.1	56.0
√	√	99.1	67.8	57.9

Table 6. Training Time (s) of GPDA, DSAN and GPTKL.

Methods	S→M	C→A	P→C
GPDA	35.88	31.5	55.42
DSAN	22.65	14.61	24.82
GPTKL	20.61	14.11	23.05

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ge, P.; Sun, Y. Gaussian Process-Based Transfer Kernel Learning for Unsupervised Domain Adaptation. Mathematics 2023, 11, 4695. https://doi.org/10.3390/math11224695

AMA Style

Ge P, Sun Y. Gaussian Process-Based Transfer Kernel Learning for Unsupervised Domain Adaptation. Mathematics. 2023; 11(22):4695. https://doi.org/10.3390/math11224695

Chicago/Turabian Style

Ge, Pengfei, and Yesen Sun. 2023. "Gaussian Process-Based Transfer Kernel Learning for Unsupervised Domain Adaptation" Mathematics 11, no. 22: 4695. https://doi.org/10.3390/math11224695

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Gaussian Process-Based Transfer Kernel Learning for Unsupervised Domain Adaptation

Abstract

1. Introduction

2. Related Work

2.1. Deep UDA Methods

2.2. Gaussian Process

3. Our Method

3.1. Motivation and Notations

3.2. Gaussian Process-Based Transfer Kernel Learning

3.3. Classification Model Learning on the Source and Target Domains

3.4. Loss Function and Training Procedure

4. Experiment Results and Analysis

4.1. Datasets and Experiment Settings

4.2. Experiment Results

4.3. Effectiveness Analysis

4.3.1. Sensitivity Analysis

4.3.2. Ablation Study

4.3.3. Feature Space Visualization

4.3.4. Distribution Discrepancy and Ideal Joint Hypothesis

4.3.5. Discrimination Ability of Kernels

4.3.6. Time Complexity of GPTKL

5. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI