Transfer Learning for Logistic Regression with Differential Privacy

Hou, Yiming; Song, Yunquan

doi:10.3390/axioms13080517

Open AccessArticle

Transfer Learning for Logistic Regression with Differential Privacy

by

Yiming Hou

and

Yunquan Song

^*

College of Science, China University of Petroleum, Qingdao 266580, China

^*

Author to whom correspondence should be addressed.

Axioms 2024, 13(8), 517; https://doi.org/10.3390/axioms13080517

Submission received: 30 May 2024 / Revised: 25 July 2024 / Accepted: 27 July 2024 / Published: 30 July 2024

(This article belongs to the Special Issue Probability, Statistics and Estimation)

Download

Browse Figures

Versions Notes

Abstract

:

Transfer learning, as a machine learning approach enhancing model generalization across different domains, has extensive applications in various fields. However, the risk of privacy leakage remains a crucial consideration during the transfer learning process. Differential privacy, with its rigorous mathematical foundation, has been proven to offer consistent and robust privacy protection. This study delves into the logistic regression transfer learning problem supported by differential privacy. In cases where transferable sources are known, we propose a two-step transfer learning algorithm. For scenarios with unknown transferable sources, a non-algorithmic, cross-validation-based transferable source detection method is introduced, to mitigate adverse effects from non-informative sources. The effectiveness of the proposed algorithm is validated through simulations and experiments with real-world data.

Keywords:

logistic regression; transfer learning; differential privacy; variable selection

MSC:

62F12; 62G08; 62G20; 62J07

1. Introduction

Machine learning, as a critical tool for data utilization, has become the engine driving transformation across various industries. Through data modeling, it empowers computers with the ability to learn autonomously and to predict, finding wide applications in fields like financial risk assessment [1], autonomous driving [2], AI-driven healthcare [3], smart manufacturing [4], and more. In today’s digital era, safeguarding personal privacy has emerged as a crucial challenge in the information technology domain. Despite the rapid advancements in big data analysis and machine learning technologies, it is still a significant challenge to safeguard sensitive user information. It is vital to prioritize the protection of personal information while extracting valuable insights and knowledge from massive datasets.

Traditional methods of data privacy protection, such as anonymization [5] and de-identification, have become increasingly vulnerable and prone to exploitation by advanced data analysis techniques. In this context, differential privacy [6] has garnered extensive attention as a rigorous privacy protection framework. Differential privacy offers a mathematically rigorous approach that ensures that no specific individual’s information can be inferred during data analysis, even with knowledge of all other people’s data. This robust privacy-preserving property has facilitated the widespread application of differential privacy in sensitive environments.

In the realm of regression analysis, numerous studies have been conducted based on differential privacy analysis. Chaudhurietal et al. [7] proposed several differential privacy regression analyses, but they are limited by the requirement that the objective function must be convex and twice differentiable. Lei [8] introduced m-estimators differential privacy methods based on a maximum likelihood estimation framework, generating noise multidimensional histograms using the Laplace mechanism and utilizing synthetic data to compute regression results. However, this method is only applicable to lower-dimensional data. To handle high-dimensional data, Zhang et al. [9] proposed a noise addition method based on a functional mechanism, perturbing the objective function of the optimization problem instead of the results, to enforce

ϵ

-differential privacy. Kifer et al. [10] focused on high-dimensional sparse regression problems, presenting a differential privacy-based convex empirical risk minimization (ERM) method. Smith [11] combined the Laplace mechanism and the exponential mechanism, proposing a general differential privacy framework, but with limitations concerning the bounded output space, studying the asymptotic properties of differentially private algorithms for statistical inference. Barrientos et al. [12]. designed algorithms for differential private estimators, regarding significance levels and symbols. Cai et al. [13] studied the trade-off between statistical accuracy and privacy in mean estimation and linear regression in both classical low-dimensional and modern high-dimensional settings, proposing a novel private iterative hard threshold algorithm for high-dimensional linear regression. In the field of logistic regression, there is existing literature discussing the relevance of differential privacy. Chaudhuri and Monteleoni [14] discussed the trade-off between privacy and learnability in designing algorithms for learning from private databases. They proposed a privacy-preserving logistic regression algorithm that solves the problem of sensitivity in regularized logistic regression. The approach involves adding noise to the learned classifier, which is proportional to the sensitivity. This is a method that is independent of the function’s sensitivity and can be used for a class of convex loss functions. Khanna et al. [15] introduced a differentially private method for sparse logistic regression that maintains hard zero coefficients. They achieved this by first training a non-private LASSO logistic regression model to determine the appropriate number of non-zero coefficients for the final model selection. Xu et al. [16] took a unique approach by combining functional mechanism and decision boundary fairness, to develop a differentially private and fair logistic regression model. This innovative method ensures both privacy preservation and fairness while maintaining good utility. Fan et al. [17] proposed a privacy-preserving logistic regression algorithm (PPLRA) that uses homomorphic encryption to prevent data privacy leakage. They shifted the majority of computational tasks to the cloud, to enhance efficiency while safeguarding data privacy. Ji et al. [18] modified the update steps in the Newton–Raphson method and presented a differentially private distributed logistic regression model based on both public and private data. This model improves practicality by leveraging a public dataset while simultaneously protecting the private dataset, all without compromising stringent privacy guarantees.

Simultaneously, many practical applications face a common challenge: the scarcity or expense of data in the target domain (the domain of the target task) while there may be ample labeled data available in the source domain (often a related but not entirely identical task domain). To overcome this data scarcity issue, transfer learning has emerged. The core concept of transfer learning lies in transferring knowledge learned from the source domain, to enhance the learning performance in the target domain.

The concept of transfer learning initially originated in the field of machine learning. In 1993, Pratt [19] introduced a neural network-based transfer learning method, using weights obtained from a network trained for related source tasks to expedite learning for the target problem, exploring how to transfer knowledge across different tasks to improve learning effectiveness. Subsequently, in 1995, Thrun [20] first proposed the concept of “knowledge transfer” and investigated methods for sharing knowledge between different tasks. This paper is considered one of the pioneering works in the field of transfer learning, laying the foundation for subsequent research. Building upon these pioneering papers, many researchers conducted in-depth studies. However, currently, transfer learning is mostly applied in the classification domain, including tasks like image classification [21], text classification [22], and time series classification [23]. In the regression analysis domain, there are also some related literatures. Li et al. [24] proposed a two-step method for estimating and predicting transfer learning on high-dimensional linear regression models, called Trans-Lasso, including a detection algorithm for consistently identifying transferable but unknown source datasets and a contrastive regularized estimator anchored to the target task. They demonstrated the robustness of Trans-Lasso in dealing with information-free auxiliary samples and its efficiency in knowledge transfer even when the auxiliary sample set is unknown. This paper marked an important step in the application of transfer learning in the regression domain. Bastani [25] proposed a two-step procedure for studying transfer learning methods in high-dimensional linear regression models using a single source dataset without needing to detect transferable source datasets from candidate datasets. Tian and Feng [26] extended this method to generalized linear models, introducing an algorithm to construct confidence intervals for each coefficient component and accompanying theoretical results. Yang et al. [27] proposed a method that combines source and target datasets using a two-layer linear neural network, accurately calculating the asymptotic limits of transfer learning prediction risk on high-dimensional linear models. Zhou et al. [28] proposed double robust transfer learning, to adapt to the challenges of label scarcity and covariate shift in the target task. Additionally, Lin and Reimherr [29] conducted transfer learning research on functional linear regression models and established the optimal convergence rate for excess risk. Takada and Fujisawa [30] proposed a method to transfer knowledge from the source domain to the target domain through high-dimensional

ℓ_{1}

regularization, which, in addition to ordinary

ℓ_{1}

regularization, also incorporates

ℓ_{1}

regularization of the difference between source parameters and target parameters. This approach produces sparsity both for the estimate itself and for changes in the estimate, has tight estimation error bounds in stationary environments, and the estimate remains invariant to the source estimate under small residuals; even if the source estimate is wrong due to non-stationarity, the estimates are also consistent with the basis functions.

While differential privacy and transfer learning have each demonstrated effectiveness in real-world applications, combining these two paradigms and applying them to logistic regression models is a challenging and compelling area of research. Combining transfer learning and differential privacy solves the fundamental problem of how to leverage knowledge in one domain to enhance learning capabilities in another domain while ensuring the privacy of sensitive information within the model.

This paper proposes a method that combines differential privacy with transfer learning and applies it to logistic regression models. Our approach not only protects individual privacy but also facilitates knowledge transfer between the source and target domains. Specifically, (1) we protect individual privacy by introducing differential privacy noise into input data, and (2) we design a transfer learning strategy to achieve knowledge transfer by sharing feature representations between the source and target domains. Our experimental results demonstrate that our method achieves good performance in the target domain while preserving individual privacy, proving its effectiveness and feasibility in practical applications.

Figure 1 provides a general overview of the proposed algorithm’s conceptual framework in this paper.

The structure of this paper is as follows. In Section 2, we present a setup for transfer learning using logistic regression with a focus on differential privacy. We also propose a transferable source domain detection algorithm based on cross-validation. In Section 3, we conduct various numerical simulations and empirical analyses. Finally, in Section 4, we provide a summary of our findings and suggest future research directions.

2. Method

2.1. Function Mechanism Differential Privacy Method Based on Logistic Regression Model

We first consider the function mechanism (FM) based on the logistic regression model, which is an extension of the Laplace mechanism in differential privacy. This privacy-preserving method does not directly inject noise into the regression results; instead, it ensures privacy by perturbing the optimization objective of the regression analysis.

The Laplace mechanism is a randomization technique in differential privacy that adds noise to query results using samples from the Laplace distribution. The probability density function of the Laplace distribution is given by

f (x ∣ b) = \frac{1}{2 b} e^{\frac{- | x |}{b}}

, where b is the scale parameter. The Laplace distribution, being centered around zero with heavy tails, is suitable for introducing random noise. The fundamental idea of the Laplace mechanism is to maintain the usability of data analysis results while introducing moderate noise to make individual data changes difficult to trace. This mechanism aims to balance privacy protection and data utility. Its mathematical representation is

q (D) +

Laplace (Δ q / ε)

, where

q (D)

is the query result,

Δ q

is the sensitivity of the query result, and

ε

is the privacy parameter.

Let

D

be a dataset containing n tuples

t_{1}, t_{2}, \dots, t_{n}

with

d + 1

attributes

X_{1}, X_{2}, \dots, X_{d}, Y

. For each tuple

t_{i} = (x_{i 1}, x_{i 2}, \dots, x_{i d}, y_{i})

, we assume that

\sqrt{\sum_{i = 1}^{d} x_{i d}^{2}} \leq 1

. Our objective is to construct a regression model from

D

that allows us to predict the value of any tuple on Y based on the values of

X_{1}, X_{2}, \dots, X_{d}

. In other words, our goal is to obtain a function

ρ

that takes

(x_{i 1}, x_{i 2}, \dots, x_{i d})

as input and outputs a prediction for

y_{i}

as accurately as possible.

For logistic regression, assuming that the attribute Y in

D

takes values in the Boolean domain

{0, 1}

, logistic regression on

D

returns a predictive function with the probability

ρ (x_{i}) = exp (x_{i}^{T} ω^{*}) / (1 + exp (x_{i}^{T} ω^{*}))

(1)

predicting

y_{i} = 1

, where

ω^{*}

is a vector of d real numbers. This can be achieved by minimizing the cost function

f (t_{i}, ω) = log (1 + exp (x_{i}^{T} ω^{*})) - y_{i} x_{i}^{T} ω

(2)

Namely,

ω^{*} = arg min_{ω} \sum_{i = 1}^{n} (log (1 + exp (x_{i}^{T} ω^{*})) - y_{i} x_{i}^{T} ω)

(3)

To ensure privacy, we require that the regression analysis should be performed using an algorithm that satisfies

ε

-differential privacy.

A random algorithm

M

satisfies

ε

-differential privacy if and only if for any output

O

of M and for any two neighboring databases

D_{1}

and

D_{2}

we have

Pr [M (D_{1}) = O] \leq e^{ε} Pr [M (D_{2}) = O]

(4)

If

M

satisfies

ε

-differential privacy, then the probability distribution of its output remains almost the same for any two input databases that differ by only one tuple.

The FM method we use does not directly inject noise into

ω^{*}

. Instead, it perturbs the objective function

f_{D} (ω) = \sum_{t_{i \in D}} f (t_{i}, ω)

, then releases the model parameters

\bar{ω}

that minimize the perturbed objective function

\bar{f_{D} (ω)}

. Here, we address this by utilizing a polynomial representation of

f_{D} (ω)

. As

ω

is a vector containing values

ω_{1}, ω_{2}, \dots \dots \dots, ω_{d}

, let

ϕ (ω)

represent the product of

ω_{1}, ω_{2}, \dots \dots \dots, ω_{d}

, defined as

ϕ (ω) = ω_{1}^{c 1} \cdot ω_{2}^{c 2} \cdot \dots \dots \dots \cdot ω_{d}^{c d}

, where

c_{1}, \dots \dots \dots, c_{d} \in N

. Let

Φ_{j} (j \in N)

represent the powers of

ω_{1}, ω_{2}, \dots \dots \dots, ω_{d}

up to j, i.e.,

Φ_{j} =

\{ω_{1}^{c 1} \cdot ω_{2}^{c 2} \cdot \dots \dots \dots \cdot ω_{d}^{c d} ∣ \sum_{l = 1}^{d} c_{l} = j\}

. According to the Stone–Weierstrass theorem, any continuous differentiable function

f (t_{i}, ω)

can always be written as a (potentially infinite) polynomial in

ω_{1}, ω_{2}, \dots \dots \dots, ω_{d}

, i.e., for some

J \in [0, + \infty]

we have

f_{D} (ω) =

\sum_{j = 0}^{J} \sum_{ϕ \in Φ_{j}} λ_{ϕ t_{i}} ϕ (ω)

, where

λ_{ϕ t_{i}} \in R

represents the coefficients of

ϕ (ω)

in the polynomial.

Let

D

and

D^{'}

be any two arbitrarily neighboring databases, and let

f_{D} (ω)

and

f_{D^{'}} (ω)

represent the objective functions of regression analysis on

D

and

D^{'}

, respectively, with polynomial representations as

f_{D} (ω) = \sum_{j = 1}^{J} \sum_{ϕ \in Φ_{j}} \sum_{t_{i} \in D} λ_{ϕ t_{i}} ϕ (ω)

(5)

f_{D^{'}} (ω) = \sum_{j = 1}^{J} \sum_{ϕ \in Φ_{j}} \sum_{t_{i}^{'} \in D^{'}} λ_{ϕ t_{i}^{'}} ϕ (ω)

(6)

Thus,

\sum_{j = 1}^{J} \sum_{ϕ \in Φ_{j}} ∥\sum_{t_{i} \in D} λ_{ϕ t_{i}} - \sum_{t_{i}^{'} \in D^{'}} λ_{ϕ t_{i}^{'}}∥ \leq 2 max_{t} \sum_{j = 1}^{J} \sum_{ϕ \in Φ_{j}} {∥λ_{t}∥}_{1} .

(7)

Let

Δ = 2 max_{t} \sum_{j = 1}^{J} \sum_{ϕ \in Φ_{j}} {∥λ_{t}∥}_{1} .

(8)

For

ϕ \in Φ_{j}

, let

λ_{ϕ} = \sum_{t_{i} \in D} λ_{ϕ t_{i}} + Lap (\frac{Δ}{ε}) .

(9)

This yields the perturbed objective function, and the minimum solution can be computed from it.

Due to the requirement in differentially private methods based on the function mechanism that the polynomial form of the objective function should only contain bounded-degree terms, the logistic regression model fails to meet this condition. Hence, Zhang et al. [9] proposed an approach based on Taylor expansion, to derive an approximate polynomial form for the objective function, demonstrating its effectiveness in achieving differential privacy. Therefore, in this paper, we adopt this validated approach. The form of the objective function is as follows:

\hat{f_{D}} (ω) = \sum_{i = 1}^{n} (0.5 x_{i}^{T} \cdot ω + 1 / 8 {(x_{i}^{T} \cdot ω)}^{2} - y_{i} \cdot x_{i}^{T} \cdot ω + log 2)

(10)

2.2. Regression Transfer Learning Based on Differential Privacy

In this paper, we address the issue of multi-source domain transfer learning. Consider a target dataset

(X^{(0)}, y^{(0)})

and K source domain datasets

(X^{(K)}, y^{(K)})

, where

X^{(k)} \in R^{n_{k} \times p}

and

y^{(k)} \in R^{n_{k} \times 1}

. Here,

X_{i}^{(k)}

and

y_{i}^{(k)}

represent the i-th row of

X^{(k)}

and the i-th element of

y^{(k)}

, respectively. The objective is to transfer useful information from the source data, to enhance the model’s performance for the target data. We assume that the relationship between independent variables and dependent variables in the target data and source data follows the logical model

P (y_{i}^{(k)} = 1 ∣ x_{i}^{(k)}) = \frac{1}{1 + e^{- (x_{i}^{(k)} \cdot ω^{(k)})}}, i = 1, \dots, n_{k}

(11)

As

k = 0, \dots \dots \dots, K

, distinct coefficients

ω^{(k)}

are present. We refer to the target parameter as

β = ω^{(0)}

and assume that it is sparse in the zero norm. This means that among p variables, only s variables contribute to predicting the response variable, where

s < p

. The similarity between the coefficients

ω^{(k)}

of the source domain data and

β

determines the usefulness of the k-th source domain in predicting the target domain. The k-th source domain is more helpful in predicting the target domain if its coefficients

ω^{(k)}

are closer to

β

. We measure the difference between the k-th source domain and the target domain as

δ^{(k)} = β - ω^{(k)}

(12)

Consequently, we can derive the information set

A = \{1 \leq k \leq K : {∥β - ω^{(k)}∥}_{1} \leq h\}

. In terms of the

L_{1}

norm, if

{∥δ^{(k)}∥}_{1} \leq h

, we refer to the k-th source as h-transferable; if

{∥δ^{(k)}∥}_{1} >

h then we refer to the k-th source as h-non-transferable. It is evident that a smaller h value implies greater benefits obtained from these source domains in the context of transfer learning.

In the field of differential privacy-based logistic regression transfer learning, there is a method known as the Oracle algorithm when

A

is already known. Our proposed algorithm follows the principles used by Bastani [25], Li et al. [24], Zhang et al. [9], and others, which is called the differential privacy transfer learning algorithm. The main idea behind this algorithm is to transfer information from transferable sources, to obtain rough estimators in the first step. Then, in the second step, the target data are used to correct biases. To ensure privacy protection, Laplace noise is added to the data in both steps.

In the first step of transfer learning, parameter estimation is required:

{\hat{ω}}^{A} = arg min_{β} \{\frac{1}{n_{A} + n_{0}} \sum_{k \in {0} \cup A} \sum_{i = 1}^{n_{k}} (0.5 \cdot x_{i}^{(k)} β + 1 / 8 {(x_{i}^{(k)} β)}^{2} - y_{i}^{(k)} x_{i}^{(k)} β + log 2) + λ_{ω} {∥ β ∥}_{1}\}

(13)

We simplify

0.5 \cdot x_{i}^{(k)} β + 1 / 8 {(x_{i}^{(k)} β)}^{2} - y_{i}^{(k)} x_{i}^{(k)} β

, adding noise to coefficients before

β

:

0.5 \cdot x_{i}^{(k)} β + 1 / 8 {(x_{i}^{(k)} β)}^{2} - y_{i}^{(k)} x_{i}^{(k)} β = x_{i}^{(k)} (0.5 - y_{i}^{(k)}) β + 1 / 8 {(x_{i}^{(k)})}^{2} β^{2},

adding Laplace noise to

x_{i}^{(k)} (0.5 - y_{i}^{(k)})

and

{(x_{i}^{(k)})}^{2}

to obtain the privacy-protected

{(x_{i}^{(k)} (0.5 - y_{i}^{(k)}))}^{'}

and

{(x_{i}^{(k)})}^{2'}

. As the value of

{\hat{ω}}^{A}

depends solely on

β

, the expression is modified to

{\hat{ω}}^{A} = arg min_{β} \{\frac{1}{n_{A} + n_{0}} \sum_{k \in {0} \cup A} \sum_{i = 1}^{n_{k}} ({(x_{i}^{(k)} (0.5 - y_{i}^{(k)}))}^{'} β + 1 / 8 {(x_{i}^{(k)})}^{2'} β^{2}) + λ_{ω} {∥ β ∥}_{1}\}

(14)

We are performing both differential privacy and feature selection at the same time. To achieve this, we obtain the value of

λ_{ω}

through cross-validation.

Likewise, in the second step of transfer learning, parameter correction is required:

\hat{β} = {\hat{ω}}^{A} + {\hat{δ}}^{A}

(15)

{\hat{δ}}^{A} = arg min_{δ} \{\frac{1}{n_{0}} \sum_{i = 1}^{n_{0}} (0.5 \cdot x_{i}^{(0)} ({\hat{ω}}^{A} + δ) + 1 / 8 {(x_{i}^{(0)} ({\hat{ω}}^{A} + δ))}^{2} - y_{i}^{(0)} x_{i}^{(0)} ({\hat{ω}}^{A} + δ) + log 2) + λ_{δ} {∥ δ ∥}_{1}\}

(16)

We simplify

(0.5 \cdot x_{i}^{(0)} ({\hat{ω}}^{A} + δ) + 1 / 8 {(x_{i}^{(0)} ({\hat{ω}}^{A} + δ))}^{2} - y_{i}^{(0)} x_{i}^{(0)} ({\hat{ω}}^{A} + δ))

, adding noise to the coefficients before

δ

:

(0.5 \cdot x_{i}^{(0)} ({\hat{ω}}^{A} + δ) + 1 / 8 {(x_{i}^{(0)} ({\hat{ω}}^{A} + δ))}^{2} - y_{i}^{(0)} x_{i}^{(0)} ({\hat{ω}}^{A} + δ)) = x_{i}^{(0)} (0.5 - y_{i}^{(0)} + 0.25 x_{i}^{(0)} {\hat{ω}}^{A}) δ + 1 / 8 {(x_{i}^{(0)})}^{2} δ^{2}

We add Laplace noise to

x_{i}^{(0)} (0.5 - y_{i}^{(0)} + 0.25 x_{i}^{(0)} {\hat{ω}}^{A})

and

{(x_{i}^{(0)})}^{2}

, to obtain the privacy-protected

x_{i}^{(0)} (0.5 - y_{i}^{(0)} {+ 0.25 x_{i}^{(0)} {\hat{ω}}^{A})}^{'}

and

{(x_{i}^{(0)})}^{2'}

. As the value of

{\hat{δ}}^{A}

depends solely on

δ

, the expression is modified to

{\hat{δ}}^{A} = arg min_{δ} \{\frac{1}{n_{0}} \sum_{i = 1}^{n_{0}} (x_{i}^{(0)} {(0.5 - y_{i}^{(0)} + 0.25 x_{i}^{(0)} {\hat{ω}}^{A})}^{'} δ + 1 / 8 {(x_{i}^{(0)})}^{2'} δ^{2}) + λ_{δ} {∥ δ ∥}_{1}\}

(17)

We simultaneously conduct differential privacy and feature selection, where

λ_{δ}

is obtained through cross-validation.

The aforementioned approach is denoted as the Oracle trans DPLR algorithm.

2.3. Transferable Source Detection

During our previous discussions, we assumed that the transferable set

A

was known. However, in practical applications, it can be challenging to fulfill this assumption. Simply selecting all sources based on the target data might not improve model performance; instead, it could lead to negative transfer and a decline in learning performance for the target task. To avoid negative transfer, this paper utilizes a simple, non-data-driven method, proposed by Tian et al., to detect transferable sources for information transfer. Initially, the target data are separated into three subsets:

{\{(x^{(0) [r]}, y^{(0) [r]})\}}_{r = 1}^{3}

. The average loss

{\hat{L}}_{0}^{(0)}

is then computed through cross-validation of the target data. Next, Algorithm 1 is applied to various data subsets from each source domain and the target domain, and estimated coefficients are obtained. The average loss

{\hat{L}}_{0}^{(k)}

is then computed on the test set. Finally, the difference between these two losses is compared against a predefined threshold. If the difference falls below the threshold, the source domains are added to set

A

.

Algorithm 1: Oracle Trans DPLR

Input: target data

(X^{(0)}, y^{(0)})

, all h-transferable source data

{\{(X^{(k)}, y^{(k)})\}}_{k \in A}

, penalty parameters

λ_{ω}

and

λ_{δ}

, Laplace noise

Lap (\frac{Δ}{ε})

Output: the estimated coefficient vector

\hat{β}

Transferring Step:

{\hat{ω}}^{A} = arg min_{β} \{\frac{1}{n_{A} + n_{0}} \sum_{k \in {0} \cup A} \sum_{i = 1}^{n_{k}} ((x_{i}^{(k)} (0.5 - y_{i}^{(k)}) + Lap (\frac{Δ}{ε})) β + (\frac{1}{8} {(x_{i}^{(k)})}^{2} + Lap (\frac{Δ}{ε})) β^{2}) + λ_{ω} {∥ β ∥}_{1}\}

Debiasing Step: Compute

{\hat{δ}}^{A} = arg min_{δ} \{\frac{1}{n_{0}} \sum_{i = 1}^{n_{0}} ((x_{i}^{(0)} (0.5 - y_{i}^{(0)} + 0.25 x_{i}^{(0)} {\hat{ω}}^{A}) + Lap (\frac{Δ}{ε})) δ + (\frac{1}{8} {(x_{i}^{(0)})}^{2} + Lap (\frac{Δ}{ε})) δ^{2}) + λ_{δ} {∥ δ ∥}_{1}\}

Step3: Let

\hat{β} \leftarrow {\hat{w}}^{A} + {\hat{δ}}^{A}

Step4: Output

\hat{β}

To facilitate notation, assuming

n_{0}

is divisible by 3, the average loss over r iterations on the target domain dataset

(x^{(0) [r]}, y^{(0) [r]})

for any estimated parameter

ω

can be formally defined as

{\hat{L}}_{0}^{[r]} (ω) = \frac{1}{n_{0} / 3} \sum_{i = 1}^{n_{0} / 3} (0.5 x_{i}^{(0) [r]} ω + 1 / 8 {(x_{i}^{(0) [r]} ω)}^{2} - y_{i}^{(0) [r]} x_{i}^{(0) [r]} ω + log 2)

(18)

Algorithm 2 illustrates the specific details.

Algorithm 2: Transferable Source Detection

3. Simulation Study

For this section, we validated the effectiveness of the proposed algorithm through multiple simulation experiments and real data verification. Some parameters and formula configurations in the data-generation phase were inspired by Tian and Feng [26]. In the simulation segment, we initially compared the fitting results under different values of h (the maximum deviation between the source and the target coefficients in

A

) and |

A

| (the cardinality of

A

) for naive DPLR using only target data and the Oracle trans DPLR proposed in Algorithm 1, involving three error settings. Subsequently, introducing h-non-transferable sources, we compared naive DPLR using only target data, the Oracle trans DPLR proposed in Algorithm 1, all_trans DPLR using transfer learning on all source and target data, and trans DPLR involving transferable source detection. In the empirical study, we explored different approaches of naive DPLR, trans DPLR, and all_trans DPLR. All subsequent experiments were implemented using Python 3.12 code and solved optimization problems using the cvxpy library.

3.1. Known Transferable Source Domain

This article considers the following sparse logistic regression model, where the target and source datasets are independently generated from Equation (19):

Y = X β + ϵ . (19)

(19)

Here,

X \sim N (0, {(0 . 5^{| i - j |})}_{i, j = 1}^{p})

, and we transform the target variable Y into a binary attribute by mapping values above a predefined threshold to 1 and values below or equal to the threshold to 0. Therefore, when employing the logistic regression model for the classification of

(X, Y)

, if

\frac{exp (x_{i}^{T} ω)}{1 + exp (x_{i}^{T} ω)} > 0.5

then we predict Y as 1; otherwise, it is predicted as 0. The parameters are set as follows:

n_{0} = 300, n_{k} = 400, K = {3, 6, 9, 12, 15, 18}, p = 50, s = 10

. The coefficient vector for the target domain is

β_{0} = {(1_{s}, 0_{p - s})}^{T}

. For simplicity, let

d_{k}

be a p dimensional vector with each component generated as 1 or −1 with a probability of 0.5. For the transferable source dataset

A_{h}

, we consider

h = {2, 4, 6, 8, 10, 12, 14, 16, 18, 20}

, and the coefficients are given by

β_{k} = β_{0} + \frac{h}{p} d_{k}

, ensuring

|β_{k} - β_{0}| \leq h

. Three error distributions are considered:

Standard normal distribution: $ϵ \sim N (0, 1)$
$t$ distribution: $ϵ \sim t (3)$
Mixture of normal distributions: $ϵ \sim 0.9 N (0, 1) + 0.1 N (0, 4^{2})$

To balance classification accuracy and privacy protection, we compared the changes in classification accuracy of logistic regression models before and after applying transfer learning under two different conditions:

\frac{Δ}{ε} = 0.1

and

\frac{Δ}{ε} = 0.01

in the scenario of

h = 3

. The results are shown in Table 1:

As can be seen from Table 1, with the decrease of the noise addition amount

\frac{Δ}{ε}

, the accuracy of the model before and after transfer learning gradually improved. When

\frac{Δ}{ε} = 0.1

the accuracy of the model before and after transfer learning decreased compared to

\frac{Δ}{ε} = 0.01

, but this decrease was still within an acceptable range, and the model’s performance could meet the needs of practical applications. Considering the need for privacy protection, we set

\frac{Δ}{ε} = 0.1

in subsequent experiments.

For each scenario, the experiment was repeated 100 times and the average

l_{2}

error of the estimated parameters

\hat{β}

was used as the model evaluation metric. Each

λ

during the experiment was chosen using cross-validation criteria. The average relative estimation error of the Oracle algorithm under three error settings is illustrated in Figure 2.

Here we define the relative error of the oracle algorithm as

relative error = \log (error of the Oracle trans DPLR / error of naive DPLR)

Therefore, negative values indicate that the Oracle trans DPLR algorithm outperformed the naive DPLR algorithm, with smaller values (represented by lighter colors in Figure 2) indicating better performance.

Based on the detailed analysis of Figure 2, the Oracle trans DPLR algorithm demonstrated significant performance advantages over the naive DPLR algorithm across a wide range of combinations of h and dataset sizes

| A |

. This finding profoundly reveals the efficiency of the Oracle trans DPLR algorithm in handling complex data scenarios. In particular, as the number of data sources increased, the performance of the Oracle trans DPLR algorithm showed a marked improvement. Simultaneously, as the parameter h increased, the complexity of the problem also rose, leading to an increase in the estimation error of the Oracle trans DPLR algorithm. However, it is noteworthy that the algorithm maintained stable prediction performance even when faced with different types of error distributions, which fully validates the robustness and generalization capability of its design.

3.2. Unknown Transferable Source Domain

First, we set the total number of source domains to 20, i.e.,

K = 20

. Further, we construct transferable source domains and non-transferable source domains in scenarios where

|A_{h}| = {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20}

. For transferable source domains, we keep the logistic regression parameters consistent with the previous context. For non-transferable source domains, we randomly select a subset

I_{k}

of size s from the set

{2 s + 1, \dots, p}

, and we set the j-th component of the logistic regression parameter

β_{k}

, as follows:

β_{k j} = \{\begin{matrix} 1 + \frac{h}{p} d_{k j} & j \in {s + 1, \dots 2 s} \cup I_{k} \\ \frac{h}{p} d_{k j} & otherwise \end{matrix}

(20)

All the other parameters remain the same as in the previous section. This paper compares the estimated values of the following algorithms:

Naive DPLR: Conducting differentially private logistic regression using only target domain data.
Trans DPLR: Performing differentially private transfer learning using Algorithm 1 and Algorithm 2.
All_trans DPLR: Conducting differentially private transfer learning using all available target domains.
Oracle trans DPLR: Conducting differentially private transfer learning based on Algorithm 1, given known transferable source domains.
Naive LR: Conducting logistic regression using only target domain data.
Trans LR: Performing transfer learning using transfer learning algorithm and transferable source detection algorithm.

For the six algorithms, we repeated the experiment 100 times and calculated the average error. Simultaneously, we treated the transferable source domain detection as a binary classification problem, and we computed model evaluation metrics for various scenarios.

Based on Figure 3, it is evident that when the differential privacy mechanism was not introduced into the transfer learning framework, the prediction errors of the two algorithms were significantly lower than their differential privacy-enhanced versions. This stark difference highlights the accuracy loss caused by data perturbation (i.e., adding noise) as a means of privacy protection. However, this trade-off is a necessary sacrifice, to ensure privacy, aligning closely with the focus of our study. Within the differential privacy framework, the Oracle trans DPLR algorithm consistently demonstrated optimal performance. This was because the algorithm not only fully leveraged all transferable source data but also excluded potential noise sources, i.e., non-transferable source data, thereby ensuring efficient and accurate knowledge transfer. The trans DPLR algorithm also performed excellently, almost perfectly replicating the performance of the Oracle trans DPLR. This outcome reflects the high effectiveness and accuracy of our proposed transferable source detection algorithm. Further analysis revealed that when the scale of the transferable source set |

A

| was small, the performance of the all trans DPLR algorithm faced challenges, even performing worse than the naïve DPLR. This phenomenon proves the existence and significant negative impact of negative transfer, further validating the importance and effectiveness of our transferable source detection strategy. As |

A

| increased, the performance of all trans DPLR gradually improved, eventually reaching a level comparable to Oracle trans DPLR and trans DPLR at |

A

|= K = 20. Additionally, as the problem complexity h increased, the errors of all four algorithms showed a significant upward trend.

3.3. A Real-Data Study

Our study focuses on the second-hand car market, aiming to accurately predict used car prices through an in-depth analysis of a real dataset containing over 50,000 records and encompassing 31 initial variables. The data preprocessing steps included handling missing values, addressing outliers, performing feature engineering, and conducting feature selection, which expanded the original dataset’s feature variables to 70, significantly enhancing the data foundation for the model.

To optimize the efficiency of applying differential privacy techniques, we normalized the predictor variables and transformed the response variable into a binary form, mapping values exceeding a predefined threshold to 1 and other values to 0, thereby simplifying the complexity of the analysis and meeting specific needs. Descriptive statistical analysis of the data showed a class imbalance in the “bodyType” feature, with the number of luxury sedan samples being the highest (32,291) and the number of concrete mixer samples being the lowest (799). Consequently, the concrete mixer samples were designated as target domain data, while samples of other body types were considered as source domain data.

To evaluate the performance of the transfer learning algorithms, 5-fold cross-validation was used in the experiments. After applying Algorithm 2 to source domain detection across all source domains, it was observed that vehicles with body types 2, 3, 4, and 6 exhibited the least error. Therefore, these body types were selected as transferable source domains for further analysis in subsequent experiments. The evaluation metric used was the misclassification rate of logistic regression, providing an objective measure of model performance (Figure 4).

We used the following algorithm for the subsequent experiment:

Naive DPLR: Conducting differentially private logistic regression using only target domain data.
Trans DPLR: Performing differentially private transfer learning using Algorithms 1 and 2.
All_trans DPLR: Conducting differentially private transfer learning using all available target domains.
Naive LR: Conducting logistic regression using only target domain data.
Trans LR: Performing transfer learning using a transfer learning algorithm and a transferable source detection algorithm.

The experimental results indicate that although the introduction of differential privacy had some impact on the accuracy of transfer learning it effectively safeguarded data privacy and the accuracy loss remained within an acceptable range. Notably, the trans DPLR algorithm significantly outperformed the other algorithms in predictive capability for specific body types, highlighting its unique advantage in integrating diverse data sources. However, the overall performance of transfer learning using all source domains was suboptimal, primarily due to the negative effects of non-transferable sources. Nonetheless, the integration of data from different vehicle types with the trans DPLR algorithm significantly enhanced predictive accuracy for specific body types, validating the effectiveness of our transfer learning algorithm in cross-domain data utilization. From a practical perspective, the findings of this study have profound positive implications for the second-hand car market. Accurate prediction of used car prices boosts market participants’ confidence in transactions, fostering healthy and orderly market development; also, the application of differential privacy technology provides a robust protection for user data, meeting international data protection standards such as GDPR. Additionally, this study emphasizes the importance of identifying and eliminating negative transfer, further enhancing the practical effectiveness of the model by precisely selecting transferable source domains. In summary, our experimental results fully demonstrate the superiority of our transfer learning algorithm. Rooted in a function-based differential privacy mechanism and utilizing transferable source domains, this algorithm not only mitigates privacy risks during the transfer process but also simultaneously improves predictive capability within the target domain.

4. Conclusions

We have integrated differential privacy and transfer learning to propose an innovative logistic regression transfer learning method based on differential privacy. In comparison to traditional linear regression, our approach exhibits further extension into the realm of classification. To mitigate potential negative transfer effects, we have introduced a transferable source detection method that operates without the need for algorithms, employing cross-validation for transferable source detection. Our extensive simulation experiments and validation with real-world data robustly confirmed the effectiveness of our algorithm. In summary, our method effectively addresses key challenges in transfer learning, yielding satisfactory results. Future research could delve more deeply into the theoretical foundations, extend this method to other models, explore alternative differential privacy methods combined with transfer learning, and further investigate practical application areas.

Author Contributions

Methodology, Y.S.; Writing—original draft, Y.H. All authors have read and agreed to the published version of the manuscript.

Funding

Our research was supported by the Fundamental Research Funds for the Central Universities (No. 23CX03012A) and the National Key Research and Development Program of China (2021YFA1000102).

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Gyorfi, L.; Ottucsak, G.; Walk, H. Machine Learning for Financial Engineering; World Scientific: Singapore, 2012. [Google Scholar]
Bojarski, M.; Del Testa, D.; Dworakowski, D.; Firner, B.; Flepp, B.; Goyal, P.; Jackel, L.D.; Monfort, M.; Muller, U.; Zhang, J.; et al. End to end learning for self-driving cars. arXiv 2016, arXiv:1604.07316. [Google Scholar]
Zhang, A.; Xing, L.; Zou, J.; Wu, J.C. Shifting machine learning for healthcare from development to deployment and from models to data. Nat. Biomed. Eng. 2022, 6, 1330–1345. [Google Scholar] [CrossRef] [PubMed]
Rai, R.; Tiwari, M.K.; Ivanov, D.; Dolgui, A. Machine learning in manufacturing and industry 4.0 applications. Int. J. Prod. Res. 2021, 59, 4773–4778. [Google Scholar] [CrossRef]
Sweeney, L. k-anonymity: A model for protecting privacy. Int. J. Uncertain. Fuzziness -Knowl.-Based Syst. 2002, 10, 557–570. [Google Scholar] [CrossRef]
Dwork, C.; Roth, A. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 2014, 9, 211–407. [Google Scholar] [CrossRef]
Chaudhuri, K.; Monteleoni, C.; Sarwate, A.D. Differentially private empirical risk minimization. J. Mach. Learn. Res. 2011, 12, 1069–1109. [Google Scholar] [PubMed]
Lei, J. Differentially private m-estimators. Adv. Neural Inf. Process. Syst. 2011, 24, 8079839. [Google Scholar]
Zhang, J.; Zhang, Z.; Xiao, X.; Yang, Y.; Winslett, M. Functional mechanism: Regression analysis under differential privacy. arXiv 2012, arXiv:1208.0219. [Google Scholar] [CrossRef]
Kifer, D.; Smith, A.; Thakurta, A. Private convex empirical risk minimization and high-dimensional regression. In Proceedings of the Conference on Learning Theory, JMLR Workshop and Conference Proceedings, Edinburgh, Scotland, 25–27 June 2012; pp. 25-1–25-40. [Google Scholar]
Smith, A. Privacy-preserving statistical estimation with optimal convergence rates. In Proceedings of the Forty-Third Annual ACM Symposium on Theory of Computing, San Jose, CA, USA, 6–8 June 2011; pp. 813–822. [Google Scholar]
Barrientos, A.F.; Reiter, J.P.; Machanavajjhala, A.; Chen, Y. Differentially private significance tests for regression coefficients. J. Comput. Graph. Stat. 2019, 28, 440–453. [Google Scholar] [CrossRef]
Cai, T.T.; Wang, Y.; Zhang, L. The cost of privacy: Optimal rates of convergence for parameter estimation with differential privacy. Ann. Stat. 2021, 49, 2825–2850. [Google Scholar] [CrossRef]
Chaudhuri, K.; Monteleoni, C. Privacy-preserving logistic regression. Adv. Neural Inf. Process. Syst. 2008, 21, 278736. [Google Scholar]
Khanna, A.; Lu, F.; Raff, E.; Testa, B. Differentially Private Logistic Regression with Sparse Solutions. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, Copenhagen, Denmark, 30 November 2023; pp. 1–9. [Google Scholar]
Xu, D.; Yuan, S.; Wu, X. Achieving differential privacy and fairness in logistic regression. In Proceedings of the Companion Proceedings of the 2019 World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; pp. 594–599.
Fan, Y.; Bai, J.; Lei, X.; Zhang, Y.; Zhang, B.; Li, K.C.; Tan, G. Privacy preserving based logistic regression on big data. J. Netw. Comput. Appl. 2020, 171, 102769. [Google Scholar] [CrossRef]
Ji, Z.; Jiang, X.; Wang, S.; Xiong, L.; Ohno-Machado, L. Differentially private distributed logistic regression using private and public data. BMC Med Genom. 2014, 7, S14. [Google Scholar] [CrossRef] [PubMed]
Pratt, L.Y. Discriminability-based transfer between neural networks. Adv. Neural Inf. Process. Syst. 1992, 5, 147613. [Google Scholar]
Thrun, S. Is learning the n-th thing any easier than learning the first? Adv. Neural Inf. Process. Syst. 1995, 8, 1016169. [Google Scholar]
Bianchi, A.; Vendra, M.R.; Protopapas, P.; Brambilla, M. Improving image classification robustness through selective cnn-filters fine-tuning. arXiv 2019, arXiv:1904.03949. [Google Scholar]
Zhu, Z.; Li, Y.; Li, R.; Gu, X. Distant Domain Adaptation for Text Classification. In Proceedings of the Knowledge Science, Engineering and Management: 11th International Conference, KSEM 2018, Changchun, China, 17–19 August 2018; Proceedings, Part I 11. Springer: Berlin/Heidelberg, Germany, 2018; pp. 55–66. [Google Scholar]
Fawaz, H.I.; Forestier, G.; Weber, J.; Idoumghar, L.; Muller, P.A. Transfer learning for time series classification. arXiv 2018, arXiv:1811.01533. [Google Scholar]
Li, S.; Cai, T.T.; Li, H. Transfer learning for high-dimensional linear regression: Prediction, estimation and minimax optimality. J. R. Stat. Soc. Ser. B Stat. Methodol. 2022, 84, 149–173. [Google Scholar] [CrossRef]
Bastani, H. Predicting with proxies: Transfer learning in high dimension. Manag. Sci. 2021, 67, 2964–2984. [Google Scholar] [CrossRef]
Tian, Y.; Feng, Y. Transfer learning under high-dimensional generalized linear models. J. Am. Stat. Assoc. 2022, 118, 2684–2697. [Google Scholar] [CrossRef]
Yang, F.; Zhang, H.R.; Wu, S.; Su, W.J.; Ré, C. Analysis of information transfer from heterogeneous sources via precise high-dimensional asymptotics. arXiv 2020, arXiv:2010.11750. [Google Scholar]
Zhou, D.; Liu, M.; Li, M.; Cai, T. Doubly robust augmented model accuracy transfer inference with high dimensional features. arXiv 2022, arXiv:2208.05134. [Google Scholar] [CrossRef]
Lin, H.; Reimherr, M. On transfer learning in functional linear regression. arXiv 2022, arXiv:2206.04277. [Google Scholar]
Takada, M.; Fujisawa, H. Transfer Learning via ell_1 Regularization. Adv. Neural Inf. Process. Syst. 2020, 33, 14266–14277. [Google Scholar]

Figure 1. Algorithm demonstration.

Figure 2. The average

l_{2}

-estimation relative error of known

A

.

Figure 2. The average

l_{2}

-estimation relative error of known

A

.

Figure 3. The average

l_{2}

-estimation error of unknown

A

.

Figure 3. The average

l_{2}

-estimation error of unknown

A

.

Figure 4. Misclassification rate.

Table 1. Classification accuracy under different

\frac{Δ}{ε}

conditions.

Table 1. Classification accuracy under different

\frac{Δ}{ε}

conditions.

		$\frac{Δ}{ε} = 0.1$	$\frac{Δ}{ε} = 0.01$
N	naive DPLR	87.67%	90.00%
N	Oracle trans DPLR	92.33%	96.67%
Mix	naive DPLR	82.33%	83.67%
Mix	Oracle trans DPLR	84.33%	89.00%
T	naive DPLR	80.00%	82.33%
T	Oracle trans DPLR	82.67%	85.67%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hou, Y.; Song, Y. Transfer Learning for Logistic Regression with Differential Privacy. Axioms 2024, 13, 517. https://doi.org/10.3390/axioms13080517

AMA Style

Hou Y, Song Y. Transfer Learning for Logistic Regression with Differential Privacy. Axioms. 2024; 13(8):517. https://doi.org/10.3390/axioms13080517

Chicago/Turabian Style

Hou, Yiming, and Yunquan Song. 2024. "Transfer Learning for Logistic Regression with Differential Privacy" Axioms 13, no. 8: 517. https://doi.org/10.3390/axioms13080517

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Transfer Learning for Logistic Regression with Differential Privacy

Abstract

1. Introduction

2. Method

2.1. Function Mechanism Differential Privacy Method Based on Logistic Regression Model

2.2. Regression Transfer Learning Based on Differential Privacy

2.3. Transferable Source Detection

3. Simulation Study

3.1. Known Transferable Source Domain

3.2. Unknown Transferable Source Domain

3.3. A Real-Data Study

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI