Next Article in Journal
Enhancing Service Quality—A Customer Opinion Assessment in Water Laboratories through Artificial Neural Networks
Previous Article in Journal
Multidisciplinary Applications of AI in Dentistry: Bibliometric Review
Previous Article in Special Issue
Comparative Analysis of Local Differential Privacy Schemes in Healthcare Datasets
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Differentially Private (Random) Decision Tree without Noise from k-Anonymity †

1
Department of Informatics, Faculty of Informatics, Tokyo University of Information Sciences, 4-1 Onaridai, Wakaba-ku, Chiba 265-8501, Japan
2
College of Information Science and Engineering, Ritsumeikan University, 2-150 Iwakura-cho, Ibaraki 567-8570, Japan
3
Cybersecurity Research Institute, National Institute of Information and Communications Technology (NICT), 4-2-1 Nukui-Kitamachi, Koganei 184-8795, Japan
*
Authors to whom correspondence should be addressed.
This paper is an extension of the conference paper: Nojima, R.; Wang, L. Differential Private (Random) Decision Tree Without Adding Noise. In Neural Information Processing, Proceedings of the Neural Information Processing—30th International Conference, ICONIP 2023, Changsha, China, 20–23 November 2023, Proceedings, Part IX; Luo, B., Cheng, L., Wu, Z., Li, H., Li, C., Eds.; Communications in Computer and Information Science; Springer: Berlin/Heidelberg, Germany, 2023; Volume 1963, pp. 162–174.
These authors contributed equally to this work.
Appl. Sci. 2024, 14(17), 7625; https://doi.org/10.3390/app14177625
Submission received: 20 July 2024 / Revised: 17 August 2024 / Accepted: 26 August 2024 / Published: 28 August 2024
(This article belongs to the Special Issue Data Privacy and Security for Information Engineering)

Abstract

:
This paper focuses on the relationship between decision trees, a typical machine learning method, and data anonymization. It is known that information leaked from trained decision trees can be evaluated using well-studied data anonymization techniques and that decision trees can be strengthened using k-anonymity and -diversity; unfortunately, however, this does not seem sufficient for differential privacy. In this paper, we show how one might apply k-anonymity to a (random) decision tree, which is a variant of the decision tree. Surprisingly, this results in differential privacy, which means that security is amplified from k-anonymity to differential privacy without the addition of noise.

1. Introduction

Recently, with the rapid evolution of machine learning technology and the expansion of data due to developments in information technology, it has become increasingly important that companies determine how they might utilize big data effectively and efficiently. However, big data often include personal and private information; thus, careless utilization of such sensitive information may lead to unexpected sanctions.
To overcome this problem, many privacy-preserving technologies have been proposed for the utilization of data that nevertheless maintains privacy. Typical privacy-preserving technologies include data anonymization (e.g., [1,2,3]) and secure computation (e.g., [4]). This paper focuses on the relationship between data anonymization and decision trees, a typical machine learning method. Historically, data anonymization research has progressed from pseudonymization to k-anonymity [1], -diversity [2,5], and t-closeness [3] and is continually growing. Currently, many researchers are focused on membership privacy and differential privacy [6].
In [7,8], the authors pointed out that the decision tree is not robust to homogeneity attacks and background knowledge attacks; they then demonstrated the application of k-anonymity and -diversity in order to amplify security. However, their proposals could not satisfy the requirements of differential privacy. In this paper, we discuss how we might prevent the leakage of private information via differential privacy provided by a learned decision tree using data anonymization techniques such as k-anonymity and -diversity.
To prevent leakage of private information, we propose the application of k-anonymity and sampling to a random decision tree, which is a variation of the expanded decision tree proposed by Fan et al. [9]. Interestingly, we show in this paper that this modification results in differential privacy. The essential idea is that instead of adding Laplace noise, as in [10,11] (please see [12] for a survey of a differentially private (random) decision tree), we propose a method of enhancing the security of a random decision tree by sampling and then removing the leaf containing fewer data than some threshold k, which applies to the other leaves of the tree. The basic concept is outlined in [13]. Our proposed model, in which k-anonymity is achieved after sampling, provides differential privacy, as in [13].
As mentioned above, researchers have shifted their attention to differential privacy rather than k-anonymization and -diversity. In fact, building upon the work outlined in [14], decision trees that satisfy differential privacy use techniques that are typical of differential privacy, such as the exponential, Gaussian, and Laplace mechanisms [10,11,15]. That is, all of these algorithms achieve differential privacy by adding some kind of noise. Our approach is very different from those of others. That said, the basic technique involves applying k-anonymity to each leaf in the random decision tree; this is similar to pruning, which is a widely accepted technique used to avoid overfitting.
The remainder of this paper is organized as follows. Section 2 introduces relevant preliminary information, e.g., anonymization methods and decision trees, and demonstrates how strategies for attacking data anonymization can be converted into attacks targeting decision trees. In Section 3, we demonstrate how much security and accuracy can be achieved in practice when the random decision tree is strengthened using a method that is similar to k-anonymity. In Section 4, the potential advantages of our proposal are discussed. Finally, the paper is concluded in Section 5, which includes a brief discussion of potential future research topics.

2. Preliminaries

2.1. Data Anonymization

When providing personal data to a third party, it is necessary to modify data to preserve user privacy. Here, modifying the user’s data (i.e., a particular record) such that an adversary cannot re-identify a specific individual is referred to as data anonymization. As a basic technique and to prevent re-identification, an identifier, e.g., a person’s name or employee number, is deleted or replaced with a pseudonym ID by the data holder. This process is referred to as pseudonymization. However, simply modifying identifiers does not ensure the preservation of privacy. In some cases, individuals can be re-identified by a combination of features (i.e., a quasi-identifier); thus, it is necessary to modify both the identifier and the quasi-identifier to reduce the risk of re-identification. In most cases, the identifiers themselves are not used for data analysis; thus, removing identifiers does not significantly sacrifice the quality of the dataset. However, if we modify quasi-identifiers in the same manner, although the data may become anonymous, they will also become useless. A typical anonymization technique for quasi-identifiers is “roughening” the numerical values.

2.1.1. Attacks Targeting Pseudonymization

A simple attack is possible against pseudonymized data from which identifiers, e.g., names, have been removed. In this attack, the attacker uses the quasi-identifier of a user u. If this attacker obtains the pseudonymized data, by searching for user u’s quasi-identifier in pseudonymized data, the attacker can obtain sensitive information about u. For example, if the attacker obtains the dataset shown in Table 1 and knows friend u’s zip code is 13068, their age is 29, and their nationality is American, then, by searching the dataset, the attacker can identify that user u is suffering from some heart-related disease. This attack is referred to as the uniqueness attack.

2.1.2. k-Anonymity

k-anonymity is a countermeasure used to prevent uniqueness attacks.
In k-anonymity, features are divided into quasi-identifiers and sensitive information, and the same quasi-identifier is modified such that it does not become less than k users. Table 2 shows anonymized data that has been k-anonymized ( k = 4 ) using quasi-identifiers, e.g., zip code, age, and nationality.

2.1.3. Homogeneity Attack

At a cursory glance, k-anonymity appears to be secure; however, even if k-anonymity is employed, a homogeneity attack is still feasible. This attack becomes possible if the sensitive information is the same. Taking the k-anonymized dataset shown on the right side of Figure 1 as an example and assuming the presence of the attacker on the left side of Figure 1, we can make the following statements. Here, the attacker has the necessary information (zip, age) = (13053, 37); all of this sensitive information in the records corresponds with cancer. The attacker can therefore deduce that Bob has cancer.

2.1.4. Background Knowledge Attack

Homogeneous attacks suggest a problem when records with the same quasi-identifier have the same sensitive information; however, a previous study [3] also argued that there was a problem even in cases in which records were not the same. The k-anonymized dataset on the right side of Figure 2 shows four records with quasi-identifiers (130, <30, *), and two types of sensitive information, i.e., (heart, flu). Here, we can assume that the attacker has background knowledge of the data similar to that shown on the left side of Figure 2. In this case, there are certainly possibilities of heart disease and flu; however, if the probability of Japanese individuals experiencing heart disease is extremely low, Umeko is estimated to instead have the flu. Thus, it must be acknowledged that k-anonymity does not provide a high degree of security.
ℓ-diversity: -diversity is a measure used to counteract homogeneity attacks. A k-anonymization table is denoted -diverse if each similar class of quasi-identifiers has at least “well-represented” values for sensitive information. There are several different interpretations of the term “well-represented” [2,3]. In this paper, we adopt distinct -diversity. Distinct -diversity means that there are at least distinct values for the sensitive information in each similar class of quasi-identifiers. Table 3 shows anonymized data that are two-diverse ( = 2 ).

2.2. Decision Trees

2.2.1. Basic Decision Trees

Decision trees are supervised learning methods that are primarily used for classification tasks, and a tree structure is created while learning from data (Figure 3). When predicting the label y of x , the process begins from the root of the tree, and the corresponding leaf is searched for while referring to each feature of x . Finally, through this referral process, y is predicted.
The label determined by the leaf is derived from the dataset D used to generate the tree structure. In other words, after the tree structure is created, for each element ( x i , y i ) in dataset D, the corresponding leaf leaf is found, and the value of y i is stored. If y i Y = { 0 , 1 } , then in each leaf , the number of y i labeled 0 and the number of y i labeled 1 are preserved. More precisely, [ n ( leaf ( 1 ) ) , n ( leaf ( 0 ) ) ] are preserved for each leaf leaf , where n ( leaf ( 0 ) ) and n ( leaf ( 1 ) ) represent the numbers of data points with label y that have values of 0 and 1, respectively. Table 4 shows the notations used in the paper.
For a given prediction x , we first search for the corresponding leaf, and it may be denoted as 1 if n ( leaf ( 1 ) ) n ( leaf ( 0 ) ) + n ( leaf ( 1 ) ) > 1 2 , and 0 otherwise. Here, the threshold 1 / 2 can be set flexibly depending on where the decision tree is applied, and when providing the learned decision tree to a third party, it is possible to pass n ( leaf ( 0 ) ) and n ( leaf ( 1 ) ) together for each leaf leaf . In this paper, we considered the security of decision trees in such situations.
Generally, the deeper the tree structure, the more likely it is to overfit; thus, we frequently prune the tree, and this technique was employed to preserve data privacy within this paper.

2.2.2. Random Decision Tree

The random decision tree was proposed by Fan et al. [9]. Notably, in their approach, the tree is randomly generated without depending on the data. Furthermore, sufficient performance can be ensured through the appropriate selection of parameters.
The shape of a normal (not random) decision tree depends on the data used; the eventual shape may cause private information to be leaked from the tree. However, the random decision tree avoids this leakage due to its random generation. Therefore, its performance is expected to match the performance of other proposed security methods.
A random decision tree is described in Algorithms 1 and 2. Algorithm 1 shows that the generated tree does not depend on dataset D, except for n i ( leaf ( y ) ) created by UpdateStatistics . Here, n i ( leaf ( y ) ) denotes the 2D array, which represents the number of feature vectors reaching each leaf leaf . However, the preservation of privacy is necessary because n i ( leaf ( y ) ) depends on D.
In the measurement of each parameter, the depth of the tree being ( dimension of feature vectors 2 ) and the number of trees being 10 are general rules.

2.3. Security Definitions

Part of this work adopts differential privacy to evaluate the security and efficiency of the model.
Definition 1.
A randomized algorithm A satisfies ( ϵ , δ ) -DP, if for any pair of neighboring datasets, D and D , and any O O :
Pr A ( D ) O e ϵ · Pr [ A ( D ) O ] + δ ,
where O is a range of A.
When studying differential privacy, it is assumed that the attacker knows all the elements in D. However, such an assumption may not be realistic. This is taken into consideration, and the following definition is given in [13]:
Definition 2
(DP undersampling, ( β , ϵ , δ ) -DPS). An algorithm A satisfies ( β , ϵ , δ ) -DPS if and only if the algorithm A β satisfies ( ϵ , δ ) -DP, where A β denotes the algorithm used to initially sample each tuple in the input dataset A with probability β.
In other words, the definition describes the output of A by inputting D , which is a result of sampling dataset D. Hence, the attacker may know D but not D .
Algorithm 1 Training the random decision tree [9]
Input: Training data D = { ( x 1 , y 1 ) , , ( x n , y n ) } , the set of features X = { F 1 , , F m } , number of random decision trees to be generated N t
Output: Random decision trees T 1 , , T N t
1:
for  i { 1 , , N t }  do
2:
    T i = BuildTreeStructure ( root , x )
3:
end for
4:
for  i { 1 , , N t }  do
5:
    T i = UpdateStatics ( T i , D )
6:
end for
7:
return  T 1 , , T N t
8:
 
9:
BuildTreeStructure ( node , X )
10:
if ( X then
11:
   Set the node as a leaf
12:
else
13:
    F X
14:
   /* Set F as the feature of the node. Also, assume | F | = c */
15:
   for  i { 1 , , c }  do
16:
      T = BuildTreeStructure ( node i , X F )
17:
   end for
18:
end if
19:
 
20:
UpdateStatistics ( T i , D )
21:
Set  n i ( leaf ( y ) ) = 0  for all leaves, leaf , and all labels, y.
22:
for  ( x , y ) D  do
23:
   Find the leaf leaf corresponding to x , and set n i ( leaf ( y ) ) = n i ( leaf ( y ) ) + 1 .
24:
end for
Algorithm 2 Classifying the random decision tree [9]
Require:  { T 1 , , T N t } , x
  1: return i = 1 N t n i ( leaf ( y ) ) / y i = 1 N t n i ( leaf ( y ) ) for each y Y , where leaf denotes the leaf corresponding to x .

2.4. Attacks Targeting the Decision Tree

Generally, a decision tree is constructed from a given dataset; however, it is also possible to partially reconstruct the dataset using the decision tree. This kind of attack is considered in [7]. In this section, we explain the workings of such attacks.
Figure 4 shows the reconstruction of a dataset from the decision tree shown in Figure 3. As shown, it is impossible to reconstruct the original data completely from a binary tree model; however, it is possible to extract some of the data. By exploiting this essential property, it is possible to mount some attacks against reconstructed data, as discussed in Section 2.1. In the following, taking Figure 4 as an example, we present a uniqueness attack, a homogeneous attack, and a background knowledge attack.
  • Uniqueness attack: in the dataset (Figure 4) recovered from the model, there is one user whose height is greater than 170 and who is under 15 years of age (as shown in the sixth row (see the 2nd red box shown on the figure), where the relevant leaf is converted via [ n 3 ( leaf ( 1 ) ) , n 3 ( leaf ( 0 ) ) ] = [ 1 , 0 ] ); thus, it is possible to target this user with a uniqueness attack.
  • Homogeneous attack: similarly, in the fourth and fifth rows (see the 1st red box shown on the figure), which are converted from the relevant leaf via [ n 2 ( leaf ( 1 ) ) , n 2 ( leaf ( 0 ) ) ] = [ 0 , 2 ] , the height < 170 , weight 60 , and health status are the same (i.e., “unhealthy”); therefore, a homogeneous attack is possible.
  • Background knowledge attack: Similarly, in the seventh, eighth, and ninth rows, there are three users whose data match in both height 170 and age 15 . Among these users, one is healthy (yes) and two are unhealthy (no). As an attacker, we can consider the following:
    • (Background knowledge of user A) height: 173, age: 33, healthy;
    • (Background knowledge of target user B) height: 171, age: 19.
    In this case, if the adversary knows that user A is healthy, he/she can identify that user B is unhealthy.

3. Proposal: Applying k -Anonymity to a (Random) Decision Tree

3.1. Construction of k - AnonToRdt

In this section, we demonstrate how one might achieve differential privacy from k-anonymity. More specifically, we present a proposal based on a random decision tree, which is a variant of the decision tree outlined in Section 2.2.2. Said proposal is shown as Algorithm 3. It differs from the original random decision tree in the following ways:
  • (Pruning): for some threshold k, if there exists a tree T i , a leaf leaf , and a label y, satisfying n i ( leaf ( y ) ) < k , then let n i ( leaf ( y ) ) equal 0.
  • (Sampling): training T i using D i , which is the result obtained after sampling dataset D of each tree T i with probability β .

3.2. Security: Strongly Safe k-Anonymization Ensures Differential Privacy

In the field of data anonymization, in study [13], the authors demonstrated that performing k-anonymity after sampling achieved differential privacy; our proposal is a development upon this core principle. Below, we outline the items necessary to evaluate data security.
Definition 3
(Strongly Safe k-anonymization algorithm [13]). Suppose that a function g has D T , where D and T are the domain and range of g, respectively. Suppose that g does not depend on D D , i.e., g is constant. The strongly safe k-anonymization algorithm A with input D D is defined as follows:
  • Compute Y 1 = { g ( x ) | x D } .
  • Y 2 = { ( y , | { x D : g ( x ) = y } | ) : y Y 1 } .
  • For each element in Y 2 = { ( y , c ) } , if c < k , then the element is set to ( y , 0 ) , and the result is set to Y 3 .
Algorithm 3 Proposed training process
Input: Training data D = { ( x 1 , y 1 ) , , ( x n , y n ) } , the set of features X = { F 1 , , F m } ,   number of random decision trees to be generated N t
Output: Random decision trees T 1 , , T N t
1:
for  i { 1 , , N t }  do
2:
    T i = BuildTreeStructure ( root , X )
3:
end for
4:
for  i { 1 , , N t }  do
5:
    D i
6:
   Regarding each ( x , y ) D , D i D i ( x , y ) with probability β
7:
    T i = UpdateStatics - k - anon ( T i , D i )
8:
end for
9:
return  T 1 , , T N t
10:
 
11:
UpdateStatistics - k - anon ( T i , D i )
12:
Set n i ( leaf ( y ) ) = 0 for all leaves leaf and labels y.
13:
for  ( x , y ) D  do
14:
   Find leaf corresponding to x , and set n i ( leaf ( y ) ) = n i ( leaf ( y ) ) + 1 .
15:
end for
16:
for All pairs of leaf and label ( leaf , y )  do
17:
   if  n i ( leaf ( y ) ) < k  then
18:
       n i ( leaf ( y ) ) 0 /*Removing leaves with fewer data*/
19:
   end if
20:
end for
Assume that f ( j ; n , β ) denotes the probability mass function: the probability of succeeding j times after n attempts, where the probability of success after one attempt is β . Furthermore, the cumulative distribution function is expressed as follows in Equation (2):
f ( j ; n , β ) = i = 0 j f ( i ; n , β ) .
Theorem 1
(Theorem 5 in [13]). Any strongly safe k-anonymization algorithm satisfies ( β , ϵ , δ ) -DPS for any 0 < β < 1 , ϵ ln ( 1 β ) , and
δ = d ( k , β , ϵ ) = max n : n k γ 1 j > γ n n f ( j ; n , β ) ,
where γ = e ϵ 1 + β e ϵ .
Equation (3) shows the relationship between β and δ in determining the value of ϵ when k is fixed.
Let us consider the case where record ( x , y ) D is applied to a random decision tree T i , and the leaf reached is denoted by leaf . If a function g i is defined as
g i ( ( x , y ) ) = ( leaf , y ) ,
then g i in Equation (4) is apparently constant, that is, it does not depend on D. Therefore, n i ( leaf ( y ) ) , which is generated using g i , can be regarded as an example of strongly safe k-anonymization; consequently, Theorem 1 can be applied.
However, Theorem 1 above can be applied in its original form when there is one T i , i.e., when the number of trees N t = 1 . Theorem 2 can be applied when N t > 1 .
Theorem 2
(Theorem 3.16 in [17]). Assume A i is an ( ϵ i , δ i ) -DP algorithm for 1 i N t . Then, the algorithm
A ( D ) : = ( A 1 ( D ) , A 2 ( D ) , , A N t ( D ) )
satisfies ( i = 1 N t ϵ i , i = 1 N t δ i ) -DP.
In Algorithm 3, each T i is selected randomly, and sampling is performed for each tree. Hence, the following conclusion can be reached.
Corollary 1.
The proposed algorithm satisfies ( β , N t ϵ , N t δ ) -DPS, for any 0 < β < 1 , ϵ ln ( 1 β ) , and
δ = d ( k , β , ϵ ) = max n : n k γ 1 j > γ n n f ( j ; n , β ) ,
where γ = e ϵ 1 + β e ϵ .
Table 5 shows the relationship, derived from Equation (6), between β and N t δ in determining the value of N t ϵ when k and N t are fixed. The cells in the table represent the approximate value of N t δ . For k and N t , we chose ( k , N t ) = ( 5 , 10 ) , ( 10 , 10 ) , and ( 20 , 10 ) , as shown in Table 5.

3.3. Experiments on k- AnonToRdt

The efficiency of the proposal was verified using the Nursery dataset [18], the Adult dataset [19], the Mushroom dataset [20], and the Loan dataset [21]. The characteristics of each dataset are as follows.
  • The Nursery dataset contains 12,960 records with eight features, with a maximum of five values for each feature;
  • The Adult dataset contains 48,842 records with 14 features. Here, each feature has more possible values and more records than in the Nursery datasets;
  • The Mushroom dataset contains 8124 records with 22 features. Compared to the above two datasets, there are more features, but the number of records is small. In general, applying k-anonymity to this kind of dataset is challenging.
  • The Loan dataset [21] contains 9578 records with 13 features. We used the “not.fully.paid” feature to label classes. There are four binary attributes and seven numerical attributes in this dataset.
Appendix A contains the evaluation of the basic decision tree with each dataset. Firstly, δ , N t , k were set to δ = 0.4 , N t { 8 , , 11 } , and k { 5 , 10 } .
The results from the Nursery dataset obtained via k - AnonToRdt with these parameters are as follows:
  • The accuracy of the original decision tree was 0.933, as shown in Table A1.
  • As shown in Table 6 (a), for a tree depth equal to four, the accuracy obtained was 0.84, which was inferior to that of the original decision tree.
  • As shown in Table 6 (a), for a tree depth equal to five, the accuracy decreased drastically as k increased.
The results from the Adult dataset obtained via k - AnonToRdt with the same δ , N t , k were as follows: (there were numerical values in this dataset; to handle this, a threshold t was chosen randomly from its domain, and two children for ≤t and >t were produced by the tree)
  • The accuracy of the original decision tree was 0.855, as shown in Table A1.
  • As shown in Table 6 (b), the achieved accuracy when k = 5 was 0.817.
The results from the Mushroom dataset obtained via k - AnonToRdt with the same δ , N t , k were as follows:
  • The accuracy of the original decision tree was 0.995, as shown in Table A1.
  • As shown in Table 6 (c), the achieved accuracy when k = 5 was 0.98.
The results from the Loan dataset obtained via k - AnonToRdt with the same δ , N t , k were as follows:
  • The accuracy of the original decision tree was 0.738 in our experiment.
  • As shown in Table 6 (d), the achieved accuracy was around 0.84.
In summary, the accuracy achieved by k - AnonToRdt was slightly inferior to that of the original decision tree.
Changing sampling rate β: To achieve secure differential privacy, the sampling rate should remain small. Maintaining values of N t = 10 , N t ϵ = 2.0 , which were small enough for our practical application, Table 7 shows how the values of N t δ changed according to the sampling rate β . As shown, for some parameters, the accuracy of the proposed method was relatively good even when N t ϵ = 2.0 and N t δ were small.

4. Discussion

  In another highly relevant study [10], Jagannathan et al. proposed a variant of a random decision tree that achieved differential privacy. The accuracy of their proposal is shown in Figures 1 and 2 of [10] for the same datasets with the same class labels; (in [10], instead of five class labels, three were used for the Nursery dataset, i.e., some of the similar labels were merged) their method resulted in similar precision. Because our proposal employs sampling, it is limited by the size of the dataset being utilized; the smaller the dataset (e.g., the Mushroom dataset), the less pronounced the accuracy. However, it must be noted that their approach was very different from ours: Laplace noise was added instead of pruning and sampling. Notably, within their proposal,
n i ( leaf ( y ) ) + Laplace Noise
for all trees T i , all leaves leaf , and all labels y. Even in this context, if n i ( leaf ( y ) ) is small for a certain T i , leaf , and y, it may be regarded almost as a personal record. A good general approach to handling such cases is to remove the rare records, i.e., to “remove the leaves containing fewer records”. This is a broadly accepted data anonymization technique [22] that is commonly used to avoid legal difficulties. Our proposal shows that pruning and sampling can be combined to ensure differential privacy. If rare sensitive records need to be removed, our method may therefore represent an excellent option.

5. Conclusions

  In this paper, we aimed to show the close relationship between the security of data anonymity and decision trees. Specifically, we show how to obtain differentially private (random) decision tree from k-anonymity. Our proposal consists of applying sampling and k-anonymity to the original random decision tree method, which results in differential privacy. Compared to existing schemes, the advantage of our proposal is its ability to implement differential privacy without adding Laplace or Gaussian noise, which provides trained decision trees with a new route to differential privacy. We believe that in addition to random decision trees, other similar algorithms can be augmented to achieve differential privacy for general decision trees. In addition, in future studies, we will explore the differential privacy of federated learning decision trees by extending the proposed method.   

Author Contributions

Conceptualization, A.W. and R.N.; methodology, A.W., R.N. and L.W.; software, A.W. and R.N.; formal analysis, A.W., R.N. and L.W.; investigation, A.W., R.N. and L.W.; resources, R.N.; data curation, A.W.; writing—original draft preparation, A.W., R.N. and L.W; writing—review and editing, L.W.; supervision, R.N.; project administration, R.N. and L.W.; funding acquisition, R.N. and L.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by JST CREST Grant Number JPMJCR21M1, Japan.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author. The source code used in this study is available on request from the corresponding author since January 2025.

Acknowledgments

We would like to thank Ryousuke Wakabayashi and Yusaku Ito for their technical assistance with the experiments.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Experiments on Attacks against the Decision Tree

We used three datasets to evaluate the vulnerability of decision trees to uniqueness and homogeneous attacks: the Nursery dataset [18], the Adult dataset [19], and the Mushroom dataset [20], In these experiments, we used the Python 3 and sklearn libraries to train the decision trees.
In the experiments, the tree depths were set to four, five, six, seven, and eight. We divided each dataset into a training set and an evaluation set. The training set, which was used to train the decision tree, contained 80% of the records in the dataset. Here, the decision tree was trained 10 times, and averages of the following numbers were computed.
  • H leaf : the number of leaves that can be identified by a homogeneous attack, that is, the number of leaves leaf for y * Y , n ( leaf ( y * ) ) 2 and y Y y * , n ( leaf ( y ) ) = 0 ;
  • H u : the number of users who can be identified by a homogeneous attack,
    n i ( leaf ( y ) ) + Laplace Noise
    where leaf denotes the leaves that suffer the homogeneous attack.
Note that in a uniqueness attack, the number of leaves that can be identified is equal to the number of identifiable users. Table A1 shows the experimental results of the homogeneous attack. Regarding the homogeneous attack, even if the tree depth is small, information can be leaked in all datasets. In addition, susceptibility to homogeneous attacks increases as the tree depth increases, as shown in Figure A1.
Table A1. Number of users ( H u ) and leaves ( H leaf ) that can be identified by homogeneity attacks.
Table A1. Number of users ( H u ) and leaves ( H leaf ) that can be identified by homogeneity attacks.
TreeNurseryAdultMushroom
Depth ACC H u H leaf ACC H u H leaf ACC H u H leaf
40.8633965.520.843215.22.40.9793293.69
50.88052174.90.851829.57.10.9803568.412
60.8885747.19.90.8531621.718.30.9956122.616
70.9216837.419.50.8551957.435.21.000649920
80.9337912.735.70.8552316.160.81.000649920
Figure A1. Tree depth and the proportion of users who can be identified by a homogeneous attack.
Figure A1. Tree depth and the proportion of users who can be identified by a homogeneous attack.
Applsci 14 07625 g0a1

References

  1. Sweeney, L. k-Anonymity: A Model for Protecting Privacy. Int. J. Uncertain. Fuzziness Knowl. Based Syst. 2002, 10, 557–570. [Google Scholar] [CrossRef]
  2. Machanavajjhala, A.; Gehrke, J.; Kifer, D.; Venkitasubramaniam, M. -Diversity: Privacy Beyond k-Anonymity. In Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006, Atlanta, GA, USA, 3–8 April 2006; IEEE Computer Society: Washington, DC, USA, 2006; p. 24. [Google Scholar] [CrossRef]
  3. Li, N.; Li, T.; Venkatasubramanian, S. t-Closeness: Privacy Beyond k-Anonymity and -Diversity. In Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, The Marmara Hotel, Istanbul, Turkey, 15–20 April 2007; IEEE Computer Society: Washington, DC, USA, 2007; pp. 106–115. [Google Scholar] [CrossRef]
  4. Yao, A.C. How to Generate and Exchange Secrets (Extended Abstract). In Proceedings of the 27th Annual Symposium on Foundations of Computer Science, Toronto, ON, Canada, 27–29 October 1986; IEEE Computer Society: Washington, DC, USA, 1986; pp. 162–167. [Google Scholar] [CrossRef]
  5. Praveena Priyadarsini, R.; Sivakumari, S.; Amudha, P. Enhanced – Diversity Algorithm for Privacy Preserving Data Mining. In Digital Connectivity—Social Impact, Proceedings of the 51st Annual Convention of the Computer Society of India, CSI 2016, Coimbatore, India, 8–9 December 2016, Proceedings; Subramanian, S., Nadarajan, R., Rao, S., Sheen, S., Eds.; Springer: Singapore, 2016; pp. 14–23. [Google Scholar] [CrossRef]
  6. Stadler, T.; Oprisanu, B.; Troncoso, C. Synthetic Data - Anonymisation Groundhog Day. In Proceedings of the 31st USENIX Security Symposium, USENIX Security 2022, Boston, MA, USA, 10–12 August 2022; USENIX Association: Berkeley, CA, USA, 2022; pp. 1451–1468. [Google Scholar]
  7. Friedman, A.; Wolff, R.; Schuster, A. Providing k-anonymity in data mining. VLDB J. 2008, 17, 789–804. [Google Scholar] [CrossRef]
  8. Ciriani, V.; di Vimercati, S.D.C.; Foresti, S.; Samarati, P. k -Anonymous Data Mining: A Survey. In Privacy-Preserving Data Mining— Models and Algorithms; Aggarwal, C.C., Yu, P.S., Eds.; Advances in Database Systems; Springer: Berlin/Heidelberg, Germany, 2008; Volume 34, pp. 105–136. [Google Scholar] [CrossRef]
  9. Fan, W.; Wang, H.; Yu, P.S.; Ma, S. Is random model better? On its accuracy and efficiency. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM 2003), Melbourne, FL, USA, 19–22 December 2003; IEEE Computer Society: Washingtom, DC, USA, 2003; pp. 51–58. [Google Scholar] [CrossRef]
  10. Jagannathan, G.; Pillaipakkamnatt, K.; Wright, R.N. A Practical Differentially Private Random Decision Tree Classifier. Trans. Data Priv. 2012, 5, 273–295. [Google Scholar]
  11. Fletcher, S.; Islam, M.Z. A Differentially Private Random Decision Forest Using Reliable Signal-to-Noise Ratios. In AI 2015: Advances in Artificial Intelligence, Proceedings of the 28th Australasian Joint Conference, Canberra, ACT, Australia, 30 November–4 December 2015, Proceedings; Pfahringer, B., Renz, J., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2015; Volume 9457, pp. 192–203. [Google Scholar] [CrossRef]
  12. Fletcher, S.; Islam, M.Z. Decision Tree Classification with Differential Privacy: A Survey. ACM Comput. Surv. 2019, 52, 83:1–83:33. [Google Scholar] [CrossRef]
  13. Li, N.; Qardaji, W.H.; Su, D. On sampling, anonymization, and differential privacy or, k-anonymization meets differential privacy. In Proceedings of the 7th ACM Symposium on Information, Compuer and Communications Security, ASIACCS ’12, Seoul, Republic of Korea, 2–4 May 2012; ACM: New York, NY, USA, 2012; pp. 32–33. [Google Scholar] [CrossRef]
  14. Blum, A.; Dwork, C.; McSherry, F.; Nissim, K. Practical privacy: The SuLQ framework. In PODS, Proceedings of the Twenty-Fourth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Baltimore, Maryland, 13–15 June 2005; Li, C., Ed.; ACM: New York, NY, USA, 2005; pp. 128–138. [Google Scholar]
  15. Friedman, A.; Schuster, A. Data mining with differential privacy. In KDD ’10, Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 25–28 July 2010; Rao, B., Krishnapuram, B., Tomkins, A., Yang, Q., Eds.; ACM: New York, NY, USA, 2010; pp. 493–502. [Google Scholar] [CrossRef]
  16. Machanavajjhala, A.; Kifer, D.; Gehrke, J.; Venkitasubramaniam, M. L-diversity: Privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data 2007, 1, 3. [Google Scholar] [CrossRef]
  17. Dwork, C.; Roth, A. The Algorithmic Foundations of Differential Privacy. Found. Trends Theor. Comput. Sci. 2014, 9, 211–407. [Google Scholar] [CrossRef]
  18. Rajkovic, V. Nursery. UCI Machine Learning Repository. 1997. Available online: https://archive.ics.uci.edu/dataset/76/nursery (accessed on 24 August 2024).
  19. Becker, B.; Kohavi, R. Adult. UCI Machine Learning Repository. 1996. Available online: https://archive.ics.uci.edu/dataset/2/adult (accessed on 24 August 2024).
  20. Mushroom. UCI Machine Learning Repository. 1987. Available online: https://archive.ics.uci.edu/dataset/73/mushroom (accessed on 24 August 2024).
  21. Mahdi, N. Bank_Personal_Loan_Modelling. Available online: https://www.kaggle.com/datasets/ngnnguynthkim/bank-personal-loan-modellingcsv (accessed on 24 August 2024).
  22. Mobile Kukan Toukei (Guidelines). Available online: https://www.intage.co.jp/english/service/platform/mobile-kukan-toukei/ (accessed on 24 August 2024).
Figure 1. Homogeneity attack. The * and ** in the Zip column mean that the last one and two digits of the original data are hidden, respectively; while the * in the Nationality column means no partitioned for the corresponding attribute.
Figure 1. Homogeneity attack. The * and ** in the Zip column mean that the last one and two digits of the original data are hidden, respectively; while the * in the Nationality column means no partitioned for the corresponding attribute.
Applsci 14 07625 g001
Figure 2. Background knowledge attack. The * and ** in the Zip column mean that the last one and two digits of the original data are hidden, respectively; while the * in the Nationality column means no partitioned for the corresponding attribute.
Figure 2. Background knowledge attack. The * and ** in the Zip column mean that the last one and two digits of the original data are hidden, respectively; while the * in the Nationality column means no partitioned for the corresponding attribute.
Applsci 14 07625 g002
Figure 3. An example of a decision tree concerning a dataset with two labels, e.g., 1 and 0. The vectors in the leaves show the number of data points being classified, in which the two terms of each vector correspond to the number of data points labeled 1 and 0, respectively.
Figure 3. An example of a decision tree concerning a dataset with two labels, e.g., 1 and 0. The vectors in the leaves show the number of data points being classified, in which the two terms of each vector correspond to the number of data points labeled 1 and 0, respectively.
Applsci 14 07625 g003
Figure 4. Example of conversion from decision tree (Figure 3) to anonymized data, where * in Weight and Age columns mean no partitioned for the corresponding attributes, respectively.
Figure 4. Example of conversion from decision tree (Figure 3) to anonymized data, where * in Weight and Age columns mean no partitioned for the corresponding attributes, respectively.
Applsci 14 07625 g004
Table 1. Example of a dataset *.
Table 1. Example of a dataset *.
ZipAgeNationalityDisease
1305328RussianHeart
1306829AmericanHeart
1306821JapaneseFlu
1305323AmericanFlu
1485350IndianCancer
1485355RussianHeart
1485047AmericanFlu
1485059AmericanFlu
1305331AmericanCancer
1305337IndianCancer
1306836JapaneseCancer
1306832AmericanCancer
* Excerpt from [16].
Table 2. k-anonymity ( k = 4 ) of the dataset shown in Table 1.
Table 2. k-anonymity ( k = 4 ) of the dataset shown in Table 1.
ZipAgeNationalityDisease
130 **<30*Heart
130 **<30*Heart
130 **<30*Flu
130 **<30*Flu
1485 *<30*Cancer
1485 *>40*Heart
1485 *>40*Flu
1485 *>40*Flu
130 **30–40*Cancer
130 **30–40*Cancer
130 **30–40*Cancer
130 **30–40*Cancer
The * and ** in the Zip column mean that the last one and two digits of the original data are hidden, respectively; while the * in the Nationality column means no partitioned for the corresponding attribute.
Table 3. -diversity ( = 2) after k-anonymity of the dataset shown in Table 1.
Table 3. -diversity ( = 2) after k-anonymity of the dataset shown in Table 1.
ZipAgeNationalityDisease
130 **<30*Heart
130 **<30*Heart
130 **<30*Flu
130 **<30*Flu
1485 *<30*Cancer
1485 *>40*Heart
1485 *>40*Flu
1485 *>40*Flu
130 **30–40*Cancer
130 **30–40*Cancer
130 **30–40*Cancer
130 **30–40*Flu
The * and ** in the Zip column mean that the last one and two digits of the original data are hidden, respectively; while the * in the Nationality column means no partitioned for the corresponding attribute.
Table 4. The notations used in this paper.
Table 4. The notations used in this paper.
NotationDescription
kParameter for k-anonymity
Parameter for -diversity
DDataset { ( x i , y i ) } , for i { 1 , , n }
X Data space with m features F 1 , , F m
x Data ( x 1 , , x m ) X = ( F 1 , , F m )
yLabel of data x
YLabel space
leaf , leaf i Leaf or leaves
n ( leaf i ( y ) ) The number of data points classed as leaf i with
label y Y .
n ( leaf i ) The number of data points classed as leaf i , i.e.,
n ( leaf i ) = y Y n ( leaf i ( y ) )
n j ( leaf i ( y ) ) The number of data points with label y Y classed as
leaf i of the j-th tree.
This is used for the random decision tree.
H u Total number of users who can be identified by
a homogeneous attack
H leaf Number of leaves that can be identified by
a homogeneity attack
N t Number of random decision trees
ACCAccuracy
Table 5. Approximate value of N t δ  for  k = 5 , 10 , 20 and N t = 10 .
Table 5. Approximate value of N t δ  for  k = 5 , 10 , 20 and N t = 10 .
N t ϵ β k = 5 k = 10 k = 20
0.01 0.1 0.4 0.01 0.1 0.4 0.01 0.1 0.4
1.0 0.001 -- 4.66 × 10 7 0.355 - 1.20 × 10 13 0.045 -
2.0 5.52 × 10 5 0.352- 1.08 × 10 9 0.034 - 7.00 × 10 19 0.000 -
3.0 7.68 × 10 6 0.127- 2.72 × 10 11 0.005 - 4.75 × 10 22 7.82 × 10 6 0.426
4.0 1.86 × 10 6 0.043- 1.68 × 10 12 0.001 0.583 1.92 × 10 24 2.37 × 10 7 0.134
5.0 7.47 × 10 7 0.0280.994 2.85 × 10 13 0.000 0.348 3.54 × 10 26 1.61 × 10 8 0.051
6.0 2.417 × 10 7 0.0090.963 3.19 × 10 14 3.93 × 10 5 0.191 7.71 × 10 28 1.03 × 10 9 0.016
7.0 1.22 × 10 7 0.0090.963 8.51 × 10 15 2.05 × 10 5 0.175 5.75 × 10 29 1.48 × 10 10 0.008
8.0 1.22 × 10 7 0.0040.498 4.07 × 10 15 4.53 × 10 6 0.093 6.27 × 10 30 1.56 × 10 11 0.003
9.0 5.47 × 10 8 0.0020.410 7.58 × 10 16 1.87 × 10 6 0.078 5.06 × 10 31 2.82 × 10 12 0.001
Table 6. Accuracy of k - AnonToRdt .
Table 6. Accuracy of k - AnonToRdt .
(a)(b)
k#TreesNursery Dataset (with 5 class labels)Adult Dataset
8 d = 3 d = 4 d = 5 d = 7 d = 8 d = 9
0 *0.7480.8320.8540.7810.8060.814
50.7880.8180.7830.7810.8070.811
100.7490.7820.5980.7940.8010.814
090.8070.8190.8670.7890.8130.817
50.7980.8430.8160.7890.8130.817
100.8080.7750.6260.8040.8120.813
0100.8150.8410.8600.7950.7910.816
50.8080.8410.8180.7950.7910.816
100.8160.8010.6730.7860.8030.813
0110.8320.8470.8680.7860.8070.818
50.8360.8470.8230.7860.8070.817
100.8300.8180.6630.7930.7970.813
(c)(d)
k#TreesMushroom DatasetLoan Dataset
8 d = 3 d = 4 d = 5 d = 4 d = 5 d = 6
0 *0.9350.9640.9790.8380.8430.838
50.9350.9640.9790.8430.8430.840
100.9030.9640.9740.8380.8450.838
090.9400.9670.9800.8390.8390.842
50.9400.9670.9800.8390.8390.842
100.8540.9630.9780.8390.8390.84
0100.9570.9720.9780.8410.8460.835
50.9570.9720.9780.8360.8410.839
100.9570.9490.9750.8410.8460.835
0110.9440.9730.9780.8440.8410.840
50.9440.9730.9780.8390.8410.840
100.9210.9680.9770.8440.8410.840
* k = 0 means original decision tree without pruning. #Trees denote the number of trees.
Table 7. Accuracy and approximate N t δ of k - AnonToRdt when N t ϵ = 2.0 .
Table 7. Accuracy and approximate N t δ of k - AnonToRdt when N t ϵ = 2.0 .
kNursery Dataset (with 3 Class Labels)Mushroom Dataset (with 2 Class Labels)
β = 0.01 β = 0.1 β = 0.4 β = 0.01 β = 0.1 β = 0.4
N t δ ACC N t δ ACC N t δ ACC N t δ ACC N t δ ACC N t δ ACC
0-0.971-0.971-0.971-0.945-0.945-0.945
5 5.52 × 10 5 0.9420.3520.958-0.972 5.52 × 10 5 0.9000.3520.942-0.914
10 1.08 × 10 9 0.7740.0340.969-0.971 1.08 × 10 9 0.8330.0340.930-0.933
20 7.00 × 10 19 0.38300.965-0.963 7.00 × 10 19 0.63100.913-0.932
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Waseda, A.; Nojima, R.; Wang, L. A Differentially Private (Random) Decision Tree without Noise from k-Anonymity. Appl. Sci. 2024, 14, 7625. https://doi.org/10.3390/app14177625

AMA Style

Waseda A, Nojima R, Wang L. A Differentially Private (Random) Decision Tree without Noise from k-Anonymity. Applied Sciences. 2024; 14(17):7625. https://doi.org/10.3390/app14177625

Chicago/Turabian Style

Waseda, Atsushi, Ryo Nojima, and Lihua Wang. 2024. "A Differentially Private (Random) Decision Tree without Noise from k-Anonymity" Applied Sciences 14, no. 17: 7625. https://doi.org/10.3390/app14177625

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop