Differential Privacy High-Dimensional Data Publishing Based on Feature Selection and Clustering

Chu, Zhiguang; He, Jingsha; Zhang, Xiaolei; Zhang, Xing; Zhu, Nafei

doi:10.3390/electronics12091959

Open AccessArticle

Differential Privacy High-Dimensional Data Publishing Based on Feature Selection and Clustering

by

Zhiguang Chu

^1,2,

Jingsha He

¹

,

Xiaolei Zhang

²,

Xing Zhang

² and

Nafei Zhu

^1,*

¹

School of Software Engineering, Beijing University of Technology, Beijing 100124, China

²

Key Laboratory of Security for Network and Data in Industrial Internet of Liaoning Province, Jinzhou 121000, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(9), 1959; https://doi.org/10.3390/electronics12091959

Submission received: 13 March 2023 / Revised: 17 April 2023 / Accepted: 19 April 2023 / Published: 23 April 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

As a social information product, the privacy and usability of high-dimensional data are the core issues in the field of privacy protection. Feature selection is a commonly used dimensionality reduction processing technique for high-dimensional data. Some feature selection methods only process some of the features selected by the algorithm and do not take into account the information associated with the selected features, resulting in the usability of the final experimental results not being high. This paper proposes a hybrid method based on feature selection and a cluster analysis to solve the data utility and privacy problems of high-dimensional data in the actual publishing process. The proposed method is divided into three stages: (1) screening features; (2) analyzing the clustering of features; and (3) adaptive noise. This paper uses the Wisconsin Breast Cancer Diagnostic (WDBC) database from UCI’s Machine Learning Library. Using classification accuracy to evaluate the performance of the proposed method, the experiments show that the original data are processed by the algorithm in this paper while protecting the sensitive data information while retaining the contribution of the data to the diagnostic results.

Keywords:

high-dimensional data; feature selection; random forest; clustering; differential privacy

1. Introduction

As one of the products of the information age, high-dimensional data are of great benefit and value, but there are also many difficulties in practical applications, among which how to solve the high-dimensional problem of data is key. At present, the dimensionality reduction processing of data is divided into two directions: feature extraction and feature selection [1,2]. Feature selection is selecting a subset of features that can represent the data from the original data, and feature extraction is converting the original data from the high-dimensional space to the low-dimensional space and merging the original features into some new feature types for representation [3,4,5,6]. Compared with feature extraction, the physical significance of the original data retained by feature selection is often more convenient in a subsequent data analysis. Today’s feature selection has been applied to many practical problems, such as bioinformatics and image processing. For noisy datasets, irrelevant or redundant features often slow down the speed or even accuracy of the learning algorithm. By eliminating irrelevant and redundant features, the FS algorithm can reduce the size of features, shorten the learning time, and improve the classification performance of the algorithm [7,8]. Existing FS algorithms can be divided into three categories: (1) filtering; (2) packaging; and (3) embedding [9]. The difference between a filter and a wrapper is whether the classifier is used as an evaluation criterion for a subset of features. Compared with the wrapper method, the filtering method is independent of the learning algorithm and has the advantage of low computational complexity [10,11]. This makes it more suitable for high-dimensional FS problems. However, due to the lack of subsequent learning algorithms, its performance tends to be worse than that of wrappers and embedded methods. The wrapper uses a learning algorithm as a black box to score a subset of features. Because it can use a powerful search strategy to find the best subset of features, this method is often more effective than filters.

Random forest is one of the ensemble learning methods widely used in disease detection. Scholars have applied random forests to the diagnosis of various diseases. In order to identify Alzheimer’s disease through early asymptomatic behavior, J. Gómez-Ramírez et al. [12] needed to confirm the presence of mild cognitive impairment (MCI) through multiple aspects. Therefore, machine learning (random forest) and permutation-based methods are used to determine the most important self-reported characteristics for future conversion to MCI. The Clinical Decision Support Systems (CDSS) framework proposed by V. R. Elgin Christo [13] uses a cooperative coevolutionary approach. This method uses coevolution and random forest classifiers. The reduced dataset is used to train a random forest classifier, a trained model that helps doctors as a second opinion for diagnosis and treatment. Desbordes Paul [14] proposed a feature selection strategy, the random forest-based genetic algorithm (GARF), which is able to extract effective information from positron emission tomography (PET) images and clinical data that are helpful for the personalized treatment of a patient’s cancer. Sutong Wang et al. [15] developed a rule extraction method (IRFRE) based on improved random forest (RF) to derive accurate and interpretable breast cancer diagnosis classification rules from decision tree collections, overcoming the problem behind some black box algorithms that cannot explain the diagnosis.

A cluster analysis is an important process in the field of data mining [16,17], which is the process of classifying objects into groups based on their similarity [18]. Various clustering methods have been proposed, mainly including k-means [19], hierarchical techniques [20,21], density-based techniques [22,23], and grid-based algorithms [24].

This paper adopts the combination of a filter method and embedding method to deal with the dimension problem of data. The filtering method is to score each feature according to divergence or correlation and to set a threshold or the number of features to be selected for filtering. The embedding method is to first use some machine learning models for training, obtain the weight coefficients of each feature, and select features from large to small according to the coefficients.

Based on the above description of feature selection and the related cluster analysis, the existing high-dimensional data publishing algorithms still have the following problems:

1. Too many initial features will interfere with feature selection. First of all, it will reduce time efficiency, and secondly, it will have a certain impact on the accuracy of the selected features.

2. It is difficult to determine the appropriate threshold to judge whether the feature is useless, and the fixed threshold division will lead to the wrong deletion of effective but weak correlation features.

3. In the allocation of a privacy budget for characteristics, there is a problem that the allocation is not reasonable enough, which finally leads to an imbalance between the utility and privacy of the released data.

In view of the current problems, this paper proposes a new hybrid FS algorithm. After quickly removing irrelevant and weakly correlated features, the remaining features are selected for the second time, then a fast feature clustering method is used, and finally adaptive noise is performed according to the generated feature clusters. The contributions of this article are as follows:

1. A hybrid FS-FC algorithm is proposed, which can solve some of the shortcomings of the high-dimensional data processing mentioned in this paper. Compared with the existing data publishing methods, the feature removal of the algorithm can effectively improve the efficiency and performance of the subsequent feature selection, so that the random forest can quickly and effectively screen the required core features.

2. The second stage uses a fast correlation feature clustering strategy. Traditional methods require comparing similarities between all features one by one; this article only compares the difference in the information between the relevant feature and the known cluster center. It not only reduces the computational cost of the clustering features but also eliminates the need to specify the number of clusters in advance.

3. The privacy budget allocation in the third stage adopts an adaptive strategy, which measures the symmetric uncertainty of cluster center characteristics and attribute clusters to carry out the corresponding privacy budget allocation.

The rest of this article is organized as follows. Section 2 covers some of the basics, and Section 3 describes some of the related work. Section 4 presents the algorithm for this paper. Section 5 verifies the performance of the proposed algorithm.

2. Materials and Methods

2.1. Differential Privacy

2.1.1. Differential Privacy Definition

The differential privacy model proposed by Dwork et al. in 2006 ensures that the exposed output will not change significantly depending on whether an individual is in the dataset by adding random noise [25,26,27,28,29,30,31] and quantifies the degree of privacy leakage. Differential privacy does not require the privacy of the dataset as a whole but provides protection for the privacy of each individual in the dataset. Its concept requires that each single element in the dataset has a limited impact on the output. Therefore, after observing the query results, the attacker cannot infer which individual is affected in the dataset to make the query return such results, so it is impossible to infer information about the privacy of individuals from the query results. In other words, an attacker has no way of knowing whether an individual exists in such a dataset.

Definition 1.

(

ε

differential privacy). For arbitrary datasets

D_{1}

and

D_{2}

that differ by at most one data record, if a random function M satisfies ε-differential privacy, where Range(M) represents the range of values of the random function M, then for all

S \subseteq R a n g e (M)

are

P_{r} [A (D_{1}) \in S] \leq e^{ε} P_{r} [A (D_{2}) \in S]

(1)

where

P_{r} (\cdot)

represents the risk of event leakage, and

ε

is the privacy budget parameter, which represents the differential privacy protection level, and the smaller its value, the less likely it is to be distinguished, and the higher the privacy protection level.

2.1.2. Differential Privacy Mechanisms

Differential privacy is implemented by adding noise that satisfies differential privacy to the query results of different query functions. Different types of differential privacy noise mechanisms are used for different types of query results, and the two most commonly used noise mechanisms are the Laplace mechanism, which is mainly for numerical results, and the exponential mechanism, which is mainly for non-numerical results.

Laplace mechanism

The Laplace mechanism [32] is mainly used to add noise that obeys the Laplace distribution to the numerical query results to meet the definition requirements of differential privacy. The Laplace distribution is also a double exponential distribution, and the mathematical expression of the probability density function is as follows:

f (x| μ, b) = {\frac{1}{2 b}}^{e^{- \frac{|x - μ|}{b}}}

(2)

where

μ

is the positional parameter and

b

≥ 0 is the scale parameter, and then

x

is the Laplace distribution of the obedient parameter

μ

and

b

. Notation:

x ~ L a p (μ, b)

.

When querying dataset

D

with the query function

f

, the Laplace mechanism realizes the privacy protection of the query result by adding a noise disturbance that satisfies the Laplace mechanism to the query result

f

(

D

), and then outputs a result

f

^* (

D

) that has been subjected to a noise disturbance. In the specific implementation process, the sensitivity size of the given query function f is calculated, and the corresponding differential privacy budget value is set to calculate the noise that should be added.

2.: Exponential mechanism

Due to the limitations of the Laplace mechanism that satisfies differential privacy, the Laplace mechanism cannot be used for privacy protection for non-numeric query results when applied in practice, so the exponential mechanism [33] is selected for the privacy protection of non-numeric data.

Stochastic algorithm A is the algorithm acting on the dataset algorithm

D

, where the random algorithm may output the result as r, where r ⊆

R a n g e

. Define

w (D, r) \to R

as the availability functions and

∆ w

as the sensitivity size of

w (D, r)

of the availability function. If the output result of algorithm A is selected from Range in proportion to the probability size of

e^{(\frac{ε w (D, r)}{2 ∆ w})}

, then stochastic algorithm A satisfies ε-differential privacy protection.

2.1.3. Differential Privacy Properties

In the process of privacy protection, most data publishing and query problems need to add the differential privacy budget multiple times for privacy protection. In this case, in order to ensure that all differential privacy processing in the entire processing process meets the range of a given privacy budget, the total privacy budget value needs to be allocated to each link, which can take advantage of the two combined properties of differential privacy protection [34].

Property 1.

(Sequence combinatory). Given dataset D for n independent random algorithms

A_{1}

,

A_{2}

……,

A_{n}

, and each stochastic algorithm has a privacy budget with a differential privacy budget value of

ε_{i}

, the corresponding combinatorial algorithm

A_{1}, A_{2}

……,

A_{n}

, composed of these random algorithms

A (A_{1} (D_{1}), A_{2} (D_{2}) \dots \dots, A_{n} (D_{n}))

, satisfies the (

\sum ε_{i}

)-differential privacy protection.

Property 2.

(Juxtaposition combination). For a dataset D that can be divided into n disjoint subsets of datasets

D_{1}

,

D_{2}

……,

D_{n}

, there are n relatively independent random algorithms

A_{1}

,

A_{2}

……,

A_{n}

corresponding to data subsets

D_{1}

,

D_{2}

……,

D_{n}

providing differential privacy protection, and then

(A_{1} (D_{1}), A_{2} (D_{2}) \dots \dots, A_{n} (D_{n}))

joint sequences satisfy the

(\max_{} ε_{i})

-differential privacy protection.

The above properties provide a guarantee for the following algorithm to meet the differential privacy protection proof and privacy budget optimization allocation strategy.

2.2. SU

In many FS algorithms [35,36], symmetric uncertainty (SU) is employed to describe the correlation between features and class labels or between features, because it can well-describe the nonlinear relationship between random variables. This paper uses SU to assess the correlation between feature and class labels.

First, we give the SU method and three existing measures, namely (1) C correlation; (2) S correlation; and (3) F related. This paper provides a theoretical basis for the algorithm in this paper. SU is defined as follows:

S U (X, Y) = 2 \frac{H (X) - H (X| Y)}{H (X) + H (Y)}

(3)

where

H (X)

and

H (Y)

are the entropy of variables

X

and

Y

, respectively, and

H (X| Y)

is the conditional entropy of

X

when

Y

is known. Suppose

p (x)

and

p (y)

are the prior probabilities of

x

and

y

, and

p (x| y)

is the posterior probability of

x

given

y

.

H (X) = - \sum_{x \in X} p (x) \log_{2} p (x)

(4)

H (Y) = - \sum_{y \in Y} p (y) \log_{2} p (y)

(5)

H (X| Y) = - \sum_{y \in Y} p (y) \sum_{x \in X} p (X| Y) \log_{2} p (x| y)

(6)

The three key definitions in this article are described below.

1. C correlation: The C correlation is to calculate the SU value between the ith feature and the classification label C.

2. S correlation: If the SU value calculated by the C correlation is greater than the threshold, it means that there is a strong correlation between fi and C.

3. F correlation: Use SU to calculate the value between two features; the greater the F correlation between the two features, the closer the correlation between the two features.

2.3. Random Forest

Random forest is an extension of the decision tree approach developed by Breiman and Cutter. It is based on the idea of Bagging [37]. Each decision tree in the random forest is a classifier; for an input sample, N trees will have N classification results, the random forest integrates all the classification voting results, and the category with the most votes is specified as the final output.

Suppose there is a training set X with M samples, with each sample consisting of N input features and a classification label Y. The specific RF construction process is as follows (Figure 1):

Step 1: Select N samples from the training set X by applying the Bootstrap technique.

Step 2: Randomly select n features (n < N) and select the feature with the smallest Gini value to split the nodes of the decision tree.

Step 3: Repeat steps 1 and 2 K times to obtain K decision tree models.

Step 4: Combine the decision trees into a random forest and determine the classification results by voting.

3. Related Work

The random forest method has been used by many scholars. For example, Yunming’s research proposes a hierarchical sampling method for selecting feature subspaces for random forests with high-dimensional data, strong information features, and weak information features [38]. Maria uses information gain to determine thresholds for random forests to improve the speed and prediction accuracy of random forests and uses FFT and IFFT to transform datasets to make it easier to classify random forests [39]. Xianlei Fu uses random forests to assess the performance of tunnel boring machines (TBMs) in soft soil with limited geological information to predict key TBM performance metrics [40]. Marie uses random forests to combine variable clustering and feature selection, and hierarchical clustering of variables allows groups of related variables to be established and each group aggregated by synthetic variables, with the most relevant synthetic variables selected by the process of using random forests [41].

At present, most of the correlation analysis of attributes is based on information theory, including information entropy, mutual information, the Pearson correlation coefficient, and the maximum information coefficient.

Xianxian Li uses rough set theory to measure the degree of correlation between attributes [42] and information entropy to quantify the sensitivity of attributes. Information entropy difference privacy is achieved through correlation-based clustering and adding personalized noise to each cluster while maintaining a correlation between the data. Wu Ningbo uses mutual information to construct an adjacency matrix between attributes, and then uses the attribute matrix to filter attributes, which can effectively measure the amount of privacy leakage caused by associated attributes [43]. Peng Changgen uses the Pearson correlation coefficient to consider the impact of noise on sensitive information with linear relationships in high-dimensional datasets and reduces the amount of noise addition while keeping the linear relationship of the data highly available, but it could not maintain the nonlinear relationship of the data in the published dataset. The maximum information coefficient (MIC) can capture nonlinear relationships and non-functional dependencies between data to maximize the utility of the original data [44].

Song et al. propose an FS algorithm based on two-stage clustering [11]. The first phase uses a minimum spanning tree to cluster all the features, while the second phase selects the most representative features from each cluster to form a subset of features. Liu et al. propose a minimum spanning tree-based clustering method, in which measures with information variation are used to assess the correlation and redundancy between features [45]. First, a minimum spanning tree is created based on the correlation between features, and then those long edges are removed to form clusters. After that, the feature with the greatest correlation from each cluster is selected. Because the correlation between all the features needs to be calculated in advance to form a complete graph, the above two algorithms still have high computational costs when working with high-dimensional data. Jie et al. propose density peak clustering, where the information distance between features is used as a clustering metric. The algorithm can cluster features efficiently but requires the number of clusters to be set in advance [46]. Chatterjee et al. develop a filtering method based on K-means. In this method, features are clustered using the K-means technique, and then the features in each cluster are sorted according to their importance [47].

4. Algorithm

4.1. Algorithmic Framework

Figure 2 shows its overall framework. The framework consists of three phases, namely (1) the removal of irrelevant features; (2) cluster-related features; and (3) adaptive noise. The first stage removes features that are irrelevant or weakly relevant to the current classification task, and then filters out the core attributes. The second stage is centered on core attributes, and the remaining related features are divided into multiple attribute clusters according to correlation. In the third stage, noise is added to these feature clusters using a noise formula, and finally the noise data results are published.

The three phases of this article framework are translated into the following three processes:

1. Identify and retain the S correlation characteristics;

2. The F correlation feature is divided into multiple feature clusters;

3. Add noise to the data through the noise formula.

4.2. First Phase

For the dataset S with n features,

S U (f_{i}, C)

, (C correlation) is calculated first. Then, select features with C correlation values above a predetermined threshold ρ and save them to the strongly correlated features

F^{'}

set. The choice of threshold

p

is determined by the dataset itself, which can effectively reduce the impact of artificial settings. The detailed process can be demonstrated in Algorithm 1.

p = m i n (0.1 \times {S U}_{m a x}, {S U}_{⌊D / \log D⌋ - t h})

(7)

A l g o r i t h m 1 : M e t h o d f o r r e m o v i n g i r r e l e v a n t f e a t u r e s

I n p u t : T h e f e a t u r e s e t, F = (f_{1}, f_{2} \dots f_{n}), t h e o f c l a s s s e t, C

O u t p u t : t h e s e t o f s t r o n g r e l e v a n c e f e a t u r e s, F^{'}

1 : C a l c u l a t e t h e C - r e l e v a n c e v a l u e o f e a c h f e a t u r e, S U (f_{i}, C), i = 1,2, \dots n;

2 : D e t e r m i n e p = m i n (0.1 \times {S U}_{m a x}, {S U}_{⌊\frac{D}{\log D}⌋ - t h});

3 : f o r i = 1 : D d o

4 : % f r o m t h e f i r s t f e a t u r e t o t h e l a s t o n e i n t h e F;

5 : i f S U (f_{i}, C) \geq p t h e n

6 : s a v e t h e i - t h f e a t u r e t h e s e t F^{'};

7 : e n d i f

8 : e n d f o r

9 : O u t p u t t h e s e t o f s t r o n g r e l e v a n c e f e a t u r e s, F^{'}

The random forest algorithm is used for the obtained strongly correlated feature set (

F^{'}

), and the random forest is used to screen the strongly correlated feature set for the second time; the size of the strongly correlated feature set is greatly reduced compared with the original feature set size, and the reduction in the feature order of magnitude. When using the random forest algorithm to make the second feature selection, there is more space to build more decision trees, which can obtain more accurate results and effectively improve the overall efficiency of the algorithm.

Among them, before the second feature screening, it is also necessary to optimize the parameters of the random forest model, and a suitable set of model parameters can help researchers obtain better algorithm results. In the parameters of the random forest, the first thing that needs to be solved is the n_estimators: in theory, the bigger the better, but the calculation time will increase accordingly, so it is not the bigger the better, but the best prediction effect will appear in a reasonable number of trees. max_depth: generally, a heuristic is carried out according to the size of the data; small datasets are generally between 1 and 20, and large datasets can try 30 to 50 layers deep. max_features: the number of randomly selected features for each decision tree. For classification problems, sqrt (n_features) is generally taken, where n_features is the number of input features. min_samples_leaf: generally, a suitable value is obtained by gradually increasing from the minimum value. min_samples_split: the default initial value is 2, which can be added or changed according to the actual dataset.

4.3. Second Phase

A good clustering method should be able to group similar features into the same cluster at a lower computational cost. However, most features have the disadvantage of a high computational cost due to the need to calculate the correlation between all the features. In order to overcome the problem of computational cost, this paper uses the two types of attributes selected in the first stage as inputs for the second stage of the clustering work, and the filtered attributes have proved their importance. Therefore, the task of this stage is to aggregate the remaining strongly correlated attributes with the characteristics of the random forest screening as the cluster center.

First, the filtered core attributes are removed from the set of strongly correlated features, and then the core attributes are sorted according to the C correlation. The F correlation between the sorted core attributes and the remaining strongly correlated attributes is calculated, and the part of the attributes that are most closely related to the current core attributes are found. After completing the clustering of the first core attribute, the same operation is performed on the core attributes in turn until all the core attributes are clustered or all the strongly correlated attributes are assigned to different attribute clusters, so that the second stage of clustering is completed.

4.4. Third Phase

After the first two stages of data processing, noise needs to be added to the obtained attribute cluster.

The noise added, discussed in this paper, is divided into two parts. Firstly, a separate noise is added for the clustering center of each attribute cluster to protect the core sensitive attributes, and the allocation of the privacy budget is based on the proportion of the Su value of the core attributes. Secondly, for the remaining attributes of each attribute cluster, noise is added according to the proportion of the SU value of each attribute cluster.

After the implementation of the algorithm in this paper, although the number of core attributes obtained is small, the overall attribute value is very high, and more attention needs to be paid to its security. For each attribute cluster containing a large number of attributes, the average value of a single attribute is not high, but the attribute cluster as a whole contains a high data value, for the core attributes and attribute cluster characteristics are different, and it is necessary to divide it when the noise is added, so that the way of separate addition can well balance the utility of data and data privacy security.

5. Experiments

Wisconsin Diagnosed Breast Cancer (WDBC) database from UCI (University of California, Irvine) Machine Learning Library. The WDBC database contains 569 instances. Each instance consists of 30 attributes and a class tag. The data in the WDBC dataset are calculated from digitized images obtained from fine needle aspiration (FNA) of human breast masses, and the ampersand values in the dataset describe the morphological features in the sample images. Table 1 shows the database details.

There are 569 instances in the WDBC dataset, of which 357 (62.7%) samples are confirmed to be benign and 212 (37.3%) samples are confirmed to be malignant. There are a total of 32 features, one of which is ID, the other is the diagnosis; B stands for benign, M stands for malignant. The remaining 30 features are the mean, standard error, and maximum value of the 10 real-valued features that describe the nucleus, namely radius, texture, perimeter, area, smoothness, compactness, concave surface, pit point, symmetry, fractal dimension.

This paper not only uses random forest as a process of feature selection but also a test of the experimental results, so the overall performance of the random forest is very important for the experimental process. Before the whole experiment, this paper uses the WDBC dataset to train the required random forest model. In order to verify the usability of the model, the dataset needs to be divided into a training set and a test set, and for a small sample dataset, only one data division cannot select the best model and parameters. Therefore, K cross-validation is used. The training set is divided into k subsets, and each time it is trained, k-1 copies are used as the training data, and the remaining 1 copy is used as verification, and k times are repeated like this; in most cases, K takes a value of 5 or 10. The computational complexity of cross-validation is relatively high, but it makes full use of the data to ensure the utility of the model. Then, by tuning the parameters of the random forest, the classification accuracy of the random forest reached 96.49%, which was the best result for improving the random forest without any method.

In addition to the random forest model, this paper also uses SVM as a classification trainer. SVM is the most common and very reliable machine learning classification model, and for binary classification problems, there will be a better performance. After using K times cross-validation to debug the parameters correlatedly, the accuracy of the SVM model on the original dataset reaches 97%, although the effect is about 1% higher than the random forest. However, with SVM as a classifier, its own features do not contain feature selection functions, so they can only be used as an evaluation classifier for experimental results.

In the first stage of using the symmetric invariance measurement (SU) method to obtain the dataset features for the results correlation degree, a larger SU value indicates that the feature for the classification results is of higher importance. Figure 3 shows the importance of the feature. The feature shown in Figure 3 is the sequence number of the strongly correlated feature obtained after the first stage of filtering; a total of 19 features are retained, compared with the original dataset of 30 which has been significantly reduced, and can greatly alleviate the computational pressure of the subsequent algorithm. The subdataset composed of these 19 features is put into the random forest algorithm to obtain the attribute importance of the strongly correlated feature set.

In order to prove the reliability of the algorithm proposed in this paper, this paper uses the PCA method in traditional feature extraction as a comparison algorithm. PCA can find the features with a high value through the size of the feature value, sort the feature vectors according to the size of the feature values obtained by the PCA method, filter the required feature vectors to form a projection matrix, and add Laplace noise to the projection matrix. In order to verify the reliability of the adaptive noise method in this paper, adaptive noise based on the eigenvalue magnitude (PCA) and average noise (MPCA) are used in the PCA method.

In Figure 4, when the privacy budget is one, the classification accuracy of the three methods is displayed. It can be clearly seen that the methods proposed in this paper have obvious advantages, and the data obtained by the other two methods are not only poor but also have large data fluctuations. In Figure 5, when the privacy budget is five, the overall data increase with the decrease in the noise addition, but the method proposed in this paper still has obvious advantages over the other two methods, and the classification results of the other two methods still have many ups and downs. In Figure 6, when the privacy budget is 10, the experimental results obtained by the 3 methods tend to be flat, and only in a few cases the results of the three methods are similar, and the advantages of the proposed method are still shown in other cases.

In order to test the robustness of the algorithm and compare the data classification of the results obtained by each algorithm under different privacy budgets in Figure 4, Figure 5 and Figure 6, it can be seen that some of the results of the algorithm proposed in this paper do not meet the expectations when the budget is 1. From the general trend in Figure 7, the results when the budget is 1 are still in an acceptable range, and in some cases, the results obtained are almost not far from the results with the budget of 5 and the budget of 10. Looking at the experimental comparison results of the other two algorithms in Figure 8 and Figure 9, it can be clearly seen that the classification curves in the three cases of the MPCA are relatively scattered, especially when the privacy budget is one, and the results of the experiment can hardly be used. Although the three classification curves of the PCA are not greatly dispersed, it can be seen that when too much noise is added, the classification accuracy decreases rapidly, and the classification results still cannot meet expectations.

SVM, as an additional experimental evaluator, compares the model accuracy, recall, and F1_score, and the F1_score achieves the purpose of balancing the accuracy and recall. In the medical field, recall will be paid more attention, that is, to be able to predict as accurately as possible in actual sick people, we do not want to miss anyone who is really sick, so that it is more likely to save some people’s lives, and the lower accuracy rate (those who are not sick are predicted to be sick) will not lead to particularly serious consequences but some excessive medical treatment.

For binary classification, the model divides samples into two categories: positive and negative. If both the prediction and the facts are correct, it is called a true positive (TP). If the prediction is false, but the facts are true, it is called a false negative (FN). If the prediction is correct, but the facts are wrong, it is called a false positive (FP). If both the predictions and facts are false, it is called true negative (TN). Based on the above four cases, a confusion matrix can be obtained (as shown in the following Table 2).

r e c a l l = \frac{T P}{T P + F N}

(8)

p r e c i s i o n = \frac{T P}{T P + F P}

(9)

f 1 = \frac{2 \times p r e c i s i o n \times r e c a l l}{p r e c i s i o n + r e c a l l}

(10)

In order to evaluate the performance of the proposed method more comprehensively, the evaluation indicators obtained under different privacy budgets using the random forest and SVM classifiers are shown below.

Through the data in the Table 3, we can see that under the same data conditions the classification results of the RF are much better than the classification results of the SVM. The conclusion data of the SVM does not help the actual data needs, although the performance on the original dataset is very good. However, the processed data loses its energy separation ability, and after reducing the addition of noise, the evaluation index does not have a great improvement. In addition, the related data indicators have too large fluctuations, and the stability of the model is not good enough. Compared with the SVM data results, when the amount of noise added by the RF is large, the overall model conclusion index reaches the basic expectation, and the performance of the data is stable and improved enough after the noise addition is reduced.

6. Conclusions

In conclusion, this paper proposes a breast cancer protection model, which uses SU values combined with random forests for feature selection and finally uses the trained random forests for classification. The SU value combined with the random forest method can not only find the key attributes more accurately but also reduce the calculation cost and running time of the model. Finally, our proposed method achieves the expected accuracy on the WDBC dataset. Compared with the traditional protection methods, the accuracy of the proposed FS-FC model on the WDBC dataset is greatly improved. This is crucial in real-world scenarios, which means that more cancers can be detected in a timely manner while maintaining data security.

Our proposed method can also be used to protect information on other types of cancer, ensuring the privacy of the patient’s condition while providing early diagnosis guidance by doctors. For patients with a history of breast cancer, the model presented in this article can also be used to quickly obtain diagnostic conclusions while entering protected information.

As a future work, the algorithm in this paper is affected by the random forest algorithm itself to a certain extent. So, in order to further improve the performance of the algorithm in this paper, it is necessary to further improve the random forest algorithm.

Author Contributions

Conceptualization, Z.C.; Method, Z.C.; Software, J.H.; Verifying, J.H. and X.Z. (Xiaolei Zhang); Formal analysis, X.Z. (Xiaolei Zhang); Investigation, X.Z. (Xing Zhang); Resource, X.Z. (Xing Zhang); Data Management, X.Z. (Xing Zhang); Writing—Original Draft Preparation, Z.C.; Writing—Review and Editing, J.H.; Visualization, X.Z. (Xiaolei Zhang); Supervision, N.Z.; Project Management, N.Z.; Funding acquisition, year-on-year all authors have read and agreed to published manuscript editions. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was supported in part by Applied Basic Research Project of Liaoning Province under Grant 2022JH2/101300280, Scientific Research Fund Project of Education Department of Liaoning Province under Grant LJKZ0625.

Data Availability Statement

The dataset used in this paper can be found on the website provided: https://archive.ics.uci.edu/ml/datasets.php, accessed on 16 April 2023.

Acknowledgments

All authors of this article have been reviewed and confirmed.

Conflicts of Interest

The authors declare no conflict of interest.

References

Khalid, S.; Khalil, T.; Nasreen, S. A survey of feature selection and feature extraction techniques in machine learning. In Proceedings of the Science and Information Conference (SAI), London, UK, 27–29 August 2014; pp. 372–378. [Google Scholar]
Hira, Z.M.; Gillies, D.F. A Review of Feature Selection and Feature Extraction Methods Applied on Microarray Data. Adv. Bioinform. 2015, 2015, 1–13. [Google Scholar] [CrossRef]
Corizzo, R.; Ceci, M.; Japkowicz, N. Anomaly Detection and Repair for Accurate Predictions in Geo-distributed Big Data. Big Data Res. 2019, 16, 18–35. [Google Scholar] [CrossRef]
Corizzo, R.; Ceci, M.; Zdravevski, E.; Japkowicz, N. Scalable auto-encoders for gravitational waves detection from time series data. Expert. Syst. Appl. 2020, 151, 113378. [Google Scholar] [CrossRef]
Zheng, K.; Li, T.; Zhang, B.; Zhang, Y.; Luo, J.; Zhou, X. Incipient Fault Feature Extraction of Rolling Bearings Using Autocorrelation Function Impulse Harmonic to Noise Ratio Index Based SVD and Teager Energy Operator. Appl. Sci. 2017, 7, 1117. [Google Scholar] [CrossRef]
Gu, Y.; Yang, X.; Peng, M.; Lin, G. Robust weighted SVD-type latent factor models for rating prediction. Expert. Syst. Appl. 2020, 141, 112885. [Google Scholar] [CrossRef]
Mistry, K.; Zhang, L.; Neoh, S.C.; Lim, C.P.; Fielding, B. A Micro-GA Embedded PSO Feature Selection Approach to Intelligent Facial Emotion Recognition. IEEE Trans. Cybern. 2016, 47, 1496–1509. [Google Scholar] [CrossRef]
Xu, J.; Tang, B.; He, H.; Man, H. Semisupervised Feature Selection Based on Relevance and Redundancy Criteria. IEEE Trans. Neural Netw. Learn. Syst. 2016, 28, 1974–1984. [Google Scholar] [CrossRef] [PubMed]
Liu, H.; Motoda, H.; Yu, L. A selective sampling approach to active feature selection. Artif. Intell. 2004, 159, 49–74. [Google Scholar] [CrossRef]
Kundu, P.P.; Mitra, S. Feature Selection Through Message Passing. IEEE Trans. Cybern. 2016, 47, 4356–4366. [Google Scholar] [CrossRef]
Lazar, C.; Taminau, J.; Meganck, S.; Steenhoff, D.; Coletta, A.; Molter, C.; de Schaetzen, V.; Duque, R.; Bersini, H.; Nowe, A. A Survey on Filter Techniques for Feature Selection in Gene Expression Microarray Analysis. IEEE/ACM Trans. Comput. Biol. Bioinform. 2012, 9, 1106–1119. [Google Scholar] [CrossRef]
Gómez-Ramírez, J.; Ávila-Villanueva, M.; Fernández-Blázquez, M.Á. Selecting the most important self-assessed features for predicting conversion to mild cognitive impairment with random forest and permutationbased methods. Sci. Rep. 2020, 10, 20630. [Google Scholar] [CrossRef]
Christo, V.R.E.; Nehemiah, H.K.; Brighty, J.; Kannan, A. Feature Selection and Instance Selection from Clinical Datasets Using Co-operative Co-evolution and Classification Using Random Forest. IETE J. Res. 2020, 68, 2508–2521. [Google Scholar] [CrossRef]
Paul, D.; Su, R.; Romain, M.; Sébastien, V.; Pierre, V.; Isabelle, G. Feature selection for outcome prediction in oesophageal cancer using genetic algorithm and random forest classifier. Comput. Med. Imaging Graph. 2017, 60, 42–49. [Google Scholar] [CrossRef] [PubMed]
Wang, S.; Wang, Y.; Wang, D.; Yin, Y.; Wang, Y.; Jin, Y. An improved random forest-based rule extraction method for breast cancer diagnosis. Appl. Soft Comput. 2019, 86, 105941. [Google Scholar] [CrossRef]
Amaricai, A. Design Trade-offs in Configurable FPGA Architectures for K-Means Clustering. Stud. Inform. Control. 2017, 26, 43–48. [Google Scholar] [CrossRef]
Xiangxiao, L.; Honglin, O.; Lijuan, X. Kernel-Distance-Based Intuitionistic Fuzzy c-Means Clustering Algorithm and Its Application. Pattern Recognit. Image Anal. 2019, 29, 592–597. [Google Scholar] [CrossRef]
Mining, W.I.D. Data mining: Concepts and techniques. Morgan Kaufinann 2006, 10, 559–569. [Google Scholar] [CrossRef]
Jasmine, M.; Kesavaraj, G. Implementation of K-means clustering algorithm in the crime data set. Program. Device Circuits Syst. 2020, 12, 13–18. [Google Scholar]
Billard, L.; Kim, J. Hierarchical clustering for histogram data. Wiley Interdiscip. Rev. Comput. Stat. 2017, 9, e1405. [Google Scholar] [CrossRef]
Lee, S.; Jung, J.; Park, I.; Park, K.; Kim, D.-S. A deep learning and similarity-based hierarchical clustering approach for pathological stage prediction of papillary renal cell carcinoma. Comput. Struct. Biotechnol. J. 2020, 18, 2639–2646. [Google Scholar] [CrossRef]
Malzer, C.; Baum, M. A hybrid approach to hierarchical density-based cluster selection. In Proceedings of the 2020 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI), Karlsruhe, Germany, 14–16 September 2020; IEEE: Piscataway, NJ, USA, 2020. [Google Scholar] [CrossRef]
Thrun, M.C.; Ultsch, A. Using Projection-Based Clustering to Find Distance- and Density-Based Clusters in High-Dimensional Data. J. Classif. 2020, 38, 280–312. [Google Scholar] [CrossRef]
Chiang, Y.-H.; Hsu, C.-M.; Tsai, A. Fast multi-resolution spatial clustering for 3D point cloud data. In Proceedings of the 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC), Bari, Italy, 6–9 October 2019; IEEE: Piscataway, NJ, USA, 2019. [Google Scholar] [CrossRef]
Dwork, C. Differential privacy. In Automata, Languages and Programming, Proceedings of the 33rd International Colloquium, ICALP 2006, Part. II 33, Venice, Italy, 10–14 July 2006; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar] [CrossRef]
Dwork, C. Differential privacy: A survey of results. In Theory and Applications of Models of Computation, Proceedings of the 5th International Conference, TAMC 2008, Proceedings 5, Xi’an, China, 25–29 April 2008; Springer: Berlin/Heidelberg, Germany, 2008. [Google Scholar] [CrossRef]
Dwork, C. The Differential Privacy Frontier (Extended Abstract). In Theory of Cryptography Conference; Springer: Berlin/Heidelberg, Germany, 2009; pp. 496–502. [Google Scholar] [CrossRef]
Dwork, C. Differential privacy in new settings. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, Austin, TX, USA, 17–19 January 2010; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2010. [Google Scholar] [CrossRef]
Dwork, C. A firm foundation for private data analysis. Commun. ACM 2011, 54, 86–95. [Google Scholar] [CrossRef]
Dwork, C. The promise of differential privacy: A tutorial on algorithmic techniques. In Proceedings of the 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science, Palm Springs, CA, USA, 22–25 October 2011. [Google Scholar] [CrossRef]
Dwork, C.; Jing, L. Differential privacy and robust statistics. In Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing, Bethesda, MD, USA, 31 May–2 June 2009. [Google Scholar] [CrossRef]
Dwork, C.; McSherry, F.; Nissim, K.; Smith, A. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference; Springer: Berlin/Heidelberg, Germany, 2006; pp. 265–284. [Google Scholar]
McSherry, F.; Talwar, K. Mechanism Design via Differential Privacy. In Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), Providence, RI, USA, 21–23 October 2007; IEEE: Piscataway, NJ, USA, 2007; pp. 94–103. [Google Scholar]
McSherry, F.D. Privacy integrated queries: An extensible platform for privacy-preserving data analysis. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, Providence, RI, USA, 29 June–2 July 2009. [Google Scholar] [CrossRef]
Tran, B.; Xue, B.; Zhang, M. Variable-Length Particle Swarm Optimization for Feature Selection on High-Dimensional Classification. IEEE Trans. Evol. Comput. 2018, 23, 473–487. [Google Scholar] [CrossRef]
Song, X.-F.; Zhang, Y.; Guo, Y.-N.; Sun, X.-Y.; Wang, Y.-L. Variable-Size Cooperative Coevolutionary Particle Swarm Optimization for Feature Selection on High-Dimensional Data. IEEE Trans. Evol. Comput. 2020, 24, 882–895. [Google Scholar] [CrossRef]
Breiman, L. Random Forest. Mach. Learn. 2001, 45, 1. [Google Scholar] [CrossRef]
Ansari, F.; Edla, D.R.; Dodia, S.; Kuppili, V. Brain-Computer Interface for wheelchair control operations: An approach based on Fast Fourier Transform and On-Line Sequential Extreme Learning Machine. Clin. Epidemiol. Glob. Heal. 2018, 7, 274–278. [Google Scholar] [CrossRef]
Prasetiyowati, M.I.; Maulidevi, N.U.; Surendro, K. Determining threshold value on information gain feature selection to increase speed and prediction accuracy of random forest. J. Big Data 2021, 8, 84. [Google Scholar] [CrossRef]
Fu, X.; Feng, L.; Zhang, L. Data-driven estimation of TBM performance in soft soils using density-based spatial clustering and random forest. Appl. Soft Comput. 2022, 120, 108686. [Google Scholar] [CrossRef]
Chavent, M.; Genuer, R.; Saracco, J. Combining clustering of variables and feature selection using random forests. Commun. Stat. Simul. Comput. 2019, 50, 426–445. [Google Scholar] [CrossRef]
Li, X.; Luo, C.; Liu, P.; Wang, L.-E. Information entropy differential privacy: A differential privacy protection data method based on rough set theory. In Proceedings of the 2019 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), Fukuoka, Japan, 5–8 August 2019; IEEE: Piscataway, NJ, USA, 2019. [Google Scholar] [CrossRef]
Wu, N.B.; Peng, C.G.; Mou, Q.L. Information Entropy Metric Methods of Association Attributes for Differential Privacy. Acta Electonica Sin. 2019, 47, 2337. Available online: https://www.ejournal.org.cn/EN/Y2019/V47/I11/2337 (accessed on 1 January 2023).
Peng, C.G.; Zhao, Y.Y.; Fan, M.-M. Principal Component Analysis Differential Privacy Data Publishing Algorithm Based on Maximum Information Coefficient. Netinfo Secur. 2020, 2, 37–48. [Google Scholar]
Liu, Q.; Zhang, J.; Xiao, J.; Zhu, H.; Zhao, Q. A supervised feature selection algorithm through minimum spanning tree clustering. In Proceedings of the 2014 IEEE 26th International Conference on Tools with Artificial Intelligence, Limassol, Cyprus, 10–12 November 2014; IEEE: Piscataway, NJ, USA, 2014. [Google Scholar] [CrossRef]
Cai, J.; Chao, S.; Yang, S.; Wang, S.; Luo, J. Feature selection based on density peak clustering using information distance measure. In Intelligent Computing Theories and Application, Proceedings of the 13th International Conference, ICIC 2017, Part. II 13, Liverpool, UK, 7–10 August 2017; Springer International Publishing: Berlin/Heidelberg, Germany, 2017. [Google Scholar] [CrossRef]
Chatterjee, I.; Ghosh, M.; Singh, P.K.; Sarkar, R.; Nasipuri, M. A clustering-based feature selection framework for handwritten Indic script classification. Expert. Syst. 2019, 36, e12459. [Google Scholar] [CrossRef]

Figure 1. Random forest model.

Figure 2. Framework of the proposed hybrid FS algorithm.

Figure 3. Importance of the feature.

Figure 4. Three algorithms’ classification accuracy (epsilon = 1).

Figure 5. Three algorithms’ classification accuracy (epsilon = 5).

Figure 6. Three algorithms’ classification accuracy (epsilon = 10).

Figure 7. FS-FC different budget classification accuracy.

Figure 8. MPCA different budget classification accuracy.

Figure 9. PCA different budget classification accuracy.

Table 1. Details of WDBC database.

No	Features	No	Features
1	Radius mean	16	Compactness severity
2	Texture mean	17	Concavity severity
3	Perimeter mean	18	Concave points severity
4	Area mean	19	Symmetry severity
5	Smoothness mean	20	Fractal dimension severity
6	Compactness mean	21	Radius worst
7	Concavity mean	22	Texture worst
8	Concave points mean	23	Perimeter worst
9	Symmetry mean	24	Area worst
10	Fractal dimension mean	25	Smoothness worst
11	Radius severity	26	Compactness worst
12	Texture severity	27	Concavity worst
13	Perimeter severity	28	Concave points worst
14	Area severity	29	Symmetry worst
15	Smoothness severity	30	Fractal dimension worst

Table 2. Confusion matrix.

Actual Class	Predicted Class
Actual Class	Positive	Negative
positive	TP	FN
negative	FP	FN

Table 3. Model metrics.

Conditions	Class	Precision	Recall	F1-Score	Support
RF_1	B	0.98	0.87	0.92	357
RF_1	M	0.82	0.97	0.89	212
SVM_1	B	0.63	1.00	0.77	357
SVM_1	M	0.00	0.00	0.00	212
RF_5	B	0.99	0.88	0.93	357
RF_5	M	0.83	0.98	0.90	212
SVM_5	B	0.64	1.00	0.78	357
SVM_5	M	1.00	0.06	0.11	212

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chu, Z.; He, J.; Zhang, X.; Zhang, X.; Zhu, N. Differential Privacy High-Dimensional Data Publishing Based on Feature Selection and Clustering. Electronics 2023, 12, 1959. https://doi.org/10.3390/electronics12091959

AMA Style

Chu Z, He J, Zhang X, Zhang X, Zhu N. Differential Privacy High-Dimensional Data Publishing Based on Feature Selection and Clustering. Electronics. 2023; 12(9):1959. https://doi.org/10.3390/electronics12091959

Chicago/Turabian Style

Chu, Zhiguang, Jingsha He, Xiaolei Zhang, Xing Zhang, and Nafei Zhu. 2023. "Differential Privacy High-Dimensional Data Publishing Based on Feature Selection and Clustering" Electronics 12, no. 9: 1959. https://doi.org/10.3390/electronics12091959

APA Style

Chu, Z., He, J., Zhang, X., Zhang, X., & Zhu, N. (2023). Differential Privacy High-Dimensional Data Publishing Based on Feature Selection and Clustering. Electronics, 12(9), 1959. https://doi.org/10.3390/electronics12091959

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Differential Privacy High-Dimensional Data Publishing Based on Feature Selection and Clustering

Abstract

1. Introduction

2. Materials and Methods

2.1. Differential Privacy

2.1.1. Differential Privacy Definition

2.1.2. Differential Privacy Mechanisms

2.1.3. Differential Privacy Properties

2.2. SU

2.3. Random Forest

3. Related Work

4. Algorithm

4.1. Algorithmic Framework

4.2. First Phase

4.3. Second Phase

4.4. Third Phase

5. Experiments

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI