1. Introduction
Nowadays, we live in an interconnected world where much data is generated from sensors, social networks, internet activity, etc., which can be found in various data repositories. This data may contain sensitive information that can be revealed when it is they are analyzed. To address this problem, many data sanitization mechanisms were proposed to provide some privacy guarantees. Conversely, from an organizational perspective, data also hide patterns that help in the decision-making process. In this context, sanitizing algorithm’s challenge is twofold: how data could be shared containing useful information but respectful of privacy.
Various algorithms are racing against each other to provide the highest privacy without penalizing data utility for mining tasks. Therefore, data curators need to test several algorithms to find a suitable solution to satisfy the trade-off between privacy and utility. In the literature, there are few benchmarks comparing privacy algorithm performance. To the best of our knowledge, there is a lack of benchmarks, including recent privacy algorithms based on Deep Learning and Knowledge Distillation. Accordingly, to fill this gap, in the present study, we performed a benchmark between classical mechanisms, such as those based on Statistical Disclosure Control, including filters such as Noise Addition, Microaggregation, and Rank swapping filters. Besides, within this comparison, we added the Differential Privacy through Laplacian and Exponential mechanisms. Finally, two privacy mechanisms based on Deep Learning were also compared: the mechanism based on Generative Adversary Networks and the Machine Learning Copies.
To compare the algorithms cited above, two measures widely used in the literature [
1,
2,
3,
4,
5,
6] were used, namely, Disclosure Risk and Information Loss. The former quantifies the danger of finding the same distribution for the output variable after a prediction task when the input dataset is sanitized. The latter measures the amount of helpful information loss after applying a sanitization algorithm.
Concerning our results, each sanitization mechanism was tuned to find the best hyperparameters to meet a trade-off between the Information Loss and the Disclosure Risk. Our findings showed the best values of Disclosure Risk measure for Noise Addition, Rank Swapping, and Machine Learning copies. Conversely, Machine Learning copies, Noise addition, and Rank swapping mechanisms have the smallest Information Loss values.
The following list summarizes the major contributions of our paper:
Seven sanitization filters were formally defined and compared on a real datasets.
Hyperparameters fine-tuning were performed for each mechanism.
Two well-known measures were used to select the best mechanism.
The remaining of this paper is organized as follows.
Section 2 presents the state-of-the-art, while
Section 3 introduces some basic concepts and methods respectively.
Section 4 and
Section 5 describe the results and the discussion of our proposal. Finally,
Section 6 concludes the paper and presents new research avenues.
3. Materials and Methods
In the present section, we introduce the concepts of the Statistical Disclosure Control filters, Differential Privacy, Generative Adversarial Networks, Knowledge Distillation, as well as the Information Loss and Disclosure Risk functions.
3.1. Statistical Disclosure Control
The Statistical Disclosure Control (SDC) aims to protect the users’ sensitive information by applying methods called filters while maintaining the data’s statistical significance. It is important to indicate that only disturbing filters have been selected because re-identification is more complex than undisturbed values. Furthermore, the Noise Addition, Microaggregation, and Rank swapping filters have been chosen for their use in the literature [
1,
24,
26].
First, the Noise Addition filter [
27] adds uncorrelated noise from a Gaussian distribution to a given variable. This filter takes a noise parameter
a in the range [0,1]. The
i-th value of the
x attribute is denoted as
, while
indicates its sanitized counterpart. Thus, the obfuscated values are calculated as shown below.
where
is the standard deviation of the attribute to be obfuscated, and
c is a Gaussian random variable such that
.
Second, the Microaggregation filter [
28] groups registers into small sets that must have a minimum number of
k elements. Furthermore, this filter complies with the property of
k-anonymity. It means that each released register cannot be distinguished from at least
registers belonging to the same dataset. The Microaggregation filter is divided into two steps: partition and aggregation. In the former, registers are placed in various sets based on their similarity containing at least
k records. These similar sets of registers can be obtained from a clustering algorithm. The latter, the aggregation stage, computes the centroid for each group to replace each group’s elements with their respective centroid value.
Third, the Rank swapping filter [
10] transforms a dataset by exchanging the values of confidential variables. First, the values of the target variable are ordered in ascending order. Then, for each ordered value, another ordered value is selected within a range
p, which is the parameter that indicates the maximum exchange range. A particular value will then be exchanged within the
p windows.
3.2. Differential Privacy
Intuitively Differential Privacy [
29] tries to reduce the privacy risk when someone has their data in a dataset to the same risk of not giving data at all. Thus, an algorithm is said to be differential private when the result of a query is hardly affected by the presence or absence of a set of records. Formally, an algorithm
A is said to be
-differential private if for two datasets
and
that differ by at least one record and for all
:
The larger the value of the
parameter, the weaker the algorithm’s privacy guarantee. Therefore,
usually takes a small value since it represents the probability to have the same output from two datasets, one sanitized and another original [
30]. Hence, a small value of
means a little probability of obtaining the same value of the original dataset while using the sanitized dataset (i.e., Disclosure Risk). Later work has added the
parameter, which is a non-zero additive parameter. This parameter allows ignoring events with a low probability of occurrence. Therefore, an algorithm
A is
-differentially private if for two datasets
and
that differ by at least one record and for all
:
This technique provides privacy to numeric data using the Laplacian Differential Privacy mechanism [
31,
32]. Thus, given a
D dataset, a
M mechanism (filter) reports the result of a
f function reaching
-Differential Privacy if
. Where
L is a vector of random variables from a Laplace distribution, and
is the Microaggregation filter function. Accordingly, to implement Differential Privacy, the Laplacian or the Exponential mechanism can be used.
On the one hand, the Laplacian mechanism [
29] adds random noise to a query’s answers calculated on the available data. Noise is calibrated through a function called sensitivity
, which measures the maximum possible change resulting from a query due to the sum or subtraction of a data record. Also, we define
, which represents a Laplace distribution with scale parameter
b and location parameter 0. If the value of
b is increased, the Laplace function curve tends to be a platicurtic shape, allowing higher noise values and, consequently, better privacy guarantees. Therefore, a value is sanitized by the Laplacian mechanism and satisfies the
-Differential Privacy if
. Where
is a query on the dataset
D and
represents the noise extracted from a Laplace distribution with a scale of
and location 0.
On the other hand, the Exponential mechanism [
33] provides privacy guarantees to queries with non-numerical responses, for which it is not possible to add random noise from any distribution. The intuition is to randomly select an answer to a query from among all the others. Each answer has an assigned probability, which is higher for those answers more similar to the correct answer. Given
R the range of all possible responses to a query function
f, and given
a utility function that measures how good response is
for the query
f on the dataset
D, where higher values of
show more trustworthy answers. In this way, the sensitivity
is defined as the maximum possible change in the utility function
given the addition or subtraction of a data record.
Given a dataset D, a mechanism satisfies -Differential Privacy if it chooses an answer r with probability proportional to . In the present effort, we used the Microaggregation filter in addition to Laplacian and Exponential distribution, respectively, to implement -differential privacy methods.
3.3. Generative Adversary Networks
The Generative Adversary Networks (GAN) [
34] comprises both a generative
G and a discriminatory
D models. The former captures the distribution of the input dataset. The latter estimates the probability that a sample comes from the real dataset rather than a sample generated by
G, which is synthetic data. The training procedure for
G is to maximize the probability that
D will not be able to discriminate whether the sample comes from the real dataset. Multilayer Neural Perceptron (MLP) can define both models so that the entire system can be trained with the backpropagation algorithm. The following equation defines the cost function:
The D discriminator seeks to maximize the probability that each piece of data entered into the model will be classified correctly. If the data comes from the real distribution or the G generator, it will return one or zero, respectively. The generator G minimizes the function . Thus, the idea is to train the generator until the discriminator D is unable to differentiate if an example comes from real or synthetic dataset distributions. Hence, the idea is to generate a synthetic dataset to mimic the original dataset X. In this context, the generator’s error to built a replica of the original dataset provides the privacy guarantee. Thus, the input of the mining task would be the synthetic dataset .
3.4. Knowledge Distillation
Knowledge Distillation [
18] allows building Machine Learning Copies that replicate the behavior of the learned decisions (e.g., Decision Trees rules) in the absence of sensible attributes. The idea behind the Knowledge Distillation is the compression of an already trained model. The technique generates a function updating parameters of a specific population to a smaller model without observing the training dataset’s sensitive variables. The methodology trains a binary classification model. Subsequently, the synthetic dataset is generated using different sampling strategies for the numerical and categorical attributes, maintaining the relationship between the independent variables and the dependent variable. Thus, new values are obtained for the variables in a balanced data group. Finally, the lower-dimensional synthetic dataset is used to train a new classification task with the same architecture and training protocol as the original model. The idea behind this algorithm is to create synthetic data for forming a new private aware dataset. Hence, we build a new dataset from a sampling process using uniform or normal distributions. The samples are validated by a classifier trained with the original dataset
X. This technique allows building a dataset representation in another space, which becomes our sanitized dataset
.
3.5. Evaluation Metrics for Privacy Filters
To assess the quality of the sanitation algorithms in terms of information utility and privacy risk, we use two standard metrics in the literature, namely Information Loss and Disclosure Risk [
1,
2,
3,
4,
5,
6]. In the following paragraphs, we define how both functions are implemented.
Information Loss is a metric that quantifies the impact of a sanitization method on the dataset utility. It quantifies the amount of useful information lost after applying a sanitization algorithm, and there are several methods to compute it. In the present paper, we rely on the Cosine similarity measure between the original value of the salinity, chlorophyll, temperature, and degrees under the sea
X and the vector
, which is the sanitized counterpart of
X as defined in Equation (
6).
Thus, to compute the IL, we sum the distances between the original
X and sanitized
vector of points using Equation (
7).
Disclosure risk quantifies the danger of finding the same distribution for the output variable after a prediction task when the input dataset is sanitized. For the sake of example, let X be the original dataset, containing salinity, chlorophyll, temperature, and degrees under the sea, and the sanitized version of X. Both datasets are the input of a Logistic Regression to predict the volume of fish stocks. Thus, the model outputs the prediction Y using the original dataset and for the sanitized input.
Therefore, we use the Jensen-Shannon distance metric to measure the closeness between two vectors
Y and
. Where m is the average point of
Y and
vectors, and
D is the Kullback-Leibler divergence.
In the experiments Y and are the predicted vectors of a given model on the real and sanitized data, respectively.
Based on the aforementioned concepts, we performed some experiments whose results are reported in the next section.
4. Results
Inspired on a benchmark previously described in [
35], we compare four groups of sanitization techniques: Statistical Disclosure Control filters, Differential Privacy filters, Generative Adversarial Networks, and Knowledge Distillation technique (The implementation of the privacy algorithms is available at:
https://github.com/bitmapup/privacyAlgorithms accessed on 4 April 2021). These methods are applied to the dataset described below.
4.1. Dataset Description
We live in an interconnected world where much data is generated from sensors, social networks, internet activity, etc. Therefore many companies have important datasets, which are both economic and scientific valuables. Thus, it is necessary to analyze and understand sanitation techniques for curating commercial datasets to be shared publicly with the scientific community owing to their informative value. In this sense, we take the case of the fishing industry in Peru, which is one of the most important economic activities [
36] for the Gross Domestic Product (GDP). In this economic activity, the cartographic charts are a high economic investment to understand where the fish stocks are located in the sea for maximizing the daily ship’s fishing. Simultaneously, this information is helpful to predict
el Niño phenomenon and study the fish ecosystem.
The oceanographic charts provide geo-referenced water characteristics data on the Peruvian coast as depicted in
Figure 1. The overall dataset contains 9529 temporal-stamped records and 29 features, which are detailed in
Table 1.
From the variables before presented, the variables ranging from 19 to 22, in
Table 1 were discarded due to the high correlation to degrees under the sea
TC as depicted in
Figure 2. Then, variables 1, 2, and 9 to 13 are not take into account because they belong to a in-house model. Another variable highly correlationated with
Chlorophyll is
Chlorophyll per Day (
Clorof.Day) as shown in
Figure 2. Finally,
Dist.Coast, Bathymetry, North-South and
Season have a poor predictive power for the mining task.
Therefore, four main characteristics are used for finding fish stock’s location. These features are
salinity, chlorophyll, temperature (TSM), and
degrees under the sea (TC), which are described in
Table 2 (Dataset available at:
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/IFZRTK accessed on 4 April 2021). Thus, in the present work, we limit the study to these four features used to find the fish stocks.
4.2. Data Sanitization through Statistical Disclosure Control Filters
This subsection shows the sanitization process using the Statistical Disclosure Control (SDC) filters. The SDC filters are applied using different settings to find the most suitable configuration (c.f.,
Table 3) for a good trade-off between Information Loss and Disclosure Risk metrics. Thus, we use different parameter settings to minimize privacy risks and maximize data utility.
This filter needs the parameter
,
is the standard deviation of the variable, and
c, which is a scaling factor for adding noise to each row in a dataset. In this experiments,
c take values of 0.1, 0.25, 0.5, 0.75, and 1. Therefore,
Figure 3a illustrates the Information Loss increment while
c grows. Analogously,
Figure 3b indicates that the Disclosure Risk follows a different behavior since it decreases while
c increases. This monotonic decrease makes it more difficult to obtain the original data from the sanitized dataset.
In conclusion, high values of c represent strong privacy guarantees and data utility loss. Besides, this filter requires low computational time to process the data.
This filter uses the spatial density-based DBSCAN [
37] clustering algorithm. After the clustering step, each point in the dataset belonging to a cluster is replaced by the cluster’s mean value to sanitize it. Accordingly, DBSCAN uses the number of kilometers km each cluster will encompass and the minimum number of geospatial points m belonging to a cluster. In this effort, the km value was empirically set to 1, 2, 3, and 4; while m was set to 50, 100, 150, 200, 250, and 300. Both parameters were tested in all possible combinations to obtain the best results, as depicted in
Table 4. It is worth noting that the number of formed clusters directly depends on both hyperparameters.
Concerning the results, when variables km and m increase, values of Information Loss and Disclosure Risk have opposite behaviors, i.e., the Information Loss increases (see
Figure 4a) and the Disclosure Risk decreases (see
Figure 4b). In detail, we notice in
Figure 4a, the higher values of km and m, the higher the loss of data utility since there are few clusters. Then, the more the clusters, the less Information Loss value. Furthermore, in the case of Disclosure Risk (
Figure 4b), by increasing the value of km, the Disclosure Risk decreases since there are few clusters. Consequently, if km remains fixed and m increases, the Disclosure Risk decreases.
In general terms, as the value of km increases, the IL increases, and DR decreases. Also, while m increases, there is a greater guarantee of information privacy, and the loss of utility also decreases. This filter has a disadvantage due to the high computational time required.
This filter takes as input the maximum exchange range
p. The experiments have been performed for
p values varying from 10 to 80 (c.f.
Table 3).
Concerning the results,
Figure 5 shows that IL remains stable for
p values from 25 to 80. On the opposite, concerning the DR the highest and lowest results are obtained for
and
, respectively. It means that when
p increases, there is less Disclosure Risk, making it more challenging to obtain the original data from the sanitized version.
To summarize,
Figure 5a,b display that the Disclosure Risk has its lowest point when
. It means that we protect the data better despite taking away its usefulness. On the other hand, if you want to have the least amount of loss of the information usefulness, the hyperparameter
is the best option. However, this filter is the one that requires the most computational time to sanitize the data and offers a better Disclosure Risk compared to the other SDC filters.
4.3. Data Sanitization through Differential Privacy Filters
In this section, two techniques based on Differential Privacy mechanisms were applied to data at our disposal. Experiments were performed in three parts. First, the Microaggregation filter was applied for different values of km and m. Once the clusters were obtained, the data was replaced by the mean value. Finally, Exponential and Laplacian Differential Privacy mechanisms were applied, each one with hyperparameters described in
Table 5.
Additionally to km and m variables, the Laplacian mechanism uses
that was set to 0.01, 0.1, 1, 10, and 100. The result for this filter can be summarized as follows. On the one hand,
Figure 6a evinces that the hyperparameters km and m seem not to impact the Information Loss value. However, this metric decreases drastically when the hyperparameter
. We also see that the Information Loss progressively grows, reaching a maximum peak when
. This trend being fulfilled for all combinations of km and m.
On the other hand,
Figure 6b indicates that the Disclosure Risk decreases when m increases. Analogously, as the km value increases, the Disclosure Risk also increases. Concerning the
hyperparameter, there is a similar trend to the Information Loss metric, i.e., where the Disclosure Risk metric reaches its minimum point when
, for constant values of km and m.
To summarize, concerning the Information Loss (see
Figure 7a), a quadratic trend is observed. The highest and lowest Information Loss peaks were obtained for km = 4 and m = 50, respectively. In the case of Disclosure Risk (see
Figure 7b), a quadratic trend is also observed where the minimum point of Disclosure risk is given for
. The maximum value of Disclosure Risk is obtained when km = 1, m = 100 and
, and the minimum value when km = 1, m = 300 and
. In conclusion, for constant values of km and m, only the value of
allows us to guarantee high privacy when the value of this hyperparameter is equal to
, or low when it is equal to 10. Please note that all values for IL and DR are summarized in
Table 6.
As the Laplacian mechanism, the Exponential mechanism takes three hyperparameters: km, m, and
. Regarding the Information Loss (c.f.,
Figure 7a), km and m seem to have no impact significant on this metric. Conversely, the Information Loss reaches a maximum peak for
and a minimum value when
. Regarding the Disclosure Risk,
Figure 7b shows that for km and m, there is a similar behavior described in the previous section. We can also notice that Disclosure Risk has the highest peak when the
.
Regarding the Information Loss (c.f.,
Figure 7a), a quadratic trend is also observed. From
the IL starts to grow, reaching a maximum point when the hyperparameter
. Similarly, the highest and lowest IL peaks are obtained when km = 4 and m = 50, respectively. It is important to notice that the IL value only depends on
to obtain maximum or minimum values.
The Disclosure Risk (c.f.,
Figure 7b) also reveals a quadratic trend, where the minimum DR is at
. Furthermore, the same trend can be observed in the km and m hyperparameters as in previous mechanism. The maximum value of all combinations is given when km = 1, m = 100 and
, and the minimum DR when km = 1, m = 300 and
. Please note that all values for IL (c.f.,
Il exp) and DR are summarized in
Table 7.
4.4. Data Sanitization through Generative Adversarial Networks
In this section, a Generative Adversary Networks GAN is applied to data at our disposal, and the algorithm returns a dataset artificially generated through an Artificial Neural Networks (ANN) mechanism. Obtained results were evaluated by measuring the Disclosure Risk and the Information Loss.
During the training phase, the synthetic data generator
G and the discriminator
D models need parametrization. Different hyperparameters values generate completely different models with different results. Thus, we took the settings recommended in [
38,
39], which are summarized in
Table 8. Concerning the number of hidden layers in the architecture of the ANN [
38,
39] recommends three hidden layers for each neural network (discriminator and generator). Also, the authors propose using the RELU activation function and Adam’s Optimizer fixed to
. Concerning the epochs, we were inspired by [
38], which obtain good results using 300 and 500 epochs. Both epochs were empirically tested, obtained better results for 500 epochs.
In the same spirit, authors in [
40] recommend training a GAN using the Batch technique, where the dataset is divided into
n blocks of data and trained separately. This technique reduces training time. In addition, the recommended parameter
n in the literature is 64. Finally, in [
39,
41], the authors use 100 neurons, and 100 input dimensions.
Concerning the result, in the case of the Information Loss (see
Figure 8a), the highest peaks of the utility are found using Architecture 3 and Architecture 4. In contrast, Architecture 7 has the lowest Information Loss. Also, architectures 3 and 7 have the same number of hidden layers with 256, 512, and 2024 neurons. Nevertheless, both architectures have a significant difference concerning these values are positioned in the GAN. This difference generates a significant impact on IL.
4.5. Data Sanitization through Knowledge Distillation
To generate a synthetic dataset using Knowledge Distillation, we rely on Machine Learning copies. To meet this aim, the CART Decision Tree algorithm [
45] was trained with the original normalized data using Entropy and Gini to measure the split’s quality. For the maximum depth of the tree, we tested values ranging from 2 to 50. Then, for the minimum number of samples required to split an internal node, we try the following values 0.01, 0.05, 0.1, 0.15, 0.16, 0.18, and 0.2.
Table 9 summarizes the best values found for both Entropy and Gini based Decision Trees.
Once the model is trained to outcome the pretense or absence of fish stocks given certain salinity values, chlorophyll, temperature, and degrees under the sea, a synthetic dataset was generates using random values sampled from normal and uniform distributions with parameters specified in
Table 10. The obtained synthetic datasets were evaluated using the IL and the DR metrics.
Figure 9 depicts the datasets issued from the normal distribution have less Information Loss and a similar Disclosure Risk between
and
.
In this section, we have presented the results of the different techniques to benchmark the results in terms of Information Loss and Disclosure Risk. In the next section, we discuss about our findings.
5. Discussion
A vast amount of data are generated and collected daily. These datasets contain sensitive information about individuals, which needs to be protected for public sharing or mining tasks to avoid privacy breaches. As a consecuence, data curators have to choose a suitable technique to guarantee a certain privacy level while keeping a good utility for the mining task after the sanitization. There are several privacy-enhancing mechanisms based on SDC, Differential privacy, GANs, or Knowledge Distillation to protect the data. Thus, there is a need to compare such methods for data sanitizations. Therefore, a question about the best algorithm to protect privacy raise. To try to answer this question, we extend the benchmarks [
24,
35] from a comparison of classical Statistical Disclosure Control, and Differential Privacy approaches with recent techniques as Generative Adversarial Networks and Knowledge Distillation using a commercial database.
About the SDC filters, the highest possible Information Loss was obtained with the Microaggregation filter and the lowest possible Information Loss with the Rank swapping filter. Besides, the highest Disclosure Risk value was obtained using Rank swapping and Noise Addition, while the lowest Disclosure Risk value was achieved through the Microaggregation filter.
Regarding Differential Privacy, the Laplacian and Exponential mechanisms differ slightly for both Disclosure Risk and Information Loss. Thus, when
, and
, we obtain the lowest DL and the highest IL respectively. Depending on the data sanitization’s primary purpose, it is recommended to alternate these
values with constant values of km and m. Since both Exponential and Laplacian mechanisms present almost the same values, it is recommended to use the Laplacian mechanism since it takes the least computational time to execute. Concerning the
choice, we suggest small values close to zero to avoid privacy breach, since
could be seen as the probability of receiving the same outcome on two different datasets [
30], which in our case are the original dataset
X and its private counterpart
.
The GAN technique shows that Architecture 3 should be used when a high privacy guarantee is required with a shallow Disclosure Risk measure. However, to have the least utility loss, it is recommended to opt for Architecture 7 or Architecture 5, since they have the lowest Information Loss. To decrease Disclosure Risk, it is possible to couple the GAN with a Differential Privacy mechanism, as mentioned in [
15,
16,
17].
Concerning the Knowledge Distillation technique, despite the fact that the distillation process could change the class balance depending on the sampling strategy. It seems to shows interesting results in terms of Information Loss and Disclosure Risk. It is worth noting that the sampling process could be challenging depending on how the process sampling is done [
46].
To summarize,
Table 11 indicates the best trade-offs between Information Loss and Disclosure Risk measures for the compared methods. We observe that Machine Learning copies present the best trade-off between Information Loss and Disclosure Risk. Then, GAN provides the second-best privacy guarantee. The strategy of this method is different from classical SDC filters and differential privacy. The former algorithms build a dataset to mimic original counterparts, while the latter algorithms add controlled noise to the original data. Hence, We notice that Noise Addition and Rank Swapping have the smallest Information Loss values. Finally, we remark that Microagreggation and Differential Privacy have similar behaviors. Based on the results mentioned above, a data curator should first try a Machine Learning Copy to reduce privacy risk while keeping a small Information Loss when performing a mining task. The second option would be to try Differential Privacy after the Machine Learning Copy since it is the second-best trade-off.
Apropos computational time, the fastest sanitization algorithm, using our dataset, is Noise Addition, it takes on average 30 min to execute. Then, Rank Swapping, Microaggregation, Differential Privacy, and GANs take about 2 h to execute, and Machine Learning Copies could take more than two hours depending on the sampling strategy and previous knowledge on the probability distributions of the variables corresponding to the dataset to be sanitized.
It is worth noting that latitude and longitude variables were not considered in the sanitization process since SDC methods change in an arbitrary way when treated as variables. This could degrade the dataset significantly while working with geo-referenced data. Thus, an adversary could note that the dataset has been previously processed. The risk of dealing with geolocation data are detailed in [
47]. Also, to the best of our knowledge, there are no studies about the privacy preservation of geolocated records using GANs or Machine Learning copies. Concerning IL and DR, there is not a consensus about the definition of such a function. Thus, there is an opportunity to implement different functions to capture the impact of the privacy mechanism. Besides, it is possible to extend this study by testing the used sanitization techniques with other datasets, such as medical datasets like the one presented in [
48]. The limitation is that authors do not share the analyzed dataset and, in general, the unavailability of public available medical datasets. Another angle of analysis is the subsequent mining task after the sanitization. Consequently, one can test different data mining tasks, namely, classification, clustering, or sequential pattern mining, to evaluate the sanitization method’s impact on the result of the mining task and the information loss.
Concerning the context of our work, we have seen in the literature benchmarks of de-identification techniques [
21,
22] limited to record anonymity techniques, which are the first step of the sanitization process, on the one hand. On the other hand, other benchmarks compare SDC and Differential Privacy techniques [
23,
24,
25,
35] excluding deep learning-based approaches. To the best of our knowledge, this benchmark is the first one to compare classical SDC and Differential Privacy methods to Generative Adversarial Networks and Knowledge Distillation based privacy techniques. Therefore, this benchmark could be the first reference to guide data curators to choose a suitable algorithm for their sanitization task. Regarding the limitations of our work, even though the limited number of datasets used for the experiments, the results are quite convincing about the privacy gain by reducing the disclosure risk with a controlled information loss depending on the hyperparameters. Besides, our results are similar to those presented in [
24,
25,
35], which took into account different datasets for their experiments. In conclusion, we have developed an extensive comparison of different privacy techniques regarding Information Loss and Disclosure Risk to guide in choosing a suitable strategy for data sanitation. There are several privacy techniques to sanitize datasets for public sharing. Thus, our contribution aims to fill the absence of privacy algorithms benchmark for proving the first approach to find a suitable sanitization technique. Therefore, our study could help to reduce the amount of time for selecting a privacy algorithm for data sanitization.
Because of this paper’s results, we are now able to evaluate GANS and Machine Learning copies for handling geolocated data and assess the impact of the privacy techniques when dealing with location data and other variables.
6. Conclusions
In the present effort, we have evaluated SDC (Statistical Disclosure Control) filters, namely Noise Addition, Microaggregation, Rank swapping, Laplacian and Exponential Differential Privacy, Generative Adversarial Networks (GAN), and Knowledge Distillation sanitization techniques on data using oceanographic charts. Therefore, the idea was to use the sanitized dataset for a fish stock prediction task. To calibrate the sanitization algorithms, different settings were tested for each technique. Thus, we evaluate the sanitization techniques in terms of Information Loss and Disclosure Risk. In this way, the best hyperparameter configurations were found, which achieve a trade-off between the Information Loss and the Disclosure Risk for each filter studied in this paper. However, there is room for improvements in testing the different techniques on other datasets and monitoring the computational time and memory usage using different hyperparameter values. This benchmark could be a good start for a data curator to target the most suitable privacy algorithm better to sanitize their datasets. Finally, the new research avenues will be to perform the benchmark by using publicly available datasets, monitor computational performance indicators like computational time and memory usage for all the filters with different configurations to analyze the hyperparameters’ impact on performance. Other experiments would be adding records’ geolocation and coupling Differential Privacy to the GANs and the Machine Learning Copies. However, there is room for improvements in testing the different techniques on other datasets and monitoring the computational time and memory usage using different hyperparameter values.