1. Introduction
Data mining is a successful tool for extracting knowledge from large amounts of data. It is efficiently applied to many fields, such as weather forecasting [
1], biomedical [
2], medical diagnosis [
3], marketing [
4], security [
5], and fraud detection [
6]. On the other hand, sensitive data used in data mining applications or sensitive knowledge gained from these applications may cause privacy breaches directly or through linkable private data. Privacy-preserving data mining (PPDM) arises from the need to continue performing data mining efficiently but while preserving private data or sensitive knowledge. PPDM can be divided into two as input privacy and output privacy. These are also known as data hiding and knowledge hiding, respectively. Data hiding techniques aim to preserve individual’s sensitive data private and modify data mining algorithms in such a way that sensitive data cannot be inferred from the results of data mining algorithm. Achieving this requires some special techniques, including anonymization, distortion, randomization, and encryption [
7]. Knowledge hiding techniques aim to preserve sensitive rules or patterns as private and modify original data in such a way that all sensitive patterns or rules stay unrevealed while remaining ones can still be discovered [
8,
9].
Finding frequently co-occurring items using data mining is popular among companies to discover valuable knowledge such as customer habits. Although this is very valuable alone, companies may be willing to share data for collaboration. In this way, a better understanding of discovered knowledge can be gained, which will help to make better strategies. However, the risk of disclosing sensitive relationships may increase. For example, let us consider a scenario in which a supermarket sells products of two rival companies. To collaborate and increase profits, one company offers lower prices to the supermarket. The collaborator company reveals relationships with its rival’s products through data mining. Using this knowledge and campaigns, the collaborator company may monopolize certain products, which can negatively affect the rival company and the supermarket [
10]. For similar situations, the stakeholders should sanitize the databases before sharing.
Approaches for privacy preservation in frequent itemset mining may be heuristic and exact in their algorithmic nature [
11]. These approaches aim to modify the database so that sensitive itemsets or association rules are hidden, and non-sensitive ones are affected minimally. Heuristic ones are faster, while exact ones have fewer side effects on non-sensitive itemsets.
In this study, we propose an exact approach for frequent itemset hiding. Our motivations are (i) to hide all sensitive itemsets, (ii) to minimize side effects on non-sensitive itemsets, (iii) to use fewer constraints to lessen runtime, (iv) to bypass the need for prior mining of all frequent itemsets, and (v) to use relaxation techniques where an exact solution is not feasible. We have evaluated performance for different hiding scenarios on different datasets observing effectiveness, the number of lost itemsets as a side effect, and runtime efficiency.
This paper is organized as follows.
Section 2 gives related work on frequent itemset hiding research.
Section 3 provides preliminary information for itemset mining and hiding.
Section 4 presents our itemset hiding approach with an example. In
Section 5, the results of experiments for our approach are given. Lastly,
Section 6 concludes the paper.
2. Related Work
The optimal itemset hiding is NP-hard [
12]. Therefore, researchers have focused on approaches that rely on some assumptions, namely heuristic approaches. Based on heuristics, the database is modified for sanitization [
10,
13,
14]. These techniques are efficient, scalable, and fast algorithms but may have too many lost itemsets as a side effect. Some other studies extend heuristics and use border theory [
15]. The notion behind these approaches is that the itemsets on the border represent a boundary between the frequent and the infrequent itemsets. Instead of considering non-sensitive frequent itemsets during the hiding process, these approaches focus on preserving the border’s quality [
16,
17]. Another heuristic approach uses intersection lattice theory and distance concepts to lessen the side effects of the hiding process [
18]
Heuristic approaches are fast but may have side effects, and the number of non-sensitive itemsets accidentally hidden may increase. To cope with this, exact approaches deal with the problem as a Constraint Satisfaction Problem (CSP). These approaches present better solutions in terms of the number of lost itemsets but have more complexity and may have a longer runtime.
The first itemset hiding approach based on constraint programming is in [
19]. In this approach, first, constraints for integer programming are defined. Solving the problem would lead us to identify the selection of transactions to be modified. Following this, heuristically, items are selected from the transactions and altered. This process continues until the selected transaction no longer supports any sensitive itemsets.
In [
20], the authors defined distance measures for the sanitized database. Instead of the number of transactions, they considered the number of modified items. Minimization of this distance is accomplished by maximizing the occurrences of items of sensitive itemsets. Using the positive and negative borders and the Apriori property, constraints are defined to maximize itemset occurrences and minimize item modifications. The authors also propose an approach for the degree reduction of constraints. When the constructed CSP is not solvable, this approach removes one constraint and constructs the CSP again iteratively until the CSP is solvable.
In [
21], the authors revised the previous approach and gave a two-phase iterative approach. Firstly, sensitive itemsets are hidden using the revised positive border of itemsets. Secondly, transactions are modified to support accidentally hidden itemsets. For both phases, CSP is used.
In [
22], authors defined new constraints and relaxation procedures to provide an exact solution. This approach observes all frequent itemsets to ensure they are kept frequent after sanitization. On the other hand, the number of constraints and variables is extremely large, resulting in an increased runtime.
There are also some techniques for itemset hiding based on evolutionary algorithms in recent years. Since the solution is NP-hard, dealing with the problem as an optimization problem is feasible. Based on the algorithm in [
23], specified transactions were deleted for sanitization. In [
24], authors proposed particle swarm optimization-based algorithms, which need fewer parameters to be set compared to previous algorithms. In [
25], an algorithm was proposed formulating an objective function that estimates the effect on non-sensitive rules with recursive computation.
3. Preliminaries
Frequent itemset mining has been one of the most essential and popular data mining techniques since it was first introduced in [
26]. Some of the most popular algorithms are Apriori [
27], Eclat [
28], and FP-Growth [
29]. Originating from market basket data analysis, it can be applied to many fields [
6,
30,
31,
32].
The basic concepts can be defined as follows. You can refer to
Table 1 for used notations. Let
be a set of literals, called items. Let
be a dataset of transactions where each transaction
is a set of items in
such that
. Each transaction can be defined in the binary form where
if the j-th item of
I appears in the transaction
. Considering all transactions, for ease of calculation, we have a binary form of
D as a matrix that is called bitmap notation. It is given in Equation (1). All items have the same level of importance; therefore, they are symmetric.
Let
be a set of items where
, which we called an itemset. If
, then itemset
is said to be supported by transaction
. In other words, all items of the itemset appear in the transaction. The number of transactions in
supporting itemset
is defined as the support count of
. Benefiting from the symmetric nature of the items, the support count of itemset
in bitmap notation can be calculated as given in Equation (2).
If the support count of itemset
is at least equal to the minimum support count, i.e.,
then itemset
is called frequent or large. The frequent itemset mining problem is to find all frequent itemsets in the database for a predefined minimum support threshold. We can define the set of all frequent itemsets
as stated in Equation (3).
Some itemsets in
may contain sensitive information. Denoting these as
, referring to sensitive itemsets, we need to modify database
into
in such a way that the frequent itemsets of sanitized database
exclude sensitive itemsets. As known from the a priori property, if an itemset is frequent, all of its subsets are also frequent. Rephrasing vice versa for the itemset hiding concept, we can say that when an itemset is sensitive, its supersets are also sensitive. Sensitive supersets
should also be hidden, which can be defined in Equation (4).
The remaining frequent itemsets are non-sensitive frequent itemsets donated by
, as given in Equation (5).
Then, we can define frequent the itemset hiding problem as modifying dataset
into
in such a way that
—frequent itemsets of sanitized dataset
—excludes sensitive frequent itemsets
whereas non-sensitive frequent itemsets
can still be mined from
with the same minimum support threshold.
and
have an asymmetric relationship since we delete some items to sanitize the dataset.
For an ideal sensitive itemset hiding methodology, as many of the following goals should be accomplished on the sanitized database with the same minimum support threshold.
Modification of the database is minimized, such that originality of the database is kept as much as possible.
All sensitive itemsets are hidden and do not appear in the sanitized database.
Supersets of sensitive itemsets are also hidden and do not appear in the sanitized database. We know from the Apriori property that this goal is also accomplished if the first goal is achieved.
All non-sensitive frequent itemsets appear in the sanitized database. If an itemset doesn’t appear in the new database, it is called a lost itemset.
No new itemset appears in the sanitized database. Such itemsets are called ghost itemsets. However, approaches that delete items from the dataset naturally accomplish this goal, and no new itemset can be mined.
Goal 1 can be rewritten as accomplishing
. Minimization of modification for approaches use item deletion; we can say that number of items deleted should be minimized. Using the bitmap notation given in Equation (1), let us define items in the new dataset as
. Then, the minimization of the number of 1s converted to 0s is the maximization of the 1s in
and can be defined as follows.
Goal 2 can be accomplished by keeping the support count of all sensitive itemsets below the minimum support count in the new dataset.
Goal 3 is accomplished if goal 2 is already satisfied.
Goal 4 can be accomplished by keeping the support count of all non-sensitive itemsets at the same or above the minimum support count in the new dataset.
Goal 5 is satisfied if the approach uses item deletion for the sanitization method and does not add any item to the new dataset. Some approaches use reconstruction methods and may also add new transactions to the sanitized dataset. Such approaches may be exposed to this side effect.
The majority of sensitive itemset hiding approaches aim to hide sensitive itemsets while minimizing modified items in the dataset and the number of lost itemsets. They are focused on goals 1, 2, 3, and 4.
4. Itemset Hiding Using Sibling Itemsets Constraints
Frequent itemset mining and CSP formulation preliminaries are given in the previous section. Considering that all non-sensitive frequent itemsets will increase the number of constraints, to lessen constraints, we introduce the sibling itemset concept. Sibling itemsets
of a frequent k-itemset
are generating itemsets of k + 1 candidate itemset. The idea behind this concept is that hiding a k-itemset will also hide its k + 1 supersets but remain non-sensitive subsets of these k + 1 supersets discoverable. This represents a local border.
Using sibling itemsets instead of all non-sensitive frequent itemsets, CSP defined in (12) can be defined as follows
Generation and determining support of sibling itemsets of a sensitive itemset is conducted in the hiding process. In this way, the time consumption of prior itemset mining is eliminated.
There are two types of constraints for our CSP: sensitive itemset constraints and sibling itemset constraints. The first type ensures that sensitive itemsets are below the defined minimum support threshold. Thus, all of these constraints must be satisfied. The second type of constraint is satisfied to lessen information loss. There are situations when all of these cannot be satisfied, and the CSP is not solvable. Then, we need to sacrifice some of them. In our approach, information loss is preferred to a privacy breach. Therefore, some constraints for sibling itemsets can be sacrificed. Instead of removing any of those constraints, we add binary relaxation variables. By adding these, we do not need to reformulate the CSP and run solver more than once. We add a unique binary relaxation variable r to the inequality for all sibling constraints.
Relaxation on constraints should be minimized to ensure that information loss is minimized.
So, Equation (15) gives our final CSP formulation.
Illustrative Example
In the following, an illustrative example of our hiding approach is given. Let
be the dataset of 10 transactions shown in
Table 2. Our set of items is
.
Using the bitmap notation given in (1), we have a 10 × 5 binary matrix representation of
. It is given in
Table 3.
Using the formulation in (2), we can calculate the support count of an itemset, for instance, the support count of itemset
.
Given minimum support count
and Equation (3), we can find 16 frequent itemsets.
Suppose that itemset is given as sensitive and needs to be hidden. Then,
Equations (4) and (5) give supersets of sensitive itemsets and non-sensitive itemsets as follows:
All itemsets in and must be hidden to achieve privacy, whereas as many itemsets as in should remain frequent after sanitization.
Using the formulation given in (11), transactions supporting itemset
are modified with binary variables. Their values will be determined after the CSP is solved. The intermediate form of the dataset is given in
Table 4.
Using (13), we can find sibling itemsets as
. Now, we can define the CSP formulation given in (15).
The solution of such CSP is
When results are applied to the intermediate form of the dataset, we obtain the sanitized dataset
as given in
Table 5. From this sanitized dataset, we can find itemsets for minimum support count
as
.
The sensitive itemset is no longer frequent for support count 2. Compared to the initial dataset, the number of itemsets decreased to 10 from 16. Two itemsets are accidentally lost, and 3 itemsets are supersets of }; therefore, they are also missing.
5. Experimental Analysis
In this section, we give a performance evaluation of our approach. The reference algorithm for comparison is IPA [
20]. We implemented the algorithms using Python. Constraints were solved using Minizinc [
33]. Implementations used the Pymzn [
34] library to be able to invoke, run, and gather results from the constraint solver. All computational experiments were conducted on a PC running MS Windows 10 with an Intel i5-4200U CPU and 8 GB of RAM.
5.1. Evaluation Metrics of Itemset Hiding
Itemset hiding aimed for transforming the dataset in a way that sensitive itemsets were concealed. Itemset hiding aimed for transforming the dataset in a way that sensitive itemsets were concealed, non-sensitive frequent itemsets were preserved, ghost itemsets were not generated and dataset distortion is minimum. These goals are mentioned in
Section 3, and related metrics are given below.
5.1.1. Hiding Failure
This metric concerns sensitive itemsets remaining frequent after the sanitization process. It is defined as the percentage of sensitive itemsets that appear in the sanitized dataset divided by the ones that appeared in the original dataset.
Our approach ensures that all sensitive itemsets are hidden; therefore, HF = 0 for all scenarios. As far as we have surveyed, all proposed approaches focus on HF and ensure that they are 0. Our reference algorithm IPA also ensures that all sensitive itemsets are hidden.
5.1.2. Artifactual Patterns
This metric concerns side effects of sanitization process because some approaches insert items or transactions to the dataset during or after sanitization. It was calculated as the ratio of itemsets that did not appear in the original dataset but appeared in the sanitized dataset to the itemsets that appear in both the original and the sanitized datasets.
Since our approach does not insert items on the original dataset, it is not possible to produce new itemsets from the sanitized dataset. This is the same for the IPA algorithm.
5.1.3. Dissimilarity
This metric is the measure of the differences between the original and the sanitized dataset quantified by comparing total number of items or the frequencies.
For our approach, the number of deleted items gives dissimilarity between the original and sanitized dataset. The number of deleted items are identical with IPA algorithm.
5.1.4. Misses Cost
This metric concerns the side effects of the sanitization process. It is measured as the percentage of non-sensitive patterns that disappeared in the sanitized dataset divided by the ones that appeared in the original dataset.
We have given this measure as the number of lost itemsets. This is the only metric that differs with the IPA algorithm and further comparison is given in next title.
5.2. Comparison
We evaluated the algorithms on six different datasets obtained from [
35]. Characteristics of these datasets are given in
Table 6. Since the IPA algorithm uses frequent itemsets discovered before the hiding process, we also provided time consumption for tested values on datasets. Python implementation [
36] of the Eclat algorithm was used for frequent itemset mining.
We tested the algorithms using different hiding scenarios: hiding 1 2-itemset (HS_2.1), hiding 2 2-itemset(HS_2.2), hiding 3 2-itemset(HS_2.3), hiding 1 3-itemset(HS_3.1), hiding 2 3-itemset(HS_3.2), and hiding 1 4-itemset(HS_4.1). The sensitive itemsets chosen had support counts close to the minimum support count since those itemsets were more logically hidden and indistinguishable compared to the rest. For ease of use in tables, our approach is named HISB (Hiding Itemsets with SiBlings) during this section.
The results of evaluation for the T10I4100K dataset are given in
Table 7. Columns represent hiding scenarios, side effects as the number of lost itemsets, and running time in seconds for algorithm IPA and HISB. Both algorithms performed well in terms of number of lost itemsets. In defined scenarios, no itemset was lost. Our approach performed better in terms of runtime in four scenarios. It should be noted that IPA needs prior itemset mining, which costs an additional 9.15 s.
The results of evaluation for the T40I10100K dataset are given in
Table 8. IPA performed better in terms of number of lost itemsets. On the other hand, our approach performed better in terms of runtime even though prior itemset mining consumption was not included for IPA.
The results of evaluation for the Mushroom dataset are given in
Table 9. Both algorithms performed well regarding the number of lost itemset where no itemset was lost. On the other hand, our approach performed better in terms of runtime even though prior itemset mining consumption was not included for IPA.
The results of evaluation for the BMS1 dataset are given in
Table 10. IPA performed better in terms of number of lost itemsets. The runtime performance of hiding processes was close in five of six scenarios.
The results of evaluation for the BMS2 dataset are given in
Table 11. Both algorithms performed well in terms of number of lost itemsets. The runtime performance of hiding processes was similar in four scenarios.
The results of the evaluation for the Retail dataset are given in
Table 12. Both algorithms performed well in terms of number of lost itemsets. The runtime performance of hiding processes was close.
5.3. Discussion
First of all, we can say that using sibling itemsets constraints to lessen the runtime of the hiding process is effective. Even though the comparison tables do not include itemset mining time consumption before the hiding process, our approach performed faster in most cases. To add this, in some cases, the number of border itemsets or the length of some border itemsets constructed by the IPA algorithm caused distinctive runtime differences in the hiding process. Experiments on the Mushroom dataset revealed that eliminating prior mining was advantageous when the dataset is dense. Although this dataset has fewer items and transactions, the number of frequent itemsets for the given support threshold was over 3.5 million. We also observed that unsolvable constraints caused another disadvantage. However, this was not common in most cases. Secondly, the number of lost itemsets caused by our approach is tolerable when considering the number of frequent itemsets.
At this juncture, we would like to mention that we have also implemented the algorithm given in [
22]. It promises optimum results since it is a full exact approach and uses relaxation techniques for CSP. However, we could not finish experiments because of insufficient time or hardware limitations that caused crashes. The reason for this problem is that the proposed approach generates constraints for all non-sensitive frequent itemsets. Considering our experiments, it should generate constraints for over 1 million and 3 million frequent itemsets for T40I10D100K and Mushroom datasets, respectively, which is not feasible.
6. Conclusions
This paper presents a methodology for hiding sensitive itemsets in transactional datasets. We focused on reducing the number of constraints for exact itemset hiding. Using sibling itemset notion and defining relaxation variables for constraints, we benefited from the exact nature of algorithms to obtain an ideal solution or minimally affected dataset. We showed that sibling itemsets are an efficient solution for reducing constraints for exact approaches. Given a comparison with a reference algorithm, we also discuss the need for prior computation of frequent itemsets. Experiments revealed that eliminating prior mining of frequent itemsets on the dataset combined with a sibling itemset approach is time-efficient where side effects such as lost itemsets are tolerable. We also mention some symmetric and asymmetric properties appearing in itemset hiding.
To sum up, our approach is especially applicable where prior mining of frequent itemsets is costly. This is also valid for frequently updated databases. Therefore, we can say that skipping prior mining while using constraints is one of the most important contributions of our approach. Additionally, using fewer constraints makes our approach even better in terms of runtime. Moreover, we added relaxation variables to make our approach more efficient when initial constraints cause a CSP that is not feasible. Although this is not common, it may result in additional runtime since the constraint solver needs to be run more than once. It can be concluded that our methodology serves an exact approach with fewer constraints so that the hiding process consumes less time.