Anonymous Methods Based on Multi-Attribute Clustering and Generalization Constraints

Fan, Yunhui; Shi, Xiangbo; Zhang, Shuiqiang; Tong, Yala

doi:10.3390/electronics12081897

Open AccessArticle

Anonymous Methods Based on Multi-Attribute Clustering and Generalization Constraints

School of Science, Hubei University of Technology, Wuhan 430068, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(8), 1897; https://doi.org/10.3390/electronics12081897

Submission received: 14 March 2023 / Revised: 14 April 2023 / Accepted: 14 April 2023 / Published: 17 April 2023

(This article belongs to the Section Networks)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The dissemination and sharing of data sheets in IoT applications presents privacy and security challenges that can be addressed using the k-anonymization algorithm. However, this method needs improvement, for example, in areas related to its overgeneralization and its insufficient attribute diversity constraints during the anonymization process. To address these issues, this study proposes a multi-attribute clustering and generalization constraints (k,l)-anonymization method that can be applied to multidimensional data tables. The algorithm first used a greedy strategy to rank the attributes by width first, derived the division into dimensions to construct a multidimensional generalization hierarchy, and then selected the attributes with the most significant width values as the priority generalization attributes. Next, the k-nearest neighbor (KNN) clustering method was introduced to determine the initial clustering center by the width-first results, divide the quasi-identifier attributes into KNN clusters according to a distance metric, and generalize the quasi-identifier attributes in the equivalence class using a hierarchical generalization structure. Then, the proposed method re-evaluated the attributes to be generalized before each generalization operation. Finally, the algorithm employed an improved frequency–diversity constraint to generalize sensitive attributes in order to ensure that there were at least l records that were mutually dissimilar and closest in the equivalence class. While limiting the frequency threshold for the occurrence of sensitive attributes, the sensitive attribute values remained similar within the group, thus achieving protection of anonymity for all the attributes.

Keywords:

IoT; privacy protection; k-anonymity; generalization

1. Introduction

In the era of big data, the widespread adoption of IoT technology has enabled its deployment in various fields including healthcare, power grids, and education, providing unprecedented convenience to people’s daily lives [1,2,3,4]. However, as dataset sharing and the dissemination of IoT applications become more frequent, the private information belonging to and referring to a significant number of users flows through the network [5,6], making users vulnerable to cyber-attacks by commercial organizations and unscrupulous individuals who engage in extensive data mining [7,8,9,10]. The release and sharing of data tables has raised concerns about the privacy of IoT user data, and improperly published data sheets have led to data privacy breaches [11,12,13,14,15]. Therefore, the protection of data privacy in IoT applications has become a primary concern in their development and application [16,17,18,19]. The most direct approach to safeguard user information is to conceal the identification attributes in collected data.

To address these challenges, a k-anonymity model [20] was proposed to ensure that data in a data table were consistent with at least one other (k − 1) tuple and attribute. In order to address the drawback that the k-anonymity model was vulnerable to homogeneity attacks, an l−diversity model was proposed in [21]. The main advantage of this anonymous model was that the same equivalence group contained at least two or more qualified sensitive attributes values, which further enhanced the protection of published data by attribute constraints. Pu et al. [22] proposed a method to protect users’ sensitive attributes by defining attribute levels and applying different anonymous generalization techniques based on the level of the attribute. However, this method needed a clear and justifiable basis for determining the attribute levels, and they struggled to generate phase-variant equivalence classes. Yan et al. [23] applied clustering to the k-anonymity method, aiming to merge equivalence classes while minimizing information loss. However, this method ignored the challenges presented by the random selection of generalization attributes and the presence of outliers, which could lead to increased information loss. To improve these methods, further research is necessary. A technique for anonymization based on sensitive attributes was proposed in [24] to enhance model performance by controlling the range of quasi-identifier features. This method aimed to group similar attribute records into the same equivalence group. Byun [25] utilized clustering in k-anonymization by selecting a random record as the initial centroid. The anonymization process involved the iterative merging of equivalence classes based on minimal information loss until the entire dataset was partitioned into distinct clusters, each comprising a minimum of k records. Liu et al. [26] proposed an anonymization algorithm that utilized sensitive attribute clustering for grouping similar attribute records into equivalence groups. This approach effectively achieves dataset anonymization while preserving the utility of the data. However, the method selected attributes for generalization in a unique manner, which could result in the overgeneralization of the attributes. Cheng [27] proposed a (θ,k) anonymization model that leveraged grouping to divide generated equivalence classes into θ groups. This approach ensured that sensitive attributes within the same group had similar values while different groups were as distinct as possible in order to resist background knowledge and similarity attacks. Domingo Ferrer et al. [28] introduced a microaggregation algorithm that used heuristics to split original data into several groups, replacing quasi-identifier attribute values with group centroids, which has proven effective in preserving privacy. Mao proposed a clustering anonymization technique that combined the degree of privacy preservation of sensitive attributes [29]. This method divided large-scale microdata into k clusters using clustering operations and then employed the S-KACA algorithm to carry out the anonymization process. These works focused on the random generalization attributes during the anonymization. The generalization attribute needed to be more generalized, thus leaving the diversity constraints insufficient. The sensitive attributes were relatively concentrated, even though the leakage rate was k − 1. This method was unable to effectively resist attacks on background knowledge and homogeneity, which made the data quality poor and, thus, less usable. The resource allocation problem in real-time systems was NP-hard, especially when these systems were deployed in cloud computing environments where task execution involved deadline constraints. Scholars proposed a hybrid form of cuckoo search and genetic algorithm, called HGCS (hybrid genetic and cuckoo search), which added genetic operators to the cuckoo search algorithm, led to a rigorous pursuit of the solution space to find the best feasible plan that could execute the task in the shortest possible time, and thus reduced the total resource usage cost [30]. As technology advances to enable collaboration and resource sharing through software while the size and energy consumption of computational grids continues to increase, scholars have proposed that heuristics be used to arrange the tasks of computational grids, and greedy heuristics were used to find energy-conscious solutions [31].

Motivated by the analysis given above, this study proposed a multi-attribute clustering and generalization constraint (k,l)-anonymity (MCKL) algorithm. In this method, the quasi-identifier used clustered and generalized attributes to satisfy the anonymity condition by constructing a multidimensional generalization hierarchy and introducing improved KNN clustering. In addition, an enhanced frequency–diversity constraint was used to restrict the constraints on the sensitive attributes in the equivalence classes. The experimental results showed that the algorithm could effectively reduce the degree of information loss and improve the resistance to background knowledge attacks and homogeneity attacks. This approach was suitable for protecting the information privacy of data tables in IoT applications.

2. Related Concepts

Table 1 shows the column attributes and row tuples. The data table attributes were divided into identifier (EI), quasi-identifier (QI), and sensitive attributes (S) [32,33,34]. Among those, the EI attributes can directly identify individuals, via information such as their ID number and name; S included user privacy information that needed to be protected, such as health and income information, etc. [35,36].

Equivalence Class: A record with the same QI value in an anonymous table is called an equivalence class.

K-anonymity: A table (T) satisfied the k-anonymity if each equivalence class in the data table T had at least k identical quasi-identifier attributes.

L-diversity: The L-diversity anonymity model required each equivalence class to contain at least L different values of S attributes. L-diversity was a privacy-preserving technique designed to increase the level of privacy and protection of data, particularly when sensitive data were involved.

Generalization: Generalization was an anonymous method that operated on a quasi-identifier property to obtain a value domain interval by generalizing the specific value of the property.

Background knowledge attack: A background knowledge attack was where an attacker used known auxiliary information and background knowledge to infer sensitive details in anonymized data. Even if the S attributes within the anonymization group differed, an attacker could use background knowledge to infer private information about individuals corresponding to specific records. Background knowledge attacks have been common in k-anonymization, as the anonymization process reveals some information which, when combined with the attacker’s background knowledge, could lead to the disclosure of sensitive information.

Homogeneity attack: A homogeneity attack was one in which an attacker could break k-anonymity-protected privacy by combining the same attributes in multiple equivalence classes. The attacker used the same features and attribute values in multiple equivalence classes to determine an individual’s identity and sensitive information. Homogeneity attacks have been one of the most common attacks on the k-anonymity model. We supposed that there were three equivalence classes, each containing the same QI attributes. Although these two datasets were k-anonymized, an attacker could combine the same attributes in the data (e.g., the user’s age and sex, postcode, etc.) to obtain the equivalence class of the user and determine the user’s identity and S information, thus undermining privacy protection. Homogeneity attacks emphasized the attacker’s exploitation of their contextual knowledge of the dataset and external communication that may have come from different data sources but contain the same attributes and attribute values.

Clustering technology: In k-anonymity privacy protection, the clustering method could be applied to it for better anonymity protection. The basic concept of clustering was to divide the data tuples (or records) with similar characteristics into the same equivalence class and then assign the same attribute values to each equivalence class. Therefore, the clustering algorithm divided the original dataset into several clusters, ensuring the data tuples within the same cluster were highly similar and the data tuples within different clusters were less similar, thus protecting the privacy of the data. The concept of introducing clustering in the formation of equivalence classes could better reduce information loss.

After anonymous generalization and suppression, with Table 1 as an example, the result was the information shown in Table 2, where there were at least 3 tuple records in each equivalence class and there were 2 types of S attribute diseases in each class. We call this (k,l)-anonymity and (3,2)-anonymity, which provided good protection. In Table 2, “*” represents the data after suppression and generalization

3. The Concept and Process of the Algorithm

For protection of data privacy, this study proposed an anonymization method based on multi-attribute clustering and generalization constraints for IoT privacy protection (MCKL) to solve the problems of overgeneralization and insufficient diversity constraints in the anonymization process.

3.1. Multigeneralization Hierarchy

The quality of the generalization hierarchy of the attributes in the original data prior to anonymization directly affected the loss of information during the anonymization process. In order to reduce the degree of information loss and improve data availability, this study constructed a multidimensional generalized hierarchy. We adopted a greedy strategy for the multifaceted QI attribute space to continuously partition the attributes in the multidimensional data table until all sub-identifier areas were inseparable. In this process, the initial space consisted of a root node; each internal node contained left and right subsequent nodes pointing to the molecular space, and the entire multidimensional data space consisted of indivisible external nodes. In this study, we started with the entire multidimensional data attribute space and sorted the attribute values by width first, and then we selected the attribute with the largest width value as the division dimension, so the division could begin with the largest attribute width value. The steps were repeated recursively for the remaining subspaces until all subspaces were indivisible, thus obtaining the generalization hierarchy of the attributes. During the generalization operation, the attribute with the largest width value was selected as the priority attribute for generalization by width preference. The tuple in which the attribute was located was marked as the clustering center of the initial equivalence class in preparation for the subsequent clustering implementation so that further generalization constraints could be applied to S attribute information to achieve additional protection and reduce the overgeneralization of information. The solution constructed a generalization hierarchy of attributes in the dataset based on the characteristics of different attribute values. Generalizing the data according to the generalization hierarchy prepared the information for the application of subsequent clustering algorithms, reduced information loss during the anonymization process, provided more substantial data integrity, and reduced the risk of S information leakage.

As shown in Figure 1, the generalization hierarchy generalized the data in the quasi-identification column with semantically consistent but more general data. Depending on the detailed hierarchical generalization structure, it could generalize each attribute value to a more abstract value than the original one. For example, when there is an initial disease attribute value of influenza in the data table, we can replace the specific attribute value with a more generalized respiratory medicine disease attribute value.

3.2. KNN Clustering Concept Introduced

In the (k,l)-anonymity model clustering process, the problem of overgeneralization was emphasized. The concept of KNN clustering was introduced and applied to the (k,l)-anonymity model. The tuples in the clusters could maintain similarity in terms of QI. In order to ensure that the tuples in the same equivalence class had good similarity, the clustering process used an inter-tuple distance metric. The QI attributes were partitioned into KNN clusters according to a distance metric, finding the k-1 tuple records with the closest distance to the initial clustering center. The tuple with the width-first result marker was used as the initial clustering center to avoid random selection for the construction of the equivalence classes so different partitioned clusters were grouped into the same equivalence class, ensuring that the size of the equivalence class was not less than k and the diversity value was not less than l. Finally, the QI attributes were generalized to the classification hierarchy through a hierarchical generalization structure. Before each generalization operation, the algorithm re-determined the generalization attributes, reducing the overgeneralization of attributes and improving data availability. KNN clustering-based generalization techniques enabled anonymized data to be more accurate and usable, avoiding the need to use generalization techniques singularly to achieve privacy protection. In contrast to traditional clustering anonymization methods, this study used an improved KNN clustering method that filtered outliers through pre-processing measures in the initial stage of the algorithm, relied on the width-first results to determine the cluster centers, selected the (k − 1) records with the closest distance to the cluster, avoided the loss of information from random selection, and stored the generated records and distance values to reduce unnecessary calculations, which improved the records in the same cluster correlation.

3.3. Distance Metric and Information Loss

In this study, we used a distance metric to measure the similarity between QIs, which provided good similarity in the same equivalence class and improved the data availability.

Let x and y be any two tuples in the data table T, where N_xi and N_yi denote the ith numerical attribute, and MaxT and MinT denote the maximum and minimum values that the numerical attributes could be. The numerical data attribute distance formula was the following:

d i s (N_{x_{i}}, N_{y_{i}}) = \frac{| N_{x_{i}} - N_{y_{i}} |}{M a x T - M i n T}

(1)

where L(C_x, C_y) denotes the number of attribute value categories of the j-th classification attribute minimum of the common parent node, and T_c is the number of leaf nodes in the attribute tree. The categorical data attribute distance formula was, then, the following:

d i s (C_{x_{i}}, C_{y_{i}}) = \frac{L (C_{x}, C_{y})}{T_{c}}

(2)

Combining the two equations above, we suppose that there was an M number of quasi-identified attributes, of which the number of numerical attributes would be M₁, and the number of the subtype attributes would be M₂. Then, we could define the distance between the two tuples, x and y, as follows:

\begin{array}{l} d i s (x, y) = & \sum_{i = 1}^{M_{1}} d i s (N_{x_{i}}, N_{y_{i}}) + \\ \sum_{j = 1}^{M_{2}} d i s (C_{x_{j}}, C_{y_{j}}) \end{array}

(3)

In this study, the numeric and subtype attributes were generalized and anonymized through a generalization hierarchy, which inevitably led to a loss of numerical information, as defined below.

Let the numerical dataset T = {N1, N2, …, Nm}, with left and right endpoint values L and U, respectively; [MinN_i, and MaxN_i] denote the value domain of attribute generalization, therefore the information loss (ILN) after generalizing of this attribute could be expressed as the following:

I L N = \frac{(M a x N_{i} - M i n N_{i})}{(U - L)}

(4)

The degree of information loss for subtyped attributes was defined as follows: for a categorical attribute in a given record, the ILC of information loss after generalization, according to the generalization hierarchy, could be expressed as:

I L C = \frac{L_{B}}{C}

(5)

where C and L_B represent the total number of classification attributes and the number of subtrees in node B, respectively.

3.4. Improved Frequency Diversity Constraint

The lack of practical restriction constraints on the S attributes and simply controlling the number of values assigned to the attributes led to the problem of insufficient diversity constraints in the (k,l)-anonymity model. The S attributes were more concentrated and could not effectively resist background knowledge and homogeneity attacks, even at a leakage rate of 1/k, which reduced the quality of the data after anonymization and reduced usability. This paper used an improved frequency–diversity constraint to address the lack of diversity constraints in the anonymization model. As compared to the essential l-diversity constraint, which only controlled the number of the values of the S attributes, the improved frequency–diversity constraint ensured that the equivalence class contained at least l S attributes while restricting the rules on S attributes according to the frequency criterion by selecting l records with different S attributes and a minimum distance to each other in order to construct an equivalence class. This ensured the records belonged to different classes, and the frequency of S attributes would not be greater than f (f = s/l, where s is the number of S attributes in the equivalence class); therefore, the values of S attributes remained similar within groups and different between groups. This method was able to resist background knowledge and homogeneity attacks, reduce the risk of S attribute leakage, and improve privacy protection and data availability.

3.5. A Multi-Attribute Clustering and Generalization Constraint (k,l)-Anonymity Algorithm

Based on the data privacy protection strategies explained above, this study proposed a multi-attribute clustering and generalization constrained (k,l)-anonymity (MCKL) algorithm. The algorithm aimed to address the problems of overgeneralization and insufficient diversity constraints in the anonymization process.

The algorithm performed width-first sorting of the attributes in the multidimensional data table using a greedy strategy, selected the attribute with the largest width value as the division dimension, started the division from the largest attribute width value, and repeated the steps recursively for the remaining subspaces until all subspaces were not divisible, in order to obtain the generalization hierarchy of attributes. In the process of dividing the equivalence classes, the distance metric between attributes was combined with the improved KNN clustering to divide the equivalence classes, relying on the width-first result to determine the cluster centers, finding the (k − 1) tuple records with the closest distance to the initial cluster centers, so that different division clusters were grouped into the same equivalence classes, and then using the generalization hierarchy given above to take a hierarchical generalization for each cluster group to avoid significant information loss and improve data usability. At the same time, to solve the problem of insufficient diversity in the anonymity model, when dividing the equivalence classes, frequency–diversity constraints were applied to S attributes according to the frequency criterion, and a boundary system was set for sensitive values. The frequency of S attributes was not greater than f (f = s/l, where s was the number of S attributes in the equivalence class), so there was a similarity in the values of S attributes within the group and differences in the importance of S attributes between the groups. Algorithm 1 was described as follows:

Algorithm 1: Anonymity Algorithm

Input: Raw table data T and parameters k, l; output: anonymous table data T’

(1): Data pre-processing to screen out outliers;
(2): Initialize the initial tuple cluster, the anonymous set;
(3): Loop until there are no equivalence classes with number less than k;
(4): Calculate the width of each attribute in the table;
(5): Use the queue for width-first sorting;
(6): Determine the division of dimensions;
(7): Recursive division starting from the value with the largest attribute width;
(8): Construct a generalization hierarchy;
(9): Select properties for priority generalization;
(10): The tuple in which the marker is located is r;
(11): With r as the initial equivalence class;
(12): Calculate distance;
(13): Find k − 1 nearest records with S attributes that match the constraints;
(14): Generate equivalence classes;
(15): Hierarchical generalization;
(16): For the remaining tuples, assign to the closest equivalence class;
(17): Perform anonymity check. If the anonymity condition is not met, go back to (1);
(18): Output anonymous data table T’.

4. Results and Analysis

The experimental comparison focused on the multi-attribute clustering and generalization constrained (k,l)-anonymity (MCKL) algorithm with the cluster-based k-anonymity algorithm (CKA) and the cluster-l-diversity-based k-anonymity (CKL) algorithm for experimental comparison and analysis. The present algorithmic model adopted an improved frequency–diversity constraint for the attribute values to restrict the attributes relative to the cluster-based k-anonymity (CKA) algorithm and an enhanced clustering strategy to reduce the generation of information loss close to the cluster-diversity-based k-anonymity (CKL) algorithm. The impact of the different quantities of QI attributes and dataset sizes on the experimental results was analyzed to compare the degree of information loss and the execution time of the algorithm in order to verify the quality of the algorithm.

4.1. Experimental Conditions

The experimental dataset was obtained from the adult dataset in the UCI machine learning database. Six thousand data records were selected as the initial dataset, and the selected QI attributes were age, education, marital status, country of birth, race, sex, employment status, and occupation, where occupation was an S attribute.

4.2. Information Loss Analysis

This evaluation set the parameters as k = 10 and l = 2 and analyzed the information loss with different quantities of Qis and dataset sizes.

As shown in Figure 2, the degree of information loss of the three algorithms increased as the number of QIs in the data table increased. This was because the increase in the number of QIs led to an increase in the number of attributes that needed to be generalized into tuples in the equivalence class. When the algorithm was executed with the same quantity of Qis, MCKL had a lower information loss than CKL and CKA.

The datasets selected for the experiments in this study had 2000, 3000, 4000, 5000, and 6000 data records, respectively. As shown in Figure 3, with a fixed number of QIs, the information loss of the algorithm decreased as the number of tuples in the dataset increases. Therefore, as the amount of data in the dataset increased, there were more similar attributes between the tuples; thus, fewer attributes needed to be clustered and generalized. The algorithm naturally lost less information. Moreover, the information loss of MCKL was lower than that of CKL and CKA for the same dataset.

As shown in Figure 2 and Figure 3, the MCKL algorithm in this study divided records with similar attributes using KNN clustering to reduce the quantity of generalized data. MCKL applied width-first hierarchical generalization, which effectively reduced the overgeneralization of the attributes, reduced the information loss in the data, and improved the usability of the data. In addition, as compared to CKL and CKA, MCKL used effective frequency–diversity constraints on the attributes, ensuring that the data was more robust to attacks.

4.3. Runtime Analysis

This section presents the execution time analysis for different quantities of QIs and dataset sizes.

As shown in Figure 4, the execution time of the three algorithms and the quantity of QIs were positively correlated. Moreover, when the number of QIs was small, the algorithms had similar execution times. As the number of QIs increased, the execution time curve of the three algorithms rose. As the number of attribute values in the equivalence classes increased, along with the number of parameters, the time taken for clustering and diversity discrimination also increased. MCKL had a slightly higher runtime than CKL, but the time difference was not significant.

Figure 5 illustrates the relationship between the algorithm execution time and the dataset. Specifically, for a fixed number of QIs, the execution time of the algorithm gradually increased as the number of tuples in the dataset increased.

Figure 4 and Figure 5 show that CKL and MCKL had lower runtimes than CKA because CKA lacked generalization constraints and required more data sifting and swapping when generalizing anonymously. Therefore, CKA had higher execution times when the data tuples were larger. Before the clustering generalization on the equivalence classes, a distance formula calculation was required to cluster the attributes, which then increased the computational demand as the tuple attributes increased, as well as increasing the time spent on diversity constraint discriminations on the S attributes. While MCKL was slightly more time-consuming than CKL, MCKL had lower information loss while satisfying the requirement for better diversity among the S attributes. This advantage protected against background knowledge and homogeneity attacks, provided good privacy, increased the availability of the data information, and achieved good anonymity and privacy protection.

5. Conclusions

In this study, we addressed the problem of data privacy protection in IoT applications by proposing a multi-attribute clustering generalization constraint (k,l)-anonymity (MCKL) algorithm based on the requirements of IoT privacy protection. Our experiments showed that the proposed algorithm reduced information loss in the anonymization process, enhanced the restriction constraints on the data attributes, and better protected the privacy of published and shared data tables in IoT applications.

Author Contributions

Conceptualization, Y.F., X.S., S.Z., and Y.T.; Supervision, Y.T.; Writing—original draft, Y.F., X.S., and S.Z.; Writing—review and editing, Y.F. and Y.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Natural Science Foundation of Hubei Province, China grant 2021CFB584.

Data Availability Statement

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

She, W.; Chen, J.S.; Gu, Z.H. Location information protection model for iot nodes based on blockchain. J. Appl. Sci. 2020, 38, 13. [Google Scholar]
Liu, H.; Li, X.H.; Luo, B. Distributed k-anonymous location privacy protection scheme based on blockchain. Chin. J. Comput. 2019, 42, 19. [Google Scholar]
Gu, C.; Zhao, X. Security Analysis of Internet of Things. Sci. Technol. Innov. Appl. 2022, 12, 4. [Google Scholar]
Gu, Y.H.; Guo, Z.Y.; Liu, W.X. Research on performance evaluation method of anonymized privacy protection technologies. Inf. Secur. Res. 2019, 5, 5. [Google Scholar]
Deebak, B.D.; Al-Turjman, F.; Aloqaily, M.; Alfandi, O. An authentic-based privacy preservation protocol for smart e-healthcare systems in IoT. IEEE Access 2019, 7, 135632–135649. [Google Scholar] [CrossRef]
Guo, M.; Zhang, S.B.; Li, X.D. Research on location privacy protection technology in iot. J. Chin. Comput. Syst. 2017, 38, 5. [Google Scholar]
Luo, F.; Xin, Y.L. Privacy-preserving security framework for IoT data based on blockchain and LSTM. Foreign Electron. Meas. Technol. 2022, 41, 145–151. [Google Scholar]
Gui, Q.; Lv, Y.J.; Cheng, X.H. Anonymization method based on proximity resistance to sensitive information. Comput. Eng. 2020, 46. [Google Scholar]
Zhang, J.L.; Zhong, B.C.; Fang, B.G. An improvement of track privacy protection method based on K-anonymity technology. Intell. Comput. Appl. 2019, 9, 4. [Google Scholar]
Wang, Z.H.; Jian, X.W.; Wang, W.; Bai, L.S. A clustering-based approach for data anonymization. J. Softw. 2010, 21, 680–693. [Google Scholar] [CrossRef]
Fu, J.J.; Xu, X.D. A (p, θ) k-anonymity for resisting peer-to-peer attacks. Comput. Digit. Eng. 2021, 49, 1619–1624. [Google Scholar]
Gu, Q.Z.; Dong, H.B. Mi loss evaluation model for k-anonymity in ppdm. Comput. Eng. 2022, 48, 143–147. [Google Scholar]
Song, F.; Ma, T.; Tian, Y.; Al-Rodhaan, M. A new method of privacy protection: Random k-anonymous. IEEE Access 2019, 7, 75434–75445. [Google Scholar] [CrossRef]
He, J.S.; Du, J.H.; Zhu, N.F. Research on k-anonymity Algorithm for Personalized Quasi-identifier Attributes. Netinfo Secur. 2020, 8, 19–26. [Google Scholar]
Zhang, Q.; Zhang, X.; Wang, M.; Li, X. DPLQ: Location-based service privacy protection scheme based on differential privacy. IET Inf. Secur. 2021, 15, 442–456. [Google Scholar] [CrossRef]
Jia, J.; Huang, H. A trajectory (k,e)-anonymity algorithm against trajectory similarity attacks. Comput. Eng. Sci. 2019, 41, 7. [Google Scholar]
Li, W.; Huang, L.S.; Luo, E.T. Anonymous Privacy Protection Model with Individual l-Diversity in Mobile Health. Comput. Sci. Explor. 2018, 12, 8. [Google Scholar]
Cao, M.Z.; Zhang, L.L.; Bi, X.H. Personalized (α,l)-diversity k-anonymity model for privacy preservation. Comput. Sci. Explor. 2018, 45, 7. [Google Scholar]
Han, J.M.; Cen, T.T.; Yu, H.Q. Research in microaggregation algorithm for k-anonymization of data table. Acta Electron. Sin. 2008, 36, 2021. [Google Scholar]
Sweeney, L. k-anonymity: A model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 2002, 10, 557–570. [Google Scholar] [CrossRef]
Machanavajjhala, A.; Kifer, D.; Gehrke, J.; Venkitasubramaniam, M. L-diversity: Privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data TKDD 2007, 1, 3-es. [Google Scholar] [CrossRef]
Pu, D.; Fang, R. Personalization(p,α,k)-anonymous privacy protection algorithm. Comput. Appl. Softw. 2020, 37, 7. [Google Scholar]
Yan, Y.; Herman, E.A.; Mahmood, A.; Feng, T.; Xie, P.J.C. A weighted K-member clustering algorithm for K-anonymization. Computing 2021, 103, 1–23. [Google Scholar] [CrossRef]
Arafat, N.; Pramanik, M.I.; Muzahid, A.J.M.; Lu, B.; Jahan, S.; Murad, S.A. A conceptual anonymity model to ensure privacy for sensitive network data. In Proceedings of the 2021 Emerging Technology in Computing, Communication and Electronics (ETCCE), Dhaka, Bangladesh, 21–23 December 2021; pp. 1–7. [Google Scholar]
Byun, J.W.; Kamra, A.; Bertino, E.; Li, N. Efficient k-Anonymization Using Clustering Techniques. In Proceedings of the 12th International Conference on Database Systems for Advanced Applications, Bangkok, Thailand, 9–12 April 2007; pp. 188–200. [Google Scholar]
Liu, G.L.; Xiao, H. Privacy protection algorithm for electronic medical records based on sensitive attribute clustering. Chin. J. Digit. Med. 2019, 14, 3. [Google Scholar]
Cheng, N.N.; Liu, S.B.; Xiong, X.X. A (θ,k)-anonymity model for sensitive attributes protection. J. Zhengzhou Univ. Sci. Ed. 2019, 51, 6. [Google Scholar]
Domingoferrer, J.; Torra, V. Ordinal, continuous and heterogeneous k-anonymity through microaggregation. Data Mining Knowl. Discov. 2005, 11, 195–212. [Google Scholar] [CrossRef]
Mao, Q.Y.; Hu, Y. S-kaca anonymous privacy protection based on clustering algorithm. Geomat. Eng. Sci. Wuhan Univ. 2018, 51, 7. [Google Scholar]
Min-Allah, N.; Qureshi, M.B.; Alrashed, S.; Rana, O.F. Cost efficient resource allocation for real-time tasks in embedded systems. Sustain. Cities Soc. 2019, 48, 101523. [Google Scholar] [CrossRef]
Lindberg, P.; Leingang, J.; Lysaker, D.; Bilal, K.; Khan, S.U.; Bouvry, P.; Ghani, N.; Min-Allah, N.; Li, J. Comparison and analysis of greedy energy-efficient scheduling algorithms for computational grids. Energy-Effic. Distrib. Comput. Syst. 2012, 1, 189–214. [Google Scholar]
Zhang, Q.; Ye, A.Y.; Ye, S.H. K-anonymous data privacy protection mechanism based on optimal clustering. J. Comput. Res. Dev. 2022, 59, 11. [Google Scholar]
Yang, L.; Li, Y. Hybrid k-anonymous feature selection algorithm. Comput. Appl. 2021, 41, 3521. [Google Scholar]
Kang, H.; Deng, J. Mapping generalization (k,l)-anonymity algorithm for security sharing of medical data. J. Beijing Inf. Sci. Technol. Univ. Nat. Sci. Ed. 2021, 36, 1–8. [Google Scholar]
Zhang, Y.B.; Zhang, Q.Y.; Yan, Y. A k-anonymous location privacy protection method of dummy based on approximate matching. Int. J. Netw. Secur. 2020, 35, 65–73. [Google Scholar]
Khan, R.; Tao, X.; Anjum, A.; Kanwal, T.; Maple, C. θ-Sensitive k-Anonymity: An Anonymization Model for IoT based Electronic Health Records. Electronics 2020, 9, 716. [Google Scholar] [CrossRef]

Figure 1. Attribute hierarchy generalization structure of disease attribute.

Figure 2. Information loss with a larger quantity of QI attributes.

Figure 3. Information loss with larger dataset sizes.

Figure 4. Effect of QI attributes on runtime.

Figure 5. Effect of data size on runtime.

Table 1. Raw data table.

Name	Sex	Age	Code	Disease
Biber	M	50	899356	flu
Coco	F	44	899654	cancer
Bob	M	42	899232	HIV
Lucy	F	25	899898	cold
Mick	F	30	899245	HIV
Marry	F	28	899575	fever

Table 2. The (3,2)-anonymous data sheet.

Gender	Age	Code	Disease
F	[20, 30]	899 ***	cold
F	[20, 30]	899 ***	hiv
F	[20, 30]	899 ***	fever
M	[40, 50]	899 ***	flu
F	[40, 50]	899 ***	cancer
M	[40, 50]	899 ***	hiv

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fan, Y.; Shi, X.; Zhang, S.; Tong, Y. Anonymous Methods Based on Multi-Attribute Clustering and Generalization Constraints. Electronics 2023, 12, 1897. https://doi.org/10.3390/electronics12081897

AMA Style

Fan Y, Shi X, Zhang S, Tong Y. Anonymous Methods Based on Multi-Attribute Clustering and Generalization Constraints. Electronics. 2023; 12(8):1897. https://doi.org/10.3390/electronics12081897

Chicago/Turabian Style

Fan, Yunhui, Xiangbo Shi, Shuiqiang Zhang, and Yala Tong. 2023. "Anonymous Methods Based on Multi-Attribute Clustering and Generalization Constraints" Electronics 12, no. 8: 1897. https://doi.org/10.3390/electronics12081897

APA Style

Fan, Y., Shi, X., Zhang, S., & Tong, Y. (2023). Anonymous Methods Based on Multi-Attribute Clustering and Generalization Constraints. Electronics, 12(8), 1897. https://doi.org/10.3390/electronics12081897

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Anonymous Methods Based on Multi-Attribute Clustering and Generalization Constraints

Abstract

1. Introduction

2. Related Concepts

3. The Concept and Process of the Algorithm

3.1. Multigeneralization Hierarchy

3.2. KNN Clustering Concept Introduced

3.3. Distance Metric and Information Loss

3.4. Improved Frequency Diversity Constraint

3.5. A Multi-Attribute Clustering and Generalization Constraint (k,l)-Anonymity Algorithm

4. Results and Analysis

4.1. Experimental Conditions

4.2. Information Loss Analysis

4.3. Runtime Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI