Security and Privacy in Big Data Life Cycle: A Survey and Open Challenges
Abstract
:1. Introduction
- Analysis of the development status of standardization organizations and studies related to big data security and privacy-preserving
- Description of the security techniques for each phase according to the threats in the big data life cycle, and writing the taxonomy of security and privacy issues based on related studies
- Evaluation comparing our proposal with existing big data security and privacy-preserving survey studies
2. Background and Standards
2.1. Big Data Life Cycle
2.1.1. Collection
2.1.2. Storage
2.1.3. Analytics
2.1.4. Utilization
2.1.5. Destruction
2.2. Application Scenario Based on Big Data Life Cycle
2.3. Standards
2.3.1. De Jure Standards
International Organization for Standardization (ISO)
International Telecommunication Union Telecommunication Standardization Sector (ITU-T)
ISO/IEC JTC1
National Institute of Standards and Technology (NIST)
Standardization Administration of China (SAC)
British Standards Institution (BSI)
2.3.2. De Facto Standards
Telecommunications Technology Association (TTA)
Tele Management Forum (TM Forum)
Institute of Electrical and Electronics Engineers Standards Association (IEEE-SA)
Apache
2.3.3. Outlook and Drawback of Current Standards
3. Security and Privacy in Big Data Life Cycle
3.1. Collection
3.1.1. Privacy Policy
3.1.2. Privacy-Preserving Data Collection
3.2. Storage
3.2.1. Encryption
3.2.2. Access Control
3.2.3. Audit Trail
3.3. Analytics
3.3.1. Privacy-Preserving Data Mining
- Pseudonymization refers to processing so that a specific individual cannot be recognized without additional information by deleting part of privacy or replacing the part. When processing a pseudonymization, it is necessary to consider whether a specific individual can be recognized by the pseudonym information and the possibility of combining additional information. Typical pseudonymization techniques include encryption, hashing, and tokenization.
- Aggregation is a de-identification technique of making the values of a sensitive data set into average or total values to prevent the identification of sensitive data values. When used in the analytics phase, the usefulness of the data is reduced, and detailed analytics is difficult. Therefore, it is necessary to collect a lot of data to ensure the accuracy of mining. Generally, aggregation uses methods such as micro-aggregation, rearrangement, and rounding. Liu et al. [55] proposed a privacy-preserving data aggregation method that does not rely on a TTP. Most existing data aggregation methods rely on TTP and have security issues such as a denial of service attacks. They described the data aggregation model in the smart grid domain and configured the data collection unit to form a virtual aggregation area. The aggregate result is masked and used for data analysis. In addition, by reducing the aggregation area, some defects of the aggregation operator are negligible. They focused on developing solutions that balance data utility and privacy, ensuring that the aggregate results have little impact on data utility.
- Data reduction is a direct method of erasing sensitive data. Generally, values that can directly identify the data provider such as zip-code, e-mail, and social security number, are temporarily or completely deleted to make them unidentifiable. The data reduction is not used much other than direct identifying information such as PII. Gahar et al. [56] focused on the performance degradation due to missing data of the existing statistical algorithms. To solve the data missing problem, they proposed a reduction algorithm based on the MapReduce paradigm of the RHadoop framework. They approached a distributed statistic method and used a random forest imputation method. The proposed algorithm is based on PCA and MCA. The PCA method processes quantitative variables and the MCA method processes categorical variables. In addition, it facilitates data search by reducing the search space in the process of extracting useful information.
- Data suppression is the conversion of data values into grouped values. For example, if the value is 35, it is converted to a value of 30–40. This makes it difficult to ensure accurate mining results with larger grouping ranges. However, data suppression difficult generally infers the original value of the data set and does not have a huge impact on data usability.
- Data masking is the most actively used method of de-identification. This is usually de-identification by combining sensitive data from the data provider with other data or replacing parts of the data. There are various techniques such as substitution, shuffling, and nulling. There are two main types of data masking such as static data masking and dynamic data masking. Motiwalla et al. [57] proposed a system that protects privacy without removing the attributes of the data through a masking technology for healthcare data and then delivers it to necessary third parties. Cui et al. [58] proposed a method of masking data while maintaining the format of sensitive data (e.g., date and e-mail) based on FPE in a big data environment. This method can be applied in both single and distributed environments. In particular, it can achieve high efficiency in a distributed environment. However, compared to the symmetric-key algorithm, the speed was significantly slower. In addition, this cannot be preserved the association of data.
- Differential privacy is a mathematical model for preventing privacy inference based on query results performed in a statistical database, and various related studies are being conducted to protect the privacy of statistical data. This method is one of the PPDM that maintains the distribution of data and adds noise without harming the original statistical meaning. The information exposure is limited by keeping the amount of change in query results according to insertion, deletion, and transformation of data below a certain level. If the query result changes significantly due to the change of the information of a specific individual, the attacker can see the difference in the query result and know the existence of data of a specific user and the value of the data. Differential privacy applies to online inquiry systems, and it is also possible to use differential privacy to generate machine learning statistical classifiers and synthetic information. Gao et al. [59] proposed differential privacy hybrid k-means that protects privacy through differential privacy when performing the k-means clustering process in the Apache Spark environment. This improves k-means clustering by combining the swarm intelligence optimization model and additionally protects privacy through the Laplace mechanism that adds noise. Ni et al. [60] proposed a schema for privacy-preserving by using differential privacy in the process of PPDM through DBSCAN. This is a method of performing mining by determining several core objects, unlike DBSCAN, in which initial core objects are randomly selected, and privacy protection can be realized through noise technique. Mo et al. [61] proposed a data preprocessing method based on differential privacy for distance-based clustering. The adaptive parameter mechanism used here is a preprocessing method that maintains a balance between privacy protection and clustering results. This is a mechanism in which the higher the security function, the higher the privacy protection strength, and the higher the availability of data function, the higher the availability of data. Zhao et al. [62] proposed a privacy protection method that can be used in distributed collaborative mining. This allows individual data owners to protect privacy by using differential privacy in the regression and classification learning process, and to ensemble the information of various trees through a gradient boosting decision tree while protecting privacy without third parties. Lin et al. [63] focused on the problem of exposing sensitive information in the existing big data collection method and proposed a differential privacy protection system in a body sensor network environment. The proposed model introduces the concept of a dynamic noise threshold to analyze the relationship between the noise size of electrocardiogram data and the size of the data set. As the proposed model can perform sufficient interference with the data, even if the attacker fully knows the background, it cannot find a match with a specific victim.
- k-anonymity is one of the privacy-preserving models to prevent linkage attacks by linking public information and is used to prevent re-identification of de-identified privacy. k-anonymity refers to a measure that, when de-identifying, ensures that there is at least k or more of the same value in a given data set so that they cannot be easily combined into other information. It is particularly effective in protecting the privacy of data with limited properties and hides sensitive data with generalization, containment, analysis, and permutation techniques. However, when de-identifying, the diversity of information is not considered. When records with the same information are de-identified and composed into a single set, there is a limit that is defenseless against homogeneity attacks. Zhang et al. [64] proposed a two-phase top-down specialization approach that anonymizes big data using MapReduce in a cloud environment. This is a method of dividing big data into small data using MapReduce parallel processing to perform primary anonymization, and then anonymizing it once more through k-anonymity. However, there is a possibility that processing efficiency is difficult in that a lot of data is anonymized twice. Mehta et al. [65] proposed a MapReduce-based scalable k-anonymization algorithm. The major purpose of the proposed algorithm is to simplify the approach using Apache Pig and to protect big data publishing without specific mapper and reducer programs. In addition, it divides the data set into smaller than existing algorithms based on all the attributes of the data set, and utilizes sorting and shuffling for data distribution and merging in Hadoop. Therefore, it reduces the number of iterations compared to the existing algorithm, shortens running time, and performs the same level of privacy-preserving.
- l-diversity is a model for defending against homogeneity attacks against k-anonymity. Even if k-anonymity is satisfied, a small number of categories increases the likelihood of being identifiable. l-diversity means that records that are de-identified in a given data set must have at least l different sensitive information. Even if it is de-identified by the l-diversity model, t-closeness is required to prevent skewness attack and similarity attack. Machanavajjhala et al. [66] suggested that if various attributes do not exist in sensitive data and the attacker has background knowledge, k-anonymity cannot guarantee disclosure of sensitive information about the attacker. Therefore, they described the possibility of an attack in both cases and proposed l-diversity to complement the problem. l-diversity means that records that are de-identified in a given data set must have at least l different sensitive information.
- t-closeness is a model to overcome the weakness of l-diversity. If the information in the records is skewed or similar to each other, there is a problem that privacy is exposed through the difference in the distribution of sensitive information. Therefore, t-closeness is a model that protects privacy by making the difference between the distribution of sensitive information of records that are not identified from the data set and the distribution of sensitive information of the entire data less than t. The closer the t value is to 0, the stronger the similarity between the distribution of the entire data and the distribution of a specific data section tends to be stronger. That is, the distribution of specific information in each homogeneous set is not too specific compared to the distribution of the entire data set. Li et al. [67] described the limitations of l-diversity and proposed a new privacy concept to overcome them. The l-diversity can allow an attacker to disclose privacy when information is skewed to a specific value and when de-identified information is similar to each other. To solve the problem of l-diversity, they proposed the concept of t-closeness. t-closeness is a model that protects privacy by making the difference between the distribution of sensitive information in an unidentified record in a data set and the distribution of sensitive information in the entire data by less than t.
3.3.2. Access Control
3.4. Utilization
3.4.1. Audit Trail
3.4.2. Privacy-Preserving Data Publishing
3.5. Destruction
4. Evaluation
5. Discussion and Open Challenge
6. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- ISO—International Organization for Standardization. Available online: https://www.iso.org/about-us.html (accessed on 27 October 2020).
- ITU Telecommunication Standardization Sector. Available online: https://www.itu.int/en/ITU-T/about/Pages/default.aspx (accessed on 27 October 2020).
- ISO/IEC JTC1—Information Technology—ISO. Available online: https://www.iso.org/isoiec-jtc-1.html (accessed on 27 October 2020).
- NIST: National Institute of Standards and Technology. Available online: https://www.nist.gov/about-nist (accessed on 27 October 2020).
- SAC—Standardization Administration of China—ISO. Available online: https://www.iso.org/member/1635.html (accessed on 27 October 2020).
- BSI—British Standards Institution—ISO. Available online: https://www.iso.org/member/2064.html (accessed on 27 October 2020).
- TTA—Telecommunications Technology Association. Available online: https://www.tta.or.kr/eng/index.jsp (accessed on 27 October 2020).
- TM Forum—How to manage Digital Transformation, Agile Business Operations & Connected Digital Ecosystems. Available online: https://www.tmforum.org/ (accessed on 27 October 2020).
- IEEE SA—The IEEE Standards Association—Home. Available online: https://standards.ieee.org/ (accessed on 27 October 2020).
- Apache Hadoop. Available online: https://hadoop.apache.org/ (accessed on 27 October 2020).
- Greene, T.; Shmueli, G.; Ray, S.; Fell, J. Adjusting to the GDPR: The impact on data scientists and behavioral researchers. Big Data 2019, 7, 140–162. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Stallings, W. Handling of Personal Information and Deidentified, Aggregated, and Pseudonymized Information under the California Consumer Privacy Act. IEEE Secur. Priv. 2020, 18, 61–64. [Google Scholar] [CrossRef]
- Kanika, A.; Khan, R.A. An Improved Security Threat Model for Big Data Life Cycle. Asian J. Comput. Sci. Technol. 2018, 7, 33–39. [Google Scholar]
- Hornyack, P.; Han, S.; Jung, J.; Schechter, S.; Wetherall, D. These aren’t the droids you’re looking for: Retrofitting android to protect data from imperious applications. In Proceedings of the 18th ACM Conference on Computer and Communications Security, Chicago, IL, USA, 17–21 October 2011. [Google Scholar] [CrossRef]
- Zhao, Y.; Wang, Z.; Zou, L.; Wang, J.; Hao, Y. A Linked Data Based Personal Service Data Collection and Semantics Unification Method. In Proceedings of the 2014 International Conference on Service Sciences, Wuxi, China, 22–23 May 2014. [Google Scholar] [CrossRef]
- Gao, W.; Yu, W.; Liang, F.; Hatcher, W.G.; Lu, C. Privacy-preserving auction for big data trading using homomorphic encryption. IEEE Trans. Netw. Sci. Eng. 2020, 7, 776–791. [Google Scholar] [CrossRef]
- Mittal, D.; Kaur, D.; Aggarwal, A. Secure data mining in cloud using homomorphic encryption. In Proceedings of the 2014 IEEE International Conference on Cloud Computing in Emerging Markets (CCEM), Bangalore, India, 15–17 October 2014. [Google Scholar] [CrossRef]
- Balebako, R.; Jung, J.; Lu, W.; Cranor, L.F.; Nguyen, C. “Little brothers watching you”: Raising awareness of data leaks on smartphones. In Proceedings of the Ninth Symposium on Usable Privacy and Security, Newcastle, UK, 24–26 July 2013. [Google Scholar] [CrossRef]
- Liu, S.; Qu, Q.; Chen, L.; Ni, L.M. SMC: A practical schema for privacy-preserved data sharing over distributed data streams. IEEE Trans. Big Data 2015, 1, 68–81. [Google Scholar] [CrossRef]
- Gupta, A.; Verma, A.; Kalra, P.; Kumar, L. Big Data: A security compliance model. In Proceedings of the 2014 Conference on IT in Business, Industry and Government (CSIBIG), Indore, India, 8–9 March 2014. [Google Scholar] [CrossRef]
- Al-Shomrani, A.; Fathy, F.; Jambi, K. Policy enforcement for big data security. In Proceedings of the 2017 2nd International Conference on Anti-Cyber Crimes (ICACC), Abha, Saudi Arabia, 26–27 March 2017. [Google Scholar] [CrossRef]
- Consolvo, S.; Jung, J.; Greenstein, B.; Powledge, P.; Maganis, G.; Avrahami, D. The Wi-Fi privacy ticker: Improving awareness & control of privacy exposure on Wi-Fi. In Proceedings of the 12th ACM International Conference on Ubiquitous Computing, Copenhagen, Denmark, 26–29 September 2010. [Google Scholar] [CrossRef]
- Zhou, Y.; Zhang, X.; Jiang, X.; Freeh, V.W. Taming information-stealing smartphone applications (on android). In Proceedings of the International Conference on Trust and Trustworthy Computing, Berlin/Heidelberg, Germany, 21–23 June 2011. [Google Scholar] [CrossRef]
- Tiwari, P.K.; Velayutham, T. Detection and deterrence from data collecting applications in Android. In Proceedings of the 2016 Fourth International Conference on Parallel, Distributed and Grid Computing (PDGC), Waknaghat, India, 22–24 December 2016. [Google Scholar] [CrossRef]
- Xu, S.; Yang, G.; Mu, Y.; Liu, X. A secure IoT cloud storage system with fine-grained access control and decryption key exposure resistance. Future Gener. Comput. Syst. 2019, 97, 284–294. [Google Scholar] [CrossRef]
- Li, J.; Lin, X.; Zhang, Y.; Han, J. KSF-OABE: Outsourced attribute-based encryption with keyword search function for cloud storage. IEEE Trans. Serv. Comput. 2016, 10, 715–725. [Google Scholar] [CrossRef]
- Xue, L.; Yu, Y.; Li, Y.; Au, M.H.; Du, X.; Yang, B. Efficient attribute-based encryption with attribute revocation for assured data deletion. Inf. Sci. 2019, 479, 640–650. [Google Scholar] [CrossRef]
- Yang, P.; Xiong, N.; Ren, J. Data Security and Privacy Protection for Cloud Storage: A Survey. IEEE Access 2020, 8, 131723–131740. [Google Scholar] [CrossRef]
- Baek, J.; Vu, Q.H.; Liu, J.K.; Huang, X.; Xiang, Y. A secure cloud computing based framework for big data information management of smart grid. IEEE Trans. Cloud Comput. 2014, 3, 233–244. [Google Scholar] [CrossRef]
- Zhang, Y.; Yang, M.; Zheng, D.; Lang, P.; Wu, A.; Chen, C. Efficient and secure big data storage system with leakage resilience in cloud computing. Soft Comput. 2018, 22, 7763–7772. [Google Scholar] [CrossRef]
- Azougaghe, A.; Kartit, Z.; Hedabou, M.; Belkasmi, M.; El Marraki, M. An efficient algorithm for data security in cloud storage. In Proceedings of the 2015 15th International Conference on Intelligent Systems Design and Applications (ISDA), Marrakech, Morocco, 14–16 December 2015. [Google Scholar] [CrossRef]
- Hussien, Z.A.; Jin, H.; Abduljabbar, Z.A.; Hussain, M.A.; Abbdal, S.H.; Zou, D. Scheme for ensuring data security on cloud data storage in a semi-trusted third party auditor. In Proceedings of the 2015 4th International Conference on Computer Science and Network Technology (ICCSNT), Harbin, China, 19–20 December 2015. [Google Scholar] [CrossRef]
- Li, Y.; Gai, K.; Qiu, L.; Qiu, M.; Zhao, H. Intelligent cryptography approach for secure distributed big data storage in cloud computing. Inf. Sci. 2017, 387, 103–115. [Google Scholar] [CrossRef]
- Al-Odat, Z.; Al-Qtiemat, E.; Khan, S. A big data storage scheme based on distributed storage locations and multiple authorizations. In Proceedings of the 2019 IEEE 5th Intl Conference on Big Data Security on Cloud (BigDataSecurity), IEEE Intl Conference on High Performance and Smart Computing, (HPSC) and IEEE Intl Conference on Intelligent Data and Security (IDS), Washington, DC, USA, 27–29 May 2019. [Google Scholar] [CrossRef]
- Arora, A.; Khanna, A.; Rastogi, A.; Agarwal, A. Cloud security ecosystem for data security and privacy. In Proceedings of the 2017 7th International Conference on Cloud Computing, Data Science & Engineering-Confluence, Noida, India, 12–13 January 2017. [Google Scholar] [CrossRef]
- Saroj, S.K.; Chauhan, S.K.; Sharma, A.K.; Vats, S. Threshold cryptography based data security in cloud computing. In Proceedings of the 2015 IEEE International Conference on Computational Intelligence & Communication Technology, Ghaziabad, India, 13–14 February 2015. [Google Scholar] [CrossRef]
- Bajwa, M.S.; Kang, S.S. An Enhanced Data Owner Centric Model for Ensuring Data Security in Cloud. In Proceedings of the 2015 second International Conference on Advances in Computing and Communication Engineering, Dehradun, India, 1–2 May 2015. [Google Scholar] [CrossRef]
- Sanka, S.; Hota, C.; Rajarajan, M. Secure data access in cloud computing. In Proceedings of the 2010 IEEE 4th International Conference on Internet Multimedia Services Architecture and Application, Bangalore, India, 15–17 December 2010. [Google Scholar] [CrossRef]
- Cheng, H.; Rong, C.; Hwang, K.; Wang, W.; Li, Y. Secure big data storage and sharing scheme for cloud tenants. China Commun. 2015, 12, 106–115. [Google Scholar] [CrossRef]
- Al Hamid, H.A.; Rahman, S.M.M.; Hossain, M.S.; Almogren, A.; Alamri, A. A security model for preserving the privacy of medical big data in a healthcare cloud using a fog computing facility with pairing-based cryptography. IEEE Access 2017, 5, 22313–22328. [Google Scholar] [CrossRef]
- Saraiva, D.A.; Leithardt, V.R.Q.; de Paula, D.; Sales Mendes, A.; González, G.V.; Crocker, P. Prisec: Comparison of symmetric key algorithms for IOT devices. Sensors 2019, 19, 4312. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Ko, S.Y.; Jeon, K.; Morales, R. The HybrEx Model for Confidentiality and Privacy in Cloud Computing. HotCloud 2011, 11, 1–5. [Google Scholar] [CrossRef]
- Ngo, C.; Membrey, P.; Demchenko, Y.; de Laat, C. Policy and context management in dynamically provisioned access control service for virtualized cloud infrastructures. In Proceedings of the 2012 Seventh International Conference on Availability, Reliability and Security, Prague, Czech Republic, 20–24 August 2012. [Google Scholar] [CrossRef]
- Yu, S.; Wang, C.; Ren, K.; Lou, W. Achieving secure, scalable, and fine-grained data access control in cloud computing. In Proceedings of the 2010 Proceedings IEEE INFOCOM, San Diego, CA, USA, 15–19 March 2010. [Google Scholar] [CrossRef] [Green Version]
- Younis, Y.A.; Kifayat, K.; Merabti, M. An access control model for cloud computing. J. Inf. Secur. Appl. 2014, 19, 45–60. [Google Scholar] [CrossRef]
- Liu, Q.; Wang, G.; Wu, J. Secure and privacy preserving keyword searching for cloud storage services. J. Netw. Comput. Appl. 2012, 35, 927–933. [Google Scholar] [CrossRef]
- Adrienne, F.; David, E. Privacy protection for social networking APIs. In Proceedings of the Web 2.0 Security and Privacy 2008 (In Conjunction with 2008 IEEE Symposium on Security and Privacy), Oakland, CA, USA, 22 May 2008. [Google Scholar]
- Sundareswaran, S.; Squicciarini, A.; Lin, D. Ensuring distributed accountability for data sharing in the cloud. IEEE Trans. Dependable Secur. Comput. 2012, 9, 556–568. [Google Scholar] [CrossRef]
- Yang, K.; Jia, X. Data storage auditing service in cloud computing: Challenges, methods and opportunities. World Wide Web 2012, 15, 409–428. [Google Scholar] [CrossRef]
- Ferdous, M.S.; Margheri, A.; Paci, F.; Yang, M.; Sassone, V. Decentralised runtime monitoring for access control systems in cloud federations. In Proceedings of the 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), Atlanta, GA, USA, 5–8 June 2017. [Google Scholar] [CrossRef]
- Wang, C.; Wang, Q.; Ren, K.; Lou, W. Ensuring data storage security in Cloud Computing. In Proceedings of the 2009 17th International Workshop on Quality of Service, Charleston, SC, USA, 13–15 July 2009. [Google Scholar] [CrossRef] [Green Version]
- Mohan, S.V.; Angamuthu, T. Association Rule Hiding in Privacy Preserving Data Mining. Int. J. Inf. Secur. Priv. 2018, 12, 141–163. [Google Scholar] [CrossRef]
- Gopalan, N.P.; Murthy, T.S. Association Rule Hiding Using Chemical Reaction Optimization. In Proceedings of the Soft Computing for Problem Solving, Bhubaneswar, India, 23–24 December 2018. [Google Scholar] [CrossRef]
- Menaga, D.; Revathi, S. Least lion optimisation algorithm (LLOA) based secret key generation for privacy preserving association rule hiding. IET Inf. Secur. 2018, 12, 332–340. [Google Scholar] [CrossRef]
- Liu, Y.; Guo, W.; Fan, C.I.; Chang, L.; Cheng, C. A practical privacy-preserving data aggregation (3PDA) scheme for smart grid. IEEE Trans. Ind. Inform. 2018, 15, 1767–1774. [Google Scholar] [CrossRef]
- Gahar, R.M.; Arfaoui, O.; Hidri, M.S.; Hadj-Alouane, N.B. A Distributed Approach for High-Dimensionality Heterogeneous Data Reduction. IEEE Access 2019, 7, 151006–151022. [Google Scholar] [CrossRef]
- Motiwalla, L.; Li, X. Value added privacy services for healthcare data. In Proceedings of the 2010 6th World Congress on Services, Miami, FL, USA, 5–10 July 2010. [Google Scholar] [CrossRef]
- Cui, B.; Zhang, B.; Wang, K. A data masking scheme for sensitive big data based on format-preserving encryption. In Proceedings of the 2017 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC), Guangzhou, China, 21–24 July 2017. [Google Scholar] [CrossRef]
- Gao, Z.Q.; Zhang, L.J. DPHKMS: An efficient hybrid clustering preserving differential privacy in spark. In Proceedings of the International Conference on Emerging Internetworking, Data & Web Technologies, Wuhan, China, 10–11 June 2017. [Google Scholar] [CrossRef]
- Ni, L.; Li, C.; Wang, X.; Jiang, H.; Yu, J. DP-MCDBSCAN: Differential privacy preserving multi-core DBSCAN clustering for network user data. IEEE Access 2018, 6, 21053–21063. [Google Scholar] [CrossRef]
- Mo, R.; Liu, J.; Yu, W.; Jiang, F.; Gu, X.; Zhao, X.; Liu, W.; Peng, J. A Differential Privacy-Based Protecting Data Preprocessing Method for Big Data Mining. In Proceedings of the 2019 18th IEEE International Conference On Trust, Security And Privacy in Computing and Communications/13th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE), Rotorua, New Zealand, 5–8 August 2019. [Google Scholar] [CrossRef]
- Zhao, L.; Ni, L.; Hu, S.; Chen, Y.; Zhou, P.; Xiao, F.; Wu, L. Inprivate digging: Enabling tree-based distributed data mining with differential privacy. In Proceedings of the IEEE INFOCOM 2018-IEEE Conference on Computer Communications, Honolulu, HI, USA, 15–19 April 2018. [Google Scholar] [CrossRef]
- Lin, C.; Song, Z.; Song, H.; Zhou, Y.; Wang, Y.; Wu, G. Differential privacy preserving in big data analytics for connected health. J. Med. Syst. 2016, 40, 97. [Google Scholar] [CrossRef]
- Zhang, X.; Yang, L.T.; Liu, C.; Chen, J. A scalable two-phase top-down specialization approach for data anonymization using mapreduce on cloud. IEEE Trans. Parallel Distrib. Syst. 2013, 25, 363–373. [Google Scholar] [CrossRef]
- Mehta, B.B.; Rao, U.P. Privacy preserving big data publishing: A scalable k-anonymization approach using MapReduce. IET Softw. 2017, 11, 271–276. [Google Scholar] [CrossRef]
- Machanavajjhala, A.; Kifer, D.; Gehrke, J.; Venkitasubramaniam, M. L-diversity: Privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data 2007, 1, 1–12. [Google Scholar] [CrossRef]
- Li, N.; Li, T.; Venkatasubramanian, S. t-closeness: Privacy beyond k-anonymity and l-diversity. In Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering, Istanbul, Turkey, 11–15 April 2007. [Google Scholar] [CrossRef] [Green Version]
- Vatsalan, D.; Christen, P.; Verykios, V.S. A taxonomy of privacy-preserving record linkage techniques. Inf. Syst. 2013, 38, 946–969. [Google Scholar] [CrossRef]
- Scannapieco, M.; Figotin, I.; Bertino, E.; Elmagarmid, A.K. Privacy preserving schema and data matching. In Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, 11–14 June 2007. [Google Scholar] [CrossRef] [Green Version]
- Gkoulalas-Divanis, A.; Verykios, V.S. Exact knowledge hiding through database extension. IEEE Trans. Knowl. Data Eng. 2008, 21, 699–713. [Google Scholar] [CrossRef]
- Verykios, V.S.; Elmagarmid, A.K.; Bertino, E.; Saygin, Y.; Dasseni, E. Association rule hiding. IEEE Trans. Knowl. Data Eng. 2004, 16, 434–447. [Google Scholar] [CrossRef]
- Verykios, V.S. Association rule hiding methods. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2013, 3, 28–36. [Google Scholar] [CrossRef]
- Dasgupta, A.; Kosara, R. Adaptive privacy-preserving visualization using parallel coordinates. IEEE Trans. Vis. Comput. Graph. 2011, 17, 2241–2248. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Dasgupta, A.; Maguire, E.; Abdul-Rahman, A.; Chen, M. Opportunities and challenges for privacy-preserving visualization of electronic health record data. In Proceedings of the IEEE VIS 2014 Workshop on Visualization of Electronic Health Records, Paris, France, 9–14 November 2014. [Google Scholar] [CrossRef]
- Chou, J.K.; Wang, Y.; Ma, K.L. Privacy preserving event sequence data visualization using a Sankey diagram-like representation. In Proceedings of the SIGGRAPH ASIA 2016 Symposium on Visualization, Macao, China, 5–8 December 2016. [Google Scholar] [CrossRef] [Green Version]
- Toysmart.com, LLC, and Toysmart.com, Inc. | Federal Trade Commission. Available online: https://www.ftc.gov/enforcement/cases-proceedings/x000075/toysmartcom-llc-toysmartcom-inc (accessed on 27 October 2020).
- PIPEDA Case Summary 2003-189: Bank Removed Customer’s SIN from some, but not all, of Its Records—Office of the Privacy Commissioner of Canada. Available online: https://www.priv.gc.ca/en/opc-actions-and-decisions/investigations/investigations-into-businesses/2003/pipeda-2003-189/ (accessed on 27 October 2020).
- Abouelmehdi, K.; Beni-Hessane, A.; Khaloufi, H. Big healthcare data: Preserving security and privacy. J. Big Data 2018, 5, 1. [Google Scholar] [CrossRef]
- Yu, S. Big privacy: Challenges and opportunities of privacy study in the age of big data. IEEE Access 2016, 4, 2751–2763. [Google Scholar] [CrossRef] [Green Version]
- Ye, H.; Cheng, X.; Yuan, M.; Xu, L.; Gao, J.; Cheng, C. A survey of security and privacy in big data. In Proceedings of the 2016 16th International Symposium on Communications and Information Technologies (ISCIT), Qingdao, China, 26–28 September 2016. [Google Scholar] [CrossRef]
- Wang, T.; Zheng, Z.; Rehmani, M.H.; Yao, S.; Huo, Z. Privacy preservation in big data from the communication perspective—A survey. IEEE Commun. Surv. Tutor. 2018, 21, 753–778. [Google Scholar] [CrossRef]
- Sangeetha, S.; Sadasivam, G.S. Privacy of big data: A review. In Handbook of Big Data and IoT Security; Springer: Cham, Switzerland, 2019; pp. 5–23. [Google Scholar] [CrossRef]
- Alwan, H.B.; Ku-Mahamud, K.R. Big data: Definition, characteristics, life cycle, applications, and challenges. In Proceedings of the IOP Conference Series: Materials Science and Engineering, Pahang, Malaysia, 25–27 September 2019. [Google Scholar] [CrossRef]
- Xu, L.; Jiang, C.; Wang, J.; Yuan, J.; Ren, Y. Information security in big data: Privacy and data mining. IEEE Access 2014, 2, 1149–1176. [Google Scholar] [CrossRef]
- Jain, P.; Gyanchandani, M.; Khare, N. Big data privacy: A technological perspective and review. J. Big Data 2016, 3, 25. [Google Scholar] [CrossRef] [Green Version]
- Lv, D.; Zhu, S.; Xu, H.; Liu, R. A Review of Big Data Security and Privacy Protection Technology. In Proceedings of the 2018 18th International Conference on Communication Technology (ICCT), Chongqing, China, 8–11 October 2018. [Google Scholar] [CrossRef]
- Goswami, P.; Madan, S. A survey on big data & privacy preserving publishing techniques. Adv. Comput. Sci. Technol. 2017, 10, 395–408. [Google Scholar]
- Bertino, E.; Ferrari, E. Big data security and privacy. In A Comprehensive Guide Through the Italian Database Research Over the Last 25 Years; Springer: Cham, Switzerland, 2018; pp. 425–439. [Google Scholar] [CrossRef]
- Fang, W.; Wen, X.Z.; Zheng, Y.; Zhou, M. A survey of big data security and privacy preserving. IETE Tech. Rev. 2017, 34, 544–560. [Google Scholar] [CrossRef]
- Jiang, H.; Wang, K.; Wang, Y.; Gao, M.; Zhang, Y. Energy big data: A survey. IEEE Access 2016, 4, 3844–3861. [Google Scholar] [CrossRef]
- Mehmood, A.; Natgunanathan, I.; Xiang, Y.; Hua, G.; Guo, S. Protection of big data privacy. IEEE Access 2016, 4, 1821–1834. [Google Scholar] [CrossRef] [Green Version]
- Alshboul, Y.; Nepali, R.; Wang, Y. Big data lifecycle: Threats and security model. In Proceedings of the Twenty-first Americas Conference on Information Systems (AMCIS), Fajardo, Puerto Rico, 13–15 August 2015; ISBN 978-0-9966831-0-4. [Google Scholar]
- Moreno, J.; Serrano, M.A.; Fernández-Medina, E. Main issues in big data security. Future Internet 2016, 8, 44. [Google Scholar] [CrossRef] [Green Version]
- Verykios, V.S.; Bertino, E.; Fovino, I.N.; Provenza, L.P.; Saygin, Y.; Theodoridis, Y. State-of-the-art in privacy preserving data mining. ACM SIGMOD Rec. 2004, 33, 50–57. [Google Scholar] [CrossRef] [Green Version]
- Agrawal, D.; Chawla, S.; Elmagarmid, A.K.; Kaoudi, Z.; Ouzzani, M.; Papotti, P.; Quiané-Ruiz, J.A.; Tang, N.; Zaki, M.J. Road to Freedom in Big Data Analytics. In Proceeding of the 19th International Conference on Extending Database Technology (EDBT), Bordeaux, France, 15–18 March 2016. [Google Scholar] [CrossRef]
Abbreviation | Description |
---|---|
ABE | Attribute-based encryption |
APK | Android application package |
BSI | British standards institution |
CCPA | California consumer privacy act |
DBSCAN | Density-based spatial clustering of applications with noise |
EEA | European economic area |
EU | European union |
FPE | Format-preserving encryption |
GDPR | General data protection regulation |
Hadoop | High-availability distributed object-oriented platform |
HMAC | Hash-based message authentication code |
IBE | Identity-based encryption |
IEEE-SA | Institute of electrical and electronics engineers standards association |
IoT | Internet of things |
ISO | International organization for standardization |
ISO/IEC JTC1 | International organization for standardization/international electrotechnical commission joint technical committee 1 |
ITU-T | International telecommunication union telecommunication standardization sector |
JAR | Java archive |
LLOA | Least lion optimization algorithm |
MAC | Message Authentication Code |
MCA | Multiple correspondence analysis |
NIST | National institute of standards and technology |
OAuth 2.0 | Open Authorization 2.0 |
Open API | Open application programming interface |
OTP | One-time password |
PCA | Principal component analysis |
PG | Project group |
PII | Personally identifiable information |
PPDM | Privacy-preserving data mining |
PPDP | Privacy-preserving data publishing |
PRE | Proxy re-encryption |
SAC | Standardization administration of china |
SG | Study group |
SHA | Secure hash algorithm |
SMC | Secure multiparity computation |
STC | Special technical committee |
TC | Technical committee |
TM Forum | Tele management forum |
TTA | Telecommunications technology association |
TTP | Trusted third-party |
WD | Working draft |
WG | Working group |
Zip-code | Zone improvement plan-code |
Affiliation | Number | Title | Limitation | Status |
---|---|---|---|---|
ITU-T/ SG 13 | X.1147 | Security requirements and framework for big data analytics in mobile Internet services | A brief explanation of security requirements | Published |
X.1750 | Guidelines on security of big data as a service for Big Data Service Providers | It cannot be viewed | Pre-published | |
X.1751 | Security guidelines on big data lifecycle management for telecommunication operators | It cannot be viewed | Pre-published | |
ISO/IEC JTC1 SC 27 | 20547-4:2020 | Information technology—Big data reference architecture—Part 4: Security and privacy | Guideline for functional components not including detailed techniques | Published |
WD 27045.5 | Information technology—Big data security and privacy—Processes | It cannot be viewed | Preparatory | |
27046.2 | Information technology—Big data security and privacy—Implementation guidelines | It cannot be viewed | Preparatory | |
NIST | SP 1500-4r2 | NIST Big data interoperability framework: Volume 4, Big data security and privacy | Architectural security and privacy issues not including the big data life cycle | Published |
SAC | GB/T 35274-2017 | Information security technology—Security capability requirements for big data services | A rough description of the requirements and | Published |
GB/T 37973-2019 | Information security technology—Big data security management guide | insufficient description of the techniques | Published |
Type | Papers | Approaches | Descriptions |
---|---|---|---|
3.1.1 Privacy Policy | Greene et al. [11] | GDPR | Identify the GDPR concepts and principles and how they can impact the work of data scientists and researchers. |
Stallings et al. [12] | CCPA | Describes how it deals with obfuscation algorithms that can protect privacy. | |
Kanika et al. [13] | GDPR and CCPA | Describes the laws dealing with privacy protection when the information provider withdrawal of consent. | |
3.1.2 Privacy-Preserving Data Collection | Hornyack et al. [14] | Access control | Block unnecessary access and privacy-preserving using shadow data. |
Zhao et al. [15] | Authentication/Authorization | Propose a personal data cloud to store collected personal data and control access. | |
Gao et al. [16] | Homomorphic encryption | A PPAS that enables data providers to sell data securely through one-time pad and homomorphic encryption. | |
Mittal et al. [17] | Homomorphic encryption | Maintaining accuracy and preserving-privacy of k-mean clustering using homomorphic cryptosystem. | |
Balebako et al. [18] | Detecting through filtering | Detects the leakage of privacy through filtering based on TaintDroid in the Android environment. | |
Liu et al. [19] | Shadow coding schema | Data privacy through shadow matrix computation. | |
Gupta et al. [20] | Abnormal detection | User classification through abnormal behavior detection and monitoring. | |
Al-Shomrani et al. [21] | Sensitive data identification | Individual storage security module through sensitive data identification policy. | |
Consolvo et al. [22] | Web traffic control | User access control by monitoring sensitive information related to keywords. | |
Zhou et al. [23] | Access control | Privacy-preservation related to background application in android environments. | |
Tiwari et al. [24] | Detection data collecting | Data leakage detection and blocking through de-compile of APK files. |
Type | Papers | Approaches | Descriptions |
---|---|---|---|
3.2.1 Encryption | Xu et al. [25] | ABE | Solve valid access after user revocation, exposure of temporary decryption key. |
Li et al. [26] | ABE | Creates an encrypted trapdoor for each keyword and decrypt without knowing the keyword. | |
Xue et al. [27] | ABE | Complete deletions by using proxy re-encryption and Merkle hash tree. | |
Yang et al. [28] | ABE | Data sharing between cross-domain and ensures that the same data can be safely deduplicated. | |
Baek et al. [29] | IBE and proxy re-encryption | Replacing digital certificates with the identifier. | |
Zhang et al. [30] | IBE | Prevents unauthorized access and periodically updates the secret key. | |
Azougaghe et al. [31] | AES and ElGamal | Data encryption with AES and key encryption using ElGamal algorithms. | |
Hussien et al. [32] | AES and ECC | Integrity verification through hash and data protection through ECC and AES. | |
Li et al. [33] | SED2 algorithm | Divide data through intelligent encryption and make only it visible to the cloud provider. | |
Al-Odat et al. [34] | Multi-authentication and SHA | Integrity guaranteed through SHA and multi-authentication method encrypted. | |
Arora et al. [35] | Hybrid encryption | Combination of HMAC, OTP, SHA, symmetric key, and asymmetric key in the cloud environment. | |
Saroj et al. [36] | Threshold encryption | Ensures confidentiality through threshold encryption and guarantees integrity. | |
Bajwa et al. [37] | Obfuscation and encryption | Protects data by obfuscation and encryption according to the type of data. | |
Sanka et al. [38] | Access control and encryption | Ensures only data owners and users can view the data through a symmetric key. | |
Cheng et al. [39] | Paths encryption | Storage that encrypts the paths and protects data mapping through the trap door function. | |
Al Hamid et al. [40] | Bilinear pairing cryptography | Secure communication and data protection using bilinear pairing cryptography. | |
Saraiva et al. [41] | Evaluation of encryption algorithm | Presented an encryption benchmark to protect data among heterogeneous resources. | |
3.2.3 Access Control | Ko et.al [42] | Private cloud | Proposed a model to ensure confidentiality and privacy using a private cloud. |
Ngo et al. [43] | Distributed clouds | Role-based policy management using a policy profile in XACML and sharing security context. | |
Yu et al. [44] | Attribute-based access control | Allows the data owner to hand over the work related to access control without providing data. | |
Younis et al. [45] | Role-based access control | Allow secure data sharing and efficient access control through security tags and risk engines. | |
Liu et al. [46] | Data access | Allows data users to access data containing specific keywords in the cloud through search. | |
Adrienne et al. [47] | Data access | The privacy-by-proxy approach to achieve privacy. | |
Sundareswaran et al. [48] | Data logging | Method to create a JAR for the access policy. | |
3.2.4 Audit Trail | Yang et al. [49] | Various audit methods | Explained various audit methods and analyzed security and performance in detail. |
Ferdous et al. [50] | Blockchain | Proposed an architecture that evaluates whether access control has been properly performed. | |
Wang et al. [51] | Token-based method | Decentralized system method that allows data owners to detect data corruption through tokens. |
Type | Papers | Approaches | Descriptions |
---|---|---|---|
3.3.1 Privacy preserving data mining | Mohan et al. [52] | Association rule hiding | Proposed hiding techniques based on genetic algorithm and dummy items creation technique. |
Gopalan et al. [53] | Association rule hiding | Developed an efficient meta-heuristic algorithm based on the chemical reaction optimization algorithm. | |
Menga et al. [54] | Association rule hiding | Proposed secret key generation method using the least lion optimization algorithm. | |
Liu et al. [55] | Aggregation | Proposed a practical privacy-preserving data aggregation scheme without TTP. | |
Gahar et al. [56] | Reduction | Reduction algorithm based on the MapReduce paradigm. | |
Motiwalla et al. [57] | Data masking | Protects privacy without removing the attributes of the data and delivers it to necessary third parties. | |
Cui et al. [58] | Data masking | Masking method based on format-preserving encryption in a distributed environment. | |
Geo et al. [59] | Differential Privacy | Protects privacy through differential privacy when performing the k-means clustering process. | |
Ni et al. [60] | Differential Privacy | Initial objects are randomly selected, and privacy protection can be realized using Laplace noise. | |
Mo et al. [61] | Differential Privacy | A data preprocessing method based on differential privacy for distance-based clustering. | |
Zhao et al. [62] | Differential Privacy | A privacy protection method that can be used in distributed collaborative mining. | |
Lin et al. [63] | Differential Privacy | Concept of a dynamic noise threshold to analyze the relationship between the noise size and the data set. | |
Zhang et al. [64] | k-anonymity | MapReduce parallel processing to perform anonymizing through k-anonymity. | |
Mehta et al. [65] | k-anonymity | Protect big data publishing without a specific mapper and reducer. | |
Machanavajjhala et al. [66] | l-diversity | Proposed a novel and powerful privacy definition called l-diversity. | |
Li et al. [67] | t-closeness | Protects privacy by making the difference between the distribution of sensitive information. | |
Vatsalan et al. [68] | Privacy-Preserving Record Linkage | Presented a survey of privacy-preserving record linkage technologies in the past and present. | |
Scannapieco et al. [69] | Record Matching Protocol | Presented the protocol that allows each object to hide records, schema attribute details that are not shared. | |
Gkoulalas-Divanis et al. [70] | Border-based Approach | Applied that to hide sensitive data sets and introduced minimal extensions to the original database. |
Type | Papers | Approaches | Descriptions |
---|---|---|---|
3.3.1 Privacy-preservingdata publishing | Dasgupta et al. [73] | k-anonymity and l-diversity | Protect privacy in visualizations using parallel coordinates. The method provided an interactive interface that could prevent direct data access. |
Dasgupta et al. [74] | privacy-preserving visualization | Discussed privacy-preserving visualization in health data, pointing out the limitations of several studies. | |
Chou et al. [75] | k-anonymity, l-diversity, and t-closeness | Interface for privacy-preserving visualization that can detect potential privacy issues and increase data utility. |
Life Cycle | Type | Fang et al. [89] | Ye et al. [80] | Xu et al. [84] | Jain et al. [85] | Abouelmehdi et al. [78] | Yu et al. [79] | Jiang et al. [90] | Mehmood et al. [91] | Alshboul et al. [92] | Moreno et al. [93] | Wang et al. [81] | Lv et al. [86] | Goswami et al. [87] | Sangeetha et al. [82] | Bardi et al. [88] | Alwan et al. [83] | Proposal |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Collection | Privacy Policy | ✓ | ✓ | ✓ | ||||||||||||||
Privacy-Preserving Data Collection | ✓ | |||||||||||||||||
Storages | Encryption | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||
Access Control | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||
Audit Trail | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||||||||
Analytics | Privacy Preserving Data Mining | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
Access Control | ✓ | ✓ | ||||||||||||||||
Utilization | Audit Trail | ✓ | ✓ | ✓ | ||||||||||||||
Privacy-Preserving Data Publishing | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||||||||
Destruction | Degaussing | ✓ | ||||||||||||||||
Overwriting | ✓ |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Koo, J.; Kang, G.; Kim, Y.-G. Security and Privacy in Big Data Life Cycle: A Survey and Open Challenges. Sustainability 2020, 12, 10571. https://doi.org/10.3390/su122410571
Koo J, Kang G, Kim Y-G. Security and Privacy in Big Data Life Cycle: A Survey and Open Challenges. Sustainability. 2020; 12(24):10571. https://doi.org/10.3390/su122410571
Chicago/Turabian StyleKoo, Jahoon, Giluk Kang, and Young-Gab Kim. 2020. "Security and Privacy in Big Data Life Cycle: A Survey and Open Challenges" Sustainability 12, no. 24: 10571. https://doi.org/10.3390/su122410571
APA StyleKoo, J., Kang, G., & Kim, Y. -G. (2020). Security and Privacy in Big Data Life Cycle: A Survey and Open Challenges. Sustainability, 12(24), 10571. https://doi.org/10.3390/su122410571