Data Re-Identification—A Case of Retrieving Masked Data from Electronic Toll Collection
Abstract
:1. Introduction
2. Background
2.1. Data Release and Open Data
2.2. Data De-Identification and Privacy Issues
2.3. De-Identification Methods and Standards
3. Materials and Methods
3.1. Research Materials: Etc Data
3.2. Research Design and Procedure
3.3. Model Explanation and Evaluation
3.4. Background Analysis of the Dataset
4. Results and Discussion
4.1. Identification of the Regularly Driven Buses
4.2. De-Identification Methods
4.2.1. Deletion of Privacy Fields
4.2.2. Cryptographic Salting
4.2.3. Modifying Privacy Field Data
4.2.4. Averaging Privacy Field Data
4.3. Discussions
5. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Zhou, Y.; Mo, Z.; Xiao, Q.; Chen, S.; Yin, Y. Privacy-Preserving Transportation Traffic Measurement in Intelligent Cyber-physical Road Systems. IEEE Trans. Veh. Technol. 2016, 65, 3749–3759. [Google Scholar] [CrossRef]
- Weng, J.; Yuan, R.; Wang, R.; Wang, C. Freeway Travel Speed Calculation Model Based on ETC Transaction Data. Comput. Intell. Neurosci. 2014, 2014, 48. [Google Scholar] [CrossRef] [PubMed]
- Hand, D.J.; Mannila, H.; Smyth, P. Principles of Data Mining; MIT Press: Cambridge, MA, USA, 2001. [Google Scholar]
- Tan, P.-N.; Steinbach, M.; Kumar, V. Introduction to Data Mining; Pearson Education India: Chennai, India, 2005. [Google Scholar]
- Han, J.; Kamber, M.; Pei, J. Data Mining: Concepts and Techniques; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2011. [Google Scholar]
- Janssen, M.; Charalabidis, Y.; Zuiderwijk, A. Benefits, Adoption Barriers and Myths of Open Data and Open Government. Inf. Syst. Manag. 2012, 29, 258–268. [Google Scholar] [CrossRef]
- Snijders, C.; Matzat, U.; Reips, U.-D. “Big Data”: Big Gaps of Knowledge in the Field of Internet Science. Int. J. Int. Sci. 2012, 7, 1–5. [Google Scholar]
- Van Devender, M.S.; Glisson, W.B.; Benton, R.; Grispos, G. Understanding De-identification of Healthcare Big Data. Proc. Twenty-Third Am. Conf. Inf. Syst. 2017. Available online: https://aisel.aisnet.org/cgi/viewcontent.cgi?article=1457&context=amcis2017 (accessed on 16 April 2019).
- Bettini, C.; Riboni, D. Privacy Protection in Pervasive Systems: State of the Art and Technical Challenges. Pervasive Mob. Comput. 2015, 17, 159–174. [Google Scholar] [CrossRef]
- Xu, L.; Jiang, C.; Wang, J.; Yuan, J.; Ren, Y. Information Security in Big Data: Privacy and Data Mining. IEEE Access 2014, 2, 1149–1176. [Google Scholar]
- Fayyad, U.; Piatetsky-Shapiro, G.; Smyth, P. From Data Mining to Knowledge Discovery in Databases. AI Mag. 1996, 17, 37–54. [Google Scholar]
- Ito, K.; Kogure, J.; Shimoyama, T.; Tsuda, H. De-identification and Encryption Technologies to Protect Personal Information. Fujitsu Sci. Tech. J. 2016, 52, 28–36. [Google Scholar]
- Sweeney, L. k-Anonymity: A Model for Protecting Privacy. Int. J. Uncertain. Fuzziness Knowl. Based Syst. 2002, 10, 557–570. [Google Scholar] [CrossRef]
- Babu, K.S.; Jena, S.K. Balancing between Utility and Privacy for k-Anonymity. Commun. Comput. Inf. Sci. 2011, 191, 1–8. [Google Scholar] [CrossRef]
- Acquisti, A.; Brandimarte, L.; Loewenstein, G. Privacy and Human Behavior in the Age of Information. Science 2015, 30, 509–514. [Google Scholar] [CrossRef]
- Politou, E.; Michota, A.; Alepis, E.; Pocs, M.; Patsakis, C. Backups and the Right to be Forgotten in the GDPR: An Uneasy Relationship. Comput. Law Secur. Rev. 2018, 34, 1247–1257. [Google Scholar] [CrossRef]
- Fal’, O.M. Standardization in Personal Data Protection. Cybern. Syst. Anal. 2014, 50, 324–326. [Google Scholar] [CrossRef]
- Yu, S. Big Privacy: Challenges and Opportunities of Privacy Study in the Age of Big Data. IEEE Access 2016, 4, 2751–2763. [Google Scholar] [CrossRef]
- Mitchell, C.J. Challenges in Standardising Cryptography. Int. J. Inf. Secur. Sci. 2016, 5, 29–38. [Google Scholar]
- Fan, S.-K.S.; Su, C.-J.; Nien, H.-T.; Tsai, P.-F.; Cheng, C.-Y. Using Machine Learning and Big Data Approaches to Predict Travel Time Based on Historical and Real-Time Data from Taiwan Electronic Toll Collection. Soft Comput. 2018, 22, 5707–5718. [Google Scholar] [CrossRef]
- U.S. Department of Health and Human Services. Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act. (HIPAA) Privacy Rule; U.S. Department of Health and Human Services: Washington, DC, USA, 2012.
Field Name | Description |
---|---|
UniCode: | Vehicle identification code. eTagID or license plate number of the vehicle was converted into a unique code using an unpublished encryption method. |
VehicleType: | Vehicle type code, which has values such as “31” (small passenger vehicle), “32” (small truck), “41” (bus), “42” (truck), or “5” (trailer). |
DetectionTime_O: | Time when the vehicle passes its first detection station during this trip. |
GantryID_O: | Code number of the first detection station passed by the vehicle during this trip. |
DetectionTime_D: | Time when the vehicle passes its final detection station during this trip. |
GantryID_D: | Code number of the last detection station passed by the vehicle during this trip. |
TripLength: | Total travelled distance during this trip. |
TripEnd: | Trip notes, “Y” denotes a trip with a normal ending, “N” denotes a trip with an abnormal ending. |
TripInformation: | Code numbers of the detection stations passed by the vehicle during this trip and the times corresponding to each of these passes. |
Day | 31 Cars | 32 Small Trucks | 41 Buses | 42 Trucks | 5 Trailers | Total | |
---|---|---|---|---|---|---|---|
Training set | Day 1 | 2,322,209 (66.16%) | 794,931 (22.65%) | 53,517 (1.52%) | 187,276 (5.34%) | 152,250 (4.34%) | 3,510,183 |
Day 2 | 2,412,406 (70.95%) | 708,370 (20.83%) | 52,673 (1.55%) | 123,066 (3.62%) | 103,598 (3.05%) | 3,400,113 | |
Day 3 | 2,385,249 (76.03%) | 595,603 (18.99%) | 51,576 (1.64%) | 54,770 (1.75%) | 49,965 (1.59%) | 3,137,163 | |
Day 4 | 2,181,427 (66.29%) | 741,507 (22.53%) | 49,563 (1.51%) | 175,877 (5.34%) | 142,274 (4.32%) | 3,290,648 | |
Testing set | Day 5 | 2,097,515 (65.02%) | 736,207 (22.82%) | 48,963 (1.52%) | 190,739 (5.91%) | 152,290 (4.72%) | 3,225,714 |
Day 6 | 2,072,792 (66.75%) | 711,646 (22.57%) | 46,599 (1.48%) | 176,445 (5.60%) | 145,156 (4.60%) | 3,152,638 | |
Day 7 | 2,116,849 (65.57%) | 734,594 (22.75%) | 47,797 (1.48%) | 184,132 (5.70%) | 144,969 (4.49%) | 3,228,341 | |
Day 8 | 1,784,414 (68.20%) | 558,757 (21.35%) | 35,916 (1.37%) | 138,092 (5.28%) | 99,427 (3.80%) | 2,616,606 | |
Summary | 17,372,861 (67.97%) | 5,581,615 (21.84%) | 386,604 (1.51%) | 1,230,397 (4.81%) | 989,929 (3.87%) | 25,561,407 |
Period | Re-Identified | Validation | |||
---|---|---|---|---|---|
Low Frequency | High Frequency | Total | Hitting Rate | ||
Training set | Days 1–4 | 1067 | 1011 | 2078 | |
Testing set | Day 5 | 970 | 934 | 1904 | 91.63% |
Day 6 | 917 | 925 | 1842 | 88.64% | |
Day 7 | 926 | 918 | 1844 | 88.74% | |
Day 8 | 926 | 932 | 1858 | 89.41% | |
Continuous hit in all 4 days | 702 | 778 | 1480 | 71.22% |
Period | Re-Identified | Validation | |||
---|---|---|---|---|---|
Low Frequency | High Frequency | Total | Hitting Rate | ||
Training set | Days 1–4 | 1011 | 1067 | 2078 | |
Testing set | Day 5 | 934 | 970 | 1904 | 91.63% |
Day 6 | 925 | 917 | 1842 | 88.64% | |
Day 7 | 918 | 926 | 1844 | 88.74% | |
Day 8 | 932 | 926 | 1858 | 89.41% | |
Continuous hit in all 4 days | 778 | 702 | 1480 | 71.22% |
Period | Re-Identified | Validation | |||
---|---|---|---|---|---|
Low Frequency | High Frequency | Total | Hitting Rate | ||
Training set | Days 1–4 | 560 | 384 | 944 | |
Testing set | Day 5 | 555 | 380 | 935 | 99.05% |
Day 6 | 556 | 382 | 938 | 99.36% | |
Day 7 | 556 | 379 | 935 | 99.05% | |
Day 8 | 555 | 376 | 931 | 98.62% | |
Continuous hit in all 4 days | 549 | 369 | 918 | 97.25% |
Period | Re-Identified | Validation | |||
---|---|---|---|---|---|
Low Frequency | High Frequency | Total | Hitting Rate | ||
Training set | Days 1–4 | 1648 | 1054 | 2702 | |
Testing set | Day 5 | 1460 | 1241 | 2701 | 99.96% |
Day 6 | 1426 | 1163 | 2589 | 95.82% | |
Day 7 | 1419 | 1151 | 2570 | 95.11% | |
Day 8 | 1422 | 1154 | 2576 | 95.34% | |
Continuous hit in all 4 days | 1130 | 757 | 1887 | 73.25% |
Period | Re-Identified | Validation | |||
---|---|---|---|---|---|
Low Frequency | High Frequency | Total | Hitting Rate | ||
Training set | Days 1–4 | 1642 | 1768 | 3410 | |
Testing set | Day 5 | 1395 | 1417 | 2812 | 82.46% |
Day 6 | 1384 | 1329 | 2713 | 79.56% | |
Day 7 | 1338 | 1304 | 2642 | 77.48% | |
Day 8 | 1360 | 1295 | 2655 | 77.86% | |
Continuous hit in all 4 days | 1064 | 846 | 1910 | 56.01% |
Re-Identification Using Training Set | Validation Using Testing Set | ||||
---|---|---|---|---|---|
Number of Re-Identified Vehicles | Re-Identified Ratio | Hitting Rate | Validated Accuracy | ||
- | Original data | 2078 | Medium | Low | High |
De-identification methods | Deletion of privacy fields | 0 | None | High | High |
Cryptographic salting | 2078 | Medium | Low | High | |
Modifying privacy fields | 944 | Low | High | Low | |
Averaging privacy fields | 2702 | Medium | Medium | Medium | |
Division-averaging privacy fields | 3410 | High | Medium | High |
De-Identification Methods | Processing Time | Code Length | Coding Difficulty | Data Usability | Privacy Security | Data Credibility |
---|---|---|---|---|---|---|
Original data | - | - | - | High | Low | High |
Deletion of privacy fields | Short | Short | Easy | Low | High | High |
Cryptographic salting | Long | Short | Easy | High | Low | High |
Modifying privacy fields | Short | Short | Easy | Low | High | Low |
Averaging privacy fields | Medium | Medium | Medium | Medium | Medium | Medium |
Division-averaging privacy fields | Medium | Medium | Hard | Medium | Medium | High |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Huang, H.-H.; Lin, J.-W.; Lin, C.-H. Data Re-Identification—A Case of Retrieving Masked Data from Electronic Toll Collection. Symmetry 2019, 11, 550. https://doi.org/10.3390/sym11040550
Huang H-H, Lin J-W, Lin C-H. Data Re-Identification—A Case of Retrieving Masked Data from Electronic Toll Collection. Symmetry. 2019; 11(4):550. https://doi.org/10.3390/sym11040550
Chicago/Turabian StyleHuang, Hsieh-Hong, Jian-Wei Lin, and Chia-Hsuan Lin. 2019. "Data Re-Identification—A Case of Retrieving Masked Data from Electronic Toll Collection" Symmetry 11, no. 4: 550. https://doi.org/10.3390/sym11040550
APA StyleHuang, H. -H., Lin, J. -W., & Lin, C. -H. (2019). Data Re-Identification—A Case of Retrieving Masked Data from Electronic Toll Collection. Symmetry, 11(4), 550. https://doi.org/10.3390/sym11040550