Using UMAP for Partially Synthetic Healthcare Tabular Data Generation and Validation
Abstract
:1. Introduction
2. Related Work
2.1. Fully and Partially Synthetic Data
2.2. Data Imputation
2.3. Data Visualization and Dimensionality Reduction Tools
2.4. Privacy Preservation
3. Materials and Methods
3.1. Datasets
3.1.1. Prostate Cancer
3.1.2. Breast Cancer
3.2. Comparison of Visualization Tools for Database Representation
3.3. Proposed Methodology
3.4. Formal Framework
3.4.1. Real-World Data Sets
3.4.2. Synthetic Data Sets
3.4.3. Problem Definition
- Fully synthetic generation. The proposed methodology aims to generate a fully synthetic data set, , based on a real-world reference data set, , with instances and d features.
- Partially synthetic generation. This approach also relies on a reference data set, , with instances and d features, along with an incomplete database with missing features. For simplicity, in this study we assume that and only one feature (x) is missing in the incomplete data set, , and this is also the only feature to be synthetically generated. The incomplete data set is denoted as , which has instances. Thus, there are known features and one unknown feature common to all instances. The goal is to generate a partially synthetic data set, , which includes known features and synthetic values for one unknown feature.
- Imputation. Imputation methodologies operate on an incomplete database with missing features, similar to the previous case. However, missing features may vary across entries. For instance, considering one missing feature for each entry, the incomplete data set is defined as . In this case, the imputation can rely on the non-missing values of the incomplete database (option A) or on a reference data set with no missing values (option B).
3.4.4. Problem Equivalence
3.4.5. Visualization Tools
3.5. Partially Synthetic Methodology Proposal
Algorithm 1 Synthetic Data Generation with UMAP-based Dimensionality Reduction |
|
3.6. Data Privacy and Workflow Strategies
- Case 0: The output is unknown, but the features are all known. In the case of missing outputs or classes, the validation process involves cluster validation, which requires the centroid cluster coordinates of and a threshold radius from .
- Case 1: The output is known, but there is one missing feature. If the interest lies in assigning generated values for a missing feature, the validation process involves a reliability score assignment. This requires the mean value of the feature x for the k-nearest neighbors in , for each instance i in , .
- Case 2: Both, the output and a feature are unknown. For scenarios involving the generation of both the unknown feature and the assignment of sample clusters, the validation process includes both steps.
- Option 1: Hospital 2 shares the generated value for each sample with hospital 1. Hospital 1 calculates the feature disparity and assigns a reliability score to each generated sample. Hospital 1 then sends the reliability scores back to hospital 2 for them to keep the desired samples.
- Option 2: Hospital 1 shares the mean neighbor value of feature x for each sample i with hospital 2. Hospital 2 calculates the feature disparity to provide the reliability scores.
4. Results
4.1. CIA Synthetic Data Generation from PI-CAI
4.2. BCNB Synthetic Data Generation from BC-MLR
4.3. PI-CAI Validation
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
VAEs | Variational Autoencoders |
GANs | Generative Adversarial Networks |
DM | Diffusion Models |
BGM | Bayesian Gaussian Mixture |
PATE | Private Aggregation of Teacher Ensembles |
MCMC | Markov Chain Monte Carlo |
LSTM | Long-Short Term Memory |
MI | Multiple Imputation |
MICE | Multivariate Imputation by Chained Equations |
UMAP | Uniform Manifold Approximation and Projection |
t-SNE | t-distributed stochastic neighbor embedding |
PCA | Principal Component Analysis |
EHR | Electronic Health Record |
IoMT | Internet of Medical Things |
PI-CAI | Prostate Imaging and Cancer Analysis Information |
CIA | Cancer Imaging Archive |
GS | Gleason Score |
PSDG | Partially Synthetic Data Generation |
PSA | Prostate-Specific Antigen |
References
- Vardhan, L.V.H.; Kok, S. Generating privacy-preserving synthetic tabular data using oblivious variational autoencoders. In Proceedings of the Workshop on Economics of Privacy and Data Labor at the 37 th International Conference on Machine Learning, Virtual, 13–18 July 2020. [Google Scholar]
- Hernandez, M.; Epelde, G.; Alberdi, A.; Cilla, R.; Rankin, D. Synthetic data generation for tabular health records: A systematic review. Neurocomputing 2022, 493, 28–45. [Google Scholar] [CrossRef]
- Hernandez, M.; Epelde, G.; Alberdi, A.; Cilla, R.; Rankin, D. Standardised metrics and methods for synthetic tabular data evaluation. TechRxiv 2021. [Google Scholar] [CrossRef]
- Van Panhuis, W.G.; Paul, P.; Emerson, C.; Grefenstette, J.; Wilder, R.; Herbst, A.J.; Heymann, D.; Burke, D.S. A systematic review of barriers to data sharing in public health. BMC Public Health 2014, 14, 1144. [Google Scholar] [CrossRef] [PubMed]
- Dove, E.S.; Phillips, M. Privacy law, data sharing policies, and medical data: A comparative perspective. In Medical Data Privacy Handbook; Springer: Berlin/Heidelberg, Germany, 2015; pp. 639–678. [Google Scholar]
- Malin, B.; Goodman, K. Between access and privacy: Challenges in sharing health data. Yearb. Med. Inform. 2018, 27, 055–059. [Google Scholar] [CrossRef]
- Lange, L.; Wenzlitschke, N.; Rahm, E. Generating Synthetic Health Sensor Data for Privacy-Preserving Wearable Stress Detection. Sensors 2024, 24, 3052. [Google Scholar] [CrossRef]
- Kwon, G.S.; Choi, Y.S. Adjacent Image Augmentation and Its Framework for Self-Supervised Learning in Anomaly Detection. Sensors 2024, 24, 5616. [Google Scholar] [CrossRef] [PubMed]
- Vovk, O.; Piho, G.; Ross, P. Anonymization methods of structured health care data: A literature review. In Proceedings of the International Conference on Model and Data Engineering, Tallinn, Estonia, 21–23 June 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 175–189. [Google Scholar]
- Giuffrè, M.; Shung, D.L. Harnessing the power of synthetic data in healthcare: Innovation, application, and privacy. NPJ Digit. Med. 2023, 6, 186. [Google Scholar] [CrossRef]
- Raghunathan, T.E. Synthetic data. Annu. Rev. Stat. Its Appl. 2021, 8, 129–140. [Google Scholar] [CrossRef]
- Jordon, J.; Szpruch, L.; Houssiau, F.; Bottarelli, M.; Cherubin, G.; Maple, C.; Cohen, S.N.; Weller, A. Synthetic Data—What, why and how? arXiv 2022, arXiv:2205.03257. [Google Scholar]
- Surendra, H.; Mohan, H. A review of synthetic data generation methods for privacy preserving data publishing. Int. J. Sci. Technol. Res. 2017, 6, 95–101. [Google Scholar]
- Reiter, J.P. Inference for partially synthetic, public use microdata sets. Surv. Methodol. 2003, 29, 181–188. [Google Scholar]
- Sun, Y.; Li, J.; Xu, Y.; Zhang, T.; Wang, X. Deep learning versus conventional methods for missing data imputation: A review and comparative study. Expert Syst. Appl. 2023, 227, 120201. [Google Scholar] [CrossRef]
- Gonzales, A.; Guruswamy, G.; Smith, S.R. Synthetic data in health care: A narrative review. PLoS Digit Health 2023, 2, e0000082. [Google Scholar] [CrossRef] [PubMed]
- Kim, K.M.; Kwak, J.W. PVS-GEN: Systematic Approach for Universal Synthetic Data Generation Involving Parameterization, Verification, and Segmentation. Sensors 2024, 24, 266. [Google Scholar] [CrossRef]
- Rubin, D.B. Statistical disclosure limitation. J. Off. Stat. 1993, 9, 461–468. [Google Scholar]
- Little, R.J. Statistical analysis of masked data. J. Off. Stat. 1993, 9, 407–426. [Google Scholar]
- Drechsler, J.; Haensch, A.C. 30 years of synthetic data. Stat. Sci. 2024, 39, 221–242. [Google Scholar] [CrossRef]
- Murtaza, H.; Ahmed, M.; Khan, N.F.; Murtaza, G.; Zafar, S.; Bano, A. Synthetic data generation: State of the art in health care domain. Comput. Sci. Rev. 2023, 48, 100546. [Google Scholar] [CrossRef]
- Khan, S.; Hoque, A. Digital health data: A comprehensive review of privacy and security risks and some recommendations. Comput. Sci. J. Mold. 2016, 71, 273–292. [Google Scholar]
- Zhang, Z.; Yan, C.; Malin, B.A. Membership inference attacks against synthetic health data. J. Biomed. Inform. 2022, 125, 103977. [Google Scholar] [CrossRef]
- Kingma, D.P. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
- Tazwar, S.M.; Knobbout, M.; Quesada, E.H.; Popa, M. Tab-VAE: A Novel VAE for Generating Synthetic Tabular Data. In Proceedings of the ICPRAM, Rome, Italy, 24–26 February 2024; pp. 17–26. [Google Scholar]
- Apellániz, P.A.; Parras, J.; Zazo, S. An improved tabular data generator with VAE-GMM integration. arXiv 2024, arXiv:2404.08434. [Google Scholar]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27. [Google Scholar]
- Xu, L.; Skoularidou, M.; Cuesta-Infante, A.; Veeramachaneni, K. Modeling tabular data using conditional gan. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
- Jordon, J.; Yoon, J.; Van Der Schaar, M. PATE-GAN: Generating synthetic data with differential privacy guarantees. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Mahendra, M.; Umesh, C.; Bej, S.; Schultz, K.; Wolkenhauer, O. Convex space learning for tabular synthetic data generation. arXiv 2024, arXiv:2407.09789. [Google Scholar]
- Choi, E.; Biswal, S.; Malin, B.; Duke, J.; Stewart, W.F.; Sun, J. Generating multi-label discrete patient records using generative adversarial networks. In Proceedings of the Machine Learning for Healthcare Conference, PMLR, Boston, MA, USA, 18–19 August 2017; pp. 286–305. [Google Scholar]
- Patel, S.; Kakadiya, A.; Mehta, M.; Derasari, R.; Patel, R.; Gandhi, R. Correlated discrete data generation using adversarial training. arXiv 2018, arXiv:1804.00925. [Google Scholar]
- Camino, R.; Hammerschmidt, C.; State, R. Generating multi-categorical samples with generative adversarial networks. arXiv 2018, arXiv:1807.01202. [Google Scholar]
- Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein generative adversarial networks. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 214–223. [Google Scholar]
- Ziegler, J.D.; Subramaniam, S.; Azzarito, M.; Doyle, O.; Krusche, P.; Coroller, T. Multi-modal conditional gan: Data synthesis in the medical domain. In Proceedings of the NeurIPS 2022 Workshop on Synthetic Data for Empowering ML Research, New Orleans, LA, USA, 2 December 2022. [Google Scholar]
- Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 7–9 July 2015; pp. 2256–2265. [Google Scholar]
- Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
- Kotelnikov, A.; Baranchuk, D.; Rubachev, I.; Babenko, A. Tabddpm: Modelling tabular data with diffusion models. In Proceedings of the International Conference on Machine Learning, PMLR, Seattle, WA, USA, 30 November–1 December 2023; pp. 17564–17579. [Google Scholar]
- He, H.; Zhao, S.; Xi, Y.; Ho, J.C. MedDiff: Generating electronic health records using accelerated denoising diffusion model. arXiv 2023, arXiv:2302.04355. [Google Scholar]
- Yang, Z.; Guo, P.; Zanna, K.; Sano, A. Balanced Mixed-Type Tabular Data Synthesis with Diffusion Models. arXiv 2024, arXiv:2404.08254. [Google Scholar]
- Park, Y.; Ghosh, J. PeGS: Perturbed Gibbs Samplers that Generate Privacy-Compliant Synthetic Data. Trans. Data Priv. 2014, 7, 253–282. [Google Scholar]
- Hernandez-Matamoros, A.; Fujita, H.; Perez-Meana, H. A novel approach to create synthetic biomedical signals using BiRNN. Inf. Sci. 2020, 541, 218–241. [Google Scholar] [CrossRef]
- Libbi, C.A.; Trienes, J.; Trieschnigg, D.; Seifert, C. Generating synthetic training data for supervised de-identification of electronic health records. Future Internet 2021, 13, 136. [Google Scholar] [CrossRef]
- Little, R.; Rubin, D. Multiple Imputation for Nonresponse in Surveys; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 1987; Volume 10, p. 9780470316696. [Google Scholar]
- de Goeij, M.C.; Van Diepen, M.; Jager, K.J.; Tripepi, G.; Zoccali, C.; Dekker, F.W. Multiple imputation: Dealing with missing data. Nephrol. Dial. Transplant. 2013, 28, 2415–2420. [Google Scholar] [CrossRef]
- Wayman, J.C. Multiple imputation for missing data: What is it and how can I use it. In Proceedings of the Annual Meeting of the American Educational Research Association, Chicago, IL, USA, 21–25 April 2003; AERA: Washington, DC, USA, 2003; Volume 2, p. 16. [Google Scholar]
- Rubin, D.B. An overview of multiple imputation. In Proceedings of the Survey Research Methods Section of the American Statistical Association; ASA: Alexandria, VA, USA, 1988; Volume 79, p. 84. [Google Scholar]
- Getz, K.; Hubbard, R.A.; Linn, K.A. Performance of multiple imputation using modern machine learning methods in electronic health records data. Epidemiology 2023, 34, 206–215. [Google Scholar] [CrossRef]
- Neves, D.T.; Alves, J.; Naik, M.G.; Proença, A.J.; Prasser, F. From missing data imputation to data generation. J. Comput. Sci. 2022, 61, 101640. [Google Scholar] [CrossRef]
- Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
- McInnes, L.; Healy, J.; Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv 2018, arXiv:1802.03426. [Google Scholar]
- Amid, E.; Warmuth, M.K. TriMap: Large-scale dimensionality reduction using triplets. arXiv 2019, arXiv:1910.00204. [Google Scholar]
- Roca, C.P.; Burton, O.T.; Neumann, J.; Tareen, S.; Whyte, C.E.; Gergelits, V.; Veiga, R.V.; Humblet-Baron, S.; Liston, A. A cross entropy test allows quantitative statistical comparison of t-SNE and UMAP representations. Cell Rep. Methods 2023, 3, 100390. [Google Scholar] [CrossRef]
- Zhao, X. Comparison of Data Visualization, Outlier Detection and Data Dimensionality Reduction Methods. Highlights Sci. Eng. Technol. 2024, 85, 1141–1149. [Google Scholar] [CrossRef]
- Wang, Y.; Huang, H.; Rudin, C.; Shaposhnik, Y. Understanding how dimension reduction tools work: An empirical approach to deciphering t-SNE, UMAP, TriMAP, and PaCMAP for data visualization. J. Mach. Learn. Res. 2021, 22, 1–73. [Google Scholar]
- Belkin, M.; Niyogi, P. Laplacian eigenmaps and spectral techniques for embedding and clustering. Adv. Neural Inf. Process. Syst. 2001, 14. [Google Scholar]
- Kobak, D.; Linderman, G.C. UMAP does not preserve global structure any better than t-SNE when using the same initialization. bioRxiv 2019, 2019, 877522. [Google Scholar] [CrossRef]
- Hurley, N.C.; Haimovich, A.D.; Taylor, R.A.; Mortazavi, B.J. Visualization of emergency department clinical data for interpretable patient phenotyping. Smart Health 2022, 25, 100285. [Google Scholar] [CrossRef]
- Misgar, M.M.; Bhatia, M. Detection of depression from IoMT time series data using UMAP features. In Proceedings of the 2022 IEEE International Conference on Computing, Communication, and Intelligent Systems (ICCCIS), Kochi, India, 23–25 June 2022; pp. 623–628. [Google Scholar]
- Weijler, L.; Kowarsch, F.; Wödlinger, M.; Reiter, M.; Maurer-Granofszky, M.; Schumich, A.; Dworzak, M.N. Umap based anomaly detection for minimal residual disease quantification within acute myeloid leukemia. Cancers 2022, 14, 898. [Google Scholar] [CrossRef]
- Allaoui, M.; Kherfi, M.L.; Cheriet, A. Considerably improving clustering algorithms using UMAP dimensionality reduction technique: A comparative study. In Proceedings of the International Conference on Image and Signal Processing, Mostaganem, Algeria, 24–25 November 2019; Springer: Berlin/Heidelberg, Germany, 2020; pp. 317–325. [Google Scholar]
- Deng, L. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Process. Mag. 2012, 29, 141–142. [Google Scholar] [CrossRef]
- Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-mnist: A novel image dataset for benchmarking machine learning algorithms. arXiv 2017, arXiv:1708.07747. [Google Scholar]
- Hull, J.J. A database for handwritten text recognition research. IEEE Trans. Pattern Anal. Mach. Intell. 1994, 16, 550–554. [Google Scholar] [CrossRef]
- Alpaydin, E.; Alimoglu, F. Pen-based recognition of handwritten digits data set. In Machine Learning Repository; University of California: Irvine, CA, USA, 1998; Volume 4. [Google Scholar]
- Graham, D.B.; Allinson, N.M. Characterising virtual eigensignatures for general purpose face recognition. In Face Recognition: From Theory to Applications; Springer: Berlin/Heidelberg, Germany, 1998; pp. 446–456. [Google Scholar]
- Dorrity, M.W.; Saunders, L.M.; Queitsch, C.; Fields, S.; Trapnell, C. Dimensionality reduction by UMAP to visualize physical and genetic interactions. Nat. Commun. 2020, 11, 1537. [Google Scholar] [CrossRef]
- Yelipe, U.; Porika, S.; Golla, M. An efficient approach for imputation and classification of medical data values using class-based clustering of medical records. Comput. Electr. Eng. 2018, 66, 487–504. [Google Scholar] [CrossRef]
- Almeida, G.; Bacao, F. UMAP-SMOTENC: A Simple, Efficient, and Consistent Alternative for Privacy-Aware Synthetic Data Generation. Knowl. Based Syst. 2024, 300, 112174. [Google Scholar] [CrossRef]
- Chong, K.M. Privacy-preserving healthcare informatics: A review. In Proceedings of the ITM Web of Conferences; EDP Sciences: Les Ulis, France, 2021; Volume 36, p. 04005. [Google Scholar]
- Sablayrolles, A.; Douze, M.; Schmid, C.; Ollivier, Y.; Jégou, H. White-box vs. black-box: Bayes optimal strategies for membership inference. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 5558–5567. [Google Scholar]
- Becht, E.; McInnes, L.; Healy, J.; Dutertre, C.A.; Kwok, I.W.; Ng, L.G.; Ginhoux, F.; Newell, E.W. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 2019, 37, 38–44. [Google Scholar] [CrossRef] [PubMed]
Feature Disparity () | Reliability Score () |
---|---|
1 | |
0.9 | |
0.8 | |
0.7 | |
0.6 | |
0.5 |
Feature | Correlation |
---|---|
age | −0.072 |
menopause | 0.052 |
tumor-size | 0.13 |
inv-nodes | 0.27 |
node-caps | 0.24 |
deg-malig | 0.3 |
breast | −0.059 |
breast-quad | 0.037 |
irradiat | 0.19 |
Age Range | 35–44 | 45–54 | 55–64 | 65–74 | 75–84 | 85–92 | Mean % | |
---|---|---|---|---|---|---|---|---|
UMAP PSDG | 16.67 | 83.93 | 91.73 | 88.09 | 66.26 | 0.00 | 57.78 | |
11.11 | 78.57 | 89.03 | 85.11 | 59.09 | 0.00 | 53.82 | ||
5.56 | 70.24 | 84.99 | 80.19 | 52.97 | 0.00 | 48.99 | ||
5.56 | 70.24 | 84.99 | 80.19 | 52.97 | 0.00 | 48.99 | ||
0.00 | 54.46 | 76.25 | 71.30 | 41.61 | 0.00 | 40.60 | ||
0.00 | 33.04 | 53.39 | 48.68 | 22.73 | 0.00 | 26.31 | ||
Mean | 0 | 0 | 4.93 | 7.17 | 0 | 0 | 2.02 | |
k-NN | 0 | 0 | 1.30 | 0.69 | 0.70 | 0 | 0.45 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lázaro, C.; Angulo, C. Using UMAP for Partially Synthetic Healthcare Tabular Data Generation and Validation. Sensors 2024, 24, 7843. https://doi.org/10.3390/s24237843
Lázaro C, Angulo C. Using UMAP for Partially Synthetic Healthcare Tabular Data Generation and Validation. Sensors. 2024; 24(23):7843. https://doi.org/10.3390/s24237843
Chicago/Turabian StyleLázaro, Carla, and Cecilio Angulo. 2024. "Using UMAP for Partially Synthetic Healthcare Tabular Data Generation and Validation" Sensors 24, no. 23: 7843. https://doi.org/10.3390/s24237843
APA StyleLázaro, C., & Angulo, C. (2024). Using UMAP for Partially Synthetic Healthcare Tabular Data Generation and Validation. Sensors, 24(23), 7843. https://doi.org/10.3390/s24237843