Next Article in Journal
Optimized Analytical–Numerical Procedure for Ultrasonic Sludge Treatment for Agricultural Use
Previous Article in Journal
Linear Matrix Inequality-Based Design of Structured Sparse Feedback Controllers for Sensor and Actuator Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

Iterative Application of UMAP-Based Algorithms for Fully Synthetic Healthcare Tabular Data Generation

1
Intelligent Data Science and Artificial Intelligence Research Center, Technical University of Catalonia, Nexus II Building, Jordi Girona 29, 08034 Barcelona, Spain
2
Institute of Robotics and Industrial Informatics (CSIC-UPC), Llorens i Artigas 4, 08028 Barcelona, Spain
*
Authors to whom correspondence should be addressed.
Algorithms 2024, 17(12), 591; https://doi.org/10.3390/a17120591 (registering DOI)
Submission received: 31 October 2024 / Revised: 18 December 2024 / Accepted: 19 December 2024 / Published: 21 December 2024

Abstract

Building on a previously developed partially synthetic data generation algorithm utilizing data visualization techniques, this study extends the novel algorithm to generate fully synthetic tabular healthcare data. In this enhanced form, the algorithm serves as an alternative to conventional methods based on Generative Adversarial Networks (GANs) or Variational Autoencoders (VAEs). By iteratively applying the original methodology, the adapted algorithm employs UMAP (Uniform Manifold Approximation and Projection), a dimensionality reduction technique, to validate generated samples through low-dimensional clustering. This approach has been successfully applied to three healthcare domains: prostate cancer, breast cancer, and cardiovascular disease. The generated synthetic data have been rigorously evaluated for fidelity and utility. Results show that the UMAP-based algorithm outperforms GAN- and VAE-based generation methods across different scenarios. In fidelity assessments, it achieved smaller maximum distances between the cumulative distribution functions of real and synthetic data for different attributes. In utility evaluations, the UMAP-based synthetic datasets enhanced machine learning model performance, particularly in classification tasks. In conclusion, this method represents a robust solution for generating secure, high-quality synthetic healthcare data, effectively addressing data scarcity challenges.
Keywords: fully synthetic data; UMAP; healthcare tabular data; data augmentation fully synthetic data; UMAP; healthcare tabular data; data augmentation

Share and Cite

MDPI and ACS Style

Lázaro, C.; Angulo, C. Iterative Application of UMAP-Based Algorithms for Fully Synthetic Healthcare Tabular Data Generation. Algorithms 2024, 17, 591. https://doi.org/10.3390/a17120591

AMA Style

Lázaro C, Angulo C. Iterative Application of UMAP-Based Algorithms for Fully Synthetic Healthcare Tabular Data Generation. Algorithms. 2024; 17(12):591. https://doi.org/10.3390/a17120591

Chicago/Turabian Style

Lázaro, Carla, and Cecilio Angulo. 2024. "Iterative Application of UMAP-Based Algorithms for Fully Synthetic Healthcare Tabular Data Generation" Algorithms 17, no. 12: 591. https://doi.org/10.3390/a17120591

APA Style

Lázaro, C., & Angulo, C. (2024). Iterative Application of UMAP-Based Algorithms for Fully Synthetic Healthcare Tabular Data Generation. Algorithms, 17(12), 591. https://doi.org/10.3390/a17120591

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop