1. Introduction
In the rapidly evolving field of machine learning and neural networks, there is more demand than ever for large amounts of data [
1]. The effectiveness of these advanced computational models depends crucially on large amounts of high-quality data to uncover patterns, make accurate predictions, and drive further innovative advances in a variety of domains. However, the feasibility of acquiring such huge datasets, particularly in areas such as healthcare and cybersecurity, is often constrained by diverse challenges [
2].
The application of synthetic data in various machine learning tasks, with a particular focus on tabular data, has garnered significant attention in the recent literature. A previous review delivers a thorough examination of the application of synthetic data across various machine learning tasks, placing a particular emphasis on tabular data [
3]. While traditional reviews may be based on keyword searches, this study distinguishes itself by offering a comprehensive examination of the use of synthetic data, presenting a detailed classification of algorithms and generation mechanisms, and discussing metrics for evaluating data quality. By addressing the existing gaps in the fragmented literature, this study aims to provide valuable information to support research on the effective use of synthetic data.
A comprehensive classification is presented that includes 70 production algorithms, as well as an explanation of six main types of production mechanisms. This study goes deeper into the discussion of metrics designed to assess the quality of synthetic data. With the goal of bridging existing gaps in the fragmented literature, this study provides valuable insights to aid researchers and practitioners in the effective utilization of synthetic data.
Federated Learning (FL) has emerged as a decentralized approach to training statistical models by leveraging data from multiple clients without exposing their raw data, thus ensuring privacy and introducing potential security advantages [
4]. Within this context, federated synthesis, employing FL for synthetic data generation, enables the amalgamation of data without compromising privacy or raw data accessibility. A scoping review of 69 articles spanning from 2018 to 2023 underscores the prevalent use of deep learning methods, notably generative adversarial networks, in federated synthesis. Although promising, further research is needed to deepen the understanding of privacy risks and develop reliable methodologies to measure them.
The pursuit of improved machine learning algorithms for personalized decision support in palliative care diagnostics underlines the need for more relevant patient data. Synthetic data generation emerges as a potential solution to address this demand, although challenges such as bias and interpretability persist. In an extensive review, the potential consequences of using synthetic data in palliative care diagnostics using machine learning are examined [
5]. Furthermore, the authors provide valuable insights and practical considerations for the integration of this approach into clinical settings.
The increasing demand for large datasets is confronting the persistent obstacle of unbalanced datasets [
6]. The nature of certain effects results in biased distributions, with some categories significantly outperforming others. This imbalance not only creates challenges in model training but also carries the risk of spreading biases, potentially undermining the reliability of predictive analyses. Achieving a balance in the representation of different classes becomes crucial for the robust performance of machine learning models.
Despite these challenges, the crucial issue of privacy is vital [
7], especially when it comes to patients’ personal data in healthcare applications in real time. Preserving sensitive information is not just a legal and ethical necessity but also a fundamental requirement to enhance trust and ensure the ethical use of data. As we navigate the complex terrain of machine learning and data-intensive applications, it becomes urgent to explore innovative solutions that reconcile the need for large, balanced datasets with the need to address privacy concerns.
This review begins a thorough exploration of contemporary research efforts that address these challenges. Taking a deep dive into the literature, we uncover a range of methodologies, with a particular focus on the innovative use of generative adversarial networks (GANs) for generating synthetic data. The necessity of large datasets, the intricacies of handling imbalanced ones, and the urgency of privacy protection are recurring issues that can be found in a wide range of domains—from cybersecurity and attack detection to healthcare and patient-oriented applications.
We employed BERT (Bidirectional Encoder Representations from Transformers) as a pivotal tool to conduct topic modeling on the collected studies. Preprocessing steps, including tokenization, lowercasing, stop-word removal, and lemmatization, were applied to ensure the cleanliness of the text data. BERTopic, a topic modeling library specifically designed to harness the power of BERT embeddings, facilitated the extraction of topics from the corpus. BERT embeddings, renowned for their ability to capture semantic nuances in text data, were generated for each document in the corpus. These embeddings served as rich representations of the textual content, enabling us to delve deeper into the underlying themes present in the literature. Clustering techniques, particularly Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN), were then applied to identify cohesive clusters representing distinct topics within the corpus. After clustering, representative keywords were assigned to each topic cluster based on the most prominent terms within the documents, enhancing the interpretability of the results. The utilization of BERT embeddings in conjunction with clustering algorithms allowed for a nuanced analysis of the literature, providing insights into the key themes and implications for synthetic data generation, as presented in
Figure 1.
Furthermore, our methodology leveraged the adaptability of BERTopic, which dynamically adjusts to the complexity of the data without requiring a predefined number of topics. This flexibility ensured that significant words were preserved in the topic descriptions, maintaining the semantic integrity and coherence of the extracted topics. Overall, our approach showcases the potential of deep learning-based algorithms, such as BERTopic, in the domain of text mining and processing, particularly in facilitating topic extraction and analysis from large collections of scientific articles.
Figure 1 displays the results obtained using BERTopic. Specifically, the bar charts represent four distinct topics generated by BERT, showcasing the top five associated words or terms for each topic. The x-axis of these bar charts represents the c-TF-IDF score, which quantifies the relevance of terms by evaluating their frequency within a particular document and their distinctiveness across the entire document corpus.
As we navigate the varied terrain of synthetic data generation, we aim to distill knowledge, draw connections, and provide a comprehensive overview of the evolving methodologies that address these sophisticated challenges. Through a rigorous inspection of the reviewed papers, we seek to unravel the potential confluences between the need for extensive, balanced datasets, the intricacies of handling imbalances, and privacy requirements—paving the way for a more informed and ethical approach to data-intensive applications in the field of machine learning and neural networks.
3. Discussion
The review presented above highlights the significant advancements and innovative approaches in data generation methodologies, particularly focusing on statistical-based, machine learning-based, and privacy-preserving techniques across various application domains. These methodologies have demonstrated their potential to overcome challenges such as data scarcity, privacy concerns, and class imbalance, thereby opening new avenues for research and applications in diverse fields. Unlike typical reviews that are based solely on an exploration of the literature, this study incorporates a synthesis of existing methodologies with a critical analysis and broadens the discussion by offering new insights and perspectives.
A significant aspect of these studies is the growing importance of synthetic data generation techniques in addressing the limitations of real-world datasets. Statistical-based approaches, such as GenerativeMTD and the divide-and-conquer (DC) strategy, have shown promise in accurately representing complex data relationships while ensuring data balance. These methods offer practical solutions for generating synthetic datasets that closely resemble real-world scenarios, particularly in domains like healthcare and software engineering.
Another aspect worth discussing is the transferability of synthetic data generation techniques from one domain to another. While certain methodologies may demonstrate effectiveness in specific application domains, their adaptability to diverse datasets and contexts remains a topic of interest. For example, although these statistical-based approaches have shown promise in generating synthetic datasets that closely resemble real-world scenarios, further research is warranted to explore their applicability beyond the domains of healthcare and software engineering into sectors like finance, manufacturing, and telecommunications.
A systematic review addresses the challenges and potential of synthetic data generation for tabular health records [
33]. The promising outcomes observed in previous studies are related to the synthesis of tabular data, particularly in fields such as healthcare and energy consumption. Despite these advancements, the study highlights the need for a reliable and privacy-preserving solution in handling valuable healthcare data. Focusing on methodologies developed within the past five years, the review elaborates on the role of generative adversarial networks (GANs) in healthcare applications. The evaluation of GAN-based approaches reveals progress, yet it underscores the necessity for further research to pinpoint the most effective model for generating tabular health data.
Similarly, machine learning-based generation techniques, including conditional generative adversarial networks (cGANs), variational autoencoders, and deep learning models, have demonstrated remarkable capabilities in generating synthetic data with high fidelity. Yet, their generalizability across different datasets and domains requires careful consideration. Understanding the nuances of dataset characteristics and domain-specific challenges is crucial for assessing the transferability of these models and methodologies. These approaches not only improve prediction accuracy in critical healthcare scenarios but also extend their applicability to diverse domains such as agriculture and energy consumption forecasting.
Furthermore, privacy preservation emerges as a critical concern in data generation, particularly in domains where sensitive information is involved. Innovative frameworks like Duo-GAN and hybrid GAN-based methodologies offer promising solutions for generating synthetic data while preserving user privacy. These techniques enable the release of private research data to the public while diminishing the risk of identification disclosure. Examining the robustness of these privacy-preserving techniques across diverse datasets and application domains can provide valuable insights into their generalizability and practical utility.
A thorough review delved into the realm of creating synthetic data for Intrusion Detection Systems (IDSs) using generative adversarial networks (GANs) [
42]. The study examined several GAN architectures, notably VanillaGAN, WGAN, and WGAN-GP, alongside specific models such as CTGAN, CopulaGAN, and TableGAN. The evaluation focused on their performance using the NSL-KDD dataset. The findings of the study demonstrated the effectiveness of GANs in generating realistic network data for IDS applications. However, the study underscored a crucial point: the choice of GAN architecture and model significantly influenced the quality of the synthetic data. This review provides valuable insights for researchers and practitioners in the field of cybersecurity, emphasizing the necessary considerations when employing GANs for synthetic data generation in the context of IDS datasets.
Evaluation metrics play a crucial role in assessing the fidelity and utility of synthetic datasets. Standardized metrics such as TabSynDex enable comprehensive evaluations, measuring the similarity between real and synthetic data across various dimensions. By establishing universal measures for evaluating synthetic data quality, researchers and practitioners can make informed decisions about the effectiveness of different generative models.
While existing evaluation metrics provide valuable insights into the quality of synthetic tabular data, it is essential to acknowledge their limitations and strengths. One limitation of existing evaluation metrics is their reliance on specific statistical measures, which may not comprehensively capture the complexity of real-world data distributions. Metrics such as Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) focus primarily on quantifying the discrepancy between synthetic and real data distributions but may overlook nuances in data relationships and patterns. Additionally, these metrics may not adequately account for the diversity and variability present in real-world datasets, leading to potential inaccuracies in assessing synthetic data quality.
Moreover, existing evaluation metrics may exhibit limited suitability for evaluating different types of synthetic data, particularly across diverse application domains. While certain metrics like the Kolmogorov–Smirnov (KS) test and Wasserstein distance (WD) are commonly used to assess the similarity between distributions, their effectiveness may vary depending on the characteristics of the data and the underlying generation methodologies. For instance, metrics tailored to assess the fidelity of continuous data may not be directly applicable to categorical or mixed-type data, necessitating the development of domain-specific evaluation metrics.
Despite these limitations, existing evaluation metrics offer valuable insights into the quality and performance of synthetic tabular data generation techniques. Metrics such as TabSynDex enable comprehensive evaluations, measuring the similarity between real and synthetic data across various dimensions. By establishing universal measures for evaluating synthetic data quality, researchers and practitioners can make informed decisions about the effectiveness of different generative models.
The application domains discussed in this study underscore the broad impact of synthetic data generation techniques. In healthcare, synthetic data facilitate medical research and diagnostic processes by addressing challenges related to data scarcity and privacy. Similarly, in retail, synthetic data offer a solution to data scarcity issues, enabling the development and testing of innovative strategies without relying on limited real-world data.
In a comprehensive survey, the complex domain of synthetic data generation was examined, with a specific emphasis on the innovative domain of generative adversarial networks (GANs) [
43]. Synthetic data become invaluable in scenarios where original data are scarce or of degraded quality, helping to improve the performance of machine learning models. This study covers various aspects, including GAN architectures, challenges and breakthroughs in their training, algorithms for synthesizing data, diverse applications, and methodologies for evaluating the synthetic data’s quality. A special feature of this research is the unique combination of synthetic data generation and GANs, offering a perspective to researchers entering this field. As is shown, this review explores the main techniques for evaluating the quality of synthetic data, with a particular focus on tabular data, offering readers a comprehensive and insightful resource for digging deeper into the complex field of synthetic data creation and GANs.
The reviewed methodologies and models offer diverse approaches to handling complex data relationships and interactions within tabular datasets, each with its strengths and limitations. Understanding how these techniques operate in various scenarios is crucial for assessing their effectiveness and applicability.
Statistical-based approaches, such as GenerativeMTD and the divide-and-conquer (DC) strategy, excel in capturing complex data relationships by leveraging mathematical principles and statistical inference. These methodologies analyze the underlying structures of tabular datasets and generate synthetic data that closely resemble real-world distributions. For example, GenerativeMTD effectively models temporal dependencies in time-series data, making it suitable for scenarios where sequential patterns are prevalent, such as financial forecasting and sensor data analysis. Similarly, the DC strategy partitions the dataset into smaller subsets, allowing for the localized modeling of complex relationships and interactions. However, these methodologies may struggle with high-dimensional datasets or nonlinear relationships, requiring careful parameter tuning and preprocessing steps to achieve satisfactory results.
Machine learning-based generation techniques, including conditional generative adversarial networks (cGANs), variational autoencoders (VAEs), and deep learning models, offer powerful tools for capturing intricate data relationships and generating synthetic data with high fidelity. cGANs, for instance, excel in generating data samples conditioned on specific attributes or features, enabling fine-grained control over the synthesized output. VAEs, on the other hand, learn latent representations of the data distribution, allowing for continuous interpolation between data points and the exploration of the latent space. Deep learning models leverage hierarchical representations to capture spatial and temporal dependencies in tabular datasets. These techniques demonstrate remarkable capabilities in scenarios where data relationships are nonlinear or high-dimensional, such as image recognition, natural language processing, and time-series forecasting. However, they may require large amounts of training data and computational resources, and their interpretability can be limited compared to statistical-based approaches.
Despite their strengths, both statistical-based and machine learning-based techniques may struggle with certain challenges, such as data sparsity, imbalanced class distributions, and outliers. For instance, generating synthetic data that accurately represent rare events or minority classes can be challenging, leading to biased or unrealistic outcomes. Moreover, ensuring the diversity and generalizability of synthetic datasets across different application domains remains an ongoing research area.
In addition, considering the growing concerns around data privacy and fairness, it is imperative to thoroughly explore the ethical implications of synthetic data generation techniques. The development of algorithms based on machine learning techniques must take into account concepts such as data bias and fairness [
44]. While the scientific literature proposes numerous techniques to detect and evaluate these problems in real datasets, less attention has been dedicated to methods generating intentionally biased datasets, which could be used by data scientists to develop and validate unbiased and fair decision-making algorithms [
45]. Synthetic data, emerging as a rich source of exposure to variability for algorithms, present unique ethical challenges. For instance, the deliberate modeling of bias in synthetic datasets using probabilistic networks raises questions about fairness and transparency in algorithmic decision-making processes. Moreover, the incorporation of synthetic data into machine learning algorithms reconfigures the conditions of possibility for learning and decision-making, warranting careful consideration of the ethicopolitical implications of synthetic training data. In light of these considerations, it is essential to assess the ethical implications of synthetic data generation techniques and develop potential mitigation strategies to ensure the responsible and equitable use of synthetic data in algorithmic decision-making processes.
In addition to privacy concerns and biases, addressing the ethicopolitical implications of synthetic data is crucial for fostering transparency and accountability in algorithmic decision-making. Synthetic data promise to place algorithms beyond the realm of risk by providing a controlled environment for training and testing, yet their usage raises questions about the societal impact of algorithmic decision-making. As machine learning algorithms become deeply embedded in contemporary society, understanding the role played by synthetic data in shaping algorithmic models and decision-making processes is paramount [
46]. Moreover, developing guidelines and best practices for the ethical use of synthetic data can help mitigate potential risks and ensure that algorithmic decision-making processes uphold principles of fairness, transparency, and accountability in diverse societal domains.
Taking into account all of these factors, deploying synthetic data generation techniques in different sectors or industries presents various practical challenges and implementation barriers that warrant careful consideration. Understanding these challenges is essential for effectively leveraging synthetic data generation methods in real-world applications.
One practical challenge is the availability of high-quality training data representative of the target domain. While synthetic data generation techniques offer a means to augment limited or unavailable real-world datasets, ensuring the fidelity and diversity of synthetic data remains a key concern. In many sectors, obtaining labeled training data that accurately reflect the underlying data distributions and capture domain-specific nuances can be challenging. Moreover, maintaining the balance between data diversity and privacy preservation introduces additional complexities, especially in highly regulated industries such as healthcare and finance.
Implementation barriers also arise from the computational and resource-intensive nature of certain synthetic data generation techniques. For instance, machine learning-based approaches, such as generative adversarial networks (GANs) and deep learning models, often require significant computational resources and expertise to train and deploy effectively. In sectors with limited access to computational infrastructure or data science expertise, deploying and maintaining such techniques can be prohibitively challenging. Additionally, ensuring the scalability and efficiency of synthetic data generation pipelines to accommodate large-scale datasets and real-time data generation further complicates implementation efforts.
Furthermore, the interpretability and explainability of synthetic data generation techniques pose challenges in sectors where transparency and accountability are paramount. Understanding how synthetic data are generated and their implications for downstream decision-making processes is crucial for building trust and confidence among end-users and stakeholders. Providing transparent documentation of the synthetic data generation process and validation methodologies is essential for fostering trust and facilitating adoption in diverse sectors and industries.
Despite all of these challenges, the future evolution of synthetic tabular data generation techniques holds great promise, driven by advancements in machine learning, artificial intelligence, and data generation methodologies. Envisioning the trajectory of this field involves anticipating key trends and developments that are likely to shape the landscape of synthetic data generation in the coming years.
One key direction for future research is the development of more sophisticated generative models capable of capturing complex data relationships and distributions with greater fidelity. Machine learning techniques such as deep generative models, including deep neural networks and variational autoencoders, are poised to play a central role in this evolution. By leveraging hierarchical representations and advanced optimization algorithms, these models offer the potential to generate synthetic data that closely resemble real-world datasets across diverse domains and application scenarios.
Moreover, the integration of domain knowledge and expert insights into the synthetic data generation process is expected to enhance the realism and utility of generated datasets. Hybrid approaches that combine statistical modeling techniques with machine learning algorithms enable the incorporation of domain-specific constraints and priors into the generative process. For instance, incorporating structural equation modeling or Bayesian networks to encode domain knowledge can improve the interpretability and fidelity of synthetic data, making them more suitable for downstream applications such as predictive modeling and decision support systems.
Another important direction for future research is the development of robust evaluation methodologies and benchmarking frameworks for assessing the quality and utility of synthetic tabular data. As synthetic data generation techniques continue to evolve, it becomes increasingly important to establish standardized evaluation metrics and datasets that enable fair comparisons across different methodologies. By promoting transparency and reproducibility in the evaluation process, researchers can facilitate the adoption and validation of novel techniques and accelerate innovation in the field.
Furthermore, addressing ethical and societal implications remains a critical aspect of the future evolution of synthetic data generation techniques. As synthetic data become more prevalent in various sectors and industries, ensuring fairness, transparency, and accountability in their generation and usage is paramount. Interdisciplinary collaborations between researchers from data science, ethics, law, and social sciences can help navigate the complex ethical landscape of synthetic data generation and develop guidelines for responsible data usage.
Overall, we tried to highlight the growing potential of synthetic data generation methodologies in overcoming data-related challenges across various domains. Constant research and development in this field are essential for advancing data science and unlocking new opportunities for innovation and breakthroughs.