Synthetic Data as a Proxy for Real-World Electronic Health Records in the Patient Length of Stay Prediction

Bietsch, Dominik; Stahlbock, Robert; Voß, Stefan

doi:10.3390/su151813690

Open AccessArticle

Synthetic Data as a Proxy for Real-World Electronic Health Records in the Patient Length of Stay Prediction

by

Dominik Bietsch

¹

,

Robert Stahlbock

^1,2

and

Stefan Voß

^1,*

¹

Institute of Information Systems, University of Hamburg, Von-Melle-Park 5, 20146 Hamburg, Germany

²

FOM University of Applied Sciences, Leimkugelstr. 6, 45141 Essen, Germany

^*

Author to whom correspondence should be addressed.

Sustainability 2023, 15(18), 13690; https://doi.org/10.3390/su151813690

Submission received: 2 June 2023 / Revised: 9 July 2023 / Accepted: 5 September 2023 / Published: 13 September 2023

(This article belongs to the Section Health, Well-Being and Sustainability)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

While generative artificial intelligence has gained popularity, e.g., for the creation of images, it can also be used for the creation of synthetic tabular data. This bears great potential, especially for the healthcare industry, where data are often scarce and underlie privacy restrictions. For instance, the creation of synthetic electronic health records (EHR) promises to improve the usage of machine learning algorithms, which usually work with large amounts of data. This also applies for the prediction of the patient length of stay (LOS), a key measure for hospitals. Thereby, the LOS represents one of the core tools for decision makers to plan the allocation of resources. Thus, this paper aims to add to the still-young research concerning the application of generative adversarial nets (GAN) on tabular EHR. It does that with the intention to leverage the advantages of synthetic data for the prediction of the LOS in order to contribute to the efficiency-enhancing and cost-saving aspirations of hospitals and insurance companies. Therefore, the applicability of synthetic data that is generated using GANs as a proxy for scarce real-world EHR for the patient LOS multi-class classification task is examined. In this context, the Conditional Tabular GAN (CTGAN) and the Copula GAN are selected as the underlying models as they are state-of-the-art GAN architectures designed for generating synthetic tabular data. The CTGAN is found to be the superior model for the underlying use case. Nevertheless, the paper shows that there is still room for improvement when applying state-of-the-art GAN architectures to clinical healthcare data.

Keywords:

generative artificial intelligence; synthetic tabular data; healthcare industry; synthetic electronic health records (EHR); patient length of stay (LOS)

1. Introduction

Being a useful tool to plan the allocation of resources—e.g., for health workers, bed capacity or to measure the quality of service or efficiency—the length of stay (LOS) of a patient is of great interest for hospital management [1]. Especially through the costs connected to each day a patient stays in a hospital, the LOS is among the key measures within the inpatient hospital setting and an area of focus for hospitals and insurance companies [2]. According to the American Medical Association, the US health spending alone grew 10.3% in 2020 and another 2.7% in 2021 reaching 4255.1 billion USD of which 31.1% is attributable to hospital care [3]. Thus, hospitals face an increase in financial pressure to reduce LOS through changes in policy [1], e.g., through the bundle payments for the Care Improvement Initiative by Medicare with which treatments are compensated at a flat rate (the initiative was introduced by the Center for Medicare and Medicaid Innovation; link to the initiative: https://innovation.cms.gov/innovation-models/bundled-payments, last access date 8 July 2023) [2]. Therefore, hospitals are motivated to reduce LOS and, thus, boost the rate of turnover in order to increase their margin of profit [4,5]. Considering this, an enhanced comprehension of the factors that influence the patient LOS as well as a further increase in the prediction capabilities for the patient LOS as an indicator for innovation are needed [2]. Nevertheless, hospitals still heavily rely on rather inaccurate averaging methods or expert knowledge when predicting the inpatient LOS [6]. At the same time, the work with electronic health records (EHR) in a machine learning context is notoriously hard. While many algorithms work best with large amounts of data, the collection and labeling of EHR is very costly and time-consuming. This holds especially for patient data that often need to be collected over larger time spans in order to obtain a critical mass of data. Most importantly, this type of data almost always contains sensitive information being subject to data protection regulations such as the GDPR (General Data Protection Regulation) in Europe or the CCPA/CPRA (California Consumer Privacy Act/California Privacy Rights Act) in the USA. Hence, it is often challenging to build appropriate machine learning algorithms as data availability remains an issue [7]. The creation of synthetic data through generative adversarial networks (GAN) could be one solution to tackle this field of problems connected to EHR as the creation of synthetic data through the application of GANs not only implicates privacy [8] but is also able to remedy data scarcity or an imbalance in data sets [9]. Introduced in 2014 by Goodfellow et al. [10], the research related to GANs has surged since then. Nevertheless, there are still few publications regarding the use of GANs for the creation of tabular data and EHR [9].

This paper aims to add to the still-young research concerning the application of GANs on tabular EHR. It does that with the intention of further researching and leveraging the advantages of synthetic data for the prediction of the LOS in order to contribute to the efficiency-enhancing and cost-saving aspirations of hospitals and insurance companies. Thereby, it specifically aims to examine the creation of synthetic data as a proxy for scarce real-world data sets as this bears the potential for huge time and cost savings for the clinic and its personnel. Although there is some related research on this topic, this is mainly limited to specific patient groups, e.g., cystic fibrosis patients [11], or using large data sets, and, thus, does not address the issue of data scarcity for the creation of synthetic data [11,12]. Therefore, this work aims to close this gap by answering open research questions through the application of generator models to smaller real-world data sets to test their performance on scarce data. Thus, being the state-of-the-art GAN architecture [13] for the creation of synthetic tabular data, the Conditional GAN (CTGAN) and the Copula GAN [14] are chosen as the underlying generator models. The CTGAN approach seems to be the most promising as it allows for accurate replication of real-world data sets through the ability of conditional sampling. In addition, this study employs a multi-class classification approach instead of a binary classification task to test for machine learning efficacy in the context of the patient LOS prediction, aiming to ensure a more rigorous evaluation compared to previously published research.

The paper is structured as follows. Related works are introduced and compared to our work in Section 2. Next, the methods used are presented in Section 3. Subsequently, the results of this research are presented in Section 4 and discussed in Section 5. Lastly, the conclusions are given Section 6.

2. Related Work

As mentioned before, the research on GANs for tabular data and, in particular, tabular health data is still relatively new. A first step towards the development of a GAN purposed to create tabular EHR is represented by the paper, “Generating Multi-label Discrete Patient Records using Generative Adversarial Networks” by Choi et al. [15].

2.1. Generating Multi-Label Discrete Patient Records Using Generative Adversarial Networks

The motivation of this paper can be found in the restrictions which concern health data such as lengthy review processes by legal departments in order to facilitate access to medical patient data. It is acknowledged that these processes restrict the timely use of such data and, thus, have the potential to slow down progress in patient care and research. Through the creation of synthetic data, the authors aim to remedy the risk connected to the sharing of real-world EHR. Therefore, Choi et al. [15] introduce medical GAN (medGAN) as, prior to this, GANs were not utilized to learn discrete feature distributions. The authors address this issue by supporting the learning process of the distribution of discrete multi-label features, e.g., medication and diagnosis codes by incorporating an autoencoder into the generator. Furthermore, several improvements to the GAN framework are introduced to support the convergence of the algorithm, which is found to be especially useful in the context of data sets containing EHR. In order to evaluate the applicability of medGAN, the authors compare it with several other GAN frameworks as well as a Variational Autoencoder (VAE) model. Each model is applied to three data sets, of which two consist of binary and one of count values. The privacy-preserving capabilities of the models are estimated through the conduction of a membership inference attack with which an attacker can determine whether a data record stems from the training data set with which an underlying algorithm was trained. The authors find that the introduced medGAN shows very good results for both count and binary variables and has compelling data privacy-preserving capabilities. However, especially the evaluation of the count data indicates one of the models’ limitations as, e.g., male patients were assigned diagnosis codes that are usually exclusively related to female patients and vice versa. Furthermore, only data sets containing either binary or count data are utilized in the paper. This stands in contrast to EHR data normally consisting of mixed data types.

2.2. Generation of Deferentially Private Heterogeneous Electronic Health Records

In response to the paper introducing the medGAN approach, further scientific work aiming to enhance the medGAN model as well as introducing new GAN frameworks for the creation of synthetic EHR are published. Nevertheless, [16] are among the first to examine the creation of synthetic EHR from data sets containing mixed data types including demographic patient data such as the age and the weight of the patient. As the underlying data set, the New Zealand National Minimal data set [17] is used containing both dense administrative as well as sparse diagnostic data. To determine the best model for this task, the medGAN, the WGAN model, as well as the WGAN with gradient penalty (WGAN-GP) are compared. Interestingly enough, also the CTGAN architecture is tested and finally rejected due to an extended training time. For the performance evaluation of the different models, the distribution of each artificially created column is compared with that of the corresponding original column of the source data. To compare the overall data sets, the Frobenius norm (a matrix’s Frobenius norm can be calculated as the square root of the sum of each element’s squares) of the synthetic as well as the real data set are compared. As a result of these trials, the WGAN-GP is found to be superior to the other models. The selected model is used for a downstream experiment in which the authors apply differential privacy (DP) to the WGAN-GP architecture. By using DP, noise is added when a synthetic data point is created, aiming to enforce enhanced and rigorous privacy-preserving capabilities of the model. To evaluate the applicability of the generated data sets as a proxy for the original data set, their machine learning efficacy is tested. Therefore, a binary classification task predicting the patient’s readmission to a hospital is conducted on real data as well as synthetic data. The authors find that the classifiers only show a small reduction of prediction power when trained on synthetic data. Nevertheless, the classifiers show significantly worse performance on data created by the WGAN-GP model using DP, which is still accepted due to the anticipated increase in data privacy. Similar to this paper, we aim to create synthetic EHR under the presence of mixed data types. While the results seem promising, this paper also leaves room for further research. For example, the used data set with over 1.4 million rows does not apply to the problem of data scarcity and might be a reason for the difficulties during the application of CTGAN, which is currently considered state of the art for the creation of synthetic tabular data [18]. Furthermore, the problem of the potential creation of invalid diagnosis and procedure codes (e.g., male patients being connected with diagnosis codes reserved for female patients), which was also present in the already discussed paper by Choi et al. [15] (see Section 2.1), cannot be fully ruled out using the here-examined architectures. Although a first step towards the generation of synthetic heterogeneous data is made by Chin-Cheong et al. [16], there is still considerable room for improvement in this field of research. An example of this is the utilization of the state-of-the-art GAN model—the CTGAN. Other than comparable GAN architectures, it overcomes most of the problems that are related to the creation of tabular synthetic data like imbalance or quantity [19].

2.3. Synthetic Data and Re-Identification Risks

Aiming to create synthetic tabular health data inter alia through the application of the CTGAN architecture, the work of Fernandes [12] is, therefore, one of the most closely related papers to this paper. Therein, the author compares multiple generator models in their capabilities to create synthetic data using the MIMIC-III clinical data set. Besides general demographic information of the patient, the latter also holds ICD-9-CM diagnosis and procedure codes and, hence, contains mixed data types with some columns showing a very high cardinality. The utilized generator models are the CTGAN, Copula GAN, TVAE, and the Synthpop model, which leverages regression and classification trees for the synthetic data generation [20]. Thereby, only quantitative and visual similarity and privacy measures are presented for the comparison of the synthetic and original data. The therein conducted experiments show poor results as the models suffer from problems such as a long training time. It is assumed that this is caused by the size of the used data set, with the largest table used containing more than 4 million rows, which coincides with the findings of Chin-Cheong et al. [16]. The paper ends with raising new research questions, e.g., to test the performance of the utilized generator models on smaller real-world healthcare data sets.

2.4. Synthesizing Electronic Health Records: Cystic Fibrosis Patient Group

In a recent paper by Muller et al. [11], the authors take a research approach similar to this paper, examining the applicability of synthetic EHR created by GANs for the prediction of the patient outcome. However, they mainly focus on the value of the artificially created data through GANs for solving the problem of class imbalance. They do this by using a data set containing 3184 cystic fibrosis patient group records extracted from the IBM Explorys database. Two GANs with different DP budgets as well as a VAE and the CTGAN are compared in the paper. The synthetic data sets created by the models are evaluated under the aspects of similarity, uniqueness, and utility. Including visual representations of the data distributions in their considerations, the authors find that the presented measures of similarity are not robust. However, the visual representations of the generated data sets show that the CTGAN model generated the highest level of similarity while providing the highest level of diversity at the same time. Since the authors use an already anonymized data set, the uniqueness of the data rows is used as a measure of privacy. Thereby, the uniqueness of the data rows is estimated by measuring the count of duplicates the synthetic data has with the original data. The utility of the data sets is measured by utilizing the generated and original data in a binary classification task. Overall, CTGAN shows a superior performance, offering a high level of uniqueness as well as a high predictive performance. Furthermore, they find that the balanced synthetic data sets generated through GANs can overcome the problem of imbalanced data sets, observing an increase in prediction performance.

2.5. Comparison to the Underlying Work

The above-presented related work gives an overview of previous approaches in this line of research. This paper builds upon these papers as some of the procedures are taken over, e.g., the statistical evaluation metrics shown in Section 2.3 as well as the widely used visual evaluation, which is largely based on the table-evaluator framework of Brenninkmeijer et al. [21]. The final aim of the paper, to create synthetic data serving as a proxy for real-world data for the prediction of patient LOS, can be seen as the evaluation of machine learning efficacy on a multi-class classification task. While some of the previous works focus on the privacy-related component of synthetic data creation through GANs, it does not seem to do justice to the scarcity-related issues. As the collection of EHR is very timely and/or requires trained personnel, this paper aims more strongly toward the lifting of the data scarcity burden. Thus, similar to Section 2.4, a smaller data set is used here to address the problem of data scarcity in the healthcare domain and, thus, also answering the raised research question of Fernandes [12] (see Section 2.3). Furthermore, to the authors’ knowledge, only the very recent work of Muller et al. [11] examines how to overcome the problem of imbalance through the application of GANs in the context of a binary classification task. This is extended in this paper by examining this issue regarding a multi-class classification problem. Lastly, only data that are available free of charge and admission are utilized. Since data in the healthcare domain are often subject to privacy protection concerns, this is very rare. Herewith, it is aimed to contribute to more accessible and free research as most of the work on GANs in the healthcare domain is conducted on data sets that are either associated with costs (e.g., the New Zealand National Minimal data set) or only accessible through a lengthy approval process that does not guarantee admission (e.g., MIMIC-III data set).

3. Methods

This section introduces the theoretical background and the origins of the underlying data of this paper as well as the utilized evaluation metrics to evaluate the quality and applicability of the artificially created data. Furthermore, the implementation framework concerning the practical implementation of this paper is presented. To implement the evaluation metrics as well as the generator models, the Synthetic Data Vault (SDV), an ecosystem of libraries dedicated to the creation and evaluation of synthetic data (see [22,23]), is used.

3.1. Length of Stay Prediction

This subsection provides a brief summary of the relevant research on LOS prediction and outlines the theoretical framework of the models used in this paper for generating synthetic data. Besides the CTGAN and the Copula GAN model, the Table Variational Auto Encoder (TVAE) model is introduced as it is used to benchmark the synthetic data creation capabilities of the GAN architectures to a non-GAN generator model.

Although the research concerning the prediction of the patient LOS is far from new, it still remains a topical issue. By enhancing the prediction capabilities of a patient’s time spent in a stationary healthcare institution, it is aimed to enhance the allocation of resources. Furthermore, knowing the patient LOS early on can help understand the factors that influence the duration and enable hospitals to take appropriate actions in order to cut costs and even boost patient satisfaction. Therefore, recent research includes the LOS prediction using International Classification of Diseases (ICD) codes and demographic data in order to improve the allocation of resources [24]. Moreover, LOS is utilized as a metric to assess the effectiveness of hospital management. This measure not only reduces costs for patients and insurance companies, but also enhances hospital profitability by facilitating a higher turnover rate [4]. Thereby, the research on patient LOS does not only concern generic hospitals and their admission but also the improvement of the treatment in specialized medical fields such as surgical care units [6] or the prediction of the LOS for patients in intensive care units [25].

3.1.1. Variational Auto Encoder

The basis of the VAE is the conventional autoencoder model. It is composed of the encoder as well as the decoder. Input samples of a data set are encoded, resulting in a latent variable. This hidden representation is then utilized by the decoder to recreate the input data. The loss function of the VAE is comprised of the input-to-output reconstruction error and the Kullback–Leibler (KL) divergence measuring the divergence of two probability distributions. For the best possible representation of the real data distribution, the KL divergence is minimized.

3.1.2. Generative Adversarial Networks

Introduced by Goodfellow et al. [10] in 2014, GANs and their variants represent the state of the art in the field of synthetic data creation. A GAN consists of two neural networks, a generative model G and a discriminative model D, which are trained simultaneously in an adversarial manner competing in a two-player min–max game. While G is capturing the distribution of the underlying data creating additional new data points, D determines the probability that the latter stem from the original training data instead of G. Thereby, G is trained with the goal to maximize the error of the classification performance of D. A commonly used example for this is the contest between a forger (representative for G) producing fake money and the police (representing D) trying to distinguish the counterfeit from real money. As the police tries to maximize the detection accuracy by learning to detect features of forgeries, the forger aims to minimize this rate by improving its forgeries. Thus, both parties improve their respective skills over time. As the generator does not have access to the training data and, thus, creates entirely new data points, this model architecture implies privacy [8].

3.1.3. Conditional Tabular GAN

Xu et al. [14] acknowledge the unique challenges that occur when applying GANs to tabular data and present the CTGAN as a novel GAN architecture. The obstacles for creating synthetic data with GANs include the problem of the existence of mixed data types in a data set. Thereby, continuous columns can have non-Gaussian and multimodal distributions. While the former can lead to vanishing gradients, previous GAN architectures struggle with the latter as distributions having multiple modes are difficult to model and, thus, call for new ways of pre-processing. Furthermore, also the learning from sparse one-hot encoded vectors as well as highly imbalanced categorical columns showing imbalances in their distributions proves to be hard. As dropping a minor category leads to only a marginal divergence in the distribution, they are oftentimes missed by the discriminator. This can lead to unequal training opportunities for the minority classes. To overcome these challenges, the authors suggest several changes for the architectural implementation. For instance, an individual pre-processing for continuous and discrete columns is proposed. Furthermore, a conditional generator is introduced, e.g., to tackle the problem of class imbalance. The objective of the conditional GAN is to adapt the input distribution to the real distribution under consideration of a condition connected to a specific categorical attribute. For the CTGAN network architecture, two fully-connected networks are used both in the generator and the discriminator in order to obtain all correlations between the columns.

3.1.4. Copula GAN

The Copula GAN differs from the CTGAN as it uses Gaussian copulas. These enable it to transform any given population into a Gaussian-like distribution by encoding the original input data points in the new distribution [21]. For instance, given the distribution of income in the USA, the value of the corresponding distribution function at USD 100,000 per year corresponds to the probability that any person in the USA has an income of USD 100,000 or less. This transformation of the input using copulas enables the Copula GAN to learn the original distribution of the data columns more easily.

3.2. Origin of the Data Set

As the underlying data basis of this paper, the “NY Hospital Inpatient Discharges in 2015” data set [26] is used. The data set, which was last updated on 13 September 2019, is publicly accessible and available free of charge. Hereof, three data sets taken from university hospitals that are located in the state of New York are extracted at random. Data sets containing less than 15,000 patient records are excluded to guarantee a sufficient size of the hold-out test data sets. The source healthcare institutions are the Nassau University Medical Center (NUMC), the State University of New York Health Science Center (SUNY), and the New York University Hospitals Center (NYU) (see Table 1). These data sets contain patient discharge information such as the age group, gender, and the codes for the diagnosis as well as the medical procedure which is carried out on the patient. Most importantly, the data sets also contain the patient LOS, which is used as the target variable for the subsequent multi-class classification task. Therefore, the data sets can be optimally taken as representative proxies for other patient admission data sets as they contain all relevant demographic and clinical data columns that are normally filled at the point of admission.

Initially, each data set contains 35 columns holding information about the patient discharge (a complete glossary holding all of the provided columns can be found on the website of the data set issuer: [26]). After a first evaluation, it becomes evident that not all columns can be used for the subsequent classification task as they contain irrelevant information or information that is unknown at the time of admission. To prevent information leakage, these columns are ruled out from the classification and, thus, from the creation of synthetic data. Ultimately, 13 features remain in each data set (see Table A1). The most interesting features for the underlying multi-class classification task are the Clinical Classification Software (CCS) diagnosis description as well as the CCS procedure description. These columns hold the diagnosis and the procedure the patient receives when being admitted to the hospital, respectively. Thereby, the diagnosis and procedures adhere to the CCS logic set up by the Healthcare Cost and Utilization Project (HCUP). This logic aggregates individual ICD-9-CM codes (ICD-9-CM are standardized diagnosis and procedure code descriptions issued by the WHO. They are widely adopted by healthcare facilities to standardize data collection. The most current version is the ICD 11th revision [27]) into broader diagnosis and procedure groups in order to enable statistical analysis [28]. Despite the utilization of the more general CCS logic, the diagnosis and procedure features in each of the data sets show high cardinality. The diagnosis column of the respective data sets each holds over 230 unique diagnosis descriptions, whereas the corresponding procedure column holds over 190 unique procedure descriptions.

3.3. Binning of the Target Variable

Given that the objective of this paper to evaluate the potential of artificially generated data as a proxy for real-world data in predicting patient LOS at admission, it becomes necessary to reevaluate the nature of the target variable in its original form. The prediction of the patient LOS through machine learning is no trivial task, making it difficult to obtain accurate results. Hence, the current literature on the prediction of LOS normally does not provide an exact estimation of the patient LOS but rather an approximate categorization of the latter, e.g., in the form of a binary split into the categories “short-stay” or “long-stay”. In fact, attempting to achieve an overly accurate prediction of LOS, e.g., through a regression analysis, tends to lead to mistrust of the obtained results among clinical staff in the event of potential inaccuracies in this prediction [29]. In alignment with this, the target variable in this paper is binned into more general categories that still leave accurate enough prediction capabilities for the patient LOS in the clinical context.

The LOS variable of the original data set comes in the form of numerical data. In order to obtain reasonable binning categories, relevant literature concerning the prediction of patient LOS as well as the distributions of the target variable of the different data sets are considered. In previous work concerning the binary classification of the patient LOS [29,30], the target variable is split at the mean, differentiating between patients that have a lower or higher LOS. However, the LOS classes only cover stays of up to 14 days. In contrast to this, within the work of Harerimana et al. [24], who conduct a multi-class LOS classification using the MIMIC-III data set [31], the target variable is split into three logical bins, short stays (LOS <= 10 days), medium stays (10 < LOS

< =

30 days), and long stays (LOS > 30 days). They do this without obviously including the distribution of the class into their consideration.

Within this paper, the LOS feature is split into three classes: short, medium, and long stays. Patients showing a LOS of two days or less for all data sets are summarized into the short-stay class. The long-stay class includes all registered admissions with a LOS exceeding one week as these are seen as potential long-time patients. Patients between two and seven days are allocated into the medium-stay category.

3.4. Evaluation Metrics

This subsection presents the evaluation methods employed to assess the quality and utility of the generated synthetic data.

3.4.1. Goodness of Fit

The goodness of fit estimates how well a set of observations is fitted by a statistical model, which helps determine discrepancies between observed and expected results.

Kolmogorov–Smirnov Test— The Kolmogorov–Smirnov Test uses the empirical cumulative distribution function to determine if a continuous sample stems from a population showing a specific distribution. A random variable X with unknown distribution is inspected. Two hypotheses are put forward. The first hypothesis expresses that the random variable has the same probability distribution F₀ as the examined population and, thus, is part of the latter. The alternative hypothesis assumes that the random variable is not part of the population [12].

Chi-Squared Test— The Chi-Squared Test is used for the comparison of discrete features. Analogous to the Kolmogorov–Smirnov Test, it shows if a given sample stems from a population showing a specific distribution [12].

3.4.2. Likelihood Metrics

For the utilized likelihood metrics, the real data are fitted to a probabilistic model. Then the likelihood that the artificially created data are part of the learned distribution is estimated. Thereby, the underlying probabilistic models are Gaussian Mixture Models (i.e., finite normal mixture models) [32], which combine multiple Gaussian distributions in order to represent more complex data patterns. These are fitted to the original data. Then the mean log-likelihood of the synthetic data belonging to the original data distribution is estimated [23].

3.4.3. Detection Metrics

Besides the aforementioned statistical metrics, i.e., detection metrics are used to evaluate the quality of the synthetic data. Here, the synthetic data set is mixed with the original data set. The data rows are then labeled to indicate if the record is artificially created or stems from the original data set. Finally, a classifier (in this case, Support Vector Machine (SVM) and logistic regression classifiers) is applied to distinguish between the two target classes using cross-validation. As a result, the average classification score of the cross-validation is subtracted from 1 and delivered as an output score.

3.4.4. Aggregated Result

The aggregated result combines all the above-mentioned measures into a single score. Therefore, the scores are normalized into a range between 0 being the worst and 1 being the best possible score. The aggregated score is calculated as the mean of these scores.

3.4.5. Privacy Metrics

Considering the rigid data privacy constraints within the medical field, also the privacy-preserving capabilities of the reviewed generator models are evaluated. Privacy measurement of synthetic data is a topic that continues to be actively discussed in the literature. As a result, the duplicate count utilized in previous studies that are closely related to this paper (see [11]) is used as a measurement of privacy. To replicate the full variability and diversion in the original data, the generator models could simply copy the original data. As mentioned before, the underlying data set is already de-identified. Therefore, the overlap of the synthetic and the real data is reviewed by counting the rows that are identical across the data sets. If the synthetic data set has many more duplicates than the real data set, e.g., the real data set has 25 duplicates while the synthetic data set has over 100 duplicates, this means that the generator model most likely suffers from mode collapse and simply copies rows from the original data set instead of creating new unique rows. Ideally, the number of unique rows in the synthetic data set should not be lower than the one in the original data set.

3.4.6. Visual Evaluation

A visual comparison of the original and synthetic data is conducted to add to the quantitative comparison. For the visual evaluation, three visual comparison techniques introduced and implemented in the table-evaluator library (https://pypi.org/project/table-evaluator/, accessed on 1 June 2023 [33]), which is based on the work of [21], are used. The hereof applied plots are the comparison of the column-wise mean and standard deviation, the estimation of the cumulative sums, the distribution per column, as well as the depiction of the correlation of the columns.

Standard Deviation and Column-Wise Mean

This simple method plots the means and standard deviations of the columns on a logarithmic scale. The synthetic and original data columns have similar means if the values fall into the diagonal of the chart.

Cumulative Sum

The cumulative sum is calculated for each column. Afterwards, the cumulative sums of the corresponding columns of the original and synthetic data set are shown in the same chart. This enables a visual examination of the similarity of the distributions of the respective synthetic columns to their original counterparts.

Correlation Analysis

During the correlation analysis, the correlation matrices of the original and the newly created artificial data set are plotted. Furthermore, a third matrix is output, showing the differences between the first two. This plot enables the viewer to quickly see whether the inter-correlations within the original data set could be preserved in the synthetic data set.

Distribution per Feature

Lastly, the distributions of the respective original data set columns are plotted against their synthetic counterparts. This proves especially helpful for categorical columns as it enables a fast inspection of the value counts per category.

3.4.7. Utility

Finally, the main method to evaluate the quality of the synthetic data is the testing of the machine learning efficacy. Moreover, here, the research question of the applicability of synthetic data created by the CTGAN as a proxy for real-world EHR data in the multi-class classification task is attempted to be answered. Therefore, multiple classifier models are applied to the real-world data as well as the synthetic data to predict the patient LOS. Afterwards, the performances of the classifiers on the different data sets are compared. Since the prediction quality of all the three target classes is valued equally and the underlying data sets are imbalanced, the macro F1-score is selected as the main metric. The macro F1-score is computed by taking the average mean of the F1-score predictions per class.

3.5. Implementation

For the practical implementation of this paper, the Cross-Industry Standard Process for Data Mining (CRISP-DM) approach is used. This is a commonly used methodology for machine learning projects. Thus, the individual steps of this paper are oriented according to the CRISP-DM definition [34]. The CRISP-DM approach starts with the Data Understanding and Data Preparation phase, which includes the closer examination and preparation of the data to serve as an input for the generator models. Thereafter, the Modeling Phase is entered. Its first part contains the creation of synthetic data. This is conducted through the application of the CTGAN, Copula GAN, and the TVAE model on the hospital admission data sets. Thereby, the original data sets are partitioned into training data sets, consisting of 5000 data rows, and hold-out test data sets, which comprise the remaining EHR from the respective data sets. The above-introduced measurements (see Section 3.4) are utilized to evaluate the performances of the generator models trained with different hyperparameter combinations. These include the number of epochs, the batch size, the log frequency, as well as the learning rates of the generator and discriminator models. After limiting the search radius of the individual hyperparameters by conducting a random gridsearch, an exhaustive gridsearch is performed by combining all possible hyperparameters. This is iteratively run through to improve the performances of the models. The application of the generator models concludes with the selection of the best-performing parameter combination for the CTGAN and the Copula GAN and TVAE model, respectively. Each of these models is then used to sample synthetic data sets in order to be comparatively evaluated for the machine learning efficacy of this data. More specifically, three different classification models are fitted to the real-world training and the artificially created data sets. Each of the fitted models is then utilized for the LOS classification of a hold-out test data set. The classification results are used to determine the applicability of the synthetic data created by the GAN models as a proxy for the real-world EHR. Beyond that, this experimental setup is also used to conduct a downstream investigation on the utility of the GAN models as an over-sampling technique in comparison to conventional techniques like SMOTE (see, e.g., Ref. [35]) or random over-sampling. Here, the minority classes of the real data set are over-sampled by applying each of these techniques separately. Analogous to the previous setup, the classification models are fitted to the data sets and the performances of these are comparatively evaluated on the model level. Hence, the application of the generator models and the testing of the machine learning efficacy can be seen as the core of this paper. The individual steps of this main part are laid out graphically in Figure 1.

4. Results

In this section, the obtained results are presented. Firstly, the results of the conducted hyperparameter search are discussed. The best-performing hyperparameter combination of each model is used for comparison with the respective best hyperparameter combination of the other models. Subsequently, the outcome of the machine learning efficacy testing is presented (note that an Appendix A with a glossary of features and detailed results is provided below).

4.1. Evaluation of Generator Model Capabilities

Since each data set is used as a basis for corresponding synthetic data sets produced by the respective generator models, the hyperparameter research is discussed on the basis of the individual original data sets. Only the aggregated result measure is shown in the result tables, summarizing the results of the other quantitative measures introduced in Section 3.4.

4.1.1. NUMC Data Set Hyperparameter Search

Comparing the respective best models on the aggregated quantitative performance result, the CTGAN architecture shows a superior performance (see Table 2). Despite the big difference in the number of duplicates, the best Copula GAN and TVAE models show very similar results. However, it is suspected that the latter also stems from the high similarity between the synthetic data and original data rows as the quantitative measures do not sufficiently penalize the existence of copied values in the synthetic data set. The superiority of the GAN approaches is also viewable in the visual evaluation plots. (Please note that, for some data sets, the visual evaluation of the distribution of the columns cannot be properly displayed for the ccs_diagnosis_description and ccs_procedure_description). Here, the CTGAN and Copula GAN show quite good representations of the core CCS diagnosis and procedure columns, while the TVAE is not able to replicate these properly. This is viable in the respective plot (a) for the CTGAN (see Figure A1), the Copula GAN (see Figure A2), and the TVAE model (see Figure A3). However, the correlation between the columns is only adequately taken over as represented in the respective plot (c) of the above-referenced figures.

4.1.2. SUNY Data Set Hyperparameter Search

Similar to the NUMC data set, the CTGAN architecture once again seems to be superior in the creation of synthetic data when compared to the Copula GAN and TVAE (see Table 3). While the Copula GAN reaches a similar result as on the NUMC data set, the TVAE approach shows an even weaker result both when examining the aggregated result and the number of duplicates to the original data set. The visual evaluation of the models shows that they do not properly overtake the inter-correlations of the columns. Thereby, they especially struggle to replicate the birth_weight column. Therefore, the value of the visual evaluation becomes evident as it dampens the expectation of the value of the synthetic data value raised when solely looking at the quantitative distribution measures. The final estimation of the value can only be delivered by the subsequent testing of the machine learning efficacy.

4.1.3. NYU Data Set Hyperparameter Search

Overall, the CTGAN architecture shows the best performance while having a moderate number of duplicates (see Table 4). The visual evaluation confirms the finding of the quantitative analysis, showing that the CTGAN and Copula GAN replicate the distributions of the original data columns quite well. The problem of the lack of capability to retain the correlation by the GAN model remains.

4.1.4. Summary of the Hyperparameter Search

In summary, the quantitative measure indicates that the GAN models deliver sufficiently well synthetic data. Hereby, the CTGAN model is expected to perform best in the subsequent testing for machine learning efficacy. This is concluded not only on the basis of the quantitative distribution measures but also on the basis of the created duplicates with the original data set, the number of missed categories of the discrete columns, as well as the visual evaluation of the distribution plots.

4.2. Testing the Machine Learning Efficacy

The following section is equivalent to the fourth step of the visual process flow (compare Figure 1). As described before, three classifier models are used to test the machine learning efficacy of the artificially created data and, hence, its utility in the real-world multi-class prediction of the patient LOS. These are Random Forest (RF), Support Vector Machines (SVM), as well as the K-Nearest Neighbor (KNN). In order to ensure the best possible comparability of the results, the default algorithms are used without any hyperparameter tuning. The training data used for the classifier models are the synthetic training data frames that are sampled at the beginning of this section by the selected best generator models (in Section 4.1) as well as the original data frame. In this course, the utility of the generator models as over-sampling techniques is also tested. However, the discussion of the latter is left for the last part of this section. The macro F1-scores of the machine learning efficacy test are shown in Table 5.

The results show that the performances of the classifiers trained on the GAN models are not able to match the performance of the ones trained on the real data set. This is the case for all three of the examined data frames of the NUMC, SUNY, and NYU healthcare facilities. Thereby, the Copula GAN model shows the worst performance of all three of the generator models. The CTGAN model shows slightly better results. However, on average, it differs from the macro F1-score of the models trained on the real data set by 0.1 scoring points. While the superior performance of the CTGAN model in comparison to the Copula GAN architecture was expected when looking at the previous quantitative distribution evaluation, the good performance of the TVAE model seems somewhat surprising. Nevertheless, under consideration of the high number of duplicates produced by the TVAE model, this performance lead should be viewed with caution. Especially the TVAE results on the NUMC and NYU data sets (see Table 5) can be rejected when looking at the immense number of duplicates between the synthetic and original data frame exposed in Table 2 and Table 4. Thus, the SUNY data set can be seen as the only data set on which the TVAE model shows superior performance compared to the GAN architectures.

Upon closer inspection, it is noticed that the moderate performance of the classifiers trained on the synthetic training data that were sampled using the GAN architectures can be pinned down to their poorer prediction capability of the long-stay minority target class compared to the other target classes. An example of this is shown in Figure 2 depicting the RF classification results on the real-world NUMC data set as well as the classification results on the synthetic data sets generated by the CTGAN and Copula GAN. The RF classifier results for the synthetic NUMC data set generated by the TVAE model are not included in this figure. This is due to the presence of a significant number of duplicates, as discussed earlier, which renders the results invalid. Nevertheless, the other classifiers show similar results across all data sets. Thus, the respective minority class is assumed to be too small compared to the other target classes as the GAN models seem to concentrate on the larger medium- and long-stay categories.

4.3. Using GAN for Over-Sampling

Lastly, the applicability of the GAN approaches as a substitute for conventional over-sampling methods in the context of multi-class patient LOS prediction is examined. The results depicted in Table 6 show that none of the over-sampling techniques are able to surpass the performance of the classifier models trained on the real untreated data frames. Although the classifiers show similar performance on training data that is over-sampled using the CTGAN and Copula GAN architectures, these results ought to be viewed under consideration of the results and the conclusion presented in Section 4.2. Therefore, a clear inferiority or superiority of the over-sampling by the GAN approach cannot be concluded for now.

4.4. Results Summary

The results of the evaluation metrics of the hyperparameter search, which are shown in Section 4.1, display that the utilized GAN architectures can sufficiently replicate the distribution of the columns of the original EHR. Thereby, the CTGAN architecture is found to be superior both in achieving the highest aggregated similarity scores as well as in the visual comparison of the synthetic and original data distribution. Interestingly enough, the numbers of the created duplicates differ considerably between the models, with the TVAE model showing the largest number of duplicate rows. This is also the case for the number of rows that are shared one-to-one with the original data set, indicating a breakdown of the model during the learning process. The number of duplicate rows within the respective synthetic data sets as well as the number of rows that are shared with the original data set are significantly lower when looking at the EHR produced by the GAN models. Considering the differences in the architectures of the GAN and the TVAE, this makes sense as only the latter has an explicit cost function. Thus, it directly uses the original training data in the training process. Although the other quantitative evaluation methods do not fully grasp the issue of duplicates, it turned out to be central for the creation of synthetic data, especially when dealing with such a small data set. Within the realms of this study, recording the created duplicates proved to early on indicate the correct direction for the hyperparameter search. Especially models trained with epoch numbers were more prone to produce higher numbers of duplicates. However, while categorical columns with a rather low cardinality were very well reproduced, especially the ccs_diagnosis_description and ccs_procedure_description posed major challenges for the models. In particular, this holds for diagnoses and procedures that were connected to only a small number of patients. The problem went so far that some categories were not learned at all and were thus not included in the newly created artificial data set. Overall, the CTGAN model showed the best results when looking at the quantitative measurement results presented in Section 4.1.

Nevertheless, the utility tests of the artificially created data as a proxy for real-world EHR in the context of the salvation, the multi-class classification task of the patient LOS prediction, show mixed results (see Table 5). However, especially the models trained on the CTGAN sampled training data performed very well on the middle-stay and short-stay target classes the performance of the minority class is significantly worse. In spite of this, the CTGAN approach is found to be the superior model when creating synthetic EHR for the patient LOS prediction as the TVAE models produced many duplicates that disregard the privacy constraints that are essential for the underlying purpose.

5. Discussion

Similar to the previous work conducted in this field of research (see Section 2), this paper shows the large potential of GANs-created synthetic EHR. Although it brings along many advantages for the creation of synthetic data (e.g., conditional sampling and its privacy preserving capabilities), the underlying work also infers that the training process of GANs for tabular EHR still underlies many obstacles and can lead to unsatisfying results. First of all, especially the learning process of the CTGAN model took an extensive amount of time, even though the hyperparameters were first approached on a trial basis taking bigger steps in both directions trying to narrow down the hyperparameter search window early on (e.g., very small and very large epoch size). The final testing of a total of ten different hyperparameter combinations took up to over 24 h, during which no further action could be conducted on the device. While an extensive amount of training time was expected, the initial hyperparameter search with more than 30 hyperparameter combinations lasted over one full week. Considering the size of the utilized training data set, this seemed to be very long after all. Hence, the timely creation and sharing of synthetic data with the best possible quality produced by the investigated GAN architectures remain rather unrealistic (see [12]). However, here, the data set used is disproportionately larger, even leading the authors to split the data into smaller chunks. Nevertheless, it can be stated that the Copula GAN indeed learned much faster and, thus, can substitute the CTGAN approach when a timely synthetic data creation is needed while accepting losses in performance. Furthermore, the results of this paper show that the utilized GAN models seem to struggle with data columns showing a high imbalance. Therefore, this adds to the ongoing question in the literature if current GAN models are capable of handling a high imbalance in categorical columns [13]. This work shows rather bad results in this regard, at least in the underlying context, as some of the minority categories in the columns showing a high cardinality could not be reproduced. The reason for this might be the problem of mode collapse, meaning that the model is not able to capture the full distribution of the input column. While the architectures of CTGAN and Copula GAN attempt to address the issue of high columns with high cardinality, it appears that this problem is not fully resolved tarnishing the performance of the models. Moreover, the models struggled also with the replication of the minority target class. It is assumed that the investigated GAN architectures are only able to deliver mediocre results when trained on highly imbalanced and scarce data sets. In this regard, also the capabilities of the CTGAN and Copula GAN architectures as a tool to over-sample the minority classes remain restricted and their application should be decided on a case-by-case basis. Nevertheless, one of the biggest advantages of the CTGAN architecture proved to be its conditional sampling. This enabled the re-sampling of data rows that were not complying with constraints while pinning down the affected target category, of which the affected record was part. In summary, it can be noted that the general applicability of synthetic data created through state-of-the-art GANs as a proxy for real-world EHR inpatient LOS prediction remains difficult. On the one hand, the results show that, other than the TVAE architecture, the GAN architectures could also comply to the opposed privacy constraints by producing only low numbers of duplicates. Moreover, the models prove to be valuable in reproducing the distributions of the individual columns of the original data fairly well. On the other hand, this was especially the case for columns with low cardinality and continuous columns with few missing values, which eventually led to mixed results when testing the machine learning efficacy in the context of the LOS prediction. It can be assumed that the investigated GAN architectures in their current form are only applicable for the creation of synthetic admission EHR if the cardinality and imbalance of the columns contained in the data set are preliminarily reduced to their input. However, the binning of the concerned categories needs to be decided on a case-by-case basis if it is desired at all. Considering the task of this paper to investigate the general applicability of the GAN approaches to replicate admission EHR, this approach was not pursued further but is expected to deliver enhanced results.

6. Conclusions

The present paper is concerned with the evaluation and examination of the value and applicability of synthetic tabular data created by GANs as a proxy for real-world tabular EHR. This is tested in the context of solving multi-class classification tasks for the prediction of patient LOS within hospital facilities. Therefore, state-of-the-art GAN models (CTGAN and Copula GAN) as well as a benchmark TVAE model are trained on real-world hospital admission data sets taken from the “NY Hospital Inpatient Discharges in 2015” data set provided by the New York State Department of Health [26].

In summary, it can be noted that the general applicability of synthetic data created through state-of-the-art GANs as a proxy for real-world EHR inpatient LOS prediction remains difficult. On the one hand, the results show that, other than the TVAE architecture, the GAN architectures could also comply with the opposed privacy constraints by producing only low numbers of duplicates. Moreover, the models proved to be valuable in reproducing the distributions of the individual columns of the original data fairly well. On the other hand, this was especially the case for columns with low cardinality and continuous columns with few missing values, which eventually led to mixed results when testing the machine learning efficacy in the context of the LOS prediction. It can be assumed that the investigated GAN architectures in their current form are only applicable for the creation of synthetic admission EHR if the cardinality and imbalance of the columns contained in the data set are reduced preliminary to its input. However, the binning of the concerned categories needs to be decided on a case-by-case basis if it is desired at all. Considering the task of this paper to investigate the general applicability of the GAN approaches to replicate admission EHR, this approach was not pursued further but is expected to deliver enhanced results.

The paper at hand shows that there is still room for improvement when applying state-of-the-art GAN architecture to clinical healthcare data, especially when the data set contains highly imbalanced columns that show a high cardinality. Therefore, state-of-the-art architectures need to be further improved to enable them to cope with this problem. More specifically, it should be investigated how the improvement in the data quality can be accompanied by a faster training process. Only then is the use of synthetic data generated by GANs economically reasonable. Thus, future work could repeat this study with newly created enhanced GAN architectures. Additionally, this research could be applied to various other industry sectors, such as the financial industry or public transportation. Moreover, future research could also test the machine learning efficacy of synthetic data created by GANs in a patient LOS regression task. It is recommended to test this also with enhanced generator models when dealing with a target class of the regression that has such a widespread imbalance as in the underlying data set used within this paper.

Author Contributions

All authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Funding

No funding was received for this study. Open Access is provided through the Open Access Fund of the University of Hamburg under 1685946786-UHH-OAF.

Informed Consent Statement

Informed consent is not required for the use of the ’NY Hospital Inpatient Discharges in 2015’ dataset in this study. This dataset has been carefully anonymized and complies with the HIPAA Privacy Rule data security regulations. It contains no personally identifiable information, ensuring the confidentiality and privacy of patient data.

Data Availability Statement

As the underlying data basis of this paper, the “NY Hospital Inpatient Discharges in 2015” data set [26] is used. The data set which was last updated on 13 September 2019 is publicly accessible and available free of charge here https://health.data.ny.gov/Health/Hospital-Inpatient-Discharges-SPARCS-De-Identified/82xm-y6g8, accessed on 1 June 2023.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Glossary of all columns used in the paper. The marked row holds the target variable of the multi-class classification problem—the patient LOS. The CCS diagnosis and procedure codes catalog can be found in the CCS users guide [28] (pp. 18–28).

Feature Glossary
Column Name	Description	Value Range	Data Type
abortion_edit_indicator	indicates whether an abortion was performed or not	(0,1)	binary
age_group	age in years at time of discharge grouped into age groups	(0–17, 18–29, 30–49, 50–69, 70-older)	ordinal
birth_weight	birth weight of a newborn	positive whole number	int
ccs_diagnosis_ description	short description of the diagnosis	diagnosis adhering to the CCS catalog	str
ccs_procedure_ description	short description of the procedure	procedure adhering to the CCS catalog	str
emergency_department_ indicator	Y: patient was admitted through the emergency unit N: patient was not admitted through the emergency unit	(N, Y)	binary
ethnicity	ethnicity of the patient	e.g., Not Span/Hispanic	nominal
gender	gender of the patient	(F, M)	binary
length_of_stay	length of stay of the patient in the hospital	positive whole number	int
payment_typology_ one	payment method of patients	e.g., Medicare	nominal
race	race of the patient	e.g., Black/African American	nominal
type_of_admission	admission type	e.g., Elective	nominal
zip_code	zip code of residence of the patient	3-digit numerical code	nominal

Figure A1. NUMC data set—CTGAN: Visual comparison of synthetic training data created by the best performing CTGAN to the original data. Measures shown in the subsequent subplots: (a) cumulative sum per column, (b) distribution of the columns, (c) difference of correlation plots (the darker the color, the greater the difference), (d) standard deviation and column-wise mean of numerical columns.

Figure A2. NUMC data set—Copula GAN: Visual comparison of synthetic training data created by the best performing Copula GAN to the original data. Measures shown in the subsequent subplots: (a) cumulative sum per column, (b) distribution of the columns, (c) correlation, (d) standard deviation and column-wise mean of numerical columns.

Figure A3. NUMC data set—TVAE: Visual comparison of synthetic training data created by the best performing TVAE to the original data. Measures shown in the subsequent subplots: (a) cumulative sum per column, (b) distribution of the columns, (c) correlation, (d) standard deviation and columnwise mean of numerical columns.

References

Carter, E.M.; Potts, H.W. Predicting length of stay from an electronic patient record system: A primary total knee replacement example. BMC Med. Inform. Decis. Mak. 2014, 14, 1–13. [Google Scholar] [CrossRef] [PubMed]
Dexur. Understanding & Predicting Length of Stay (LOS) Using Machine Learning. 2022. Available online: https://dexur.com/a/ml-research-los/6/ (accessed on 20 March 2022).
AMA. Trends in Health Care Spending. 2023. Available online: https://www.ama-assn.org/about/research/trends-health-care-spending (accessed on 20 April 2023).
Baek, H.; Cho, M.; Kim, S.; Hwang, H.; Song, M.; Yoo, S. Analysis of length of hospital stay using electronic health records: A statistical and data mining approach. PLoS ONE 2018, 13, e0195901. [Google Scholar] [CrossRef] [PubMed]
Sotoodeh, M.; Ho, J.C. Improving length of stay prediction using a hidden Markov model. Amia Summits Transl. Sci. Proc. 2019, 2019, 425–434. [Google Scholar] [PubMed]
Schiele, J. Predicting Surgical Durations and Implications at the Operational Level. Ph.D. Thesis, University of Augsburg, Augsburg, Germany, 2019. [Google Scholar]
Xiao, C.; Choi, E.; Sun, J. Opportunities and challenges in developing deep learning models using electronic health records data: A systematic review. J. Am. Med. Inform. Assoc. 2018, 25, 1419–1428. [Google Scholar] [CrossRef] [PubMed]
Lin, Z.; Sekar, V.; Fanti, G. On the Privacy Properties of GAN-generated Samples. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Virtual, 13–15 April 2021; pp. 1522–1530. [Google Scholar]
Georges-Filteau, J.; Cirillo, E. Synthetic Observational Health Data with GANs: From slow adoption to a boom in medical research and ultimately digital twins? arXiv 2020, arXiv:2005.13510. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’14), Montreal, QC, Canada, 8–13 December 2014; Volume 2, pp. 2672–2680. [Google Scholar]
Muller, E.; Zheng, X.; Hayes, J. Synthesising Electronic Health Records: Cystic Fibrosis Patient Group. arXiv 2022, arXiv:2201.05400. [Google Scholar]
Fernandes, D.A.F. Synthetic Data and Re-Identification Risks. Ph.D. Thesis, Universidade do Porto, Porto, Portugal, 2021. [Google Scholar]
Zhao, Z.; Kunar, A.; Birke, R.; Chen, L.Y. CTAB-GAN: Effective table data synthesizing. In Proceedings of the Asian Conference on Machine Learning, Virtual, 17–19 November 2021; pp. 97–112. [Google Scholar]
Xu, L.; Skoularidou, M.; Cuesta-Infante, A.; Veeramachaneni, K. Modeling Tabular data using Conditional GAN. 2019. Available online: http://xxx.lanl.gov/abs/1907.00503 (accessed on 1 June 2023).
Choi, E.; Biswal, S.; Malin, B.; Duke, J.; Stewart, W.F.; Sun, J. Generating multi-label discrete patient records using generative adversarial networks. In Proceedings of the Machine Learning for Healthcare Conference, Boston, MA, USA, 18–19 August 2017; pp. 286–305. [Google Scholar]
Chin-Cheong, K.; Sutter, T.; Vogt, J.E. Generation of differentially private heterogeneous electronic health records. arXiv 2020, arXiv:2006.03423. [Google Scholar]
Ministry of Health New Zealand. National Minimum Dataset (Hospital Events). 2021. Available online: https://www.health.govt.nz/nz-health-statistics/national-collections-and-surveys/collections/national-minimum-dataset-hospital-events (accessed on 5 January 2022).
Rosenblatt, L.; Liu, X.; Pouyanfar, S.; de Leon, E.; Desai, A.; Allen, J. Differentially private synthetic data: Applied evaluations and enhancements. arXiv 2020, arXiv:2011.05537. [Google Scholar]
Xu, L. Synthesizing Tabular Data Using Conditional GAN. Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 2020. [Google Scholar]
Nowok, B.; Raab, G.M.; Dibben, C. synthpop: Bespoke creation of synthetic data in R. J. Stat. Softw. 2016, 74, 1–26. [Google Scholar] [CrossRef]
Brenninkmeijer, B. On the Generation and Evaluation of Tabular Data Using GANs, 2019. Master’s Thesis, Radboud University, Nijmegen, The Netherlands.
Data to AI Lab at MIT. The Synthetic Data Vault. 2021. Available online: https://sdv.dev/SDV/ (accessed on 5 April 2022).
Patki, N.; Wedge, R.; Veeramachaneni, K. The Synthetic Data Vault. In Proceedings of the 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Montreal, QC, USA, 17–19 October 2016; pp. 399–410. [Google Scholar] [CrossRef]
Harerimana, G.; Kim, J.W.; Jang, B. A deep attention model to forecast the Length of Stay and the in-hospital mortality right on admission from ICD codes and demographic data. J. Biomed. Inform. 2021, 118, 103778. [Google Scholar] [CrossRef] [PubMed]
Lin, W.; Halpern, S.D.; Prasad Kerlin, M.; Small, D.S. A “placement of death” approach for studies of treatment effects on ICU length of stay. Stat. Methods Med. Res. 2017, 26, 292–311. [Google Scholar] [CrossRef] [PubMed]
New York State Department of Health. Hospital Inpatient Discharges (SPARCS De-Identified). 2019. Available online: https://health.data.ny.gov/Health/Hospital-Inpatient-Discharges-SPARCS-De-Identified/82xm-y6g8 (accessed on 5 January 2022).
World Health Organization. International Statistical Classification of Diseases and Related Health Problems (ICD). 2022. Available online: https://www.who.int/standards/classifications/classification-of-diseases (accessed on 22 March 2022).
Agency for Healthcare Research and Quality. Clinical Classifications Software (CCS). 2015. Available online: https://www.hcup-us.ahrq.gov/toolssoftware/ccs/CCSUsersGuide.pdf (accessed on 20 February 2022).
Andersson, O. Predicting Patient Length of Stay at Time of Admission Using Machine Learning. Master’s Thesis, KTH Royal Institute of Technology, Stockholm, Sweden, 2019. [Google Scholar]
Zebin, T.; Rezvy, S.; Chaussalet, T.J. A deep learning approach for length of stay prediction in clinical settings from medical records. In Proceedings of the 2019 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), Siena, Italy, 9–11 July 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–5. [Google Scholar]
Johnson, A.E.; Pollard, T.J.; Shen, L.; Lehman, L.w.H.; Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Anthony Celi, L.; Mark, R.G. MIMIC-III, a freely accessible critical care database. Sci. Data 2016, 3, 1–9. [Google Scholar] [CrossRef] [PubMed]
Everitt, B.S. An introduction to finite mixture distributions. Stat. Methods Med. Res. 1996, 5, 107–127. [Google Scholar] [CrossRef] [PubMed]
Brenninkmeijer, B.; de Vries, A.; Marchiori, E.; Hille, Y. Table-Evaluator Package, Version 1.4.2. 2022. Available online: https://pypi.org/project/table-evaluator/ (accessed on 10 April 2022).
Chapman, P.; Clinton, J.; Kerber, R.; Khabaza, T.; Reinartz, T.; Shearer, C.; Wirth, R. CRISP-DM 1.0: Step-by-step data mining guide. SPSS Inc. 2000, 9, 13. [Google Scholar]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]

Figure 1. Diagram showing the practical implementation steps.

Figure 2. Confusion matrix and classification output of the RF model trained on data based on the NUMC data set. Thereby, (a) represents the original NUMC training data set, (b) represents the synthetic training data created by the CTGAN, and (c) represents the synthetic training data created by the Copula GAN architecture.

Table 1. Overview of the three data sets. Each data set row accounts for a unique patient discharge record.

Data Set Overview
Hospital Name	Data Rows
Nassau University Medical Center	19.208
SUNY Health Science Center	21.370
NYU Hospitals Center	33.098

Table 2. NUMC data set: Comparison of the best-performing models per model category. Empty rows are marked as not available (n.a.).

	Hyperparameters					Quantitative Metrics
	Epochs	Batch Size	Log Frequency	Generator/ Discriminator Learning Rate		Aggregated Result	Duplicates btw. Synthetic and Real Data Set
CTGAN	1000	500	FALSE	( $2 \times 10^{- 5}$ , $2 \times 10^{- 5}$ )	->	0.73	272
Copula GAN	1000	500	FALSE	( $2 \times 10^{- 6}$ , $2 \times 10^{- 5}$ )	->	0.65	27
TVAE	1000	500	n.a.	n.a.	->	0.62	935

Table 3. SUNY data set: Comparison of the best-performing models per model category. Empty rows are marked as not available (n.a.).

	Hyperparameters					Quantitative Metrics
	Epochs	Batch Size	Log Frequency	Generator/ Discriminator Learning Rate		Aggregated Result	Duplicates btw. Synthetic and Real Data Set
CTGAN	1000	500	FALSE	( $2 \times 10^{- 5}$ , $2 \times 10^{- 5}$ )	->	0.89	43
Copula GAN	500	700	FALSE	( $2 \times 10^{- 4}$ , $2 \times 10^{- 4}$ )	->	0.67	8
TVAE	1000	500	n.a.	n.a.	->	0.57	336

Table 4. NYU data set: Comparison of the best-performing models per model category. Empty rows are marked as not available (n.a.).

	Hyperparameters					Quantitative Metrics
	Epochs	Batch Size	Log Frequency	Generator/ Discriminator Learning Rate		Aggregated Result	Duplicates btw. Synthetic and Real Data Set
CTGAN	1000	500	TRUE	( $2 \times 10^{- 5}$ , $2 \times 10^{- 5}$ )	->	0.7	158
Copula GAN	500	700	FALSE	( $2 \times 10^{- 4}$ , $2 \times 10^{- 4}$ )	->	0.67	27
TVAE	1000	700	n.a.	n.a.	->	0.57	1817

Table 5. Macro F1-scores of the classifier models trained on the data frames as well as of the models trained on its synthetic replicants.

		Random		K-Nearest
Data Set	Data Frame Source	Forest	SVM	Neighbor
NUMC	Real Data Set	0.59	0.60	0.55
	Sampled by Copula GAN	0.31	0.30	0.32
	Sampled by CTGAN	0.33	0.32	0.35
	Sampled by TVAE	0.53	0.55	0.50
SUNY	Real Data Set	0.48	0.52	0.44
	Sampled by Copula GAN	0.33	0.33	0.34
	Sampled by CTGAN	0.34	0.34	0.36
	Sampled by TVAE	0.43	0.46	0.40
NYC	Real Data Set	0.54	0.58	0.52
	Sampled by Copula GAN	0.35	0.38	0.39
	Sampled by CTGAN	0.40	0.41	0.43
	Sampled by TVAE	0.49	0.52	0.47

Table 6. Macro F1-scores of the classifier models trained on the over-sampled training data frames.

NUMC Oversampling Results
	Random		K-Nearest
Data Frame Source	Forest	SVM	Neighbor
Real Data Set	0.59	0.60	0.55
SMOTE	0.57	0.60	0.54
Random Oversampling	0.58	0.60	0.52
CTGAN	0.56	0.58	0.53
Copula GAN	0.57	0.58	0.54
TVAE	0.57	0.58	0.53

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bietsch, D.; Stahlbock, R.; Voß, S. Synthetic Data as a Proxy for Real-World Electronic Health Records in the Patient Length of Stay Prediction. Sustainability 2023, 15, 13690. https://doi.org/10.3390/su151813690

AMA Style

Bietsch D, Stahlbock R, Voß S. Synthetic Data as a Proxy for Real-World Electronic Health Records in the Patient Length of Stay Prediction. Sustainability. 2023; 15(18):13690. https://doi.org/10.3390/su151813690

Chicago/Turabian Style

Bietsch, Dominik, Robert Stahlbock, and Stefan Voß. 2023. "Synthetic Data as a Proxy for Real-World Electronic Health Records in the Patient Length of Stay Prediction" Sustainability 15, no. 18: 13690. https://doi.org/10.3390/su151813690

APA Style

Bietsch, D., Stahlbock, R., & Voß, S. (2023). Synthetic Data as a Proxy for Real-World Electronic Health Records in the Patient Length of Stay Prediction. Sustainability, 15(18), 13690. https://doi.org/10.3390/su151813690

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Synthetic Data as a Proxy for Real-World Electronic Health Records in the Patient Length of Stay Prediction

Abstract

1. Introduction

2. Related Work

2.1. Generating Multi-Label Discrete Patient Records Using Generative Adversarial Networks

2.2. Generation of Deferentially Private Heterogeneous Electronic Health Records

2.3. Synthetic Data and Re-Identification Risks

2.4. Synthesizing Electronic Health Records: Cystic Fibrosis Patient Group

2.5. Comparison to the Underlying Work

3. Methods

3.1. Length of Stay Prediction

3.1.1. Variational Auto Encoder

3.1.2. Generative Adversarial Networks

3.1.3. Conditional Tabular GAN

3.1.4. Copula GAN

3.2. Origin of the Data Set

3.3. Binning of the Target Variable

3.4. Evaluation Metrics

3.4.1. Goodness of Fit

3.4.2. Likelihood Metrics

3.4.3. Detection Metrics

3.4.4. Aggregated Result

3.4.5. Privacy Metrics

3.4.6. Visual Evaluation

3.4.7. Utility

3.5. Implementation

4. Results

4.1. Evaluation of Generator Model Capabilities

4.1.1. NUMC Data Set Hyperparameter Search

4.1.2. SUNY Data Set Hyperparameter Search

4.1.3. NYU Data Set Hyperparameter Search

4.1.4. Summary of the Hyperparameter Search

4.2. Testing the Machine Learning Efficacy

4.3. Using GAN for Over-Sampling

4.4. Results Summary

5. Discussion

6. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI