Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Synthetic Data as a Proxy for Real-World Electronic Health Records in the Patient Length of Stay Prediction

Sustainability 2023, 15(18), 13690; https://doi.org/10.3390/su151813690

by Dominik Bietsch¹

, Robert Stahlbock^1,2

and Stefan Voß^1,*

Reviewer 1:

Jingyuan Zhao

Reviewer 2: Anonymous

Reviewer 3:

Takeshi Emura

Reviewer 4:

Reza Soleimani

Sustainability 2023, 15(18), 13690; https://doi.org/10.3390/su151813690

Submission received: 2 June 2023 / Revised: 9 July 2023 / Accepted: 5 September 2023 / Published: 13 September 2023

(This article belongs to the Section Health, Well-Being and Sustainability)

Round 1

Reviewer 1 Report

Synthetic data generation is indeed crucial for areas where data scarcity and privacy concerns exist, such as the healthcare industry. The authors demonstrate a good understanding of GANs (Generative Adversarial Networks) and their application in creating synthetic electronic health records (EHR). The paper clearly outlines the objective and approach, particularly in leveraging GANs for predicting patient Length of Stay (LOS), a significant parameter in resource allocation within healthcare facilities. The use of a real-world dataset (NY Hospital Inpatient Discharges in 2015) adds credibility to the research and its findings. Some comments are as follows:

1. The affiliation part is terrible. Where is the first one? And what about 2 and 3? Who belongs to the third affiliation? Moreover, the “current address” shown in that section should not be written separately. Also, affiliations 2 and 3 are exactly the same, so why are they written as two separate entries?"

2. The structure of the paper could be improved. Currently, the introduction contains an excessive focus on the GAN algorithm, which deviates from the central theme of your article. Your manuscript primarily provides an application of using GAN, rather than focusing on the algorithm itself. In the introduction, it would be beneficial to include more literature review related to Patient Length of Stay prediction. The discussion of the modeling could be moved to other sections, such as methods, and the detailed background of the model can be minimized in the introduction. This would prevent readers from being distracted from the main focus, which is prediction, to the specifics of the algorithm.

3. The paper reveals that the GAN architectures in their current form struggle with certain column properties (high cardinality, imbalance), leading to mixed results. This limits the general applicability of your research.

4. There is insufficient explanation regarding the underperformance of the GAN models on certain aspects of the data. More analysis on this issue would make your conclusions stronger.

5. It is unclear how the privacy aspect is handled. While GANs can generate synthetic data that can alleviate privacy concerns, the privacy-preserving aspect of these specific models (CTGAN and Copula GAN) is not well explored.

6. Consider delving deeper into the limitations of the GAN architectures used, especially why they may fail to handle high cardinality and imbalance in the columns. This would contribute to the understanding of their suitability for EHR data generation.

7. Explain how these GAN models maintain privacy and whether they meet privacy standards in healthcare. This is crucial since one of the motivations for synthetic data generation is to mitigate privacy concerns.

8. It would also be valuable to investigate ways to expedite the training process while improving data quality. If the generation of synthetic data is not time-efficient, it may not be a feasible solution for practical applications.

9. For future work, it might be beneficial to look into new or alternative GAN architectures that might better handle the issues faced in this study.

Minor editing of English language required

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

The article uses some approaches for generating synthetic data. However, the article is not well organized; understanding is too complicated. The discussion of some concepts and how they are related to the proposed work is not clear. Above all, the results obtained in terms of model performance are too low. Following are the comments:

1. In the abstract it is said by the authors that the work aims to use CTGAN and Copula GAN, however there is also an extensive discussion of TVAE in the article. It is not clear if the purpose is to evaluate TVAE as well. Modify in accordance.

2. Also, some results are reported only for CTGAN and Copula GAN and not for TVAE (e.g., Figure 5). The authors should clarify this throughout the text and possibly report all results for the three models.

3. The authors report descriptions of the architectures for generating the synthetic data in the introduction. However, the Introduction section should introduce the topic and the methods should be explained in the Methods section.

4. At the end of the Introduction section, it should appear what the contributions of the work are and expose how the paper is organized.

5. It is unclear whether the methods outlined in 1.5 and 1.6 are relevant to the proposed work. The authors should clarify how these methods are related to the experiments performed. If the purpose is to evaluate CTGAN and Copula GAN and TVAE it is expected that the Methods section describes only these three methodologies used.

6. Section 3.1 needs to describe in detail the dataset used and how and why features were discarded. Were feature selection methods used? Have statistical tests been carried out?

7. In 3.3.4 it is not clear how the results were aggregated. Some metrics should not be aggregated, e.g., those evaluating model performance with those evaluating the system as a proxy, because they capture two too different aspects of the models to be aggregated.

8. Tables 2,3 and 4 should report all quantitative metrics separately in addition to the aggregated metrics. The generalization capability of the models with the aggregate metrics alone is not clear.

9. In 3.4 it is not clear what the CRISP-DM approach is and what it was used for. The authors should clarify this.

10. In 3.5.5 the authors need to be more specific about "if the number is high". What is meant by high?

11. The training protocol is unclear. Were the datasets divided into training validation and testing? It was mentioned in section 3.3.3 that cross-validation was employed. Therefore, all results should report mean and standard deviation values.

12. How was the search for hyperparameters performed (random search, greedy search, etc.)? Hyperparameter search should be done on training and validation datasets, keeping out the testing portion.

13. Why does Figure 5 shows only the confusion matrices of the NUMC dataset? In addition, Table A1 and A2 should also show the respective confusion matrices to evaluate the generalization capabilities for each class.

14. In the end, the performance is too low for a three-class classification. The authors should find a solution to this problem. The precision and recall values shown in Figure 5 suggest that the models fail to generalize, especially for the first class (Figures (b) and (c))

Minor issues:

15. One reference is missed in the introduction (line 36).

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

The paper proposes to apply the GAN methods (e.g., the copula-GAN) to synthesize

the electronic health data. The main endpoint of interest is the patient lenth

of stay. Prediction analysis was performed by training and validation datasets.

The paper is a novel application of a machine learning method for medical data.

The paper is written well, yet some corrections/modifications are necessary to be qualified for publication.

Major concerns:

1. All abbreviations should be defined in the text, such as LOS and GAN.

Some of them defined in Abstract should be defined again in the text

since the main text should be readable by its own.

2. Overall, the paper should be more carefully editted.

For instance, avoid the missing reference [?] in P.2, Line 36.

3. The paper often uses unclear pronouns, making some of the sentences difficult to understand.

For instance, please avoid the sentence "It does that".

4. P.6, L.210: A Gaussian mixture model needs a proper reference, such as

Everitt, B. S. (1996). An introduction to finite mixture distributions. Statistical Methods in Medical Research, 5(2), 107-127.

More precisely, the model is called "finite normal mixture model".

5. The paper applies many statistical models and methods, such as the Copula-GAN and the K-S tests.

The references could add appropriate statistical software packages that were used to implement the analysis.

6. P.6, L.262: A reference is necessary for the Gaussian copula, such as

Nelsen, R. B. (2006). An introduction to copulas, 2nd edition, Springer

In addition, the authors could briefly explain what is copulas for readers who are not familier

with copulas.

7. The patient LOS has been analyzed by many authors. The paper should emphasize more on its diverse use in medical research,

including randomized clinical trials:

Lin, W., Halpern, S. D., Prasad Kerlin, M., & Small, D. S. (2017). A “placement of death” approach for studies of treatment effects

on ICU length of stay. Statistical Methods in Medical Research, 26(1), 292-311.

8. P.11-12, Abbreviations, such as KS, CS, GMM should be defined.

9. P.12, logarithmic likelihood --> log-likelihood; learned --> fitted

English is fine, but the paper should be more carefully editted.

For instance, avoid the missing reference [?] in P.2, Line 36.

The paper often uses unclear pronouns, making some of the sentences difficult to understand.

For instance, please avoid the sentence "It does that".

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 4 Report

Some lines need better writing to be understood. 444-445, 467-489

The beginning of line 558-559 needs to be corrected.

Missing Ref. at line 36.

There is no mention of the training part of the generative models and their hyperparameters that are optimized. Please elaborate.

Labels in fig. 5 are wrong for the classes.

What are the main contributions of this paper? Please clearly elaborate.

If this is a new methodology, a comparison with the older methods is required. How well does your evaluation compare to the others?

The paper language needs to be clearer to easily follow the sections.

Overall, the English structre is good, but it needs to be revised for some grammar issues and clarity of the paper.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Accept in present form

Minor editing of English language required

Reviewer 3 Report

The authors have nicely addressed all the concerns in my report. I would suggest accepting the paper in the presente form.

Reviewer 4 Report

The authors incorporated all the changes.

Article Menu

Synthetic Data as a Proxy for Real-World Electronic Health Records in the Patient Length of Stay Prediction

Further Information

Guidelines

MDPI Initiatives

Follow MDPI