Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

International Classification of Diseases Prediction from MIMIIC-III Clinical Text Using Pre-Trained ClinicalBERT and NLP Deep Learning Models Achieving State of the Art

Big Data Cogn. Comput. 2024, 8(5), 47; https://doi.org/10.3390/bdcc8050047

by Ilyas Aden^*

, Christopher H. T. Child

and Constantino Carlos Reyes-Aldasoro

Reviewer 1: Anonymous

Reviewer 2:

Ighli Di Bari

Reviewer 3: Anonymous

Reviewer 4: Anonymous

Big Data Cogn. Comput. 2024, 8(5), 47; https://doi.org/10.3390/bdcc8050047

Submission received: 27 March 2024 / Revised: 27 April 2024 / Accepted: 30 April 2024 / Published: 10 May 2024

(This article belongs to the Special Issue Artificial Intelligence and Natural Language Processing)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

1. The paper mentions that many SOTA (State Of The Art) Baselines were compared, but among them, RNN (Recurrent Neural Network) and LSTM (Long Short-Term Memory) are quite early deep learning models. How does the method proposed in this paper compare to Transformer-based model [1]?

2. The figures in the paper are very problematic. For example, the meaning of the red or blue underline in Figure 5 is unclear, and the lines in Figure 8 are not aligned.

3. The paper needs to cite more recent deep learning literature [1-2].

[1] Han K, Xiao A, Wu E, et al. Transformer in transformer[J]. Advances in neural information processing systems, 2021, 34: 15908-15919.

[2] Li W, Fan L, Wang Z, et al. Tackling mode collapse in multi-generator GANs with orthogonal vectors[J]. Pattern Recognition, 2021, 110: 107646.

Comments on the Quality of English Language

The quality of English in the paper needs further improvement, as many sentences lack clarity and logical flow, which is particularly evident in the abstract.

Author Response

IMPLEMENTATION OF FIRST REVIEWER FEEDBACKS (Time spent: 4 days):

Clarified the objective of our paper, which was to utilise "autoencoding" models such as BERT—a transformer-based framework—rather than "autoregressive" or "sequence-to-sequence" models, which demand more significant computational resources (1/2 day).
Enhanced and verified all figures within the manuscript for better presentation (1/2 day).
Enriched the manuscript by referencing more work on NLP and deep learning models, specifically incorporating the foundational paper on the Google BERT model as a reference:

[1] Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan; Kaiser, Lukasz; Polosukhin, Illia. (2017). Attention Is All You Need.

4. conducted a thorough review of the two papers suggested by the reviewers. While they appear tangential to the scope of ICD prediction, they offer general insights into NLP deep learning methodologies. (2 days)

My assessments of these two papers are as follows:

Relevance to ICD Prediction: These works are not directly pertinent as they delve into distinct realms of deep learning such as computer vision and generative models, which are applied in contexts beyond ICD prediction and clinical text analysis.
Potential Connection: Despite the papers' indirect relevance, the underlying Transformer architecture has been applied across various NLP endeavours, inclusive of clinical text analysis. There lies a possibility that subsequent studies could explore the application of Transformer models for ICD prediction tasks.

5. Rigorously reviewed and substantially revised the abstract to ensure linguistic precision and conciseness. (1 day)

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The authors of "ICD prediction from MIMIIC-III clinical text using pre-trained clinicalBERT and NLP deep learning models achieving state-of-the-art", with their work, show and support that deep learning models exhibit superior performance compared to conventional machine learning approaches.

Their work shows an improvement in the accuracy of disease prediction.

The F1 values are very encouraging, also.

However, some considerations have emerged:

-line 57-59: How is the presence of a phenocopy discriminated using this deep learning models?

-to cite Table 3 and 4 in the main text.

-figure 6-7-8-9 can be moved in supplementary files.

-related using BERT model, the ROC curve can be evaluated.

Comments on the Quality of English Language

Minor editing of English language required

Author Response

IMPLEMENTATION OF SECOND REVIEWER FEEDBACKS (Time spent: 3.5 days):

To underscore the innovation and results of our study, I have included the following paragraph within the results and discussion section to offer additional clarification (1/2 day).

“To summarise, this experimentation utilised a diverse array of deep learning models, including RNN, LSTM, BiLSTM, and BERT, with a specific emphasis on the Bio-ClinicalBERT model, which is pre-trained for biomedical texts. The study takes advantage of various neural network architectures, particularly focusing on a specialized version of BERT pre-trained for biomedical contexts. This approach enhances the model's ability to interpret clinical language effectively. Furthermore, these results showcase significant advancements in the automation of ICD coding and presents the most comprehensive F1 score metrics available to date. These scores are internationally recognised for evaluating the balance between precision and recall in classification tasks.”

Furthermore, I have conducted a comprehensive review and analysis over two days of two pivotal papers that serve as benchmarks for comparison (2 days):

[1] Biswas, Biplob & Zhang, Ping. (2021). TransICD: Transformer Based Code-wise Attention Model for Explainable ICD Coding.

[2] Mullenbach, James & Wiegreffe, Sarah & Duke, Jon & Sun, J. & Eisenstein, Jacob. (2018). Explainable Prediction of Medical Codes from Clinical Text. 1101-1111. 10.18653/v1/N18-1100.

This feedback implementation or revision aims to provide a clear context for the unique contributions of our study and acknowledges the thorough examination of relevant literature for comparison.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

This paper presents a comprehensive study on using deep learning models, specifically clinicalBERT along with RNN, LSTM, and BiLSTM models, for predicting ICD (International Classification of Diseases) codes from clinical texts in the MIMIC-III database. The authors aim to enhance the accuracy of diagnosing patient conditions by leveraging advanced natural language processing (NLP) and machine learning techniques. Results show that the proposed model achieves new state-of-the-art results on predicting accuracy of top-10 and top-50 mimic-iii diagnosis codes.

Strength: This is a typical application of transformer models onto biomedical and clinical NLP researches. The authors applied pre-trained clinicalBERT model to predict mimic-iii codes for EHR texts, which is a appropriate way to apply pre-trained biomedical transformer models. The paper is well-written, clearly proposed the model structure, experimental results and some discussions.

Weakness: The novelty in this paper is limited. The authors do not propose any new machine learning structures. They simply applied (if not even fine-tuned) the clinicalBERT model to predict Mimic-iii code using EHR context as inputs. No novel machine learning models or deep neural networks are proposed.

Conclusion: In general, I will still recommend to accept this paper ultimately. Although there is low novelty in this paper, the proposed model does achieve state-of-the-art performances comparing to other published baselines. But before accepting this paper, I hope the authors can:

1. Add more recent works as baselines for comparison. Applying transformers to mimic-iii is a well-developed research area. I believe there should be more recent models other than those mentioned in this paper.

2. Try to modify the pre-trained clinicalBERT model in order to further improve the predicting accuracy, which can improve the novelty of this paper. But this recommendation is optional.

Comments on the Quality of English Language

No major grammar errors. Writing is fine in general.

Author Response

SECOND IMPLEMENTATION OF REVIEWER 2 FEEDBACKS :

-line 57-59: How is the presence of a phenocopy discriminated using this deep learning models?

Line57-59 refer to the paper title :” "TransICD: Transformer 56 Based Code-wise Attention Model for Explainable ICD Coding" described in the first section “Background and context”, this paper illustrate a deep learning model, specifically a transformer-based architecture, used for automating the assignment of International Classification of Diseases (ICD) codes to clinical notes, a process that is traditionally manual and prone to errors. A particular challenge in the ICD coding task is identifying "phenocopies"—cases where different conditions have similar clinical presentations but should be assigned different ICD codes. In this paper phenocopies are discriminated using the model through the implementation of a "code-wise attention mechanism”. This mechanism allows the model to focus on different parts of the text depending on the specific ICD code being considered.

-to cite Table 3 and 4 in the main text.

-figure 6-7-8-9 can be moved in supplementary files.

-related using BERT model, the ROC curve can be evaluated.

This “code-wise attention mechanism” helps in distinguishing phenocopies by ensuring that the model's predictions are based on features of the text most relevant to each specific medical condition, even if different conditions share overlapping symptoms or descriptions. Therefore, it enhances the model's accuracy and interpretability, crucial in clinical settings where understanding the reasoning behind automated decisions is as important as the decisions themselves.

Done! see line 207 and 215 of the new pdf file created after Latex compilation.

Done! we created a new section Appendix and placed all figures then cited in the main text.

The assessment of machine learning models often requires balancing the depth of the evaluation with the resources it demands. In this context, we have opted for the F1 score and classification reports as our model evaluation metrics for a particular reason: they offer greater efficiency. This efficiency is even more pronounced when it comes to evaluating BERT models, therefore influencing our choice to favour these metrics over other metrics such as the ROC curve.

Below are some more explanations consolidating our selection model metrics:

1. Computational Efficiency:

F1 Score: This is a single measure that combines precision and recall into a harmonic mean. It only requires one pass through the evaluation dataset to calculate the true positives, false positives, and false negatives needed for its computation. This single value makes it quick to calculate and easy to interpret, especially for binary classification tasks.
Classification Report: This report typically includes precision, recall, and F1- score for each class, and sometimes support (the number of true instances for each label). It provides a concise summary and is computed from the confusion matrix, which, like the F1 score, can be generated in a single pass through the evaluation data.
ROC Curve: The Receiver Operating Characteristic (ROC) curve requires the model to output probability scores for classes, which are then used to calculate the True Positive Rate (TPR) and False Positive Rate (FPR) at

2. Interpretability:

various threshold settings. Generating a ROC curve can be computationally intensive because it may involve multiple passes for different thresholds to plot the curve accurately, especially if you want a smooth curve.

F1 Score and Classification Report give direct insight into how well the model is performing per class, which is easy for stakeholders to understand.
The ROC curve, while informative, provides a more general view of the model’s performance across different thresholds. It is a bit more abstract, which can be a little harder for non-experts to grasp compared to a simple F1 score.

In summary, while ROC curves are excellent for visualizing and understanding the behaviour of a model at all classification thresholds, the F1 score and classification reports can be more resource-efficient and provide a direct and often more applicable measure of a model's performance.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

Please find the reviewer’s major concerns below.

1. While the application is novel, the use of deep learning and NLP models such as BERT in healthcare data analysis is becoming increasingly common, slightly diminishing the uniqueness of the approach. The authors need to discuss more innovative parts about this study.

2. I only see limited discussion on the challenges faced during model training and optimization. The authors must discuss on this part.

3. how do you address the interpretability of your model's predictions? Are there specific techniques or methodologies you have employed or plan to explore to make the model's decision-making process more transparent, especially for healthcare practitioners?

4. I am interested to ask the authors about how might the model interface with existing electronic health record systems, and what steps are necessary to ensure seamless integration and user adoption?

5. While your model demonstrates superior performance over previous studies, how does it compare with commercially available solutions for ICD coding, if any? Are there unique advantages or limitations of your approach when considered in the context of existing solutions?

6. Another issue is that medical terminology and ICD codes evolve over time. How does your model accommodate or adapt to changes in medical language and coding standards? Are there mechanisms in place for continual learning or model updating without extensive retraining?

7. My final concern is about ethical considerations. Please explain that how do you address ethical considerations and patient privacy concerns, especially given the sensitive nature of medical records used for training the model? Are there specific safeguards or protocols you recommend when deploying such models in a healthcare setting?

Comments on the Quality of English Language

Moderate English proofreading is required.

Author Response

IMPLEMENTATION OF THIRD REVIEWER FEEDBACKS :

Table 5 describes the comparison between our experimentation results and others results within in literature review. It was clear that our study takes advantage of various neural network architectures, particularly focusing on a specialised version of BERT pre-trained for biomedical contexts. This approach enhances the model's ability to interpret clinical language effectively. Furthermore, this paper showcases significant advancements in the automation of ICD coding and presents the most comprehensive F1 score metrics available to date. These scores are internationally recognised for evaluating the balance between precision and recall in classification tasks.

2. I only see limited discussion on the challenges faced during model training and optimization. The authors must discuss on this part.

To address this question, we reviewed and rewritten the limitation and future work section as below:

“One of the main challenges we faced during our work was a lack of computational resources to execute high-end operations necessary for training and optimizing complex models like RNN, LSTM, and BERT. Indeed, handling the extraction of 7GB from the MIMIC- III dataset, which has an initial total size of 3TB and consists of 26 tables, demands significant computational resources and time. Using a private cloud server such as RTX A6000 GPU helped us overcome environments limited by resources, enabling more efficient data processing and model training.

During the training of RNNs, LSTMs, and particularly BERT models using the MIMIC-III dataset, we encountered several additional challenges. Firstly, the complexity and heterogeneity of healthcare data present in MIMIC-III can lead to issues such as imbalanced classes and missing values, which significantly affect the performance of predictive models. Addressing these data quality issues required sophisticated preprocessing steps, which themselves are resource-intensive.

Moreover, the temporal dependencies and high dimensionality of the data make RNNs and LSTMs computationally expensive to train. These models also suffer from issues like vanishing and exploding gradients, making it challenging to train deep networks effectively without careful tuning of hyperparameters and the adoption of techniques like gradient clipping and batch normalization.

BERT and other transformer-based models, while powerful in capturing contextual information from clinical notes, demand even greater computational resources due to their attention mechanisms and large number of parameters. Training these models from scratch on a dataset like MIMIC-III can be prohibitively expensive, often necessitating the use of pre- trained models followed by fine-tuning on specific tasks. However, the adaptation of these models to domain-specific medical language and tasks requires careful calibration and

validation to ensure that the models do not perpetuate biases or errors inherent in the training data.

Future work will focus on enhancing prediction models for ICD codes or diagnoses by using an ensemble approach rather than relying on single models. Such an approach may leverage the strengths of various model architectures to improve accuracy and robustness. Additionally, refinements are necessary to boost the performance and accuracy of models when predicting a larger number of diagnoses, such as the top 20, top 50, or even more than 100 diagnoses.

Furthermore, adopting more advanced validation techniques, such as k-fold cross-validation, will be explored to ensure the robustness and generalizability of the models. Unlike the traditional approach of splitting the dataset into a fixed training and test set, k-fold cross- validation provides a more comprehensive evaluation of model performance by partitioning the data into multiple subsets for training and validation. This helps in assessing the model's performance across different subsets of the data and provides a more accurate estimate of its true performance on unseen data.

Lastly, addressing the limitations in explainability and transparency of these complex models is crucial, especially in a high-stakes field like healthcare. Developing methods to interpret model decisions and ensure they align with clinical reasoning will be critical in future work, enabling clinicians to trust and effectively use AI-driven tools in their decision-making processes.”

We prioritised the F1 score and classification reports to support our model's interpretability without resorting to other methodologies such as SHAP or LIME. Indeed, selecting the F1 score as a primary evaluation metric significantly enhances the interpretability and explainability of a model's performance. The F1 score, a harmonic mean of precision and recall, provides a balanced view of both false positives and false negatives, which is particularly important in scenarios where classes are imbalanced. A high F1 score indicates not only that the model is accurately identifying positive cases, but also that it is not excessively mislabelling negative cases as positive—a common concern in many real-world applications.

Furthermore, the classification report complements the F1 score by providing a detailed breakdown of precision, recall, and F1 score for each class. This level of detail affords stakeholders a granular understanding of the model's performance across different categories, allowing them to identify specific areas where the model excels or where it may require further improvement. For example, in a medical diagnosis context, it would reveal if the model were better at identifying one disease over another, which is crucial for trust and reliability.

Integrating the model into an existing electronic health record (EHR) system involves a process with few steps and requires careful planning, execution, and continuous monitoring to ensure seamless integration and user adoption.

Below is a quick overview of how this might be approached from a data engineering perspective:

1. Assessment and Planning:

Understand the EHR Ecosystem: Begin by thoroughly understanding the

existing EHR system, including its architecture, data models, workflows, and

APIs.
Identify Integration Points: Determine where and how models will interface

with the EHR system. This could be for predictive analytics, decision support,

or enhancing data entry.

2. Data Preparation and Processing:

Data Access: Set up secure methods for accessing EHR data, which may

involve de-identification for model training and establishing real-time data

pipelines for operational use.
Data Quality: Ensure data quality and consistency through cleaning,

normalization, and transformation steps.

3. Model Development and Validation:

Iterative Development: Develop the models iteratively, with continuous

feedback from clinical stakeholders to ensure clinical relevance and

accuracy.
Validation and Testing: Rigorously validate the models against independent

data sets to ensure they are generalizable and perform well across diverse

patient populations.

4. Integration Architecture:

Microservices Architecture: Consider a microservices architecture to

deploy models, allowing them to run independently from the EHR system and

communicate through well-defined APIs.
EHR Customisation: Customise EHR interfaces or use vendor-provided

modules to present model outputs to end-users in a way that fits into their

existing workflows.

5. User Interface and Experience:

User-Centric Design: Design user interfaces that present model outputs

clearly and usefully within the EHR system, emphasizing usability to fit

clinical workflows.
Decision Support: Implement the models in a way that supports, rather than

disrupts, clinical decision-making processes.

6. Testing and Iteration:

Pilot Studies: Conduct pilot studies within a clinical environment to test the

integration thoroughly and identify any issues in real-world settings.
Performance Monitoring: Continuously monitor the performance and impact

of the integrated solution to ensure it remains accurate and useful.

7. Training and Adoption:

Stakeholder Training: Provide comprehensive training for all stakeholders,

including clinicians, IT staff, and administrative personnel, to ensure they are

comfortable with the new tools.
Support Structures: Set up support structures, including IT help desks and

feedback mechanisms, to address any concerns or problems that users

encounter.

8. Continuous Improvement:
• Feedback Loops: Establish mechanisms to gather feedback from users for

continuous improvement of the models and their integration into EHR

systems.
• Model Updating: Implement processes for regularly updating the models

with new data to maintain and improve their accuracy and relevance.

9. Governance and Oversight:
• Regulatory Compliance: Regularly review and ensure compliance with all

relevant regulations and standards.
• Ethical Considerations: Monitor for and address any ethical implications,

including potential biases in model predictions.

Finally, throughout all these steps, maintaining transparent communication with users and stakeholders is crucial to addressing concerns, ensuring user satisfaction, and guaranteeing that the integrated system truly enhances patient care and operational efficiency.

Our models are currently available on GitHub and are open-source and not intended for commercial use. On the other hand, industry leaders like 3M and Nuance are collaborating to offer advanced ICD-10-ready clinical documentation and coding systems. These commercial products typically offer more mature solutions than open-source alternatives. When comparing new models to such commercial solutions, we need to consider the following benchmarks:

Clinical Workflow: Determine if the new model enhances or hinders the clinical coding process.
Performance Metrics: Evaluate the model’s accuracy, precision, recall, F1 score, etc., and compare them with those of commercial products.
Usability and Integration: Assess how smoothly the new model can be adopted within current healthcare IT systems relative to commercial offerings.
Cost-Effectiveness: Compare the costs of deploying and operating the new model against the subscription or purchase costs of commercial coding solutions.
Return on Investment (ROI): Gauge potential savings from increased coding precision, such as fewer claim rejections or less corrective work.

The primary benefits or advantages of current open-source models include:

Also, the potential limitations of current models are:

6. Another issue is that medical terminology and ICD codes evolve. How does your model accommodate or adapt to changes in medical language and coding standards? Are there mechanisms in place for continual learning or model updating without extensive retraining?

Indeed, ICD codes are regularly updated, and we now utilize ICD-11. During the data pre- processing stage, there's an essential step that revises and updates the coding from older ICD versions to the current one. We employ a Python script for this task, which serves as a provisional measure and can be enhanced further by developing it into a more robust ETL (Extract, Transform, Load) tool.

7. My final concern is about ethical considerations. Please explain that how do you address ethical considerations and patient privacy concerns, especially given the sensitive nature of medical records used for training the model. Are there specific safeguards or protocols you recommend when deploying such models in a healthcare setting?

The research presented in this paper touches on several ethical considerations inherent in the field of medical data analysis. The ethical issues associated with this research stem primarily from the use of patient data, the accuracy and implications of automated diagnosis predictions, and potential biases in the data and algorithms. Below are three main issues related to Ethical questions and how they can be addressed:

o Enhancedperformancemetricscomparedtosimilarexistingmodels. o FreeaccessibilityonGitHub.
o Codethatisstraightforwardtoscaleandmaintain.

o Data/ConceptDriftMonitoring:Theymayrequiremanualupdatestoensure ongoing accuracy and pertinence.

o Post-DeploymentManagement:Deploymentandongoingmaintenancemight need more technical expertise. The importance of thorough validation to build user confidence and clear communication with stakeholders to support informed decision-making cannot be understated.

Patient Privacy and Data Protection: Even though the MIMIC-III dataset is de- identified, the vast amount of detailed clinical information it contains poses a risk of re-identification in some cases. Ensuring all data handling and processing procedures comply with relevant privacy laws and ethical guidelines is paramount. This involves using secure storage solutions, implementing access controls, and regularly auditing data usage.
Ethical Use and Implementation: The potential for misuse or over-reliance on automated diagnostic systems raises ethical concerns about patient care and the role of technology in medicine. To address this, establishing clear guidelines for the ethical use of AI in healthcare is necessary. This includes involving healthcare professionals in the development and deployment process, ensuring transparency about how predictions are made, and maintaining human oversight in all diagnostic decisions.
Bias and Fairness: Machine learning models can inherit or even amplify biases present in their training data. Given the noted disparities in the MIMIC-III dataset, such as a predominance of older adults and males, there's a risk of the models performing less effectively for underrepresented groups. Addressing this involves undertaking bias audits, ensuring diverse and representative training data, and applying fairness-aware algorithms. Additionally, continually assessing the model's performance across different demographics can help identify and mitigate biases.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

This paper need to cite the reference [1] in the related work.

[1] Li W, Gu C, Chen J, et al. DLS-GAN: generative adversarial nets for defect location sensitive data augmentation[J]. IEEE Transactions on Automation Science and Engineering, 2023.

Author Response

Upon careful consideration, here are my reasons why the proposed above paper on DLS-GAN would not serve as a relevant reference in the context of my paper:

Differences in data types and processing:

My paper deals with textual data, using tokenization, embeddings, and sequential data processing. The computational techniques for handling and extracting meaningful features from text involve linguistic models and contextual embeddings, which do not apply to image data.
The proposed DLS-GAN paper handles image data, focusing on the synthesis of visual features and defect specifics in images. This involves spatial data considerations, colour features, and image dimensions that are intrinsic to visual data processing.

Lack of overlapping methodological frameworks and techniques:

My paper employs NLP and deep learning techniques to predict ICD codes from clinical text. The methodologies and challenges in processing textual data fundamentally differ from those involved in image data synthesis and augmentation.
The proposed paper introduces a GAN-based approach tailored for generating images with precise defect locations, which is specifically designed to improve automated defect detection in manufacturing processes.
The methodologies used in this DLS-GAN paper (GAN architectures for image augmentation) do not offer transferable insights or techniques that could be directly applied or adapted for use in NLP tasks described in my paper. The generative models for images and text operate on different principles and are optimized for different types of data distributions and features.

Irrelevance of findings across fields:

The findings and advancements reported in the proposed DLS-GAN paper, while significant within the context of image processing and defect detection, do not provide actionable insights or contribute to the theoretical or practical understanding of NLP models or medical diagnosis as ICD code prediction.
The performance metrics, challenges, and solutions in image augmentation cannot be correlated in a meaningful way to text analysis and prediction tasks.

Reviewer 2 Report

Comments and Suggestions for Authors

Dear Authors,

Thank you for your reply.

Please, check well the references, they must be in progressive order and not placed in a confusing way.

Reated to my comment on phenocopy, the paper in reference is ok, but please provide a further and concise explaination.

Comments on the Quality of English Language

Moderate editing of English language required

Author Response

THIRD IMPLEMENTATION OF REVIEWER 2 FEEDBACK:

Please, check well the references, they must be in progressive order and not placed in a confusing way.

Reviewed and checked again.

but please provide a further and concise explanation:

To address how deep learning models could theoretically be used to discriminate phenocopies in a similar context, we need to first understand what phenocopies are. Phenocopies are conditions that clinically resemble a genetic disorder but are not caused by the genetic mutation associated with that disorder; instead, they result from environmental factors, other non-genetic biological mechanisms, or other diseases.

While my paper does not specifically address phenocopies, the methodologies described for leveraging deep learning to analyse clinical text could be adapted to include discrimination of phenocopies with careful planning and execution, particularly by enhancing the first two stages of data preparation and feature extraction:

Data Preparation: The data must include detailed clinical notes that cover symptoms, diagnoses, treatment responses, and possibly “environmental exposures” or detailed patient histories that could suggest a phenocopy.
Feature Extraction:

Using models like Bio_ClinicalBERT, extract semantic features from unstructured clinical text that could indicate both genetic disorders and similar phenotypic presentations caused by other factors.
Features could include specific symptom descriptions, mentioned environmental factors, and documented responses to treatments that are not typical of the genetic condition.

Reviewer 4 Report

Comments and Suggestions for Authors

Thanks for the response, and I have no further questions.

Comments on the Quality of English Language

This revision reads better than the original submission, but some minor grammar and comma issues are waiting to be corrected.

Author Response

Many thanks for your support, I have reviewed and checked the manuscript again.

Article Menu

International Classification of Diseases Prediction from MIMIIC-III Clinical Text Using Pre-Trained ClinicalBERT and NLP Deep Learning Models Achieving State of the Art

Further Information

Guidelines

MDPI Initiatives

Follow MDPI