Language Model-Based Text Augmentation System for Cerebrovascular Disease Related Medical Report

Kim, Yu-Hyeon; Kim, Chulho; Kim, Yu-Seop

doi:10.3390/app14198652

Open AccessArticle

Language Model-Based Text Augmentation System for Cerebrovascular Disease Related Medical Report

by

Yu-Hyeon Kim

¹

,

Chulho Kim

²

and

Yu-Seop Kim

^1,*

¹

Department of Convergence Software, Hallym University, Chuncheon-si 24252, Republic of Korea

²

Department of Neurology, Chuncheon Sacred Heart Hospital, Chuncheon-si 24253, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(19), 8652; https://doi.org/10.3390/app14198652

Submission received: 31 July 2024 / Revised: 13 September 2024 / Accepted: 16 September 2024 / Published: 25 September 2024

(This article belongs to the Topic Innovation, Communication and Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Texts in medical fields containing sensitive information pose challenges for AI research usability. However, there is increasing interest in generating synthetic text to make medical text data bigger for text-based medical AI research. Therefore, this paper suggests a text augmentation system for cerebrovascular diseases, using a synthetic text generation model based on DistilGPT2 and a classification model based on BioBERT. The synthetic text generation model generates synthetic text using randomly extracted reports (5000, 10,000, 15,000, and 20,000) from 73,671 reports. The classification model is fine-tuned with the entire report to annotate synthetic text and build a new dataset. Subsequently, we fine-tuned a classification model by incrementally increasing the amount of augmented data added to each original dataset. Experimental results show that fine-tuning by adding augmented data improves model performance by up to 20%. Furthermore, we found that generating a large amount of synthetic text is not necessarily required to achieve better performance, and the appropriate amount of data augmentation depends on the size of the original data. Therefore, our proposed method reduces the time and resources needed for dataset construction, automating the annotation task and generating meaningful synthetic text for medical AI research.

Keywords:

text augmentation; language models; cerebrovascular disease; medical report

1. Introduction

AI is providing great help to people in various fields, and the medical field is no exception. AI innovates various medical tasks, such as medical diagnosis, treatment planning, and patient monitoring [1]. However, textual data in the medical field, such as medical records, contain sensitive patient information; thus, their usage is restricted [2,3]. This is essential to ensure patient confidentiality, yet it poses a substantial challenge to developing deep learning models. These models require large amounts of data to learn meaningful patterns and achieve high performance. Consequently, the limitation in data accessibility hinders AI research in natural language processing (NLP).

AI research has primarily relied on image data, such as Computed Tomography (CT) and Magnetic Resonance Imaging (MRI) scans, to overcome these limitations in the medical field. Images present fewer patient privacy issues while playing a crucial role in diagnosing diseases and planning treatments. Nevertheless, the importance of textual data remains significant. Textual data such as medical records, doctor’s notes, and patient interview transcripts are essential for understanding the context and details of diseases. Recently, there has been an increasing interest in leveraging textual data in medical research, particularly in generating synthetic text data [4,5].

Synthetic text data generation is gaining attention as a solution to the data scarcity problem, providing an effective way to obtain large amounts of training data. However, this process faces several critical issues. The generated texts lack annotated information, necessitating manual annotation. Additionally, there is a risk of hallucination, where the generated text contains incorrect information, making it unsuitable for use as training data. Therefore, human evaluation and correction of the generated text’s appropriateness are required [6,7,8]. This process consumes significant time and resources during data augmentation.

This paper proposes a cerebrovascular disease-specific text augmentation system using a DistilGPT2-based synthetic text generation model and a BioBERT-based text classification model [9,10]. We randomly extracted a subset of the report data to create five training-test pairs (5000, 10,000, 15,000, and 20,000). The training data of each pair were then input into the synthetic text generation model to generate new texts. Subsequently, we used all report data to fine-tune a classification model. The classification model classified the generated texts as either intracerebral hemorrhage (ICH) or normal and filtered out samples likely to confuse the model. This process created an augmented dataset for each pair.

Through this approach, we identified the optimal amount of augmented data that maximizes model performance based on the original data’s size and the counterproductive amount of augmented data. The process is as follows: (1) Fine-tune BioBERT, ClinicalBERT, and BiomedBERT using the original data for each pair and incrementally adding augmented data for fine-tuning [11,12]. (2) Calculate the accuracy of the fine-tuned models and evaluate whether the augmented data from our proposed system improves model performance.

This paper is structured as follows: Section 2 introduces related work, Section 3 describes the models and methods used, Section 4 analyzes the experimental results, and Section 5 presents the discussion. Finally, Section 6 summarizes the study and discusses future research directions. This study proposes a novel method for effectively utilizing textual data in the medical field, contributing to expanding AI-driven medical data analysis capabilities.

2. Related Work

Text data augmentation has gained significant attention in NLP to improve the performance of models dealing with limited datasets. One foundational study in this area is the Easy Data Augmentation (EDA) technique introduced by Wei et al. [13]. Numerous studies have also explored methods such as random word replacement and deletion, paraphrasing, synonym replacement, and inducing spelling errors [14,15,16,17,18,19,20]. These methods effectively generate diverse training samples while maintaining the original text’s meaning, enhancing model robustness and generalization.

Alzantot et al. [21] conducted a study on generating adversarial examples to evaluate model robustness. They used GloVe embeddings to find the nearest neighbors based on the Euclidean distance of words and employed Google’s language model to select contextually appropriate words. This research utilized a genetic algorithm to generate adversarial examples, achieving high success rates with minimal word modifications. The adversarial examples created were perceived as similar to the original by human evaluators.

Wang et al. [22] proposed a method using lexical and frame-semantic embeddings to classify annoying behaviors on Twitter automatically. This study introduced a data augmentation technique that generates new training instances based on continuous word embeddings from tweet data. For example, the sentence “Being late is terrible” could be transformed into “Be behind are bad,” creating new training data with the same label. Additionally, they used the SEMAfor frame-semantic parser to semantically analyze tweet data and extract frame-level semantic features for data augmentation. This approach demonstrated effective classification performance even with short and noisy tweet data.

Sugiyama et al. [23] also employed the back-translation technique, where text is translated into another language and back to the original language. This study evaluated NMT models using English–Japanese and English–French datasets, with both small parallel corpora and large quasi-parallel corpora. The results demonstrated that data augmentation improved BLEU scores and influenced translation tasks. It showed that back-translation-enhanced data augmentation positively impacts the translation quality of context-aware NMT models.

Moreover, Guo et al. [24] introduced a technique called MixUp. This method involves mixing existing samples, sometimes including their labels, to form new samples. The study experimented with MixUp at both the word and sentence levels, demonstrating that this approach was more effective in addressing overfitting issues.

Chen et al. [25] introduced a new data augmentation technique called TMix for semi-supervised text classification in their framework, MixText. TMix interpolates text samples in the hidden space, generating infinite new training samples. This approach mitigates the overfitting problem and performs well even with minimal labeled data. MixText explicitly models the relationship between labeled and unlabeled data, overcoming the limitations of previous semi-supervised learning models. It estimates low-entropy labels for unlabeled data, making them usable like labeled data. Experimental results demonstrated that MixText outperformed state-of-the-art semi-supervised learning methods across various text classification benchmarks.

Recently, data augmentation using generative models has garnered significant attention. Anaby-Tavor et al. [26] and Bayer et al. [27] explored the use of pre-trained language models like Generative Pre-trained Transformer 2 (GPT-2) [28] to generate synthetic text. This approach involves fine-tuning the models on existing datasets to generate new, contextually appropriate text samples.

In summary, text data augmentation in NLP rapidly evolves, with researchers developing and refining various methods to enhance model performance. Key techniques include EDA, back-translation, text generation, and adversarial examples. These methods enrich and diversify training datasets, ultimately creating more accurate and robust NLP models.

3. Methodology

The structure of the proposed system is illustrated in Figure 1. Text from cerebrovascular disease-related reports is input into a generation model to create synthetic text. Since the generated text lacks labels, it is passed through a classification model to annotate it. The classification model filters the text based on a set threshold, ensuring that only meaningful text remains.

3.1. Generative Model

The generation model used in this study is a DistilGPT2-based model for generating synthetic text related to cerebrovascular diseases [29]. DistilGPT2 is a distilled version of the GPT-2 model, which retains most of GPT-2′s capabilities while being smaller and more efficient. GPT-2, developed by OpenAI (https://openai.com/, accessed on 2 July 2024), is a state-of-the-art transformer-based language model known for generating coherent and contextually appropriate text based on the input it receives. DistilGPT2 maintains these abilities while being lighter and faster, achieved through knowledge distillation, where the smaller model is trained to replicate the behavior of the larger model.

The architecture of DistilGPT2 is as follows. It has a vocabulary size of 50,260, which allows it to handle a wide range of medical terms and expressions commonly found in medical reports. The model consists of 6 layers, providing sufficient depth to learn complex language patterns. With a hidden size of 768 dimensions, it can richly represent the contextual meaning of the text. Additionally, the model uses 12 attention heads, enhancing its ability to simultaneously consider various text parts and understand the context more effectively.

Specifically, DistilGPT2 was fine-tuned using a corpus of medical records to tailor it for medical text generation. Fine-tuning involves further training a pre-trained model on a specific domain’s dataset, making the model suitable for that domain. In this study, DistilGPT2 was fine-tuned with medical records related to cerebrovascular diseases to generate relevant text. As a result, the synthetic text generated by this model reflects the style, terminology, and contextual characteristics of medical texts, closely resembling actual data.

To assess the impact of augmented synthetic text on model performance using small amounts of text data, we randomly extracted 5000, 10,000, 15,000, and 20,000 samples from the entire report dataset. This step was crucial to reflect real-world scenarios where obtaining annotated medical data is challenging, ensuring the model can effectively learn from limited data. To evaluate the generalized performance of the model, we applied a 5-fold cross-validation technique. This involves dividing each original dataset into five distinct training and test pairs, maintaining an 8:2 training—test split ratio for each pair. This approach helps verify the model’s generalization ability and reliability across different parts of the data.

Subsequently, we generated synthetic text using the training data from each pair. Equation (1) illustrates the structure of the synthetic text generation process.

S = G_{d i s t i l G P T 2} (t e x t_{t r a i n})

(1)

We used the generated synthetic text to augment the training datasets and improve the performance of future classification tasks. This approach creates more affluent and more diverse datasets, significantly enhancing the model’s learning and generalization capabilities. Consequently, it contributes to developing more accurate and reliable medical AI systems.

3.2. Annotator

The classification model used in this study is a BioBERT-based model tailored for cerebrovascular disease classification. Bidirectional Encoder Representations from Transformers (BERT) [30] is a language model that uses a masked language model based on a bidirectional transformer. BioBERT is an extension of BERT that has been further pre-trained on biomedical text. For this purpose, it was pre-trained on the corpus used for BERT as well as biomedical domain-specific texts from PubMed abstracts (PubMed) and PubMed Central full-text articles (PMC). This additional pre-training allows BioBERT to understand and process the biomedical domain’s language better.

Furthermore, BioBERT was fine-tuned using approximately 73,000 cerebrovascular disease-related report data to perform binary classification on cerebrovascular disease-related text.

To annotate the generated synthetic text, thresholds of 0.24 and 0.89, representing the top and bottom 30% of predictions, were set. These two thresholds serve as reference points for the model to classify the text accurately. Threshold selection plays a critical role in the filtering process. For instance, if a narrower threshold, such as the top and bottom 20%, were used, the number of samples classified as ICH or normal would decrease, potentially reducing misclassification by the model. However, this might lead to a shortage of data needed for performance improvement. On the other hand, setting a broader threshold, such as the top and bottom 40%, could increase the number of samples available for training, improving the model’s performance. However, this approach also risks including uncertain samples near the prediction boundary, which could degrade the model’s performance.

Subsequently, we passed the generated synthetic text through the classification model. This model outputs a prediction value indicating whether the input text is ICH or normal. The text is annotated as normal if the prediction value is less than 0.24. If the prediction value exceeds 0.89, we annotate it as ICH. Texts with prediction values between these two thresholds are excluded, as they will likely introduce ambiguity into the model’s judgment. Such texts, with unclear prediction values, could negatively impact model performance if annotated incorrectly. Equation (2) illustrates annotating and filtering the generated synthetic text using the classification model.

C l a s s i f i e r (S) = \{\begin{matrix} I C H, & i f P (y | S;) > 0.89 \\ N o r m a l, & i f P (y | S;) < 0.24 \\ E x c l u d e d, & i f 0.24 \leq P (y | S;) \leq 0.89 \end{matrix}

(2)

This process is essential for maintaining the quality and consistency of the generated synthetic text. By annotating the text through the classification model, we enhance the reliability of the dataset and refine the data for model training. This annotation process maximizes the effectiveness of data augmentation using synthetic text, ultimately playing a crucial role in improving the model’s performance. Table 1 shows examples of the augmented synthetic text.

3.3. Evaluation

First, we fine-tuned BioBERT, ClinicalBERT, and BiomedBERT to establish baseline models. These baseline models, trained solely on the original text, serve as the reference points for performance comparison.

Next, we progressively added generated synthetic text to the training data and fine-tuned each model. The synthetic text was added ten times for each training dataset, with 20% of the original training data size for both normal and ICH cases each time. This approach allows us to observe the performance changes of the models with varying amounts of synthetic text.

Subsequently, we calculated the accuracy and F1 score of the baseline model, and the models were fine-tuned with the addition of synthetic text. These metrics were computed using 5-fold cross-validation and were evaluated based on the size of the original dataset used for synthetic text generation (5000, 10,000, 15,000, and 20,000). Equation (3) illustrates calculating the accuracy of the fine-tuned models. Here,

F

represents the number of folds, which is 5, and

n

denotes the amount of text data.

N_{f}

is the number of data in the

f

-th fold, and

B E R T (t e x t_{n},_{f},_{i})

represents the prediction of the model for text

n

in fold

f

for sample

i

, where it is considered 1 if the prediction is correct and 0 if it is incorrect.

A v g A c c_{5 - f o l d} (t e x t_{n}) = \frac{1}{F} \sum_{f = 1}^{F} (\frac{1}{N_{f}} \sum_{i = 1}^{N_{f}} B E R T (t e x t_{n, f, i}))

(3)

This process ensures that we accurately measure the impact of synthetic text augmentation on model performance across different data sizes.

4. Experimental Result

4.1. Experimental Setup

The experiments were conducted on a server equipped with two Nvidia A100 GPUs. The hyperparameters were set with the AdamWeightDecay optimizer with a learning rate set to 3 × 10⁻⁵ and a weight decay rate of 0.3 for fine-tuning the classification models. The batch size was set to 16, and early stopping was configured with a patience of 2 epochs.

4.2. Cerebrovascular Disease

Cerebrovascular diseases refer to conditions that affect the blood vessels in the brain, leading to insufficient blood supply to the brain [31]. These diseases can manifest in various forms, with acute conditions like strokes being particularly prominent. A stroke occurs when the blood flow to the brain is suddenly blocked or when a brain blood vessel bursts, which can lead to dire consequences. Cerebrovascular diseases are one of the leading causes of death worldwide, making rapid diagnosis and treatment crucial [32]. Major cerebrovascular diseases include:

Stroke: this occurs when the blood flow is obstructed, causing brain tissue damage. It is further categorized into ischemic stroke (brain infarction) and hemorrhagic stroke (brain hemorrhage).
Transient Ischemic Attack (TIA): similar to a stroke, but the blood flow blockage is temporary, and symptoms fully resolve.
Cerebral Aneurysm: a condition where a part of a brain blood vessel weakens and bulges. If it ruptures, it can cause severe bleeding.

These conditions are more likely to occur due to risk factors such as hypertension, diabetes, smoking, and hyperlipidemia. Preventive measures include maintaining a healthy lifestyle and undergoing regular health check-ups.

4.3. Medical Reports

The dataset consists of 73,671 reports on cerebrovascular diseases collected from Hallym University Sacred Heart Hospital and Chuncheon Sacred Heart Hospital between 2012 and 2020. The dataset was created for a binary classification task, containing text and labels for 41,373 ICH (Intracerebral Hemorrhage) patients and 32,298 normal patients. To address the imbalance in the dataset, an equal amount of synthetic text for each label was gradually added during the experiments.

Following this, we extracted only text related to cerebrovascular disease diagnosis from medical reports, and each sentence was tokenized into words using NLTK. Only the first two words were selected from these tokens as input for the synthetic text generation model. Table 2 illustrates examples of medical reports. The report for an Intracerebral Hemorrhage Patient describes the symptoms and the affected area, while the report for a normal patient diagnoses a healthy brain condition. These reports allow the model to learn medical terminology and contextual information about diseases, helping the synthetic text generation model produce clinically useful text.

4.4. Result

4.4.1. BioBERT

Figure 2 shows the results of fine-tuning BioBERT. Regardless of the original data’s size, an accuracy improvement was observed. Specifically, the baseline model fine-tuned with 5000 original data size achieved an accuracy of 77.52%, while the model with 10,000 synthetic texts added per label achieved 81.7%, marking a performance improvement of approximately 4.2%. Additionally, models with more synthetic text showed more significant performance improvements than those with fewer additions, resulting in a steeper slope in the trend line. The F1 score also exhibited overall performance enhancement, particularly with the baseline model fine-tuned with 5000 original data size starting at 0.66. The models with 6000 and 10,000 synthetic texts added per label improved to 0.79, reflecting a 0.13 increase in performance.

4.4.2. ClinicalBERT

Figure 3 presents the results of fine-tuning ClinicalBERT. Similar to BioBERT, accuracy improvements were observed across all original dataset sizes. Specifically, when using a 5000 original data size, the model with 10,000 synthetic texts added per label achieved an accuracy of 82.8%, which is an approximately 2.7% increase compared to the baseline model’s accuracy of 80.12%. There was an overall performance improvement when synthetic text was added, with the trend showing that adding a small amount of synthetic text resulted in lower performance, while a more significant amount led to better results. Additionally, the performance improvement became more significant as the original dataset size decreased, while the trend line slope gradually decreased as the original dataset size increased. Performance improvements were also observed in the F1 score, with an increase of 0.1 compared to the baseline, regardless of the size of the original dataset.

4.4.3. BiomedBERT

Figure 4 shows the results of fine-tuning BiomedBERT. Like the previous two cases, accuracy improvements were observed regardless of the size of the original dataset. Notably, when using a 5000 original data size, the baseline model had an accuracy of 61.16%. However, the model with 8000 synthetic texts added per label achieved an accuracy of 81.44%, marking an approximate 20% increase in accuracy, the most considerable improvement across all cases. Furthermore, the performance improved as the amount of added synthetic text increased, with a steep slope in the trend line. The F1 score also exhibited the most enormous improvement, mirroring the accuracy results. Specifically, the baseline model fine-tuned with 5000 original data size had an F1 score of 0.64. In contrast, the models with 6000 and 7000 synthetic texts added per label improved to 0.78, indicating a 0.14 increase in performance.

4.4.4. Comparison

Table 3 presents each model’s accuracy based on the data’s size. We progressively added synthetic text in 20% increments of the original dataset size. Regardless of the model or the size of the original data, adding synthetic text generally improved performance. The baseline models showed higher accuracy as the original data size was increased. Although ClinicalBERT had the highest accuracy overall, its performance improvement was the smallest, with an increase of 2.44% for 5000 data points, 1.97% for 10,000, 1.26% for 15,000, and 0.77% for 20,000.

On the other hand, BiomedBERT, which had the lowest baseline accuracy, showed the most significant performance improvements, with accuracy increases of 20.28% for 5000 data points, 4.96% for 10,000, 1.04% for 15,000, and 2.22% for 20,000. BioBERT showed accuracy increases of 4.18% for 5000 data points, 3.07% for 10,000, 0.97% for 15,000, and 0.89% for 20,000. While the performance improvement for BioBERT was smaller than that of BiomedBERT, it was more extensive than that of ClinicalBERT.

Additionally, all three models demonstrated a trend where more synthetic text resulted in relatively better performance gains. Furthermore, the smaller the original dataset, the more pronounced the performance improvement became.

Table 4 shows the F1 scores of each model based on the dataset size. Most models experienced performance improvements regardless of the size of the original dataset. BioBERT and ClinicalBERT both achieved the highest F1 scores at 0.85. ClinicalBERT showed an improvement of 0.01 for 5000 data points, 0.04 for 10,000, 0.03 for 15,000, and 0.01 for 20,000. In contrast, BioBERT demonstrated more varied enhancements, with an increase of 0.13 for 5000 data points, 0.02 for 10,000, 0.03 for 15,000, and 0.03 for 20,000, resulting in a significant difference in the magnitude of performance enhancement between the two models.

BiomedBERT, while achieving a maximum F1 score of 0.84—lower than the other two models—showed the most considerable performance improvements. The F1 score increased by 0.14 for 5000 data points, 0.02 for 10,000, and 0.1 for 15,000, but no improvement was observed for the 20,000 data point case. Additionally, both BioBERT and BiomedBERT showed more pronounced performance improvements with smaller original dataset sizes, a less evident trend in ClinicalBERT.

These results demonstrate that synthetic text is crucial in enhancing model performance. Specifically, adding an appropriate amount of synthetic text significantly improved model accuracy. Conversely, adding an inappropriate amount of synthetic text could degrade model performance. This indicates that the quantity of synthetic text significantly impacts model performance, highlighting the importance of determining the optimal amount of synthetic text.

4.4.5. Comparison to Another Technique

Table 5 presents the performance of models fine-tuned with data augmented using our proposed system and the Synonym Replacement technique from EDA. We compared the performance using the original dataset size 5000, which showed the highest performance increase. Models fine-tuned with Synonym Replacement demonstrated an accuracy improvement for BioBERT from 79.18% to 79.9%, an increase of approximately 0.7%. In comparison, ClinicalBERT showed no accuracy increase from 82.52%, and BiomedBERT showed a significant improvement from 69.9% to 79.12%, an increase of approximately 9.22%. Additionally, for the F1 score, BiomedBERT saw an increase from 0.78 to 0.82, a 0.4-point improvement, while no performance improvement was observed in the other two models.

In contrast, our system showed overall performance improvements, with accuracy increases of up to 20% and F1 score gains of up to 0.14. Our study, which added the same proportion of data per label to alleviate data imbalance, showed greater accuracy improvement, suggesting that the model predicts more accurately overall and performs well across different classes. This indicates that our system is more likely to perform consistently in various situations.

The higher baseline performance observed when using EDA may have occurred by chance during the experiment. However, the key point of our study is the degree of improvement compared to the baseline. In other words, the more significant performance improvement demonstrated by our method compared to the baseline highlights its effectiveness in enhancing model performance.

Therefore, our system provides higher reliability in real-world applications, showing that filtering unnecessary text, in addition to simply augmenting text, is a crucial factor in improving model performance. This demonstrates that our method contributes relatively more to maximizing model performance.

5. Discussion

5.1. Discussion of Previous Research Results

We propose a text augmentation system for medical reports related to cerebrovascular diseases using a DistilGPT2-based synthetic text generation model and a BioBERT-based classification model. Our method demonstrated that the augmented synthetic text contributes to improving model performance. However, since this study is limited to cerebrovascular diseases, we aim to explore whether we can effectively apply it in other medical domains. Therefore, in this discussion, we investigate the potential generalizability of our methodology through case studies from different medical fields.

The scarcity of labeled data is a common challenge across various medical domains, leading to extensive research into numerous text data augmentation methods to address this issue. For instance, in drug identification and incident classification, ChatGPT has been used to generate synthetic text data, significantly improving pre-trained BERT models’ performance [33]. This study demonstrated that synthetic data generated in diverse contexts helps the model better recognize patterns in drug-related text.

Similarly, in predicting patient readmission, GPT-2 was utilized to augment text data, balancing the dataset and optimizing model performance through fine-tuning [34]. This study emphasized that the quantity and quality of augmented data critically impact model performance, a key consideration in our study.

Additionally, the EDA technique was modified to perform effective text augmentation in a named entity recognition (NER) task on Chinese electronic medical records [35]. This research highlighted how maintaining the semantic consistency of augmented text enhances the model’s ability to generalize.

These case studies suggest that the text augmentation techniques we employed in our research could also be effectively applied in other medical fields. However, they also underscore a significant difference: the absence of a filtering mechanism to ensure domain-specific data accuracy. This difference is crucial, particularly in medical domains where data accuracy and relevance are paramount, thus highlighting the value of our approach.

In conclusion, our research addresses the limitations of previous studies by introducing classification and filtering stages after text generation. This results in a more accurate and reliable medical report augmentation system. Our methodology offers a significant contribution to medical report data augmentation and suggests potential applications in various medical domains in the future.

5.2. Data Drift and Its Risks

Data drift refers to the discrepancy that can arise between the distribution of data a model was trained with and the data encountered in real-world operations. When synthetic data is used, differences between the original data and synthetic data can lead to a decline in model performance. To mitigate these issues, we implemented the following measures:

We filtered the generated synthetic text using a classification model. By setting the top and bottom 30% of the classification model’s prediction scores as thresholds, we reduced the differences between the original text and the synthetic text.
We combined actual medical report data with synthetic data to ensure the model retained the critical patterns from the original data, thereby helping to reduce the impact of data drift.

5.3. Risk of Bias Introduction

Another issue to be mindful of during the synthetic data generation process is the potential introduction of bias. Synthetic text can excessively reflect or distort specific patterns or tendencies from the original data, which could lead the model to make inaccurate predictions for specific patient groups. To address this, we generated synthetic text using two key terms from the sentences in the reports. This approach minimized the risk of bias that can be introduced by traditional text augmentation methods, such as word replacement or back-translation, as well as by generative models like ChatGPT.

5.4. Negative Impacts

We evaluated model performance by adjusting the amount of synthetic data across various dataset sizes and confirmed that an appropriate amount of synthetic data effectively enhances model performance. However, using excessive synthetic data can lead to performance degradation. To examine this effect, we progressively added synthetic data to original datasets of varying sizes and analyzed the resulting performance changes. The analysis revealed that adding too much synthetic text can negatively impact performance compared to models without excessive augmentation. For example, in the case of BiomedBERT fine-tuned with an additional 40,000 synthetic texts per label for a dataset of 20,000 original data points, the model achieved an accuracy of 85%. In contrast, the model with 24,000 synthetic texts added showed an accuracy of 86%. Despite the slight difference, both models outperformed the baseline model, and overall, performance improvements were observed in most cases.

In conclusion, while synthetic data can improve model performance, excessive use may result in negative outcomes. We emphasize the importance of using a suitable amount of synthetic data and recognize the need for future research to systematically analyze these negative effects to better mitigate them.

5.5. Ethical Considerations

In medical AI research, synthetic data plays a crucial role in addressing the issue of data scarcity, but its use entails several ethical considerations. Specifically, we must pay close attention to protecting patient privacy and preventing the generation of inaccurate medical information. Medical data often contains personal information, and strict privacy protection regulations are in place to govern its handling. While synthetic data can be used to circumvent these regulations, there is still a risk that synthetic data could reproduce sensitive information from the original data. For instance, if synthetic data repeatedly reproduces specific patterns from the original data, it could provide clues to identify individual patients. In our study, we mitigated this risk by removing all personal information from the cerebrovascular disease-related texts and labels in all medical reports provided by the hospital.

Additionally, inaccurate information may be included during the generation of synthetic text. This could lead to incorrect clinical decisions, which is particularly concerning in the medical field, where patient health is directly impacted. To ensure that the generated synthetic data contains accurate medical information, we employed a BioBERT-based classification model to filter the synthetic text. During this process, we removed unreliable text from the dataset.

To address the ethical issues that can arise in medical AI research involving synthetic data, researchers and medical institutions must uphold ethical responsibility. We are fully aware of these ethical considerations and remain focused on safeguarding patient rights. Moving forward, we will continue to review the ethical challenges that may emerge during the generation and use of synthetic data and seek solutions to address them.

6. Conclusions

This paper proposes a text augmentation system using a DistillGPT2-based synthetic text generation model related to cerebrovascular diseases and a BioBERT-based classification model. We extracted datasets of 5000, 10,000, 15,000, and 20,000 reports from all medical reports to generate synthetic text, which was then applied with annotation and filtering using the classification model. The results showed that models trained with the addition of synthetic text outperformed the baseline models overall.

For future research, we should apply this system to various medical studies and applications to find ways to secure sufficient learning data while minimizing privacy issues. Specifically, we plan to collect text data related to dementia diagnosis and replace the DistilGPT2 model with Llama as the synthetic text generation model to augment the data.

Author Contributions

Conceptualization, Y.-S.K. and C.K.; methodology, Y.-H.K.; formal analysis, Y.-H.K.; resources, Y.-S.K.; data curation, C.K.; writing—original draft preparation, Y.-H.K.; writing—review and editing, Y.-S.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. 2022R1A5A8019303) and the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) [NO. 2710007402, Artificial Intelligence Innovation Hub (Artificial Intelligence Institute, Seoul National University)].

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki and approved by the Institutional Review Board at Chuncheon Sacred Heart Hospital. (IRB No. 2021-10-012).

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Poalelungi, D.G.; Musat, C.L.; Fulga, A.; Neagu, M.; Neagu, A.I.; Piraianu, A.I.; Fulga, I. Advancing Patient Care: How Artificial Intelligence Is Transforming Healthcare. J. Pers. Med. 2023, 13, 1214. [Google Scholar] [CrossRef] [PubMed]
Lee, S.; Kim, H.-S. Prospect of Artificial Intelligence Based on Electronic Medical Record. J. Lipid Atheroscler. 2021, 10, 282. [Google Scholar] [PubMed]
Jeun, Y.-J. EMR System and Patient Medical Information Protection. Korean J. Health Serv. Manag. 2013, 7, 213–224. [Google Scholar]
Goncalves, A.; Ray, P.; Soper, B.; Stevens, J.; Coyle, L.; Sales, A.P. Generation and Evaluation of Synthetic Patient Data. BMC Med. Res. Methodol. 2020, 20, 108. [Google Scholar]
Gonzales, A.; Guruswamy, G.; Smith, S.R. Synthetic data in health care: A narrative review. PLoS Digit. Health 2023, 2, e0000082. [Google Scholar]
Clark, E.; August, T.; Serrano, S.; Haduong, N.; Gururangan, S.; Smith, N.A. All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; Zong, C., Xia, F., Li, W., Navigli, R., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2021; pp. 7282–7296. [Google Scholar]
van der Lee, C.; Gatt, A.; van Miltenburg, E.; Wubben, S.; Krahmer, E. Best Practices for the Human Evaluation of Automatically Generated Text. In Proceedings of the 12th International Conference on Natural Language Generation, Tokyo, Japan, 29 October–1 November 2019; van Deemter, K., Lin, C., Takamura, H., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2019; pp. 355–368. [Google Scholar]
Howcroft, D.M.; Belz, A.; Clinciu, M.-A.; Gkatzia, D.; Hasan, S.A.; Mahamood, S.; Mille, S.; van Miltenburg, E.; Santhanam, S.; Rieser, V. Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions. In Proceedings of the 13th International Conference on Natural Language Generation, Dublin, Ireland, 8–13 December 2020; Davis, B., Graham, Y., Kelleher, J., Sripada, Y., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2020; pp. 169–182. [Google Scholar]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. arXiv 2020, arXiv:1910.01108. [Google Scholar]
Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining. Bioinforma. Oxf. Engl. 2020, 36, 1234–1240. [Google Scholar]
Wang, G.; Liu, X.; Ying, Z.; Yang, G.; Chen, Z.; Liu, Z.; Zhang, M.; Yan, H.; Lu, Y.; Gao, Y.; et al. Optimized Glycemic Control of Type 2 Diabetes with Reinforcement Learning: A Proof-of-Concept Trial. Nat. Med. 2023, 29, 2633–2642. [Google Scholar] [PubMed]
Gu, Y.; Tinn, R.; Cheng, H.; Lucas, M.; Usuyama, N.; Liu, X.; Naumann, T.; Gao, J.; Poon, H. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. ACM Trans Comput Healthc. 2021, 3, 1–23. [Google Scholar]
Wei, J.; Zou, K. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Inui, K., Jiang, J., Ng, V., Wan, X., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2019; pp. 6382–6388. [Google Scholar]
Huong, T.H.; Hoang, V.T. A Data Augmentation Technique Based on Text for Vietnamese Sentiment Analysis. In Proceedings of the 11th International Conference on Advances in Information Technology, New York, NY, USA, 1–3 July 2020; IAIT ’20; Association for Computing Machinery: New York, NY, USA, 2020; pp. 1–5. [Google Scholar]
Qiu, S.; Xu, B.; Zhang, J.; Wang, Y.; Shen, X.; de Melo, G.; Long, C.; Li, X. EasyAug: An Automatic Textual Data Augmentation Platform for Classification Tasks. In Companion Proceedings of the Web Conference 2020, New York, NY, USA, 20–24 April 2020; WWW ’20; Association for Computing Machinery: New York, NY, USA, 2020; pp. 249–252. [Google Scholar]
Kumar, A.; Bhattamishra, S.; Bhandari, M.; Talukdar, P. Submodular Optimization-Based Diverse Paraphrasing and Its Effectiveness in Data Augmentation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2019; pp. 3609–3619. [Google Scholar]
Kolomiyets, O.; Bethard, S.; Moens, M.-F. Model-Portability Experiments for Textual Temporal Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011; Lin, D., Matsumoto, Y., Mihalcea, R., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2011; pp. 271–276. [Google Scholar]
Xiang, R.; Chersoni, E.; Lu, Q.; Huang, C.-R.; Li, W.; Long, Y. Lexical Data Augmentation for Sentiment Analysis. J. Assoc. Inf. Sci. Technol. 2021, 72, 1432–1447. [Google Scholar]
Coulombe, C. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs. arXiv 2018, arXiv:1812.04718. [Google Scholar]
Belinkov, Y.; Bisk, Y. Synthetic and Natural Noise Both Break Neural Machine Translation. arXiv 2018, arXiv:1711.02173. [Google Scholar]
Alzantot, M.; Sharma, Y.; Elgohary, A.; Ho, B.-J.; Srivastava, M.; Chang, K.-W. Generating Natural Language Adversarial Examples. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2018; pp. 2890–2896. [Google Scholar]
Wang, W.Y.; Yang, D. That’s So Annoying!!!: A Lexical and Frame-Semantic Embedding Based Data Augmentation Approach to Automatic Categorization of Annoying Behaviors Using #petpeeve Tweets. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; Màrquez, L., Callison-Burch, C., Su, J., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2015; pp. 2557–2563. [Google Scholar]
Sugiyama, A.; Yoshinaga, N. Data Augmentation Using Back-Translation for Context-Aware Neural Machine Translation. In Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019), Hong Kong, China, 3 November 2019; Popescu-Belis, A., Loáiciga, S., Hardmeier, C., Xiong, D., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2019; pp. 35–44. [Google Scholar]
Guo, H.; Mao, Y.; Zhang, R. Augmenting Data with Mixup for Sentence Classification: An Empirical Study. arXiv 2019, arXiv:1905.08941. [Google Scholar]
Chen, J.; Yang, Z.; Yang, D. MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2020; pp. 2147–2157. [Google Scholar]
Anaby-Tavor, A.; Carmeli, B.; Goldbraich, E.; Kantor, A.; Kour, G.; Shlomov, S.; Tepper, N.; Zwerdling, N. Do Not Have Enough Data? Deep Learning to the Rescue! In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 7383–7390. [Google Scholar]
Bayer, M.; Kaufhold, M.-A.; Buchhold, B.; Keller, M.; Dallmeyer, J.; Reuter, C. Data Augmentation in Natural Language Processing: A Novel Text Generation Approach for Long and Short Text Classifiers. Int. J. Mach. Learn. Cybern. 2023, 14, 135–150. [Google Scholar] [PubMed]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models Are Unsupervised Multitask Learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Oh, B.-D.; Kim, G.-Y.; Kim, C.; Kim, Y.-S. How to Use Language Models for Synthetic Text Generation in Cerebrovascular Disease-Specific Medical Reports. In Proceedings of the 1st Workshop on Personalization of Generative AI Systems (PERSONALIZE 2024), St. Julians, Malta, 22 March 2024; Deshpande, A., Hwang, E., Murahari, V., Park, J.S., Yang, D., Sabharwal, A., Narasimhan, K., Kalyan, A., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2024; pp. 10–17. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2019; pp. 4171–4186. [Google Scholar]
Cerebrovascular Disease: Types, Causes & Symptoms. Cleveland Clinic. Available online: https://my.clevelandclinic.org/health/diseases/24205-cerebrovascular-disease (accessed on 19 July 2024).
AbuRahma, A.F. Overview of Cerebrovascular Disease. In Noninvasive Vascular Diagnosis: A Practical Textbook for Clinicians; AbuRahma, A.F., Perler, B.A., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 103–139. ISBN 978-3-030-60626-8. [Google Scholar]
Sarker, S.; Qian, L.; Dong, X. Medical Data Augmentation via ChatGPT: A Case Study on Medication Identification and Medication Event Classification. arXiv 2023, arXiv:2306.07297. [Google Scholar]
Lu, Q.; Dou, D.; Nguyen, T.H. Textual Data Augmentation for Patient Outcomes Prediction. In Proceedings of the 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Online, 9–12 December 2021; pp. 2817–2821. [Google Scholar]
Chen, H.; Dan, L.; Lu, Y.; Chen, M.; Zhang, J. An Improved Data Augmentation Approach and Its Application in Medical Named Entity Recognition. BMC Med. Inform. Decis. Mak. 2024, 24, 221. [Google Scholar]

Figure 1. Proposed System. The text from cerebrovascular disease-related reports is input into a generation model to create synthetic text. This synthetic text is passed through a classification model, which annotates and filters it to produce annotated and filtered synthetic text.

Figure 2. Results of the BioBERT models fine-tuned with the original data and augmented with synthetic text. (a) Original dataset size of 5000 samples and (b) original dataset size of 20,000 samples.

Figure 3. Results of the ClinicalBERT models fine-tuned with the original data and augmented with synthetic text. (a) Original dataset size of 5000 samples and (b) original dataset size of 20,000 samples.

Figure 4. Results of the BiomedBERT models fine-tuned with the original data and augmented with synthetic text. (a) Original dataset size of 5000 samples and (b) original dataset size of 20,000 samples.

Table 1. Examples of the augmented synthetic text.

Original Text	Augmented Text
Cystic encephalomalatic changes in left temporal, frontal lobes, and right inferior cerebellar hemisphere.	Cystic encephalomalatic ipsilateral ventricular effacement. Diffuse obliteration of fourth ventricle. R/O: diffuse hydrocephalus.

Table 2. Examples of medical reports.

Intracerebral Hemorrhage Patient	Normal Patient
downward and lateral dispalced right middle cerebral artery d/t right frontal hematoma. Otherwise, unremarkable.	Unremarkable finding of brain parenchymal and CSF space.

Table 3. Accuracy of each model by data size.

Model	Data Size	Baseline	+20%	+40%	+60%	+80%	+100%	+120%	+140%	+160%	+180%	+200%
BioBERT	5000	77.52	72.82	69.04	75.28	79.88	76.66	80.78	80.36	81.02	80.78	81.7
	10,000	80.46	77.52	80.58	81.71	81.94	82.03	83	82.83	83.53	83.25	83.23
	15,000	82.78	80.67	81.88	82.76	83.34	83.65	83.75	83.71	83.68	83.73	83.65
	20,000	83.7	82.3	81.24	83.8	83.86	84.06	83.9	84.03	84.21	84.28	84.59
ClinicalBERT	5000	80.12	80.16	79.42	81.16	80.86	81.68	81.46	81.82	82.56	82.48	82.8
	10,000	82.77	80.69	81.75	82.7	83.32	83.65	83.09	84.62	84.73	84.53	84.74
	15,000	84.13	81.8	82.58	83.81	84.91	84.43	85	84.93	84.75	85.39	84.75
	20,000	84.88	83.14	83.89	84.27	85.03	85.48	84.74	85.62	85.56	85.32	85.65
BiomedBERT	5000	61.16	68.08	66.38	71	79.2	75.58	80.02	80.92	81.44	81.4	79.48
	10,000	78.08	81.75	74.83	81.57	78.16	81.73	82.33	82.57	82.77	82.57	83.04
	15,000	82.09	75.25	80.72	82.35	82.47	83.03	83.13	83.06	83.39	82.81	83
	20,000	83.78	80.43	79.48	80.35	83.57	85.01	86	85.7	85.99	85.35	85

Table 4. F1 score of each model by data size.

Model	Data Size	Baseline	+20%	+40%	+60%	+80%	+100%	+120%	+140%	+160%	+180%	+200%
BioBERT	5000	0.66	0.63	0.63	0.69	0.77	0.55	0.79	0.78	0.78	0.52	0.79
	10,000	0.79	0.8	0.8	0.81	0.78	0.81	0.81	0.79	0.7	0.78	0.74
	15,000	0.81	0.81	0.79	0.83	0.82	0.78	0.83	0.8	0.84	0.82	0.82
	20,000	0.82	0.84	0.85	0.84	0.83	0.85	0.83	0.81	0.8	0.8	0.85
ClinicalBERT	5000	0.8	0.8	0.74	0.81	0.79	0.81	0.8	0.8	0.68	0.69	0.75
	10,000	0.8	0.81	0.81	0.81	0.82	0.82	0.84	0.83	0.82	0.8	0.81
	15,000	0.82	0.82	0.84	0.84	0.82	0.82	0.83	0.85	0.82	0.84	0.81
	20,000	0.84	0.85	0.83	0.85	0.82	0.85	0.84	0.85	0.82	0.84	0.85
BiomedBERT	5000	0.64	0.61	0.67	0.75	0.76	0.76	0.78	0.78	0.76	0.65	0.71
	10,000	0.8	0.81	0.67	0.73	0.65	0.82	0.81	0.82	0.71	0.82	0.82
	15,000	0.74	0.81	0.84	0.82	0.83	0.84	0.72	0.83	0.82	0.82	0.82
	20,000	0.84	0.82	0.83	0.82	0.82	0.83	0.84	0.84	0.84	0.84	0.84

Table 5. The accuracy and F1 scores of each model are fine-tuned with the augmented text generated by the proposed system and EDA.

Technique	Model	Metric	Baseline	+20%	+40%	+60%	+80%	+100%	+120%	+140%	+160%	+180%	+200%
Ours	BioBERT	Acc F1-score	77.52 0.66	72.82 0.63	69.04 0.63	75.28 0.69	79.88 0.77	76.66 0.55	80.78 0.79	80.36 0.78	81.02 0.78	80.78 0.52	81.7 0.79
	Clinical BERT	Acc F1-score	80.12 0.8	80.16 0.8	79.42 0.74	81.16 0.81	80.86 0.79	81.68 0.81	81.46 0.8	81.82 0.8	82.56 0.68	82.48 0.69	82.8 0.75
	BiomedBERT	Acc F1-score	61.16 0.64	68.08 0.61	66.38 0.67	71 0.75	79.2 0.76	75.58 0.76	80.02 0.78	80.92 0.78	81.44 0.76	81.4 0.65	79.48 0.71
EDA [13]	BioBERT	Acc F1-score	79.18 0.82	79.56 0.82	79.28 0.82	79.82 0.82	78.78 0.81	75.56 0.8	79.5 0.82	79.9 0.82	77.64 0.81	78.24 0.81	79.32 0.82
	Clinical BERT	Acc F1-score	82.52 0.84	81.98 0.84	82.32 0.84	81.48 0.83	82.14 0.84	82.1 0.84	81.12 0.83	81.84 0.84	82.46 0.84	80.86 0.83	82.0 0.84
	BiomedBERT	Acc F1-score	69.9 0.78	77.42 0.81	78.2 0.82	70.84 0.65	79.12 0.82	69.56 0.64	74.5 0.79	75.24 0.79	77.2 0.81	70.82 0.67	64.52 0.62

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, Y.-H.; Kim, C.; Kim, Y.-S. Language Model-Based Text Augmentation System for Cerebrovascular Disease Related Medical Report. Appl. Sci. 2024, 14, 8652. https://doi.org/10.3390/app14198652

AMA Style

Kim Y-H, Kim C, Kim Y-S. Language Model-Based Text Augmentation System for Cerebrovascular Disease Related Medical Report. Applied Sciences. 2024; 14(19):8652. https://doi.org/10.3390/app14198652

Chicago/Turabian Style

Kim, Yu-Hyeon, Chulho Kim, and Yu-Seop Kim. 2024. "Language Model-Based Text Augmentation System for Cerebrovascular Disease Related Medical Report" Applied Sciences 14, no. 19: 8652. https://doi.org/10.3390/app14198652

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Language Model-Based Text Augmentation System for Cerebrovascular Disease Related Medical Report

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Generative Model

3.2. Annotator

3.3. Evaluation

4. Experimental Result

4.1. Experimental Setup

4.2. Cerebrovascular Disease

4.3. Medical Reports

4.4. Result

4.4.1. BioBERT

4.4.2. ClinicalBERT

4.4.3. BiomedBERT

4.4.4. Comparison

4.4.5. Comparison to Another Technique

5. Discussion

5.1. Discussion of Previous Research Results

5.2. Data Drift and Its Risks

5.3. Risk of Bias Introduction

5.4. Negative Impacts

5.5. Ethical Considerations

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI