2.1. Dataset
MIMIC-III [
11] is a large and freely available clinical database comprising Electronic Health Record (EHR) data gathered between 2001 and 2012 from patients admitted to the Intensive Care Unit (ICU) of Beth Israel Hospital. The database consists of 26 tables, including ‘ADMISSIONS’, ‘PATIENTS’, ‘NOTEEVENTS’, ‘DIAGNOSES_ICD’, and ‘D_ICD_DIAGNOSES’. The main important table is ‘ADMISSIONS’, which contains 58,976 distinct admissions belonging to 46,520 patients. Each admission and each patient has a unique identifier. In addition to the unique identifiers for each admission and patient, each admission record contains information concerning the admission type (elective, emergency, urgent, newborn), time of admission, time of discharge, diagnosis at time of admission, death time, etc. The ‘PATIENTS’ table provides more information concerning patients’ gender and age. Along with ‘ADMISSIONS’ table, the ‘NOTEEVENTS’ table is the most important for our work, as it contains significant clinical notes for each admission related to discharge summaries, ECG reports, radiology reports, etc.
In this work, we are mostly interested in data related to the type of admission, time of admission, and time of discharge, which allow the number of days between discharge and the next unplanned admission to be computed, as well as data on the diagnosis and discharge summaries, which are used by the predictive model. As we are only interested in unplanned readmissions occurring within 30 days of discharge, we create output labels, where a positive label represents a readmission occurring within 30 days of discharge and a negative label represents any other situation.
Figure 1 shows a histogram plot of the number of readmissions over 365 days; it can be seen that readmissions within 30 days are the most frequent, representing 38.76% of all readmissions within the year, for a total of 2549 cases.
2.2. Descriptive Analysis
In this section, we aim to understand and measure the impact of primary diagnoses on readmission rate. Each patient may have multiple diagnoses at the time of admission. A primary diagnosis refers to the one with the highest priority level and which represents the principal cause of admission. To this end, we extracted the most frequent diagnoses from the ADMISSIONS table based on different perspectives. We started our descriptive analysis by extracting the ten most frequent diagnoses for all admissions, which is illustrated in
Figure 2. We noticed that Pneumonia was the most frequent diagnosis, with a rate of 2.87%, followed by Sepsis and Coronary Artery Disease, with rates of 2.02% and 1.80%, respectively. The top ten diagnoses represent more than 16% of all 14,221 diagnoses reported in the database. As we are interested in predicting readmission, we extracted the most frequent primary diagnoses of readmitted patients reported before readmission (during their first admission to the hospital) and during readmission, which are illustrated in
Figure 3 and
Figure 4. The most frequent diagnoses are defined as those diagnoses with the highest number of readmission cases compared to other diagnoses. During our analysis, we first observed that the number of diagnoses reported in readmitted patients does not exceed 1194 diagnoses, compared to 14,221 diagnoses reported in the entire database. We noticed from the figures that 13 diagnoses from these 1194 represented more than 25% of all diagnoses preceding readmission, while only eight diagnoses represented the top 25% of post-readmission diagnoses. At the top of the list are Pneumonia, Congestive Heart Failure, and Sepsis, the three most frequent diagnoses both prior to and after readmission. Other primary diagnoses that were highly present during readmission included Fever, Altered Mental Status, Abdominal Pain, Upper GI Bleed, and Hypotension. From this, it can be deduced that certain diseases probably have a higher impact on the readmission rate compared to others. For further analysis, we used the most frequent diagnoses extracted during the previous analysis to compute the ones with the highest readmission rates. This analysis defines the proportion of readmissions for a diagnosis, computed by dividing the number of readmissions for a diagnosis by the number of all admissions for the same diagnosis.
Figure 5 show that Shortness of Breath was the diagnosis with the highest readmission rate, at nearly 14%, followed by Congestive Heart Failure and Abdominal Pain. Other diagnoses with a high readmission rate included Pneumonia, Diabetic Ketoacidosis, and GI Bleed. In addition, we extracted the diagnoses with the highest death rate;
Figure 6 and
Figure 7 illustrate the bar plots of the diagnoses with the highest death rates before and after readmission, respectively.
Except for Pneumonia, Sepsis, and Congestive Heart Failure, the diagnoses with the highest death rates were somewhat different from the ones with the highest readmission rates. It can be observed from the figures that Intracranial Hemorrhage was the main cause of death after readmission and the third cause for all admission types, while it had a lower readmission rate compared to other diagnoses. In addition, we noticed that the diagnoses with the highest death rates, in particular those after readmission, were not same as those with highest readmission rates. Consequently, from this analysis we assumed that patients suffering from Pneumonia, Congestive Heart Failure, Diabetes, Chest Pain, or GI Bleed present high risk of readmission.
2.3. Data Preprocessing
As mentioned above, the ADMISSIONS table contains 58,976 recordings belonging to 46,520 patients. Therefore, a single patient could have multiple admissions stored in different rows. It contains relevant information about each admission, including patient ID, admission type, diagnosis, time of admission, and time of discharge; 71.33% (42,071) of admissions are described as emergencies, 2.3% (1336) as urgent, 13.3% (7863) as newborn, and 13.1% (7706) as elective. We began the preprocessing step by removing newborn and death admissions from the ADMISSIONS table. As we are interested in unplanned readmissions, we kept only the first ’elective’ admission and filtered out the others in order to retain only emergency readmissions. The remaining data contained 45,321 hospital admissions.
The NOTEEVENTS table contains notes from physicians, including discharge summaries, radiology reports, and ECG reports for each admission. We preprocessed the table by selecting only discharge summaries from all the notes in the table, resulting in 59,652 records. We removed duplicated and null discharge summaries from the selected notes and merged the remaining 43,880 notes with the preprocessed ADMISSIONS table. As cited above, a single patient could have several admissions and discharge notes; as we are interested in predicting the readmission occurrence from a single admission, we chose to process only the first admission for each patient, leading to a final dataset of 33,492 records.
In this work, we aim to predict whether a patient will experience an unplanned ICU readmission within 30 days of discharge. To this end, we labeled unplanned readmissions that occurred within 30 days of discharge as a positive class, while all others were labeled as negative. The result was an imbalanced dataset with 1900 records for the positive class and 31,592 records for the negative one. Therefore, we subsampled the negative classes by randomly selecting 1900 samples to achieve a 50% prevalence with 3800 samples.
Figure 8 illustrates the aforementioned preprocessing steps for both the ADMISSIONS and NOTEEVENTS tables.
Figure 9 and
Figure 10 illustrate the distribution of the top ten diseases in the balanced dataset for the “readmitted” and “not readmitted” classes, respectively. The two plots support our prior assumption concerning the diseases with the highest impacts on readmission risk. We observed that Pneumonia, Congestive Heart Failure, and Sepsis were the three most frequent diseases among the readmitted class, while Coronary Artery Disease and Coronary Artery Bypass were highly dominant among the negative samples. Accordingly, it can be inferred that the risk of readmission for patients suffering from Coronary Artery Disease or having undergone a Coronary Artery Bypass is relatively low, whereas the risk is significant for those with Pneumonia and Congestive Heart Failure.
After the dataset was created, we preprocessing the clinical notes using NLP techniques before extracting the relevant features for use by the predictive models.
2.4. Feature Extraction Using NER
In this section, we aim to extract relevant features for readmission prediction from the discharge notes using NER techniques. Extracting relevant information from clinical notes is a challenging process, as they are essentially unstructured text containing a large amount of information describing patients’ medical records during their admission time, including their age, medications, medical history, diagnosis, laboratory tests, etc. Therefore, a prior preprocessing step is required before handling the named entity recognition task to ensure more accurate results.
Figure 11 illustrates the NLP techniques used for the preprocessing step. Among those techniques, we cite the following:
Lowercase Text: The first preprocessing technique was to convert all text to lowercase. NER techniques are case sensitive, with case represented as an important feature for prediction; thus, case uniformity is fundamental in order to avoid biases and treating instances of the same word differently.
Stopword Removal: Removal of stopwords and punctuation is an important process, helping to avoid frequent but unnecessary words that can distract the model from more meaningful words. In addition, this process reduces the dimension of the text and simplifies its representation.
Tokenization: One of the most important preprocessing steps, tokenization breaks a sentence up into separate individual words called tokens, which is essential for analysis of the structure and meaning of the text.
Lemmatization: Lemmatization involves reducing tokenized words to their base or dictionary form. This step helps to consolidate similar words.
Next, we applied NER techniques in order to extract medical entities related to the presence of diseases and symptoms from the preprocessed notes. Sometimes known as entity extraction, NER is a natural language processing technique that aims to extract relevant information, referred to as entities, from unstructured text by identifying and classifying key elements. Entities can include names of people, organizations, locations, dates, numerical values, and other specific types of information. In healthcare, NER aims to recognize medical terms such as diseases, medication, clinical measurements, etc. Existing NER techniques used for clinical notes include the following:
En_ner_bc5cdr_md: A spaCy NER model for processing clinical texts trained on the BC5CDR corpus [
12] with an F1-score of 84.28%. The model is intended to recognize elements related to diseases and chemical entities in clinical texts.
En_ner_craft_md: A spaCy NER model trained on the CRAFT corpus [
13] with an F1-score of 78.01%. This model recognizes entities related to biomedical ontologies.
En_ner_jnlpba_md: A spaCy NER model trained on the JNLPBA corpus [
14] with an F1-score of 72.06%. This model is distinguished by its ability to recognize cell, DNA, and RNA entities.
En_ner_bionlp13cg_md: A spaCy NER model trained on the BIONLP13CG corpus [
15] with an F1-score of 77.84%. This model recognize entities related to cancer genetics.
BioMedical-NER [16]: An NER model built on a DistilBERT-based uncased model, which is a refined version of the Bidirectional Encoder Representations from Transformers (BERT) model [
17,
18]. This model uses the publicly available Maccrobat dataset [
19], with an F1-score of 91.89%. BioMedical-NER can recognize 107 biomedical entities, including disease disorders, symptoms, diagnostic procedures, lab values, biological structures, etc.
As we are interested in identifying biomedical entities related to disease and symptoms, BioMedical-NER was the best choice in this case. During the first phase, we fed the preprocessed discharge texts into the pretrained BioMedical-NER model, which returned tokens of the recognized named entities associated with their labels (disease, symptom, clinical events, etc.). However, the clinical notes contained several negative entities that denied the presence of a disease using a negation. We noticed that BioMedical-NER did not consider negative entities when extracting medical entities, and labeled these instances as positive diseases. To tackle this issue, we used the NegspaCy library, which is a spaCy pipeline for negation identification based on the NegEx algorithm [
20]. The model detected preceding and following negations such as “deny”, “absence of”, and “no sign of”, successfully labeling the related entities as negative ones; however, it did not consider other negative words as negations, including “no abnormal”, “negative for”, “neither”, “nor”, and “not have”, considering these entities as positive instances. We also noticed that the negation model considered diseases such as renal failure, heart failure, and respiratory failure as negative entities due to the presence of the word failure. To address this concern, we customized the NegspaCy model by including patterns of preceding negation and following negation terms that should be considered by the model. Using the same technique, we removed patterns that contained the term “failure” in order to consider them as positive entities.
Figure 12a,b displays the results before and after customizing the NegspaCy model, where each color refers to an entity category (chemical, disease, negative entity). It can be observed that negative entities such as “neither substance abuse”, “nor alcohol”, “no abnormal rash or ulcer”, “negative for COVID-19 infection”, and “COVID-19 viral infection absent” were misidentified as positive words prior to customization.
The extracted entities were then used as input features for the prediction model. However, as this technique generates a boundless amount of features, misleading the prediction model, we created a list of the most important features extracted from the text, represented by the 100 most frequent words related to those diagnoses with the highest readmission rate. This included the names of diseases, symptoms (bleed, pain, fever, vomit, shortness of breath, etc.), and other features that could impact the likelihood of a patient’s early readmission, such as the presence of alcohol or tobacco and the severity level of the illness (mild, severe). As discharge notes may contain acronyms instead of the full name of the disease, our list of features included acronyms for certain diseases, such as CHF for Congestive Heart Failure and chr kidney for chronic kidney.
We then excluded negative entities from the extracted diseases and symptoms, as they indicate the absence of the condition as opposed to its presence.