**1. Introduction**

Chronic kidney disease (CKD) is a major public health concern characterized by an increasing prevalence and associated with a high level of morbidity and mortality [1,2]. Correct identification of CKD is crucial, e.g., for appropriate dosing of drugs and for early intervention, including the prevention of progression [3]. For clinical research, accurate identification of CKD or absence of kidney disease (NKD = no known kidney disease) is essential for clinical trials and epidemiological studies. In this context, a particular challenge is to store samples from hospitalized patients with known kidney status in clinical biorepositories, as part of Healthcare-Integrated Biobanking (HIB). At the time point of sample selection and storage, only a limited range of information regarding the respective patient phenotype is available.

Administrative data such as ICD-10 billing codes are often used in research trails to identify patients with CKD [4]. However, administrative databases are not maintained with the primary purpose of supporting research; thus, it might be that, e.g., mild impairment of kidney function will be underrepresented because they cannot be billed [5]. Indeed, many studies have demonstrated that ICD-10 billing codes considerably underestimate the prevalence of CKD [6]. Moreover, there is no ICD-10 billing code for NKD, as the purpose of ICD-10 billing codes is to indicate the presence of a disease.

Electronic health records (EHRs) are a promising source for the diagnosis or exclusion of CKD. EHRs contain structured data (laboratory values, epidemiological data) and unstructured data (narrative discharge summaries).

The laboratory assessment of kidney function is based on an equation to estimate the glomerular filtration rate (GFR) [3]. This equation, Chronic Kidney Disease Epidemiology Collaboration (CKD-EPI), includes the blood creatinine level, age, sex and ethnicity [7]. According to the Kidney Disease: Improving Global Outcomes (KDIGO) definition, CKD Stage III and higher can be diagnosed by an eGFR below 60 mL/min/1.73m<sup>2</sup> for a time period of at least 90 days [3]. However, previous laboratory data on hospitalized patients are often not fully available, e.g., they were recorded in other hospitals or in outpatient clinics.

Unstructured data such as discharge summaries can fill the gap of missing medical information. Letters are available in a digital form for every hospitalized patient and often contain complementary information, not only about the current hospital stay, but also about the clinical history of the patient including chronic diseases. Information can be extracted from narrative discharge summaries for example by reusing SNOMED CT codes from EHRs [8], screening the letters for disease-specific keywords [9,10], or using mL based natural language processing (NLP) technology for ICD-10 billing codes [11] or SNOMED CT [12] coding, named entity recognition [13], or relation extraction [14].

Data analysis from EHRs can be performed in a rule-based format for example by strictly adhering to the KDIGO definition of CKD ≥ III. In recent years, various machine learning (ML) methods have been applied to improve the automated recognition of chronic kidney disease, using mainly laboratory values and demographic information [15–20]. However, to the best of our knowledge, no study specifically targeted advanced CKD ≥ III or NKD.

In this study, we hypothesize that combining structured (laboratory values, ICD-10 billing codes) and unstructured (discharge summaries) information from EHRs and applying mL for data analysis can reliably distinguish between patients with advanced CKD (stage ≥ III) and patients with no known kidney disease (NKD) in different scenarios of data availability.

#### **2. Materials and Methods**

#### *2.1. Study Population*

The dataset of this retrospective study has been derived from the Jena Part of the 3000 PA text corpus of the Smart Medical Information Technology for Healthcare (SMITH) consortium (part of the Medical Informatics Initiative founded by the German Federal Ministry of Education and

Research) [21–23]. The dataset consisted of EHRs from 785 individuals who were from European descent and had an index hospital stay for at least five days on a ward for internal medicine or in an intensive care unit between 2010 and 2015. No individual deceased during the index hospital stay. At the time point of retrospective data collection, all individuals were deceased. The EHRs included discharge summaries, laboratory values and ICD-10 billing codes. The study was approved by the local ethics committee (4639-12/15); data were collected retrospectively and anonymized, individual-level informed consent of participants was waived by the ethics review board. The study was also approved by the data protection officer of Jena University Hospital.

#### *2.2. Classification of CKD and NKD by ICD-10 Billing Codes*

For classification of CKD and NKD, ICD-10 billing codes of the index hospital stay, extracted from the hospital accounting system and from hospital discharge summaries, were used. For extraction of kidney diseases from discharge summaries the Health Discovery text mining tool v5.7.0 from Averbis (https://health-discovery.io/) was applied using the discharge pipeline with default settings to extract basic medical information (detailed information can be found in the Averbis Health Discovery User Manual Version 5.7, 4 December 2018). Subsequently, a Python script was applied to extract the ICD-10 billing codes from these output files. ICD-10 billing codes for CKD classification were used according to ICD-10 billing codes for moderate to severe kidney disease from the Charlson comorbidity index [24] (Supplementary Materials). For the definition of no kidney disease (NKD), none of these codes as well as further ICD-10 billing codes for kidney disease published by the Centers for Disease Control and Prevention (CDC, http://www.cdc.gov/ckd) (Supplementary Materials) should be present.

#### *2.3. Laboratory and Demographic Data*

Laboratory values and demographics of the patients were extracted from the laboratory information system (LIS) of the University Hospital of Jena. The following values were considered in the analysis and classification of the study cohort:


Descriptive statistics were reported as the mean [SD] or median [I quartile–III quartile] for continuous variables and absolute numbers (percentages) for categorical variables.

#### *2.4. Classification of CKD and NKD by Blood Creatinine and eGFR*

In order to define CKD and NKD by laboratory values from the current hospital index stay, we created the following rules. If all eGFR values during the index stay were below 60 mL/min/1.73 m2, the case was assigned to CKD. If all eGFR values during the index hospital stay were above 60 mL/min/1.73m<sup>2</sup> and there was no presence of AKI (definition see below), the case was assigned to NKD.

#### *2.5. Classification of CKD and NKD by Manual Review*

CKD stage III or higher was defined according to the KDIGO guidelines. This included an eGFR, based on the formula CKD-EPI [7], which had to be less than 60 mL/min/1.73 m<sup>2</sup> for at least 3 months (90 days) or by an additional proof of kidney damage [3].

We defined NKD, adapted from James et al. [25], as the complete absence of GFR less than 60 mL/min/1.73m2, stable serum creatinine measurements, e.g., no fulfillment of acute kidney disease criteria, median absence of proteinuria when multiple measurements were made before and the absence of AKI in patient laboratory history. AKI was present, if serum creatinine had increased by more than 26.5 mmol/L within 48 h or increased more than 1.5-fold over 7 days [26]. In addition, adapted from

the publication by Duff et al. [27], we included AKI recovery defined as a decline in creatinine for more than 33% over 7 days.

All cases were reviewed by an advanced medical student and a physician to assess the underlying kidney status based on individual EHRs, including discharge summaries, ICD-10 billing codes and laboratory test results performed before, subsequent to, and during the index hospital stay. Of note, for clarification of difficult cases, the reviewers used information not available to the rule-based or statistical algorithms (e.g., laboratory values after index hospital stay). The review was used as a reference standard for comparison with automated classification.

#### *2.6. Dataset for the Machine Learning Methods*

The dataset used for logistic regression and the different mL models is composed of 11 to 19 different categorical and numerical variables. Three of them are derived variables to improve classification.


All of these variables were used in all mL models. Further categorical variables, listed below, were added in different combinations, as described in the results.

CKD: eGFR at admission below 60 mL/min/1.73 m<sup>2</sup> (eGFR\_admission), eGFR at discharge below 60 mL/min/1.73 m<sup>2</sup> (eGFR\_discharge), and all eGFR measurements during index stay below 60 mL/min/1.73 m<sup>2</sup> (eGFR).

NKD: eGFR at admission above 60 mL/min/1.73 m<sup>2</sup> (eGFR\_admission), eGFR at discharge above 60 mL/min/1.73 m<sup>2</sup> (eGFR\_discharge), eGFR always above 60 mL/min/1.73 m<sup>2</sup> (eGFR\_history), all eGFR during index stay above 60 mL/min/1.73 m<sup>2</sup> (eGFR); classification by ICD-10 billing codes (ICD); classification by ICD-10 codes from discharge summaries.

#### *2.7. Classification of CKD and NKD Using Machine Learning Methods*

We applied three different mL methods—generalized linear model via penalized maximum likelihood (GLMnet) [28], random forests (RF) [29] and artificial neural network (ANN) [30]. These are all well-established approaches that represent different types of mL methods.

GLMnet is a statistical method in which different models generalize to the concept of a penalty parameter and in which different models have different loss functions. A penalty parameter constrains the size of the model coefficients such that the only way the coefficients can increase is if a comparable decrease in the models loss function is experienced. A loss function essentially calculates how poorly a model is performing by comparing what the model is predicting with the actual value it is supposed to output. If both values are very similar, the loss value will be very low. There are three common penalty parameters (ridge regression, lasso penalty, elastic-net penalty). We used the elastic-net penalty which is controlled by the *alpha* parameter. It bridges the gap between the ridge regression (alpha = 0), which is good for retaining all features while reducing the noise that less influential variables may create and the lasso (alpha = 1) penalty, which actually excludes features from the model.

Like a simple rule-based decision tree, random forests are tree-based models and part of a class of non-parametric algorithms that work by partitioning the feature space into a number of smaller regions. The predictions are obtained by fitting a simpler model in each region. Random forests use the same principles as bagging trees, which grow many trees (*ntree*) on bootstrapped copies of the training data, and extend it with an additional random component through split-variable randomization, where each time a split is to be performed the search for the split variable is limited to a random subset (*mtry*) of the original features.

Artificial neural networks are designed to simulate the biological neural networks of animal brains. They process input examples of a given task and map them against the desired output by forming probability-weighted associations between the two, storing these in the net data structure itself. In its basic form a neural network has three layers. An input layer which consists of all of the original input features, a hidden layer where the majority of the learning process takes place and an output layer [31].

The dataset was randomly split into 80% training and 20% test data. The prevalence for CKD or NKD respectively was similar in the two datasets (Supplementary Materials).

To properly adapt the mL algorithms, we optimized the hyperparameters that are used to control the learning process of a model and cannot be directly estimated from the data. We used a grid search method, which is simply an exhaustive search through a manually specified subset of the hyperparameter space of the learning algorithm. We specified these hyperparameters for every type of model, trailed all combinations and selected the model with the best results (see Supplementary Materials for details). For the GLMnet, the regularization parameter *lambda*, which controls the overall strength of the penalty term and helps to control the model from overfitting to the training data, was calculated during a pre-training of the model. Subsequently the best alpha parameter was determined. It ranges between [0,1] and was divided into steps of 0.1.

Random forest was tuned on the *mtry* parameter in a range between [1,18] depending on the number of features of the model, divided into steps of 1. The *ntree* parameter was set to its default value *ntree* = 100.

The artificial neural network is a fully connected feed-forward network with a single hidden layer. We use a fixed number of units between 11 and 19 in the input layer depending on the number of features of the model and a single unit with a sigmoid activation function for binary classification as the output layer. We optimized the number of units in the hidden layer as a hyperparameter (*size*) for every model in a range between [1,10] divided into steps of 1 (see Supplementary Materials for details).

In addition, all models were evaluated using three separate 10-fold cross-validations as the resampling scheme and were trained to optimize the F1 score. The final F1 score for each model is averaged over the resamples.

Classifications were assessed using sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), F1 score, accuracy, area under the receiver operating characteristics (AUROC) and precision-recall curve (AUCPR). For AUROC and AUCPR, the 95% confidence interval was calculated (see Supplementary Materials for formulas and for detailed classification performances regarding the di fferent models).

Area under the precision–recall curve is known to be more informative for class-imbalanced predictive tasks [32], as it is more sensitive to changes in the number of false-positive predictions. Comparison between AUROC was calculated according to DeLong et al. [33].

Analyses were implemented using R Studio (version 1.2.5001), the R Software (version 3.6.1) [34] and the following packages: *limma* [35] for plots, *rio* [36], *plyr* [37], *nlme* [38], *tidyverse* bundle [39], *pROC* [40], *ROCR* [41] for data management, data analysis and functional programming and *caret* [42] for all mL models. Graphs were generated by GraphPad Prism (version 8.4.2).

## **3. Results**

The study cohort comprises 785 cases, with an average age of 75 years, the majority of individuals were male (61%), and 95% and 49% of the patients had at least one or three severe disease(s) of the Charlson comorbidity index, respectively. Most patients were hospitalized due to cardiovascular disease (40%), gastrointestinal/liver diseases (15%) or oncology disorders (15%). The prevalence of CKD in this elderly morbid cohort was comparable to other studies that included probably less morbid non-hospitalized patients ([43,44]). The prevalence for patients with no known kidney disease (NKD) was lower than for CKD. NKD was associated with younger age, better kidney function and fewer co-morbidities compared to CKD ≥ III. (Table 1).

**Table 1.** Epidemiological Characteristics from all Individuals and from Individuals with CKD ≥ III or NKD Identified by the Reference Standard, Respectively.


1 eGFR at admission could not be calculated for all individuals because creatinine was massively interfered with by bilirubin or hemoglobin at admission.

In 128 (34%) of patients, the cause of CKD ≥ III was further specified by ICD-10 billing codes. In the remaining cohort of 245 patients with CKD ≥ III, 90% suffered from diabetes mellitus II and/or hypertension. More than 33% of etiologies for CKD ≥ III had been documented only in discharge summaries (Supplementary Materials).

There was a high incidence for AKI (33.6%) and AKI recovery (27.4%) in the CKD ≥ III cohort (Supplementary Materials).

Most patients were assigned to CKD status by discharge summaries, followed by eGFR and ICD-10 billing codes (Figure 1a). After manual review, less than 1% of the CKD cases identified by discharge summaries and eGFR and ICD-10 billing codes did not suffer from CKD III–V (Figure 1b). Patients identified by discharge summaries seemed to have a better kidney function at admission, while patients assigned to CKD by eGFR or ICD-10 billing codes had a worse kidney function compared to the reference standard. Similarly, patients identified by eGFR and discharge summaries were less morbid than patients characterized as CKD by ICD-10 billing codes, as indicated by Charlson morbidity categories (Table 2). Of note, 19 patients were identified by manual review only, while each of the three formal criteria failed.

**Figure 1.** Venn diagrams comparing identification of CKD ≥ III by laboratory results (eGFR values), discharge summaries or ICD -10 billing codes within all patients (**a**) and within patients with CKD ≥ III according to reference standard (**b**). (**a**) Numbers of patients from the study cohort with CKD recognized by laboratory results (eGFR values), discharge summaries or ICD-10 billing codes. (**b**) Numbers of patients from the study cohort with CKD *correctly* recognized by laboratory results (eGFR values), discharge summaries or ICD -10 billing codes. A total of 19 patients were recognized by neither of the three formal criteria, but by manual review only.

**Table 2.** Epidemiological characteristics from patients with CKD identified by reference standard or recognized by laboratory results (eGFR values), discharge summaries or ICD-10 billing codes.


1 eGFR could not be calculated for all individuals because creatinine was massively interfered with by bilirubin or hemoglobin at admission.

Similar to CKD, the patient cohort was investigated for patients with no known kidney disease (NKD). Numbers of patients assigned to NKD by laboratory values, ICD-10 billing codes or discharge summaries are depicted in Figure 2a. Comparison with the reference standard (Figure 2b) confirms 65% of the patients assigned to NKD by all three categories. Patients identified by the laboratory NKD criteria were younger, had a higher eGFR at admission and did therefore better correspond with the reference standard compared to patients assigned to NKD by discharge summaries or ICD-10 billing codes (Table 3).

**Figure 2.** Venn diagrams comparing identification of no known kidney disease (NKD) by laboratory results (eGFR values), discharge summaries or ICD -10 billing codes within all patients (**a**) and within patients with CKD ≥ III according to reference standard (**b**). (**a**) Numbers of patients from the study cohort with NKD recognized via the eHealth sources laboratory results (eGFR values), discharge summaries or ICD-10 billing codes. (**b**) Numbers of patients from the study cohort with NKD *correctly* recognized via laboratory results (eGFR values), discharge summaries or ICD-10 billing codes.


**Table 3.** Epidemiological characteristics from patients with NKD identified by reference standard or recognized by sources laboratory results (eGFR values), discharge summaries or ICD-10 billing codes.

\* eGFR could not be calculated for all individuals because creatinine was massively interfered with by bilirubin or hemoglobin at admission. 1 *n* = 331; 2 *n* = 434.

Tables 4 and 5 depict the specificities and sensitivities of the different rules applied for identification of CKD or NKD, respectively. While ICD-10 billing codes show excellent specificity for identification of CKD, the sensitivity was lower compared to discharge summaries and eGFR. Discharge summaries had a better sensitivity, but a reduced specificity compared to ICD-10 billing codes (Table 4). Using eGFR < 60 mL/min/1.73 m<sup>2</sup> during the whole hospital stay results in good sensitivity and specificity. If only the first eGFR at admission or the last eGFR measurement at discharge were used, overall performance (AUROC) did only minimally change compared to the original rule.


**Table 4.** Performance of different rules for identification of patients with CKD compared to the reference standard.

**Table 5.** Performance of different rules for identification of patients with NKD compared to the reference standard.


Regarding NKD, ICD-10 billing codes, discharge summaries and creatinine blood values, at admission, at discharge and during hospital stay, have all excellent sensitivity. However, acceptable specificity (>80%) was achieved only by using eGFR < 60 mL/min/1.73m<sup>2</sup> during the whole hospital stay. However, the PPV was still low at 0.52 (Table 5).

Combining laboratory measurements with discharge summaries and ICD-10 billing codes using logistic regression developed in a training dataset resulted in a better overall performance for identification of CKD (AUROC: 0.96[0.93–0.98]) or NKD (AUROC: 0.94[0.91–0.97]) in the test dataset compared to estimated glomerular filtration rate (eGFR) values (CKD: AUROC 0.85[0.79–0.90]; NKD: AUROC 0.91[0.87–0.94]), discharge summaries (CKD: AUROC 0.87[0.82–0.92], NKD: AUROC 0.84[0.79–0.89]) or ICD-10 billing codes (CKD: AUROC 0.85[0.80–0.91], NKD: AUROC 0.77[0.72–0.83) alone (Figure 3 and Supplementary Materials). Interestingly, the combination of all three categories, however, did not (NKD) or only minimally (CKD ≥ III) increase the performance in comparison with the combination of laboratory results and discharge summaries (CKD: AUROC 0.94[0.9–0.97]; NKD: AUROC 0.95[0.92–0.97]).

(**b**) 

**Figure 3.** Area under the receiver operating characteristic (AUROC) and under the precision-recall curve (AUCPR) for simple categorical classifiers based on combinations of EHR components for CKD ≥ III (**a**) and NKD (**b**) on the test dataset. eGFR values = "eGFR", discharge summaries = "DS" and ICD-10 billing codes = "ICD". For the complete list of all combinations, see Supplementary Materials. Logistic regression was calculated on the training dataset. Performance is calculated on the test dataset (n = 156). \* Indicates *p* < 0.05 for difference in AUROC compared to eGFR.

In NKD, AUROC values were quite high. However, AUCPR values that include sensitivity and PPV were lower. It is therefore helpful to include several parameters, e.g., AUROC and AUCPR for assessing test performance, particularly in imbalanced data [32].

To further improve performance for correct assignment of patients to CKD ≥ III or NKD, we developed a logistic regression and three mL models using (1) all data from the index hospital stay including laboratory values with incidence of AKI and AKI recovery including staging, demographics, ICD-billing codes and ICDs from discharge summaries; (2) laboratory values and demographics from the index hospital stay; (3) and (4) in addition to (1) or (2) includes laboratory values from previous hospital stays, respectively (for a detailed listing of variables, see Supplementary Materials).

Figure 4 shows the AUROCs and AUCPRs of the respective best logistic regression (LR) and best different mL models for identification of CKD ≥ III and NKD compared to the best simple categorical classifier for each scenario. In general, AUROCs of LR and of the different mL models were only slightly different between each other (see Supplementary Materials for more details).

**Figure 4.** AUROC (**<sup>a</sup>**,**<sup>c</sup>**) and AUCPR (**b**,**d**) of the simple categorical classifier and of models calculated from logistic regression and the three mL methods for identification of CKD (**<sup>a</sup>**,**b**) and NKD (**<sup>c</sup>**,**d**) in different scenarios of data availability. (**a**) AUROC and (**b**) AUCPR for identification of CKD ≥ III; (**c**) AUROC and (d) AUCPR for identification of NKD. SC = simple categorical classifier, LR = logistic regression, GLMnet = generalized linear machine network, RF = random forest, NN = Artificial Neuronal Network. N = 156 patients (test dataset). Scenarios: (1) All data from the index hospital stay including laboratory values, demographics, ICD-billing codes and ICDs from discharge summaries; (2) laboratory values and demographics from the index hospital stay; (3) and (4) includes, in addition to (1) or (2), laboratory values from previous hospital stays, respectively. \* Indicates *p* < 0.05 for difference in AUROC between SC and all other models.

For identification of CKD ≥ III, the AUROCs of the LR and machine learning models were not significantly better in scenario 1 (LR/ML: 0.97[0.95–1.00]) and scenario 3 (LR/ML: 0.97[0.94–1.00) compared to the simple classifier in scenario 1 and 3 (0.96[0.94–0.99]), respectively. AUROCs of the LR and mL models significantly (*p* < 0.05) improved in scenario 2 (LR/ML: 0.96[0.92–0.99) and scenario 4 (LR: 0.96[0.93–0.99]/ML 0.97[0.94–0.99]) compared to the simple classifier in scenario 2 and 4 (0.86[0.81–0.91]), respectively. In scenarios 2 and 4, data were restricted to laboratory values alone.

For identification of NKD, AUROCs of the LR and mL models significantly (*p* < 0.05) improved in scenario 3 (LR: 0.98[0.96–1.00]/ML: 1.00[1.00–1.00]) and scenario 4 (LR: 0.98[0.96–1.00]/ML: 0.99[0.98–1.00]) compared to the simple classifier in scenario 3 (0.95[0.92–0.97]) and scenario 4 (0.91[0.87–0.94]), respectively (Figure 4c). In scenarios 3 and 4, data from previous hospital stays were included. AUCPRs of the logistic regression and mL models for identification of NKD also improved in scenarios 3 and 4 compared to the simple classifier (Figure 4d, see Supplementary Materials for more details). AUROCs of LR and mL models slightly improved in scenario 1 (LR/ML: 0.96[0.93–0.99]) and scenario 2 (LR/ML: 0.93[0.89–0.97]) compared to the simple classifier in scenario 1 (0.95[0.92–0.97]) and scenario 2 (0.91[0.87–0.94]), respectively (Figure 4c). However, AUCPR of LR and mL models decreased in scenario 1 and 2 compared to the simple classifier.

In conclusion, the best LR and mL models slightly improved AUROCs for identification of CKD ≥ III and NKD compared to the best simple categorical classifier in each scenario. However, we observed a significant improvement by models compared to the simple classifier for CKD > III only in scenarios 2 and 4 and for NKD only in scenarios 3 and 4.
