1. Introduction
Given the complexity and high dimensionality of Electronic Health Records (EHR), the need for imputation is an inevitable aspect in any study that attempts to use such data for downstream analysis or building advanced machine learning models for decision support systems for clinical applications. The EHR or any other administrative dataset is not designed for research purposes, even though the breadth and depth of the information can be used to improve care at many levels [
1]. Furthermore, the level and extent of the missing values in healthcare systems are typically not at random. Three main categories explain the missingness in clinical settings [
2,
3]—incompleteness, inconsistency, and inaccuracy—and these can capture a variety of situations, including the following: the patient could have been cared for outside of the healthcare system where the data are collected, the patient did not seek treatment, the health care provider did not enter the information, the patient expired, and the missing value was not needed.
Given the complexity of the clinical data and the advanced analytics that can be applied on such data, it is important to account for any sources of bias in the data that will be used to drive predictive models. Imputation is an example of data preprocessing that could lead to biased results. Furthermore, excluding variables or patients with a high-level of missingness can also introduce bias and reduce the scope of the study. From a recent review article, 85 out of 316 studies reported some form of missing data, and only 12 studies actively handled the missingness; as the authors showed, the majority of researchers exclude incomplete cases, causing biased outcomes [
4]. Furthermore, imputation could boost the statistical power for data-poor patients who tend to be minorities and low-income patients with more restricted access to primary and specialty care and rehabilitation programs.
Imputation has been an ongoing solution in many fields, but only recently, the research has been focused on medical applications. Twelve different imputation techniques applied to laboratory measures from EHR were compared [
5]. In general, the authors found that Multivariate Imputation by Chained Equations (MICE) and softImpute consistently imputed missing values with low error [
5]; however, in that study, the analysis was restricted to 28 most commonly available variables. In another study, the authors assessed the different causes of missing data in the EHR data and identified these causes to be the source of unintentional bias [
6]. A comparative analysis of three methods of imputation (a Singular Value Decomposition (SVD)-based method (SVDimpute), weighted K-nearest neighbors (KNNimpute), and row average for DNA microarrays showed that, in general, KNN and SVD methods surpass the commonly accepted solutions of filling missing values with zeros or row averages [
7]. However, comparing imputation for clinical data with a DNA microarray can be misleading. The missingness in a DNA microarray is likely at random due to technical challenges unlike missingness in the EHR. In another study, fuzzy clustering was integrated with a neural network to enhance the imputation process [
8].
Research has also been done to evaluate imputation methods for non-normal data [
9]. Using simulated data from a range of non-normal distributions and a level of missingness of 50% (missing completely at random or missing at random), it was found that the linearity between variables could be used to determine the need for transformation for non-normal variables. In the case of a linear relationship, transformation can introduce bias, while the nonlinear relationship between variables may require adequate transformation to accurately capture the nonlinearity. Furthermore, many of the techniques are optimized for smaller levels of missingness (the most commonly available measurements), yet most clinical datasets (including the EHRs) have a significant level of missingness for many of their important variables that are routinely used for diagnosis purposes. To address this problem, machine learning methods have also been proposed [
10]. There are more examples of imputation applied to simulated than real-life EHR data; however, few studies focused on imputing laboratory values. For instance, Ford E. and colleagues [
11] proposed using logistic regression models with and without Bayesian priors representing the rates of misclassification in the data. However, in that study, the authors focused on misclassified diagnoses rather than laboratory values. The challenges of imputation for EHRs are unique, and if left unaddressed, the utility of the data becomes limited [
12]. Consequently, even though, for smaller targeted studies, it could be possible to integrate additional modalities or perform an analytical evaluation through a chart review to determine a likely cause of missingness, for larger studies, this becomes infeasible. For example, the missingness level for very important variables, such as hemoglobin A1C or HbA1c (LOINC ID: 17856-6) levels, a common biomarker for diabetes can easily reach 50% or more in many realistic large datasets. At last, in a more recent study, the integration of genetic and clinical information was shown to improve the imputation of data missing from the Electronic Health Records [
13]; however, genetic data integrated with the EHR is still scarce.
Finally, given the complexity and the scale of the problem, in many studies, MICE [
14] remains the method of choice. The MICE fully conditional specification (FCS) algorithm imputes multivariate missing data on a variable-by-variable basis [
15]. An imputation model is specified for each incomplete variable, and the imputation of missingness in one variable is conducted iteratively based on the other variables. There are also variations of MICE that have been proposed [
16]; however, the need for imputation for data from EHR poses its challenges, especially when targeting less commonly measured variables. Nonetheless, given the high level of redundancy and the presence of highly correlated entities in the EHR, imputation by MICE still performs relatively well for large clinical datasets. A comprehensive overview of handling missing data in the EHR is presented in [
12].
In this study, we created three unique cohorts from the EHR data, with varying sizes and heterogeneity, and developed a hybrid imputation strategy that we applied to these cohorts. We selected the inflammatory bowel disease cohort because of its heterogeneity and the fact that a clear understanding of IBD’s risk factors is still lacking. We selected the Clostridioides difficile, because understanding of the recurrent infection is important, and the existing data from the EHR can help us identify clinical biomarkers; finally, we created the osteoarthritis (OA) cohort to test the limits of this model, as the OA diagnosis is not based on any laboratory measurements known today. Our imputation model was based on using comorbidity information to cluster patients prior to the imputation of their laboratory values.
2. Methods
In the following section, we will (1) describe our cohort definition and data extraction for the laboratory values and comorbidities from our EHR data warehouse and (2) outline our imputation design.
2.1. Study Cohort
The cohort in this study consisted of 67,445 patients from the Geisinger Health System with three different phenotypes. This study was exempted by the Geisinger Institutional Review Board for using deidentified information.
Clostridioides difficile (Cdiff) Infection case and control cohort:
Clostridioides difficile (
C. difficile) is an anaerobic, Gram-positive, and spore-forming bacterium and a major cause of intestinal infection and antibiotic-associated diarrhea. Toxins are the major virulence factors of
C. difficile [
17]. Toxins A (TcdA) and B (TcdB) are large, secreted glucosyltransferase proteins that target intestinal epithelia cells and disrupt the epithelial barrier, leading to secretory diarrhea. The diagnosis of
C. difficile at Geisinger is captured and documented by Polymerase Chain Reaction (PCR) confirmation, which is highly sensitive. The latter is also considered the gold standard by the eMERGE algorithm for EHR mining [
18]. We identified the
C. difficile cohort, which includes patients tested for
C. difficile, from the EHR of the Geisinger Health System. The cohort includes both cases and controls. Cases are defined as having laboratory positive PCR test results. Controls are patients tested for
C. difficile with negative PCR test results. Case/control ratio is 1:8. We are interested in the combined case and control cohort, since patients tested for
C. difficile, irrespective of their test results, share some of the signs and symptoms (such as diarrhea); furthermore, using a case and control combined cohort increases our sample size, an important factor for imputation, while providing a framework for building predictive models that can benefit from the integration of a large number of laboratory-based features.
Inflammatory Bowel Disease (IBD) cohort: We identified the IBD cohort from the EHR of the Geisinger Health System. Inclusion criteria of this cohort were based on the extraction of the patient population based on the diagnosis recorded for patients under their visits, admissions, and currently active problems listed based on the ICD9 and ICD10 codes for Crohn’s disease (CD) and ulcerative colitis (UC) (see
Table A1 in
Appendix A). To have a higher fidelity regarding the diagnosis in the EHR, qualifying criteria included either two or more outpatient encounters, or one or more inpatient admissions, or an entry into the problem list with an active flag.
Osteoarthritis (OA) cohort: We identified an osteoarthritis (OA) cohort from the EHR of the Geisinger Health System; the cohort includes a knee or hip OA diagnosis, either primary or secondary diagnosis (see
Table A1 in
Appendix A for the OA diagnosis ICD codes).
2.2. Data Extraction
We extracted clinical laboratory measurements for this cohort using the Logical Observation Identifiers Names and Codes (LOINC) system. For comorbidities, we extracted all the diagnosis codes for all the patients based on the ICD9, as well as ICD10, codes. Comorbidity data included details from out-patient visits, in-patient admissions, and problem lists. The latter was used to capture conditions identified outside of the Geisinger Health System but discussed and assessed during the patient’s care management. We excluded laboratory codes with more than 75% missingness. To further clarify, in this study, missingness is defined as the laboratory measure “not resulted”. Therefore, if an order was placed but the results were not available (or not valid), we considered that as a missing value. We analyzed the data in three batches, including only laboratory measures that have, at most, (a) 25% missingness, (b) 50% missingness, and (c) 75% missingness.
2.3. Data Processing
Quality Control (QC) and outlier detection strategy: Geisinger has implemented a rigorous process to continuously extract, transform, organize, and store EHR data and remove erroneous entries for research purposes. For example, we currently have access to quality-controlled laboratory values with the reconciliation of units. Median laboratory values for each patient were calculated to be used for this study. It is important to mention that, especially for less common laboratory values, the frequency of measurements and the window between the first and last measurements per patient is relatively narrow. We analyzed the frequency patterns and reported the results in our descriptive section.
As part of the added data processing and outlier detection and removal, the distribution of each laboratory value was analyzed and fit to a tri-modal gaussian distribution model (see Equation (1)). The rationale for using this strategy, as opposed to the assumption of normality, is driven by the nature of the laboratory measures. Laboratory orders, especially those with a higher level of missingness, are typically missing not at random (MNAR), and there are mainly three groups of patients for whom there is a measurement recorded (those with higher or lower than average measures, as well as patients with average measurements). However, the average measurement is not necessarily associated with a larger group in all the cases, especially for laboratory measures that are specific to a phenotype, such as an iron-binding capacity. The latter is ordered for patients if the physician needs that information to make a diagnosis/management decision. Two cut-off values are created to filter outliers based on the three distributions model. The automated process to generate data-driven cut-off values is proposed for large-scale data mining, where limited manual curation is applied in the data preparation and preprocessing.
where
μ is the mean and
σ is the standard deviation. The lowest boundary to filter out the outliers is set to
c_low = max (min(
μ1 − 3
σ1,
μ2 − 3
σ2,
μ3 − 3
σ3), 0), and the highest boundary is set to
c_high = max(
μ1 + 3
σ1,
μ2 + 3
σ2,
μ3 + 3
σ3).
Data processing of the comorbidity dataset was performed to remove noise by excluding the ICD9/10 codes that were recorded only once in the patient’s chart (rule of 2). The resulting matrix was then converted to binary to represent the presence or absence of an ICD9/10 code for each patient. This is important, since the count does not necessarily correlate with the severity or duration of the condition. Therefore, a binary comorbidity matrix for each cohort was created for imputation modeling.
2.4. Data Abstraction and Imputation Strategy
The comorbidity dataset was used to compute an encoding matrix for each dataset (Cdiff, OA, and IBD) using singular value decomposition (Equation (2)).
where
APT_ICD_cohort is the matrix encompassing all the ICD9/10 codes (presence of absence) for all the patients for each dataset,
U is an
mxm square matrix,
S is an
mxn diagonal matrix with
m rows and
n colums, and
V is an
nxn square matrix. The columns of
V are eigenvectors of
ATA, and the columns of
U are eigenvectors of
AAT. The diagonal elements of
S are the square root of the eigenvalues of
ATA or
AAT.
The encoding matrix was then used to create different levels of data abstraction by retaining only 100 or 1000 of the encoding using the dimensionality reduction technique (Equation (3)) for each dataset. We used these predefined cut-off values based on our preliminary assessment [
19], as well as empirical studies [
20,
21]. For comparison, the full rank was also used in the modeling. Note that the approximation matrix is referred to as the data abstraction. The finalized output is referred to as latent comorbidities.
where
g is the level of abstraction (100 or 1000) corresponding to the level of reduced matrices.
APT_ICD_cohort_g is an approximation of the initial matrix (
APT_ICD_cohort).
As a final step in the data abstraction process, a baseline noise reduction is performed by removing the ICD codes if the sum of all the values for a given code in the latent comorbidity matrix is less than 1. This strategy reduces noise that is due to irrelevant (very rare) comorbidities in the model. The imputation method presented in this work is a hybrid method—that is, based upon concurrently applying dimensionality reduction and a clustering strategy—to efficiently capture relationships among the features (or variables) and reduce noise (through dimensionality reduction) while providing an adaptive mechanism to perform imputation for any complex phenotype or trait. Using latent comorbidity data, patients are clustered using the k-mean clustering technique with K set to 2, 4, 8, and 16 clusters, depending on the heterogeneity of the cohort.
Imputation was applied using the MICE fully conditional specification (FCS) algorithm [
5], which imputes multivariate missing data on a variable-by-variable basis. An imputation model is specified to each incomplete variable, and the imputation of missingness in one variable is conducted in an iterative fashion using the Markov Chain Monte Carlo (MCMC) method. More specifically, we selected the predictive mean matching (pmm) algorithm, which is the default method of mice() for imputing continuous incomplete variables. For each missing value, pmm finds a set of observed values (default is 5) with the closest predicted mean as the missing one and imputes the missing value by a random draw from that set. In other words, pmm is restricted to the observed values. We also used Random Forest (rf), which is based on imputing missingness by recursively subdividing the data based on values of the predictor variables in the predictive model by a bootstrap aggregation of multiple regression trees to reduce the risk of overfitting and improve the predictions through a combination of prediction from many trees [
22]. The latter does not rely on distributional assumptions and can better accommodate nonlinear relations and interactions.
Imputations using MICE-pmm and MICE-rf were applied to each subgroup independently to predict the missing values. The results were compared when MICE-pmm and MICE-rf were applied to estimate the missing in the laboratory values in three cohorts without any consideration of the comorbidity information. The reader is referred to the work [
15] by S. van Buuren and K. Groothuis-Oudshoorn for more details about imputation by MICE.
2.5. Evaluation Strategy
Model evaluation is performed by randomly selecting variables and predicting them using the hybrid strategy. A total of 100 values from each laboratory measure was randomly withheld for testing. For example, for the Cdiff cohort, where we identified 48 laboratory codes with less than 75% missingness, we held out 100 values for each of the 48 laboratory codes and estimate these 10 times. The root mean square error (RMSE) was also calculated and averaged over the 10 runs. Comparison was based on calculating the difference between running imputation using the hybrid model and the standard MICE algorithm, without any consideration of the comorbidity information, using both the pmm and rf models implemented in the MICE package. The presented results were, therefore, the RMSE differences, where the negative values represent a reduction in the root mean square error.
3. Results
In the following section, we will (1) describe our cohorts, pattern of missingness, and frequency of available data for different levels of missingness and (2) present imputation results for the three datasets.
3.1. Description of Laboratory Values for the Three Cohorts
We identified a total of 67,445 patients in three different cohorts (Cdiff, OA, and IBD) from Geisinger’s electronic data warehouse. Further, we identified 495 LOINC codes from this cohort. We selected the LOINC codes for which we had, at most, 75% missingness (i.e., the number of patients without any measurement divided by the total number of patients is less than or equal to 75%) in each of the three cohorts.
We identified a total of 46,215 patients tested for C. difficile. We extracted comorbidity and laboratory data from the EHR for this cohort. A total of 48 laboratory codes and 8160 ICD codes for comorbidities were used. Specifically, we identified a total of 48 of the laboratory codes from the 495 codes that had at least 25% of the 46,215 patients with at least one measurement in their records. It is important to highlight that many of the LOINC codes can be very specific (<1% of the patients have such measurements) or were used for a narrow period and may not be actively in use. The dimensionality reduction was set to 100 and 1000. The Cdiff cohort had high heterogeneity, since the dataset contained both cases (tested positive for C. difficile) and controls (tested negative for C. difficile). The number of clusters tested was 4, 8, and 16.
Similarly, we further identified 11,230 IBD patients with both comorbidity and laboratory data from the EHR. A total of 48 laboratory codes and 7916 ICD codes for comorbidities were identified. The dimensionality reduction was set to 100 and 1000. The number of clusters tested was two, four, and eight, given the smaller sample size of this cohort.
Finally, we identified 187,040 patients with a primary or secondary diagnosis of the knee or hip OA from which we randomly selected 10,000 patients for imputation modeling. A total of 44 laboratory codes and 2042 ICD codes for comorbidities were used. The OA cohort had high heterogeneity, since the dataset was large (almost 200,000 cases from the initial pool) and contained both hip and knee OA. We selected a random set of 10,000 patients, as it is impractical to use an extremely large cohort of patients for optimizing an imputation, as the optimization alone is a computationally extensive process. The number of clusters tested was 4, 8, and 16.
The distribution of missingness in the laboratory values was different for the different cohorts.
Table A2 summarizes the percentage missing for the laboratory measures. Our results showed that the pattern and frequency of the laboratory measurements were dependent on the missingness level. Briefly, for laboratory values with high missingness, a larger percentage of patients (30–60%) had only one resulted value; therefore, the median that we calculated in our experiment was practically the exclusively reported value for the patient (see
Figure 1A). We further observed that the laboratory values with a high level of missingness (when a patient had more than one value) tended to have an observation window of approximately two to six years (see
Figure 1B) and a frequency that was below five measurements (see
Figure 1C). However, for more common laboratory values, we observe a window of approximately 5 to 12 years and a frequency above 10 (see
Figure 1C).
The outlier detection using a multimodal gaussian distribution function was applied to each laboratory measure for each cohort separately.
Figure 2 highlights that, for laboratories with higher missingness levels, the distribution is different for the different cohorts, and therefore, the accepted range is adjusted accordingly. For more common laboratory measures (such as the example presented in
Figure 3), the distributions are similar. The accepted range for these laboratory measures is within the calculated range. To further help the reader to better understand the pattern of laboratory data, we created distribution plots for all the laboratory values used in this study for the three cohorts (see
Figure A1 and
Table A2).
3.2. Imputation Applied to Laboratory Values
C. difficile (Cdiff) infection case and control cohort: Using adaptive imputation for the Cdiff cohort showed improved performance, especially for the high missingness group (laboratory measures that have, at most, 75% missingness). An average RMSE difference (comparing the proposed imputation with the standard imputation model, without any consideration of comorbidity information using MICE) was −31.47 for a level of abstraction
g = 1000 and a cluster number
k = 4. The average RMSE difference was −8.75 for
g = 100 and
k = 4, demonstrating that, at a high missingness level, additional information from the patient comorbidity information can play an important role in improving the accuracy of the imputation prediction. A total of 27 combinations (or nine combinations for each missingness threshold) were tested, and for each missingness level (
Table 1), the tradeoff between the sample size and clustering approach resulted in one or two instances where clustering was associated with improved performance. Since the dataset is of fixed size, the higher number of clusters will reduce the power of the imputation method, especially when the number of clusters is increased to eight or beyond. However, as each dataset has its unique characteristics, the best set of parameters must be empirically determined prior to performing the imputation using the adaptive strategy. Using MICE and the random forest model (rf), the RMSE differences were negative for the majority of the combinations. The missingness group of <75% had seven out of the nine parameter combinations that were in favor of the novel method (See
Table 1 and
Figure 4).
Inflammatory Bowel Disease (IBD) cohort: Using adaptive imputation for the IBD cohort showed improved performance, especially for the high missingness group (laboratory measures that have, at most, 75% missingness). An average RMSE difference when compared to the standard model using MICE alone was −8.35 with no abstraction and cluster number
k = 2. Similarly, an average RMSE difference when compared to the standard model using MICE alone was −8.24 for
k = 8. The results highlighted that, at a high missingness level, additional information from the patient comorbidity data can play an important role in improving the accuracy of the imputation prediction, even as the sample size is significantly smaller (in this case, 11 K versus 46 K for the Cdiff cohort). A total of 27 combinations (or nine combinations for each missingness threshold) were tested. The tradeoff between the sample size and clustering approach resulted in parameter combinations that were associated with improved performance. Additional analyses were performed with the random forest model in MICE, and an RMSE difference of −2.70 was recorded for a missingness level of 25% (see
Table 1 and
Figure 4). Our results corroborate the value of parameter optimization on the dataset using various modeling frameworks. Thus, the best set of parameters should be empirically determined for each dataset.
Osteoarthritis (OA) cohort: Using adaptive imputation for the OA cohort showed that the best performance improvement was for missingness at 50% (
Table 1 and
Figure 4). The tradeoff between the sample size reduction, when clustering is utilized, and the use of additional information from comorbidities did show benefits even for this smaller and more heterogeneous dataset. The rf model in MICE was best fitted for this dataset.
4. Discussion
This study is a first step towards improving our many layers of data analytics and quality control pipelines to help enhance the quality of data extracted from the EHR that is ingested in machine learning applications for precision medicine. The use of heterogeneous and large-scale clinical datasets, such as EHRs, provides an avenue for the exploration of strategies to improve care at individualized levels, which include developing personalized models of responses to therapy and the prediction of disease onset, among others [
1]. However, the data extracted from EHRs are noisy and have many missing values. In the majority of studies, variables suffering from missingness are excluded from models and analyses [
4], even for some variables with high discriminative ability according to the clinical knowledge. As we showed in this work, it is not recommended to solely rely on the redundancy of EHR laboratory data to conduct imputation for realistic applications. That is because the majority of redundancy from laboratory measurements are associated with variables that are missing at high levels. However, laboratory data is highly associated with comorbidity, as the latter is based on laboratory values in realistic settings. For instance, besides the commonly ordered laboratory tests (20–30 laboratory measures), the remaining values are missing at very high rates, even in a healthcare system with a stable population (Geisinger is an integrated healthcare system with a drop-out rate <5%). However, the laboratory measures are highly correlated with comorbidities and diagnosis. Therefore, our intuitive modeling strategy is focused on using this redundancy to improve the imputation for laboratory values.
Furthermore, many diagnoses are based on laboratory values; however, due to the challenges associated with mining laboratory measures, many models ignore this important parameter or only include the ones that are not missing at high levels to reduce the noise and bias due to poor imputation predictions. We created three diverse datasets to test this intuitive strategy of imputation designed specifically for EHR laboratory data by including information from the comorbidities.
The IBD dataset was used, because IBD is a heterogeneous disease and a clear understanding of its risk factors is still lacking. Recent advances in the knowledge of IBD’s pathogenesis have led to the implication of a complex interplay between metabolic reprogramming and immunity [
23]. Furthermore, the response to treatment in IBD varies significantly among individuals and disease subtypes based on demographic characteristics, diet, comorbidities, underlying immunological factors, and genetic polymorphisms. Thus, there is an urgent unmet need to replace the current imputation approaches with personalized strategies that consider individual variability, diversity, and more balanced patient representation. Therefore, building predictive models for treatment outcomes for IBD is an important step in utilizing the available data on drug responses to provide better care for this patient population. Thus, the integration of laboratory measures in a predictive model for IBD has clinical value.
We created the Cdiff dataset, because the understanding of recurrent C. difficile infection is important, and the existing data from EHR can help us identify clinical biomarkers and help in building a decision support system for physicians to target the patients at a higher chance of recurrence for more targeted preventive care.
Finally, the OA dataset was added to test the limits of this model. An OA diagnosis is not based on any laboratory measure known today. An OA diagnosis is based on imaging alone. Therefore, we did not expect the OA cohort to have any special patterns in their laboratory profile, yet we observed that, even in this situation, the use of a comorbidity pattern can help in improving the imputation of laboratory values. The OA dataset was also the smallest dataset tested in this study.
Overall, our results showed that each dataset is unique, and a one-size-fits-all approach does not apply when selecting the imputation model. On simulated datasets with interactions between variables, the imputation of missing data using MICE with regression trees resulted in less biased parameter estimates than MICE with linear regression. [
24] In the CALIBER study, MICE random forest showed more imputation efficiency with narrower confidence intervals for the error metric [
25]. Through a simulation of a dataset in which the partially observed variable depended on the fully observed variables in a nonlinear way, MICE-RF showed less bias in parameter estimates and better confidence interval coverage. In our study, rf also performed well; however, the best performance was observed when pmm was used in the Cdiff cohort. Nonetheless, because the RMSEs were calculated across all laboratory variables, the improvement may be contributed by a few variables that were imputed better in perhaps some, but not all, cases. Further analysis will be needed to address this assumption.
The method presented here is an intuitive approach for any given complex disease where biosignatures or risk factors are only partially known and the relationship among the variables can be convoluted given the large dimensionality of the dataset. Even though the level of missingness can vary, the best results are typically obtained when the level of missingness is low or moderate. The improvement over conventional methods without the consideration of comorbidity information can be achieved when the missingness level is high. Our strategy was to ensure that (1) our experiment aligned with the current methodologies in practice and (2) others can easily adapt this modification to their work. In future directions, we will explore if advanced modeling frameworks such as the generative adversarial network [
26] (GAN) or the newly proposed generative adversarial imputation nets (GAIN) framework [
27] can be optimized for imputing laboratory values from EHRs.
Finally, our study provided a step in what we believe is a pipeline of data quality improvements for empowering machine learning models using EHRs. The main limitation of this approach is the need for large datasets. This is due to the nature of this approach, as the clustering step will reduce the sample size for the imputation, thus reducing its power. Therefore, this approach is ideal for machine learning applications where the sample size tends to be large and comprehensive. Our smallest cohort consisted of 10,000 OA patients. Our best prediction improvement was observed for the largest dataset of 46,215 patients. Another limitation of this study that we could not address is based on our masking strategy for the evaluation, which was done at random, even though we knew that the missingness in the EHR was not at random. However, given that we did not know a priori the reason for missingness for each patient, given the complex nature of the data, masking at random was the most sensible strategy in this case. As of now, we do not have a better strategy to simulate MNAR to withhold values. The contributing factors to MNAR are multifactorial and largely unknown.
This study had several other limitations. First, by converting the comorbidity information into binary, we may have lost important information. This study design can be enhanced further to answer a specific research question by optimizing the pattern of ICD codes recoded (both the frequency and time intervals) to capture the duration and severity of the conditions. Second, we withheld a relatively small number of values to evaluate our model. This is because we included laboratory codes with as high as 75% missingness and applied clustering prior to imputing; thus, withholding a higher level of laboratory values may further increase the sparsity of the dataset and introduce further bias. As a future direction, we plan on applying the algorithm several times to random subsamples of the data of size n/2 (n = number of samples). This repeated double randomization, similar to the concept of bagging and sub-bagging [
28,
29] algorithms, could further help optimize our strategy. Third, we are not limiting the window with respect to the diagnosis index event, as it should be for a carefully designed study [
30,
31]. However, the identification of pre- and post-index windows should be thoroughly planned based on the research question, the sparsity of the data, the healthcare system, and the variables under consideration [
30]. However, as this is a proof-of-concept study, we did not limit our observation window in order to help improve our data availability so that we could experiment with different levels of missingness. Even though this is a limitation of this study, we showed what, in many instances, were only a few laboratory values for each patient for the less commonly used laboratory codes. Fourth, as this was a pilot study, we wanted to corroborate the generalizability and scalability of the proposed strategy. Therefore, we did not exhaustively vary the abstraction level nor the size of the clusters; however, we applied the model on three different cohorts that were created specifically for this study. Finally, by combining the laboratory codes into three groups (<25% missing, <50% missing, and <75% missing), we were unable to determine if this improvement was due to one or a few laboratory variables. Further assessments will be needed to study the improvement of imputation for each laboratory on a case-by-case basis for more targeted evaluations and improvements.
To conclude, the advantages of imputing missingness are manifold; imputation can be used for increasing the data density, improving the representation of data-poor patients, thus reducing the implicit algorithmic bias. Patients with limited access to healthcare and specialty care may be prone to be less-represented in models, because their data footprint is lower. The inclusion of more laboratory values is important as a prediction of a diagnosis; if it is not at least partially based on laboratory information, it could be weak. Predicting a future disease by only focusing on past diagnoses (i.e., using only information based on the ICD codes) is not taking full advantage of the information in electronic health records. Laboratory measurements, similar to imaging and imaging reports, are at the core of diagnosis and care management. The novelty of this study is in its intuitive design and relatively simple implementation in incorporating information from a patient’s comorbidity to improve the imputation of laboratory values.
As a future direction, we will investigate how best to impute longitudinal laboratory measures to better inform clinical studies. In addition, we will also explore integrating additional features, such as demographic information, age, gender, and medication usage, as well as genetic information when available, to further enhance the imputation outcome. Finally, we will evaluate various preprocessing and normalization strategies and evaluate if these manipulations can improve the outcome of our predictions, especially for variables with skewed distributions, and explore the impact of imputation on each laboratory value and further investigate any potential patterns or trends that can help improve predicting the missing values. To conclude, we optimized the level of abstraction needed to improve the imputation for three cohorts of varying sizes and complexities. This study demonstrates that the use of shared latent comorbidities can facilitate improvements in imputing laboratory measures from EHRs for downstream analysis and predictive modeling.