*Article* **Longwise Cluster Analysis for the Prediction of COVID-19 Severity within 72 h of Admission: COVID-DATA-SAVE-LIFES Cohort**

**Rodrigo San-Cristobal 1,\*,†, Roberto Martín-Hernández 2,†, Omar Ramos-Lopez 3, Diego Martinez-Urbistondo 4, Víctor Micó 1, Gonzalo Colmenarejo 2, Paula Villares Fernandez 4, Lidia Daimiel <sup>5</sup> and Jose Alfredo Martínez 1,6**


**Abstract:** The use of routine laboratory biomarkers plays a key role in decision making in the clinical practice of COVID-19, allowing the development of clinical screening tools for personalized treatments. This study performed a short-term longitudinal cluster from patients with COVID-19 based on biochemical measurements for the first 72 h after hospitalization. Clinical and biochemical variables from 1039 confirmed COVID-19 patients framed on the "COVID Data Save Lives" were grouped in 24-h blocks to perform a longitudinal k-means clustering algorithm to the trajectories. The final solution of the three clusters showed a strong association with different clinical severity outcomes (OR for death: Cluster A reference, Cluster B 12.83 CI: 6.11–30.54, and Cluster C 14.29 CI: 6.66–34.43; OR for ventilation: Cluster-B 2.22 CI: 1.64–3.01, and Cluster-C 1.71 CI: 1.08–2.76), improving the AUC of the models in terms of age, sex, oxygen concentration, and the Charlson Comorbidities Index (0.810 vs. 0.871 with *p* < 0.001 and 0.749 vs. 0.807 with *p* < 0.001, respectively). Patient diagnoses and prognoses remarkably diverged between the three clusters obtained, evidencing that data-driven technologies devised for the screening, analysis, prediction, and tracking of patients play a key role in the application of individualized management of the COVID-19 pandemics.

**Keywords:** COVID-19; Charlson Comorbidities Index; cluster analysis; longitudinal cluster; individualized management

#### **1. Introduction**

The Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) appeared around December 2019 in Wuhan (China) and has been spreading all around the globe thenceforth [1,2]. The World Health Organization (WHO) declared the disease (COVID-19) caused by SARS-CoV-2 as a pandemic in March 2020, based on the incidence growths due to the high contagiousness and high levels of lethality presented [3]. The major challenge for clinicians and practitioners has been the wide clinical presentation form of the disease and requiring the decision of intensive care unit (ICU) admission, together with the use of mechanical ventilation. Patients with COVID-19 could present as asymptomatic or with

**Citation:** San-Cristobal, R.; Martín-Hernández, R.; Ramos-Lopez,

O.; Martinez-Urbistondo, D.; Micó, V.; Colmenarejo, G.; Villares Fernandez, P.; Daimiel, L.; Martínez, J.A. Longwise Cluster Analysis for the Prediction of COVID-19 Severity within 72 h of Admission: COVID-DATA-SAVE-LIFES Cohort. *J. Clin. Med.* **2022**, *11*, 3327. https://doi.org/ 10.3390/jcm11123327

Academic Editor: Robert Flisiak

Received: 5 April 2022 Accepted: 7 June 2022 Published: 10 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

milder symptoms (including fever, sore throat, dry cough, dyspnea, myalgia, headache, or diarrhea) or with more severe symptoms, such as chest pain, hypoxemia, pneumonia, and other complications [4]. Since the appearance of this pandemic, several authors have tried to stratify the patients depending on the symptoms, oxygen saturation, or the chest computed tomography in order to predict the severity of the patients, aiming to facilitate decision making in the clinical practice [5].

COVID-19 infection displays a mean incubation period between 6 and 7 days from the initial infection, followed by a viremic phase from the 8th day to the 10th day. However, the delay between symptom onset and hospitalization could vary from 2.6 to 9.7 days, depending on the country and the age of patients [6]. The delay of detection and hospitalization has a large impact on the concurrent inflammatory stage, and, thus, on the prognosis and the fatality of the disease [7]. These manifestations are accompanied by microvascular damages caused by the cytokine "storm" [8], and often, the pathophysiological COVID-19 condition is also associated with bacterial infections [9] and with body metabolic impairments [10,11], where prescribed anti-inflammatory medications also may play a role [12].

With all of this, the use of routine laboratory biomarkers is the key monitoring tool to predict the prognosis of the disease. There are several studies that have focused their research on a limited number of these markers or have uniquely performed cross-sectional analyses at the baseline and their relationship with the prognosis of these patients [13,14]. Thus, the identification of patients that are more likely to develop severe illness after diagnosis is a critical *checkpoint* in order to decrease mortality rates, as well as to avoid the collapse of medical care within the hospitals [15]. Therefore, taking into account the time evolution of comorbidities and potential organ injuries throughout the course of severe COVID-19 is crucial in the precise clinical management of patients, influencing treatment approaches and recovery rates [16] where inflammation has a very strong role [17], as well as immunity and hematological alterations [18] and liver dysfunctions [19]. All of this emphasizes the need for a clustered clinical management of this disease and one that would lead to achieve more personalized and effective interventions [20].

In this regard, understanding the short-term longitudinal variation and the specific profiles of these biomarkers based on the severity of disease progression would allow the development of stratification tools [21] to characterize distinctive phenotypes concerning patients with COVID-19 that predict their potential prognosis [22]. In this regard, the use of data science methods to identify underlying patterns or profiles present in patients with COVID-19 could shed light on the mechanisms that occur and would allow for the prescription of personalized treatments through the determination of clusters of patients attending objectively measured variables [5]. Based on this background, the present study aimed to explore data from patients admitted to all HM private hospitals in the Madrid region during the first pandemic peak reported in Spain, in order to find clusters of patients based on the biochemical measurements for the first 72 h of attendance and further implication in their prognosis.

#### **2. Materials and Methods**

#### *2.1. Patients Database*

The data used for the present analysis were framed on the "COVID Data Save Lives" (COVIDDSL) initiative carried out by the *HM Hospitales.* This initiative made freely available an anonymous dataset containing the information from the Electronic Health Record (EHR) system of the HM Hospitales (information available at https://www.hmhospitales. com/coronavirus/covid-data-save-lives/english-version (accessed on 20 July 2020)). The anonymized information contains the records of 2310 patients that were admitted with a diagnosis of COVID-19 between 26 December 2019 and 10 June 2020. Multicenter longitudinal information from this EHR comprise different datasets corresponding to the main clinical characteristics of different domains. Each patient was identified by an anonymized unique admission code. The datasets include information about the COVID-19 treatment process, including complete information on admission and diagnoses, treatments, ICU

admissions, diagnostic imaging tests, laboratory results, drug administration, and cause of discharge or death). This study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Ethics Committee of the HM hospitals consortium (CEI HM Hospitales Ref No. 20.05.1627-GHM).

#### *2.2. Data Collection and Definitions*

The data sets were preprocessed considering only adult patients with confirmed COVID-19. Both clinical and biochemical variables were selected and grouped in blocks of 24 h to 72 h from the patients' admission to the hospital. In those patients that presented more than one measure per day, the median value was used to avoid the potential effect of extreme values for these variables. Additionally, patients were categorized according to the cause of discharge or admission to the ICU and the administration of mechanical ventilation. Reported death and mechanical ventilation variables were used to test the prognostic value of the current exploratory analysis.

Data included for the exploratory analysis were patient's age, sex, clinical history of previous diseases, vital signs and tests performed throughout the hospitalization, and the medications administered until the discharge. The vital sign variables included for the analysis were oxygen saturation (%), body temperature (◦C), heart rate (beats/min), and systolic and diastolic blood pressure (mmHg). The following parameters were selected from the different tests collected: white cell proportions including leukocytes (1000/μL), basophil (%), eosinophils (%), lymphocyte (%), monocyte (%), and neutrophils (%); red cell markers including red cell distribution width (RDW, in %), hemoglobin (g/dL), hematocrit (%), mean corpuscular hemoglobin (pg/cell), mean corpuscular hemoglobin concentration (g/dL), and mean corpuscular volume (fL); platelets and prothrombin markers such as mean platelet volume (%), platelet count (1000/μL), the international normalized ratio (INR), prothrombin activity (%), and prothrombin time (seconds); metabolic markers and electrolytes including glucose (mg/dL), Gamma-glutamil transferase (GGT, in IU/L), aspartate aminotransferase (AST, in IU/L), alanine aminotransferase (ALT, in IU/L), sodium (mmol/L), and potassium (mmol/L); and finally, inflammatory and catabolic markers such as C-Reactive Protein (CRP, in mg/L), D-Dimer (ng/mL), lactate dehydrogenase (IU/L), creatinine (mg/dL), and urea (mg/dL).

Additionally, International Statistical Classification of Disease and Related Health Problems (ICD-10) coding tables with clinical records of diseases and procedures, as well as medications classified by ATC5/ATC7, for each patient and time point were also condensed in categories and activity of medications, respectively. The coded information was used to carried out complementary descriptive analysis. Additionally, clinical variables were encoded following the criteria of the Charlson comorbidity index (CCI) categories [23] to adjust the logistic regression models and measure the effect in the models of concomitant diseases as a potential confounder in the prognosis of these patients.

#### *2.3. Statistical Analysis*

Patients with less than 50% of missing values for the selected variables during the first 4 blocks of 24 h were selected to conduct the present analysis. Patients were categorized, by the median number of comorbidities at the baseline (by CCI), patients with 3 or less comorbidities and those with more than 3 comorbidities at the baseline, to carry out the descriptive analysis, including means and standard deviations (SD) for quantitative variables and absolute value with percentages for categorical variables. Student's t tests for continuous variables and chi-squared tests for categorical variables were used to assess differences between patients from both comorbidity groups.

The longitudinal unsupervised clustering was performed by using the *Kml3d* library, which provided a longitudinal implementation of the widely used k-means algorithms [24]. The technique used for this study was an unsupervised non-parametric cluster analysis that classifies the trajectories of the patients by simultaneously providing the 33 routine biochemical parameters from the first 72 h after the admission of the patients. This technique implements a path expectation-maximization algorithm by alternating different initialization methods to obtain the most stable solution for the clusters, and it can feature groups of patients associated with specific disease risks. Clustering approaches are not ultimately predictive, but they are descriptive and contribute to identify patterns concerning hidden structural data, which do not demand a formal hypothesis. Indeed, clustering analysis can feature groups of patients associated with specific disease risks. Clustering permits targeting patients in a cost-effective feasible nature and relevant clinical impact. Cluster analysis has been used to characterize risk factors associated with diseases [25] and may require further regression analysis to predict other related variables [26]. This library was used to specifically cluster patients based on the joint trajectories of the selected clinical and biochemical variables throughout the 24-h time periods during the first 72 h of hospitalization. A range of 2 to 10 clusters was assayed to fit the most adequate solution for the model, based on the lowest Bayesian information criterion (BIC) and the clinical relevance of clustering solutions (measured by the severity of outcomes related to the cluster using logistic regression), resulting in a final best solution of 3 clusters. Principal component analysis was conducted to visualize the categorization of the patients. The relative importance of the variables and the time periods was estimated through variable/time permutation to gain a better understanding of the most important variables and times in the clusters obtained. ANOVA analysis was carried out to compare clinical characteristics among clusters, and a Tukey *post hoc* analysis was applied to compare individual groups.

A multivariable logistic regression model was used afterwards to estimate the gain upon inclusion of the clusters previously obtained as independent variables for the prediction of two outcome variables, namely death and administration of mechanical ventilation during hospitalization. Three different models were developed to evaluate the effect of the inclusion of the cluster assignment, in addition to the main factors that impacted the COVID prognosis. Model 1 used age-independent CCI, sex, and age as predictor variables; model 2 was additionally adjusted by temperature and oxygen saturation at admission; and the final model 3 was additionally adjusted by the cluster assignments. Area under the curve (AUC) from receiver operating characteristic curves (ROC) was estimated to evaluate the predictive value of each model. All the statistical analyses were performed using R statistical software version 4.0.1 (R Project for Statistical Computing) within RStudio statistical software version 1.4 (Rstudio Team. Rstudio: Integrated Development Environment for R. Boston, MA, USA).

#### **3. Results**

#### *3.1. Study Sample Description*

The cleaned dataset (Table 1) contained 1039 confirmed COVID-19 patients, 60% male and 40% female, with a global age mean of 68.5 years. The mean days of hospitalization were 10.1, with 5.4% of the patients admitted to ICU and 62.6% receiving mechanical ventilation during the hospitalization. The main cause of medical discharge was home referral (78.5% of patients), while the referral to other centers corresponded to 6.2% of the hospitalization, and death represented 11.5% of the patients. The patients presented an average CCI of 3.6 at hospitalization. As expected, when the patients were categorized by CCI with a cut-off of 3 points (Table 1), those above the cutoff were older and evidenced worse health status concerning hospitalization features and higher death, and they suffered more comorbidities including cardiovascular events, liver diseases, diabetes, and cancer; however, a significant association with sex was observed.


**Table 1.** Baseline and outcome characteristics of COVID-19 patients from DATA SAVE LIVES categorized by the Charlson comorbidity index.

*p*-value: *t*-test for continuous variables and chi-square for categorical variables. ICU: intensive care unit; CCI: Charlson comorbidity index; COPD: chronic obstructive pulmonary disease; CKD: chronic kidney disease; AIDS: acquired immune deficiency syndrome.

#### *3.2. Patient Clusterization*

The cluster analysis was developed to categorize the sample based on the longitudinal evolution of multiple vital signs and laboratory tests (see Section 2). The best clustering was obtained with three clusters. Supplementary Figure S1b displays a PCA with all these variables, colored by the three clusters obtained. We can see the good separation of the patients achieved by this longitudinal clustering. In addition, in order to interpret the clustering, we estimated the relative importance of the different variables and times by permutation-based feature/time importance analyses. The resulting ranked importance of variables and times are displayed in Supplementary Figure S1c, where it can be seen that monocytes, GGT, neutrophils, prothrombin time, and urea were the most remarkable variable contributors to the clustering, and the first 24 h is the most important time of all.

In addition, in Table 2, we analyzed the association of these clusters with different baseline and outcome variables, in order have an idea of the clinical profiles of the three clusters. In this way, Cluster A encompassed patients with lower hospitalization, ICU stay, and clinical complication rates, displaying a death rate of only 1.6%; Cluster B showed an intermediate prevalence of chronic diseases with a fatality incidence of 14.4%; and Cluster C showed the eldest group of patients, with a mortality rate of 37.4% and a higher clinical morbidity prevalence.


**Table 2.** Baseline and outcome characteristics of COVID-19 patients from DATA SAVE LIVES categorized by cluster.

*p*-value: ANOVA for continuous variables and chi-square for categorical variables. ICU: intensive care unit; CCI: Charlson comorbidity index; COPD: chronic obstructive pulmonary disease; CKD: chronic kidney disease; AIDS: acquired immune deficiency syndrome.

Additionally, clinical variables evolved during the initial 72 h after hospital admission according to different cluster profiles, as can be seen in Figure 1, where the time evolution of these variables is displayed for the three classes, color coded in reference to recommended values (above, within, below). In general, Cluster C presented the most altered medical variables in comparison with the other two clusters, while Cluster B showed a mildly severe inflammatory condition. More specifically, the patients in Cluster B showed the lowest eosinophil levels and the highest levels of GGT, AST, ALT, C-reactive protein, and lactate dehydrogenase during the 72 h compared to the other clusters. Meanwhile, Cluster C presented the lowest lymphocyte levels and prothrombin activity, as well as the most elevated levels for prothrombin time, INR, glucose, D-dimer, creatinine, and urea (Figure 1).

Vital signs (Supplementary Figure S2) indicated that, while Cluster A presented less unhealthy symptoms, Cluster B and C displayed significantly worse clinical outcomes maintained throughout all time points (0–72 h). When white blood cell count was observed (Supplementary Figure S3), Cluster A involved fewer biological abnormalities. Specifically, lower levels of eosinophils were detected in the three clusters at all-time points, while only Cluster B and Cluster C had lymphocyte counts below the laboratory references. Curiously, Cluster A presented high levels of monocyte count. Those cluster differences were present across the 72-h measured course (Figure 1 and Supplementary Figure S3). Red blood cell

levels, despite some significant cluster differences, were not different to normalized values (Figure 1 and Supplementary Figure S4).

**Figure 1.** Heatmap plot of adequacy to reference values for clinical variables included in the cluster analysis. White means that the mean value for the cluster was within the recommended values; meanwhile, blue and orange intensity represent the deviation from the recommended values below and above, respectively.

Regarding blotting (prothrombin activity and time besides international normalized ratio) and hepatic related enzymes (ALT, AST, and GGT), Cluster C had altered high levels of those indications, while the other two cluster were closer to reference normality (Figure 1 and Supplementary Figures S5 and S6). Finally, inflammation (C-reactive protein) and thrombosis (D-Dimer) examinations, as well as lactate dehydrogenase, were impaired in all the clusters, with a greater severity in Cluster B and C compared to Cluster A, while renal functionality assessed by creatinine and urea were only altered in Cluster C (Figure 1 and Supplementary Figure S7).

#### *3.3. Logistic Regression Models to Predict Severe Outcomes*

Finally, a logistic regression model was fitted to discern the capacity of the modeled clusters to predict the disease fatality (Table 3, Figure 2). The first model, including the age-independent CCI, sex, and age as predictors, showed only age and sex with significant *p*-values and an AUROC of 0801. A second model, which added oxygen saturation and temperature (the former significant but not the latter) to the previous one, had a negligible increase of AUROC to 0.81. However, the inclusion of the cluster variable in the third model (green line in Figure 2) resulted in a large boost of the AUROC, up to 0.87. The third model presented the highest value of AUC, showing the better capacity of death prediction (*p*-value obtained by parametric bootstrapping for differences between Model 1 vs. Model 3 and Model 2 vs. Model 3, <0.001 and <0.001, respectively).


**Table 3.** Logistic regression model for the risk of death.

OR: Odds Ratio; CI: Confidence interval; AUC: area under the curve; CCI: Charlson comorbidity index.

**Figure 2.** ROC curve of logistic regression for the three models.

A similar analysis was performed to predict the risk of mechanical ventilation using these models (Table 4). The obtained results were similar, with a better predictive capacity in Model 3 compared to the other two models (Figure 3), confirming the utility of patients' clusterization (*p*-value obtained by parametric bootstrapping for differences between Model 1 vs. Model 3 and Model 2 vs. Model 3 = 0.023 and <0.001, respectively).


**Table 4.** Logistic regression model for the risk of mechanical ventilation.

AUC: area under the curve; CCI: Charlson comorbidity index.

**Figure 3.** ROC curve of logistic regression for the three models.

#### **4. Discussion**

Coronavirus disease has affected all nations and territories, while several investigations are now being conducted to seek personalized clinical prescriptions and provide epidemiological surveillance to control this pandemic [15,27]. Indeed, research concerning the early symptomatic identification and assessing specific traits involving clinical manifestations, medical outcomes, and epidemiological estimates with machine learning models offers huge opportunities for precision medicine despite some limitations and challenges [28]. In this context, the COVID-19 disease presents a unique prospect to understand whether there are distinct phenotypes of COVID-19 outcomes, whose knowledge will provide important benefits not only for the personalized management of infected patients, but also for optimizing health care systems and for devising public health policies [29] by considering phenotypical plus family and clinical history backgrounds, as well as individual lifestyle factors [30].

The implementation of multivariate statistical and bioinformatic instruments to provide valid information for clinical purposes includes hierarchical cluster analysis, principal component analysis, random forest, discriminant analysis, support vector machine algorithms, and neural network-based deep learning methods, with value on disease characterization, diagnosis, and treatment [20]. In this context, a longitudinal cluster analysis was implemented on the "COVID Data Save Lives" (COVIDDSL) dataset to unhidden statistically significant clinical variables and the internal structure, as performed elsewhere with COVID-19 infected patients [31].

In our clinical setting, regarding a group of Spanish public/private hospitals, applying longitudinal cluster analyses enabled three distinctive COVID-19 medical phenotypes to emerge: Cluster A characterized by including patients' mild inflammatory symptoms and low death occurrence (1.6%), Cluster B featuring important immune-inflammatory distress and specific liver dysfunctions with a rate of 14.4% mortality, while Cluster C encompassed specific coagulation disorders and renal alterations, in addition to inflammatory and immunocompetence abnormalities with a fatality prevalence of 37.4% of the patients. Thus, survival times across clusters notably differed in the three groups of patients, which is key for ameliorating disease management and outcomes by considering individualized patient profiling, predictive personalized models, and precision cost-effective risks, alleviating procedures as previously described in the palliative treatment of liver tumors using unsupervised artificial intelligence [32]. Moreover, the age and number of comorbidities, as associated with increased risk of mortality in patients with COVID-19, need to be accounted for [33], as delineated in the three A, B, and C clusters.

In this scenario, analyses concerning longitudinal COVID-19 disease trajectories were able to recognize vulnerable population clusters that would particularly benefit from specific health resources and provide insights for public health targets in order to manage the COVID-19 infectious pandemic. Thus, tuberculosis and HIV/AIDS, hepatitis, cardiomyopathies, and diabetes were consistently associated with an increased risk to be found in a more vulnerable cluster [34]. Furthermore, a comprehensive measurement of dysfunction severity of six organ systems based on the Sequential Organ Failure Assessment (SOFA) score revealed that cardiovascular, central nervous system, coagulation, liver, renal, and respiration pathobiology were able to identify distinct strata of COVID-19 patients, as defined by the baseline post-intubation SOFA. This includes findings suggestive of inflammation as a mechanism involving differential COVID-19 disease severity outcomes, as well as a heterogeneous physiopathological lung illness [29], which is in accordance with some of our findings, given that inflammatory responses, clothing, hepatic/renal alterations, and impaired immunocompetence were markers involved in cluster discrimination

Another study developed with machine learning tools and based on a decision tree model to anticipate COVID-19 outcomes from a list of 132,939 recovered COVID-19 subjects evidenced that mortality prevalence was specifically clustered among males, older cases, and hospital admission history as predictors of case fatality [35]. In addition, a database study encompassing hospitalized COVID-19 patients over 24 and 48 h in the Mount

Sinai Health System predicted intubation, intensive care unit transfer, and mortality and was able to identify important features, such as pulse oximetry with clinical importance in the outcome [36]. Results from the current analyses confirm trends during the 72-h outcomes among the three clusters, with some differential responses concerning PCR, hemoglobin, and coagulation indicators, while the fitted logistic regression model for the risk of mechanical ventilation and death considered both variables independently influenced by cluster allocation.

Another analysis devised to generate an accurate diagnosis model of COVID-19 based on routine tests and clinical symptoms by applying machine learning to COVID-19 data found several associations between clinical variables, such as having idiosyncratic levels of circulating lymphocytes and neutrophils, suggesting that COVID-19 patients could be clustered into several phenotype subtypes based on immune cells, gender, and declared symptoms, which could overcome the influence of a low testing capacity or the concurrent impact of other bacterial or viral infections [37]. Indeed, our cluster model demonstrated discrimination abilities associated with lymphocyte, monocyte, and eosinophil counts among then and during the 72 h after hospitalization.

Noteworthy, anemia and iron deficiency may play a role in the Coronavirus disease, as shown in a systematic review and associated meta-analyses, where hemoglobin levels were lower with older age but higher in subjects with diabetes, hypertension, and overall comorbidities and those admitted to intensive care [38], which is independently categorized by Cluster C in our model

The severe proinflammatory state commonly reported in COVID-19 patients has been associated with the activation of coagulation pathways and thrombosis [39], as well as by a characteristic coagulopathy and procoagulant endothelial phenotype [40]. The current clustered model for COVID-19 patients classified prothrombin activity and time, specifically in Cluster C, and also demonstrated some stratification competences in Dimer-D measurements, but not in increased platelet consumption. Interestingly, thrombocytopenia is relatively uncommon in COVID-19, being estimated that the dysregulated immune system responses as coordinated by inflammatory cytokines, lymphocyte cell death, and endothelial damage are involved [41]. Thus, patients with COVID-19 may suffer coagulation and thrombotic abnormalities, stimulating a hypercoagulable condition and increasing thromboembolic incidence [42].

Associations between blood biomarkers such as the neutrophil-to-lymphocyte ratio with the severity of COVID-19 lesions have been established, as well as with other specific and unspecific proinflammatory markers, such as CRP and other measures commonly analyzed for COVID-19, such as hemoglobin, D-dimers, and eosinophils counts [18], which should orientate the clinician for infected patients' management being eased by the existence of algorithms and cluster categorization. Further statistical analyses indicated that inflammatory CRP and D-dimer levels were increased and can assist as early indicators of severe COVID-19 cerebrovascular problems [27].

In these circumstances, exacerbated innate and adaptive immune responses are crucial in foreseeing the development and progression of NAFLD in COVID-19 patients [19]. A specific implication of severe COVID-19 in NAFLD patients putatively mediated by immunocompetence status is highlighted in the B cluster, where transaminases and liver health markers showed abnormal values and may drive personalized medicine approaches, as prompted by the allocation to a cluster with related measurements uncovering therapeutic targets. In a previous report, patients concerning this COVID-DATA-SAFE-LIFES cohort were categorized following conventional criteria to explain disease severity and deaths, which verified that liver and proinflammatory features are important determinants of COVID-19 morbidity and mortality in order to ameliorate the understanding of morbid manifestations of COVID-19, besides to help the therapy decision-making protocols under a personalized medicine scope [11]. Indeed, the liver health and coagulation axis appears as a relevant surrogate for elucidating some COVID-19 outcomes linked to systemic inflammation [43], as well as thrombotic and fibrinolytic disturbances [44], which

were deciphered in the currently emerged three clusters, including some markers of global health such as lactate dehydrogenase or creatinine/urea measurements [45], as particularly discriminated in Cluster C. Interestingly, hemoglobin and prothrombin values evidenced divergent patterns after the following 72-h period, which represent a worth for a cluster monitor. Indeed, our results provide a tool in the early management of COVID-19 patients, in contrast to other related papers in COVID where it has been taken into account with cardiac biomarkers [46] or other more complex techniques, such as imaging-based prognosis or gene/protein expression [47,48].

This research had some limitations and strengths. Thus, as a multipurpose cohort, the aims and hypotheses were assigned after the database was closed, and this was partly overcome by the large number of collected clinical determinations and the relatively high sample size. In addition, the initial uncertainties about the clinical management guidelines and concurrent morbid conditions/medications in COVID-19 patients may have an impact on data interpretation, although we provided information about pharmacological treatments (Supplementary Table S2) and several diseases at admission.

The identification of subgroups of COVID-19 patients through the longwise cluster analysis performed in this study allowed the identification of latent profiles of COVID-19 patients to shed light on the most appropriate treatment focused on objective routine blood markers commonly used in clinical practice, unlike other articles that only study a single marker follow-up [27], cross-sectional analyses [14], composite index [29], or nonobjective markers [13]. Moreover, a model using machine learning was able to predict case fatality in the elderly population, with a large history of hospital admission, which increases the rate of COVID-19 death [35]. Novel aspects of this analysis concerned the discrimination of patients by clustering routine determinations and being able to forecast death rates and associated comorbidities in the first 72 h. Previous studies have focused on exploring the value of these bioinformatic tools for coronavirus diagnosis and treatment [20], including image processing [49]. These results have been reinforced in systematic and metanalysis, which described clinical subgroups, while other researchers using resultdriven technologies implemented the screening, analyses, and predictors of data tracking to confirm death cases [50]. Furthermore, the longitudinal follow up for 72 h allowed the confirmation of trends and alignments, giving support to the interest of multiple clinical analytical measurements at entrance. Actually, healthcare provision necessitates the backing of innovative skills and strategies, including artificial intelligence (AI), Big Data, and machine learning approaches to combat and project actions against new diseases such as COVID and other complex syndromes. Identifying the pool of cases and predicting where this viral infection and associated comorbidities will move in future interventions require collecting clinical information and bioinformatically analyzing available preceding data [50].

#### **5. Conclusions**

Summing up the current cohort, by applying a longwise cluster analysis of the first 72 h enabled to materialize three discriminated COVID-19 clinical clustered phenotypes: Cluster A, featuring patients mainly displaying mild inflammatory abnormalities and a low fatal occurrence below 2%; Cluster B, involving specific immune-inflammatory and explicit liver dysfunctions, with a mortality incidence around 15%; and Cluster C exhibiting hemoglobin, prothrombin, and renal impairments, together with importantly altered inflammatory and immune responses, resulting in about 40% of deaths in this group. Indeed, patient diagnoses and prognoses remarkably diverged in the three clusters, which is relevant for considering predictive patient alignment, tailored precision clinical prescriptions, personalized cost-effective engagements, and alleviating epidemiological measures, as pioneers reported in diverse communicable and non-communicable diseases using artificial intelligence and machine learning instruments. Actually, medical-driven technologies devised for the proper screening, analysis, prediction, and tracking of SARS- CoV-2 infected patients are partaking significant developments and applications for the precision and individualized management of the COVID-19 pandemics.

**Supplementary Materials:** The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/jcm11123327/s1, Supplementary Table S1: Reference values for clinical variables included in the cluster analysis; Supplementary Table S2: Drug use by cluster and time (drugs with overall frequency n > 100); Supplementary Figure S1: A Principal component plot of the 2 main components from the cluster analysis; Supplementary Figure S2: Vital signs within the first 72 h of patients categorized by cluster; Supplementary Figure S3: White cells proportions within the first 72 h of patients categorized by cluster; Supplementary Figure S4: Red cells markers within the first 72 h of patients categorized by cluster; Supplementary Figure S5: Platelets and prothrombin markers within the first 72 h of patients categorized by cluster; Supplementary Figure S6: Metabolic markers and electrolytes within the first 72 h of patients categorized by cluster; Supplementary Figure S7: Inflammation and catabolic markers within the first 72 h of patients categorized by cluster.

**Author Contributions:** Conceptualization, R.S.-C., O.R.-L., D.M.-U., L.D. and J.A.M.; methodology, R.S.-C., R.M.-H., G.C. and J.A.M.; investigation, R.S.-C., R.M.-H., O.R.-L., D.M.-U., G.C., V.M., P.V.F., L.D. and J.A.M.; data Curation and formal analysis, R.S.-C., V.M., L.D. and R.M.-H.; writing—original draft preparation, R.S.-C., R.M.-H. and O.R.-L.; writing—review and editing, R.S.-C., R.M.-H., D.M.- U., V.M., G.C., P.V.F., L.D. and J.A.M. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Community of Madrid and the European Union, through the European Regional Development Fund (ERDF)-REACT-EU resources of the Madrid Operational Program 2014–2020, in the action line of R + D + i projects in response to COVID 19 (REACT EU Program "FACINGLCOVID-CM"). J.A.M. acknowledges financial support from Synergic R&D Projects in New and Emerging Scientific Areas on the Frontier of Science and Interdisciplinary Nature of The Community of Madrid (METAINFLAMATION-Y2020/BIO-6600). The support from CIBERobn is also credited.

**Institutional Review Board Statement:** The study was conducted in accordance with the Declaration of Helsinki and was approved by the Ethics Committee of the HM hospitals consortium (CEI HM Hospitales Ref No. 20.05.1627-GHM).

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** HM Hospitales makes this clinical dataset available to researchers from academic, university and healthcare institutions who request it and whose project is approved. The content is expected to be expanded and updated periodically, and its update will not be completed until this pandemic is terminated. To obtain the data, it will be necessary to send the following request to the email coviddatasavelives@hmhospitales.com or data\_science@hmhospitales.com in order to be evaluated by the Data Science Commission and, where appropriate, by the Research Ethics Committee of HM Hospitales or any other accredited research ethics committee.

**Acknowledgments:** R.S.-C. acknowledges financial support from the Juan de la Cierva Programme Training Grants of the Spanish State Research Agency of the Spanish Ministerio de Ciencia e Innovación y Ministerio de Universidades (FJC2018-038168- I). Authors thank HM hospitals for access to the COVID-DATA-SAFE-LIFES database.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

