1. Introduction and Background
Type 2 diabetes mellitus (T2DM) is a chronic disease that occurs when the body becomes resistant to insulin and/or cannot make enough insulin in the pancreas [
1]. Patients with T2DM are at greater risk of developing cardiovascular disease (CVD). CVD, which includes congestive heart failure (CHF), cardiac arrhythmias, valvular disease, pulmonary circulation disorders, and peripheral vascular disorders, is one of the leading causes of death in people with T2DM in most countries worldwide and can account for 50% or more of deaths due to T2DM [
2]. In 2011, CVD, diabetes and chronic kidney disease (CKD) were the major cause of death in Australia, accounting for 14% of all deaths, where about 7% of all deaths were due to CVD and diabetes together [
3]. In clinical literature, CVD and T2DM often occur together. Patients with T2DM have over twice the risk of occurrence of CVD than patients without T2DM [
4,
5]. This is partly because of various common risk factors between CVD and T2DM, such as obesity, old age, hypertension and chronic kidney disease. There are also complex relationships between CVD and T2DM, and each of them may be caused by other diseases. As a result, they are more likely to occur together in an individual. The co-occurrence of these conditions is known as comorbidity [
6]. The clinical management for people with CVD and T2DM is more expensive, complex and time-consuming than for the people with a single disease. Alongside the projected increase in the prevalence of chronic diseases, the presence of CVD and T2DM as comorbidities exerts a significant social and health burden, often resulting in higher healthcare costs [
7]. Although the comorbidity of chronic diseases is receiving more attention, most studies to date focus on understanding the progression of a single chronic disease, and fewer studies investigate the relationship of multiple chronic diseases [
8]. Thus, the increased prevalence of comorbidity and its impact on the health of the population and the healthcare system are not clear. In particular, greater research attention is required for the chronic disease cohort with both CVD and T2DM as the data shows an increased risk of hospitalization and death for the patients with both CVD and T2DM compared to patients with only T2DM [
9]. Therefore, if we can identify those diabetic patients with a risk of CVD based on their past medical data, preventive measures can be taken to increase the quality of care and reduce treatment costs. For the patients’ health information, a potential data source can be hospital admission and discharge data which contain standardized ICD (International Classification of Diseases) codes [
10]. Analysis of these administrative data using data mining and social network analysis can help us to understand the progression of chronic disease comorbidities.
The development of effective disease progression modelling depends on the understanding of the disease progression pathway. In the literature, a considerable amount of work has been done in the related field of understanding the comorbidity of chronic disease progression. There are mainly three types of approaches (i.e., the statistical method, machine learning and data mining approach, and network-based approach) applied to understand the disease progression and develop the risk prediction model [
11]. Rule-based scoring is a widely used statistical method to understand disease progression as well as risk prediction. It focuses on the clinical and empirical understanding of symptoms, prevalence and disease comorbidities [
12,
13]. In these models, scores are assigned to various physiologically observable symptoms, demographic risk factors and comorbidity conditions to assess the severity of a patient. Various rule-based scoring methods have been developed over the years to understand disease progression [
12,
13,
14,
15]. In 1987, Charlson et al. proposed the Charlson Comorbidity Index to predict the 10-year mortality for a patient by ranking a range of demographic information (e.g., age and sex) and comorbid conditions (e.g., cancer, heart disease and AIDS) [
12]. The Elixhauser Index [
16] shows slightly better performance for predicting mortality beyond 30 days [
17,
18]. Some other rule-based models such as APACHE-II (Acute Physiology and Chronic Health Evaluation-II) [
13] and SAPS (Simplified Acute Physiology Score) [
19] were proposed to assess intensive care unit (ICU) patients’ health conditions in the first 24 hours of admission. The results of the diagnostic tests are considered as scores that are also used to assess or make a prognosis. For example, Ewing and Clarke proposed five tests (known as Ewing’s battery test) to assess the risk of cardiovascular disease in patients with diabetes [
14]. In 2008, a diabetes-specific equation was proposed to understand the disease progression and estimate the 5-year risk of cardiovascular disease in T2DM patients with the use of the A1C (i.e., glycated hemoglobin) test results [
15]. Although these rule-based scoring models work well in the specific healthcare setting, they are derived from clinical and empirical observation and do not test for many population cohorts with multiple comorbidities. However, chronic diseases do not occur in isolation [
20]. They often share common risk factors than can be environmental, genetic and behavioral. These risk factors have a synergistic effect [
21,
22] on patients’ health outcomes and thus should not be considered in isolation.
Administrative data are generated during different stages of healthcare delivery and health insurance claims. These include important information about the patient and population health, such as demographic characteristics, health behaviors, clinical diagnoses and codes for procedures, laboratory results and care utilization [
23]. Recently, these data have gained popularity and are used in clinical decision-making and healthcare research, such as treatment, diagnosis, understanding disease progression and disease risk prediction [
16,
24,
25,
26]. In 2001, Nichols et al. developed a research framework to estimate the prevalence and incidence of CVD (specifically congestive heart failure) in patients with T2DM using electronic health data [
27]. They also identified risk factors for diabetes-associated congestive heart failure using multiple logistic regression models. Later, they updated their study to estimate the CHF incidence rate in T2DM and identify risk factors for developing CHF in patients with T2DM over 6 years of follow-up [
28]. Recently, a research methodology was developed to identify the prevalence and incidence of CVD in patients with T2DM using electronic health data [
29]. The study used ICD-9 (International Classification of Diseases, Ninth Edition) codes to identify the prevalence and incidence of CVD events. Some data mining and machine learning-based methods were proposed using administrative data in different healthcare research [
30,
31]. For example, collaborative filtering methods were proposed to understand disease progression and predict disease risk using healthcare data [
30,
32]. A deep learning algorithm was used for risk prediction for multiple comorbid diseases [
33]. The Bayesian network [
34], a combination of graph theory and probability theory, has been used to understand the comorbidity of multiple chronic diseases [
35]. A risk prediction model was developed to predict the risk of progression to chronic obstructive pulmonary disease in asthma patients using electronic health data [
36]. The study used the Bayesian network to develop the proposed model.
In this study, we used a network-based approach on administrative data—i.e., hospital admission and discharge data—to understand the disease progression of CVD in patients with T2DM by considering the comorbidities. In the biomedical field, the network-based approach has been used to understand the pathogenesis of diseases using gene expression and related proteins [
37]. Social network analysis (SNA) is another related approach introduced in healthcare informatics [
38,
39]. SNA can be defined as a set of entities, such as physicians, diseases and hospitals, with some relationships between them. More recently, administrative data have been widely used to demonstrate SNA-based approaches [
40]. These approaches are used to understand relations between healthcare entities [
38,
41] and improve the collaboration efficiency among physicians [
42]. SNA techniques were applied in administrative and electronic healthcare data of CHF patients to explore the patterns of service delivery for care coordination [
43]. Khan et al. [
24] proposed a framework to understand the progression of T2DM using graph theory and SNA. Their proposed framework was applied to administrative healthcare data in the Australian healthcare context. The study mainly focused on understanding a single chronic disease (e.g., T2DM) rather than the progression of multiple chronic diseases. To our knowledge, there is very little research on using a network-based approach and administrative data to understand the progression of CVD in patients with T2DM.
4. Discussions
This study presented a network-based framework to understand chronic disease comorbidities. Administrative data are used in this study as they represent a unique source of information for patients’ medical conditions. Also, these databases are probably the best available source for understanding disease progression. Our study focused on the understanding of the progression of CVD in patients with T2DM. The final disease network generated from two baseline disease networks represents the health trajectory of the patients with CVD in T2DM patients, where the node frequency (i.e.,
Figure 4) and edge weight (i.e.,
Table 6) of the network show unique characteristics of progression towards CVD in patients with T2DM in terms of comorbidities. For instance, this study found some risk factors (e.g., renal failure, chronic pulmonary disease, hypertension and fluid and electrolyte disorders) that may be responsible for developing CVD in patients with T2DM [
27,
33,
53]. Although these risk factors are well-known evidence for developing CVD among diabetic patients, this study provided further in-depth information about the transition between these comorbid conditions. For example, as illustrated in
Table 6, we found that the transition between “fluid and electrolyte disorders” and “renal failure” has the highest weight, meaning that this transition may be a potential risk factor for the progression towards CVD in patients with T2DM. Apparently, no other studies in the present literature provide such in-depth information regarding the edge-level transition of comorbid disease conditions. The proposed framework for learning disease progression is flexible and can accommodate new sources of data for understanding the progression of multiple (more than two) chronic diseases.
4.1. Age and Sex Distribution of the Patients of Cohorts
Table 8 represents the age and sex distribution for both
CohortT2DM&CVD and
CohortT2DM. It is observed that the number of elderly patients (≥60 years) is higher than others in both cohorts. It is well known that the risk of developing T2DM and CVD in elderly patients is very high [
58,
59]. In
Table 8, the number of female patients with both T2DM and CVD is higher than the number of female patients with only T2DM. On the other hand, the ratio of developing CVD in diabetic male patients is significantly lower. Regarding patients with T2DM, it has been shown that female patients have a higher risk of developing CVD than male patients [
60]. In addition, the age for the first group (
CohortT2DM&CVD) is higher than the second group (
CohortT2DM) for each age range. This is obvious, as we considered T2D leading to CVD for the first group; on the other hand, the second group considered only T2D. Thus, the selected two cohorts could be uniform to analyze in term of age and sex. Additionally, male patients have a higher prevalence than female patients for both cohorts. This is consistent with the data released by the Australian Institute of Health and Welfare (AIHW) [
53]. Thus, the research dataset is selected following the Australian government statistics.
4.2. Limitations of the Proposed Framework and Potential Future Works
This study has several limitations. The main limitation comes from the fact that the dataset used in this study uses real-world healthcare data. The quality of coding of these datasets is the main constraint because of the different coding criteria across different hospitals. The changing trend of policy is a cause of changes in the coding system. In addition, the expertise of clinical coders, funding and time constraints can affect the quality of coding. The data used in this study come from the hospital admissions and discharge summaries; thus, they do not include the GP (general practitioners) records and subsequent diagnoses. This may underestimate the comorbidity conditions of the patients. Additionally, the dataset collected from a private health insurer does not contain the information about the patients when they are admitted to a public hospital as public patients. However, most of these limitations are common for most administrative datasets. Another limitation of this study is that it does not include ischemic heart disease, acute myocardial infarction and stroke as a CVD since this study used the cardiovascular diseases mentioned in the Elixhauser comorbidity index. Also, this study cannot draw the time span between first and last hospitalization, as the selection criteria for cohorts limits the amount of data. Thus, the study considers the patients of cohorts with at least one hospitalization. Nevertheless, despite these limitations, the proposed framework could be useful for healthcare providers to obtain a better understanding of the progression of CVD in patients with T2DM.
As future work, the features extracted from the final disease network of chronic disease comorbidities can be utilized to develop a predictive model for future chronic disease. This can be implemented by comparing the final disease network with the individual disease network of a test patient. If the features of the test patient’s network match significantly with the features of the final disease network, the patient might be progressing on that chronic disease pathway.