1. Introduction
Type 2 diabetes (T2D) has become a growing global issue in recent decades. According to the 2021 Atlas of the International Diabetes Federation, it is estimated that there are 5.37 billion patients worldwide, and this trend will further increase to 6.0 billion by 2045 [
1]. Not surprisingly, a similar endemic was noted in Taiwan. According to the data bank of the National Health Insurance Company, the total number of diabetic patients increased from 1.32 million to 2.2 million within 10 years (2005 to 2014). This represents an astonishing 66% increase [
2]. It is now the 5th highest cause of death. In 2020, the cost spent on T2D was over 10 billion USD, which is approximately 4.66% of the budget of the National Health Insurance Company in one year. The accompanying complications, such as micro- and macrovascular diseases, impose heavy burdens on individuals and their families, as well as health providers and society [
3,
4]. It is important to note that this trend is particularly prominent among people aged <40 and ≥80 years [
5].
Among all the complications, diabetic nephropathy is the leading cause of chronic kidney disease and end-stage renal disease (ESRD) [
6], which are associated with high morbidity and mortality rate. According to the annual report of the US Renal Data System, Taiwan has the highest incidence (523 per million population) and prevalence of treated ESRD requiring renal replacement therapy [
7]. In 2019, there were 84,615 dialysis patients and the National Health Insurance spent 1.54 billion, which is approximately 8.7–9.3% of the annual budget [
8,
9]. Therefore, its early detection and prevention are urgently required.
It is well known that urine albumin–creatinine ratio (uACR) is a strong predictor of the subsequent decline of the glomerular filtration rate in T2D, with an average of 0.93 mL per minute per month in approximately 35% of the subjects [
10]. The underlying pathophysiology is due to the increased glomerular pressure, which is independent of hyperfiltration or hyperglycemia [
11,
12,
13].
Traditionally, most studies have used multiple linear regression (MLR) to explore the relationships between risk factors and outcomes (complications) in medical research. Nevertheless, artificial intelligence using machine learning (ML), which enables machines to learn from past data or experiences without being explicitly programmed, has now become a new modality for data analysis that is competitive with MLR [
14,
15,
16]. Because ML can capture nonlinear relationships in data and complex interactions among multiple predictors, it has the potential to outperform conventional MLR in disease prediction [
17].
To our knowledge, only one study has attempted to predict the uACR in a T2D cohort. Thus, in the present study, we applied four different ML methods and attempted to answer the following questions in a diabetic cohort that was followed up for four years.
Compare the prediction accuracy between ML and traditional MLR.
Rank the importance of risk factors, such as demographic and biochemistry data.
3. Results
A total of 1147 participants were enrolled in the study (men: 539, women: 608). The demographic data are shown in
Table 3 (mean ± standard deviation). The results of the comparison between the traditional MLR and the four ML methods (i.e., RF, SGB, CART, and XGBoost) in predicting diabetic uACR in a 4-year follow-up cohort are shown in
Table 4. From the table, it can be seen that all four ML methods yielded lower prediction errors than the MLR method and were all convincing ML models. To determine whether the four ML methods significantly outperformed the MLR method, the Wilcoxon signed-rank test was used. The Wilcoxon signed-rank test is one of the most popular distribution-free, non-parametric statistical tests for evaluating the performance of two prediction models [
43].
Table 5 shows the test results of the four ML methods and the MLR method. It can be observed from the table that the prediction error values of all ML methods were significantly different from those of the MLR method. Therefore, it can be determined that the ML methods used in this study significantly outperformed traditional MLR in predicting uACR at the end of the follow-up in terms of prediction error.
Table 6 presents the average importance ranking of each factor generated by the RF, SGB, CART, and XGBoost methods. It can be observed from the figure that the different ML methods generated different relative importance rankings for each factor. The darkness of the blue color indicates the importance of risk factors. The darker the blue color, the more important the risk factor. For instance, in the RF method, the first three important factors were baseline Cr, age, and baseline SBP. The most important feature of the SGB method was baseline Cr, which was followed by baseline HDL-C and baseline DBP. To fully integrate the importance rankings of each factor in all the four ML methods, the average importance ranking of each risk factor was obtained by averaging the ranking values of each variable in each method.
Figure 3 depicts the risk factors based on the increasing order of the averaged ranking values. It can be noted from the figure that the first six important risk factors in predicting diabetic uACR in a 4-year follow-up cohort are baseline Cr, baseline SBP, baseline DBP, baseline HDL-C, baseline glycated hemoglobin, and baseline FPG.
4. Discussion
As mentioned in the Introduction, the present study has two goals. The first was to compare the accuracy between ML methods and MLR, and the second was to identify the rank of different risk factors for predicting uACR. Our study showed that all four ML methods outperformed the MLR. We also found that baseline Cr, blood pressure, HDL-C, glycated hemoglobin, and FPG were the most important factors.
Traditionally, MLR has been widely used to analyze medical research to deal with continuous variables. However, it is difficult to describe the nonlinear data patterns of MLR, and the effective use of MLR requires fitting its strong assumptions during modeling. Unlike MLR, ML does not require strong model assumptions and can capture the delicate underlying nonlinear relationships contained in empirical data [
19]. Our present data showed that all four ML methods are superior to MLR because the MAPE and RAE of the ML methods all have lower values (
Table 4). Our results suggest that ML might have a great potential for medical studies and applications.
Because diabetic nephropathy causes a serious burden on individuals and consumes a large portion of the government health budget, extensive studies have focused on this topic [
6,
44,
45,
46,
47]. From these previous studies, it could be concluded that sex, high blood glucose and blood pressure, smoking, dyslipidemia, decreased glomerular filtration rate, BMI, and uACR are common risk factors for future uACR. However, in the present study, our data showed that baseline Cr, DBP, SBP, HDL-C, glycated hemoglobin, and FPG were the most important risks. Additionally, the roles of diabetes duration, glycated hemoglobin, BMI, HDL-cholesterol, triglyceride, sex, smoking, and alcohol use were less important.
Our data suggest that the most important predictor of albuminuria is baseline Cr. This is not surprising because albuminuria occurs early in the course of diabetic nephropathy [
48]. According to the majority of previous studies, a summary of this relationship could be depicted as follows: diabetic patients with albuminuria are at a higher risk of end-stage renal and cardiovascular diseases [
49,
50]. This indicates that albuminuria is the cause of end-stage renal disease, which differs from the findings of the present study. Our results show that an increase in serum Cr level could predict albuminuria four years later, which is an opposite cause–effect relationship to the majority of the other studies. However, our finding can be supported by the cornerstone study conducted by Gansevoort et al. [
51]. This meta-analysis clearly showed that there are independent, continuous, and negative associations between serum Cr and albuminuria. Thus, it could be postulated that each of these factors could affect the other at the same time. Further research is required to explore this area.
Both diastolic and systolic blood pressures were identified as the second and third important factors for predicting albuminuria. Their relationships are well known and have been extensively studied [
52]. Similar to the role of increased serum Cr levels, kidney disease causes an increase in BP, which could further deteriorate renal function. More specifically, the change in BP is in concordance with and even precedes albuminuria [
53]. By controlling BP, the speed of end-stage renal disease progression can be slowed down [
54].
Interestingly, HDL cholesterol level was the only lipid found to be correlated with albuminuria. However, few studies have focused on this topic. Most previous studies have demonstrated that different stages of diabetic kidney disease (DKD) have different influences on blood lipid levels [
55,
56]. Other studies measured apolipoproteins and the size of LDL-cholesterol, which all showed positive correlations with DKD, including albuminuria [
57]. To our knowledge, only two studies are relatively close to the present findings. The first study was performed by Sacks et al. In a group of 2535 T2D patients, they evaluated the impact of HDL-C levels on uACR. Furthermore, kidney disease was defined as albuminuria, proteinuria, or decreased eGFR. The data showed that the odds ratio of having kidney disease decreased by 0.86 (0.82–0.91) for every 0.2 mmol/L (approximately 1 quintile) increase in HDL-C [
58]. The second study was conducted on a cohort of 524 Chinese patients. Using multiple logistic regression, after adjusting for the available confounding factors, they suggested that subjects with the highest quartile HDL-C had a lower odds ratio (OR = 0.17, 95% confidence interval 0.15–0.52) of having uACR than the lowest quartile. However, a limitation of this study was that it was cross-sectional. Thus, it was unable to infer the causation or directionality of this relationship [
59]. This study responds to this limitation in its longitudinal design. The causative influence of HDL-C level can be explained by several assumptions. First, the glomerular and renal tubules could be injured by impaired HDL-C function, which hinders the reversal of the cholesterol transport process [
60]. Second, the antioxidative ability of the HDL-C is reduced and oxidative stress is increased, which further influences the immune-mediated diabetic nephropathy [
61]. Finally, it is well known that low HDL-C levels are associated with insulin resistance, hyperinsulinemia, and hyperglycemia. All these untoward derangements can damage endothelial cells in the glomerulus [
62,
63].
The last two factors affecting albuminuria are glycated hemoglobin and FPG levels. This finding is compatible with the results of the Diabetes Control and Complication Trial (DCCT) [
64]. The data showed positive relationships between glucose control and albuminuria. Moreover, after controlling for blood glucose levels, albuminuria also improved [
65]. Because DCCT enrolled patients with type 1 diabetes, its pathophysiology is different from that of the present study. Regarding T2D, few studies have been conducted in this area. A comprehensive meta-analysis conducted by Lo et al. [
66] showed that for intensive control (glycated hemoglobin < 7% and FPG < 6.6 mmol/L), the relative risk of having uACR was 0.59 (confidence interval: 0.38–0.93). As this study enrolled 11 studies (29,141 subjects) and follow-ups were conducted for an average of 56.7 months, their conclusion is convincing. The underlying pathophysiology to support this result is that high blood glucose concentration could involve mesangial cell damage in nephrons [
67]. However, it is worth noting that both A1c and FPG were classified as important predictors. This might indicate that because FPG is only one blood glucose measurement within 90 days compared to A1c, it is less accurate than A1c. Our results show that they are ‘independent’ of each other.
Interestingly, in the present study, the duration of diabetes, body mass index, sex, smoking, and alcohol use were less important. This finding could be attributed to the nature of the ML. ML methods are data-driven, non-parametric models. They can map any nonlinear function without an a priori assumption about the properties of the data and have the ability to capture subtle functional relationships among the empirical data, even though the underlying relationships are unknown or difficult to describe [
68,
69,
70]. These factors may contain richer linear pattern information and less important nonlinear information than baseline creatinine, blood pressure, albuminuria level, and age. Thus, they were ranked as less important risk factors using ML methods.
This study had some limitations. First, the smoking and alcohol details need to be more defined because some other reports have shown that they have an important impact on the occurrence of diabetic nephropathy. Second, we did not collect information on the use of angiotensin-converting enzyme inhibitors, angiotensin receptor blockers, sodium-glucose cotransporter 2 inhibitors, and glucagon-like peptide-1 agonists. All these medications would have beneficial effects on DKD. Third, some of the data, such as uACR and blood pressure, were collected only once. For some of the participants, we did have data more than once. However, because the number is less than the present number, we still chose to enroll subjects with only one value. Even though these drawbacks do exist, our large n number and the characteristics of ML (alleviating the effects of extremes) could at least partially adjust.