Enhancing Diabetes Prediction and Prevention through Mahalanobis Distance and Machine Learning Integration

Dashdondov, Khongorzul; Lee, Suehyun; Erdenebat, Munkh-Uchral

doi:10.3390/app14177480

Open AccessArticle

Enhancing Diabetes Prediction and Prevention through Mahalanobis Distance and Machine Learning Integration

by

Khongorzul Dashdondov

¹

,

Suehyun Lee

^1,*

and

Munkh-Uchral Erdenebat

^2,*

¹

Department of Computer Engineering, College of IT Convergence, Gachon University, Seongnam-si 13120, Republic of Korea

²

Department of Computer Engineering, School of Information and Communication Engineering, Chungbuk National University, 1 Chungdae-ro, Seowon-gu, Cheongju-si 28644, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2024, 14(17), 7480; https://doi.org/10.3390/app14177480 (registering DOI)

Submission received: 17 July 2024 / Revised: 19 August 2024 / Accepted: 20 August 2024 / Published: 23 August 2024

(This article belongs to the Special Issue Advancements in Obesity and Diabetes Management: From Diagnosis to Treatment)

Download

Browse Figures

Versions Notes

Abstract

:

Diabetes mellitus (DM) is a global health challenge that requires advanced strategies for its early detection and prevention. This study evaluates the South Korean population using the Korea National Health and Nutrition Examination Survey (KNHANES) dataset from 2015 to 2021, provided by the Korea Disease Control and Prevention Agency (KDCA), focusing on improving diabetes prediction models. Outlier removal was implemented using Mahalanobis distance (MAH), and feature selection was based on multicollinearity (MC) and reliability analysis (RA). The proposed Extreme Gradient Boosting (XGBoost) model demonstrated exceptional performance, achieving an accuracy of 98.04% (95% CI: 97.89~98.59), an F1-score of 98.24%, and an Area Under the Curve (AUC) of 98.71%, outperforming other state-of-the-art models. The study highlights the significance of rigorous outlier detection and feature selection in enhancing the predictive power of diabetes risk models. Notably, a significant increase in diabetes cases was observed during the COVID-19 pandemic, particularly linked to male sex, older age, rural location, hypertension, and obesity, underscoring the need for enhanced public health strategies for early intervention and targeted prevention.

Keywords:

diabetes; data integration; outlier remover; Mahalanobis distance; healthcare; disease prevention; COVID-19

1. Introduction

Diabetes mellitus (DM) is a significant and growing global health challenge, and its prevalence is increasing at an alarming rate. Recent estimates suggest that the number of adults with diabetes will continue to increase, necessitating the development of advanced strategies for its effective prevention and early detection [1]. Diabetes is associated with a range of serious complications, including cardiovascular disease, neuropathy, retinopathy, and kidney failure, making early identification of at-risk individuals critical for reducing these adverse health outcomes [2]. Recent research has underscored the importance of leveraging machine learning techniques to enhance the accuracy and reliability of diabetes-risk-prediction models. Sonia et al. (2023) demonstrated the potential of a Multilayer Neural Network no-prop algorithm for diabetes risk prediction, highlighting the need for innovative approaches in this domain [3].

Traditional diagnostic methods, including fasting plasma glucose (FPG) tests, oral glucose tolerance tests (OGTTs), and glycated hemoglobin (HbA1c) measurements, often fail to identify early-stage diabetes or pre-diabetes due to their limited sensitivity. These methods also fail to fully capture the complex interplay between various risk factors such as demographic, genetic, biochemical, and lifestyle factors [4]. Consequently, there is growing interest in developing predictive models that integrate these diverse risk factors to improve the early detection of diabetes. Some authors have introduced a hybrid diabetes prediction framework that utilizes Random Forest (RF), LightGBM, Glmnet, XGBoost [5,6,7,8], and the LS-SVM classifier to address prevalent data quality issues in medical datasets, such as outliers. The framework achieved a classification accuracy of 96.57% [9]. Furthermore, [10] presented a selective data preprocessing approach using the Synthetic Minority Oversampling Technique (SMOTE) to balance the training data and manage outliers.

Advanced predictive models that integrate these diverse risk factors can enhance early detection. However, their performance is frequently compromised by the presence of outliers. Outliers, which are data points that deviate significantly from most datasets, can result from measurement errors, data entry mistakes, or inherent population variability. The presence of outliers can distort statistical analyses and predictive models, leading to unreliable outcomes and reduced predictive accuracy [11,12]. Effective outlier detection and removal are crucial for improving the data quality and predictive model performance. Mahalanobis distance (MAH), a multivariate distance measure, offers a robust method for outlier detection by accounting for correlations among variables and the overall covariance structure of the data [13]. The Mahalanobis distance method was utilized to generate values in patients with type 2 diabetes from NHANES to 1999–2018, using feature selection, as described in [14].

In this study, we leveraged Mahalanobis-distance-based outlier removal to enhance diabetes prediction models using the Korea National Health and Nutrition Examination Survey (KNHANES) dataset from 2015 to 2021 provided by the Korea Disease Control and Prevention Agency (KDCA) [15,16,17]. To address the complex interplay between the risk factors, we implemented feature selection based on multicollinearity (MC) [18,19] and reliability analysis (RA) [20] to refine the predictive variables. We applied this enhanced dataset to several machine-learning classifiers, including Extreme Gradient Boosting (XGBoost), Naive Bayes (NB), K-Nearest Neighbors (KNN), Random Forest (RF), and Decision Trees (DTs) [21]. The objective was to develop a comprehensive risk assessment model that integrates demographic, genetic, biochemical, and lifestyle variables, while effectively addressing the issue of outliers. Our empirical analysis demonstrates that the proposed approach significantly improves the model accuracy and robustness across all classifiers, with XGBoost showing superior performance.

The feature importance table summarizes the contribution of each feature to the prediction accuracy of our proposed method, MC and RA-based, XGBoost, and Random Forest models. Feature importance is a crucial aspect of machine learning models, as it helps identify which variables have the most significant impact on the model’s predictions.

The remainder of this paper is organized as follows. Section 2 provides a detailed survey of related work. The proposed methodology is described in Section 3. Section 4 presents the experimental study, including the dataset preparation, procedures used for comparison, evaluation metrics, and comparative results. Section 5 highlights discussions about the significance of the results and suggests potential future research directions. Finally, Section 6 concludes the paper. Appendix A presents the cross-tabulation results.

2. Related Works

Extensive research has been conducted on the use of machine learning and statistical methods to predict and prevent the occurrence of DM. This study highlights the urgent need to enhance early detection and intervention strategies for this common chronic disease. This section provides an overview of important contributions and methodologies in this field. It specifically examines the integration of different techniques for preparing data, machine learning algorithms, and the role of outlier detection using the MAH.

MAH has been widely used for outlier detection in medical datasets owing to its ability to account for variable correlations. For instance, ref. [11] utilized the Mahalanobis distance to effectively detect and handle outliers, which improved the stability and performance of their hypertension prediction models. Ref. [22] challenges the Neural Network for medical images by applying Mahalanobis distance to detect any out-of-distribution (OOD) pattern using synthetic artifacts. The authors of [23] proposed a privacy-preserving disease diagnosis scheme based on the Mahalanobis distance test, involving query users, aided cloud servers, and classification cloud servers. The system jointly calculates and protects sensitive medical data and outcomes from the clinical diagnostic processes. An innovative heart-sound-based system for diagnosing heart diseases was proposed using automatic adaptive feature extraction and the Mahalanobis-distance-classification criterion [24]. Similarly, various studies have employed outlier detection methods to predict diabetes. In diabetes prediction, removing outliers helps refine the dataset and enhances the performance of machine learning models.

Recent advancements in diabetes prediction models have demonstrated the effectiveness of various machine learning algorithms, including LR, DT, CatBoost, XGBoost, RF, Support Vector Machines (SVMs), Convolutional Neural Networks (CNNs), and Neural Networks (NNs). Notably, a 2024 study using the SECNN model achieved 94.12% accuracy on the NHANES dataset and 89.47% accuracy on the PIMA Indian dataset [25]. Another study found that the CatBoost Classifier was the most accurate (85%) for predicting gestational diabetes [26]. Additionally, models such as the Deep Dense Layer Neural Network (DDLNN) [27] and RF [28] have been effectively applied to the PIMA Indian dataset. At the same time, Hasan et al. improved predictions through a weighted ensemble of models [29] yielding an accuracy of 95.83% in the RFWBP model [30]. Practical applications, such as Sharma et al.’s automated eHealth cloud system using an Extreme Learning Machine (ELM), have also shown promising results with an accuracy of 90% [31].

Integrating outlier removal with machine learning models significantly advances diabetes prediction and prevention. By effectively identifying and handling outliers, these integrated approaches enhance the robustness and accuracy of predictive models. The continued exploration and development of such hybrid models hold great potential for improving diabetes management and ultimately reducing the burden of this chronic disease. The relevant studies are summarized in Table 1 and organized in chronological order.

3. Methodology

Figure 1 shows the proposed framework based on the MAH-based outlier-removal method. Our methodology was divided into three distinct interlinked modules. Initially, data preparation was performed, and the raw data were meticulously organized and structured. Data preprocessing involves cleaning and feature selection for optimal dataset refinement. Finally, predictive analysis applies advanced statistical models to make informed predictions. Each module is crucial for the integrity and accuracy of a diabetes prediction model. A detailed description of Figure 1 is provided in the subsequent subsections, where we break down the components, methodologies, and insights depicted in the figure for a thorough understanding.

3.1. Preparing Experimental Dataset

The KDCA collected the KNHANES data. These included health examinations for various diseases, health interviews, and nutritional surveys conducted in the Korean population. The KNHANES datasets were released for public use within one year from the end of each survey year. We generated a target value for individuals aged ≥ 19 years with diabetes. The target diabetes group included individuals with a history of hypertension, diabetes, heart disease, heart attack, or stroke. Our comprehensive study meticulously processed and analyzed the datasets derived from KNHANES. For the KNHANES (2015–2021), we initially obtained 39,768 records and 879 features. Through a rigorous screening process that excluded rows with missing values and features not pertinent to diabetes, coupled with MC and reliability analyses, we identified 24 features of critical relevance to diabetes. This careful curation resulted in a refined experimental dataset containing 27,282 records, each of which contained 24 diabetes-related features. After excluding the outliers using the MAH method, we obtained a target dataset with 23,611 records and 24 features. Figure 2 shows the procedure for creating the experimental target datasets for KNHANES (2015–2021).

The data preprocessing module consists of three main parts: feature selection based on the analysis of MC, RA, and the outlier removal. This study aimed to predict diabetes using data from the KNHANES. Table 2 provides the feature descriptions, means, and standard deviations of the target dataset.

3.2. Feature Selection Based on MC Analysis

The feature selection module was executed using MC analysis. This process allows for the identification of essential features to propose a simpler yet accurate model. We assessed collinearity among the selected features from health examinations, nutrition, and basic information concerning diabetes using MC in regression analysis [32]. MC is a statistical term used when two or more input attributes exhibit strong correlation [33]. Certain attributes were removed when highly correlated variables were present. We evaluated the results in terms of tolerance and the variance inflation factor (VIF) using MC analysis. An MC problem arises when the VIF value exceeds 10 and the tolerance falls below 0.10. After this analysis, we retained 36 features that were used as inputs for subsequent analysis of the two datasets. The comprehensive results of the correlation and MC analyses of the target dataset are presented in Table 2. The highest VIF was observed for “Attempted suicide for 1 year” at 5.171, followed by “Counseling for Mental Issues for 1 Year” (4.261), which was notable as it diverges from the common symptoms associated with diabetes. Predictors such as “glycated hemoglobin” and “fasting blood sugar” had the highest VIF scores of 3.546 and 3.173, respectively. A VIF ranging from 1 to 5 indicated that the predictors were not highly correlated, and could be considered when building a diabetes prediction model.

3.3. Feature Selection Based on Reliability Analysis

Feature selection techniques were employed to reduce the dimensionality of the feature space by eliminating unnecessary features. In this study, we used two distinct feature selection methods: MC and reliability analysis, based on two different datasets. Both techniques were performed using the SPSS software package v.23.

A reliability analysis was conducted using Cronbach’s alpha to assess the reliability of all features in the two datasets. Furthermore, it is crucial to determine whether the combined features contribute to the assessment of the same construct. While there is no specified lower bound, a higher Cronbach’s alpha coefficient (approaching 1) indicates greater internal consistency of the factors [34]. An α value of 0.7 or higher is indicative of strong internal consistency in assessing the construct’s reliability.

3.4. Outlier Detection Based on MAH

Multivariate outliers can be identified using the MAH distance, which is the distance of a data point from the calculated centroid in other cases where the centroid is calculated as the intersection of the means of the assessed variables. Each point is recognized as an (X, Y) combination, and the multivariate outliers lie at a given distance from the other cases. The distances were interpreted using p< 0.001 and the corresponding χ² value, with degrees of freedom equal to the number of variables.

Multivariate outliers can also be identified using leverage, discrepancies, and influence. Leverage is related to MAH distance but is measured on a different scale; therefore, the χ² distribution does not apply. Large scores indicate cases in which they are farther out; however, they may still lie along the same line. Discrepancy assesses the extent to which a case is consistent with others. Influence is determined by leverage and discrepancy and assesses changes in coefficients when cases are removed. Cases > 1.00 are likely to be considered outliers. The formula for computing the MAH distance is [11,12]:

D^{2} = {(x - m)}^{T} C^{- 1} (x - m)

(1)

where D² is the square of the MAH distance; x is the vector of the observation (row in a dataset); m is the vector of the mean values of the independent variables (mean of each column);

C^{- 1}

is the inverse covariance matrix of the independent variables; T denotes the transpose of the vector; and (x − m) is the distance of the vector from the mean, which is divided by the covariance matrix (or multiplied by the inverse of the covariance matrix). The p-value probability is given by the following equation:

P = 1 - X^{2} (M A H, d f)

(2)

In this study, outliers were removed based on the MAH distance to detect multivariate outliers. We selected 37 diabetes-related features. The descriptions of these features are presented in Table 2. Figure 3 shows the data with and without outliers from the dataset, using several values based on the MAH method.

Default training and testing splits of 70% and 30% were applied to the target dataset, respectively. Descriptive statistics for the target variables with and without FS and MAH are presented in Table 3. In this study, the diagnostic criteria typically involved the assessment of fasting blood glucose levels. Diabetes is usually diagnosed with a fasting blood glucose level of 126 milligrams per deciliter (mg/dL) or higher, or when the hemoglobin A1c (HbA1c) level reaches 6.5% or higher. Pre-diabetes is often identified when fasting blood glucose levels fall between 100 and 125 mg/dL or when HbA1c levels range between 5.7% and 6.4%. Normal ranges were defined as fasting blood glucose levels of < 100 mg/dL and HbA1c levels of < 5.7%. These values are used to diagnose diabetes and pre-diabetes and are based on guidelines from organizations such as the American Diabetes Association (ADA).

3.5. Classifiers

To improve the performance of the predictive analysis, we focused on the training dataset. In other words, instead of directly training the classifiers, outliers were removed from the training dataset by using the MAH method. During the analysis stage, the curated experimental dataset underwent a rigorous testing phase utilizing advanced algorithms, such as Random Forest (RF), k-Nearest Neighbor (KNN), XGBoost (XGB), Decision Tree (DT), and Naïve Bayes (NB) [35,36,37,38,39]. These algorithms were selected based on their proven efficacy in pattern recognition and predictive accuracy in health-data analytics. This approach ensures a comprehensive and resilient analysis that leverages multiple facets of data to enhance decision-making.

First, we evaluate the performance of the baseline models as a comparison point for the proposed method. We trained these baseline models directly on both experimental datasets using ML algorithms, as shown in Figure 1. Furthermore, on a dataset with feature selection, we trained additional baseline models after removing outliers.

3.6. Evaluation Metrics

To evaluate the performance of our predictive models, we adopted a comprehensive set of metrics: accuracy, AUC, and F1-score [11,12]. Each metric offers a unique lens to assess the effectiveness of our models, from accuracy in predictions to the balance between precision and recall, providing a holistic view of our model’s performance, as follows:

P r e c i s i o n = \frac{T P}{(T P + F P)} a n d R e c a l l = \frac{T P}{(T P + F N)}

(3)

Precision and recall are important metrics for evaluating the classification models. Precision measures the accuracy of positive predictions, whereas recall measures the model’s ability to identify all positive instances. These metrics are useful for imbalanced datasets or when the cost of false positives and negatives varies.

The F1-score is the harmonic mean of the precision and recall, as follows:

F 1 = \frac{2 \cdot P r e c i s i o n \cdot R e c a l l}{(P r e c i s i o n + R e c a l l)}

(4)

We studied the multiclass case, and the average F1-score of each class label with weighting depended on the average parameter, as shown in Equation (4).

Accuracy is a measure of the degree of closeness between calculated and actual values. Accuracy is the sum of the true positive and true negative fractions among all test data, as shown in Equation (5).

A c c u r a c y = \frac{T P + T N}{T P + F P + F N + T N}

(5)

The AUC (Area Under the Curve) is a crucial metric for multiple classification models as shown in Equation (6). This was calculated by determining the receiver operating characteristic (ROC) curve. A higher AUC value indicates a better model performance, making it easier to distinguish between positive and negative instances. This is beneficial for imbalanced datasets or when the cost of false positives and negatives varies.

A U C = \sum_{i = 1}^{n} \frac{(F P R_{i} + F P R_{i + 1}) \cdot (T P R_{i + 1} - T P R_{i})}{2}

(6)

4. Experimental Study

During the data transformation, the data quality structure can change based on the column selection, presence of null values, and identification of outliers. Therefore, we assessed the reliability of the data using Cronbach’s alpha coefficient to evaluate the integrity of data quality. The comparative results of the Cronbach’s alpha are presented in Table 4. By assessing the reliability of the integrated dataset compared with non-integrated data, it was found that integrating the data improved the reliability of the data by 0.042.

4.1. Classifier Results

Data preprocessing and predictive analysis modules were implemented in Python using the Sklearn library [39]. Data preprocessing was performed using SPSS version 23.0. Table 5 and Table 6 present the comparative results based on the following feature-importance methods: baseline (with all 165 features), MC (with 24 selected features), XGB (with 66 features), and RF (with 22 features). This differentiation of feature sets enables a comparison of model performance across varying levels of feature complexity and relevance.

Table 5 provides a comprehensive performance comparison between the baseline models and proposed method. The results revealed that the FS-based baseline models were the XGB, KNN, DT, NB, and RF. The findings indicate that our proposed method, which incorporates MAH-based outlier removal, demonstrates superior performance across all metrics compared to the baseline models. The accuracy, AUC, and F1-score of the performance results are presented in Table 5, with the highest scores highlighted in bold. The XGB model exhibited the highest accuracy of 96.78%, which improved to 97.98%, and the RF model exhibited the highest accuracy of 95.42%, which improved to 97.62 when MAH-based outlier removal was applied to the baseline model. The XGB and RF performed well across multiple metrics, indicating that they may be strong candidates for a given task. NB appears to have a lower accuracy and F1-score than the other algorithms, suggesting that it may not be the best choice for this specific task.

Figure 4 presents the multiclass ROC curves for each model and compares them with the average ROC curves obtained from the target dataset. Our analysis suggests that the XGB model demonstrates superior performance in predicting outcomes in the target dataset.

The accuracy, AUC, and F1-score of the performance results are presented in Table 6, with the highest scores highlighted in bold based on the k = 10 cross-validation. The XGB model exhibited the highest accuracy of 97.02%, which improved to 98.04% when MAH-based outlier removal was applied to the baseline model. The XGB and RF appear to perform well across multiple metrics, indicating that they are strong candidates for a given task. NB appears to have a lower accuracy and F1-score than the other algorithms, suggesting that it may not be the best choice for this specific task.

The ROC curves for each comparative method on the integrated dataset with k = 10-fold cross-validation are shown in Figure 5. When the training set is divided into several subsets, it is feasible to compute the mean area under the curve and view the variance of the curve for 10-fold cross-validation.

The diabetes risk prediction accuracy of the dataset was compared using various machine-learning algorithms, as shown in Table 7. The MAH technique enhances the performance of the individual algorithms, and MAH_XGB has emerged as the most effective. The accuracy of the standard XGB algorithm, standing at 98.04% (with a 95% confidence interval (95% CI) of 97.89 to 98.59) with the MAH-based approach, is shown in Table 7. The results of our study indicate that XGB, followed closely by RF and DT, significantly outperforms other algorithms, such as KNN and NB. These algorithms exhibit minimal or even slightly declining improvements in accuracy when using these features.

4.2. Hyperparameter Results

To improve the outcomes, we adjusted some XGBoost hyperparameters on a specific dataset by utilizing the grid search framework in scikit-learn [38,39]. Figure 6a displays a graph of the F1-weighted performance for each learning rate with variation in the number of trees. In addition, the data indicated that an optimal outcome was achieved when using a learning rate of 0.1 and 100 trees. The anticipated overall pattern remained consistent, with the performance improving as the number of trees increased.

Figure 6b illustrates the correlation between the number of trees in the model and the individual depth of each tree. We created a grid consisting of six distinct values for the number of estimators (50, 100, 150, 200, 300, and 400) and six distinct values for the maximum tree depth (2, 4, 6, 8, 10, and 12). Each combination was evaluated using 10-fold cross-validation, resulting in the training and evaluation of 360 models (6 × 6 × 10). The optimal outcome was obtained by utilizing 150 estimators and setting the maximum depth to 6, resulting in a superior F1-weighted score.

Compared with other advanced ML-based techniques for diabetes detection, our MAH_XGB method demonstrated superior performance. As detailed in Table 8, this method outperforms various state-of-the-art ML-based algorithms in terms of the accuracy and AUC metrics. For example, MAH_XGB achieved a remarkable accuracy of 98.04% and an AUC of 98.71% on the target dataset, exceeding those of other models such as CatBoost CBC [26] and DDLNN [27]. This indicated the exceptional efficacy of MAH_XGB in predicting diabetes risk by setting a new benchmark in the field.

5. Discussions

This study significantly advances the field of diabetes risk prediction and prevention by integrating advanced machine-learning techniques with robust outlier detection methods. Specifically, we employed Mahalanobis-distance-based outlier removal to enhance diabetes prediction models using data from the Korea National Health and Nutrition Examination Survey (KNHANES) dataset (2015–2021) provided by the Korea Disease Control and Prevention Agency (KDCA) [15,16,17]. We implemented feature selection based on multicollinearity (MC) and reliability analysis (RA) to address the complex interplay between various risk factors effectively. These refined datasets were then applied to several machine-learning classifiers, including Extreme Gradient Boosting (XGBoost), Naive Bayes (NB), K-Nearest Neighbors (KNNs), Random Forest (RF), and Decision Trees (DTs). The primary objective was to develop a comprehensive risk assessment model that integrates demographic, genetic, biochemical, and lifestyle variables, while effectively managing outliers. The empirical results demonstrate that the proposed approach significantly enhances model accuracy and robustness across all classifiers, with XGBoost demonstrating superior performance.

The application of Mahalanobis distance (MAH) for outlier removal yielded substantial improvements in the accuracy and reliability of the predictive models. Outliers in clinical datasets often undermine the model performance by distorting the statistical relationships among variables. By effectively identifying and removing these outliers, our study enhanced the precision and dependability of diabetes risk predictions. The MAH-based outlier removal method was particularly effective in refining the dataset, leading to a superior model performance. Notably, the XGBoost model achieved remarkable results, with an accuracy of 98.04%, an F1-score of 98.24%, and an AUC of 98.71%. These metrics underscore the superiority of the MAH-based approach over other state-of-the-art models, highlighting its potential for broader applications in predictive healthcare modeling.

The study also involved a comparative analysis of the importance of the features across different models. By comparing the importance scores between XGBoost, Random Forest, and our MC- and RA-based models, we observed similarities and differences in how these models prioritize various features. Some features, such as fasting blood glucose levels, have been consistently identified as critical predictors of diabetes, indicating their strong influence on the risk of diabetes. In contrast, the other features showed varying degrees of importance, depending on the model, reflecting the distinct data-processing mechanisms of each algorithm. These insights into feature importance have practical implications for healthcare providers, who can prioritize the monitoring and management of the most influential features identified by the models.

An analysis of the KNHANES data revealed a significant increase in diabetes cases during the COVID-19 pandemic, particularly among males, older individuals, rural residents, and those with hypertension or obesity. This increase in cases is likely attributable to pandemic-induced lifestyle changes such as reduced physical activity, poor diet, and increased psychological stress. These findings underscore the need for targeted public health strategies to mitigate the impact of global health crises on chronic diseases such as diabetes [40]. Traditional diagnostic methods for diabetes, including fasting plasma glucose (FPG) tests, oral glucose tolerance tests (OGTTs), and glycated hemoglobin (HbA1c) measurements, often fall short of early detection due to their limited sensitivity. In contrast, our integrated approach, which combines machine learning models with MAH-based outlier removal, offers a more comprehensive and accurate assessment of diabetes risk by incorporating demographic, genetic, biochemical, and lifestyle factors.

Limitations and Future Work

Although this study presented a robust methodology that yielded promising results, it is essential to acknowledge certain limitations. This research relied primarily on data from the KNHANES dataset, which may not fully represent the global population. Future research should aim to validate the proposed model across diverse datasets and populations to ensure its generalizability. Additionally, exploring other advanced feature selection techniques and machine learning algorithms could further enhance model performance, a significant advancement in diabetes prediction and prevention, by demonstrating the efficacy of combining the Mahalanobis distance for outlier detection with machine learning techniques. The proposed approach improved the accuracy and reliability of predictive models and provided valuable insights into the factors influencing diabetes risk during the COVID-19 pandemic. These findings pave the way for the development of more effective public health strategies and predictive models for chronic disease management, ultimately contributing to better health outcomes and reduced disease burden.

The practical implications of this study are as follows: Healthcare providers can leverage enhanced predictive models to more accurately identify at-risk individuals and implement early intervention strategies. Public health authorities can use these insights to design targeted prevention programs, particularly when considering the increased risk of diabetes observed during the pandemic. Moreover, this methodology can be adapted and applied to other chronic diseases, offering a versatile tool for improving health outcomes through advanced data analytics and machine learning. Future research should refine the model by integrating additional data sources, such as genetic information and real-time health monitoring data, to improve the prediction accuracy. Exploring deep learning techniques and ensemble methods can provide further enhancements. The development of user-friendly tools and applications incorporating these advanced predictive models could facilitate their adoption in clinical practice and public health policies. By addressing these areas, future studies can build on the foundation of this research, advancing the field of predictive modeling in healthcare and contributing to the overall goal of reducing the global burden of diabetes and other chronic diseases.

6. Conclusions

This study innovated diabetes prediction by employing sophisticated feature selection, including multicollinearity and reliability analyses, along with the Mahalanobis distance for outlier detection, significantly enhancing prediction accuracy. Our findings revealed a critical link between the two conditions, underscored by the effectiveness of the machine learning and linear regression techniques. Although tested on limited datasets and classifiers, our model outperformed traditional approaches, demonstrating the potential of advanced data processing and machine learning for disease prediction. This approach not only refines diabetes risk assessment during the pandemic but also serves as a foundational strategy for predictive modeling in healthcare, addressing future pandemics. The XGB model exhibited the highest accuracy of 96.78%, which improved to 98.04% when MAH-based outlier removal was applied to the baseline model. According to an examination of the KNHANES data from 2015 to 2021, a significant increase in diabetes cases was observed during the COVID-19 pandemic. The increased prevalence of diabetes is linked to the male sex, older age, rural location, hypertension, and obesity. The analysis of KNHANES data from 2015 to 2021 revealed a significant increase in diabetes cases during the COVID-19 pandemic, with a higher prevalence particularly noted in older age groups, rural residents, males, and individuals with hypertension or obesity.

Author Contributions

Conceptualization, K.D.; data curation, M.-U.E.; formal analysis, K.D.; funding acquisition, S.L.; methodology, K.D. and S.L.; project administration, S.L.; software, K.D. and M.-U.E.; supervision, S.L. and M.-U.E.; validation, K.D. and M.-U.E.; visualization, M.-U.E.; writing—original draft, K.D.; writing—review and editing, K.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HI22C0452).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

A correlation of the categorical features of the experimental datasets was performed on the experimental datasets. Table A1 presents the diabetes cases from the KNHANES dataset (2015–2021) across various demographics and health conditions. This revealed that rural areas consistently reported more diabetes cases than urban areas, with 1217 cases (10.7%) in rural regions versus 1039 cases (9.1%) in urban areas. Males showed a slightly higher prevalence of diabetes, with 1122 cases (12.2%) compared to 1134 cases (8.4%) in females. Age significantly influenced the diabetes rates, particularly among individuals aged ≥ 50 years, who accounted for the majority of cases. The results highlighted a strong correlation between diabetes and both hypertension and obesity, with those in the highest stages of these conditions showing the most significant prevalence. Overall, the analysis underscores that older age, rural residency, male sex, hypertension, and higher obesity stages are key factors associated with increased diabetes incidence in South Korea during this period.

Table A1. Cross-tabulation results for KNHANES (2015–2021 years).

		Diabetes							Total
		Before COVID-19 Pandemic					During COVID-19 Pandemic
		2015	2016	2017	2018	2019	2020	2021
Residential area
	Rural	76 (0.7)	164 (1.4)	185 (1.6)	57 (0.5)	247 (2.2)	244 (2.1)	244 (2.1)	1217 (10.7)
	Urban	158 (1.4)	144 (1.3)	154 (1.4)	51 (0.4)	171 (1.5)	181 (1.6)	180 (1.6)	1039 (9.1)
Sex
	Male	123 (1.3)	159 (1.7)	184 (2.0)	0 (0)	222 (2.4)	218 (2.4)	216 (2.3)	1122 (12.2)
	Female	111 (0.8)	149 (1.1)	155 (1.1)	108 (0.8)	196 (1.4)	207 (1.5)	208 (1.5)	1134 (8.4)
Age
	19–29 years old	0 (0.0)	1 (0.0)	0 (0.0)	0 (0.0)	0 (0.0)	2 (0.1)	1 (0.0)	4 (0.2)
	30–39 years old	2 (0.1)	6 (0.2)	7 (0.2)	3 (0.1)	11 (0.3)	10 (0.3)	3 (0.1)	42 (10.3)
	40–49 years old	19 (0.4)	20 (0.5)	28 (0.7)	8 (0.2)	37 (0.9)	34 (0.8)	30 (0.7)	176 (40.1)
	50–59 years old	43 (0.9)	60 (10.3)	79 (10.7)	24 (0.5)	71 (10.5)	107 (20.3)	63 (10.4)	447 (90.7)
	60–69 years old	96 (20.2)	10 (20.3)	106 (20.4)	32 (0.7)	127 (20.9)	137 (30.2)	154 (30.6)	752 (170.4)
	70–79 years old	67 (20.3)	109 (30.8)	97 (30.4)	33 (10.2)	142 (50.0)	107 (30.7)	138 (40.8)	693 (240.2)
	Over 80 years old	7 (10.0)	12 (10.7)	22 (30.0)	8 (10.1)	30 (40.1)	28 (30.9)	35 (40.8)	142 (190.5)
Hypertension
	Normal	49 (0.5)	45 (0.4)	60 (0.6)	17 (0.2)	64 (0.6)	85 (0.8)	77 (0.7)	397 (3.8)
	Pre-hypertension	49 (0.9)	69 (1.2)	57 (1.0)	15 (0.3)	91 (1.6)	85 (1.5)	79 (1.4)	445 (7.9)
	Hypertension	136 (2.0)	194 (2.9)	222 (3.3)	76 (1.1)	263 (3.9)	255 (3.8)	268 (3.9)	1414 (20.8)
Obesity
	Underweight	1 (0.1)	2 (0.2)	4 (0.4)	0 (0.0)	3 (0.3)	2 (0.2)	7 (0.8)	19 (2.1)
	Normal	127 (1.2)	152 (1.5)	88 (0.8)	32 (0.3)	92 (0.9)	87 (0.8)	112 (1.1)	690 (6.6)
	Pre-obesity stage	106 (1.8)	154 (2.6)	80 (1.4)	30 (0.5)	102 (1.7)	111 (1.9)	93 (1.6)	676 (11.4)
	1st stage obesity	-	-	144 (3.1)	38 (0.8)	182 (3.9)	171 (3.6)	181 (3.8)	716 (15.2)
	2nd stage obesity	-	-	22 (3.2)	7 (1.0)	35 (5.1)	52 (7.6)	21 (3.1)	137 (20.0)
	3rd stage obesity	-	-	1 (1.1)	1 (1.1)	4 (4.3)	2 (2.2)	10 (10.8)	18 (19.4)

References

Saeedi, P.; Petersohn, I.; Salpea, P.; Malanda, B.; Karuranga, S.; Unwin, N.; Colagiuri, S.; Guariguata, L.; Motala, A.A.; Ogurtsova, K.; et al. Global and regional diabetes prevalence estimates for 2019 and projections for 2030 and 2045: Results from the International Diabetes Federation Diabetes Atlas. Diabetes Res. Clin. Pract. 2019, 157, 107843. [Google Scholar] [CrossRef]
Zheng, Y.; Ley, S.H.; Hu, F.B. Global aetiology and epidemiology of type 2 diabetes mellitus and its complications. Nat. Rev. Endocrinol. 2018, 14, 88–98. [Google Scholar] [CrossRef] [PubMed]
Sonia, J.J.; Jayachandran, P.; Md, A.Q.; Mohan, S.; Sivaraman, A.K.; Tee, K.F. Machine-learning-based diabetes mellitus risk prediction using multilayer neural network no-prop algorithm. Diagnostics 2023, 13, 723. [Google Scholar] [PubMed]
Care, D. Classification and diagnosis of diabetes. Diabetes Care 2017, 40, S11–S24. [Google Scholar]
Kopitar, L.; Kocbek, P.; Cilar, L.; Sheikh, A.; Stiglic, G. Early detection of type 2 diabetes mellitus using machine learning-based prediction models. Sci. Rep. 2020, 10, 11981. [Google Scholar] [CrossRef]
Adua, E.; Kolog, E.A.; Afrifa-Yamoah, E.; Amankwah, B.; Obirikorang, C.; Anto, E.O.; Acheampong, E.; Wang, W.; Tetteh, A.Y. Predictive model and feature importance for early detection of type II diabetes mellitus. Transl. Med. Commun. 2021, 6, 17. [Google Scholar]
Sadeghi, S.; Khalili, D.; Ramezankhani, A.; Mansournia, M.A.; Parsaeian, M. Diabetes mellitus risk prediction in the presence of class imbalance using flexible machine learning methods. BMC Med. Inform. Decis. Mak. 2022, 22, 36. [Google Scholar] [CrossRef]
Dritsas, E.; Trigka, M. Data-driven machine-learning methods for diabetes risk prediction. Sensors 2022, 22, 5304. [Google Scholar] [CrossRef]
Srivastava, A.K.; Kumar, Y.; Singh, P.K. Hybrid diabetes disease prediction framework based on data imputation and outlier detection techniques. Expert Syst. 2022, 39, e12785. [Google Scholar] [CrossRef]
Nnamoko, N.; Korkontzelos, I. Efficient treatment of outliers and class imbalance for diabetes prediction. Artif. Intell. Med. 2020, 104, 101815. [Google Scholar]
Dashdondov, K.; Kim, M.H. Mahalanobis distance based multivariate outlier detection to improve performance of hypertension prediction. Neural Process. Lett. 2023, 55, 265–277. [Google Scholar] [CrossRef]
James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning; Springer: New York, NY, USA, 2013. [Google Scholar]
Flores-Guerrero, J.L.; Grzegorczyk, M.A.; Connelly, M.A.; Garcia, E.; Navis, G.; Dullaart, R.P.; Bakker, S.J. Mahalanobis distance, a novel statistical proxy of homeostasis loss is longitudinally associated with risk of type 2 diabetes. eBioMedicine 2021, 71, 103550. [Google Scholar] [CrossRef] [PubMed]
Li, W.; Lai, Z.; Tang, N.; Tang, F.; Huang, G.; Lu, P.; Jiang, L.; Lei, D.; Xu, F. Diabetic retinopathy related homeostatic dysregulation and its association with mortality among diabetes patients: A cohort study from NHANES. Diabetes Res. Clin. Pract. 2024, 207, 111081. [Google Scholar] [CrossRef] [PubMed]
Korea Centers for Disease Control & Prevention. Available online: http://knhanes.cdc.go.kr (accessed on 4 February 2014).
Kwan, B.S.; Cho, I.A.; Park, J.E. Effect of breastfeeding and its duration on impaired fasting glucose and diabetes in perimenopausal and postmenopausal women: Korea National Health and Nutrition Examination Survey (KNHANES) 2010–2019. Medicines 2021, 8, 71. [Google Scholar] [CrossRef]
Bae, J.H.; Han, K.D.; Ko, S.H.; Yang, Y.S.; Choi, J.H.; Choi, K.M.; Kwon, H.-S.; Won, K.C.; on Behalf of the Committee of Media-Public Relation of the Korean Diabetes Association. Diabetes fact sheet in Korea 2021. Diabetes Metab. J. 2022, 46, 417–426. [Google Scholar] [CrossRef]
Dashdondov, K.; Kim, M.H.; Song, M.H. Deep autoencoders and multivariate analysis for enhanced hypertension detection during the COVID-19 era. Electron. Res. Arch. 2024, 32, 3202–3229. [Google Scholar] [CrossRef]
Montesinos, L.; Osval, A.; Crossa, J. Multivariate Statistical Machine Learning Methods for Genomic Prediction; Springer Nature: Cham, Switzerland, 2022. [Google Scholar]
Taber, K.S. The use of Cronbach’s alpha when developing and reporting research instruments in science education. Res. Sci. Educ. 2018, 48, 1273–1296. [Google Scholar] [CrossRef]
Khongorzul, D.; Mi-Hye, K.; Kyuri, J. NDAMA: A Novel Deep Autoencoder and Multivariate Analysis Approach for IoT-Based Methane Gas Leakage Detection. IEEE Access 2023, 11, 140740–140751. [Google Scholar]
Anthony, H.; Kamnitsas, K. On the use of Mahalanobis distance for out-of-distribution detection with neural networks for medical imaging. In Proceedings of the International Workshop on Uncertainty for Safe Utilization of Machine Learning in Medical Imaging, Vancover, BC, Canada, 12 October 2023; Springer Nature: Cham, Switzerland, 2023; pp. 136–146. [Google Scholar]
Zhang, M.; Zhang, Y.; Shen, G. PPDDS: A privacy-preserving disease diagnosis scheme based on the secure Mahalanobis distance evaluation model. IEEE Syst. J. 2021, 16, 4552–4562. [Google Scholar] [CrossRef]
Sun, S. Segmentation-based adaptive feature extraction combined with mahalanobis distance classification criterion for heart sound diagnostic system. IEEE Sens. J. 2021, 21, 11009–11022. [Google Scholar] [CrossRef]
Zhao, J.; Gao, H.; Yang, C.; An, T.; Kuang, Z.; Shi, L. Attention-Oriented CNN Method for Type 2 Diabetes Prediction. Appl. Sci. 2024, 14, 3989. [Google Scholar] [CrossRef]
Belsti, Y.; Moran, L.; Du, L.; Mousa, A.; De Silva, K.; Enticott, J.; Teede, H. Comparison of machine learning and conventional logistic regression-based prediction models for gestational diabetes in an ethnically diverse population the Monash GDM Machine learning model. Int. J. Med. Inform. 2023, 179, 105228. [Google Scholar] [CrossRef]
Gupta, N.; Kaushik, B.; Imam Rahmani, M.K.; Lashari, S.A. Performance Evaluation of Deep Dense Layer Neural Network for Diabetes Prediction. Comput. Mater. Contin. 2023, 76, 347–366. [Google Scholar] [CrossRef]
Al Sadi, K.; Balachandran, W. Prediction model of Type 2 diabetes mellitus for omanpre-diabetess patients using artificial neural network and six machine learning classifiers. Appl. Sci. 2023, 13, 2344. [Google Scholar] [CrossRef]
Hasan, M.K.; Alam, M.A.; Das, D.; Hossain, E.; Hasan, M. Diabetes prediction using ensembling of different machine learning classifiers. IEEE Access 2020, 8, 76516–76531. [Google Scholar] [CrossRef]
Ali, M.S.; Islam, M.K.; Das, A.A.; Duranta, D.U.; Haque, M.F.; Rahman, M.H. A novel approach for best parameters selection and feature engineering to analyze and detect diabetes: Machine learning insights. BioMed Res. Int. 2023, 1, 8583210. [Google Scholar]
Sharma, S.K.; Zamani, A.T.; Abdelsalam, A.; Muduli, D.; Alabrah, A.A.; Parveen, N.; Alanazi, S.M. A Diabetes Monitoring System and Health-Medical Service Composition Model in Cloud Environment. IEEE Access 2023, 11, 32804–32819. [Google Scholar] [CrossRef]
Aminizadeh, S.; Heidari, A.; Toumaj, S.; Darbandi, M.; Navimipour, N.J.; Rezaei, M.; Talebi, S.; Azad, P.; Unal, M. The applications of machine learning techniques in medical data processing based on distributed computing and the Internet of Things. Comput. Methods Programs Biomed. 2023, 241, 107745. [Google Scholar] [CrossRef]
Xu, J.; Chen, T.; Fang, X.; Xia, L.; Pan, X. Prediction model of pressure injury occurrence in diabetic patients during ICU hospitalization—XGBoost machine learning model can be interpreted based on SHAP. Intensiv. Crit. Care Nurs. 2024, 83, 103715. [Google Scholar] [CrossRef]
Uddin, M.J.; Ahamad, M.M.; Hoque, M.N.; Walid, M.A.; Aktar, S.; Alotaibi, N.; Alyami, S.A.; Kabir, M.A.; Moni, M.A. A comparison of machine learning techniques for the detection of type-2 diabetes mellitus: Experiences from Bangladesh. Information 2023, 14, 376. [Google Scholar] [CrossRef]
Pina, A.F.; Meneses, M.J.; Sousa-Lima, I.; Henriques, R.; Raposo, J.F.; Macedo, M.P. Big data and machine learning to tackle diabetes management. Eur. J. Clin. Investig. 2023, 53, e13890. [Google Scholar] [CrossRef] [PubMed]
Wee, B.F.; Sivakumar, S.; Lim, K.H.; Wong, W.K.; Juwono, F.H. Diabetes detection based on machine learning and deep learning approaches. Multimed. Tools Appl. 2024, 83, 24153–24185. [Google Scholar] [CrossRef]
Dashdondov, K.; Song, M.H. Factorial Analysis for Gas Leakage Risk Predictions from a Vehicle-Based Methane Survey. Appl. Sci. 2021, 12, 115. [Google Scholar] [CrossRef]
Brownlee, J.; Machine Learning Algorithms from Scratch with Python. Machine Learning Mastery. 2016. Available online: https://machinelearningmastery.com/machine-learning-algorithms-from-scratch/ (accessed on 1 August 2024).
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
WHO. Diabetes. Available online: https://www.who.int/news-room/fact-sheets/detail/diabetes (accessed on 1 August 2024).

Figure 1. Architecture of proposed method for diabetes prediction. Clean dataset: removing missing values and unrelated features from diabetes [31]. MCD: Minimum Covariance Determinant method.

Figure 2. Workflow for data preprocessing to predict diabetes risk using the KNHANES dataset from 2015 to 2021.

Figure 3. Plots of diabetes features: (a) with outliers and (b) without outliers by the MAH distance for the target dataset.

Figure 4. ROC curves for algorithm performance of the target dataset based on Table 5: (a) baseline models were tested using the dataset in Figure 2; (b) models enhanced by the proposed MAH-based outlier removal method.

Figure 5. ROC curves for algorithm performance of the target dataset with k = 10-fold based on Table 6: (a) baseline models were tested using the dataset in Figure 2; (b) models enhanced by the proposed MAH-based outlier removal method using the target dataset in Figure 2.

Figure 6. Influence of learning rate, number of estimators, and depth of trees on performance of XGBoost model. (a) Learning rate and n_estimator, (b) depth and n_estimator.

Table 1. Summary of existing studies integrating the MAH and ML methods for predicting diabetes risk ¹.

Authors	Years	Algorithms	Comments
[25]	2024	SECNN	The SECNN model has an accuracy of 94.12% in NHANES dataset and an accuracy of 89.47% on the PIMA Indian dataset.
[26]	2023	CBC	CatBoost classifier best performance, with an accuracy of 85% and an AUC of 93%.
[27]	2023	DDLNN	The model is best, with an accuracy of 84.42%.
[31]	2023	PCA + ELM	PCA has the best performance with an accuracy of 90.57%.
[29]	2020	AB + XB	The best model reaches an AUC of 95%.
[30]	2023	RFWBP	The best performance, with an accuracy of 95.83%.

¹ MAH: Mahalanobis distance. ML: machine learning. SECNN: Attention-Oriented Convolutional Neural Network. PIMA Indian dataset: Public Investment Management Assessment of India. NHANES: National Health and Nutrition Examination Survey. CBC: CatBoost Classifier. DDLNN: Deep Dense Layer Neural Network. PCA: Principal Component Analysis. ELM: Extreme Learning Machine. RFWBP: Random forest algorithm with the best parameters. AB + XB: The ensembling of two boosting (adaptive (AB) and gradient (XB))-type classifiers.

Table 2. Feature descriptions, mean and standard deviation, correlation, and MC analyses for the KNHANES (2015–2021) target dataset.

Feature	Descriptions	Mean	Std. Dev	p-Value	Tolerance	VIF ¹
BP6_31	Attempted suicide for 1 year	2	0.04	0.189	0.193	5.171
BP7	Counseling for mental issues for 1 year	1.98	0.143	0.149	0.235	4.261
HE_HbA1c	Glycated hemoglobin	5.69	0.623	<0.001	0.282	3.546
HE_glu	Fasting blood sugar	99.12	16.855	<0.001	0.315	3.173
HE_sbp	End systolic blood pressure	118.67	16.223	0.290	0.350	2.857
HE_alt	Serum glutamic pyruvic transaminase	21.18	13.468	<0.001	0.380	2.632
HE_HP	Hypertension	1.86	0.859	<0.001	0.392	2.554
HE_ast	Serum glutamic oxaloacetic transaminase	22.8	8.248	0.670	0.428	2.335
Age	Age	4.73	1.686	<0.001	0.441	2.270
Sex	Sex	1.6	0.489	<0.001	0.442	2.260
HE_obe	Obesity (over 19 years old)	2.75	0.964	<0.001	0.464	2.155
HE_dbp	End diastolic blood pressure	75.03	9.623	<0.001	0.484	2.065
BO1	Subjective body shape recognition	3.37	0.906	<0.001	0.487	2.053
HE_HCT	Hematocrit	42.33	4.201	<0.001	0.520	1.924
BS13	Exposure to secondhand smoke indoors in public	1.85	0.36	0.101	0.526	1.900
HE_crea	Blood creatinine	0.8	0.183	0.686	0.592	1.688
HE_BUN	Blood urea nitrogen	14.6	4.285	<0.001	0.640	1.563
BS9_2	Exposure to secondhand smoke indoors at home	2.8	0.498	0.294	0.712	1.405
BP6_2	Planned suicide for a year	1.99	0.101	<0.004	0.759	1.317
BP1	Perceived level of stress on a daily basis	2.88	0.72	0.227	0.789	1.268
BO2_1	Weight control for 1 year	2.28	1.3	0.213	0.805	1.242
Year	Survey year	2018.12	2.032	<0.001	0.827	1.209
HE_HCHOL	Hypercholesterolemia	0.24	0.428	<0.001	0.851	1.175
HE_WBC	Leukocyte	6.09	1.628	<0.001	0.857	1.166
HE_HTG	Hypertriglyceridemia	0.12	0.328	<0.005	0.888	1.126
Graduate	Education level—graduation status	1.56	1.363	0.319	0.896	1.116
DJ8_pt	Allergic rhinitis treatment	6.91	2.699	0.362	0.907	1.102
DJ6_pt	Sinusitis treatment	7.54	1.847	0.261	0.948	1.055
BD1_11	Frequency of drinking per year	3.69	2.122	<0.019	0.951	1.051
DL1_pt	Atopic dermatitis treatment	7.88	0.968	0.176	0.958	1.044
HE_fh	Family history of chronic dis. diagnosed	0.67	0.793	0.060	0.972	1.028
DH4_pt	Otitis media treatment	7.63	1.676	0.245	0.975	1.026
Region	Region	7.39	4.918	0.520	0.976	1.024
HE_hepaB	Hepatitis B surface antigen positive	0.02	0.129	<0.043	0.991	1.009

¹ Dependent variable: diabetes. Predictors: features. VIF: variance inflation factor. Statistical significance (p < 0.05).

Table 3. Descriptive statistics of target variables for target datasets with and without FS and MAH ¹.

Class	Dataset with FS			Dataset with FS and MAH
Class	Total	Train 70%	Test 30%	Total	Train 70%	Test 30%
Normal	14,648	10,186	4442	13,318	9324	3994
Pre-diabetes	8876	6238	2638	7775	5426	2349
Diabetes	3778	2673	1105	2518	1777	741
Total	27,282	19,097	8185	23,611	16,527	7084

¹ Dataset with FS: experimental dataset (case number = 27,443, feature = 37). Dataset with FS and MAH: target dataset for outliers removed based on MAH (case number 23,612, feature = 37).

Table 4. Comparison of Cronbach’s alpha for KNHANES 2020–2021.

	Total Case	Cronbach’s Alpha	Features Number
Dataset without MAH *	27,282	0.599	24
Dataset with MAH **	23,611	0.641	24

* Experimental dataset based on Figure 2. ** Target dataset based on Figure 2.

Table 5. Evaluation comparison of the proposed methods on the target dataset based on training 70% and testing 30%.

	Algorithms (Selected Features)	Accuracy (%)				AUC (%)				F1-Score (%)
	Algorithms (Selected Features)	Base (165)	MC (24)	XGB (66)	RF (22)	Base (165)	MC (24)	XGB (66)	RF (22)	Base (165)	MC (24)	XGB (66)	RF (22)
Baseline model *	XGB	96.78	97.75	97.72	97.74	97.83	98.48	98.47	98.46	97.24	97.95	97.96	97.97
	KNN	40.09	77.47	66.13	66.38	58.50	83.56	75.74	75.92	44.44	79.71	70.68	70.83
	DT	93.81	94.07	93.96	93.93	96.92	97.05	97.04	96.94	95.87	96.17	95.92	96.05
	NB	48.21	53.33	48.21	53.14	78.23	84.29	79.37	83.43	63.75	70.40	64.29	69.26
	RF	95.42	97.26	96.91	97.25	97.29	97.26	98.04	98.21	97.49	97.74	97.68	97.91
Proposed model **	MAH_XGB	97.79	97.98	97.92	97.76	98.52	98.64	98.61	98.47	97.99	98.21	98.17	97.94
	MAH_KNN	41.93	78.71	67.69	67.48	59.53	84.39	76.76	76.59	44.12	79.71	70.81	70.98
	MAH_DT	94.51	95.31	95.05	94.13	97.17	97.58	97.46	97.03	96.34	96.86	96.63	96.04
	MAH_NB	54.94	60.92	55.71	64.60	79.45	85.27	80.47	83.81	66.12	71.96	67.68	72.65
	MAH_RF	96.67	97.62	97.29	97.25	97.96	98.43	98.28	98.22	97.51	97.93	97.94	97.74

* The baseline model was tested using the dataset in Figure 2 and without outlier removal. ** The proposed model was tested using the target dataset in Figure 2 with outlier removal.

Table 6. Comparison of the proposed methods on the target dataset based on k = 10 cross-validation and Cronbach’s alpha = 0.641 ¹.

	Algorithms (Selected Features)	Accuracy (%)				AUC (%)				F1-Score (%)
	Algorithms (Selected Features)	Base (165)	MC (24)	XGB (66)	RF (22)	Base (165)	MC (24)	XGB (66)	RF (22)	Base (165)	MC (24)	XGB (66)	RF (22)
Baseline model *	XGB	97.02	97.96	97.83	97.77	97.51	98.58	98.55	98.5	97.04	98.08	98.03	97.96
	KNN	33.68	77.6	65.66	66.01	54.00	83.62	75.35	75.64	33.93	77.93	66.76	67.12
	DT	94.501	94.44	94.39	94.63	97.5	97.2	97.15	97.3	96.69	96.30	96.24	96.41
	NB	42.08	54.00	48.56	53.9	77.77	84.66	79.8	83.78	69.65	78.54	72.73	77.35
	RF	96.08	97.55	97.178	97.47	96.68	98.36	98.21	98.35	96.72	97.80	97.63	97.80
Proposed model **	MAH_XGB	97.22	98.04	97.96	97.69	97.44	98.71	98.72	98.39	97.22	98.24	98.27	97.81
	MAH_KNN	48.52	78.80	67.36	67.4	64.69	84.38	76.46	76.47	52.67	78.79	68.02	67.97
	MAH_DT	94.14	94.97	94.94	94.16	95.07	97.47	97.49	97.08	94.75	94.68	96.61	96.16
	MAH_NB	53.60	61.01	57.03	65.61	82.81	85.23	81.51	84.44	77.4	79.16	74.94	78.44
	MAH_RF	96.85	97.71	97.28	97.17	96.73	98.44	98.34	98.16	96.74	97.89	97.82	97.52

¹ The Cronbach’s alpha value was 0.641, which indicates the reliability of the scale. This can be used in our study dataset to remove outliers and predict the diabetes risk using MAH. * The baseline model was tested using the dataset in Figure 2 and without outlier removal. ** The proposed model was tested using the target dataset in Figure 2 with outlier removal.

Table 7. Statistical significance of the overall mean accuracy, p-values, and CI values for diabetes risk prediction using ML algorithms on target dataset ¹.

Algorithms	Accuracy (%)	p-Value	95% CI
MAH_XGB	98.04	3.52 × 10⁻¹⁵	97.89~98.59
MAH_KNN	78.80	3.89 × 10⁻¹⁵	77.66~79.89
MAH_DT	94.97	3.33 × 10⁻¹⁵	93.56~97.52
MAH_RF	97.71	3.98 × 10⁻¹⁵	97.62~98.08
MAH_NB	61.01	3.93 × 10⁻¹⁵	79.88~81.96

¹ 95% confidence interval (CI): a 95% confidence interval is a statistical measure that provides a range of values based on sample data, within which there is a 95% probability that the correct population parameter will lie. p-value: the p-value represents the likelihood of obtaining results that are as extreme as or more extreme than the observed results under the assumption that the null hypothesis is true. Statistical significance was set at p < 0.05.

Table 8. Comparison of classification applications of ML models using other methods for diabetes risk prediction ¹.

Algorithms	Accuracy (%)	AUC (%)
CatBoost CBC [26]	85.0	93.0
DDLNN [27]	84.42	-
AB + XB [29]	-	95.0
RFWBP [30]	83.0	86.0
PCA + ELM [31]	90.57	-
Proposed method_MAH_XGB	98.04	98.71

¹ MAH_XGB: XGB within Mahalanobis distance. CatBoost CBC: CatBoost Classifier. DDLNN: Deep Dense Layer Neural Network. PCA: Principal Component Analysis. ELM: Extreme Learning Machine. RFWBP: Random forest algorithm with the best parameters. AB + XB: The ensembling of two boosting (adaptive (AB) and gradient (XB))-type classifiers.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dashdondov, K.; Lee, S.; Erdenebat, M.-U. Enhancing Diabetes Prediction and Prevention through Mahalanobis Distance and Machine Learning Integration. Appl. Sci. 2024, 14, 7480. https://doi.org/10.3390/app14177480

AMA Style

Dashdondov K, Lee S, Erdenebat M-U. Enhancing Diabetes Prediction and Prevention through Mahalanobis Distance and Machine Learning Integration. Applied Sciences. 2024; 14(17):7480. https://doi.org/10.3390/app14177480

Chicago/Turabian Style

Dashdondov, Khongorzul, Suehyun Lee, and Munkh-Uchral Erdenebat. 2024. "Enhancing Diabetes Prediction and Prevention through Mahalanobis Distance and Machine Learning Integration" Applied Sciences 14, no. 17: 7480. https://doi.org/10.3390/app14177480

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Diabetes Prediction and Prevention through Mahalanobis Distance and Machine Learning Integration

Abstract

1. Introduction

2. Related Works

3. Methodology

3.1. Preparing Experimental Dataset

3.2. Feature Selection Based on MC Analysis

3.3. Feature Selection Based on Reliability Analysis

3.4. Outlier Detection Based on MAH

3.5. Classifiers

3.6. Evaluation Metrics

4. Experimental Study

4.1. Classifier Results

4.2. Hyperparameter Results

5. Discussions

Limitations and Future Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI