1. Introduction
Type 2 diabetes mellitus (T2DM) is a chronic metabolic disorder that affects millions of people worldwide, posing significant health and economic burdens. Early detection and prevention of T2DM are crucial to reducing its complications and improving the quality of life of patients [
1]. Predicting diabetes allows for earlier detection and intervention, potentially delaying or preventing disease progression. This aligns with personalized medicine’s emphasis on proactive healthcare. However, the current diagnostic methods for T2DM, such as the oral glucose tolerance test (OGTT) and the glycated hemoglobin (A1C) test, can be invasive, costly, and time-consuming.
Phenotypic data can provide valuable insights into the risk factors and pathophysiology of T2DM. Phenotypic data include anthropometric measurements, biochemical markers, lifestyle habits, medical history, and family history. Machine learning (ML) techniques, which are computational methods that learn from data and make predictions, can leverage phenotypic data to build predictive models for T2DM [
2]. ML techniques have several advantages over conventional statistical methods, such as the ability to handle high-dimensional and nonlinear data, discover complex patterns and interactions, and improve accuracy and generalization [
3,
4,
5].
Several studies have applied different ML techniques to predict T2DM by using phenotypic data from various populations. For example, Yu et al. (2010) used support vector machine (SVM) to classify instances of diabetes by using data from the National Health and Nutrition Examination Survey (NHANES) [
6]. Deberneh and Kim utilized five ML algorithms for the prediction of T2DM by using both laboratory results and phenotypic variables [
2]. Anderson et al. used a reverse engineering and forward simulation (REFS) analytical platform that relies on a Bayesian scoring algorithm to create prediction-model ensembles for progression to prediabetes or T2DM in a large population [
7]. Cahn et al. used electronic medical record (EMR) data from The Health Improvement Network (THIN) database, which represents the UK population, to identify prediabetic individuals [
8]. Shin et al. used Logistic Regression (LR), Decision Tree, Random Forest (RF), eXtreme Gradient Boosting (XGBoost), Cox regression, and XGBoost Survival Embedding (XGBSE) for the prediction of diabetes [
9]. Gul et al. investigated the prediction value of phenotypic variables (body mass index, cholesterol, familial diabetes history, and high blood pressure) with LR [
10]. Dinh et al. achieved high area under curve (AUC) scores with and without laboratory data with LR, SVM, and three ensemble models (RF, Gradient Boosting-XGBoost, and a weighted ensemble model) [
11]. Viloria et al. used SVM to predict T2DM by only using body mass index (BMI) and blood glucose concentration [
12]. Wang et al. compared XGBoost, SVM, RF, and K-Nearest Neighbor (K-NN) algorithms to predict the risk of T2DM [
13].
Previous studies have shown that ML techniques can be used to predict T2DM by using phenotypic data. However, these studies have several limitations. First, they focus on a few ML algorithms or compare them in isolation without considering the full range of existing methods or novel algorithms. Second, some of these studies use laboratory results that are directly related to glucose metabolism and may not reflect other aspects of phenotypic variation. Another limitation of these studies is their lack of consideration for gender disparities in risk factors, disease outcomes, and model performance for T2DM. Additionally, some studies relied on datasets that were either small or of questionable quality [
13]. Despite these limitations, the previous studies provide promising evidence that ML techniques can be used to develop accurate and robust models for predicting T2DM risk.
This study aimed to address the limitations of previous studies by using PyCaret, an open-source, low-code ML library in Python that automates ML workflows [
14]. PyCaret allows for the simultaneous evaluation of multiple ML algorithms, including newly developed ones, such as XGBoost, LightGBM, and CatBoost, for predicting T2DM. PyCaret eliminates the need for extensive domain expertise, reduces the analysis time, and allows one to obtain more comprehensive evaluation metrics.
Another aim of this research was to explore the differences between male and female populations for predicting T2DM. By analyzing how phenotype and gender interact, we can identify risk factors that are more prominent in one sex compared with the other. This knowledge can allow healthcare providers to tailor personalized screening and prevention strategies for men and women.
2. Materials and Methods
2.1. Dataset Overview
Raw data from controls and patients were obtained from the “Nurses’ Health Study” (NHS), an all-female cohort, and the “Health Professionals’ Follow-up Study” (HPFS), an all-male cohort. Data are available at the database of Genotypes and Phenotypes (dbGaP) under accession phs000091.v2.p1 and were obtained with permission [
15].
The NHS and HPFS are well-established cohorts that are part of the Genes and Environment Initiatives (GENEVA). In addition to investigating the genetic factors contributing to the development of T2DM, they also aim to explore the role of environmental exposure. These cohorts offer a resource for studying the genetic and environmental factors associated with T2DM. Participants in both cohorts completed comprehensive mailed questionnaires regarding their medical history and lifestyle.
The “Nurses’ Health Study” (NHS), initiated in 1976 with 121,700 female nurses aged 30–55 years, and the “Health Professionals’ Follow-up Study” (HPFS), established in 1986 with 51,529 male US health professionals aged 40–75 years, both began by collecting data on medical history and lifestyle through mailed questionnaires [
16]. Every two years since, participants in both cohorts have completed self-administered questionnaires to update information on medical history, lifestyle factors (diet, exercise, and smoking), demographics (age and ethnicity), and family history of diabetes. This study leverages data from the “Nurses’ Health Study” and the “Health Professionals’ Follow-up Study” as part of the Gene Environment Association Studies (GENEVA) initiative. GENEVA aims to identify novel genetic contributors to T2DM through a genome-wide association analysis. Blood samples were collected for genotyping in 1989–1990 for the NHS and in 1993–1995 for the HPFS. While both studies initially enrolled a large number of participants, a subset of approximately 6000 individuals with T2DM (cases) and healthy controls was selected for this specific project for GENEVA [
15]. Our analysis focuses on the phenotypic data collected from this selected group.
2.2. Variables
The variables used for analysis in this study are given in
Table 1.
Categorical variables were used as they are, with no numeric conversion applied for analysis.
2.3. Data Preprocessing
The data contain information about the disease status of a total of 6033 individuals: 3429 females (NHS) and 2604 males (HPFS). The analysis focused on white individuals. Participants of other races (158), Hispanic (37), those with other types of diabetes (133), individuals without genotype ID (25), and first-degree relatives (8) of the participants were excluded. The characteristics of the remaining 5672 individuals, which included both controls and T2DM individuals, are presented in
Table 2.
2.4. PyCaret Analysis
PyCaret is a low-code framework that enables the simultaneous execution of machine learning algorithms. When the main code is run, many embedded codes execute in a specified order, automating the machine learning analysis process. Instead of calculating accuracy in one code and precision in another, all codes are embedded and executed sequentially, creating comparison tables and graphics automatically in a short time. This enables researchers who may not have software expertise to perform machine learning analysis with minimal coding, provided they prepare the dataset properly.
PyCaret was used to analyze a dataset of 5672 individuals with 14 phenotypic variables for the male and female datasets and 15 variables, including gender, in the total dataset. The dataset was divided into male and female subsets, and the performance and features of 16 machine learning (ML) classification algorithms were compared for each subset.
PyCaret analysis was performed in the following order: import the necessary libraries and the data (as a .csv file), then preprocess the data by using PyCaret, display data features (numeric or categorical), and handle missing data. Since the rows with missing values accounted for 3.5% of the dataset, with a maximum rate of 1.5% in individual features, no data points were dropped. PyCaret used a simple imputation method to address the missing data, utilizing the mean for the imputation of numeric variables and the mode for the imputation of categorical variables.
Next, the classification command was used to compare the available classification algorithms. Once the results were available, the best algorithm was selected and performed 10-fold cross-validation for tuning the model’s hyperparameters. The model’s performance and robustness were evaluated by using stratified 10-fold cross-validation in a train–test split ratio of 70:30. The process flow diagram is presented in
Figure 1. The analysis was implemented by using the Anaconda Navigator IDE with Python in the Jupyter Notebook (version 6.5.4) editor, along with PyCaret (version 3.2.0), on a 64-bit Windows 11 computer.
The analysis provided the following evaluation metrics: accuracy, area under curve (AUC), recall, precision, F1-score, kappa score, Matthew’s Correlation Coefficient (MCC), and analysis time (TT). A variable importance graph was also produced. A SHAP value graph was generated by using the “interpret_model” command.
The explanation of the performance metrics is given below:
Accuracy is the percentage of correct predictions out of all predictions.
Precision is the percentage of correct positive predictions out of all positive predictions.
Recall is the percentage of correct positive predictions out of all actual positives.
F1-score is a harmonic mean of precision and recall that balances both metrics.
Area under the curve is a measure of how well a model can rank positive and negative examples correctly.
Kappa score is a measure of agreement between the model’s predictions and the actual labels.
Matthew’s Correlation Coefficient is used as a measure of the quality of binary classifications.
The kappa score and MCC are valuable metrics for evaluating the performance of a classifier in identifying true positives and negatives. This information can aid clinical decision making by minimizing the risk of misclassifying patients.
The formulas for these metrics are
TP: true positive
FP: false positive
TN: true Negative
FN: false negative
2.5. Statistical Analysis
SPSS software (version 29.0.0.0; SPSS Inc., Chicago, IL, USA) was used to analyze the numerical data. The Kolmogorov–Smirnov test was used to evaluate normality, and either the Student’s
t-test or the Mann–Whitney U test was used for the statistical analysis of numeric variables, as appropriate. A chi-square test was used to examine the association between categorical variables by using Chi-Square Test Calculator [
17]. Numerical variables were presented as means ± standard deviation (SD) and categorical variables as frequencies and percentages. The significance level was set at
p < 0.05.
4. Discussion
The use of artificial intelligence in healthcare has emerged in recent years, alongside the automation of ML, to improve the quality, efficiency, and effectiveness of healthcare [
19]. ML algorithms, which are a set of rules that do not require explicit programming, allow computers to learn and make predictions from data. They are used for a variety of applications, such as natural language processing, image recognition, fraud detection, and disease prediction [
20]. In the past, the use of the most prevalent ML algorithms simultaneously was infeasible due to the need for programming expertise and time-consuming testing. However, developments in information technologies have made it possible to run ML algorithms without knowing much code and to analyze many ML algorithms simultaneously.
Machine learning and statistical analysis were applied in the current study to investigate the factors associated with T2DM in the NHS and HPFS datasets. PyCaret classification analysis was used as a machine learning tool to run 16 classification algorithms simultaneously with minimal coding to predict T2DM. The performance of the models was evaluated by using various metrics, such as accuracy, AUC, recall, precision, F1-score, kappa, MCC, and TT (Sec). Feature importance plots and SHAP values were used to interpret the models and identify the most relevant features for prediction.
Ridge Classifier, LDA, and LR exhibited the best performance among models for the male-only data subset, all achieving similar scores. In contrast, for the female-only data subset LR, Ridge, and LDA were the top-performing models, also with similar scores. In the total data subset, LR, GBC, and CatBoost Classifier emerged as the best-performing models, demonstrating comparable scores.
The feature importance plot, one of the most commonly used explanation tools in machine learning analysis, was also utilized [
21]. This tool shows how much each feature contributes to the prediction model, based on the change in accuracy or error when the feature is removed or shuffled [
22]. The higher the variable importance, the more important the feature is for the model. However, this tool does not imply any causal relationship between the features and the outcome, as there may be other factors or interactions that affect the model. A feature may have high statistical significance but low variable importance if it does not improve the prediction model much.
The feature importance plot aids in understanding relevant features for the prediction model and identifying irrelevant or redundant ones. Additionally, it assists in selecting or eliminating features to enhance or simplify the model. However, it should be noted that the feature importance plot may vary depending on the type of machine learning technique, the dataset, and the evaluation metric used for measuring the model’s performance. Therefore, it should be interpreted with caution and in conjunction with other methods of machine learning analysis. The feature importance plot showed that features had different importance values for the prediction model in different data subsets. The most important features were “famdb”, “smoker_never”, and “hbp” in the female-only data subset, and “famdb”, “hbp”, and “smoker_current” in the male-only data subset. Furthermore, the order of the variables and their values differ across genders.
Dinh et al. used 123 variables for 1999–2014 and 168 variables for 2003–2014, including survey questionnaire and laboratory results [
11]. They found that the AUCs were 73.7% and 84.4% without and with the laboratory data for prediabetic individuals, respectively. In another study, Lai et al. found that the GBM and Logistic Regression models performed better than the Random Forest and Decision Tree models. However, they used several laboratory data, such as fasting blood glucose, in their models, as well as BMI, high-density lipoprotein (HDL), and triglycerides, as the most important predictors [
23]. The well-known Framingham Diabetes Risk Scoring Model (FDRSM) is a simple clinical model that uses eight factors, i.e., gender, age, fasting blood glucose, blood pressure, triglycerides, HDL, BMI, and parental history of diabetes, in order to predict the 8-year risk for developing diabetes by Logistic Regression models [
24,
25]. They also used 2 h post-OGTT in complex clinical models [
24]. While the AUC was 0.72 in the simple model, it increased to 0.85 in the complex clinical model. The use of either blood glucose levels or OGTT in model creation clearly has a significant impact on model performance alone. However, it is crucial to establish the predictive effectiveness of variables prior to the onset of elevated blood glucose levels. An AUC of 0.79 was obtained in ML analysis conducted on the total dataset in the current study. The moderate performance can be attributed to the use of fewer variables, especially phenotypic variables instead of direct glucose or OGTT measurements, and fewer laboratory data in the current models. Moreover, the current approach employed a broader range of ML algorithms compared with previous studies, enabling a comprehensive comparison and selection of the most effective methodologies. Utilizing PyCaret for ML algorithms streamlined the process by automating data processing and model evaluation steps. Therefore, differences between this and other studies primarily stem from variations in the nature of the data.
An in-depth investigation into the predictive potential of phenotypic variables for T2DM was conducted. To complement the current ML approach, a rigorous statistical analysis was undertaken to assess the inferential strength of each individual variable. Statistical analysis tests hypotheses and infers causal relationships between individual variables and diabetes risk, while ML builds models and makes predictions based on the data, without necessarily explaining how the data are related or what causes the outcome [
26]. However, statistical methods do not capture the nonlinear and interactive effects of multiple variables on diabetes risk. In contrast, ML algorithms can uncover intricate patterns and interactions within the data, which are not evident through conventional statistical measures alone. Therefore, the complementary strengths of both statistics and ML were leveraged to provide a more comprehensive understanding of the predictive factors for T2DM, allowing for the prioritization of variables that may have been overlooked by traditional statistical analysis. As Bennett et al. suggest, ML and statistical analysis are different but complementary methods that can provide different insights into the data, depending on the research question and the data available [
26].
Furthermore, gender differences in diabetes risk and outcomes have been extensively reviewed in previous studies [
27]. However, most of these investigations have primarily focused on the role of sex hormones, sex chromosomes, or sex-specific environmental factors in explaining these disparities. This study aimed to investigate whether phenotypic variables, including BMI, blood pressure, and lipid levels, demonstrate distinct predictive patterns for diabetes risk among males and females. By leveraging ML techniques, a substantial dataset comprising phenotypic variables was analyzed from individuals with and without T2DM. The current findings reveal that certain phenotypic variables displayed varying degrees of predictive power for diabetes risk across genders. Notably, variables like famdb, hbp, and chol exhibited higher feature importance scores for females than for males. In the SHAP analysis, the impact of heme iron intake appears more significant in males than in females when comparing
Figure 6 and
Figure 7. Additionally, the effect of lower activity is more pronounced in favor of diabetes in males. While higher glycemic load is more pronounced in females, it has an almost neutral effect in males. Interestingly, lower polyunsaturated fat and higher trans fat intake favor diabetes in females, but the opposite holds true for males. Furthermore, the feature importance analysis revealed gender as one of the significant factors in the overall dataset. These results suggest that phenotypic variables can capture certain facets of sex-specific pathophysiological aspects of diabetes and have the potential to enhance the accuracy and personalization of diabetes risk prediction models. Further studies are needed to validate the current findings and delve into the underlying mechanisms contributing to the sex differences observed in phenotype-based diabetes prediction.
Several recent studies have utilized PyCaret for diabetes prediction, but key differences exist between those studies and the current work. Whig et al. employed PyCaret for gestational diabetes prediction on a relatively small open-source dataset. While they reported a 90% accuracy after hyperparameter tuning, their manuscript indicated an actual accuracy of 78% [
28]. This highlights the importance of dataset size, quality, and careful interpretation of results.
The study by Jose et al. shares some context with our research, but there are significant variations in datasets and outcomes. Their open-source dataset includes features beyond our scope, suggesting a focus on cardiovascular health alongside diabetes. However, it lacks crucial diabetes-related variables like family history [
29]. Additionally, we leverage the more recent CatBoost algorithm from Yandex, which was absent in their work. Our study population, derived from the well-established NHS and HPFS datasets, is expected to be of higher quality.
These comparisons demonstrate key distinctions from previous research: study population, variables, algorithms, and analysis of gender differences. In previous research, authors did not mention gender as a contributing factor for the outcome, nor did they find it to contribute to predicting readmission for patients with diabetes [
29]. However, our findings, supported by the statistical analysis, the feature importance plot, and the SHAP analysis, suggest that gender is a significant phenotypic factor in diabetes prediction.