1. Introduction
Diabetes is a global health problem that both reduces individual quality of life and places a significant economic burden on healthcare systems. According to data from the International Diabetes Federation, approximately 537 million people aged 20–79 worldwide have diabetes, and this number is expected to reach 700 million by 2045 [
1,
2]. This rapid increase is creating a serious public health problem, particularly in low- and middle-income countries, and necessitates the development of effective tools for early diagnosis and management of the disease [
1,
3].
Early diagnosis of diabetes is of great importance in terms of preventing complications, improving patient management, and reducing healthcare costs [
4,
5]. In addition to complications, diabetes also significantly affects quality of life and has negative consequences on occupational productivity, mental health, and family life [
6,
7], highlighting the importance of early diagnosis. However, the multidimensional and complex nature of diabetes makes it difficult to reliably predict this disease in its early stages. At this point, machine learning (ML) and data mining methods offer significant potential in the healthcare sector [
8,
9]. Nevertheless, the success of ML models largely depends on the qualities of the data set used. High-dimensional medical datasets, often filled with unnecessary or irrelevant variables, increase the computational cost of the model, lead to overfitting, and reduce interpretability [
8,
10,
11,
12].
Feature selection emerges as a critical preprocessing step to overcome these issues [
13,
14]. Feature selection shortens training time and increases model accuracy by identifying meaningful attributes that enhance predictive power and eliminating unnecessary variables while also facilitating interpretability [
4,
8,
10,
15,
16]. Numerous feature selection methods have been proposed in the literature [
3,
5,
11,
17,
18,
19,
20,
21], such as filter-based, wrapper-based, and embedded methods. However, most of these methods are only effective under specific structures and show limited success in complex disease structures such as diabetes [
11,
13]. Most of these methods have limitations such as being based only on linear relationships, ignoring patterns, and having high processing costs [
3,
16]. In such studies, the effect of each parameter is evaluated individually, and the combined effect of the parameters is ignored. In some cases, symptoms may not be meaningful indicators on their own, but when they occur together, they can trigger the development of the disease or increase the risk.
In this context, the use of methods based on association rule mining, such as the Apriori algorithm, for feature selection in diabetes prediction is considered a noteworthy alternative in this study. Although Apriori is traditionally an algorithm used in market basket analysis, its ability to discover relationship patterns between signs and symptoms makes it suitable for use as a feature prioritization tool in health data [
22]. Indeed, in classification tasks, identifying feature sets that occur together with high frequency can improve both the accuracy and interpretability of the model [
22,
23].
Although the Apriori algorithm has been successfully applied in various fields such as sentiment analysis or text mining in the literature, no studies have been found regarding its use for direct feature selection in the context of diabetes prediction [
16,
23,
24]. Specifically, Ref. [
3] clearly emphasizes that redundant features in electronic health records reduce model performance and that feature selection therefore remains an active research problem. Similarly, Ref. [
1] states that more research is needed on predicting diabetes-related risk factors. Reference [
16] states that feature selection itself has not been sufficiently explored in the literature, that existing prediction models are inadequate in terms of predictive accuracy, and that feature selection therefore represents a significant gap. It also reports that investigating the integration of relationship analysis with machine learning methods would provide an important foundation for future research [
22].
The approach proposed in this study aims to contribute to this gap. Using the Apriori algorithm, the goal is to discover the most effective features in the classification of diabetes, thereby improving both the performance and interpretability of the model. Thus, unlike traditional methods, the goal is to create a more accurate subset of variables for decision-making systems by considering not only statistical dependencies but also meaningful patterns between symptoms.
In this study, unlike existing machine learning-based approaches to diabetes prediction, the feature selection problem is addressed from an association rule mining perspective. Many studies in the literature generally evaluate features individually, using correlation-based or filter, wrapper, and embedded method-based feature selection approaches. However, such methods often fail to adequately consider the patterns of co-occurrence of symptoms and may overlook clinically significant symptom combinations. In this study, the co-occurrence patterns of diabetes-related symptoms were analyzed using the Apriori algorithm, and symptom clusters with high support, confidence, and F1 values were included in the feature selection process. This approach not only increases classification success but also strengthens the interpretability of the model. The findings show that high accuracy, sensitivity, and F1 scores can be obtained using a smaller number of clinically significant symptoms. In this respect, the study contributes to the development of explainable and low-cost symptom-based early diabetes screening and decision support systems and aims to fill the existing gap in the field.
In conclusion, this study presents an original methodology that extends various feature selection approaches proposed in the literature and integrates association rule mining into the feature selection process. In this regard, it is expected to contribute to the field of diabetes prediction and offer a new perspective on the creation of high-quality prediction models. Thus, it has the potential to fill an important gap in the academic literature and contribute to the decision support process in healthcare systems. Furthermore, this study is related to the United Nations Sustainable Development Goal of “good health and well-being”.
2. Materials and Methods
In this study, a dataset consisting of samples from 520 individuals from Sylhet Diabetes Hospital in Sylhet, Bangladesh, was used. The dataset used is based on a real-world diabetes dataset available in the UCI Machine Learning Repository [
25]. The dataset contains a total of 16 symptoms that may be associated with diabetes.
Various preprocessing steps were applied to the dataset prior to analysis. All variables were coded categorically. The absence of symptoms was marked as “0” and the presence of symptoms as “1”. First, missing data checks were performed, and a clean dataset without missing values was used in the analysis. The ROC analysis performed for the age variable determined that it showed a significant discriminatory feature in terms of diabetes diagnosis, and the value of 47.5 was determined as the cut-off point. Accordingly, individuals aged 47.5 and above were coded as “1”, while those younger were coded as “0”. The mean age of individuals with diabetes was calculated as 49.07 ± 12.1, while the mean age of individuals without diabetes was 46.36 ± 12.1. The difference between the groups was found to be statistically significant (p < 0.05). In the evaluation based on the gender variable, it was determined that the incidence of diabetes in women was statistically significantly higher than in men (p < 0.05). In line with this finding, female individuals were coded as “1” (positive) and male individuals as “0”.
In the dataset used in this study, variables based on easily observable symptoms were used instead of parameters based on blood tests. All parameters in the dataset and the distributions of these variables are presented in
Table 1.
Feature selection was first applied to the variables presented in
Table 1. For this purpose, the Apriori algorithm, one of the association analysis methods, was used to determine the relationships between variables and identify the features most closely related to diabetes. The Apriori algorithm is a simple and widely used data mining algorithm that extracts association rules from data sets. Based on the results obtained, meaningful features were selected, and classification models were trained on these features.
For diabetes prediction, four different machine learning algorithms were used: K-Nearest Neighbors (KNNs), Support Vector Machines (SVMs), Artificial Neural Networks (ANNs), and Random Forests (RFs). During the modeling process, classification was performed using both the full dataset containing all variables and the reduced dataset obtained after feature selection. Thus, the effect of feature selection on classification performance was evaluated. For the prediction phase using machine learning, the data was split into a 75% training and 25% test dataset. The createDataPartition function in RStudio was used for this purpose.
The following criteria were considered in evaluating model performance: accuracy, kappa coefficient, precision, recall, specificity, and F1 score. All analyses were performed using RStudio version 4.3.2. The flow diagram related to the study is shown in
Figure 1.
Feature selection with the Apriori algorithm:
The Apriori algorithm is one of the most frequently used methods in association rule mining. The Apriori algorithm has a simple and straightforward structure, and this simplicity facilitates the application of the algorithm and the interpretation of the results obtained. It provides effective and reliable results, especially for small- and medium-sized data sets. This algorithm was developed to identify groups of items that tend to occur together in large data sets. It is widely used, particularly in applications such as market basket analysis, to identify products or items that are frequently purchased together. The Apriori algorithm first identifies frequently occurring item sets in the dataset, then generates rules reflecting the relationships that may exist between these sets [
26,
27]. These rules are typically expressed as “If A occurs, then B is also likely to occur.” The rules obtained are evaluated using performance metrics such as support, confidence, and lift. The basic working principle of the algorithm is based on considering item groups whose support level is above a predetermined threshold value [
28,
29].
The support metric indicates the frequency with which a set of items appears within a dataset. This ratio reflects how common the rule is within the dataset. A high support ratio means that the relevant items are frequently observed together, which generally indicates that these rules are associated with more common or fundamental items. The support value is calculated using the following Equation (1) [
27]:
Another fundamental evaluation criterion is the confidence value. Confidence is an important metric used to determine the accuracy level of a relationship rule that has been established. This criterion reflects the probability of another element (e.g., B) appearing when a set of elements (e.g., A) is observed. A high confidence level indicates that when symptom A is present, symptom B is also likely to be present, demonstrating that the relationship is strong. The confidence value is calculated using the following Equation (2). This calculation is obtained by dividing the joint frequency of occurrence of A and B by the frequency of occurrence of A alone [
26].
The lift is a metric used to evaluate the strength of the rules created and reveals the extent to which these relationships deviate from a random association. This measure reveals how different the relationship between symptoms (or items) is from a purely coincidental occurrence. In other words, the lift value indicates the extent to which the joint occurrence of symptoms A and B increases or decreases compared to the probability expected under the assumption that A and B are independent of each other. The lift value is calculated using the following Equation (3) [
30]:
Lift > 1: There is a positive relationship between elements A and B (e.g., symptoms). In this case, the probability of these two elements occurring together is higher than the probability of them occurring independently. Therefore, the relationship between them is meaningful and strong.
Lift = 1: Elements A and B are independent of each other. The probability of them occurring together is equal to the probability of them occurring separately. This indicates that there is no statistically significant relationship between the elements.
Lift < 1: There is a negative relationship between elements A and B. The presence of one element reduces the probability of the other occurring. This indicates the existence of a negative relationship between the elements.
To summarize these three fundamental metrics, support indicates the prevalence of a rule within a dataset; confidence reflects the accuracy level of the rule; and lift reveals the strength and significance of the relationship between two elements (e.g., symptoms) [
26]. In this study, the relationship rules with the highest values according to these three metrics were determined, and feature selection was performed based on these rules. In the algorithm application, the minimum support and confidence values were set at 3% and 60%, respectively. Rules below these values were not considered. Feature selection was performed based on the rules with the highest support, confidence, and lift values. The “arules” package was used in the Rstudio program for the Apriori algorithm. Transactions were applied to the data set. Subsequently, the rules most positively associated with diabetes were generated and ranked using the apriori function according to the values support = 0.03 and confidence = 0.60 [
22].
Machine Learning Algorithms:
The K-Nearest Neighbor (KNN) algorithm is a widely used and frequently preferred classification method in the field of data mining [
31]. This algorithm classifies the object to be classified based on its distance to all objects in the training data, according to the nearest neighbors. Each object is assigned to a class based on its K nearest neighbors, where K is a positive integer specified by the user before running the algorithm. The distance between objects is usually calculated using the Euclidean distance measure, which is a standard approach to similarity measurement that forms the basis of the algorithm. The “class” package and knn function were used in the Rstudio program for the KNN algorithm. The function parameter k = 1 was used.
Support Vector Machine (SVM) is one of the supervised machine learning algorithms [
32]. The primary goal of this algorithm is to determine the optimal hyperplane (decision boundary) that separates data points belonging to different classes. Multiple hyperplanes can be selected to separate two classes; however, the goal of SVM is to find the hyperplane that maximizes the margin between the classes. The margin is defined as the distance between the hyperplane and the data points closest to this plane. These closest data points are the support vectors that give the algorithm its name. A hyperplane is a linear plane that divides the data space into two distinct regions and is defined as having one less dimension than the space it resides in. Thanks to this structure, SVM stands out as an effective classifier that can separate data into different classes with high accuracy. The svm() function from the “e1071” package was used in Rsturio for the SVM model. The radial kernel was preferred in the model, with the cost parameter set to 15 and the gamma parameter set to 0.1. The parameter settings were determined experimentally to provide the best performance.
Random Forest (RF) is an ensemble method among supervised learning algorithms. It is based on creating multiple decision trees and running them together. Each tree is trained on a random subset of the training data (bootstrap sampling). At the same time, at each node, a randomly selected subset of features is used instead of all features. This diversity reduces the correlation between trees, thereby lowering the overall error. In classification problems, each tree produces a prediction, and the result is determined by majority vote. In regression problems, the arithmetic mean of the trees’ predictions is taken. It is resistant to overfitting and generally provides high accuracy. Thanks to its ability to determine feature importance levels, it is also used for variable selection. It is a reliable algorithm that provides fast and stable results on large data sets [
33]. For the RF method, the “randomForest” package and the randomForest function were used in the Rsudio program. ntree = 500, importance = TRUE was set.
Artificial Neural Networks (ANNs) are supervised learning algorithms that can model non-linear relationships with their multi-layered structure. They consist of an input layer, one or more hidden layers, and an output layer. Each neuron multiplies the inputs it receives by specific weights, sums them, applies them to an activation function, and passes the output to the next layer. During the training process, the error between the model output and the actual value is typically minimized using backpropagation and gradient descent methods. Nonlinear functions such as sigmoid, tanh, and ReLU are used as activation functions. Updating the weights appropriately enables the model to learn. ANNs produce powerful results, especially in classification, regression, and pattern recognition problems. Multilayer structures form the basis of the field of deep learning. With sufficient data and processing power, ANNs can generalize with very high accuracy [
34]. An Artificial Neural Network (ANN) model was created using the nnet() function of the “nnet” package in RStudio. In the model, 5 neurons (size = 5) were defined in the hidden layer, 0.1 was defined as the weight decay parameter (decay = 0.1), and a maximum of 200 iterations (maxit = 200) were defined.
Performance Evaluation:
To evaluate the classification performance of machine learning methods, the dataset was divided into a training set comprising 75% of the data and a test set comprising 25% of the data. This process was repeated 10 times. In each iteration, the training and test sets were selected randomly. The mean and standard deviation of the calculated metrics were obtained.
A confusion matrix was used to compare the performance of the classifications.
Table 2 shows the four sections of the matrix, referred to as True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN).
Performance metrics obtained using the confusion matrix are calculated using the following equations (Equations (4)–(9)). Accuracy measures the overall correctness of the model. Kappa indicates how well the classification performs compared to random success. Precision indicates how many of the model’s positive predictions are correct. Recall indicates how many true positives are correctly predicted. Specificity is the rate at which true negatives are correctly classified. The F1 score is the harmonic mean of Precision and Recall. In addition to accuracy, precision, recall, specificity, and F1-score, the Area Under the Receiver Operating Characteristic Curve (AUC-ROC) was employed as an evaluation metric. AUC measures the ability of a classifier to distinguish between positive and negative classes across all possible decision thresholds, providing a single-value summary of model performance that is particularly useful in imbalanced datasets or when sensitivity and specificity need to be jointly considered. A higher AUC value indicates better overall discriminative performance of the model, complementing the other metrics to give a more comprehensive assessment of classification quality. To assess the statistical significance of performance differences between the full dataset and the Apriori-selected feature dataset, Welch’s
t-test was employed. All metrics were reported as mean ± standard deviation over 10 repeated runs, and
p-values obtained from Welch’s
t-test were used to determine significance.
3. Results
In this study, variables that may be associated with diabetes were analyzed in detail. The dataset, obtained from the UCI Machine Learning Repository and based on real observational data, consists of a total of 17 variables from 520 individuals. These are clinical or subjective symptoms that can be obtained without biochemical tests such as blood tests. Descriptive statistics for the variables are presented in
Table 1.
Table 1 shows the distribution of individuals in the study according to their symptoms and certain demographic characteristics. The number of individuals with diabetes is 320, which constitutes 61.5% of the sample. This rate indicates that diabetes is quite prevalent in the study sample.
Looking at the frequency distribution of symptoms, the most common symptom was “weakness,” observed in 305 individuals (58.7%). This symptom is followed by age (47 years and older) at 50%, excessive urination (polyuria) at 49.6%, itching at 48.7%, and delayed healing at 46%. These findings reveal that some classic symptoms associated with diabetes were observed at quite high rates in the sample.
Defining symptoms of diabetes, such as excessive eating (polyphagia) and excessive drinking (polydipsia), were also observed at rates of 45.6% and 44.8%, respectively. Polydipsia and polyuria, in particular, are symptoms that provide important clues for the diagnosis of diabetes, and the high prevalence of these symptoms indicates that the study was conducted with a diabetes-sensitive data set.
Symptoms such as visual blurring and partial paresis are also noteworthy, occurring at rates of 44.8% and 43.1%, respectively. Symptoms such as sudden weight loss, muscle stiffness, and hair loss (alopecia) occur at moderate rates, ranging from 35% to 42%. In contrast, obesity (16.9%), genital yeast infection (22.3%), and nervous irritability (24.2%) were identified as the least frequently observed symptoms in the sample. Female individuals constituted 36.9% of the sample.
The most common symptoms largely overlap with the clinical signs of diabetes. These data provide important clues as to which symptoms require greater attention in the early diagnosis of diabetes.
In recent years, data mining and machine learning techniques have been widely used in the medical field as reliable and complementary decision support systems [
35,
36]. In this study, the association analysis approach was adopted for the symptom-based early prediction of diabetes, and the Apriori algorithm was applied for feature selection [
22]. Thus, symptoms that jointly affect diabetes were identified, and these variables were selected as attributes to be used in the classification process.
There are numerous machine learning-based studies in the literature on the prevention and early diagnosis of diabetes [
37]. These studies generally aimed to select diabetes-related features through various algorithms and used these attributes in the modeling processes [
7,
38,
39,
40,
41]. In contrast to these studies, the association rule mining method, which identifies symptoms frequently observed together, was preferred to determine the relevant attributes in this research.
In the analysis performed using the Apriori algorithm, the most common pairs of symptoms associated with diabetes were listed first, followed by triplets and higher-level associations. The algorithm was run based on a minimum support threshold of 3% and a confidence threshold of 60%; rules with the highest support, confidence, and lift values were prioritized. As a result of this analysis, eight prominent symptoms that should be evaluated together in predicting diabetes were identified: Gender (female), Polyuria, Polydipsia, Sudden Weight Loss, Weakness, Visual Blurring, Partial Paresis, and Obesity. The simultaneous positive observation of these eight symptoms increases the likelihood of developing diabetes by 1.63 times (support: 0.03, confidence: 1.00, lift: 1.63) [
22].
As a result of the association analysis, eight attributes were selected from 16 symptoms, and classification was performed using four different machine learning algorithms (KNN, SVM, ANN, RF) using these two different data sets (full and selected attributes). In all analyses, the dataset was randomly split into 75% training and 25% testing; the training–testing–prediction process was repeated ten times, each with different sample contents.
The mean and standard deviations of the performance metrics (Accuracy, Kappa, Precision, Recall, Specificity, F1 Score) obtained after each iteration were calculated. The classification performance results obtained are presented in
Table 3 and
Figure 2.
This table compares the performance of four different machine learning algorithms (KNN, SVM, ANN, RF) according to two different data usage scenarios (feature-selected dataset and full dataset). Performance evaluation was conducted using six different metrics: Accuracy, Kappa, Precision, Recall, Specificity, and F1 Score.
When the apriori algorithm feature selection was applied to all methods, the Accuracy, Kappa, Precision, Recall, Specificity, and F1 Score values were higher compared to the full dataset. This indicates that the relevant classification algorithms work more effectively with noise-free, more meaningful features. Feature selection increased the model’s generalization power, ensuring more balanced results in both positive and negative classes.
KNN is the algorithm with the highest Recall value (0.975 ± 0.02) when feature selection is applied. This indicates a very high success rate in capturing positive cases when it comes to a disease such as diabetes.
SVM was the most balanced and highest performing algorithm across all metrics. After feature selection, it achieved the highest levels in critical values such as Accuracy (0.970), F1 Score (0.961), and Specificity (0.975). The balance between Precision and Recall is quite strong, making SVM both a sensitive and selective classifier.
Although the ANN’s F1 score of 0.950 after feature selection indicates a successful model, it performed slightly lower than SVM and RF.
RF, on the other hand, achieved high values, particularly in the Precision (0.959) and Specificity (0.974) metrics. Its overall success level is quite high, with an F1 Score of 0.958. This model, based on a collection of decision trees, can successfully learn complex relationships and be resilient to the risk of overfitting.
The highest overall performance was demonstrated by the SVM model with feature selection applied. In terms of not missing positive diabetes cases (high Recall), the KNN algorithm stands out. Especially in critical areas such as healthcare, the overall balanced performance of the model is as important as its sensitivity and specificity. From this perspective, SVM is a strong candidate for clinical decision support systems. Feature selection significantly improved model performance in the data mining process, emphasizing the importance of the data preprocessing stage. In other words, feature selection performed using the apriori algorithm significantly improved the classification performance of machine learning models. This shows that the relational rules determined by the apriori algorithm improve the model’s generalization ability and optimize classification success. The highest overall performance was observed particularly in the SVM algorithm; this finding reveals that pre-analysis using association rule mining can make a strong contribution to classification processes. In conclusion, apriori-based feature selection offers an effective and explainable approach for classification models working with health data.
4. Discussion
The increasing prevalence of diabetes necessitates the development of accurate and reliable predictive models for the early diagnosis and effective management of the disease. Such models can enable the identification of individuals at risk, thereby providing an opportunity for timely intervention to prevent or delay diabetes-related complications [
11]. Continuous developments in the field of machine learning (ML) stand out as an important trend that is closely followed in the healthcare sector [
11,
42,
43].
In the study by Aglarci and Karakurt (2025) [
22], the a priori algorithm was used only to reveal the associations between symptoms, and it was suggested that it be evaluated in the context of feature selection in the future. This study takes the aforementioned suggestion a methodological step further by integrating apriori-based feature selection with machine learning [
22].
In this study, diabetes prediction was performed using only a dataset based on symptoms. The attributes in the dataset were analyzed using the Apriori algorithm to select features, and then classification performance was evaluated by applying common machine learning algorithms such as KNN, SVM, ANN, and RF to the selected features. In the study, validation was repeated 10 times with a 75–25% random data split, and results were reported as mean ± standard deviation. This approach represents repeated random validation rather than formal cross-validation. The complex nature of diabetes makes it difficult to obtain accurate predictions. The large number of features in high-dimensional datasets can cause the model to overfit and reduce its interpretability. The presence of irrelevant or unnecessary features can prevent the model from learning meaningful patterns. Therefore, effective feature selection techniques aimed at identifying the most relevant features are critical for improving model accuracy and generalizability [
10]. Indeed, studies have reported significant improvements in model accuracy and interpretability by selecting a subset of relevant features [
5,
44].
The results obtained in our study show significant performance improvements across all algorithms used. In [
4], which adopted a different feature selection approach on the same dataset, 13 features were identified from 16 symptoms (polydipsia, polyuria, gender, sudden weight loss, partial paresis, irritability, polyphagia, age, alopecia, visual blurring, weakness, genital thrush, and muscle stiffness), and a classification accuracy of 98% was achieved using the KNN and RF methods with these features. While the “obesity” attribute was excluded in this study, it was considered in our study. Furthermore, the variables irritability, polyphagia, age, alopecia, genital thrush, and muscle stiffness, which were not included in our attribute set, were selected as attributes in the aforementioned study.
Studies aimed at identifying risk factors associated with diabetes yield different results depending on the methodological approaches used. In this context, it is emphasized that more research is needed on predicting diabetes risk factors [
1]. In the literature [
1], Principal Component Analysis (PCA) was used for feature extraction and Information Gain (IG) techniques for feature selection; accuracy rates above 82.2% were obtained in analyses performed with SVM, RF, and KNN methods. It was stated that feature selection improved model performance.
Similarly, in [
15], the number of features was reduced through metaheuristic methods, thereby increasing the success rate. Using the same dataset, 12 out of 16 features were identified, and 99% accuracy was achieved using the KNN method. In the same study, the accuracy rate remained between 90% and 95% in the classification performed with the selected 8 features. When these results are compared, our study shows that higher performance was achieved using fewer features.
Ref. [
18] used F-Score-based feature selection and Fuzzy Support Vector Machines (FSVMs) for the classification and detection of Diabetes Mellitus, achieving an accuracy of 89.02%. In Ref. [
13], the k-best feature selection technique and extra trees method were used to achieve an accuracy rate of 92.5%.
In another study using the same dataset [
21], feature selection using the greedy hill climbing method resulted in the identification of 11 out of 16 features: age, gender, polyuria, polydipsia, sudden weight loss, polyphagia, itching, irritability, delayed healing, muscle stiffness, and alopecia. Predictions made with these features yielded accuracy rates of 92.3%, 93.6%, and 97.3% for the SVM, KNN, and MLP methods, respectively. The features age, polyphagia, itching, irritability, delayed healing, muscle stiffness, and alopecia were not selected in our study; nevertheless, high accuracy was achieved with fewer features.
In [
14], which used spiral-based feature selection, Grey Wolf Optimization (GWO) and Adaptive Particle Swarm Optimization (APGWO) methods were applied, selecting 13 and 10 features, respectively. An accuracy of 96% and 97% was achieved with the MLP classifier. The GWO method excluded the Weakness and Obesity features, while APGWO excluded the Sudden Weight Loss, Weakness, polyphagia, genital thrush, irritability, and Partial Paresis variables from the analysis.
The analysis conducted using Shapley values in [
45] demonstrated that even identifying only the three most effective features can significantly improve model performance. The literature [
16] indicates that selecting the best feature set is as important as finding the best classification model. A comprehensive review of previous studies identified the following research gaps: the predictive accuracy of diabetes diagnosis remains a challenging issue and therefore requires further research. The use of feature selection has been addressed in very few studies. In this regard, Ref. [
16] compared different feature selection techniques such as the non-dominant sorting genetic algorithm (NSGA-II), Genetic Algorithms (GA), Principal Component Analysis (PCA), and Particle Swarm Optimization (PSO). The SVM method achieved 86.8% accuracy and MLP achieved 98% accuracy with the feature set obtained using NSGA-II. It was also noted that feature selection not only increased classification accuracy but also reduced computational cost.
The literature [
9] shows that the model accuracy for diabetes prediction increased from 83% to 93% using a random forest-based weighted feature selection algorithm. In a similar study [
46], features were extracted using the partial least squares method and achieved 74.40% accuracy with LDA. Reference [
47] combined SVM and Feedforward Neural Network (FFNN) models with features selected by LDA and achieved 75.65% accuracy.
In Ref. [
48], mutual information (MI) and the linear correlation coefficient (LCC) were used in the proposed feature selection algorithm; an accuracy of 86% was achieved with the six selected features. In Ref. [
49], Chi-Square and PCA-based feature selection was compared, and the Chi-Square method achieved the highest performance with 85% accuracy.
These studies clearly demonstrate that feature selection plays a critical role in improving the accuracy, generalizability, interpretability, and efficiency of classification models. Feature selection not only reduces model complexity but also increases the clinical applicability of prediction systems developed in the healthcare field.
In conclusion, this study demonstrates that a symptom-based and explainable feature selection approach based on the Apriori algorithm can yield consistent classification performance on a dataset widely used in the literature. Compared with several previous studies employing the same or similar datasets, comparable or improved accuracy and F1 scores were obtained using a reduced number of attributes, highlighting the potential utility of the proposed approach in an exploratory setting. These results suggest that the association analysis-based approach enhances interpretability while being associated with improved classification performance. Within the scope of this study, the findings indicate the potential of symptom-based, fast, and cost-effective screening models.
However, most feature selection methods used in the literature are correlation-based or filter-based. The Apriori algorithm used in this study, on the other hand, considers the co-occurrence structure of symptoms, ensuring that clusters selected are not only statistically but also clinically meaningful. For example, the fact that symptoms such as weakness, partial paresis, and visual blurring, in addition to classic diabetes symptoms such as polyuria and polydipsia, stand out with the Apriori algorithm draws attention to the importance of these attributes, which are often overlooked in the literature.
One limitation of this study is that the dataset used does not include data from all age groups. Considering the increasing prevalence of diabetes due to unhealthy diets and sedentary lifestyles, if diabetes cannot be controlled in all age groups, including young people, undesirable diseases may emerge at an early stage. Another limitation of the study is that the research data is limited to only one country. Applying the proposed technique with larger samples obtained from different countries and hospitals will validate its accuracy and reliability. Future studies aim to test the proposed method with different data sets.
5. Conclusions
This study has demonstrated that feature selection using the Apriori algorithm in the analysis of symptom-based data for diabetes prediction improves the success of machine learning models. In particular, eight symptoms with high support, confidence, and lift values play a critical role in diabetes prediction. Performance comparisons using four different classification algorithms revealed significant increases in metrics such as accuracy, precision, sensitivity, and specificity in datasets where feature selection was applied.
While the SVM algorithm stood out in terms of overall performance, the KNN algorithm showed the highest sensitivity in identifying positive cases. These results demonstrate that both model selection and appropriate feature determination are of great importance in diabetes prediction.
The symptom-based dataset used in the study does not include biochemical parameters such as blood tests, but is based solely on variables that can be obtained through clinical observation and patient reports. In this respect, the model offers a practical and effective structure that can be used to develop decision support systems for preliminary screening in the field.
Future studies should perform comparative analyses on similar datasets using different association algorithms and deep learning-based architectures. Furthermore, analyses conducted on different demographic groups and larger samples will increase the model’s generalizability.