*2.2. Model Framework*

Figure 1 shows the assessment framework used in this study for diabetes patient screening. The assessments were performed in a total of six steps. Initially, the real-life diabetes mellitus data were acquired and preprocessed for selection of appropriate attributes. Afterward, this data was utilized for evaluation and assessment. Secondly, the updated plugins of two machine learning Rule classifiers (PART and Decision table) were used on Weka version 3.9.2 "data mining platform" for classification measurements and Rule assessment [33]. In addition, the logistic regression method was utilized on the results of the machine learning classifiers to forecast the rule assessment.

**Figure 1.** Assessment framework used in this study for clinical implication screening.

#### *2.3. Data Collection and Explanation*

The real-life diabetes mellitus data of 1257 patients from December 2017 to February 2019 were acquired from four main hospitals across Nigeria and carefully examined. Figure 2 demonstrate the collection flow of data gathered from four principal hospitals in Nigeria namely Abdullahi Wase Specialist Hospital (22.75%), Ajingi General Hospital (22.04%), Federal Medical Center Birnin-Kudu (26.81%), and Gaya General Hospital (28.40%) located in the northwestern region of Nigeria. The data were collected through questionnaires, verbal interviews, and by consultation of the medical specialist after the ethics committee of the institute where the research was carried out approved the study protocols. The data collection flow of diabetes patients from the mentioned hospitals is shown in Figure 2, and the number of patients in each hospital is shown in Figure 3.

**Figure 2.** The data collection flow of diabetes patients from the four hospitals.

**Figure 3.** Total number of diabetes patients recorded in the four hospitals.

## *2.4. Attributes Selection*

In our prediction assessment of diabetes mellitus prevalence, the data of 10 easily available attributes/variables, namely age, gender, GLU (glucose level of the patient), BMI (body mass index of the patient), HYP (hypertension status), HCD (history of cardiovascular disease), FDH (family history of diabetes), PEX (physical exercise), STW (work stress status), and DIT (diet of the patient, healthy and unhealthy). Out of 1257 records, 587 patient records were missing values in the body mass index, glucose level, hypertension, cardiovascular disease, work stress status, family diabetes experience, physical exercises, and diet lifestyles. Moreover, 389 records were removed from assessment dataset because of missing values in pre-diabetes status. Therefore, 281 records with 10 variables were used in the prediction analysis.

## *2.5. Attribute Parameters*

The 10 features selected in this study were characterized as follows. Age and gender represented demographic characteristics. A patient's glucose level (mmol/L) has a relation with age and diet. Family history of diabetes was defined as any family member previously diagnosed by a physician as diabetic or pre-diabetic (Yes = 1, No = 0). BMI was calculated as body weight divided by the square of height in meters and BMI ≥ 25 was defined as overweight. History of cardiovascular disease or stroke was defined as the patient having been previously diagnosed with coronary heart disease or stroke by a surgeon (Yes = 1, No = 0). Physical exercise indicated whether the patient engaged in exercise (Yes = 1, No = 0). Work stress was measured to the patient's subjective impression (Yes = 1, No = 0). Diet was measured as a balanced and unbalanced diet (Yes = 1, No = 0). HYP was defined in three ways: first, a systolic BP (blood pressure) ≥140 mmHg; second, medication for BP control; and third, diastolic BP ≥ 90 mmHg.

#### *2.6. Data Mining Platform*

Waikato Environment for Knowledge Analysis (Weka/v 3.9.2) was used for the preprocessing and classification assessment of diabetes mellitus by updated plugins of the Kmean clustering algorithm to assign the class to the dataset of 10 variables for testing as positive and negative status (positive mean diabetes and negative mean normal status) [34]. The positive patients were declared as high in diabetes status after assessment and negative as normal for the initial screening by proper forecast assessment. The advantage of using Weka is the avoidance of overfitting and unnecessary complexity.

In addition, Rule algorithms (PART and Decision table) were adopted for accurate measurements. Moreover, the logistic regression was utilized on the assessment of classification to forecast diabetes prevalence for clinical implications.

After data preprocessing, the final dataset included 281 patient records with males and females and 11 attributes. The population sampling included patients with diabetes mellitus status Type 1 (non-insulin dependent), Type 2 (insulin-dependent), and gestational diabetes. The 11 attributes included 10 as input attributes and the one as the target attribute. The target attribute consisted of two classes: one class obtained the diagnosis of diabetes tested positive and the second class was tested negative by the Kmean finding within the clusters that are more related to each other at the significance level of 0.05 [35].

Kmean is a typical distance-based cluster algorithm and its distance is measured on similarities. The process steps of the Kmean are to measure the distance between each object and the centers of the cluster by Equations (1)–(3), as follows:

$$S\_i^{(t)} = \left\{ \forall j, \ 1AjAkX\_{\mathcal{P}} : \left\| X\_{\mathcal{P}} - m\_i^t \right\|^2 \le \left\| X\_{\mathcal{P}} - m\_j^t \right\|^2 \forall j, 1 \le j \le k \right\} \forall j, 1AjAk,\tag{1}$$

$$m\_i^{t+1} = \frac{1}{\left| S\_i^{(t)} \right|} \sum\_{\mathbf{x}\_j \in \mathcal{S}\_i^{(t)}} \mathbf{x}\_{j\prime} \tag{2}$$

$$J = \sum\_{j=1}^{k} \sum\_{i=1}^{n} \parallel x\_i^{(j)} - c\_j \parallel^2 \tag{3}$$

where n is the number of data points in the *i* clusters, k is the number of cluster centers, and *x* (*j*) *i* − *cj* represents the Euclidean distance between *x* (*j*) *i* and *cj*. In addition, the Kmean clustering algorithm is composed of the following steps.


$$
arg\min\_{\mathbf{c}\_j \in \mathbb{C}} \text{dist}(\mathbf{c}\_i, \mathbf{x})^2. \tag{4}
$$
