1. Introduction
Encapsulation is widely used in numerous industries, including food, cosmetics, biology, agriculture, and pharmacy [
1,
2,
3]. The nature of the encapsulation technique involves stabilizing active compounds using building systems to preserve their physical, chemical, and biological properties under various conditions. There are two main kinds of encapsulation techniques: physical and chemical. The first uses techniques that influence intensive properties, such as temperature and pressure [
3], whereas the second, chemical encapsulation, leverages the active compounds’ chemical properties [
4]. Due to the method’s complexity, significant obstacles exist [
5] despite the technique’s extensive range of applications. One of these obstacles is the compatibility between the active compound and the shell material (encapsulant compound) [
6,
7]. Among the widespread use of the technique and the varying amounts of compounds, antioxidant compounds (ACs) are significant for their properties in the abovementioned industries.
Antioxidants include polyphenols, carotenoids, anthocyanins, catechins, vitamins, and polyunsaturated fatty acids. They can be found in fruits, vegetables, cereals, and plants [
8,
9]. These compounds serve multiple functions. Plants, for example, offer structural support and defense against environmental factors such as biotic and abiotic stress, ultraviolet radiation, and pathogens. On the other hand, consumers benefit from these compounds’ properties, which protect against noncommunicable diseases (NCDs) by acting as regulators of cellular processes such as enzyme inhibition, gene expression, and protein phosphorylation. These compounds affect the sensory qualities of fruits and vegetables, such as bitterness, color, and flavor, resulting in a unique sensory profile for each product based on the compound present [
10]. Due to their capabilities, the bioactive compounds mentioned above can be utilized in various industries. Nevertheless, these compounds’ capacity is constrained by various obstacles that must be overcome. Nowadays, the bioavailability, bioaccessibility, and bioactivity of antioxidant compounds are issues due to negative extrinsic factors such as poor stability of compounds in various mediums [
11], lack of transportation vehicles, and challenging absorption [
12]. Subsequently, encapsulation could solve these concerns, promote stability, and pursue maintenance or increase the stated capabilities [
13].
Compatibility between the active and encapsulant compounds (WM) is crucial in encapsulation. This relevance is due to the affinity between the compounds, influenced by various chemical properties. Most of the time, suitable matches are complex, requiring costly and time-consuming trial-and-error experiments [
6]. To bridge this gap, the design and implementation of a computational tool could support decision-making processes regarding the time and resources required for this purpose in experimental development [
14]. Various tools could be proposed in this regard; however, there have been remarkable advancements in machine learning (ML) for classification problems in recent years. Several research studies have considered machine learning (ML) to address the issues of determining compatibility between substances. Qi et al. suggested utilizing artificial intelligence (AI) and machine learning (ML) to facilitate drug development and identify protein–protein interactions and drug target interactions [
15]. D’Souza et al. examined the application of Deep Learning (DL) in several cheminformatics methodologies for predicting the binding affinity between chemicals and proteins, focusing on the drug–target interaction. Liu et al. used an unambiguous multiple linear regression (MLR) algorithm to build QSAR models of the estrogen receptor binding affinity [
16]. Rege et al. used a Support Vector Machine (SVM) regression with bootstrapping for creating and validating QSAR models to examine the DNA-binding characteristics of a library of aminoglycoside-polyamine compounds [
17]. Rosas-Jiménez et al. employed ML algorithms (MLR, KNN, and RF) to construct quantitative structure–activity relationship (QSAR) models for cruzain inhibitors. These models use molecular descriptors to forecast the biological activity of the inhibitors accurately [
18]. Krenn et al. investigated the application of generative models, specifically variational autoencoders (VAEs), in creating new molecules and predicting their properties using QSAR models [
19]. Using computational methods to anticipate the binding affinity between compounds and targets significantly increases the likelihood of identifying lead compounds by minimizing the need for wet-lab studies. ML and DL methods employing ligand-based and target-based strategies have been utilized to forecast binding affinities, resulting in time and cost savings in drug discovery endeavors.
Nevertheless, ML is not limited to the pharmaceutical industry when predicting compatibility [
20]. Piotrowsky discovered that scientific research on the corrosion of aerosol canisters is extensive but inadequate, and only some of the physical and chemical principles can be used to predict corrosion accurately. Additionally, other possible issues are influenced by many parameters and formula components. Piotrowsky’s paper utilized a data-driven methodology to address these restrictions and limits to forecast the compatibility between a novel product’s formulation and packaging. The model exemplifies a classification method, a subset of supervised machine learning; it utilizes input values provided for training to derive a conclusion and produce an output that classifies a dataset into distinct categories [
21]. Periwal et al. proposed a use case for ML in predicting compound compatibility. They assessed the functional similarity between natural compounds and approved drugs by combining various chemical similarity metrics and physiochemical properties using a machine learning approach [
22]. This study exemplifies the parallels between the present and previous studies since both investigations employ machine learning approaches to examine chemical data and generate predictions. By incorporating a range of chemical similarity measures and physical features, the machine learning approach enables a thorough evaluation of compound compatibility.
Therefore, considering a computational tool incorporating ML capabilities could be advantageous [
23]. This work seeks a model with self-learning capabilities and compartmental adaptability of the prototype since ML is a data analysis model that automates the construction of analytic models. The K-nearest neighbors (KNN) algorithm is contemplated to construct a classification model that determines the compatibility between ACs and WMs, given its recognized capability in similar classification problems. To ascertain the compounds’ eligibility for encapsulation, the model aims to assess the compatibility of the compounds by analyzing the relationships between their molecular descriptors.
The paper is structured as follows:
Section 2 involves gathering and organizing data. This process includes assessing two databases to analyze the connections between the encapsulation process and various antioxidants and encapsulant compounds. Afterward, the model parameters are determined using the RDKit cheminformatics toolkit to extract the molecular descriptors of each compound. A review of the existing research considers the associations between the compatibility and incompatibility between substances. To increase the variety of encapsulant chemicals and the number of possible combinations between them, the computational tool LIDEB’s Useful Decoys (LUDe) is used to build decoys.
Section 3 establishes the connections between ACs and WMs by identifying comparison and difference relations, which are determined throughout the cleaning process.
Section 4 involves conducting statistical analysis to examine the interrelationships between variables using tools such as histograms, scatter plots, and Pearson and Kendall correlations. In
Section 5, a Principal Component Analysis (PCA) is performed to decrease the complexity of the dataset and assess the influence of individuals and variables in each principal component.
Section 6 involves the implementation of K-nearest neighbors (KNN) to predict the compatibility of the combinations of ACs and WMs. Finally,
Section 7 presents some concluding remarks and insights for further developments.
4. Exploratory Analysis
Before statistical analysis, the dataset containing 288 observations and 22 variables was separated into success and failure cases. Case-sensitive variables facilitate the separation of a dataset into two distinct subsets. The first dataset describes the compatibility of the compound, while the second dataset describes their incompatibility. As a result, a pairs panel graph was created, which helped illustrate descriptive statistics (
Figure 1 and
Figure 2). It is a scatter plot of matrices (SPLOM) with bivariate scatter plots below the diagonal, histograms on the diagonal, and Pearson’s correlation above the diagonal [
42].
Note that to visualize the correlation between parameters, the variance cannot be zero , which means that all observations have the same value for the evaluated parameter. The parameters above with zero variance were removed from their respective datasets. First, the following parameters were eliminated from the success dataset: C-MW, C-HBA, C-PSA, C-HAC, and Case. C-HBA and Case parameters were removed from the collection of failure data. The behavior of these parameters by their respective datasets allowed for the determination of the significance of each in establishing the compounds’ compatibility or incompatibility.
4.1. Histograms
This type of graphic provides information about the representation of the data distribution, allowing the pattern, shape, and characteristics of the data to be observed. Moreover, the Gaussian distribution, represented by the curve superimposed on the histogram and also known as the normal distribution, is a continuous probability distribution that reveals the underlying characteristics of the data.
The shape of the Gaussian curves allowed the data distribution to be determined. If the curve is bell-shaped, the data are evenly distributed around the mean, with an equal number of values on both sides as determined by the mean () and standard deviation () simultaneously. It may also exhibit bimodal forms or skewness in both directions. When the distribution has a longer tail on one side, the skewness indicates a greater frequency of values in the direction of the trend. Positive skewness indicates that the tail reaches higher values, whereas negative skewness indicates that the tail reaches instead lower values.
On the other hand, bimodal distributions have two distinct peaks, indicating the existence of two separate groups or phenomena. Referring to the spread of the histogram, the standard deviation, which represents the spread of the distribution, describes the range of values covered by the histogram. The greater the spread, the greater the range of values, indicating that the data tend to be more dispersed. In contrast, a narrower spread indicates a more concentrated distribution, with values clustered closely around the mean. The standard deviation quantifies the dispersion or variability of the data [
43,
44].
Each dataset parameter shows some similarity in their respective data behavior. The following parameters described the same data behavior for success and failure datasets, analyzed by Gaussian distribution and data spread:
MW shows a bimodal distribution, which refers to two different and separate clusters of data and no spread. The C-logP does not exhibit Gaussian distribution and highly spread bars, which indicates that the data are highly spread with significant variability. Referring to
HAC, the Gaussian distribution is bimodal and low spread, C-AGC does not exhibit a Gaussian distribution and has widely dispersed data.
HBA and
HBD, on the other hand, indicate the same Gaussian distribution for both datasets but different data spreads.
HBA shows a bimodal distribution and low data spread, where the information cluster is low spread. Regarding
HBD, the data of this parameter exhibit right negative skewness, which indicates that the data distribution is not symmetric. Most of them are located on the left side of the distribution, with fewer values to the right.
logP, C-HBD,
PSA, C-CGC,
CGC, C-CnGC,
CnG, C-HGC, and
AGC exhibit different Gaussian distributions expressed in
Table 4, as well as low, high, and medium spread. Low means the data are not spread, medium data are regularly spread, and high means the data have a wide range of values. Lastly,
HGC neither share a similar Gaussian distribution nor a similar spread.
It is important to note that the data exhibiting a high spread pertain to the comparison parameters, which is attributable to the binomial data type of this parameter. On the other hand, the parameters do not exhibit high spread in any of the cases, only no, low, or medium spread, which corresponds to the type of data, indicating that the difference between ACs and WMs is not too spread and has more consistent values.
4.2. Scatter Plots
This type of graph displays relationships between variables whose behavior can be determined by characteristics such as direction, strength, and shape. The first one determines whether there is a positive, negative, or no apparent relationship between variables; positive relationships indicate that as one variable increases, the other tends to increase as well; negative relationships indicate the opposite, where if one variable increases, the other tends to decrease; and no apparent relationships indicate that the variables are unrelated or independent. Regarding the strength, it indicates how clustered the data of both variables are; if the data exhibit a tighter cluster, it indicates a stronger relationship, whereas a decrease in this strength could indicate a moderate or weak relationship. The form that identifies whether the relationship is linear or nonlinear, shown by the trending line, is a straight trending line for a linear relationship and a curve or nonlinear trending line for a nonlinear relationship [
44,
45].
Thus, the scatter plots of the pair panels graph for both datasets demonstrate the same behavior for most data. The lack of apparent direction, weak strength, and linear relationships between the comparison parameters can be attributed to the nature of the data for this variable type. Furthermore, outliers are typically responsible for the poor clustering of the data. In contrast, there are usually no apparent relationships, weak strength, and nonlinear form between variables. Regarding the relationship between comparison and variables, there was no discernible direction, weak strength, and a nonlinear or linear shape, depending on the variables involved.
4.3. Pearson and Kendall Correlations
The Pearson correlation coefficient, frequently denoted by the symbol r, measures the linear relationship between two continuous variables. It quantifies the degree to which the variables move linearly together. The Pearson correlation coefficient ranges from −1 to +1, where −1 represents a perfect negative linear relationship, +1 represents a perfect positive linear relationship, and 0 represents the absence of a linear relationship [
46,
47]. In contrast, the Kendall correlation coefficient, denoted by
(tau), is a rank-based correlation measure. It evaluates the strength and direction of the relationship between variables based on their ranks or ordinal positions. Kendall correlation applies to continuous and discrete variables, making it useful for analyzing nonlinear and nonparametric relationships [
46,
47,
48].
Pearson and Kendall correlations serve the same purpose of evaluating the strength and direction of the relationship between variables, but they are not simple measures. They differ in their underlying assumptions and data processing, adding a layer of interpretability.
Figure 3 displays the strength and direction of the linear relationship between variables for Pearson correlation plots. Positive correlations are indicated by ascending values and more vibrant red hues, negative correlations by descending values and more vibrant purple hues, and no correlation by middle-ground values and white hues. For a Kendall correlation plot,
Figure 4 depicts the strength and direction of the monotonic relationship between variables. Similarly, positive and negative correlations are presented, emphasizing monotonicity rather than linearity.
Figure 3a depicts a robust positive linear correlation between
variables. On the other hand, except for
HBD and
logP, the C-logP parameter exhibits a negative or no linear correlation with the remaining parameters. In contrast, in
Figure 3b, the variables in the upper portion of the graph have a positive linear relationship, whereas the variables in the left portion from C-CGC to C-AGC and in the right lower portion have negative linear relationships. Comparing the Pearson correlation results of the success and failure datasets reveals contradictory behaviors, indicating that the correlations referring to the comparison values vary based on the case-sensitive decision. Thus, comparison variables do not exhibit correlations for determining compatibility, whereas they exhibit a negative linear correlation for determining incompatibility.
In contrast, Kendall correlations reveal positive monotonic correlations between most delta variables and C-CGC and C-HGC. In contrast, the remaining comparison variables and
logP exhibit negative monotonic or nonmonotonic correlations for
Figure 4a. Regarding
Figure 4b, it is determined that the correlations in the upper right portion of the graph display positive monotonic correlations between the variables and a small proportion of negative monotonic correlations dispersed throughout the entire graph. This graph displays more positive monotonic correlations than
Figure 4b, which displays fewer variables with no monotonic correlations.
In conclusion, the Pearson and Kendall correlations for the analysis of the success dataset (
Figure 3a and
Figure 4a) exhibit a similar distribution of correlations around the variables. In contrast, the correlations between the variables in the graphs for the failure dataset (
Figure 3b and
Figure 4b) differ significantly, with
Figure 3b exhibiting stronger positive correlations than
Figure 4b.
5. Principal Component Analysis (PCA)
PCA is one of a series of approaches for representing high-dimensional data in a lower-dimensional, more manageable format without sacrificing too much information. PCA is one of the most straightforward and reliable dimension-reduction techniques. It is also one of the oldest and has been repeatedly rediscovered in other domains; therefore, it is also known as the Karhunen=-Loève transformation, the Hotelling transformation, the method of empirical orthogonal functions, and singular value decomposition [
49]. In this manner, a PCA was performed on each dataset (success and failure) by a singular value decomposition of the centered and scaled data matrix (mean 0, variance 1) and by not using eigen on the covariance matrix [
50]; the obtained results are presented in
Table 5.
PCA transformed the original variables into a set of principal components (PCs) by reducing the dimensionality of the variables and identifying patterns and structure in multivariate datasets [
42]. The standard deviation quantifies the dispersion or spread of the data along each PC; more significant standard deviation values indicate a greater spread along the respective PC. In contrast, the variation proportion represents the total variability in the data explained by each principal component, with larger values indicating that the respective PC captures a more significant amount of variability in the data. Lastly, the cumulative proportion represents the cumulative amount of total variation explained by a set of PCs; this value provides insight into the total amount of variation captured by a combination of PCs, where higher cumulative proportion values indicate that these principal components explain a more significant proportion of the total variation in the data [
42,
50,
51].
PC1 has the most significant standard deviation across both datasets, indicating that it captures the most data variability. PC1 explains 32% total variation in the success data and 31% in the failure data, according to the relatively high proportion of variation values for PC1. The cumulative proportion for PC1 demonstrates that it alone accounts for a significant portion of the total variation, indicating its importance in describing the overall patterns and structure of the data. PC2 has a lower standard deviation than PC1 in both datasets, indicating that it captures less variability but is still statistically significant. Despite being smaller than PC1, the PC2 proportion of variation values is still significant, indicating that PC2 explains the additional 21% of the variation in both datasets. The PC2 cumulative proportion demonstrates its contribution to capturing additional variation, enhancing the PC1 representation of the data. PC1 and PC2 represent 32% and 21% of the values, respectively, for a total coverage of 53% and 51% of the data in the success and failure datasets. PC3 has the lowest standard deviation of all principal components in both datasets, indicating that it captures the slightest variance. PC3 explains less variation than PC1 and PC2 because its proportion of variation values is the lowest. PC3 contributes to capturing additional variation, albeit to a lesser extent than PC1 and PC2.
Table 6 displays the contribution of each variable to PC1 and PC2 for the success and failure datasets. With respect to the PC1 of the success dataset, MW is the variable that contributes the most to PC1, followed by
HAC and
HBA. C-HBD, C-CnCG, and
logP are the variables that contribute the least to PC1. Referring to PC2, HBD is the variable that contributes the most, followed by
logP and
PSA, while C-AGC, C-CGC, and C-HGC contribute the least. Regarding PC1 of the failure dataset, the variables that contribute the most are C-MW, C-HAC, and
PSA, while the variables that contribute the least are CnGC, C-AGC, and C-HBD. Regarding PC2, the variables that contribute the most are
HAC,
MW, and C-AGC, while the variables that contribute the least are
AGC,
CGC, and
CnGC.
The data presented in
Table 6 can also be represented as bar graphs indicating the percentage contribution of each variable to the PC.
Figure 5 and
Figure 6 depict the contribution of the variables as bar graphs.
The contributions of each variable can also be represented as vectors, whose magnitude, direction, and angle provide insight into the behavior of the variable’s contribution to the PC.
Figure 7 depicts the contribution of the variables in vectors.
In terms of the contribution and relationship between variables and their contribution to the PCA, variables with the most extended vectors contribute the most to the PC, whereas the magnitude of the angle between variables represents the correlation between variables; if the angle is slight, it indicates a positive, strong correlation between variables, indicating that they tend to vary together, whereas variables with a large angle between them have a low or negative correlation. Another factor is the direction of the respective vector, which indicates that the variable has an inverse relationship to the other variables and may have opposing effects on the PC [
50,
51].
Individual contribution is another way of visualizing the contribution of the PC, representing how much each variable contributes to the PC. Nonetheless, this representation is not optimal if a variable’s values are highly diverse. In this way, the individual contribution to PC was adequately accounted for in the comparison variables with only two binomial values.
Upon comparing the contribution of individuals in the success and failure datasets for the same variables, it was discovered that the individuals in the success dataset exhibit more clustering than those in the failure dataset. In contrast, the success dataset exhibits more outliers. Thus, the clustering between individuals indicates the existence of relationships between the observations [
52]. The combination of ACs and WMs determines these relationships among individuals. Due to the evaluation of the compatibility or incompatibility of the combinations, an additional significant conclusion is that individuals with a zero value are less prevalent in the success dataset than in the failure dataset. It was also observed that the individuals in the success dataset have a more consistent form along the graph in all variables than those in the failure dataset. Similarly, the success dataset contains more individuals per unit area than the failure dataset.
6. Automatic Learning Approach
ML is a subfield of artificial intelligence (AI) concerned with implementing computational algorithms that improve performance based on historical data [
53]. An ML strategy consists of three essential components: data, representation, and model. The data refer to previously curated dataset information that has been recompiled, followed by the representation, which is the numerical translation of the input information for use in the model, where the selection of variables or features that will comprise the model input can have a significant impact on the model’s performance, and finally, the model, which is the mathematical representation of the process. The model, which can be categorized in different ways, is the mathematical representation of the process (unsupervised, supervised, active, or transfer learning). In general, the term ML can be applied to any method that implicitly models correlations within datasets [
54,
55].
Based on the recompiled data, the supervised learning strategy was chosen. This strategy, which involves the ML task of learning a function that translates an input to an output based on example input–output pairs, is supervised learning. It infers a function from a set of training instances labeled with training data. Supervised ML algorithms require external supervision. Train and test datasets are created from the input dataset. The training dataset contains output variables that require prediction or classification. All algorithms extract patterns from the training dataset and apply them to the test dataset for prediction or classification.
There are numerous techniques within the supervised learning methodology, including linear regression, logistic regression, K-nearest neighbors (KNN), Naïve Bayes, decision trees, and more. However, selecting a technique depends on the desired features and the data type. In this approach, the best-suited technique for the model porpoise was determined by recompiling the crucial information of each technique [
56]. The KNN technique was chosen based on the recompiled information about the technique that must correspond to the purpose of the model, which is determining the compatibility or incompatibility of AC and WM compounds, and the available information, which is labeled, noncontinuous, and noise-free. In this manner, the KNN technique was implemented in R, considering a data split of 70% training, 10% validation, and 20% test sets (
Figure 8). The model was trained using k-fold cross-validation with hyperparameter tuning to choose the best-performing model. The validation and training sets were used in a data combination to retrain (refit) the model, and the test set was used for the classification model for the compatibility prediction.
Initially, the binary data of the variable “case” were substituted with the strings “yes” and “no” to signify whether there was compatibility or incompatibility. The variable was subsequently transformed from a vector into a factor or category. The raw data, consisting of 288 observations and 21 variables, were divided into three datasets: training, validation, and test. The division was performed in a ratio of 70:10:20. To enhance the strength and reliability of the model, k-fold cross-validation and hyperparameter optimization were conducted simultaneously on the training dataset.
Figure 9 illustrates the implementation of the k-fold cross-validation, specifically using repeated cross-validation with 10 sets of folds. The data were divided into 10 subsets (folds) for cross-validation, and repeated cross-validation was performed three times. The model underwent training using k-1 folds and was assessed on the remaining fold [
57]. The process was iterated three times, with each fold used as the validation set once. A single estimation was obtained by averaging the results from each iteration. This method reduces the variability and offers a more comprehensive performance measurement for the model [
58]. The Receiver Operating Characteristic (ROC) curve was employed [
59] to assess the effectiveness of various hyperparameter configurations, specifically, the number of neighbors used. The ROC curve is a visual depiction that demonstrates the diagnostic capability of a binary classifier system as the discrimination threshold changes [
60]. By graphing the sensitivity (true positive rate) against the specificity (false positive rate), it was able to determine the ideal number of neighbors that maximized the area under the curve (AUC), achieving a balance between sensitivity and specificity.
Figure 10 displays the ROC curve with repeated cross-validation, indicating the optimal number of neighbors as seven. The AUC value for this configuration is 0.993.
Therefore, to determine the best-performing model on the train set, the optimized model was retrained using the train–validation set, which combines the train and validation set. Ultimately, the model was applied to the unseen test set to evaluate its performance. The model’s performance was evaluated using the confusion matrix and statistical features.
Table 7 presents the statistical metrics for assessing the model’s performance.
The confusion matrix is a tool utilized to assess the effectiveness of a classification model by comparing the predicted labels with the true labels [
61]. The decomposition of the confusion matrix is as follows:
True Negatives (TN): There were 21 instances in which the model accurately predicted “No”.
False Positives (FP): There were 4 occurrences where the model made an incorrect prediction of “Yes” when the true label was “No”.
False Negatives (FN): These refer to the instances where the model incorrectly predicted a negative outcome (“No”) when the actual label was a positive outcome (“Yes”). In this case, there were no instances of false negatives.
True Positives (TP): There were 27 instances where the model accurately predicted “Yes”.
The confusion matrix assesses the classification model’s performance by comparing predicted and actual labels. The model accurately classified 21 instances as incompatible (No) and 27 instances as compatible (Yes), leading to a high overall accuracy. The model exhibited four instances of false positives, where it erroneously classified instances as “Yes” instead of “No”, and zero instances of false negatives, indicating flawless recall with a sensitivity of 100%. The model has a specificity of 84%, accurately identifying most negative cases. Additionally, it has a precision of 87% for compatible (Yes) predictions, indicating that 27 out of 31 “Yes” predictions were correct. Overall, the model demonstrates robust performance, particularly in accurately identifying positive cases, albeit with a slight inclination to incorrectly predict compatible (Yes) cases.
The model’s accuracy, the proportion of accurate predictions relative to the total number of samples, was 0.923. The confidence interval (CI) of the accuracy shows the predicted range of values within which the accuracy will fall, in this case (0.815, 0.979), indicating a high confidence level in the estimated accuracy. Kappa denotes the Kappa statistic, which quantifies the agreement between the expected and actual class labels, considering the probability of chance agreement. A Kappa score 0.845 indicates significant agreement outside chance [
62]. McNemar’s test
p-value indicates the
p-value from McNemar’s test, which evaluates the significance of any changes in error rates between the two models. In this instance, the
p-value is 0.134, showing no statistically significant difference between the error rates of the evaluated models [
63]. Sensitivity represents the actual positive rate or the proportion of positive samples accurately identified as positive [
64].
In this instance, the sensitivity is 0.84, indicating that the model accurately recognizes 84% of positive samples. Specificity denotes the true negative rate or the proportion of actual negative samples accurately categorized as negative. A specificity of one suggests that the model correctly detects 100% negative samples [
64]. The Positive Predictive Value is the proportion of samples anticipated to be positive that are, in fact, positive. In this instance, the positive predictive value is one, which indicates that around 100% of the anticipated positive samples are positive. A negative predictive value of 0.87 suggests that around 87.1% projected negative samples are negative [
65]. Prevalence denotes the frequency or proportion of positive samples within a dataset [
66]. In this instance, the prevalence is 0.481, which indicates that around 48.1% of samples are positive. The detection rate is the fraction of positive samples that the model correctly classifies as positive. A detection rate of 0.403 suggests that the model detects or identifies 40.3% of positive samples. The detection prevalence is the proportion of projected positive samples that are positive. In this instance, the detection prevalence is 0.404, which indicates that approximately 40.4% of the anticipated positive samples are positive [
64]. Balanced accuracy is the average of sensitivity and specificity and measures classification performance as a whole [
67]. A balanced accuracy of 0.92 means that positive and negative samples are correctly classified. The positive class column identifies the label or class regarded as positive. In this instance, “No” is the positive class. Root Mean Square Error (RMSE) is a metric utilized to assess the precision of predicted values by computing the average discrepancy between predicted probabilities and actual outcomes [
61]. It is especially advantageous for assessing models that generate probabilities instead of solely class labels. The classification model obtained an RMSE of 0.228, suggesting that the model’s predicted probabilities are highly accurate and closely match the actual class labels. This demonstrates the model’s effectiveness in making dependable predictions.
Overall, the results indicate that the KNN model performs effectively, exhibiting a high and balanced accuracy. As evidenced by the sensitivity and specificity scores, it performs well in correctly categorizing positive and negative combinations. Positive and negative predictive values reflect the model’s ability to reliably predict positive and negative cases.