This section discusses the employed dataset and proposed methodology. First, a detailed analysis of the curated data is provided. After that, a discussion on the various data preparation and preprocessing steps is presented. Then, our proposed methodology for using machine learning to predict the impact of the 21 essential oils on Alzheimer’s Disease progression is detailed. The overall process entails dataset feature selection, dataset preprocessing, machine learning model evaluation, and model selection.
3.1. Dataset
The collected data (Made publicly available at
https://github.com/researchrepo1/EssentialOilsDataset (accessed on 30 May 2024)) used in this research work comprise 21 essential oil samples and 428 chemical compounds, which represent the chemical composition percentages of these essential oils. These oil samples, available commercially in the United Arab Emirates market, were chosen based on the reported literature that indicate their potential activity against Alzheimer’s Disease progression [
20,
21,
22,
23,
24,
25,
26,
27,
28,
29,
30,
31,
32,
33,
34,
35,
36,
37,
38,
39,
40,
41,
42,
43,
44,
46,
47,
48,
49,
50]. The chemical composition data were gathered from the literature. Each essential oil sample was classified as “HIGH” or “LOW” in activity, with eight samples classified as “HIGH” and 13 as “LOW”. This classification was based directly on the literature available about each oil. Oils classified as “HIGH” reported significant activity relevant to Alzheimer’s Disease, such as neuroprotection or acetylcholinesterase inhibition. Oils classified as “LOW” either showed low activity, or there was insufficient literature to conclusively determine their efficacy.
The 21 essential oils examined in this study contain 428 chemical compounds in total, as reported in the literature. Out of which, 12 compounds were present in four or more essential oils. These compounds, along with their concentrations, are shown in
Figure 1,
Figure 2 and
Figure 3, which portray a visual representation of how each compound is concentrated in each oil.
As per
Figure 1, the compound 1,8-Cineole is concentrated the highest in Rosemary oil at
%, and is found with much lower concentrations in the remaining oils: Sweet Violet at
%, Costus at
%, Hazelnut at 1%, and in negligible concentrations with the remaining oils. The compound Camphene is concentrated the highest in Ginger oil at
%, followed by Rosemary oil at
% and Costus oil at
%. It was found in less than 1% concentrations in the remaining oils. The compound Camphor is mostly concentrated in Rosemary oil at
%, in Costus oil at
%, in Sweet Violet Oil at
% and in Henna at
%. It is noticeably clear that Limonene is highly concentrated in Grapefruit oil at
%, in Juniper oil at
%, and at much lower concentrations in the remaining oils. The compound Linalool is concentrated the highest in Apricot oil at
%, in Sweet Violet oil at 3.06%, in Mint oil at
%, in Henna oil at
% and in less than 1% concentrations in the remaining oils. The compound Spathulenol is concentrated the highest in Sweet Violet oil at
% and in Hazelnut oil at
%. It is also found in Chamomile and Anjeer oils at
% and
% concentrations, respectively.
In
Figure 2, it is noticed that the compound
-Pinene is concentrated the highest in Juniper oil at
%, followed closely by Rosemary oil at
% and in Ginger oil at
%. It is also found in less than 2% concentrations in the remaining oils. The compound
-Terpineol is concentrated the highest in Hazelnut oil at
%, in Rosemary oil at
%, in Mint oil at
% and in Costus oil at
%, while the compound
-Thujene is concentrated the highest in Juniper oil at
%, and in smaller concentrations in Rosemary, Grapefruit, Chamomile oils at
%,
% and
%, respectively. The compound
-Elemene is concentrated the highest in Gum Myrrh oil at
%, followed by Ginsing oil at
%, at
% concentration in Chamomile oil, and at a negligible
% concentration in Costus oil. The compound
-Pinene is primarily concentrated in Juniper oil at
% and at a much lower concentration of
% in Rosemary oil. It is also found in several other oils but at less than 3% concentration. Finally, the compound
-Cadinene is concentrated the highest in Gum Myrrh at
%, in Apricot oil at
%, and in both Chamomile and Juniper oils at
%.
By examining the heatmap in
Figure 3, it is evident that the concentration of the compound Limonene in Grapefruit oil at
% is the highest among the concentrations of all the chemical compounds in the 21 oils. This is followed by the concentration of the compound Camphene in Ginger oil at
%. It is worth noting that among the highest concentrations is the concentration of the compound 1,8-Cineole in Rosemary oil at
%, and the concentration of the compound
-Pinene in Juniper oil at
%, in Rosemary oil at
% and in Ginger oil at
%. Furthermore, Chamomile, Costus and Rosemary oils contain nine out of the twelve compounds examined in the study. On the other hand, these 12 compounds have no concentrations in the following oils: Amla, Mustard, Onion, Sadab, Sandal, and Turmeric.
Since our dataset contains 428 chemical compounds in total,
Table 1 provides a detailed overview of the concentrations of only some chemical compounds found in the 21 essential oils at hand, along with their activity label (“HIGH” or “LOW”) against the Alzheimer’s Disease progression. Each row represents an essential oil sample, and the columns list the percentage concentrations of various compounds, including Camphene, Limonene, Linalool,
-Pinene,
-Pinene, and 1,8-Cineole. The last column labels the activity of each essential oil, as either “HIGH” or “LOW”, where “High” indicates reported significant activity of the oil and “LOW” indicates low activity or not enough literature availability about the essential oil.
3.2. Feature Selection
As reported in
Table 1, our dataset has more than 400 attributes. This data phenomenon is known as the
curse of high-dimensionality, whereby data points are sparse in the high-dimensional space, resulting in a negative impact on the predictive accuracy of machine learning models [
51].
Due to the extremely high dimensionality of our dataset, a rigorous feature selection process is implemented to enhance the predictive accuracy of the machine learning models. This process entails creating various versions of the dataset by systematically excluding chemical compound features that exhibit relatively low cumulative percentages across all essential oil samples.
Each chemical compound could theoretically achieve a maximum percentage of 100% within a single essential oil, yielding a total possible sum of 2100% across all 21 essential oils. In practice, the highest observed cumulative percentage was for Limonene, with a value of 114.97%. To determine which features to retain, percentage thresholds were established based on this maximum value. 11 distinct threshold values, particularly 0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 30, 40, and 50, were enforced, resulting in 11 dataset variations (subsets). Additionally, the original dataset, incorporating all chemical compound features (where the sum is greater than 0), is also included for comparison. This approach results in twelve unique datasets, each progressively refining the feature set to enhance the model’s efficacy in predicting the activity of essential oils against Alzheimer’s Disease progression, from a computational perspective.
Table 2 specifies the 11 threshold values enforced in the feature selection process, the corresponding number of resulting chemical compound features, and the designated names for each dataset. The experiments in
Section 4 are carried out on all the 11 + 1 (original) datasets.
3.3. Preprocessing
Machine learning algorithms may require numerical data in specific formats for optimal performance. Initially, our dataset contains textual labels for activity and unformatted numerical data for features. To facilitate data processing, the binary classes “HIGH” and “LOW” are converted to numerical values using Label Encoding. Label Encoding is a technique that transforms categorical text data into numerical values. In this case, the labels “HIGH” and “LOW” are encoded as 1 and 0, respectively.
Furthermore, machine learning algorithms generally perform better when the numerical values are normalized within a specific range, typically between 0 and 1. However, the values in our datasets ranged from 0 to 100, representing percentage compositions. To address this issue, the Min-Max Scaling technique is employed. Min-Max Scaling transforms the data by rescaling each feature to a specified range, usually 0 to 1. This is achieved by subtracting the minimum value of the feature from each data point and then dividing by the range of the feature (the difference between the maximum and minimum values), as per Equation (
1).
In Equation (
1),
X is the original value,
is the minimum value of the feature,
is the maximum value of the feature, and
is the scaled value. This normalization process ensures that the selected machine learning models are effectively trained, as the dataset now contains standardized feature values to a common scale.
Further preprocessing beyond these steps was not necessary, as the dataset does not contain any outliers. All numerical values are between 0 and 100. Moreover, the dataset does not contain any missing values as all chemical compounds were accounted for during the process of collecting their data from the literature.
3.5. Model Evaluation
The effectiveness of a machine learning model is evaluated based on its accuracy in predicting the correct label of an essential oil sample in the underlying dataset (see
Table 1 and
Table 2). Particularly, two accuracy metrics are employed: testing accuracy and Leave-one-out cross-validation (LOOCV) accuracy.
Cross-validation (CV) is a model evaluation technique where the sample set is split into k subsets, and model fitting and prediction are performed k times. The testing set will be formed of one particular subset, out of the k subsets, and the remaining subsets will be the training set. This method is useful for avoiding model overfitting and also for effectively evaluating the performance of a model on a smaller dataset.
Leave-one-out cross-validation (LOOCV) is a form of k-fold cross-validation, where k = n, and n is the number of samples in the dataset. In the n iterations, the testing set will be the kth sample and the remaining samples will be the training set. To evaluate model performance using LOOCV, the accuracies of the model in predicting the activity of the testing sample in the n training iterations are averaged to give the overall LOOCV accuracy of the model.
Due to the extremely small size of our dataset in terms of the number of samples, several other evaluation methods were experimented with to identify the best ones. These evaluation methods include 50:25:25 and 70:15:15 ratios of training-validation-testing sets, and mean cross-validation using k = 3, 5, 7, and LOOCV. Out of these methods, evaluating a model using 50% training and 50% testing sets and using LOOCV reported the highest accuracies.
Table 3 summarizes these experiments. Henceforth, all our experiments will be evaluated using the 50% testing set and LOOCV. The accuracies of a machine learning model in predicting the activity of essential oils are hereby referred to as
testing accuracy and
LOOCV accuracy, respectively.
3.6. Model Selection
The problem of predicting the class label of an essential oil sample is a classification problem. This section discusses building three widely-used classification models:
k-Nearest Neighbours, Logistic Regression, and Random Forest. For each one of these models, this section discusses the selection of the various model parameters. All these algorithms were implemented using Python’s scikit-learn library [
45,
53,
54].
3.6.1. k-Nearest Neighbours
k-Nearest Neighbors (
kNN) is a supervised learning classification algorithm that operates by plotting samples in a multidimensional space, where the number of features corresponds to the number of dimensions. The class label of a sample is predicted by identifying the
k nearest neighbors in this multidimensional space and taking a majority vote of these neighbors’ labels [
55].
It is crucial to select an appropriate value for k to avoid ties in the voting process. For binary classification tasks, k should be an odd number to prevent ties. In cases where the classification involves three classes, k should be chosen such that it is not a multiple of three, minimizing the likelihood of tie votes. This careful selection of k ensures accurate classification outcomes.
To explore the impact of different parameters on the performance of the
kNN algorithm, the scikit-learn implementation was utilized [
53]. Specifically, an odd number of neighbors (the
n_neighbors parameter) is chosen between 1 and 9, and two distinct settings for the weights parameter,
uniform and
distance, are used. The
uniform setting assigns equal weights to each neighbor, while the
distance setting assigns larger weights to closer samples [
53]. This configuration yields 10 distinct
kNN models, each characterized by a combination of parameters. These models are listed and named in
Table 4.
3.6.2. Logistic Regression
Logistic Regression (LR) is a supervised learning binary classification algorithm that predicts the probability of a sample belonging to a certain class. Equation (
2) formulates this process, where the equation takes all input features as weighted variables, and the output would be either 0 or 1 with each extremity representing the two classes [
56].
In Equation (
2),
y is either 0 or 1, the input features are
⋯,
, and the weights of the features are given by
⋯,
, where
n is the number of features.
The output can be plotted as an S-shaped curve, which is also called a Sigmoid function. The curve helps to visualize the predicted probabilities and how outputs are assigned. This model is called regression since the prediction is a range of values between 0 and 1, but it performs binary classification since the output can only have two possible values, 0 or 1.
During the model training process, the weights are calculated based on the output values. Then, during model evaluation, the testing samples are assigned output values by substituting the values of their input features into the equation, multiplying each of them with their respective calculated weights, adding them up, and the resulting output would be assigned 0 or 1 based on the value to which it is closest.
An optimization algorithm is used to find the optimal weights and bias during training. Conventionally, gradient descent is used for this purpose. An LR Model trained on the same dataset with different optimization algorithms may result in different outcomes.
To prevent overfitting the S-shaped curve to the training data, there are regularization methods called Loss and Loss functions. An LR model can be trained using either one of these loss functions or neither of them. Implementing a loss function improves model performance by penalizing incorrect predictions.
Using Python’s scikit-learn library, we experimented with several parameters for Logistic Regression. The
parameter trains the LR model with either
or
loss functions (or none) to regularize the data and avoid overfitting to the training set. The
parameter is
by default but can be set to
for an imbalanced training set. This assigns samples of the minority class with a higher weight and majority class samples with a lower weight. This may improve model performance. The
parameter implements different optimization algorithms. The
C parameter specifies the strength of regularization, whereas a smaller
C would specify stronger regularization [
54].
We experimented with the
,
, and
C parameters and found that
=
and
=
worked well for all datasets. We then varied
C and
, and evaluated the performance of the logistic regression models.
Table 5 summarizes the different LR models resulting from parameter variations.
3.6.3. Random Forest
Random Forest (RF) is an ensemble supervised learning technique that can be used for classification and regression applications [
57]. To understand how RF works, one must first understand how Decision Trees (DT) works. Decision Trees is another machine learning technique that is used for classification and regression. It works as a flowchart, splitting the dataset based on different features at different levels, to ultimately predict the label of a target sample [
58].
Random Forest implements multiple decision trees, where a random subset of features is dropped in each decision tree, thus minimizing the chance of overfitting to the training data. The class label of a sample is predicted by feeding that sample’s feature values into each decision tree, obtaining each decision tree’s predicted class label, and then taking the majority of votes to assign the output class label [
59].
An RF model’s accuracy is based on multiple parameters, namely the number of decision trees, the best-split algorithm, and whether the tree has been pre-pruned or not. Pre-pruning refers to limiting the growth of the trees. This is conducted to further ensure that the model does not overfit. One way of conducting this is by limiting the maximum depth of the trees or by varying certain parameters during training.
The
n_estimators parameter refers to the number of decision trees used in the random forest. The
criterion parameter specifies the method of choosing the best feature for splitting the data at each node. We experimented with the
gini,
log_loss, and
cross-entropy methods for
criterion and discovered that the
log_loss method results in a better performance. The
max_depth and
ccp_alphas parameters are used to pre-prune the trees.
max_depth refers to the maximum depth of each decision tree in the RF model, whereas
ccp_alphas is a constant that specifies which sub-trees are allowed in a singular tree based on their cost complexity. By limiting the depth and complexity of the trees, we can ensure better performance.
Table 6 summarizes the different RF models resulting from parameter variations.