Optimized Machine Learning Classifiers for Symptom-Based Disease Screening

Fuster-Palà, Auba; Luna-Perejón, Francisco; Miró-Amarante, Lourdes; Domínguez-Morales, Manuel

doi:10.3390/computers13090233

Open AccessArticle

Optimized Machine Learning Classifiers for Symptom-Based Disease Screening

by

Auba Fuster-Palà

¹,

Francisco Luna-Perejón

^2,3,4

,

Lourdes Miró-Amarante

^2,3,4

and

Manuel Domínguez-Morales

^2,3,4,*

¹

E.T.S. Ingeniería Informática, Universidad de Sevilla, Avda. Reina Mercedes s/n, 41012 Seville, Spain

²

Robotics and Technology of Computers Research Group (TEP-108), Architecture and Computer Technology Department, E.T.S. Ingeniería Informática, Universidad de Sevilla, Avda. Reina Mercedes s/n, 41012 Seville, Spain

³

Computer Engineering Research Institute (I3US), E.T.S. Ingeniería Informática, Universidad de Sevilla, Avda. Reina Mercedes s/n, 41012 Seville, Spain

⁴

Escuela Politécnica Superior (EPS), Universidad de Sevilla, 41011 Seville, Spain

^*

Author to whom correspondence should be addressed.

Computers 2024, 13(9), 233; https://doi.org/10.3390/computers13090233

Submission received: 14 June 2024 / Revised: 9 September 2024 / Accepted: 11 September 2024 / Published: 14 September 2024

(This article belongs to the Special Issue Future Systems Based on Healthcare 5.0 for Pandemic Preparedness 2024)

Download

Browse Figures

Versions Notes

Abstract

:

This work presents a disease detection classifier based on symptoms encoded by their severity. This model is presented as part of the solution to the saturation of the healthcare system, aiding in the initial screening stage. An open-source dataset is used, which undergoes pre-processing and serves as the data source to train and test various machine learning models, including SVM, RFs, KNN, and ANNs. A three-phase optimization process is developed to obtain the best classifier: first, the dataset is pre-processed; secondly, a grid search is performed with several hyperparameter variations to each classifier; and, finally, the best models obtained are subjected to additional filtering processes. The best-results model, selected based on the performance and the execution time, is a KNN with 2 neighbors, which achieves an accuracy and F1 score of over 98%. These results demonstrate the effectiveness and improvement of the evaluated models compared to previous studies, particularly in terms of accuracy. Although the ANN model has a longer execution time compared to KNN, it is retained in this work due to its potential to handle more complex datasets in a real clinical context.

Keywords:

disease screening; hospital overcrowding; machine learning; artificial intelligence

1. Introduction

There are multiple causes that directly affect the turnaround time of a medical diagnosis. Among them are the aging of the population, the lack of medical professionals, the difficulty of access to healthcare, or its misuse. In the following, we will briefly detail each of these aspects.

Population aging is currently a global phenomenon that is accelerating in many countries of the world. According to projections of the United Nations (UN), the global population over 60 years of age will exceed 2.1 billion by 2050, representing 21% of the total world population [1]. This demographic trend has important implications for healthcare systems worldwide [2].

On the other hand, another challenge facing health systems is the lack of health professionals. WHO estimates that by 2030, there will be a global shortage of about 14 million health professionals [3]. This lack of health personnel can lead to an overload of health systems, leading to longer waiting times for care, a reduced quality of health services, and ultimately, the collapse of the health system [4].

In the third place, lack of access to healthcare is a serious problem that affects many people around the world, especially those who live in countries in the Global South and rural areas or have disabilities or reduced mobility. According to the WHO, more than half of the world’s population does not have access to essential health services [5]. In low-income countries, lack of access to healthcare is one of the leading causes of mortality. According to a WHO report, 45% of child deaths worldwide occur in low-income countries due to lack of access to basic health services [5]. In addition, many people living in rural areas do not have access to adequate healthcare services due to the lack of infrastructure and trained medical personnel. Patients in rural areas must travel to areas far from their homes to assess their health status, making access to health services difficult and often impossible [6].

Moreover, people with disabilities or reduced mobility also face many challenges in accessing adequate health services. According to WHO, more than one billion people worldwide have some form of disability and are more likely to experience barriers to accessing healthcare than people without disabilities [7]. A lack of physical accessibility and lack of trained personnel to care for people with disabilities are some of the main challenges faced by people with disabilities.

And finally, the misuse of healthcare is a problem that affects many health systems around the world. The request for emergency healthcare for banal cases is one of the main problems associated with it: Many people go to the emergency rooms for minor problems that could be treated in primary care or even self-care at home. This not only saturates emergency departments but also contributes to increased waiting times and reduced quality of care provided to patients who really need urgent care [8,9].

According to the work presented in [10], which analyzes multiple sources of health information, one of the main reasons for the overcrowding in hospital emergency services comes from those patients who go directly to these services without first passing through their primary care center; as a result, a large proportion of these cases are not urgent and could have been treated directly in primary care centers: more than 35% according to the Official College of Doctors of Madrid (Spain).

The healthcare system faces the multiple challenges described previously. To solve these problems, which will increase over the years, one of the stages of the healthcare cycle where some improvements can be made is the initial stage (triage), which consists of directing people to the appropriate level of care and the first pre-diagnosis of initial symptomatology.

The importance of pre-diagnosis lies in the early detection of disease and the prevention of long-term health problems. Early diagnosis allows healthcare professionals to take preventive measures to stop disease progression and avoid serious complications [11]. In addition, early identification of the disease can improve the quality of life of patients and reduce the financial burden of long-term treatment [11].

On the other hand, it is equally important to direct patients to the appropriate level of care based on the urgency of their case. Currently, many patients go directly to emergency rooms for medical care, which can increase costs and burden on the healthcare system. Directing patients to the appropriate level of care, for example, a primary care center, can help avoid unnecessary emergency room visits and improve the efficiency of the healthcare system [12].

In summary, early pre-diagnosis and proper direction of patients to the appropriate level of healthcare are key elements in improving the efficiency and effectiveness of the healthcare system.

To achieve this purpose, recent advances in telemedicine may be useful tools in healthcare, especially in the initial pre-diagnostic stage and in chronic treatments. On the one hand, telemedicine can be used for the evaluation and follow-up of patients with chronic conditions, such as diabetes, arterial hypertension, and chronic obstructive pulmonary disease [13,14]. Telemedicine has also been shown to be effective in the early detection of mental illnesses such as depression, anxiety, and bipolar disorder [15,16].

However, telemedicine cannot be considered a one-size-fits-all solution to all problems in healthcare. Although telemedicine can reduce the need for face-to-face visits, it still requires trained healthcare personnel for its implementation and effective use [17]. Therefore, it is important to keep these limitations in mind when considering the implementation of telemedicine in healthcare and to look for alternatives or complementary tools, such as artificial intelligence (AI). It is currently used in many areas of healthcare, including drug discovery, genomics, radiology, pathology and prevention, and diagnostic imaging, among others [18,19,20].

Despite its extensive use in healthcare, the direct application of artificial intelligence for the triage and pre-diagnosis of initial patient symptoms is not widespread. There are several examples of AI tools that are being developed to improve care in a particular specialty or disease; however, to improve the global healthcare system, the existence of a tool capable of helping during the prediagnosing stage is necessary.

The primary aim of this work is to develop an AI classifier that can assist healthcare professionals during the initial screening phase, providing classification insights to healthcare professionals, enabling them to make a final diagnosis more quickly. Specifically, the developed system will be focused on a symptom-based disease classifier.

This work is structured as follows. In Section 2, the findings of related works are described. In Section 3, the classifiers tested are presented, as well as the evaluation metrics, the three-stage process developed to obtain the best classifier, and the dataset used for that purpose. In Section 4, the results of the process detailed in the previous section are shown, the final classifier developed is compared with previous works, and the results obtained are discussed in detail. In Section 5, the conclusions obtained from this work are presented. Finally in Section 6, the limitations and future works are presented.

2. Related Works

In the study of Gomathy et al. [21], the primary objective was to create a system that can predict diseases based on user-provided symptoms. The system, named Disease Predictor, processes these symptoms and generates a prediction using the Grails framework, offers a user-friendly interface, and is accessible via a web application, allowing users to utilize the system from anywhere and at any time. The system utilizes machine learning algorithms such as Decision Tree, Random Forest, and Naïve Bayes to analyze and predict diseases like diabetes, Malaria, Jaundice, Dengue, and Tuberculosis. The system has demonstrated an accuracy of 98.3%, showcasing its capability to effectively predict disease outbreaks.

In another study conducted by Grampurohit et al. [22], machine learning techniques were applied to enhance disease prediction, supporting physicians in early diagnosis and patient care. The research analyzed a dataset of 4920 patient records across 41 diseases and 95 related symptoms, utilizing Decision Tree, Random Forest, and Naïve Bayes classifiers. The study demonstrated that these algorithms could achieve up to 95% accuracy, providing a comparative analysis of their effectiveness. The findings underscore the growing importance of machine learning in the medical field, highlighting its potential to significantly improve data analysis and disease prediction as artificial intelligence continues to advance.

Lastly, the work performed by Nesterov et al. [23] introduces a novel model for symptom and diagnosis prediction based on supervised learning, addressing the limitations of traditional symptom checking systems and recent reinforcement learning (RL) models. While basic systems such as those based on Bayesian methods or Decision Trees are easy to train, they often suffer from low relevance and diagnostic quality. In contrast, Nesterov et al. propose a neural model with logic regularization, which combines the strengths of different approaches. Their experiments on real and synthetic data demonstrate that this model outperforms existing methods, particularly in scenarios involving large and sparse symptom and diagnosis spaces. The model utilizes asymmetric loss and logic regularization to enhance prediction accuracy and overcome challenges associated with RL-based methods, such as training complexity and the limitations of the Markov process. The practical significance of this work lies in its ease of implementation, training stability, and low computational demands, making it well suited for real-world medical applications. The authors also emphasize their future work direction, which involves applying the model in practical medical systems.

3. Methodology

This section will present the tools and methodologies applied in this work to meet the proposed objective. Initially, the dataset used for this purpose is presented, followed by a detailed explanation of the types of classifiers used, the evaluation mechanisms to determine the best classifier, and finally, the process followed to obtain the best classifier.

3.1. Dataset

The dataset used for this work is “Disease Symptom Prediction” [24]. It consists of almost 5000 entries and a set of symptoms associated with each of them. Finally, it shows each entry labeled among 41 different diseases. The symptoms are listed verbatim in the original file, but then, an additional file is provided detailing the severity level of each symptom on a scale of 1 to 7.

A summary of the datasets can be seen in Table 1.

3.2. Classifiers

There are multiple machine learning (ML)-based classifiers that can be utilized in this work, each offering distinct advantages depending on the specific application and data characteristics. To ensure a comprehensive evaluation and to maximize the potential for accurate and reliable predictions, we have deliberately chosen a diverse set of classifiers for this study. This selection ranges from more traditional statistical classifiers, such as Decision Trees, which are known for their simplicity and interpretability, to much more complex classifiers, such as Neural Networks, which are capable of capturing intricate patterns in data due to their deep learning architecture. This approach not only broadens the applicability of our findings but also allows for a better understanding of how different ML techniques perform in the context of healthcare screening. The models proposed are introduced below.

Random Forest (RF) is a supervised learning algorithm that uses ensemble learning to combine multiple Decision Trees, improving prediction accuracy and robustness, as supported by studies [25]. Each tree is built from a random subset of training data and features, reducing overfitting through “bagging”. For classification, the final prediction $\hat{y}$ is determined by a majority vote among the M Decision Trees:

$\hat{y} = mode \{h_{1} (x), h_{2} (x), \dots, h_{M} (x)\}$

(1)

This method is known to improve generalization by averaging out biases and reducing variance [26]. Additionally, RFs can assess feature importance by analyzing the impact of permuted values on classification accuracy. Figure 1a illustrates the workflow of the RF model.
K-Nearest Neighbors (KNNs) is a supervised machine learning algorithm commonly used for both classification and regression tasks. It operates on the principle that data points with similar features are likely to belong to the same class or have similar outputs, a concept well supported by studies in the field [27]. The KNN model classifies a new data point by analyzing its proximity to the nearest neighbors in the feature space, using distance metrics such as Euclidean or Manhattan distance.
For classification, the KNN algorithm follows these steps: calculate the distances between the new data point and all points in the training set, select the closest K neighbors, and use majority voting among these neighbors to determine the class label. The classification decision $\hat{y}$ can be mathematically represented as follows:

$\hat{y} = mode \{y_{(1)}, y_{(2)}, \dots, y_{(K)}\}$

(2)

where $y_{(i)}$ is i-th nearest neighbor. The final classification is based on the most frequent class among the K nearest neighbors. Figure 1b illustrates an example of KNN classification, showing how the majority class among the nearest neighbors determines the assigned class.
This method is recognized for its simplicity and effectiveness in various applications, as evidenced by its widespread use in the literature.
Support Vector Machine (SVM) is a supervised learning algorithm widely used for both classification and regression tasks, with its effectiveness supported by numerous studies [28]. The primary goal of SVM is to identify an optimal hyperplane that separates samples of different classes in a high-dimensional space. This hyperplane serves as a decision boundary, with support vectors—data points closest to the hyperplane—being crucial in determining its position and orientation.
The SVM algorithm employs kernel functions to transform the input data into a higher-dimensional space where linear separation is more feasible. The algorithm then searches for the hyperplane that maximizes the margin, which is the distance between the hyperplane and the nearest support vectors. This margin maximization is key to achieving the best possible separation between classes.
Mathematically, the optimal hyperplane can be represented as Equation (4):

$w \cdot x - b = 0$

(3)

where $w$ is the weight vector, $x$ is the input feature vector, and b is the bias term. The margin $M$ is defined in Equation (4)

$M = \frac{2}{∥ w ∥}$

(4)

The objective of SVM is to maximize M, ensuring that the classes are well separated. Figure 1c illustrates this concept, showing two classes, the optimal hyperplane, and the support vectors.
This approach has been shown to be particularly effective in high-dimensional spaces, making SVM a powerful tool for various classification tasks.
Artificial Neural Networks (ANNs), specifically Feed-Forward Neural Networks (FNNs), are the final classifiers employed in this study. FNNs transmit information in a single direction, from the input layer through one or more hidden layers, to the output layer. The basic structure of an FNN includes an input layer, hidden layers, and an output layer, with each layer composed of interconnected neurons. These connections are governed by synaptic weights, which are adjusted during training to minimize a loss function, thereby improving the model’s accuracy. This adjustment process is typically achieved using optimization techniques such as gradient descent, which iteratively updates the weights to reduce the difference between the predicted and actual outputs.
Mathematically, the output of a neuron in the network can be expressed as follows:

$y = ϕ (\sum_{i = 1}^{n} w_{i} x_{i} + b)$

(5)

where y is the output of the neuron, $ϕ$ is the activation function, $w_{i}$ represents the synaptic weights, $x_{i}$ are the input features, and b is the bias term. The activation function $ϕ$ introduces non-linearity into the model, enabling it to learn complex patterns.
Figure 1d illustrates the structure of an FNN, highlighting how data are processed from the input layer, through the hidden layers, to the output layer to generate predictions.
In summary, FNNs utilize a layered structure and supervised learning to perform classification or regression tasks on complex datasets by adjusting neuron connections. This study explores the effectiveness of different models, including Decision Trees, SVMs, and Neural Networks, to determine the most efficient approach for building a disease classifier based on coded symptoms.

As discussed, each model employs a distinct approach to tackle classification problems, ranging from Decision Trees, which rely on simple rule-based logic, to SVMs that use hyperplane geometry for separation, to KNNs that classify based on distances between neighboring samples, and finally to Neural Networks, which utilize complex, layered structures to capture non-linear patterns in the data. These differences in methodology highlight the unique strengths and weaknesses of each algorithm, making it essential to study them all to determine which is most effective for our specific task. By comparing these diverse models, we can identify the one that best balances accuracy, computational efficiency, and generalization ability, ultimately leading to a more robust and reliable disease classifier based on coded symptoms. This comprehensive approach ensures that we do not overlook any potential advantages of one algorithm over another, thereby maximizing the chances of developing an efficient and effective model.

The mechanisms (or metrics) used to determine the goodness of a classifier are presented below.

3.3. Evaluation Metrics

To evaluate the effectiveness of the classification results of a classifier, the most common metrics are used: accuracy (most-used metric), sensitivity (known as recall in other works), specificity, precision, and

F 1_{s c o r e}

[29]. To this end, the classification results obtained for each class are tagged as “True Positive” (TP), “True Negative” (TN), “False Positive” (FP), or “False Negative” (FN). According to them, the high-level metrics are presented in the next equations:

A c c u r a c y = \sum_{c} \frac{T P_{c} + T N_{c}}{T P_{c} + F P_{c} + T N_{c} + F N_{c}}, c \in c l a s s e s

(6)

P r e c i s i o n = \sum_{c} \frac{T P_{c}}{T P_{c} + F P_{c}}, c \in c l a s s e s

(7)

R e c a l l = \sum_{c} \frac{T P_{c}}{T P_{c} + F N_{c}}, c \in c l a s s e s

(8)

F 1_{s c o r e} = 2 * \frac{p r e c i s i o n * s e n s i t i v i t y}{p r e c i s i o n + s e n s i t i v i t y} .

(9)

About those metrics,

Accuracy: all samples classified correctly compared to all samples (see Equation (6))
Precision: proportion of values classified as “true positive” in all cases that have been classified as that (see Equation (7))
Recall (or Sensitivity): proportion of values classified as “true positive” that are correctly classified (see Equation (8))
$F 1_{s c o r e}$ : it considers two of the main metrics (precision and sensitivity), calculating the harmonic mean of both parameters (see Equation (9))

The above metrics are common to all ML systems. Therefore, the classifier systems developed in this work will be evaluated according to all the metrics detailed in this subsection. Moreover, the results obtained by the classification system will be compared with the results obtained in previous works.

3.4. Optimization Procedure

In order to obtain the best classifier for the system, an optimization model based on three stages is presented: a first stage of pre-processing the information coming from the dataset; a second stage in which a global search is performed including multiple trainings with different variations of the hyperparameters in each type of classifier; and a last stage in which additional filtering rules are applied with the best candidates obtained from the previous stage.

Figure 2 present the full processing chain performed for this work. Each of these stages will be described below.

3.4.1. Pre-Processing

Dataset pre-processing is the first step in order to be used as an input data source for a machine learning algorithm. Due to the vital importance of conducting good dataset pre-processing in order to make it readable by the machine learning models we intend to apply, a summary diagram of the steps followed for the pre-processing has been prepared. The diagram is available in Figure 3.

Once the general pre-processing strategy has been visualized, the steps followed to carry out this process are described in detail:

1st.: Removing blank spaces from texts: for the symptoms that are composed of more than one word, the repeated blanks located between words are reduced to one.
2nd.: Removing attributes that do not provide useful information: those attributes whose value is the same for all entries in the dataset or which are empty for more than half of the entries are deleted.
3rd.: Replacing null values: those symptoms with unfilled values for certain entries, are filled with the value `0’.
4th.: Coding the symptoms: Transform textual values of symptoms to numerical values. For certain datasets, these symptoms have a severity scale (as is the case in this work); therefore, the numerical coding will follow an incremental scale based on the degree of severity.

Once the dataset pre-processing is finished, we perform the global search for the best classifiers by training multiple classifiers based on the previously described ones and varying their hyperparameters.

3.4.2. Grid Search

The first step before starting to train machine learning models is to divide the dataset randomly into three subsets: train, validation, and test. To do this, a code has been developed to prepare and split the dataset for use in machine learning models. This process is carried out taking into account that the result is balanced for all classes, containing a similar number of inputs for each of them.

Finally, the training subset represents 70% of the original dataset. Next, the remaining data are split into validation and test sets, using 50% for each one: this means that both the validation set and the test set represent 15% of the original dataset. Figure 4 graphically represents the partitioning of the dataset that has been performed using the hold-out technique.

After this division of the dataset, the process followed to train and validate the models has been the same for the four models presented previously. During this process, different combinations of hyperparameters have been tested for each model.

For RFs, various combinations of the hyperparameter n_estimators, which specifies the number of Decision Trees to be used in the ensemble, have been tested. Increasing the value of n_estimators increases the diversity and robustness of the model, which can improve the overall accuracy but can also increase training time and model complexity. The values we have tested for the parameter n_estimators (number of trees) are 10, 50, 100, and 200, widely recommended in the literature [25]. For the rest of parameters, default values have been utilized, that is, Gini as the criterion function, one as the minimum number of samples required to create a leaf node, and two as the minimum number of samples required to split an internal node.

Referring to the KNN model, the hyperparameter that has been tuned is n_neighbors, which specifies the number of nearest neighbours to be used to make classification decisions. A higher value of n_neighbors smooths the decision by considering a higher amount of nearest neighbours, while a lower value can lead to a classification that is more sensitive to individual data and potentially noisier. The different values that have been tested for the n_neighbors parameter are 2, 3, 4, and 5. The initial weights were uniformly chosen, Euclidean distance was used for the Minkowski metric.

For SVM, a linear kernel has been selected, and different combinations of the parameter C have been tested. This parameter indicates how much it is desired to avoid misclassifying each training example. For large values of C, the optimization will choose a smaller margin hyperplane if that hyperplane succeeds in correctly classifying all training points. On the other hand, a very small value of C will cause the optimizer to look for a larger margin separator hyperplane, even if that hyperplane incorrectly classifies more points. For very small values of C, you are likely to obtain incorrectly classified examples, even if your training data are linearly separable. The values we have tested for the C parameter have been 1, 2, 10, and 20. Kernel type RBF is utilized, and the degree of the polynomial kernel function is 3.

Finally, for the FNN model, it was decided to tune two of its hyperparameters: activation and neurons of the hidden layers. There were other parameters in this model to be tuned intended to be included in this study; however, early results showed that the system converged easily to an optimal solution, so the combinations were reduced.

The activation parameter specifies the activation function to be used in the hidden layers of the Neural Network and determines how the output of each neuron in the hidden layer is propagated and transformed. In this study, we have tested with all possible values of activation functions, which are “relu” (Rectified Linear Unit), “identity” (Identity Function), “tanh” (Tangent Hyperbolic Function), “logistic” (Logistic Function). The choice of activation function can affect the model’s ability to learn and adapt to patterns in the data.

On the other hand, the number of neurons in the hidden layer influences how the Neural Network processes and learns from the input data. Three combinations with two hidden layers have been tested—(10, 10), (50, 50), and (50, 10)—as well as one combination with only one hidden layer and 100 neurons. An Adam function was used as the optimizer, and ReLU function was used as the activation function for hidden layers and a learning rate of constant 0.001 value.

Table 2 summarizes the different combinations used in this study for each model.

Finally, the maximum number of iterations for all the trainings was set to 1000, although the training stops when there is no improvement after several epochs.

3.4.3. Filtering and Final Evaluation

With the results of the metrics from the previous step, a comparison of all the models trained with different algorithms and hyperparameters is performed to select the model that best fits the characteristics of the dataset used.

First, models whose “Accuracy” value is greater than or equal to the 85th percentile are filtered out. Next, models whose “

F 1_{s c o r e}

” value is greater than or equal to the 85th percentile are filtered out.

Finally, the model runtime is used in the evaluation process to determine the final selected model.

4. Results and Discussion

Table 3 shows the results of all models after pre-processing with the corresponding parameters. In order to obtain the execution time, the classifications of all models are performed on the same platform (Google Colab with the same resources). The time shown is the arithmetic mean of the classification time of all samples in the test subset.

Analyzing Table 3 and filtering by the 85th percentile of accuracy, the models obtained are RFs with 50, 100, and 200 trees; KNN with 2 and 3 neighbours; and FNN with logistic, tangential and relu functions with specific shapes of the internal layers of the network: (50, 50), (100), and (50, 10) in some cases.

Next, filtering by the 85th percentile of

F 1_{s c o r e}

, the best models are reduced to KNN with 2 neighbours and FNN with a logistic function and (50, 50) architecture. To select the final model between both of them, the execution time is taken into account: the average time of KNN is 0.087762 s and 22.473805 s for FNN. Therefore, the selected model is KNN with 2 neighbours.

It is worth noting that all models tested with the FNN model have a high runtime compared to the other models, and this has not been reflected in an improvement of the model performance. Even so, we have decided to keep FNN in our work since, being a more complex model, we believe that this model would obtain better convergence results with more complex datasets.

The final results obtained for the selected model are 99.19% accuracy and 99.03%

F 1_{s c o r e}

for the KNN model with 2 neighbours. The confusion matrix for the final model is shown in Figure 5.

The confusion matrix reinforces the results previously presented: as can be observed in the main diagonal, almost all the samples are correctly classified. In addition, the space outside the diagonal is practically white and has values of 0 (except four cases), indicating the low presence of false predictions, i.e., the absence of both false positives and false negatives for each class. In summary, this matrix represented as a heat map shows once again the successful results of the KNN model with 2 neighbours for the prediction of diseases from symptoms coded by their severity.

In conclusion, the outstanding performance of the K-Nearest Neighbors (KNNs) model with 2 neighbors, compared to other algorithms evaluated in this study, such as Random Forest, Support Vector Machines (SVMs), and Feed-Forward Neural Networks (FNNs), can be attributed to several intrinsic characteristics of both the dataset and the KNN model itself.

First, the dataset, which consists of symptoms encoded by severity and associated with specific diseases, appears to have a structure where instances of the same class are clearly defined and clustered. This structure facilitates KNN, a proximity-based algorithm, in classifying effectively by using the distance between points in the feature space to accurately identify the disease corresponding to a given set of symptoms.

Moreover, the choice of using 2 neighbors allows the KNN model to optimally capture local relationships between instances in the dataset. This approach strikes a balance between sensitivity and robustness by considering only the two closest points, avoiding both excessive smoothing of decision boundaries and susceptibility to noise that might occur with fewer neighbors.

On the other hand, although RF, SVM, and ANN are powerful algorithms with advanced capabilities for handling complex datasets, in this particular case, the simplicity and non-parametric nature of KNN have proved to be more suitable. While RF and SVM might have introduced unnecessary complexity and Neural Networks could have faced challenges related to overfitting and high computational demands, KNN provides a straightforward and effective solution that maximized accuracy without the need for complex adjustments.

In summary, the efficacy of KNN with two neighbors in this context is due to the strong alignment between the dataset’s characteristics and the strengths of the algorithm, underscoring the importance of selecting the most appropriate model based on the structure and nature of the available data.

Finally, the results obtained in this study are compared with the ones published by previous works. This comparison is summarized in Table 4.

In the study developed by Gomathy and Naidu [21], a dataset with 25 classes is used to develop two different models: a RF (obtaining a 98.95%) and an SVM (obtaining a 96.49%). In comparison with the results obtained in this work, in the RF model, the accuracy metrics are very similar (but this study uses a dataset with almost the double number of classes). On the other hand, comparing the SVM results, those obtained in this work are better, with accuracy values between 98.24% and 98.51%.

Another relevant study to compare with this work is the one performed by Grampurohit and Sagarnal [22]. An RF model is used, obtaining 95.12%. Comparing that work with the results presented here, the results obtained in this study improves significantly the ones obtained previously by Grampurohit and Sagarnal [22], obtaining 4%.

It should be noted that both works previously compared also implement more complex models with Neural Networks, but the results obtained are worse than those compared.

For the study performed in this work, Neural Networks were tested, and one of its combinations obtained the same accuracy results than the final selected model (KNN). However, it was not the final selected model because its execution time was much longer than KNN.

The final work compared is the one developed by Nesterov et al. [23]. The results obtained in this work seem better than those obtained in this study, as the dataset used for that work varies by between 5 to 10 times the size of the dataset used here. More specifically, for the dataset of 200 illnesses, it obtains a 96.5% accuracy (while our work obtains a 99.19% for a 41-illness dataset).

However, the work presented by Nesterov et al. [23] uses a first hidden layer of more than 6000 neurons and a second layer of 3000 neurons. In our work, 99.19% is obtained with a first hidden layer of 50 neurons (less than a 1%) and a second hidden layer of 50 neurons too (less than 2%). So, in comparison, the network used by Nesterov et al. [23] has 100 times more neurons than ours to classify 4 times the illnesses used in this work.

Therefore, based on the previous comparative study, the results obtained in this work are promising and seem to improve on previous studies.

5. Conclusions

This work details the importance of the initial screening process in hospital emergency services.

In order to solve the problems derived from the lack of healthcare professionals and misuse of emergency services, the use of automatic disease classification tools based on the patient’s symptomatology, using artificial intelligence techniques, is proposed. These tools would help reduce the time that healthcare professionals spend in the screening process, supporting their decision making.

To demonstrate the feasibility of using this type of tool, multiple classifiers are developed using various machine learning techniques: Random Forest, K-Nearest Neighbors, Support Vector Machines, and Artificial Neural Networks.

Using a public dataset containing 41 diseases with their symptomatology, a training process consisting of three phases is carried out: pre-processing, grid search, and final filtering. After this process and evaluating both the accuracy of the classifiers and the execution time, it is determined that the most suitable classifier is KNN with two neighbors. This classifier obtains 99.19% accuracy and 99.03% of

F 1_{s c o r e}

.

If we compare the results obtained here with previous similar works, it can be observed that this work presents substantial improvements in both accuracy and computational cost, since the works with which it is compared use more complex classifiers.

6. Limitations and Future Areas of Research

Despite the promising results obtained in this study, some limitations must be addressed in future research to improve the efficacy and scientific validation of the machine learning-based disease detection algorithm for practical clinical application.

One of the primary limitations is the reliance on synthetic or non-clinical datasets for training the algorithm. Although a representative dataset comprising symptoms encoded by severity was utilized, it is imperative to incorporate a more extensive and diverse set of real clinical data that reflect the wide variability present in actual medical cases. The inclusion of real-world data would significantly improve the algorithm’s performance and predictive capabilities, ensuring its robustness and generalizability across different patient populations and clinical settings.

Another critical limitation pertains to the scientific validation of the algorithm. Although the current study produces encouraging results, further validation through controlled clinical studies and collaborations with medical experts is essential. Such rigorous validation processes are necessary to confirm the reliability, accuracy, and applicability of the algorithm in various clinical environments, thereby facilitating its acceptance and integration into standard medical practices.

For future work, it is recommended to explore the application of the algorithm using more complex and heterogeneous datasets derived from real clinical scenarios. This approach would enable the model to handle a wider spectrum of diseases and clinical conditions, providing a more comprehensive and precise diagnostic tool. Furthermore, conducting comparative analyses with other existing models by experimenting with different hyperparameters and advanced machine learning techniques could lead to the development of an enhanced version of the algorithm with superior performance metrics. In this context, investigating the efficacy of Graph Neural Networks (GNNs), as suggested in previous studies [30], could be particularly beneficial. For instance, the study by Islam S. R. et al. reported a 95% accuracy rate in disease detection using GNNs based on symptom data, indicating a promising avenue for improvement.

Furthermore, integrating the algorithm developed with Natural Language Processing (NLP) techniques presents a valuable opportunity for advancement. Employing conventional NLP methods or modern transformer-based language models could enhance the algorithm’s ability to comprehend and analyze textual information related to symptoms, facilitating a complete diagnostic cycle—from symptom expression in natural language to the provision of potential diagnoses. This integration would not only improve the user experience but also increase the accuracy and efficiency of the diagnostic process.

In addition to NLP integration, incorporating the algorithm into existing symptom checker platforms that combine symptom recognition, questionnaires, and preliminary diagnoses could significantly enhance its practical utility. Such an integration would offer a more comprehensive and user-friendly tool for disease detection, providing a validated and certified graphical interface as a medical device. This approach would endow existing symptom checkers with new capabilities, enabling more precise and rapid differential diagnoses.

Moreover, future research could explore the deployment of the algorithm within telemedicine platforms and electronic health record (EHR) systems to facilitate remote diagnostics and continuous patient monitoring. This integration could improve accessibility and efficiency of healthcare care, particularly in underserved or remote areas. It is also crucial to consider the ethical, privacy, and security implications of handling sensitive medical data, ensuring compliance with relevant regulations and standards.

In summary, this study has identified significant limitations and outlined various avenues for future research. By addressing these limitations and exploring novel techniques and integrations, substantial progress can be made in developing robust machine learning-based disease detection algorithms. Such advances have the potential to improve early diagnosis, improve patient outcomes, and contribute to alleviating current burdens on healthcare systems.

Author Contributions

Conceptualization: F.L.-P. and M.D.-M.; methodology: L.M.-A. and M.D.-M.; software: A.F.-P. and F.L.-P.; validation: L.M.-A. and M.D.-M.; formal analysis: F.L.-P. and M.D.-M.; investigation: A.F.-P. and F.L.-P.; resources: L.M.-A. and M.D.-M.; data curation: A.F.-P.; writing—original draft preparation: A.F.-P. and M.D.-M.; writing—review and editing: F.L.-P. and L.M.-A.; visualization: L.M.-A.; supervision: F.L.-P. and M.D.-M.; project administration: L.M.-A.; funding acquisition: M.D.-M. All authors have read and agreed to the published version of the manuscript.

Funding

Ministerio de Ciencia, Innovación y Universidades (Spanish Government): Spanish AEI (Agencia Estatal de Investigación) project ADICVIDEO (PID2022-141172OA-I00).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used for this work, “Disease Symptom Prediction” is publicly available in the Kaggle dataset repository at https://www.kaggle.com/datasets/itachi9604/disease-symptom-description-dataset (accessed on 8 September 2024). The data and code associated with this study are available upon request by contacting the corresponding author.

Acknowledgments

We want to thank the research group “TEP108—Robotics and Computer Technology” from University of Seville (Spain).

Conflicts of Interest

The authors declare no conflicts of interest.

References

NU. CEPAL. Perspectivas de la Población Mundial 2019: Metodología de las Naciones Unidas para las Estimaciones y Proyecciones de Población. 2020. Available online: https://hdl.handle.net/11362/45989 (accessed on 13 September 2024).
Cristea, M.; Noja, G.G.; Stefea, P.; Sala, A.L. The impact of population aging and public health support on EU labor markets. Int. J. Environ. Res. Public Health 2020, 17, 1439. [Google Scholar] [CrossRef] [PubMed]
World Health Organization. The World Health Report 2006: Working Together for Health; WHO: Geneva, Switzerland, 2006.
Golz, C.; Oulevey Bachmann, A.; Defilippis, T.S.; Kobleder, A.; Peter, K.A.; Schaffert, R.; Schwarzenbach, X.; Kampel, T.; Hahn, S. Preparing students to deal with the consequences of the workforce shortage among health professionals: A qualitative approach. BMC Med. Educ. 2022, 22, 756. [Google Scholar] [CrossRef] [PubMed]
World Health Organization. Universal Health Coverage Partnership Annual Report 2019: In Practice: Bridging Global Commitments with Country Action to Achieve Universal Health Coverage; WHO: Geneva, Switzerland, 2020.
Tolosana, S.; Serrano, O. Local perception of access to health services in rural areas. The case of the Navarrese Pyrenees. An. Sist. Sanit. Navar. 2021, 44, 185–194. [Google Scholar]
World Health Organization. WHO Policy on Disability; WHO: Geneva, Switzerland, 2021.
Alnasser, S.; Alharbi, M.; AAlibrahim, A.; Aal ibrahim, A.; Kentab, O.; Alassaf, W.; Aljahany, M. Analysis of Emergency Department Use by Non-Urgent Patients and Their Visit Characteristics at an Academic Center. Int. J. Gen. Med. 2023, 16, 221–232. [Google Scholar] [CrossRef] [PubMed]
Instituto Nacional de Estadística. Encuesta de Morbilidad Hospitalaria. 2021. Available online: https://www.ine.es/prensa/emh_2021.pdf (accessed on 13 September 2024).
Gracia, R.A.; Ramos, I.J.; Palacio, P.C.; Arcelus, M.G.; Albors, C.P. Saturación en los servicios de urgencias, causas y consecuencias. Rev. Sanit. Investig. 2021, 2, 138. [Google Scholar]
Schweitzer, S.O. Cost effectiveness of early detection of disease. Health Serv. Res. 1974, 9, 22. [Google Scholar]
Washington, D.L.; Stevens, C.D.; Shekelle, P.G.; Baker, D.W.; Fink, A.; Brook, R.H. Safely directing patients to appropriate levels of care: Guideline-driven triage in the emergency service. Ann. Emerg. Med. 2000, 36, 15–22. [Google Scholar] [CrossRef]
Tang, M.; Reddy, A. Telemedicine and its past, present, and future roles in providing palliative care to advanced cancer patients. Cancers 2022, 14, 1884. [Google Scholar] [CrossRef]
Luna-Perejón, F.; Muñoz-Saavedra, L.; Castellano-Domínguez, J.M.; Domínguez-Morales, M. IoT garment for remote elderly care network. Biomed. Signal Process. Control 2021, 69, 102848. [Google Scholar] [CrossRef]
Shore, J.H.; Schneck, C.D.; Mishkind, M.C. Telepsychiatry and the coronavirus disease 2019 pandemic—Current and future outcomes of the rapid virtualization of psychiatric care. JAMA Psychiatry 2020, 77, 1211–1212. [Google Scholar] [CrossRef]
Muñoz-Saavedra, L.; Escobar-Linero, E.; Miró-Amarante, L.; Bohórquez, M.R.; Domínguez-Morales, M. Designing and evaluating a wearable device for affective state level classification using machine learning techniques. Expert Syst. Appl. 2023, 219, 119577. [Google Scholar] [CrossRef]
Chang, J.E.; Lai, A.Y.; Gupta, A.; Nguyen, A.M.; Berry, C.A.; Shelley, D.R. Rapid transition to telehealth and the digital divide: Implications for primary care access and equity in a post-COVID era. Milbank Q. 2021, 99, 340–368. [Google Scholar] [CrossRef] [PubMed]
Domínguez-Morales, M.J.; Luna-Perejón, F.; Miró-Amarante, L.; Hernández-Velázquez, M.; Sevillano-Ramos, J.L. Smart footwear insole for recognition of foot pronation and supination using neural networks. Appl. Sci. 2019, 9, 3970. [Google Scholar] [CrossRef]
Escobar-Linero, E.; Domínguez-Morales, M.; Sevillano, J.L. Worker’s physical fatigue classification using neural networks. Expert Syst. Appl. 2022, 198, 116784. [Google Scholar] [CrossRef]
Piñero-Fuentes, E.; Canas-Moreno, S.; Rios-Navarro, A.; Domínguez-Morales, M.; Sevillano, J.L.; Linares-Barranco, A. A deep-learning based posture detection system for preventing telework-related musculoskeletal disorders. Sensors 2021, 21, 5236. [Google Scholar] [CrossRef]
Gomathy, C.; Naidu, M.A.R. The prediction of disease using machine learning. Int. J. Sci. Res. Eng. Manag. (IJSREM) 2021, 5, 1–7. [Google Scholar]
Grampurohit, S.; Sagarnal, C. Disease prediction using machine learning algorithms. In Proceedings of the 2020 International Conference for Emerging Technology (INCET), Belgaum, India, 5–7 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–7. [Google Scholar]
Nesterov, A.; Ibragimov, B.; Umerenkov, D.; Shelmanov, A.; Zubkova, G.; Kokh, V. NeuralSympCheck: A Symptom Checking and Disease Diagnostic Neural Model with Logic Regularization. In Proceedings of the Artificial Intelligence in Medicine: 20th International Conference on Artificial Intelligence in Medicine, AIME 2022, Halifax, NS, Canada, 14–17 June 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 76–87. [Google Scholar]
Chen, M.; Hao, Y.; Hwang, K.; Wang, L.; Wang, L. Disease prediction by machine learning over big data from healthcare communities. IEEE Access 2017, 5, 8869–8879. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Ho, T.K. Random decision forests. In Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, Canada, 14–16 August 1995; IEEE: Piscataway, NJ, USA, 1995; Volume 1, pp. 278–282. [Google Scholar]
Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support vector machine. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Sokolova, M.; Lapalme, G. A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 2009, 45, 427–437. [Google Scholar] [CrossRef]
Islam, S.R.; Sinha, R.; Maity, S.P.; Ray, A.K. Deep learning on symptoms in disease prediction. In Machine Learning for Healthcare Applications; Wiley: Hoboken, NJ, USA, 2021; pp. 77–87. [Google Scholar]

Figure 1. Graphical diagram representing the machine learning algorithms considered in the study for screening system analysis, with colors highlighting key components. (a) Random Forest: green and blue nodes represent data points used in different decision trees, and the final result is determined by majority voting or averaging. (b) K-Nearest Neighbors (KNN): red triangles and blue squares represent different classes, with the green circle being the query point. (c) Support Vector Machine (SVM): yellow circles and blue squares indicate different classes, while the black line is the optimal hyperplane separating them, and the red circles are support vectors. (d) Neural Network (NN): green nodes represent input layers, yellow nodes represent hidden layers, and red nodes represent the output layer.

Figure 2. Graphical abstract showing the dataset split into training (green, 70%), evaluation (yellow, 15%), and testing (red, 15%) phases. Multiple algorithms are trained and evaluated using the training and evaluation sets, while the testing set is used for final model performance assessment. abstract for the full processing chain.

Figure 3. Schematic representation of the steps followed for the pre-processing of the datasets.

Figure 4. Graphical representation and numerical data of the dataset split using hold-out in train, validation, and test subsets.

Figure 5. Confusion matrix for the final selected model (KNN with 2 neighbours).

Table 1. Dataset samples and split.

Set	Samples	Percentage
Train	3445	70%
Validation	738	15%
Test	738	15%
TOTAL	4921

Table 2. Models’ parameter tuning.

Model	Hyperparameters
Random Forest	Number of Trees: 10, 50, 100, 200
K-Nearest Neighbors	Number of Neighbors: 2, 3, 4, 5
Support Vector Machine	C: 1, 2, 10, 20
Feed-Forward Neural Network	Activation: relu, identity, tanh, logistic Neurons: (100), (10, 10), (50, 10), (50, 50)

Table 3. Grid Search results.

Model	Execution Time	n_Estimators	n_Neighbors	C	Activation	Neurons	$F 1_{s c o r e}$	Accuracy
RF	0.047557116	10	NA	NA	NA	NA	98.5669%	98.7805%
	0.185845137	50	NA	NA	NA	NA	98.8983%	99.0515%
	0.327396631	100	NA	NA	NA	NA	98.8983%	99.0515%
	0.659683228	200	NA	NA	NA	NA	98.8983%	99.0515%
KNN	0.087762117	NA	2	NA	NA	NA	99.0290%	99.1870%
	0.043197870	NA	3	NA	NA	NA	98.8983%	99.0515%
	0.043943644	NA	4	NA	NA	NA	98.6976%	98.9160%
	0.087762117	NA	5	NA	NA	NA	98.3714%	98.6450%
SVM	0.104895115	NA	NA	1	NA	NA	97.7190%	98.2385%
	0.084115028	NA	NA	5	NA	NA	98.1293%	98.5095%
	0.080511093	NA	NA	10	NA	NA	98.1293%	98.5095%
	0.088666439	NA	NA	20	NA	NA	98.1293%	98.5095%
FNN	7.424005985	NA	NA	NA	identity	(10, 10)	90.5192%	91.4634%
	5.624384403	NA	NA	NA	identity	(50, 50)	91.0987%	91.8699%
	5.281115770	NA	NA	NA	identity	(50, 10)	91.3083%	92.0054%
	10.32017255	NA	NA	NA	identity	(100)	91.4521%	92.4119%
	14.49916768	NA	NA	NA	logistic	(10, 10)	85.5094%	86.9918%
	22.47380471	NA	NA	NA	logistic	(50, 50)	99.0290%	99.1869%
	18.69183969	NA	NA	NA	logistic	(50, 10)	94.5426%	94.4444%
	21.43575096	NA	NA	NA	logistic	(100)	98.8983%	99.0514%
	13.71196008	NA	NA	NA	tanh	(10, 10)	97.2997%	97.5609%
	9.751794100	NA	NA	NA	tanh	(50, 50)	98.8983%	99.0514%
	11.14694428	NA	NA	NA	tanh	(50, 10)	98.8983%	99.0514%
	12.34619427	NA	NA	NA	tanh	(100)	98.8983%	99.0514%
	15.97383976	NA	NA	NA	relu	(10, 10)	98.0560%	98.2384%
	7.647896767	NA	NA	NA	relu	(50, 50)	98.8983%	99.0514%
	8.573491573	NA	NA	NA	relu	(50, 10)	98.5669%	98.7804%
	11.89987421	NA	NA	NA	relu	(100)	98.5669%	98.7804%

Table 4. Comparative summary with previous works available in the literature.

Work	Year	Classifier	Classes	Accuracy
Gomathy and Naidu [21]	2021	RF	25	98.95%
Gomathy and Naidu [21]	2021	SVM	25	96.49%
Grampurohit and Sagarnal [22]	2020	RF	-	95.12%
Nesterov et al. [23]	2022	FNN	200 400	96.5% 49%
This work		RF	41	99.05%
		KNN		99.19%
		SVM		98.51%
		FNN		99.19%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fuster-Palà, A.; Luna-Perejón, F.; Miró-Amarante, L.; Domínguez-Morales, M. Optimized Machine Learning Classifiers for Symptom-Based Disease Screening. Computers 2024, 13, 233. https://doi.org/10.3390/computers13090233

AMA Style

Fuster-Palà A, Luna-Perejón F, Miró-Amarante L, Domínguez-Morales M. Optimized Machine Learning Classifiers for Symptom-Based Disease Screening. Computers. 2024; 13(9):233. https://doi.org/10.3390/computers13090233

Chicago/Turabian Style

Fuster-Palà, Auba, Francisco Luna-Perejón, Lourdes Miró-Amarante, and Manuel Domínguez-Morales. 2024. "Optimized Machine Learning Classifiers for Symptom-Based Disease Screening" Computers 13, no. 9: 233. https://doi.org/10.3390/computers13090233

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Optimized Machine Learning Classifiers for Symptom-Based Disease Screening

Abstract

1. Introduction

2. Related Works

3. Methodology

3.1. Dataset

3.2. Classifiers

3.3. Evaluation Metrics

3.4. Optimization Procedure

3.4.1. Pre-Processing

3.4.2. Grid Search

3.4.3. Filtering and Final Evaluation

4. Results and Discussion

5. Conclusions

6. Limitations and Future Areas of Research

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI