Next Article in Journal
A Review on Trending Machine Learning Techniques for Type 2 Diabetes Mellitus Management
Previous Article in Journal
Enhancing Clinical Decision Support for Precision Medicine: A Data-Driven Approach
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Differential Classification of Dengue, Zika, and Chikungunya Using Machine Learning—Random Forest and Decision Tree Techniques

by
Wilson Arrubla-Hoyos
1,*,
Jorge Gómez Gómez
2,* and
Emiro De-La-Hoz-Franco
3
1
Faculty of Engineering, Universidad Nacional Abierta y a Distancia, Sincelejo 700002, Colombia
2
SOCRATES Group, Department of Systems Engineering and Telecommunications, Faculty of Engineering, University of Cordoba, Montería 230001, Colombia
3
Department of Computer Science and Electronics, Faculty of Engineering, Universidad de la Costa, Barranquilla 080002, Colombia
*
Authors to whom correspondence should be addressed.
Informatics 2024, 11(3), 69; https://doi.org/10.3390/informatics11030069
Submission received: 15 July 2024 / Revised: 5 September 2024 / Accepted: 12 September 2024 / Published: 20 September 2024

Abstract

:
Dengue, Zika, and chikungunya viruses pose a serious threat globally and circulate widely in America. These diseases share similar symptoms in their early stages, which can make early diagnosis difficult. In this study, two predictive models based on Decision Trees and Random Forests were developed to classify dengue, Zika, and chikungunya, with the aim of being supportive and easily interpretable for the medical community. To achieve this, a dataset was collected from a clinic in Sincelejo, Colombia, including the signs, symptoms, and laboratory results of these diseases. The Pan American Health Organization (PAHO) Diagnostic Guide 2022 methodology for the differential classification of dengue and chikungunya was applied by assigning evaluative weights to symptoms in the dataset. In addition, a bootstrapping resampling technique based on the central limit theorem was used to balance the target variable, and cross-validation was used to train the models. The main results were obtained with the Random Forest technique, achieving an accuracy of 99.7% for classifying chikungunya, 99.1% for dengue, and 98.8% for Zika. This study represents a significant advance in the differential prediction of these diseases through the use of automatic learning techniques and the integration of clinical and laboratory information.

1. Introduction

The tropical diseases dengue, Zika, and chikungunya are transmitted by mosquitoes of the Aedes aegypti and Aedes albopictus families [1,2,3,4,5] and represent a global public health problem [6,7,8]. Timely diagnosis can be challenging because these diseases often share a similar clinical picture at an early stage [9,10]. In addition, co-circulation in some parts of the world makes it even more complex to distinguish between [9,11,12]. Early differentiation between these diseases is crucial for the proper treatment and implementation of effective control measures against mosquito vectors.
Specific tests, such as Real-time reverse transcriptase polymerase chain reaction (RT-PCR) and high-sensitivity Enzyme-linked immunosorbent assay (ELISA), are available for classifying these diseases [13]. However, these tests often require specialised equipment, which is not always available in remote areas. Faced with this limitation, solutions using more advanced technologies, such as machine learning algorithms [14,15,16,17,18] and deep learning [19], have been proposed to support early diagnosis based on disease signs and symptoms.
One of the main challenges lies in the interpretation of the predictions made by these algorithms, as many of these techniques operate as ne-green boxes [20]; that is, it is not fully understood how they arrive at their conclusions. However, there are exceptions, such as the use of decision trees (DTs) and more complex techniques, such as Random Forest (RF), which offer greater transparency in their decision-making processes.
According to recent systematic literature reviews [20,21], predictive models that attempt to effectively differentiate between dengue, Zika, and chikungunya have not yet been developed. It is therefore important to contribute to this area to support medical decision-making in settings where specific, highly sensitive tests are not readily available early on.
This study aimed to develop predictive models that can be easily interpreted by the medical community. It proposes the use of decision trees and random forests to predict dengue, Zika, and chikungunya from clinical data, including signs, symptoms, and laboratory results. In addition, bootstrapping, which is a technique for balancing the classes of the target variable and is based on the central limit theorem, is used, as well as a weight assignment methodology based on PAHO 2022 [22] and cross-validation to obtain a more balanced performance of the model results. The rest of the article is organised as follows: Section 2 presents the context of this study, Section 3 describes the methodology used, Section 4 presents the results obtained and discussed, and, finally, the conclusions are presented in Section 5.

2. Background

2.1. Differential Classification of Dengue, Zika and Chikungunya

Dengue, Zika, and chikungunya viruses are transmitted by mosquitoes, often presenting similar clinical symptoms, and can sometimes coexist in the same individual [12]. These three diseases represent a serious threat to public health worldwide [23], and, throughout the Americas, their circulation is epidemic [24], making early diagnosis difficult for healthcare professionals. On the other hand, laboratory tests such as RT-PCR and ELISA, which are highly sensitive and specific, are used to confirm or rule out these diseases. However, in remote areas, these specialised tests are difficult to access for rapid diagnosis. To address this situation, artificial intelligence-based alternatives have been proposed [23,24,25,26,27,28,29] that aim to predict dengue, Zika, or chikungunya early and aid medical decisions.
Some researchers have proposed models to differentially predict tropical diseases, such as those mentioned above. For example, in [30], an algorithm was developed to classify dengue, chikungunya, and malaria using neural networks. Similarly, other studies [19,22,31,32] propose models based on classical techniques, ensembles, and convolutional networks to classify dengue, chikungunya, and other diseases. Although differential classifications have been attempted, no specific proposals for classifying dengue, Zika, and chikungunya have been demonstrated thus far [20,21]. This may be because of the complexity of compiling a dataset with records of the signs and symptoms of these three diseases.

2.2. Methodology for the Differential Classification of Dengue and Chikungunya According to the PAHO Diagnostic Guide 2022

The proposal by [22] is based on Evidence Synthesis: Guidelines for the diagnosis and treatment of dengue, chikungunya, and Zika in the Americas [30]. This study presents a methodological approach that converts qualitative information into quantitative information in a dataset, assigning differential weights to symptoms according to medical evidence and the GRADE scale, based on recommendation 1 of the guideline. To achieve this transformation in the data, we first identified common variables in the dataset according to the Pan American Health Organization PAHO guidelines and established quality rules to parameterise this assignment. Subsequently, a linear interpolation function was used to assign weights to the symptoms according to the evidence. In addition, different machine learning techniques were used to compare the models, achieving an accuracy of 99% compared with 79% without the methodology.

2.3. Bootstrap Resampling Technique

Bootstrapping is a statistical technique used to estimate the properties of the sampling distribution of a statistic [31,32]. It generates multiple samples, usually by resampling the original sample, to simulate the sampling distribution of the statistic [33]. This technique is well known and widely used because it reduces the variability [34,35] supported by the total variance law Var [H (U)] ≥ E [Var [H (U)|S]] [34] and differs from other techniques used to balance data that increase samples by oversampling and tend to introduce a bias to counteract class imbalance.

3. Materials and Methods

This article presents an experiment aimed at differentially predicting dengue, Zika, and chikungunya. It compares the results of applying the weighting methodology based on scientific evidence from the PAHO 2022 guidelines proposed by Arrubla et al. [22] to create predictive models using different machine-learning techniques. The process began with the creation of a fully anonymised dataset in collaboration with Clínica Las Peñitas in the city of Sincelejo, Colombia. This dataset relates signs, symptoms, and clinical science data recorded in 2015 for chikungunya, 2016 for Zika, and 2020 for dengue. Subsequently, bootstrapping replacement resampling was applied to address the imbalance in the classes and size of the dataset. This technique was chosen because of its advantages in this context, allowing for balancing the classes of the target variable based on the central limit theorem without generating synthetic data. Finally, two machine learning models based on decision trees (DTs) and random forests (RFs) were proposed to compare the results obtained by applying the weight assignment methodology with the data obtained without using this methodology. Figure 1 presents the proposed algorithm for the development of this experiment in more detail.

3.1. Data Processing

This phase consists of three steps that allow preprocessing of the data before applying the training. Python 3.10.12 libraries Pandas 2.1.4, Numpy 1.26.4, and Matplotlib 3.7.1 were used for the development of this phase. Each of these is described below:

3.1.1. Dataset Creation

The creation of a dataset for these three diseases became necessary because of the lack of a public dataset that integrated information on signs, symptoms, and clinical laboratory results. Literature reviews conducted by [20,21] show limitations in studies on these three diseases, which is attributed to the lack of adequate datasets. The dataset used in this study was obtained from fully anonymised records, thus ensuring patient privacy and confidentiality. Data collection was carried out in collaboration with the Las Peñitas Clinic in Sincelejo, Colombia, in compliance with local ethical and legal requirements. It should be noted that the project was reviewed and approved by the clinic’s ethics committee, ensuring compliance with the relevant regulations for the use of clinical data in research. Historical records cover chikungunya cases in 2015, Zika in 2016, and dengue in 2020. The resulting dataset consisted of 150 records and 28 variables, including signs, symptoms, and laboratory results, as detailed in Table 1.
The amount of data is limited because of the difficulty in finding historical records for Zika and chikungunya, as their epidemiological cycles occurred in 2015 and 2016 [36,37], respectively. During these epidemic peaks, data collection was incomplete, and it was not possible to use records published by the Colombian National Institute of Health. The distribution of the dataset is as follows: 89 cases of Zika (59%), 52 of Dengue (35%) and 9 of Chikungunya (6%).

3.1.2. Data Cleaning

This phase begins with an exploratory statistical analysis of the data, which includes the selection of variables and the treatment of outliers in the dataset. During this stage, the records of fever, hepatomegaly, hypothermia, and increased haematocrit were eliminated. These variables were eliminated because fever was present in all records in the dataset, and, in the case of the other symptoms, all labels were “NO”.

3.1.3. Target Balance through Bootstrapping

One of the main drawbacks of the dataset is its limited size, with only 150 rows and 24 columns (89 Zika, 52 Dengue, and 9 Chikungunya records), which classifies it as a small dataset. In addition, the imbalance in the classes of the target variable makes it difficult to train the model, especially for Chikungunya disease, which represents only 6% of the data. Faced with this challenge, and considering the various options for balancing the data, the bootstrapping technique was chosen. This technique allowed the creation of new samples that balanced dengue and chikungunya labels with Zika disease in the target variable. Consequently, the new dataset increased from 150 to 267 samples in total. Each disease now had 89 samples, without reducing the Zika records, which previously had 89 records before applying the technique. Bootstrapping was chosen because it is a statistical technique that allows the accuracy of a statistic (such as the mean or median) to be estimated by creating multiple samples from the original data. This is particularly useful when the exact properties of the underlying population are unknown. Unlike synthetic oversampling methods such as ADASYN [38], SMOTE [39,40] or data augmentation [35,41], which create new synthetic samples to augment the data of the minority class, bootstrap does not generate new data. Instead, it creates “dummy samples” by repeatedly drawing with replacements from the original data. This means that the data are randomly selected from the original sample, allowing the same data to be selected more than once.
This process is repeated many times, calculating the desired estimate (such as the mean) for each sample and examining how these estimates vary. By repeating this process multiple times, a distribution of possible values of the estimate is obtained, and the variability and uncertainty can be assessed without making strong assumptions about the underlying distribution of the data. Figure 2 presents the data balancing process.

3.2. Training Model Based on the PAHO 2022 Methodology

In this phase, the methodology proposed in [22] was implemented using the cross-validation technique with k = 10. This technique provides significant advantages in obtaining quality metrics compared with the standard 70/30 split, especially when working with limited datasets. To carry out the training, we used the tools Jupyter NoteBook, Google Collab (Python 3.10.12), Sklearn 1.3.2, RStudio 4.3.3, and the predictoR package 3.0.10 developed by Promidat of the Autonomous University of Central America in Costa Rica, which offers a wide range of machine-learning methods for the creation of predictive models.

3.2.1. Data Transformation Was Performed According to Methodology Based on the PAHO Guidelines (2022)

In this phase, data transformation was performed following a methodology based on the PAHO Guidelines (2022) proposed by Arrubla et al. [22]. Quantitative values were assigned to each categorical variable in the dataset in accordance with the guidelines of the Pan American Health Organization (PAHO), which allowed assigning a differential value based on medical evidence to the variables that coincided with the proposals of these guidelines. Table 2 shows the variables to which interpolation was applied and the ranges of the evaluative weights, allowing the categorical variables to be transformed into numerical variables according to the methodological proposal of [22].

3.2.2. Modeling with ML Techniques

At this stage, the models were trained using Decision Tree (DT) and Random Forest (RF) machine learning techniques. These techniques were selected because of their good performance in previous experiments using similar data [22]. In addition, the decision tree offers the advantage of being more interpretable in its results, thus facilitating an understanding of the criteria used for decision-making. The training was carried out using the k = 10 cross-validation technique, with the aim of obtaining more reliable results and minimising the biases that are usually generated when using conventional techniques, such as the 70–30 division.

3.3. Model Training

In this phase, the data processed without applying the aforementioned weight assignment methodology were used. The training and evaluation of the models were carried out under the same conditions as in the previous step using cross-validation, DT, and RF techniques.

3.4. Assessment

In the last phase of the proposed methodology, the assessment results were analysed. Given its focus on classification modelling, this analysis was based on quality metrics obtained from the confusion matrix, which included Accuracy, Precision, f1-score and Recall [22].

4. Results and Discussion

Table 3 presents the results obtained by the DT and RF models using the dataset transformed by the methodology proposed in [22] and balancing of the target variable using the bootstrapping technique.
The results of both models were balanced for all quality metrics, allowing for highly accurate prediction of the three diseases in the dataset. However, the RF model performs better overall. Figure 3 shows a comparison of the results obtained using the DT and RF models.
The confusion matrix in Figure 4 reveals high overall performance in classifying chikungunya, dengue, and Zika using the DT model. The model classifies chikungunya with an accuracy of 88.5%, presenting a very low false-negative rate (0.5%) and no false positives. Dengue had an accuracy of 86.7%, with false negatives (1.7%) and false positives (0.6%). Although Zika showed a slightly lower accuracy of 81.9%, it was still high, with a false negative rate of 4.1% and false positives of 3.0%. Overall, the model was efficient, although it exhibited slight confusion in classifying Zika.
The confusion matrix shown in Figure 5 reveals that the Random Forest technique offers robust performance in classifying Chikungunya, Dengue, and Zika. The model achieved an accuracy of 88.5% for chikungunya, with no false positives and a very low false-negative rate of 0.5%. For Dengue, the accuracy was 88.0%, with a false-negative rate of 1.0% and no false positives. For Zika, the accuracy was 87.3%, with a false negative rate of 0.4% and false positives of 1.4%. Overall, the model demonstrated high classification ability, although it exhibited slight confusion in identifying Zika compared to the other two classes.
Table 4 shows the results of the models obtained by working with the dataset without applying the methodology proposed by Arrubla et al. [22] but balanced using the bootstrapping technique.
The results show good performance of the DT model, with a balance in most metrics, although accuracy is the only measure that is below 90%, with 88.8%. However, Random Forest (RF) performs better in the classification of the three diseases, with a balance in all quality metrics, making it a better option to support the classification of these diseases. Figure 6 presents a comparison of the performance of the two models.
Similarly, Figure 7 summarises the behaviour of the ten models created using the cross-validation technique in the two experiments. It shows the behaviour of the accuracy and error in each model, highlighting that applying the methodology proposed by Arrubla et al. [22] allows superior quality metrics to be obtained in the model.
The confusion matrix in Figure 8 for the decision tree technique indicates mixed performance in classifying Chikungunya, Dengue, and Zika. The model classifies chikungunya with an accuracy of 88.2%, with no false positives and a false negative rate of 0.8%. However, for Dengue, the accuracy was 76.9%, with a remarkably high false-negative rate of 7.3% and a false-positive rate of 4.8%. Zika shows an accuracy of 72.0%, with a false-negative rate of 12.3% and a false-positive rate of 4.7%. Overall, although the decision tree model shows good accuracy for chikungunya, it faces difficulties in classifying Dengue and Zika, evidencing higher confusion between the classes.
The confusion matrix in Figure 9 for the Random Forest model showed better performance in classifying chikungunya, dengue, and Zika. The model achieved an accuracy of 88.2% for chikungunya, with a false-positive rate of 0.0% and a false-negative rate of 0.8%. For dengue, the accuracy was 83.0%, with 3.7% false negatives and 2.3% false positives. For Zika, the accuracy was 81.4%, with a false-negative rate of 5.3% and a false-positive rate of 2.4%. Overall, the model demonstrated better classification ability with a good balance between classes, although it showed slight confusion in identifying Zika and dengue.
The application of the methodology proposed by Arrubla et al. [22] significantly improved the performance of the Random Forest and Decision Tree classification models. For Random Forest, the use of the methodology results in a slight improvement in accuracy, especially in the reduction of false positives and negatives, with an accuracy of 88.5% for chikungunya, 88.0% for dengue, and 87.3% for Zika. In comparison, without the methodology, the accuracy was 88.2% for chikungunya, 83.0% for dengue, and 81.4% for Zika, showing a lower discrimination capacity between classes. For decision trees, the proposed methodology also had a positive impact, improving the overall accuracy to 88.5% for chikungunya, 86.7% for dengue, and 81.9% for Zika. Without the methodology, the accuracies were 88.2% for chikungunya, 76.9% for dengue, and 72.0% for Zika, indicating a notable reduction in classification capacity, especially for dengue and Zika.
While it is true that the DT model performs less well compared to RF in both experiments, it is important to mention that it may be more interpretable for the medical community when supporting early decision-making. Figure 10 shows the model tree, in which the rules generated by the model to perform the respective classifications are shown.
Figure 10 illustrates that headache is the most relevant variable for classifying dengue, while myalgia is key for identifying chikungunya, aligning with PAHO’s 2022 guidelines on the differential symptoms of these diseases. The decision tree classifies cases between chikungunya, dengue, and Zika using symptoms such as headache, myalgia, days of symptoms, IgM, and platelet count. The root node shows that a mild or absent headache is linked to chikungunya, whereas a severe headache strongly indicates dengue. As the tree progresses, additional symptoms, such as myalgia and symptom duration, further refine the classification, with terminal nodes offering pure and definitive predictions for each disease, underscoring the clinical utility of these symptoms.
In contrast to the tree diagram, Figure 11 shows the importance of the variables generated by the Random Forest (RF) model, where, as in the DT model, headache is the most important variable for classifying diseases, followed by myalgia and arthralgia. These results are in line with the guidelines given by PAHO, which consider these variables as differential in the three diseases.
To validate the predictive performance of the models for each label of the target variable, Table 5 summarises the quality metrics obtained in the experiment.
From Table 5, it can be inferred that the models have a high accuracy for all diseases. Specifically, the RF model with the methodology proposed in [22] achieved the highest accuracy for chikungunya (99.7%), dengue (99.1%), and Zika (98.8%). These results indicate balanced performance across the four quality metrics evaluated, suggesting that the model can correctly predict all labels of the target variable.
Figure 12 also presents the results obtained for the 10 models tested using the DT algorithm and the cross-validation technique. In this figure, the performance measured in terms of accuracy and error is observed, showing that chikungunya is the disease that can be best predicted in all models. However, the accuracy for dengue and Zika in models that do not use the methodology proposed in [22] presents greater difficulties and errors when recognising these classes. It is highlighted that, by using this methodology, an improvement in the prediction of dengue and Zika is achieved in all the models generated.
On the other hand, when reviewing the behaviour with the RF and cross-validation techniques in Figure 13, a similar trend to that analysed above is evident, with the difference that its performance is superior in all quality metrics.
The results obtained in this study allow for a highly accurate classification of dengue, Zika, and chikungunya diseases, highlighting the relevance of certain variables in prediction. Consequently, a new experiment was carried out, in which only variables related to signs and symptoms were selected, excluding laboratory results that were not available in the early stages of the disease, as well as variables that were not significant in previous analyses. This new model proposal seeks to align with the medical reality, providing an approach that, based on data obtainable by the physician in the early stages of the disease, effectively supports decision-making in the classification of these pathologies. Table 6 presents the variables that were selected to create the new model.
The training was carried out under the same conditions as the previous models using stratified cross-validation and the methodology proposed in [22]. The results are presented in Table 7.
The results presented in Table 7 highlight the excellent performance of the RF technique, which achieved a balance of over 99% for all quality metrics. Similarly, the decision tree also showed a solid performance, with an average of 96% across all metrics. These results suggest that the developed models are highly effective and can be adapted for early disease detection, providing valuable support to the medical community for accurate triage of dengue, Zika, and chikungunya. This is especially useful in remote communities where the lack of experienced medical epidemiologists or specialists can make early disease triage difficult.
Table 8 shows the quality metrics of both models and highlights their ability to recognise chikungunya, dengue, and Zika diseases. Although the RF model has superior metrics, suggesting that it might be the preferred option in terms of pure performance, the Decision Tree offers very robust performance and clearer interpretability in the medical domain. This better interpretability makes it a potentially more useful tool in contexts where a detailed understanding of the model’s decisions is critical to support clinical decision-making.
Figure 14 illustrates the decision tree, highlighting that the most significant variable for classifying dengue was headache, followed by abdominal pain, retrocular pain, and arthralgia. These findings are in line with the PAHO guidelines, which identify these symptoms as differential signs, supported by scientific evidence. In addition, patient age emerged as a significant factor in the classification of dengue. For chikungunya, myalgia was observed as a key variable, which is in line with the PAHO indications. However, symptom duration, retroocular pain, and patient age were also identified as important factors in the classification of chikungunya. Finally, in the case of Zika, significant variables for classification include myalgia, abdominal pain, age, and arthralgia. It should be noted that, although relevant in this context, they are not mentioned as distinctive signs or symptoms of Zika in the PAHO 2022 guidelines.
However, the scarcity of specific research on the classification of diseases such as dengue, Zika, and chikungunya limits direct comparisons of results. However, a recent study [42] addressed this challenge by developing a proposal to classify seven similar diseases, including 137 records of Zika, 127 of dengue, and 140 of chikungunya, in addition to other diseases such as malaria and yellow fever, totalling 1500 records. This proposal compares various algorithms and presents a hybrid technique called HML that combines machine learning techniques with reinforcement learning based on recurrent neural networks (RNNs). The results obtained showed high precision, with an accuracy of 98.7%, precision of 98.7%, recall of 98.4%, and an F1-score of 99.10%.
Despite these promising results, this research does not include confusion matrices that allow the evaluation of the reliability of the classification for each disease individually. When comparing these results with those of our research, it is observed that our models outperform the proposed quality metrics, particularly in terms of accuracy, precision, recall, and F1-score. Furthermore, our research provides a detailed analysis of the confusion matrix level for each class, allowing for a more accurate assessment of the classification capacity of each disease. This highlights not only the effectiveness of our models in differentiating between dengue, Zika, and chikungunya but also the advantage of having detailed metrics to assess and improve classification quality.
The results of this research support the feasibility of a model for early and differential prediction of dengue, Zika, and chikungunya based on signs and symptoms. This model showed high performance with an accuracy of 99.3%, precision of 99.8%, specificity of 99.9%, and F1-Score of 99.9%. Furthermore, its ability to accurately recognise each disease is remarkable, reaching 99.9% for chikungunya, 99.3% for dengue, and 99.3% for Zika.
The use of cross-validation in this study played a crucial role in providing a more accurate estimate of model performance. By employing multiple partitions of the dataset for training and validation, this technique reduces the risk of overfitting and improves the ability of the model to generalise to unseen data. In addition, using cross-validation, more stable and reliable metrics of model performance were obtained, allowing for a more accurate assessment of the model’s ability to predict these diseases.
Bootstrapping was used to balance the classes in model construction. This technique allowed us to work with the unbalanced dataset that made up the dataset, generating multiple samples of equal size to the original dataset and randomly selecting observations with replacements. By applying this technique, we were able to obtain an adequate representation of the training samples, which helped improve the model’s ability to learn, in a balanced way, the characteristics of each disease.
Finally, this study represents a significant advance in the differential prediction of dengue, Zika, and chikungunya using machine learning techniques and the analysis of signs, symptoms, and laboratory variables. The developed model offers robust diagnostic support based on the criteria established in PAHO evidence synthesis (2022), which clearly distinguishes the signs and symptoms of each disease for diagnosis and treatment. With high performance, this model not only demonstrates remarkable accuracy but also has great potential for implementation in clinical settings. Its integration into clinical practice would provide fundamental support to health professionals, facilitate early and accurate diagnoses, and favour timely decision-making that improves patient outcomes.
Moreover, the predictive model developed in this study could be particularly beneficial in regions where dengue, Zika, and chikungunya co-circulate, as early differentiation between these diseases is challenging owing to their similar initial symptoms. This tool could empower healthcare providers to make more informed and rapid decisions regarding patient management, ultimately leading to better care and outcomes.
Although this study presents some limitations regarding the amount of data, especially for chikungunya, which was addressed by specialised computational techniques, it is recognised that the reliability of the model could be improved with a larger volume of data. Despite these limitations, this study establishes a benchmark for future research, since, according to [20,21], no comparable studies have been identified in the literature, mainly because of the scarcity of datasets that include records of these viruses.

5. Conclusions

In this study, a model for early and differential prediction of dengue, Zika, and chikungunya was developed and evaluated using machine learning techniques. The results showed that the model had high performance, with an accuracy of 99.8%, precision of 99.8%, specificity of 99.9%, and F1-Score of 99.9%. This indicates that the model is highly effective in recognising each disease accurately, achieving 99.9% accuracy for chikungunya, 99.3% for dengue, and 99.3% for Zika.
The use of cross-validation in this work was instrumental in providing a more accurate estimate of model performance. This technique helps reduce the risk of overfitting by improving the ability of the model to generalise to unseen data. In addition, by using cross-validation, more stable and reliable metrics of the model performance were obtained, allowing for a more accurate assessment of its ability to predict these diseases.
On the other hand, the bootstrapping technique also played an important role in balancing the classes in the model construction, allowing it to work with the unbalanced dataset, generating multiple samples of equal size to the original dataset, and randomly selecting observations with replacement. By applying this technique, we were able to obtain an adequate representation of the training samples, which helped improve the model’s ability to learn the characteristics of each disease in a balanced way.
In addition, this study represents an important advancement in the differential prediction of dengue, Zika, and chikungunya using machine learning techniques and information from signs, symptoms, and laboratory variables. The developed model could be of great use to the medical community in places where these diseases co-circulate, helping healthcare professionals make more informed and faster decisions in the management of patients with these diseases, which could result in better care and outcomes for patients.
Future studies could include external validation of the model using data from different geographical locations to assess its generalisability and applicability in different epidemiological contexts. In addition, a longitudinal study could be conducted to assess the effectiveness and long-term sustainability of the model in disease prediction and management in affected populations.

Author Contributions

Conceptualization, W.A.-H. and E.D.-L.-H.-F.; methodology, W.A.-H., J.G.G. and E.D.-L.-H.-F.; validation, W.A.-H. and J.G.G.; formal analysis, W.A.-H. and J.G.G.; investigation, J.G.G.; data curation, W.A.-H. and J.G.G.; writing—original draft preparation J.G.G. and E.D.-L.-H.-F.; writing—review and editing, W.A.-H. and E.D.-L.-H.-F.; visualisation, J.G.G., W.A.-H. and E.D.-L.-H.-F.; supervision, J.G.G. and W.A.-H.; project administration, J.G.G. and W.A.-H.; resources, J.G.G. All authors have read and agreed to the published version of the manuscript.

Funding

We thank the University of Córdoba for financing this research project according to the internal call with project code FI-05-19.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Acknowledgments

We thank the SOCRATES research group of the Systems Engineering and Telecommunications program for supporting the development of this project.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Lambrechts, L.; Scott, T.W.; Gubler, D.J. Consequences of the Expanding Global Distribution of Aedes Albopictus for Dengue Virus Transmission. PLoS Neglected Trop. Dis. 2010, 4, e646. [Google Scholar] [CrossRef]
  2. Chaw, J.K.; Chaw, S.H.; Quah, C.H.; Sahrani, S.; Ang, M.C.; Zhao, Y.; Ting, T.T. A Predictive Analytics Model Using Machine Learning Algorithms to Estimate the Risk of Shock Development among Dengue Patients. Healthc. Anal. 2024, 5, 100290. [Google Scholar] [CrossRef]
  3. Arrubla, W.D.J.A. Conceptualización del diagnóstico del Dengue desde una perspectiva de la ingeniería y las nuevas tecnologías. Comput. Electron. Sci. Theory Appl. 2022, 3, 1–8. [Google Scholar] [CrossRef]
  4. Codina, J.-R.; Mascini, M.; Dikici, E.; Deo, S.K.; Daunert, S. Accelerating the Screening of Small Peptide Ligands by Combining Peptide-Protein Docking and Machine Learning. Int. J. Mol. Sci. 2023, 24, 12144. [Google Scholar] [CrossRef] [PubMed]
  5. Gangula, R.; Thirupathi, L.; Parupati, R.; Sreeveda, K.; Gattoju, S. Ensemble Machine Learning Based Prediction of Dengue Disease with Performance and Accuracy Elevation Patterns. Mater. Today Proc. 2023, 80, 3458–3463. [Google Scholar] [CrossRef]
  6. Brady, O.J.; Hay, S.I. The Global Expansion of Dengue: How Aedes Aegypti Mosquitoes Enabled the First Pandemic Arbovirus. Annu. Rev. Entomol. 2020, 65, 191–208. [Google Scholar] [CrossRef]
  7. Sukhralia, S.; Verma, M.; Gopirajan, S.; Dhanaraj, P.S.; Lal, R.; Mehla, N.; Kant, C.R. From Dengue to Zika: The Wide Spread of Mosquito-Borne Arboviruses. Eur. J. Clin. Microbiol. Infect. Dis. 2019, 38, 3–14. [Google Scholar] [CrossRef] [PubMed]
  8. Chala, B.; Hamde, F. Emerging and Re-Emerging Vector-Borne Infectious Diseases and the Challenges for Control: A Review. Front. Public Health 2021, 9, 719759. [Google Scholar] [CrossRef]
  9. PAHO Síntesis de evidencia: Directrices para el diagnóstico y el tratamiento del dengue, el chikunguña y el zika en la Región de las Américas. Rev. Panam. Salud Pública 2022, 46, 1. [CrossRef]
  10. Paniz-Mondolfi, A.E.; Rodriguez-Morales, A.J.; Blohm, G.; Marquez, M.; Villamil-Gomez, W.E. ChikDenMaZika Syndrome: The Challenge of Diagnosing Arboviral Infections in the Midst of Concurrent Epidemics. Ann. Clin. Microbiol. Antimicrob. 2016, 15, 42. [Google Scholar] [CrossRef]
  11. da Silva Neto, S.R.; Tabosa de Oliveira, T.; Teixiera, I.V.; Medeiros Neto, L.; Souza Sampaio, V.; Lynn, T.; Endo, P.T. Arboviral Disease Record Data—Dengue and Chikungunya, Brazil, 2013–2020. Sci. Data 2022, 9, 198. [Google Scholar] [CrossRef] [PubMed]
  12. Villamil-Gómez, W.E.; Rodríguez-Morales, A.J.; Uribe-García, A.M.; González-Arismendy, E.; Castellanos, J.E.; Calvo, E.P.; Álvarez-Mon, M.; Musso, D. Zika, Dengue, and Chikungunya Co-Infection in a Pregnant Woman from Colombia. Int. J. Infect. Dis. 2016, 51, 135–138. [Google Scholar] [CrossRef] [PubMed]
  13. Caicedo, D.M.; Méndez, A.C.; Tovar, J.R.; Osorio, L.; Caicedo, D.M.; Méndez, A.C.; Tovar, J.R.; Osorio, L. Desarrollo de algoritmos clínicos para el diagnóstico del dengue en Colombia. Biomédica 2019, 39, 170–185. [Google Scholar] [CrossRef] [PubMed]
  14. Dharap, P.; Raimbault, S. Performance Evaluation of Machine Learning-Based Infectious Screening Flags on the HORIBA Medical Yumizen H550 Haematology Analyzer for Vivax Malaria and Dengue Fever. Malar. J. 2020, 19, 429. [Google Scholar] [CrossRef]
  15. Tchapet Njafa, J.-P.; Nana Engo, S.G. Quantum Associative Memory with Linear and Non-Linear Algorithms for the Diagnosis of Some Tropical Diseases. Neural Netw. 2018, 97, 1–10. [Google Scholar] [CrossRef]
  16. Rodriguez-Quijada, C.; Gomez-Marquez, J.; Hamad-Schifferli, K. Repurposing Old Antibodies for New Diseases by Exploiting Cross-Reactivity and Multicolored Nanoparticles. ACS Nano 2020, 14, 6626–6635. [Google Scholar] [CrossRef]
  17. Tan, K.W.; Tan, B.; Thein, T.L.; Leo, Y.-S.; Lye, D.C.; Dickens, B.L.; Wong, J.G.X.; Cook, A.R. Dynamic Dengue Haemorrhagic Fever Calculators as Clinical Decision Support Tools in Adult Dengue. Trans. R. Soc. Trop. Med. Hyg. 2020, 114, 7–15. [Google Scholar] [CrossRef]
  18. Veiga, R.V.; Schuler-Faccini, L.; França, G.V.; Andrade, R.F.; Teixeira, M.G.; Costa, L.C.; Paixão, E.S.; Costa, M.d.C.N.; Barreto, M.L.; Oliveira, J.F.; et al. Classification Algorithm for Congenital Zika Syndrome: Characterizations, Diagnosis and Validation. Sci. Rep. 2021, 11, 6770. [Google Scholar] [CrossRef]
  19. Medeiros Neto, L.; Rogerio da Silva Neto, S.; Endo, P.T. A Comparative Analysis of Converters of Tabular Data into Image for the Classification of Arboviruses Using Convolutional Neural Networks. PLoS ONE 2023, 18, e0295598. [Google Scholar] [CrossRef]
  20. da Silva Neto, S.R.; Tabosa Oliveira, T.; Teixeira, I.V.; Aguiar de Oliveira, S.B.; Souza Sampaio, V.; Lynn, T.; Endo, P.T. Machine Learning and Deep Learning Techniques to Support Clinical Diagnosis of Arboviral Diseases: A Systematic Review. PLoS Neglected Trop. Dis. 2022, 16, e0010061. [Google Scholar] [CrossRef]
  21. Choubey, S.; Barde, S.; Badholia, A. Analysis of Deep Learning Techniques to Investigate and Support Diagnosis of Virus Borne Diseases. In Proceedings of the 3rd International Conference on Electronics and Sustainable Communication Systems, ICESC 2022—Proceedings, Coimbatore, India, 17–19 August 2022; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2022; pp. 921–928. [Google Scholar]
  22. Arrubla-Hoyos, W.; Gómez, J.G.; De-La-Hoz-Franco, E. Methodology for the Differential Classification of Dengue and Chikungunya According to the PAHO 2022 Diagnostic Guide. Viruses 2024, 16, 1088. [Google Scholar] [CrossRef] [PubMed]
  23. Noorbakhsh-Sabet, N.; Zand, R.; Zhang, Y.; Abedi, V. Artificial Intelligence Transforms the Future of Health Care. Am. J. Med. 2019, 132, 795–801. [Google Scholar] [CrossRef] [PubMed]
  24. Wiljer, D.; Hakim, Z. Developing an Artificial Intelligence–Enabled Health Care Practice: Rewiring Health Care Professions for Better Care. J. Med. Imaging Radiat. Sci. 2019, 50, S8–S14. [Google Scholar] [CrossRef] [PubMed]
  25. Bharambe, A.; Chandorkar, A.A.; Kalbande, D. A Deep Learning Approach for Dengue Tweet Classification. In Proceedings of the 2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA), Coimbatore, India, 2–4 September 2021; IEEE: Coimbatore, India, 2021; pp. 1043–1047. [Google Scholar]
  26. Khotimah, P.H.; Fachrur Rozie, A.; Nugraheni, E.; Arisal, A.; Suwarningsih, W.; Purwarianti, A. Deep Learning for Dengue Fever Event Detection Using Online News. In Proceedings of the 2020 International Conference on Radar, Antenna, Microwave, Electronics, and Telecommunications (ICRAMET), Tangerang, Indonesia, 18–20 November 2020; IEEE: Tangerang, Indonesia, 2020; pp. 261–266. [Google Scholar]
  27. Gambhir, S.; Malik, S.K.; Kumar, Y. The diagnosis of dengue disease: An evaluation of three machine learning approaches. Int. J. Healthc. Inf. Syst. Inform. (IJHISI) 2018, 13, 1–19. [Google Scholar] [CrossRef]
  28. Acosta Torres, J.; Oller Meneses, L.; Sokol, N.; Balado Sardiñas, R.; Montero Díaz, D.; Balado Sansón, R.; Sardiñas Arce, M.E. Técnica Árboles de Decisión Aplicada al Método Clínico En El Diagnóstico Del Dengue. Rev. Cuba. Pediatr. 2016, 88, 441–453. [Google Scholar]
  29. Arrubla-Hoyos, W.; Seveiche-Maury, Z.; Saeed, K.; Gómez, J.E.G.; De-La-Hoz-Franco, E. Comparison of Classical Machine Learning and Ensemble Techniques in the Context of Dengue Severity Prediction. In Proceedings of the 2023 IEEE Colombian Caribbean Conference (C3), Barranquilla, Colombia, 22–25 November 2023; pp. 1–5. [Google Scholar]
  30. PAHO/WHO Epidemiological Update—Dengue, Chikungunya and Zika—10 June 2023—PAHO/WHO|Pan American Health Organization. Available online: https://www.paho.org/en/documents/epidemiological-update-dengue-chikungunya-and-zika-10-june-2023 (accessed on 13 March 2024).
  31. Zoubir, A.M.; Boashash, B. The Bootstrap and Its Application in Signal Processing. IEEE Signal Process. Mag. 1998, 15, 56–76. [Google Scholar] [CrossRef]
  32. Zoubir, A.M.; Iskander, D.R. Bootstrap Techniques for Signal Processing; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
  33. Smith, P.J.; Hoaglin, D.C.; Battaglia, M.P.; Barker, L. Implementation and Applications of Bootstrap Methods for the National Immunization Survey. Stat. Med. 2003, 22, 2487–2502. [Google Scholar] [CrossRef]
  34. Wu, C.; Rao, J.N.K. Bootstrap procedures for the pseudo empirical likelihood method in sample surveys. Stat. Probab. Lett. 2010, 80, 1472–1478. [Google Scholar]
  35. Kunz, P.J.; ben Abid, S.; Zoubir, A.M. The Heterogeneity-Intensified and Heterogeneity Ratio-Stratified Bootstrap (HiS- and HeRS-Boot) Oversampling to Boost a Detector Performance. In Proceedings of the 2023 IEEE SENSORS, Vienna, Austria, 29 October–1 November 2023; pp. 1–4. [Google Scholar]
  36. Acosta-Reyes, J.; Navarro-Lechuga, E.; Martínez-Garcés, J.C. Enfermedad por el virus del Chikungunya: Historia y epidemiología. Rev. Salud Uninorte 2015, 31, 621–630. [Google Scholar] [CrossRef]
  37. Pardo-Turriago, R. Zika. Una pandemia en progreso y un reto epidemiológico. Colomb. J. Anestesiol. 2016, 44, 86–88. [Google Scholar] [CrossRef]
  38. He, H.; Garcia, E.A. Learning from Imbalanced Data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar]
  39. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority over-Sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
  40. Fernández, A.; Garcia, S.; Herrera, F.; Chawla, N.V. SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-Year Anniversary. J. Artif. Intell. Res. 2018, 61, 863–905. [Google Scholar] [CrossRef]
  41. Connor, S.; Khoshgoftaar, T.M. A Survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 60. [Google Scholar]
  42. Shaikh, S.G.; Kumar, B.S.; Narang, G.; Pachpor, N.N. Original Research Article Hybrid machine learning method for classification and recommendation of vector-borne disease. J. Auton. Intell. 2024, 7, 1–14. [Google Scholar]
Figure 1. Flowchart of the methodological proposal for developing a predictive model for dengue, Zika, and chikungunya.
Figure 1. Flowchart of the methodological proposal for developing a predictive model for dengue, Zika, and chikungunya.
Informatics 11 00069 g001
Figure 2. (a) Representation of the data from the original dataset; (b) balanced data using the bootstrapping technique.
Figure 2. (a) Representation of the data from the original dataset; (b) balanced data using the bootstrapping technique.
Informatics 11 00069 g002
Figure 3. Comparison of DT and RF quality metrics using the dataset transformed by the methodology proposed in [22].
Figure 3. Comparison of DT and RF quality metrics using the dataset transformed by the methodology proposed in [22].
Informatics 11 00069 g003
Figure 4. Average confusion matrix DT model using the dataset transformed by the methodology proposed in [22].
Figure 4. Average confusion matrix DT model using the dataset transformed by the methodology proposed in [22].
Informatics 11 00069 g004
Figure 5. Average confusion matrix RF model Average confusion matrix DT model using the dataset transformed by the methodology proposed in [22].
Figure 5. Average confusion matrix RF model Average confusion matrix DT model using the dataset transformed by the methodology proposed in [22].
Informatics 11 00069 g005
Figure 6. Comparison of DT and RF quality metrics without applying the methodology proposed by Arrubla et al. [22].
Figure 6. Comparison of DT and RF quality metrics without applying the methodology proposed by Arrubla et al. [22].
Informatics 11 00069 g006
Figure 7. Comparison of the precision and error of DT models generated by cross-validation.
Figure 7. Comparison of the precision and error of DT models generated by cross-validation.
Informatics 11 00069 g007
Figure 8. Confusion matrix of the average DT Model without applying the methodology proposed by Arrubla et al. [22].
Figure 8. Confusion matrix of the average DT Model without applying the methodology proposed by Arrubla et al. [22].
Informatics 11 00069 g008
Figure 9. Confusion matrix of the average RF Model without applying the methodology proposed by Arrubla et al. [22].
Figure 9. Confusion matrix of the average RF Model without applying the methodology proposed by Arrubla et al. [22].
Informatics 11 00069 g009
Figure 10. Tree diagram of the DT model using the dataset transformed by the methodology proposed in [22].
Figure 10. Tree diagram of the DT model using the dataset transformed by the methodology proposed in [22].
Informatics 11 00069 g010
Figure 11. Importance of RF model variables.
Figure 11. Importance of RF model variables.
Informatics 11 00069 g011
Figure 12. (a) Comparison of the accuracy and error of each class of DT models generated by cross-validation applying the methodology proposed by [22]. (b) Comparison of the accuracy and error of each class of DT models generated by cross-validation without the methodology proposed by [22].
Figure 12. (a) Comparison of the accuracy and error of each class of DT models generated by cross-validation applying the methodology proposed by [22]. (b) Comparison of the accuracy and error of each class of DT models generated by cross-validation without the methodology proposed by [22].
Informatics 11 00069 g012
Figure 13. (a) Comparison of the accuracy and error of each class of RF models generated by cross-validation applying the methodology proposed by [22]. (b) Comparison of the accuracy and error of each class of RF models generated by cross-validation without the methodology proposed by [22].
Figure 13. (a) Comparison of the accuracy and error of each class of RF models generated by cross-validation applying the methodology proposed by [22]. (b) Comparison of the accuracy and error of each class of RF models generated by cross-validation without the methodology proposed by [22].
Informatics 11 00069 g013
Figure 14. Tree diagram of the DT model.
Figure 14. Tree diagram of the DT model.
Informatics 11 00069 g014
Table 1. Description of the dataset.
Table 1. Description of the dataset.
VariableDescription
AgeRepresents the age of patients
SexRepresents the sex of the patient
FeverRepresents the Fever of the patient (yes or no).
Symptom_daysRepresents the number of days from the date of symptom onset to the day of consultation.
HospitalizedIndicates whether the patient was hospitalised (yes or no)
headacheIndicates headache symptom (yes or no)
Retroocular_painIndicates symptom of retro ocular pain (yes or no)
MyalgiaIndicates symptom myalgia (yes or no)
ArthralgiaIndicates symptom Arthralgia (yes or no)
RashIndicates symptom Rash (yes or no)
Abdominal_painIndicates whether the patient has abdominal pain (yes or no).
Threw_upIndicates whether the patient has vomited (yes or no).
DiarrheaIndicates whether the patient has symptoms of diarrhoea (yes or no).
DrowsinessIndicates whether the patient has symptoms of Drowsiness (yes or no).
HepatomegalyIndicates whether the patient has Hepatomegaly sign (yes or no).
Mucosal_hemorrhageIndicates whether the patient has the sign of mucosal bleeding (yes or no).
platelet_dropIndicates if the patient has the sign of falling platelets. (yes or no).
Fluid_accumulationIndicates if the patient has the sign of fluid accumulation. (yes or no).
hypothermiaIndicates if the patient has the sign of hypothermia. (yes or no).
Increased_haematocritIndicates if the patient has the sign of increased haematocrit. (yes or no).
HyperemiaIndicates if the patient has the signs of Hyperemia (yes or no).
exanthemaIndicates if the patient has signs of rash (yes or no).
IgMIndicates Immunoglobulin M (IgM) value from laboratory test
Platelet_countIndicates platelet count value from laboratory test
ErythrocytesIndicates the value of the Erythrocytes laboratory test.
LeukocytesIndicates the value of the Leukocytes laboratory test.
HematocritosIndicates the value of the haematocrit laboratory test.
TargetIndicates illness, dengue, Zika or chikungunya
Table 2. Transformation of categorical data applying the PAHO Guidelines (2022) proposed by Arrubla et al. [22].
Table 2. Transformation of categorical data applying the PAHO Guidelines (2022) proposed by Arrubla et al. [22].
VariableCertainty in Evidence Quantitative Weight Assignment
Manifestations in DengueManifestations in ChikungunyaManifestations in Zika
Myalgia*Moderate 0.51–0.75
HeadacheLow**0.26–0.50
Rash*ModerateModerate0.51–0.75
Threw upModerate**0.51–0.75
Abdominal_painModerate**0.51–0.75
Mucosal_hemorrhageModerateLow*0.51–0.75
0.26–0.50
Arthralgia*High*0.76–1
DiarrheaLow**0.26–0.50
HepatomegalyLow**0.26–0.50
Retroocular painLow**0.26–0.50
platelet_dropHigh**0.76–1
Note: * A score of 0.0–0.25 was given when the label “yes” was present, signifying an extremely low level of confidence in the evidence, as all outcomes were deemed uncertain according to the GRADE system.
Table 3. Quality metrics of the models applying the methodology based on the PAHO Guidelines (2022).
Table 3. Quality metrics of the models applying the methodology based on the PAHO Guidelines (2022).
ML TechniqueAccuracyPrecisionSpecificityRecallF1-Score
DT96.3%95%97.4%99%97.2%
RF98.8%99.6%99.899.4%99.5%
Table 4. Quality metrics of the models without applying methodology based on PAHO Guidelines (2022).
Table 4. Quality metrics of the models without applying methodology based on PAHO Guidelines (2022).
ML TechniqueAccuracyPrecisionSpecificityRecallF1-Score
DT88.8%90.3%99.1%94.6%94.5%
RF94.6%95%99.197.4%97%
Table 5. Comparison of model quality metrics.
Table 5. Comparison of model quality metrics.
Model Quality Metrics
Quality metrics Decision Tree with Methodology
accuracyprecisionrecallF1-Score
Chikungunya98.0%95.0%99.4%97.2%
Dengue98.0%96.7%97.4%97.0%
Zika96.5%97.4%92.0%94.6%
Quality metrics Decision Tree without Methodology
Chikungunya95.8%90.3%99.1%94.5%
Dengue90.7%86.2%86.4%86.3%
Zika90.4%89.9%80.9%85.2%
Quality metrics Random Forest with Methodology
Chikungunya99.7%99.6%99.4%99.5%
Dengue99.1%98.5%98.9%98.7%
Zika98.8%98.3%98.1%98.2%
Quality metrics Random Forest without Methodology
Chikungunya97.9%95.0%99.1%97.0%
Dengue95.7%94.0%93.3%93.6%
Zika95.4%94.8%91.4%93.1%
Table 6. Description of the new dataset eliminating the clinical laboratory results variables.
Table 6. Description of the new dataset eliminating the clinical laboratory results variables.
VariableDescription
AgeRepresents the age of patients
SexRepresents the sex of the patient
Symptom_daysRepresents the number of days from the date of symptom onset to the day of consultation.
headacheIndicates headache symptom (yes or no)
Retroocular_painIndicates symptom of retro ocular pain (yes or no)
MyalgiaIndicates symptom myalgia (yes or no)
ArthralgiaIndicates symptom Arthralgia (yes or no)
RashIndicates symptom Rash (yes or no)
Abdominal_painIndicates whether the patient has abdominal pain (yes or no).
Threw_upIndicates whether the patient has vomited (yes or no).
DiarrheaIndicates whether the patient has symptoms of diarrhoea (yes or no).
DrowsinessIndicates whether the patient has symptoms of Drowsiness (yes or no).
HepatomegalyIndicates whether the patient has Hepatomegaly sign (yes or no).
Mucosal_hemorrhageIndicates whether the patient has the sign of mucosal bleeding (yes or no).
HyperemiaIndicates if the patient has the signs of Hyperemia (yes or no).
exanthemaIndicates if the patient has signs of rash (yes or no).
TargetIndicates illness, dengue, Zika or chikungunya
Table 7. Quality metrics of the models applying methodology based on the PAHO Guidelines (2022).
Table 7. Quality metrics of the models applying methodology based on the PAHO Guidelines (2022).
ML TechniqueAccuracyPrecisionSpecificityRecallF1-Score
DT96%97%96%96%96%
RF99.3%99.8%99.9%99.9%99.9%
Table 8. Model quality metrics.
Table 8. Model quality metrics.
Model Quality Metrics
Quality metrics Decision Tree with Methodology
accuracyprecisionrecallF1-Score
Chikungunya96.0%90.0%100.0%95.0%
Dengue96.0%100.0%100.0%100.0%
Zika96.0%100.0%88.0%93.0%
Quality metrics Random Forest with Methodology
Chikungunya99.9%99.8%100.0%99.9%
Dengue99.3%98.6%99.4%99.0%
Zika99.3%99.4%98.4%98.9%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Arrubla-Hoyos, W.; Gómez, J.G.; De-La-Hoz-Franco, E. Differential Classification of Dengue, Zika, and Chikungunya Using Machine Learning—Random Forest and Decision Tree Techniques. Informatics 2024, 11, 69. https://doi.org/10.3390/informatics11030069

AMA Style

Arrubla-Hoyos W, Gómez JG, De-La-Hoz-Franco E. Differential Classification of Dengue, Zika, and Chikungunya Using Machine Learning—Random Forest and Decision Tree Techniques. Informatics. 2024; 11(3):69. https://doi.org/10.3390/informatics11030069

Chicago/Turabian Style

Arrubla-Hoyos, Wilson, Jorge Gómez Gómez, and Emiro De-La-Hoz-Franco. 2024. "Differential Classification of Dengue, Zika, and Chikungunya Using Machine Learning—Random Forest and Decision Tree Techniques" Informatics 11, no. 3: 69. https://doi.org/10.3390/informatics11030069

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop