1. Introduction
Process automation has been one of the main goals of research efforts since the third Industrial Revolution [
1]. The advance of technological resources has impacted various sectors, including the visual [
2], musical [
3], chemical [
4], and biological [
5] sciences, besides the automation of mechanical processes [
6]. One of the most significant research areas nowadays is computational science, where combining mathematics, logic, and electronics has led to the development of more powerful computational methods and techniques [
7]. In recent decades, the popularity of pattern recognition techniques has significantly increased because of their wide range of applications regarding classification, data recovery, regression, and clustering problems [
8,
9].
Particularly, pattern recognition techniques have become incredibly reliable in classification applications due to their simplicity and well-established execution process, which helps solve problems where the correct data classification is crucial. Such an execution process of classifiers is summarized in the following steps [
10]:
The first step is to create a desired information dataset. To accomplish this, it is necessary to consider the different outputs (classes) into which the information will be divided. Also, it is essential to determine the inputs and characteristics (attributes) from which the classes are to be evaluated [
11].
Once the information is gathered, it must be analyzed and filtered. This step is essential for assessing the adequacy of the dataset in terms of the number of samples, the representativeness of the characteristics, and the accuracy of the element values, among other factors [
12].
After determining the reliability of the dataset for analysis, a specific pattern recognition approach can be trained or directly executed, depending on its behavior, to obtain the optimal parameters (hyperparameters). These hyperparameters will categorize future data based on the chosen attributes [
13,
14].
In the last stage, a validation step is necessary, wherein the effectiveness of the classifier is assessed using new data (the test set) by evaluating performance metrics such as accuracy, recall, or precision [
15,
16].
The classification techniques based on pattern recognition provide the advantage of being easily applicable to data from any field without being limited to engineering approaches. This makes them valuable in contexts where the information is complex to depict or indecipherable to other disciplines [
17]. An example of these applications is presented in [
18], where an Artificial Neural Network (ANN) is utilized to classify catchments and their drivers based on seasonal water quality using spatio-temporal data. Another element may be found in the study conducted in [
19], which aimed to categorize tea based on its harvest season using FT-IR fingerprinting. This target was achieved using the Principal Component Analysis (PCA) technique. These works explore the geographical and climatic factors as attributes that can enhance the engineering and production processes despite the limited knowledge about such topics among the individuals involved in the research.
Likewise, several works belong to highly specialized domains such as health, medicine, and biology, whose intricate possesses require expert analysis for accurate description and interpretation. However, pattern recognition classifiers continue to yield favorable results when applied to these circumstances. An element of this may be found in [
20], where the researchers employed the k-Nearest Neighbors (kNN) classifier to determine the best control input by analyzing the surface electromyogram (sEMG) signals. Furthermore, particular works employ intricate and resilient adaptations of an initial pattern recognition technique, as demonstrated in [
21]. That work utilizes a Convolutional Neural Network (CNN), an ANN variant, to categorize hypospadias. A comparative analysis was conducted between the CNN and fifty-three specialists, yielding comparable levels of accuracy at 90%.
Moreover, in certain elements, the classifiers utilizing pattern recognition approaches have achieved such a high level of trustworthiness that they are employed to enrich medical procedures. In [
22], a Naïve Bayes (NB) classifier was employed to enhance cancer patients’ management and medical care after radiotherapy treatment. The classifier was employed to classify those patients based on their probability of relapse or progression.
However, although there are many different pattern recognition techniques, none will likely obtain perfect results on all existing data sets [
23]. In light of this, it is advisable and essential to compare several pattern recognition techniques while implementing a classifier. This allows for selecting the most effective strategy, leading to notable enhancements. It is important to note that the performance of a classification approach can be significantly altered by the selection of different hyperparameters, even when using just one technique. Studies related to hyperparameter selection can be seen in [
24,
25,
26,
27]. Therefore, when aiming for classification, it is recommended to evaluate several classifiers in their basic form [
28] and fine-tune their hyperparameters [
29]. Considering this, the study described in [
30] not only includes the implementation of classifiers such as ANN and random forest but also conducts comparisons to determine which classifiers are more effective for a given problem. Such work involves analyzing driver injury trends from a multi-class perspective. However, the study does not contain additional classification techniques for comparison. In [
31], a study is conducted to compare several pattern recognition techniques, including kNN, ANN (with radial basis function), and Support Vector Machine (SVM). Those methodologies were utilized to categorize diseases of the vertebral column. Although the study includes many tests and validation criteria, it does not modify the hyperparameters of the classifier to evaluate its performance, nor does it experiment with other classifiers.
In addition, a preliminary classification comparison provides a more reliable selection of classifiers for practical or real-world applications and future research baseline comparing elements. Recently, the study conducted in [
32] showed the superiority of the kNN classifier in automatic text classification compared to other machine learning techniques. As a result, the kNN classifier was chosen for further investigation. This kind of practice has become a common trend in medical applications. An example of this is presented in research like [
33,
34,
35], where many classifiers were examined for the categorization of different critical diseases, and the winners in each study were distinct ANN architectures. In those cases, selecting the most effective classifier is crucial because the application is related to difficulties associated with medical treatments and patients’ quality of life. Likewise, in [
36], a comparative analysis of many studies that employ pattern recognition approaches to classify diseases is carried out. The study showed that the Support Vector Machine (SVM) was the most widely used classifier among the studies. However, the Random Forest (RF) classifier achieved the highest level of accuracy. Similarly, in [
37], different classifiers were empirically tuned, resulting in the kNN classifier as the most suitable option for categorizing vertebral column disease.
Nonetheless, it is essential to keep in mind that even if these pattern recognition techniques are based on a statistical approach, their solution relies on different process properties like geometrical, probabilistic, logical, etc. This means that each classifier takes advantage of different properties of the dataset, like its elements distribution or the attributes’ correlations. In recent years, an outstanding practice in the biology field for classification problems has been to use ensemble strategies to enhance the results of the classification problem beyond the limits of using a baseline classifier. An example of this is presented in [
38], where an ensemble of three different classifiers (ANN-based) enhances the classification task of chromosomes, phages, and plasmids, which was a complicated endeavor for the baseline classifiers. Similarly, in [
39], an RF ensemble (ensemble of decision trees) enhances COVID-19 case detection, surpassing the accuracy percentage limit obtained by DT and SVM baseline classifiers. However, it is important to note that in such works, the same baseline classifiers were used for their respective ensemble strategies, which limits the classification process to the characteristics of the baseline classifiers. Nonetheless, some works implement diverse baseline classifiers to take advantage of its properties, like in [
40] where the kNN, SVM, and DT classifiers were ensembled through voting strategy to enhance the diagnosis and prediction of coronary artery disease with reduced feature subset. However, in [
40], there is no mention of other ensemble strategies or the criteria to limit the ensemble strategy to those classifiers. Similarly, in [
41], the voting strategy is used to classify cardiovascular disease through the ensemble of stochastic gradient descent, logistic regression, and support vector machine classifiers. Nonetheless, adding other classifiers tested in the same work is not considered.
Only a few works related to ensemble strategies can be found in the literature, particularly regarding the vertebral column illness classification problems (case of interest). In [
42], different variants of SVM-based ensemble strategies are applied to different datasets, including vertebral column disease classification. Despite the SVM-based ensemble strategies achieving outstanding results in other classification problems with accuracies beyond 90%, the SVM-based ensemble strategies only achieved accuracies of around 85% in the classification of vertebral column disease. This outcome shows that ensemble strategies based on only one kind of baseline classifier process cannot be sufficient for all kinds of problems. Another example of this single-technique-based approach is presented in [
43], where ensemble variations of the original random forest ensemble strategies are tested in the vertebral column illness classification, where none of the implemented proposals could surpass an accuracy of 85%, being limited by the process of only one baseline classifier. Another similar case is presented in [
44], where a voting ensemble based on ANN variants is applied to vertebral column illness classification. Even if the proposal reached an accuracy of up to 90%, the ensemble is still limited to the ANN variants and their complexity.
Considering that different classifiers are based on different processes like geometrical, probabilistic, or logic [
45,
46], combining them can offer a major diversity to the solution of the ensemble strategy for the classification problem since the drawbacks of a particular baseline classifier can be compensated through the perks of another classifier. So, to the authors’ best knowledge, ensemble strategies mixing baseline classifiers, whose procedures are based on different approaches, have not been implemented in vertebral column disease classification. This forms the first contribution of this paper.
On the other hand, this study aims to analyze and contrast the performance of voting and stacking ensemble strategies based on auto-tuned supervised machine learning techniques in vertebral column orthopedic illnesses to provide insight into their characteristics and limitations with respect to metrics like accuracy, precision, recall, and F1-score, as well as evaluating the receiver operating characteristic curve. This study can help researchers choose appropriate pattern recognition approaches for future studies on problems with similar characteristics. Researchers and people in the area might use the presented information as guidelines to select an initial set of classifiers for vertebral column disease, determine which classifiers are the most useful in the ensemble to increase classification performance, and understand the ensemble behavior of popular ensemble strategies. The latter forms the second contribution of the paper.
The rest of the present work is distributed as follows:
Section 2 describes, analyzes, and explains the preprocessing approach used in the vertebral column dataset employed in this work.
Section 3 provides an operation process explanation of the implemented ensemble strategies and the baseline classifier that will shape the ensembles.
Section 4 incorporates the analysis methodology, discusses the obtained results separately, and provides a general discussion at the end. Finally,
Section 5 provides a detailed account of the conclusions drawn from this study and outlines potential areas for further research.
2. Description and Analysis of the Dataset
The dataset is acquired from the Vertebral Column dataset (VCDS) available at the following link:
https://archive.ics.uci.edu/ml/datasets/Vertebral+Column [
47] (accessed on 1 January 2024). The information was gathered and refined by Dr. Henrique da Mota in the Group of Applied Research in Orthopaedics (GARO) at the Centre médico-chirurgical de réadaptation des Massues in Lyon, France. There are 310 elements within the dataset, and each of them is evaluated through the following six biomechanical attributes (a brief description of the attributes is included):
- (i)
Pelvic incidence (PI): This is the angle between a line perpendicular to the sacral plate at its center and a line connecting the same point to the center of the bicoxofemoral axis [
48].
- (ii)
Pelvic tilt (PT): This is the angle estimated between two reference lines. The first is a vertical line to the center of the femoral head. The second goes from the center of the femoral head to the midpoint of the sacral endplate [
49].
- (iii)
Lumbar lordosis angle (LLA): The angle measured in the sagittal plane between the two ends of the lumbar curve [
50].
- (iv)
Sacral slope (SS): The angle produced by a line parallel to the sacral end plate and a horizontal reference line [
49].
- (v)
Pelvic radius (PR): The distance from the hip axis to the posterior–superior corner of the S1 endplate [
51].
- (vi)
Grade of spondylolisthesis (GS): The grades are considered as follows: grade I represents 0–25%, grade II 25–50%, grade III 50–75%, and grade IV 75–100%. These percentages represent how much the cephalad vertebra has slipped anteriorly, related to the caudal vertebra. The superior endplate of the caudal vertebral body is divided into four equivalent quadrants, and the magnitude of the slip is based on the percentage of endplate uncovered as a result of the slip [
52].
The dataset can be utilized for two classification problems. The first problem splits the set into three distinct categories: Normal (consisting of 100 elements), Disk Hernia (consisting of 60 elements), and Spondylolisthesis (consisting of 150 elements). The second problem scheme is more straightforward, as it just categorizes cases into two groups: Normal, which consists of 100 elements, and Abnormal, which consists of 210 elements. This study will concentrate on the initial option, which involves three distinct categories, to demonstrate the effectiveness of classification methods in achieving a more comprehensive and detailed categorization.
Before commencing the application of the classifiers, it is crucial to verify the usability of the dataset. In this particular version, the dataset does not exclude any information in the elements. However, the data exhibits an imbalance between the categories. This can be verified by computing the Imbalance Ratio
(
1).
According to the theory, a dataset is considered imbalanced if
[
53]. Thus, considering this in (
3), it is confirmed that the dataset is imbalanced. As a matter of completeness, the relationship between the majority class and the second minority class (Normal class) is also estimated in (
2). This last procedure is carried out only to show the great difference between the elements of the classes.
This dataset imbalance can decrease the classifiers’ potential due to insufficient elements for a representative and fair classification. So, the dataset is corrected through the Adaptive Synthetic Sampling Approach for Imbalanced Learning (ADASYN), which randomly creates additional samples for the minority classes [
54,
55] (oversampling correction strategy). The technique is employed to generate synthetic samples to achieve 150 elements for each class. This technique is used over other popular approaches, such as the Synthetic Minority Oversampling Technique (SMOTE), since ADASYN can create samples with more defined decision boundaries and is less susceptible to the noise produced by outliers [
56].
Figure 1 displays the heat map illustrating the correlation among the attributes of the oversampled dataset. The depicted matrix displays the correlations between the attributes in the row labels and attributes in the column headers. A negative cell value indicates a negative correlation (inverse relationship), while a positive value expresses a positive correlation (direct relationship). A lower magnitude value in a cell represents independence between the correlated attributes, and a greater magnitude value means dependency between them. It is observed that the PR attribute has the lowest correlation magnitudes (the more independent attribute), while the PI attribute has the greatest correlation magnitudes (the more dependent attribute). Typically, strategies to boost the performance of a particular classifier (depending on the attributes’ correlations) can be implemented to reduce the dataset size by eliminating an attribute to reduce the computational load [
57]. Nevertheless, given the limited size of this dataset, there may not be an enhancement in the computational load, and the performance can be affected [
58]. Therefore, all the attributes are to be employed in the classification processes.
Looking for a complete analysis of the employed dataset, the principal component analysis (PCA) matrix in a pairwise comparison among the attributes is presented in
Figure 2. The distribution of the elements according to just one attribute is offered in the diagonal of this matrix, where the distributions of the DH, SL, and NO classes are depicted in blue, green, and red, respectively. It is observed that, for elements of the SL class, the attributes PI, LLA, SS, and GS have important differences against those of the other two classes (DH and NO). On the other hand, only the attributes LLA, SS, and PR have slight differences in the distributions of the DH and NO classes. Furthermore, the rest of the matrix of pairwise comparisons provided in
Figure 2 shows the distribution of scattered elements of the three classes in each plane created between the considered attributes. The DH, SL, and NO classes are depicted in blue circles, green squares, and red triangles, respectively. Within these planes, it is observed that a high correlation exists between the elements of each class since the elements of the classes overlap. Moreover, the scattered elements of all the classes tend towards a linear behavior in those planes that use one of their axis components, the GS and PI attributes. This information could prove valuable in determining the key attributes to consider.
4. Results
The classifiers used in this work are developed using the Python 3.12.1 programming language. The classification algorithms use the
scikit-learn library [
64,
77]. It is important to note that the classifications were performed using a computer equipped with an Intel
® Core
TM i7-7700HQ CPU running at a clock speed of 2.80 GHz and 16 GB of RAM.
4.1. Experimentation Methodology
Before addressing the results, explaining the methodology employed to carry out the experiments is essential. Next, a brief description of the stages used is offered:
Firstly, the hyperparameters of the baseline classifiers presented in this work are tuned through a grid-search K-Fold cross-validated strategy. This process aims to find the most suitable hyperparameters per classifier, hence the best version of the baseline classifiers. A thorough description of this process is presented in
Section 4.2.
An analysis of the results of the tuned baseline classifiers is presented in
Section 4.3. This analysis aims to provide insight into the baseline classifiers and find the best for the particular problem of the vertebral column disease classification.
Section 4.4 encompasses the description and analysis results of different ensemble strategies (with subvariant proposals). This analysis highlights the behavior and performance of the employed ensemble strategies when different baseline classifiers are employed, providing a better understanding of their advantages and limitations.
Finally, a general discussion is presented in
Section 4.5. This discussion not only highlights the results of the baseline classifiers and the ensemble strategies but also discusses their trustworthiness in this particular medical problem for vertebral column disease classification.
4.2. Baseline Classifiers’ Hyperparameter Tuning
The baseline classifiers are tuned through the grid-search strategy to find the most suitable hyperparameters for the particular problem of vertebral column disease classification. The tuning process employs accuracy as the scoring metric for these tests, which is a well-recognized and dependable metric in pattern recognition techniques [
78,
79]. Considering the dataset’s reduced size, the K-Fold Cross-Validation (KFCV) process with a value of
is employed [
80] to analyze the diverse characteristics of the dataset instead of training the classifier with limited information, and also to ensure a robust evaluation. In each classifier, the grid-search strategy uses the most popular classifier hyperparameters. Next, a brief description of the hyperparameter variations used per baseline classifier in the grid-search tuning process is offered. Also, it is worth pointing out that the NB and QDA classifiers do not require a hyperparameter training process since their classification process is directly performed through probabilistic approaches as presented in
Section 3.1.2 and
Section 3.1.4, respectively. Moreover, it is worth remarking that discrete hyperparameters as the distance metric, activation functions, optimization algorithm solvers, or kernels in this work are restricted to the most popular, open-access options, yet within the literature are other options that have achieved important results.
kNN classifier: The grid-search process varies the distance metric between Euclidean and Manhattan. Also, the number of k neighbors is evaluated from 1 to 30, i.e., , considering the maximum permissible of the K-Fold process.
LR classifier: In this classifier, the optimization algorithm (solver) used to perform the training is selected among the liblinear, sag, saga, lbfgs, newton-cholesky, and newton-cg algorithms.
LDA classifier: In this classifier, the solver employed to compute the eigenvector and eigenvalues used for the function regression is chosen among the svd, lsqr, and eigen strategies.
SVM classifier: The grid-search strategy evaluates the kernel function employed to create the hyperplane to separate the dataset element varying between linear, polynomial, radial basis, and sigmoid functions. Regarding the case of a polynomial function kernel, its degree
d varies between 2 and 10, i.e.,
. This limit is considered through an empirical observation where higher polynomial degrees increased the computational burden without enhancing the classification task. Such characteristics of polynomial degree increment have been analyzed in [
81] with similar outcomes.
ANN classifier: The grid-search strategy evaluates multiple hyperparameters like the number of hidden layers (between 1 and 10) and the number of neurons in the hidden layers (between 1 and 10). Also, the activation function used in the neurons is among identity, logistic, tanh, and relu functions (see
Table 3). Moreover, the optimization algorithm (solver) used in the grid-search strategy varies between lbfgs, sgd, and adam. Finally, the last hyperparameter tuned is the learning rate
varying it between 0.001 and 0.9 with a step size of 0.005, i.e.,
. The range regarding learning rate, step size, number of hidden layers, and their number of neurons are established considering that the boundaries range set computational burden limits (regarding time) where other classifiers achieved acceptable outcomes without further endeavor.
DT classifier: The grid search variates multiple hyperparameters like the criterion to split the elements of the dataset between gini, entropy, and log_los. Also, the splitter technique can be optimization-based (best) or stochastic-based (random). Moreover, the maximum depth of the tree varies between 5 and 10. Moreover, the minimum of samples to consider a node a leaf changes between 1 and 10. Finally, the minimum number of samples to split a node is between 1 and 10. Similarly, the hyperparameters tuned within ranges are selected considering boundaries where the computational burden limits are not surpassed.
Finally,
Table 4 presents the baseline classifiers and their tuned hyperparameters that yield the highest reported accuracy.
4.3. Baseline Classifiers Results
Table 5 presents the results of the tuned baseline classifiers, where the first column depicts the evaluated classifier. The accuracy, precision, recall, and F1-score of the KFCV process obtained are presented from the second to the fifth columns, respectively. Looking for a complete analysis, the classifier’s confusion matrix is presented in the sixth column where the labels “DH”, “SL”, and “NO” stand for disk hernia, spondylolisthesis, and normal, respectively (which are the possible classes). The row label of the matrix indicates the actual classes, and the column label represents the class proportioned by the classifier. On each three-by-three matrix, the sum of each row equals the total of elements per class, i.e., DH, SL, and NO have 150 elements each. The number within the diagonal of each matrix indicates the correct classified elements (CCE) of the dataset (the CCE evaluation is presented in the seventh column of
Table 5), and the rest of the cells are the misclassified elements.
Considering this is a medical problem where the misclassification of diseases might worsen the patient’s health, another crucial evaluation is related to the misclassified disease elements (MDE) valuation. The MDE valuation considers those elements of DH and SL classes misclassified as elements of the NO class. These errors are more important because they might cause the patients to remain with their illnesses instead of starting an accurate medical treatment. Other kinds of misclassification might lead to the start of medical treatment when the patient is healthy. These last misclassifications are not evaluated in this work because they might be corrected throughout the treatment (future medical analyses) by medical experts. The MDE evaluation of the baseline classifiers is presented in the eighth column of
Table 5.
The last column of
Table 5 includes the time metric, which considers the classifiers’ hyperparameter tuning, training, and implementation. It is essential to remember that each baseline classifier relies on different properties and requires different computational capabilities and resources. Hence, this last metric could be important for future implementation in real-world healthcare devices or future research considerations.
Furthermore, the best-obtained outcomes per evaluated metric in
Table 5 are in boldface. Also, the classifiers are ranked according to their achieved CCE, where classifiers’ ranks are placed as a supper index in the classifier names of the first column. Finally, to achieve a straightforward understanding of the results, the following highlights are provided:
The kNN classifier yields the most outstanding results, being the only one whose accuracy, precision, recall, and F1-scores are above 0.9. This is reflected in its CCE evaluation, where it correctly classified 427 elements. Also, the kNN classifier is the most trustworthy approach for this particular medical application since it only misclassified four disease elements as normal elements, i.e., MDE = 4.
It is noteworthy that the SVM classifier achieves the second-best outcomes. This is interesting since both SVM and kNN classifiers use geometrical strategies. Nonetheless, it is important to mention that the SVM classifier drops its competitiveness against the kNN classifier by decreasing its CCE evaluation in . The SVM’s MDE evaluation (MDE = 8) is twice greater (worse) than that reported by the best classifier.
The DT classifier gets third place by reducing its CCE evaluation to concerning the best classifier. However, even if its CCE evaluation is worse than the SVM classifier outcome, the DT classifier yields more reliable results since its MDE evaluation is lower (MDE = 6).
The ANN classifier reaches fourth place, followed by the LR classifier, where both obtain similar outcomes. Interestingly, even if the ANN classifier is based on a more complex process, it does not obtain better results.
The classifiers based on probabilistic strategies (NB, LDA, and QDE) report the worst outcomes. Their poor performance is related to the dataset characteristics, where there are high correlations among the attributes, as observed in
Figure 1. This characteristic is against the operation principle of the probabilistic classifiers, which require statistical independence among the attributes evaluated.
Particularly, regarding the MDE, it is observed that the baseline classifiers tend to misclassify the SL elements as NO elements. Notably, the SL elements were the majority class in the original imbalanced data.
It is observed that the baseline classifier based on probabilistic strategies requires less time to be implemented, but their classification performance is lower against other classifiers. On the other hand, it is interesting that the best-baseline classifier (kNN) is 26.6 times faster than the second best-baseline classifier (SVM), indicating that the vertebral column disease classification task does not imply a computational waste endeavor.
Aiming to provide a more robust analysis of the baseline classifiers, the Receiver Operating Characteristic curves (ROCs) and their area under the curve (ROC-AUC) are computed.
Figure 8,
Figure 9,
Figure 10,
Figure 11,
Figure 12,
Figure 13,
Figure 14 and
Figure 15 display the ROC-AUC results of the kNN, NB, LR, LDA, QDA, SVM, ANN, and DT baseline classifiers, in that order. Each graphic represents the relationship between the true-positive rate (y-axis) of an evaluated class against the false-positive rate (x-rate), where the ROC is depicted by a red line whose area represents the statistical power as a function of the Type I Error of the decision in a binary state (correctly classified or misclassified). Also, in these graphs, the dark blue-dashed line represents the probabilistic behavior of a random classifier (useless performance), where if the ROC curve is farther away from the random classifier, it implies that the evaluated classifier performs better for the assessed class. Otherwise, the classifier has low trustworthiness. Finally, in this set of graphs, all subfigures (a), (b), and (c) display the ROC-AUC of the assessed classes DH, SL, and NO against the rest of the classes, respectively.
The ROC-AUC outcomes corroborate the kNN as the best baseline classifier since it is the only one whose ROC-AUC results surpassed 0.9 in all the evaluated classes. Also, regarding the classification problem, it is observed that the easiest class to classify is the SL, where all the baseline classifiers (except for the kNN) achieved the best result in the ROC-AUC evaluation of this class. Also, it is observed that the DH and NO classes are more complicated to classify, confirming the information of the pairwise principal component analysis presented in
Figure 2, where the distributions of each attribute per class show that the dataset’s elements of DH and SL classes present overlapping in all attributes. This characteristic limits the performance of the baseline classifier, particularly those that rely upon probabilistic approaches (NB, LDA, and QDA).
4.4. Ensembled Classifiers Results
Five different ensembled classifiers are executed to analyze the behavior of ensemble classifier strategies. The first one is the RF approach. The second and third methods are the voting and stacking classifiers, respectively, in a primary version, where the baseline classifiers used for their ensemble strategies are the top-three ranked classifiers according to
Table 5 (hereinafter referred to as VC
top and SC
top, respectively). Finally, the fourth and fifth methods to compare are also the voting and stacking classifiers, respectively, with the main difference being that they use all the baseline classifiers for their ensemble strategies (hereinafter referred to as VC
all and SC
all, respectively). These last four methods consider no weighting among the included baseline classifiers, aiming to provide the same relevance to the individual results and excluding unknown preferences in the final ensemble results.
Table 6 presents the outcomes of the tested ensemble strategies, following a similar structure used to report the results of the baseline classifiers with the best-obtained outcomes per evaluated metric in boldface. The following noteworthy points are offered:
Regarding the ensemble strategies that used all the baseline classifiers, it is observed that the SCall ensemble yields the best outcomes, where the SCall ensemble is the only ensemble strategy able to achieve results above 0.95 in all the evaluated metrics. These results make the SCall ensemble the most outstanding option among all the ensembles and baseline classifiers. On the other hand, it is observed that the VCall ensemble cannot boost its performance beyond the best baseline classifier (kNN).
Concerning the ensemble strategies that use only the three best baseline classifiers, it is observed that the VCtop and SCtop ensembles have similar outcomes to the best baseline classifier (kNN). Particularly, the VCtop ensemble has the same kNN’s outcomes in all the evaluated metrics, and the SCtop ensemble drops kNN’s competitiveness by increasing its MDE valuation, misclassifying six disease elements as normal elements.
The RF ensemble is the less promising ensemble strategy. Nonetheless, it is observed that the RF ensemble improves by the CCE valuation of the DT classifier (its baseline classifier). These results show that an ensemble strategy that uses only one kind of baseline classifier has a limited performance, reaching only small improvements.
Regarding the MDE metric, it is observed that, as with the baseline classifier, the ensemble strategies tend to miscategorize the SL elements as NO ones. This can result from the overlapping PT and GS dataset’s attributes, as observed in
Figure 2. This kind of error might be mitigated by removing from the dataset the overlapped attributes, yet the best strategy is to consider additional attributes that clarify the differences between these classes. Therefore, in clinical problems, the machine learning results must be ratified by an expert in the field [
82].
Since the ensemble strategies work with already tuned baseline classifiers except for the RF ensemble strategy, the time measurement only considers the time required to complete the classification task, looking for a fair comparison. Particularly regarding the VC and SC ensemble strategies, it is clear that the variants that use all the baseline classifiers require more computational time than the variants that use only the top three, which is an expected result, according to related works like [
83,
84].
Looking for a deep analysis of the ensemble strategies,
Figure 16,
Figure 17,
Figure 18,
Figure 19 and
Figure 20 display the ROC-AUC outcomes of RF, VC
top, SC
top, VC
all, and SC
all ensemble strategies, respectively, following the same structure used in the ROC-AUC evaluation of the baseline classifiers. It is observed that the RF ensemble strategy is the worst ensemble strategy, since it is the only one unable to surpass 0.9 in the ROC-AUC of the NO class. Concerning the rest of the ensemble strategies, it is observed that all of them yield similar performance to the best-baseline classifier (kNN). Nonetheless, it is essential to remember that the SC
all ensemble strategy is the only one able to surpass 0.95 (kNN limit) in all the other evaluated metrics.
4.5. General Discussion
Considering the above-reported outcomes of baseline classifiers whose operation relies on different approaches, it is observed that the kNN yields the best performance among the baseline ones, presenting a significant difference of (regarding the correctly classified elements) against the second-best (SVM). Also, it is observed that most of the classifiers present similar outcomes, obtaining values of accuracy, precision, recall, and F1-score in ranges from 0.81 to 0.85. Nonetheless, it is noteworthy that the ones whose operation relies on probabilistic approaches (NB, LDA, and QDA) are below those ranges, making them the least reliable for the problem of vertebral column disease classification. Moreover, an ROC-AUC analysis corroborates the kNN as the best baseline classifier since it is the only one that surpassed 0.9 in the curve areas related to the three possible classes.
Nevertheless, the ensemble strategies can outperform the results of the baseline classifiers. In particular, it is observed that the SC ensemble, when all the baseline classifiers are considered (SCall), is the only one that surpasses the accuracy, precision, recall, and F1-score beyond 0.95 (best-reported results among ensembles and baseline classifiers). On the other hand, when only the top-three baseline classifiers are used in the SC ensemble (SCtop), the performance is limited to the best one (kNN). This shows that the more baseline classifiers are used in the SC ensemble strategy, the more representative input of the meta-classifier is created. On the other hand, the VC ensembles with the top-three and all baseline classifiers (VCtop and VCall, respectively) are limited to the best baseline classifier (kNN) performance. This is attributed to the significant difference between the kNN classifier and the rest of the baseline classifiers, making the rest of the baseline classifiers’ votes inconsequential. It is also observed that the RF ensemble is the less promising approach, which is attributed to the fact that all its baseline classifiers operate with the same principle, making the RF ensemble unable to face different conditions in the variance of the dataset. This is also confirmed through the ROC-AUC analysis, where the RF ensemble strategy cannot surpass the 0.9 curve area in the probability of classification related to the NO class, while the rest of the ensemble strategies reached similar outcomes to the best baseline classifier (kNN).
On the other hand, the use of ensemble strategies enhances the trustworthiness (reliability) of the classification process. This is observed in the decrement of misclassified elements (MDE), reflecting that fewer disease elements are misclassified as healthy ones (NO class), increasing the reliability of the ensemble strategies over the single baseline classifiers. An example of this can be observed through the second-best baseline classifier (SVM), which achieves the second-highest scores in the valuation of accuracy, recall, precision, and F1-score. Yet, its MDE is 100% greater (worse) than the best baseline classifier (kNN), resulting in an unreliable classification. In the case of the ensemble strategies, it is observed that in the worst of the cases (RF and SCtop), the MDE results are only 50% greater, indicating that the ensemble strategies classification is more reliable than the baseline results. Moreover, it is observed that for some baseline classifiers (for instance, NB), the MDE valuation increases by 500% (the classification is worse), and when this baseline classifier is considered in the ensemble strategies (VCall and SCall) the proposals provide more robust outcomes in the classification.
Finally, it is worth pointing out that this study is mainly limited to factors related to the available vertebral column dataset, the machine learning strategies implemented, and the resource consumption required for obtaining the ensemble results. The first limitation encompasses the imbalance ratio between the classes of the dataset and the oversampling technique used for the correction (ADASYN), which may overfit the dataset with synthetic instances that closely resemble the original data (in the case of noisy data, the class may increase mislabeled instances). Another limitation related to the dataset is the correlation between the attributes of the elements, which affected the misclassification of some elements, particularly those of the SL and NO classes. Also, this study is limited to some of the most popular, easy-access baseline classifiers, taking advantage of fundamental ML technique processes (geometric, probabilistic, logic, and optimization). Another limitation of this study is related to the resource consumption required for obtaining the ensemble results, which increases with respect to the baseline classifiers, where some ML processes imply higher computational loads.
5. Conclusions
Pathology diagnosis is one of the medical fields in which machine learning techniques help practitioners. Biomedical data classification is regarded as a delicate and challenging endeavor. This research examines VC and SC ensemble strategies using the three most promising auto-tuned baseline classifiers and a complete set of auto-tuned baseline classifiers evaluated for automatic diagnosis of vertebral column orthopedic illnesses. The comparative study with respect to the most popular pattern recognition techniques (kNN, NB, LR, LDA, QDA, SVM, ANN, and DT) indicates that using a meta-classifier in the SC ensemble strategy provides better outcomes when all the baseline classifiers are considered. In particular, the SCall ensemble achieves values above 0.95 in evaluating the accuracy, recall, precision, and F1-score metrics. Nonetheless, it is noteworthy that when the SC ensemble strategy employs only the top-three baseline classifiers (SCtop), its performance is restricted to the best baseline classifier (kNN). This difference between the SC ensemble strategy variants is attributed to the fact that the SCall variant creates more representative dataset information for the meta-classifier. Furthermore, the SCall ensemble archives competitive MDE valuations (the fewest, the better) with only five misclassified elements, indicating that five sick elements are misclassified as healthy. However, this result can be related to the dataset characteristics, specifically the presence of overlapping attributes where only a small number showed significant variances. In the case of voting ensemble variants (VCall or VCtop), the VC ensemble strategy is limited to the kNN (the best baseline classifier) independently of using the complete set (all) or the three most promising baseline classifiers.
When only the baseline classifiers are evaluated, the kNN classifier is the most outstanding baseline classifier (using the Euclidean distance metric and one neighbor), achieving above 0.94 in accuracy, recall, precision, and F1-score compared to NB, LR, LDA, QDA, SVM, ANN, and DT baseline classifiers. Interestingly, the second-best baseline classifier is the SVM (with a polynomial kernel of seventh degree), whose classification process is also based on geometrical approaches. Also, it is remarkable that the classifiers whose processes are probabilistic-based (NB, LDA, and QDA) yield the worst outcomes and cannot surpass the 0.81 value in the evaluated metrics, whereas the rest of the classifiers can surpass such a value. This inefficiency of the classifiers based on probabilistic approaches is attributed to the high correlation magnitude between the dataset’s attributes, while the classifiers require statistical independence among the attributes.
Finally, for future work, we propose a comprehensive investigation of the impact of reducing the dataset’s attributes on the classifiers. Additionally, exploring alternative scoring methods beyond those employed in this work is considered. In addition, the innovation of classifiers based on the techniques here-reported that achieved the best outcomes is considered, as well as their comparison with other methods documented in the existing literature.