1. Introduction
Currently, cardiovascular diseases stand as one of the leading causes of global morbidity and mortality [
1,
2]. Detecting and diagnosing these conditions in a timely manner has become a critical challenge in the field of healthcare [
3]. Cardiovascular diseases, also known as heart diseases, encompass a diverse group of conditions that impact the heart and blood vessels. These conditions can range from diseases affecting the coronary arteries (the arteries that supply blood to the heart) to disorders affecting heart valves, cardiac muscles, and the electrical conduction system of the heart [
4,
5].
Heart failure (HF) occurs when the heart cannot pump blood efficiently, affecting the delivery of oxygen and nutrients to tissues. It can be caused by coronary artery disease, hypertension, and damage to the heart muscle, presenting with extreme fatigue, shortness of breath, and edema [
6,
7,
8]. Diagnosis is challenging due to the variety of clinical presentations and the similarity to other conditions, requiring imaging tests and biomarker monitoring for accurate diagnosis and treatment [
9,
10].
The significance of classifying individuals with heart failure from those without it lies in several crucial aspects [
11,
12]. First and foremost, the early detection of cardiovascular diseases enables timely medical intervention, which can be pivotal in improving patients’ prognosis. Early treatment can help prevent severe complications such as heart attacks or heart failure and reduce the disease burden associated with these conditions [
13,
14,
15].
Furthermore, accurate identification of individuals at risk of heart failure is essential for the implementation of prevention and control strategies [
16,
17]. These strategies may include lifestyle changes, such as a healthy diet and regular exercise, as well as the management of risk factors, including high blood pressure and elevated cholesterol levels. Precise classification allows resources and efforts to be directed toward those who need them the most [
18,
19,
20].
In this context, the use of machine learning (ML) algorithms has emerged as powerful tools to address this challenge [
21,
22]. This article explores the use of a dataset from the Kaggle platform named “Heart Failure”, consisting of approximately 10 crucial features for the classification of cardiovascular diseases [
23]. These features provide key information on the cardiovascular health of a patient and serve as fundamental variables in the application of supervised learning and machine-learning techniques for the accurate classification of heart diseases [
24,
25].
This paper’s innovation resides in its implementation of machine learning algorithms for the prediction of heart failure, combining advanced preprocessing techniques, such as the elimination of outliers, and the optimization of hyperparameters to maximize the performance of the models. Its structured design ensures adaptability to different databases of other diseases, making it a valuable educational tool. The selection of a robust dataset and the integration of metrics such as specificity, AUC, and Brier score reinforce its clinical applicability.
By obtaining robust metrics such as specificity, AUC, and Brier score, the potential role of machine learning in early detection and clinical decision making is reinforced. More than just predicting heart failure with certain accuracy, the model can be integrated into the Clinical Decision Support System (CDSS), helping to identify high-risk patients in real time, to prioritize critical cases, and to optimize resources, enhancing hospital efficiency and clinical results.
2. Previous Works
In this section, a comprehensive literature review is conducted, focusing on cases related to the subject at hand. Various studies and methodologies are examined; the different machine learning techniques employed in this context are represented in
Table 1.
In study [
18], six data mining tools, namely, Orange, Weka, RapidMiner, KNIME, MATLAB, and ScikitLearn, were compared using six machine learning techniques: logistic regression, SVM, KNN, ANN, naïve Bayes, and random forest. Heart disease was classified using a dataset with 13 features, one target variable, and 303 instances (139 with cardiovascular disease and 164 healthy).
The dataset related to heart diseases in [
26] comprises 70,000 patient records with 12 features, including age, height, weight, gender, systolic and diastolic blood pressure, cholesterol, glucose, smoking status, alcohol consumption, physical activity, and the presence of cardiovascular disease (“cardio”: 1 for disease; 0 for healthy).
The dataset [
13,
19] of this study selected and used 14 specific features. In these studies, different preprocessing methods were implemented, using 14 characteristics but combining them into 4 groups with around 6 features, each with different machine learning models.
In the case of [
27,
28], a dataset pertaining to cardiovascular diseases from Kaggle is employed. This dataset comprises twelve attributes, with a clear target variable. These attributes span across age, height, weight, gender, blood pressures, cholesterol and glucose levels, smoking and drinking habits, physical activity, and the presence or absence of cardiovascular disease.
The machine learning algorithms employed in these articles encompass decision tree (DT), naïve Bayes (NB), random forest (RF), k-Nearest neighbor (KNN), SVM (Linear Kernel), and logistic regression (LR). Additionally, boosted trees (GBT) are utilized as part of the analysis.
3. Materials and Methods
The heart disease dataset utilized in this research, sourced from the Kaggle platform (San Francisco, CA, USA) and named “Heart Failure”, serves as a consolidated resource for studying cardiovascular conditions. This dataset consists of approximately 10 crucial features for the classification of cardiovascular diseases. The dataset analysis and model development were conducted in a collaborative environment using Google Colab (Mountain View, CA, USA), leveraging its computational resources for efficient data handling and experimentation. Python 3.13 (Python Software Foundation, Wilmington, DE, USA) was employed as the primary programming language due to its versatility and the wide range of libraries available for data analysis and machine learning.
3.1. Data Analysis and Preprocessing
The dataset used combined five independent datasets on heart disease, resulting in a database with a total of 918 records and 11 features. These features include age, sex, chest pain type, resting blood pressure, cholesterol levels, fasting blood sugar, resting ECG, peak heart rate, exercise-induced angina, ST depression, and ST slope. This dataset represents the largest publicly accessible dataset on heart disease and provides a solid basis for research and analysis. This dataset can be seen in more detail in
Table 2.
The preprocessing stage included a detailed analysis of the dataset. Exploratory data analysis (EDA) techniques, such as histograms, box plots, and density plots (
Figure 1), were first employed to visualize the distribution of variables and detect anomalies. In data cleaning, 16 outliers were identified and eliminated using the interquartile range (IQR) method, a statistical approach chosen for its simplicity and effectiveness in detecting extreme deviations [
29]. Clinical validation corroborated that these outliers corresponded to data errors or outlier conditions. Removing them allowed machine learning models to be trained on representative data, improving their reliability and clinical relevance.
After the database was cleaned, categorical variables were analyzed to identify those with low frequencies. Although a threshold of 5% was defined to identify infrequent categories, none of the categories met this criterion in the columns analyzed; therefore, all the original categories were retained. Ordinal coding was applied to categorical variables with a natural order, and one-hot coding was used to transform the remaining variables into numerical representations suitable for machine learning models.
The binary variables generated during this process were consolidated to simplify the dataset and standardize the representations. Finally, all columns with Boolean values were converted into binary representations (0 and 1), ensuring uniformity in the dataset format. Once these transformations were completed, a correlation matrix (
Figure 2) was calculated to analyze the relationships between features in the processed dataset. This analysis made it possible to identify clearer patterns and more consistent relationships among the variables, validating the effectiveness of the preprocessing process, as multicollinearity issues were not present, particularly for models like logistic regression.
Each cell in the matrix represents the correlation between two variables, measured on a scale from −1 to 1. A value of 1 indicates a perfect positive correlation, −1 a perfect negative correlation, and 0 signifies no correlation.
In this matrix, a positive correlation is observed between age and heart disease, implying that the risk of heart disease increases with age. Similarly, a negative correlation is noted between cholesterol levels and heart disease, indicating that lower cholesterol levels are associated with a reduced risk of heart disease. Other variables, such as resting blood pressure, fasting blood sugar, and maximum heart rate achieved, also show positive correlations with heart disease. On the other hand, exercise-induced angina and depression of the ST segment are negatively correlated with heart disease.
Overall, this correlation matrix shows that variables associated with health risks, such as older age, high blood pressure, high cholesterol, and diabetes, are also linked to a higher risk of heart disease.
3.2. Analyzing Our Target Feature
The target variable in this study represents the presence or absence of heart disease, where 1 indicates the presence of the condition and 0 represents its absence. Initially, the dataset contained 508 instances of heart disease and 410 cases without it, as shown in
Figure 3. After removing outliers, the distribution slightly changed to 498 instances of heart disease and 407 cases without it. Despite this slight imbalance, the dataset remained reasonably balanced overall.
Before classification with the machine learning models, the dataset was balanced using oversampling techniques. The minority class (absence of heart disease) was resampled to match the size of the high class. The result was a balanced dataset with 498 samples for each class, as shown in
Figure 4. By equalizing the number of samples from both classes, the model training process can effectively reduce potential biases and improve the accuracy of predictions for both conditions.
The final balanced distribution of the target variable ensures sufficient data representation for both classes, enabling the classification model to be trained effectively. This preprocessing step strengthens the reliability of the model’s evaluation and increases its capacity to generalize well to unseen data.
3.3. Machine-Learning Methods Used in This Study
In this section, a concise description of the machine-learning classification algorithms employed in the research is presented. These algorithms include LR, DT, RF, k-NN, and MLP.
3.3.1. Logistic Regression (LR)
LR is a multivariable method widely used to model the relationship between multiple independent variables and a categorical dependent variable (
Figure 5). It is the statistical method of choice when predicting the occurrence of a binary outcome (see
Figure 1), such as determining whether someone is sick or healthy or making yes-or-no decisions [
30]. This approach is commonly applied in situations involving disease states and decision-making. In statistical terms, LR is used to solve binary classification problems by modeling events and classes probabilistically through logistic functions [
31,
32].
3.3.2. Decision Tree (DT)
DTs are sequential models that logically combine a series of simple tests; each test compares a numeric attribute against a threshold value or a nominal attribute against a specific set of values [
33]. DTs serve as versatile prediction and classification mechanisms, recursively dividing a dataset into subsets based on the values of associated input fields or predictors [
34]. This subdivision creates partitions and descendant nodes, known as leaves, containing internally similar target values, and as one descends the tree, increasingly different values between leaves (
Figure 6). This fundamental characteristic of DTs allows them to adapt to diverse situations, making them crucial tools in the fields of decision-making and predictive analysis [
35].
3.3.3. Random Forest (RF)
RF is an ensemble machine-learning technique that combines multiple individual DTs to create a more robust and accurate predictive model [
36]. It operates by constructing a multitude of DTs during training and outputs the mode of the classes (classification) or the mean prediction (regression) of the individual trees as can be seen in its architecture in
Figure 7. The randomness introduced in the tree-building process, both in terms of the data samples used for training and the features considered at each split, enhances the model’s generalization ability and reduces overfitting, yielding improved overall performance and reliability in predicting outcomes for new data points [
37,
38].
3.3.4. K-Nearest Neighbor (KNN)
KNN is a classification and regression algorithm used in machine learning. In classification, k-NN assigns a class to a data point based on the classes of its nearest neighbors in a feature space [
39]. In regression, k-NN estimates the numerical value of a data point by taking the average of its nearest neighbors’ values (
Figure 8). The “k” in k-NN represents the number of neighbors considered when making the prediction, and the choice of this value affects the model’s accuracy [
40]. K-NN is a simple and effective method, especially for data in which decision boundaries are nonlinear or when the underlying data structure is complex and cannot be easily modeled using mathematical equations [
41].
3.3.5. The Multi-Layer Perceptron (MLP)
MLP is a kind of artificial neural network that has been designed for supervised learning tasks. This consists of multiple layers of nodes, which include an input layer through which the network receives the initial data, one or more hidden layers that process this information, as well as an output layer that produces the final prediction or classification [
42]. Each node in the network is a perceptron, applying a mathematical transformation to its inputs and passing the result to the next layer (see
Figure 9). MLPs are able to learn complex patterns and relations in data, which makes them suitable for tasks such as regression, classification, and pattern recognition. Techniques such as backpropagation are used to adjust the weights of connections between nodes during training, allowing the network to learn and improve its performance over time [
43,
44,
45].
5. Discussion
In the present research, the Heart Failure Prediction dataset was used to test the optimal machine learning classification algorithm for efficient feature removal and diagnosis of heart failure. To this end, different machine learning algorithms were analyzed to identify their strengths and weaknesses. Logistic regression (LR) shows a balanced performance, with an F1 of 0.85 and an AUC of 0.94, indicating its ability to differentiate between classes. However, although it is robust in both classes, its metrics are not as competing as those of other models evaluated.
Decision tree (DT) outperforms LR in accuracy (90%) and achieves a better balance between precision and recall. Its F1 for class 0 is solid, reaching 0.90, and its high specificity (0.94) highlights its effectiveness in identifying true negatives. However, its AUC of 0.89 is slightly lower than that of LR and other models, suggesting that it can still improve in discriminating between classes.
The k-nearest neighbors (k-NN) model faces overall difficulties, with a lower F1 (0.75) and reduced specificity (0.69), reflecting problems in handling class imbalances and recognizing negative cases. Although its AUC of 0.81 is acceptable, it falls short of the other models evaluated.
The multilayer perceptron (MLP) offers competitive performance, standing out with an AUC of 0.95 and an overall accuracy of 85%. However, the disparity in recall (75% for class 0 and 95% for class 1) evidences difficulties in consistently identifying positive instances in both classes, resulting in a lower F1 compared to the RF model.
Finally, the random forest (RF) model performs remarkably well, achieving a precision of 90% for class 0 and 93% for class 1, demonstrating its ability to accurately identify both positive and negative cases. Additionally, recall is also high, with 93% for class 0 and 90% for class 1, ensuring that most correct instances are detected. This results in F1 scores of 0.92 and 0.91 for classes 0 and 1, respectively, reflecting a well-balanced overall performance.
Moreover, metrics such as specificity (0.93) and AUC (0.97) highlight the model’s ability to effectively differentiate between classes while minimizing errors. The Brier score of 0.06 confirms that the predicted probabilities are highly reliable, and the MCC of 0.83 indicates strong and balanced performance, even in scenarios with potential class imbalances. These results position random forest as the best choice among the evaluated models. This model was found to be able to reliably identify cases of heart failure.
In the comparison of studies on machine learning models (
Table 9), our approach stands out by achieving the highest precision (92%) through correct preprocessing and hyperparameter tuning. This contrasts with other studies that did not handle outliers and obtained lower precisions, such as study [
18] with 85% [
13], with 83% [
27], with 76%, and study [
28] with 73%. Our result suggests that data preprocessing, specifically outlier management, can significantly improve model performance. Study [
26], which also managed outliers, achieved a high precision of 89.58%, and study [
19] achieved a precision of 91.70% supporting this observations.
A major finding of our study is that despite using only 12 features for the model, our accuracy was higher (92%) compared to other studies that used 14 features. For example, study [
19] achieved an accuracy of 91.70% with 14 features, while study [
13] recorded 83% using the same amount of features. This shows that adequate feature selection and effective preprocessing, such as outlier treatment, can enhance model performance significantly even with fewer features.
The McNemar test significance matrix provides a rigorous statistical basis for comparing the effectiveness of the models. The dominance of random forest (p < 0.001 versus key models) underscores its ensemble advantage, which mitigates overfitting by aggregating diverse decision trees, thus capturing complex feature interactions more effectively. In contrast, the poor performance of KNN (p < 0.01 in all comparisons) highlights its limitations in clinical datasets, where noise and class imbalance degrade its distance-based logic. The lack of significant differences between logistic regression, decision tree, and MLP (p ≥ 0.05) may reflect their shared sensitivity to shallow linear or nonlinear decision boundaries in this dataset.
The algorithm is able to integrate with devices that provide static data or electronic medical records (EHRs), digital systems that store patients’ medical history, laboratory results, and patient diagnoses. This assists healthcare professionals in making informed decisions and optimizing the treatment of cardiovascular or chronic patients. When linked to EHRs, the algorithm processes medical history and laboratory results to identify high-risk patients. In the ED, models with high sensitivity (MLP at 94%) prioritize critical cases, while accurate algorithms (DT at 93%) reduce false positives and avoid unnecessary testing. In additional, continuous monitoring would detect patterns of early deterioration, triggering alerts for timely interventions and optimizing the allocation of hospital resources.
This study demonstrates promising results; however, certain limitations must be recognized. First, although the dataset provides a solid basis for analysis, its extension with additional variables-such as left ventricular ejection fraction (LVEF) or biomarkers such as NT-proBNP-could improve clinical relevance and diagnostic granularity. Second, although the sample size is sufficient for robust model training, incorporating data from diverse populations and comorbid conditions would improve generalizability across demographics.
Importantly, the models were validated on a preprocessed static dataset and have not yet been tested in real-world clinical workflows. Factors such as variability in real-time data collection or evolving patient conditions could affect performance.