Heart Failure Prediction Based on Bootstrap Sampling and Weighted Fusion LightGBM Model

Wang, Yuanni; Cao, Hong

doi:10.3390/app15084360

Open AccessArticle

Heart Failure Prediction Based on Bootstrap Sampling and Weighted Fusion LightGBM Model

by

Yuanni Wang

^1,2,*

and

Hong Cao

¹

School of Computer Science, China University of Geosciences, Wuhan 430078, China

²

Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences, Wuhan 430078, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(8), 4360; https://doi.org/10.3390/app15084360

Submission received: 21 February 2025 / Revised: 30 March 2025 / Accepted: 8 April 2025 / Published: 15 April 2025

Download

Browse Figures

Versions Notes

Abstract

:

Heart disease is a serious threat to human health. Accurate prediction is very important for disease prevention and treatment. The purpose of this study is to establish a more suitable prediction model of heart disease. Based on LightGBM, we have deeply integrated bootstrap sampling and weighting technology. We repeatedly use multiple parameters of LightGBM to perform bootstrap sampling on the original training set. Through this process, we not only obtain training sub-models for various training subsets but also mine rich data features. In the process of cross-validation, the weight coefficient for each sub-model is carefully determined by comprehensively evaluating multiple key performance indicators, including accuracy, precision, recall, and the F1 score. This approach effectively highlights the role of high-quality sub-models. In the test stage, each sub-model is weighted according to the weight corresponding to its specific parameter combination, and finally, an accurate prediction result is obtained. Compared with the traditional prediction model, the model shows better comprehensive performance in terms of various performance metrics such as accuracy, precision, recall, and F1 score, and also performs better in the paired t-test. Moreover, compared with the baseline model, the phenomenon of overfitting is obviously reduced. Although the model has not been verified by external data sets, it has, to a certain degree, significantly boosted its predictive ability, universality, and stability. Moreover, it has provided a feasible scheme for heart disease prediction, which is expected to play a crucial role in clinical auxiliary diagnosis and disease management. The research shows that this model has obvious advantages in heart disease prediction and can effectively enhance the accuracy and reliability of prediction.

Keywords:

heart disease prediction; LightGBM; bootstrap sampling; weighted fusion

1. Introduction

As a serious cardiovascular disease, the incidence and prevalence of heart failure are increasing all over the world. Due to the aging population and the prevalence of cardiovascular risk factors such as hypertension, coronary heart disease, and diabetes, the number of potential patients is huge [1]. The condition is hidden, and the early symptoms are mild, making it easy to be ignored or misdiagnosed. When the symptoms are obvious, the difficulty of treatment increases dramatically, the quality of life for patients decreases seriously, and the mortality rate is high. Undoubtedly, it has caused a heavy burden on the medical system and society [2,3,4]. In this case, it is self-evident that the research on heart failure prediction is important. Accurate prediction is helpful for medical workers to identify high-risk groups in advance. Intervention before the disease worsens, such as optimizing lifestyle, strictly controlling risk factors, and precise drug treatment, can effectively delay or stop the disease process and reduce the hospitalization rate and mortality rate of patients [5].

At present, there are many problems in the treatment of heart failure that need to be solved. Due to the concealment of symptoms of early heart failure and the lack of effective early diagnosis methods and accurate prediction models, it is difficult to accurately identify the high-risk individuals who may have heart failure in the initial stage of the disease. As a result, patients often miss the best opportunity for prevention and early intervention, and they can only receive passive treatment after the illness worsens, which has limited treatment effect and high medical expenses. In addition, some existing diagnostic methods may have some problems, such as insufficient accuracy, invasive damage to patients, complex operation, and high cost, meaning it can not meet the needs of large-scale population screening and early warning. At present, traditional diagnostic methods often rely on doctors’ experience and limited detection indicators, which are subjective and carry the risk of a missed diagnosis. At the same time, although heart disease-related data are accumulating, the sample size is still insufficient. For example, the sample size of common data sets is only a few hundred to a thousand, so it is challenging to construct an accurate prediction model. The rise of machine learning technology provides a new method for this research field [6]. It demonstrates robust processing capabilities. It has the ability to discern underlying laws from complicated and limited heart disease data sources. However, these models still have some limitations in terms of accuracy and generalization. Based on the advantage of machine learning, this study deeply analyzes the related factors of heart disease and constructs a more accurate and reliable prediction model. It overcomes the limitations of traditional methods and the problem of small sample sizes, and provides strong support for the early diagnosis and intervention of heart disease. Instead of the previous machine learning models, bootstrap sampling and weighted fusion technology are innovatively adopted in this study. These methods enhance the model’s adaptability to diverse data sets and, thus, improve the generalization ability of the model. Through bootstrap sampling, multiple different sub-sample sets can be generated from the original training data, thus increasing the diversity of data, enabling the model to learn a wider range of data feature patterns, and avoiding excessive dependence on specific data distribution. This method is helpful to improve the adaptability of the model to different data sets and, thus, improve the generalization ability.

Now, the research on prediction of heart failure is developing vigorously from multiple dimensions. In terms of the risk factor assessment model [7], the traditional risk factor model determines factors such as age, disease history, living habits, established models, etc. The prediction model for lifelong risks can effectively identify high-risk individuals and carry out early intervention. In the detection of biomarkers, besides the commonly used brain natriuretic peptide markers [8], new markers such as neuropeptide Y [9] have been found to predict the death risk of patients with stable heart failure. In imaging technology, cardiac MRI [10] can evaluate the pressure inside the heart and predict heart failure, while echocardiography [11] can help find early asymptomatic heart failure by measuring the structural and functional parameters of the heart. The focus of gene detection and gene research is the relationship between gene polymorphism and hereditary cardiomyopathy and heart failure, which provides the basis for predicting the risk of high-risk family populations [12]. Machine learning and big data analysis use algorithms to mine multi-source data, extract features, and establish optimized prediction models, which are better than traditional models. These research results lay a solid foundation for the accurate prediction and effective prevention and treatment of heart failure [13,14].

Although there has been a lot of research on heart failure, there are still obvious shortcomings. In regard to the aspect of prediction, most studies focus on the influence of single factor or a few factors on heart failure, lacking in-depth analysis and the integration of multi-dimensional and comprehensive factors into the prediction model. For example, some studies only pay attention to cardiac structural indexes or blood biochemical indexes, but they fail to comprehensively consider the comprehensive influence of many factors such as patients’ clinical symptoms, lifestyle, family history, gene polymorphism, and so on. At the same time, the generality and accuracy of the existing forecasting models need to be further improved [15,16]. Some models perform well in a specific population or a specific medical environment, but they are difficult to use widely in people of different regions, races, and age, and their ability to dynamically change the prediction results and long-term follow-up is weak.

Considering the obvious limitations of existing commonly used heart disease data samples, the traditional single model method can easily fall into the overfitting dilemma when processing such a small sample of data. The reason for this situation is either that the complexity of the model does not match the richness of data characteristics, or that there are not enough diversified data to train, which leads to a significant decline in the accuracy of the prediction. Commonly used machine learning models, such as LightGBMs (light gradient boosting machines), SVMs (support vector machines), KNN (K-nearest neighbor), decision trees, GBDTs (gradient boosting decision trees), XGBoost (eXtreme Gradient Boosting), etc., are difficult to cope with many challenges brought by insufficient data. The purpose of this paper is to develop a new accurate prediction method of heart failure based on bootstrap sampling and the weighted fusion LightGBM model, aiming at solving the problems of traditional prediction methods in regard to accuracy, stability, and an insufficient ability to deal with unbalanced data, improving the efficiency of early diagnosis and risk assessment of heart failure, thus providing more reliable basis for clinical decision making and improving the prognosis management of patients. By bootstrap sampling, a number of different sub-sample sets can be derived from the original training data. This enriches the diversity of data, allowing the model to learn a wider range of data feature patterns, and prevents over-reliance on specific data distributions. This method is helpful to improve the adaptability of the model to different data sets and, thus, improve the generalization ability. The weighted fusion method is to synthesize the prediction results of multiple models based on different self-help sampling samples according to certain weights. Each model learns unique information on different subsets of data. Through reasonable weighted fusion, this scattered effective information can be fully integrated, and the deviation of a single model can be reduced, making the final prediction result more robust and accurate. Compared with the existing models, our method is expected to significantly improve the accuracy and generalization ability, and to provide more powerful support for accurate prediction and personalized treatment of heart failure, thus making new breakthroughs in this research field.

Among the numerous sampling and ensemble learning methods, bootstrap sampling has obvious advantages over bagging and boosting. Although bagging also constructs several sub-models based on sampling, it puts more emphasis on reducing the variance of the model. In addition, bagging does not make use of data diversity as directly and fully as bootstrap sampling. Boosting trains several weak learners in series, with the emphasis on gradually correcting the mistakes of the previous learner. Its data enhancement and model diversity are different from bootstrap sampling, and in some cases, it may not be as effective as bootstrap sampling in mining potential information in the data. Bootstrap sampling extracts samples from the original data set and generates a series of different sub-data sets. This approach not only enhances the diversity of the data but also endows the models trained on these sub-data sets with varying characteristics and performances. Therefore, it has effectively improved the generalization ability of the models. In addition, the bootstrap sampling method is relatively simple and straightforward, and the calculation efficiency is higher. It does not need repeated iterative training like boosting; it only needs to train different sub-data sets independently each time and then fuse the results. This is very beneficial for large-scale data sets or limited computing resources, and many models with certain differences can be obtained in a short time. Bagging also has high computational efficiency, but when dealing with some complex models, the complexity of the model may be increased due to the large number of sub-models. Bootstrap sampling can control the complexity of the model to a certain extent and avoid overfitting. If there is noise or missing data in the original data set, bootstrap sampling enables the model to learn different noise patterns or missing value interpolation methods through multiple sampling, thus enhancing the robustness of the model. When dealing with these problems, bagging and boosting may require additional pretreatment steps or special designs. In addition, in order to explore all kinds of potential relationships and feature combinations in data more comprehensively, bootstrap sampling can achieve this goal better through different sub-sample sets, because it allows each sub-model to learn data from different angles, which is more in line with the needs of data mining and model generalization in this study.

Compared with the traditional single model prediction method, the LightGBM model based on bootstrap sampling and weighted fusion significantly improved the accuracy of heart failure prediction. Through the bootstrap sampling technology, the distribution of different types of samples in the training process is effectively balanced, and the prediction bias of the model to a few types of heart failure patients caused by too many sample types is avoided. In terms of recall index, the predicted recall rate of patients with heart failure is 88.8%, which is 3 percentage points higher than that of the model without special treatment. This indicates that this model can identify actual heart failure patients more effectively, thus reducing the possibility of missed diagnoses. This model has good stability and explanatory ability. Due to the weighted fusion strategy of multi-base models, the prediction results of the model are less affected by the fluctuation of a single data point or a single model, and the prediction performance can be maintained stably under different data set division and experimental conditions.

The paper is organized as follows. In Section 2 the related work is discussed, including the issue of using other algorithms. Then, the new improved algorithm is introduced in detail in Section 3. The experimental settings are described in Section 4. Section 5 provides the analysis and experimental results. We conclude with the contribution of our work in Section 6.

2. Related Work

The research status of heart failure prediction can be summarized as follows: risk factor evaluation, including the assessment of traditional and new factors; biomarker detection, including the improvement of traditional biomarkers and the exploration of new ones; the application of imaging technology, including cardiac MRI and echocardiography; genetic research, including gene polymorphism and genes related to hereditary cardiomyopathy; and machine learning, such as data mining, model construction, and big data analysis.

At present, the research on risk factor evaluation models is more in-depth and comprehensive, including not only traditional risk factor models, but also lifelong risk prediction models and disease-specific correlation models. The traditional risk factor model, which includes common factors such as age, hypertension, diabetes, and coronary heart disease related to cardiovascular diseases, can preliminarily assess the risk of an individual developing heart failure. The long-term risk prediction model further extends the prediction time, and the first long-term risk prediction model of heart failure is proposed in [17]. Based on the current level of risk factors, such as body mass index, blood pressure, cholesterol, diabetes and smoking, this model can predict the risk of heart failure in the next 30 years, and help doctors to take preventive measures for high-risk groups, especially young people. In [18], the researchers developed the first race-specific and gender-specific risk prediction models for heart failure, which has preserved ejection fraction (HFpEF) and decreased ejection fraction (HFrEF). The HFpEF risk prediction model included age, diabetes, BMI, COPD, previous MI, antihypertensive treatment, SBP, smoking status, atrial fibrillation, and estimated glomerular filtration rate (eGFR), while the HFrEF model additionally included previous CAD. The risk factors included in the model have good discrimination and calibration ability. The researchers provided clinicians with the one-year death risk of hospitalized patients with heart failure through the heart failure risk model to evaluate whether this risk prediction can help clinicians improve treatment decisions, thus reducing the readmission rate or mortality rate of patients [19]. Furthermore, the use of the heart failure risk model is one of the promising methods to identify at-risk patients of compensatory heart failure (HF) in the vulnerable period (VP). In [20], the researchers believe that the structured approach, which combines remote life monitoring and risk stratification tools, is most effective in identifying the risk of patients with compensatory heart failure during the vulnerable period.

The detection of biomarkers has always been an important means of predicting heart failure. Some studies have developed a biomarker-based predictive model for patients with stable coronary heart disease by evaluating and comparing the prognostic value of biomarkers and clinical variables [21]. Traditional plasma brain natriuretic peptide (BNP) and its precursor N-terminal pro-BNP are widely used in clinical diagnosis and severity evaluation of heart failure [22,23], but there is a problem of insufficient specificity. Moreover, in recent years, a series of novel biomarkers, including the promising neuropeptide Y, have been discovered, showing great potential in the field of heart failure prediction. These new biomarkers can be used as novel markers to predict the death risk of patients with stable heart failure [9]. In addition, the ratio of neutrophils to lymphocytes [24], small molecular RNA, and other markers [25] have also been found to be closely related to the occurrence, development, and prognosis of heart failure, providing more evidence for the early diagnosis, risk stratification, and prognosis evaluation of heart failure.

Imaging technology plays an important role in the prediction of heart failure and is constantly developing and perfecting. Cardiac magnetic resonance imaging can clearly show the structure of the heart, and accurately evaluate the pressure inside the heart, thus predicting whether the patient will have heart failure. Various cardiac imaging tests have been used to help manage patients with heart failure (HF). This article reviewed current and future HF applications for the major noninvasive imaging modalities: transthoracic echocardiography (TTE), single-photon emission computed tomography (SPECT), positron emission tomography (PET), cardiovascular magnetic resonance (CMR), and computed tomography (CT) [26]. Some researchers use MRI images to predict cardiac arrest [27]. At the same time, relevant researchers use echocardiography [28] to help detect early asymptomatic heart failure. This is achieved by measuring the structural and functional parameters of the heart, such as the left ventricular ejection fraction and the thickness of ventricular wall. This method is also significant in evaluating the severity and prognosis of heart failure.

Gene detection and gene research provide a new angle and direction for the prediction of heart failure. Genetic testing is an established nursing method in contemporary cardiology practice [29]. On the one hand, it focuses on the relationship between gene polymorphism and heart failure, and most of the genetic association research focuses on ACE I/d polymorphism [30]. On the other hand, the exploration of genes related to hereditary cardiomyopathy is also deepening [31,32,33], which is helpful to identify high-risk groups with genetic predisposition and provide a basis for the early prevention of and intervention in heart failure.

With the development of big data and computer technology, machine learning and big data analysis are widely used in the field of heart failure prediction [34,35,36,37]. By mining and analyzing a large number of clinical data, biomarker data, and image data, we can identify the high-risk population of heart failure more accurately, and we can significantly enhance the accuracy and practicability of the prediction model. For example, the prediction model of heart failure based on logistic regression, decision trees, neural networks, random forest, XG Boost, and LightGBM [38,39,40] provides strong support for the accurate prediction and individualized treatment of heart failure.

3. Mathematical Model

Based on LightGBM model, this paper establishes a heart failure prediction framework of bootstrap sampling and a weighted fusion LightGBM sub-model. After preprocessing, feature selection, bootstrap sampling, and sub-model prediction based on LightGBM classification and weighted fusion, the prediction result is finally obtained. The data preprocessing includes two aspects: feature discretization and feature coding. Through feature discretization, continuous features are reasonably divided to better meet the processing requirements of subsequent models. On the other hand, feature coding is a kind of standardized coding conversion applied to different types of features. This ensures that the data form is convenient for the model to analyze and operate. The feature selection part focuses on correlation analysis and feature combination exploration. Correlation analysis aims at revealing the degree of correlation between each feature and the target variable. It filters out features with weak correlation and the least contribution to the prediction results, thus reducing the data dimension and eliminating redundant information. The study of feature combination exploration attempts to combine existing features from different angles. The purpose is to explore a more expressive and predictive feature combination forms, further optimize the data features of the feed-in model, and improve the accuracy and efficiency of model prediction. Bootstrap sampling sub-model prediction based on LightGBM classification is one of the core steps, and involves several key operations. Firstly, the data set is divided into the training set and verification set, respectively, to ensure the scientificity and effectiveness of model training and verification. Then, the model training is carried out through bootstrap sampling, and the performance of the model under different parameter combinations is comprehensively explored by means of cross-validation and parameter combination traversal. After the training of each sub-model, the prediction results of multiple sub-models are integrated through voting. For each parameter combination, the corresponding evaluation index is calculated. This enables the determination and fine-tuning of the model parameters according to these indexes, allowing the model to achieve a better prediction performance. For the test set, the weighted fusion method is adopted, and the weights of the corresponding sub-models under each parameter combination are determined according to the evaluation indexes obtained in the previous calculation. These weights are used to carry out weighted fusion prediction on the test set, and finally, a comprehensive and accurate prediction result is integrated to achieve the goal of the whole prediction process. The overall framework is shown in the Figure 1.

3.1. Classification Model Based on LightGBM

A model for predicting heart failure is built around the use of LightGBM. LightGBM is a fast and efficient gradient lifting framework, which is suitable for processing large-scale data sets and various types of machine learning tasks. It is used here to model the characteristic data related to heart disease to predict the disease situation.

In order to obtain better model performance, several key parameters of LightGBM model are traversed and optimized. Different parameter ranges are defined, including the learning rate, number of leaf nodes, minimum sub-node weight, bagging ratio, and bootstrap sampling times. Through training and evaluation under different combinations of these parameters, we try to find the parameter setting that can make the model performance optimal.

In this paper, KFold 50% cross-validation is used to evaluate the performance stability of the model on different data subsets. In the process of each cross-validation, the training set is further divided into small training sets and validation sets. By training the model on these small training sets and forecasting and evaluating the model on the validation set, we can obtain the performance indexes of the model under diverse data partitions. In this way, the generalization ability and stability of the model can be comprehensively evaluated.

3.2. Application Strategy of Bootstrap Sampling in Model

Bootstrap sampling is a random sampling method with playback. In the process of model training, bootstrap sampling is mainly used to increase the diversity of training data and evaluate the stability of the model. The number of bootstrap sampling iterations is directly related to the number of different training subsets generated by it, which has an important influence on the stability and accuracy of the final comprehensive prediction results. Although more bootstrap sampling times may enhance the stability of comprehensive prediction results, it will also lead to an increase in the calculation cost and extension of the training time. In the research process of the thesis, we can choose a smaller value, such as 3, as the starting point, and then gradually increase it, i.e., trying 5, 7, 10, and so on. In this process, the performance changes of the model on the verification set and the growth trend of training time are closely observed. We observe that with the increase in bootstrap sampling iterations, the performance of the model on the verification set is gradually stable, and the training time is obviously prolonged. On this basis, we can determine the optimal bootstrap sampling iteration number. In this way, while ensuring the accuracy and stability of the results, we can reasonably manage the computing resources and time costs, so as to achieve the optimal balance between them.

At each bootstrap sampling iteration, the number of sample indexes with the same length as the original training set subset is randomly extracted from the current training set subset, thus obtaining a new training data subset and its corresponding label subset. In bootstrap sampling, training data from each iteration probably contain duplicate samples. Each sample in the original training set has a well-defined probability of being selected or not during each operation. This probability selection mechanism is applied repeatedly in several sampling rounds. Therefore, this leads to different data distributions, which can have a great impact on model training and generalization performance.

In the cycle of parameter optimization, for each group of parameter settings, multiple LightGBM models based on different sampling data can be trained through multiple bootstrap sampling processes. This will help the model to better capture various patterns and characteristic relationships in the data, reduce the model deviation caused by a single training data distribution, and also evaluate the performance of the model under different data distributions, providing more abundant and reliable information for subsequent model fusion.

The specific implementation steps are as follows:

Determine the sampling times.
Conduct bootstrap sampling in the training set.

Bootstrap sampling was carried out on the original data set many times to obtain multiple training subsets, and the verification set was divided at the same time. In this process, it is very important to set the proportion of the verification set, which will affect the sensitivity of the model to the performance of the verification set and the effectiveness of the early stop mechanism. Generally speaking, you can try to choose a value in the range of 0.1 to 0.3, such as 0.15, 0.2, 0.25, etc. A small proportion of the verification set may lead to a lack of data in the verification set, which makes it difficult to accurately present the model performance, thus adversely affecting the effectiveness of early stop mechanism. On the contrary, a larger proportion of the verification set will reduce the data size of the training set, which may have an adverse impact on the training effect of the model. Therefore, it is necessary to try different division ratios, carefully observe the performance of the model on the verification set, and the trigger status of the early stop mechanism, so as to find an appropriate ratio value.

3.: Further divide the training set after bootstrap sampling.

In order to train the model better, the training set and label after bootstrap sampling are divided into the new training set, verification set, and corresponding label, according to the proportion of 20%, which are used for the subsequent training and verification of LightGBM model.

4.: Train the model based on bootstrap sampling data and make predictions.
5.: Synthesize the prediction results of the bootstrap sampling model by voting.

3.3. Model Fusion Scheme

This model constructs a new fusion system, which seamlessly integrates bootstrap sampling and weighting ideas into LightGBM model to improve the efficiency of heart disease prediction. According to the training set, multiple bootstrap sampling is carried out under different parameter configurations, and several LightGBM submodels with differentiated data perspectives are constructed. These sub-models thoroughly explore data characteristics during their training process. Then, using a strict evaluation index system, they precisely measure each sub-model’s performance under different data distributions. Finally, reasonable fusion weights are allocated to them. The purpose of this paper is to overcome the inherent limitations of traditional single-model methods and simple comprehensive models. It is committed to enhancing the model’s ability to understand and predict heart disease data and, at the same time, strengthening its generalization and stability. Through these efforts, this paper aims at developing an optimized fusion model, which can achieve excellent performance in heart disease prediction tasks in practical application scenarios. The specific scheme is as follows:

(1): Synthesis of forecast results based on bootstrap sampling and voting

In the cross-validation stage, for each parameter combination, bootstrap sampling is carried out many times. After each bootstrap sampling, a small training set and a verification set are further divided to train the LightGBM model, and a plurality of prediction results are obtained by forecasting on the corresponding verification set. Then, the prediction results of these bootstrap sampling are synthesized by voting. In this way, the comprehensive prediction results under each parameter combination in cross-validation are obtained, and the corresponding evaluation indicators such as the accuracy, accuracy, recall, and F1 value are calculated.

(2): Model fusion based on evaluation index weight

Calculate the weight based on the F1 value, accuracy, and recall rate of each parameter combination. The specific calculation method is to set the weight coefficient of different evaluation indexes, and then, calculate the weight according to the average evaluation index value of this parameter combination in cross-validation. Finally, according to this weight, the weight of each model’s prediction results on the test set is set.

(3): Weighted average fusion to obtain the prediction result

When predicting the test set, for each parameter combination, the LightGBM model is trained again according to the previous process and the prediction results are obtained on the test set. At the same time, the prediction results of all different parameter combinations are weighted and averaged according to their corresponding weights, and the final fusion prediction results are obtained. Finally, the fusion prediction results are converted into binary classification results, which are converted according to the threshold value of 0.5, and various evaluation indexes of the fused model, such as the accuracy, precision, recall, and F1 value, are calculated to evaluate the performance of the fused model.

3.4. Parameter Optimization of Model

The important parameters of the model include the learning rate, number of leaf nodes, minimum leaf node sample weight, data sampling ratio, and bootstrap sampling times. KFold is used for 50% cross-validation. In each cross-validation process, following the division of the training set and validation set, bootstrap sampling is executed for diverse parameter combinations. Within each bootstrap sampling operation, a validation set is demarcated. Subsequently, LightGBM data-set objects corresponding to these subsets are instantiated, and different parameter configurations are specified for model training. The prediction is made on the verification set, and the prediction results are synthesized by voting. Calculate the evaluation indexes such as the accuracy, precision, recall, and F1 value, and store the evaluation index results of each cross-validation. Finally, the average evaluation index under different parameter combinations is calculated, all the results are compared, and the best parameter combination is selected according to these results.

3.4.1. The Optimization Thought

The parameter optimization strategy adopted in this paper is to explore the hyperparametric space and evaluate the performance based on cross-validation. A multi-dimensional hyperparameter space is formed by defining different value ranges of multiple hyperparameters, such as learning rate, number of leaf nodes, sample weight of leaf nodes, data sampling ratio, and bootstrap sampling times. The multi-layer nested loop structure is used to iteratively explore all possible combinations within the hyperparameter space. The purpose is to explore the influence of different hyperparameter values on the model performance, without leaving any unchecked potential excellent combination, so as to determine the best parameter configuration for the given data set and task.

KFold cross-validation and 50% cross-validation are used to evaluate the performance of the model under each hyperparameter combination. This concept is based on the understanding that the data set is divided into several subsets, and the performance of the model in different data distributions can be understood more comprehensively and objectively by training and verifying the models of these different subsets. By doing this, it helps to prevent overfitting the evaluation data, thus producing more reliable performance evaluation results. In each cross-validation segmentation process, the complete cycle of model training, prediction, and evaluation index calculation is performed for each hyperparameter combination. Then, the advantages and disadvantages of each hyperparameter combination are determined by comprehensively analyzing the results from multiple folds.

3.4.2. Parameter Optimization Strategy

In this study, we use the grid search strategy to adjust the hyperparameters and find the optimal hyperparameter configuration of the model. The grid search is adopted to traverse the hyperparameter combination, and the grid search strategy is used, that is, every possible hyperparameter combination is traversed in turn according to the value range defined by each hyperparameter through the multi-layer nested cycle. For example, there are several values for the learning rate, several values for the number of leaf nodes, and so on for other hyperparameters. In this way, all possible parameter combinations can be traversed, and the entire hyperparameter space can be searched systematically. Although this method may be time-consuming when the number of hyperparameters is large and the range of values is wide, it can ensure integrity and not miss any possible optimal combination. The specific optimization strategies are as follows:

(1): Determination and range setting of hyperparameters.

Firstly, the key hyperparameters that need to be optimized are defined, and a reasonable range of values is defined for each hyperparameter. The specific hyperparameter are as follows.

Learning rate: This parameter controls the parameter step size in each iteration, which has an important influence on the convergence speed and performance of the model. We set its range as [0.1, 0.05, 0.01].

Number of leaf nodes (num_leaves): This parameter determines the complexity of the decision tree and affects the model’s ability to fit data. The range of values is set to [31, 63].

Min_child_weight: used to limit the minimum weight of samples in child nodes and prevent overfitting. The range of values is [1 × 10⁻³, 1 × 10⁻²].

Bagging_fraction: indicates the proportion of randomly selected samples during training. The range of values is [1, 0.8].

Number of bootstrap sampling (num_bootstrap_samples): determines the number of times that bootstrap sampling operation is repeated. The range of values is [3, 5, 7].

(2): Implementation process of grid search

Using the grid search strategy, by constructing a multi-layer nested loop structure, each possible hyperparameter combination is traversed in turn in strict accordance with the value range defined by each hyperparameter, specifically. The outermost loop takes values from the range [3, 5, 7] in turn for the number of bootstrap samples (num_bootstrap_samples). The second external loop adopts the values in [0.1, 0.05, 0.01] for the learning rate. Next, take a value from [31, 63] for the number of leaf nodes (num_leaves). Then, for the smallest child node weight (min_child_weight), take a value in [1 × 10⁻³, 1 × 10⁻²]. The innermost loop takes one of the values in [1, 0.8] for the sampling bagging fraction. Through this loop structure, all hyperparametric combinations can be traversed comprehensively, and the whole hyperparametric space can be searched systematically.

(3): Training and evaluation of the model

For each hyperparameter combination generated, perform the following.

Cross-validation setting: The training data set (X_train) is divided into five different subsets by 50% cross-validation (KFold, n_splits = 5, shuffle = True, random_state = 2).

Bootstrap sampling and model training: Bootstrap sampling is carried out in the training process of each cross-verification. That is, samples are randomly selected from the current training subset (X_train_fold) to construct self-help sampling data sets (X_train_bootstrap and y_train_bootstrap). Using the LightGBM model, the model parameters are set according to the current hyperparameter combination, trained on the training set, and monitored on the verification set, and the early stop mechanism (early_stopping_rounds = 200) is set to prevent overfitting.

To sum up, this study uses the grid search strategy to adjust the hyperparameters. Although the calculation is large and time-consuming in the case of a large number of hyperparameters and a wide range of values, it can ensure comprehensiveness and ensure that no possible optimal combination of hyperparameters will be missed, which provides a solid guarantee for the optimization of model performance.

3.4.3. Parameter Optimization Process

Parameter optimization adopts the way of cross-validation setting and parameter combination traversal. By constructing a KFold object, the cross-validation fold is set to 5, and the random state is clearly specified, so as to ensure the repeatability of the whole process. Use the multi-layer nested loop to traverse all predefined parameter combinations. For each specific set of parameter combinations, the following processes are performed in the splitting process of each cross-validation:

(1): Accurately obtain the corresponding training set and verification set data according to the split index.
(2): Perform bootstrap sampling in turn, divide verification set carefully, create LightGBM data set objects, set model parameters reasonably, conduct sub-model training, and finally, complete the prediction task on the verification set.
(3): Accurately calculate the accuracy rate, precision rate, recall rate, and F1 value on this verification set, and save the results of these evaluation indicators.

For each hyperparameter combination, after all the cross-validation splitting operations are completed, the evaluation index results are averaged, and the average accuracy, average accuracy, average recall and average F1 value under this parameter combination are obtained.

Through the above-mentioned parameter optimization, the model parameters can be adjusted in a systematic way. This method can improve the performance and generalization ability of the prediction model of heart failure combining bootstrap sampling and LightGBM.

4. Experimental Settings

4.1. Data Set Generation

4.1.1. Data Source

This data set was created by combining different data sets already available independently but not combined before [41]. In this data set, five heart data sets are combined over 11 common features, which makes it the largest heart disease data set available so far for research purposes. These five data sources cover 303 observation data in Cleveland, 294 observation data in Hungarian, 123 observation data in Switzerland, 200 observation data in Long Beach, VA, and 270 observation data in the Stalog (Heart) Data Set, with a total of 1190 original data. However, there are 272 duplicate data. After de-duplication, the final data set contains 918 observation data, which provides rich and relatively refined data resources for heart disease-related research. Every data set used can be found under the Index of heart disease data sets from UCI Machine Learning Repository on the following link: http://archive.ics.uci.edu/dataset/45/heart+disease (donated on 30 June 1988). The attributes and information involved are shown in the Table 1.

4.1.2. Potential Deviation from the Data Set

The original data set for this paper comes from many medical institutions, involving different areas. Specifically, there are 303 observations in Cleveland, and the data come from the Cleveland Clinic Foundation in Ohio. There are 294 observations in Hungarian, and the data are provided by the Hungarian Heart Institute. Hungary has also made some achievements in the research field of cardiovascular disease. These data reflect the features of heart disease in Hungary. The different geographical environment, living habits, and genetic background will make the prevalence rate and influencing factors of heart disease in this area different from other areas. There are 123 observations in Switzerland, with data from Zurich University Hospital and university of basel Hospital. Long Beach, VA provided 200 observation values, and the data from the Medical Center of the Veterans Affairs Department of Long Beach were mainly aimed at the specific group of veterans, who were unique in life experience and health status, and their heart disease incidence might be different from that of the general population. These data are helpful for studying the heart disease problems of specific groups. Based on the above data, there may be some regional deviations in the data set, which are from Cleveland, Switzerland, and Long Beach, USA. However, the sample size in different regions is quite different. Cleveland and Hungary have relatively large data sets, while Switzerland has a relatively small sample size. This difference may lead to the unbalanced capture of regional features by the model. Specifically, the model is likely to be more attuned to the features of regions with large sample sizes, while regions with small sample sizes will be inadequately represented, thus undermining the comprehensiveness and balance of the model’s regional characteristic-capturing ability. There is another demographic deviation. The data of the Department of Veterans Affairs in Long Beach are aimed at veterans, who are different from the general population in terms of life experience and occupational exposure, and they may have unique risk factors for heart disease. However, the sample size of 200 veterans in the data set is not absolutely superior to that of other sources. When the model deals with the general population and veterans, this situation may lead to a lack of understanding of the characteristics of veterans. As a result, it may lead to the deviation of veterans’ heart disease prediction.

In addition, data collection may be biased, and medical institutions in different regions may have differences in data collection methods, standards, details, and so on. Different hospitals may use different testing equipment and methods to detect heart disease-related indexes. In addition, the specifications and details of the recorded data can be different, thus causing problems in data quality and consistency. When merging these data, noise may be introduced, which will affect the accurate learning and understanding of data features by the model. In addition, the types of diseases may also be biased. The disease spectrum may be different in different areas, and some areas may have more specific types of heart disease. Due to the uneven sample sizes across different regions, the data set may contain an overabundance of certain types of heart disease while having relatively few cases of others. This leads to an imbalance in the ability of the model to identify and predict different types of heart disease. Therefore, this model may perform better at predicting the types of common heart disease, but it may not be able to accurately predict the rare diseases.

4.1.3. Initial Data Exploration

(1): Data distribution

The target variable of the data set is HeartDisease, which indicates whether the patient has heart disease. The following pie chart mainly shows the proportion of patients with heart disease in the sample data. As can be seen from the Figure 2, the proportion of patient users is 55.3%, and the proportion of non-patient users is 44.7%, which belongs to a balanced data set and does not need to be balanced.

(2): The influence of various indicators on the predicted value

In order to explore the relationship between these attributes and HeartDisease, a number of attributes in the data set, such as age, sex, chest pain type, cholesterol, fasting blood sugar, resting blood pressure, and maximum heart rate, were analyzed visually.

The basic information of patients is age and gender, respectively. By drawing a nuclear density estimation map to compare the age distribution of people with and without heart disease, to observe the possible correlation between age and heart disease, it is found that the proportion of middle-aged and elderly patients is high, especially at the age of 60 in Figure 3a. By drawing a bar chart to analyze the situation of people with and without heart disease in different sexes, it is found that the probability of male patients is greater than that of female patients in Figure 3b.

Indicators of symptoms related to chest pain include the chest pain type and exercise-induced angina. The distribution of people with and without heart disease in different types of chest pain and the relationship between exercise-induced angina and whether they have heart disease are used to explore the relationship between chest pain-related symptoms and heart disease. There are four types of chest pain, among which, TA stands for typical angina pectoris, ATA stands for atypical angina pectoris, NAP stands for non-anginal Pain, and ASY stands for asymptomatic chest pain. Through analysis, it is found that the proportion is large when the type of chest pain is ASY, and there is no obvious chest pain, but the probability of exercise-induced angina is high in Figure 4.

Blood biochemical indexes include serum cholesterol and fasting blood sugar. Through the construction of nuclear density estimation maps, the distributions of serum cholesterol and fasting blood glucose levels were compared between individuals with and without heart disease. In addition, based on the visualization results, the potential relationship between cholesterol levels and heart disease was studied. It is found that the probability of becoming sick is high when serum cholesterol is around 0 and 250, and it is high when fasting blood sugar is 1 in Figure 5.

The physiological indexes include resting blood pressure and maximum heart rate. The distribution of resting blood pressure (RestingBP) and maximum heart rate (MaxHR) between people with heart disease and those without heart disease were compared by drawing the nuclear density estimation map. With the help of visual images, the relationship between them and heart disease is analyzed. It is found that the resting blood pressure is around 150, and the maximum heart rate is around 125 in Figure 6.

The physical examination indexes include resting electrocardiogram, oldpeak and the slope of the peak exercise ST segment. We attempted to analyze whether there is any connection between the resting electrocardiogram, oldpeak, and the slope of the peak exercise ST segment and the heart disease from the visual results. Through analysis, it is found that there is little difference among the three conditions of the resting electrocardiogram in Figure 7. The probability of not becoming sick is high when the oldpeak is near 0, and the probability of becoming sick is low when the slope of the peak exercise ST segment is upsloping.

4.1.4. Data Preprocessing

There are no missing values in the data to be processed, and the sample number of heart failure accounts for 55% of the total, which is basically balanced and does not need to be balanced. The data preprocessing part mainly includes two parts: feature discretization and feature encoding.

(1): Feature discretization

Because the quantitative units of each variable in the data set are different, the dimension of the Oldpeak feature is quite different compared with other features. In order to eliminate the influence of different dimensions, it is divided into four grades, normal, mild abnormality, moderate abnormality, and severe abnormality, according to medical knowledge. When Oldpeak is less than or equal to 0.2 mV, it may indicate that the ST segment depression of the heart is at a normal or near normal level. When Oldpeak is greater than 0.2 mV and less than or equal to 0.4 mV, it can be classified as mild abnormality. If Oldpeak is greater than 0.4 mV and less than or equal to 0.6 mV, it can be classified as moderate anomaly. When Oldpeak is greater than 0.6 mV, it is a serious anomaly.

(2): Feature encoding

Some binary variables in the data set, such as Sex and ExerciseAngina, are transformed into numerical variables by simple mapping, so that the subsequent models can handle them. The purpose of this transformation is to make the model understand and process the classification information and bring it into the calculation in numerical form.

For the classification features in data with multiple category attributes, such as ChestPainType, RestingECG, and ST_Slope features, there are many values, and the values can not be distinguished in size, nor can we know whether they are good or bad. Therefore, for the classification features ChestPainType, RestingECG, and ST_Slope, label encoding processing is used for data format conversion. After label encoding, all feature variables within the data set are transformed into numerical values. This transformation not only facilitates the processing of input data by computer learning algorithms but also enables the expansion of the sample’s feature variables.

In the data preprocessing part, data standardization and dimension reduction are not carried out. The reason is that the tree models such as LightGBM and XGBoost are insensitive to the size of data. In the process of building a tree model, decisions are made mainly according to the split points of features, without relying on the specific values of features. The decision-making process of decision trees, gradient lifting decision trees, and other models is also based on the sorting and comparison of features, not the absolute value of features. In addition, features in the data set have different physical meanings, and standardization may destroy the original information and internal relations of these features. For example, heart rate and blood pressure are two characteristics with different orders of magnitude and physical meanings. Forced standardization may cover up their unique correlation with heart disease, making it difficult for the model to capture the real law. The number of data features is relatively limited (11 features are used), and there is no obvious dimension disaster problem. The following feature selection section will analyze the basis of feature preservation in detail.

4.2. Result Evaluation Metrics

In this study, multiple evaluation metrics are comprehensively employed to assess the advantages and disadvantages. Instead of relying solely on a single evaluation metric to gauge the model’s performance, a variety of common and crucial evaluation metrics, such as the accuracy, precision, recall, and F1_score, are calculated. Different evaluation indicators reflect the performance of the model from different angles. For instance, the accuracy rate measures the proportion of correct predictions. The precision rate emphasizes both the prediction of positive cases and the proportion of correct predictions among them. The recall rate focuses on the proportion of actual positive cases that are correctly predicted. The F1_score is the harmonic mean of the precision rate and the recall rate, which comprehensively takes into account the balance between these two metrics. By integrating these evaluation metrics, we can conduct a more comprehensive and precise assessment of whether a specific hyperparameter combination enables the model to exhibit superior performance across various aspects. In turn, this helps to select the parameter configuration that best matches the task requirements.

The accuracy, precision, recall, and F1_score are as follows:

Accuracy = \frac{T P + T N}{T P + F P + T N + F N}

(1)

Precision = \frac{T P}{T P + F P}

(2)

Recall = \frac{T P}{T P + F N}

(3)

F 1_score = \frac{2 * Recall * Precision}{Recall + Precision}

(4)

In addition, the receiver operating characteristic (ROC) curve is usually used to show the performance of the classifier, in which the true positive rate (TPR) is taken as the ordinate and the false positive rate (FPR) is taken as the abscissa. The closer the ROC curve is to the upper left, the better the performance of the classifier. If the ROC curve of one classifier completely covers the other, it can be concluded that the latter is better. If ROC curves overlap or cross, it will be difficult to evaluate the performance of classifiers. At this time, the area under ROC curve can be compared for evaluation and comparison. Among them, the calculation methods of TPR and FPR are as follows:

T P R = \frac{T P}{T P + F N}

(5)

F P R = \frac{F P}{T N + F P}

(6)

The lower area of the ROC curve is the area under the curve (AUC), and its value is between 0 and 1. In the actual modeling process, the generalization ability of the model can be roughly judged according to the range of the AUC value.

Typically, when evaluating machine learning models, we employ the paired t-test to compare the performance of different models and determine whether the differences among them are statistically significant.

The formula of the paired t-test is as follows:

t = \frac{\bar{d}}{s_{d} / \sqrt{n}}

(7)

where

\bar{d}

is the average value of the difference between paired samples, that is,

\bar{d} = \frac{\sum_{i = 1}^{n} d_{i}}{n}

.

d_{i}

is the difference between the i-th pair of observations.

s_{d}

is the standard deviation of the difference between paired samples. The calculation formula is

s_{d} = \sqrt{\frac{\sum_{i = 1}^{n} {(d_{i} - \bar{d})}^{2}}{n - 1}}

. n is the number of paired samples.

The t-value computed using this formula is compared with the critical value of the t-distribution corresponding to a specific degree of freedom to ascertain whether there exists a statistically significant difference between the means of the two paired samples. At the same time, the p value can be calculated from the t value, and the significance of the difference can be judged more accurately. If the value of p is less than the pre-set significance level, such as 0.05, it is considered that there is a significant difference between the two groups of samples. In a paired sample t-test, the p value is the probability of obtaining the current sample data or more extreme data under the premise of the original assumption.

4.3. Experimental Environment

The hardware environment for the experiment is Intel Core I7 and above processor, 16 GB of memory, and 256 GB of solid-state drive (SSD). The software environments are Windows 10 operating system, Anaconda 3.0, and Python 3.7. The experiment is primarily implemented using the Python programming language, leveraging several third-party Python libraries. Specifically, numpy and pandas are employed for data processing and analysis, seaborn and matplotlib are utilized for data visualization, scikit-learn, LightGBM, and XGBoost are applied in model training, and scipy is used for statistical analysis.

5. Experimental Results and Analysis

5.1. Feature Selection

Feature selection can help to eliminate irrelevant or redundant features, thus improving the accuracy and operation efficiency of the model by reducing the number of features. When selecting features, we usually need to consider two aspects:

(1): Whether the feature is divergent: If the difference between samples in this feature is very small, then this feature has no practical effect on distinguishing data samples. Generally speaking, such features can be regarded as invalid or irrelevant features, and they are eliminated in the process of feature selection to improve the prediction performance of the model.
(2): Correlation between features and targets: The correlation between features and targets is a very important factor in feature selection. When the correlation between the feature and the target is high, the prediction ability of the feature for the target is stronger.

Firstly, the correlation analysis is carried out. In this paper, the XGBoost algorithm, a prominent algorithm within the tree model family, is employed to conduct the screening of feature variables. The essence of deriving the importance scores for each feature hinges on the computation of information gain. The information gain formula is as follows:

I G (X, Y) = H (Y) - H (Y | X) = \sum_{x, y} p (x) \cdot p (y | x) \cdot \log (y | x) - \sum_{y} p (y) \cdot \log p (y)

(8)

In the above formula, empirical entropy (Y) represents the uncertainty of data set Y, while empirical conditional entropy (Y|X) represents the uncertainty of data set Y under given conditions, that is, when the characteristic variables X are known. Information gain indicates the degree to which the uncertainty of data set Y is reduced under the action of characteristic variables X. Since different feature variables typically yield varying degrees of information gain, by computing the information gains of each individual feature variable and ranking them accordingly, it is possible to identify the features that exhibit the optimal regression performance.

By analyzing the correlation in Figure 8 between the classification results of heart disease and the characteristic variables, it is revealed that the correlation between the RestingECG feature and heart disease is nearly zero when the LVH (left ventricular hypertrophy) value is in a certain state. However, in other value states of LVH, there exists a certain degree of correlation. Therefore, it remains necessary to retain the RestingECG feature in the analysis.

Furthermore, the importance of features was ranked, and it was found that Cholesterol, MaxHR, Age, and RestingBP had great influence on heart disease in Figure 9.

In this feature importance map, Cholesterol, MaxHR, Age, and RestingBP rank higher. This is because they exert direct and crucial influences on heart health. For instance, abnormal cholesterol levels can disrupt blood supply to the heart; an abnormal maximum heart rate may serve as an indicator of underlying heart problems; the aging process is often associated with the deterioration of heart function and the increased susceptibility to chronic diseases; and elevated resting blood pressure can impose additional stress on the heart, thereby increasing the risk of heart-related issues. Nevertheless, features like fasting blood sugar (FastingBS) and exercise-induced angina (ExerciseAngina) have lower rankings. This is due to the fact that their associations with heart disease are not directly predominant, and within the context of the current data set and model, their contributions to the determination of the prevalence of heart disease are relatively minor. In this data set, cholesterol, maximal heart rate, age, and resting blood pressure are the most important predictors of heart failure, which has its inherent physiological and pathological causes. Excessive cholesterol level will lead to atherosclerosis, resulting in vascular stenosis and blocked blood flow. The heart has to exert significantly more effort to sustain an adequate blood supply. Prolonged impairment due to myocardial fatigue can gradually elevate the risk of heart failure. The highest heart rate reflects the heart’s function in extreme states. If it remains abnormal for an extended period, it indicates that there is insufficient cardiac reserve function or potential lesions present. In such a case, the heart is unable to effectively meet the body’s stress requirements, thereby impacting its normal function. With the increase in age, the structure and function of the heart naturally deteriorate, and problems such as weakening myocardial elasticity and valve calcification gradually appear. At the same time, the elderly often suffer from various chronic diseases, which further damage the heart, making age closely related to heart failure. Abnormal resting blood pressure, whether it is hypertension or hypotension, will disturb the normal pressure load balance of the heart. Hypertension increases the ejection resistance of the heart. Prolonged hypertension can lead to myocardial hypertrophy over time. Meanwhile, hypotension can result in inadequate perfusion of the heart. In the long term, both the increased ejection resistance due to hypertension and the insufficient perfusion caused by hypotension can predispose an individual to heart failure.

The paper further tries the influence of feature combination on the prediction results, mainly from three aspects: the characteristics related to chest pain, the combination related to cardiac blood supply, and the combination related to cardiac electrophysiological activities. The following Table 2 shows the experimental results of accuracy under different combinations.

When predicting heart failure, the overall quality of the model highly depends on the quality of the selected key feature group. Through many experiments, it was found that among many feature combinations, the combination according to the characteristics related to chest pain is the worst in predicting heart failure. Among the combinations related to heart blood supply, the combination of Oldpeak, ST_Slope, RestingECG, and ExerciseAngina has shown good results. The combination related to cardiac electrophysiological activities, such as RestingECG, MaxHR combined with ExerciseAngina, and Oldpeak, ST_Slope combined with RestingECG, also performed well. However, these combinations are not as good as all-feature combinations in overall performance. This clearly demonstrates that the comprehensive feature set encompasses more crucial and extensive information during the prediction of heart failure. Such information plays a pivotal role in enhancing the overall performance of the model. Therefore, in order to ensure that the model can achieve the best prediction effect, this paper decided to keep all the features in the feature selection.

5.2. Comparative Analysis of Model Effects

5.2.1. The Setting of the Experimental Sample

In this scheme, the selection and processing of sample number mainly focus on the training process of the LightGBM model, which is closely related to bootstrap sampling and cross-validation. In this approach, the determination and handling of the sample size mainly focus on the training procedure of the LightGBM model. This aspect is intricately linked to both bootstrap sampling and cross-validation techniques. The overall sample situation is as follows. The original data set contains 918 records, which are divided into feature matrix X and label vector y. The test set is divided into 30%, that is, the number of samples in the test set is 918 × 0.3 ≈ 275, and the number of samples in the training set is about 918 − 275 = 643. This partitioning ratio is a widely adopted configuration in relevant studies. Its primary objective is to guarantee that the test set can effectively assess the model’s generalization capability, while simultaneously supplying an adequate amount of data to the training set for the purpose of facilitating the model’s learning process.

The sample number of bootstrap sampling is as follows.

In the process of optimizing the parameters of the LightGBM model, bootstrap sampling technology is adopted. For each bootstrap sampling, samples are randomly selected from the training set (about 643 samples), and the number of samples extracted is set to be the same as that of the original samples in the training set, that is, the sample set obtained by each bootstrap sampling also contains about 643 samples. This setting is because the purpose of self-help sampling is to simulate different training data sets through repeated random sampling, thus increasing the diversity of model training and improving the generalization ability of the model. Maintaining the number of samples to be consistent with that of the original training set enables the model to encounter similar data sizes and distributions during the training process on different bootstrap sample sets. Consequently, this practice enhances the comparability when assessing the model’s performance under various parameter combinations.

The number of samples for cross validation is as follows.

We use five-fold cross-validation to evaluate the performance of the model. In each iteration of the cross-validation process, the training set, which consists of approximately 643 samples, will be further partitioned into a training subset and a validation subset. Because it is a five-fold cross-validation, the training subset contains about 643 × (4/5) ≈ 514 samples, and the validation subset contains about 643 × (1/5) ≈ 129 samples. Through this process of multiple partitioning and validation, a more comprehensive evaluation of the model’s performance on different data subsets can be achieved. This approach helps to mitigate the biases arising from data partitioning and renders the performance assessment of the model more robust. At the same time, different parameter combinations are evaluated based on the same cross-validation method and sample size division rules, which ensures the fairness and effectiveness of model comparison.

5.2.2. Selection of Basic Model

The experiments in this paper are compared with single classification algorithms, such as K-nearest neighbors, support vector machines, and decision trees, ensemble learning algorithms, such as GBDT and XGBoost, and basic LightGBM model classification. By drawing the learning curve of test set classification accuracy, the experimental results are evaluated and compared.

The Figure 10 shows the change of the accuracy of different machine learning models in the training process. It is found that the accuracy of all models increases with the increase of the number of training samples. The accuracy of SVM and KNN models is low when the number of training samples is small, but with the increase in the number of samples, their accuracy is gradually improved. The SVM model performs worst when the number of training samples is small, but with the increase in the number of training samples, its accuracy is also improved. Overall, the performance of SVM model and KNN is relatively inferior. In contrast, the modeling outcomes of ensemble learning models for the heart disease prediction data set are generally superior to those of single models. LightGBM performs best when the number of samples is less than 150. With the increase in samples, the accuracy of LightGBM, the XGBoost model, and the GBDT model is relatively good, and when the number of training samples is about 300–600, their accuracies are similar, but LightGBM is slightly better.

As can be seen from the figure, the LightGBM model, that is, the black dotted line, has a relatively stable trend of improving accuracy when the number of training samples is relatively small. When the bootstrap sampling approach is employed, the LightGBM model is likely to converge to a relatively stable accuracy level more rapidly, especially when the sample size obtained through sampling is relatively small. For instance, when the number of training samples increases starting from 100, the accuracy of the LightGBM model gradually improves. Moreover, in comparison with some other models, the LightGBM model demonstrates superior performance during this stage. This indicates that the LightGBM model is capable of learning the patterns within the data more rapidly. Moreover, when the bootstrap sampling generates a relatively small sample set, it can effectively reduce the consumption of both training time and computational resources.

In the process of increasing the number of samples, the accuracy of LightGBM model is improved steadily. Compared with SVM, logistic regression, and the decision tree, when the sample size is not very large, such as in the sample size range of 300–400, the accuracy of LightGBM is already at a high level. If the bootstrap sampling can well represent the overall data distribution, the LightGBM model can use these samples to effectively construct the model, thus obtaining higher accuracy. In addition, the performance curve of LightGBM model is relatively smooth, and there is no significant fluctuation. This shows that the LightGBM model may show better stability in the case of different sample subsets generated by bootstrap sampling.

Here, we use the paired t-test method to compare the significant performance tests of the basic model, LightGBM, and other models (Table 3).

In statistics, the p value is used to judge whether there is a significant difference between two samples. Generally, when the p-value is less than the significance level (commonly set at 0.05), it can be concluded that there is a significant difference in the performance between the two models. Conversely, if the p-value is greater than or equal to the significance level, it can be inferred that there is no significant difference. Based on the p-value results obtained from the comparison of the models, an analysis of the performance disparities between the models can be conducted as follows.

When compared with models such as SVM, KNN, and decision trees, the p-values of the LightGBM model are all below 0.05. This indicates that there is a significant performance disparity between the LightGBM model and these models, and such a difference is unlikely to be attributed to random factors. The observed difference can potentially be ascribed to the superior performance of the LightGBM model. When the p-value is less than the commonly set significance level of 0.05, it can be concluded that there are significant differences between the LightGBM model and other models in terms of both prediction accuracy and fitting capability. It is highly probable that the LightGBM model outperforms other models, which accounts for the substantial disparities observed between them.

The p-value for both LightGBM and GBDT is 0.943512, which is much higher than 0.05. This shows that there is not enough evidence to show that there is a significant difference in performance between LightGBM and GBDT in the current examination. Considering that the LightGBM model is an enhancement and optimization of the GBDT algorithm, it incorporates several innovative features while inheriting some of the GBDT’s merits. For instance, it employs a histogram algorithm and a leaf growth strategy with depth constraints. These features endow the LightGBM model with distinct advantages in terms of training speed and memory usage. Despite this, it maintains a performance level comparable to that of the GBDT, thereby demonstrating the superiority of the LightGBM model.

5.2.3. Experiment and Comparison of Fusion Model

Then, after the bootstrap sampling of the training set for many iterations, several LightGBM sub-models with differentiated data perspectives are constructed. According to the performance of each sub-model under different data distributions, a reasonable fusion weight is assigned to it. The following Table 4 is a comparison between the paper fusion model and common machine learning methods in performance indicators.

As can be seen from the above table, the F1 score of LightGBM with bootstrap sampling and weighted fusion has reached 0.858108, which is at a high level in all models, second only to 0.862069 of GBDT. This indicates that, when taking both accuracy and recall into comprehensive consideration, it exhibits outstanding overall classification performance. It is capable of striking a better balance between the accuracy of predicting positive cases and the coverage of positive cases, thereby achieving ideal outcomes in classification tasks.

The recall rate of LightGBM using bootstrap sampling and weighted fusion is 0.888112, which is the highest among all the compared models. This means that it has excellent ability in identifying positive samples. When compared with other models, it is less likely to overlook the actual positive samples. Instead, it can more comprehensively identify the samples that should be classified as positive. This constitutes a crucial advantage in numerous scenarios where a high level of positive sample detection is demanded. For example, in the field of disease diagnosis, the goal is to ensure that positive cases are not missed as much as possible, and this model’s ability serves this purpose effectively.

The original accuracy of LightGBM was 0.829710. After bootstrap sampling and weighted fusion, the accuracy was improved to 0.847826. This demonstrates that the fusion operation can effectively enhance the model’s overall capability of making correct predictions. As a result, the model is able to yield more accurate prediction outcomes and decrease the likelihood of making incorrect predictions during the classification of diverse samples.

When compared with the baseline LightGBM model, the accuracy of the fusion model has witnessed an increase of 2.183413%. This improvement clearly indicates that the overall classification precision of the model has been elevated. The recall rate has experienced an increment of 4.098356%, which clearly demonstrates that the model has significantly enhanced its capability to identify positive samples. As a result, it can effectively prevent the oversight of actual positive samples. The accuracy rate has witnessed an increase of 0.696449%. Although the margin of increase is relatively modest, it still indicates that the proportion of true positive cases has risen to some degree. This, in turn, has decreased the likelihood of misjudgment. The F1-score has increased by 2.339926%. This comprehensively indicates that the model demonstrates superior performance in terms of both accuracy and recall. Compared with the baseline model, its overall performance has been improved to different extents across various key indicators. Consequently, it is anticipated to exhibit more favorable outcomes in practical application scenarios.

In general, the LightGBM model with bootstrap sampling and weighted fusion demonstrates its advantages in multiple aspects by virtue of its distinctive processing and fusion techniques. Compared with other common models, it has better performance in different degrees, especially in the recall rate.

The comparative analysis of the ROC curve is a highly reliable approach for assessing the performance of classifiers. It enables us to discern the performance disparities among multiple classifiers and make an informed decision in selecting the most appropriate classifier to address practical issues. The subsequent figure illustrates the ROC curve obtained from the test data set.

The Figure 11 depicts the ROC curves of several models, along with their corresponding AUC values. These visual and numerical indicators are employed to assess the capacity of various models to discriminate between positive cases (for example, individuals with heart disease) and negative cases (such as those without heart disease). The horizontal axis represents the FPR, which is defined as the ratio of misclassifying negative cases as positive cases. On the other hand, the vertical axis denotes the TPR, which refers to the proportion of accurately classifying positive cases. The dotted black line in the figure represents random guess. In the context of evaluating model performance, the closer the curve of the model is to the upper left corner of the chart, the better its performance.

From the shape of ROC curve, the curve of the paper model, the fusion model, is close to the upper left corner, which is an ideal model representation area. It can maintain a high true positive rate in the whole range of false positive rate. This indicates that in practical applications, the hybrid model employed in this paper has the ability to effectively identify patients with actual heart disease. In other words, it can maintain a high true positive rate (TPR), thus maximizing the accurate detection of positive cases. At the same time, it aims to significantly minimize the occurrence of false positives and, thereby, avoid wrongly classifying healthy individuals as heart disease patients. The AUC of the mixed model used in this paper is 0.92, which is a very high level. AUC value represents the comprehensive ability of the model to distinguish between positive cases with heart disease and negative cases without heart disease under different classification thresholds.

The curves of KNN (AUC = 0.74) and the decision tree (AUC = 0.78) are near the random guess line, indicating weak discriminative power. The low AUC values suggest that these models have certain limitations and are likely to make classification errors when differentiating between positive and negative cases.

In contrast, SVM (AUC = 0.91), GBDT (AUC = 0.91), XGBoost (AUC = 0.90), and LightGBM (AUC = 0.91) perform excellently.

Compared with these well-performing models, the fusion model has more prominent advantages, with a relatively high AUC. This implies that the fusion model is highly likely to effectively curtail the incidence of misclassification and decrease the false positive rate. As a consequence, it stands a greater chance of ensuring the rational and efficient allocation of medical resources.

The paired t-test is used to test the significance of the fusion model and other commonly used models. In the paired t-test, the t-statistics and p-value data of the fusion model compared with other models are shown in the following Table 5.

From the significance test, the p-values between the fusion model and other models, except the decision tree model, are greater than 0.05. This indicates that, from the significance test index, there is no significant difference in performance between the fusion model and other models except the decision tree model. That is to say, in this test, there is insufficient evidence to demonstrate that the performance of the fusion model is significantly superior or inferior to that of other models, excluding the decision tree model. Notably, the comparison p-values of the fusion model are quite close to those of the LightGBM model, GBDT model, and XGBoost model. This similarity in p-values may imply that these models exhibit comparable stability across different data sets. Moreover, their performance in terms of data fitting and prediction accuracy is roughly on par, suggesting that these models can maintain a relatively consistent performance level under varying data conditions. The p value between the fusion model and the decision tree model is less than 0.05, which indicates that the performances of the fusion model and the decision tree model are significantly different. It can be speculated that the fusion model may be better than the decision tree model in performance, but the specific situation needs further analysis of the background and data characteristics of the problem. If the performance is improved significantly, the fusion model may be a better choice than the decision tree model. If it is not sensitive to performance differences, other models can be selected according to specific situation.

However, compared to other models, the fusion model has the highest recall rate and an excellent F1 score performance. It balances accuracy and recall well, ensuring accurate predictions and effectively identifying positive cases. Thus, it has clear advantages in tasks needing high-level comprehensive model performance. Although the precision of the fusion model is slightly lower than that of the GBDT model, it is still at a high level. This indicates that the fusion model achieves high overall prediction accuracy and can accurately judge the majority of samples. The comprehensive performance advantage of the fusion model stems from its utilization of bootstrap sampling and weighted fusion of multiple basic models. This enables it to comprehensively leverage the strengths of different models while circumventing the limitations of any single model. From various indicators, it performs well in many aspects, and there is no obvious shortcoming, which reflects the effectiveness of the fusion strategy in improving the performance of the model.

To analyze whether the fusion model is overfitted, we compare the disparities in accuracy, precision, recall, and the F1 score between the training set and the test set. Based on these differences, we then determine whether the model is overfitted. A common indicator of overfitting is that the larger the disparity between the performance metrics of the training set and those of the test set, the higher the likelihood of overfitting. The detailed data can be found in the Table 6 below.

The differences among the overfitting indexes of the LightGBM, decision tree, and XGBoost models are generally large. The performance difference of decision tree model between the training set and the test set is more obvious, and the problem of overfitting is more serious. This might be attributable to the fact that the decision tree model tends to generate a complex tree structure during the training process. This structure overfits the noise and specific details within the training data, consequently leading to a substantial deterioration in its performance on the test set. When compared with the LightGBM, decision trees, and XGBoost models, the disparity in the overfitting index of the fusion model is significantly smaller than that of the basic models. This indicates that the overfitting issue has been substantially mitigated. However, there still exist certain discrepancies between the performance of the training set and that of the test set. Moreover, differences can also be observed in terms of accuracy, precision, recall, and the F1 score, suggesting that the fusion model still exhibits some degree of overfitting. The reason may be that the size of the data in this data set is not large, and the original data have not been processed using data enhancement technology, which could increase the diversity of the data and expand the scale of the data set. In addition, during the model training process, the number of training epochs and the learning rate might not have been set appropriately. This could potentially cause the model to be over-optimized for the training set. As a result, the model gradually adapts to the noise and minor fluctuations present in the training data, ultimately leading to overfitting. Especially on small sample data sets, the model is more likely to converge to a local optimum quickly. This is also the deficiency of the paper model, which needs further study.

There is little difference among the overfitting indexes of the SVM model, the KNN model, and the GBDT model, which indicates that the overfitting phenomenon is not obvious. This could be attributed to the fact that the SVM model effectively circumvents the overfitting issue by employing the kernel function during data processing. This enables the model to exhibit a more consistent performance across diverse data sets. However, the SVM model does not perform well in accuracy, precision, recall rate, or the F1 score. KNN model is related to the selection of the k value and data distribution. If the value of k is not chosen properly, the model may be too sensitive to local data, leading to overfitting. If the GBDT model undergoes an excessive number of iterations, or if the learning rate is set inaccurately, overfitting may occur.

Generally speaking, the degree of overfitting is different for different models. Although the fitting problem of the fusion model is smaller than that of the selected basic model, LightGBM still has an overfitting issue. In practical applications, it is essential to choose an appropriate model in accordance with specific requirements and the characteristics of the data at hand. Additionally, corresponding measures should be taken to mitigate the overfitting issue. By doing so, the generalization ability and prediction accuracy of the model can be enhanced.

Of course, there are some limitations in the fusion model. For example, it increases the computational complexity. The fusion model requires the training, sampling, and weighting of multiple basic models. This undoubtedly elevates the computational complexity of the model and prolongs its training time. This computational burden may become a limiting factor for its application in large-scale data processing or scenes that require high real-time performance. In addition, the explanatory ability of the model has also decreased. Compared with some simple basic models, such as decision trees, the structure and decision-making process of the fusion model are more complicated, and it is difficult to directly explain the prediction results. In some fields that require a high explanatory ability of the model, such as medical decision making, financial risk assessment, etc., this may affect the credibility and application acceptance of the model. It is also very difficult to adjust parameters. The fusion model involves the parameters of several basic models as well as the weight parameters during the fusion process, and its optimization procedure is relatively intricate. A greater amount of time and computational resources are required to identify the optimal parameter combination so as to attain the best possible performance. The performance of the fusion model depends largely on the basic model selected and its performance. If there is a big deviation or variance in the basic model itself, it will be difficult for the fusion model to achieve the desired effect.

The model of this paper is constructed by a complex fusion strategy, which combines the advantages of multiple sub-models and different parameter configurations of the same model. This fusion method can reduce the fluctuation of model performance to a certain extent. The fusion model can maintain a relatively stable performance as far as possible under different data distribution and sample conditions. This makes it reliable in the actual heart disease diagnosis scene, and its performance will not be greatly reduced due to a slight fluctuation in data.

Because of the integration of sub-models and various parameter configurations, the model in this paper often has better generalization ability when facing new data. It will not rely too much on a specific model structure or parameter setting, and it can better adapt to new and unprecedented data of heart patients. In practical applications, when dealing with new case data, its data distribution may deviate slightly from that of the training data. However, the model presented in this paper is more likely to make accurate classification predictions, unlike some single models that tend to experience a decline in performance when applied to new data.

Overall, the fusion model demonstrates remarkable advantages in performance metrics, particularly in terms of recall and the F1 score. However, it also has some overfitting issues and confronts challenges like high computational complexity, weak interpretability, and the difficulty of parameter adjustment. In practical applications, it is necessary to weigh the advantages and disadvantages of the fusion model according to specific needs and scenarios and then decide whether to adopt it.

5.3. Discussion on Model Deployment with Incomplete External Verification

Before deploying the model related to heart diseases based on this data set in the real world, a series of rigorous clinical validation steps are needed. These steps should comprehensively evaluate key aspects, such as the algorithm foundation, data processing, and output interpretation of the model, to ensure its accuracy, reliability, and security. For example, during internal verification, data sets should be divided into a training set, a validation set, and a test set according to a specific proportion. Generally, the training set is utilized for model training. The validation set is employed for adjusting model parameters and selecting the optimal model. The test set is then used to evaluate the final performance of the model.

Cross-validation, such as KFold cross-validation, is also of great importance. Through multiple divisions of the data set, followed by training and validation processes, the stability and generalization ability of the model can be effectively assessed, and the bias resulting from data division can be mitigated.

Multi-aspect validation is of utmost importance. In particular, external validation necessitates the use of an independent data set for validation. That is, independent heart disease data sets sourced from other medical institutions or different regions are employed to validate the model. This approach ensures that the model can sustain good performance across diverse data distributions. These external data sets should have similar characteristics to the target application scenario, but they are independent from the training data set to verify the generality of the model.

Multi-center verification is a crucial aspect that cannot be overlooked. It requires validation across numerous distinct clinical centers, which vary in geographical location, patient demographics, and medical practices, among other factors. Multi-center verification enables a more comprehensive assessment of the model’s performance across diverse clinical environments. It also helps to identify potential variations in the model’s performance attributable to factors such as regional differences and disparities in medical standards.

In addition, both performance evaluation and clinical practicality assessment are of great significance. It is essential to compare this model with existing diagnostic methods to ascertain whether it offers added clinical value. For example, there is a need to assess whether the early detection of diseases can be realized, the accuracy of diagnosis can be improved, or the cost of diagnosis can be reduced. At the same time, there is a need to investigate the influence of the prediction results of the model on clinical decision making, such as whether it can help doctors make more accurate treatment plans and judge the prognosis of patients. In practical application, the potential impact of the model on medical quality and patient outcome can also be evaluated by simulating clinical scenarios.

Finally, the importance of ethical and legal review should not be underestimated. It is imperative to adhere strictly to relevant ethical guidelines and legal regulations to ensure that the deployment and utilization of the model comply with the provisions of medical laws and regulations. Notably, given that the external verification of the model remains incomplete, extreme caution must be exercised when contemplating its real-world deployment, even after a series of processes such as internal verification and multi-center verification have been successfully carried out. In the future, only upon successfully passing a comprehensive and rigorous set of clinical verification procedures can it be ensured that the model is ready for deployment in real-world scenarios.

If the model is precipitately deployed without comprehensive verification, in the event that issues such as misdiagnosis and data leakage occur, it will not only severely undermine the interests of patients but also give rise to legal disputes. Moreover, it will exert an irreparable adverse impact on medical institutions. Therefore, it is particularly critical to develop a scientific and reasonable external verification plan. In the future, external verification can be promoted from the following two aspects. First of all, fully tap the value of electronic health record. Large-scale electronic health records in different regions and different medical institutions have been widely integrated. With the help of advanced data mining technology, the key features closely related to heart disease can be accurately extracted, and a multi-dimensional and highly heterogeneous external verification data set can be built. In this process, it is imperative that we desensitize the data in strict compliance with the laws and regulations governing data security and privacy protection. We should spare no effort in ensuring the absolute security of patient information. Secondly, actively carry out multi-center prospective research. Combined with many typical medical institutions, the prospective research plan was carefully designed. Clearly define the inclusion and exclusion criteria, and recruit a sufficient number of patients with broad representation. In the research process, the data collection standard and operational procedures were unified, and the patient’s condition development and diagnosis and treatment results were tracked in real time. The collected data were used for model verification. Through the collaboration among multiple centers, we should comprehensively take into account the characteristics of patients in diverse regions and various medical environments, as well as the disparities in diagnosis and treatment approaches. This enables a comprehensive evaluation of the model’s performance in actual clinical scenarios.

Through the implementation of the external verification plan, we are able to continuously refine the model. Additionally, by closely monitoring the evolving trends in ethical and legal aspects, we can steadily enhance the model’s compliance with regulations and its safety. This comprehensive effort represents the core task at present. Only when the model has indisputably passed each and every one of the extremely rigorous tests should it be contemplated for utilization in real-world medical scenarios. Once achieved, it may contribute to some improvements in heart disease diagnosis and treatment.

6. Conclusions and Future Work

Heart disease is a major health problem in the world. Accurate heart disease prediction is of great significance for disease prevention, early intervention and medical resource allocation. With the continuous development of machine learning technology, the construction of heart disease prediction models by using data-driven methods has become a research hotspot, aiming to assist medical professionals to make more accurate diagnosis and decisions. This paper puts forward an effective and reliable heart disease classification model. By conducting an in-depth study and detailed comparison of diverse machine-learning models in the field of heart disease prediction, we explore methods to optimize the model’s performance. Our ultimate goal is to offer more powerful support tools for the clinical diagnosis of heart disease.

In this paper, a new strategy based on the LightGBM model and a weighted fusion sub-model using bootstrap sampling is proposed. The model is trained and optimized through cross-validation by traversing the multi-parameter combinations of LightGBM. Moreover, it is compared with common machine learning models such as KNN, SVM, decision trees, GBDT, and XGBoost. The bootstrap sampling operation carried out in the experimental part enables the model to better adapt to the diversity of sample distribution and data imbalance. Through targeted sample sampling, the model can learn effective patterns under different data subset characteristics, enhance its robustness, and perform better and relatively stably when facing different data sets. Weighted fusion can combine the advantages of different features or models and assign appropriate weights to each part. As a result, the final model can take advantage of LightGBM’s own strengths. Additionally, by fusing other sub-models, it can further optimize performance and overcome the possible limitations of a single model. Finally, although the proposed fusion model still has some overfitting problems that need further improvement, it demonstrates excellent performance in aspects such as accuracy, precision, recall, and the F1 score. Moreover, it exhibits a favorable balance between true positive and false positive rates in the ROC curve comparison. Additionally, it has a considerably high AUC value. This proves the effectiveness of the model in heart disease data classification and that it can provide a powerful reference for diagnosis.

Although some achievements have been made so far, some shortcomings still exist. In the next step, we will continue to conduct in-depth exploration in terms of both data and models and continuously enhance the model to boost its efficiency.

Author Contributions

Conceptualization, Y.W.; methodology, Y.W.; software, Y.W.; validation, Y.W.; formal analysis, H.C.; investigation, H.C.; resources, H.C.; writing—original draft preparation, Y.W.; writing—review and editing, Y.W. and H.C.; visualization, Y.W.; funding acquisition, Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki of 1975, which was revised in 2013. In this study, the data set (https://www.kaggle.com/fedesoriano/heart-failure-prediction (accessed on September 2021)) licensed by “Open Data Commons Open Database License (ODBL) v1.0” on Kaggle is used, and the source of the data has been given in the reference part of the paper. The original data of this data set come from the heart disease data set of UCI machine learning library (http://archive.ics.uci.edu/dataset/45/heart+disease (donated on 30 June 1988)). This data set is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license. At present, there is no clear information that when UCI data were collected, the data set had been reviewed by the Institutional Review Committee. In the use process, we will strictly follow the relevant research ethics and data use regulations to ensure the legitimacy and ethics of the research.

Informed Consent Statement

In this study, the data set licensed by “Open Data Commons Open Database License (ODBL) v1.0” on Kaggle is used. The original data of this data set come from the heart disease data set of UCI machine learning library which is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license. At present, it is impossible to determine whether the informed consent of the participants was obtained when UCI collected the data. In the research process, we will take strict data protection measures to protect the privacy and rights of data subjects and avoid any potential harm to participants.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Acknowledgments

In the research process of this paper, we thank the open source platform Kaggle and UCI Machine Learning Library for providing data sources. Thanks to fedesoriano, the data provider, and four original data creators Andras Janosi, William Steinbrunn, Matthias Pfisterer and Robert Detrano.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Savarese, G.; Becher, P.M.; Lund, L.H.; Seferovic, P.; Rosano, G.M.C.; Coats, A.J.S. Global burden of heart failure: A comprehensive and updated review of epidemiology. Cardiovasc. Res. 2023, 118, 3272–3287. [Google Scholar] [CrossRef] [PubMed]
Wang, H.; Liu, Y.J.; Yang, J.F. Epidemiology of heart failure. J. Clin. Cardiol. 2023, 39, 243–247. [Google Scholar]
Papadimitriou, L.; Grewal, P.; Kalogeropoulos, A.P. Epidemiology of heart failure. In Heart Failure: An Essential Clinical Guide, 1st ed.; Kalogeropoulos, A.P., Skopicki, H.A., Butler, J., Eds.; CRC Press: Boca Raton, FL, USA, 2022; pp. 244–253. [Google Scholar]
Yogeswaran, V.; Hidano, D.; Diaz, A.E.; Spall, H.V.V.; Mamas, M.; Roth, G.A.; Cheng, R.K. Regional variations in heart failure: A global perspective. Heart 2023, 110, 11–18. [Google Scholar] [CrossRef]
Tazi, A.; Biju, S.M.; Oroumchian, F.; Kumar, M. Artificial intelligence enabled healthcare data analysis for chronic heart disease detection: An evaluation. Int. J. Grid Util. Comput. 2024, 15, 198–210. [Google Scholar] [CrossRef]
DeGroat, W.; Abdelhalim, H.; Patel, K.; Mendhe, D.; Zeeshan, S.; Ahmed, Z. Discovering biomarkers associated and predicting cardiovascular disease with high accuracy using a novel nexus of machine learning techniques for precision medicine. Sci. Rep. 2024, 14, 1. [Google Scholar] [CrossRef]
Ouwerkerk, W.; Voors, A.A.; Zwinderman, A.H. Factors Influencing the Predictive Power of Models for Predicting Mortality and/or Heart-Failure Hospitalization in Patients With Heart Failure. JACC Heart Fail. 2014, 2, 429–436. [Google Scholar] [CrossRef]
Bettencourt, P.; Azevedo, A.; Pimenta, J.; Friões, F.; Ferreira, S.; Ferreira, A. N-terminal-pro-brain natriuretic peptide predicts outcome after hospital discharge in heart failure patients. Circulation 2004, 110, 2168–2174. [Google Scholar] [CrossRef]
McDowell, K.; Adamson, C.; Jackson, C.; Campbell, R.; Welsh, P.; Petrie, M.C.; Mcmurray, J.J.V.; Jhund, P.S.; Herring, N. Neuropeptide Y is elevated in heart failure and is an independent predictor of outcomes. Eur. J. Heart Fail. 2024, 26, 107–116. [Google Scholar] [CrossRef]
Yang, W.; Zhu, L.; He, J.; Wu, W.; Zhang, Y.; Zhuang, B.; Xu, J.; Zhou, D.; Wang, Y.; Liu, G. Long-term outcomes prediction in diabetic heart failure with preserved ejection fraction by cardiac MRI. Eur. Radiol. 2024, 34, 5678–5690. [Google Scholar] [CrossRef]
Matsushita, K.; Ito, J.; Isaka, A.; Higuchi, S.; Minamishima, T.; Sakata, K.; Satoh, T.; Soejima, K. Predicting readmission for heart failure patients by echocardiographic assessment of elevated left atrial pressure. Am. J. Med. Sci. 2023, 366, 360–366. [Google Scholar] [CrossRef]
Judge, D.P.; Rouf, R. Use of Genetics in the Clinical Evaluation and Management of Heart Failure. Curr. Treat. Options Cardiovasc. Med. 2010, 12, 566–577. [Google Scholar] [CrossRef]
Szczepanowski, R.; Uchmanowicz, I.; Pasieczna-Dixit, A.H.; Sobecki, J.; Katarzyniak, R.; Koaczek, G.; Lorkiewicz, W.; Kdras, M.; Dixit, A.; Biegus, J. Application of machine learning in predicting frailty syndrome in patients with heart failure. Adv. Clin. Exp. Med. 2024, 33, 309–315. [Google Scholar] [CrossRef]
Miyashita, Y.; Hitsumoto, T.; Fukuda, H.; Kim, J.; Washio, T.; Kitakaze, M. Predicting heart failure onset in the general population using a novel data-mining artificial intelligence method. Sci. Rep. 2023, 13, 4352. [Google Scholar] [CrossRef]
Smith, D.H.; Johnson, E.S.; Thorp, M.L.; Yang, X.; Petrik, A.; Platt, R.W.; Crispell, K. Predicting poor outcomes in heart failure. Perm. J. 2011, 15, 4–11. [Google Scholar] [CrossRef]
Gottdiener, J.S.; Fohner, A.E. Risk Prediction in Heart Failure: New Methods, Old Problems. JACC Heart Fail. 2020, 8, 22–24. [Google Scholar] [CrossRef]
Khan, S.S.; Ning, H.; Allen, N.B.; Carnethon, M.R.; Yancy, C.W.; Shah, S.J.; Wilkins, J.T.; Tian, L.; Lloyd-Jones, D.M. Development and Validation of a Long-Term Incident Heart Failure Risk Model. Circ. Res. 2022, 2, 200–209. [Google Scholar] [CrossRef]
Gaziano, L.; Cho, K.; Djousse, L.; Schubert, P.; Galloway, A.; Ho, Y.-L.; Kurgansky, K.; Gagnon, D.R.; Russo, J.P.; Di Angelantonio, E.; et al. Risk factors and prediction models for incident heart failure with reduced and preserved ejection fraction. ESC Heart Fail. 2021, 8, 4893–4903. [Google Scholar] [CrossRef]
Ahmad, T.; Yamamoto, Y.; Biswas, A.; Ghazi, L.; Martin, M.; Simonov, M.; Hsiao, A.; Kashyap, N.; Velazquez, E.J.; Desai, N.R.; et al. REVeAL-HF: Design and rationale of a pragmatic randomized controlled trial embedded within routine clinical practice. Heart Fail. 2021, 9, 409–419. [Google Scholar]
Phan, J.; Barroca, C.; Fernandez, J. A Suggested Model for the Vulnerable Phase of Heart Failure: Assessment of Risk Factors, Multidisciplinary Monitoring, Cardiac Rehabilitation, and Addressing the Social Determinants of Health. Cureus 2023, 15, e35602. [Google Scholar] [CrossRef]
Lindholm, D.; Lindbäck, J.; Armstrong, P.W.; Budaj, A.; Cannon, C.P.; Granger, C.B.; Hagström, E.; Held, C.; Koenig, W.; Östlund, O.; et al. Biomarker-Based Risk Model to Predict Cardiovascular Mortality in Patients with Stable Coronary Disease. J. Am. Coll. Cardiol. 2017, 70, 813–826. [Google Scholar] [CrossRef]
Li, X.; Zhang, T.; Xing, W. Predictive value of initial Lp-PLA2, NT-proBNP, and peripheral blood-related ratios for heart failure after early onset infarction in patients with acute myocardial infarction. Am. J. Transl. Res. 2024, 16, 2940–2952. [Google Scholar] [CrossRef]
Bayes-Genis, A.; Docherty, K.F.; Petrie, M.C.; Januzzi, J.L.; Mueller, C.; Anderson, L.; Bozkurt, B.; Butler, J.; Chioncel, O.; Cleland, J.G.F.; et al. Practical algorithms for early diagnosis of heart failure and heart stress using NT-proBNP: A clinical consensus statement from the Heart Failure Association of the ESC. Eur. J. Heart Fail. 2023, 25, 1891–1898. [Google Scholar] [CrossRef]
da Silva, R.M.F.L.; Borges, L.E. Neutrophil-Lymphocyte Ratio and Red Blood Cell Distribution Width in Patients with Atrial Fibrillation and Rheumatic Valve Disease. Curr. Vasc. Pharmacol. 2023, 21, 367–377. [Google Scholar] [CrossRef]
Huang, S.; Zhou, Y.; Zhang, Y.; Liu, N.; Liu, J.; Liu, L.; Fan, C. Advances in MicroRNA Therapy for Heart Failure: Clinical Trials, Preclinical Studies, and Controversies. Cardiovasc. Drugs Ther. 2025, 39, 221–232. [Google Scholar] [CrossRef]
Paterson, I.; Mielniczuk, L.M.; O’Meara, E.; So, A.; White, J.A. Imaging Heart Failure: Current and Future Applications. Can. J. Cardiol. 2013, 29, 317–328. [Google Scholar] [CrossRef]
Lee, J.-H.; Uhm, J.-S.; Suh, Y.J.; Kim, M.; Kim, I.-S.; Jin, M.-N.; Cho, M.S.; Yu, H.T.; Kim, T.-H.; Hong, Y.J.; et al. Usefulness of cardiac magnetic resonance images for prediction of sudden cardiac arrest in patients with mitral valve prolapse: A multicenter retrospective cohort study. BMC Cardiovasc. Disord. 2021, 21, 546. [Google Scholar] [CrossRef]
Pinto, J.; Koshy, A.G. The Role of Echocardiography in Heart Failure Today. J. Indian Acad. Echocardiogr. Cardiovasc. Imaging 2021, 5, 16–23. [Google Scholar] [CrossRef]
Yogasundaram, H.; Alhumaid, W.; Dzwiniel, T.; Christian, S.; Oudit, G.Y. Cardiomyopathies and Genetic Testing in Heart Failure: Role in Defining Phenotype-Targeted Approaches and Management. Can. J. Cardiol. 2021, 37, 547–559. [Google Scholar] [CrossRef]
Bleumink, G.S.; Schut, A.F.C.; Sturkenboom, M.C.J.M.; Deckers, J.W.; van Duijn, C.M.; Stricker, B.H.C. Genetic polymorphisms and heart failure. Genet. Med. 2004, 6, 465–474. [Google Scholar] [CrossRef]
Rosenbaum, A.N.; Pereira, N. Updates on the Genetic Paradigm in Heart Failure. Curr. Treat. Options Cardiovasc. Med. 2019, 21, 1–11. [Google Scholar] [CrossRef]
Skrzynia, C.; Berg, J.S.; Willis, M.S.; Jensen, B.C. Genetics and Heart Failure: A Concise Guide for the Clinician. Curr. Cardiol. Rev. 2015, 11, 10–17. [Google Scholar] [CrossRef]
Povysil, G.; Chazara, O.; Carss, K.J.; Deevi, S.V.V.; Wang, Q.; Armisen, J.; Paul, D.S.; Granger, C.B.; Kjekshus, J.; Aggarwal, V.; et al. Assessing the Role of Rare Genetic Variation in Patients with Heart Failure. JAMA Cardiol. 2021, 6, 379–386. [Google Scholar] [CrossRef]
Cuocolo, R.; Perillo, T.; De Rosa, E.; Ugga, L.; Petretta, M. Current applications of big data and machine learning in cardiology. J. Geriatr. Cardiol. 2019, 16, 601–607. [Google Scholar]
Agrawal, H.; Chandiwala, J.; Agrawal, S.; Goyal, Y. Heart Failure Prediction using Machine Learning with Exploratory Data Analysis. In Proceedings of the 2021 International Conference on Intelligent Technologies (CONIT), Hubli, India, 25–27 June 2021; pp. 1–6. [Google Scholar]
Penny-Dimri, J.C.; Bergmeir, C.; Perry, L.; Hayes, L.; Bellomo, R.; Smith, J.A. Machine learning to predict adverse outcomes after cardiac surgery: A systematic review and meta-analysis. J. Card. Surg. 2022, 37, 3838–3845. [Google Scholar] [CrossRef]
Benedetto, U.; Dimagli, A.; Sinha, S.; Cocomello, L.; Gibbison, B.; Caputo, M.; Gaunt, T.; Lyon, M.; Holmes, C.; Angelini, G.D. Machine learning improves mortality risk prediction after cardiac surgery: Systematic review and meta-analysis. J. Thorac. Cardiovasc. Surg. 2022, 163, 2075–2087. [Google Scholar] [CrossRef]
Mythili, T.; Mukherji, D.; Padalia, N.; Naidu, A. A Heart Disease Prediction Model using SVM-Decision Trees-Logistic Regression (SDL). Int. J. Comput. Appl. Technol. 2014, 68, 11–15. [Google Scholar]
SK, H.K.; Praveen, A.; Kowshik, G.; Lokeshwaran, T.; Prasanna, K.M. Heart Disease Prediction using XGBoost and Random Forest Models. In Proceedings of the 2024 5th International Conference on Mobile Computing and Sustainable Informatics (ICMCSI), Lalitpur, Nepal, 18–19 January 2024; pp. 19–23. [Google Scholar]
Yang, P.; Qiu, H.; Wang, L.; Zhou, L. Early prediction of high-cost inpatients with ischemic heart disease using network analytics and machine learning. Expert Syst. Appl. 2022, 210, 118541. [Google Scholar] [CrossRef]
Fedesoriano. Heart Failure Prediction Dataset. Available online: https://www.kaggle.com/fedesoriano/heart-failure-prediction (accessed on September 2021).

Figure 1. Overall frame diagram.

Figure 2. Proportional diagram of heart failure.

Figure 3. The influence of basic patient information on the predicted value: (a) the influence of age and (b) the influence of sex.

Figure 4. The influence of symptoms related to chest pain on the predicted value: (a) the influence of chest pain types and (b) the influence of exercise-induced angina.

Figure 5. The influence of blood biochemical indexes on the predicted values: (a) the influence of serum cholesterol and (b) the influence of fasting blood sugar.

Figure 6. The influence of physiological indexes on the predicted values: (a) the influence of resting blood pressure and (b) the influence of maximum heart rate.

Figure 7. The influence of physical examination indexes on the predicted values: (a) resting electrocardiogram; (b) the influence of oldpeak, and (c) the influence of the slope of the peak exercise ST segment.

Figure 8. Feature correlation analysis.

Figure 9. Feature importance ranking.

Figure 10. The learning curve of test set.

Figure 11. The ROC curve of the test set.

Table 1. Attribute Information.

Attribute	Attribute Information
Age	age of the patient [years]
Sex	sex of the patient [M: male, F: female]
ChestPainType	chest pain type [TA: typical angina, ATA: atypical angina, NAP: non-anginal pain, ASY: asymptomatic]
RestingBP	resting blood pressure [mm Hg]
Cholesterol	serum cholesterol [mm/dL]
FastingBS	fasting blood sugar ¹
RestingECG	resting electrocardiogram results ²
MaxHR	maximum heart rate achieved ³
ExerciseAngina	exercise-induced angina [Y: Yes, N: No]
Oldpeak	oldpeak = ST [numeric value measured in depression]
ST_Slope	slope of the peak exercise ST segment ⁴
HeartDisease	output class [1: heart disease, 0: Normal]

¹ [1: if FastingBS > 120 mg/dL, 0: otherwise]. ² [Normal: normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of >0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes’ criteria]. ³ [Numeric value between 60 and 202]. ⁴ [Up: upsloping, Flat: flat, Down: downsloping].

Table 2. Accuracy of different feature combinations.

Combination Type	Accuracy
Age + Gender + ChestPainType + ExerciseAngina	About 79%
ChestPainType + ExerciseAngina + Oldpeak + ST slope	79–81%
Oldpeak + ST_Slope + RestingECG + ExerciseAngina	About 84%
RestingECG + MaxHR + ExerciseAngina	83–84%
Oldpeak + ST_Slope + RestingECG	81–83%
Full feature combination	82–85%

Table 3. The p value comparison between LightGBM and other models.

Model	LightGBM	SVM	KNN	Decision Tree	GBDT	XGBoost
LightGBM	0.000000	0.002467	0.000839	0.017154	0.943512	0.079960

Table 4. Comparison of indicators of each model.

Model	Accuracy	Recall	Precision	F1 score
SVM	0.735507	0.769230	0.733333	0.750853
KNN	0.702899	0.762238	0.694268	0.726667
Decision Tree	0.818841	0.853147	0.807947	0.829932
GBDT	0.855072	0.874126	0.850340	0.862069
XGBoost	0.829710	0.853147	0.824324	0.838488
LightGBM	0.829710	0.853147	0.824324	0.838488
Bootstrap sampling and weighted fusion LightGBM	0.847826	0.888112	0.830065	0.858108
Increase relative to baseline model (%)	2.183413	4.098356	0.696449	2.339926

Table 5. t-statistics and p-values of the fusion model compared with other models.

Contrast Mode	t	p
Fusion model and LightGBM model	1.637964	0.102573
Fusion model and SVM model	0.266811	0.789815
Fusion model and KNN model	−0.619480	0.536112
Fusion model and Decision Tree model	1.990495	0.047526
Fusion model and GBDT model	1.511057	0.131922
Fusion model and XGBoost model	1.416791	0.157675

Table 6. Comparison of overfitting indexes of various models.

Model	Accuracy Difference	Precision Difference	Recall Difference	F1 Difference
Fusion model	0.126202	0.135431	0.100929	0.119140
LightGBM	0.170290	0.175676	0.146853	0.161512
SVM	−0.008093	0.020677	0.003372	0.012340
KNN	0.075918	0.096855	0.067899	0.083494
Decision Tree	0.242753	0.224638	0.251748	0.238434
GBDT	0.098199	0.091612	0.103956	0.097608
XGBoost	0.170290	0.175676	0.146853	0.161512

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Y.; Cao, H. Heart Failure Prediction Based on Bootstrap Sampling and Weighted Fusion LightGBM Model. Appl. Sci. 2025, 15, 4360. https://doi.org/10.3390/app15084360

AMA Style

Wang Y, Cao H. Heart Failure Prediction Based on Bootstrap Sampling and Weighted Fusion LightGBM Model. Applied Sciences. 2025; 15(8):4360. https://doi.org/10.3390/app15084360

Chicago/Turabian Style

Wang, Yuanni, and Hong Cao. 2025. "Heart Failure Prediction Based on Bootstrap Sampling and Weighted Fusion LightGBM Model" Applied Sciences 15, no. 8: 4360. https://doi.org/10.3390/app15084360

APA Style

Wang, Y., & Cao, H. (2025). Heart Failure Prediction Based on Bootstrap Sampling and Weighted Fusion LightGBM Model. Applied Sciences, 15(8), 4360. https://doi.org/10.3390/app15084360

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Heart Failure Prediction Based on Bootstrap Sampling and Weighted Fusion LightGBM Model

Abstract

1. Introduction

2. Related Work

3. Mathematical Model

3.1. Classification Model Based on LightGBM

3.2. Application Strategy of Bootstrap Sampling in Model

3.3. Model Fusion Scheme

3.4. Parameter Optimization of Model

3.4.1. The Optimization Thought

3.4.2. Parameter Optimization Strategy

3.4.3. Parameter Optimization Process

4. Experimental Settings

4.1. Data Set Generation

4.1.1. Data Source

4.1.2. Potential Deviation from the Data Set

4.1.3. Initial Data Exploration

4.1.4. Data Preprocessing

4.2. Result Evaluation Metrics

4.3. Experimental Environment

5. Experimental Results and Analysis

5.1. Feature Selection

5.2. Comparative Analysis of Model Effects

5.2.1. The Setting of the Experimental Sample

5.2.2. Selection of Basic Model

5.2.3. Experiment and Comparison of Fusion Model

5.3. Discussion on Model Deployment with Incomplete External Verification

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI