1. Introduction
Many different factors are often taken into account when diagnosing a disease. The complexity of the disease (such as the risk levels associated with multiple diseases) and the diagnostic knowledge available to the physician [
1,
2] can influence the correct diagnosis of the disease [
3]. These complicated factors have raised many challenges for medical professionals, especially those who are young and inexperienced [
4]. Machine learning is widely adopted to develop medical auxiliary diagnostic systems [
5], which are also known as Computer-Aided Diagnosis (CAD) systems. CAD systems are important tools that provide disease diagnosis and prognosis [
6,
7]. They do not only help doctors make quick decisions and save patients’ time but also reduce the uncomfortable experience of patients by replacing invasive approaches [
8]. CAD systems use a wide spectrum of machine learning methods [
9], ranging from single prediction models such as Support Vector Machine (SVM) and Decision Tree (DT), to ensemble and deep learning models, such as Random Forest (RF), Extreme Gradient Boosting (XGBoost) and Deep Neural Network (DNN).
When CAD is used to assist diagnosis, effective feature engineering can be realized with the help of doctors, which makes it possible for some classical machine learning methods with better understanding to achieve better performance than deep learning models [
10]. Appropriate features can be obtained through feature selection algorithms [
11], selection methods based on physician experience [
10], or other methods. On the other hand, many models based on deep neural networks may hinder the efficiency of the interaction between doctors and the system due to the incomprehensible nature of its decision-making process [
12,
13], while highly complex models are also not conducive to the physician’s adjustment to reduce diagnostic bias [
14,
15]. Therefore, improving the performance of relatively simple models with high comprehensiveness (such as accuracy and generalization in the face of changing data) remains important for CAD [
16].
Ensemble learning is a class of methods that utilize more than one machine learning model to improve prediction results [
17]. The performance of an ensemble learning model integrating the results of individual models (i.e., base learners) is usually better than that of the individual models [
18]. For instance, Tseng [
19] integrated five machine learning classifiers to propose an ensemble model for diagnosing recurrent ovarian cancer. Ensemble learning usually selects an optimal set of base learners and then combine them using a specific fusion method. Thus, the decision on choosing base learners and integrating them is critical. To ensure optimal performance, the base learners should have both good performances and enough diversity [
20]. To aggregate base learners, classifier fusion methods are typically used. Such methods may include majority voting, support function fusion, and stacking [
21].
The optimal set of base learners and fusion method may change when an ensemble learning model is applied to different datasets [
22]. Due to the heterogeneity of datasets and the diversity of disease types, a fixed algorithm structure is likely to limit the accuracy of diagnosis. Prior research has proposed different strategies to make an ensemble model generalizable to different problems. For instance, Al-Tashi [
23] used wavelet transformation and singular value decomposition to reduce feature space dimensions. This method relies on the projection of features instead of specific features, which improves model generalization on diagnostic performance. Yet, similar to linear models, this approach still focuses on reducing model complexity rather than making the model adaptive to different problems and datasets. Zhou [
24] experimented with a deep forest ensemble architecture that consists of two kinds of random forest algorithms. However, adopting a fixed number and type of classifier will still hinder the performance of the system in the face of different problems.
Previous studies have made good progress in adapting ensemble models for heterogeneous problems. However, most of them adopt a fixed structure, which can only ensure that the performance of the model remains relatively stable, but they do not help the model achieve optimal performance across different diagnostic tasks, changing datasets, and diagnostic features. Specifically, in some real-world assisted diagnosis scenarios, training datasets will have significantly different volumes and features depending on different diagnostic tasks and different hospitals [
22]. Diagnostic data are still difficult to share as an important asset for hospitals, which means that it is difficult for small hospitals to obtain large amounts of data sufficient to support the training of complex deep models, so it is important that the auxiliary diagnostic models can maintain good performance against small datasets, and the performance of the models needs to be robust in the face of different features of different diagnostic scenarios. Therefore, designing and constructing an adaptive deep ensemble learning method for simple base learners with high understandability can further improve the accuracy, reproducibility and interpretability of the deep ensemble learning model and promote its wider application in the field of bioinformatics and CAD [
25].
In this study, we propose a DEM based on a Tree-Structured Parzen Estimator (TPE) to address the above problems. DEM is a class of deep learning model based on cascade forest structure. Different from traditional deep neural networks, each layer of DEM is composed of base classifiers. In this study, we use TPE to optimize the number of base classifiers per layer so that it can dynamically adjust the number of base classifiers when applied to different datasets. The TPE method has been widely used for optimizing hyperparameters [
26]. We further use four advanced ensemble learners to form a base classifier pool. This ensures that the base learners have good diversity, which is critical to ensemble learning [
27]. The four ensemble learners are Random Forest (RF), Extra Trees (ET), AdaBoost, and Gradient Boosting Decision Tree (GBDT). By dynamically adjusting the system structure based on data, the proposed algorithm can dynamically search for optimal solutions when applied to different problems.
Overall, our model uses TPE for classifier selection and DEM for classifiers fusion. The proposed model has three main advantages:
- (1)
Our proposed model is based on the integration of simple and comprehensible models. Therefore, this model needs to learn fewer parameters than the deep neural network-based model and therefore requires less training data while being more easily accepted and understood by physicians in practical applications.
- (2)
Our proposed model can dynamically adjust its structure to maintain good performance in tasks with different datasets and feature sets.
- (3)
Our proposed model can be flexibly tuned for continuous optimization, e.g., future studies for base classifiers can enable the overall performance of the model.
To examine the performance of the TPE-DEM model compared with other benchmark models, we conducted validation experiments on six datasets with significant differences (the differences are reflected in the different volumes, number of features, and the proportion of negative and positive data). We first use two different datasets representing different diagnostic tasks and describe the optimal hyperparameters and performance of the proposed model on two datasets. The first is breast cancer diagnostic data from our partner hospitals, and the second is the coronary artery disease prediction dataset from the UCI public datasets. Then, to further validate the performance of the proposed model on different datasets, we used four additional UCI public datasets for evaluation experiments. The first two datasets are oriented to medical diagnosis tasks. The last two datasets are oriented to tasks in other scenarios, where the last dataset has a significantly higher volume than the others. Our experimental results demonstrate that the proposed model has good performance on small volume datasets. However, as a deep model, its performance on datasets with large volumes is more outstanding than other benchmark models.
The remainder of the paper is organized as follows.
Section 2 reviews previous studies and their relevance to our study.
Section 3 describes the proposed TPE-DEM model, and
Section 4 introduces the six datasets and metrics that we used to evaluate the model. In
Section 5, we analyze the experimental results and discuss the theoretical and practical implications of our research. In the final section, we summarize our research and point out limitations that still need to be addressed in the future.
2. Related Works
Ensemble learning techniques combine multiple base learners and can obtain better prediction performance than single learners. Bagging, boosting, and stacking are the most common ensemble approaches. Bagging combines the predictions of individual base learners by voting. Boosting iteratively constructs new models based on the prediction error of previous models. Stacking trains a meta learner using the predictions of individual base learners. The meta learner determines the weights of the predictions in a supervised fashion. The construction of an ensemble model mainly involves approaches for generating (of a pool of classifier), selecting (categories and quantities of classifiers) and integrating (the prediction results of each classifier to generate the final output) [
28].
Chandra [
29] suggests that the most promising direction is to generate a pool of accurate and diverse algorithms. Therefore, the optimal ensemble model should combine base learners with good individual performance and enough level of diversity. The selection stage in ensemble learning determines the type and number of base learners. The selection strategy can be static or dynamic [
30]. The static strategy combines base learners regardless of data, while dynamic selection chooses the most appropriate base learners for a given dataset. Existing research has extensively studied algorithms for finding an accurate and diverse set of base learners for ensemble learning. For instance, Brun [
31] proposed a dynamic classifier selection framework and demonstrated through experiments that training different classifiers based on different problems and datasets can improve classification accuracy. Junior [
32] proposed a reduced minority k-nearest neighbors method based on k-nearest neighbors, which effectively solves the problem of prediction bias caused by unbalanced data in a credit score prediction task. Previous studies have proved that dynamic classifier selection and the combination can improve the performance of classifiers facing different data types and different scenarios, and our study also proves this theory. However, differently from previous work, the model we proposed turns the classifier selection problem into an optimization problem, making the process of classifier selection more rapid and further improving the performance of the ensemble model by combining it with DEM [
32].
Many search algorithms have been considered for optimization, such as Genetic Algorithms (GA) [
33] and Evolutionary Algorithms (EA) [
34]. The major limitation of these methods is that they often use a significant amount of time to evaluate hyperparameters. Gaussian process-expected improvement [
35] and Gaussian process-predictive entropy search [
36] methods use Gaussian Process (GP) to estimate the error caused by different hyperparameters. These methods employ Expected Improvement (EI) and predictive entropy search acquisition functions. Although GP is simple and flexible, its covariance matrix processing needs a lot of computation [
37]. Researchers proposed TPE, which now has been widely used for hyperparameter optimization. Recent work has also used TPE to optimize the hyperparameters of convolutional neural networks to improve the performance of the model in the lung nodule recognition task [
38]. In this paper, our proposed model needs to dynamically adjust the hyperparameters for better performance in the face of different diagnostic tasks, but the optimal computation of hyperparameters entails additional time loss. To minimize the time loss, our model requires a faster optimization algorithm. Compared with other optimization algorithms, TPE can complete the optimization task in less time; therefore, we choose TPE as the hyperparameter selection method in this paper.
The integration strategy of the ensemble learning model often depends on the specific situation. Each base learner can have equal or different weights, and the integration strategy usually affects the accuracy of the final model [
39]. The rule for combining base learners could be supervised or unsupervised. Sum and majority voting are well-known unsupervised methods. Stacking is a supervised method. The predicted results from each base learner are merged into new features and trained using the meta learner [
40]. Recently, researchers have introduced mechanisms to combine ensemble learning methods and various deep learning algorithms to enhance prediction performance. Zhou [
24] proposed a cascade forest ensemble based on gcForest for better representation learning. In this model, based on the deep neural network model, the author replaced each neuron with a tree-based classifier. In general, the performance of traditional deep forest ensemble models based on the static integration method will be greatly affected by the change in data. Based on the traditional deep forest ensemble scheme, we use the TPE method to optimize the structure of the model to dynamically adjust the type and number of base learners in the model according to different datasets. Experiments show that the method we propose in this study has better performance than popular baselines and maintains stable performance on different datasets. We further evaluate the new system in diseases diagnosis.
6. Conclusions
In this paper, we proposed a TPE-DEM model based on the traditional DEM model. Our proposed model transforms the process of integrating different simple base learners into an optimization problem by using a TPE optimization algorithm to obtain the optimal hyperparameters of the model for various diagnostic tasks. Due to the integration of simple models, our proposed model requires less training data and is more easily understood by physicians than deep neural network-based models. When faced with different diagnostic tasks and datasets, our proposed model can change its structure by dynamically adjusting hyperparameters to maintain good performance in various tasks.
To evaluate the effectiveness of TPE-DEM, we validated its performance on six different datasets. The first and fourth datasets have good features and more balanced data distribution. The experimental results show that TPE-DEM and other baseline models can effectively learn from the data and achieve good performance. However, TPE-DEM performs on average 2% higher than other baseline models in all four metrics on the first dataset and 1% higher than other baseline models in three metrics on average on the fourth dataset for TPE-DEM. When the datasets are somewhat unbalanced (the second and third datasets), the performance of all models decreases. Still, TPE-DEM outperforms the rest of the baseline by more than 1.5% on average for all four metrics. In the experiment based on the fifth dataset, Precision and F-measure metrics were significantly lower for all models affected by the dataset. However, TPE-DEM outperformed the other baseline models by more than 6% on average in these two metrics. Overall, TPE-DEM outperforms the other baseline models on all six datasets. The advantage of TPE-DEM is more pronounced when deficiencies in the dataset degrade the performance of all models.
However, the proposed algorithm is not without limitations. For example, the algorithm specifies that the classifiers and their number must be the same in each layer of the deep ensemble structure. Additionally, some recent studies proposed other types of classifier selection algorithms. In our experiments, we did not implement these algorithms for our testing datasets due to the lack of specific details. Thus, although we have directly compared our proposed method to some very competitive baselines, we have not obtained the results of these recent algorithms using our testbed. Future research may contribute to this field through a comprehensive benchmarking of different classifier selection algorithms and identify state-of-the-art. Further research may also analyze the theoretical performance of TPE-DEM.