1. Introduction
As a classification function approximation method, the decision tree is developed from the field of machine learning [
1]. Recently, decision tree design methodology has been extended and proposed to raise accuracy via boosting algorithm addition. Numerous researchers have emphasized the related research [
2,
3,
4]. Hunt et al. proposed that the concept learning system is the earliest decision tree algorithm [
5]. Then, the decision tree algorithm gradually developed a series of algorithms, such as Iterative Dichotomizer3 (ID3) algorithm, C4.5 algorithm, C5.0 algorithm, Classification and Regression Tree (CART) algorithm, and so on [
6]. The algorithms used in this paper are C5.0 algorithm and CART algorithm, both of which are evolved from the previous algorithm, and their comprehensive performance has been improved [
6]. C5.0 algorithm is an intuitive and efficient classification method, but it has the problems of information gain rate calculation complexity, and is prone to overfitting and decision tree bias. To solve these problems, the calculation process of the information gain rate is simplified by formula transformation. In the pruning process, the combination of loss matrix and confidence interval is used to judge pruning, and the weights of multiple models are adjusted. A modified C5.0 algorithm with boosting method is proposed [
7]. In the previous study, a classifier ensemble was proposed to enhance diversity, and it provided a near-optimal classifying system [
8,
9].
In previous studies, C5.0 algorithm and CART algorithm generally have overfitting problems or insufficient model performance optimization when they deal with imbalanced data. This causes problems with the decision-making mistakes, which are prone to unstable prediction when they are applied to real problems. In order to overcome these problems, this paper proposes, by adding the cost matrix and boosting algorithm, to improve these problems [
6], and it verifies the decision results improvement through application to actual data.
At the same time, there is the problem with the classification error of different generation values when it is not treated differently with the decision tree C5.0 algorithm, which makes the cost of classification error higher. In this paper, we use the value of misconduct and cost matrix to reduce the high-cost error rate; we realize C5.0 under the condition that the overall error rate of the model changes is small. It is expected that the optimized model can reduce the high-cost error rate in the test data. The result is proven when the application effect of the cost matrix is obvious, and the general cost error rate could be reduced [
10]. Finally, based on the C5.0 decision tree, a boosting algorithm is used in this paper, and a cost matrix is introduced for the comparison with the CART algorithm. According to the receiver operating curve model, the performance evaluation index and the decision tree algorithm cross-check are the result. Then, the model performance is comprehensively evaluated.
By the application of boosting knowledge, Pang showed the C5.0 algorithm and the corresponding boosting technology in detail based on the decision tree C4.5 algorithm and embedded the boosting algorithm technology [
11]. The personal credit rating model is established in a bank based on the C5.0 algorithm and the model is applied to perform a credit rating with the personal credit data of a German bank. By the comparison of the decision tree application with before results and after, the model parameters are adjusted. The experimental results show that the discrimination result with the decision tree after the parameter adjustment is better than before the parameter adjustment [
11]. Furthermore, a modified
k-mean clustering algorithm has been studied by Ahmad and Dey for the mixed numeric and categorical features, not only for numeric data [
12]. Wang, Jiang, and Hui tried to increase the accuracy of the current stock prediction model, which is not high enough, but there are challenges such as overfitting or underfitting which are based on the analysis of the existing stock prediction methods. In the research, a CART-based decision tree was given for the stock forecasting method with boosting method, and it used boosting method which is cascaded to multiple decision trees to solve the fitness problem. By selecting seven indicators in the stock data, the mean square error (MSE) and the mean square standard deviation (RMSE) are used to evaluate the prediction accuracy. Experimental results show that the decision tree fitting effect and prediction accuracy rate after adding the boosting algorithm are higher than the original model [
13]. Yao et al. researched and analyzed the new decision tree C5.0 algorithm. In predictive classification, the cost of misjudgment was considered in the decision tree modeling, and the value conditions for the substitution value of misjudgment were given, and a cost matrix was established to guide the modeling. The cost of the prediction classification error is minimized when the overall error rate of the model does not change much. In-depth study of the decision tree C5.0 algorithm based on the cost matrix and its application in the classification has been carried out for the patient classification problem in a Chinese hospital. From the final patient classification model, the model has a high classification error rate in the modeling data and test data, even though the model has the advantages of low risk and good stability [
14]. In this paper, we add the boosting idea to the conventional decision algorithm and obtain high accuracy by the generation of a strong classifier to the corresponding data. The result also overcomes the overfitting problem, and optimal decision results are obtained for the given personal banking data by using a confusion matrix.
The paper is organized as follows: preliminary study on data processing and evaluation in
Section 2. For the evaluation, accuracy and sensitivity are introduced with the confusion matrix. The considered boosting algorithm is introduced here. In
Section 3, C5.0 and CART algorithm are applied to empirical data. After data analysis, it is ensured that there is no need to perform principal component analysis. The decision results are carried out with conventional C5.0/CART model and by adding the boosting algorithm in
Section 4. The results are discussed in
Section 5. In the discussion, different considerations on positive prediction are investigated and illustrated. Finally, conclusions follow in
Section 6.
4. Decision Model and its Evaluation
In this section, the decision tree model with boosting algorithm is implemented and applied to experiments. To obtain the optimal discrimination, the model has been evaluated through confusion matrix and cross-validation.
4.1. C5.0 with Boosting Algorithm
C5.0 decision tree with boosting algorithm is applied to actual data. The confusion matrix/cost matrix addition cases are also considered. The test data set results with confusion matrix are illustrated in the tables below.
From the 4521 test samples,
Table 6 shows that the C5.0 model predicts 4094 (3853 + 241) samples accurately, and 427 (288 + 139) samples are incorrectly predicted with an error rate of 9.4%.
Table 7 shows that the C5.0 model predicts 4051 (3679 + 372) samples after adding the cost matrix, and 470 (157 + 313) samples are incorrectly predicted with an error rate of 10.9%.
Table 8 indicates that the C5.0 model predicts 4084 (3845 + 239) samples after adding the boosting algorithm, and 437 (290 + 147) samples are incorrectly predicted with an error rate of 9.6%. The error rate of the C5.0 model with the added cost matrix is slightly higher than the other error rates of the model, and after adding the boosting algorithm, the C5.0 model will not significantly improve the performance of the C5.0 algorithm. Next, we analyze the performance of C5.0 algorithm from the accuracy, precision, and sensitivity viewpoint.
From the calculation results of
Table 9, C5.0 model’s accuracy, precision, and sensitivity are 0.9055, 0.4556, and 0.4558, respectively. The sensitivity is 0.4558, which means that 45.58% of potential customers are correctly classified. Whereas precision and sensitivity measures with cost matrix (CM) are increased around 25% compared with only C5.0 tree.
However, the number of samples in each category is not often balanced in actual classification problems. If there is no adjustment on this kind of unbalanced data set, the model is easily biased towards the big category and the small category is ignored [
22]. Hence, an index can be considered to punish the bias of the model to increase the accuracy in this time [
23]. According to the calculation formula of
Kappa, the more unbalanced the confusion matrix, the lower the illustrated
Kappa value. It gives a low score to the model with strong biasedness. Therefore, the higher the
Kappa value selected, the better the represented model performance [
4].
The C5.0 model with cost matrix’s sensitivity is 0.7032, which means that 70.32% of the customers who confirm the subscription deposit are correctly classified. The Kappa value of C5.0 model without the cost matrix is lower at 0.4793 compared with C5.0 + CM. Thus, the sensitivity and Kappa value of the results after fitting the C5.0 model with the cost matrix are significantly improved. In a brief summary, the C5.0 model with cost matrix can be improved more accurately and classify potential users, and is more suitable for dealing with imbalanced data sets.
For the accuracy point of view, accuracy of the model with boosting has not changed significantly. Together with the boosting algorithm, cross-validation is added to obtain the average performance of the model. The results are as follows:
It can be seen from the results in
Table 10 that 8 candidate models were tested. The results show that trials = 1 provides the best performance according to the Kappa value; trials = 25 provides the best performance according to the accuracy rate, but the Kappa value is not ideal, so choosing a model with trials = 1 not only results in better computing performance but also reduces the possibility of overfitting.
4.2. CART with Boosting Algorithm
The CART decision tree with boosting algorithm is considered and the test data set with confusion matrix are illustrated in the tables below.
Table 11 shows that the CART model predicts 4084 (3886 + 197) samples accurately, and the model prediction accuracy rate is 90.31%. The accuracy of the model fitting result shows not much difference from the C5.0 algorithm, and the correct classification ability of the model is satisfactory.
Table 12 shows the result after the boosting algorithm is added to the CART model; the number of samples is predicted accurately at 4112 (3855 + 257) samples, and the model prediction accuracy rate is 90.95%. The accuracy rate has increased slightly from 90.31% to 90.95%.
4.3. Cross-Validation
K-fold cross-validation is commonly used to evaluate model performance [
24]. Cross-validation is a different approach from the repeated random sampling from the sample set. K-fold cross-validation divides all samples into K group separately; then each part is called a fold. When 10-fold cross-validation is adopted, we randomly divide the data set into 10 parts and use 9 of them for training and the other 1 for testing. This process is repeated 10 times. The process of training and testing the model is repeated 10 times, and the output results of 10 times are obtained with an average performance index [
25,
26].
After the model is fitted, 10-fold cross-validation is used for each of the five algorithms to obtain the accuracy of 10 model checks, and then the average accuracy is calculated. Compared with the accuracy of a model prediction obtained by the confusion matrix, the accuracy of the 10-fold cross-check is more suitable for evaluating the performance of the model.
As can be seen from the above results in
Table 13, C5.0 model with CM sacrifices the accuracy of the model to improve the sensitivity of model fitting, thereby ensuring a more accurate classification of potential customers.
As shown in
Table 13, the performance of the ranking model according to the average accuracy is illustrated in Equation (11):
The accuracy rates of the five models are all high, all above 90%; this indicates that the models have better prediction performance for the sample data.
From the result, CART + boosting and C5.0 + boosting algorithms show satisfactory average accuracy; this means that the boosting algorithm can enhance the performance of the model.
5. Discussion
According to the positive prediction in
Table 1, the calculated model evaluation index values could be different. Therefore, the different result could be derived. The positive class is considered insofar as it should be more concerned with practical applications. In this paper,
Positive = yes means that bank customers who subscribe to fixed deposits are considered as positive, and bank customers who do not subscribe to fixed deposits are denoted as negative. In this category, more attention should be paid to the model’s ability to correctly classify the potential users. Among them, precision, sensitivity, and specificity are assumed as evaluation indicators for calculating a certain classification characteristic.
Accuracy, F1 score, and model fitting time are the criteria for judging the overall classification model.
Evaluation indices of
Positive = yes are illustrated in
Table 14, and the overall prediction accuracy for each model shows a small amount of difference. The CART + boosting model represents the highest accuracy, reaching 90.95%, and the C5.0 + CM model has an accuracy of 89.60%, which is the lowest among the five models. At the same time, by comparing CART model and CART + boosting model, the precision increased from 37.24% to 65.23%, which means that the model’s ability to predict potential customers has been improved. After the CART model was added to the boosting algorithm, the F1 score increased from 47.36% to 55.69%. Therefore, by adding the boosting algorithm to the CART model, the performance can be improved drastically, but the CART model after adding the boosting algorithm needs a long time to classify large data sets.
From the results of accuracy and F1 values, the model performance has not been improved after the C5.0 with boosting algorithm. Because the C5.0 algorithm is mainly strengthened by increasing the number of iterations according to the data in
Table 13, the C5.0 model is the optimal model when
trials = 1, so the improvement effect of the algorithm is not significant.
However, after adding the CM, although the model’s total sample prediction accuracy decreased, the precision and F1 score have been improved. After considering the addition of the confusion matrix to the C5.0 model, the precision and F1 values surpassed the other four models, being 70.32% and 61.29%, respectively. Not only does the performance of correctly classifying potential users show the best, but the model fitting time also becomes shorter. Hence, it is the best model to classify potential users correctly. Therefore, in the case of Positive = yes, the C5.0 model illustrates the best performance by adding the CM.
In
Table 15, the evaluation indices of
Positive = no are illustrated. In this case, the bank customers who do not subscribe to the fixed deposit are considered as positive. By the comparison with
Table 14, it can be found that the total sample prediction accuracy rate is unchanged, but precision and sensitivity have been greatly increased, while specificity has decreased. This is because in the overall sample, the number of customers who will not subscribe to fixed deposits is far greater than the number of customers who will subscribe to fixed deposits. By observing the four indices of accuracy, precision, sensitivity, and F1 score, it is found that the overall performance difference between the five models is very small. It is worth noting that the accuracy and F1 score of the CART model after the boosting algorithm are still improved; this shows that the boosting algorithm can indeed enhance the performance of the model, but the effect is not significant when
Positive = no.
6. Conclusions
This paper introduces the basic principles of the C5.0 algorithm model and the CART algorithm model and uses the personal information data of 45,211 customers of the Bank of Portugal, seven continuous variables, and nine discrete variables to conduct an empirical study on whether they subscribe to fixed deposits. The matrix confusion method and cross-validation method are used to compare the performance of the model. This paper fits two basic models, namely, C5.0 algorithm model and CART algorithm model. Based on each algorithm, a boosting algorithm is added, and a cost matrix (CM) is added to the C5.0 algorithm for model fitting. In the final comparison of models, the accuracy, F1 score, and the average accuracy of the 10-fold cross-check are used to evaluate the overall performance of the model. According to the recall, precision, and specific indicators, a certain classification feature is calculated to evaluate the specific classification of the model performance for the given banking data. The test results show:
- (1)
The performance improvement of C5.0 algorithm after combining with the boosting algorithm is not significant. This is because the experimental data set is an unbalanced data set (the number of customers who do not subscribe to the time deposit is much higher than the number of customers who subscribe to the time deposit). Experiments on this kind of unbalanced data set, if the model is not adjusted, are easy to bias towards the big category and give up on the small category.
Table 10 experiments show that when the number of iterations is 1 (
trials = 1), it is the C5.0 algorithm itself. The highest Kappa value indicates that the C5.0 algorithm has the lowest bias. Compared with the model after adding the boosting algorithm, the ability to deal with imbalanced data sets is improved. Therefore, in dealing with the problem of unbalanced data classification, the performance improvement of C5.0 algorithm combined with the boosting algorithm is not significant.
- (2)
Among all the fitted models, the sensitivity of the model fitted by the C5.0 algorithm by adding the CM is shown to be 13% and 54% higher than CM + boosting and CM only, respectively. The results are illustrated in
Table 9. Therefore, we must consider the problem comprehensively, and we need to choose the model for the consideration of accuracy, sensitivity, or others. After the requirements are clarified, the model is further fitted and compared; the enhancement algorithm is a combination of multiple weak classifier models, which has some fitting effects to the better model. The boosting algorithm may not significantly improve the performance. Therefore it is required to choose a model with lower computational complexity and better fitting. For example, if a boosting algorithm is added to the ID3 algorithm, the effect will be more significant.
- (3)
The bank customer classification problem is carried out as an example. In an actual decision problem, the speed of model fitting is also a factor that needs to be considered. On the one hand, this article conducts a classified evaluation on whether bank customers will subscribe to fixed deposits. For customers who subscribe to time deposits, it is recommended to use the C5.0 model with CM because the higher sensitivity can improve the performance of the model for classifying potential users. It predicts more customers who will subscribe to time deposits, and will facilitate the bank’s business development. Furthermore it is also necessary to make predictions for users who will not book fixed deposits. The banking business covers a wide range, and other financial services can be promoted. Because the data set is large, it is recommended to use the C5.0 model to make predictions. The time is shorter, the model performance difference is small, and the accuracy rate is rather high.
Finally, the analysis of the proposed methodology can provide a more reliable basis for decision makers. How to set other better indicators to measure model performance, and how to determine whether the model to be compared is comprehensive are all issues that need to be discussed later in this article, and more in-depth research would be expected.