*4.2. Credit Default Prediction Performance*

In this section, we present a comparative analysis between our proposed CT-XGBoost model and other commonly used models, including logistic regression, SVM, neural network, random forest, and XGBoost. Table 1 summarizes the average prediction results of different models with ten times 5-fold cross-validation. Among conventional models, XGBoost showed superior performance with an AUC value of 95.44%, and its other evaluation results are also superior. Similarly, Zhang and Chen [10] found that, compared with logistic regression, SVM, random forest, and et al., XGBoost achieves better credit default prediction performance: 91.4% AUC. Moreover, Wang et al. [35] constructed the XGBoost model for default prediction and found that the prediction performance in terms of AUC was 88.07%. By comparison, the credit default prediction performance in our study is better than in previous studies. The main reason for the difference between the results of this study and previous studies is the different default datasets used. We focused on the credit default companies in the energy industry. Overall, these results demonstrate that the XGBoost model is an efficient algorithm for credit default prediction, and it is rational to select XGBoost as the model to be modified for its superior performance.

**Table 1.** Credit default prediction performance comparison analysis of different prediction models.


Note: The number in bold-face indicates the best performance for each metric.

It is notable that the chief aim of credit default prediction is to accurately identify as many default samples as possible without misclassifying too many non-default samples. However, the prediction performance of XGBoost has not met our expectations, with the type I accuracy value of only 71.07%, whereas the type II accuracy value is 99.02%. The reason for this phenomenon is the class imbalance problem, which causes the prediction model to be overwhelmed by the majority of non-default samples. Thus, this study proposes a novel CT-XGBoost to solve the class imbalance problem.

First, comparing the type I, type II, and overall accuracy, we can see that the CT-XGBoost model is more rational than other conventional models. We can see that the type I accuracy value of CT-XGBoost is 91.43%, which is 20.36% higher than that of the representative XGBoost model. The result implies that our proposed model has a superior ability to identify default class samples. Second, while the type II and overall accuracy values of our proposed model are lower than those of other models, the accuracy values of sacrifice are 9.75% and 4.96%, respectively, which are lower than the benefit in type I accuracy. In addition, the main aim of credit default prediction is to accurately identify the default class samples. Finally, as for the AUC value, which can evaluate the comprehensive performance of the prediction model, we can notice that our proposed model is better than benchmark models. The average AUC value of CT-XGBoost was 96.38%, which is better than the AUC values of other default prediction models, which ranged from 90.35% to 95.44%. These results suggest that our proposed model, which modifies the XGBoost model

with cost-sensitive and threshold methods, outperforms other benchmark models when dealing with the class-imbalance problem.

## *4.3. The Importance of Predictor*

A practical default prediction model should have not only good accuracy, but also a clear, interpretable result. To make the model acceptable for users, transparency in the decision process is indispensable. For instance, according to the Equal Credit Opportunity Act of the U.S., the creditors are mandated to provide applicants, upon request, with specific reasons underlying credit denial decisions. In previous studies [35,36], some methods are proposed to identify the significant performance drivers of the XGBoost model in default prediction. In this study, we applied the "Feature Importance" function to estimate the importance of the financial features used in our proposed CT-XGBoost model.

Before introducing the "Feature Importance" function, the splitting mode of leaf nodes in CT-XGBoost needs to be explained. First, a portion of features is selected as a candidate set. Then, determine a split point of the leaf node in the tree by using a greedy algorithm and calculating the Gini score to determine the best splitting point. Define *IL* and *IR* as the sample sets of leaf nodes and right nodes after splitting. Assume *I* = *IL* + *IR*; then, the objective function value *Lt no*−*split* before splitting and the objective function value *Lt split* can be obtained as follows:

$$\widetilde{L}\_{no-split}^t = -\frac{1}{2} \frac{\left(G\_L + G\_R\right)^2}{H\_L + H\_R + \lambda} + \gamma T\_{no-split} \tag{15}$$

$$\tilde{L}\_{split}^{t} = -\frac{1}{2} \left[ \frac{G\_L^2}{H\_L + \lambda} + \frac{G\_R^2}{H\_R + \lambda} \right] + \gamma T\_{split} \tag{16}$$

where *G*, *H* are the first derivatives and the second derivatives after splitting, and subscripts *L*, *R* indicate the left and the right node. Then, loss Gain value for leaf nodes in the *t*-th tree can be calculated, and the node with the highest Gain value is determined as the splitting point.

$$Gain = \frac{1}{2} \left[ \frac{G\_L^2}{H\_L + \lambda} + \frac{G\_R^2}{H\_R + \lambda} - \frac{\left(G\_L + G\_R\right)^2}{H\_L + H\_R + \lambda} \right] - \gamma \tag{17}$$

The Gain value can be used to estimate the importance of features, which measures the ability to classify the default and non-default samples. Considering that CT-XGBoost is a model where a number of trees should be simultaneously considered, we calculated the "Feature Importance" function for the *r*-th feature as follows:

$$Importance\_r = \frac{\sum\_{k=1}^{t} Gain\_r^k}{\sum\_{r=1}^{m} \left(\sum\_{k=1}^{t} Gain\_r^k\right)}\tag{18}$$

*Gain<sup>k</sup> <sup>r</sup>* is the Gain value for the *r*-th feature in the *k*-th tree, *t* is the number of trees, and *m* is the number of features. So far, the "Feature Importance" function has been explained, and the importance of financial variables can be calculated with Equation (15).

Table 2 represents the feature importance results of the top 20 most important financial variables, ranked based on the feature importance values from highest to lowest. Starting with the most important, the ten features that contribute to the CT-XGBoost model's credit default prediction ability are: (1) other receivables, (2) sales expense, (3) long-term deferred, (4) non-operating income, (5) accounts receivable, (6) taxes, (7) prepaid accounts, (8) liabilities and owner's equity, (9) capital reserves, and (10) cash flow generated from operating activities net amount. The higher the feature's importance, the stronger ability of the financial variable to classify the default and non-default samples. The results may be of great worth for practitioners, as they can help explain why an applicant is classified as a credit default class.


**Table 2.** Feature importance of the top 20 important financial variables in the CT-XGBoost model.
