*3.2. CT-XGBoost*

XGBoost is a strong approach for various tasks. Nonetheless, the efficiency of the model can be limited due to the class-imbalance problem in the credit default data. Thus, it may be a good idea to modify XGBoost to adapt to the class imbalance of the credit default dataset. Assume the credit default dataset for the training model contains *N* samples in total, where the number of credit default samples is *Nd* and the number of non-default samples is *Nn*. In the real world, the number of default samples is hugely greater than the number of non-default samples, which causes the class-imbalance problem with the imbalance ratio defined as *Nn Nd* . To solve the problem, we proposed a novel CT-XGBosot, which is modified from the XGBoost model.

Specifically, we modified the XGBoost in two aspects: (1) The cost-sensitive strategy is employed to assign more misclassification costs for default class samples relative to non-default class samples. During the calculation of the loss function, a novel parameter, called the penalty ratio in this paper, is added to control the ratio of misclassification costs for different classes. (2) We set a more reasonable threshold considering the class imbalance, which is used to classify the samples into two groups based on the predicting default probabilities. The corporates with default probabilities above the threshold are classified as the default group, and those with default probabilities below the threshold are classified as the non-default group. The modification will be explained in detail as follows.

#### 3.2.1. Cost-Sensitive Strategy

In the process of default prediction model training, an important step is to calculate the objective function. Equation (2) is the objective function of XGBoost. In Equation (2), the first term ∑*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *l* - *yi*, *y*ˆ*<sup>t</sup> i* is the loss function, which measures the disparity between the prediction results and the true results [23].

However, the importance of each sample to the loss function is the same. The misclassification costs of default class and non-default class samples are equal. Due to the class imbalance problem where the non-default samples are the majority, the contribution of non-default samples to the loss may be larger than that of default samples. The model may wrongly take the chief aim of correctly classifying the non-default samples. Thus, it is important to modify XGBoost by assigning more misclassification costs to default class samples in the training process.

In CT-XGBoost, to increase the misclassification cost of default class samples, we modify the loss function ∑*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *l* - *yi*, *y*ˆ*<sup>t</sup> i* as follows.

$$\sum\_{i=1}^{n} \left[ y\_i \ast \mathbb{C}\_d \ast l(y\_i, \mathfrak{z}\_i^t) + (1 - y\_i) \ast \mathbb{C}\_n \ast l(y\_i, \mathfrak{z}\_i^t) \right] \tag{10}$$

where *Cd*, *Cn* are the weights of misclassification costs for default and non-default class samples. Since the magnitudes of *Cd*, *Cn* do not influence the training process, we define a new parameter *p*, called penalty ratio, which equals to *Cd Cn* . In this paper, we set penalty ratio *p* as the dataset imbalance ratio *Nn Nd* . Then, the loss contributed by default samples will be larger than before.
