*3.1. XGBoost*

XGBoost [19], the full name of which is extreme gradient boost, is a distributed and efficient implementation of gradient boost tree. It is an improved model based on the gradient boosting decision tree (GBDT), which belongs to the family of boosting methods. The chief idea of XGBoost is to incorporate a series of weak learners into a strong learning algorithm [2]. By adding new weak learners, the probability of making mistakes is reduced continuously, and the final output value is the sum of the results of many weak learners. To better understand the mechanism of XGBoost, the prediction function, objective function, and optimization process are introduced as follows.

Considering a dataset with *<sup>n</sup>* substances and m features, where *<sup>D</sup>* = {(*xi*, *yi*)|*xi* ∈ *<sup>R</sup>m*, *yi* ∈ *R*} and *xi* = {*xi*1, *xi*2,..., *xim*|*i* = 1, 2, . . . , *n*}. The basic idea of XGboost is to iteratively construct *t* weak estimators to predict the output *yi* by the predictor *xi*.

$$\begin{aligned} \mathfrak{g}\_i^0 &= 0\\ \mathfrak{g}\_i^1 &= f\_1(\mathbf{x}\_i) = \mathfrak{g}\_i^0 + f\_1(\mathbf{x}\_i) \\ \mathfrak{g}\_i^2 &= f\_1(\mathbf{x}\_i) + f\_2(\mathbf{x}\_i) = \mathfrak{g}\_i^1 + f\_2(\mathbf{x}\_i) \\ &\cdots \\ \mathfrak{g}\_i^t &= \sum\_{k=1}^t f\_k(\mathbf{x}\_i) = \mathfrak{g}\_i^{t-1} + f\_t(\mathbf{x}\_i) \end{aligned} \tag{1}$$

Each weak estimator *fk*(*xi*), *k* = 1, 2, ... *t* is generated from the iteration of the gradient boosting algorithm, and the output value *y*ˆ*<sup>t</sup> <sup>i</sup>* is the summation of the output value of previous iteration *y*ˆ *t*−1 *<sup>i</sup>* and the present result *ft*(*xi*). To learn the set of estimators, the objective function that needs to be minimized can be expressed as:

$$L^t(y, \mathfrak{z}^t) = \sum\_{i=1}^n l\begin{pmatrix} y\_{i\prime} \ \mathfrak{z}^t\_i \end{pmatrix} + \sum\_{k=1}^t \Omega(f\_k)\_{\prime} \tag{2}$$

where *l* - *yi*, *y*ˆ*<sup>t</sup> i* is the loss function that measures the difference between the target value and the prediction value *y*ˆ*<sup>t</sup> i* . The second term is the regularization of the model, which is used to penalize the complexity of the entire model, and it can be calculated as follows:

$$
\Omega(f\_k) = \gamma T\_k + \frac{1}{2}\lambda \sum\_{j=1}^{T\_k} w\_{kj\prime}^2 \tag{3}
$$

Here, *Tk* represents the number of leaf nodes in the *k*−th base tree estimator, and *γ* is the penalty parameter for the number of leaf nodes. Meanwhile, *wkj* represents the weight of the *j*−th leaf node in the base tree estimator and *λ* is the penalty parameter for the leaf node weight.

Up to now, we have a basic idea about the chief goal of XGBoost [19]. Next, we will introduce the process of how to optimize the objective. First, considering the training process is an additive consideration, as in Equation (1), *ft* is greedily added to minimize the objective, when predicting the output value *y*ˆ*<sup>t</sup>* at the t-th iteration.

$$L^t = \sum\_{i=1}^n l\left(y\_{i\prime} \; \mathcal{Y}\_i^{t-1} + f\_t(x\_i)\right) + \Omega(f\_t) \tag{4}$$

Using the second gradient approximation of the Taylor explosion, Equation (4) can be expanded as follows.

$$L^t \cong \sum\_{i=1}^n \left[ l(y\_{i\prime}, \mathfrak{z}\_i^{t-1}) + \mathfrak{g}\_i f\_t(\mathbf{x}\_i) + \frac{1}{2} l\_i f\_t^2(\mathbf{x}\_i) \right] + \Omega(f\_k) \tag{5}$$

where *gi* and *hi* indicate first and second gradient statistics. By removing the constant term *l*(*yi*, *y*ˆ *t*−1 *<sup>i</sup>* ), we can obtain the simplified objective as follows.

$$\widetilde{L}^t \cong \sum\_{i=1}^n \left[ g\_i f\_t(\mathbf{x}\_i) + \frac{1}{2} h\_i f\_t^2(\mathbf{x}\_i) \right] + \Omega(f\_k) \tag{6}$$

Define the set of samples of the j leaf node as *Ii* = {*i*|*q*(*xi*) = *j*} and then expand the regularization term.

$$\begin{split} \tilde{\boldsymbol{\lambda}}^{t} & \cong \sum\_{i=1}^{n} \left[ \mathcal{g}\_{i} \boldsymbol{f}\_{t}(\mathbf{x}\_{i}) + \frac{1}{2} h\_{i} \boldsymbol{f}\_{t}^{2}(\mathbf{x}\_{i}) \right] + \gamma \boldsymbol{T}\_{t} + \frac{1}{2} \lambda \sum\_{j=1}^{T\_{t}} \boldsymbol{w}\_{tj}^{2} \\ &= \sum\_{j=1}^{T\_{t}} \left[ \boldsymbol{G}\_{j} \boldsymbol{w}\_{tj} + \frac{1}{2} \left( \boldsymbol{H}\_{j} + \lambda \right) \boldsymbol{w}\_{tj}^{2} \right] + \gamma \boldsymbol{T}\_{t} \end{split} \tag{7}$$

where *Gj* = <sup>∑</sup>*i*∈*Ij gi* and *Hj* = <sup>∑</sup>*i*∈*Ij hi*. Then, the optimal weight *<sup>w</sup>*<sup>∗</sup> *<sup>j</sup>* of leaf *j* can be computed by

$$w\_j^\* = -G\_j H\_j + \lambda \tag{8}$$

and we get the corresponding optimal objective value by substituting <sup>−</sup> *Gj Hj*+*<sup>λ</sup>* for *w*<sup>∗</sup> *<sup>j</sup>* in Equation (8).

$$\tilde{L}^t(q) = -\frac{1}{2} \sum\_{j=1}^{T\_l} \frac{G\_j^2}{H\_j + \lambda} + \gamma T\_t \tag{9}$$

where *Lt* (*q*) is used as the assessment function to evaluate the quality of the tree structure *q*(*x*). Specifically, the smaller the value of *Lt* (*q*), the higher quality of the tree structure.

So far, the model with the T base estimator has been basically constructed and the prediction value of XGBoost is *y*ˆ*<sup>t</sup> i* , which can represent the default probability of the *i*-th corporate in this paper.
