**Preface to "Advances in Credit Risk Modeling and Management"**

Correctly assessing credit risk still represents an important challenge for both practitioners and scholars. On the one hand, credit risk measures play a central role in the banking sector's regulations, governing the profitability of financial institutions which remain at the heart of our economic system. On the other hand, effectively computing such measures in a sound and rigorous way triggers important challenges because of the lack of relevant information and/or models. It is therefore important that academics pursue efforts to improve their models. This book presents some recent advances which methodologically and/or computationally contribute to the more rigorous and reliable management of credit risk of firms. The book covers default and recovery rate models, trade credit, counterparty credit risk, and hybrid product pricing.

> **Fr´ed´eric Vrins** *Special Issue Editor*

## *Article* **Modelling Recovery Rates for Non-Performing Loans**

#### **Hui Ye \* and Anthony Bellotti \***

Department of Mathematics, Imperial College London, London SW7 2AZ, UK

**\*** Correspondence: hui.ye16@alumni.imperial.ac.uk (H.Y.); a.bellotti@imperial.ac.uk (A.B.)

Received: 12 February 2019; Accepted: 15 February 2019; Published: 20 February 2019

**Abstract:** Based on a rich dataset of recoveries donated by a debt collection business, recovery rates for non-performing loans taken from a single European country are modelled using linear regression, linear regression with Lasso, beta regression and inflated beta regression. We also propose a two-stage model: beta mixture model combined with a logistic regression model. The proposed model allowed us to model the multimodal distribution we found for these recovery rates. All models were built using loan characteristics, default data and collections data prior to purchase by the debt collection business. The intended use of the models was to estimate future recovery rates for improved risk assessment, capital requirement calculations and bad debt management. They were compared using a range of quantitative performance measures under *K*-fold cross validation. Among all the models, we found that the proposed two-stage beta mixture model performs best.

**Keywords:** recovery rates; beta regression; credit risk

#### **1. Introduction**

In Basel II, an internal ratings-based (IRB) approach was proposed by the Basel Committee in 2001 to determine capital requirements for credit risk (Bank for International Settlements 2001). This IRB approach grants banks permission to use their own risk models or assessments to calculate regulatory capital. Under the IRB approach, banks are required to estimate the following risk components: probability of default (PD), loss given default (LGD), exposure at default (EAD) and maturity (M) (Bank for International Settlements 2001). Since Basel II's capital requirement calculation depends heavily on LGD, financial institutions have put more emphasis on modelling LGD in recent years. Unlike the estimation of PD, which is well-established, LGD is not so well-understood and still subject to research. Improving LGD modelling can help financial institutions assess their risk and regulatory capital requirement more precisely, as well as improving debt management.

LGD is defined as the proportion of money financial institutions fail to collect during the collection period, given the borrower has already defaulted. Conversely, Recovery Rate (RR) is defined as the proportion of money financial institutions successfully collected minus the administration fees during the collection period, given the borrower has already defaulted. Equations (1) and (2) give formal definitions of RR and LGD, respectively:


Then,

$$\text{Recovery Rate} = \frac{R\_i - A\_i}{EAD\_i} = \frac{\sum \text{Colllections} - \sum \text{Admin Fee}}{\text{Outstanding Balance at Default}} \tag{1}$$

and

$$\text{Loss Given Defalut} = 1 - \text{Recovery Rate} = 1 - \frac{R\_i - A\_i}{EAD\_i} \tag{2}$$

RR mainly lies in the interval [0, 1] and typically has high concentrations at the boundary points 0 and 1. It is possible for RR to be negative if recoveries are less than administration costs, *Ai* > *Ri*, and greater than 1 if recoveries exceed exposure plus administration costs, *Ri* > *EADi* + *Ai*. Typically, however, RR is truncated within the interval [0, 1] when developing LGD models.

The main challenge in estimating LGD is the bimodal property with high concentrations at 0 and 1 typically present in LGD empirical distributions, where people either repay in full or repay nothing. For the dataset we used in this study, we found our LGD distribution is actually tri-modal. Therefore, regression models have been studied that specifically deal with this problem. For example, Bellotti and Crook (2012) built Tobit and decision tree models along with beta and fractional logit transformation of the RR response variable to forecast the LGD based on a dataset of 55,000 defaulted credit cards in the UK from 1999 to 2005. They concluded that ordinary least squares regression with macroeconomic variables performed the best in terms of forecast performance. Calabrese (2012) proposed a mixed continuous-discrete model, where the boundary values 0 and 1 are modelled by Bernoulli random variables and the continuous part of the RR is modelled by a Beta random variable. This model is then applied to predict RR of Bank of Italy's loans from 1985 to 1999. The result is compared with Papke and Wooldrige's fractional response model with log-log, logistic and complementary log-log link functions (Papke and Wooldridge 1996) and linear regression. The mixed continuous and discrete model achieves the best performance . Qi and Zhao (2011) applied four linear models, namely ordinary least squares regression, fractional response regression, inverse Gaussian regression, and inverse Gaussian regression with beta transformation, and two non-linear models, namely regression tree and neural network, to model the LGD of 3751 defaulted bank loans and bonds in the US from 1985 to 2008. They concluded that fractional response regression is slightly better than the ordinary least squares regression. Moreover, they reported that non-linear models perform best. Loterman et al. (2012) performed a benchmark study of LGD by comparing twenty-four different models using six datasets extracted from international banks. They concluded that non-linear models, such as neural network, support vector machine and mixture models perform better than linear models.

For this project, we specifically modelled and predicted RR for data from a single European country provided by a debt collection company. Due to reasons of commercial confidentiality and data protection, the debt collection company will remain anonymous and some aspects of the data were also anonymised, including the country of origin. Consequently, the data cannot be made publicly available. We applied some of the models that have already been studied previously and also extended the existing models, proposing a new beta mixture model to improve the accuracy of RR prediction. A good prediction of RR would help the debt collection company to determine collection policy for new debt portfolios. It is important to note that the RR we modelled is different from most RR, as the data only contain positive repayments and no administration fee was recorded. Therefore, all the RRs in our data lie in the range (0, 1] instead of [0, 1]. Figure 1 shows a histogram of RR for the data. We can clearly see that there are modes at 0, 0.55 (approximately) and a high spike at boundary value 1. Since the shape of the empirical RR distribution demonstrates a trimodal feature, it is reasonable to assume that the recovery rate is a mixed type random variable. The multi-modality of RR is a natural consequence of different groups of bad debts being serviced using different strategies; e.g., one strategy may be that some bad debts are allowed to be written off if the debtor paid back some agreed fixed percentage of the outstanding balance. Having outcome RR within (0, 1] motivated the use of the beta regression model and the multi-modal nature of RR motivates the use of a mixture model within this context.

The beta mixture model has been applied successfully within several other application domains. Ji et al. (2005) showed how to apply the beta mixture regression model in several bioinformatics

applications such as meta-analysis of gene expression data and to cluster correlation coefficients between gene expressions. Laurila et al. (2011) used a beta mixture model to describe DNA methylation patterns, helping to reduce the dimensionality of microarray data. Moustafa et al. (2018) used a beta mixture model as the basis of an anomaly detection system. Their network data are typically bounded, which suggests a beta distribution, and the use of the beta mixture allowed them to identify latent clusters in normal network use.

**Figure 1.** Histogram of recovery rates for 8237 loans after pre-preprocessing described in Section 2. The stack of 1s shows frequency of *RR* = 1, but the stack at 0 shows frequency for small *RR* > 0.

Inspired by Calabrese's mixed continuous-discrete model (Calabrese 2012), we propose a two-stage model composed of:


The above proposed model allows representation of the trimodal feature of the data. The beta mixture component groups the clients into two clusters for RR < 1, based on their personal information, debt conditions and repayment history, which may become useful information for other business analysis and decision-making, and then uses logistic regression to model the third case RR = 1. In addition, we also used linear regression, linear regression with Lasso, beta regression and inflated beta regression to model RR. Model performance was measured by mean squared error, mean absolute error and mean aggregate absolute error under *K*-fold cross validation.

To our knowledge, this is the first study for estimating RR for portfolios of non-performing loans using a statistical model, and the first use of a beta mixture model for LGD. We also developed a novel procedure for predicting an expected value of outcome from a beta mixture model based on assigning a new observation to one of the clusters in the mixture. The remainder of the article is organised as follows: Section 2 provides a detailed data overview. Section 3 introduces the modelling methodology with great emphasis on the proposed beta mixture model combined with logistic regression model. Section 4 analyses some important features of the models and reports the model performance and Section 5 concludes with key findings and future recommendations.

#### **2. Data**

Three datasets were provided by the debt collection company:


Figure 2 shows how the data were joined. There are 8237 data points presented in Dataset 3, but only 7161 individual historical collection information are recorded in Dataset 2. In these cases, there are no historical recoveries by bank, i.e., no calls, contacts, visits or payments for the remaining 1076 individuals. Therefore, a value of 0 was assigned to aggregate recoveries in Dataset 2 for the remaining 1076 individuals. The modified Dataset 2 was then joined to Datasets 1 and 3 by the unique key Loan.Ref and we obtained a table of 8237 data points with 61 variables.

Table A1 gives descriptive statistics for each of the variables in the joined dataset used in the statistical modelling. The predictor variable Pre-Recovery Rate is the bank's RR before the debt portfolio was purchased. The minimum value is −0.130, which is negative due to the substantial amount of administration fee exceeding repayments incurred during the collection period. The predictor variable Credit Bureau Score is a generic credit score provided by a credit bureau.

**Figure 2.** Joining the three datasets.

#### *Recovery Rate Calculation*

Since the repayments in Datasets 2 and 3 were recorded in the format of monthly activity summaries, each individual may have several repayments for the same loan. Therefore, we defined the recovery rate as the sum of repayments minus the administration fee (if available) over the original balance of the loan, which is also equivalent to the difference between original balance and ending balance over the original balance. For each individual *i*, RR is calculated using:

$$\text{Recovery Rate}\_{i} = \frac{\sum \text{Repayments}\_{i} - \text{AdminFee}\_{i}}{\text{Original Balance}\_{i}} = \frac{\text{Original Balance}\_{i} - \text{Ending Balance}\_{i}}{\text{Original Balance}\_{i}} \tag{3}$$

Figure 1 is the empirical RR histogram calculated based on Equation (3), for the 8237 data points after pre-processing. The remaining 112,462 data points not included in the analysis essentially have RR = 0, but we do not know whether they have been serviced or not, thus they were not included in the analysis. Essentially, the goal of our model is to estimate RR computed from Dataset 3 (post-purchase), based on pre-purchase information given in Datasets 1 and 2.

#### **3. Modelling Methodology**

We applied various models to estimate RR. In all cases, model performance was measured within a *K*-fold cross validation framework. We first tried using ordinary least squares linear regression, with and without stepwise backward variable selection using the AIC criterion. In the following

sub-sections, we list the other modelling approaches we explored. Let *y* indicate the outcome variable, recovery rate, and *X* is a corresponding vector of predictor variables.

#### *3.1. Linear Regression with Lasso*

We applied linear regression with a Lasso (Least Absolute Shrinkage and Selection Operator) penalty. The model structure is

$$y = \beta\_0 + \mathbf{\mathcal{B}}^T \mathbf{X} + \epsilon$$

where *β*<sup>0</sup> and *β* are intercept and coefficients to be estimated and is the error term. Then, estimation using least squares error with Lasso is given by the optimisation problem on a training dataset of *N* observations:

$$(\boldsymbol{\beta}\_0, \boldsymbol{\pm}) = \operatorname\*{argmin}\_{\boldsymbol{\beta} \gg \boldsymbol{\mu}} \left[ \frac{1}{N} \sum\_{i=1}^{N} \left( y\_i - \boldsymbol{\beta}\_0 - \boldsymbol{\mathfrak{f}}^T \boldsymbol{X}\_i \right)^2 + \lambda \sum\_{j=1}^{p} |\boldsymbol{\beta}\_j| \right],\tag{4}$$

where *λ* > 0 is a tuning parameter controlling the size of regularisation. Regression with Lasso will tend to shrink coefficient estimates to zero and hence is a form of variable selection (Friedman et al. 2010). The value of *λ* is chosen using *K*-fold cross validation. For this project, the R packages "lars" (Hastie and Efron 2013) and "glmnet" (Friedman et al. 2010) were used to estimate linear regression with Lasso.

#### *3.2. Multivariate Beta Regression*

The problem with linear regression is that it does not take account of the particular distribution of RR, which is between 0 and 1. The beta distribution, with two shape parameters *α* and *β*, allows us to model RR in the open interval (0, 1):

$$f(y\_i; a\_i, \beta\_i) = \frac{\Gamma(a\_i + \beta\_i)}{\Gamma(a\_i)\Gamma(\beta\_i)} y\_i^{a\_i - 1} (1 - y\_i)^{\beta\_i - 1}, \quad 0 < y\_i < 1,\tag{5}$$

where *α*, *β* > 0 are the shape parameters and Γ(·) is the Gamma function. The beta distribution is reparameterised by mean and precision parameters, denoting by *μ* and *φ*, respectively, following Ferrari and Cribari-Neto (2004), since this parameterisation meaningfully express the expected value and variance:

$$\phi\_i = a\_i + \beta\_i, \quad E(y\_i) = \mu\_i = \frac{a\_i}{a\_i + \beta\_i}, \quad Var(y\_i) = \frac{\mu\_i (1 - \mu\_i)}{\phi\_i + 1},\tag{6}$$

The reparameterised beta distribution is then

$$f(y\_i; \mu\_i, \phi\_i) = \frac{\Gamma(\phi\_i)}{\Gamma(\mu\_i \phi\_i)\Gamma((1-\mu\_i)\phi\_i^{\ast})} y\_i^{\mu\_i \phi\_i - 1} (1-y\_i)^{(1-\mu\_i)\phi\_i - 1}, \quad 0 < y\_i < 1,\tag{7}$$

with 0 < *μ<sup>i</sup>* < 1 and *φ<sup>i</sup>* > 0. Figure 3a demonstrates three examples of the beta distribution with fixed *φ* = 5 and different *μ*. The variance is maximised at *μ* = 0.5. Figure 3b demonstrates another three examples of beta distribution with fixed *μ* = 0.5 and different *φ*.

The precision parameter *φ* is negatively correlated with *Var*(*yi*), given a fixed *μ*. Furthermore, the variance of Y is a function of *μ*, which enables the regression to model heteroskedasticity. RR is modelled as *yi* ∼ B(*μi*,*φi*) for *i* ∈ (1, ··· , *N*) for sample size *N*. The multivariate beta regression model (Cribari-Neto and Zeileis 2010) is defined as:

$$F\_1(\mu\_i) = \eta^T X\_i = \xi\_{1i\prime}$$

$$F\_2(\phi\_i) = \gamma^T W\_i = \xi\_{2i\prime}$$

where *η* is a vector of parameters which needs to be estimated corresponding to predictor variables *X* and *γ* is a vector of parameters which needs to be estimated corresponding to predictor variables *W*. *Risks* **2019**, *7*, 19

The predictor variables in *W* may be the same as in *X*, or a subset, or contain different variables. For this study, *W* will have a subset of predictor variables determined using stepwise variable selection. The link function ensures that *μ<sup>i</sup>* ∈ (0, 1) and *φ<sup>i</sup>* > 0. We applied Logit and Log link function to *μ<sup>i</sup>* and *φi*, respectively:

$$\mu\_i = \frac{1}{1 + \varepsilon^{-\eta^T \mathcal{X}\_i}}, \quad \phi\_i = e^{-\gamma^T \mathcal{W}\_i}.$$

With this multivariate beta regression model, *η* and *γ* can be estimated by maximum likelihood estimation, where the log-likelihood function is

$$L(\eta\_{i}, \gamma) = \sum\_{i=1}^{N} \left[ \log \Gamma(\phi\_{i}) - \log \Gamma(\mu\_{i}\phi\_{i}) - \log \Gamma((1 - \mu\_{i})\phi\_{i}) \right. $$

$$+ \left( \mu\_{i}\phi\_{i} - 1 \right) \log y\_{i} + \left( (1 - \mu\_{i})\phi\_{i} - 1 \right) \log(1 - y\_{i}) \Big]. \tag{8}$$

By substituting *μ<sup>i</sup>* = *F*−<sup>1</sup> <sup>1</sup> (*ηTXi*) and *<sup>φ</sup><sup>i</sup>* <sup>=</sup> *<sup>F</sup>*−<sup>1</sup> <sup>2</sup> (*γTWi*) into Equation (8), the log-likelihood is obtained as a function of *η* and *γ*. The parameters can be estimated using Broyden–Fletcher–Goldfarb–Shanno (BFGS) quasi-Newton method, which is considered to be the most appropriate method (Mittelhammer et al. 2000; Nocedal and Wright 1999).

**Figure 3.** Beta distribution. (**a**) Beta Distribution with Fixed *φ*; (**b**) Beta Distribution with Fixed *μ*.

#### *3.3. Inflated Beta Regression*

The disadvantage of beta regression is that it does not include the boundary values 0 or 1. Therefore, a modification is required before fitting the model. To better represent RR on the boundaries 0 and 1, Calabrese (2012) suggested considering RR as a mixture of Bernoulli random variables for the boundary 0 and 1, and a Beta random variable for the open interval (0, 1). The distribution for this inflated beta regression on [0, 1] is then defined as

$$f\_Y(y) = \begin{cases} p\_{0\prime} & \text{if } y = 0\\ (1 - p\_0 - p\_1) f\_B(y; \alpha, \beta), & \text{if } 0 < y < 1\\ p\_{1\prime} & \text{if } y = 1 \end{cases} \tag{9}$$

for y ∈ [0, 1], *p*<sup>0</sup> = *P*(*y* = 0), *p*<sup>1</sup> = *P*(*y* = 1), 0 < *p*<sup>0</sup> + *p*<sup>1</sup> < 1 and *fB*(*y*) is the beta distribution defined in Section 3.2. Moreover, if RR *y* ∈ (0, 1], i.e., it only inflates at one, as our data do, then the distribution is just

$$f\_Y(y) = \begin{cases} (1 - p\_1) f\_B(y; a, \beta), & \text{if } 0 < y < 1 \\ p\_{1\prime} & \text{if } y = 1 \end{cases} \tag{10}$$

We used maximum likelihood estimation to estimate parameters for Bernoulli random variable and Beta random variables, parameterising the discrete part in the following way (Calabrese 2012):

$$s\_i = \frac{p\_1}{p\_1 + p\_0}, \quad d\_i = p\_0 + p\_1.$$

The log-likelihood function is then

$$\begin{split} L(s, d, a, \beta) = \sum\_{y\_i = 0} \log(1 - s\_i) + \sum\_{y\_i = 0} \log(d\_i) + \sum\_{y\_i = 1} \log(s\_i) + \sum\_{y\_i = 1} \log(d\_i) \\ + \sum\_{0 < y\_i < 1} \log(1 - d\_i) + \sum\_{0 < y\_i < 1} \log(f\_B(y; a\_i, \beta\_i)). \end{split} \tag{11}$$

The continuous beta random variables can be parameterised in the same way as described in Section 3.2.

#### *3.4. Beta Mixture Model combined with Logistic Regression*

Examining the distribution of RR shown in Figure 1, it can be seen that the distribution between 0 and 1 is bimodal. For this reason, we consider a beta mixture model to deal with what appears to be two different groups of recoveries. We propose a two-stage model: beta mixture model combined with logistic regression. The beta mixture model allows us to model the multimodality of RR in the interval (0, 1). This is similar to the two-stage (decision tree) model used by Bellotti and Crook (2012), but with a beta mixture used for regression.

Firstly, RR is classified into ones and non-ones using logistic regression. Secondly, within the non-ones group, a mixture of beta distributions is used to model RR in the range (0, 1). In general, a mixture of beta distribution consists of *m* components where each component follows a parametric beta distribution. The prior probability of component *j* is denoted as *πj*, where *j* ∈ (1, ··· , *m*). Let *Mj* denote the *j*th component/cluster in the beta mixture model. The beta mixture model with *m* components is defined as:

$$\begin{aligned} \log(y;\mu,\phi) &= \sum\_{j=1}^{m} \pi\_j f\_{\bar{f}}(y; \mathbf{X}, \mu\_j, \phi\_{\bar{j}}) \\ &= \sum\_{j=1}^{m} \pi\_j f\_{\bar{f}}(y; \mathbf{X}, \mathbf{W}, F\_1^{-1}(\eta\_{\bar{f}}^T \mathbf{X}\_i), F\_2^{-1}(\gamma\_{\bar{f}}^T \mathbf{W}\_i)), \\ &= \sum\_{j=1}^{m} \pi\_j f\_{\bar{f}}(y; \mathbf{X}, \mathbf{W}, \eta\_{\bar{j}}, \gamma\_{\bar{j}}), \end{aligned}$$

where *fj* is the beta distribution corresponding to the *j*th component with separate parameter vectors *η<sup>j</sup>* and *γj*. The same link functions are used as in Section 3.2. The prior probabilities, *πj*, need to satisfy the following conditions:

$$\sum\_{j=1}^{m} \pi\_j = 1, \quad \pi\_j \ge 0.$$

The iterative Expectation-Maximisation (EM) algorithm was used to estimate the parameters of the beta mixture model, as described by (Leisch 2004). In particular, R package "flexmix" (Leisch 2004; Gruen and Leisch 2007, 2008) embedded in R package "betareg" (Cribari-Neto and Zeileis 2010; Gruen

et al. 2012) was applied to estimate the model. Figure 4 illustrates the two-stage mixture model as a decision tree.

**Figure 4.** Estimate the expected value of RR using two-stage decision tree model.

The choice of *m* in the model depends on the number of clusters expected in the data. Based on our analysis of the recoveries for the dataset we used, *m* = 2 was used since this corresponded to the two modes we see in the RR distribution for RR < 1, as shown in Figure 1. If it is not clear how many clusters may exist, approaches based on AIC can be used.

#### Predictions Using the Beta Mixture Model

Given the beta mixture model, we need to predict the RR for new clients based on their information, i.e., *Xnew* and *Wnew*. Figure 5 shows a flowchart explaining how to calculate the estimated RR from the beta mixture model. This gives an expected value of RR *y* conditional on the cluster *Mj*. Therefore, we need to first identify which cluster the new observation belongs to. Even though the R package "betareg" (Cribari-Neto and Zeileis 2010; Gruen et al. 2012) can compute the conditional expectation for us, it does not identify which cluster the new points should be assigned to. Therefore, we propose a method to do this. In general, there are two feasible approaches to assign a new observation to *Mj*:


Decomposing the expected value of *y* using the Law of Total Expectation, we get

$$E(\boldsymbol{y} \mid \mathbf{x}\_i) = \sum\_{j=1}^{m} P(\mathcal{M}\_j | \mathbf{x}\_i) E(\boldsymbol{y} | \mathbf{x}\_i, \mathcal{M}\_j) \tag{12}$$

where *E*(*y*|*xi*, *Mj*) is calculated from the beta mixture model prediction (refer to Figure 5). We can replace *<sup>P</sup>*(*Mj*|*xi*) = *<sup>f</sup>*(*xi*|*Mj*)*P*(*Mj*) *<sup>f</sup>*(*xi*) where *<sup>f</sup>*(*xi*) = <sup>∑</sup>*<sup>m</sup> <sup>j</sup>*=<sup>1</sup> *f*(*xi*|*Mj*)*P*(*Mj*), to get

$$E(\boldsymbol{y} \mid \boldsymbol{x}\_i) = \frac{\sum\_{j=1}^{m} f(\boldsymbol{x}\_i | \boldsymbol{M}\_j) P(\boldsymbol{M}\_j) E(\boldsymbol{y} | \boldsymbol{x}\_i, \boldsymbol{M}\_j)}{f(\boldsymbol{x}\_i)} \tag{13}$$

where *P*(*Mj*) is the prior probability of belonging to cluster *Mj*. The density *f*(*xi*|*Mj*) is estimated using kernel density estimation,

$$\hat{f}(\mathbf{x}^{n \text{new}}) = \sum\_{i=1}^{n} \frac{1}{n \prod\_{k=1}^{d} h\_{i,k}} \prod\_{k=1}^{d} K(\frac{\mathbf{x}\_{k}^{n \text{new}} - \mathbf{x}\_{i,k}}{h\_{i,k}})^{2}$$

where *K*(·) is the Gaussian kernel (Azzalini and Menardi 2014) and *d* is the number of dimensions in data *x*. In addition, *xi* may be high-dimensional, which makes the kernel density estimation computationally expensive. As a remedy, we applied Principal Component Analysis (PCA) to reduce the dimension of *xi*, and then kernel density estimation was performed in the reduced dimension space.

**Figure 5.** Prediction of RR conditional on each cluster *Mj*.

#### **Approach 1: Maximum log-likelihood.**

Given a new observation *xi*, choose *j* that maximises the density:

$$\arg\max\_{j} \quad \log f(y|x\_{i\prime}, \mathcal{M}\_j)\_{\prime\prime}$$

which is computed using the log-likelihood function. If the objective function is maximised with respect to Cluster *Mj*, then set

$$P(M\_j|\mathbf{x}\_i) = 1 \text{ and } P(M\_k|\mathbf{x}\_i) = 0 \text{ for all } k \neq j$$

and hence, from Equation (12), the expected value of *y* is given by *E*(*y*|*xi*, *M*1).

#### **Approach 2: Prior Probability.**

Treat *P*(*Mj*) as a prior estimated using methods given in Table 1 and use in Equation (13) for soft clustering. By substituting *P*(*Mj*) given in Table 1 into Equation (13), we can compute *E*(*y* | *x*) for *y* ∈ (0, 1).


**Table 1.** Determining *P*(*Mj*) in Approach 2.

After calculating *E*(*y* | *x*) for the interval (0, 1) using the beta mixture model, the boundary 1 needs to be taken into consideration using a logistic regression model. From the decision tree defined in Figure 4, the logistic regression can provide the estimates at the first leaf node: *P*(*y* = 1 | *X* = *x*). Then, the overall expectation of RR *y* ∈ (0, 1] is

$$\begin{aligned} E(y \mid \mathbf{x}) &= P(y=1 \mid \mathbf{x}) E(y \mid \mathbf{x}, y=1) &+ & P(0 < y < 1 \mid \mathbf{x}) E(y \mid \mathbf{x}, 0 < y < 1) \\ &= P(y=1 \mid \mathbf{x}) \times 1 &+ & (1 - P(y=1 \mid \mathbf{x})) E(y \mid \mathbf{x}, 0 < y < 1) \\ &= P(y=1 \mid \mathbf{x}) &+ & (1 - P(y=1 \mid \mathbf{x})) E(y \mid \mathbf{x}, 0 < y < 1) \end{aligned}$$

where *E*(*y* | *x*, 0 < *y* < 1) is the predicted RR from the beta mixture model using Approach 1 or 2.

#### **4. Results**

The linear model had an adjusted *R*<sup>2</sup> of 0.69, which was considerably higher than most models of RR (e.g., see Bellotti and Crook (2012); Loterman et al. (2012)), which could be explained by the richness of data, especially collections information. We expected that the linear regression model was misspecified, due to the range of the outcome variable and this is confirmed in the residual vs. fitted plot for the model and a Breusch-Pagan test for heteroscedasticity (*p* < 0.0001).

For the beta mixture model, we used all variables for *X*, but variable selection for *W* based firstly on the output of stepwise selection using AIC in linear regression and then on a series of likelihood ratio tests. The result was the selection of four variables for *W*: pre-recovery rate, post balance, customer payment frequency and credit bureau score. Table 2 shows parameter estimates for *η* and *γ* for the two clusters, along with coefficient estimates under standard beta regression in the interval (0, 1) for comparison.

In Table 2, there are "NA" values for some of the p-values in the beta mixture model. This is because the estimation algorithm could not produce reliable standard errors in these cases. We can see that the significance of variables was *diluted* by the two clusters. For instance, credit bureau score was significant in the standard beta regression with a p-value of 0.0022, but in the beta mixture model, it was not significant for either of the clusters, taking a significance level of 5%. The direction of association of coefficient estimates in beta mixture model for both clusters were mostly consistent, where the estimates were significant (at 5% level), although magnitude of association differed. Pre-Recovery Rate for *γ* component was the only exception to this observation. The model also demonstrated some interesting significant associations between some variables and RR: taking insurance showed higher recoveries and having a record at the bad debt bureau was associated with lower recovery rates. In addition, the recoveries, pre-purchase, were positively correlated with future RR, although total number of calls to customer had a negative association, perhaps because these were difficult customers from whom to collect, hence requiring more intervention.


**Table 2.** *η* and *γ* estimated by EM algorithm. M1 and M2 represent Clusters 1 and 2.

Following the procedure in Figure 5, the expected value of RR conditional on Cluster *Mj* was calculated based on the parameters *η* and *γ* estimated in Table 2. Since it was too time consuming to perform kernel density estimation on 29 variables, we reduced the dimension to six by employing PCA analysis, which greatly shortened the running time for two clusters' density estimations. Nevertheless, it is inevitable that information is lost during the dimension reduction process, which may result in weaker estimates. Figure 6 shows histograms of expected value of RR conditional on each *j*th cluster for the test dataset. The shapes of the two clusters are similar, except Cluster 2 has more estimates in the range 0.2 to 0.6.

Figure 7 shows four histograms of predicted RR corresponding to the four different priors defined in Table 1, in contrast to the true RR. The predicted value of beta mixture model combined with logistic regression model was calculated by applying the formula derived in Equation (14). Models with the different priors performed in a similar way. Importantly, they were all able to model the bimodal nature of the RR. The figure shows that none of the models ewre good at predicting the extreme values of RR close to 0 or 1, but this naturally follows from the fact that these predictions are estimates of *expected values* of RR, through Equation (12), albeit conditional on predictor variables, and thus do not represent the extremes in the distribution well. Further detail can be seen in Figure 8, which shows predicted RR against true RR. The strong correlation between predicted and true RR is clear. However, it is noticeable that, when true RR was around 0.6, the model tended to under-estimate for some observations. This was because the model was not perfect at detecting observations in Cluster 2. This suggests future improvements to the model to enhance its capacity to predict the correct latent cluster.

(**a**) (**b**) **Figure 6.** *E*(*y*|*Mj*, *Xnew*) based on the Test dataset, for the two clusters (*m* = 2). (**a**) *E*(*y*|*M*1, *Xnew*); (**b**) *E*(*y*|*M*2, *Xnew*).

**Figure 7.** Predicted RR on test data (*n* = 2746) using beta mixture with four different priors, combined with logistic regression.

**Figure 8.** Predicted RR against true RR on test data (*n* = 2746) using beta mixture with the indifferent prior, combined with logistic regression.

#### *Model Performance*

Predictive performance was measured using *K*-fold cross validation with three performance measures popular in the literature on RR estimation: mean squared error (MSE), mean absolute error (MAE) and mean aggregate absolute error (MAAE). Since the sample size (8237) was relatively large and model estimation time was long, *K* = 3 was chosen. Let *n* be the sample size. Then,

$$MSE = \frac{1}{K} \sum\_{k=1}^{K} \frac{1}{n/K} \sum\_{i=1}^{n/K} (\mathcal{g}\_{ki} - \mathcal{y}\_i)^2 = \frac{1}{n} \sum\_{k=1}^{K} \sum\_{i=1}^{n/K} (\mathcal{g}\_{ki} - \mathcal{y}\_i)^2,$$

$$MAE = \frac{1}{K} \sum\_{k=1}^{K} \frac{1}{n/K} \sum\_{i=1}^{n/K} |\mathcal{g}\_{ki} - \mathcal{y}\_i| = \frac{1}{n} \sum\_{k=1}^{K} \sum\_{i=1}^{n/K} |\mathcal{g}\_{ki} - \mathcal{y}\_i|.$$

MAAE is the MAE at segment level (Thomas and Bijak 2015) and defined here as

$$MAAE = \frac{1}{V} \sum\_{i=1}^{V} \frac{1}{|S\_i|} \left| \sum\_{j \in S\_i} (y\_j - \hat{y}\_j) \right|^2$$

where *V* is the number of segments expressed as disjoint index sets *S*1, ··· , *SV*. The segments could express different characteristics, e.g., risk bands. However, for this study, each segment was a different random sample from the test data with approximately the same sample size and jointly exhaustive. We used *V* = 100 since this gave a balance of number of segments approximately equal to number of observations in each segment. MSE, MAE and MAAE are all penalty measures, thus the smaller the value, the better the model. Since the RR is a financial ratio between 0 and 1, the MAE can reflect the size of the error in a more intuitive and direct way. If one is interested in the segment portfolio level, then the MAAE should be used.

All models were trained on the same partitions of data into cross validation folds, to avoid bias being introduced due to different samples. Table 3 shows the results. There was little difference between results for the various linear regressions, with or without variable selection or Lasso penalty, in terms of predictive performance. The last linear model, "excluding Dataset 2", was built without predictor variables from Dataset 2. This showed noticeably worse performance than the other linear models, especially for MAE, which demonstrates that including past recoveries data (i.e., Dataset 2) improved performance. The standard beta regression model, with and without zero-inflation, performed much worse than linear regression, but the beta mixture model with logistic regression gave the best performance on all three measures. The different priors gave slightly different performances but are not very much different, although Approach 1 method for selecting cluster assignment (max log-likelihood) was slightly worse than Approach 2, soft clustering methods.


**Table 3.** Predictive results using three-fold cross validation.

#### **5. Conclusions**

Linear regression, beta regression, inflated beta regression, and a beta mixture model combined with logistic regression were applied to model the recovery rate of non-performing loans. The models' predictive performances were measured using mean squared error, mean absolute error and mean aggregate absolute error under three-fold cross validation. To produce predictions from the beta mixture model, methods of hard and soft clustering were developed and the soft clustering approaches gave marginally better predictive performance. Theoretically, the proposed model, beta mixture model combined with logistic regression model, should be a suitable model to predict recovery rate for this data since it allows us to model the multimodality in the dataset and takes extra consideration of the boundary value. Indeed, we found that it achieved the best results amongst the models. Stepwise linear regression also achieved relatively good performance; however, the normality and homoscedasticity assumptions did not hold. In our experiments, we also found that inclusion of previous collections data boosted predictive performance.

We believe the beta mixture model is useful for modelling RR because it is explaining different servicing strategies. In the case of our study, the cluster with mode around 0.55 is likely expressing those loans for which the debt servicer has agreed with the borrower to repay just a proportion of the outstanding debt. There may be servicing strategies in other bad debt portfolios that could be discovered using a similar mixture model or clustering approach. We developed a technique to predict the correct latent cluster for new observations and this works well. However, results suggest that further work to refine this aspect of the use of the model could yield improved performance.

**Author Contributions:** Conceptualisation, H.Y. and T.B.; methodology, H.Y. and T.B.; validation, T.B.; investigation and statistical modelling, H.Y.; data analysis, H.Y.; writing—original draft preparation, H.Y..; writing—review and editing, T.B.; and supervision, T.B.

**Funding:** This research received no external funding.

**Acknowledgments:** We wish to thank the anonymous debt collection company for use of their data and their expertise, which was essential to understand the meaning and context of the data. We would also like to thank *Risks* **2019**, *7*, 19

Tommaso Pappagallo who did preliminary data analysis as part of his MSci project, which was useful in taking this project work forward.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **Appendix A**

**Table A1.** Descriptive statistics. *n* = 8237. For numeric variables: min, mean (standard deviation), max. For factors, frequency (%age) for each level. All predictive data is collected prior to servicing.


#### **References**


c 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
