1. Introduction
The granting of a loan implies a risk for the financial institution, which arises due to the possibility that the borrower does not comply with the repayment of the loan; that is, the possibility of delinquency. Financial institutions face a decision problem of whether to grant the loan to the client, which implies an assessment of the probability of each applicant presenting delinquency problems. Therefore, an institution attempts to calculate an uncertain event with data from the past. In other words, this is an estimation or prediction problem known as credit scoring. Any credit rating system that enables the automatic assessment of the risk associated with a banking operation is called credit scoring. This risk may depend on several customer and credit characteristics, such as solvency, type of credit, maturity, loan amounts and other features inherent in financial operations. It is an objective system for approving credit that does not depend on the analyst’s discretion.
Today, this area of research represents a very important challenge for financial institutions for three fundamental reasons:
The credit risk represents 60% of the total risk of the financial institution.
The requirements by authorities to comply with Basel III, among other requirements, implies an increase in the entity’s reserves based on a higher percentage of the calculation of the expected losses, with a consequent reduction of the benefits to be distributed to shareholders.
With the introduction of the Ninth International Financial Reporting Standard (IFRS 9) in January 2018, financial companies will have to calculate their expected losses due to default during the 12 months after these financial instruments are implemented, transferring them as losses to their income statement, with a consequent reduction in profit.
Therefore, a good automatic method to select variables and optimize risk assessment decisions would mean lower expected losses and higher profits for the company and shareholders.
Financial institutions have exhaustive datasets for their operations; this allows them to gain an advantage (they have huge amounts of information related to creditors and debtors and can try to predict our behavior) and presents a problem (they need to obtain the best prediction without exceeding resources). Yu et al. [
1] indicated that the credit industry requires quick decisions, and, in this sense, there should be a trade-off between computational performance and computational efficiency. Today, despite the fact that financial institutions have unlimited computing resources and computing power (Amazon Web Services (AWS), Azure, etc.), organizational reality sets limitations.
The processing of a model runs in parallel to many core business processes that are performed in batches during the night. Therefore, the data dump in the system is continuous at all levels and departments of the financial institution. This, together with the need for financial institutions to apply dynamic credit risk optimization models together with the selection of variables, leads us to consider this research topic. The models must allow the introduction of new variables at any moment in time, thus reflecting the economic, financial, social and political reality in order to anticipate both new predictions and a change in the importance of a variable. Therefore, the adjustments to which the model is subjected, together with the rest of the usual computer processes of the financial institution, will determine the need for a computationally efficient model.
According to Pérez-Martin et al. [
2], computational efficiency decreases as the number of variables increases, while the number of records (potential borrowers) hardly influences this. Therefore, it is in these settings that a massive data problem appears, since the system must recalculate its credit scoring together with a new optimization of variables as its scenario has changed. This leads us to consider this research topic; i.e., to obtain a method for selecting variables and optimize risk assessment decisions in order to obtain results in a minimum time. The contribution made by this research is important in economic terms for financial institutions and at the same time for society. It is possible to give quick responses to risk managers and to be able to quickly correct a credit risk model to avoid a possible increase in expected losses, with a consequent loss of profits and increase in reserves for the financial institution, which represents a lower profit for distribution among shareholders.
Regarding the daily business of banks, they need to evaluate many loan applications and also be clear regarding the applications which must be rejected and accepted; in the past, for this task, employees engaged in personal study. Simon [
3] concluded that this type of classical methodology has some problems, such as the non-reproducibility of the valuations (each analyst may have a different opinion) or a low level of confidence due to the subjectivity of the valuation, which can change every day. This method, as observed, has a significant degree of subjectivity, which together with its low efficiency and the difficulty of training new personnel makes it an inefficient method.
The general approach to this task is currently a model that is updated each night in a batch together with other processes that are executed by a bank’s systems. The problem with a credit risk model batch is clear: each morning, the new model is deployed and ready to use for each user and must be the best model from the information available. Furthermore, banking institutions have limited computational capacity. The combination of these factors is a problem that a good feature selection method combined with big data techniques can solve. In fact, there are many research works in the literature that have sought the optimal method since Durand [
4] introduced discriminant analysis. These works have employed approaches with different methodological methods, as listed below.
Linear Methods: The authors of [
5,
6,
7,
8,
9] considered this type of methodology to solve this problem, but the majority of authors did not obtain good results. For example, in the comparative study carried out by Yu et al. [
5] in which the efficacy of 10 methods using three databases was compared to test the different methods: one referring to loans from an English financial company extracted (England Credit) from [
10], another referring to credit cards from a German bank (German Credit) and finally another database referring to credit cards in the Japanese market (Japanese Credit). The results show that the method ranks among the last classified, only surpassing neural networks with backwards propagation. However, subsequent research works (e.g., [
11]) have shown the best percentages for this approach compared to other more complex methods, but the existing limitations due to the size of the data have been evident; in contrast, Xiao et al. [
12] considered Logistic Regressions to be the best method if the information can be managed.
Decision Trees: The authors of [
9,
13,
14,
15,
16] used this methods in research comparing different methodologies. In none of these works, however, was this method the best of the methods used. For example, the results obtained in [
9] can be observed, in which this approach lags far behind the rest of the methods in predictive precision. A predictive improvement can be observed in [
17] where 11 methods are compared using the databases of German Credit and Australian Credit.
Neural networks are commonly used to solve this problem in the literature, and successful results have been found, such as in the work of Malhotra and Malhotra [
18] comparing the results with a discriminant analysis and obtaining good performance with the German Credit database. However, in general, there have been many research works that use neural networks without good or with insignificant results compared with the complexity of the models (e.g., [
19,
20,
21]).
Support Vector Machine obtains good results with classifications, as can be seen in [
22], where the best accuracy results are obtained in comparison with four different methods. However, as Pérez-Martin and Vaca [
23] concluded with a Monte Carlo simulation study, these best accuracy results have an impact on the efficiency (i.e., reducing it). Other authors have attempted to create or use another parametrization of the kernel, such as Xiao et al. [
12], Bellotti and Crook [
24] or Wang et al. [
25], but the same problem remains regarding efficiency.
One important aspect is the method used when credit scoring must be performed, but it is possible to solve this problem by reducing the information used to model the behavior. To do this, it is necessary to select the variables that have the capacity to explain the best model. Managing large amounts of data poses a problem in terms of the server’s efficiency and slows down its effectiveness. The number of considered features is called the data dimension, and, although high dimensionality in the feature space has advantages, it also has some serious shortcomings. In fact, as the number of features increases, more computation is required and the model accuracy and scoring interpretation efficiency are reduced [
26,
27]. Financial institutions need to provide a response to their potential borrowers, and so prediction models must meet two requirements: computational efficiency and predictive efficacy. We study this situation as a big data problem, and we consider a variable selection method for the model in order to reduce the volume of data, maintaining the model’s efficiency with the aim of increasing its computational efficiency (i.e., achieving a lower execution time).
To reduce the computational time and maintain the effectiveness of the methods, one solution can be found through analytical methods of data mining ([
28]). Data mining attempts to quantify the risk associated with a loan and facilitate credit decision-making, while allowing for and limiting the potential losses that a bank may suffer.
Therefore, variable selection is performed, and it is determined whether the prediction improves and if this improvement corresponds to a greater computational efficiency. It is expected that faster forecasts will be achieved, and information that does not result in any gain to the model must be eliminated. This involves applying a data mining algorithm that generates and reduces classification rule systems and achieves accurate predictions, but with a substantially lower temporal and memory consumption cost. The automatic selection of characteristics involves a set of analytical techniques that generate the combinations of variables that have a greater incidence of the behavior of a target variable. The selection of different variables can be carried out by applying different methodologies such as decision trees, rule systems and techniques such as principal components analysis with an incremental approach in which the algorithms employed seek to improve a certain metric of significance [
29].
The importance of selecting variables will be determined if the contribution of the attributes is significant for the model. There are different approaches to this problem. One of the possible approaches is to use wrapper methods, and in this popular case, one methodology for selecting attributes to have been emphasized in the literature is genetic programming ([
1,
30,
31]), or other models such as Linear Discriminant Analysis (LDA) ([
14]) or different classes of artificial neural network ([
30]), among other methodologies. However, if we consider the Basel principles regarding the model, all of these possible models breach these principles on the basis of their simplicity.
In this research, the impacts of two proposed feature selection techniques are measured—gain ratio and principal component analysis (PCA)—in terms of reducing dimensionality along with the impact of discretization in order to standardize the dataset values. To test these factors, we use a credit dataset from the Spanish Institute of Fiscal Studies (IEF) dataset. Later, we apply three credit scoring methods to predict when a home equity loan can be granted or not: the generalized linear model for binary data (GLMlogit), the linear mixed model (LMM) and the classification rules extraction algorithm-reduction based on significance (CREA-RBS). LMM and GLMlogit methods were chosen because they are the most efficient and effective methods ([
2,
32]).
Section 2 gives an overview of the datasets used and the stages of our experimental study with a description of the methods used. In
Section 3, the obtained results are shown. In
Section 4, the conclusions are presented.
3. Results
In this work, we attempted to determine the effect of different processes in order to reduce the complexity of this problem. We calculated the gain ratio and PCA and performed discretization in order to rank the variables, create dimensions and create aggregated groups, respectively.
In
Figure 3 and
Figure 4, we can see the most influential variable in the IEF semi-synthetic and German Credit datasets, respectively. As we can see in
Figure 3, the most influential variable in the IEF dataset is Correctorincome. This is a feature which aims to discriminate between people with higher income or lower than the minimum inter-professional salary (SMI); the lower is the income compared to SMI, the more complicated it is to grant the payment of a bank credit. The other two most relevant variables are Familyincome and total tax deduction related to main residence (buying or improvements). The rest of the features have marginal importance. Moreover, economic attributes are more relevant than social variables according to the gain ratio.
In the case of German Credit dataset (
Figure 4), the most influential variable is account_check_status. This feature aims to determine the current amount of the account. The other four most relevant variables are foreign_worker, credit_history, duration and credit_amount. The rest of the features have marginal importance. Again, economic attributes are more relevant than social variables.
PCA is a technique used for feature extraction; thus, the output is a new dataset with new variables that are a combination of the input variables. It is very difficult to explain the meaning of the new variables. In the case of the IEF semi-synthetic dataset, the first 10 dimensions of PCA explain 25.34% of the total variability. This percentage is relatively low, but the next variables grow in very small increments of less than 5% of the previous explained variability.
In the German Credit case, the first dimension of PCA expresses 99.997% of the total dataset inertia. It is possible that this situation was created by the small size of the dataset. We decided to selected the five first dimensions because there is a small value of variance, but the possible interpretation of the tested models would not be logical except for the GLMlogit models.
Then, the dataset for the three proposed methods is adjusted. The execution time and the root mean square error (RMSE) are obtained by entering the features according to their contribution to the model. The results of these experiments are collected in
Table 5 for RMSE and in
Table 6 for execution times.
As we can see from the results in
Table 5, the version of the dataset that performs worst is the selected dimensions of PCA, except for the RBS, which we analyze below. Moreover, the best version depends on the dataset used, although it is clear that it is a completed version. The difference between the two datasets is the individual’s total variability. In the case of the IEF dataset, there is a large number of cases and the discretization process reduces this number. In contrast, the German Credit Dataset contains a limited number of cases and the discretization process does not add value but homogenizes users who should not be equal but statistically cannot constitute an individual class.
RBS is the best method in most cases (it only shows worse results using gain ratio selection for the German Credit dataset, as seen in
Figure 5). However, this model is not dynamic, because it is not able to adapt and predict new combinations of variables. For this reason, we must consider that this model does not assume that groups of observations may contain different combinations than the initial ones. RBS is a method that does not explain all combinations because it is focused on those of greater value.
We can see the execution time of training models in
Table 6, but they do not offer us relevant information on their own, beyond the fact that with a greater number of variables, there is more computational effort. In fact, if we consider those cases not collected by the model, whether positive or negative, we find that they perform worse except in two particular cases (gain ratio and PCA for German Credit considering accepting credit).
The mixture of the model precision (RMSE) with computational cost (time) is particularly interesting and is illustrated in
Figure 5 The combination of a low time requirement and low value of RMSE shows the best methods. Three groups can be identified based on the results:
The first group is formed by a complete dataset trained with a linear method. This group has the highest execution times and the second lowest RMSE. The best relationship within this group is LMM, because it saves more than 30%, losing less than 2% compared to GLMLogit.
The second group is formed by a selected dataset with two proposed methods. This group has the second-lowest execution times and the worst RMSE. In general, GLMlogit behaves well; it obtains a good RMSE and shorter execution times. Within the methodologies used, the gain ratio is more effective than the PCA in view of the results.
The last group is formed by RBS models regardless of whether there is a selection of variables or not. This stands out due to its low execution times, whatever the volume of data, and the best precision results. In this case, it is observed that the method of removing information makes the model lose precision.
4. Conclusions
We attempted to find efficient methods, discarding those with the lowest efficiency, for use with a banking dataset. For this reason, we used our own design with the IEF dataset, using some synthetic variables such as the target variable and borrowed amount.
Two feature selection methods were proposed using the elbow method, gain ratio and PCA. We calculated the measures of effectiveness (RMSE) and efficiency for the models adjusted with LMM, GLMlogit and RBS.
It is important to note that the method proposed in this variable selection investigation in its first step calculates the credit score with all the variables entered in the model. Once calculated, a selection is made using the most efficient proposed methods that offer more relevant information for making credit decisions. These variables are those that will be taken into account when deciding whether or not to grant a loan to a potential borrower, without having to enter the rest of the information, since it is irrelevant. As it is a dynamic model—i.e., new risk factors are added that are not constant over time and depend on economic, financial, political and even health changes (for example, COVID-19)—there are changes both in the calculation of the credit scoring and in the selection of variables. Therefore, it is important to interpret the results obtained in terms of variables that are part of the model; moreover, there is the possibility that new variables will appear and that they will have a greater weight, as well as that others will disappear or influence the decision less.
In the temporal analysis proposed in this research, the most important features with respect to the gain ratio in the IEF dataset used to predicting default probability were “Correctorincome” and “Familyincome”. These variables have a 51% weight, with respect to the total features. In the case of German Credit, the most important variables are “account_check_status” and “foreign_worker”, but these variables have only a 13.55% weight, with respect to the total features. When applying PCA, it is observed that, up to the 10th dimension, the inertia grows slowly in the IEF dataset. In the case of German Credit, it is observed that, up to the third dimension, practically all the inertia is represented.
RBS is the most effective method regarding accuracy, except that it is not an expert system and does not create responses for all cases. This situation may mean that, in stable environments, it can be used to estimate provisions for credit losses with customers where extensive knowledge is available for a financial institution or to provide a first filter for access to credit.
On the other hand, LMM has proven to be very efficient, practically equaling GLMlogit, and so this method could be a candidate to replace the logistic regressions (GLMlogit) used in the industry, since it remains a linear method, allows a simple interpretation but allows—through the random effect—the simplification of the adjustment of variables that are not particularly important but that should be introduced in the model.
For these reasons, RBS and LMM are proposed to evaluate credit risk in the presence of big data problems that need prior feature selection to reduce the dimensionality of a dataset. This method is applicable to any database and time horizon, because it selects the most important variables within the analyzed database while taking into account all of the information.