You are currently viewing a new version of our website. To view the old version click .
Big Data and Cognitive Computing
  • Article
  • Open Access

6 March 2024

Enhancing Supervised Model Performance in Credit Risk Classification Using Sampling Strategies and Feature Ranking

,
,
,
and
1
School of Information Technology, King Mongkut’s University of Technology Thonburi, Bangkok 10140, Thailand
2
Department of Biotechnology, Faculty of Science and Technology, Thammasat University, Khlong Luang 12120, Pathum Thani, Thailand
*
Authors to whom correspondence should be addressed.
This article belongs to the Topic Big Data and Artificial Intelligence, 2nd Volume

Abstract

For the financial health of lenders and institutions, one important risk assessment called credit risk is about correctly deciding whether or not a borrower will fail to repay a loan. It not only helps in the approval or denial of loan applications but also aids in managing the non-performing loan (NPL) trend. In this study, a dataset provided by the LendingClub company based in San Francisco, CA, USA, from 2007 to 2020 consisting of 2,925,492 records and 141 attributes was experimented with. The loan status was categorized as “Good” or “Risk”. To yield highly effective results of credit risk prediction, experiments on credit risk prediction were performed using three widely adopted supervised machine learning techniques: logistic regression, random forest, and gradient boosting. In addition, to solve the imbalanced data problem, three sampling algorithms, including under-sampling, over-sampling, and combined sampling, were employed. The results show that the gradient boosting technique achieves nearly perfect A c c u r a c y , P r e c i s i o n , R e c a l l , and F 1 s c o r e values, which are better than 99.92%, but its M C C values are greater than 99.77%. Three imbalanced data handling approaches can enhance the model performance of models trained by three algorithms. Moreover, the experiment of reducing the number of features based on mutual information calculation revealed slightly decreasing performance for 50 data features with A c c u r a c y values greater than 99.86%. For 25 data features, which is the smallest size, the random forest supervised model yielded 99.15% A c c u r a c y . Both sampling strategies and feature selection help to improve the supervised model for accurately predicting credit risk, which may be beneficial in the lending business.

1. Introduction

Machine learning techniques have several benefits in various applications, especially in the form of predicting a trend or outcome. Hence, machine learning models can accurately assess credit default probabilities and improve credit risk prediction []. Focusing on financial services like personal loans, accurately predicting the risk of non-performing loans (NPLs) in peer-to-peer (P2P) lending is one crucial thing for lenders such as P2P lending platforms. When borrowers fail to repay (or default on) their loans, it brings about an NPL for the lenders. Generally, an NPL is a major task to overcome in order to reach the stability and profitability of not only financial institutions [] but also P2P platforms. So, risk assessment measures, diversification strategies, and collection processes are always performed to minimize the NPL issue. These P2P platforms, which are widely used in many countries, are involved with higher risk than traditional lending, because they depend on individuals []. However, there are many advantages superior to banking credit, i.e., lenders’ and borrowers’ direct interaction, detailed credit scoring [], and the opportunity to gather and analyze large numbers of data which can be used to assess trustability and reduce risks []. Therefore, several previous research works have been studied to build an efficient model to predict the risk of lending [,,]. Still, there are several challenges, including selecting important features, coping with imbalanced data, handling data quality, and experimenting with in-depth model evaluations. Based on lending datasets, they often contain imbalanced data, i.e., a higher proportion of good loans than risky ones. This possibly leads to the model’s prediction bias. In addition, resolving missing values in the data needs mindful consideration of whether to impute, remove, or ignore them. In addition, selecting a relevant and informative input feature set is one substantial step for avoiding model overfitting or underfitting. Apart from that, to ensure the efficient model’s real-world performance, one way is to validate it on many lending datasets. In summary, covering these above challenges may be able to result in successfully developing a reliable lending risk prediction model.
In this research, to overcome the challenges associated with building a machine learning model for this lending risk prediction problem, various approaches were implemented and contributed. Firstly, exploratory data analysis (EDA) to explore and clean the data was conducted, which aimed to adjust data quality before initiating the model creation process. Secondly, logistic regression (LG), random forest (RF), and gradient boosting (GB), which are supervised machine learning approaches, were used for model building experiments. Thirdly, over-sampling, under-sampling, and combined sampling techniques to mitigate the imbalanced data problem were comparatively employed. Lastly, an experiment on reducing feature number according to its importance computed by mutual information was also performed.
The remaining sections of this paper are organized as follows. A brief literature review about the machine learning approaches and imbalanced data handling techniques utilized in this study is provided in Section 2. The methodology such as material data description, data preparation, experimental setup, and performance evaluations is outlined in Section 3. In Section 4, the results and discussions are reported. Finally, in Section 5, the conclusion and future works are summed up.

3. Materials and Methodology

3.1. Data Description and Preprocessing

In this work, financial data provided by the LendingClub Company from 2007 to 2020Q3 [] were used. The data consist of 2,925,493 records which are divided into various loan statuses as in Table 2. The loan statuses are categorized as “Good” or “Risk” users. The “Fully Paid” status is categorized as “Good” users, whereas “Charged Off”, “In Grace Period”, “Late (16–30 days)”, “Late (31–120 days)”, and “Default” are grouped as “Risk” users. The “Current” status is not explicitly categorized as “Good” or “Risk” since it represents the current state of ongoing payments. “Issued” is also not classified as it may refer to loans that are approved but not yet active. “Does not meet the credit policy” users were excluded in this study. For the experiment, the data contain 1,497,783 samples labeled as “Good” and 391,882 samples labeled as “Risk”, totaling 1,889,665 samples. This is a two-class dataset with an imbalance ratio (IR) equal to 3.82, which shows a slight class imbalance, as displayed in Figure 1. IR is the majority class size divided by the minority class size. A high IR value may affect model performance in some machine learning algorithms, i.e., the majority class is more correctly predicted than the minority class due to imbalanced training data causing model bias. However, the selected algorithms used in this paper like random forest and gradient boosting are quite robust for mild class imbalance. It is more helpful to solve imbalanced data before model training because of the increasing chance of model performance improvement. In addition, the imbalanced data handing methods used in this work are explained in Section 2.3.
Table 2. Dataset from LendingClub company from 2007 to 2020Q3 and loan status distribution.
Figure 1. Class imbalance of our experimental dataset from LendingClub dataset.
There are 141 attributes in the original data which contain many missing values, as illustrated in Figure 2, with high percentages. Some columns need to be dropped and transformed before training the models. The data were preprocessed through the following steps.
Figure 2. The summary of missing values on each attribute excluding the “id”, “url”, “pymnt_plan”, and “policy_code” attributes.
(1)
Drop column “id” because it typically serves as a unique identifier for each row, and including it as a feature could lead the model to incorrectly learn patterns that are specific to certain ids rather than generalizing well to new data.
(2)
Drop “url” because it might not provide meaningful information for your model, or its content might be better represented in a different format.
(3)
Drop columns “pymnt_plan” and “policy_code” because every record in the “pymnt_plan” column has the value “n” and every record in the “policy_code” column has the value 1. These columns contain constant values, resulting in the model being unable to differentiate between different data inputs.
(4)
Drop columns that have missing values exceeding 50%. The selected dataset now comprises 101 columns, including 100 features and the loan status.
(5)
In the “int_rate” and “revol_util” columns, convert the percentage values from string format to float.
(6)
For categorical data, fill the missing values with the mode and transform them into numerical values.
(7)
For real value data, fill the missing values with the mean of the existing values.
Now, the dataset comprises 100 features. Each feature was explored in the relationship with its target variable (class label) to rank the importance of features. Mutual information ( M I ) can identify informative features on both linear and non-linear relationships between features and target variables. In feature selection, a feature with a higher mutual information value is considered as more important and typically selected into a training feature set. The importance of these features can be represented by mutual information, as defined in Equation (7).
M I ( X ; Y ) = x X y Y p ( x , y ) · log 2 p ( x , y ) p ( x ) p ( y )
where p ( y ) , p ( x , y ) , and p ( x ) represent the probabilities associated with the target variable Y and the joint and marginal distributions of features X and Y, respectively. The mutual information values for all features are presented in Figure 3. These were used to investigate the impact of feature selection on the performance of the models. The correlation matrix for the first 25 features with the highest mutual information and the class label is depicted in Figure 4. Each cell in the table shows the correlation between two variables. It is often used to understand the relationships between different variables in a dataset. The values range from −1 to 1, where −1 indicates a perfect negative correlation, 1 indicates a perfect positive correlation, and 0 indicates no correlation.
Figure 3. A summary of mutual information ( M I ) across the 100 features used.
Figure 4. Correlation matrix on the first 25 highest mutual information features.

3.2. Model Creations and Evaluations

An overview of the processes in this work is depicted in Figure 5. The raw dataset was explored for characteristics such as data types and missing values. Subsequently, the data were preprocessed to handle missing values. The dataset was then separated into training and testing sets. Two data splitting protocols were experimented with, i.e., hold-out cross-validation with a 70:30 ratio of training and testing sets and 4-fold cross-validation. Next, the training data were prepared in four versions based on imbalanced data handling methods, including original (no sampling), over-sampling, under-sampling, and combined sampling training data. Each training dataset version was used to create three models using logistic regression, random forest, and gradient boosting approaches. The testing dataset was employed to evaluate model performance by calculating A c c u r a c y , P r e c i s i o n , R e c a l l , F 1 s c o r e , and Matthews Correlation Coefficient ( M C C ). In the context of imbalanced data, where one class may dominate the others, using macro-averaging for P r e c i s i o n , R e c a l l , F 1 s c o r e , and M C C can provide a more balanced evaluation across different classes. Then, the confusion matrix was displayed, which is a tabular representation commonly employed to assess the effectiveness of a classification algorithm. This matrix provides a concise overview of the model’s performance by detailing the distribution of predicted and actual class labels (Figure 6). Denote that green and red cells stand for the number of correctly predicted and wrongly predicted samples, respectively. T P and T N are the sample numbers of correctly classified to positive and negative classes, respectively, while F P and F N are the sample numbers of wrongly classified to positive and negative classes, respectively. Subsequently, key performance metrics such as A c c u r a c y , P r e c i s i o n , R e c a l l , F 1 s c o r e , and M C C were computed as follows:
Figure 5. Overview of proposed methodology.
Figure 6. Two-by-two confusion matrix.
A c c u r a c y is a fundamental metric that measures the overall correctness of a classification model by assessing the proportion of testing data that are correctly predicted out of the total testing data size.
A c c u r a c y = T P + T N T P + T N + F P + F N
P r e c i s i o n (Macro-Averaged) is a metric used to evaluate the P r e c i s i o n of a classification model when dealing with imbalanced datasets. In the context of macro-averaging, P r e c i s i o n is calculated individually for each class and then averaged across all classes.
P r e c i s i o n = 1 C i = 1 C T P i T P i + F P i
where C is the number of classes and T P i and F P i are the true positives and false positives for class i.
R e c a l l (Macro-Averaged) is a metric used to evaluate the R e c a l l of a classification model in the context of imbalanced datasets. In macro averaging, R e c a l l is calculated individually for each class and then averaged across all classes.
R e c a l l = 1 C i = 1 C T P i T P i + F N i
where F N i represents the false negatives for class i.
F 1 s c o r e (Macro-Averaged) is a metric that combines both P r e c i s i o n and R e c a l l , offering a balanced assessment of a model’s performance on imbalanced datasets. In macro averaging, the F 1 S c o r e is calculated individually for each class and then averaged across all classes.
F 1 s c o r e = 1 C i = 1 C 2 × P r e c i s i o n i × R e c a l l i P r e c i s i o n i + R e c a l l i
where P r e c i s i o n i and R e c a l l i are the P r e c i s i o n and R e c a l l for class i.
The Matthews Correlation Coefficient ( M C C ) is one of the metrics suitable for evaluating binary classification models, especially models that were trained by imbalanced datasets, because true positives ( T P ), true negatives ( T N ), false positives ( F P ), and false negatives ( F N ) are all taken into account in its formula. The M C C is defined as:
M C C = ( T P × T N ) ( F P × F N ) ( T P + F P ) ( T P + F N ) ( T N + F P ) ( T N + F N ) .
The M C C value ranges between −1 and 1. The best and worst M C C values are 1 and −1, respectively. When the M C C value is 0, this means that the model performance is not greater than that of random guessing.

4. Results and Discussion

To solve an imbalanced data issue, four versions of training datasets, including data with no sampling, over-sampling, under-sampling, and combined sampling, were experimented with. The number of data samples in each training dataset is illustrated in Figure 7. Experiments with two methods of training and testing data splitting, 70:30 hold-out cross-validation and 4-fold cross-validation, were performed.
Figure 7. Comparison of training data sizes on various sampling methods.

4.1. Hold-Out Cross-Validation with 70:30 Ratio of Training and Testing Sets

The confusion matrices of testing data prediction were yielded by logistic regression, random forest, and gradient boosting models, which were trained by four versions of training data as shown in Figure 8a–d. The five performance metrics, i.e., A c c u r a c y , P r e c i s i o n , R e c a l l , F 1 s c o r e , and M C C , are shown in Table 3. The comparably highest effective performance, i.e., the first and second ranks across the five metrics, was yielded by the random forest model as well as the gradient boosting model trained by data with over-sampling. In detail, the gradient boosting model with the over-sampling technique showed slightly better results, i.e., performance values were 1 for all five measures, but this hold-out cross-validation experiment was performed one time due to convenience for a very large dataset, at first. Therefore, for a solid experimental conclusion, another 4-fold cross-validation experiment was also studied.
Figure 8. Confusion matrices of 70:30 hold-out cross-validation results. (a) No sampling testing data (original data). (b) Over-sampling testing data. (c) Under-sampling testing data. (d) Combined sampling testing data.
Table 3. The performance of three different machine learning techniques with various sampling approaches in the 70:30 hold-out cross-validation experiment. The superscript numbers in the brackets denote the performance ranking based on the evaluation measure in each column.

4.2. Four-Fold Cross-Validation

The average confusion matrices for 4-fold cross-validation results are illustrated in Figure 9a–d. The performance metrics for logistic regression, random forest, and gradient boosting are shown in Table 4, Table 5, and Table 6, respectively. Three imbalanced data handling approaches, including over-sampling, under-sampling, and combined sampling, can improve the performance of models trained by logistic regression, random forest, and gradient boosting algorithms. Considering only the performance of logistic regression models, models with the combined sampling approach outperform the others. For random forest and gradient boosting models, when the under-sampling approach was employed, they both showed better model performance compared to the other sampling approaches. In general, from all 4-fold cross-validation results, gradient boosting models with the under-sampling method gave the superior performance. The additionally depicted comparisons of M C C and F 1 s c o r e are shown in Figure 10 and Figure 11, respectively.
Figure 9. Average confusion matrices of four-fold cross-validation results. (a) No sampling testing data (original data). (b) Over-sampling testing data. (c) Under-sampling testing data. (d) Combined sampling testing data.
Table 4. Performance metrics for logistic regression with different sampling approaches in 4-fold cross-validation (4-fold cv) experiments.
Table 5. Performance metrics for random forest with different sampling approaches in 4-fold cross-validation (4-fold cv) experiments.
Table 6. Performance metrics for gradient boosting with different sampling approaches in 4-fold cross-validation (4-fold cv) experiments.
Figure 10. Average M C C comparison for different sampling methods.
Figure 11. Average F 1 s c o r e comparison for different sampling methods.
Overall result summation from both experiments of the two cross-validation methods indicates that the gradient boosting algorithm with an appropriate data solving technique for supervised model training offers the very impressive ability of resulting in models that correctly classify both “Good” and “Risk” instances.
Next, the feature selection method was applied, i.e., computing and ranking the mutual information ( M I ) values of each feature, in order to reasonably select important features of the smaller feature size k of the training set. So, the training data with the best imbalanced data handling technique for each model were further explored by preparing a smaller number of k features via their M I values to assess the trade-offs between the different important feature sizes and their impact on model performance. The features were ranked based on their computed values of mutual information. Three numbers of feature size, i.e., k = 25, 50, and 100, were experimented with. The results of logistic regression, random forest, and gradient boosting models on both 70:30 hold-out cross-validation and 4-fold cross-validation with three different feature sizes are shown in Figure 12, Figure 13, and Figure 14, respectively. Generally, the performance of all three supervised models was reduced slightly. For k = 25 and 50 important features as training data, gradient boosting models showed better results than logistic regression and random forest models. Focusing on k = 50 important features, random forest and gradient boosting models yielded five performance values, i.e., A c c u r a c y , P r e c i s i o n , R e c a l l , F 1 s c o r e , and M C C , greater than 99%, whereas logistic regression models gave four performance values, excepting M C C , higher than 95%. For k = 25 important features, gradient boosting models still yielded A c c u r a c y , P r e c i s i o n , R e c a l l , and F 1 s c o r e values not less than 99%, but M C C values reduced to around 97.5%. These show that when the number of features was reduced by half ( k = 50), the performance values were reduced by only less than 1%. Although the number of features was approximately reduced by 75% ( k = 25), the performance values were reduced by only less than 1–2%. Apart from that, the performance of gradient boosting models using k = 100 important features was better than that of the others on both 70:30 hold-out cross-validation and 4-fold cross-validation experiments.
Figure 12. Logistic regression model performance on five metrics for k different numbers of features, i.e., k = 25, 50, and 100.
Figure 13. Random forest model performance on five metrics for k different numbers of features, i.e., k = 25, 50, and 100.
Figure 14. Gradient boosting model performance on five metrics for k different numbers of features, i.e., k = 25, 50, and 100.
In order to additionally display our results compared with previous research, the performance comparison of the proposed methods with other existing works on various versions of LendingClub data is shown in Figure 15. Based on A c c u r a c y , the proposed methods outperform the others.
Figure 15. A c c u r a c y of the proposed method compared with existing works on various versions of LendingClub data. Denote that ^ and * symbols stand for the different dataset or experiment in the same work.

5. Conclusions and Future Work

This study provided a very efficient solution to the problem of credit risk prediction. To investigate the improved predictive model results that could be better than those from previous works, three popular machine learning methods, including logistic regression, random forest, and gradient boosting, were employed. Additionally, the imbalanced data problem was resolved by experimenting with various sampling strategies: under-sampling, over-sampling, and combined sampling. Based on our best model performance outcomes, the over-sampling as well as under-sampling methods robustly manage class-imbalanced data, especially when the training model uses the gradient boosting method. In addition, the feature numbers of the data were reduced by selecting only important features for the training set according to their ranks computed by mutual information. Another experiment was performed using two reduced feature sets, the half size as well as the one-fourth size of its original feature size. The resulting model performance was just barely decreased. Remarkably, both random forest and gradient boosting models created by the reduced feature sets with the half size showed impressive Accuracy values, higher than 99%.
This comprehensive analysis enhances better understanding of credit risk prediction using a supervised learning method combined with various imbalanced data solving strategies. Furthermore, the importance of features based on mutual information was addressed in order to increase model performance with the smaller feature size of training data. Our proposed method and results offer a simple way to select important features with the reduced size by ranking the mutual information values of each feature. In spite of this method not providing the most optimal size with the best performance, it can apply to other large credit risk data with different feature sets. This approach does not significantly decrease performance, but there might be better methods available. In future work, it may be beneficial to further investigate parameter optimization, particularly in handling imbalanced data, and explore alternative feature selection methods beyond mutual information, such as correlation and symmetrical uncertainty, to improve model performance. In addition, ensemble techniques could offer performance improvement of those small feature sizes. Apart from that, real-time data streams and dynamic model updating may increase the adaptability of credit risk prediction systems.

Author Contributions

Conceptualization, N.W. and S.T.; methodology, N.W. and S.T.; validation, N.W. and S.T.; formal analysis, N.W. and P.W.; investigation, N.W., P.W. and S.T.; data curation, N.W. and P.W.; writing—original draft preparation, N.W. and S.T.; writing—review and editing, N.W., S.J., S.S. and S.T.; visualization, N.W., P.W. and S.T.; supervision, S.J. and S.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The used data are publicly available at https://www.kaggle.com/datasets/ethon0426/lending-club-20072020q1 (accessed on 17 January 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Noriega, J.P.; Rivera, L.A.; Herrera, J.A. Machine Learning for Credit Risk Prediction: A Systematic Literature Review. Data 2023, 8, 169. [Google Scholar] [CrossRef]
  2. Gjeçi, A.; Marinč, M.; Rant, V. Non-performing loans and bank lending behaviour. Risk Manag. 2023, 25, 7. [Google Scholar] [CrossRef]
  3. Liu, H.; Qiao, H.; Wang, S.; Li, Y. Platform Competition in Peer-to-Peer Lending Considering Risk Control Ability. Eur. J. Oper. Res. 2019, 274, 280–290. [Google Scholar] [CrossRef]
  4. Sulastri, R.; Janssen, M. Challenges in Designing an Inclusive Peer-to-Peer (P2P) Lending System. In Proceedings of the 24th Annual International Conference on Digital Government Research, DGO ‘23, New York, NY, USA, 11–14 July 2023; pp. 55–65. [Google Scholar] [CrossRef]
  5. Ko, P.C.; Lin, P.C.; Do, H.T.; Huang, Y.F. P2P Lending Default Prediction Based on AI and Statistical Models. Entropy 2022, 24, 801. [Google Scholar] [CrossRef]
  6. Kurniawan, R. Examination of the Factors Contributing To Financial Technology Adoption in Indonesia using Technology Acceptance Model: Case Study of Peer to Peer Lending Service Platform. In Proceedings of the 2019 International Conference on Information Management and Technology (ICIMTech), Denpasar, Indonesia, 19–20 August 2019; Volume 1, pp. 432–437. [Google Scholar] [CrossRef]
  7. Wang, Q.; Xiong, X.; Zheng, Z. Platform Characteristics and Online Peer-to-Peer Lending: Evidence from China. Financ. Res. Lett. 2021, 38, 101511. [Google Scholar] [CrossRef]
  8. Ma, Z.; Hou, W.; Zhang, D. A credit risk assessment model of borrowers in P2P lending based on BP neural network. PLoS ONE 2021, 16, e0255216. [Google Scholar] [CrossRef]
  9. Moscato, V.; Picariello, A.; Sperlí, G. A benchmark of machine learning approaches for credit score prediction. Expert Syst. Appl. 2021, 165, 113986. [Google Scholar] [CrossRef]
  10. Liu, W.; Fan, H.; Xia, M. Credit scoring based on tree-enhanced gradient boosting decision trees. Expert Syst. Appl. 2022, 189, 116034. [Google Scholar] [CrossRef]
  11. Kriebel, J.; Stitz, L. Credit default prediction from user-generated text in peer-to-peer lending using deep learning. Eur. J. Oper. Res. 2022, 302, 309–323. [Google Scholar] [CrossRef]
  12. Uddin, N.; Uddin Ahamed, M.K.; Uddin, M.A.; Islam, M.M.; Talukder, M.A.; Aryal, S. An ensemble machine learning based bank loan approval predictions system with a smart application. Int. J. Cogn. Comput. Eng. 2023, 4, 327–339. [Google Scholar] [CrossRef]
  13. Yin, W.; Kirkulak-Uludag, B.; Zhu, D.; Zhou, Z. Stacking ensemble method for personal credit risk assessment in Peer-to-Peer lending. Appl. Soft Comput. 2023, 142, 110302. [Google Scholar] [CrossRef]
  14. Muslim, M.A.; Nikmah, T.L.; Pertiwi, D.A.A.; Dasril, Y. New model combination meta-learner to improve accuracy prediction P2P lending with stacking ensemble learning. Intell. Syst. Appl. 2023, 18, 200204. [Google Scholar] [CrossRef]
  15. Niu, K.; Zhang, Z.; Liu, Y.; Li, R. Resampling ensemble model based on data distribution for imbalanced credit risk evaluation in P2P lending. Inf. Sci. 2020, 536, 120–134. [Google Scholar] [CrossRef]
  16. Li, X.; Ergu, D.; Zhang, D.; Qiu, D.; Cai, Y.; Ma, B. Prediction of loan default based on multi-model fusion. Procedia Comput. Sci. 2022, 199, 757–764. [Google Scholar] [CrossRef]
  17. He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, 1–6 June 2008; pp. 1322–1328. [Google Scholar] [CrossRef]
  18. Chen, Y.R.; Leu, J.S.; Huang, S.A.; Wang, J.T.; Takada, J.I. Predicting Default Risk on Peer-to-Peer Lending Imbalanced Datasets. IEEE Access 2021, 9, 73103–73109. [Google Scholar] [CrossRef]
  19. Kumar, V.L.; Natarajan, S.; Keerthana, S.; Chinmayi, K.M.; Lakshmi, N. Credit Risk Analysis in Peer-to-Peer Lending System. In Proceedings of the 2016 IEEE International Conference on Knowledge Engineering and Applications (ICKEA), Singapore, 28–30 September 2016; pp. 193–196. [Google Scholar] [CrossRef]
  20. Setiawan, N. A Comparison of Prediction Methods for Credit Default on Peer to Peer Lending using Machine Learning. Procedia Comput. Sci. 2019, 157, 38–45. [Google Scholar] [CrossRef]
  21. Liu, Z.; Zhang, Z.; Yang, H.; Wang, G.; Xu, Z. An innovative model fusion algorithm to improve the recall rate of peer-to-peer lending default customers. Intell. Syst. Appl. 2023, 20, 200272. [Google Scholar] [CrossRef]
  22. Ziemba, P.; Becker, J.; Becker, A.; Radomska-Zalas, A.; Pawluk, M.; Wierzba, D. Credit Decision Support Based on Real Set of Cash Loans Using Integrated Machine Learning Algorithms. Electronics 2021, 10, 2099. [Google Scholar] [CrossRef]
  23. Dong, H.; Liu, R.; Tham, A.W. Accuracy Comparison between Five Machine Learning Algorithms for Financial Risk Evaluation. J. Risk Financ. Manag. 2024, 17, 50. [Google Scholar] [CrossRef]
  24. Stoltzfus, J.C. Logistic regression: A brief primer. Acad. Emerg. Med. 2011, 18, 1099–1104. [Google Scholar] [CrossRef]
  25. Manglani, R.; Bokhare, A. Logistic Regression Model for Loan Prediction: A Machine Learning Approach. In Proceedings of the 2021 Emerging Trends in Industry 4.0 (ETI 4.0), Raigarh, India, 19–21 May 2021; pp. 1–6. [Google Scholar] [CrossRef]
  26. Kadam, E.; Gupta, A.; Jagtap, S.; Dubey, I.; Tawde, G. Loan Approval Prediction System using Logistic Regression and CIBIL Score. In Proceedings of the 2023 4th International Conference on Electronics and Sustainable Communication Systems (ICESC), Coimbatore, India, 7–9 August 2023; pp. 1317–1321. [Google Scholar] [CrossRef]
  27. Zhu, X.; Chu, Q.; Song, X.; Hu, P.; Peng, L. Explainable prediction of loan default based on machine learning models. Data Sci. Manag. 2023, 6, 123–133. [Google Scholar] [CrossRef]
  28. Lin, M.; Chen, J. Research on Credit Big Data Algorithm Based on Logistic Regression. Procedia Comput. Sci. 2023, 228, 511–518. [Google Scholar] [CrossRef]
  29. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  30. Zhu, L.; Qiu, D.; Ergu, D.; Ying, C.; Liu, K. A study on predicting loan default based on the random forest algorithm. Procedia Comput. Sci. 2019, 162, 503–513. [Google Scholar] [CrossRef]
  31. Rao, C.; Liu, M.; Goh, M.; Wen, J. 2-stage modified random forest model for credit risk assessment of P2P network lending to “Three Rurals” borrowers. Appl. Soft Comput. 2020, 95, 106570. [Google Scholar] [CrossRef]
  32. Reddy, C.S.; Siddiq, A.S.; Jayapandian, N. Machine Learning based Loan Eligibility Prediction using Random Forest Model. In Proceedings of the 2022 7th International Conference on Communication and Electronics Systems (ICCES), Coimbatore, India, 12–14 November 2022; pp. 1073–1079. [Google Scholar] [CrossRef]
  33. Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
  34. Zhou, L.; Fujita, H.; Ding, H.; Ma, R. Credit risk modeling on data with two timestamps in peer-to-peer lending by gradient boosting. Appl. Soft Comput. 2021, 110, 107672. [Google Scholar] [CrossRef]
  35. Zhu, X.; Chen, J. Risk Prediction of P2P Credit Loans Overdue Based on Gradient Boosting Machine Model. In Proceedings of the 2021 IEEE International Conference on Power, Intelligent Computing and Systems (ICPICS), Shenyang, China, 29–31 July 2021; pp. 212–216. [Google Scholar] [CrossRef]
  36. Miaojun Bai, Y.Z.; Shen, Y. Gradient boosting survival tree with applications in credit scoring. J. Oper. Res. Soc. 2022, 73, 39–55. [Google Scholar] [CrossRef]
  37. Qian, H.; Wang, B.; Yuan, M.; Gao, S.; Song, Y. Financial distress prediction using a corrected feature selection measure and gradient boosted decision tree. Expert Syst. Appl. 2022, 190, 116202. [Google Scholar] [CrossRef]
  38. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority over-Sampling Technique. J. Artif. Int. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
  39. Bach, M.; Werner, A.; Palt, M. The Proposal of Undersampling Method for Learning from Imbalanced Datasets. Procedia Comput. Sci. 2019, 159, 125–134. [Google Scholar] [CrossRef]
  40. Batista, G.E.A.P.A.; Prati, R.C.; Monard, M.C. A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. SIGKDD Explor. Newsl. 2004, 6, 20–29. [Google Scholar] [CrossRef]
  41. Ethon0426. Lending Club 2007–2020Q3. Available online: https://www.kaggle.com/datasets/ethon0426/lending-club-20072020q1 (accessed on 17 January 2024).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.