1. Introduction
A credit loan refers to the loan issued on the guarantee of the borrower’s reputation, which is the main way for financial institutions, especially banks, to make profits. With the advent of the big data era, financial data have also ushered in explosive growth, giving rise to a variety of loan products (such as car loans, mortgage payments, credit cards, etc.) [
1]. At the same time, there are also various default events. If the credit institutions can not accurately predict the user default risk, they will not be able to achieve effective risk management [
2]. The aggravation and expansion of default events can affect the normal operations of banks and other financial institutions and, in serious cases, can lead to company bankruptcy or the financial crisis for the entire industry [
3]. Therefore, user default prediction is the top priority in banking. In recent years, the Chinese government actively responded to the call of the international community to actively develop green finance, etc. Therefore, China has actively expanded the scale of green credit [
4], which not only helps to cope with economic downturn pressure but also promotes the development of a green economy and provides solid financial support for sustainable development. And the advent of big data and advances in machine learning have provided new opportunities for user default prediction.
Traditional methods for user default prediction include logistic regression [
5,
6], Decision Tree (DT) [
6], and support vector machine [
7], and these classification models are widely used in default prediction because of their high efficiency. Zhangjie Huang [
8] used the SMOTETomek-LighTBM-LR credit default prediction model to improve the accuracy of credit default prediction, providing a new method for the credit industry to reduce the default rate and improve the efficiency of capital utilization. However, it only considers the classification imbalance of a credit dataset when processing data and has limitations in its ability to deal with complex and dynamic relationships between predictor and target variables, making it difficult to capture nonlinear relationships. Syed Nor S. H. [
9] developed the prediction model for personal bankruptcies by using DT technology, which facilitates and assists financial institutions in assessing their potential borrowers. Nevertheless, the model used a single algorithm, which is still deficient in prediction accuracy and robustness. M. Z. Abedin et al. introduced a credit default prediction model that utilizes two main techniques: support vector machine (SVM) and Probabilistic Neural Network (PNN). This method aims to improve the accuracy of credit default prediction by fully utilizing the advantages of SVM and PNN algorithms [
10].
Ensemble learning, as a research hotspot, aims to integrate data fusion, data modeling, and data mining into a unified framework [
11]. Specifically, it is a method of combining multiple weak learners to build a machine learning method with strong performance [
12]. In order to overcome the highly imbalanced default and non-default categories in small business credit risk assessment, Abedin M. Z. [
13] proposed an extended integration approach based on the Weighted Synthetic Minority Oversampling Technique, called the WSMOTE-ensemble. F. N. Khan et al. [
14] proposed a credit card fraud prediction and classification model based on a deep neural network and ensemble learning. The model integrates four algorithms, the Bayes classifier algorithm, logistic regression, DT, and the deep belief network, and has excellent performance in predicting a credit card customer’s default, with good accuracy and interpretability. With the development of technology, more and more transactions are completed through the network, and the transaction data have characteristics such as high dimensionality and multiple sources, far exceeding the processing capacity of traditional solutions. The efficiency, accuracy, and strong robustness demonstrated by ensemble learning in multiple fields have brought new solutions and development opportunities to these problems. Ensemble learning can be divided into two types, homogeneous and heterogeneous, according to whether each base learner belongs to the same category. Homogeneous ensemble learning means that all base learners belong to the same category, which is also the most widely used. For example, Lean Y. U. [
15] proposed an SVM-based ensemble learning system when studying customer risk identification and customer relationship management. The results showed that the model has multiple components and spatial selection strategies and performs best in various test samples. Heterogeneous ensemble learning refers to not all individual learners being of the same kind. He H. et al. [
16] proposed a default prediction model that combines deep learning classifiers with tree-based classifiers using the ensemble learning method, and the proposed model has significantly improved the overall performance. In previous work, each model performed well in certain aspects, but there were also limitations. Through ensemble learning, we can combine the prediction results of these models, such as through voting or weighted averaging, to obtain more accurate and robust prediction results.
In this study, to ensure the accuracy of the experiments, we first used Hot-deck imputation and Mean imputation to fill the missing values in the data variables. To mitigate the impact of differences in the magnitude among the variables on the classification results, prior to model construction, we employed the z-score standardization method to adjust the numerical ranges of different features to similar levels, eliminating the influence of scale on the data. Additionally, standardization improved the distribution of existing data, making it closer to a normal distribution, thereby enhancing the performance and interpretability of the model. Furthermore, we utilized three feature selection methods based on statistical learning, regression coefficients, and principal component analysis to select appropriate feature variables. The obtained data were then inputted into both homogeneous and heterogeneous ensemble learning models, such as CatBoost, RUSBoost, and gcForest, to predict user defaults. A comparison was made with traditional single learning models to determine the optimal model and analysis.
The experimental results of this paper show that the method based on adjusted homogeneous and heterogeneous ensemble learning can achieve good prediction of user defaults, and the accuracy of user default prediction based on the CatBoost and RUSBoost ensemble algorithms reaches 100%. This paper contributes to the development of existing user default prediction algorithms through a novel and effective approach based on homogeneous and heterogeneous ensemble learning, achieving higher performance and stronger generalization than traditional methods and state-of-the-art techniques. Our findings have important implications for practitioners and researchers in the fields of finance, banking, and insurance and lay the foundation for future research in this field.
3. Results
In this paper, the effects of different ensemble learning algorithms on the prediction results of the final model were compared with those of a single classifier model. Different filling methods and feature selection algorithms also have influence on the prediction results. To prevent the loss of some important features, we also compared the impact of the data input models with and without feature selection on the results. The entire default risk prediction model process is shown in
Figure 4.
Models such as SVM, DT, KNN, LDA, RF, gcForest, CatBoost, RUSBoost, Imbalance-XGBoost, and Stacking are mainly used in the experiment. The results of the different filling methods, whether feature selection, different feature selection methods, and different models, on user default prediction are compared, respectively.
Table 4 shows the impact of the different filling methods on the predicted results. We found that the result of using Mean imputation is better than that of using Hot-deck imputation, especially when using single classifiers. Then, we compared the effects of the different feature selection methods on the model performance, and the results are shown in
Table 5. Using the feature selection method based on the regression algorithm, the overall effect of the model is best. In a single classifier, the feature selection using the Elastic Net regression coefficient and input DT model have the highest accuracy, which is 98.59%. The second best single classifier model is 97.18%, and its feature selection algorithm is based on the stepwise regression of the BIC criterion. The regression coefficient can well display the relative contribution of features in the model, so features with little or no contribution can be deleted.
Table 5 also shows the influence of the different feature selection methods on the results in ensemble learning. The models with better prediction results are RF and CatBoost. The results of the stepwise feature selection based on the AIC criterion were input into the RF model with the best prediction accuracy of 100.00%, and the highest accuracy of CatBoost is 98.59%.
Figure 5 shows the average of various indicator values inputted into each model after four regression coefficient feature selection algorithms. The DT indicators values are higher than 95.00%. The RF, CatBoost, RUSBoost, and Stacking indicators have high values and are greater than 93.00%. In particular, the values of CatBoost are greater than 97%.
Under Hot-deck imputation, we used the data without feature selection to input each ensemble learning model, and the results are shown in
Figure 6, where the index values of CatBoost and RUSBoost reach 100%. We take the optimal results of each ensemble learning model, as shown in
Table 6, and the results indicate that Hot-deck imputation is more suitable for ensemble learning model algorithms. Combined with
Table 4, the single classifiers are more suitable for Mean imputation.
The experimental results of this paper show that the ensemble learning algorithms have better predictive ability for user defaults, and the accuracy of using the CatBoost algorithm to predict user defaults is as high as 100%, with the other indicator values also reaching 100%. The overall effect of ensemble learning is better than that of single classifiers.
4. Discussion
In this work, we demonstrate the effects of different adjusted ensemble learning algorithms on the final model prediction compared with a single classifier model and analyze the effects of different filling methods and feature selection algorithms on the prediction results. The results show that an ensemble learning algorithm has better predictive ability in user default prediction, especially when the CatBoost algorithm is used; its prediction accuracy is as high as 100%. Compared with a single classifier, ensemble learning has a better overall effect.
Ensemble learning algorithms combine predictions from multiple base models to reduce the bias and variance of a single model, thereby improving the overall prediction accuracy. Additionally, due to the utilization of multiple diverse base models, ensemble learning models exhibit better robustness, maintaining good predictive performance across different data distributions or feature conditions. In our research, we found that traditional single machine learning models suffer from poor interpretability, making it difficult to capture nonlinear relationships and exhibiting shortcomings in both prediction accuracy and robustness. The heterogeneous and homogeneous ensemble learning model for user default prediction efficiently handles large volumes of high-dimensional data and enhances the model generalization capabilities. By effectively leveraging predictions from multiple models, it mitigates the risk of overfitting, thereby improving the model’s generalization ability.
In addition, the results from using Mean imputation were found to be better than those from using Hot-deck imputation, especially when using single classifiers. The reason for this difference is that Hot-deck imputation requires finding all the samples without null values, but in practice, it is difficult to find a perfect Hot-deck imputation dataset. Therefore, the overall accuracy of the model using Mean imputation is better than that of Hot-deck imputation.
Future research directions could include the further optimization of ensemble learning algorithms, the exploration of more efficient feature selection methods, and deeper analysis and processing methods for unbalanced data.
5. Conclusions
In this paper, an adjusted model based on homogeneous and heterogeneous ensemble learning algorithms is proposed for user default prediction. In order to solve the missing data problem, two methods, Hot-deck imputation and Mean imputation, which are suitable for financial datasets, are adopted. The method using Mean imputation is better than Hot-deck imputation in single classifier models. In the ensemble learning model, inputting data filled with Hot-deck imputation into the Catboost and RUSboost models can achieve 100% classification accuracy, and the other five indicators (specificity, sensitivity, F1-score, kappa, and MCC) have values of 100%. The experiment proves that the ensemble learning algorithms can predict user default efficiently and accurately.
The traditional single classifier method can not adapt to complex financial scenarios, but the ensemble learning method mining the nonlinear characteristics of financial data enables it to be applied to various complex financial scenarios. And at the same time, this article also involves risk assessment and modeling methods in response to the analysis of new green credit business. The experimental exploration and results of this paper provide new ideas and prospects for risk assessment modeling.