1. Introduction
In today’s dynamic and interconnected global economy, credit plays a pivotal role in facilitating economic activities and fostering financial growth. Lenders and financial institutions rely heavily on credit scoring models to assess the creditworthiness of individuals, businesses, and other entities seeking access to financial products and services. A credit score is a numerical representation of an individual’s creditworthiness, which helps lenders gauge the risk associated with extending credit and making lending decisions. As such, credit scoring has become an indispensable tool in the modern financial landscape, shaping access to credit and influencing financial outcomes for millions of borrowers worldwide. The development and refinement of credit scoring models have evolved significantly over the years, driven by advancements in data analytics, statistical modeling techniques, and the availability of vast amounts of financial and nonfinancial data [
1,
2]. Traditionally, credit scoring was primarily based on a few key factors, such as payment history, outstanding debt, length of credit history, and new credit applications. However, contemporary credit scoring models have incorporated a more diverse set of variables and sophisticated algorithms to enhance predictive accuracy and provide a more comprehensive assessment of credit risk. The importance of credit scoring cannot be overstated, as it not only affects the availability of credit but also influences interest rates, loan terms, and overall financial inclusion. Access to affordable credit is crucial for individuals and businesses to pursue their aspirations, invest in productive ventures, and contribute to economic growth. Moreover, credit scoring also plays a vital role in mitigating risks for lenders, enabling them to make informed decisions, manage their loan portfolios effectively, and maintain the stability of the financial system [
3,
4].
Traditionally, credit scoring has heavily relied on manual processes, limited variables, and subjective criteria, leading to inefficiencies and potential biases in the evaluation process. However, the advent of machine learning techniques has revolutionized the credit scoring landscape, offering automated and data-driven approaches that can significantly enhance the accuracy and efficiency of credit assessments. Machine learning algorithms have demonstrated remarkable capabilities in handling complex and high-dimensional data, learning patterns and relationships, and making predictions based on historical information. By training on large-scale credit datasets, machine learning models can capture intricate credit patterns that may be overlooked by traditional methods. Furthermore, these algorithms can adapt and evolve as new data become available, ensuring their relevance in dynamic credit markets. The utilization of machine learning in credit scoring holds the promise of providing lenders with more objective, consistent, and reliable credit assessment models [
5,
6,
7].
The objective of this research is to conduct experiments and analyses using various machine learning classifiers on different credit approval datasets. The research aims to evaluate and compare the performance of these classifiers in terms of multiple evaluation metrics, including accuracy, sensitivity, specificity, precision, F1 score, receiver operating characteristic (ROC), balanced accuracy, and weighted sum metric (WSM) performance. Through these systematic experiments, the research seeks to assess how well these classifiers can predict credit approval outcomes and identify which classifiers perform best under different dataset conditions. Additionally, the research involves analyzing the impact of various data preprocessing techniques, such as feature selection and scaling, on the classifier’s performance. The key research questions to be addressed in this study include the following:
- -
How do mathematical formulations underpin the machine learning algorithms employed in credit score classification, and which algorithms demonstrate superior predictive performance?
- -
In the context of credit scoring models, how does the mathematical modeling of feature selection, especially using the PSO metaheuristic optimizer, influence the model’s accuracy and efficiency?
- -
In what ways can the mathematical optimization of hyperparameters (e.g., learning rates, regularization strengths) in machine learning models influence their performance in credit score classification, and are there specific optimization algorithms that are more effective for this domain?
- -
Through the incorporation of the mathematical model of the LIME explainer, what insights can be gleaned concerning the strengths, limitations, and interpretability of various machine learning approaches in the context of credit scoring?
- -
Based on the empirical findings and mathematical rigor introduced in this study, how can practitioners be better equipped to choose the most suitable machine learning techniques and feature selection methods for credit score classification?
To achieve these research objectives, a comprehensive experimental analysis is conducted using publicly available credit datasets. Various machine learning algorithms, including but not limited to logistic regression, decision trees, random forests, support vector machines, and neural networks, are implemented and evaluated. The performance of these algorithms is measured using standard evaluation metrics such as accuracy, precision, recall, and F1 score. Furthermore, the impact of feature selection methods are investigated to determine the most relevant variables for credit scoring. The significance of this research extends across several dimensions, providing a comprehensive understanding of credit scoring models due to the following:
- -
This study introduces a robust mathematical framework underpinning machine learning algorithms and preprocessing techniques in the realm of credit scoring. This framework ensures the solidity of credit scoring models, providing a sound basis for analysis and decision making.
- -
Through the rigorous mathematical modeling of feature selection, particularly harnessing the Particle Swarm Optimization (PSO) metaheuristic optimizer, this research provides valuable insights into the identification of the most relevant variables for credit scoring. This optimization process not only improves the accuracy of credit scoring models but also enhances their computational efficiency, making them more practical for real-world applications.
- -
This study places a strong emphasis on the mathematical model underlying the Local Interpretable Model-Agnostic Explanations (LIME) explainer. By delving into the intricacies of LIME’s mathematical foundations, it highlights the pivotal role of explainability techniques in enhancing the transparency and interpretability of complex machine learning models used in credit scoring. This transparency is crucial for building trust in the decision-making process.
- -
This research provides compelling empirical evidence by evaluating the effectiveness of various machine learning techniques for credit score classification. This empirical analysis is conducted across multiple datasets, adding depth and credibility to the findings. The benchmarking of different approaches offers practical insights into their performance under diverse conditions.
- -
By systematically comparing and analyzing different algorithms, feature selection methods, and the LIME explainer, this research offers valuable guidance to practitioners in the field of credit scoring. It assists them in selecting the most suitable techniques and strategies for specific credit scoring tasks, ultimately leading to better decision-making processes within financial institutions and lending companies.
- -
The findings of this research not only contribute to the current state of credit scoring but also pave the way for further advancements in the field. By identifying areas where mathematical rigor, feature selection, and interpretability can be enhanced, it opens doors to future research, innovation, and continuous improvement in credit scoring models and practices. This advancement is essential in keeping pace with evolving financial landscapes and data-driven technologies.
The rest of this research paper is structured as follows:
Section 2 presents a review of the literature concerned with feature selection and machine learning algorithms for credit scoring.
Section 3 presents the datasets utilized in this study.
Section 4 provides a detailed discussion of the proposed approach.
Section 5 describes the experiments conducted and discusses their outcomes. Lastly,
Section 6 concludes this paper and outlines future research directions.
2. Feature Selection, Machine Learning, and Credit Scoring: A Review of Literature
Default risk is a primary concern in online lending, prompting the use of credit scoring models to assess borrower creditworthiness. Existing efforts have mainly focused on improving assessment methods, without adequately addressing the quality of credit data, often plagued by noisy, redundant, or irrelevant features that hinder model accuracy. Effective feature selection methods are crucial for enhancing credit evaluation accuracy. Current feature selection methods in online credit scoring suffer from issues like subjectivity, time consumption, and low accuracy, necessitating the introduction of innovative approaches. Zhang et al. [
8] proposed a solution called the local binary social spider algorithm (LBSA), which incorporates two local optimization strategies (i.e., opposition-based learning (OBL) and improved local search algorithm (ILSA)) into BinSSA. These strategies address the aforementioned drawbacks. Comparative experiments conducted on three typical online credit datasets (i.e., Paipaidai (PPD), Renrendai (RRD) in China, and Lending Club (LC) in the United States) concluded that LBSA significantly reduces feature subset redundancy, enhances iterative stability, and improves credit scoring model accuracy and effectiveness.
Tripathi et al. [
9] directed their efforts toward enhancing credit scoring models employed by financial institutions and credit industries. Their primary objective was to enhance model performance by introducing a hybrid methodology that combines feature selection with a multilayer ensemble classifier framework. This hybrid model was meticulously crafted in three distinct phases: initial preprocessing and classifier ranking, followed by ensemble feature selection, and ultimately the utilization of the selected features within a multilayer ensemble classifier framework. To further optimize ensemble performance, they introduced a classifier placement algorithm based on the Choquet integral value. Then, the researchers conducted experiments using real-world datasets, including Australian (AUS), Japanese (JPD), German-categorical (GCD), and German-numerical (GND). The findings indicated that the features chosen through their proposed approach exhibited enhanced representativeness, leading to improved classification accuracy across various classifiers such as quadratic discriminant analysis (QDA), Naïve Bayes (NB), multilayer feed-forward neural network (MLFN), time-delay neural network (TDNN), distributed time-delay neural network (DTNN), decision tree (DT), and support vector machine (SVM). Additionally, for all the credit scoring datasets considered, the proposed ensemble model consistently outperformed traditional ensemble models in terms of accuracy, sensitivity, and G-measure.
Furthermore, Zhang et al. [
10] introduced a novel multistage ensemble model with enhanced outlier adaptation to enhance credit scoring predictions. To mitigate the impact of outliers in noisy credit datasets, an improved local outlier factor algorithm was employed, incorporating a bagging strategy to identify and integrate outliers into the training set, thereby enhancing base classifier adaptability. Additionally, for improved feature interpretability, a novel dimension-reduced feature transformation method was proposed to hierarchically evolve and extract salient features. To further enhance predictive power, a stacking-based ensemble learning approach with self-adaptive parameter optimization was introduced, automatically optimizing base classifier parameters and constructing a multistage ensemble model. The performance of this model was evaluated across ten datasets (e.g., Australian, Japanese, German, Taiwan, and Polish credit datasets) using six evaluation metrics, and the reported experimental results demonstrated the superior performance and effectiveness of the suggested approach.
A sequential ensemble credit scoring model based on XGBoost, a variation of the gradient boosting machine, was proposed by Xia et al. [
11]. The proposed XGBoost-based credit scoring model consists of three phases (i.e., data preprocessing, data scaling, and missing value marking). The redundant features are then removed using a model-based feature selection approach, which enhances performance and lowers computing costs. The final model is trained using the acquired configuration after the hyperparameters have been tuned using the Tree-structured Parzen Estimator (TPE) method. The results show that TPE hyperparameter optimization outperforms grid search, random search, and manual search. The proposed model also provides feature importance scores and decision charts, which enhance the interpretability of the credit scoring model. Moreover, Liu et al. [
12] introduced two tree-based augmented GBDTs, AugBoost-RFS and AugBoost-RFU. These methods incorporate a stepwise feature augmentation mechanism to diversify base classifiers within GBDT, and they maintain interpretability through tree-based embedding techniques. Experimental results on four large-scale credit scoring datasets demonstrated that AugBoost-RFS and AugBoost-RFU outperform standard GBDT. Moreover, their supervised tree-based feature augmentation achieved competitive results compared with neural network-based methods, while significantly improving efficiency.
Chen et al. [
13] proposed a multilevel Weighted Voting classification algorithm based on the combination of classifier ranking and the Adaboost algorithm. Four feature selection methods were used to select the features; then, seven commonly used heterogeneous classifiers were used to select five classifiers and calculate their ranks, and then AdaBoost was used to boost the performance of the selected base classifiers and calculate the updated F1 and ranks. The effects of ensemble framework Majority Voting (MV), Weighted Voting (WV), Layered Majority Voting (LMV), and Layered Weighted Voting (LWV) were all evaluated from the aspects of accuracy, sensitivity, specificity, and G-measure. The outcome of the experiments showed that the presented method achieved significant results in Australian credit score data and some progress on the German loan approval data. In Gicić et al. [
14], stacked unidirectional and bidirectional LSTM networks were applied to solve credit scoring tasks. The proposed model exploited the full potential of the three-layer stacked LSTM and BiLSTM architecture with the treatment and modeling of public datasets. Attributes of each loan instance were transformed into a sequence of the matrix with a fixed sliding window approach with a one-time step. The proposed models outperformed existing and more complex deep learning models and, thus, succeeded in preserving their simplicity.
Kazemi et al. [
15] proposed an approach based on a Genetic Algorithm (GA) and neural networks (NNs) to automatically find customized cut-off values. Since credit scoring is a binary classification problem, two popular credit scoring datasets (i.e., the “Australian” and “German” credit datasets) were used to test the proposed approach. The numerical results reveal that the proposed GA-NN model could successfully find customized acceptance thresholds, considering predetermined performance criteria, including Accuracy, Estimated Misclassification Cost (EMC), and AUC for the tested datasets. Furthermore, the best-obtained results and the paired samples t-test results showed that utilizing the customized cut-off points leads to a more accurate classification than the commonly used threshold value of 0.5. Khatir and Bee [
16] aimed to pinpoint the most significant predictors of credit default to construct machine learning classifiers capable of efficiently distinguishing defaulters from nondefaulters. They proposed five machine learning classifiers, and each of them was combined with different feature selection techniques and various data-balancing approaches. Given the imbalance in the used dataset (i.e., German Credit Data), three sample-modifying algorithms were used, and their impact on the performance of the classification models was evaluated. The key findings highlighted that the most effective classifier is a random forest combined with random forest recursive feature elimination and random oversampling. Moreover, it underscored the value of data-balancing algorithms, particularly in enhancing sensitivity.
Khan and Ghosh [
17] introduced an improved version of the random wheel classifier. Their proposed approach was evaluated using two datasets (i.e., Australian and South German credit approval datasets). The results showed that their approach not only delivers more accurate and precise recommendations but also offers interpretable confidence levels. Additionally, it provided explanations for each credit application recommendation. This inclusion of recommendation confidence and explanations can instill greater trust in machine-provided intelligence, potentially enhancing the efficiency of the credit approval process. Haldankar [
18] discussed the use of data mining techniques to identify fraud in various domains, particularly focusing on risk detection. The study proposed a cost-sensitive classifier for detecting risk using the Statlog (German Credit Data) dataset. The study demonstrated the effectiveness of proper feature selection combined with an ensemble approach and thresholding in reducing the overall cost. The study reported an ACC of 76% and a SPC of 55%.
Wang et al. [
19] focused on ensemble classification. They conducted an analysis and comparison of SVM ensembles using four different ensemble constructing techniques. They reported the highest ACC of 85.35% using the Statlog (Australian credit approval) dataset and 76.41% using the Statlog (German Credit Data) dataset. Additionally, Novakovic et al. [
20] presented the performance of the C4.5 decision tree algorithm with wrapper-based feature selection. They conducted tests using eighteen datasets to compare the classification ACC results with the C4.5 decision tree algorithm. The authors demonstrated that wrapper-based feature selection, when applied to the C4.5 decision tree classifier, effectively contributed to the detection and elimination of irrelevant, redundant data, and noise in the data. They reported an ACC of 71.72% using the J48 reduced approach on the Statlog (German Credit Data) dataset.
5. Experiments and Discussions
The present study utilized the Python 3.11.4 programming language to conduct experiments, employing various packages, including scikit-learn 0.24.2 and LIME 0.2.0.1. The experiments were executed on a device equipped with 128 GB of RAM, a 4 GB NVIDIA graphics card, and the Windows 11 operating system.
5.1. Performance Report Using “Statlog (Australian Credit Approval)” Dataset
Table 3 provides a performance report for various classifiers using the “Statlog (Australian Credit Approval)” dataset. The table includes several evaluation metrics, such as accuracy (ACC), sensitivity (SNS), specificity (SPC), precision (PRC), F1 score (F1), receiver operating characteristic (ROC), balanced accuracy (BAC), and weighted sum metric (WSM) performance.
Among the classifiers, AdaBoost achieved an ACC of 87.54%. It demonstrated good SNS (88.27%) and SPC (86.95%), indicating its ability to correctly identify both positive and negative instances. However, its PRC (84.42%) was slightly lower compared with other classifiers. The F1 score (86.31%) and ROC (87.61%) were reasonably high, reflecting a balanced performance. The BAC was 87.61%, and the WSM performance was 86.96%. AdaBoost utilized L1 regularization, and the hyperparameters included a logistic regression value of approximately 0.87672 and an estimate of 26. The selected features were X1, X3, X5, X8, X9, X11, and X12.
The Decision Tree (DT) classifier achieved an ACC of 87.10%. It exhibited a slightly lower SNS (84.04%) and higher SPC (89.56%). The PRC (86.58%) and F1 score (85.29%) were relatively good, while the ROC (86.84%) and BAC (86.80%) were also satisfactory. The maximum scaling method was applied, and the DT classifier used entropy as the splitting criterion, a maximum depth of 5, and the “best” splitter. The selected features were X5, X6, X8, X9, and X12. The Histogram-Based Gradient Boosting (HGB) classifier achieved the highest ACC of 88.12%. It showed good SNS (88.93%) and SPC (87.47%). The PRC (85.05%) was slightly lower, but the F1 score (86.94%) and ROC (88.20%) were relatively high. The BAC was 88.20%, and the WSM performance was 87.56%. HGB used standard scaling, a logistic regression value of approximately 0.06022, and a maximum depth of 7. The selected features were X1, X2, X3, X5, X8, X9, X12, and X13.
The K-Nearest Neighbors (KNN) classifier achieved an ACC of 88.12%. It demonstrated good SNS (85.99%) and high SPC (89.82%). The PRC (87.13%) and F1 score (86.56%) were relatively good, while the ROC (87.93%) and BAC (87.91%) were also satisfactory. KNN used standard scaling, the KDTree algorithm, the Manhattan metric, 11 neighbors, and uniform weights. The selected features were X5, X8, X9, X10, X12, and X13. The LightGBM (LGBM) classifier achieved an ACC of 87.54%. It showed good SNS (88.93%) and SPC (86.42%). The PRC (84.00%) was slightly lower compared with other classifiers. The F1 score (86.39%) and ROC (87.68%) were relatively high, indicating a balanced performance. The BAC was 87.67%, and the WSM performance was 86.95%. LGBM utilized MinMax scaling, a logistic regression value of approximately 0.52510, a maximum depth of 1, and an estimate of 61. The selected features were X1, X3, X4, X5, X8, X9, X12, and X13.
The Logistic Regression (LR) classifier achieved an ACC of 87.54%. It exhibited good SNS (90.23%) and reasonable SPC (85.38%). The PRC (83.18%) was the lowest among the classifiers, but the F1 score (86.56%) and ROC (87.84%) were relatively high. The BAC was 87.80%, and the WSM performance was 86.93%. LR used standard scaling, a logistic regression value of approximately 0.05388, and the LBFGS solver. The selected features were X1, X4, X5, X6, X8, X10, X12, and X13. The MultiLayer Perceptron (MLP) classifier achieved an ACC of 87.97%. It demonstrated good SNS (87.62%) and SPC (88.25%). The PRC (85.67%) and F1 score (86.63%) were relatively good, while the ROC (87.94%) and BAC (87.94%) were also satisfactory. The WSM performance was 87.43%. MLP used standard scaling, the ReLU activation function, 64 hidden layers, an adaptive learning rate, and the Adam solver. The selected features were X3, X4, X5, X7, X8, X10, X13, and X14.
The Random Forest (RF) classifier achieved the highest ACC of 88.84%. It showed relatively low SNS (84.69%) but the highest SPC (92.17%). The PRC (89.66%) was the highest among the classifiers, and the F1 score (87.10%) and ROC (88.51%) were also high. The BAC was 88.43%, and the WSM performance was 88.48%. RF utilized L2 regularization, the Gini criterion, a maximum depth of 2, and an estimate of 90. The selected features were X4, X5, X8, X9, X10, X11, and X13. The XGBoost (XGB) classifier achieved an ACC of 87.83%. It exhibited good SNS (85.67%) and SPC (89.56%). The PRC (86.80%) and F1 score (86.23%) were relatively good, while the ROC (87.63%) and BAC (87.61%) were also satisfactory. XGB used L2 regularization, a logistic regression value of approximately 0.21922, a maximum depth of 23, an estimate of 31, and a subsample of approximately 0.81806. The selected features were X1, X4, X7, X8, X9, X10, X11, X13, and X14.
Based on the evaluation metrics, the best classifier varies depending on the metric considered. For overall ACC, the RF classifier achieved the highest ACC of 88.84%. For SNS, the LR classifier performed the best with 90.23%. For SPC, the RF classifier achieved the highest value of 92.17%. The RF classifier also obtained the highest PRC (89.66%). The F1 score, which considers both PRC and recall, was the highest for the RF classifier at 87.10%. The ROC was the highest for the HGB classifier with 88.20%. Finally, the BAC was also the highest for the RF classifier at 88.43%.
5.2. Performance Report Using “South German Credit” Dataset
Table 4 provides a performance report for various classifiers using the “South German Credit” dataset. The table includes several evaluation metrics, such as ACC, SNS, SPC, PRC, F1, receiver operating characteristic (ROC), BAC, and WSM performance.
Among the classifiers, AdaBoost achieved an ACC of 75.60%, SNS of 84.86%, SPC of 54.00%, PRC of 81.15%, F1 score of 82.96%, ROC of 71.12%, BAC of 69.43%, and WSM performance of 74.16%. It utilized the MinMax scaler, and the selected features were laufkont, laufzeit, moral, verw, hoehe, sparkont, rate, famges, buerge, wohnzeit, wohn, and gastarb. The DT classifier achieved an ACC of 70.70%, SNS of 73.86%, SPC of 63.33%, PRC of 82.46%, F1 score of 77.92%, ROC of 68.80%, BAC of 68.60%, and WSM performance of 72.24%. It utilized the L2 scaler, and the selected features were laufkont, laufzeit, moral, sparkont, verm, wohn, and gastarb.
The HGB classifier achieved an ACC of 77.60%, SNS of 87.86%, SPC of 53.67%, PRC of 81.56%, F1 score of 84.59%, ROC of 72.80%, BAC of 70.76%, and WSM performance of 75.55%. It utilized the STD scaler, and the selected features were laufkont, laufzeit, moral, verw, hoehe, sparkont, beszeit, buerge, verm, weitkred, pers, telef. The KNN classifier achieved an ACC of 73.00%, SNS of 77.86%, SPC of 61.67%, PRC of 82.58%, F1 score of 80.15%, ROC of 70.23%, BAC of 69.76%, and WSM performance of 73.61%. It utilized the STD scaler, and the selected features were laufkont, laufzeit, moral, hoehe, buerge, pers, telef, and gastarb. The LGBM classifier achieved an ACC of 76.60%, SNS of 87.29%, SPC of 51.67%, PRC of 80.82%, F1 score of 83.93%, ROC of 71.72%, BAC of 69.48%, and WSM performance of 74.50%. It utilized the MinMax scaler, and the selected features were laufkont, laufzeit, moral, hoehe, sparkont, beszeit, buerge, verm, alter, wohn, pers, and telef.
The LR classifier achieved an ACC of 75.70%, SNS of 88.29%, SPC of 46.33%, PRC of 79.33%, F1 score of 83.57%, ROC of 70.50%, BAC of 67.31%, and WSM performance of 73.00%. It utilized the MinMax scaler, and the selected features were laufkont, laufzeit, moral, hoehe, sparkont, beszeit, verm, alter, and gastarb. The MLP classifier achieved an ACC of 77.80%, SNS of 88.00%, SPC of 54.00%, PRC of 81.70%, F1 score of 84.73%, ROC of 73.01%, BAC of 71.00%, and WSM performance of 75.75%. It utilized the STD scaler, and the selected features were laufkont, laufzeit, moral, hoehe, beszeit, buerge, verm, pers, and telef.
The RF classifier achieved an ACC of 76.80%, SNS of 89.71%, SPC of 46.67%, PRC of 79.70%, F1 score of 84.41%, ROC of 71.51%, BAC of 68.19%, and WSM performance of 73.85%. It utilized the MaxAbs scaler, and the selected features were laufkont, laufzeit, moral, hoehe, sparkont, rate, buerge, verm, alter, bishkred, and telef. The XGB classifier achieved an ACC of 75.90%, SNS of 87.71%, SPC of 48.33%, PRC of 79.84%, F1 score of 83.59%, ROC of 70.82%, BAC of 68.02%, and WSM performance of 73.46%. It utilized the STD scaler, and the selected features were laufkont, laufzeit, moral, hoehe, sparkont, beszeit, buerge, verm, alter, beruf, pers, telef, and gastarb.
Based on the evaluation metrics, the best classifier varies depending on the metric considered. For ACC, the best classifier is AdaBoost with 75.60%. For SNS, the best classifier is RF with 89.71%. For SPC, the best classifier is DT with 63.33%. For PRC, the best classifier is KNN with 82.58%. For the F1 score, the best classifier is MLP with 84.73%. For the ROC, the best classifier is MLP with 73.01%. For balanced ACC, the best classifier is MLP with 71.00%. Lastly, for WSM performance, the best classifier is MLP with 75.75%.
5.3. Performance Report Using “Statlog (German Credit Data)” Dataset
Table 5 provides a performance report for various classifiers using the “Statlog (German Credit Data)” dataset. The table includes several evaluation metrics, such as ACC, SNS, SPC, PRC, F1, receiver operating characteristic (ROC), BAC, and WSM performance.
Among the classifiers, AdaBoost achieved an ACC of 77.60%, SNS of 51.67%, SPC of 88.71%, PRC of 66.24%, F1 score of 58.05%, ROC of 72.59%, BAC of 70.19%, and a WSM performance of 69.29%. It used the maximum scaling method, and the hyperparameters LR: ≈0.85523 and Est. #: 80. The selected features for AdaBoost were C1, C2, C3, C4, C5, C7, C8, C12, C16, C17, C18, C19, C20. The DT classifier achieved an ACC of 75.70%, SNS of 52.33%, SPC of 85.71%, PRC of 61.09%, F1 score of 56.37%, ROC of 71.01%, BAC of 69.02%, and a WSM performance of 67.32%. It used the maximum scaling method, and the hyperparameters Criterion: Gini, Max Depth: 6, and Splitter: Best. The selected features for DT were C1, C2, C3, C4, C6, C7, C11, C14, C15, C17, and C19.
The HGB classifier achieved an ACC of 78.30%, SNS of 53.33%, SPC of 89.00%, PRC of 67.51%, F1 score of 59.59%, ROC of 73.37%, BAC of 71.17%, and a WSM performance of 70.32%. It used the standard scaling method, and the hyperparameters LR: ≈0.08319 and Max Depth: 7. The selected features for HGB were C1, C2, C3, C5, C6, C10, C11, C13, C14, C15, C17, and C19. The KNN classifier achieved an ACC of 77.60%, SNS of 47.67%, SPC of 90.43%, PRC of 68.10%, F1 score of 56.08%, ROC of 72.28%, BAC of 69.05%, and a WSM performance of 68.74%. It used the maximum scaling method, and the hyperparameters Algo.: KDTree, Metric: Euclidean, Neighbors #: 7, and Weights: Distance. The selected features for KNN were C1, C2, C3, C6, C7, C8, C10, C11, C14, C15, C16, C18, and C20.
The LGBM classifier achieved an ACC of 78.30%, SNS of 55.00%, SPC of 88.29%, PRC of 66.80%, F1 score of 60.33%, ROC of 73.55%, BAC of 71.64%, and a WSM performance of 70.56%. It used the standard scaling method, and the hyperparameters LR: ≈0.15232, Max Depth: 9, and Est. #: 81. The selected features for LGBM were C1, C2, C3, C5, C6, C7, C9, C10, C11, C12, C13, C17, and C20. The LR classifier achieved an ACC of 77.10%, SNS of 49.33%, SPC of 89.00%, PRC of 65.78%, F1 score of 56.38%, ROC of 71.95%, BAC of 69.17%, and a WSM performance of 68.39%. It used the standard scaling method, and the hyperparameters C: ≈1.55321 and Solver: LibLinear. The selected features for LR were C1, C2, C3, C5, C6, C7, C8, C9, C10, C12, C13, C14, C15, C17, C19, and C20.
The MLP classifier achieved an ACC of 77.60%, SNS of 54.00%, SPC of 87.71%, PRC of 65.32%, F1 score of 59.12%, ROC of 72.83%, BAC of 70.86%, and a WSM performance of 69.64%. It used the MinMax scaling method, and the hyperparameters Activation: ReLU, Hidden Layers #: 176, LR: Inv. Scaling, and Solver: Adam. The selected features for MLP were C1, C2, C3, C4, C6, C7, C8, C9, C10, C13, C15, C16, C17, C18, C19, and C20. The RF classifier achieved an ACC of 78.70%, SNS of 48.67%, SPC of 91.57%, PRC of 71.22%, F1 score of 57.82%, ROC of 73.33%, BAC of 70.12%, and a WSM performance of 70.20%. It used the standard scaling method, and the hyperparameters Criterion: Entropy, Max Depth: 11, and Est. #: 94. The selected features for RF were C1, C2, C3, C5, C6, C7, C8, C10, C12, C13, C15, and C16.
The XGB classifier achieved an ACC of 78.50%, SNS of 51.33%, SPC of 90.14%, PRC of 69.06%, F1 score of 58.89%, ROC of 73.35%, BAC of 70.74%, and a WSM performance of 70.29%. It used the standard scaling method, and the hyperparameters LR: ≈0.10056, Max Depth: 13, Est. #: 67, and Subsample: ≈0.34291. The selected features for XGB were C1, C2, C3, C5, C6, C9, C10, C11, C12, C13, C14, C15, C17, C19, and C20. Based on the evaluation metrics, the best classifier varies depending on the metric considered. For ACC, the best classifier is RF with 78.70%. For SNS, the best classifier is LGBM with 55.00%. For SPC, the best classifier is RF with 91.57%. For PRC, the best classifier is RF with 71.22%. For F1, the best classifier is LGBM with 60.33%. For ROC, the best classifier is LGBM with 73.55%. For BAC, the best classifier is LGBM with 71.64%. Finally, for WSM performance, the best classifier is LGBM with 70.56%.
5.4. Overall Discussion and Explainability
Table 6 presents the performance of the best classifiers on three different datasets after conducting 100 trials. The results are reported in terms of mean values and their corresponding standard deviations. The first metric, accuracy, displays the mean accuracy values followed by their standard deviations for the datasets. For example, the mean accuracy value for the Australian Credit dataset is 87.57, with a standard deviation of 0.45. Similarly, the subsequent metrics include the other metrics, where each metric provides the mean values and standard deviations for the corresponding datasets.
For the “Statlog (Australian Credit Approval)” dataset, the features were X4, X5, X8, X9, X10, X11, X13, utilizing the L2 scaler and RF classifier (with 90 estimators, max depth of 2, and Gini criterion). For the “South German Credit” dataset, the features were laufkont, laufzeit, moral, hoehe, beszeit, buerge, verm, pers, telef, utilizing the STD scaler and MLP classifier (with ReLU activation, Adam optimizer, 272 hidden layers, and constant learning rate). For the “Statlog (German Credit Data)” dataset, the features were C1, C2, C3, C5, C6, C7, C9, C10, C11, C12, C13, C17, and C20, utilizing the STD scaler and LGBM classifier (with 81 estimators, max depth of 9, and learning rate of ≈0.15232).
In the “Statlog (Australian Credit Approval)” dataset,
Figure 2 provides the LIME explanation for the model’s positive decision (i.e., Yes) with a 97% confidence regarding a testing instance. The figure demonstrates that this decision was primarily influenced by the high confidence value of X8, which was 1 (within the range of −1 to 1).
In the “South German Credit” dataset,
Figure 3 provides the LIME explanation for the model’s positive decision (i.e., Yes) with a 75% confidence regarding a testing instance. The figure demonstrates that this decision was primarily influenced by the high confidence value of laufkont, which was 4 (within the range of 2 to 4), and buerge, which was 1 (less than or equal to 1).
In the “Statlog (German Credit Data)” dataset,
Figure 4 provides the LIME explanation for the model’s negative decision (i.e., No) with a 96% confidence regarding a testing instance. The figure demonstrates that this decision was primarily influenced by the high confidence value of C1, which was 3 (within the range of 1 to 3).
5.5. Related Studies Comparison
Table 7 shows a comparison between the suggested approach and related studies concerning the same used datasets. It can be observed from the literature that there are different research works for credit scoring. For “Statlog (Australian Credit Approval)” [
21], the current study achieved an accuracy of 88.84%. This result is competitive with the highest accuracy reported for this dataset, which was 91.91% by Kazemi et al. [
15]. In the case of the “Statlog (German Credit Data)” [
22], the current study achieved an accuracy of 78.30%. While this accuracy is an improvement over some earlier studies, it falls short of the highest accuracy reported for this dataset, which was 88.89% by Gicić et al. [
14]. For “South German Credit ” [
23], this study achieved an accuracy of 77.80%. In comparison, Khan and Ghosh [
17] reported a slightly higher accuracy of 80.50%. In summary, the current work demonstrates a consistent and competitive performance across all three datasets. The credit scoring model developed in this study appears to be effective in predicting credit risk for various datasets, with accuracy values ranging from 77.80% to 88.84%.
While some related works may surpass the performance of our current study on individual datasets, our approach demonstrates superior performance when applied to specific datasets. For instance, despite Kazemi et al. [
15] achieving an impressive 91.91% accuracy in their study on the “Statlog (Australian Credit Approval)” [
21] dataset, our model excels with an accuracy of 78.30% when evaluated on the “Statlog German Credit Data" dataset, surpassing Kazemi et al.’s results (accuracy: 67.49%). This underscores the adaptability and competitiveness of our credit scoring model, showcasing its ability to outperform others in various scenarios across diverse datasets. It is worth noting that the relative performance depends on the specific dataset and the benchmarks set by each work.