1. Introduction
Credit card fraud remains a persistent challenge for financial institutions, requiring advanced machine learning (ML) techniques to detect fraudulent transactions. Credit card fraud involves the unauthorized use of payment cards to conduct illegal monetary transactions, often aimed at obtaining goods, services, or transferring money without the cardholder’s permission [
1]. As online transactions continue to rise, fraudsters are constantly developing more sophisticated attack strategies, making fraud detection a critical area of research. According to Juniper Research, global losses due to online payment fraud are expected to reach USD 91 billion in 2028, demonstrating the urgent need for robust and scalable fraud detection systems [
2]. Traditional rule-based fraud detection systems struggle to adapt to evolving fraud patterns and often generate high false positive rates, causing financial and operational inefficiencies. In response to these challenges, ML techniques, with their ability to analyze large datasets and identify subtle patterns, are well suited for addressing the limitations of traditional fraud detection systems [
3]. ML models can detect anomalies in transaction data that may signify fraudulent activity and offer the ability to continuously learn and adapt to new fraudulent behaviors [
4]. This capability makes them an invaluable tool for real-time fraud detection and prevention, as they can dynamically adapt to emerging threats and evolving fraud patterns. Popular approaches include artificial neural networks (ANNs), logistic regression (LR), decision trees (DT), and support vector machines (SVMs), which are well suited for identifying patterns in large datasets.
However, despite these advances, there are still ongoing challenges. One of the primary challenges in fraud detection is the extreme class imbalance in datasets, where fraudulent transactions constitute only a tiny fraction of all transactions. This imbalance can degrade model performance, resulting in a trade-off between overall accuracy and the ability to detect fraudulent cases. Various strategies, such as oversampling, undersampling, and synthetic data generation, have been employed to mitigate this issue [
5,
6]. However, these methods often introduce biases or fail to generalize effectively in real-world applications.
Another critical issue in fraud detection is the lack of interpretability in predictive models. While deep learning techniques have demonstrated high accuracy in various domains, they are often criticized as “black-box” models, making it difficult for financial institutions and regulators to trust their decision-making process. Explainable AI approaches, such as Shapley additive explanations (SHAP), have gained attention for providing transparency in model predictions by explaining which features contribute most to fraud classification [
7,
8].
Several recent studies have explored ensemble learning methods to enhance fraud detection performance [
9,
10,
11]. However, most of these studies focus on balanced datasets, which may not reflect real-world fraud detection scenarios where class imbalance is inherent.
To address these challenges, our work proposes FraudX AI, an ensemble-based fraud detection framework designed to enhance performance on highly imbalanced datasets while maintaining interpretability. Unlike existing ensemble approaches that rely on oversampling techniques, FraudX AI retains the original class distribution and mitigates class imbalance by optimizing model thresholds and adjusting class weights. This approach ensures that the model’s predictions remain realistic in real-world banking scenarios, where fraudulent transactions are rare but highly impactful.
Our goal was to achieve higher recall compared to previous studies. FraudX AI accomplishes this by combining random forest (RF) and eXtreme Gradient Boosting (XGBoost), optimizing their predictions through threshold tuning to balance recall and precision. By prioritizing the area under the precision–recall curve (AUC-PR) and recall metrics over traditional accuracy-based metrics, this approach enables a more reliable evaluation of fraud detection models, as accuracy alone can be misleading in highly imbalanced datasets. Additionally, this study integrates SHAP to enhance model interpretability, demonstrating how SHAP values provide meaningful insights despite feature anonymization while ensuring transparency for financial analysts and regulatory compliance.
The key contributions of this research are as follows:
Proposing a novel fraud detection framework (FraudX AI) that enhances recall while minimizing false positives without relying on artificial data balancing techniques, ensuring real-world applicability.
Introducing an optimized threshold-tuning strategy to balance recall and precision, improving fraud detection effectiveness in highly imbalanced datasets.
Incorporating SHAP for interpretable fraud detection, providing actionable insights for financial analysts and regulatory compliance, even when working with PCA-transformed anonymized features.
Conducting an extensive comparative evaluation against multiple ML baselines and recent fraud detection studies, demonstrating FraudX AI’s superior performance in recall and AUC-PR.
Discussing the generalization capability of FraudX AI and its applicability to different datasets, addressing the challenges of fraud detection across diverse financial institutions.
The remainder of this paper is structured as follows:
Section 2 reviews related work on credit card fraud detection, focusing on existing methodologies and the challenges of imbalanced datasets.
Section 3 presents the proposed framework and dataset, beginning with an overview of the dataset used for evaluation, followed by a detailed discussion of the FraudX AI framework and its implementation.
Section 4 describes the experimental setup and results, outlining the implementation strategy, model training process, evaluation metrics, performance analysis, and SHAP-based interpretability assessment.
Section 5 provides a detailed discussion of the findings, an interpretation of the results in the context of previous studies and applicability to real life, and the main limitations and challenges. Finally,
Section 6 summarizes the study’s contributions and suggests potential directions for future research.
2. Related Work
Fraud detection in credit card transactions has become a critical area of research due to its substantial financial implications and the increasing sophistication of fraudulent activities. The specific challenges associated with fraud detection, particularly the highly imbalanced nature of datasets and the continuously evolving fraud patterns, have led to the exploration of various ML and deep learning (DL) techniques [
12,
13,
14].
A study [
15] compared LR, DT, SVM, and XGBoost for fraud detection, finding XGBoost to be the best-performing model, with an accuracy of 99.88% and an AUC-ROC score of 1.0. The study demonstrated XGBoost’s effectiveness in handling imbalanced datasets and applying the synthetic minority oversampling technique (SMOTE), significantly enhancing model performance. However, concerns regarding potential data skewing remained. Explainable AI tools such as SHAP and LIME enhanced XGBoost’s interpretability, highlighting key fraud-related features such as transaction type and old balance. While LR and DT models attained respectable accuracies of 98.99% and 98.96%, respectively, they lacked the precision and scalability required for large-scale fraud detection. Conversely, SVM exhibited poor performance, with a recall value of only 0.36, primarily due to its reliance on linear kernels, which are inadequate for capturing complex fraud patterns. The study underscores the importance of AUC-ROC and macro-averaged metrics for evaluating fraud detection systems while acknowledging the challenges of balancing model interpretability, computational efficiency, and the scalability for real-world applications.
The authors of [
16] proposed a hybrid fraud detection framework that integrated process mining and SVM, using multi-perspective features such as control flow, resource flow, and user behavior to enhance accuracy. Tools like ProM and Colored Petri nets facilitated real-time anomaly detection in e-commerce transactions, achieving a higher F1-score and AUC metrics than single-perspective models. However, the approach faced challenges, including computational overhead, scalability constraints, and difficulties adapting to evolving fraud tactics. While the model offers interpretability through process mining, its complexity may hinder practical deployment in real-world settings.
Another study [
17] evaluated five ML models—ANN, SVM, RF, DT, and naive Bayes (NB)—for credit card fraud detection, with ANN achieving the highest accuracy (97.6%), followed by SVM (95.5%) and RF (94.5%). Preprocessing techniques, including data cleaning, feature scaling, and temporal analysis, were essential in improving model performance and addressing class imbalance. Feature engineering, such as incorporating transaction frequency and time-based metrics, further enhanced the discriminatory power of all models. While ANN demonstrated superior accuracy and an ability to model complex transaction patterns through backpropagation, its high computational demands were identified as a limitation. SVM also exhibited strong performance but struggled with scalability due to its computationally intensive kernel operations. DT and NB models, on the other hand, showed higher rates of false positives and false negatives, restricting their applicability in real-world fraud detection. The study underscores the necessity of balancing accuracy, recall, and precision to optimize fraud detection performance in practical applications.
Hashemi et al. [
18] evaluated the performance of advanced ML models, including LightGBM, XGBoost, and CatBoost, for fraud detection. When optimized using hyperparameter tuning techniques such as Bayesian optimization, these models achieved superior performance metrics, with AUC-ROC reaching 0.95 and recall attaining 0.80. Additionally, a majority voting ensemble approach combining CatBoost, XGBoost, and LightGBM demonstrated improved results on imbalanced datasets, achieving a recall value of 0.82 and an F1-score of 0.81. Despite the success of the majority voting ensemble, the study highlighted the computational overhead associated with this approach, emphasizing the need for more efficient ensemble methods for real-world applications.
Mosa et al. [
19] investigated 15 meta-heuristic optimization (MHO) algorithms for feature selection in fraud detection, identifying the sailfish optimizer (SFO) in combination with RF as particularly effective, achieving 97% accuracy while reducing the feature set size by 90%. Advanced transfer functions further improved feature selection accuracy, while undersampling effectively addressed class imbalance. Computationally efficient methods, such as particle swarm optimization (PSO) and gray wolf optimization (GWO), reduced feature dimensionality without compromising model performance. However, the study also identified key challenges, including the computational complexity of MHO techniques, the scalability limitations for large datasets, and the risks of overfitting due to overly aggressive feature selection.
The study conducted by Ming et al. [
20] proposed a hybrid fraud detection framework that combines a 12-layer CNN for deep feature extraction with classifiers such as SVM, KNN, and NB. The CNN-ensemble model demonstrated high performance, achieving an AUC-ROC of 100% for credit card fraud detection. However, balanced datasets limited the applicability of these results in real-time scenarios. While ensemble methods enhanced classification performance, several limitations remained, including CNN’s reduced effectiveness on smaller datasets, high computational demands, and challenges adapting to real-world, imbalanced data. Key evaluation metrics, such as AUC-ROC, precision, and recall, underscore the model’s theoretical advantages but do not fully account for practical deployment constraints.
Khalid et al. [
21] assessed an ensemble model incorporating SVM, KNN, RF, bagging, and boosting classifiers for credit card fraud detection, reporting high accuracy and F1-score. While RF and boosting exhibited strong performance, the training and testing times for bagging and boosting models were significantly longer, indicating inefficiencies in computational resources compared to simpler models like LR or KNN. Moreover, SMOTE-generated balanced datasets produced results that do not accurately reflect real-world fraud detection scenarios characterized by inherent class imbalance. Although SMOTE mitigates class imbalance and enhances performance, it introduces synthetic data artifacts that may limit generalizability in operational environments. The computational inefficiencies and reliance on artificially balanced datasets highlight the challenges of deploying these models in real-time, large-scale fraud detection systems.
Another ensemble-based study [
22] similarly employed SMOTE to mitigate class imbalance, proposing a weighted-average ensemble model that combined RF, KNN, bagging, adaptive boosting (AdaBoost), and LR. This approach achieved an impressive accuracy of 99.5%, while demonstrating improvements in precision, recall, and F1-score compared to individual classifiers. However, like other SMOTE-based studies, its dependence on synthetic data raises concerns regarding applicability in real-world, highly imbalanced fraud detection scenarios. Among the individual classifiers, RF and bagging exhibited the best performance, with accuracies of 98%, followed by AdaBoost and LR at 97% and KNN at 95%. The ensemble model utilized a weighted strategy to optimize predictions and reduce misclassifications. However, scalability and computational efficiency challenges persist, particularly in real-time fraud detection applications. These findings underscore the broader limitations of SMOTE-based ensemble methods in operational environments.
Several recent studies have explored advanced ML and DL techniques for fraud detection. Notable bodies of research by Tian et al. [
23] and Xie et al. [
24,
25,
26] have contributed significantly to developing graph-based and sequence-based fraud detection models. Graph neural networks (GNNs) have gained popularity in transactional fraud detection due to their ability to capture transaction relationships. In [
23], ASA-GNN was introduced as an adaptive sampling and aggregation approach to improve fraud classification performance by filtering noisy nodes and capturing behavioral patterns through graph-based modeling.
Xie et al. [
24] have developed multiple DL-based fraud detection models, emphasizing behavioral pattern extraction from sequential transaction data. The authors introduced the time-aware interaction LSTM (TAI-LSTM) model, designed to extract transactional behaviors and learn transactional behavioral representations for each transaction record within a user’s history. Their approach enables the model to enhance fraud detection by learning both short-term and long-term transactional habits. Similarly, their spatial–temporal gated network (STGN) introduces spatio-temporal attention mechanisms to model credit card transactions dynamically, capturing fraudulent transactions’ geographical and temporal distribution [
25]. By incorporating spatial and temporal features, STGN improves fraud detection by identifying the transaction location and timing inconsistencies. The authors also expanded their research in [
26] by introducing the time-aware historical-attention-based LSTM (TH-LSTM), a model designed to capture transactional behavioral representations by leveraging time-aware LSTM networks. This approach effectively extracts long-term dependencies in transaction sequences, enabling the model to identify subtle spending patterns and detect fraudulent activities more accurately.
Table 1 presents a comparative analysis of recent studies, examining their methodologies, datasets, results, and limitations. This analysis highlights advancements in the field and persistent challenges, including scalability, reliance on synthetic data, and computational efficiency in real-world applications.
While the reviewed studies demonstrate significant advancements in credit card fraud detection through ML, DL, and ensemble methods, several critical challenges remain. The reliance on SMOTE and balanced datasets, although effective in improving performance metrics, limits the generalizability of these models to real-world scenarios where data are highly imbalanced. Moreover, computational inefficiencies and scalability issues pose obstacles to deploying these methods in real-time, high-volume environments. Additionally, many models’ “black-box” nature raises concerns regarding interpretability.
To address these challenges, our study proposes an alternative approach that eliminates the need for synthetic data augmentation while maintaining high recall and precision. Unlike prior works, we leverage class weight adjustments instead of SMOTE, ensuring that the model learns directly from the original data distribution. Furthermore, during training, we introduce a multi-layer perceptron (MLP) validator, enhancing feature extraction without affecting inference time predictions. Our method achieves superior recall and AUC-PR scores and enhances model transparency through SHAP-based interpretability, making it well suited for real-world financial applications.
3. Proposed Framework and Dataset
This section introduces the dataset used to evaluate the FraudX AI framework, outlining its characteristics and challenges and a detailed discussion of its design and implementation.
3.1. Dataset
We assessed the effectiveness of FraudX AI using the European credit card fraud dataset [
27], widely recognized as a benchmark in fraud detection research. Its accessibility and extensive use in prior studies make it a valuable resource for comparative analysis and situating findings within the broader context of current research in this field. The dataset contains 284,807 transactions recorded over two days, among which only 492 transactions (0.172%) are fraudulent (
Figure 1). This pronounced class imbalance closely reflects real-world conditions, underscoring the inherent challenges of fraud detection.
Each transaction comprises 31 features, including 28 anonymized numerical variables (V1–V28) transformed using Principal Component Analysis (PCA) to preserve confidentiality. The remaining features were retained in their original form, as follows: Time, representing the elapsed time between transactions; Amount, indicating the monetary value of each transaction; and Class, which serves as the target variable, where a value of 0 denotes a genuine transaction and 1 represents fraud.
Due to the severe class imbalance, traditional ML models trained directly on this dataset tend to favor the majority class, resulting in high overall accuracy but poor recall for fraud detection. This study evaluates fraud detection models on the original imbalanced dataset to ensure real-world applicability, ensuring that performance metrics accurately reflect operational challenges.
We divided the dataset into 80% for training and 20% for testing to ensure model evaluation while maintaining class distribution. Additionally, we implemented feature scaling techniques to normalize transaction values and improve model convergence.
Given the complexity of fraud detection in highly imbalanced datasets, a specialized ensemble-based framework was required to improve fraud identification while maintaining interpretability. The following subsection introduces FraudX AI, outlining its design, components, and methodological approach.
3.2. Overview of FraudX AI Framework
FraudX AI is an ensemble-based fraud detection framework designed to maximize recall and AUC-PR while ensuring interpretability. The architecture, illustrated in
Figure 2, outlines the workflow from dataset preprocessing to model evaluation and explainability.
As shown in
Figure 2, the proposed framework comprises several key stages, as follows: data preprocessing, model training, ensemble learning, threshold optimization, and SHAP-based interpretability.
The first stage, data preprocessing, involves cleaning, scaling, and preparing raw transaction data for training. Given the highly imbalanced nature of fraud detection datasets, an MLP validator trained on an undersampled dataset is incorporated as a reference model to assess the impact of class balancing. However, this validator does not directly contribute to the final fraud detection decisions.
The core fraud detection pipeline consists of two base models, RF and XGBoost, trained on the original imbalanced dataset to preserve real-world class distributions. These models independently generate probability scores for fraudulent transactions. Their outputs are then combined using probability-weighted averaging, leveraging the complementary strengths of both classifiers to enhance predictive performance.
To refine fraud detection outcomes, threshold optimization is applied to balance precision and recall, ensuring a high fraud detection rate while minimizing false positives. This step is particularly critical in fraud detection systems, where undetected fraudulent transactions can lead to significant financial losses.
Finally, SHAP enhances interpretability by quantifying how individual features contribute to the model’s predictions. This approach improves transparency, enabling stakeholders to understand the reasoning behind fraud classifications, which is essential for regulatory compliance and decision making. However, since PCA was applied to transform the original feature space and ensure data privacy in the European credit card fraud dataset, the actual feature names are unavailable, with all features being represented as anonymized components (V1 to V28). While this may reduce interpretability, we acknowledge that individuals with access to the original feature mappings can derive more precise insights into the model’s decision-making process.
4. Experimental Setup and Results
4.1. Implementation and Training Strategy
The FraudX AI framework consists of two base models (RF and XGBoost) trained on the original imbalanced dataset. In contrast, an MLP validator was trained on an undersampled subset to assess preprocessing effectiveness. Notably, the MLP validator does not directly contribute to ensemble decision making.
We configured the models with standard hyperparameter settings, optimizing them through validation experiments. The RF and XGBoost models were trained using balanced learning strategies, incorporating class weighting to address class imbalance effectively. Specifically, a class weight ratio of {0:1.0, 1:2.0} was applied to assign higher importance to the minority fraud class during training.
The MLP validator employs a fully connected neural network architecture with a single hidden layer, leveraging gradient-based optimization for training. To maximize model performance, hyperparameters such as the number of trees, learning rates, and activation functions were fine-tuned through empirical analysis.
FraudX AI utilizes an ensemble learning approach, aggregating RF and XGBoost predictions through a probability-weighted strategy. Model probabilities are combined using an empirically determined weighting scheme, leveraging the strengths of both classifiers.
Since fraud detection requires balancing recall (capturing fraudulent cases) and precision (reducing false positives), adaptive threshold tuning was applied instead of a fixed classification threshold. The optimal threshold was determined through precision–recall curve analysis, ensuring a high fraud detection rate while minimizing false alarms.
All experiments were conducted on a MacBook Pro equipped with an Apple M1 Pro chip, featuring a multi-core CPU and integrated GPU acceleration, which provided sufficient computational power for data preprocessing, model training, and evaluation. The implementation was carried out in a Jupyter Notebook 7.3.2 environment, facilitating efficient development, testing, and visualization.
We implemented the framework using standard ML and DL libraries, including scikit-learn, XGBoost, TensorFlow/Keras, pandas, NumPy, and visualization tools like Matplotlib 3.9.2 and Seaborn 0.13.2. Training efficiency was optimized to ensure model convergence within a reasonable computational timeframe.
4.2. Evaluation Metrics
Selecting the appropriate metric is essential for evaluating the performance of ML models. Since fraud is a rare event, evaluation metrics must accurately reflect a model’s ability to handle class imbalance. Many studies emphasize accuracy as a primary metric. Still, in highly imbalanced datasets where most transactions are genuine, accuracy can be misleading, as a model that classifies all transactions as non-fraudulent may achieve high accuracy while failing to detect actual fraud cases, rendering it ineffective in real-world applications [
28,
29]. To address this challenge, more reliable evaluation metrics, such as recall (sensitivity) and AUC-PR, are preferred [
30]. Recall measures the system’s ability to identify fraudulent transactions, which is critical for minimizing the risk of undetected fraud. In contrast to traditional AUC-ROC, AUC-PR focuses on the minority class, making it more suitable for fraud detection, as it better reflects performance in imbalanced settings [
31].
Key performance metrics, including true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN), are used to assess the model’s effectiveness. These values are the foundation for computing accuracy, precision, recall, F1-score, and AUC-PR. The following equations define these metrics mathematically, comprehensively evaluating fraud detection models.
Accuracy is the proportion of correctly predicted cases (both positive and negative) out of the total number of cases [
32]. A high accuracy indicates that the model correctly predicts most cases, and, therefore, it can be misleading in imbalanced datasets. For instance, in a dataset where 99% of transactions are genuine (non-fraudulent), a model predicting all transactions as genuine will have 99% accuracy despite being ineffective at detecting fraud. Equations (1)–(4) present the mathematical formulations for accuracy, precision, recall, F1-score, and AUC-PR.
Precision is a performance metric that quantifies the proportion of correctly predicted positive cases (fraud) out of all cases classified as positive by the model [
33]. It measures the model’s ability to avoid false positives (false alarms). In fraud detection, a model with low precision may misclassify many legitimate transactions as fraudulent, leading to unnecessary inconvenience for users. High precision reduces operational costs and maintains user trust by minimizing false alarms.
Recall, also called sensitivity or true positive rate (TPR), evaluates a model’s ability to correctly identify all actual positive cases in a dataset [
34]. In fraud detection, recall indicates how effectively the model identifies fraudulent transactions among all actual fraud cases. A high recall value signifies that the model successfully detects most fraudulent transactions, minimizing false negatives. Conversely, a low recall value suggests that the model fails to detect a significant number of fraud cases, which can have severe financial and reputational consequences.
The F1-score is a performance metric that balances precision and recall [
35]. It is beneficial for evaluating models in scenarios with class imbalance.
Since our framework incorporates a threshold optimization step to balance recall and precision, evaluating performance using the AUC-PR metric was essential. AUC-PR measures a model’s ability to maintain an optimal trade-off between precision and recall across all classification thresholds. It is a single scalar value that summarizes the precision–recall (PR) curve, and it is computed as the precision weighted by the probability that the output score Y falls below a given threshold c [
36]. Mathematically, it is defined as follows:
where Prec(c) is the precision at threshold c and P(Y ≤ c) is the probability distribution of the output scores.
The AUC-PR value ranges between 0 and 1, reflecting the model’s ability to balance precision and recall across varying thresholds. In imbalanced datasets, AUC-PR is particularly advantageous because it avoids the pitfalls of misleading accuracy metrics and provides a more robust measure of model performance for rare events such as fraud detection [
37].
4.3. Results and Analysis
To evaluate the performance of FraudX AI, we compared it with the following eight baseline ML models: LR, DT, RF, gradient boosting (GB), XGBoost, LightGBM, NB, and MLP. All models were trained and tested on the imbalanced dataset, adhering to the previously defined data split. The performance was assessed using the evaluation metrics outlined earlier, as presented in
Table 2 and
Table 3.
While
Table 2 and
Table 3 provide a detailed numerical comparison of the performance metrics for all tested models,
Figure 3 visualizes the results specifically in terms of recall and AUC-PR. This visualization highlights the models’ ability to identify fraudulent transactions (recall) and their effectiveness in distinguishing between fraudulent and legitimate transactions (AUC-PR). This graphical representation clearly illustrates the differences in performance, offering a more intuitive understanding of how FraudX AI surpasses baseline models.
To further validate the effectiveness of FraudX AI compared to baseline models, we evaluated our proposed framework against previous studies that used the same unbalanced credit card fraud dataset [
27].
Table 4 provides a comparative analysis, demonstrating how our framework outperforms prior research in key evaluation metrics, such as achieving a balance between high recall, high precision, and AUC-PR. This comparison highlights the robustness of our framework relative to traditional individual-based models and state-of-the-art methods, ensuring its competitiveness and applicability in real-world scenarios.
For instance, the TAI-LSTM and TH-LSTM models demonstrate impressive capabilities, with a remarkable recall of 99%. However, these models also show a precision score of 51% and 50%, respectivley, resulting in an F1-score of 67% for both. While these results highlight their ability to detect a high proportion of fraud cases, they also indicate a trade-off with precision, where a significant number of flagged transactions may be legitimate. Methods such as FMC and the weighted DNN achieved recall values of 91% and 89%, but their precision scores were either lower or inconsistent. Similarly, the CB-CHL-LGBM method achieved a precision score of 74%, but its recall was limited to 84%, indicating challenges in balancing these key metrics. FraudX AI effectively addresses these limitations by achieving well-balanced precision and recall of 100% and 95%, respectively, leading to a higher F1-score (97%) and AUC-PR (97%). This balance between precision and recall is critical for fraud detection systems, as it minimizes false positives while maximizing the identification of fraudulent transactions.
Additionally, FraudX AI outperforms traditional models like ANN and the multiple classifier approach, which, while effective, fail to achieve comparable results across all metrics. By integrating advanced ML techniques with threshold optimization and explainability features, FraudX AI surpasses existing approaches and ensures transparency and reliability, making it a competitive solution for deployment in real-world industry settings.
4.4. SHAP Analysis for FraudX AI Interpretability
Model interpretability is a cornerstone of trust and transparency in credit card fraud detection systems. Integrating SHAP into our proposed framework provides a comprehensive understanding of how various features influence the model’s decisions. This aspect is crucial for industry stakeholders and regulatory compliance, where explainability is as vital as the model’s ability to detect fraudulent transactions effectively. By analyzing SHAP values, we gain insight into feature contributions, ensuring that the decision-making process is interpretable and compliant with financial regulations.
In this study, we utilized SHAP summary plots to visualize the overall impact of features on fraud classification and SHAP-based importance analysis to assess the influence of specific features in fraudulent transactions.
Figure 4 shows the SHAP summary plot, ranking features according to their contribution to fraud classification. The results indicate that V4, V14, and V12 significantly impact the model decisions, suggesting that these transaction attributes play a key role in distinguishing between fraudulent and genuine transactions. Notably, features such as V4 and V11 exhibit a wide range of SHAP values, implying that high and low values influence the classification outcomes. In contrast, V14 and V12 display a more polarized effect, indicating a more substantial directional influence on fraud detection.
While
Figure 4 presents the overall feature importance in FraudX AI,
Figure 5 refines this analysis further by illustrating feature importance specifically for fraudulent transactions. This visualization highlights the key features that most influence classifying fraudulent cases, offering a more targeted approach to fraud detection.
As shown in
Figure 5, V14, V10, and V4 emerge as the most influential features in detecting fraudulent transactions, exhibiting high SHAP values. It indicates that variations in these features significantly impact the model’s decision to classify a transaction as fraudulent. Interestingly, V12, V7, and V3 also play a crucial role, reinforcing that fraud detection relies on multiple interacting factors rather than on a single dominant variable.
The Amount feature, which represents the transaction value, exhibits lower importance than some of the transformed PCA features. It aligns with previous findings showing that fraudsters often manipulate the transaction value to evade detection, making more complex behavioral features crucial for classification. Additionally, the Time feature demonstrates moderate importance, indicating that transaction timing may contribute to anomaly detection.
By comparing
Figure 4 and
Figure 5, we observe that, while the overall feature significance remains relatively consistent, certain variables exert a more decisive impact, specifically in fraudulent cases. This difference underscores the value of SHAP in providing a granular interpretation, enabling fraud analysts to focus on the most critical attributes when investigating potential fraudulent activity.
The SHAP analysis highlights key features influencing the FraudX AI decision-making process, offering a transparent view of fraud classification. By identifying the most influential variables, this interpretability enhances confidence in the model’s predictions and supports financial institutions in making informed risk management decisions. Furthermore, this explainability strengthens the applicability of FraudX AI in real-world fraud detection scenarios, where the accuracy and validity of model decisions are essential.
5. Discussion
The FraudX AI framework demonstrates promising results for credit card fraud detection, achieving the highest recall of 95% on an unbalanced dataset while maintaining an AUC-PR of 97%. These results underscore the effectiveness of the ensemble approach combining RF and XGBoost, optimized through MLP validation, class weighting, threshold tuning, and probability averaging.
A primary challenge in credit card fraud detection is the imbalanced classification problem, where fraudulent transactions constitute a small fraction of all transactions. Models trained on imbalanced datasets often prioritize high accuracy, leading to misleading results where all transactions are classified as legitimate to achieve high accuracy without effectively detecting fraud.
FraudX AI introduces an innovative approach to address these challenges without relying on traditional oversampling techniques like SMOTE. By retaining the original imbalanced dataset and incorporating advanced techniques such as MLP validation during training, class-weighted learning, and threshold tuning, FraudX AI enhances model robustness and ensures accurate fraud detection outcomes while preserving real-world transaction distributions.
Moreover, while SHAP has been utilized in fraud detection, previous studies primarily focused on DL models with domain-specific feature engineering. FraudX AI integrates SHAP with ensemble learning, balancing interpretability and performance. However, the interpretation of features in this study is limited due to the PCA-transformed dataset. Future research will explore datasets with non-anonymized features to better correlate SHAP results with real-world transactional attributes. Despite this limitation, the model’s explainability is not solely based on SHAP, but also on its structured threshold optimization and class-weighted learning approach, which enables financial institutions to interpret predictions without relying on opaque DL architectures.
All comparisons in this study were exclusively conducted with research utilizing the same European credit card fraud dataset and avoiding oversampling or undersampling techniques. It ensures a fair evaluation of FraudX AI’s performance compared to similar methodologies, maintaining the integrity of the original class distribution and validating the model’s effectiveness in practical fraud detection scenarios.
Future work will expand the evaluation of FraudX AI to include diverse financial datasets, including the one developed in previous work [
42], to validate its adaptability and effectiveness in detecting evolving fraud behaviors across various transaction environments. These efforts aim to enhance the model’s generalizability across different fraud patterns and transaction environments, improving its adaptability and effectiveness in real-world applications.
6. Conclusions
In conclusion, FraudX AI significantly advances fraud detection methodologies for credit card transactions. The framework achieves high recall and precision rates, demonstrating its effectiveness in identifying fraudulent transactions while minimizing false positives.
FraudX AI’s approach to handling imbalanced datasets and integrating advanced methodologies such as class-weighted learning, MLP validation, and SHAP sets it apart from previous fraud detection ensemble models. FraudX AI optimizes fraud detection while preserving model integrity, providing financial institutions with a scalable, interpretable solution for mitigating fraud risks. The methodological adaptability of the framework enhances fraud detection strategies in real-world financial ecosystems.
Future research will prioritize validating FraudX AI’s applicability across diverse datasets and implementing adaptive learning mechanisms to refine fraud detection strategies in response to emerging fraud trends. Additionally, there will be an investigation into extending FraudX AI’s evaluation beyond credit card fraud detection to encompass other domains, such as network anomaly detection, where anomalies are infrequent events. This initiative aims to rigorously validate the model’s robustness and effectiveness in identifying various types of anomalies across a wide range of operational environments.
In summary, FraudX AI sets new standards in fraud detection methodologies, promising enhanced recall, interpretability, and adaptability crucial for addressing the evolving challenges of fraud in modern financial environments.