1. Introduction
Credit card fraud is an escalating threat in the digital financial ecosystem, inflicting substantial financial losses worldwide. As digital transactions proliferate, the complexity and frequency of fraudulent activities demand robust, scalable, and real-time detection systems [
1,
2]. Machine learning (ML) models have become central to fraud detection because they can identify complex patterns and adapt to evolving behaviors. However, the massive scale of transaction data and the class imbalance inherent in fraud detection tasks require distributed frameworks that can handle these computational challenges effectively [
3,
4].
Machine learning has been transforming decision-making into an art of precision, adaptability, and effectiveness in various domains. In fraud detection, ML automates the analysis of large and complex data with patterns and relationships that no human could detect [
5]. Further, unlike traditional methods, which rely on some empirical rules or rigid algorithms, ML adapts to constantly changing data and provides quicker and more accurate decisions. This adaptability is essential in fraud detection since fraudulent behaviors frequently change to get around detection systems [
6,
7].
Despite the potential, the application of ML in fraud detection is confronted with serious challenges. The most prominent challenge involves highly imbalanced fraud-related data since fraudulent transactions constitute a tiny percentage of total transactions [
8,
9]. In addition, the volume of financial data needs a framework capable of handling large-scale processing efficiently. Distributed computing platforms such as PySpark address these demands by enabling scalable and parallelized data processing, reducing latency, and enhancing system performance [
10,
11,
12].
Credit card fraud is developing alongside improving abilities to perform digital transactions, and because of this, it requires sophisticated tools to endure [
13,
14]. Modern ML models using XGBoost and CatBoost, together with PySpark as a distributed framework for scalable solutions capable of real-time detection and adaptive learning, develop intricate features from historical transaction data to outperform traditional detection systems [
15,
16]. In turn, these systems face overfitting problems, limited access to data, and an inability to provide real-time responsiveness [
17,
18,
19].
Recent efforts towards overcoming the preceding issues proposed creative solutions based on genetic algorithm-based feature selection techniques, spatial-temporal attention, or other methods involving combinations with distributed learning approaches, like Apache Spark [
20,
21]. Utilizing this, new analysis techniques enable better coping with larger datasets without an added computational load [
22,
23,
24]. Nevertheless, ever-evolving patterns in fraud schemes and lately scalable algorithms provide continuous grounds for complications concerning effective methods to implement real-world fraud-detecting solutions [
25,
26,
27].
This study proposes an optimized, scalable fraud detection framework by integrating PySpark for distributed processing with advanced ML models (XGBoost and CatBoost). We evaluate the framework’s performance across metrics such as accuracy, calibration (Brier Score), specificity, and latency. Our main objective is to demonstrate how distributed ML pipelines can enhance predictive accuracy, computational efficiency, and real-time responsiveness in credit card fraud detection. This work contributes by benchmarking performance across several algorithms and providing actionable insights into scalable model deployment in financial systems.
3. Materials and Methods
This section discusses methodology, using datasets, models, and evaluation performance for fraud in credit card use. The dataset used in this analysis was gathered from
https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud/data (accessed on 21 April 2025) and includes credit card fraud detection data and anonymous information about credit card transactions.
Its large size, over 200 MB, makes it well suited to several problems, such as imbalance, and others handled below via models and techniques. Logistic Regression, Decision Trees, Random Forests, CatBoost, and XGBoost were implemented (
Figure 2).
3.1. Performance Evaluation
Logistic Regression is a statistical model whereby the predictors estimate the probability for binary classification tasks; hence, it is suitable for predicting fraudulent transactions. It calculates the likelihood of an outcome based on one or more independent variables. It is simple and interpretable; therefore, Logistic Regression is used as a baseline model compared to other more complex algorithms.
Decision Trees classify data using a tree-like model in which nodes represent either decisions or conditions, while the branches are outcomes. The method is intuitive and interpretable; however, it is vulnerable to overfitting when exposed to new, unseen data. Here, the technique is an intermediary approach for understanding how hierarchical decision-making influences fraud detection accuracy.
Random Forests aggregate predictions from several Decision Trees to improve classification performance and reduce overfitting. Each tree is trained on a random subset of the data, and the results are combined to produce the final prediction. This ensembling technique ensures robust performance with an improvement in generalizing unseen data.
XGBoost is an ensemble learning technique that builds a sequence of models, each correcting the errors of the previous model. It is optimized for handling big data and allows flexibility in tuning parameters, making it apt for fraud detection problems. XGBoost was selected because it has shown its capability for high accuracy and effectiveness in handling imbalanced data.
CatBoost is a gradient-boosting algorithm optimized for categorical data. It introduces novelty because it takes categorical features directly without much pre-processing or one-hot encoding. It is computationally efficient and does well on both balanced and imbalanced datasets. It reduces overfitting by using ordered boosting and symmetrical trees, hence generalizing well on unseen data. The CatBoost algorithm had quite decent performance metrics, was very close, and even outperformed XGBoost in calibration and accuracy in this paper. With their interpretability and computational efficiency, highly accurate predictions make this algorithm competitive in fraud detection.
3.2. Dataset Description
The publicly available “Credit Card Fraud Detection” dataset from Kaggle was used in this research, which consists of 284,807 transactions in total, with just 492 of them labeled as fraudulent; the fraud rate turns out to be highly imbalanced, at 0.172%. Each transaction is represented by 31 attributes, of which 28 anonymized features V1–V28 were derived using Principal Component Analysis and two non-transformed features, “Time” and “Amount”. The binary “Class” attribute is the target variable, where 1 denotes fraud, and 0 denotes a legitimate transaction. The PCA transformation maintains privacy while retaining essential feature representations for fraud detection.
Subsequent EDA was rather extensive, and it came out that several PCA components showed some striking dependencies, including positive ones for V17 versus V18 and negative ones for V14 versus V4. These can indicate feature interactions that could influence fraud detection performance. The “Time” feature represents a certain number of seconds between each pair of transactions. It was excluded since it provides no meaningful temporal context for the classification tasks. The “Amount” feature was standardized with the StandardScaler transformation, so its scale aligns with PCA-transformed features, and model training can be consistently performed.
SMOTE was applied to generate synthetic fraudulent examples to deal with class imbalance. It operates by oversampling the minority class with new synthetic data points created based on the feature space of the already existing minority samples. This approach reduces the bias towards the majority class during a model’s training and improves the generalization of classification models.
In model evaluation, k-fold cross-validation was used to ensure robustness and avoid performance biases. The data were then divided into k subsets: k − 1 subsets for training and the remaining subset for validation. This process was repeated k times, moving each time to one subset for validation. K-fold cross-validation gives an overall view of the model’s performance, avoiding overfitting by using all the data for training and validation. K-fold cross-validation was preferred over simple splitting of data since it does not run any risk of a biased estimate of performance for models, and because diverse data partitions can be tried on the models. For purposes of comparison, LOOCV was also considered for the smaller subsets, in which a single transaction was utilized once as a validating sample to provide a high-granular estimate of the predictive performance of the models, although using LOOCV on the entire dataset would have been computationally costly, considering its enormity.
This embedding of the pre-processing steps into the k-fold cross-validation guarantees the best dataset preparation for the training and evaluation of any machine learning model. Highly problematic issues, such as class imbalance and overfitting, are handled, reliably estimating generalizability.
Figure 3: Heatmap of the pairwise correlations between PCA-transformed features, V1 through V28, transaction amount, and the target class. This is a critical analysis to understand the relationships and dependencies within the dataset, especially for features derived from PCA, which do not have intuitive interpretability. The heatmap shows both positive and negative correlations, underlining some key relationships that can be used to inform feature engineering and model optimization in credit card fraud detection.
Some noticeable patterns are the high positive values for the correlations among V17 and V18 and V16 and V17. These features seem to capture overlapping statistical dimensions of the transaction data and, therefore, can be redundant or share the importance of identifying fraud transactions. Features like V14 show a highly negative correlation with V4, indicating these attributes represent complementary patterns within the dataset and may play a crucial role in distinguishing fraudulent transactions from valid ones.
The “Class” variable, as the binary label of fraud, correlates rather weakly with various features. Most conspicuous are a mild negative correlation with V14 and a positive correlation with V10 and V4. These associations, however, are not strongly direct in establishing predictive powers for themselves; they point toward more subtle patterns in the data, which can help fraud detection through several features in combination under a more complex modeling framework. The “Amount” feature of the transaction in the heatmap underlines that it could be an essential variable in fraud detection. A weak positive correlation with the “Class” attribute may indicate that higher transactions are slightly associated with fraud and should be treated carefully to avoid bias during pre-processing and model training.
First, the heatmap is an essential step in the exploratory data analysis process, providing insight into feature relationships that will inform the construction of robust machine learning models. Features such as V17 and V18, documenting multicollinearity, are highly correlated, thus indicating the potential use of dimensionality reduction techniques or feature selection strategies. In addition, understanding the subtle but real correlations under the “Class” variable allows for the feature prioritization during training. However, any contextual information that could help reveal the root causes behind such collinearity is unavailable because of feature anonymization.
This analysis ensures that the features contributing most to predictive accuracy are prioritized and highlights areas where additional pre-processing, such as decorrelation or feature scaling, may enhance model performance. The heatmap thus provides a foundational basis for developing optimized machine learning pipelines for fraud detection. A box plot to review the transaction amount distribution was obtained from the command data[’Amount’].plot.box() The created visualization of the transaction amount distribution and outliers allows instant impressions about the variability of data and extreme values (
Figure 4).
The kernel density estimate plot, via the command sns.kdeplot(data=data[‘Amount’], shade=True), showed a smooth representation of the distribution, indicating that the transaction amounts are positively skewed, and most values fall in the lower range of the data distribution.
This was considered a non-normal distribution and was scaled and pre-processed as such (
Figure 5).
Individual distributions of highly correlated key features like V1, V10, V12, and V23 were studied using histograms (
Figure 6). These features were chosen from the earlier heatmap based on their correlation with the target class. The histograms showed a variety of skewness, some approximately normal and others with significant deviations.
A pie chart was generated to display the target class distribution.
Figure 7 below depicts that only 0.172% of the dataset is fraudulent transactions, demonstrating that this is an imbalanced classification. There is a big class imbalance problem; hence, resampling techniques like SMOTE will be necessary to generate synthetic samples of the minority class to make the model generalizable in fraud detection.
Care has been taken to avoid the arbitrary splitting of the dataset into training and testing subsets for reliable model evaluation. In its place, k-fold cross-validation with k = 10 has been used since it gives a more robust assessment by training and validating the model on all parts of the dataset, which reduces bias and variance in the evaluation process. It prepared data for standard scaling with features within comparable ranges to one another; a crucial step, knowing how different magnitudes of transaction amount would show compared to the features transformed via PCA. The data should also be appropriately visualized and statistically checked for correctness.
For each model, the Brier Score was computed; it gives a statistical performance assessment of the model predictions. It measures the accuracy of the probabilistic predictions resulting from calculating the mean squared difference between predicted probabilities and actual outcomes. The Brier Score benefits imbalanced classification problems since it considers the calibration and refinement of probabilities. A low Brier Score reflects a well-calibrated prediction.
To identify which model produced statistically superior forecasts, this study followed Hansen’s Model Confidence Set framework [
116]. The underlying approach compares models concerning the Brier Score, retaining only those statistically superior to other models in a confidence set. The MCS framework conducts an orderly examination of the significance of the differences in performance between models. This allows for a rigorous statistical comparison to address the problem of similar performance across models.
Apart from the Brier Score evaluation, the Model Confidence Set (MCS) procedure [
116] was implemented to statistically support the significance of the identified best performance. The MCS framework compares models by testing the null hypothesis of equal predictive ability for all included models and, through an iterative elimination of the worst-performing model, identifies a confidence set containing the best-performing models. This further ensures a sound comparison of the model forecasting accuracy in case of similar performances.
The Brier Scores of the models comprising Logistic Regression, Decision Trees, Random Forest, XGBoost, and CatBoost were computed in order to assess the probabilistic accuracy and calibration of these various techniques. For this comparison, CatBoost showed promising performance for both depth levels of 6 and 8, achieving a test Brier Score of 0.0004 at depth 8, thus outperforming most of the other models in calibration and probabilistic accuracy on imbalanced datasets, and being very close to XGBoost, which achieved a Brier Score of 0.0002. MCS confirmed that, at a confidence level of 95%, the model CatBoost was statistically equivalent in its predictive ability compared to models XGBoost and Random Forest. At the same time, CatBoost showed better calibration than XGBoost for settings.
Figure 8 below visualizes the Brier Score for each evaluated model, where lower values represent better-calibrated probabilistic predictions.
Random Forest, XGBoost, and CatBoost consistently outperform Logistic Regression and Decision Trees, particularly in handling imbalanced datasets. CatBoost’s intense calibration (Brier Score of 0.0004) aligns with its superior interpretability and computational efficiency.
3.3. Experimental Controls and Reproducibility Measures
All experiments were conducted using a standardized environment to ensure the full transparency and reproducibility of this study. The machine learning models were implemented in Python 3.9 with libraries including scikit-learn 1.2, XGBoost 1.7, and CatBoost 1.2. The distributed processing environment was set up on Apache Spark 3.3.1 using PySpark, with four worker nodes provisioned via a cloud-based cluster—each with 16 GB RAM and 4 virtual CPUs. Experiments were orchestrated using Hadoop YARN for resource management.
To avoid result variability, a global random seed [
42] was fixed across all components, including numpy, random, and all model libraries. All pre-processing (e.g., SMOTE, standard scaling) was applied only to training data within each fold to prevent data leakage. The classification models were trained and evaluated using stratified 10-fold cross-validation, ensuring equal class distribution in each fold. Leave-One-Out Cross-Validation (LOOCV) was also explored on smaller subsets for comparison purposes.
Model hyperparameters were tuned using Bayesian optimization via Optuna (50 trials per model). The primary optimization metric was F1-score, and early stopping was employed (patience of 10 rounds). Default settings for regularization parameters were tested against custom configurations to evaluate their impact. The exact parameter grids and best-performing configurations are available upon request.
All performance metrics—accuracy, precision, recall, F1-score, specificity, and Brier Score—were calculated using scikit-learn. Runtime measurements (training/testing duration) were captured using Python’s time module. For real-time detection latency, PySpark’s Structured Streaming was simulated with 500 ms micro-batches to assess the feasibility of online fraud detection under operational conditions.
This rigorously controlled setup ensures this study’s findings are robust, interpretable, and replicable in academic and applied settings.
3.4. Rationale for Model Selection and Statistical Validation Enhancements
To reinforce the robustness and reliability of our findings, this section elaborates on the rationale behind the selection of specific machine learning algorithms and presents the additional statistical analyses employed.
3.4.1. Model Selection Justification
The models chosen for this study—Logistic Regression, Decision Tree, Random Forest, XGBoost, and CatBoost—were selected to represent a diverse set of classification paradigms ranging from baseline linear models to advanced ensemble techniques.
Logistic Regression was used as a baseline due to its interpretability and widespread use in the fraud detection literature.
Decision Trees and Random Forests were included because they can capture non-linear relationships and provide feature importance insights.
XGBoost and CatBoost were selected as state-of-the-art gradient boosting frameworks for their high predictive performance, handling of imbalanced datasets, and efficient computation. CatBoost, in particular, is robust to categorical variables and offers superior calibration, which is essential in fraud probability estimation.
These choices allow for comparative evaluation and performance benchmarking, addressing the dual goals of accuracy and explainability in high-stakes financial applications.
3.4.2. Statistical Validation and Error Analysis
To further validate the outcomes, we expanded the statistical analysis in the following ways:
Confidence Intervals: For each performance metric (accuracy, precision, recall, F1-score, Brier Score), 95% confidence intervals were calculated using bootstrapped resampling (n = 1000 iterations). This provides a measure of statistical stability and allows for more informed comparisons between models.
McNemar’s Test: To compare the statistical significance between model predictions (e.g., XGBoost vs. CatBoost), we applied McNemar’s test on paired classification outputs. This test evaluated whether the observed improvements were statistically significant or due to chance.
Calibration Curves: we incorporated reliability plots (calibration curves) for XGBoost and CatBoost to assess how well the predicted probabilities align with actual outcomes.
Error Distribution: Misclassified instances were analyzed by comparing the distribution of feature values between false positives, false negatives, and correctly predicted classes. This analysis helps identify potential edge cases or underrepresented patterns in the training data.
4. Results
The models’ performance was evaluated using accuracy, precision, recall, F1-score, specificity, and Brier Score. K-fold cross-validation (k = 10) was employed for training and testing, ensuring robust evaluation. The results below summarize the performance of Logistic Regression, Decision Trees, Random Forest, XGBoost, and CatBoost, with insights into their suitability for credit card fraud detection.
4.1. Logistic Regression
Logistic Regression demonstrated reliable performance with a mean test accuracy of 96.49% and a mean test Brier Score of 0.0259, indicating relatively good calibration. Specificity was consistent at 97.84%, showcasing its ability to minimize false positives (
Table 1). However, the model underperformed compared to ensemble methods.
4.2. Decision Trees
Decision Trees, evaluated at different maximum depths (15, 20, 25), exhibited increasing performance with depth, achieving a mean test accuracy of 98.91% at depth 25 and a Brier Score of 0.0085. Specificity reached 99.15%, making Decision Trees highly effective at differentiating legitimate and fraudulent transactions (
Table 2).
4.3. Random Forest
Random Forest consistently outperformed Decision Trees and Logistic Regression, achieving a test accuracy of 99.36% and a Brier Score of 0.0053 at depth 25 (
Table 3). Specificity improved with depth, peaking at 99.50%, showcasing the model’s robustness in handling imbalanced data.
4.4. XGBoost
XGBoost achieved the highest test accuracy of 99.97% and a Brier Score of 0.0002, outperforming Random Forest in calibration but requiring more computational resources (
Table 4). Specificity was also exceptional at 99.95%, reinforcing its suitability for fraud detection.
4.5. CatBoost
CatBoost, tested at depths 6 and 8, demonstrated strong calibration and performance. At depth 8, the test accuracy was 99.96%, with a Brier Score of 0.0004 (
Table 5). Specificity was marginally lower than XGBoost at 99.91% but still competitive.
The performance of the machine learning models was evaluated using key metrics, including accuracy, precision, recall, F1-score, specificity, and Brier Score, as shown in
Table 6. Logistic Regression achieved a moderate accuracy of 96.5%, reflecting its simplicity and interpretability. Decision Trees and Random Forest exhibited robust performance, with accuracies of 98.9% and 99.4%, respectively, highlighting their strength in handling imbalanced datasets. Advanced ensemble methods like XGBoost and CatBoost demonstrated exceptional results, achieving near-perfect accuracies of 99.97% and 99.96%, respectively, alongside superior calibration metrics such as the lowest Brier Scores (0.0002 for XGBoost and 0.0004 for CatBoost).
4.6. Model Accuracy Comparison
The performance of the machine learning models was assessed based on their accuracy during both the training and testing phases.
Figure 9 and
Figure 10 illustrate the accuracy comparison of various algorithms, including Logistic Regression, Decision Trees (at depths of 15, 20, and 25), Random Forest (at depths of 15, 20, and 25), XGBoost (depth 6), and CatBoost (depths 6 and 8).
In the training phase,
Figure 9 depicts the performance of Decision Tree and Random Forest models perfectly reaching 100% for all depths, indicating a potential tendency for overfitting. In comparison, XGBoost and CatBoost reached near-perfect accuracies of 99.97% and 99.96%, respectively, already showing the generalization ability during the training process.
Logistic Regression, being the baseline model, achieved an accuracy of 96.50%, which is comparatively lower but still competitive given its simplicity. In the testing phase, as shown in
Figure 10, XGBoost and CatBoost retained their near-perfect accuracy scores of 99.97% and 99.96%, respectively.
These results underline their strength and generalization capability on unseen data. The Random Forest and Decision Tree models had very high accuracies of over 99% during testing, though these were slightly decreased from the training accuracy, indicating mild overfitting.
Logistic Regression maintained its accuracy at 96.49% during testing, strengthening its reliability as a baseline model against which to compare. These results show the improved performance of sophisticated ensemble models, such as XGBoost and CatBoost, on fraud detection complexities.
They can achieve high accuracy in training and testing phases and are robust against overfitting, making them good candidates for real-world applications. Logistic Regression is a benchmark because of its simplicity and interpretability, while Decision Trees and Random Forests perform excellently but may require extra regularization to avoid overfitting risks.
These results highlight some intrinsic trade-offs among model complexity, performance, and generalization, which could provide valuable insights into selecting the optimum algorithms for fraud detection systems.
4.7. Comparative Analysis
Precision, recall, and F1-score metrics provide further insight into how the models handle class imbalance.
Figure 3 and
Figure 4 depict the metrics at both the training and testing phases for Logistic Regression, Decision Tree, depths 15, 20, and 25, Random Forest, depths 15, 20, and 25, XGBoost, depth 6, and CatBoost, depths 6 and 8.
Figure 11 illustrates that the Decision Tree and Random Forest models for the training sets have been performed with perfect precision, recall, and an F1-score of 1.0 at each depth without error, capturing all true positives. However, flawless scores hint that there is a likelihood of overfitting. XGBoost and CatBoost showed near perfect scores (~0.999) by maintaining a trade-off between capturing true positives and reducing either false positives or false negatives. Logistic Regression, while simpler, consistently obtained scores of ~0.965 and proved reliable as a baseline.
Figure 12 shows that during the testing phase, XGBoost and CatBoost maintained near-perfect precision, recall, and F1-scores of 0.999+, proving their robustness and efficiency on unseen data. Random Forest and Decision Tree models were also impressive during testing, with scores close to 1.0, although these scores are slightly lower than in training, indicating slight overfitting. Logistic Regression had balanced scores of ~0.96, showcasing its capability despite being less sophisticated.
This analysis has underlined how advanced models, such as XGBoost and CatBoost, realize excellent performance while preserving generalization. While Decision Tree and Random Forest are excellent on these metrics, their tendency toward overfitting requires cautious parameter tuning in practical scenarios. Though less accurate, Logistic Regression is useful because of its simplicity and interpretability. These metrics point to the trade-offs between model complexity, precision, and generalization within fraud detection.
More analytically, the comparison underlined some different advantages and limitations for each of the metrics considered, reflecting the effectiveness of the models in identifying fraudulent transactions. Even though Logistic Regression is simple, it was taken as a baseline model because it is interpretable and its assumptions are linear. However, this limitation becomes evident while handling nonlinear relationships, which are complex in fraud detection datasets. Although it achieved a test accuracy of 96.49%, its results were still bound by suboptimal precision, recall, and F1-scores compared to state-of-the-art methods.
Decision Trees represent data in a flowchart or tree structure and hence allow interpretability of most metrics: 99.04% precision: 99.79%; the tests show some declines, such that a specificity of 99.15% and a Brier Score of 0.0085 would probably highlight the efficient output of Decision Trees to tend toward overfitting to their respective training sets a little by a marginal degree, generalizing them in return.
Random Forest is an ensemble method that enhances predictive robustness by combining multiple Decision Tree results. Such a method reduces overfitting and is condemned to high performance for training and test datasets repeatedly. Its generalization is exemplary, as it presents an accuracy of 99.36% and specificity of 99.50%. Its Brier Score of 0.0053 confirms superlative calibration and probabilistic performance, making it a very effective method against imbalanced datasets found in fraudulent cases.
The performance-optimized version of gradient boosting, XGBoost, realizes 99.97% test accuracy, a specificity of 99.95%, and a Brier Score of 0.0002. These performances mean that this method efficiently handles big, imbalanced datasets and has excellent computational efficiency. Due to their sequential training nature, XGBoost significantly reduces prediction errors and produces a high precision and recall value. Its computation complexity can hinder real-time applications where fast decision-making is required.
CatBoost has been specially developed to handle categorical features, which makes it quite a competitive model. Without extensive pre-processing, CatBoost integrates categorical data into its learning framework by itself and thus reduces computational over-head. It achieved a test accuracy of 99.96% and a specificity of 99.91%, making it run close to XGBoost. A Brier Score of 0.0004 points to intense calibration, which would be very useful in probabilistic predictions. It efficiently handles categorical features, and its performance is comparable or even better, but with less tuning. Therefore, CatBoost is of higher value in practical applications.
The simple models, Logistic Regression among them, provide more basic insights, while more advanced ensemble methods—Random Forest, XGBoost, and CatBoost—show better predictive accuracy, calibration, and robust performances. XGBoost is better calibrated, although the results for both models are close to perfect; this means both models are promising for modeling such a complex task as credit card fraud detection. The inclusion of CatBoost indicates its ability to handle both categorical and numerical features seamlessly, with better performance and less pre-processing, making it practical and efficient in real-world financial applications.
The combined heatmap (
Figure 13) presents a unified view of all models’ training and testing metrics. The rows correspond to different metrics and datasets (training or testing), while the columns represent the models. Insights:
XGBoost and CatBoost consistently achieve superior performance across all metrics and datasets.
Brier Score is adjusted for visualization, highlighting calibration strengths.
Differences between training and testing performance are minimal for advanced models like XGBoost and CatBoost.
5. Discussion
This work has analyzed several machine learning methods for fraud detection in credit cards: Logistic Regression, Decision Trees, Random Forest, XGBoost, and CatBoost. The results present several aspects related to the trade-off with model simplicity and interpretability for better predictive performance: accuracy, calibration, and computation efficiency.
Logistic Regression is the baseline model with an accuracy of 96.49%, indicating its usefulness when model interpretability is crucial. However, it suffers from a limited ability to capture complicated nonlinear relationships, thus limiting its application in fraud detection, especially in imbalanced datasets. The Brier Score of 0.0259 is relatively high, indicating poor calibration; hence, it is unsuitable for a high-stakes environment where precision is key.
The performance of Decision Trees and Random Forest showed quite remarkable improvements. Random Forest achieved almost perfect metrics on the training and testing datasets. Due to the nature of aggregating predictions from several trees, Random Forest mitigates overfitting and enhances generalization. This was depicted by a low Brier Score of 0.0053 and a specificity of 99.50%. However, the ensemble nature comes with a more computationally expensive model, which is less applicable in scenarios that demand rapid predictions.
XGBoost further enhanced the performance by availing gradient boosting, reaching the highest accuracy of 99.97% and a Brier Score of 0.0002. This model has proven to be very effective in handling imbalanced and complex datasets because of its ability to correct prediction errors sequentially. However, XGBoost requires intensive hyperparameter tuning and a lot of computational resources, especially for big datasets, hence making its deployment difficult in real-time systems. CatBoost was a very strong competitor, especially in handling categorical features natively without pre-processing.
The model achieved an accuracy of 99.96% on testing, with a specificity of 99.91% and a Brier Score of 0.0004. The computational efficiency of CatBoost, in addition to the high calibration and accuracy, underlines its practical relevance for real-world applications, especially where mixed data types are concerned. Compared to XGBoost, CatBoost performs similarly, with reduced pre-processing overhead and faster training times.
Table 7 summarizes the hypotheses proposed and their respective evaluation outcomes to provide a structured assessment of this study’s theoretical foundations. This table highlights the empirical confirmation of both hypotheses based on the experimental results and performance metrics discussed earlier.
Specifically, Hypothesis 1 (H1), which posits that advanced machine learning techniques significantly enhance fraud detection accuracy compared to traditional methods, was confirmed. The results showed that ensemble models such as XGBoost and CatBoost outperformed Logistic Regression and Decision Trees, achieving test accuracies of 99.97% and 99.96%, respectively, alongside superior Brier Scores and calibration.
Similarly, Hypothesis 2 (H2) was also validated concerning the role of distributed frameworks like PySpark in enhancing scalability and processing efficiency. PySpark reduced training times significantly and enabled real-time fraud detection capabilities with system latency as low as 500 milliseconds, outperforming traditional platforms like Hadoop and Flink.
Confirming both hypotheses reinforces the value of combining advanced ML models with scalable distributed architectures in developing robust, adaptive fraud detection systems.
Prior studies [
117,
118] have contributed significantly to the field by applying traditional machine learning algorithms like Random Forest and Logistic Regression to credit card fraud detection. While these studies addressed essential issues such as data imbalance and model evaluation, their implementations were confined mainly to non-distributed environments, achieving accuracy rates between 94% and 96% and often requiring batch-based processing. In contrast, our proposed framework integrates advanced ensemble methods (XGBoost and CatBoost) within a distributed PySpark architecture, achieving near-perfect accuracy (99.96–99.97%), extremely low Brier Scores (0.0002–0.0004), and scalability across large transactional datasets.
Additionally, our work improves upon traditional approaches by emphasizing real-time detection capability, which is essential for fraud prevention in high-volume financial systems. PySpark’s in-memory processing and distributed training significantly reduced latency (to 500 milliseconds), outperforming traditional big data frameworks like Hadoop and Flink, which remain more batch-oriented. Moreover, while many existing studies relied solely on resampling techniques like SMOTE to handle class imbalance, our approach incorporates these techniques within an advanced learning pipeline that uses hyperparameter-optimized ensemble classifiers, ensuring improved generalization and probabilistic calibration.
Finally, a recent study [
119] has demonstrated how machine learning frameworks can effectively identify and mitigate digital threats through anomaly detection models in parallel domains like cybersecurity. These findings align with our fraud detection context, where timely and accurate classification of malicious behavior is equally critical. This cross-domain relevance underscores the broader applicability of our proposed methodology and supports its deployment across various sectors requiring robust anomaly detection solutions.
In addition to computational overhead,
Figure 14 and
Figure 15 show the computational overhead of the assessed models, another critical aspect when real-time fraud detection systems come into play. Logistic Regression proved to be the most computationally efficient, having minimal training and testing times of 1.87 s and 0.058 s, respectively, at the cost of slightly reducing predictive power compared with more advanced algorithms. It will be ideal for cases with the highest possible computational efficiency and simplicity.
Decision Trees gave a relatively moderate computational load, while the training and testing time increased linearly with the depth of the trees. At depth 25, training took 21.74 s, and testing took 0.127 s, showing their good scalability while remaining computationally feasible. However, the ensemble models, such as Random Forest, were much more overhead-expensive. For example, at depth 25, the training time was 78.55 s, and the testing time was 0.22 s. Considering robustness and high accuracy, despite higher resource demands, Random Forest suits those systems where training can be completed offline. XGBoost showed a balanced approach: 34.02 s training time and 0.06 s testing time. The optimization techniques, parallel processing, and regularization enabled XGBoost to show competitive performance without going to an extreme computational cost, though hyperparameter tuning is still resource-demanding.
CatBoost, while highly comparable to XGBoost in terms of performance, did take a little longer to train, 34.15 s, but kept testing times low at 0.03 s. Native support for categorical data without pre-processing dramatically reduces the overall complexity of the implementation. It makes it an up-and-coming candidate for real-world, large-scale fraud detection applications. PySpark was essential for handling the computational burden arising from high-volume transaction data. Its distributed processing capabilities enabled the implementation of all models in a scalable manner. It ensured the feasibility of deploying resource-intensive algorithms such as Random Forest, XGBoost, and CatBoost in real-time detection environments. The computational trade-offs presented in this study bring to the fore the importance of choosing appropriate algorithms based on fraud detection systems’ operational constraints and objectives.
The experimental results directly validate Hypothesis 1 (H1) because XGBoost and CatBoost achieved predictive accuracy rates of 99.97% and 99.96%, which exceeded the results of traditional analytical methods such as Logistic Regression at 96.49%. The advanced algorithms achieve better results through sequential error correction, balanced dataset handling, and adaptive learning capabilities, which result in improved adaptability and responsiveness. The observed performance improvements between distributed frameworks like PySpark and advanced machine learning models also validate Hypothesis 2 (H2). The PySpark system showed better scalability and computational efficiency through its ability to cut training time by 52% compared to Hadoop and 34% compared to Flink. At the same time, it achieved real-time latency at 500 ms instead of 1200 ms (Hadoop) and 800 ms (Flink). The analytical results demonstrate how PySpark’s in-memory processing, parallelized computations, and streaming capabilities make it superior to traditional non-distributed frameworks.
These results, which were benchmarked against traditional methods and state-of-the-art machine learning models, further confirm the superiority of ensemble methods like Random Forest, XGBoost, and CatBoost. Though Logistic Regression is a very dependable baseline, it lacks the robustness of an ensemble method on complex, imbalanced datasets. The performance of CatBoost agrees with empirical evidence from existing research on its efficiency and accuracy in handling mixed data types. These results agree with those obtained in similar studies [
103,
107,
110]. The comparison would be more complete, and the results would be further validated by including other benchmarks such as LightGBM or hybrid models.
5.1. Comparative Evaluation and Innovations Beyond the State of the Art
A detailed comparative analysis was conducted to benchmark the proposed framework, combining PySpark with advanced machine learning models (XGBoost and CatBoost), against other distributed frameworks, including Apache Flink and Hadoop (
Table 8). This evaluation focused on critical performance metrics such as training time, accuracy, real-time latency, and scalability, as these factors significantly influence the efficacy of fraud detection systems.
The comparison highlighted PySpark’s superior ability to integrate with machine learning workflows while maintaining real-time responsiveness. Its distributed architecture, in-memory processing, and support for iterative computations offered distinct advantages over Hadoop, which relies on disk-based operations, and Flink, which has limited support for advanced machine learning models. PySpark’s flexibility in managing batch and streaming data further reinforces its utility in fraud detection scenarios that demand rapid analysis and adaptability.
Key Observations:
- ▪
Training Time: the proposed framework achieved a 52% reduction in training time compared to Hadoop and a 34% reduction compared to Flink, attributed to PySpark’s parallelized processing and in-memory computations during gradient boosting iterations.
- ▪
Accuracy: by leveraging optimized feature engineering and advanced model tuning, the proposed framework improved detection accuracy by up to 4% over Flink and 4% over Hadoop, highlighting its ability to handle imbalanced datasets effectively.
- ▪
Latency: PySpark’s integration with XGBoost and CatBoost demonstrated a real-time latency of 500 milliseconds, significantly outperforming Hadoop (1200 ms) and Flink (800 ms). This advantage stems from PySpark’s support for streaming and low-latency computations.
- ▪
Scalability: while both PySpark and Flink exhibited high scalability due to their distributed nature, Hadoop’s scalability was moderate, limited by its reliance on batch processing and disk-based operations.
The proposed framework introduces several new features that were limited or lacking in previous systems. PySpark’s streaming capability was utilized to implement continuous learning mechanisms so that the models keep evolving dynamically as newer data come in, eliminating manual retraining and allowing real-time adaptation to changing fraud patterns. The framework will include edge computing to process high-priority transactions locally, reducing latency for critical fraud detection tasks while maintaining scalability for larger datasets processed in the cloud. This hybrid approach enhances the framework’s responsiveness and resource utilization.
This will provide insight into feature importance and model predictions for both XGBoost and CatBoost models in integrating explainable AI methods. These explainable outputs are significant in ensuring regulatory compliance while instilling stakeholder confidence in the fraud detection system. Some computation-intensive tasks, such as interaction term creation, aggregated statistics, and temporal feature extraction, were divided between PySpark nodes. This way, nuanced patterns associated with fraudulent transactions were better represented, and model performance increased substantially.
In the meantime, the proposed framework could make this real with parallelized grid and random search tuning for learning rate, max depth, and tree boosters through the PySpark distributed infrastructure, hence decreasing time optimizations by at least a factor of 40% when benchmarked with conventional sequential optimization procedures. These comparative insights bring forth the superiority of the proposed framework in handling large-scale transactional datasets with high precision and efficiency. It is also specially fitted for real-world deployment in financial institutions because it integrates adaptive learning mechanisms and hybrid architecture. Furthermore, with explainable outputs, interpretability enhances trust and compliance in a regulatory environment.
5.2. Model Sensitivity to Macroeconomic Shifts and Policy Invariance
Although this study’s findings reaffirm machine learning methods’ excellent accuracy and calibration in detecting credit card fraud based on historical transaction data, we must acknowledge a straightforward limitation: the possible non-invariance of structural parameters over time, as described in the Lucas critique. For economic modeling, parameter constancy ensures correct inference under changing policy regimes or macroeconomic conditions.
While credit card fraud is primarily a micro-level phenomenon, macroeconomic factors influence it indirectly. The literature [
120] provides evidence that fiscal deficits, unemployment, interest rates, and the general performance of GDP can influence aggregate credit risk behavior. For instance, during a fiscal expansion or recession, the characteristics and volume of fraudulent activity can be entirely different. Previously insignificant predictor variables can become significant, and current predictors can become irrelevant or have different correlations.
Like most others, our models are trained in historical patterns, typically extracted from data collected in relatively stable economic times. Therefore, their predictive capacity can degrade when applied in structurally different environments. This vulnerability is particularly pertinent to real-time systems functioning across extended time horizons and heterogeneous economic conditions. To mitigate this, future model development should consider the following:
- ▪
Temporally aware training (e.g., temporal cross-validation or rolling windows) for accurately fitting to changing trends.
- ▪
Incorporating macroeconomic variables (like inflation rates, consumer confidence) into the feature set allows the model to adjust for general conditions.
- ▪
Continuity mechanisms of learning that facilitate adaptive retraining whenever new patterns emerge due to policy and economic changes.
Lastly, recognizing this sensitivity encourages model interpretability and transparency, essential for deployment in high-stakes financial environments. This dialogue makes practitioners aware of the risks of potential generalization from previous data to future fraud situations.
5.3. Limitations of the Study
Although the area of credit card transactions is the most critical domain in the financial fraud detection part of the literature review, it represents only a tiny portion of the entire scope of the research study. An extended fraud detection framework would expand the research area from pure credit card transaction-based detection to considering other platforms in digital wallets, online banking systems, and even peer-to-peer payment networks. This broad generalization of such models has not been decided upon across various domains in this study.
Other limitations concern the focus adopted for machine learning models and technologies: PySpark, XGBoost, and CatBoost. While the performance of the models and tools described was extremely high, these were part of just a tiny cut of the available and possible algorithms and platforms that can perform fraud detection. Not to mention the more established ones, such as LightGBM, and all new paradigms like graph neural networks and hybrid models. Real-time implementation poses significant challenges that were not fully addressed in this research. These include the following:
- ▪
Latency and Delays: Fraud detection systems need to process transactions in real-time to detect and stop fraudulent activities on time. The computational overhead and latency associated with advanced models, especially ensemble methods, may impede their deployment in live systems.
- ▪
Scalability: it must handle and analyze vast volumes of transactional data on scalable architectures for high performance when the load is high.
- ▪
Data Privacy: handling sensitive financial information in a world with global regulations on data privacy, such as GDPR and CCPA, adds to the complexity.
- ▪
Global Collaboration: coordinating various financial institutions in fraud detection involves standardized protocols for data sharing, secure channels of communication, and mechanisms for resolving cross-border regulatory constraints.
Another limitation is the issue of data imbalance, which is still fresh in fraud detection. Various techniques were applied in this paper, such as oversampling and synthesizing data creation, such as SMOTE. Still, they were not evaluated for performance when faced with such changing fraud patterns. Also, using only one dataset limits the general applicability of results because fraud patterns in the wild vary significantly over geographies, industries, and user demographics. Most of the emphasis in this research is on model performance metrics, such as accuracy and Brier Scores, without in-depth cost–benefit analysis for implementation in live environments. Understanding the computational costs involved and resource allocation trade-offs versus fraud detection effectiveness in real-world deployment is essential.
Such limitations are overcome by considering the careful thinking and strategic preparation involved. The work in the future should be performed on various financial platforms, ranging over a wide scale of machine learning algorithms, to deploy scalable real-time fraud detection systems and preserve privacy. Such solutions to the challenges mentioned will increase the adaptability, reliability, and effectiveness of the fraud detection frameworks to keep them relevant for handling the ever-changing dimensions of financial fraud.
5.4. Future Research Directions
This research proved the viability of distributed machine learning models in improving credit card fraud detection systems’ accuracy, efficiency, and flexibility. Extending this work’s contributions, various future research opportunities are suggested to aid the further development of real-time, robust, and privacy-preserving fraud detection systems. A crucial follow-on action entails deploying the proposed models in actual financial transactional environments to test their performance against live operational conditions. This includes system latency, computational overhead, and compatibility with existing infrastructures to validate effortless integration into financial systems. This deployment will necessitate ongoing optimization and monitoring to ensure real-time responsiveness.
Since fraud’s character changes over time, future research must create real-time learning mechanisms that will develop alongside these new trends as they arise. These involve behavioral analytics and context-aware anomaly detection, which utilize historic, contextual, and social interactional data to inform predictive capacity. Methods like Kolmogorov complexity and social signal processing can also detect deviations from normative behavior that transactional features will not detect [
121,
122,
123].
Emerging technologies offer additional opportunities. Blockchain offers decentralized, tamper-proof data exchange between financial institutions, enabling data transparency and traceability [
124]. Quantum computing, conversely, may enable unparalleled processing speed and model training opportunities, particularly for streaming data scenarios at large scales [
125,
126,
127].
In addition, the future work must investigate hybrid modeling approaches blending conventional statistical techniques with sophisticated machine learning models. Such blends may provide greater flexibility and robustness to detect simple and intricate fraud patterns. Additionally, Explainable AI (XAI) will be essential to address regulatory compliance and provide transparency in model-based decisions, especially in high-risk finance applications [
128].
One other critical area of research is the uptake of federated learning, which allows joint model training across various financial institutions without revealing sensitive customer information. This procedure honors worldwide privacy laws like GDPR while enhancing fraud detection via secure knowledge sharing [
129]. Deep insight into consumer behavior remains at the center of fraud detection. Research is needed into how geographic, temporal, and transactional behaviors interact to determine legitimate versus anomalous activity, reducing false positives and enhancing detection accuracy.
Finally, next-generation systems must incorporate privacy-preserving machine learning solutions such as secure multi-party computation and homomorphic encryption to keep up with changing privacy demands. These solutions allow for secure, scalable fraud detection without sacrificing confidentiality or adherence to worldwide data protection regulations.