Credit Card Fraud Detection with Autoencoder and Probabilistic Random Forest
Abstract
:1. Introduction
2. The Proposed Method
2.1. Autoencoder
2.2. Random Forest
- Step 1.
- Produce n sub-datasets from the original dataset of d data. Each sub-dataset is produced by drawing d’ out of the d data with replacement, where d’ ≤ d.
- Step 2.
- For each of the n sub-datasets, grow a decision tree by choosing the best splitting of internal tree nodes with the largest information gain for arbitrary k’ attributes, k’ < k. There are thus in total n decision trees to generate n classifications, each of which is one of r classes (i.e., labels) c1, …, cr.
- Step 3.
- Aggregate the results of the n trees to output the dominant class as the final classification, where is the frequency that ci appears among the n classifications. Note that the output may be adjusted to be with probabilistic classification, i.e., to output the classification frequencies (or probabilities) freq(c1), …, freq(cr) for all classes c1, …, cr.
2.3. The Proposed AE-PRF Method
- Step 1.
- Employ the training data to train the AE model AET and obtain the set T of training data feature codes.
- Step 2.
- Train the RF model RFT with the set T of training data feature codes.
- Step 3.
- Apply AET to the validation data to extract the set V of validation data feature codes.
- Step 4.
- For threshold θ = 0 to 1 step s (=0.01), execute the following:Feed every code in V into RFT to output a probability p of fraud classification. If p > θ, then the classification result is positive (fraudulent); otherwise, the classification result is negative (normal).
- Step 5.
- Employ the classification results of all codes in V to find the threshold value θ* producing the best classification performance in terms of a specific metric M.
- Step 1.
- Apply AET to every test datum d to extract its feature code c.
- Step 2.
- Feed the code c into RFT with the threshold value θ* to produce the classification result of d.
Algorithm 1 AE-PRF |
Input: training data Dtrain, validation data Dvalidation, test data Dtest, and metric M Output: the classification result of each test datum (0 for normal or 1 for fraudulent) 1: Train the AE model AET with Dtrain 2: T ← AET(Dtrain) 3: Train the RF model RFT with T 4: V ← AET(Dvalidation) 5: for θ ← 0 to 1 step 0.01 do 6: for each v in V do 7: p ← RFT(v) 8: if p > θ then result [θ][v] ← 1 9: else result [θ][v] ← 0 10: Find the best θ* by comparing all result values in terms of metric M 11: C ← AET(Dtest) 12: for each c in C do 13: q ← RFT(c) 14: if q > θ* then output [c] ← 1 15: else output [c] ← 0 16: return output |
3. Related Work
4. Performance Evaluation and Comparisons
4.1. Dataset and Data Resampling
4.2. Performance Metrics
4.3. Performance Evaluation of AF-PRF
4.4. Perforamance Comparisons
5. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- de Best, R. Credit Card and Debit Card Number in the U.S. 2012–2018. Statista. 2020. Available online: https://www.statista.com/statistics/245385/number-of-credit-cards-by-credit-card-type-in-the-united-states/#statisticContainer (accessed on 10 October 2021).
- Voican, O. Credit Card Fraud Detection using Deep Learning Techniques. Inform. Econ. 2021, 25, 70–85. [Google Scholar] [CrossRef]
- The Nilson Report. Available online: https://nilsonreport.com/mention/1313/1link/ (accessed on 20 December 2020).
- Taha, A.A.; Sharaf, J.M. An intelligent approach to credit card fraud detection using an opti-mized light gradient boosting machine. IEEE Access 2020, 8, 25579–25587. [Google Scholar] [CrossRef]
- Dal Pozzolo, A. Adaptive Machine Learning for Credit Card Fraud Detection. Ph.D. Thesis, Université Libre de Bruxelles, Brussels, Belgium, 2015. [Google Scholar]
- Lucas, Y.; Portier, P.-E.; Laporte, L.; Calabretto, S.; Caelen, O.; He-Guelton, L.; Granitzer, M. Multiple perspectives HMM-based feature engineering for credit card fraud detection. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, ACM, New York, NY, USA, 8–12 April 2019; pp. 1359–1361. [Google Scholar]
- Wiese, B.; Omlin, C. Credit Card Transactions, Fraud Detection, and Machine Learning: Modelling Time with LSTM Recurrent Neural Networks. In Studies in Computational Intelligence; Springer Science and Business Media LLC: Berlin, Germany, 2009; pp. 231–268. [Google Scholar]
- Jurgovsky, J.; Granitzer, M.; Ziegler, K.; Calabretto, S.; Portier, P.-E.; He-Guelton, L.; Caelen, O. Sequence classification for credit-card fraud detection. Expert Syst. Appl. 2018, 100, 234–245. [Google Scholar] [CrossRef]
- Zhang, F.; Liu, G.; Li, Z.; Yan, C.; Jiang, C. GMM-based Undersampling and Its Application for Credit Card Fraud Detection. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–8. [Google Scholar]
- Ahammad, J.; Hossain, N.; Alam, M.S. Credit Card Fraud Detection using Data Pre-processing on Imbalanced Data—Both Oversampling and Undersampling. In Proceedings of the International Conference on Computing Advancements, New York, NY, USA, 10–12 January 2020; ACM Press: New York, NY, USA, 2020. [Google Scholar]
- Lee, Y.-J.; Yeh, Y.-R.; Wang, Y.-C.F. Anomaly Detection via Online Oversampling Principal Component Analysis. IEEE Trans. Knowl. Data Eng. 2012, 25, 1460–1470. [Google Scholar] [CrossRef] [Green Version]
- Awoyemi, J.O.; Adetunmbi, A.O.; Oluwadare, S.A. Credit card fraud detection using machine learning techniques: A comparative analysis. In Proceedings of the 2017 International Conference on Computing Networking and Informatics (ICCNI), Lagos, Nigeria, 29–31 October 2017; pp. 1–9. [Google Scholar]
- Pumsirirat, A.; Yan, L. Credit Card Fraud Detection using Deep Learning based on Auto-Encoder and Restricted Boltzmann Machine. Int. J. Adv. Comput. Sci. Appl. 2018, 9, 18–25. [Google Scholar] [CrossRef]
- Zamini, M.; Montazer, G. Credit Card Fraud Detection using autoencoder based clustering. In Proceedings of the 2018 9th International Symposium on Telecommunications (IST), Tehran, Iran, 17–19 December 2018; pp. 486–491. [Google Scholar] [CrossRef]
- Randhawa, K.; Loo, C.K.; Seera, M.; Lim, C.P.; Nandi, A.K. Credit Card Fraud Detection Using AdaBoost and Majority Voting. IEEE Access 2018, 6, 14277–14284. [Google Scholar] [CrossRef]
- Lucas, Y.; Johannes, J. Credit card fraud detection using machine learning: A survey. arXiv 2020, arXiv:2010.06479. [Google Scholar]
- Nikita, S.; Pratikesh, M.; Rohit, S.M.; Rahul, S.; Chaman Kumar, K.M.; Shailendra, A. Credit card fraud detection techniques—A survey. In Proceedings of the 2020 International Conference on Emerging Trends in Information Technology and Engineering, Vellore, India, 24–25 February 2020. [Google Scholar]
- Rumelhart, D.E.; Geoffrey, E.H.; Ronald, J.W. Learning internal representations by error propagation. Calif. Univ. San Diego La Jolla Inst. Cogn. Sci. 1985, 8, 318–362. [Google Scholar]
- Liaw, A.; Matthew, W. Classification and regression by random Forest. R News 2002, 2, 18–22. [Google Scholar]
- Seeja, K.R.; Zareapoor, M. FraudMiner: A Novel Credit Card Fraud Detection Model Based on Frequent Itemset Mining. Sci. World J. 2014, 2014, 1–10. [Google Scholar] [CrossRef]
- Zhang, C.; Gao, W.; Song, J.; Jiang, J. An imbalanced data classification algorithm of improved autoencoder neural network. In Proceedings of the 2016 Eighth International Conference on Advanced Computational Intelligence (ICACI), Chiang Mai, Thailand, 14–16 February 2016; pp. 95–99. [Google Scholar]
- Credit Card Fraud Detection Dataset. Available online: https://www.kaggle.com/mlg-ulb/creditcardfraud (accessed on 20 August 2020).
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the IEEE International Joint Conference on Neural Networks, Hong Kong, China, 1–8 June 2008; pp. 1322–1328. [Google Scholar] [CrossRef] [Green Version]
- Tomek, I. Two Modifications of CNN. IEEE Trans. Syst. Man Cybern. 1976, 6, 769–772. [Google Scholar] [CrossRef] [Green Version]
- Altman, N.S. An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat. 1992, 46, 175–185. [Google Scholar]
- Sutskever, I.; Geoffrey, E.H.; Graham, W.T. The recurrent temporal restricted boltzmann machine. In Advances in Neural Information Processing Systems; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
- Bhattacharyya, S.; Jha, S.; Tharakunnel, K.; Westland, J.C. Data mining for credit card fraud: A comparative study. Decis. Support Syst. 2011, 50, 602–613. [Google Scholar] [CrossRef]
- Margineantu, D.; Dietterich, T. Pruning Adaptive Boosting. In Proceedings of the 14th International Conference on Machine Learning, ICML, Guangzhou, China, 18–21 February 1997. [Google Scholar]
- Naveen, P.; Diwan, B. Relative Analysis of ML Algorithm QDA, LR and SVM for Credit Card Fraud Detection Dataset. In Proceedings of the 2020 Fourth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), Palladam, India, 7–9 October 2020; pp. 976–981. [Google Scholar]
- Rish, I. An Empirical Study of the Naive Bayes Classifier. In Proceedings of the IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence. 2001, Volume 3. No. 22. Available online: https://www.cc.gatech.edu/fac/Charles.Isbell/classes/reading/papers/Rish.pdf (accessed on 20 August 2020).
- Jeatrakul, P.; Wong, K.W. Comparing the performance of different neural networks for binary classification prob-lems. In Proceedings of the 2009 Eighth International Symposium on Natural Language Processing, Bangkok, Thailand, 20–22 October 2009. [Google Scholar]
- Murphy, K.P. Naive Bayes Classifiers; University of British Columbia: Vancouver, BC, Canada, 2006. [Google Scholar]
- Chicco, D.; Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 2020, 21, 1–13. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Saito, T.; Rehmsmeier, M. The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLoS ONE 2015, 10, e0118432. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Chawla, N.V. Data Mining for Imbalanced Datasets: An Overview. Data Min. Knowl. Discov. Handb. 2009, 875–886. [Google Scholar] [CrossRef] [Green Version]
- Lakshmi, T.J.; Prasad, C.S.R. A study on classifying imbalanced datasets. In Proceedings of the 2014 First International Conference on Networks & Soft Computing (ICNSC2014), Guntur, India, 19–20 August 2014; pp. 141–145. [Google Scholar]
- Assaf, R.; Giurgiu, I.; Pfefferle, J.; Monney, S.; Pozidis, H.; Schumann, A. An Anomaly Detection and Explainability Framework using Convolutional Autoencoders for Data Stor-age Systems. IJCAI 2020, 5228–5230. [Google Scholar] [CrossRef]
- Antwarg, L.; Miller, R.M.; Shapira, B.; Rokach, L. Explaining anomalies detected by autoencoders using Shapley Additive Explanations. Expert Syst. Appl. 2021, 186, 115736. [Google Scholar] [CrossRef]
- Fernández, R.R.; de Diego, I.M.; Aceña, V.; Fernández-Isabel, A.; Moguerza, J.M. Random forest explainability using counterfactual sets. Inf. Fusion 2020, 63, 196–207. [Google Scholar] [CrossRef]
Metrics | Average | Minimum | Q1 | Q2 | Q3 | Maximum |
---|---|---|---|---|---|---|
ACC | 0.99949 | 0.999410129 | 0.999455774 | 0.999494396 | 0.999522485 | 0.999578664 |
TPR | 0.8142 | 0.75 | 0.808333333 | 0.816666667 | 0.825 | 0.84166667 |
TNR | 0.9998 | 0.999704567 | 0.999788976 | 0.999803044 | 0.999831181 | 0.999887454 |
MCC | 0.8441 | 0.811201026 | 0.834295395 | 0.845629405 | 0.853475139 | 0.870180134 |
Models | ACC | TPR | TNR | MCC |
---|---|---|---|---|
AE-PRF (θ = 0.03) | 0.9973 | 0.8910 | 0.9975 | 0.5921 |
AE-PRF (θ = 0.25) | 0.9995 | 0.8142 | 0.9998 | 0.8441 |
ADASYN AE-PRF (θ = 0.13) | 0.9960 | 0.8613 | 0.9963 | 0.5018 |
ADASYN AE-PRF (θ = 0.57) | 0.9995 | 0.8316 | 0.9998 | 0.8665 |
SMOTE + T-Link AE-PRF (θ = 0.11) | 0.9965 | 0.8583 | 0.9967 | 0.5133 |
SMOTE + T-Link AE-PRF (θ = 0.51) | 0.9995 | 0.8333 | 0.9998 | 0.8585 |
Research | Methods | ACC | TPR | TNR | MCC | AUC |
---|---|---|---|---|---|---|
Awoyemi et al. [12] | k-NN | 0.9691 | 0.8835 | 0.9711 | 0.5903 | - |
Pumsirirat et al. [13] | AE | 0.97054 | 0.83673 | 0.97077 | 0.1942 | 0.9603 |
Zamini et al. [14] | AE-based clustering | 0.98902 | 0.81632 | 0.98932 | 0.3058 | 0.961 |
Randhawa et al. [15] | SVM with AdaBoost | 0.99927 | 0.82317 | 0.99957 | 0.796 | - |
Randhawa et al. [15] | NN+NB with MV | 0.99941 | 0.78862 | 0.99978 | 0.823 | - |
This Research | AE + PRF (θ = 0.03) | 0.99738 | 0.89109 | 0.99757 | 0.5921 | 0.962 |
This Research | AE + PRF (θ = 0.25) | 0.9995 | 0.8142 | 0.9998 | 0.8441 | 0.962 |
Methods | ACC | TPR | TNR | MCC |
---|---|---|---|---|
k-NN [12] (without resampling) | 0.9691 | 0.8835 | 0.9711 | 0.5903 |
k-NN [12] (with all data 34:66 resampling) | 0.9792 | 0.9375 | 1.0 | 0.9535 |
Re-implemented k-NN (without resampling) | 0.9977 | 0.7483 | 0.9981 | 0.5512 |
Re-implemented k-NN (with only training data resampling) | 0.9817 | 0.1881 | 0.9832 | 0.0556 |
Re-implemented k-NN (with all data 34:66 resampling) | 0.9832 | 0.9494 | 1.0 | 0.9624 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lin, T.-H.; Jiang, J.-R. Credit Card Fraud Detection with Autoencoder and Probabilistic Random Forest. Mathematics 2021, 9, 2683. https://doi.org/10.3390/math9212683
Lin T-H, Jiang J-R. Credit Card Fraud Detection with Autoencoder and Probabilistic Random Forest. Mathematics. 2021; 9(21):2683. https://doi.org/10.3390/math9212683
Chicago/Turabian StyleLin, Tzu-Hsuan, and Jehn-Ruey Jiang. 2021. "Credit Card Fraud Detection with Autoencoder and Probabilistic Random Forest" Mathematics 9, no. 21: 2683. https://doi.org/10.3390/math9212683
APA StyleLin, T.-H., & Jiang, J.-R. (2021). Credit Card Fraud Detection with Autoencoder and Probabilistic Random Forest. Mathematics, 9(21), 2683. https://doi.org/10.3390/math9212683