Estimating Financial Fraud through Transaction-Level Features and Machine Learning
Abstract
:1. Introduction
2. Materials and Methods
2.1. Dataset Collection and Further Samples Generation using CTGAN
2.2. Data Analysis and Splitting Approaches
2.3. Machine Learning Classifiers
2.4. Performance Evaluation
3. Results
3.1. Analysis of Dataset
3.2. Evaluation of Original Dataset
3.3. Data Generation through CTGAN
3.4. Evaluation of Updated Dataset
3.5. Repeated 10-Fold Cross-Validation
3.6. Final Evaluation of Original Dataset
4. Discussion
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Kyriienko, O.; Magnusson, E.B. Unsupervised quantum machine learning for fraud detection. arXiv 2022, arXiv:2208.01203. [Google Scholar]
- Kulatilleke, G.K. Challenges and complexities in machine learning based credit card fraud detection. arXiv 2022, arXiv:2208.10943. [Google Scholar]
- Levi, M.; Burrows, J.; Fleming, M.; Hopkins, M.; Matthews, K.G.P. The Nature, Extent and Economic Impact of Fraud in the UK; Association of Chief Police Officers (ACPO): Mays Landing, NJ, USA, 2007. [Google Scholar]
- Van Driel, H. Financial fraud, scandals, and regulation: A conceptual framework and literature review. Bus. Hist. 2018, 61, 1259–1299. [Google Scholar] [CrossRef] [Green Version]
- Okoye, E.I.; Gbegi, D.O. An evaluation of the effect of fraud and related financial crimes on the Nigerian economy. Kuwait Chapter Arab. J. Bus. Manag. Rev. 2013, 33, 1–23. [Google Scholar] [CrossRef]
- Aziz, R.M.; Baluch, M.F.; Patel, S.; Ganie, A.H. LGBM: A machine learning approach for Ethereum fraud detection. Int. J. Inf. Technol. 2022, 14, 3321–3331. [Google Scholar] [CrossRef]
- Ahmed, S.; Alshater, M.M.; El Ammari, A.; Hammami, H. Artificial intelligence and machine learning in finance: A bibliometric review. Res. Int. Bus. Financ. 2022, 61, 101646. [Google Scholar] [CrossRef]
- Alfaiz, N.S.; Fati, S.M. Enhanced Credit Card Fraud Detection Model Using Machine Learning. Electronics 2022, 11, 662. [Google Scholar] [CrossRef]
- Aziz, S.; Dowling, M.; Hammami, H.; Piepenbrink, A. Machine learning in finance: A topic modeling approach. Eur. Financ. Manag. 2022, 28, 744–770. [Google Scholar] [CrossRef]
- Chaquet-Ulldemolins, J.; Gimeno-Blanes, F.-J.; Moral-Rubio, S.; Muñoz-Romero, S.; Rojo-Álvarez, J.-L. On the Black-Box Challenge for Fraud Detection Using Machine Learning (I): Linear Models and Informative Feature Selection. Appl. Sci. 2022, 12, 3328. [Google Scholar] [CrossRef]
- Bertucci, L.; Briere, M.; Fliche, O.; Mikael, J.; Szpruch, L. Deep Learning in Finance: From Implementation to Regulation. SSRN 4080171. 2022. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4080171 (accessed on 8 January 2022).
- D’Amato, V.; Levantesi, S.; Piscopo, G. Deep learning in predicting cryptocurrency volatility. Phys. A Stat. Mech. Appl. 2022, 596, 127158. [Google Scholar] [CrossRef]
- Saheed, Y.K.; Baba, U.A.; Raji, M.A. Big Data Analytics for Credit Card Fraud Detection Using Supervised Machine Learning Models. In Big Data Analytics in the Insurance Market; Emerald Publishing Limited: Bingley, UK, 2022; pp. 31–56. [Google Scholar]
- Megdad, M.M.; Abu-Naser, S.S.; Abu-Nasser, B.S. Fraudulent Financial Transactions Detection Using Machine Learning. Int. J. Acad. Inf. Syst. Res. (IJAISR) 2022, 6, 30–39. [Google Scholar]
- Khedmati, M.; Erfani, M.; GhasemiGol, M. Applying support vector data description for fraud detection. arXiv 2020, arXiv:2006.00618. [Google Scholar]
- Lucas, Y.; Portier, P.-E.; Laporte, L.; He-Guelton, L.; Caelen, O.; Granitzer, M.; Calabretto, S. Towards automated feature engineering for credit card fraud detection using multi-perspective HMMs. Future Gener. Comput. Syst. 2020, 102, 393–402. [Google Scholar] [CrossRef]
- Ge, D.; Gu, J.; Chang, S.; Cai, J. Credit card fraud detection using lightgbm model. In Proceedings of the 2020 international conference on E-commerce and internet technology (ECIT), Zhangjiajie, China, 24–26 April 2020; pp. 232–236. [Google Scholar]
- Yu, X.; Li, X.; Dong, Y.; Zheng, R. A deep neural network algorithm for detecting credit card fraud. In Proceedings of the 2020 International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE), Fuzhou, China, 12–14 June 2020; pp. 181–183. [Google Scholar]
- Dornadula, V.N.; Geetha, S. Credit card fraud detection using machine learning algorithms. Procedia Comput. Sci. 2019, 165, 631–641. [Google Scholar] [CrossRef]
- Thennakoon, A.; Bhagyani, C.; Premadasa, S.; Mihiranga, S.; Kuruwitaarachchi, N. Real-time credit card fraud detection using machine learning. In Proceedings of the 2019 9th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Noida, India, 10–11 January 2019; pp. 488–493. [Google Scholar]
- Lakshmi, S.; Kavilla, S.D. Machine learning for credit card fraud detection system. Int. J. Appl. Eng. Res. 2018, 13, 16819–16824. [Google Scholar]
- Carneiro, N.; Figueira, G.; Costa, M. A data mining based system for credit-card fraud detection in e-tail. Decis. Support Syst. 2017, 95, 91–101. [Google Scholar] [CrossRef]
- Jain, R.; Gour, B.; Dubey, S. A hybrid approach for credit card fraud detection using rough set and decision tree technique. Int. J. Comput. Appl. 2016, 139, 1–6. [Google Scholar] [CrossRef]
- Seeja, K.; Zareapoor, M. Fraudminer: A novel credit card fraud detection model based on frequent itemset mining. Sci. World J. 2014, 2014, 252797. [Google Scholar] [CrossRef]
- Xu, L.; Skoularidou, M.; Cuesta-Infante, A.; Veeramachaneni, K. Modeling tabular data using conditional gan. Adv. Neural Inf. Process. Syst. 2019, 32, 1–11. [Google Scholar]
- Lopez-Rojas, E.; Elmir, A.; Axelsson, S. PaySim: A financial mobile money simulator for fraud detection. In Proceedings of the 28th European Modeling and Simulation Symposium, EMSS, Larnaca, Cyprus, 26–28 September 2016; pp. 249–255. [Google Scholar]
- Lopez-Rojas, E.A. Applying Simulation to the Problem of Detecting Financial Fraud; Blekinge Tekniska Högskola: Karlskrona, Sweden, 2016. [Google Scholar]
- Archakov, I.; Hansen, P.R. A new parametrization of correlation matrices. Econometrica 2021, 89, 1699–1715. [Google Scholar] [CrossRef]
- Kim, J.-H. Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap. Comput. Stat. Data Anal. 2009, 53, 3735–3745. [Google Scholar] [CrossRef]
- Hao, J.; Ho, T.K. Machine learning made easy: A review of scikit-learn package in python programming language. J. Educ. Behav. Stat. 2019, 44, 348–361. [Google Scholar] [CrossRef]
- Pölsterl, S. scikit-survival: A Library for Time-to-Event Analysis Built on Top of scikit-learn. J. Mach. Learn. Res. 2020, 21, 8747–8752. [Google Scholar]
- Rtayli, N.; Enneya, N. Enhanced credit card fraud detection based on SVM-recursive feature elimination and hyper-parameters optimization. J. Inf. Secur. Appl. 2020, 55, 102596. [Google Scholar] [CrossRef]
- Hossin, M.; Sulaiman, M.N. A review on evaluation metrics for data classification evaluations. Int. J. Data Min. Knowl. Manag. Process 2015, 5, 1. [Google Scholar]
- Marom, N.D.; Rokach, L.; Shmilovici, A. Using the confusion matrix for improving ensemble classifiers. In Proceedings of the 2010 IEEE 26-th Convention of Electrical and Electronics Engineers, Eilat, Israel, 17–20 November 2010; pp. 000555–000559. [Google Scholar]
- Lipton, Z.C.; Elkan, C.; Narayanaswamy, B. Thresholding classifiers to maximize F1 score. arXiv 2014, arXiv:1402.1892. [Google Scholar]
- Barrett, P.; Hunter, J.; Miller, J.T.; Hsu, J.-C.; Greenfield, P. matplotlib—A Portable Python Plotting Package. In Proceedings of the Astronomical data analysis software and systems XIV, Pasadena, CA, USA, 24–27 October 2005; p. 91. [Google Scholar]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Chen, T.; He, T.; Benesty, M.; Khotilovich, V.; Tang, Y.; Cho, H.; Chen, K. Xgboost: Extreme gradient boosting. R Package Version 0.4-2 2015, 1, 1–4. [Google Scholar]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Model | Accuracy | AUC-ROC | F1-Score | Time Taken (s) |
---|---|---|---|---|
XGBClassifier | 0.996 | 0.990 | 0.993 | 0.102 |
NuSVC | 0.990 | 0.990 | 0.990 | 0.110 |
KNeighborsClassifier | 0.960 | 0.960 | 0.960 | 0.025 |
ExtraTreesClassifier | 0.930 | 0.930 | 0.930 | 0.271 |
LGBMClassifier | 0.900 | 0.900 | 0.900 | 1.611 |
QuadraticDiscriminantAnalysis | 0.900 | 0.900 | 0.900 | 0.034 |
SVC | 0.890 | 0.890 | 0.890 | 0.639 |
RandomForestClassifier | 0.890 | 0.890 | 0.890 | 0.706 |
LinearDiscriminantAnalysis | 0.880 | 0.880 | 0.880 | 0.070 |
RidgeClassifierCV | 0.880 | 0.880 | 0.880 | 0.065 |
RidgeClassifier | 0.880 | 0.880 | 0.880 | 0.034 |
LinearSVC | 0.880 | 0.880 | 0.880 | 0.298 |
CalibratedClassifierCV | 0.880 | 0.880 | 0.880 | 1.012 |
LogisticRegression | 0.870 | 0.870 | 0.870 | 0.035 |
AdaBoostClassifier | 0.870 | 0.870 | 0.870 | 0.703 |
GaussianNB | 0.840 | 0.840 | 0.840 | 0.019 |
SGDClassifier | 0.820 | 0.820 | 0.820 | 0.040 |
BaggingClassifier | 0.810 | 0.810 | 0.807 | 0.537 |
BernoulliNB | 0.810 | 0.810 | 0.810 | 0.019 |
PassiveAggressiveClassifier | 0.800 | 0.800 | 0.799 | 0.023 |
NearestCentroid | 0.780 | 0.780 | 0.780 | 0.024 |
Perceptron | 0.780 | 0.780 | 0.780 | 0.023 |
DecisionTreeClassifier | 0.780 | 0.780 | 0.780 | 0.118 |
ExtraTreeClassifier | 0.720 | 0.720 | 0.720 | 0.015 |
LabelSpreading | 0.500 | 0.500 | 0.333 | 0.076 |
LabelPropagation | 0.500 | 0.500 | 0.333 | 0.063 |
DummyClassifier | 0.500 | 0.500 | 0.333 | 0.016 |
Model | Accuracy | AUC-ROC | F1-Score | Time Taken (s) |
---|---|---|---|---|
XGBClassifier | 0.999 | 1.000 | 0.999 | 0.696 |
SVC | 0.994 | 0.994 | 0.980 | 0.058 |
KNeighborsClassifier | 0.993 | 0.993 | 0.980 | 0.021 |
RandomForestClassifier | 0.990 | 0.984 | 0.980 | 0.711 |
NuSVC | 0.980 | 0.974 | 0.980 | 0.106 |
LGBMClassifier | 0.960 | 0.968 | 0.960 | 0.664 |
ExtraTreesClassifier | 0.960 | 0.968 | 0.960 | 0.247 |
LinearDiscriminantAnalysis | 0.960 | 0.947 | 0.960 | 0.051 |
RidgeClassifierCV | 0.960 | 0.947 | 0.960 | 0.075 |
RidgeClassifier | 0.960 | 0.947 | 0.960 | 0.032 |
BaggingClassifier | 0.940 | 0.941 | 0.940 | 0.561 |
LogisticRegression | 0.940 | 0.931 | 0.940 | 0.029 |
LinearSVC | 0.940 | 0.931 | 0.940 | 0.287 |
CalibratedClassifierCV | 0.940 | 0.931 | 0.940 | 1.072 |
NearestCentroid | 0.920 | 0.915 | 0.920 | 0.023 |
GaussianNB | 0.920 | 0.915 | 0.920 | 0.020 |
BernoulliNB | 0.920 | 0.915 | 0.920 | 0.019 |
AdaBoostClassifier | 0.920 | 0.915 | 0.920 | 0.722 |
SGDClassifier | 0.900 | 0.899 | 0.900 | 0.095 |
PassiveAggressiveClassifier | 0.880 | 0.883 | 0.881 | 0.020 |
Perceptron | 0.880 | 0.883 | 0.881 | 0.015 |
QuadraticDiscriminantAnalysis | 0.880 | 0.883 | 0.881 | 0.032 |
ExtraTreeClassifier | 0.780 | 0.772 | 0.781 | 0.017 |
DecisionTreeClassifier | 0.780 | 0.772 | 0.781 | 0.122 |
LabelSpreading | 0.500 | 0.597 | 0.430 | 0.106 |
LabelPropagation | 0.500 | 0.597 | 0.430 | 0.084 |
DummyClassifier | 0.380 | 0.500 | 0.209 | 0.019 |
Repeat | Mean Accuracy | Mean AUC-ROC | Mean F1-Score |
---|---|---|---|
Repeat-1 | 0.999 | 0.998 | 0.998 |
Repeat-2 | 0.998 | 0.999 | 0.998 |
Repeat-3 | 0.999 | 0.999 | 0.998 |
Repeat-4 | 0.998 | 0.999 | 0.998 |
Repeat-5 | 0.998 | 0.998 | 0.999 |
Repeat-6 | 0.999 | 0.998 | 0.998 |
Repeat-7 | 0.999 | 0.998 | 0.998 |
Repeat-8 | 0.999 | 0.998 | 0.998 |
Repeat-9 | 0.999 | 0.998 | 0.999 |
Repeat-10 | 0.999 | 0.999 | 0.999 |
Repeat-11 | 0.998 | 0.999 | 0.998 |
Repeat-12 | 0.998 | 0.998 | 0.998 |
Repeat-13 | 0.997 | 0.998 | 0.999 |
Repeat-14 | 0.998 | 0.998 | 0.998 |
Repeat-15 | 0.998 | 0.998 | 0.998 |
Model | Accuracy | AUC-ROC | F1-Score |
---|---|---|---|
XGBClassifier | 0.999 | 1.000 | 0.999 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Alwadain, A.; Ali, R.F.; Muneer, A. Estimating Financial Fraud through Transaction-Level Features and Machine Learning. Mathematics 2023, 11, 1184. https://doi.org/10.3390/math11051184
Alwadain A, Ali RF, Muneer A. Estimating Financial Fraud through Transaction-Level Features and Machine Learning. Mathematics. 2023; 11(5):1184. https://doi.org/10.3390/math11051184
Chicago/Turabian StyleAlwadain, Ayed, Rao Faizan Ali, and Amgad Muneer. 2023. "Estimating Financial Fraud through Transaction-Level Features and Machine Learning" Mathematics 11, no. 5: 1184. https://doi.org/10.3390/math11051184
APA StyleAlwadain, A., Ali, R. F., & Muneer, A. (2023). Estimating Financial Fraud through Transaction-Level Features and Machine Learning. Mathematics, 11(5), 1184. https://doi.org/10.3390/math11051184