Detecting Web-Based Attacks with SHAP and Tree Ensemble Machine Learning Methods
Abstract
:1. Introduction
- This study proposes using SHAP and tree ensemble methods for detecting web-based attacks.
- We detail the process of obtaining features using AST-JS node sets and patterns, sample JS attack codes, and association rule mining.
- We compared the performances of different classifiers in malicious JS code detection using SHAP selected features and achieved good detection performance for the tree ensemble methods.
- We compared the performance of SHAP selected features to the performance of those selected by other feature selection methods: Boruta, ELI5, RandomForest, and SelectKBest.
- The proposed web-based attack detection method outperformed the other feature selection methods in all three evaluation metrics.
2. Related Work
3. Proposed Method
3.1. Preprocessing
3.1.1. AST-JS Node Sets and Patterns
3.1.2. Sample JS Attack Codes
3.1.3. Association Rule Mining
Algorithm 1 Mining frequent AST-JS node sets using the FP-growth algorithm. |
Input:D—a database of benign and malicious JS codes defined as AST-JS nodes; —the minimum support count threshold. Output: Benign and malicious DataFrames of AST-JS nodes and node combinations.
|
3.2. Feature Selection
3.3. Shapley Additive Explanations’ Feature Importance
Algorithm 2 Calculating tree SHAP values for AST-JS M features. |
Input:Z—a malicious or benign JS code we want to explain, Z—AST-JS instances that tree SHAP uses as background examples, the g—tree ensemble model. Output: features - SHAP values for each AST-JS feature in JS code dataset sorted in descending order of their importance.
|
Algorithm 3—predicting the ith AST-JS feature. |
Input:Z—a malicious or benign JS code to explain, i—the AST-JS feature to get prediction for, g—tree ensemble model. Output:i—the value of the ith AST-JS feature in .
|
Algorithm 4 Selecting contributing AST-JS features’ SHAP values. |
Input: features—SHAP values for each AST-JS feature in JS code dataset sorted in descending order of their importance, Z—a malicious or benign JS code we want to explain, —the prediction for Z. Output: features and features.
|
3.4. Machine Learning Classifiers
4. Experiments
4.1. Experimental Setup
4.2. Performance Comparisons
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Masri, R.; Aldwairi, M. Automated malicious advertisement detection using virustotal, urlvoid, and trendmicro. In Proceedings of the 8th International Conference on Information and Communication Systems (ICICS’17), Irbid, Jordan, 4–6 April 2017; pp. 336–341. [Google Scholar] [CrossRef]
- Bilge, L.; Kirda, E.; Kruegel, C.; Balduzzi, M. EXPOSURE: Finding malicious domains using passive DNS analysis. In Proceedings of the 18th Annual Network and Distributed System Security Conference (NDSS’11), San Diego, CA, USA, 6–9 February 2011; pp. 1–17. [Google Scholar]
- Bilge, L.; Sen, S.; Balzarotti, D.; Kirda, E.; Kruegel, C. Exposure: A Passive DNS Analysis Service to Detect and Report Malicious Domains. Assoc. Comput. Mach. Trans. Inf. Syst. Secur. 2014, 16, 1–14. [Google Scholar] [CrossRef]
- Ghafir, I.; Prenosil, V. DNS traffic analysis for malicious domains detection. In Proceedings of the 2nd International Conference on Signal Processing and Integrated Networks (SPIN’15), Noida, India, 19–20 February 2015; pp. 613–918. [Google Scholar]
- Messabi, K.A.; Aldwairi, M.; Yousif, A.A.; Thoban, A.; Belqasmi, F. Malware detection using dns records and domain name features. In Proceedings of the 2nd International Conference on Future Networks and Distributed Systems (ICFNDS’18), New York, NY, USA, 26 June 2018; pp. 1–7. [Google Scholar] [CrossRef]
- LLC; OpenDNS. PhishTank: An Anti-Phishing Site. 2016. Available online: https://www.phishtank.com (accessed on 1 May 2020).
- Majestic SEO. The Majestic Million Service: The Million Domains We Find with the Most Referring Subnets. Available online: https://majestic.com/reports/majestic-million (accessed on 1 May 2020).
- Alexa Inc. The Top 500 Sites on the Web. Available online: https://www.alexa.com/topsites (accessed on 1 May 2020).
- myWOT. myWOT Web of Trust. Available online: https://www.mywot.com/ (accessed on 1 May 2020).
- Sahoo, D.; Liu, C.; Hoi, S. Malicious URL detection using machine learning: A survey. arXiv 2019, arXiv:1701.07179. [Google Scholar]
- Ferreira, M. Malicious URL detection using machine learning algorithms. In Proceedings of the Digital and Privacy Security Conference, Lusófona University of Porto, Porto, Portugal, 16 January 2019; pp. 114–122. [Google Scholar] [CrossRef]
- Ma, J.; Saul, L.; Savage, S.; Volker, G. Learning to Detect Malicious URLs. ACM Trans. Intell. Syst. Technol. 2011, 2, 1–24. [Google Scholar] [CrossRef] [Green Version]
- Ma, J.; Saul, L.; Savage, S.; Voelker, G. Beyond blacklists: Learning to detect malicious web sites from suspicious URLs. In Proceedings of the 15th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD’09), Paris, France, 28 June–1 July 2009; pp. 1245–1254. [Google Scholar]
- Ma, J.; Saul, L.; Savage, S.; Voelker, G. Identifying suspicious URLs: An application of large-scale online learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML’09), Montreal, QC, Canada, 14–18 June 2009; pp. 681–688. [Google Scholar]
- Ndichu, S.; Ozawa, S.; Misu, T.; Okada, K. A Machine Learning Approach to Malicious JavaScript Detection using Fixed Length Vector Representation. In Proceedings of the 2018 International Joint Conference on Neural Networks, (IJCNN’18), Rio de Janeiro, Brazil, 8–13 July 2018; pp. 1–8. [Google Scholar]
- Ndichu, S.; Kim, S.; Ozawa, S.; Misu, T.; Makishima, K. A machine learning approach to detection of JavaScript-based attacks using AST features and paragraph vectors. Appl. Soft Comput. J. 2019, 84, 1–11. [Google Scholar] [CrossRef]
- Ndichu, S.; Kim, S.; Ozawa, S. Deobfuscation, Unpacking, and Decoding of Obfuscated Malicious JavaScript for Machine Learning Models Detection Performance Improvement. CAAI Trans. Intell. Technol. 2020, 5, 184–192. [Google Scholar] [CrossRef]
- Likarish, P.; Jung, E. A targeted web crawling for building malicious javascript collection. In Proceedings of the ACM First International Workshop on Data-Intensive Software Management and Mining (DSMM ’09), New York, NY, USA, 6 November 2009; pp. 23–26. [Google Scholar] [CrossRef]
- Chou, N.; Ledesma, R.; Teraguchi, Y.; Boneh, D.; Mitchell, J. Client-Side Defense against Web-Based Identity Theft. In Proceedings of the 11th Annual Network and Distributed System Security Symposium (NDSS ’04), San Diego, CA, USA, 4–6 February 2004; Available online: http://crypto.stanford.edu/SpoofGuard/webspoof.pdf (accessed on 1 May 2020).
- McGrath, D.; Gupta, M. Behind phishing: An examination of phisher modi operandi. In Proceedings of the USENIX Workshop on Large-Scale Exploits and Emergent Threats (LEET), San Francisco, CA, USA, 14 April 2008. [Google Scholar]
- AlRoum, K.; Alolama, A.; Kamel, R.; Barachi, M.E.; Aldwairi, M. Detecting Malware Domains: A Cyber-Threat Alarm System. In Proceedings of the International Conference on Emerging Technologies for Developing Countries, Cotonou, Benin, 29–30 May 2018; Springer International Publishing: Cham, Switzerland, 2018; pp. 181–191. [Google Scholar]
- NWang, W.; Shirley, K. Breaking bad: Detecting malicious domains using word segmentation. In Proceedings of the 9th IEEE Workshop on Web 2.0 Security and Privacy (W2SP’15), San Jose, CA, USA, 21 May 2015; Available online: http://arxiv.org/abs/1506.04111 (accessed on 1 October 2021).
- Kuyama, M.; Kakizaki, Y.; Sasaki, R. Method for Detecting a Malicious Domain by using WHOIS and DNS features. In Proceedings of the Third International Conference on Digital Security and Forensics (DigitalSec’16), Kuala Lumpur, Malaysia, 6–8 September 2016; pp. 74–80. [Google Scholar]
- Kuyama, M.; Kakizaki, Y.; Sasaki, R. Method for detecting a malicious domain by using only well known information. Int. J. Cyber-Secur. Digit. Forensics 2016, 5, 166–174. [Google Scholar] [CrossRef] [Green Version]
- Marchal, S.; Francois, J.; State, R.; Engel, T. Phishstorm: Detecting phishing with streaming analytics. IEEE Trans. Netw. Serv. Manag. 2014, 11, 458–471. [Google Scholar] [CrossRef] [Green Version]
- Feroz, M.; Mengel, S. Phishing URL detection using URL ranking. In Proceedings of the IEEE International Congress on Big Data, BigData Congress, New York, NY, USA, 27 June–2 July 2015; pp. 635–638. [Google Scholar]
- Moghimi, M.; Varjani, A. New rule-based phishing detection method. Expert Syst. Appl. 2016, 53, 231–242. [Google Scholar] [CrossRef]
- Yuan, H.; Chen, X.; Li, Y.; Yang, Z.; Liu, W. Detecting Phishing Websites and Targets Based on URLs and Web page Links. In Proceedings of the 24th International Conference on Pattern Recognition (ICPR’18), Beijing, China, 20–24 August 2018; pp. 3669–3674. [Google Scholar] [CrossRef]
- Anand, A.; Gorde, K.; AntonyMoniz, J.; Park, N.; Chakraborty, T.; Chu, B. Phishing URL detection with oversampling based on text generative adversarial networks. In Proceedings of the IEEE international conference on Big Data (Big Data), Seattle, WA, USA, 10–13 December 2018; pp. 1168–1177. [Google Scholar] [CrossRef]
- Jain, A.; Gupta, B. Towards detection of phishing websites on client-side using machine learning based approach. Telecommun. Syst. 2018, 68, 687–700. [Google Scholar] [CrossRef]
- Ford, S.; Cova, M.; Kruegel, C.; Vigna, G. Analyzing and detecting malicious flash advertisements. In Proceedings of the 25th Annual Computer Security Applications Conference (ACSAC ’09), IEEE Computer Society, Honolulu, HI, USA, 6–10 December 2009; pp. 363–372. [Google Scholar]
- Li, Z.; Zhang, K.; Xie, Y.; Yu, F.; Wang, X. Knowing your enemy: Understanding and detecting malicious web advertising. In Proceedings of the 2012 ACM conference on Computer and Communications Security (CCS ’12), New York, NY, USA, 16–18 October 2012; pp. 674–686. [Google Scholar]
- Oentaryo, R.; Lim, E.P.; Finegold, M.; Lo, D.; Zhu, F.; Phua, C.; Cheu, E.Y.; Yap, G.E.; Sim, K.; Nguyen, M.; et al. Detecting click fraud in online advertising: A data mining approach. J. Mach. Learn. Res. 2014, 15, 99–140. [Google Scholar]
- Xu, H.; Liu, D.; Koehl, A.; Wang, H.; Stavrou, A. Click fraud detection on the advertiser side. In Proceedings of the 19th European Symposium on Research in Computer Security (ESORICS), Wroclaw, Poland, 7–11 September 2014; pp. 419–438. [Google Scholar]
- Kapravelos, A.Z.A.; Stringhini, G.; Holz, T.; Kruegel, C.; Vigna, G. The dark alleys of madison avenue: Understanding malicious advertisements. In Proceedings of the 2014 ACM Conference on Internet Measurement Conference (IMC’14), Vancouver, BC, Canada, 5–7 November 2014; pp. 373–380. [Google Scholar]
- Akiyama, M.; Yagi, T.; Yada, T.; Mori, T.; Kadobayashi, Y. Analyzing the ecosystem of malicious URL redirection through longitudinal observation from honeypots. Comput. Secur. 2017, 69, 155–173. [Google Scholar] [CrossRef]
- VirusTotal. Analyze Suspicious Files and URLs to Detect Types of Malware, Automatically Share Them with the Security Community. Available online: https://www.virustotal.com/gui/home/url (accessed on 1 May 2020).
- URLVoid. Website Reputation Checker, This Service Helps You Detect Potentially Malicious Websites. Available online: https://www.urlvoid.com/ (accessed on 1 May 2020).
- TrendMicro. Site Safety Center, with One of the Largest Domain-Reputation Databases in the World, Trend Micro’s Web Reputation Technology Is a Key Component of Trend Micro™ Smart Protection Network™. Available online: https://global.sitesafety.trendmicro.com/ (accessed on 1 May 2020).
- Canali, D.; Cova, M.; Vigna, G.; Kruegel, C. Prophiler: A fast filter for the large-scale detection of malicious web pages. In Proceedings of the 20th International Conference on World Wide Web (WWW’11), ACM, Hyderabad, India, 28 March–1 April 2011; pp. 197–206. [Google Scholar]
- Hongtao, L.; Sergio, M.; Mahdi, C. Amj: An Analyzer for Malicious Javascript; Imperial College London, Department of Computing: London, UK, 2018. [Google Scholar]
- Han, J.; Pei, J.; Yin, Y.; Mao, R. Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach. Data Min. Knowl. Discov. 2004, 8, 53–87. [Google Scholar] [CrossRef]
- Raschka, S. MLxtend: Providing machine learning and data science utilities and extensions to Python’s scientific computing stack. J. Open Source Softw. 2018, 3, 638. [Google Scholar] [CrossRef]
- Lundberg, S.M.; Erion, G.; Chen, H.; DeGrave, A.; Prutkin, J.M.; Nair, B.; Katz, R.; Himmelfarb, J.; Bansal, N.; Lee, S.I. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2020, 2, 2522–5839. [Google Scholar] [CrossRef] [PubMed]
- Lundberg, S.M.; Nair, B.; Vavilala, M.S.; Horibe, M.; Eisses, M.J.; Adams, T.; Liston, D.E.; Low, D.K.W.; Newman, S.F.; Kim, J.; et al. Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nat. Biomed. Eng. 2018, 2, 749. [Google Scholar] [CrossRef] [PubMed]
- Lundberg, S.M.; Erion, G.G.; Lee, S. Consistent Individualized Feature Attribution for Tree Ensembles. arXiv 2018, arXiv:1802.03888. [Google Scholar]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Buitinck, L.; Louppe, G.; Blondel, M.; Pedregosa, F.; Mueller, A.; Grisel, O.; Niculae, V.; Prettenhofer, P.; Gramfort, A.; Grobler, J.; et al. API design for machine learning software: Experiences from the scikit-learn project. arXiv 2013, arXiv:1309.0238. [Google Scholar]
- Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 3149–3157. [Google Scholar]
- Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016. [Google Scholar] [CrossRef] [Green Version]
- Petrak, H. JavaScript Malware Collection—A Collection of Almost 40.000 JavaScript Malware Samples. Available online: https://github.com/HynekPetrak/javascript-malwarecollection (accessed on 1 October 2020).
- Kursa, M.B.; Rudnicki, W.R. Feature Selection with the Boruta Package. J. Stat. Softw. 2010, 36, 1–13. [Google Scholar] [CrossRef] [Green Version]
Parameter | AST-JS Nodes | Association Rule | SHAP Value |
---|---|---|---|
XGBClassifier | |||
0.5 | 0.7 | 0.4 | |
0.0 | 0.1 | 0.0 | |
0.2 | 0.2 | 0.2 | |
10 | 8 | 13 | |
1 | 1 | 1 | |
LGBMClassifier | |||
0 | 0 | 0 | |
0 | 0 | 0 | |
9 | 6 | 8 | |
520 | 520 | 520 | |
10 | 10 | 10 |
Model | Recall | Precision | F1 |
---|---|---|---|
XGBoost | 0.9981 ± 0.0008 | 0.9698 ± 0.0033 | 0.9838 ± 0.0018 |
LightGBM | 0.9979 ± 0.0008 | 0.9691 ± 0.0032 | 0.9833 ± 0.0019 |
RandomForest | 0.9986 ± 0.0005 | 0.9702 ± 0.0032 | 0.9842 ± 0.0018 |
DecisionTree | 0.9983 ± 0.0005 | 0.9687 ± 0.0029 | 0.9833 ± 0.0016 |
LogisticRegression | 0.9839 ± 0.0021 | 0.9414 ± 0.0039 | 0.9622 ± 0.0025 |
KNeighbors | 0.8448 ± 0.0062 | 0.9986 ± 0.0006 | 0.9153 ± 0.0037 |
GaussianNB | 0.9968 ± 0.0013 | 0.5493 ± 0.0029 | 0.7083 ± 0.0025 |
Model | Recall | Precision | F1 |
---|---|---|---|
XGBoost | 0.9763 ± 0.0025 | 0.9825 ± 0.0024 | 0.9794 ± 0.0018 |
LightGBM | 0.9762 ± 0.0023 | 0.9820 ± 0.0025 | 0.9791 ± 0.0018 |
RandomForest | 0.9757 ± 0.0024 | 0.9833 ± 0.0025 | 0.9795 ± 0.0018 |
DecisionTree | 0.9758 ± 0.0024 | 0.9812 ± 0.0025 | 0.9785 ± 0.0019 |
LogisticRegression | 0.9600 ± 0.0033 | 0.9602 ± 0.0040 | 0.9601 ± 0.0025 |
KNeighbors | 0.8516 ± 0.0044 | 0.9991 ± 0.0007 | 0.9195 ± 0.0026 |
GaussianNB | 0.8221 ± 0.0048 | 0.7845 ± 0.0044 | 0.8029 ± 0.0038 |
Model | Recall | Precision | F1 |
---|---|---|---|
XGBoost | 0.9989 ± 0.0004 | 0.9832 ± 0.0024 | 0.9909 ± 0.0014 |
LightGBM | 0.9986 ± 0.0007 | 0.9820 ± 0.0027 | 0.9902 ± 0.0015 |
RandomForest | 0.9985 ± 0.0006 | 0.9840 ± 0.0024 | 0.9912 ± 0.0014 |
DecisionTree | 0.9986 ± 0.0005 | 0.9815 ± 0.0024 | 0.9900 ± 0.0013 |
LogisticRegression | 0.9895 ± 0.0013 | 0.9632 ± 0.0037 | 0.9762 ± 0.0017 |
KNeighbors | 0.8779 ± 0.0056 | 0.9995 ± 0.0003 | 0.9347 ± 0.0031 |
GaussianNB | 0.9373 ± 0.0056 | 0.6889 ± 0.0031 | 0.7941 ± 0.0030 |
Model | Recall | Precision | F1 |
---|---|---|---|
SHAP | 0.9989 ± 0.0004 | 0.9832 ± 0.0024 | 0.9909 ± 0.0014 |
Boruta | 0.9983 ± 0.0008 | 0.9698 ± 0.0033 | 0.9839 ± 0.0019 |
ELI5 | 0.9982 ± 0.0008 | 0.9699 ± 0.0033 | 0.9839 ± 0.0019 |
RandomForest | 0.9984 ± 0.0007 | 0.9698 ± 0.0032 | 0.9839 ± 0.0018 |
SelectKBest | 0.9982 ± 0.0007 | 0.9687 ± 0.0034 | 0.9832 ± 0.0019 |
Model | Training Time (s) | Detection Time (s) |
---|---|---|
XGBoost | 0.8685 | 1.498 × 10 |
LightGBM | 0.5716 | 2.114 × 10 |
RandomForest | 1.6350 | 1.035 × 10 |
DecisionTree | 0.0895 | 1.179 × 10 |
LogisticRegression | 0.6384 | 1.871 × 10 |
KNeighbors | 0.0037 | 0.0015 |
GaussianNB | 0.0425 | 1.334 × 10 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ndichu, S.; Kim, S.; Ozawa, S.; Ban, T.; Takahashi, T.; Inoue, D. Detecting Web-Based Attacks with SHAP and Tree Ensemble Machine Learning Methods. Appl. Sci. 2022, 12, 60. https://doi.org/10.3390/app12010060
Ndichu S, Kim S, Ozawa S, Ban T, Takahashi T, Inoue D. Detecting Web-Based Attacks with SHAP and Tree Ensemble Machine Learning Methods. Applied Sciences. 2022; 12(1):60. https://doi.org/10.3390/app12010060
Chicago/Turabian StyleNdichu, Samuel, Sangwook Kim, Seiichi Ozawa, Tao Ban, Takeshi Takahashi, and Daisuke Inoue. 2022. "Detecting Web-Based Attacks with SHAP and Tree Ensemble Machine Learning Methods" Applied Sciences 12, no. 1: 60. https://doi.org/10.3390/app12010060
APA StyleNdichu, S., Kim, S., Ozawa, S., Ban, T., Takahashi, T., & Inoue, D. (2022). Detecting Web-Based Attacks with SHAP and Tree Ensemble Machine Learning Methods. Applied Sciences, 12(1), 60. https://doi.org/10.3390/app12010060