A Malicious Webpage Detection Method Based on Graph Convolutional Network
Abstract
:1. Introduction
- We design a malicious webpage detection method, GMWD, based on a GCN. The model can fully explore and exploit the syntactic and semantic correlations within and among webpages to accurately detect malicious webpages.
- We propose phrase nodes to replace the word nodes in the text graph. Phrase nodes can effectively maintain the syntactic and semantic information integrity of the source code, which is beneficial for the GCN to capture the features of the malicious codes. We slice the source code into phrases based on point-wise mutual information and left–right entropy.
- We use the URL links, which are contained in the source code and point to other websites, as auxiliary detection information and use blacklist detection technology to quickly pre-detect them, further improving the overall accuracy of the detection scheme.
2. Related Work
3. Methodology
3.1. Quick Detection by Blacklist
Algorithm 1. Detection By Blacklist |
Input: A set of webpage source codes |
Output: A set of malicious webpages , A set of unknown webpages |
1: for do |
2: |
3: if then |
4: Put into |
5: else |
6: Put into |
7: end if |
8: end for |
9: return , |
3.2. Detection by GCN
3.2.1. Phrase Segmentation
3.2.2. Text Graph Building
3.2.3. GCN-Based Classification
Algorithm 2. Detection By GCN |
Input: A set of webpage source codes // i.e., in the output of Algorithm 1 |
Parameter: Weight matrixes , initial feature matrix , The number of training epochs |
Output: A set of labels of each webpage |
1: for do |
2: |
3: for do |
4: |
5: for do |
6: |
7: end for |
8: end for |
9: end for |
10: |
11: |
12: |
13: for in do |
14: put into GCN: |
15: |
16: |
17: Updating and using and |
18: end for |
19: return |
4. Experiments
4.1. Dataset
4.2. Evaluation Indexes
4.3. Detection by Blacklist
4.4. Detection by GCN
4.5. Comparison of Related Work
5. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
GCN | Graph convolutional network |
GMWD | The GCN-based malicious webpage detection method proposed in this paper |
URL | Uniform Resource Location |
CNN | Convolutional neural network |
LSTM | Long short-term memory |
PMI | Point-wise mutual information |
TF-IDF | Term frequency–inverse document frequency |
TP | True positive |
FN | False negative |
TN | True negative |
FP | False positive |
ACC | Accuracy rate |
TPR | True positive rate |
TNR | True negative rate |
FPR | False positive rate |
FNR | False negative rate |
References
- China Internet Network Information Center. The 47th Statistical Report on China’s Internet Development; China Internet Network Information Center: Beijing, China, 2021. [Google Scholar]
- Guo, Y.; Marco-Gisbert, H.; Keir, P. Mitigating webshell attacks through machine learning techniques. Future Internet 2020, 12, 12. [Google Scholar] [CrossRef]
- Song, X.; Chen, C.; Cui, B.; Fu, J. Malicious JavaScript detection based on bidirectional LSTM model. Appl. Sci. 2020, 10, 3440. [Google Scholar] [CrossRef]
- Fang, Y.; Huang, C.; Liu, L.; Xue, M. Research on malicious JavaScript detection technology based on LSTM. IEEE Access 2018, 6, 59118–59125. [Google Scholar] [CrossRef]
- Liu, X.; You, X.; Zhang, X.; Wu, J.; Lv, P. Tensor graph convolutional networks for text classification. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA, 7–12 February 2020; pp. 8409–8416. [Google Scholar]
- Welling, M.; Kipf, T.N. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
- Yao, L.; Mao, C.; Luo, Y. Graph convolutional networks for text classification. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Atlanta, GA, USA, 8–12 October 2019; pp. 7370–7377. [Google Scholar]
- Manjeri, A.S.; Kaushik, R.; Ajay, M.; Nair, P.C. A machine learning approach for detecting malicious websites using URL features. In Proceedings of the 2019 3rd International conference on Electronics, Communication and Aerospace Technology (ICECA), Coimbatore, India, 12–14 June 2019; pp. 555–561. [Google Scholar]
- Chiramdasu, R.; Srivastava, G.; Bhattacharya, S.; Reddy, P.K.; Gadekallu, T.R. Malicious url detection using logistic regression. In Proceedings of the 2021 IEEE International Conference on Omni-Layer Intelligent Systems (COINS), Barcelona, Spain, 23–25 August 2021; pp. 1–6. [Google Scholar]
- Lee, S.; Kim, J. Warningbird: A near real-time detection system for suspicious urls in twitter stream. IEEE Trans. Dependable Secur. Comput. 2013, 10, 183–195. [Google Scholar] [CrossRef]
- Jain, A.K.; Gupta, B.B. Towards detection of phishing websites on client-side using machine learning based approach. Telecommun. Syst. 2018, 68, 687–700. [Google Scholar] [CrossRef]
- Fass, A.; Krawczyk, R.P.; Backes, M.; Stock, B. Jast: Fully syntactic detection of malicious (obfuscated) javascript. In Proceedings of the International Conference on Detection of Intrusions and Malware & Vulnerability Assessment (ICNSC), Paris, France, 28–29 June 2018; pp. 303–325. [Google Scholar]
- Altay, B.; Dokeroglu, T.; Cosar, A. Context-sensitive and keyword density-based supervised machine learning techniques for malicious webpage detection. Soft Comput. 2019, 23, 4177–4191. [Google Scholar] [CrossRef]
- Agor, J.; Özaltın, O.Y. Feature selection for classification models via bilevel optimization. Comput. Oper. Res. 2019, 106, 156–168. [Google Scholar] [CrossRef]
- Liu, W.; Wang, J. A brief survey on nature-inspired metaheuristics for feature selection in classification in this decade. In Proceedings of the 2019 IEEE 16th International Conference on Networking, Sensing and Control (ICNSC), Banff, AB, Canada, 9–11 May 2019; pp. 424–429. [Google Scholar]
- Zhang, C.; Liu, C.; Zhang, X.; Almpanidis, G. An up-to-date comparison of state-of-the-art classification algorithms. Expert Syst. Appl. 2017, 82, 128–150. [Google Scholar] [CrossRef]
- Louati, F.; Ktata, F.B. A deep learning-based multi-agent system for intrusion detection. SN Appl. Sci. 2020, 2, 675. [Google Scholar] [CrossRef]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Le, H.; Pham, H.Q.; Sahoo, D.; Hoi, S.C. URLNet: Learning a URL representation with deep learning for malicious URL detection. In Proceedings of the PODS 2017: ACM Symposium on Principles of Distributed Computing, Washington, DC, USA, 25 July 2017; pp. 1–13. [Google Scholar]
- Yang, P.; Zhao, G.; Zeng, P. Phishing website detection based on multidimensional features driven by deep learning. IEEE Access 2019, 7, 15196–15209. [Google Scholar] [CrossRef]
- HongYing, Z.; Yang, Y.; ZHANG, J.; Jun, Z.; YouLang, J.; ZhenDong, D.; QingChen, W. Application of term library construction based on machine learning and statistical method in intelligent power grid custom service. In Proceedings of the 2019 IEEE 8th International Conference on Advanced Power System Automation and Protection (APAP), Xi’an, China, 21–24 October 2019; pp. 145–149. [Google Scholar]
- Liu, Y.; Zhu, C.; Wu, Y.; Xu, H.; Song, J. MMWD: An efficient mobile malicious webpage detection framework based on deep learning and edge cloud. Concurr. Comput. Pract. Exp. 2021, 33, e6191. [Google Scholar] [CrossRef]
- Rozi, M.; Ban, T.; Kim, S.; Ozawa, S.; Takahashi, T.; Inoue, D. Detecting Malicious Websites Based on JavaScript Content Analysis. In Proceedings of the Computer Security Symposium, Kamakura, Japan, 21–24 June 2021; pp. 727–732. [Google Scholar]
- Saxe, J.; Harang, R.; Wild, C.; Sanders, H. A deep learning approach to fast, format-agnostic detection of malicious web content. In Proceedings of the 2018 IEEE Security and Privacy Workshops (SPW), San Francisco, CA, USA, 24–24 May 2018; pp. 8–14. [Google Scholar]
- Vazhayil, A.; Vinayakumar, R.; Soman, K. Comparative study of the detection of malicious URLs using shallow and deep networks. In Proceedings of the 2018 9th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Bengaluru, India, 10–12 July 2018; pp. 1–6. [Google Scholar]
- Wang, R.; Zhu, Y.; Tan, J.; Zhou, B. Detection of malicious web pages based on hybrid analysis. J. Inf. Secur. Appl. 2017, 35, 68–74. [Google Scholar] [CrossRef]
- Yi, P.; Guan, Y.; Zou, F.; Yao, Y.; Wang, W.; Zhu, T. Web phishing detection using a deep learning framework. Wirel. Commun. Mob. Comput. 2018, 2018, 1–9. [Google Scholar] [CrossRef]
- Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Benign Samples | Malicious Samples | Blacklist | |||
---|---|---|---|---|---|
Labeled Samples | Test Samples | Labeled Samples | Test Samples | ||
Source | Competition 1 | Chinaz 2 | Malware Domain List 3 | VirusTotal 4 | |
Number | 9458 | 17,438 | 9458 | 16,094 | 300,000 |
Malicious Samples | Benign Samples | ACC | Precision | TPR | TNR | F1 | ||
---|---|---|---|---|---|---|---|---|
TP | FN | TN | FP | |||||
15,595 | 23 | 17,416 | 22 | 0.9986 | 0.9986 | 0.9985 | 0.9987 | 0.9986 |
Top 9 Phrases with Malicious Labels | ||
---|---|---|
1. wp-content | 2. quot | 3. menu-item |
4. elementor-element | 5. path | 6. important |
7. data-element_type | 8. not | 9. border-color |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, Y.; Xue, S.; Song, J. A Malicious Webpage Detection Method Based on Graph Convolutional Network. Mathematics 2022, 10, 3496. https://doi.org/10.3390/math10193496
Wang Y, Xue S, Song J. A Malicious Webpage Detection Method Based on Graph Convolutional Network. Mathematics. 2022; 10(19):3496. https://doi.org/10.3390/math10193496
Chicago/Turabian StyleWang, Yilin, Siqing Xue, and Jun Song. 2022. "A Malicious Webpage Detection Method Based on Graph Convolutional Network" Mathematics 10, no. 19: 3496. https://doi.org/10.3390/math10193496
APA StyleWang, Y., Xue, S., & Song, J. (2022). A Malicious Webpage Detection Method Based on Graph Convolutional Network. Mathematics, 10(19), 3496. https://doi.org/10.3390/math10193496