Encrypted Malicious Traffic Detection Based on Word2Vec
Abstract
:1. Introduction
2. Related Works
3. Dataset Description
4. Methodology
- (1)
- Feature Extraction: extracts features from raw PCAP to build a corpus.
- (2)
- Building Vocabulary and Token Parser: uses tokenization technique to extract words from the training dataset and then applies word embedding technique to represent words.
- (3)
- Training Model: TLS2Vec trains the dataset using LSTM and BiLSTM.
4.1. Feature Extraction
4.2. Building Vocabulary and Token Parser
4.3. Training Model
5. Evaluations
6. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Acknowledgments
Conflicts of Interest
References
- Lets Encrypt Status Report. 2021. Available online: https://letsencrypt.org/stats (accessed on 17 December 2021).
- Firefox Telemetry. 2021. Available online: https://docs.telemetry.mozilla.org/datasets/other/ssl/reference.html (accessed on 17 December 2021).
- Google Transparency Report. 2021. Available online: https://transparencyreport.google.com/https/overview?hl=en (accessed on 2 December 2021).
- The Relevance of Network Security in an Encrypted World. 2021. Available online: https://blogs.vmware.com/networkvirtualization/2020/09/network-security-encrypted.html (accessed on 2 December 2021).
- Sen, S.; Wang, J. Analyzing peer-to-peer traffic across large networks. In Proceedings of the 2nd ACM SIGCOMM Workshop on Internet Measurment, Marseille, France, 6–8 November 2002; pp. 137–150. [Google Scholar] [CrossRef]
- Cao, Z.; Xiong, G.; Zhao, Y.; Li, Z.; Guo, L. A survey on encrypted traffic classification. In Proceedings of the International Conference on Applications and Techniques in Information Security, Melbourne, Australia, 26–28 November 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 73–81. [Google Scholar] [CrossRef]
- Service Name and Transport Protocol Port Number Registry. 2021. Available online: https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.xhtml (accessed on 10 October 2021).
- Marchette, D.J. A Statistical Method for Profiling Network Traffic. In Proceedings of the Workshop on Intrusion Detection and Network Monitoring, Santa Clara, CA, USA, 9–12 April 1999; pp. 119–128. [Google Scholar]
- Crotti, M.; Gringoli, F.; Pelosato, P.; Salgarelli, L. A statistical approach to IP-level classification of network traffic. In Proceedings of the 2006 IEEE International Conference on Communications, Istanbul, Turkey, 11–15 June 2006; Volume 1, pp. 170–176. [Google Scholar] [CrossRef]
- Zhang, J.; Xiang, Y.; Zhou, W.; Wang, Y. Unsupervised traffic classification using flow statistical properties and IP packet payload. J. Comput. Syst. Sci. 2013, 79, 573–585. [Google Scholar] [CrossRef]
- Amma, N.B.; Selvakumar, S. A statistical class center based triangle area vector method for detection of denial of service attacks. Clust. Comput. 2021, 24, 393–415. [Google Scholar] [CrossRef]
- Sicker, D.C.; Ohm, P.; Grunwald, D. Legal issues surrounding monitoring during network research. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement, San Diego, CA, USA, 24–26 October 2007; pp. 141–148. [Google Scholar]
- Stratosphere. Stratosphere Laboratory Datasets. 2015. Available online: https://www.stratosphereips.org/datasets-overview (accessed on 13 March 2020).
- Jason Stroschein Public Github Malware Samples. 2021. Available online: https://github.com/jstrosch/malware-samples (accessed on 10 May 2021).
- Etienne, L. Malicious Traffic Detection in Local Networks with Snort. 2009. Available online: https://infoscience.epfl.ch/record/141022?ln=en (accessed on 13 March 2021).
- Snort IDS. 2021. Available online: https://snort.org/ (accessed on 10 May 2021).
- Papadogiannaki, E.; Deyannis, D.; Ioannidis, S. Head(er)Hunter: Fast Intrusion Detection using Packet Metadata Signatures. In Proceedings of the 2020 IEEE 25th International Workshop on Computer Aided Modeling and Design of Communication Links and Networks (CAMAD), Pisa, Italy, 14–16 September 2020; pp. 1–6. [Google Scholar] [CrossRef]
- Callegati, F.; Cerroni, W.; Ramilli, M. Man-in-the-Middle Attack to the HTTPS Protocol. IEEE Secur. Priv. 2009, 7, 78–81. [Google Scholar] [CrossRef]
- Sen, S.; Spatscheck, O.; Wang, D. Accurate, Scalable in-Network Identification of P2p Traffic Using Application Signatures. In Proceedings of the 13th International Conference on World Wide Web (WWW’04), New York, NY, USA, 17–20 May 2004; Association for Computing Machinery: New York, NY, USA, 2004; pp. 512–521. [Google Scholar] [CrossRef]
- Anderson, B.; McGrew, D. Identifying Encrypted Malware Traffic with Contextual Flow Data. In Proceedings of the 2016 ACM Workshop on Artificial Intelligence and Security (AISec’16), Vienna, Austria, 24–28 October 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 35–46. [Google Scholar] [CrossRef]
- Wala, F.B.; Cotton, C. Unconstrained Endpoint Security System: UEPTSS. Int. J. Netw. Secur. Its Appl. (IJNSA) 2018, 10, 1–12. [Google Scholar] [CrossRef] [Green Version]
- Zeek IDS. 2021. Available online: https://zeek.org (accessed on 10 May 2021).
- Prasse, P.; Machlica, L.; Pevnỳ, T.; Havelka, J.; Scheffer, T. Malware detection by analysing encrypted network traffic with neural networks. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Skopje, Macedonia, 18–22 September 2017; Springer: Cham, Switzerland, 2017; pp. 73–88. [Google Scholar]
- Anderson, B.; McGrew, D. Machine Learning for Encrypted Malware Traffic Classification: Accounting for Noisy Labels and Non-Stationarity. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’17), Halifax, NS, Canada, 13–17 August 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 1723–1732. [Google Scholar] [CrossRef]
- Shekhawat, A.S.; Troia, F.D.; Stamp, M. Feature analysis of encrypted malicious traffic. Expert Syst. Appl. 2019, 125, 130–141. [Google Scholar] [CrossRef]
- Zheng, R.; Liu, J.; Liu, L.; Liao, S.; Li, K.; Wei, J.; Li, L.; Tian, Z. Two-layer detection framework with a high accuracy and efficiency for a malware family over the TLS protocol. PLoS ONE 2020, 15, e0232696. [Google Scholar] [CrossRef] [PubMed]
- Dai, R.; Gao, C.; Lang, B.; Yang, L.; Liu, H.; Chen, S. SSL Malicious Traffic Detection Based On Multi-View Features. In Proceedings of the 2019 the 9th International Conference on Communication and Network Security (ICCNS 2019), Chongqing, China, 15–17 November 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 40–46. [Google Scholar] [CrossRef]
- Amoli, P.V.; Hämäläinen, T. A real time unsupervised NIDS for detecting unknown and encrypted network attacks in high speed network. In Proceedings of the 2013 IEEE International Workshop on Measurements & Networking (M&N), Naples, Italy, 7–8 October 2013; pp. 149–154. [Google Scholar]
- Su, L.; Yao, Y.; Li, N.; Liu, J.; Lu, Z.; Liu, B. Hierarchical Clustering Based Network Traffic Data Reduction for Improving Suspicious Flow Detection. In Proceedings of the 2018 17th IEEE International Conference On Trust, Security and Privacy in Computing Furthermore, Communications/12th IEEE International Conference on Big Data Science Furthermore, Engineering (TrustCom/BigDataSE), New York, NY, USA, 1–3 August 2018; pp. 744–753. [Google Scholar] [CrossRef]
- Li, L.; Zhang, H.; Peng, H.; Yang, Y. Nearest neighbors based density peaks approach to intrusion detection. Chaos Solitons Fractals 2018, 110, 33–40. [Google Scholar] [CrossRef]
- Baroni, M.; Dinu, G.; Kruszewski, G. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Baltimore, MD, USA, 22–27 June 2014; Association for Computational Linguistics: Baltimore, MD, USA, 2014; pp. 238–247. [Google Scholar] [CrossRef] [Green Version]
- Baek, J.W.; Chung, K.Y. Multimedia recommendation using Word2Vec-based social relationship mining. Multimed. Tools Appl. 2021, 80, 34499–34515. [Google Scholar] [CrossRef]
- Chuan, C.H.; Agres, K.; Herremans, D. From context to concept: Exploring semantic relationships in music with word2vec. Neural Comput. Appl. 2020, 32, 1023–1036. [Google Scholar] [CrossRef] [Green Version]
- Ring, M.; Dallmann, A.; Landes, D.; Hotho, A. IP2Vec: Learning Similarities Between IP Addresses. In Proceedings of the 2017 IEEE International Conference on Data Mining Workshops (ICDMW), New Orleans, LA, USA, 18–21 November 2017; pp. 657–666. [Google Scholar] [CrossRef]
- Goodman, E.L.; Zimmerman, C.; Hudson, C. Packet2Vec: Utilizing Word2Vec for feature extraction in packet data. arXiv 2020, arXiv:2004.14477. [Google Scholar]
- Li, J.; Zhang, H.; Wei, Z. The Weighted Word2vec Paragraph Vectors for Anomaly Detection Over HTTP Traffic. IEEE Access 2020, 8, 141787–141798. [Google Scholar] [CrossRef]
- Lucia, M.J.D.; Cotton, C. Identifying and detecting applications within TLS traffic. In Cyber Sensing 2018; Ternovskiy, I.V., Chin, P., Eds.; International Society for Optics and Photonics, SPIE: Bellingham, WA, USA, 2018; Volume 10630, pp. 179–190. [Google Scholar] [CrossRef]
- Malware Capture Facility Project. 2021. Available online: https://mcfp.felk.cvut.cz/publicDatasets/datasets.html (accessed on 10 May 2021).
- Zeus Trojan Analysis. 2022. Available online: https://talosintelligence.com/zeus_trojan (accessed on 12 January 2022).
- TrickBot: The Multi-Faceted Botnet. 2022. Available online: https://www.kaspersky.com/resource-center/threats/trickbot (accessed on 12 January 2022).
- Allen, C.; Dierks, T. The TLS Protocol Version 1.0; RFC 2246; Internet Engineering Task Force: Wilmington, DE, USA, 1999. [Google Scholar] [CrossRef]
- Dierks, T.; Allen, C. Rfc5246: The Transport Layer Security (TLS) Protocol Version 1.2; RFC 5246; RFC, Ed.; Internet Engineering Task Force: Wilmington, DE, USA, 2008. [Google Scholar]
- Nir, Y.; Josefsson, S.; Pégourié-Gonnard, M. Elliptic Curve Cryptography (ECC) Cipher Suites for Transport Layer Security (TLS) Versions 1.2 and Earlier; RFC 8422; Internet Engineering Task Force: Wilmington, DE, USA, 2018. [Google Scholar] [CrossRef]
- Zeus Github. 2021. Available online: https://github.com/Visgean/Zeus/blob/c55a9fa8c8564ec196604a59111708fa8415f020/manual_en.html (accessed on 1 December 2021).
- Khalife, J.; Hajjar, A.; Diaz-Verdejo, J. A Multilevel Taxonomy and Requirements for an Optimal Traffic-Classification Model. Int. J. Netw. Manag. 2014, 24, 101–120. [Google Scholar] [CrossRef]
- Leroux, S.; Bohez, S.; Maenhaut, P.J.; Meheus, N.; Simoens, P.; Dhoedt, B. Fingerprinting encrypted network traffic types using machine learning. In Proceedings of the NOMS 2018—2018 IEEE/IFIP Network Operations and Management Symposium, Taipei, Taiwan, 23–27 April 2018; pp. 1–5. [Google Scholar] [CrossRef] [Green Version]
- Řehůřek, R.; Sojka, P. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta, 22 May 2010; ELRA: Valletta, Malta, 2010; pp. 45–50. Available online: http://is.muni.cz/publication/884893/en (accessed on 10 October 2021).
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
- Rao, G.; Huang, W.; Feng, Z.; Cong, Q. LSTM with sentence representations for document-level sentiment classification. Neurocomputing 2018, 308, 49–57. [Google Scholar] [CrossRef]
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems; Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2013; Volume 26. [Google Scholar]
- Rhode, M.; Burnap, P.; Jones, K. Early-stage malware prediction using recurrent neural networks. Comput. Secur. 2018, 77, 578–594. [Google Scholar] [CrossRef]
- Xiao, X.; Zhang, S.; Mercaldo, F.; Hu, G.; Sangaiah, A.K. Android malware detection based on system call sequences and LSTM. Multimed. Tools Appl. 2019, 78, 3979–3999. [Google Scholar] [CrossRef]
- Saia, R.; Carta, S.; Recupero, D.R.; Fenu, G.; Stanciu, M. A Discretized Extended Feature Space (DEFS) Model to Improve the Anomaly Detection Performance in Network Intrusion Detection Systems. In Proceedings of the KDIR, Vienna, Austria, 17–19 September 2019; pp. 322–329. [Google Scholar]
- Saia, R.; Carta, S.; Recupero, D.R.; Fenu, G. A Feature Space Transformation to Intrusion Detection Systems. In Proceedings of the KDIR, Budapest, Hungary, 2–4 November 2020; pp. 137–144. [Google Scholar]
- Tran, L.; Fan, L.; Shahabi, C. Outlier Detection in Non-Stationary Data Streams. In Proceedings of the 31st International Conference on Scientific and Statistical Database Management (SSDBM’19), Santa Cruz, CA, USA, 23–25 July 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 25–36. [Google Scholar] [CrossRef]
- Gómez, G.; Kotzias, P.; Dell’Amico, M.; Bilge, L.; Caballero, J. Unsupervised Detection and Clustering of Malicious TLS Flows. arXiv 2021, arXiv:2109.03878. [Google Scholar]
Label | Total Packets | Total TLS Sessions | Avg. App Data/Session |
---|---|---|---|
Zeus | 2,201,308 | 10,581 | 1.932 |
Benign | 1,601,294 | 1461 | 1.249 |
Cobalt | 1,471,709 | 250 | 2.000 |
Trickbot | 895,172 | 33 | 1.909 |
Hyper-Parameter | Binary Target | Multiclass Target |
---|---|---|
Activation Function | sigmoid | softmax |
Epoch | 30 | 30 |
Optimizer | adam | adam |
Learning Rate | 0.001 | 0.001 |
Batch Size | 32 | 32 |
Loss Function | binary_crossentropy | categorical_crossentropy |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ferriyan, A.; Thamrin, A.H.; Takeda, K.; Murai, J. Encrypted Malicious Traffic Detection Based on Word2Vec. Electronics 2022, 11, 679. https://doi.org/10.3390/electronics11050679
Ferriyan A, Thamrin AH, Takeda K, Murai J. Encrypted Malicious Traffic Detection Based on Word2Vec. Electronics. 2022; 11(5):679. https://doi.org/10.3390/electronics11050679
Chicago/Turabian StyleFerriyan, Andrey, Achmad Husni Thamrin, Keiji Takeda, and Jun Murai. 2022. "Encrypted Malicious Traffic Detection Based on Word2Vec" Electronics 11, no. 5: 679. https://doi.org/10.3390/electronics11050679