A Novel NLP-Driven Dashboard for Interactive CyberAttacks Tweet Classification and Visualization
Abstract
:1. Introduction
- Development of CybAttT: https://github.com/HudaLughbi/CybAttT (accessed on 16 February 2024): CybAttT is a novel dataset comprising the most recent cyberattack tweets, labeled as high-risk news, normal news, and non-news. This dataset alleviates the scarcity of resources for classifying cyberattack tweets and serves as a valuable foundation for future research.
- Visualizing all up-to-date news about any new cyberattack type located in any country around the world.
- Allowing users to obtain information about cyberattacks by just clicking any country on the dashboard map worksheet. Users then can obtain information such as the number of tweets posted from that country, the classification of those tweets with the number of tweets in each class, statistics showing the counts and names of attacks, and a table that includes the full tweets, posting times, attack names, and tweet classes.
2. Related Works
3. Data Collection and Machine Learning Classification Models
3.1. Data Collection
3.2. Machine Learning and Transformers-Based Models
3.2.1. Experiment 1: Using Machine Learning Models
3.2.2. Experiment 2: Using Transformer-Based Models
4. Dashboard for Data Visualization
4.1. Geographical Map Worksheet
4.2. Table Worksheet
4.3. Tiles Worksheet
4.4. Bar Chart Worksheet
4.5. Final Cyberattack Dashboard
5. Comparison with Existing Work
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Vadapalli, S.R.; Hsieh, G.; Nauer, K.S. Twitterosint: Automated cybersecurity threat intelligence collection and analysis using twitter data. In Proceedings of the International Conference on Security and Management (SAM); The Steering Committee of The World Congress in Computer Science, Computer Engineering and Applied Computing; WorldComp: Las Vegas, NV, USA, 2018; pp. 220–226. [Google Scholar]
- Nahar, V.; Li, X.; Zhang, H.L.; Pang, C. Detecting cyberbullying in social networks using multi-agent system. Web Intell. Agent Syst. Int. J. 2014, 12, 375–388. [Google Scholar] [CrossRef]
- Taninpong, P.; Ngamsuriyaroj, S. Tree-based text stream clustering with application to spam mail classification. Int. J. Data Min. Model. Manag. 2018, 10, 353–370. [Google Scholar] [CrossRef]
- Hu, X.; Wang, H.; Li, P. Online biterm topic model based short text stream classification using short text expansion and concept drifting detection. Pattern Recognit. Lett. 2018, 116, 187–194. [Google Scholar] [CrossRef]
- Alruily, M. Issues of dialectal saudi twitter corpus. Int. Arab J. Inf. Technol. 2020, 17, 367–374. [Google Scholar] [CrossRef]
- Jeffin Gracewell, J.; Pavalarajan, S. Fall detection based on posture classification for smart home environment. J. Ambient. Intell. Humaniz. Comput. 2021, 12, 3581–3588. [Google Scholar] [CrossRef]
- Zorich, L.; Pichara, K.; Protopapas, P. Streaming classification of variable stars. Mon. Not. R. Astron. Soc. 2020, 492, 2897–2909. [Google Scholar] [CrossRef]
- Clever, L.; Pohl, J.S.; Bossek, J.; Kerschke, P.; Trautmann, H. Process-oriented stream classification pipeline: A literature review. Appl. Sci. 2022, 12, 9094. [Google Scholar] [CrossRef]
- Sarikaya, A.; Correll, M.; Bartram, L.; Tory, M.; Fisher, D. What do we talk about when we talk about dashboards? IEEE Trans. Vis. Comput. Graph. 2018, 25, 682–692. [Google Scholar] [CrossRef] [PubMed]
- Few, S. Information Dashboard Design: The Effective Visual Communication of Data; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2006. [Google Scholar]
- Cîmpan, A. Applying Design System in Cybersecurity Dashboard Development. Ph.D. Thesis, ETSI Informatica, Málaga, Spain, 2019. [Google Scholar]
- Samtani, S.; Li, W.; Benjamin, V.; Chen, H. Informing cyber threat intelligence through dark Web situational awareness: The AZSecure hacker assets portal. Digit. Threat. Res. Pract. 2021, 2, 1–10. [Google Scholar] [CrossRef]
- Carvalho, V.S.; Polidoro, M.J.; Magalhaes, J.P. Owlsight: Platform for real-time detection and visualization of cyber threats. In Proceedings of the 2016 IEEE 2nd International Conference on Big Data Security on Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart Computing (HPSC), and IEEE International Conference on Intelligent Data and Security (IDS), New York, NY, USA, 9–10 April 2016; pp. 61–66. [Google Scholar]
- Georgescu, T.M. Natural language processing model for automatic analysis of cybersecurity-related documents. Symmetry 2020, 12, 354. [Google Scholar] [CrossRef]
- Hu, Z.; Baynard, C.W.; Hu, H.; Fazio, M. GIS mapping and spatial analysis of cybersecurity attacks on a florida university. In Proceedings of the 2015 23rd International Conference on Geoinformatics, Wuhan, China, 19–21 June 2015; pp. 1–5. [Google Scholar]
- McKenna, S.; Staheli, D.; Fulcher, C.; Meyer, M. Bubblenet: A cyber security dashboard for visualizing patterns. In Computer Graphics Forum; Wiley Online Library: Hoboken, NJ, USA, 2016; Volume 35, pp. 281–290. [Google Scholar]
- Franco, M.; Von der Assen, J.; Boillat, L.; Killer, C.; Rodrigues, B.; Scheid, E.J.; Granville, L.; Stiller, B. SecGrid: A Visual System for the Analysis and ML-based Classification of Cyberattack Traffic. In Proceedings of the 2021 IEEE 46th Conference on Local Computer Networks (LCN), Edmonton, AB, Canada, 4–7 October 2021; pp. 140–147. [Google Scholar]
- Franco, M.; von der Assen, J.; Boillat, L.; Killer, C.; Rodrigues, B.; Scheid, E.; Granville, L.; Stiller, B. Poster: DDoSGrid: A Platform for the Post-mortem Analysis and Visualization of DDoS Attacks. In Proceedings of the 2021 IFIP Networking Conference (IFIP Networking), Espoo and Helsinki, Finland, 21–24 June 2021; pp. 1–3. [Google Scholar]
- Fleiss, J.L. Measuring nominal scale agreement among many raters. Psychol. Bull. 1971, 76, 378. [Google Scholar] [CrossRef]
- Hamoui, B.; Mars, M.; Almotairi, K. FloDusTA: Saudi Tweets Dataset for Flood, Dust Storm, and Traffic Accident Events. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 1391–1396. Available online: https://aclanthology.org/2020.lrec-1.174 (accessed on 14 February 2024).
- Landis, J.R.; Koch, G.G. The measurement of observer agreement for categorical data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef] [PubMed]
- Lughbi, H.; Mars, M.; Almotairi, K. CybAttT: A Dataset of Cyberattack News Tweets for Enhanced Threat Intelligence. Data 2024, 9, 39. [Google Scholar] [CrossRef]
- Mars, M. From Word Embeddings to Pre-Trained Language Models: A State-of-the-Art Walkthrough. Appl. Sci. 2022, 12, 8805. [Google Scholar] [CrossRef]
- Lughbi, H.; Mars, M.; Almotairi, K. Leverage AI and NLP for Enhanced Threat Intelligence: An Interactive AI-Powered Dashboard for Cyberattack Tweet Visualization; LAP LAMBERT Academic Publishing: Saarbrücken, Germany, 2024; Volume 96. [Google Scholar]
Labels | High_Risk_News | Normal_News | Not_News |
---|---|---|---|
No. tweets | 892 | 3948 | 31,231 |
Dataset size (tweets) | 36,071 | ||
Fleiss’ Kappa | 0.99 |
Feature Representation | Algorithm | n-Gram | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|---|---|
Count Vectorizer | DT | default | 0.951 | 0.842 | 0.813 | 0.827 |
KNN | default | 0.95 | 0.892 | 0.772 | 0.823 | |
LR | default | 0.97 | 0.901 | 0.846 | 0.871 | |
MNB | default | 0.953 | 0.865 | 0.81 | 0.825 | |
SVM | default | 0.968 | 0.916 | 0.823 | 0.862 | |
DT | (1, 2) | 0.953 | 0.858 | 0.83 | 0.843 | |
KNN | (1, 2) | 0.942 | 0.911 | 0.733 | 0.803 | |
LR | (1, 2) | 0.972 | 0.909 | 0.849 | 0.876 | |
MNB | (1, 2) | 0.965 | 0.913 | 0.815 | 0.851 | |
SVM | (1, 2) | 0.969 | 0.918 | 0.828 | 0.866 | |
TF-IDF | DT | default | 0.946 | 0.834 | 0.815 | 0.823 |
KNN | default | 0.943 | 0.903 | 0.735 | 0.801 | |
LR | default | 0.964 | 0.919 | 0.803 | 0.851 | |
MNB | default | 0.925 | 0.953 | 0.573 | 0.658 | |
SVM | default | 0.968 | 0.921 | 0.823 | 0.864 | |
DT | (1, 2) | 0.943 | 0.82 | 0.815 | 0.816 | |
KNN | (1, 2) | 0.942 | 0.911 | 0.733 | 0.803 | |
LR | (1, 2) | 0.963 | 0.93 | 0.802 | 0.853 | |
MNB | (1, 2) | 0.931 | 0.951 | 0.613 | 0.704 | |
SVM | (1, 2) | 0.969 | 0.93 | 0.825 | 0.868 |
Model ID | Accuracy | Precision Macro | Precision Micro | Precision Weighted | Recall Macro | Recall Micro | Recall Weighted | F1 Macro | F1 Micro | F1 Weighted |
---|---|---|---|---|---|---|---|---|---|---|
DistilBERT | 0.9720 | 0.6742 | 0.9720 | 0.9718 | 0.6710 | 0.9720 | 0.9720 | 0.6725 | 0.9720 | 0.9719 |
RoBERTa | 0.9717 | 0.6720 | 0.9717 | 0.9710 | 0.6625 | 0.9717 | 0.9717 | 0.6671 | 0.9717 | 0.9713 |
DeBERTa | 0.9716 | 0.6639 | 0.9716 | 0.9715 | 0.6716 | 0.9716 | 0.9716 | 0.6676 | 0.9716 | 0.9715 |
BERT | 0.9723 | 0.6776 | 0.9723 | 0.9719 | 0.6708 | 0.9723 | 0.9723 | 0.6741 | 0.9723 | 0.9721 |
Studies | Platform Created | Purpose of Platform | Platform Stages | Dataset |
---|---|---|---|---|
[5] | Prototype system | Automatically collecting and analyzing cyber security data posted on the X platform | Data collection from the X platform, data processing and analysis using NLP, and data indexing and visualization | Posts on the X platform collected using the X platform streaming API |
[14] | NLP and machine learning based system | Analyzing cybersecurity-related documents published over the internet. | Symmetry, machine adjustment using the NLP model, and extraction, analysis, and presentation of related data | Data collected from documents using the NLP model |
[15] | Geographic information system (GIS) mapping and analysis system | offering real-time detection of cyberattacks at the University of North Florida | Detection of cyberattack locations, and mapping the locations using GIS | Data related to the University of North Florida |
[16] | BubbleNet | Assisting network analysts to recognize and summarize the cybersecurity data patterns | Data collection, data clustering, and visualization | Intrusion detection system that automatically flags essential network records as alerts for network analysts |
[17] | SecGrid | Analyzing, classifying, and visualizing DDoS cyberattacks | Data collection, data clustering, and visualization | Different publicly available DDoS attack datasets |
[18] | DDoSGrid | Analyzing and visualizing distributed denial-of-service (DDoS) attacks | Data collection, data clustering, and visualization | PCAP files created using a program |
[21] | Cyber threat platform | offering real-time detection and visualization of different cyberattacks | Data collection from internal sources, data clustering and visualization | Internal sources, such as logs related to the organization, and external sources from available data over the internet |
The proposed method | Tableau dashboard | Freely and quickly accessing a real-time visual map to see essential information about attacks, their locations, time of occurrence, and names | Data collection, data preprocessing, data labeling, feature representation, data classification, evaluation, and data visualization | Labeled 21,796 tweets collected using the X platform API |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lughbi, H.; Mars, M.; Almotairi, K. A Novel NLP-Driven Dashboard for Interactive CyberAttacks Tweet Classification and Visualization. Information 2024, 15, 137. https://doi.org/10.3390/info15030137
Lughbi H, Mars M, Almotairi K. A Novel NLP-Driven Dashboard for Interactive CyberAttacks Tweet Classification and Visualization. Information. 2024; 15(3):137. https://doi.org/10.3390/info15030137
Chicago/Turabian StyleLughbi, Huda, Mourad Mars, and Khaled Almotairi. 2024. "A Novel NLP-Driven Dashboard for Interactive CyberAttacks Tweet Classification and Visualization" Information 15, no. 3: 137. https://doi.org/10.3390/info15030137
APA StyleLughbi, H., Mars, M., & Almotairi, K. (2024). A Novel NLP-Driven Dashboard for Interactive CyberAttacks Tweet Classification and Visualization. Information, 15(3), 137. https://doi.org/10.3390/info15030137