A Novel NLP-Driven Dashboard for Interactive CyberAttacks Tweet Classification and Visualization

Lughbi, Huda; Mars, Mourad; Almotairi, Khaled

doi:10.3390/info15030137

Open AccessArticle

A Novel NLP-Driven Dashboard for Interactive CyberAttacks Tweet Classification and Visualization

by

Huda Lughbi

^1,*

,

Mourad Mars

^1,2

and

Khaled Almotairi

¹

College of Computing, Umm-Alqura University, Mecca 24382, Saudi Arabia

²

Higher Institute of Computer Sciences and Mathematics, Monastir University, Monastir 5000, Tunisia

^*

Author to whom correspondence should be addressed.

Information 2024, 15(3), 137; https://doi.org/10.3390/info15030137

Submission received: 17 February 2024 / Accepted: 23 February 2024 / Published: 28 February 2024

Download

Browse Figures

Versions Notes

Abstract

:

The pervasive reach of social media like the X platform, formerly known as Twitter, offers unique opportunities for real-time analysis of cyberattack developments. By parsing and classifying tweets related to cyberattacks, we can glean valuable insights into their type, location, impact, and potential mitigation strategies. However, with millions of daily tweets, manual analysis is inefficient and time-consuming. This paper proposes an interactive and automated dashboard powered by natural language processing to effectively address this challenge. First, we created the CybAttT dataset, which contains 36,071 manually labeled English cyberattack tweets. We experimented with different classification algorithms. Following that, the best model was deployed and integrated into the streaming pipeline for real-time classification. This dynamic dashboard makes use of four different visualization formats: a geographical map, a data table, informative tiles, and a bar chart. Users can readily access crucial information about attacks, including location, timing, and perpetrators, enabling a swift response and mitigation efforts. Our experimental results demonstrated the dashboard’s promising visualization capabilities, highlighting its potential as a valuable tool for organizations and individuals seeking an intuitive and comprehensive overview of cyberattack events.

Keywords:

cyberattacks; X platform stream; CybAttT dataset; natural language processing; machine learning; dashboard; visualization

1. Introduction

Generally, cyberattacks represent the stealing of sensitive information from organizations, attacking customers’ data, and hacking payment networks through online fraud and breaches. This causes direct financial losses, damage to brands, and losing the trust of customers www.threatstream.com, (accessed on 20 January 2024). www.threatconnect.com, (accessed on 20 January 2024). The most common cyberattacks are malware, Trojans, phishing, denial of service (DoS), and structured query language (SQL) injection. Hence, different automated cybersecurity tools are being developed to track the presence of attacks in surrounding regions and to find effective plans to avoid or face them. This is primarily applied using natural language processing (NLP), which is a branch of artificial intelligence enabling computers to understand, produce, and manipulate human language.

Recently, researchers have focused on social media platforms such as Facebook and the X platform, formerly known as Twitter, to collect data about cyberattacks [1,2,3,4]. This is due to their wide usage by people around the world, who post about events before or during their occurrence. A huge amount of data are added daily to social media. For example, people post around 500 million tweets per day on the X platform [5]. According to statistics reported in 2021, about 57% of the world’s population use social media platforms https://datareportal.com/social-media-users (accessed on 16 February 2024). Among social media platforms, the X platform is the fastest and simplest way to connect. It is supported by the X platform application programming interface (API), which permits users to find and retrieve, engage with, or build different resources, including tweets, users, and spaces. The X platform is also used by almost all experts and famous personalities from all fields.

X allows the scraping of publicly available data. This means anything a user can see without logging into the platform. As an example, if a user followed a private account and can access its profile, he/she cannot scrape, share, or utilize these data for any purposes. Hence, general scraping on the X platform is not allowed, to avoid excess traffic on their website. Hence, it tries to block automated web scrapers: https://research.aimultiple.com/twitter-web-scraping/ (accessed on 16 February 2024).

Moreover, the X platform offers up-to-the-minute updates on news, trends, and events, which makes it an important source of breaking news, live events, and updates. Twitter keeps users in the loop, whereby, with a few clicks, they can access the latest happenings and stay informed about the world around them. It also provides analytics tools for users and businesses to track engagement and performance metrics: https://www.rfwireless-world.com/Terminology/Advantages-and-Disadvantages-of-Twitter.html; https://igyani.com/advantages-and-disadvantages-of-twitter/2023 (accessed on 16 February 2024).

Another important point is that the X platform offers free API access for write-only use cases. Hence, users must register their use cases at the platform developer website and share their API key once the user case has been approved. In order to obtain tweet data, there is a need for a paid API access. Such access is supported by the platform and hence it will not be blocked as long as the user pulls the data and follows the API guidelines. On the other hand, the platform API has some limitations concerning how many tweets you can pull in a minute: https://research.aimultiple.com/twitter-web-scraping (accessed on 16 February 2024); 2023: https://www.hitechwhizz.com/2023/04/5-advantages-and-disadvantages-drawbacks-benefits-of-twitter.html (accessed on 16 February 2024).

Due to the continuous and huge amounts of data that are added daily on social media platforms and published over the internet, the analysis of streaming data becomes an essential issue. Streaming data represents the data that are continuously generated by different sources. Its core feature is that it comes in large quantities at high speeds. In addition, it has large dimensionality due to its huge amount of features and observations. The main research area in streaming data is streaming data classification, which aims to classify data points in an evolving stream of observations. The fields of streaming data classification are various, ranging from monitoring sensor data to the analysis of a broad range of social media platforms [6,7]. Streaming data classification focuses on creating methods adapting to a changing and possibly unstable data stream [8].

Practically, the X platform offers numerous daily news stories about cyberattacks that require extensive efforts and time periods to collect, analyze, and tabulate. Hence, dashboards are used to track data by summarizing information in the form of a business tool. A dashboard represents the visual display of information utilized to monitor conditions and enhance the control of data, to allow people to visually recognize trends, anomalies, and patterns, and make effective decisions [9]. Hence, dashboards decrease unnecessary reporting and save working time for relevant tasks, to allow an efficient decision-making process [10]. Moreover, they offer powerful and unique information presentation methods to visually display data using graphical and text elements that permit superior decision-making by focusing on related parts of information in the dataset [11].

The architecture of the proposed NLP-based interactive and automated dashboard is shown in Figure 1 below. In the offline phase, tweets are initially collected from the X platform using the X platform API. Next, the collected data are then manually annotated by three volunteers into three categories: news, not-news, and high-risk news. These labeled tweets then undergo feature engineering to prepare them for model training. Different supervised machine learning algorithms are then experimented with to classify the data and identify the most accurate model. This model is subsequently deployed and integrated into the streaming pipeline for real-time classification. The classified data then feeds into the interactive dashboard for data visualization, presenting insights in a user-friendly manner.

The focus of this paper was to create an efficient and automated dashboard to quickly display and visualize information about the latest cyberattacks collected from relevant tweets, including their region of occurrence, time of occurrence, and names. In this paper, the main contributions are as follows:

Development of CybAttT: https://github.com/HudaLughbi/CybAttT (accessed on 16 February 2024): CybAttT is a novel dataset comprising the most recent cyberattack tweets, labeled as high-risk news, normal news, and non-news. This dataset alleviates the scarcity of resources for classifying cyberattack tweets and serves as a valuable foundation for future research.
Visualizing all up-to-date news about any new cyberattack type located in any country around the world.
Allowing users to obtain information about cyberattacks by just clicking any country on the dashboard map worksheet. Users then can obtain information such as the number of tweets posted from that country, the classification of those tweets with the number of tweets in each class, statistics showing the counts and names of attacks, and a table that includes the full tweets, posting times, attack names, and tweet classes.

The rest of this paper is organized as follows: Section 2 offers a review of some cyberattack platforms proposed recently to automatically collect, analyze, and visualize cybersecurity data collected from the X platform. Section 3 gives details about the dataset used in the dashboard and the classification model to classify the dataset. Section 4 explores the created dashboard in detail, its main worksheets, and its final interface. Section 5 compares our dashboard and some of the state-of-the-art platforms proposed in the literature. Section 6 concludes the work presented in this paper and suggests some future tasks for further work enhancements.

2. Related Works

The visualization of cyberattacks has a major effect on increasing the situational awareness of users or organizations. Moreover, visualization tools and dashboards can be deployed to represent complex cyberattack datasets more intuitively. This can assist analysts in recognizing patterns or trends that may not be directly obvious from the raw data. Hence, visualization tools and dashboards can enhance cybersecurity threat intelligence by offering real-time insights into organizational data and permitting organizations to quickly detect emerging threats, while minimizing risk [12]. Therefore, researchers have focused on proposing different visualization tools and dashboards to automatically collect, analyze, and visualize cybersecurity data.

NLP and machine learning have been widely applied in the literature to create cyberthreat platforms to visualize data collected globally from different sources, such as social media and documents published over the internet. In the field of classifying data published over the internet, a cyberthreat platform was proposed in [13] to offer real-time detection and visualization of cyberattacks, to help organizations respond rapidly to attacks. It includes three stages: the first stage is the data collection from internal sources, such as organizational logs, and external sources, such as available data over the Internet. These data are then clustered to find similar events, de-duplicated to avoid duplicate copies of data, analyzed, and saved in databases. The last stage is the visualization, which provides a real-time visualization with historical analysis of data, so organizations can gain situational awareness, access event details, understand the methods of infection, and follow the suggested defense plans.

In the same field, an NLP and machine-learning-based system was created [14] to analyze cybersecurity-related documents. It included three stages. The first stage was symmetry, which finds the symmetry between the way a human represents a domain and that represented by machine learning techniques. A dictionary including over 5000 words related to attacks was created after interviewing 14 cybersecurity experts. Those words were then categorized into 29 classes. The second stage is the machine adjustment, where an NLP model was proposed, trained, and tested using a large set of keywords. In the third stage, the NLP model was used to extract related data from documents, which were then analyzed, saved in a database, and presented in a web interface. The system offered valuable information about cybersecurity and presented it through a web interface to simplify the understanding of cybersecurity data.

With a focus on collecting data from social media platforms, a cyberthreat prototype was created in [1] to automatically collect and analyze cybersecurity data posted on the X platform. Data were collected from X platform posts using the X platform streaming API. Tweets were then processed and analyzed by applying NLP with certain libraries and language models. Tweets were then indexed, managed, and visualized. The results revealed that the proposed system helped analysts collect and process cybersecurity intelligence.

The rise of social media platforms has also opened up a new application area for stream-based algorithms. These platforms attract users to publish their content as texts, photos, and videos. Various algorithms have been developed recently to analyze the development of data on these platforms. For example, the researchers in [2] proposed automated detection of hate speech or cyber-bullying. Moreover, social media researchers dealt with short texts when compared to other applications of (stream) text data, such as the detection of spam emails. Hence, stream-based algorithms have less relative information with which to learn the true meaning of texts for classification [3,4].

Some researchers have tended to focus on collecting data from local areas. The researchers in [15] proposed a geographic information system (GIS) mapping and analysis system for cyberattacks at the University of North Florida. The system was initially based on detecting the locations of cyberattacks using geographic internet protocol (GEO-IP) freeware Geo2Lite2 version 2.1 software. Then, the origin locations of these attacks were mapped using the GIS. R version 3.0 software and some advanced spatial statistical analysis functions were then adopted to discover patterns in the cyberattacks. The outcomes proved that the proposed system was capable of finding spatial patterns in cyberattacks and mapping their locations.

BubbleNet was created in [16] as a cybersecurity dashboard using D3.js to assist network analysts in recognizing and summarizing cybersecurity data patterns. Those patterns are collections of network records representing abnormal behaviors that may be malicious. The dataset visualized was obtained from an intrusion detection system that automatically flags essential network records as alerts for network analysts. The dashboard includes four main views: location, temporal, attributes, and records. The location view is a location-based map, while the temporal view encodes time using a bar chart of network records per day. The attributes view includes bar charts and bullet charts for various attributes of the data, whereas the records view includes a details-on-demand table view offering a summary of different records in any selection. Visualization results demonstrated a novel, interactive, and real-time dashboard that can be used in both research and operational environments.

On the other hand, other researchers focused on collecting data concerning specific types of attacks to be collected and classified. The DDoSGrid dashboard was created in [17] to analyze and visualize distributed denial-of-service (DDoS) attacks. This was motivated by the demand for network operators to recognize the characteristics and behaviors of attacks, and hence more effectively plan cybersecurity strategies. The dashboard includes miners to decode the dataset, which includes PCAP files created using a program. Moreover, it has a web-based interface that permits users to interact with DDoSGrid. The dashboard also includes visualizations and statistics to enable users to gain detailed insights regarding the dataset. The results proved that the DDoSGrid allows real-world DDoS scenario analysis.

In [18], the researchers proposed SecGrid, which is a machine-learning-empowered dashboard for analyzing, classifying, and visualizing cyberattacks. It implements an extensible set of miners to analyze information from network traces, in order to offer perceptive visualizations of malicious traffic and to automatically categorize different types of cyberattacks by machine learning. Two machine learning algorithms, random forest and k-nearest neighbor (kNN), were implemented to classify single or multiple PCAP files of records extracted from different publicly available DDoS attack datasets. After the data classification, all data are visualized on the dashboard, with an clear overview of the duration and behavior of the attacks identified. Experiments revealed the high usability and scalability of the dashboard in extracting information from large files, a high accuracy in classifying cyberattacks, and a high performance in visualizing data.

In summary, cyberattack dashboards are efficient, simple, and quick tools that effectively analyze cyberattack data, to allow users to obtain information about attacks, such as the type of cyberattack, the occurrence region, time of occurrence, and names.

3. Data Collection and Machine Learning Classification Models

This section introduces the cyberattack dataset adopted in this work and explains the classification model applied to classify the collected data.

3.1. Data Collection

To ensure the success and reliability of the dashboard, relevant and accurate cyberattack data need to be collected and used. The CybAttT dataset created during this research project includes 36,071 English cyberattack tweets. (The Tweepy python package was used to collect tweets accessed on 16 Febuary 2022. (http://www.tweepy.org)) by searching for tweets posted on the X platform using keywords related to cyberattacks, data breaches, and cyberbreaches. Tweets were then labeled manually by three volunteer annotators. The annotation task aimed to classify each tweet into one of three classes: not-news, normal-news, and high-risk-news, based on descriptions and instructions provided to the annotators, to ensure that the annotation stage was completed with precise labeling. All tweets in the dataset were initially cleaned and filtered by removing punctuation marks, cleaning all pronouns and emojis: https://pypi.org/project/emoji (accessed on 16 February 2024), and removing stop words https://www.nltk.org/index.html (accessed on 16 February 2024).

To evaluate the reliability of the annotation process, the inter-annotator agreement was assessed using Fleiss’ kappa [19,20]. This is a measure of inter-rater agreement deployed to find the agreement level among two or more raters when the assessment method, called the response variable, is measured on a categorical scale. Fleiss’ kappa interpretation can be summarized as follows: slight agreement (<0.20), fair agreement (0.21–0.40), moderate agreement (0.41–0.60), substantial agreement (0.61–0.80), and perfect agreement (0.81–1) [21,22].

The final overall inner annotation agreement of the dataset was 0.99 (Fleiss kappa), which represents high agreement. Table 1 below illustrates the statistics of the resultant labels.

3.2. Machine Learning and Transformers-Based Models

To power the real-time classification of streamed tweets within the dashboard, we rigorously evaluated a diverse arsenal of machine learning algorithms and transformer-based models. This included LR, MNB, DT, KNN, SVM, BERT, DistilBERT, DeBERTa, and RoBERTa. The winning model was seamlessly integrated into the dashboard, enabling it to autonomously categorize incoming tweets as news, non-news, or high-risk news, based on their content. This crucial classification process involved four primary stages: data pre-processing, feature representation, classification, and performance evaluation, all orchestrated using the CybAttT dataset.

3.2.1. Experiment 1: Using Machine Learning Models

In this experiment, we evaluated the effectiveness of five machine learning algorithms on the CybAttT dataset: logistic regression (LR), multinomial naive Bayes (MNB), decision tree (DT), K-nearest neighbors (KNN), and support vector machine (SVM).

The tweets obtained from the CyAttT dataset were divided into a 80% training set and 20% testing set using a train–test split tool.The training set was read via the five models to train and learn the features of tweets. Hence, words in tweets were converted into numerical features to be used in the classification process. Features were then modeled into a vector form to be further used to train each classification model. The classification outputs of the five models were then evaluated using the testing set.

The best model was selected based on the highest recorded F1-score. This was because the accuracy score could not represent the different models’ performance, due to the imbalanced data, where 87% of the data collected were labeled as not news, 11% of them were labeled as normal news, and only 2% of them were labeled as high-risk news. Based on the results, the logistic regression algorithm with token vectorizer feature representation method was the best-performing model with an F1-score of 87.6%. The most interesting accuracy result revealed was the one related to the correct classification of high-risk news.

The experiments presented in Table 2 showed that the best range of n-grams for the final matrix to be represented as new features was 2-grams.

3.2.2. Experiment 2: Using Transformer-Based Models

Large language models (LLMs) are powerful tools capable of processing and generating human-like language [23]. They are trained on massive amounts of text data, allowing them to recognize, translate, predict, and generate text or other content. Hence, in the second experiment with transformer models, we investigated the performance of five different fine-tuned models in a text classification task (DisltilBERT, DeBERTa, BERT, and RoBERTa). These transformer-based fine-tuned models are available on the Hugging Face platform: https://huggingface.co/ (accessed on 16 February 2024). They were fine-tuned and used to classify 36,071 tweets from the dataset.Hence, four models were trained to classify the data, including DisltilBERT, DeBERTa, BERT, and RoBERTa. The evaluation metrics for all models are presented in Table 3. The considered metric was the F1 macro score, which measures the overall performance across all classes. Based on this metric, the best-performing model was BERT, with an F1 macro score of 0.6741. This indicates that BERT achieved the highest accuracy in correctly classifying the text data across all classes.

After comparison and evaluation of all experiments, the logistic regression classifier was selected as the optimal model for deployment within the dashboard. It achieved a remarkable F1-score of 87.6%, demonstrating exceptional accuracy in classifying cyberattack-related tweets. This success was particularly pronounced when paired with count vectorizer feature representation and a bigram range of (1, 2).

While all other models achieved relatively high F1 macro scores, indicating their success in learning the classification task, the distilBERT models performed surprisingly well, despite the minimal performance gap between models. This suggests their potential as a good option for applications with limited resources, due to their smaller size and faster training times.

The best trained model was seamlessly integrated into the dashboard, enabling it to autonomously categorize incoming tweets in real-time, empowering users with instant insights into the evolving landscape of cyberattacks.

4. Dashboard for Data Visualization

This section presents a custom-designed dashboard that acts as a central intelligence hub for understanding and visualizing cyberattacks. This interactive tool, seamlessly integrating the CybAttT dataset with advanced machine learning classification models, empowers researchers and security professionals to unlock actionable insights from the relentless stream of cyberattack-related tweets.

By harnessing the power of Tableau: https://www.tableau.com/learn/get-started/dashboards (accessed on 16 February 2024), this custom-designed dashboard can classify and plot tweets related to cyberattacks using a dynamic and interactive approach.

In this work, the Twitter streaming API was used to stream public Tweets in real time from the Twitter platform, before it had become the X platform. To achieve this, a connection with the Twitter streaming API was initially made by sending a query request to the Twitter API. Next, tweets related to defined keywords (names of attacks) were received in JSON format, where relevant information was then extracted from tweets and stored in a database. This database was then queried and the streaming tweets were analyzed.

Three relevant libraries were installed to form the application. The first library was the PostgreSQL database, which was used to save the tweets. The second library was Tweepy, which was used to stream the tweets. Tweepy is a great library for accessing the Twitter streaming API. It allows streaming tweets from the user timeline, a specific user’s timeline, or by searching for certain keywords, as in our project. The last library was psycopg. Postgresql is a free and open-source relational database management system based on SQL. To create a database in Postgresql, the pgAdmin tool was used to create multiple tables to save the data received from the Twitter streaming API [24].

To stream data from Twitter, the Twitter developer’s website was used to register the app and obtain access to the API. The main difficulty here was that the user must provide a precise explanation as to what he/she is going to do with the data received from Twitter. After this has been completed, an API access key and API secret key are obtained. Next, the access token was accessed to a generate access token secret key. To authenticate the Twitter streaming API, an OAuthHandler instance was created to handle the authentication by passing the API key, API secret key, and access token.

The Twitter streaming API returns the data in JSON format. Each Tweet JSON includes information concerning the tweet, the user, and entities such as hashtags. The use of the Twitter streaming API solved the problems with the basic Twitter API. One of these problems was retrieving Tweets using the Search API that returns results from only the past 7 days. Moreover, the Search API also has a limit of 180 tweets per 15 min window. Another problem is that the Tweets are not exhaustive, which means that Tweets matching the keywords might be missing from those that the API returns. Hence, the Twitter streaming API solved all those problems and returned Tweets in real time. It retrieves up to 1% of the total number of tweets. This means that the number of tweets that can be obtained varies per request based on the current total tweet volume on the platform. Tweets were collected over a period of six months, with an average of 144,000 tweets per request for 50 requests. Since we collected a huge amount of data from all over the world, we filtered the data based on the tweet location, such as visualizes tweets from the United States only. The streaming capture was not sensitive to network transmission problems. However, although Twitter streaming was previously free, now with the X platform, it offers paid streamed tweets.

For visualization, Tableau stands out as the preeminent data visualization and dashboard creation platform, reigning supreme in popularity. Its widespread acclaim is attributed to its user-friendly interface and a plethora of visualization options, setting it apart from other software counterparts. Recognized as a robust business intelligence platform, Tableau is embraced for its ability to streamline data control, exploration, discovery, and sharing. One of Tableau’s distinctive strengths lies in its capacity to provide diverse and appealing visualizations, making data comprehension a seamless experience. This makes it an ideal tool for creating dashboards that convey insights in an engaging and informative manner. In our specific design, we utilized Tableau as a visualization tool to build a pipeline that collects streamed tweets, and processes, classifies, and visualizes cyberattack tweets obtained from the X platform in an interactive dashboard.

The resulting dashboard is a composite of four fundamental worksheets, each serving a unique purpose. The geographical map provides a spatial representation of cyberattack-related tweets, offering insights into their geographical distribution. The table worksheet presents a detailed tabular view of the dataset, allowing for a granular examination of individual tweet attributes. Complementing these, the three tiles contribute a concise summary of key metrics and trends, enhancing the overall accessibility of information. Lastly, the bar chart delivers a visual depiction of data patterns, facilitating a quick grasp of trends and anomalies.

In the subsequent sections, we will detail each of these worksheets, offering a comprehensive understanding of their roles and contributions to the overarching dashboard. Through Tableau’s intuitive interface and robust features, we aim to empower users to gain meaningful insights from the rich tapestry of cyberattack-related data encapsulated in our dataset.

4.1. Geographical Map Worksheet

One of the components of our dashboard is the world geographical map, a visual overview designed to encapsulate the global landscape of cyberattack-related tweets. In this map, each country is assigned a distinctive color, with shades ranging from a serene light blue to a deep, commanding dark blue within the chosen blue–teal color palette. This deliberate selection facilitates an intuitive interpretation: the lighter the hue, the fewer tweets emanate from the corresponding country.

Beyond its visual appeal, our map gives an interactive experience. The inclusion of country names ensures clarity, enabling users to effortlessly identify nations on the map. As users hover their cursor over any given country, a dynamic feature comes to life, revealing the precise number of tweets originating from that specific location. This instant feedback mechanism empowers users with granular insights, fostering a deeper understanding of the cyberattack tweet landscape.

Concurrently, we recognize the presence of tweets originating from undisclosed or unknown locations. To address this, a dedicated section positioned at the right bottom of the map aggregates and displays the cumulative number of tweets emanating from these unidentified locations. This transparent representation ensures that users are cognizant of the scope of unknown origins within the dataset, contributing to a comprehensive understanding of the overall data distribution.

Additionally, the world geographical map serves as a dynamic gateway to the global narrative of cyberattacks, leveraging color gradients, interactive tooltips, and meticulous detailing to present a nuanced and insightful perspective. Through this visualization, users embark on a journey that transcends geographical boundaries, unraveling the multifaceted tapestry of cyberthreats across the world.

Figure 2 shows the geographical map worksheet with its settings, where the rows are set to longitude and columns are set to latitude. The color mark is considered to indicate the number of tweets. All filters are shown and expanded in the worksheet. In addition, the tooltip has been adjusted to show a sentence illustrating the exact number of tweets from the country that the mouse is hovering on. To illustrate the exact number of tweets, a field, named Tweets, is generated with a value of one, to count the number of tweets in the dataset.

4.2. Table Worksheet

The table worksheet serves as a comprehensive repository of cyberattack tweet data, encapsulating key attributes within its four columns: “Text”, “Created At”, “Attack Name”, and “Pred Class”. In the pursuit of granularity, the “Attack Name” column catalogs the diverse array of attack types, while the “Class” column categorizes each tweet into one of three classes: “News”, “Not-News”, or “High-Risk News”.

Navigating these details, the “Text” column preserves the essence of the tweets themselves, providing a textual snapshot of the communication. The “Created At” column chronicles the precise timestamps of tweet creation, adhering to a format that specifies the year, month, day, hour, and minute. Additionally, the “Attack Name” column shows the attack type. The “Pred Class” column displays the output of the classification model for every tweet.

The versatility of the table worksheet is augmented by its filtering capabilities. As users engage with the world geographical map and click on a specific country, filters are dynamically applied to showcase only the tweets relevant to the selected geographic location, attack names, and tweet classes. This feature ensures a tailored and focused exploration of the dataset, aligning with user preferences and research objectives.

To streamline the visual presentation, a calculated field named “Blank” was ingeniously introduced, eliminating the default blank column that often accompanies tables. This strategic move enhances the aesthetic appeal and readability of the worksheet. Additionally, a “Full Text” column was incorporated to showcase tweets with cleaned text, contributing to a clutter-free and informative display.

Finally, the table worksheet is not merely a static table but is also an interactive gateway to a wealth of cyberattack tweet insights. Its thoughtful design, coupled with filtering capabilities and interactive text display, transforms data exploration into a user-friendly and informative experience.

4.3. Tiles Worksheet

The tiles worksheet serves as an invaluable visual summary, providing an instant snapshot of the distribution of cyberattack tweets across different classes: “News”, “Not-News”, and “High-Risk News”. Its primary function is to efficiently convey the exact count for each class, offering users a quick and straightforward way to discern the prevalence of specific tweet categories.

These strategically arranged tiles serve as a rapid-reference tool, enabling users to swiftly identify and compare class counts. Their utility is particularly pronounced in their ability to distill complex data into a visually digestible format, allowing for quick insights and informed decision-making.

Just as in the table worksheet, the tiles worksheet aligns with the broader interactivity of our dashboard. It is seamlessly integrated with the filtering system initiated in the world geographical map worksheet, ensuring that the displayed tile counts dynamically adjust to the specific filters applied—whether based on geographic regions, attack names, or tweet classifications. This synchronization ensures that users receive real-time insights tailored to their chosen parameters.

In the pursuit of accuracy, the calculated field “Tweets” is employed once again to display the exact number of tweets. This calculated field ensures precision in conveying tweet counts, offering users a reliable foundation for their analyses and interpretations.

In Figure 3, both the table and tiles worksheets are showcased without the application of any filters, presenting counts derived from the raw dataset. This intentional choice serves as a baseline for users, allowing them to gauge the impact of subsequent filters on the overall dataset. This transparency underscores our commitment to providing users with a clear understanding of the context in which the presented data are situated.

In other words, the tiles worksheet stands as a user-friendly compass within our dashboard, offering both efficiency and clarity in the exploration of cyberattack tweet data. Its role as a dynamic, filter-sensitive visualization ensures that users can seamlessly navigate the nuanced landscape of tweet classifications with precision and confidence.

4.4. Bar Chart Worksheet

The bar worksheet emerges as a powerful visual narrative, unraveling the intricate tapestry of the streamed cyberattack tweets. This dynamic bar chart not only shows the proportions of various attack names but also introduces a nuanced layer by aggregating these insights across distinct tweet classes—namely, “News”, “Not-News”, and “High-Risk News”. Each bar in this chart represents the count of a specific attack name, offering users a clear and insightful overview of the prevalence and distribution of cyberthreats.

What sets this chart apart is its ability to convey information at multiple levels. The visual representation of bars, each precisely colored to distinguish attack names, allows for an intuitive understanding of the relative magnitudes of different attacks within each tweet class. Furthermore, the inclusion of exact attack name counts, prominently displayed above each bar, brings quantitative precision to the visual narrative, facilitating a more granular exploration.

Crucially, the bar chart’s aggregation by tweet class divides the visual landscape into three distinct sections, each dedicated to one of the aforementioned tweet classes. This deliberate segmentation ensures that users can discern patterns and trends specific to “News”, “Not-News”, or “High-Risk News” tweets with clarity and efficiency. This is a thoughtful design choice aimed at providing a comprehensive understanding of how various attack names manifest across different types of tweets.

As showcased in Figure 4, the bar worksheet encapsulates this wealth of information, offering users a snapshot of the cyberattack landscape in its entirety. This figure serves as a visual portal into the complex relationships between attack names and tweet classes, inviting users to explore and extract actionable insights.

4.5. Final Cyberattack Dashboard

A dashboard containing all the previous worksheets was created to visualize the data interactively. This dashboard was created with a custom size of 1300 × 850 and contains two horizontal sections: the upper and lower sections. The upper section contains all filters, including the map, the attack name filter, and the tweet class filter (pred_class). The lower section, in turn, is segmented vertically into two sections. The left section contains a table displaying all the tweet information from the corresponding country. The right section is horizontally segmented into two sections; the upper section contains the tiles for tweet class counts, while the lower section contains the tweet class aggregated bar chart with the attack name counts.

The performance and efficiency of the final dashboard were evaluated by asking two volunteers to use it after making a filter for the tweets to display only those from the United States. This was performed since the data were collected from all over the world. The dashboard displayed to the two users after clicking the United States is shown in Figure 5. As shown in the figure, the dashboard displayed all the tweets from the United States, along with their statistics and segmentations according to tweet class and then attack name. Both users reported that the dashboard was incredibly intuitive. They liked how quickly the results updated. They praised the intuitive interface and clear filtering options, making it easy to focus on specific data relevant to their needs. Both volunteers highlighted the dashboard’s speed and responsiveness, allowing them to quickly navigate and analyze information. One volunteer commented on the compelling visualizations, saying they helped them grasp complex trends and patterns within the data. Overall, the positive feedback from user testing underscored the dashboard’s effectiveness and potential to be a valuable tool for monitoring and analyzing cyber-attacks in specific regions.

Figure 5 shows a snapshot of the final interactive dashboard after clicking the United States as an example. The dashboard displays all the tweets from the United States, along with their statistics and segmentations according to tweet class and then attack name.

5. Comparison with Existing Work

Different related works have been reviewed in this paper. In the field of classifying published data over the internet, both works in [13,14] proposed cyberthreat platforms to offer a real-time detection and visualization of cyberattacks. However, the main difference between our work and the two works proposed in [13,14] is that our work is based on collecting data from the X platform, while the other two works are based on data published over the internet. Practically, the visualization of cybersecurity-related documents published over the internet takes more time and requires greater effort. As well, the published documents may not cover all affected regions and all types of new attacks, while people always post tweets about the presence and location of any new attack. On the other hand, the work proposed in [1] focused on collecting data from the X platform. However, its main limitation was that users could not refer to the original published tweets. This limitation was solved in our work, where users can click and view any tweet visualized over the dashboard. Some related works focused on collecting data from only specific areas, such as the works proposed in [15,16] The work proposed in [15] focused on collecting the locations of cyberattacks at the University of North Florida, while the work in [16] focused on collecting records of a local network. Our work, in turn, contributes to both works in collecting data from the whole world. Both works proposed in [17,18] focused on collecting data concerning specific types of attacks, i.e., DDoS attacks, to be clustered and classified. Hence, our work contributes to both works in collecting tweets concerning different types of cyberattacks published on the X platform.

In a summary, our proposed Tableau dashboard contributes to the works proposed in the literature in collecting tweets posted from all over the world concerning any type of cyberattacks on one of the main social media platforms, the X platform, and allowing users to refer to the original published tweets. A comparison between our proposed dashboard and those reviewed in the literature is illustrated in the Table 4 below.

6. Conclusions

This paper introduced CybAttT (CybAttT dataset is available online for research purposes: (accessed on 14 February 2024)) https://github.com/HudaLughbi/CybAttT), a novel dataset rich in cyberattack-related tweets, facilitating the development of robust tweet classification models. We rigorously evaluated diverse machine learning and fine-tuned models on CybAttT, revealing valuable insights into their effectiveness. Importantly, this analysis identified promising models with the potential to accurately detect cyberattacks in real time, paving the way for enhanced online security.

Another contribution of our research was the creation of an efficient and attractive dashboard using the Tableau platform, to visualize the tweets collected using the CybAttT dataset and the best classification model. By clicking any country on the dashboard, several types of information are directly displayed, including the number of tweets posted from that country; the classification of those tweets, which shows the number of tweets in each class; statistics showing the counts and names of attacks; and a table including the full tweets, posting times, attack names, and tweet class. Therefore, this dashboard is a valuable addition for organizations and users, to help them display all types of information they want in a simple and attractive manner.

One suggestion for future work would be to further refine the data stream. We plan to expand our annotation workforce, reducing human error and enriching the dataset. Additionally, incorporating named entity recognition (NER) into the classification model could pinpoint targeted locations with even greater precision. By continuously refining and expanding CybAttT, we aim to build an indispensable tool for navigating the ever-shifting cyberterrain, ultimately safeguarding countless individuals and organizations from the evolving threats of the digital age.

Author Contributions

Conceptualization, H.L., M.M. and K.A.; Validation, H.L., M.M. and K.A.; investigation, H.L., M.M. and K.A.; methodology, H.L., M.M. and K.A.; resources, H.L., M.M. and K.A.; data curation, H.L. and M.M.; writing—original draft preparation, H.L. and M.M.; writing—review and editing, H.L., M.M. and K.A.; visualization, H.L. and M.M.; supervision, M.M. and K.A.; project administration, M.M. and K.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data introduced and described within this paper are available at: https://github.com/HudaLughbi/CybAttT (The latest access occurred on 14 February 2024.)

Conflicts of Interest

The authors declare no conflicts of interest.

References

Vadapalli, S.R.; Hsieh, G.; Nauer, K.S. Twitterosint: Automated cybersecurity threat intelligence collection and analysis using twitter data. In Proceedings of the International Conference on Security and Management (SAM); The Steering Committee of The World Congress in Computer Science, Computer Engineering and Applied Computing; WorldComp: Las Vegas, NV, USA, 2018; pp. 220–226. [Google Scholar]
Nahar, V.; Li, X.; Zhang, H.L.; Pang, C. Detecting cyberbullying in social networks using multi-agent system. Web Intell. Agent Syst. Int. J. 2014, 12, 375–388. [Google Scholar] [CrossRef]
Taninpong, P.; Ngamsuriyaroj, S. Tree-based text stream clustering with application to spam mail classification. Int. J. Data Min. Model. Manag. 2018, 10, 353–370. [Google Scholar] [CrossRef]
Hu, X.; Wang, H.; Li, P. Online biterm topic model based short text stream classification using short text expansion and concept drifting detection. Pattern Recognit. Lett. 2018, 116, 187–194. [Google Scholar] [CrossRef]
Alruily, M. Issues of dialectal saudi twitter corpus. Int. Arab J. Inf. Technol. 2020, 17, 367–374. [Google Scholar] [CrossRef]
Jeffin Gracewell, J.; Pavalarajan, S. Fall detection based on posture classification for smart home environment. J. Ambient. Intell. Humaniz. Comput. 2021, 12, 3581–3588. [Google Scholar] [CrossRef]
Zorich, L.; Pichara, K.; Protopapas, P. Streaming classification of variable stars. Mon. Not. R. Astron. Soc. 2020, 492, 2897–2909. [Google Scholar] [CrossRef]
Clever, L.; Pohl, J.S.; Bossek, J.; Kerschke, P.; Trautmann, H. Process-oriented stream classification pipeline: A literature review. Appl. Sci. 2022, 12, 9094. [Google Scholar] [CrossRef]
Sarikaya, A.; Correll, M.; Bartram, L.; Tory, M.; Fisher, D. What do we talk about when we talk about dashboards? IEEE Trans. Vis. Comput. Graph. 2018, 25, 682–692. [Google Scholar] [CrossRef] [PubMed]
Few, S. Information Dashboard Design: The Effective Visual Communication of Data; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2006. [Google Scholar]
Cîmpan, A. Applying Design System in Cybersecurity Dashboard Development. Ph.D. Thesis, ETSI Informatica, Málaga, Spain, 2019. [Google Scholar]
Samtani, S.; Li, W.; Benjamin, V.; Chen, H. Informing cyber threat intelligence through dark Web situational awareness: The AZSecure hacker assets portal. Digit. Threat. Res. Pract. 2021, 2, 1–10. [Google Scholar] [CrossRef]
Carvalho, V.S.; Polidoro, M.J.; Magalhaes, J.P. Owlsight: Platform for real-time detection and visualization of cyber threats. In Proceedings of the 2016 IEEE 2nd International Conference on Big Data Security on Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart Computing (HPSC), and IEEE International Conference on Intelligent Data and Security (IDS), New York, NY, USA, 9–10 April 2016; pp. 61–66. [Google Scholar]
Georgescu, T.M. Natural language processing model for automatic analysis of cybersecurity-related documents. Symmetry 2020, 12, 354. [Google Scholar] [CrossRef]
Hu, Z.; Baynard, C.W.; Hu, H.; Fazio, M. GIS mapping and spatial analysis of cybersecurity attacks on a florida university. In Proceedings of the 2015 23rd International Conference on Geoinformatics, Wuhan, China, 19–21 June 2015; pp. 1–5. [Google Scholar]
McKenna, S.; Staheli, D.; Fulcher, C.; Meyer, M. Bubblenet: A cyber security dashboard for visualizing patterns. In Computer Graphics Forum; Wiley Online Library: Hoboken, NJ, USA, 2016; Volume 35, pp. 281–290. [Google Scholar]
Franco, M.; Von der Assen, J.; Boillat, L.; Killer, C.; Rodrigues, B.; Scheid, E.J.; Granville, L.; Stiller, B. SecGrid: A Visual System for the Analysis and ML-based Classification of Cyberattack Traffic. In Proceedings of the 2021 IEEE 46th Conference on Local Computer Networks (LCN), Edmonton, AB, Canada, 4–7 October 2021; pp. 140–147. [Google Scholar]
Franco, M.; von der Assen, J.; Boillat, L.; Killer, C.; Rodrigues, B.; Scheid, E.; Granville, L.; Stiller, B. Poster: DDoSGrid: A Platform for the Post-mortem Analysis and Visualization of DDoS Attacks. In Proceedings of the 2021 IFIP Networking Conference (IFIP Networking), Espoo and Helsinki, Finland, 21–24 June 2021; pp. 1–3. [Google Scholar]
Fleiss, J.L. Measuring nominal scale agreement among many raters. Psychol. Bull. 1971, 76, 378. [Google Scholar] [CrossRef]
Hamoui, B.; Mars, M.; Almotairi, K. FloDusTA: Saudi Tweets Dataset for Flood, Dust Storm, and Traffic Accident Events. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 1391–1396. Available online: https://aclanthology.org/2020.lrec-1.174 (accessed on 14 February 2024).
Landis, J.R.; Koch, G.G. The measurement of observer agreement for categorical data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef] [PubMed]
Lughbi, H.; Mars, M.; Almotairi, K. CybAttT: A Dataset of Cyberattack News Tweets for Enhanced Threat Intelligence. Data 2024, 9, 39. [Google Scholar] [CrossRef]
Mars, M. From Word Embeddings to Pre-Trained Language Models: A State-of-the-Art Walkthrough. Appl. Sci. 2022, 12, 8805. [Google Scholar] [CrossRef]
Lughbi, H.; Mars, M.; Almotairi, K. Leverage AI and NLP for Enhanced Threat Intelligence: An Interactive AI-Powered Dashboard for Cyberattack Tweet Visualization; LAP LAMBERT Academic Publishing: Saarbrücken, Germany, 2024; Volume 96. [Google Scholar]

Figure 1. The architecture of our NLP-based interactive and automated dashboard.

Figure 2. The interactive geographical map worksheet.

Figure 3. The tiles worksheet for an instant snapshot of the distribution of cyberattack tweets across different classes.

Figure 4. Worksheet with a bar chart showing the distribution of attack counts by attack type.

Figure 5. The Final interactive cyberattack visualization dashboard.

Table 1. Dataset size, label distributions, and Fleiss’ Kappa annotation agreement.

Labels	High_Risk_News	Normal_News	Not_News
No. tweets	892	3948	31,231
Dataset size (tweets)		36,071
Fleiss’ Kappa		0.99

Table 2. Revealed performance metrics for supervised models.

Feature Representation	Algorithm	n-Gram	Accuracy	Precision	Recall	F1-Score
Count Vectorizer	DT	default	0.951	0.842	0.813	0.827
	KNN	default	0.95	0.892	0.772	0.823
	LR	default	0.97	0.901	0.846	0.871
	MNB	default	0.953	0.865	0.81	0.825
	SVM	default	0.968	0.916	0.823	0.862
	DT	(1, 2)	0.953	0.858	0.83	0.843
	KNN	(1, 2)	0.942	0.911	0.733	0.803
	LR	(1, 2)	0.972	0.909	0.849	0.876
	MNB	(1, 2)	0.965	0.913	0.815	0.851
	SVM	(1, 2)	0.969	0.918	0.828	0.866
TF-IDF	DT	default	0.946	0.834	0.815	0.823
	KNN	default	0.943	0.903	0.735	0.801
	LR	default	0.964	0.919	0.803	0.851
	MNB	default	0.925	0.953	0.573	0.658
	SVM	default	0.968	0.921	0.823	0.864
	DT	(1, 2)	0.943	0.82	0.815	0.816
	KNN	(1, 2)	0.942	0.911	0.733	0.803
	LR	(1, 2)	0.963	0.93	0.802	0.853
	MNB	(1, 2)	0.931	0.951	0.613	0.704
	SVM	(1, 2)	0.969	0.93	0.825	0.868

Table 3. Results of experiments using transformer-based models.

Model ID	Accuracy	Precision Macro	Precision Micro	Precision Weighted	Recall Macro	Recall Micro	Recall Weighted	F1 Macro	F1 Micro	F1 Weighted
DistilBERT	0.9720	0.6742	0.9720	0.9718	0.6710	0.9720	0.9720	0.6725	0.9720	0.9719
RoBERTa	0.9717	0.6720	0.9717	0.9710	0.6625	0.9717	0.9717	0.6671	0.9717	0.9713
DeBERTa	0.9716	0.6639	0.9716	0.9715	0.6716	0.9716	0.9716	0.6676	0.9716	0.9715
BERT	0.9723	0.6776	0.9723	0.9719	0.6708	0.9723	0.9723	0.6741	0.9723	0.9721

Table 4. Comparison between the proposed dashboard and some related systems.

Studies	Platform Created	Purpose of Platform	Platform Stages	Dataset
[5]	Prototype system	Automatically collecting and analyzing cyber security data posted on the X platform	Data collection from the X platform, data processing and analysis using NLP, and data indexing and visualization	Posts on the X platform collected using the X platform streaming API
[14]	NLP and machine learning based system	Analyzing cybersecurity-related documents published over the internet.	Symmetry, machine adjustment using the NLP model, and extraction, analysis, and presentation of related data	Data collected from documents using the NLP model
[15]	Geographic information system (GIS) mapping and analysis system	offering real-time detection of cyberattacks at the University of North Florida	Detection of cyberattack locations, and mapping the locations using GIS	Data related to the University of North Florida
[16]	BubbleNet	Assisting network analysts to recognize and summarize the cybersecurity data patterns	Data collection, data clustering, and visualization	Intrusion detection system that automatically flags essential network records as alerts for network analysts
[17]	SecGrid	Analyzing, classifying, and visualizing DDoS cyberattacks	Data collection, data clustering, and visualization	Different publicly available DDoS attack datasets
[18]	DDoSGrid	Analyzing and visualizing distributed denial-of-service (DDoS) attacks	Data collection, data clustering, and visualization	PCAP files created using a program
[21]	Cyber threat platform	offering real-time detection and visualization of different cyberattacks	Data collection from internal sources, data clustering and visualization	Internal sources, such as logs related to the organization, and external sources from available data over the internet
The proposed method	Tableau dashboard	Freely and quickly accessing a real-time visual map to see essential information about attacks, their locations, time of occurrence, and names	Data collection, data preprocessing, data labeling, feature representation, data classification, evaluation, and data visualization	Labeled 21,796 tweets collected using the X platform API

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lughbi, H.; Mars, M.; Almotairi, K. A Novel NLP-Driven Dashboard for Interactive CyberAttacks Tweet Classification and Visualization. Information 2024, 15, 137. https://doi.org/10.3390/info15030137

AMA Style

Lughbi H, Mars M, Almotairi K. A Novel NLP-Driven Dashboard for Interactive CyberAttacks Tweet Classification and Visualization. Information. 2024; 15(3):137. https://doi.org/10.3390/info15030137

Chicago/Turabian Style

Lughbi, Huda, Mourad Mars, and Khaled Almotairi. 2024. "A Novel NLP-Driven Dashboard for Interactive CyberAttacks Tweet Classification and Visualization" Information 15, no. 3: 137. https://doi.org/10.3390/info15030137

APA Style

Lughbi, H., Mars, M., & Almotairi, K. (2024). A Novel NLP-Driven Dashboard for Interactive CyberAttacks Tweet Classification and Visualization. Information, 15(3), 137. https://doi.org/10.3390/info15030137

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel NLP-Driven Dashboard for Interactive CyberAttacks Tweet Classification and Visualization

Abstract

1. Introduction

2. Related Works

3. Data Collection and Machine Learning Classification Models

3.1. Data Collection

3.2. Machine Learning and Transformers-Based Models

3.2.1. Experiment 1: Using Machine Learning Models

3.2.2. Experiment 2: Using Transformer-Based Models

4. Dashboard for Data Visualization

4.1. Geographical Map Worksheet

4.2. Table Worksheet

4.3. Tiles Worksheet

4.4. Bar Chart Worksheet

4.5. Final Cyberattack Dashboard

5. Comparison with Existing Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI