Introducing UWF-ZeekData24: An Enterprise MITRE ATT&CK Labeled Network Attack Traffic Dataset for Machine Learning/AI

Elam, Marshall; Mink, Dustin; Bagui, Sikha S.; Plenkers, Russell; Bagui, Subhash C.

doi:10.3390/data10050059

Open AccessData Descriptor

Introducing UWF-ZeekData24: An Enterprise MITRE ATT&CK Labeled Network Attack Traffic Dataset for Machine Learning/AI

by

Marshall Elam

¹,

Dustin Mink

²

,

Sikha S. Bagui

^1,*

,

Russell Plenkers

¹ and

Subhash C. Bagui

³

¹

Department of Computer Science, The University of West Florida, Pensacola, FL 32514, USA

²

Department of Cybersecurity, The University of West Florida, Pensacola, FL 32514, USA

³

Department of Mathematics and Statistics, The University of West Florida, Pensacola, FL 32514, USA

^*

Author to whom correspondence should be addressed.

Data 2025, 10(5), 59; https://doi.org/10.3390/data10050059

Submission received: 19 February 2025 / Revised: 17 April 2025 / Accepted: 22 April 2025 / Published: 25 April 2025

Download

Browse Figures

Versions Notes

Abstract

:

This paper describes the creation of a new dataset, UWF-ZeekData24, aligned with the Enterprise MITRE ATT&CK Framework, that addresses critical shortcomings in existing network security datasets. Controlling the construction of attacks and meticulously labeling the data provides a more accurate and dynamic environment for testing of IDS/IPS systems and their machine learning algorithms. The outcomes of this research will assist in the development of cybersecurity solutions as well as increase the robustness and adaptability towards modern day cybersecurity threats. This new carefully engineered dataset will enhance cyber defense mechanisms that are responsible for safeguarding critical infrastructures and digital assets. Finally, this paper discusses the differences between crowd-sourced data and data collected in a more controlled environment.

Keywords:

cybersecurity; network traffic; Enterprise MITRE ATT&CK Framework; labeled dataset; machine learning; AI; network security

1. Introduction

In the domain of cybersecurity, this paper seeks to address the urgent need for network security datasets that are effective and comprehensive. This need is attributable to the growing diversity and complexity of cyberattacks that target a wide range of entities from health care facilities to critical infrastructure. Analyzing attacks after they have happened is insufficient for advancing defenses against cybersecurity threats. The dataset created in line with this paper will seek to expand upon previously created datasets by utilizing a controlled approach to the orchestration of attacks, as well as the strict labeling of data alongside these attacks.

The central premise of this paper revolves around the utilization of the MITRE Adversarial Tactics, Techniques, and Common Knowledge (ATT&CK) [1] framework in tandem with Zeek [2], an open-source network traffic analyzer. Creating datasets that reflect real-world observations is important when using the Enterprise MITRE ATT&CK framework as the cornerstone to understanding adversary behaviors and tactics. This framework is constantly kept up to date, and offers an expansive knowledge base of cyberattacks. Zeek excels in interpreting network traffic and generates detailed logs. These attributes make it an ideal tool to capture and analyze intricate network activities.

By combining the strengths of Zeek and the MITRE ATT&CK framework, this paper showcases this combination by focusing on real (non-simulated) network data, and labeling it according to the Enterprise MITRE ATT&CK framework. This majorly distinguishes this dataset from other datasets. This dataset seeks to serve as a modern benchmark for network intrusion detection, identifying attack traffic. In addition, the dataset’s potential applications, which include its ability to detect pre-attack adversary behaviors and identify various forms of attack traffic, are outlined in this paper. In addressing the urgent need for a comprehensive network security dataset, the UWF-ZeekData24 dataset is introduced, detailing its construction, and outlining a 4-week experiment using automated attacks aligned with the Enterprise MITRE ATT&CK framework, for machine learning training and testing. Complimenting UWF-ZeekData22 [3], the outcome of this work, a precisely labeled dataset, is expected to further assist in the advancement of the development of cybersecurity solutions, making them more robust and adaptive to today’s current threats. This is done to enhance the cyber defense mechanisms that are responsible for safeguarding critical infrastructures and digital assets.

The rest of this paper is organized as follows. Section 2 presents the related works, that is, the state of the present datasets available; Section 3 presents the methodology and experimental components used to create and collect the data; Section 4 introduces the data; Section 5 presents network analysis of the data; Section 6 presents a comparison of the nature of this experimentally collected data with crowd-sourced data; Section 7 presents the conclusions; and Section 8 presents the future works.

2. Related Works

In the context of related works, this paper delves into a comparative analysis of other existing network intrusion detection, network security, and cybersecurity datasets. Emphasizing issues in datasets such as KDDCupp99 [4] and NSL-KDD [5] related to unbalanced distribution, duplicate records, and lack of coverage for more modern attack techniques, the authors highlight these issues within the paper. The KDDCupp99 and NSL-KDD datasets only utilize four attack families: DoS, user to root (U2R), remote to local (R2L), and probing. In comparison, the UWF-ZeekData24 dataset, a modern and accurately labeled dataset, includes fourteen attack families and follows the Enterprise MITRE ATT&CK Framework, which is the industry standard for identifying attack tactics and techniques. The following datasets will be described by some of their strengths, followed by some weaknesses.

The UNSW-NB15 dataset [6] contains a large number of records, containing over two million records. It comprises a wide variety of attacks, and uses realistic network traffic. Some weaknesses of the UNSW-NB15 dataset include: lack of real-time network traffic data, network traffic over a short, fixed time period, and limited network topology information.

The UGR16 dataset [7] involves realistic network traffic, diverse use of network applications and services, and a large number of botnet-related traffic, which provides ample data for model training dealing with botnets. Some weaknesses include: limited features for each network connection which may affect accuracy within intrusion detection models and limited size compared to other network traffic models. This can impact the diversity and overall representation of normal network traffic patterns. There are also class imbalance issues since there is a disproportionate number of normal instances compared to attack instances.

The CIC-IDS 2017 dataset [8] contains a wide variety of network attacks. There are also many records with millions of network flows and diverse network traffic, including IoT and mixed enterprise traffic. However, the CIC-IDS 2017 dataset [8] also has weaknesses. There is a lack of real-time network traffic, as well as a limited network context, since a complete network topology is not provided. Also, there is data imbalance since there is a significantly larger number of normal instances compared to attack instances.

The ToN-IoT dataset [9] comprises realistic IoT network traffic, a substantial amount of network flow records, and specific IoT data. This dataset’s strengths are also its weaknesses, as this dataset is extremely IoT specific. A network topology or user behavior is not provided for this dataset either.

The CIC-IoT23 dataset [10] has a tremendous focus on IoT-specific attacks, as well as comprehensive labeling, making it an ideal dataset for IoT security research. However, as an IoT focused dataset, its narrow view limits its application for broader network security analysis beyond IoT contexts.

The notable strengths of one of the first papers related to UWF-ZeekData22 [3] is that it presents the detailed framework used for collecting and processing the UWF-ZeekData22 dataset [11]. The paper gives insights into the cyber range, which was used to generate network traffic data, and the architecture of the Hadoop-based Big Data platform for data storage. The paper also provides insight into the tools used in the data collection process. This comprehensive overview enhances the paper’s contribution by providing a solid foundation for understanding the dataset’s creation process.

While the UWF-ZeekData22 [3,11] dataset, collected and labeled by the University of West Florida (UWF), provides foundational insights into how malicious network traffic behaves, this research diverges by focusing on creating and matching mission logs for each attack event. By incorporating and addressing these elements, this research provides a clean and precise dataset for machine learners, thus expanding upon the baseline established by UWF-ZeekData22 [3].

This research paper aims to establish the framework for constructing a controlled dataset that is tailored specifically for training or testing machine learners dealing with Intrusion Detection Systems (IDS) and Intrusion Prevention Systems (IPS). This dataset, UWF-ZeekData24 [11], will significantly expand the field of modern Cybersecurity data by addressing critical shortcomings found in popular existing datasets such as KDDCUP99 [4] and NSL-KDD [5], as well as others. In addition, this dataset, UWF-ZeekData24, not only reinforces but also contrasts the UWF-ZeekData22 [3,11] dataset by meticulously controlling data acquisition, data storage, and sanitization of the data. In addition, this dataset, UWF-ZeekData24, aims to provide a more accurate and dynamic environment for testing and benchmarking of IDS/IPS systems and performance of their machine learning algorithms. Finally, unlike other papers, this paper seeks to fill technical gaps such as the scripting of the attacks that created the data and demonstrating an accurate correlation of the data for labeling purposes.

3. Methodology

This section presents the experimental setup and architectural frameworks that were used, that is, the cyberrange, the Hadoop cluster, as well as the MITRE ATT&CK Framework. The latter part of this section presents the generation of the data, that is, the labs and scripts that were used to generate the data.

3.1. Experimental Setup

The setup for this experiment takes place on UWF’s cyber range on Vsphere. There are three subnets, each subnet including a Metasploitable 3 Ubuntu, Metasploitable 3 Windows, Kali Linux, Security Onion, Pfsense, and Webgoat machines. All subnets send network traffic logs from the local Security Onion to the central Security Onion that then ship the logs to the Big Data Platform, Hadoop, daily. NTP is configured across all virtual machines on the range to ensure that the timelines are accurate and precise. To create UWF-ZeekData24, automated nmap scans occur daily at random times using cron jobs and bash scripts. The nmap scan outputs the source ip, destination ip, start time of the scan, and end time of the scan, to a csv file. This csv file is filled and labeled daily by the script and then stored on Hadoop. As the experiment progresses, the attacks change. The scenarios are that the attackers have used active Reconnaissance to scan the network from their machines and have successfully phished an employee and gained access to the network. After gaining access, they will start internally scanning the network, searching for more attack vectors. This dataset is being created in order to test the accuracy of the machine learners in identifying malicious traffic and correlating the traffic to a tactic according to the Enterprise MITRE ATT&CK Framework.

3.2. Overall Architectural Framework

The framework for collecting the data includes using the cyber range which includes using network subnets with a Kali Linux machine [12], Pfsense [13], Metasploitable 3 machines [14], and a Security Onion machine [15]. After using the Security Onion machine to label network attack data into PCAPs and Zeek logs, the data is then shipped using UWF’s Big Data platform [3]. UWF’s Big Data platform includes Jupyter [16], Spark [17], and Hadoop [18]. Zeek and PCAPs, as well as mission logs, are transferred daily from the Security Onion machine [15] on the range to the Hadoop distributed file system for storage and labeling.

The network topology used in this research was designed to emulate a realistic enterprise environment, enabling the creation of a comprehensive dataset for cybersecurity testing. Figure 1 illustrates the network structure, which consists of 15 groups, with 3 of these groups detailed in the figure.

Each group is structured to mimic a segmented subnet that an organization would utilize. Some of the key components in each group include a Pfsense firewall, multiple Linux, and Windows virtual Machines (VMs), and a Security Onion machine. Each of these VMs represent various roles, such as user workstations, servers, and vulnerable machines. Each group is isolated within its own VLAN, simulating separate departments or network zones within an organization.

Figure 1. Network Topology showing 1, 2, N groups.

Figure 2 shows the instructor and student group network topologies. In each group, the Pfsense [13] firewall serves as the gateway, providing routing and security functions. It also connects to the group’s LAN and WAN through VLAN tagging, ensuring traffic segregation. This setup allows for testing scenarios involving internal and external threats, lateral movement, and different network policies.

Each group has a Security Onion machine [15], crucial for network monitoring and data collection. It captures and labels network traffic, generating PCAPs and Zeek logs that are essential in detecting threats. Each group’s Security Onion enables each group to hunt or analyze threats to their subnet.

The topology has 15 group subnets but also includes an instructor network, which provides centralized control and oversight. The instructor’s machines and network infrastructure, including a primary Security Onion machine, are connected to the student groups. Security Onion [15] is a specialized Linux distribution that enables network security monitoring, and offers tools for intrusion detection, network visibility, and data collection. This allows for the central collection of captured PCAPs and Zeek logs from all the other groups, and is then sent daily to the Hadoop cluster for storage and further analysis. The Hadoop cluster is a centralized big data platform where the collected data from all the groups is aggregated, processed, and analyzed using tools like Jupyter [16], Spark [17], and Hadoop [18].

Jupyter [16] is an interactive computing environment used for creating and sharing live code, which makes it ideal for data analysis. Spark [17], a powerful analytics engine, is used when processing large datasets in parallel, which allows for quick data transformations and analysis. Finally, Hadoop [18] provides a distributed storage and processing backbone, and enables efficient handling of large amounts of data for the entire network; this will be further explained in Section 3.3.

The segmented design of the network, combined with a centralized data collection and analysis approach, ensures that the dataset is not only rich in diverse attack scenarios but also is accurately labeled. This topology supports the development of more effective and adaptive cybersecurity solutions by providing a realistic and controlled environment for testing.

Figure 2. Instructor and Student group Network Topology.

3.3. Hadoop Cluster

UWF’s Big Data platform has evolved since it was previously used for creating the UWF-ZeekData22 dataset [3]. The cluster has moved from Redhat Enterprise Linux to Ubuntu Server Linux. There is presently one Hadoop name node and three Hadoop worker nodes. Figure 3 presents the Hadoop Cluster Interface. The current HDFS version is 3.3.1 on Ubuntu Server 22.04, as presented in Figure 3, and the cluster’s current storage capacity is 119.52 TB. The purpose of using Hadoop is to scale-up from a single server to thousands of other machines, so rather than just relying on hardware to deliver high-availability, the library is able to detect and handle failures at the application layer [3]. The Apache Hadoop software library allows for distributed processing of large datasets across multitudes of machines. [18]

Figure 3. Hadoop Cluster Interface.

3.4. The Enterprise MITRE ATT&CK Framework

The MITRE ATT&CK framework [1] is a globally recognized knowledge base of adversary tactics, techniques, and procedures (TTPs) that outlines how attackers could potentially operate within a network. It is organized into tactics, techniques, and sub-techniques. Tactics are the adversary goals, techniques are the methods used to achieve those goals, and sub-techniques are more detailed ways a technique is executed. The framework serves as a behavioral model that can help plan, understand, detect, and mitigate cyber threats [1].

Tactics are used to represent high-level objectives or goals that an attacker aims to achieve at different stages of an attack. They are usually more general strategies rather than specific actions., There are currently 14 tactics in the MITRE ATT&CK framework, each representing a phase in the adversary’s attack lifecycle [1].

Techniques describe more specific methods or ways an attacker can achieve a particular tactic. Each tactic can have multiple techniques associated with it, and this offers adversaries different options for accomplishing their goals. Techniques are much more detailed than tactics and can apply to multiple phases of an attack [1].

Sub-techniques are a further breakdown of techniques that are more granular or specific in their methodology. They provide a deeper level of detail about how a particular technique is carried out [1].

3.5. Generating and Collecting the Data

This section presents the labs and scripts used to generate the data.

3.5.1. Labs Used to Generate the Data

This data was created and collected from a private computer network that follows the template of the Cyber Wargaming course offered at the University of West Florida, Pensacola, Florida, USA. The course centers around the idea that every organization, public or private, requires IT professionals to safeguard their networks from potential threats. To impart practical knowledge, the course opted for a hands-on approach, simulating real-world scenarios that students are likely to encounter. The exercises allow students to assume distinct roles in both launching and guarding against IT infrastructure attacks. The private network used UWF’s cyber range, as presented in Figure 1.

To create this data, the same set of labs (presented in Table 1) used in UWF-ZeekData22 [3] were used. For this new dataset, UWF-ZeekData24, however, network traffic security incidents had to be created utilizing scripts to automate the attacks. Each attack originated from the Kali machine on the group’s subnet and attacked one other group’s Metasploitable 3 Ubuntu and Win2k8 [19] servers. Figure 4 is a traceroute from the Kali machine that gives an example of how an attack starts and reaches another group’s subnet. By utilizing the MITRE ATT&CK Framework [1], the attacks were orchestrated to follow some of the primary tactics that attackers would utilize. The attacks included in this dataset are Reconnaissance, Initial Access, Credential Access, and Exfiltration [1].

Table 1. Labs Used to Generate Attacks.

Lab	Description of Lab	MITRE ATT&CK Tactic	MITRE ATT&CK Technique to Be Used for Data Collection
Network Mapping	Use Networking Mapping tools such as dig and nmap to perform footprinting (host discovery, port scanning) and OS fingerprinting Explain how network mapping can assist with Security	Reconnaissance	Active Scanning
			Gather Victim Host Information
			Gather Victim Identity Information
			Gather Victim Network Information
Enumeration	DNS Enumeration Port Scanning SMB Enumeration SMTP Enumeration	Reconnaissance	Active Scanning
			Gather Victim Host Information
			Gather Victim Identity Information
			Gather Victim Network Information
Attack Metasploit	Use Networking mapping tools such as dig and nmap to perform footprinting (host discovery, port scanning) and OS fingerprinting Explain how network mapping can assist with Security Use Metasploit to exploit a system	Initial Access	External Remote Services
Password Attacks	Use mapping and exploitation tools to exploit a system and find user authentication information Use password cracking software to decrypt user passwords Use passwords to pivot to a database server and gain access Apply hacker zen to crack a user application	Credential Access	Brute Force
Password Attacks		Credential Access	OS Credential Dumping
Reconnaissance	Conduct Reconnaissance offensive cyber operations on target(s). Provide IPs, open ports, services, and possible vulnerabilities for use to identify possible exploits Conduct detection defensive cyber operations on network. Using the MITRE ATT&CK framework, provide indicators of compromise and attribution of attacker(s)	Reconnaissance	Active Scanning
			Gather Victim Host Information
			Gather Victim Identity Information
			Gather Victim Network Information
Gaining Access	Conduct offensive cyber operations to gain access to target(s). Provide methods (i.e., step-by-step record of operations) and means (i.e., exploits) used to gain access to target(s) Conduct detection defensive cyber operations on network. Using the MITRE ATT&CK framework, provide indicators of compromise and attribution of attacker(s)	Initial Access	Exploit Public Facing Application
			External Remote Services
			Valid Accounts
		Credential Access	Brute Force
			Credentials from Password Stores
			Input Capture
			OS Credential Dumping
		Lateral Movement	Exploitation of Remote Services
			Lateral Tool Transfer
			Remote Services Session Hijacking
			Remote Services
Execution	Conduct offensive cyber operations by using established persistence into target’s network and lateral movement to their Metasploitable server to exfiltrate the msfadmin user’s ssh key pair. Provide methods (i.e., step-by-step record of operation) and means used to persist access to target(s) Conduct detection defensive cyber operations on network. Using the MITRE ATT&CK framework (i.e., https://mitre-attack.github.io/attack-navigator/ (accessed on 28 August 2024)), provide indicators of compromise and attribution of attacker(s)	Collection	Automated Exfiltration

Figure 4. Traceroute of a Subnet from Attacker Kali.

3.5.2. Scripts Used to Generate the Data

The overall architecture of the attack scripts is extremely similar; the only difference between them being the metasploit exploit and payload that they are using. Utilizing the metasploit resource script configuration, these attacks were automated by loading the resource script and running the metasploit module. In order to give a better overall understanding of the attacks that were conducted, the timings and descriptions of each of the attacks are explained in the following sections.

An nmap scan is a tool used in cybersecurity to explore networks and find vulnerabilities. It does this by identifying active hosts, open ports, and available services on a network; this helps in assessing a network’s security posture [20]. Pseudocode 1 is an nmap scan script that was executed using native Linux cron jobs four times a day at 00:00, 05:00, 11:00, and 17:00. The execution then sleeps for 3400 s (56.6 min), meaning that it randomly occurred within the hour it was first executed. This script leverages MITRE ATT&CK T1595 (Active Scanning) in order to simulate scanning activity. T1595 is a technique under the Reconnaissance tactic. The scan was run using the T4 option, TCP option, and specific port option, and the scan was then output to an .xml file in order to have the metadata of the scans. The specific ports and IP addresses were defined by variables within the algorithm and all the variables were appended to a mission log.

Pseudocode 1: Nmap Scan

for time in [00:00, 05:00, 11:00, 17:00] do

sleep_random(3400)

exploit_mitre_att&ck_t1595()

PsExec is a lightweight telnet-replacement tool that allows users to execute processes on remote systems without needing to install a client [21]. Pseudocode 2 is a Psexec Exploit script that will be executed using native Linux cron jobs four times a day at 01:00, 07:00, 12:00, and 18:00. It will then sleep for 3500 s (58.3 min), meaning that it will occur randomly within the hour it is first executed. This algorithm leverages MITRE ATT&CK T1078 (Valid Accounts) in order to simulate attack activity. T1078 is a technique under the Initial Access tactic, Defense Evasion, Persistence, and Privilege Escalation. This attack has multiple labels and can be cross referenced in the MITRE ATT&CK Framework. This attack utilized the “exploit/windows/smb/psexec” exploit and the “windows/64x/meterpreter/reverse_tcp” payload within metasploit configured by a resource script that is defined within the algorithm.

Pseudocode 2: Psexec Exploit

for time in [01:00, 07:00, 12:00, 18:00] do

sleep_random(3500)

exploit_mitre_att&ck_t1078()

GlassFish is an open-source Java EE application server that provides a platform for developing, deploying, and managing java-based web applications and services [22]. Pseudocode 3 is a Password Bruteforce GlassFish Exploit script that will be executed using native Linux cron jobs four times a day at 02:00, 08:00, 14:00, and 19:00. It will then sleep for 3500 s (58.3 min), meaning that it will occur randomly within the hour it is first executed. This algorithm leverages MITRE ATT&CK T1110 (Brute Force) in order to simulate attack activity. T1110 is a technique under the Credential Access Tactic, and this attack utilized the “auxiliary/scanner/http/glassfish_login” exploit and the “php/meterpreter/reverse_tcp” payload within metasploit. The options are configured by a resource script that is defined within the algorithm. In this algorithm, a word file was used to replicate a brute force attack.

Pseudocode 3: GlassFish Exploit

for time in [02:00, 08:00, 14:00, 19:00] do

sleep_random(3500)

exploit_mitre_att&ck_t1110()

ProFTPD is an open-source File Transfer Protocol (FTP) server software that is commonly used for hosting FTP services due to its flexibility, security features, and ease of configuration [23]. Pseudocode 4 is a ProFTPD Exploit script that will be executed using native Linux cron jobs four times a day at 03:00, 09:00, 15:00, and 21:00. It will then sleep for 3500 s (58.3 min), meaning that it will occur randomly within the hour it is first executed. This script leverages MITRE ATT&CK T1190 (Exploit Public-Facing Application) in order to simulate attack activity. T1190 is a technique under the Initial Access Tactic, and this attack utilized the “exploit/unix/ftp/proftpd_modcopy_exec” exploit and the “cmd/unix/reverse_perl” payload within metasploit. The options are configured by a resource script that is defined within the algorithm and in this algorithm a site path was set where the payload would be dropped.

Pseudocode 4: ProFTPD Exploit

for time in [03:00, 09:00, 15:00, 21:00] do

sleep_random(3500)

exploit_mitre_att&ck_t1190()

Server Message Block (SMB) is a network file-sharing protocol that is primarily used by windows systems to allow shared access to files, printers, and other resources on a local network. By operating over TCP/IP, it facilitates communication between devices by enabling them to read and write to shared files [24]. Pseudocode 5 is an SMB Exploit script that will be executed using native Linux cron jobs four times a day at 04:00, 10:00, 16:00, and 22:00. It will then sleep for 3500 s (58.3 min), meaning that it will occur randomly within the hour it is first executed. This script leverages MITRE ATT&CK T1078 (Valid Accounts), T1048 (Exfiltration Over Alternative Protocol) [1] in order to simulate attack activity. T1078 is a technique under the tactics of Initial Access, Defense Evasion, Persistence, and Privilege Escalation, and T1048 is a technique under Exfiltration. These attacks utilized the “exploit/windows/smb/psexec” exploit and the “windows/x64/meterpreter/reverse_tcp” payload within metasploit. The options are configured by a resource script that is defined within the algorithm. In this algorithm, a compromised username and password were entered to replicate valid account attacks. This script is different from the others since it is a two-step attack where initial access is gained and then information is exfiltrated from the victim machine.

Pseudocode 5: SMB Exploit

for time in [04:00, 10:00, 16:00, 22:00] do

sleep_random(3500)

exploit_mitre_att&ck_t1078()

exploit_mitre_att&ck_t1048()

Pseudocode 6 is an example of one of the multiple attack scripts being used to create this dataset. This script is a bash script that will set all of the data fields present in a mission log at the start of an attack, the attack will then happen, and the script will grab the ending timestamp right after the attack. This is done to make sure that the attacks are correlated with extreme precision, and that there is no time present in the mission log when the attack is not occurring. After these steps, the script will write all the data fields that were created into the mission log csv in labeled columns.

Pseudocode 6: Attack Script

# Define the CSV file path

Set CSV_FILE to “/home/kali/nmap_scan.csv”

# Get the current timestamp in “MM/DD/YYYY HH:MM:SS” format

Set TIMESTAMP to current date in “MM/DD/YYYY HH:MM:SS” format

# Set constants

Set GROUP_NUMBER to “1”

Set TACTIC_ID to “T1595”

Set SOURCE_IP to “143.88.1.18”

Set SOURCE_PORT to ““ (empty)

Set TARGET_IP to “143.88.2.1-21”

Set TARGET_PORT to “445”

# Get the start time and date in UTC and split it into components

Set START_TIME_DATE to current date in “YYYY-MM-DDTHH:MM:SSZ” format (UTC)

Set START_YEAR to current year in “YYYY” format

Set START_MONTH to current month in “MM” format

Set START_DAY to current day in “DD” format

Set START_TIME to current time in “HH:MM:SS” format

# Run the nmap scan with specified target IP, port, and output file

Run nmap with options:

- Timing template “T4”

- Port set to TARGET_PORT

- Target IP set to TARGET_IP

- Output results in XML format to “nmapOut.xml”

# Get the end time and date components

Set END_YEAR to current year in “YYYY” format

Set END_MONTH to current month in “MM” format

Set END_DAY to current day in “DD” format

Set END_TIME to current time in “HH:MM:SS” format

# Write all collected data into the CSV file

Append to CSV_FILE:

TIMESTAMP, GROUP_NUMBER, TACTIC_ID, SOURCE_IP, SOURCE_PORT, TARGET_IP, TARGET_PORT,

START_TIME_DATE, START_YEAR, START_MONTH, START_DAY, START_TIME, END_YEAR,

END_MONTH, END_DAY, END_TIME

3.6. Mapping and Labeling Data

This dataset was mapped and labeled by primarily targeting DNS and Cfnn data files, with a focus on aligning attack indicators using the MITRE ATT&CK framework [1]. Figure 5 presents a flowchart of the process. The numbers in the figure correspond to the numbered list presented next.

1.

Preprocessing Mission Logs

Time Conversion: Mission log timestamps are converted to epoch time.
Array Creation: Arrays for specific features within the logs like source/destination ports, source/destination IP, and attack indicators are built.

2.

Preprocessing Conn Data

A similar process to mission logs occurs, timestamps are converted and attribute names that contain “.” are renamed in order to maintain compatibility with spark processing.

3.

Joining Mission Logs with Conn Data

Mission logs and conn data are joined based on time intervals (this allows for a slop factor, which was 1 min in this case), IPaddresses, as well as port numbers.
After they are joined, the Conn Data inherits the attack information taken from the mission logs.

4.

Merging with STIX Data

Labeled Conn data is combined using STIX data in order to enhance MITRE technique-to-tactic mappings.
Flattening array structures in IP and attack fields allow for cases where a single technique relates to multiple tactics.

5.

Final Labeled Conn Data

Benign entries are labeled with mitre_attack == none and label_tactic == none.

6.

Final Labeled DNS Data

Finally, the labeled DNS dataset is created by joining labeled Conn data with raw DNS dating using Unique identifiers (uid).

Figure 5. Data Creation Process.

Using this crafted method allows for MITRE ATT&CK labeling to be integrated into the dataset while handling unclear cases or cases where there are multiple mappings, which results in a MITRE ATT&CK labeled dataset ready to be analyzed.

4. The Dataset

The dataset combines and utilizes Zeek logs, the Enterprise MITRE ATT&CK framework, and mission logs to provide a comprehensive report of network attacks. In order to understand how this dataset was created, it is necessary to get an overall explanation of Zeek, as well as mission logs.

4.1. Zeek Logs

Zeek [2] is an extremely customizable network security monitor that is designed to capture and analyze network traffic. It does this by creating logs that record detailed information about network activities, such as connections, DNS queries, and different protocols. It provides a structured and comprehensive format so that it is easier to understand network behavior and identify suspicious activities. These logs are generated in real-time as network traffic passes through Zeek’s monitoring system [2], which in this infrastructure is baked into Security Onion. [15]

The reason for collecting Zeek logs is that they provide more insight into application-layer behaviors and give comprehensive connection tracking that is not available in other simpler log types. Table 2 shows the Zeek log files that were collected in the experimentation process, the total count of records in each file, and a description for each of the files.

Table 2. Zeek Files in the UWF-ZeekData24 Dataset.

Name	Total Count	Description
mission_logs	29,550	Used for collating records.
Broker	19,818	Communication file used to enforce asynchronous distributed communication, as well as to interact with persistent data stores.
capture_loss	19,746	Shows how well Zeek’s management and analysis tools are working. A missing TCP sequence set is correlated to a “gap” of lost data. This lost data results in a capture_loss file.
Cluster	84	Zeek cluster messages.
conn-summary	4433
Conn	46,991,170	Tracks protocols and associated information such as IP addresses, durations, transferred (two way) bytes, states, packets, and tunnel information. Conn files provide all data regarding the connection between two points.
dhcp	32,113	Helps correlate IP addresses and MAC addresses and potentially hostnames. From a security standpoint, this allows for the confirmation of connected systems/services and potential intrusion detection by determining which system is assigned to which IP address.
dns	59,041,059	Provides a swath of information on how specific systems access and utilize the internet and other systems; focuses on the system that is asking a question and all elements of the question and its associated answer.
loaded_scripts	1455
Notice	7111	An event that Zeek learning has determined to be inspection-worthy; these are often higher-level alerts such as self-signed certs and are Zeek’s approximate equivalent to IDS alerts.
Reporter	4	Internal error/warning messages.
Stats	34,549	Memory/event/packet/lag statistics.
Stderr	21	Captures standard errors when Zeek is started from ZeekControl.
Stdout	32	Captures standard outputs when Zeek is started from ZeekControl.
Weird	4000	Anything that does not fall into any other category.

4.2. Mission Logs

Mission logs play an important role in this dataset. They are used to label attacks and correlate them to Zeek logs utilizing the MITRE ATT&CK Framework [1] techniques and timestamps. Each attack is logged using a mission log, which stores information such as: the MITRE ATT&CK Technique [1], source port and IP address, destination port and IP address, as well as the starting and ending timestamps (in UTC).

4.3. Tactics and Techniques in UWF-ZeekData24

This dataset presents 7 of the 14 total MITRE ATT&CK Framework [1] tactics. Table 3 shows the tactics found in UWF-ZeekData24 [11]. These primary tactics are easily identifiable due to the requirement of a network connection to initiate the attack. This makes the correlation between the Zeek and mission logs easier. Reconnaissance and Credential Access make up the bulk of the attacks, as these are usually the first steps in the adversary’s attack chain. These two tactics allow more tactics to be deployed later.

Table 3. Tactics in UWF-ZeekData24.

Attack Type	Description
Reconnaissance	Active or passive tactics for gathering information that can be used to plan future operations.
Discovery	Tactics that may be used to gain knowledge about the system and internal network.
Credential access	Tactics for stealing credentials such as account names and passwords.
Privilege escalation	Tactics used to gain higher-level permissions on systems or networks.
Exfiltration	Tactics used to steal data from the network.
Initial access	Tactics that use various entry vectors to gain an initial foothold within the network.
Persistence	Tactics used to keep access to systems across restarts, changed credentials, and other interruptions.

Table 4 presents the distribution of malicious traffic in the dataset, labeled using the MITRE ATT&CK framework. In this dataset, credential access makes up approximately 90.88% of all the attacks, followed by Reconnaissance at 6.06% of the total attacks. Table 5 depicts the specific MITRE attack techniques present in UWF-ZeekData24.

Table 4. Tactic Counts in UWF-ZeekData24.

MITRE Tactic Attack Type	Count	%
Credential Access	871,188	90.88
Reconnaissance	58,095	6.06
Initial Access	10,662	1.11
Privilege Escalation	6048	0.631
Persistence	6048	0.631
Defense Evasion	6048	0.631
Exfiltration	559	5.83 × 10⁻⁴

Table 5. Technique Count in UWF-ZeekData24.

MITRE Tactic Attack Type	Count	%
T1110	871,188	90.87
T1595	58,095	6.06
T1078	6048	1.89
T1190	4614	0.63
T1048	559	0.48

5. Traffic Analysis

Table 6 presents the UWF-ZeekData24’s traffic distribution between malicious traffic and non-malicious traffic.

Table 6. Distribution of malicious traffic in UWF-ZeekData24.

Traffic_Type_Relabelled	Count	%
Malicious traffic	958,648	50
Non-malicious traffic	958,561	50

Traffic Analysis of Cumulative Flows

Table 7 presents a summary traffic analysis for all the cumulative flows of network data during the period of data collection while the UWF-ZeekData24 dataset was being created. This table shows the total source bytes, destination bytes, number of source packets, number of destination packets, protocol types, number of normal and abnormal records, as well as the number of unique source/destination IP addresses in the data.

Table 7. Summary traffic analysis of UWF-ZeekData24.

Features	Sub-Features	Counts
src_bytes		1,063,460,303
dest_bytes		12,461,401,543
src_pkts		45,053,268
dest_pkts		53,723,672
Protocol Types	udp	928,896
	icmp	2691
	tcp	985,622
Unique	src_ip	55
Unique	dest_ip	166

6. Crowded Sourced Data Versus Controlled Data

This section presents, graphically, a comparison between crowd-sourced data and data collected in a controlled experiment. UWF-ZeekData22 [3,11] was crowd-sourced data and this newly created dataset, UWF-ZeekData24 [11], is data collected in a controlled environment. To ensure consistency in the comparison of the attack data, mission logs from identical log spots were selected in both datasets, specifically, logs from mission log positions 5, 8, 12, 20, 28, 32, 42, 43, 44, and 111. A total of 20 mission logs were compared by using plots on a windowing graph. The first ten windowing graphs, Figure 6, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12, Figure 13, Figure 14 and Figure 15, are from the UWF-ZeekData22 dataset [11] and the second ten windowing graphs, Figure 16, Figure 17, Figure 18, Figure 19, Figure 20, Figure 21, Figure 22, Figure 23, Figure 24 and Figure 25, are from the UWF-ZeekData24 dataset [11].

Snapshots of the Reconnaissance attack data are compared from the two datasets. The plots (Figure 6, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12, Figure 13, Figure 14, Figure 15, Figure 16, Figure 17, Figure 18, Figure 19, Figure 20, Figure 21, Figure 22, Figure 23, Figure 24 and Figure 25) display the timestamp on the x-axis and the volume of network traffic on the y-axis, with key attack events marked using mission log data. The start (green checkered line) and stop points (red checked line) of the attacks are based on their recorded timestamps from the mission logs. The overall network activity (solid blue line) indicates the amount of network traffic being logged at those timestamps. And finally, the dots along the traffic lines indicate the IP addresses involved in the activity that were matched using the Zeek logs and the IP columns in the mission logs ((Source OR Destination IP) OR (Destination OR Source IP)).

Since there were much less Reconnaissance mission logs in the UWF-ZeekData22 dataset (less than 100), these log spots were analyzed to extract timestamps marking the start and stop of attacks, along with the associated IP addresses. UWF-ZeekData24 had around 51,000 Reconnaissance mission logs recorded. These visualizations provide a clear representation of the Reconnaissance attacks, facilitating a detailed comparison of the dataset’s ability to be precise and accurate when dealing with correlating attack labels with Zeek data. The UWF-ZeekData22 plots were created using a larger window, as well as a one-minute frequency count, while the UWF-ZeekData24 plots were created using a smaller window, as well as a one second frequency count.

6.1. UWF-ZeekData22 Plots

UWF-Zeekdata22 shows traffic of all 15 subnets, and was plotted with a slop factor of ±5 min. A slop factor of ±5 min means that timestamps from the mission logs and the plotted network traffic data can differ by up to 5 min in either direction. This accounts for any potential delays, inaccuracies in logging, or variations in how events are recorded. Even with a slop factor of ±5 min, Figure 6, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12, Figure 13, Figure 14 and Figure 15, from UWF-ZeekData22, reveal significant variability in the accuracy of how Reconnaissance attacks are represented. Of the ten plots analyzed, Figure 12, Figure 13 and Figure 14, are able to demonstrate patterns that align somewhat with the expected characteristics of Reconnaissance activity based on the mission logs data.

Being crowd-sourced, data can have a human error, as demonstrated in Figure 6, Figure 7, Figure 8, Figure 9, Figure 10 and Figure 11. From the high spikes in the network traffic in these plots, which are used to label attacks, it is difficult to determine if an attack is occurring within the start and stop timestamp provided. Missing timestamps and mislabeled data are also problems in crowd-sourced data, in addition to missing markers for critical IP addresses. Figure 12, Figure 13 and Figure 14 are the most accurate of the 10 plots as these three plots accurately mark the start and stop times of attacks and correctly associate IP addresses with corresponding traffic spikes. Figure 6 shows a 10-hour difference between the correlated plots and the timestamps, and this could be due to different factors such as human error or incorrectly configured NTP servers.

Figure 6. Mission Log 5 from UWF-ZeekData22.

Figure 7 and Figure 8 show good start and stop timestamps with correlation dots in line with the network activity line. The reason the blue lines and the dots are not the same is due to the polling rate that was used for this dataset’s plots.

Figure 7. Mission Log 8 from UWF-ZeekData22.

Figure 8. Mission Log 12 from UWF-ZeekData22.

Figure 9. Mission Log 20 from UWF-ZeekData22.

Figure 10. Mission Log 28 from UWF-ZeekData22.

Figure 11 has accurate timestamps except that the correlated dot at the peak of the network traffic is just outside.

Figure 11. Mission Log 33 from UWF-ZeekData22.

Figure 12, Figure 13 and Figure 14 timestamps are accurate and precise as they have a clear spike in network traffic and a correlated dot inside the timestamp. The other correlation dots are from other mission log entries that match the IP which are shown to correlate correctly between all these figures.

Figure 12. Mission Log 42 from UWF-ZeekData22.

Figure 13. Mission Log 43 from UWF-ZeekData22.

Figure 14. Mission Log 44 from UWF-ZeekData22.

Figure 15. Mission Log111 from UWF-ZeekData22.

6.2. UWF-ZeekData24 Plots

Figure 16, Figure 17, Figure 18, Figure 19, Figure 20, Figure 21, Figure 22 and Figure 23 were plotted with a slop factor of ±2 min. The slop factor could be reduced further while still maintaining precision and accuracy. Since UWF-ZeekData24 had a larger number of attacks occurring more frequently compared to UWF-ZeekData22, these set of ten plots (Figure 16, Figure 17, Figure 18, Figure 19, Figure 20, Figure 21, Figure 22 and Figure 23) were generated utilizing only the destination subnet traffic, rather than all the 15 subnet traffic, as generated for UWF-ZeekData22. A higher network traffic floor would make it harder to visualize the attacks on the graph.

Since UWF-ZeekData24 is from a controlled environment and not crowd-sourced, the ten plots (Figure 16, Figure 17, Figure 18, Figure 19, Figure 20, Figure 21, Figure 22, Figure 23, Figure 24 and Figure 25) show evidence of the accuracy of the timestamps. This means that the slop factor could also be reduced to a few seconds before and after the timestamps, since the attack scripts are gathering the timestamps immediately when the attack starts and stops.

Figure 16, Figure 17, Figure 18, Figure 20, Figure 21, Figure 22, Figure 23 and Figure 24 show clear distinct network floors and high peaks within the timestamps. Two plots, Figure 19 and Figure 25, are much more compact with many highs and lows which shows a lot more network noise. Figure 16, Figure 17 and Figure 18 show distinct peaks and correlated dots within the start and stop timestamps. This shows the increased precision and accuracy compared to the previous graphs that were created using the crowd-sourced data (Figure 6, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12, Figure 13, Figure 14 and Figure 15).

Figure 16. Mission Log 5 from UWF-ZeekData24.

Figure 17. Mission Log 8 from UWF-ZeekData24.

Figure 18. Mission Log 12 from UWF-ZeekData24.

Figure 19. Mission Log 20 from UWF-ZeekData24.

Figure 20, Figure 21, Figure 22, Figure 23 and Figure 24 are all accurate and have corresponding points (green dots) at the peaks of the network activity (blue line). This shows how consistent this data is compared to crowd-sourced data like UWF-ZeekData22.

Figure 20. Mission Log 28 from UWF-ZeekData24.

Figure 21. Mission Log 33 from UWF-ZeekData24.

Figure 22. Mission Log 42 from UWF-ZeekData24.

Figure 23. Mission Log 43 from UWF-ZeekData24.

Figure 24. Mission Log 44 from UWF-ZeekData24.

Figure 25. Mission Log 111 from UWF-ZeekData24.

7. Conclusions

This paper outlines a comprehensive framework for constructing a modern network security dataset. It provides a curated dataset with several Enterprise MITRE ATT&CK Framework tactics that could be used as a training or testing dataset. This methodology, as well as scripts, can be used as a guide for future datasets that will expand upon the MITRE ATT&CK Framework by including additional tactics, techniques, and procedures by collecting host-based logs, as well as network traffic logs. Finally, comparing UWF-ZeekData22 and UWF-ZeekData24 allows for arguments between the pros and cons of crowd-sourced data, as well as controlled data. The plots of UWF-ZeekData24 show clearer distinct network floors and high peaks within the timestamps, and much less noise. Crowd-sourced data provides more anomalous type behaviors and a more general breadth of attacks, while controlled data provides large amounts of precisely labeled data.

8. Future Works

Since UWF-ZeekData24 is created in a controlled environment, it can be used as both a training or testing dataset for classifying attacks using classifiers in Machine Learning algorithms like Decision Trees, Naïve Bayes, Random Forest, to name a few. The next step will be to use UWF-ZeekData24 for Machine Learning. This dataset also forms a basis for AI-based research.

Author Contributions

Conceptualization, D.M., M.E. and S.S.B.; methodology, D.M., M.E. and S.S.B.; software, D.M., M.E. and R.P.; validation, D.M., M.E., S.S.B., R.P. and S.C.B.; formal analysis, D.M., M.E., S.S.B. and R.P.; investigation, D.M., M.E., S.S.B. and R.P.; resources, S.S.B., D.M. and S.C.B.; data curation, D.M., M.E. and R.P.; writing—original draft preparation, M.E., D.M. and S.S.B.; writing—review and editing, S.S.B., D.M. and S.C.B.; visualization, D.M., M.E., S.S.B., R.P. and S.C.B.; supervision, D.M., S.S.B. and S.C.B.; project administration, D.M., S.S.B. and S.C.B.; funding acquisition, S.S.B., D.M. and S.C.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by 2021 NCAE-C-002: Cyber Research Innovation Grant Program, Grant Number: H98230-21-1-0170. This research was also partially supported the Askew Institute at the University of West Florida.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

datasets.uwf.edu.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

NTP	Network Time Protocol
IP	Interface Protocol
FTP	File Transfer Protocol
DNS	Domain Name System
DoS	Denial of Service
LAN	Local Area Network
WAN	Wide Area Network
VLAN	Virtual Local Area Network
STIX	Structured Threat Information Expression
PCAP	Packet Capture
HDFS	Hadoop Distributed File System
VM	Virtual Machine
IDS	Intrusion Detection System
IPS	Intrusion Prevention System
TTP	Tactics, Techniques, Procedures
SMB	Server Message Block

References

MITRE ATT&CK. Available online: https://attack.mitre.org/ (accessed on 19 September 2024).
About Zeek—Book of Zeek. Available online: https://docs.zeek.org/en/master/about.html (accessed on 16 September 2024).
Bagui, S.S.; Mink, D.; Bagui, S.C.; Ghosh, T.; Plenkers, R.; McElroy, T.; Dulaney, S.; Shabanali, S. Introducing UWF-ZeekData22: A Comprehensive Network Traffic Dataset Based on the MITRE ATT&CK Framework. Data 2023, 8, 18. [Google Scholar] [CrossRef]
KDD Cup 1999. Available online: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html (accessed on 3 September 2024).
Tavallaee, M.; Bagheri, E.; Lu, W.; Ghorbani, A. A detailed analysis of the KDD CUP 99 data set. In Proceedings of the Second IEEE Symposium on Computational Intelligence for Security and Defence Applications, Ottawa, ON, Canada, 8–10 July 2009; pp. 1–6. Available online: https://ieeexplore.ieee.org/document/5356528 (accessed on 9 August 2024).
Moustafa, N.; Slay, J. UNSW-NB15: A Comprehensive Data Set for Network Intrusion Detection Systems. In Proceedings of the Military Communications and Information Systems Conference (MilCIS), Canberra, ACT, Australia, 10–12 November 2015; IEEE: Canberra, Australia, 2015; pp. 1–6. Available online: https://ieee-dataport.org/documents/unswnb15-dataset (accessed on 9 August 2024).
Maciá-Fernández, G.; Camacho, J.; Magán-Carrión, R.; García-Teodoro, P.; Therón, R. UGR’16: A New Dataset for the Evaluation of Cyclostationarity-Based Network IDSs. Comput. Secur. 2018, 73, 411–424. [Google Scholar] [CrossRef]
Sharafaldin, I.; Habibi Lashkari, A.; Ghorbani, A.A. A Detailed Analysis of the CICIDS2017 Data Set. In ICISSP; Revised Selected Papers; Springer: Cham, Switzerland, 2018; pp. 172–188. Available online: https://www.unb.ca/cic/datasets/ids-2017.html (accessed on 4 August 2024).
Booij, T.M.; Chiscop, I.; Meeuwissen, E.; Moustafa, N.; den Hartog, F.T. ToN_IoT: The Role of Heterogeneity and the Need for Standardization of Features and Attack Types in IoT Network Intrusion Data Sets. IEEE Internet Things J. 2022, 9, 485–496. [Google Scholar] [CrossRef]
Neto, E.C.P.; Dadkhah, S.; Ferreira, R.; Zohourian, A.; Lu, R.; Ghorbani, A.A. CICIoT2023: A Real-Time Dataset and Benchmark for Large-Scale Attacks in IoT Environment. Sensors 2023, 23, 5941. [Google Scholar] [CrossRef] [PubMed]
Available online: https://datasets.uwf.edu/ (accessed on 3 August 2024).
Kali Linux | Penetration Testing and Ethical Hacking Linux Distribution. Available online: https://www.kali.org/ (accessed on 3 August 2023).
pfSense Documentation. Netgate. Available online: https://docs.netgate.com/pfsense/en/latest/ (accessed on 9 August 2024).
Metasploit. Available online: https://www.rapid7.com/products/metasploit/resources/ (accessed on 6 September 2024).
Security Onion Solutions. Available online: https://securityonionsolutions.com/ (accessed on 3 August 2024).
Project Jupyter | Home. Available online: https://jupyter.org/ (accessed on 9 August 2024).
Apache Spark—Unified Engine for Large-Scale Data Analytics. Available online: https://spark.apache.org/ (accessed on 3 August 2024).
Apache Hadoop. Available online: https://hadoop.apache.org/ (accessed on 10 August 2024).
Windows Server 2008 R2 and Windows 2000. Microsoft. Available online: https://learn.microsoft.com/en-us/previous-versions/windows/it-pro/windows-server-2012-r2-and-2012/hh831795(v=ws.11) (accessed on 9 August 2024).
Singh, Y.; Singh, P.; Sinha, G. Footprinting using Nmap. J. Inform. Electr. Electron. Eng. 2022, 3, 1–15. [Google Scholar] [CrossRef]
“PsExec.” Microsoft Sysinternals Documentation, Microsoft. Available online: https://learn.microsoft.com/en-us/sysinternals/downloads/psexec (accessed on 9 August 2024).
GlassFish Documentation. Oracle. Available online: https://docs.oracle.com/cd/E26576_01/index.htm (accessed on 9 August 2024).
ProFTPD Documentation. ProFTPD Project. Available online: http://www.proftpd.org/ (accessed on 9 August 2024).
SMB Essay 71415: SMB. University of Twente. Available online: https://essay.utwente.nl/71415/ (accessed on 9 August 2024).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Elam, M.; Mink, D.; Bagui, S.S.; Plenkers, R.; Bagui, S.C. Introducing UWF-ZeekData24: An Enterprise MITRE ATT&CK Labeled Network Attack Traffic Dataset for Machine Learning/AI. Data 2025, 10, 59. https://doi.org/10.3390/data10050059

AMA Style

Elam M, Mink D, Bagui SS, Plenkers R, Bagui SC. Introducing UWF-ZeekData24: An Enterprise MITRE ATT&CK Labeled Network Attack Traffic Dataset for Machine Learning/AI. Data. 2025; 10(5):59. https://doi.org/10.3390/data10050059

Chicago/Turabian Style

Elam, Marshall, Dustin Mink, Sikha S. Bagui, Russell Plenkers, and Subhash C. Bagui. 2025. "Introducing UWF-ZeekData24: An Enterprise MITRE ATT&CK Labeled Network Attack Traffic Dataset for Machine Learning/AI" Data 10, no. 5: 59. https://doi.org/10.3390/data10050059

APA Style

Elam, M., Mink, D., Bagui, S. S., Plenkers, R., & Bagui, S. C. (2025). Introducing UWF-ZeekData24: An Enterprise MITRE ATT&CK Labeled Network Attack Traffic Dataset for Machine Learning/AI. Data, 10(5), 59. https://doi.org/10.3390/data10050059

Article Menu

Introducing UWF-ZeekData24: An Enterprise MITRE ATT&CK Labeled Network Attack Traffic Dataset for Machine Learning/AI

Abstract

1. Introduction

2. Related Works

3. Methodology

3.1. Experimental Setup

3.2. Overall Architectural Framework

3.3. Hadoop Cluster

3.4. The Enterprise MITRE ATT&CK Framework

3.5. Generating and Collecting the Data

3.5.1. Labs Used to Generate the Data

3.5.2. Scripts Used to Generate the Data

3.6. Mapping and Labeling Data

4. The Dataset

4.1. Zeek Logs

4.2. Mission Logs

4.3. Tactics and Techniques in UWF-ZeekData24

5. Traffic Analysis

Traffic Analysis of Cumulative Flows

6. Crowded Sourced Data Versus Controlled Data

6.1. UWF-ZeekData22 Plots

6.2. UWF-ZeekData24 Plots

7. Conclusions

8. Future Works

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI