Introducing UWF-ZeekData24: An Enterprise MITRE ATT&CK Labeled Network Attack Traffic Dataset for Machine Learning/AI
Abstract
:1. Introduction
2. Related Works
3. Methodology
3.1. Experimental Setup
3.2. Overall Architectural Framework
3.3. Hadoop Cluster
3.4. The Enterprise MITRE ATT&CK Framework
3.5. Generating and Collecting the Data
3.5.1. Labs Used to Generate the Data
Lab | Description of Lab | MITRE ATT&CK Tactic | MITRE ATT&CK Technique to Be Used for Data Collection |
---|---|---|---|
Network Mapping |
| Reconnaissance | Active Scanning |
Gather Victim Host Information | |||
Gather Victim Identity Information | |||
Gather Victim Network Information | |||
Enumeration |
| Reconnaissance | Active Scanning |
Gather Victim Host Information | |||
Gather Victim Identity Information | |||
Gather Victim Network Information | |||
Attack Metasploit |
| Initial Access | External Remote Services |
Password Attacks |
| Credential Access | Brute Force |
OS Credential Dumping | |||
Reconnaissance |
| Reconnaissance | Active Scanning |
Gather Victim Host Information | |||
Gather Victim Identity Information | |||
Gather Victim Network Information | |||
Gaining Access |
| Initial Access | Exploit Public Facing Application |
External Remote Services | |||
Valid Accounts | |||
Credential Access | Brute Force | ||
Credentials from Password Stores | |||
Input Capture | |||
OS Credential Dumping | |||
Lateral Movement | Exploitation of Remote Services | ||
Lateral Tool Transfer | |||
Remote Services Session Hijacking | |||
Remote Services | |||
Execution |
| Collection | Automated Exfiltration |
3.5.2. Scripts Used to Generate the Data
Pseudocode 1: Nmap Scan |
for time in [00:00, 05:00, 11:00, 17:00] do |
sleep_random(3400) |
exploit_mitre_att&ck_t1595() |
Pseudocode 2: Psexec Exploit |
for time in [01:00, 07:00, 12:00, 18:00] do |
sleep_random(3500) |
exploit_mitre_att&ck_t1078() |
Pseudocode 3: GlassFish Exploit |
for time in [02:00, 08:00, 14:00, 19:00] do |
sleep_random(3500) |
exploit_mitre_att&ck_t1110() |
Pseudocode 4: ProFTPD Exploit |
for time in [03:00, 09:00, 15:00, 21:00] do |
sleep_random(3500) |
exploit_mitre_att&ck_t1190() |
Pseudocode 5: SMB Exploit |
for time in [04:00, 10:00, 16:00, 22:00] do |
sleep_random(3500) |
exploit_mitre_att&ck_t1078() |
exploit_mitre_att&ck_t1048() |
Pseudocode 6: Attack Script |
# Define the CSV file path |
Set CSV_FILE to “/home/kali/nmap_scan.csv” |
# Get the current timestamp in “MM/DD/YYYY HH:MM:SS” format |
Set TIMESTAMP to current date in “MM/DD/YYYY HH:MM:SS” format |
# Set constants |
Set GROUP_NUMBER to “1” |
Set TACTIC_ID to “T1595” |
Set SOURCE_IP to “143.88.1.18” |
Set SOURCE_PORT to ““ (empty) |
Set TARGET_IP to “143.88.2.1-21” |
Set TARGET_PORT to “445” |
# Get the start time and date in UTC and split it into components |
Set START_TIME_DATE to current date in “YYYY-MM-DDTHH:MM:SSZ” format (UTC) |
Set START_YEAR to current year in “YYYY” format |
Set START_MONTH to current month in “MM” format |
Set START_DAY to current day in “DD” format |
Set START_TIME to current time in “HH:MM:SS” format |
# Run the nmap scan with specified target IP, port, and output file |
Run nmap with options: |
- Timing template “T4” |
- Port set to TARGET_PORT |
- Target IP set to TARGET_IP |
- Output results in XML format to “nmapOut.xml” |
# Get the end time and date components |
Set END_YEAR to current year in “YYYY” format |
Set END_MONTH to current month in “MM” format |
Set END_DAY to current day in “DD” format |
Set END_TIME to current time in “HH:MM:SS” format |
# Write all collected data into the CSV file |
Append to CSV_FILE: |
TIMESTAMP, GROUP_NUMBER, TACTIC_ID, SOURCE_IP, SOURCE_PORT, TARGET_IP, TARGET_PORT, |
START_TIME_DATE, START_YEAR, START_MONTH, START_DAY, START_TIME, END_YEAR, |
END_MONTH, END_DAY, END_TIME |
3.6. Mapping and Labeling Data
- 1.
- Preprocessing Mission Logs
- Time Conversion: Mission log timestamps are converted to epoch time.
- Array Creation: Arrays for specific features within the logs like source/destination ports, source/destination IP, and attack indicators are built.
- 2.
- Preprocessing Conn Data
- A similar process to mission logs occurs, timestamps are converted and attribute names that contain “.” are renamed in order to maintain compatibility with spark processing.
- 3.
- Joining Mission Logs with Conn Data
- Mission logs and conn data are joined based on time intervals (this allows for a slop factor, which was 1 min in this case), IPaddresses, as well as port numbers.
- After they are joined, the Conn Data inherits the attack information taken from the mission logs.
- 4.
- Merging with STIX Data
- Labeled Conn data is combined using STIX data in order to enhance MITRE technique-to-tactic mappings.
- Flattening array structures in IP and attack fields allow for cases where a single technique relates to multiple tactics.
- 5.
- Final Labeled Conn Data
- Benign entries are labeled with mitre_attack == none and label_tactic == none.
- 6.
- Final Labeled DNS Data
- Finally, the labeled DNS dataset is created by joining labeled Conn data with raw DNS dating using Unique identifiers (uid).
4. The Dataset
4.1. Zeek Logs
Name | Total Count | Description |
---|---|---|
mission_logs | 29,550 | Used for collating records. |
Broker | 19,818 | Communication file used to enforce asynchronous distributed communication, as well as to interact with persistent data stores. |
capture_loss | 19,746 | Shows how well Zeek’s management and analysis tools are working. A missing TCP sequence set is correlated to a “gap” of lost data. This lost data results in a capture_loss file. |
Cluster | 84 | Zeek cluster messages. |
conn-summary | 4433 | |
Conn | 46,991,170 | Tracks protocols and associated information such as IP addresses, durations, transferred (two way) bytes, states, packets, and tunnel information. Conn files provide all data regarding the connection between two points. |
dhcp | 32,113 | Helps correlate IP addresses and MAC addresses and potentially hostnames. From a security standpoint, this allows for the confirmation of connected systems/services and potential intrusion detection by determining which system is assigned to which IP address. |
dns | 59,041,059 | Provides a swath of information on how specific systems access and utilize the internet and other systems; focuses on the system that is asking a question and all elements of the question and its associated answer. |
loaded_scripts | 1455 | |
Notice | 7111 | An event that Zeek learning has determined to be inspection-worthy; these are often higher-level alerts such as self-signed certs and are Zeek’s approximate equivalent to IDS alerts. |
Reporter | 4 | Internal error/warning messages. |
Stats | 34,549 | Memory/event/packet/lag statistics. |
Stderr | 21 | Captures standard errors when Zeek is started from ZeekControl. |
Stdout | 32 | Captures standard outputs when Zeek is started from ZeekControl. |
Weird | 4000 | Anything that does not fall into any other category. |
4.2. Mission Logs
4.3. Tactics and Techniques in UWF-ZeekData24
Attack Type | Description |
---|---|
Reconnaissance | Active or passive tactics for gathering information that can be used to plan future operations. |
Discovery | Tactics that may be used to gain knowledge about the system and internal network. |
Credential access | Tactics for stealing credentials such as account names and passwords. |
Privilege escalation | Tactics used to gain higher-level permissions on systems or networks. |
Exfiltration | Tactics used to steal data from the network. |
Initial access | Tactics that use various entry vectors to gain an initial foothold within the network. |
Persistence | Tactics used to keep access to systems across restarts, changed credentials, and other interruptions. |
MITRE Tactic Attack Type | Count | % |
---|---|---|
Credential Access | 871,188 | 90.88 |
Reconnaissance | 58,095 | 6.06 |
Initial Access | 10,662 | 1.11 |
Privilege Escalation | 6048 | 0.631 |
Persistence | 6048 | 0.631 |
Defense Evasion | 6048 | 0.631 |
Exfiltration | 559 | 5.83 × 10−4 |
MITRE Tactic Attack Type | Count | % |
---|---|---|
T1110 | 871,188 | 90.87 |
T1595 | 58,095 | 6.06 |
T1078 | 6048 | 1.89 |
T1190 | 4614 | 0.63 |
T1048 | 559 | 0.48 |
5. Traffic Analysis
Traffic_Type_Relabelled | Count | % |
---|---|---|
Malicious traffic | 958,648 | 50 |
Non-malicious traffic | 958,561 | 50 |
Traffic Analysis of Cumulative Flows
Features | Sub-Features | Counts |
---|---|---|
src_bytes | 1,063,460,303 | |
dest_bytes | 12,461,401,543 | |
src_pkts | 45,053,268 | |
dest_pkts | 53,723,672 | |
Protocol Types | udp | 928,896 |
icmp | 2691 | |
tcp | 985,622 | |
Unique | src_ip | 55 |
dest_ip | 166 |
6. Crowded Sourced Data Versus Controlled Data
6.1. UWF-ZeekData22 Plots
6.2. UWF-ZeekData24 Plots
7. Conclusions
8. Future Works
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
NTP | Network Time Protocol |
IP | Interface Protocol |
FTP | File Transfer Protocol |
DNS | Domain Name System |
DoS | Denial of Service |
LAN | Local Area Network |
WAN | Wide Area Network |
VLAN | Virtual Local Area Network |
STIX | Structured Threat Information Expression |
PCAP | Packet Capture |
HDFS | Hadoop Distributed File System |
VM | Virtual Machine |
IDS | Intrusion Detection System |
IPS | Intrusion Prevention System |
TTP | Tactics, Techniques, Procedures |
SMB | Server Message Block |
References
- MITRE ATT&CK. Available online: https://attack.mitre.org/ (accessed on 19 September 2024).
- About Zeek—Book of Zeek. Available online: https://docs.zeek.org/en/master/about.html (accessed on 16 September 2024).
- Bagui, S.S.; Mink, D.; Bagui, S.C.; Ghosh, T.; Plenkers, R.; McElroy, T.; Dulaney, S.; Shabanali, S. Introducing UWF-ZeekData22: A Comprehensive Network Traffic Dataset Based on the MITRE ATT&CK Framework. Data 2023, 8, 18. [Google Scholar] [CrossRef]
- KDD Cup 1999. Available online: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html (accessed on 3 September 2024).
- Tavallaee, M.; Bagheri, E.; Lu, W.; Ghorbani, A. A detailed analysis of the KDD CUP 99 data set. In Proceedings of the Second IEEE Symposium on Computational Intelligence for Security and Defence Applications, Ottawa, ON, Canada, 8–10 July 2009; pp. 1–6. Available online: https://ieeexplore.ieee.org/document/5356528 (accessed on 9 August 2024).
- Moustafa, N.; Slay, J. UNSW-NB15: A Comprehensive Data Set for Network Intrusion Detection Systems. In Proceedings of the Military Communications and Information Systems Conference (MilCIS), Canberra, ACT, Australia, 10–12 November 2015; IEEE: Canberra, Australia, 2015; pp. 1–6. Available online: https://ieee-dataport.org/documents/unswnb15-dataset (accessed on 9 August 2024).
- Maciá-Fernández, G.; Camacho, J.; Magán-Carrión, R.; García-Teodoro, P.; Therón, R. UGR’16: A New Dataset for the Evaluation of Cyclostationarity-Based Network IDSs. Comput. Secur. 2018, 73, 411–424. [Google Scholar] [CrossRef]
- Sharafaldin, I.; Habibi Lashkari, A.; Ghorbani, A.A. A Detailed Analysis of the CICIDS2017 Data Set. In ICISSP; Revised Selected Papers; Springer: Cham, Switzerland, 2018; pp. 172–188. Available online: https://www.unb.ca/cic/datasets/ids-2017.html (accessed on 4 August 2024).
- Booij, T.M.; Chiscop, I.; Meeuwissen, E.; Moustafa, N.; den Hartog, F.T. ToN_IoT: The Role of Heterogeneity and the Need for Standardization of Features and Attack Types in IoT Network Intrusion Data Sets. IEEE Internet Things J. 2022, 9, 485–496. [Google Scholar] [CrossRef]
- Neto, E.C.P.; Dadkhah, S.; Ferreira, R.; Zohourian, A.; Lu, R.; Ghorbani, A.A. CICIoT2023: A Real-Time Dataset and Benchmark for Large-Scale Attacks in IoT Environment. Sensors 2023, 23, 5941. [Google Scholar] [CrossRef] [PubMed]
- Available online: https://datasets.uwf.edu/ (accessed on 3 August 2024).
- Kali Linux | Penetration Testing and Ethical Hacking Linux Distribution. Available online: https://www.kali.org/ (accessed on 3 August 2023).
- pfSense Documentation. Netgate. Available online: https://docs.netgate.com/pfsense/en/latest/ (accessed on 9 August 2024).
- Metasploit. Available online: https://www.rapid7.com/products/metasploit/resources/ (accessed on 6 September 2024).
- Security Onion Solutions. Available online: https://securityonionsolutions.com/ (accessed on 3 August 2024).
- Project Jupyter | Home. Available online: https://jupyter.org/ (accessed on 9 August 2024).
- Apache Spark—Unified Engine for Large-Scale Data Analytics. Available online: https://spark.apache.org/ (accessed on 3 August 2024).
- Apache Hadoop. Available online: https://hadoop.apache.org/ (accessed on 10 August 2024).
- Windows Server 2008 R2 and Windows 2000. Microsoft. Available online: https://learn.microsoft.com/en-us/previous-versions/windows/it-pro/windows-server-2012-r2-and-2012/hh831795(v=ws.11) (accessed on 9 August 2024).
- Singh, Y.; Singh, P.; Sinha, G. Footprinting using Nmap. J. Inform. Electr. Electron. Eng. 2022, 3, 1–15. [Google Scholar] [CrossRef]
- “PsExec.” Microsoft Sysinternals Documentation, Microsoft. Available online: https://learn.microsoft.com/en-us/sysinternals/downloads/psexec (accessed on 9 August 2024).
- GlassFish Documentation. Oracle. Available online: https://docs.oracle.com/cd/E26576_01/index.htm (accessed on 9 August 2024).
- ProFTPD Documentation. ProFTPD Project. Available online: http://www.proftpd.org/ (accessed on 9 August 2024).
- SMB Essay 71415: SMB. University of Twente. Available online: https://essay.utwente.nl/71415/ (accessed on 9 August 2024).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Elam, M.; Mink, D.; Bagui, S.S.; Plenkers, R.; Bagui, S.C. Introducing UWF-ZeekData24: An Enterprise MITRE ATT&CK Labeled Network Attack Traffic Dataset for Machine Learning/AI. Data 2025, 10, 59. https://doi.org/10.3390/data10050059
Elam M, Mink D, Bagui SS, Plenkers R, Bagui SC. Introducing UWF-ZeekData24: An Enterprise MITRE ATT&CK Labeled Network Attack Traffic Dataset for Machine Learning/AI. Data. 2025; 10(5):59. https://doi.org/10.3390/data10050059
Chicago/Turabian StyleElam, Marshall, Dustin Mink, Sikha S. Bagui, Russell Plenkers, and Subhash C. Bagui. 2025. "Introducing UWF-ZeekData24: An Enterprise MITRE ATT&CK Labeled Network Attack Traffic Dataset for Machine Learning/AI" Data 10, no. 5: 59. https://doi.org/10.3390/data10050059
APA StyleElam, M., Mink, D., Bagui, S. S., Plenkers, R., & Bagui, S. C. (2025). Introducing UWF-ZeekData24: An Enterprise MITRE ATT&CK Labeled Network Attack Traffic Dataset for Machine Learning/AI. Data, 10(5), 59. https://doi.org/10.3390/data10050059