Introducing UWF-ZeekData22: A Comprehensive Network Traffic Dataset Based on the MITRE ATT&CK Framework
Abstract
:1. Introduction
- Can be used to detect adversary behavior leading up to an attack;
- Can be used to develop a profile of user or user groups intending to perform attacks;
- Can also be used to identity attack traffic and attacks.
2. Background and Related Work
3. Architectural Framework for Collecting UWF-ZeekData22
3.1. Overall Architectural Framework
3.2. The UWF Cyber Range
- VMware vCenter;
- Pfsense;
- Kali;
- WebGoat;
- Security Onion 2;
- Ubuntu and Windows Server 2008 R2 Metasploitable 3.
3.3. UWF’s Hadoop Cluster
- RedHat Enterprise Linux;
- Podman;
- Apache HDFS;
- Apache Spark;
- JupyterLab.
- (×3) 2015 Dell PowerEdge R730 (20 cores, 128 GB RAM, and 4 TB Storage);
- (×6) 2015 Dell PowerEdge R730xd (20 cores, 128 GB RAM, and 48 TB Storage).
- Use one Dell PowerEdge R730 with 40 cores, 128 GB Memory, and a minimal amount of storage as the Hadoop name node and Spark master (Table 3);
- Use five Dell PowerEdge R730xd, while maximizing the storage, as the Hadoop worker nodes and Spark workers;
- The cluster is interconnected using two bonded 10 gbps
3.4. UWF’s Spark Cluster
4. Generating and Collecting the Data
5. The Data
5.1. Zeek
5.2. MITRE ATT&CK Framework
Tactics Available in UWF-ZeekData22
6. Mapping and Labeling the Data
6.1. Labeling the DNS Data File
- Preprocess mission logs
- Convert time stamps to unix epoch time;
- Create arrays for port, IP, and attack features;
- With strings such as “101, 102, 103” in a port column, create a new column port_array that contains [101, 102, 103];
- Manually set port and IP address values where mission log input is noisy or unclear (for example, for responses such as “unknown high port” or “all ports”). Responses were interpreted as broadly as possible; for instance, the response “unknown” was replaced with all port numbers in the registered range 1–1023;
- Preprocess Conn data file (this is shown in Section 6.2);
- Convert time stamps to unix epoch time;
- Rename attributes with “.” in the attribute name to avoid Spark syntax issues;
- Join mission logs and preprocessed Conn file on the following:
- Time (see Figure 8 for specifics on slop factor)
- Conn datetime ≥ mission log start time (±slop factor)
- AND Conn datetime ≤ mission log end time (±slop factor)
- AND IP
- Conn src ip == mission log src ip
- AND Conn dest ip == mission log dest ip
- AND Port
- Conn src port == mission log src port
- AND Conn dest port == mission log dest port
- Join labeled Conn and STIX data;
- Mix benign data;
- Label with mitre_attack == none, label_tactic == none;
- Join labeled Conn with raw DNS to produce labeled DNS;
- FROM Conn SELECT uid, mitre_attack, label_tactic
- FROM dns SELECT all
- Join on conn.uid == dns.uid
6.2. Labeling the Conn Data File
7. Traffic Analysis
Traffic Analysis of Cumulative Flows
8. Conclusions
9. Future Works
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Conflicts of Interest
Appendix A
Name | Attributes |
---|---|
mission_logs | id, sis_id, datetime_submitted, attempt, group_number, mitre_attck_technique, bcol_1-011, src_ip, src_port, dest_ip_arrays, dest_ip, dest_portdatetime_submitted, dt_start, datetime_start, dt_end, datetime_end, num_correct, num_incorrect, score |
Broker | ts, ty, ev, peer.address, peer.bound_port, message, peer |
capture_loss | ts, ts_delta, peer, gaps, acks, percent_lost |
Cluster | |
conn-summary | |
Conn | ts, uid, id.orig_h, id.orig_p, id.resp_h, id.resp_p, proto, service, duration, orig_bytes, resp_bytes, conn_state, local_orig, local_resp, missed_bytes, history, orig_pkts, orig_ip_bytes, resp_pkts, resp_ip_bytes, community_id, id, tunnel_parents |
dhcp | ts, uids, client_addr, server_addr, mac, host_name, domain, assigned_addr, lease_time, msg_types, duration, requested_addr, client_port, server_port, client_fqdn, client_message, server_message, client_chaddr |
dns | ts, uid, id.orig_h, id.orig_p, id.resp_h, id.resp_p, proto, trans_id, query, qclass, qclass_name, qtype, qtype_name, rcode, rcode_name, AA, TC, RD, RA, Z, rejected, rtt, answers, TTLs, lass_name, qtype, qtype_name, rcode, rcode_name, AA, TC, RD, RA, Z, rejected, rtt, answers, TTLs, id, total_answers, total_replies, saw_query, saw_reply |
loaded_scripts | |
Notice | ts, uid, id.orig_h, id.orig_p, id.resp_h, id.resp_p, fuid, proto, note, msg, sub, src, dst, p, peer_descr, actions, suppress_for, id, conn, iconn, f, file_mime_type, file_desc, n, peer_name, email_dest, email_body_sections, email_delay_tokens, identifier |
packet_filter | |
Reporter | |
Stats | ts, peer, mem, pkts_proc, bytes_recv, events_proc, events_queued, active_tcp_conns, active_udp_conns, active_icmp_conns, tcp_conns, udp_conns, icmp_conns, timers, active_timers, files, active_files, dns_requests, active_dns_requests, reassem_tcp_size, reassem_file_size, reassem_frag_size, reassem_unknown_size, pkts_dropped, pkts_link, pkt_lag |
Stderr | |
Stdout | |
Weird | ts, uid, id.orig_h, id.orig_p, id.resp_h, id.resp_p, name, notice, peer, addl, source, id, conn, identifier |
References
- Available online: https://datasets.uwf.edu/ (accessed on 15 November 2022).
- About Zeek—Book of Zeek. Available online: https://docs.zeek.org/en/master/about.html (accessed on 16 September 2022).
- MITRE ATT&CK. Available online: https://attack.mitre.org/ (accessed on 19 September 2022).
- Krundyshev, V.M. Preparing datasets for training in a neural network system of intrusion detection in industrial systems. Autom. Control Comput. Sci. 2019, 53, 1012–1016. [Google Scholar] [CrossRef]
- Almomani, I.; Al-Kasasbeh, B.; AL-Akhras, M. WSN-DS: A Dataset for Intrusion Detection Systems in Wireless Sensor Networks. J. Sens. 2016, 2016, 4731953. [Google Scholar] [CrossRef] [Green Version]
- Zago, M.; Gil Pérez, M.; Martínez Pérez, G. UMUDGA: A dataset for profiling algorithmically generated domain names in botnet detection. Data Brief 2020, 30, 105400. [Google Scholar] [CrossRef] [PubMed]
- Ahmed, M.; Naser Mahmood, A.; Hu, J. A survey of network anomaly detection techniques. J. Netw. Comput. Appl. 2016, 60, 19–31. [Google Scholar] [CrossRef]
- DARPA Intrusion Detection Evaluation Dataset. MIT Lincoln Lab. Available online: https://www.ll.mit.edu/r-d/datasets/1998-darpa-intrusion-detection-evaluation-dataset (accessed on 3 September 2022).
- KDD Cup 1999. Available online: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html (accessed on 3 September 2022).
- Tavallaee, M.; Bagheri, E.; Lu, W.; Ghorbani, A. A detailed analysis of the KDD CUP 99 data set. In Proceedings of the Second IEEE Symposium on Computational Intelligence for Security and Defence Applications, Ottawa, ON, Canada, 8–10 July 2009; pp. 1–6. [Google Scholar] [CrossRef] [Green Version]
- Alkasassbeh, M.; Al-Naymat, G.; Hassanat, A.; Almseidin, M. Detecting Distributed Denial of Service Attacks Using Data Mining Techniques. Int. J. Adv. Comput. Sci. Appl. 2016, 7, 436–445. [Google Scholar] [CrossRef] [Green Version]
- Moustafa, N.; Slay, J. UNSW-NB15: A Comprehensive Data Set for Network Intrusion Detection Systems. In Military Communications and Information Systems Conference (MilCIS); IEEE: Canberra, Australia, 2015; pp. 1–6. [Google Scholar] [CrossRef]
- Maciá-Fernández, G.; Camacho, J.; Magán-Carrión, R.; García-Teodoro, P.; Therón, R. UGR’16: A New Dataset for the Evaluation of Cyclostationarity-Based Network IDSs. Comput. Secur. 2018, 73, 411–424. [Google Scholar] [CrossRef] [Green Version]
- Sharafaldin, I.; Habibi Lashkari, A.; Ghorbani, A.A. A Detailed Analysis of the CICIDS2017 Data Set. In ICISSP; Revised Selected Papers; Springer: Cham, Switzerland, 2018; pp. 172–188. [Google Scholar] [CrossRef]
- UNB CSE-CIC-IDS2018 on AWS. Available online: https://www.unb.ca/cic/datasets/ids-2018.html (accessed on 3 September 2022).
- Booij, T.M.; Chiscop, I.; Meeuwissen, E.; Moustafa, N.; den Hartog, F.T. ToN_IoT: The Role of Heterogeneity and the Need for Standardization of Features and Attack Types in IoT Network Intrusion Data Sets. IEEE Internet Things J. 2022, 9, 485–496. [Google Scholar] [CrossRef]
- Vasudevan, A.; Harshini, E.; Selvakumar, S. SSENet-2011: A network intrusion detection system dataset and its comparison with KDD CUP 99 dataset. In Proceedings of the 2011 Second Asian Himalayas International Conference on Internet (AH-ICI), Kathmundu, Nepal, 4–6 November 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 1–5. [Google Scholar] [CrossRef]
- Damasevicius, R.; Venckauskas, A.; Grigaliunas, S.; Toldinas, J.; Morkevicius, N.; Aleliunas, T.; Smuikys, P. LITNET-2020: An Annotated Real-World Network Flow Dataset for Network Intrusion Detection. Electronics 2020, 9, 800. [Google Scholar] [CrossRef]
- VMware vSphere Documentation. Available online: https://docs.vmware.com/en/VMware-vSphere/index.html (accessed on 3 August 2022).
- Red Hat Enterprise Linux Operating System. Available online: https://www.redhat.com/en/technologies/linux-platforms/enterprise-linux (accessed on 3 August 2022).
- Podman. Available online: https://podman.io/ (accessed on 3 August 2022).
- Apache Hadoop. Available online: https://hadoop.apache.org/ (accessed on 3 August 2022).
- Apache Spark—Unified engine for large-scale data analytics. Available online: https://spark.apache.org/ (accessed on 3 August 2022).
- Project Jupyter | Home. Available online: https://jupyter.org/ (accessed on 3 August 2022).
- Hutchins, E.M.; Cloppert, M.J.; Amin, R.M. Amin. Intelligence-driven computer network defense informed by analysis of adversary campaigns and intrusion kill chains. Lead. Issues Inf. Warf. Secur. Res. 2011, 1, 80. [Google Scholar]
- Kali Linux | Penetration Testing and Ethical Hacking Linux Distribution. Available online: https://www.kali.org/ (accessed on 3 August 2022).
- Security Onion Solutions. Available online: https://securityonionsolutions.com/ (accessed on 3 August 2022).
- Strom, B.E.; Applebaum, A.; Miller, D.P.; Nickels, K.C.; Pennington, A.G.; Thomas, C.B. Mitre att&ck: Design and Philosophy. Technical Report. 2018. Available online: https://www.mitre.org/news-insights/publication/mitre-attck-design-and-philosophy (accessed on 16 September 2022).
- MITRE ATT&CK: Design and Philosophy—Mitre Corporation. Available online: https://pdf4pro.com/view/mitre-att-amp-ck-design-and-philosophy-mitre-corporation-7083ef.html (accessed on 19 September 2022).
Parameters | KDDCUP99 | NSL-KDD | UNSW-NB15 | UGR16 | CIC-IDS 2017 | CSE-CIC-IDA 2018 | ToN-IoT | UWF-ZeekDatas22 |
---|---|---|---|---|---|---|---|---|
Year | 1999 | 2009 | 2015 | 2016 | 2017 | 2018 | 2019 | 2022 |
Duration of data collected | 5 weeks | N/A, based off KDDCUP99 | 16 h 15 h | 4 months | 5 days | 16 days (based on attack days) | 27 days | 16 weeks |
Simulated? | Yes | Yes | Yes | Mixed; real background traffic and synthetic attack traffic | Yes | Yes | Yes | No; mixed: live wargaming in a controlled environment |
Number of attack families | 4 | 4 | 9 | 3 | 8 | 7 | 9 | 14 |
Format of data collected | 3 types (tcpdump, BSM, dump files) | 2 types (ARFF and txt for CSV) | Pcap files | Flow | PCAPs, CSVs, network/labeled flows | CSV, event logs, Pcaps | Zeek logs, PCAP | Zeek logs, PCAPs |
Number of networks | 2 | 2 | 3 | 2 sub-networks (core, inner), 1 network (in core), 3 networks (inner) | 2 (attacker, victim) | 5 servers, 5 subnets, one attack-network | 3 layers; Edge: 7 IoT/IIoT Fog: 6 VMs Cloud: | 81 subnets |
Number of distinct IP addresses | 11 | 11 | 45 | Over 600 million external, 16 billion individual flows | 2 (attacker), 12 (victim) | 31 | 10 | Source_ip 254; Destination_ip 4324 |
Extraction tools | Bro-IDS | N/A, based off KDDCUP99 | Argus, Bro-IDS, and new tools | nfdump, nfanon | CICFlowMeter | CICFlowMeter-V3 | Zeek | Zeek, MITRE ATT&CK Framework |
Number of features | 41 | Based off KDDCUP99 | 49 | 7 | 80+ | 84 | 45 | Several files and several features per file |
Number of files | 23 processed network logs | 16 logs | ||||||
Framework | MITRE ATT&CK | |||||||
Website | https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html (accessed on 3 September 2022). | https://www.unb.ca/cic/datasets/nsl.html (accessed on 3 September 2022). | https://research.unsw.edu.au/projects/unsw-nb15-dataset (accessed on 3 September 2022). | https://security.kiwi/docs/ugr16-dataset/ (accessed on 3 September 2022). | https://www.unb.ca/cic/datasets/ids-2017.html (accessed on 3 September 2022). | https://www.unb.ca/cic/datasets/ids-2018.html (accessed on 3 September 2022). | https://research.unsw.edu.au/projects/toniot-datasets (accessed on 14 September 2022). | https://datasets.uwf.edu (accessed on 1 November 2022). |
Server | CPU | Memory | Storage |
---|---|---|---|
Supermicro X9DRE-TF+/X9DR7-TF+ | 2 × E5-2630 v2 @2.6 GHz (24 cores) | 128 GB | 20.02 TB |
Dell PowerEdge R740 | 2 × Gold 6126 @ 2.6 GHz (48 cores) | 768 GB | 6.74 TB (SSD) |
ASUSTeK Computer INC. | 2 × Opteron 6344 (24 cores) | 192 GB | 7.27 TB |
Server | CPU | Memory | Storage |
---|---|---|---|
Dell PowerEdge R730 | 2 × E5-2650 v3 @2.3 GHz (40 cores) | 128 GB | |
Dell PowerEdge R730 xd | 2 × E5-2650 v3 @2.3 GHz (40 cores) | 128 GB | 12 × 4 TB 7.2 K RPM (48 TB) |
Dell PowerEdge R730xd | 2 × E5-2650 v3 @2.3 GHz (40 cores) | 128 GB | 12 × 4 TB 7.2 K RPM (48 TB) |
Dell PowerEdge R730xd | 2 × E5-2650 v3 @2.3 GHz (40 cores) | 128 GB | 12 × 4 TB 7.2 K RPM (48 TB) |
Dell PowerEdge R730xd | 2 × E5-2650 v3 @2.3 GHz (40 cores) | 128 GB | 12 × 4 TB 7.2 K RPM (48 TB) |
Dell PowerEdge R730xd | 2 × E5-2650 v3 @2.3 GHz (40 cores) | 128 GB | 12 × 4 TB 7.2 K RPM (48 TB) |
Name | Total Count | Description |
---|---|---|
mission_logs | 377 | Used for collating the records. |
Broker | 197,985 | Communication file used to enforce asynchronous distributed communication as well as to interact with persistent data stores. |
capture_loss | 197,800 | Shows how well Zeek’s management and analysis tools are working. A missing TCP sequence set is correlated to a “gap” of lost data. This lost data results in a capture_loss file. |
Cluster | 362 | Zeek cluster messages. |
conn-summary | 318,225 | |
Conn | 140,477,116 | Tracks protocols and associated information such as IP addresses, durations, transferred (two way) bytes, states, packets, and tunnel information. Conn files provide all data regarding the connection between two points. |
dhcp | 2,356,475 | Helps correlate IP addresses and MAC addresses and potentially hostnames. From a security standpoint, this allows for the confirmation of connected systems/services and potential intrusion detection by determining which system assigned which IP address. |
dns | 191,049,652 | Provides a swath of information on how specific systems access and utilize the internet and other systems and focuses on the system that is asking a question and all elements of the question and its associated answer. |
loaded_scripts | 3880 | |
Notice | 144,946 | An event that Zeek learning has determined to be inspection-worthy; these are often higher-level alerts such as self-signed certs and are Zeek’s approximate equivalent to IDS alerts. |
packet_filter | 0 | Lists packet filters that were applied. |
Reporter | 74 | Internal error/warning messages. |
Stats | 346,088 | Memory/event/packet/lag statistics. |
Stderr | 48 | Captures standard errors when Zeek is started from ZeekControl. |
Stdout | 72 | Captures standard outputs when Zeek is started from ZeekControl. |
Weird | 47,311 | Essentially anything that does not fall into any other category. |
Attack Type | Description |
---|---|
Reconnaissance | Active or passive tactics for gathering information that can be used to plan future operations. |
Discovery | Tactics that may be used to gain knowledge about the system and internal network. |
Credential access | Tactics for stealing credentials such as account names and passwords. |
Privilege escalation | Tactics used to gain higher-level permissions on systems or networks. |
Exfiltration | Tactics that may be used to steal data from network. |
Lateral movement | Tactics used to enter and control remote systems on networks. |
Resource Development | Tactics to try to establish resources that can be used to support operations. |
Initial access | Tactics that use various entry vectors to gain an initial foothold within network. |
Persistence | Tactics used to keep access to systems across restarts, changed credentials, and other interruptions. |
Defense evasion | Tactics used to avoid detection throughout their compromise. |
Execution | Tactics to try to run malicious code. |
Collection | Tactics to try to gather data to reach a goal. |
Command and control | Tactics to try to communicate with compromised systems to control them. |
Impact | Tactics to try to manipulate, interrupt, or destroy systems and data. |
Attack Type | Count | % |
---|---|---|
Reconnaissance | 9,278,722 | 0.999768664 |
Discovery | 2086 | 0.000224763 |
Credential access | 31 | 3.3402 × 10−6 |
Privilege escalation | 13 | 1.40073 × 10−6 |
Exfiltration | 7 | 7.5424 × 10−7 |
Lateral movement | 4 | 4.30994 × 10−7 |
Resource development | 3 | 3.23246 × 10−7 |
Initial access | 1 | 1.07749 × 10−7 |
Persistence | 1 | 1.07749 × 10−7 |
Defense evasion | 1 | 1.07749 × 10−7 |
id | Col2. | Col3 |
---|---|---|
1 | val1 | [1] |
2 | rand2 | [2,3] |
3 | val3 | [4–6] |
id | Col2 | Col3 |
---|---|---|
1 | val1 | 1 |
2 | rand2 | 2 |
2 | rand2 | 3 |
3 | val2 | 4 |
3 | val2 | 5 |
3 | val2 | 6 |
Non-malicious traffic | 9,281,599 |
Malicious traffic | 9,280,869 |
Tactics/Technique | Count |
---|---|
Command and control | 36 |
Defense evasion, privilege escalation | 27 |
Defense evasion, initial access, persistence, privilege escalation | 5 |
Impact | 26 |
Collection | 28 |
Discovery | 37 |
Defense evasion, discovery | 5 |
Persistence, privilege escalation | 42 |
Lateral movement | 14 |
Initial access, persistence | 1 |
Resource development | 38 |
Defense evasion, persistence | 7 |
Initial access, lateral movement | 1 |
Credential access, defense evasion, persistence | 6 |
Privilege escalation | 2 |
Execution | 25 |
Reconnaissance | 42 |
Credential access | 42 |
Defense evasion, persistence, privilege escalation | 13 |
Execution, lateral movement | 1 |
Collection, credential access | 9 |
Command and control, defense evasion, persistence | 2 |
Persistence | 26 |
Defense evasion, execution | 1 |
Defense evasion, lateral movement | 5 |
Execution, persistence, privilege escalation | 6 |
Credential access, discovery | 1 |
Initial access | 12 |
Exfiltration | 17 |
Defense evasion | 99 |
Statistical Features | ||
---|---|---|
Src_bytes | 1,881,011,939,061 | |
Des_bytes | 23,446,737,545 | |
Src_pkts | 359,379,346 | |
Dst_pkts | 243,986,486 | |
Protocol types | TCP | 33,987,569 |
UDP | 105,098,306 | |
ICMP | 1,391,241 | |
Unique | Src_ip | 254 |
Dst_ip | 4324 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Bagui, S.S.; Mink, D.; Bagui, S.C.; Ghosh, T.; Plenkers, R.; McElroy, T.; Dulaney, S.; Shabanali, S. Introducing UWF-ZeekData22: A Comprehensive Network Traffic Dataset Based on the MITRE ATT&CK Framework. Data 2023, 8, 18. https://doi.org/10.3390/data8010018
Bagui SS, Mink D, Bagui SC, Ghosh T, Plenkers R, McElroy T, Dulaney S, Shabanali S. Introducing UWF-ZeekData22: A Comprehensive Network Traffic Dataset Based on the MITRE ATT&CK Framework. Data. 2023; 8(1):18. https://doi.org/10.3390/data8010018
Chicago/Turabian StyleBagui, Sikha S., Dustin Mink, Subhash C. Bagui, Tirthankar Ghosh, Russel Plenkers, Tom McElroy, Stephan Dulaney, and Sajida Shabanali. 2023. "Introducing UWF-ZeekData22: A Comprehensive Network Traffic Dataset Based on the MITRE ATT&CK Framework" Data 8, no. 1: 18. https://doi.org/10.3390/data8010018