1. Introduction
With the rapid development of Internet of Things (IoT) technology, various smart devices, such as home routers, smart cameras, and smart bulbs, have become deeply integrated into daily life. However, frequent updates and widespread reuse of core components have introduced numerous security vulnerabilities into IoT device firmware, which are challenging to eliminate [
1]. Once exploited, these vulnerabilities can lead to system crashes, privacy breaches, and even full device control, enabling attackers to launch large-scale distributed denial-of-service (DDoS) attacks by creating botnets [
2]. For example, the notorious Mirai botnet exploited vulnerabilities in numerous home routers, posing severe threats to internet infrastructure [
3].
Fuzzing, an effective technique for vulnerability detection, has been widely applied to IoT firmware security testing [
4]. Typically, fuzzing requires access to either physical or emulated IoT devices as testing targets. By sending carefully crafted malformed requests to a device’s management interface (e.g., a web management interface), a fuzzer evaluates whether the device’s response aligns with the expected behavior [
5,
6]. Test messages must pass authorization checks on the target’s management interface to effectively trigger deep-seated vulnerabilities within the firmware [
7]. Researchers designing fuzzers require a large set of emulated IoT devices and established N-day vulnerability benchmarks to validate fuzzer effectiveness [
8].
However, constructing portable emulation environments for firmware presents significant challenges. Differences among firmware vendors, device architectures, and dependencies on external hardware make emulation environments highly complex and often result in low success rates [
9]. This complexity makes it difficult for researchers to obtain sufficient standardized testing targets. Furthermore, there is a lack of user-friendly task scheduling and evaluation frameworks, and existing high-quality firmware emulation environments and comprehensive vulnerability benchmark datasets often fail to integrate seamlessly with these frameworks [
10]. Standardized firmware and vulnerability benchmark datasets that connect seamlessly with scheduling and evaluation frameworks are essential to supporting large-scale fuzzing evaluation and unknown vulnerability discovery [
4]. The absence of these critical resources and tools hinders the development of fuzzing technology, limiting the evaluation capabilities of fuzzers and the scalability of parallel firmware fuzzing [
5].
Challenges. In previous research, we introduced IoTFuzzBench [
11], an open-source framework for IoT fuzzing task scheduling and evaluation. IoTFuzzBench can define and execute complex parallel IoT firmware fuzzing tasks through configuration files. However, IoTFuzzBench has limitations in addressing the standardization of firmware emulation environments and the reproducibility of vulnerability datasets. These shortcomings underscore the pressing need for scalable, high-quality benchmark datasets to advance large-scale IoT fuzzing research. Specifically, current research encounters substantial challenges in the following domains:
Inefficiency in benchmark production and validation processes. The current process for IoT firmware emulation and vulnerability benchmark production involves multiple steps that require specialized expertise, which makes it highly time-consuming and prone to errors [
12]. Moreover, the lack of standardized workflows further exacerbates inefficiencies, creating inconsistencies in the format, quality, and compatibility of benchmarks produced by different researchers. These issues hinder dataset sharing and reuse, thereby limiting the scalability of experiments and impeding the horizontal comparison of different methods in IoT fuzzing.
Scarcity of scalable, reproducible, and ready-to-use firmware and vulnerability benchmarks with known vulnerabilities. Publicly available datasets that integrate reproducible firmware and associated vulnerabilities are critical for enabling researchers to perform rapid studies and cross-comparisons [
10]. However, existing open-source datasets often fall short in terms of scale, reproducibility, and standardization, significantly constraining their utility in advancing research. Furthermore, the lack of datasets explicitly designed to include reproducible known vulnerabilities limits their application in fuzzing evaluations.
High cost of integrating benchmarks into fuzzing frameworks. Incorporating new IoT firmware or vulnerability benchmarks into existing fuzzing scheduling frameworks typically involves complex configurations or code modifications, increasing integration costs and lowering research efficiency. This complexity limits the ability of researchers to scale their experiments, particularly in large-scale unknown vulnerability discovery tasks.
Our Approach. To address the challenges above, we designed and implemented IoTBenchSL, a modular, pipeline-based framework dedicated to the efficient production and automated validation of IoT firmware and vulnerability benchmarks. IoTBenchSL includes standardized workflows for firmware and vulnerability benchmark production and a CI/CD-based automated validation pipeline. It is designed to assist researchers in rapidly constructing benchmarks that meet established standards, addressing the complexity and inefficiency issues present in current firmware emulation and vulnerability benchmark production processes. Within IoTBenchSL, we provide step-by-step auxiliary tools that simplify complex tasks such as firmware extraction, emulation environment setup, authentication script generation, and vulnerability reproduction. Researchers can efficiently construct high-quality benchmarks with predefined templates, sample code, and automation scripts. Additionally, we have established unified benchmark production standards and automated validation mechanisms to ensure consistency in format, quality, and compatibility, facilitating benchmark integration and sharing. Furthermore, the formats and scheduling of benchmarks generated by IoTBenchSL are aligned with the IoTFuzzBench scheduling framework, enabling seamless integration with IoTFuzzBench. This alignment supports lightweight, large-scale fuzzing evaluations and unknown vulnerability discovery tasks.
IoTBenchSL has successfully generated 100 firmware benchmarks and 100 vulnerability benchmark datasets. Each firmware benchmark includes a complete containerized setup and supports one-click emulation startup, while each vulnerability benchmark provides an emulation environment and exploit scripts. All benchmarks have passed standardized CI/CD validation to ensure effectiveness and reproducibility. These high-quality, ready-to-use benchmark datasets offer strong support for researchers to conduct rapid studies and cross-comparisons, advancing IoT security research. This paper presents the design principles and implementation details of IoTBenchSL, along with experiments demonstrating large-scale fuzzing and 0-day vulnerability discovery using firmware and vulnerability benchmarks generated by IoTBenchSL. The experimental results showcase the effectiveness of IoTBenchSL.
Contributions. In summary, our contributions are as follows:
New Framework. We describe the design of IoTBenchSL, which includes standardized workflows for firmware and vulnerability benchmark production, as well as a CI/CD-based automated benchmark validation pipeline. This framework provides step-by-step tools to assist researchers in efficiently constructing benchmarks that meet established standards, significantly enhancing the efficiency of benchmark production and validation.
New Dataset. Based on IoTBenchSL, we have created a dataset containing over 100 standardized firmware benchmarks and over 100 verified vulnerability benchmarks. Each benchmark provides a fully containerized emulation environment with one-click startup tools, and all benchmarks have undergone automated pipeline validation to ensure quality and reproducibility.
Implementation and Evaluation. We developed a prototype implementation of IoTBenchSL and seamlessly integrated it with IoTFuzzBench, leveraging its task scheduling and parallel fuzzing capabilities. This integration enabled us to perform large-scale fuzzing experiments across all benchmarks produced by IoTBenchSL. These experiments completed the evaluation of five fuzzers and identified 32 previously unknown vulnerabilities, including 21 that have already been assigned CVE or CNVD identifiers. This achievement underscores the effectiveness of IoTBenchSL’s workflow and its significant contribution to advancing IoT security research through systematic and scalable vulnerability discovery.
3. Methodology
This section introduces the IoTBenchSL framework, which is designed to efficiently construct and verify IoT firmware and vulnerability benchmark datasets. The framework automatically verifies the validity and reliability of these datasets via a standardized validation pipeline.
Figure 1 illustrates the high-level architecture of IoTBenchSL.
IoTBenchSL consists of three core stages: the Firmware Benchmark Production Workflow, the Vulnerability Benchmark Production Workflow, and the Benchmark Validation Pipeline. We first define the concepts of firmware and vulnerability benchmarks, then provide an overview of the entire workflow and a detailed description of each component.
We define a firmware benchmark comprising the following four parts: Benchmark Configuration Metadata, Emulation Environment Bundle, Automated Authentication Scripts, and Initial Fuzzing Seed Set.
Benchmark Configuration Metadata records the essential attributes and parameters of the firmware benchmark in a structured format (such as YAML or JSON), including firmware details, emulation parameters, network configuration, and other necessary references for emulation and testing.
Emulation Environment Bundle provides all required resources for firmware emulation, ensuring consistent emulation conditions across different testing platforms. This bundle includes environment configuration files and support resources. The configuration files specify system requirements, dependencies, and network settings. At the same time, the support resources include system files, configuration files, dependencies, initialization scripts, and firmware emulation tools based on container technology, which facilitate stable emulation.
Automated Authentication Scripts automate the login and authentication processes for the firmware’s management interface, ensuring that generated fuzzing test cases can pass initial validation and reach the critical functional areas of the firmware.
The Initial Fuzzing Seeds Set comprises a collection of request packets from the firmware’s original communication interfaces, serving as seed data for fuzzing to enhance testing efficiency and coverage.
We define a vulnerability benchmark as consisting of the following four parts: Vulnerability Description Metadata, Vulnerability Payloads, Vulnerability Trigger Template, and Associated Firmware Benchmark Reference.
Vulnerability Description Metadata contains detailed information about the vulnerability in a structured format, including the vulnerability ID (e.g., CVE number), impact scope, and vulnerability type.
Vulnerability Payloads are input data or requests that validate the presence of the vulnerability and can be used to trigger the target vulnerability within the firmware benchmark.
The Vulnerability Trigger Template describes, in a structured format, the usage of vulnerability payloads and the expected outcomes upon triggering the vulnerability, supporting automated validation and assessment.
The Associated Firmware Benchmark Reference lists all affected firmware benchmarks related to the vulnerability, providing quick access to the corresponding firmware emulation environment.
At a high level, the IoTBenchSL workflow proceeds as follows: First, we collect firmware from various sources to build an initial firmware dataset. Then, through the Firmware Benchmark Production Workflow, we generate a set of pending firmware benchmarks, which are automatically verified for completeness and usability via the Benchmark Validation Pipeline. Successfully verified firmware benchmarks are then used in the Vulnerability Benchmark Production Workflow to generate a set of pending vulnerability benchmarks, further validated through the Benchmark Validation Pipeline. Finally, the verified firmware and vulnerability benchmarks are used as inputs for our previously proposed fuzzing scheduling framework, IoTFuzzBench, enabling large-scale fuzzing evaluations.
3.1. Firmware Benchmark Production Workflow
IoTBenchSL uses the collected firmware dataset as input for the Firmware Benchmark Production Workflow. This firmware dataset is partially derived from publicly available datasets released in existing research, precisely the open-source datasets provided in the FirmSec and FirmAE [
37,
40]. In contrast, other firmware samples are gathered from device manufacturers’ public releases online.
Initial Emulation Testing. In this stage, IoTBenchSL performs batch emulation tests on each firmware sample to evaluate their emulation viability in a system-mode environment. During emulation, the system automatically collects critical data, such as the emulated firmware’s default IP address and port information, providing configuration references for subsequent steps.
Benchmark Template Initialization. At this stage, IoTBenchSL generates an initial benchmark template for each firmware. This template consists of two main parts: Benchmark Configuration Metadata and the Emulation Environment Bundle. IoTBenchSL provides a set of predefined emulation environment bundles, which include all the containerized resources, emulation tools, and configuration files needed for system-mode firmware emulation. In this stage, IoTBenchSL automatically populates the Benchmark Configuration Metadata and the initial Emulation Environment Bundle with firmware-specific information, such as firmware name, hash, manufacturer, model, and the emulation IP address and port obtained during testing. This results in an initial firmware benchmark template.
Firmware Benchmark Production. In this stage, IoTBenchSL provides comprehensive tools and scripts to assist security experts in creating the final firmware benchmark. In IoT environments, the firmware’s web management interface typically requires authentication, so fuzzing requests must include valid authentication information to access deeper functionalities. During fuzzing, authentication fields must be dynamically updated when mutating original seed messages to ensure that test cases pass authentication checks and successfully reach the target functional code. Therefore, the firmware benchmark must include automated login and authentication update scripts rather than relying solely on the fuzzer. IoTBenchSL provides auxiliary tools to help security experts generate and update authentication scripts within the emulation environment. It also automatically captures high-quality request messages from the firmware, which serve as seed data for fuzzing. Finally, the system packages the complete benchmark and uploads it to the Benchmark Validation Pipeline for automated validation and storage, ensuring its completeness and reproducibility. Details on the Benchmark Validation Pipeline are provided in
Section 3.3.
3.2. Vulnerability Benchmark Production Workflow
The Vulnerability Benchmark Production Workflow is a standardized process within IoTBenchSL, tailored specifically for IoT firmware security testing. This workflow provides an efficient and structured method for creating vulnerability benchmarks, supported by automated tools and auxiliary features. The process takes as input a verified firmware benchmark set and public vulnerability databases and produces a standardized vulnerability benchmark set that facilitates subsequent testing and validation. The workflow comprises four key stages: Firmware Benchmark Association, Vulnerability Reproduction, Vulnerability Payload Development, and Vulnerability Benchmark Production.
Firmware Benchmark Association. In this stage, IoTBenchSL automatically links relevant vulnerability information from public vulnerability databases with specific firmware models and versions in the verified firmware benchmark set. The system identifies whether a corresponding proof-of-concept (PoC) file exists by analyzing reference links and exploiting tags in the vulnerability information, providing security experts with valuable references for vulnerability reproduction and payload development.
Vulnerability Reproduction. Based on the vulnerabilities selected in the previous stage and their associated verified firmware benchmarks, IoTBenchSL uses the emulation environment bundle within the firmware benchmark to initiate a firmware emulation environment that meets the target vulnerability’s reproduction requirements. The system preconfigures network settings and interface parameters to mirror actual device conditions, facilitating vulnerability reproduction in a controlled environment. Security experts then use relevant vulnerability reports and CVE information to confirm the vulnerability within the emulation environment, ensuring it can be triggered on the target firmware and establishing a reliable foundation for subsequent payload development.
Vulnerability Payload Development. In this stage, IoTBenchSL assists security experts in converting the vulnerability reproduction process into executable payloads and trigger templates. After confirming a vulnerability, IoTBenchSL provides templated development tools to guide experts in developing vulnerability payloads and trigger templates in a standardized YAML format. The system supplies predefined script interfaces, input-output specifications, and sample code, enabling experts to create payloads that meet testing requirements. The trigger template describes the exploitation steps, parameter configurations, and expected outcomes, ensuring that the generated payloads can be reused effectively in subsequent tests.
Vulnerability Benchmark Production. This final stage dynamically generates structured vulnerability metadata, including critical information such as vulnerability type, severity, and impact scope. The system integrates vulnerability payloads, trigger templates, and associated firmware benchmarks to form a complete vulnerability benchmark. This benchmark set is standardized and highly portable, suitable for direct use in automated validation tasks, and provides a reliable vulnerability discovery testing resource.
3.3. Benchmark Validation Pipeline
The Benchmark Validation Pipeline automatically validates the pending firmware and vulnerability benchmarks generated in the previous stages. IoTBenchSL’s validation pipeline is built on a CI/CD mechanism, automatically triggering whenever new benchmarks are submitted to the benchmark repository. This pipeline includes three primary stages: Pipeline Management and Initialization, Initial Validation, and Benchmark Dynamic Validation.
Pipeline Management and Initialization. In this stage, IoTBenchSL uses version control to manage the submission and updating of firmware and vulnerability benchmarks. When a benchmark is submitted to the repository, the version control module automatically triggers the pipeline and creates validation tasks. The system employs a validation task scheduling module to allocate resources across registered runner virtual machines, and it initiates the containerized environment required for verification through an environment setup module, ensuring the independence and cleanliness of subsequent steps.
Initial Validation. At this stage, IoTBenchSL conducts a general validation of the pending firmware and vulnerability benchmarks and generates pipeline subtasks. In the General validation module, IoTBenchSL first parses the structured benchmark configuration and vulnerability description metadata, performing a completeness check to identify any missing or erroneous information. Next, IoTBenchSL performs static checks on the remaining components of the firmware and vulnerability benchmarks to ensure consistency among the files, components, and descriptions. In the Test Jobs Generation module, IoTBenchSL ensures task independence by creating a separate subtask for each benchmark under verification and assigning it to individual runners. The runner’s resources are released upon task completion, and the system automatically schedules the following pending tasks. Subtasks run in parallel, enhancing efficiency and optimizing resource use.
IoTBenchSL performs specific test tasks for firmware and vulnerability benchmarks across different runners. The dynamic validation process consists of Firmware Benchmark Dynamic Validation and Vulnerability Benchmark Dynamic Validation, each designed to ensure that the benchmarks meet standards for reliability and functionality.
Firmware Benchmark Dynamic Validation. IoTBenchSL utilizes firmware validation plugins to validate the target firmware benchmark dynamically. The system uses the Emulation Environment Bundle within the firmware benchmark to launch the corresponding benchmark container on the runner’s container platform. Once operational, IoTBenchSL conducts Service Availability Testing and Request Authorization Testing on the emulated firmware.
In Service Availability Testing, IoTBenchSL confirms that the firmware’s management interface is accessible within the emulation environment. After initializing emulation, port forwarding tools map the emulated firmware’s management port to a specific container port, which is further mapped to a runner port. The verification tool on the runner then sends an initial seed message via socket to this port. The system verifies service availability based on whether an appropriate response is received.
Request Authorization Testing checks the functionality of the Automated Authentication Scripts. Unauthorized requests to the firmware management interface often yield errors or redirects, resulting in notable differences in response lengths between authorized and unauthorized requests. IoTBenchSL verifies the authentication script’s effectiveness by comparing these response lengths. First, the system sends an unauthorized request and records the response length. It then calls the automated login and authentication scripts to update the request with credentials, sends the updated request, and compares the response length to the initial one. A significant change indicates successful script functionality, while no change suggests failure.
Vulnerability Benchmark Dynamic Validation. IoTBenchSL uses vulnerability validation plugins to verify target vulnerabilities for vulnerability benchmarks dynamically. The system initializes the relevant benchmark container on the runner, referencing the associated firmware benchmark within the vulnerability benchmark. Once operational, IoTBenchSL performs Request Permission Update and Vulnerability Trigger Verification on the emulated firmware.
In Request Permission Update, IoTBenchSL calls the Automated Authentication Scripts to dynamically update requests in the Vulnerability Payload, ensuring they bypass authentication checks and effectively trigger the vulnerability.
Vulnerability Trigger Verification involves parsing the Vulnerability Trigger Template to retrieve relevant vulnerability information, test steps, payloads, and expected conditions. IoTBenchSL follows the template-defined steps, sending the updated vulnerability payload to the target interface on the emulated firmware. This simulates an actual attack scenario to execute the vulnerability trigger. The system continuously monitors the emulation environment’s responses and behaviors, capturing abnormal events, such as service crashes, system reboots, or unusual outputs, to verify successful vulnerability activation.
After successful validation, the system outputs the Verified Vulnerability Benchmarks, which, together with the Verified Firmware Benchmarks, form the core dataset for IoTFuzzBench testing framework.
4. Implementation
To achieve modular design and high scalability, IoTBenchSL’s implementation is organized into distinct layers, breaking down the workflow into independent steps, each managed by individual scripts or functions. Python 3.10 and Bash 5.1 scripts primarily support the functionalities of each module, integrating a range of open-source tools and technologies. We selected the Firmware Analysis Toolkit (FAT) [
41] as the core firmware emulation tool for IoTBenchSL. Based on Firmadyne [
37] and QEMU [
42], FAT is a system-mode emulation tool that efficiently emulates diverse firmware types. We leverage Docker [
43] to manage and operate the emulation environments. To support this, we have defined a standardized set of Dockerfiles for FAT, which include the required environment configurations and emulation startup scripts within the containers. The following sections detail the implementation of each core module in IoTBenchSL.
4.1. Firmware Benchmark Production Workflow
In the Initial Emulation Testing module, we developed batch scripts enabling the system to automatically iterate through the firmware dataset, creating FAT containers to run emulation tests for each firmware sample. Upon successful emulation, the system logs results, including the assigned IP addresses and open ports for each firmware, which are used in the automated generation of benchmark configurations.
The Benchmark Template Initialization module employs Python 3.10 scripts to integrate the emulation data with predefined emulation environment bundles, generating an initial benchmark template. This template is formatted as a YAML file to ensure standardized and compatible configurations. The system automatically populates the firmware’s essential information and emulation environment resources, establishing the foundation of the benchmark template.
In the Firmware Benchmark Production module, IoTBenchSL provides automated tools to assist security experts in creating the final firmware benchmark. We designed lightweight containerized emulation environment startup scripts that support one-click deployment and service forwarding, enhancing emulation environment usability. IoTBenchSL includes sample scripts for standard authentication methods, such as HTTP Basic Auth, Cookies, and Tokens, to simplify the creation of authentication scripts. Through integrated Tcpdump [
44] and Tshark [
45] tools, the system captures and filters request messages with parameters, producing high-quality seed sets for fuzzing.
4.2. Vulnerability Benchmark Production Workflow
In the Firmware Benchmark Association stage, IoTBenchSL parses the metadata of verified firmware benchmarks to extract essential details, such as the firmware’s model and version. We developed Python scripts to search vulnerability databases, including the National Vulnerability Database (NVD) and the China National Vulnerability Database (CNVD), for vulnerabilities that match the firmware model and version. Using fields such as vulnerability descriptions and Common Platform Enumeration (CPE), relevant vulnerabilities are identified and associated with the corresponding firmware benchmarks.
In the Vulnerability Reproduction stage, IoTBenchSL provides scripts for a one-click emulation environment setup, allowing security experts to launch the emulation environment for the associated firmware quickly. IoTBenchSL also leverages the “Exploit” tags in the References field of each NVD entry to automatically identify links to proof-of-concept (PoC) resources, which are presented to security experts for efficient verification of vulnerability presence and reproducibility.
During the Vulnerability Payload Development stage, we define a standardized YAML template to document the vulnerability exploitation steps, parameter configurations, and expected outcomes. IoTBenchSL provides a templated development environment for security experts, including predefined script interfaces and example code to facilitate the creation of vulnerability payloads and trigger templates.
In the Vulnerability Benchmark Production stage, we developed a Vulnerability Benchmark Packaging Tool that retrieves detailed information about the target vulnerability from NVD and automatically generates vulnerability description metadata. IoTBenchSL then integrates the vulnerability payloads and trigger templates created by security experts and the associated firmware benchmark references into a complete vulnerability benchmark. This finalized benchmark is submitted to the repository within the benchmark verification pipeline for subsequent automated validation.
4.3. Benchmark Verification Pipeline
In the Benchmark Verification Pipeline, we employ GitLab Pipelines [
46] as the continuous integration and continuous deployment (CI/CD) framework and have configured a private GitLab server to automate the validation process.
During the Pipeline Management and Initialization stage, when a new firmware or vulnerability benchmark is submitted to the GitLab repository, GitLab Pipelines automatically triggers the corresponding validation pipeline. The pipeline’s stages, tasks, execution order, and test scripts are defined in a YAML file. Using GitLab Runner, we registered multiple virtual machine runners in the GitLab instance and selected the Shell executor to perform tasks within the pipeline. The Shell executor was chosen to avoid the nested virtualization issues associated with Docker-in-Docker, ensuring the stability of the firmware emulation process.
In the Initial Verification stage, we developed Python scripts to parse and validate the completeness and correctness of each benchmark’s metadata. These scripts verify required fields and ensure proper formatting. The system also performs static checks on each benchmark component to ensure completeness and consistency. For instance, it verifies the presence of configuration files in the Emulation Environment Bundle and checks the executability of Automated Authentication Scripts. We designed dynamically generated sub-pipelines within the main pipeline to test each firmware and vulnerability benchmark independently without interference. The system dynamically creates independent child pipeline configuration files in the Test Jobs Generation module according to the number of benchmarks to be tested, saving them as artifacts. These child pipelines are then triggered by a designated job in the main pipeline, ensuring seamless execution of downstream pipelines and comprehensive validation of each firmware and vulnerability benchmark.
We implemented dedicated validation plugins for both firmware and vulnerability benchmarks in the Benchmark Dynamic Validation stage. These plugins can initiate Docker-based emulation environments on the runner and perform specific test tasks for each benchmark type. The system executes Service Availability Testing and Request Authorization Testing within the emulation environment for firmware benchmarks. For vulnerability benchmarks, it performs Request Permission Update and Vulnerability Trigger Verification. Upon completion of testing, the system automatically updates the test results in the GitLab repository, allowing security experts to review and analyze the findings.
5. Experiments and Results
We evaluate the effectiveness of the IoTBenchSL prototype by considering the following research questions:
RQ 1: Can IoTBenchSL create firmware and vulnerability benchmarks that are larger in scale and more diverse than those in existing open-source IoT fuzzing datasets?
RQ 2: How efficient is IoTBenchSL in benchmark production and validation?
RQ 3: Can the benchmarks generated by IoTBenchSL be seamlessly integrated into fuzzing frameworks to enable low-cost and effective practical fuzzer evaluation?
RQ 4: Can the benchmarks provided by IoTBenchSL enable large-scale, cost-effective discovery of unknown vulnerabilities?
To answer these questions, we design and conduct four different experiments.
5.1. Experiment Settings
We outline our experimental setup, detailing the hardware, fuzzers, and firmware targets utilized.
Hardware Configuration. All experiments were conducted on identical hardware configurations, precisely two Intel Xeon Silver 4314 CPUs @ 2.40 GHz (a total of 64 physical cores) with 256 GB of RAM, running 64-bit Ubuntu 22.04 LTS. For each fuzzer, we allocated one CPU core, 2 GB of RAM, and 1 GB of swap space. In each task across experimental groups, the fuzzer operates within an isolated Docker container, performing fuzzing tasks on target benchmark firmware, which also runs in an isolated Docker container. Benchmark containers are not shared between tasks to ensure independence.
Fuzzers. Our evaluation includes the following fuzzers: Snipuzz [
17], T-Reqs [
47], Mutiny [
48], Fuzzotron [
49], MSLFuzzer [
50], and Boofuzz [
51]. These fuzzers all support network-based fuzzing for IoT devices. Snipuzz and T-Reqs, open-sourced and presented at the top-tier security conference CSS 2021, represent typical academic fuzzing tools. Mutiny, an open-source fuzzer by Cisco, and Fuzzotron, an actively maintained community fuzzer with ongoing updates within the past six months, exemplify open-source tools from the industry. Mutiny and Fuzzotron have garnered over 500 stars on GitHub, underscoring their community impact. Boofuzz is a notable representative of generative fuzzing frameworks and is frequently used as a baseline in research studies. Finally, MSLFuzzer is a black-box fuzzer developed in our prior research. For a fair and unbiased evaluation among the fuzzers, MSLFuzzer is used solely for unknown vulnerability discovery experiments.
Targets. To ensure the benchmarks created by IoTBenchSL reflect real-world conditions, all firmware images and vulnerabilities used in our experiments were drawn from authentic sources. We compiled three sets of actual firmware samples for the initial dataset, totaling 19,018 images. The first set consisted of 842 firmware images from the dataset provided by FirmAE [
37]. The second included 13,596 images from the FirmSec [
40], and the third contained 4580 images publicly released by various manufacturers. Each target was a complete firmware image with an entire filesystem rather than an isolated component and was provided without accompanying documentation or emulation methods. Throughout the experiments, IoTBenchSL was applied to unpack and emulate these initial samples, filtering for successfully emulated targets. Additionally, we cross-referenced each sample with associated N-day vulnerabilities from the NVD and CNVD databases, performing a secondary selection on the firmware. This screened set of firmware was then used to create and validate benchmarks and to conduct further fuzzing experiments.
5.2. Large-Scale Benchmark Production
To address RQ 1, we systematically compared the benchmark datasets generated by IoTBenchSL with existing datasets using quantitative metrics and qualitative analysis. We aimed to produce as many firmware and vulnerability benchmarks as possible from an initial set of 19,018 firmware samples and compare them with existing open-source emulated firmware and vulnerability datasets. These initial firmware samples lack emulation methods and accompanying documentation, making them unsuitable for direct use in fuzzing evaluation. IoTBenchSL addresses this challenge by enabling automated emulation attempts, which enhances reproducibility and reduces human bias. This is a crucial first step in achieving comprehensive firmware and vulnerability benchmarking.
First, we employed IoTBenchSL to perform initial emulation tests on these firmware samples, successfully emulating 300 firmware images. Recognizing the significant time and labor traditionally required for constructing emulation environments, IoTBenchSL streamlines this process by automating critical steps, enabling the selection of 100 representative firmware samples for benchmark production. The selection criteria included the representativeness of firmware manufacturers, the popularity of the firmware models, and the number of historical vulnerabilities associated with each firmware. Next, IoTBenchSL produced firmware benchmarks from these 100 selected firmware samples. After completing the firmware benchmarks, we identified the 100 most relevant vulnerabilities based on firmware models and versions from the NVD and CNVD databases. Leveraging IoTBenchSL, we reproduced these vulnerabilities and created corresponding vulnerability benchmarks.
Appendix A Table A1 provides a comprehensive list of all firmware benchmarks successfully created and validated using IoTBenchSL. In contrast,
Appendix A Table A2 presents the complete list of vulnerabilities reproduced using IoTBenchSL and subsequently transformed into vulnerability benchmarks.
We collected and analyzed data for these firmware and vulnerability benchmarks during production and compared them with existing open-source datasets.
Table 2 compares the firmware and vulnerability benchmarks produced with IoTBenchSL and other datasets. It is important to note that the firmware and vulnerability counts in
Table 2 represent the initial numbers provided by the corresponding papers’ open-source datasets and exclude newly discovered 0-day vulnerabilities.
The comparison demonstrates that IoTBenchSL significantly enhances the scale of firmware and vulnerability benchmarks, achieving 100 benchmarks in each category, representing a several-fold or even an order-of-magnitude increase compared to existing datasets. Furthermore, IoTBenchSL surpasses other emulated firmware datasets in terms of diversity, providing a broader range of firmware models and vulnerability types. The dataset also features a one-click startup for the emulated environment of each firmware benchmark and automated vulnerability validation scripts. These capabilities substantially reduce the time and effort required for researchers to set up emulation and reproduction environments and enable efficient and reproducible cross-evaluation of fuzzers on a unified dataset. All benchmarks listed in
Appendix A Table A1 and
Table A2 will be open-sourced in the IoTVulBench repository [
52].
5.3. Efficiency in Benchmark Production and Validation
To address RQ 2, we designed a controlled comparative experiment to evaluate the efficiency improvements and resource consumption associated with IoTBenchSL in producing and verifying firmware and vulnerability benchmarks. To ensure the validity of this evaluation, we conducted a structured comparison between the traditional manual process and IoTBenchSL’s automated workflow. This comparison was based on quantifiable metrics, including execution time across different experimental groups during both the benchmark production and validation phases. By adopting this approach, we aimed to provide an objective assessment of IoTBenchSL’s advantages in terms of time efficiency and labor reduction, ensuring that our conclusions are both reproducible and reliable.
In our experimental design, we selected 20 representative firmware samples from the dataset generated by IoTBenchSL. These samples encompass a variety of vendors and common vulnerability types, such as memory corruption and command injection, ensuring diversity and representativeness. The experiment included two groups: Group A, which utilized IoTBenchSL for automated benchmark production and validation, and Group B, which followed the conventional manual workflow. Both groups consisted of researchers with equivalent expertise and professional experience in cybersecurity, ensuring standardized operations and comparability of results. Detailed records of time consumption and human involvement were maintained to ensure the completeness and accuracy of the collected data.
Figure 2 compares the total time consumption for benchmark production and validation in Group A and Group B. Group A, utilizing IoTBenchSL, completed the entire process in 50.9 h. Group B, following the manual workflow, required 151.67 h. This demonstrates a significant efficiency improvement, with IoTBenchSL reducing the total time for benchmark production and validation by approximately three times. In the validation phase, Group A completed the process in 3.33 h for all benchmarks, whereas Group B required 31.33 h, achieving a 9.4 times improvement in validation efficiency. This indicates that IoTBenchSL can significantly shorten the time required for benchmark production and validation.
Figure 3 provides a more detailed comparison of time efficiency, showing the average time consumed per benchmark in both the production and validation phases. In the production phase, Group A spent an average of 1.28 h per firmware benchmark and 1.26 h per vulnerability benchmark. Group B spent 4.25 h per firmware benchmark and 3.5 h per vulnerability benchmark. These results indicate a 3.3 times improvement in firmware benchmark production efficiency and a 2.6 times improvement in vulnerability benchmark production.
The experimental results demonstrate the significant advantage of IoTBenchSL in enhancing the efficiency of benchmark production and validation. The substantially reduced time required for benchmark generation and validation enables researchers to more effectively generate and verify large-scale datasets. By significantly reducing manual effort and time investment, IoTBenchSL allows researchers to focus more on experimentation and analysis, ultimately accelerating the overall research process and improving the reproducibility of results.
5.4. Evaluation of Vulnerability Discovery Capabilities Based on Benchmarks
To address RQ 3, we evaluated the seamless integration of IoTBenchSL-generated benchmarks into existing fuzzing frameworks and assessed their effectiveness in practical fuzzer evaluations. To ensure the validity of our approach, we systematically measured the performance of multiple fuzzers in detecting N-day vulnerabilities after integrating these benchmarks. The evaluation focused on two key aspects: (1) the completeness and efficiency of fuzzer evaluations when using IoTBenchSL-generated datasets, and (2) the differential performance of various fuzzers in the task of detecting N-day vulnerabilities. By conducting a structured comparison, we aimed to demonstrate that IoTBenchSL not only facilitates cost-effective and efficient fuzzer assessment but also enhances the reproducibility and reliability of vulnerability detection. These findings reinforce the practical value of IoTBenchSL in real-world security testing environments and its potential to standardize and improve fuzzer evaluation methodologies.
In our previous work, IoTFuzzBench, we integrated five protocol fuzzers, including Mutiny, Snipuzz, Boofuzz, T-Reqs, and Fuzzotron, enabling large-scale parallel fuzzing evaluations. Building upon this, we integrated the datasets produced by IoTBenchSL into IoTFuzzBench, significantly reducing the implementation cost of fuzzing evaluations. This integration allows researchers to define custom fuzzing evaluation tasks through a simple configuration file, streamlining the process and eliminating the need for manual intervention. A single configuration file was used to define all participating fuzzers and 100 selected vulnerabilities, enabling the system to execute 500 fuzzing experiments automatically. Each fuzzer performed its tasks using raw request seeds corresponding to target vulnerabilities, without embedding exploit payloads, to systematically assess their vulnerability discovery capabilities on the IoTBenchSL dataset.
Table 3 summarizes the performance of the fuzzers on the IoTBenchSL-generated vulnerability benchmarks. Among the 100 benchmarks, Mutiny and Fuzzotron identified the highest number of vulnerabilities, detecting 42 and 41, respectively. Conversely, Boofuzz demonstrated the weakest performance, uncovering only five vulnerabilities. Despite this, Boofuzz achieved the shortest average discovery time at 316.09 s, while T-Reqs required the longest time, averaging 8693.40 s. Regarding memory consumption, Fuzzotron exhibited the lowest usage, whereas Snipuzz consumed the most. Notably, the average memory consumption for all fuzzers remained below 250 MB, which is minimal compared to the tens or hundreds of gigabytes typically available on modern servers. Thus, memory consumption is not a bottleneck for current fuzzing tools in vulnerability detection. However, in resource-constrained environments where dozens or even hundreds of fuzzers may be deployed, low-memory consumption fuzzers such as Fuzzotron and Mutiny, averaging under 20 MB, are preferable.
These results underscore IoTBenchSL’s effectiveness in supporting diverse, large-scale fuzzer evaluation scenarios. By enabling systematic comparisons of fuzzer performance on a unified benchmark dataset, IoTBenchSL facilitates more consistent, reproducible, and insightful analyses in IoT fuzzing.
5.5. 0-Day Vulnerability Discovery Using Firmware Benchmarks
To address RQ 4, we assessed the potential of IoTBenchSL as a resource for large-scale zero-day vulnerability discovery by systematically conducting experiments on its standardized firmware dataset. Specifically, we aimed to evaluate its effectiveness in facilitating the identification of previously unknown vulnerabilities and its applicability to real-world vulnerability research. To ensure a rigorous assessment, we employed MSLFuzzer, a protocol fuzzer we previously developed, to test the web management interfaces of IoT devices. By applying MSLFuzzer to the IoTBenchSL-generated benchmarks, we sought to demonstrate the dataset’s practical utility in uncovering new security vulnerabilities and further validate its applicability in real-world vulnerability research.
For this study, MSLFuzzer was selected as the primary tool for 0-day vulnerability discovery. We integrated MSLFuzzer into the IoTFuzzBench platform, enabling large-scale 0-day vulnerability discovery experiments on the firmware dataset provided by IoTBenchSL. The IoTBenchSL dataset comprises 100 firmware benchmarks and 4935 raw seeds, offering broad manufacturer coverage and diverse sample representation. Leveraging IoTFuzzBench task isolation mechanism, we executed independent fuzzing experiments for each seed. Each experiment was conducted in a fully isolated environment for 24 h, ensuring reliable and consistent results and enabling a systematic approach to large-scale 0-day vulnerability discovery.
The experimental results identified 968 crash samples on IoTBenchSL’s firmware benchmarks. These crash samples underwent rigorous filtering, de-duplication, and comparison with historical vulnerabilities to assess their relevance to new and existing vulnerabilities. This systematic analysis confirmed 40 known vulnerabilities and identified 32 unknown vulnerabilities. Among these newly discovered vulnerabilities, 21 have been assigned CNVD and CVE identifiers, as shown in
Table 4. All discovered unknown vulnerabilities were responsibly disclosed to the corresponding vendors through the National Computer Network Emergency Response Technical Team/Coordination Center of China (CNCERT) to facilitate prompt remediation efforts. These efforts assist vendors in implementing timely fixes and enhancing the security posture of their devices.
Additionally, after conducting a comparative analysis of different crash samples from our experiments, we found that some vulnerabilities not only affect the targeted firmware but also impact multiple firmware versions from the same vendor. For instance, CVE-2023-33538 was found to affect multiple TP-Link devices, including TL-WR940N, TL-WR841N, and TL-WR941ND. The finding highlights two key aspects: (1) the widespread code reuse among firmware from the same vendor, which exacerbates security risks, and (2) the effectiveness of IoTBenchSL’s benchmark datasets in facilitating unknown vulnerability discovery and impact analysis.
The experimental results demonstrate that the standardized firmware dataset provided by IoTBenchSL effectively supports large-scale automated 0-day vulnerability discovery, serving as a high-quality, ready-to-use research resource. Through its integration with the IoTFuzzBench platform, the IoTBenchSL dataset offers researchers a convenient environment for firmware fuzzing, accelerating progress in IoT firmware vulnerability discovery and cross-comparison experiments. For researchers aiming to deploy IoT firmware security studies and validate their findings rapidly, IoTBenchSL provides comprehensive sample data support and, through its synergy with IoTFuzzBench, highlights its critical value in enabling scalable and standardized vulnerability discovery experiments.
6. Discussion and Future Work
While IoTBenchSL provides a practical framework for constructing and verifying IoT firmware and vulnerability benchmarks, several opportunities remain to enhance its functionality and expand its applicability. As IoTBenchSL represents an initial step toward a standardized and automated approach to IoT fuzzing benchmarks, we would like to discuss current design limitations and explore directions for future improvements.
Expanding Firmware Emulation Methods. Currently, IoTBenchSL focuses primarily on system-mode firmware emulation, with limited support for user-mode emulation. This limitation arises because current user-mode emulation technologies are generally suited for emulating individual applications rather than offering a complete operating system and interface environment, limiting their applicability in IoTBenchSL. Consequently, many firmware samples fail to run smoothly without manual intervention and code adjustments to bypass crashes or restrictions. This reliance on expert intervention makes it challenging to scale user-mode emulation across large volumes of firmware samples. However, preliminary experiments with user-mode emulation, such as on the Tenda AC9 firmware, demonstrate that IoTBenchSL is adaptable to user-mode emulation, opening up possibilities for broader implementation. In future work, we aim to expand the quantity and diversity of user-mode firmware and explore automation methods to reduce time costs, enhancing the dataset’s diversity and applicability.
Broadening Vulnerability Types. Currently, IoTBenchSL’s vulnerability benchmarks focus on buffer overflow, command injection, and denial-of-service vulnerabilities. These vulnerability types are relatively straightforward to detect and assess using protocol fuzzing tools, making them suitable samples for evaluating fuzzer performance. To increase the dataset’s broad applicability, we plan to incorporate a more comprehensive range of vulnerabilities in future work. A wider variety of vulnerabilities will place higher demands on the detection capabilities of fuzzers after a vulnerability is triggered and require greater adaptability from scheduling frameworks like IoTFuzzBench. By adding modular vulnerability monitoring components, we aim to improve the scheduling framework’s compatibility with different vulnerability types, thereby expanding vulnerability coverage and reducing development costs for fuzzers.
Expanding Firmware Attack Surfaces. Currently, the firmware benchmarks in IoTBenchSL focus primarily on the web management interface as the primary attack surface for fuzzing. However, emulated firmware may contain other protocol interfaces, such as FTP, UPnP, and HNAP, which could expand the breadth and depth of fuzzing. In future work, we will explore ways to broaden support for these alternative attack surfaces within the benchmark dataset, creating a more comprehensive fuzzing framework and extending the applicability of fuzzers.
Automating Fuzzing Seed Generation. The current process for generating fuzzing seeds within firmware benchmarks depends on security experts manually interacting with emulated firmware to capture raw requests and using IoTBenchSL’s auxiliary tools to filter suitable fuzzing seeds. This manual process is time-intensive and limits interface coverage. To improve efficiency, we are considering introducing static analysis techniques to automatically generate raw requests as seeds by parsing static resources in the firmware and using reverse engineering to identify potential communication interfaces and parameters. This automated method would significantly enhance benchmark production efficiency and improve the coverage and effectiveness of fuzzing. This will be a primary focus in our next phase of research.
Deployment Scenarios. In practical applications, we anticipate that researchers will focus on two primary scenarios: (1) how to quickly deploy and utilize all firmware and vulnerability datasets provided by IoTBenchSL and perform fuzz testing experiments using their fuzzers, and (2) how to conduct large-scale fuzzing comparison experiments between different fuzzing tools using the datasets provided by IoTBenchSL. For Scenario 1, we provide all standardized firmware and vulnerability benchmarks, along with corresponding scaffolding tools, in the open-source repository. These tools enable researchers to rapidly set up emulation environments for any firmware. Researchers can then use their fuzzers to conduct fuzzing experiments on the selected firmware. For Scenario 2, the IoTBenchSL datasets are designed to be used in conjunction with the IoTFuzzBench framework, which we have also open-sourced. During development, we accounted for integration issues, ensuring that the dataset format provided by IoTBenchSL can be directly parsed and utilized by the IoTFuzzBench framework. Researchers need only deploy the IoTFuzzBench framework and can import the IoTBenchSL datasets as resource files with a single click. Additionally, the configuration files provided by IoTFuzzBench allow researchers to define the desired fuzzers and target firmware environments, enabling them to easily conduct multi-fuzzer comparative experiments with a single command. Continuous improvement and maintenance of these open-source repositories is one of our future goals.