IoTBenchSL: A Streamlined Framework for the Efficient Production of Standardized IoT Benchmarks with Automated Pipeline Validation

Cheng, Yixuan; Li, Xiongfei; Mao, Zhikang; Fan, Wenqing; Huang, Wei; Liu, Wen

doi:10.3390/electronics14050856

Open AccessArticle

IoTBenchSL: A Streamlined Framework for the Efficient Production of Standardized IoT Benchmarks with Automated Pipeline Validation

by

Yixuan Cheng

^1,2,

Xiongfei Li

^1,2,

Zhikang Mao

^1,2

,

Wenqing Fan

^1,2,

Wei Huang

^1,2 and

Wen Liu

^1,2,*

¹

State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing 100024, China

²

School of Computer and Cyber Sciences, Communication University of China, Beijing 100024, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(5), 856; https://doi.org/10.3390/electronics14050856

Submission received: 14 January 2025 / Revised: 17 February 2025 / Accepted: 18 February 2025 / Published: 21 February 2025

Download

Browse Figures

Versions Notes

Abstract

Security vulnerabilities in IoT firmware pose significant risks, and black-box protocol fuzzing offers a scalable and cost-effective solution for identifying these vulnerabilities. However, current fuzzing evaluation schemes are often beset by challenges such as limited benchmark dataset sizes, low benchmark production efficiency, insufficient automation, and poor adaptability, which hinder cross-method comparisons. To address these challenges, we propose IoTBenchSL, a modular, pipeline-driven framework for constructing standardized, reproducible, and integrable vulnerability detection benchmark datasets. IoTBenchSL introduces a benchmark production workflow alongside a CI/CD-based automated validation pipeline, equipping each benchmark with a one-click emulation environment, automated validation mechanisms, and exploit verification scripts. This approach enhances the efficiency and accuracy of benchmark generation. Additionally, IoTBenchSL seamlessly integrates with existing fuzzing frameworks, enabling large-scale fuzzing evaluations and vulnerability discovery without modifying fuzzers. In our experiments, IoTBenchSL was used to generate 100 firmware benchmarks and 100 vulnerability benchmarks. We identified 32 previously unknown vulnerabilities through these benchmarks, 21 of which received CVE or CNVD identifiers. These results demonstrate IoTBenchSL’s capacity to generate large-scale IoT fuzzing benchmarks, improving evaluation efficiency and reducing costs for comparative studies, thereby advancing reproducible fuzzing research.

Keywords:

benchmark dataset; fuzzer evaluation; vulnerability discovery; black-box fuzzing; IoT smart device

1. Introduction

With the rapid development of Internet of Things (IoT) technology, various smart devices, such as home routers, smart cameras, and smart bulbs, have become deeply integrated into daily life. However, frequent updates and widespread reuse of core components have introduced numerous security vulnerabilities into IoT device firmware, which are challenging to eliminate [1]. Once exploited, these vulnerabilities can lead to system crashes, privacy breaches, and even full device control, enabling attackers to launch large-scale distributed denial-of-service (DDoS) attacks by creating botnets [2]. For example, the notorious Mirai botnet exploited vulnerabilities in numerous home routers, posing severe threats to internet infrastructure [3].

Fuzzing, an effective technique for vulnerability detection, has been widely applied to IoT firmware security testing [4]. Typically, fuzzing requires access to either physical or emulated IoT devices as testing targets. By sending carefully crafted malformed requests to a device’s management interface (e.g., a web management interface), a fuzzer evaluates whether the device’s response aligns with the expected behavior [5,6]. Test messages must pass authorization checks on the target’s management interface to effectively trigger deep-seated vulnerabilities within the firmware [7]. Researchers designing fuzzers require a large set of emulated IoT devices and established N-day vulnerability benchmarks to validate fuzzer effectiveness [8].

However, constructing portable emulation environments for firmware presents significant challenges. Differences among firmware vendors, device architectures, and dependencies on external hardware make emulation environments highly complex and often result in low success rates [9]. This complexity makes it difficult for researchers to obtain sufficient standardized testing targets. Furthermore, there is a lack of user-friendly task scheduling and evaluation frameworks, and existing high-quality firmware emulation environments and comprehensive vulnerability benchmark datasets often fail to integrate seamlessly with these frameworks [10]. Standardized firmware and vulnerability benchmark datasets that connect seamlessly with scheduling and evaluation frameworks are essential to supporting large-scale fuzzing evaluation and unknown vulnerability discovery [4]. The absence of these critical resources and tools hinders the development of fuzzing technology, limiting the evaluation capabilities of fuzzers and the scalability of parallel firmware fuzzing [5].

Challenges. In previous research, we introduced IoTFuzzBench [11], an open-source framework for IoT fuzzing task scheduling and evaluation. IoTFuzzBench can define and execute complex parallel IoT firmware fuzzing tasks through configuration files. However, IoTFuzzBench has limitations in addressing the standardization of firmware emulation environments and the reproducibility of vulnerability datasets. These shortcomings underscore the pressing need for scalable, high-quality benchmark datasets to advance large-scale IoT fuzzing research. Specifically, current research encounters substantial challenges in the following domains:

Inefficiency in benchmark production and validation processes. The current process for IoT firmware emulation and vulnerability benchmark production involves multiple steps that require specialized expertise, which makes it highly time-consuming and prone to errors [12]. Moreover, the lack of standardized workflows further exacerbates inefficiencies, creating inconsistencies in the format, quality, and compatibility of benchmarks produced by different researchers. These issues hinder dataset sharing and reuse, thereby limiting the scalability of experiments and impeding the horizontal comparison of different methods in IoT fuzzing.
Scarcity of scalable, reproducible, and ready-to-use firmware and vulnerability benchmarks with known vulnerabilities. Publicly available datasets that integrate reproducible firmware and associated vulnerabilities are critical for enabling researchers to perform rapid studies and cross-comparisons [10]. However, existing open-source datasets often fall short in terms of scale, reproducibility, and standardization, significantly constraining their utility in advancing research. Furthermore, the lack of datasets explicitly designed to include reproducible known vulnerabilities limits their application in fuzzing evaluations.
High cost of integrating benchmarks into fuzzing frameworks. Incorporating new IoT firmware or vulnerability benchmarks into existing fuzzing scheduling frameworks typically involves complex configurations or code modifications, increasing integration costs and lowering research efficiency. This complexity limits the ability of researchers to scale their experiments, particularly in large-scale unknown vulnerability discovery tasks.

Our Approach. To address the challenges above, we designed and implemented IoTBenchSL, a modular, pipeline-based framework dedicated to the efficient production and automated validation of IoT firmware and vulnerability benchmarks. IoTBenchSL includes standardized workflows for firmware and vulnerability benchmark production and a CI/CD-based automated validation pipeline. It is designed to assist researchers in rapidly constructing benchmarks that meet established standards, addressing the complexity and inefficiency issues present in current firmware emulation and vulnerability benchmark production processes. Within IoTBenchSL, we provide step-by-step auxiliary tools that simplify complex tasks such as firmware extraction, emulation environment setup, authentication script generation, and vulnerability reproduction. Researchers can efficiently construct high-quality benchmarks with predefined templates, sample code, and automation scripts. Additionally, we have established unified benchmark production standards and automated validation mechanisms to ensure consistency in format, quality, and compatibility, facilitating benchmark integration and sharing. Furthermore, the formats and scheduling of benchmarks generated by IoTBenchSL are aligned with the IoTFuzzBench scheduling framework, enabling seamless integration with IoTFuzzBench. This alignment supports lightweight, large-scale fuzzing evaluations and unknown vulnerability discovery tasks.

IoTBenchSL has successfully generated 100 firmware benchmarks and 100 vulnerability benchmark datasets. Each firmware benchmark includes a complete containerized setup and supports one-click emulation startup, while each vulnerability benchmark provides an emulation environment and exploit scripts. All benchmarks have passed standardized CI/CD validation to ensure effectiveness and reproducibility. These high-quality, ready-to-use benchmark datasets offer strong support for researchers to conduct rapid studies and cross-comparisons, advancing IoT security research. This paper presents the design principles and implementation details of IoTBenchSL, along with experiments demonstrating large-scale fuzzing and 0-day vulnerability discovery using firmware and vulnerability benchmarks generated by IoTBenchSL. The experimental results showcase the effectiveness of IoTBenchSL.

Contributions. In summary, our contributions are as follows:

New Framework. We describe the design of IoTBenchSL, which includes standardized workflows for firmware and vulnerability benchmark production, as well as a CI/CD-based automated benchmark validation pipeline. This framework provides step-by-step tools to assist researchers in efficiently constructing benchmarks that meet established standards, significantly enhancing the efficiency of benchmark production and validation.
New Dataset. Based on IoTBenchSL, we have created a dataset containing over 100 standardized firmware benchmarks and over 100 verified vulnerability benchmarks. Each benchmark provides a fully containerized emulation environment with one-click startup tools, and all benchmarks have undergone automated pipeline validation to ensure quality and reproducibility.
Implementation and Evaluation. We developed a prototype implementation of IoTBenchSL and seamlessly integrated it with IoTFuzzBench, leveraging its task scheduling and parallel fuzzing capabilities. This integration enabled us to perform large-scale fuzzing experiments across all benchmarks produced by IoTBenchSL. These experiments completed the evaluation of five fuzzers and identified 32 previously unknown vulnerabilities, including 21 that have already been assigned CVE or CNVD identifiers. This achievement underscores the effectiveness of IoTBenchSL’s workflow and its significant contribution to advancing IoT security research through systematic and scalable vulnerability discovery.

To support further research, all firmware and vulnerability benchmarks generated by IoTBenchSL will be open-sourced at https://github.com/a101e-lab/IoTVulBench (accessed on 20 December 2024).

2. Background and Related Work

Before delving into the technical details of our approach, we provide a brief overview of the IoT firmware fuzzing workflow and the challenges involved in preparing high-quality experimental datasets in this field.

2.1. IoT Firmware Fuzzing Workflow

The IoT firmware fuzzing workflow can be divided into three main phases: the preparation phase, the execution phase, and the result analysis phase [4,5,13,14].

Preparation Phase of Fuzzing. Researchers must first identify the test targets, typically physical IoT devices or emulated firmware, and recognize the devices’ communication protocols and management interfaces, such as web management interfaces [15,16]. Establishing mechanisms for device startup and state resetting is crucial to ensure test independence [17]. Additionally, researchers must prepare initial seeds for fuzzing, which are generally raw messages of the target network protocol [18]. This phase also involves identifying and standardizing the input-output formats of the programs under test, especially the authentication mechanisms, to reduce the likelihood of generating invalid test cases and ensure the generated cases are both practical and targeted [7].

Execution Phase of Fuzzing. At this phase, the fuzzer selects initial seeds and generates test cases by applying minor mutations according to format rules [19,20]. The generated test cases must pass the target device’s initial format or authentication checks and contain sufficient anomalous data to trigger deep-seated vulnerabilities within the program. The fuzzer then transmits these test cases to the target device over the network, closely monitoring the program’s outputs and behaviors to detect abnormalities [21].

Result Analysis Phase. After testing is completed, researchers analyze all crashes and test cases triggered during fuzzing to determine whether the exceptions can be stably reproduced on the target physical or emulated devices [5,22]. This analysis helps identify potential security vulnerabilities and assess the detection capabilities of the fuzzer, providing insights for refining fuzzing strategies.

2.2. IoT Firmware Fuzzing Datasets

In designing and evaluating fuzzers, researchers typically need to compare their fuzzers with the current state-of-the-art fuzzers [23,24,25]. These comparative experiments are usually conducted on physical devices or emulated firmware to assess the fuzzer’s effectiveness in discovering N-day or 0-day vulnerabilities [26,27]. In this process, researchers generally need a firmware dataset and an N-day vulnerability dataset. Table 1 summarizes the datasets used in various IoT fuzzing studies.

Firmware Dataset. Researchers commonly use two methods to prepare the firmware dataset. Some researchers opt to purchase physical IoT devices as their testing environment, which avoids the complexities of firmware emulation and allows for direct evaluation [6,17,28,29]. However, this approach has two main drawbacks. First, different researchers may acquire devices of varying models, leading to inconsistencies in testing environments and high procurement costs. Second, since some devices may be discontinued or specific firmware versions updated, obtaining identical models and firmware versions can be challenging, reducing consistency across experiments and complicating unified dataset evaluations.

Alternatively, some researchers employ firmware emulation techniques to provide testing devices [7,24,34,35]. Through this method, researchers can create shared experimental environments by distributing firmware to others. However, firmware emulation generally has a low success rate and is often complex and cumbersome [9,37]. Consequently, researchers tend to proceed with fuzzing experiments immediately after successful emulation without standardizing or packaging the complex emulation steps. Furthermore, different researchers use varying emulation methods and steps without uniform standards, making it difficult for others to recreate these environments rapidly [24,30,36]. This variability increases the difficulty of standardized integrating emulated firmware, adding costs for horizontal comparisons of fuzzers.

N-day Vulnerability Dataset. When preparing an N-day vulnerability dataset, researchers cannot create experimental datasets by inserting vulnerabilities into firmware since device manufacturers typically do not provide the firmware source code and detailed documentation [4,38]. As a result, researchers often seek firmware versions known to contain vulnerabilities as their test targets, emulating these versions and verifying the presence of vulnerabilities to form the N-day vulnerability dataset [24].

The first challenge in this process is obtaining and emulating specific firmware versions with known vulnerabilities. While vulnerability databases such as the National Vulnerability Database (NVD) disclose information on vulnerabilities affecting particular firmware versions, these versions may not be publicly available or might have been removed from circulation by manufacturers, making access difficult. Moreover, even if these firmware versions are obtained, the complexity and low success rate of emulation present significant challenges.

Another notable challenge lies in the complexity of vulnerability reproduction. Studies suggest reproducing a single vulnerability typically requires 6 to 10 h [39]. Furthermore, many vulnerabilities listed in databases lack publicly available proof-of-concept (PoC) code, forcing researchers to rely solely on textual descriptions. This process demands considerable expertise and experience, significantly increasing the effort involved. Consequently, existing studies often include only 10 to 20 vulnerabilities in their N-day datasets, with some even bypassing N-day testing entirely to focus exclusively on 0-day vulnerability discovery [7,24,28].

3. Methodology

This section introduces the IoTBenchSL framework, which is designed to efficiently construct and verify IoT firmware and vulnerability benchmark datasets. The framework automatically verifies the validity and reliability of these datasets via a standardized validation pipeline. Figure 1 illustrates the high-level architecture of IoTBenchSL.

IoTBenchSL consists of three core stages: the Firmware Benchmark Production Workflow, the Vulnerability Benchmark Production Workflow, and the Benchmark Validation Pipeline. We first define the concepts of firmware and vulnerability benchmarks, then provide an overview of the entire workflow and a detailed description of each component.

We define a firmware benchmark comprising the following four parts: Benchmark Configuration Metadata, Emulation Environment Bundle, Automated Authentication Scripts, and Initial Fuzzing Seed Set.

Benchmark Configuration Metadata records the essential attributes and parameters of the firmware benchmark in a structured format (such as YAML or JSON), including firmware details, emulation parameters, network configuration, and other necessary references for emulation and testing.
Emulation Environment Bundle provides all required resources for firmware emulation, ensuring consistent emulation conditions across different testing platforms. This bundle includes environment configuration files and support resources. The configuration files specify system requirements, dependencies, and network settings. At the same time, the support resources include system files, configuration files, dependencies, initialization scripts, and firmware emulation tools based on container technology, which facilitate stable emulation.
Automated Authentication Scripts automate the login and authentication processes for the firmware’s management interface, ensuring that generated fuzzing test cases can pass initial validation and reach the critical functional areas of the firmware.
The Initial Fuzzing Seeds Set comprises a collection of request packets from the firmware’s original communication interfaces, serving as seed data for fuzzing to enhance testing efficiency and coverage.

We define a vulnerability benchmark as consisting of the following four parts: Vulnerability Description Metadata, Vulnerability Payloads, Vulnerability Trigger Template, and Associated Firmware Benchmark Reference.

Vulnerability Description Metadata contains detailed information about the vulnerability in a structured format, including the vulnerability ID (e.g., CVE number), impact scope, and vulnerability type.
Vulnerability Payloads are input data or requests that validate the presence of the vulnerability and can be used to trigger the target vulnerability within the firmware benchmark.
The Vulnerability Trigger Template describes, in a structured format, the usage of vulnerability payloads and the expected outcomes upon triggering the vulnerability, supporting automated validation and assessment.
The Associated Firmware Benchmark Reference lists all affected firmware benchmarks related to the vulnerability, providing quick access to the corresponding firmware emulation environment.

At a high level, the IoTBenchSL workflow proceeds as follows: First, we collect firmware from various sources to build an initial firmware dataset. Then, through the Firmware Benchmark Production Workflow, we generate a set of pending firmware benchmarks, which are automatically verified for completeness and usability via the Benchmark Validation Pipeline. Successfully verified firmware benchmarks are then used in the Vulnerability Benchmark Production Workflow to generate a set of pending vulnerability benchmarks, further validated through the Benchmark Validation Pipeline. Finally, the verified firmware and vulnerability benchmarks are used as inputs for our previously proposed fuzzing scheduling framework, IoTFuzzBench, enabling large-scale fuzzing evaluations.

3.1. Firmware Benchmark Production Workflow

IoTBenchSL uses the collected firmware dataset as input for the Firmware Benchmark Production Workflow. This firmware dataset is partially derived from publicly available datasets released in existing research, precisely the open-source datasets provided in the FirmSec and FirmAE [37,40]. In contrast, other firmware samples are gathered from device manufacturers’ public releases online.

Initial Emulation Testing. In this stage, IoTBenchSL performs batch emulation tests on each firmware sample to evaluate their emulation viability in a system-mode environment. During emulation, the system automatically collects critical data, such as the emulated firmware’s default IP address and port information, providing configuration references for subsequent steps.

Benchmark Template Initialization. At this stage, IoTBenchSL generates an initial benchmark template for each firmware. This template consists of two main parts: Benchmark Configuration Metadata and the Emulation Environment Bundle. IoTBenchSL provides a set of predefined emulation environment bundles, which include all the containerized resources, emulation tools, and configuration files needed for system-mode firmware emulation. In this stage, IoTBenchSL automatically populates the Benchmark Configuration Metadata and the initial Emulation Environment Bundle with firmware-specific information, such as firmware name, hash, manufacturer, model, and the emulation IP address and port obtained during testing. This results in an initial firmware benchmark template.

Firmware Benchmark Production. In this stage, IoTBenchSL provides comprehensive tools and scripts to assist security experts in creating the final firmware benchmark. In IoT environments, the firmware’s web management interface typically requires authentication, so fuzzing requests must include valid authentication information to access deeper functionalities. During fuzzing, authentication fields must be dynamically updated when mutating original seed messages to ensure that test cases pass authentication checks and successfully reach the target functional code. Therefore, the firmware benchmark must include automated login and authentication update scripts rather than relying solely on the fuzzer. IoTBenchSL provides auxiliary tools to help security experts generate and update authentication scripts within the emulation environment. It also automatically captures high-quality request messages from the firmware, which serve as seed data for fuzzing. Finally, the system packages the complete benchmark and uploads it to the Benchmark Validation Pipeline for automated validation and storage, ensuring its completeness and reproducibility. Details on the Benchmark Validation Pipeline are provided in Section 3.3.

3.2. Vulnerability Benchmark Production Workflow

The Vulnerability Benchmark Production Workflow is a standardized process within IoTBenchSL, tailored specifically for IoT firmware security testing. This workflow provides an efficient and structured method for creating vulnerability benchmarks, supported by automated tools and auxiliary features. The process takes as input a verified firmware benchmark set and public vulnerability databases and produces a standardized vulnerability benchmark set that facilitates subsequent testing and validation. The workflow comprises four key stages: Firmware Benchmark Association, Vulnerability Reproduction, Vulnerability Payload Development, and Vulnerability Benchmark Production.

Firmware Benchmark Association. In this stage, IoTBenchSL automatically links relevant vulnerability information from public vulnerability databases with specific firmware models and versions in the verified firmware benchmark set. The system identifies whether a corresponding proof-of-concept (PoC) file exists by analyzing reference links and exploiting tags in the vulnerability information, providing security experts with valuable references for vulnerability reproduction and payload development.

Vulnerability Reproduction. Based on the vulnerabilities selected in the previous stage and their associated verified firmware benchmarks, IoTBenchSL uses the emulation environment bundle within the firmware benchmark to initiate a firmware emulation environment that meets the target vulnerability’s reproduction requirements. The system preconfigures network settings and interface parameters to mirror actual device conditions, facilitating vulnerability reproduction in a controlled environment. Security experts then use relevant vulnerability reports and CVE information to confirm the vulnerability within the emulation environment, ensuring it can be triggered on the target firmware and establishing a reliable foundation for subsequent payload development.

Vulnerability Payload Development. In this stage, IoTBenchSL assists security experts in converting the vulnerability reproduction process into executable payloads and trigger templates. After confirming a vulnerability, IoTBenchSL provides templated development tools to guide experts in developing vulnerability payloads and trigger templates in a standardized YAML format. The system supplies predefined script interfaces, input-output specifications, and sample code, enabling experts to create payloads that meet testing requirements. The trigger template describes the exploitation steps, parameter configurations, and expected outcomes, ensuring that the generated payloads can be reused effectively in subsequent tests.

Vulnerability Benchmark Production. This final stage dynamically generates structured vulnerability metadata, including critical information such as vulnerability type, severity, and impact scope. The system integrates vulnerability payloads, trigger templates, and associated firmware benchmarks to form a complete vulnerability benchmark. This benchmark set is standardized and highly portable, suitable for direct use in automated validation tasks, and provides a reliable vulnerability discovery testing resource.

3.3. Benchmark Validation Pipeline

The Benchmark Validation Pipeline automatically validates the pending firmware and vulnerability benchmarks generated in the previous stages. IoTBenchSL’s validation pipeline is built on a CI/CD mechanism, automatically triggering whenever new benchmarks are submitted to the benchmark repository. This pipeline includes three primary stages: Pipeline Management and Initialization, Initial Validation, and Benchmark Dynamic Validation.

Pipeline Management and Initialization. In this stage, IoTBenchSL uses version control to manage the submission and updating of firmware and vulnerability benchmarks. When a benchmark is submitted to the repository, the version control module automatically triggers the pipeline and creates validation tasks. The system employs a validation task scheduling module to allocate resources across registered runner virtual machines, and it initiates the containerized environment required for verification through an environment setup module, ensuring the independence and cleanliness of subsequent steps.

Initial Validation. At this stage, IoTBenchSL conducts a general validation of the pending firmware and vulnerability benchmarks and generates pipeline subtasks. In the General validation module, IoTBenchSL first parses the structured benchmark configuration and vulnerability description metadata, performing a completeness check to identify any missing or erroneous information. Next, IoTBenchSL performs static checks on the remaining components of the firmware and vulnerability benchmarks to ensure consistency among the files, components, and descriptions. In the Test Jobs Generation module, IoTBenchSL ensures task independence by creating a separate subtask for each benchmark under verification and assigning it to individual runners. The runner’s resources are released upon task completion, and the system automatically schedules the following pending tasks. Subtasks run in parallel, enhancing efficiency and optimizing resource use.

IoTBenchSL performs specific test tasks for firmware and vulnerability benchmarks across different runners. The dynamic validation process consists of Firmware Benchmark Dynamic Validation and Vulnerability Benchmark Dynamic Validation, each designed to ensure that the benchmarks meet standards for reliability and functionality.

Firmware Benchmark Dynamic Validation. IoTBenchSL utilizes firmware validation plugins to validate the target firmware benchmark dynamically. The system uses the Emulation Environment Bundle within the firmware benchmark to launch the corresponding benchmark container on the runner’s container platform. Once operational, IoTBenchSL conducts Service Availability Testing and Request Authorization Testing on the emulated firmware.

In Service Availability Testing, IoTBenchSL confirms that the firmware’s management interface is accessible within the emulation environment. After initializing emulation, port forwarding tools map the emulated firmware’s management port to a specific container port, which is further mapped to a runner port. The verification tool on the runner then sends an initial seed message via socket to this port. The system verifies service availability based on whether an appropriate response is received.

Request Authorization Testing checks the functionality of the Automated Authentication Scripts. Unauthorized requests to the firmware management interface often yield errors or redirects, resulting in notable differences in response lengths between authorized and unauthorized requests. IoTBenchSL verifies the authentication script’s effectiveness by comparing these response lengths. First, the system sends an unauthorized request and records the response length. It then calls the automated login and authentication scripts to update the request with credentials, sends the updated request, and compares the response length to the initial one. A significant change indicates successful script functionality, while no change suggests failure.

Vulnerability Benchmark Dynamic Validation. IoTBenchSL uses vulnerability validation plugins to verify target vulnerabilities for vulnerability benchmarks dynamically. The system initializes the relevant benchmark container on the runner, referencing the associated firmware benchmark within the vulnerability benchmark. Once operational, IoTBenchSL performs Request Permission Update and Vulnerability Trigger Verification on the emulated firmware.

In Request Permission Update, IoTBenchSL calls the Automated Authentication Scripts to dynamically update requests in the Vulnerability Payload, ensuring they bypass authentication checks and effectively trigger the vulnerability.

Vulnerability Trigger Verification involves parsing the Vulnerability Trigger Template to retrieve relevant vulnerability information, test steps, payloads, and expected conditions. IoTBenchSL follows the template-defined steps, sending the updated vulnerability payload to the target interface on the emulated firmware. This simulates an actual attack scenario to execute the vulnerability trigger. The system continuously monitors the emulation environment’s responses and behaviors, capturing abnormal events, such as service crashes, system reboots, or unusual outputs, to verify successful vulnerability activation.

After successful validation, the system outputs the Verified Vulnerability Benchmarks, which, together with the Verified Firmware Benchmarks, form the core dataset for IoTFuzzBench testing framework.

4. Implementation

To achieve modular design and high scalability, IoTBenchSL’s implementation is organized into distinct layers, breaking down the workflow into independent steps, each managed by individual scripts or functions. Python 3.10 and Bash 5.1 scripts primarily support the functionalities of each module, integrating a range of open-source tools and technologies. We selected the Firmware Analysis Toolkit (FAT) [41] as the core firmware emulation tool for IoTBenchSL. Based on Firmadyne [37] and QEMU [42], FAT is a system-mode emulation tool that efficiently emulates diverse firmware types. We leverage Docker [43] to manage and operate the emulation environments. To support this, we have defined a standardized set of Dockerfiles for FAT, which include the required environment configurations and emulation startup scripts within the containers. The following sections detail the implementation of each core module in IoTBenchSL.

4.1. Firmware Benchmark Production Workflow

In the Initial Emulation Testing module, we developed batch scripts enabling the system to automatically iterate through the firmware dataset, creating FAT containers to run emulation tests for each firmware sample. Upon successful emulation, the system logs results, including the assigned IP addresses and open ports for each firmware, which are used in the automated generation of benchmark configurations.

The Benchmark Template Initialization module employs Python 3.10 scripts to integrate the emulation data with predefined emulation environment bundles, generating an initial benchmark template. This template is formatted as a YAML file to ensure standardized and compatible configurations. The system automatically populates the firmware’s essential information and emulation environment resources, establishing the foundation of the benchmark template.

In the Firmware Benchmark Production module, IoTBenchSL provides automated tools to assist security experts in creating the final firmware benchmark. We designed lightweight containerized emulation environment startup scripts that support one-click deployment and service forwarding, enhancing emulation environment usability. IoTBenchSL includes sample scripts for standard authentication methods, such as HTTP Basic Auth, Cookies, and Tokens, to simplify the creation of authentication scripts. Through integrated Tcpdump [44] and Tshark [45] tools, the system captures and filters request messages with parameters, producing high-quality seed sets for fuzzing.

4.2. Vulnerability Benchmark Production Workflow

In the Firmware Benchmark Association stage, IoTBenchSL parses the metadata of verified firmware benchmarks to extract essential details, such as the firmware’s model and version. We developed Python scripts to search vulnerability databases, including the National Vulnerability Database (NVD) and the China National Vulnerability Database (CNVD), for vulnerabilities that match the firmware model and version. Using fields such as vulnerability descriptions and Common Platform Enumeration (CPE), relevant vulnerabilities are identified and associated with the corresponding firmware benchmarks.

In the Vulnerability Reproduction stage, IoTBenchSL provides scripts for a one-click emulation environment setup, allowing security experts to launch the emulation environment for the associated firmware quickly. IoTBenchSL also leverages the “Exploit” tags in the References field of each NVD entry to automatically identify links to proof-of-concept (PoC) resources, which are presented to security experts for efficient verification of vulnerability presence and reproducibility.

During the Vulnerability Payload Development stage, we define a standardized YAML template to document the vulnerability exploitation steps, parameter configurations, and expected outcomes. IoTBenchSL provides a templated development environment for security experts, including predefined script interfaces and example code to facilitate the creation of vulnerability payloads and trigger templates.

In the Vulnerability Benchmark Production stage, we developed a Vulnerability Benchmark Packaging Tool that retrieves detailed information about the target vulnerability from NVD and automatically generates vulnerability description metadata. IoTBenchSL then integrates the vulnerability payloads and trigger templates created by security experts and the associated firmware benchmark references into a complete vulnerability benchmark. This finalized benchmark is submitted to the repository within the benchmark verification pipeline for subsequent automated validation.

4.3. Benchmark Verification Pipeline

In the Benchmark Verification Pipeline, we employ GitLab Pipelines [46] as the continuous integration and continuous deployment (CI/CD) framework and have configured a private GitLab server to automate the validation process.

During the Pipeline Management and Initialization stage, when a new firmware or vulnerability benchmark is submitted to the GitLab repository, GitLab Pipelines automatically triggers the corresponding validation pipeline. The pipeline’s stages, tasks, execution order, and test scripts are defined in a YAML file. Using GitLab Runner, we registered multiple virtual machine runners in the GitLab instance and selected the Shell executor to perform tasks within the pipeline. The Shell executor was chosen to avoid the nested virtualization issues associated with Docker-in-Docker, ensuring the stability of the firmware emulation process.

In the Initial Verification stage, we developed Python scripts to parse and validate the completeness and correctness of each benchmark’s metadata. These scripts verify required fields and ensure proper formatting. The system also performs static checks on each benchmark component to ensure completeness and consistency. For instance, it verifies the presence of configuration files in the Emulation Environment Bundle and checks the executability of Automated Authentication Scripts. We designed dynamically generated sub-pipelines within the main pipeline to test each firmware and vulnerability benchmark independently without interference. The system dynamically creates independent child pipeline configuration files in the Test Jobs Generation module according to the number of benchmarks to be tested, saving them as artifacts. These child pipelines are then triggered by a designated job in the main pipeline, ensuring seamless execution of downstream pipelines and comprehensive validation of each firmware and vulnerability benchmark.

We implemented dedicated validation plugins for both firmware and vulnerability benchmarks in the Benchmark Dynamic Validation stage. These plugins can initiate Docker-based emulation environments on the runner and perform specific test tasks for each benchmark type. The system executes Service Availability Testing and Request Authorization Testing within the emulation environment for firmware benchmarks. For vulnerability benchmarks, it performs Request Permission Update and Vulnerability Trigger Verification. Upon completion of testing, the system automatically updates the test results in the GitLab repository, allowing security experts to review and analyze the findings.

5. Experiments and Results

We evaluate the effectiveness of the IoTBenchSL prototype by considering the following research questions:

RQ 1: Can IoTBenchSL create firmware and vulnerability benchmarks that are larger in scale and more diverse than those in existing open-source IoT fuzzing datasets?
RQ 2: How efficient is IoTBenchSL in benchmark production and validation?
RQ 3: Can the benchmarks generated by IoTBenchSL be seamlessly integrated into fuzzing frameworks to enable low-cost and effective practical fuzzer evaluation?
RQ 4: Can the benchmarks provided by IoTBenchSL enable large-scale, cost-effective discovery of unknown vulnerabilities?

To answer these questions, we design and conduct four different experiments.

5.1. Experiment Settings

We outline our experimental setup, detailing the hardware, fuzzers, and firmware targets utilized.

Hardware Configuration. All experiments were conducted on identical hardware configurations, precisely two Intel Xeon Silver 4314 CPUs @ 2.40 GHz (a total of 64 physical cores) with 256 GB of RAM, running 64-bit Ubuntu 22.04 LTS. For each fuzzer, we allocated one CPU core, 2 GB of RAM, and 1 GB of swap space. In each task across experimental groups, the fuzzer operates within an isolated Docker container, performing fuzzing tasks on target benchmark firmware, which also runs in an isolated Docker container. Benchmark containers are not shared between tasks to ensure independence.

Fuzzers. Our evaluation includes the following fuzzers: Snipuzz [17], T-Reqs [47], Mutiny [48], Fuzzotron [49], MSLFuzzer [50], and Boofuzz [51]. These fuzzers all support network-based fuzzing for IoT devices. Snipuzz and T-Reqs, open-sourced and presented at the top-tier security conference CSS 2021, represent typical academic fuzzing tools. Mutiny, an open-source fuzzer by Cisco, and Fuzzotron, an actively maintained community fuzzer with ongoing updates within the past six months, exemplify open-source tools from the industry. Mutiny and Fuzzotron have garnered over 500 stars on GitHub, underscoring their community impact. Boofuzz is a notable representative of generative fuzzing frameworks and is frequently used as a baseline in research studies. Finally, MSLFuzzer is a black-box fuzzer developed in our prior research. For a fair and unbiased evaluation among the fuzzers, MSLFuzzer is used solely for unknown vulnerability discovery experiments.

Targets. To ensure the benchmarks created by IoTBenchSL reflect real-world conditions, all firmware images and vulnerabilities used in our experiments were drawn from authentic sources. We compiled three sets of actual firmware samples for the initial dataset, totaling 19,018 images. The first set consisted of 842 firmware images from the dataset provided by FirmAE [37]. The second included 13,596 images from the FirmSec [40], and the third contained 4580 images publicly released by various manufacturers. Each target was a complete firmware image with an entire filesystem rather than an isolated component and was provided without accompanying documentation or emulation methods. Throughout the experiments, IoTBenchSL was applied to unpack and emulate these initial samples, filtering for successfully emulated targets. Additionally, we cross-referenced each sample with associated N-day vulnerabilities from the NVD and CNVD databases, performing a secondary selection on the firmware. This screened set of firmware was then used to create and validate benchmarks and to conduct further fuzzing experiments.

5.2. Large-Scale Benchmark Production

To address RQ 1, we systematically compared the benchmark datasets generated by IoTBenchSL with existing datasets using quantitative metrics and qualitative analysis. We aimed to produce as many firmware and vulnerability benchmarks as possible from an initial set of 19,018 firmware samples and compare them with existing open-source emulated firmware and vulnerability datasets. These initial firmware samples lack emulation methods and accompanying documentation, making them unsuitable for direct use in fuzzing evaluation. IoTBenchSL addresses this challenge by enabling automated emulation attempts, which enhances reproducibility and reduces human bias. This is a crucial first step in achieving comprehensive firmware and vulnerability benchmarking.

First, we employed IoTBenchSL to perform initial emulation tests on these firmware samples, successfully emulating 300 firmware images. Recognizing the significant time and labor traditionally required for constructing emulation environments, IoTBenchSL streamlines this process by automating critical steps, enabling the selection of 100 representative firmware samples for benchmark production. The selection criteria included the representativeness of firmware manufacturers, the popularity of the firmware models, and the number of historical vulnerabilities associated with each firmware. Next, IoTBenchSL produced firmware benchmarks from these 100 selected firmware samples. After completing the firmware benchmarks, we identified the 100 most relevant vulnerabilities based on firmware models and versions from the NVD and CNVD databases. Leveraging IoTBenchSL, we reproduced these vulnerabilities and created corresponding vulnerability benchmarks. Appendix A Table A1 provides a comprehensive list of all firmware benchmarks successfully created and validated using IoTBenchSL. In contrast, Appendix A Table A2 presents the complete list of vulnerabilities reproduced using IoTBenchSL and subsequently transformed into vulnerability benchmarks.

We collected and analyzed data for these firmware and vulnerability benchmarks during production and compared them with existing open-source datasets. Table 2 compares the firmware and vulnerability benchmarks produced with IoTBenchSL and other datasets. It is important to note that the firmware and vulnerability counts in Table 2 represent the initial numbers provided by the corresponding papers’ open-source datasets and exclude newly discovered 0-day vulnerabilities.

The comparison demonstrates that IoTBenchSL significantly enhances the scale of firmware and vulnerability benchmarks, achieving 100 benchmarks in each category, representing a several-fold or even an order-of-magnitude increase compared to existing datasets. Furthermore, IoTBenchSL surpasses other emulated firmware datasets in terms of diversity, providing a broader range of firmware models and vulnerability types. The dataset also features a one-click startup for the emulated environment of each firmware benchmark and automated vulnerability validation scripts. These capabilities substantially reduce the time and effort required for researchers to set up emulation and reproduction environments and enable efficient and reproducible cross-evaluation of fuzzers on a unified dataset. All benchmarks listed in Appendix A Table A1 and Table A2 will be open-sourced in the IoTVulBench repository [52].

5.3. Efficiency in Benchmark Production and Validation

To address RQ 2, we designed a controlled comparative experiment to evaluate the efficiency improvements and resource consumption associated with IoTBenchSL in producing and verifying firmware and vulnerability benchmarks. To ensure the validity of this evaluation, we conducted a structured comparison between the traditional manual process and IoTBenchSL’s automated workflow. This comparison was based on quantifiable metrics, including execution time across different experimental groups during both the benchmark production and validation phases. By adopting this approach, we aimed to provide an objective assessment of IoTBenchSL’s advantages in terms of time efficiency and labor reduction, ensuring that our conclusions are both reproducible and reliable.

In our experimental design, we selected 20 representative firmware samples from the dataset generated by IoTBenchSL. These samples encompass a variety of vendors and common vulnerability types, such as memory corruption and command injection, ensuring diversity and representativeness. The experiment included two groups: Group A, which utilized IoTBenchSL for automated benchmark production and validation, and Group B, which followed the conventional manual workflow. Both groups consisted of researchers with equivalent expertise and professional experience in cybersecurity, ensuring standardized operations and comparability of results. Detailed records of time consumption and human involvement were maintained to ensure the completeness and accuracy of the collected data.

Figure 2 compares the total time consumption for benchmark production and validation in Group A and Group B. Group A, utilizing IoTBenchSL, completed the entire process in 50.9 h. Group B, following the manual workflow, required 151.67 h. This demonstrates a significant efficiency improvement, with IoTBenchSL reducing the total time for benchmark production and validation by approximately three times. In the validation phase, Group A completed the process in 3.33 h for all benchmarks, whereas Group B required 31.33 h, achieving a 9.4 times improvement in validation efficiency. This indicates that IoTBenchSL can significantly shorten the time required for benchmark production and validation.

Figure 3 provides a more detailed comparison of time efficiency, showing the average time consumed per benchmark in both the production and validation phases. In the production phase, Group A spent an average of 1.28 h per firmware benchmark and 1.26 h per vulnerability benchmark. Group B spent 4.25 h per firmware benchmark and 3.5 h per vulnerability benchmark. These results indicate a 3.3 times improvement in firmware benchmark production efficiency and a 2.6 times improvement in vulnerability benchmark production.

The experimental results demonstrate the significant advantage of IoTBenchSL in enhancing the efficiency of benchmark production and validation. The substantially reduced time required for benchmark generation and validation enables researchers to more effectively generate and verify large-scale datasets. By significantly reducing manual effort and time investment, IoTBenchSL allows researchers to focus more on experimentation and analysis, ultimately accelerating the overall research process and improving the reproducibility of results.

5.4. Evaluation of Vulnerability Discovery Capabilities Based on Benchmarks

To address RQ 3, we evaluated the seamless integration of IoTBenchSL-generated benchmarks into existing fuzzing frameworks and assessed their effectiveness in practical fuzzer evaluations. To ensure the validity of our approach, we systematically measured the performance of multiple fuzzers in detecting N-day vulnerabilities after integrating these benchmarks. The evaluation focused on two key aspects: (1) the completeness and efficiency of fuzzer evaluations when using IoTBenchSL-generated datasets, and (2) the differential performance of various fuzzers in the task of detecting N-day vulnerabilities. By conducting a structured comparison, we aimed to demonstrate that IoTBenchSL not only facilitates cost-effective and efficient fuzzer assessment but also enhances the reproducibility and reliability of vulnerability detection. These findings reinforce the practical value of IoTBenchSL in real-world security testing environments and its potential to standardize and improve fuzzer evaluation methodologies.

In our previous work, IoTFuzzBench, we integrated five protocol fuzzers, including Mutiny, Snipuzz, Boofuzz, T-Reqs, and Fuzzotron, enabling large-scale parallel fuzzing evaluations. Building upon this, we integrated the datasets produced by IoTBenchSL into IoTFuzzBench, significantly reducing the implementation cost of fuzzing evaluations. This integration allows researchers to define custom fuzzing evaluation tasks through a simple configuration file, streamlining the process and eliminating the need for manual intervention. A single configuration file was used to define all participating fuzzers and 100 selected vulnerabilities, enabling the system to execute 500 fuzzing experiments automatically. Each fuzzer performed its tasks using raw request seeds corresponding to target vulnerabilities, without embedding exploit payloads, to systematically assess their vulnerability discovery capabilities on the IoTBenchSL dataset.

Table 3 summarizes the performance of the fuzzers on the IoTBenchSL-generated vulnerability benchmarks. Among the 100 benchmarks, Mutiny and Fuzzotron identified the highest number of vulnerabilities, detecting 42 and 41, respectively. Conversely, Boofuzz demonstrated the weakest performance, uncovering only five vulnerabilities. Despite this, Boofuzz achieved the shortest average discovery time at 316.09 s, while T-Reqs required the longest time, averaging 8693.40 s. Regarding memory consumption, Fuzzotron exhibited the lowest usage, whereas Snipuzz consumed the most. Notably, the average memory consumption for all fuzzers remained below 250 MB, which is minimal compared to the tens or hundreds of gigabytes typically available on modern servers. Thus, memory consumption is not a bottleneck for current fuzzing tools in vulnerability detection. However, in resource-constrained environments where dozens or even hundreds of fuzzers may be deployed, low-memory consumption fuzzers such as Fuzzotron and Mutiny, averaging under 20 MB, are preferable.

These results underscore IoTBenchSL’s effectiveness in supporting diverse, large-scale fuzzer evaluation scenarios. By enabling systematic comparisons of fuzzer performance on a unified benchmark dataset, IoTBenchSL facilitates more consistent, reproducible, and insightful analyses in IoT fuzzing.

5.5. 0-Day Vulnerability Discovery Using Firmware Benchmarks

To address RQ 4, we assessed the potential of IoTBenchSL as a resource for large-scale zero-day vulnerability discovery by systematically conducting experiments on its standardized firmware dataset. Specifically, we aimed to evaluate its effectiveness in facilitating the identification of previously unknown vulnerabilities and its applicability to real-world vulnerability research. To ensure a rigorous assessment, we employed MSLFuzzer, a protocol fuzzer we previously developed, to test the web management interfaces of IoT devices. By applying MSLFuzzer to the IoTBenchSL-generated benchmarks, we sought to demonstrate the dataset’s practical utility in uncovering new security vulnerabilities and further validate its applicability in real-world vulnerability research.

For this study, MSLFuzzer was selected as the primary tool for 0-day vulnerability discovery. We integrated MSLFuzzer into the IoTFuzzBench platform, enabling large-scale 0-day vulnerability discovery experiments on the firmware dataset provided by IoTBenchSL. The IoTBenchSL dataset comprises 100 firmware benchmarks and 4935 raw seeds, offering broad manufacturer coverage and diverse sample representation. Leveraging IoTFuzzBench task isolation mechanism, we executed independent fuzzing experiments for each seed. Each experiment was conducted in a fully isolated environment for 24 h, ensuring reliable and consistent results and enabling a systematic approach to large-scale 0-day vulnerability discovery.

The experimental results identified 968 crash samples on IoTBenchSL’s firmware benchmarks. These crash samples underwent rigorous filtering, de-duplication, and comparison with historical vulnerabilities to assess their relevance to new and existing vulnerabilities. This systematic analysis confirmed 40 known vulnerabilities and identified 32 unknown vulnerabilities. Among these newly discovered vulnerabilities, 21 have been assigned CNVD and CVE identifiers, as shown in Table 4. All discovered unknown vulnerabilities were responsibly disclosed to the corresponding vendors through the National Computer Network Emergency Response Technical Team/Coordination Center of China (CNCERT) to facilitate prompt remediation efforts. These efforts assist vendors in implementing timely fixes and enhancing the security posture of their devices.

Additionally, after conducting a comparative analysis of different crash samples from our experiments, we found that some vulnerabilities not only affect the targeted firmware but also impact multiple firmware versions from the same vendor. For instance, CVE-2023-33538 was found to affect multiple TP-Link devices, including TL-WR940N, TL-WR841N, and TL-WR941ND. The finding highlights two key aspects: (1) the widespread code reuse among firmware from the same vendor, which exacerbates security risks, and (2) the effectiveness of IoTBenchSL’s benchmark datasets in facilitating unknown vulnerability discovery and impact analysis.

The experimental results demonstrate that the standardized firmware dataset provided by IoTBenchSL effectively supports large-scale automated 0-day vulnerability discovery, serving as a high-quality, ready-to-use research resource. Through its integration with the IoTFuzzBench platform, the IoTBenchSL dataset offers researchers a convenient environment for firmware fuzzing, accelerating progress in IoT firmware vulnerability discovery and cross-comparison experiments. For researchers aiming to deploy IoT firmware security studies and validate their findings rapidly, IoTBenchSL provides comprehensive sample data support and, through its synergy with IoTFuzzBench, highlights its critical value in enabling scalable and standardized vulnerability discovery experiments.

6. Discussion and Future Work

While IoTBenchSL provides a practical framework for constructing and verifying IoT firmware and vulnerability benchmarks, several opportunities remain to enhance its functionality and expand its applicability. As IoTBenchSL represents an initial step toward a standardized and automated approach to IoT fuzzing benchmarks, we would like to discuss current design limitations and explore directions for future improvements.

Expanding Firmware Emulation Methods. Currently, IoTBenchSL focuses primarily on system-mode firmware emulation, with limited support for user-mode emulation. This limitation arises because current user-mode emulation technologies are generally suited for emulating individual applications rather than offering a complete operating system and interface environment, limiting their applicability in IoTBenchSL. Consequently, many firmware samples fail to run smoothly without manual intervention and code adjustments to bypass crashes or restrictions. This reliance on expert intervention makes it challenging to scale user-mode emulation across large volumes of firmware samples. However, preliminary experiments with user-mode emulation, such as on the Tenda AC9 firmware, demonstrate that IoTBenchSL is adaptable to user-mode emulation, opening up possibilities for broader implementation. In future work, we aim to expand the quantity and diversity of user-mode firmware and explore automation methods to reduce time costs, enhancing the dataset’s diversity and applicability.

Broadening Vulnerability Types. Currently, IoTBenchSL’s vulnerability benchmarks focus on buffer overflow, command injection, and denial-of-service vulnerabilities. These vulnerability types are relatively straightforward to detect and assess using protocol fuzzing tools, making them suitable samples for evaluating fuzzer performance. To increase the dataset’s broad applicability, we plan to incorporate a more comprehensive range of vulnerabilities in future work. A wider variety of vulnerabilities will place higher demands on the detection capabilities of fuzzers after a vulnerability is triggered and require greater adaptability from scheduling frameworks like IoTFuzzBench. By adding modular vulnerability monitoring components, we aim to improve the scheduling framework’s compatibility with different vulnerability types, thereby expanding vulnerability coverage and reducing development costs for fuzzers.

Expanding Firmware Attack Surfaces. Currently, the firmware benchmarks in IoTBenchSL focus primarily on the web management interface as the primary attack surface for fuzzing. However, emulated firmware may contain other protocol interfaces, such as FTP, UPnP, and HNAP, which could expand the breadth and depth of fuzzing. In future work, we will explore ways to broaden support for these alternative attack surfaces within the benchmark dataset, creating a more comprehensive fuzzing framework and extending the applicability of fuzzers.

Automating Fuzzing Seed Generation. The current process for generating fuzzing seeds within firmware benchmarks depends on security experts manually interacting with emulated firmware to capture raw requests and using IoTBenchSL’s auxiliary tools to filter suitable fuzzing seeds. This manual process is time-intensive and limits interface coverage. To improve efficiency, we are considering introducing static analysis techniques to automatically generate raw requests as seeds by parsing static resources in the firmware and using reverse engineering to identify potential communication interfaces and parameters. This automated method would significantly enhance benchmark production efficiency and improve the coverage and effectiveness of fuzzing. This will be a primary focus in our next phase of research.

Deployment Scenarios. In practical applications, we anticipate that researchers will focus on two primary scenarios: (1) how to quickly deploy and utilize all firmware and vulnerability datasets provided by IoTBenchSL and perform fuzz testing experiments using their fuzzers, and (2) how to conduct large-scale fuzzing comparison experiments between different fuzzing tools using the datasets provided by IoTBenchSL. For Scenario 1, we provide all standardized firmware and vulnerability benchmarks, along with corresponding scaffolding tools, in the open-source repository. These tools enable researchers to rapidly set up emulation environments for any firmware. Researchers can then use their fuzzers to conduct fuzzing experiments on the selected firmware. For Scenario 2, the IoTBenchSL datasets are designed to be used in conjunction with the IoTFuzzBench framework, which we have also open-sourced. During development, we accounted for integration issues, ensuring that the dataset format provided by IoTBenchSL can be directly parsed and utilized by the IoTFuzzBench framework. Researchers need only deploy the IoTFuzzBench framework and can import the IoTBenchSL datasets as resource files with a single click. Additionally, the configuration files provided by IoTFuzzBench allow researchers to define the desired fuzzers and target firmware environments, enabling them to easily conduct multi-fuzzer comparative experiments with a single command. Continuous improvement and maintenance of these open-source repositories is one of our future goals.

7. Conclusions

In this paper, we proposed and implemented IoTBenchSL, a modular, pipeline-based framework designed to provide standardized, reproducible, and ready-to-use firmware and vulnerability benchmark datasets for large-scale IoT fuzzing. IoTBenchSL significantly improves the efficiency of dataset creation and validation by introducing structured workflows for firmware and vulnerability benchmarks alongside a fully automated validation pipeline. Through IoTBenchSL, we successfully generated and validated extensive datasets, evaluated fuzzer performance, and conducted experiments on unknown vulnerability discovery. Experimental results demonstrate that IoTBenchSL can produce the largest and most reliable IoT firmware and vulnerability benchmark datasets to date. Moreover, by seamlessly integrating the datasets generated by IoTBenchSL with fuzzing scheduling frameworks, it supports the cost-effective execution of fuzzing evaluation tasks. In our unknown vulnerability discovery experiments, IoTBenchSL enabled the identification of 32 unknown vulnerabilities within its datasets, 21 of which have been assigned CVE or CNVD identifiers, further underscoring IoTBenchSL’s substantial contributions to large-scale, high-quality IoT fuzzing research. The introduction of IoTBenchSL provides robust data support for large-scale IoT fuzzing, enabling various fuzzers to perform reproducible cross-evaluations on a consistent dataset. This approach enhances the comparability of research outcomes and significantly reduces redundancy, offering valuable support for researchers in the security domain. As frameworks like IoTBenchSL gain broader adoption, using standardized datasets in IoT fuzzing will become increasingly prevalent, allowing researchers to assess and demonstrate the value of novel fuzzing methodologies more precisely.

Author Contributions

Conceptualization, Y.C. and W.H.; Data curation, X.L.; Methodology, Y.C. and W.F.; Resources, W.F. and W.L.; Software, Y.C., X.L. and Z.M.; Validation, Y.C., X.L. and Z.M.; Visualization, Z.M.; Writing—original draft, Y.C.; Writing—review and editing, Y.C., W.H. and W.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Major Project of Science and Technology Innovation 2030, “The Next Generation of Artificial Intelligence”, under Grant Number 2021ZD0111400, and the Fundamental Research Funds for the Central Universities under Grant Number CUCZDTJ2403. The experiments and data processing in this study were supported by the Public Computing Cloud, CUC.

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request. The disclosed security vulnerabilities used and identified in this paper are accessible in the CVE database (https://cve.mitre.org/ (accessed on 20 December 2024)). Additionally, the firmware benchmark dataset created during this study will be open-sourced at https://github.com/a101e-lab/FirmEmuHub (accessed on 20 December 2024), and the vulnerability benchmark dataset will be made available at https://github.com/a101e-lab/IoTVulBench (accessed on 20 December 2024).

Acknowledgments

We would like to sincerely thank the reviewers for their insightful comments, which helped us improve this work.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

The appendices in this paper provide supplementary information and detailed data that support the findings and conclusions presented in the main text. Appendix A Table A1 lists all the firmware benchmarks successfully created and validated using IoTBenchSL, offering a comprehensive overview of the dataset used for experimental evaluations. Appendix A Table A2 details the vulnerabilities reproduced using IoTBenchSL, including their corresponding benchmarks, which were systematically developed for large-scale fuzzing experiments. These appendices aim to ensure transparency, reproducibility, and accessibility of the research, enabling other researchers to validate and build upon our work in the domain of IoT firmware security.

Table A1. List of 100 Firmware Benchmarks Produced Using IoTBenchSL.

ID	Firmware Model	Firmware Version	Seed Count	ID	Firmware Model	Firmware Version	Seed Count
1	TL-WR940N	V4	38	51	Archer-C20	V1	66
2	DIR-823G A1	V1.0.2B03	21	52	Archer-C50	V1	56
3	DIR-825 B1	V2.10NAb02	42	53	TL-WR802N	V1	54
4	Archer C20i(UN)	V1	93	54	TL-WR940N	V3	61
5	DIR-865L A1	V1.07FB	57	55	TL-WR940N	V4	65
6	DIR-850L A1	V1.05	7	56	TL-WR940N	V6	51
7	TL-WR841N	V10	32	57	TL-WR940N	V3	52
8	DIR-846 A1	V1.0.0	35	58	TL-WR841N	V7	52
9	TL-WR740N	V1	52	59	Archer-C2	V1	51
10	TL-WR810N	V2	45	60	Archer-C20	V1	26
11	Tenda AC9	V15.03.05.19	12	61	TD-W8968	V1	26
12	DIR-822 B1	V2.02KRb06	33	62	TD-W8970	V1	123
13	Archer-C2	V1	112	63	DIR-850L	A1.13	95
14	TL-WR841ND	V9	67	64	TL-R470T	V6.0.3	17
15	DIR-806 A1	V1.00	33	65	DIR-823G	V1.02B01	106
16	TL-WR840N	V4	52	66	TL-WR743ND	V1	59
17	DIR850L	V1.15KRb02	30	67	TL-WR743ND	V1	43
18	TL-WR845N	V3	83	68	TL-WR841N	V7	42
19	TL-WR841N	V8	57	69	TL-WR940N	V1	45
20	TL-WR940N	V2	51	70	TL-WR1042N	V1	34
21	TL-WR802N	V2	53	71	TL-WR1042N	V1	45
22	DAP1650 A1	V1.04	62	72	DIR-823G	V1.00B02	46
23	DAP-2310	V2.07	47	73	DIR-823G	V1.0.2B05	61
24	TL-WR1043	V2	16	74	DIR850L	V1.15KRb01	62
25	DAP-3520	V1.17	107	75	DIR850L	V1.15KRb03	83
26	TL-WR841N	V9	23	76	DIR850L	V1.15KRb07	83
27	DAP-2330	V1.07	58	77	DAP-3520	V1.15	83
28	TL-WR841N	V9	20	78	DAP-2553	V1.30	20
29	DSL-2740R B1	V1.03_EU	60	79	DAP-3520	V1.00	66
30	DIR-852 A1	V1.00	24	80	DAP2695	V111-rc044	46
31	DAP-2360	V2.07	68	81	TL-WR902AC	V1	37
32	Archer_C2	V1	30	82	DAP-2330	V1.00	106
33	TL-WR841N	V8	91	83	DIR-825 B1	V2.02SSb13	21
34	TL-WR841N	V9	56	84	TL-WR841N	V7	17
35	DIR-817LW A1	V1.04	55	85	DAP-2310	V2.01RC013	103
36	TL-WR841ND	V8	46	86	DAP-2310	V2.06RC029	29
37	DIR-615 C1	V3.14NA	25	87	DAP-2553	V1.12	38
38	DAP-2553	V1.32	31	88	DAP-3520	V1.03	18
39	TL-WR941ND	V6	21	89	TD-W8968	V1	30
40	DAP-2553	V3.06	72	90	DAP-2310	V1.16rc028	52
41	DAP-1650	V1.04	24	91	DAP-2553	V1.31RC071	19
42	DIR-846	V1.0.0	59	92	DAP-2695	V1.00RC015	20
43	DAP-2630	V2.07	13	93	DAP-3320	V1.01RC014	32
44	DAP-2660	V1.13	29	94	DAP-3662	V1.00RC015	29
45	DAP-1665	V2.02	34	95	DAP2660	V111rc046	26
46	TL-WR1043	V2	16	96	TL-WPA8630	V2	31
47	TL-WR810N	V1	59	97	DAP-2695	V1.11RC044	65
48	TL-WR743ND	V1	44	98	DAP-3520	V1.02	41
49	Archer-C2	V1	51	99	Archer C20i	V1_220107	116
50	Archer-C2	V5	51	100	TL-WPA8630P	V1_170405	89

Table A2. List of 100 Vulnerability Benchmarks Produced Using IoTBenchSL.

Vuln ID	Firmware BM No.	Vuln Type	Vuln Score	Vuln ID	Firmware BM No.	Vuln Type	Vuln Score
CVE-2017-13772	1	CWE-119	8.8	CVE-2022-25062	17	CWE-190	7.5
CVE-2018-16334	12	CWE-78	8.8	CVE-2022-25064	17	CWE-78	9.8
CVE-2018-16408	41	CWE-78	7.2	CVE-2022-25414	12	CWE-787	9.8
CVE-2018-17787	71	CWE-78	9.8	CVE-2022-25417	12	CWE-787	9.8
CVE-2018-17880	71	CWE-306	7.5	CVE-2022-25428	12	CWE-787	9.8
CVE-2018-18708	12	CWE-119	7.5	CVE-2022-25429	12	CWE-787	9.8
CVE-2018-18728	12	CWE-78	9.8	CVE-2022-25431	12	CWE-787	9.8
CVE-2018-19986	71	CWE-78	9.8	CVE-2022-25434	12	CWE-787	9.8
CVE-2018-19987	71	CWE-78	9.8	CVE-2022-25435	12	CWE-787	9.8
CVE-2018-19988	71	CWE-78	9.8	CVE-2022-25437	12	CWE-787	9.8
CVE-2018-19989	71	CWE-78	9.8	CVE-2022-25439	12	CWE-787	9.8
CVE-2018-19990	71	CWE-78	9.8	CVE-2022-26639	17	CWE-120	7.2
CVE-2019-10892	16	CWE-787	9.8	CVE-2022-26640	17	CWE-120	7.2
CVE-2019-12786	71	CWE-77	8.8	CVE-2022-26641	17	CWE-120	7.2
CVE-2019-12787	71	CWE-78	8.8	CVE-2022-27016	12	CWE-787	9.8
CVE-2019-13128	71	CWE-78	8.8	CVE-2022-27022	12	CWE-787	9.8
CVE-2019-13481	71	CWE-78	8.8	CVE-2022-32092	5	CWE-78	9.8
CVE-2019-13482	71	CWE-78	8.8	CVE-2022-36568	12	CWE-787	8.8
CVE-2019-15526	71	CWE-78	8.8	CVE-2022-36569	12	CWE-787	8.8
CVE-2019-15528	2	CWE-78	8.8	CVE-2022-36570	12	CWE-787	7.2
CVE-2019-15529	2	CWE-78	8.8	CVE-2022-36571	12	CWE-787	7.2
CVE-2019-15530	2	CWE-78	8.8	CVE-2022-37055	16	CWE-120	9.8
CVE-2019-17510	9	CWE-78	9.8	CVE-2022-37056	16	CWE-78	9.8
CVE-2019-6989	1	CWE-787	8.8	CVE-2022-42156	41	CWE-77	8.8
CVE-2019-7297	2	CWE-78	9.8	CVE-2022-43109	2	CWE-77	9.8
CVE-2019-7298	2	CWE-78	8.1	CVE-2022-44808	71	CWE-78	9.8
CVE-2020-10213	3	CWE-78	8.8	CVE-2022-46552	41	CWE-78	8.8
CVE-2020-10214	3	CWE-787	8.8	CVE-2022-46641	9	CWE-77	9.9
CVE-2020-10215	3	CWE-78	8.8	CVE-2022-46642	9	CWE-77	9.9
CVE-2020-10216	3	CWE-78	8.8	CVE-2023-26613	2	CWE-78	9.8
CVE-2020-13390	12	CWE-120	9.8	CVE-2023-29665	71	CWE-787	9.8
CVE-2020-13391	12	CWE-120	9.8	CVE-2023-33735	41	CWE-77	9.8
CVE-2020-13393	12	CWE-120	9.8	CVE-2023-43284	41	CWE-77	8.8
CVE-2020-13394	12	CWE-120	9.8	CVE-2023-51984	71	CWE-78	9.8
CVE-2020-13782	5	CWE-78	8.8	CVE-2024-2704	12	CWE-121	8.8
CVE-2020-25366	2	CWE-862	9.1	CVE-2024-2705	12	CWE-121	8.8
CVE-2020-25367	2	CWE-78	9.8	CVE-2024-2979	12	CWE-121	8.8
CVE-2020-25368	2	CWE-78	9.8	CVE-2024-30584	12	CWE-120	9.8
CVE-2020-27600	9	CWE-78	9.8	CVE-2024-30585	12	CWE-121	6.5
CVE-2020-8423	7	CWE-120	7.2	CVE-2024-4114	12	CWE-121	8.8
CVE-2021-26827	10	CWE-120	7.5	CVE-2024-41622	41	CWE-78	9.8
CVE-2021-29302	51	CWE-120	8.1	CVE-2024-44340	41	CWE-78	8.8
CVE-2021-31624	12	CWE-120	8.8	CVE-2024-46313	55	CWE-121	8.
CVE-2021-31627	12	CWE-120	8.8	CVE-2024-9284	7	CWE-121	8.
CVE-2021-43474	2	CWE-77	9.8	CNVD-2021-24948	12	CWE-78	6.5
CVE-2021-44827	4	CWE-78	8.8	CNVD-2021-35879	1	CWE-78	6.5
CVE-2021-44864	10	CWE-120	6.5	CNVD-2021-81533	7	CWE-78	6.5
CVE-2021-46314	9	CWE-78	9.8	CNVD-2021-81545	28	CWE-20	9.1
CVE-2021-46315	9	CWE-78	9.8	CNVD-2022-59131	12	CWE-119	9.1
CVE-2022-25061	17	CWE-78	9.8	CNVD-2022-62390	63	CWE-78	9.1

Vuln: Vulnerability. BM: Benchmark.

References

Bakhshi, T.; Ghita, B.; Kuzminykh, I. A Review of IoT Firmware Vulnerabilities and Auditing Techniques. Sensors 2024, 24, 708. [Google Scholar] [CrossRef]
Kumari, P.; Jain, A.K. A Comprehensive Study of DDoS Attacks over IoT Network and Their Countermeasures. Comput. Secur. 2023, 127, 103096. [Google Scholar] [CrossRef]
Kambourakis, G.; Kolias, C.; Stavrou, A. The Mirai Botnet and the IoT Zombie Armies. In Proceedings of the MILCOM 2017–2017 IEEE Military Communications Conference (MILCOM), Baltimore, MD, USA, 23–25 October 2017; pp. 267–272. [Google Scholar]
Eceiza, M.; Flores, J.L.; Iturbe, M. Fuzzing the Internet of Things: A Review on the Techniques and Challenges for Efficient Vulnerability Discovery in Embedded Systems. IEEE Internet Things J. 2021, 8, 10390–10411. [Google Scholar] [CrossRef]
Yun, J.; Rustamov, F.; Kim, J.; Shin, Y. Fuzzing of Embedded Systems: A Survey. ACM Comput. Surv. 2023, 55, 1–33. [Google Scholar] [CrossRef]
Zhang, Y.; Huo, W.; Jian, K.; Shi, J.; Liu, L.; Zou, Y.; Zhang, C.; Liu, B. ESRFuzzer: An Enhanced Fuzzing Framework for Physical SOHO Router Devices to Discover Multi-Type Vulnerabilities. Cybersecurity 2021, 4, 24. [Google Scholar] [CrossRef]
Zhang, H.; Lu, K.; Zhou, X.; Yin, Q.; Wang, P.; Yue, T. SIoTFuzzer: Fuzzing Web Interface in IoT Firmware via Stateful Message Generation. Appl. Sci. 2021, 11, 3120. [Google Scholar] [CrossRef]
Mallissery, S.; Wu, Y.S. Demystify the Fuzzing Methods: A Comprehensive Survey. ACM Comput. Surv. 2024, 56, 1–38. [Google Scholar] [CrossRef]
Chen, D.D.; Egele, M.; Woo, M.; Brumley, D. Towards Automated Dynamic Analysis for Linux-based Embedded Firmware. In Proceedings of the Network and Distributed Systems Security (NDSS) Symposium 2016, San Diego, CA, USA, 21–24 February 2016. [Google Scholar]
Qasem, A.; Shirani, P.; Debbabi, M.; Wang, L.; Lebel, B.; Agba, B.L. Automatic Vulnerability Detection in Embedded Devices and Firmware: Survey and Layered Taxonomies. ACM Comput. Surv. 2022, 54, 1–42. [Google Scholar] [CrossRef]
Cheng, Y.; Chen, W.; Fan, W.; Huang, W.; Yu, G.; Liu, W. IoTFuzzBench: A Pragmatic Benchmarking Framework for Evaluating IoT Black-Box Protocol Fuzzers. Electronics 2023, 12, 3010. [Google Scholar] [CrossRef]
De Ruck, D.; Goeman, V.; Willocx, M.; Lapon, J.; Naessens, V. Linux-Based IoT Benchmark Generator for Firmware Security Analysis Tools. In Proceedings of the 18th International Conference on Availability, Reliability and Security, Benevento, Italy, 29 August–1 September 2023; pp. 1–10. [Google Scholar]
Daniele, C.; Andarzian, S.B.; Poll, E. Fuzzers for Stateful Systems: Survey and Research Directions. ACM Comput. Surv. 2024, 56, 1–23. [Google Scholar] [CrossRef]
Manès, V.J.; Han, H.; Han, C.; Cha, S.K.; Egele, M.; Schwartz, E.J.; Woo, M. The Art, Science, and Engineering of Fuzzing: A Survey. IEEE Trans. Softw. Eng. 2019, 47, 2312–2331. [Google Scholar] [CrossRef]
Li, X.; Zhao, L.; Wei, Q.; Wu, Z.; Shi, W.; Wang, Y. SHFuzz: Service Handler-Aware Fuzzing for Detecting Multi-Type Vulnerabilities in Embedded Devices. Comput. Secur. 2024, 138, 103618. [Google Scholar] [CrossRef]
Shang, Z.; Garbelini, M.E.; Chattopadhyay, S. U-Fuzz: Stateful Fuzzing of Iot Protocols on Cots Devices. In Proceedings of the 2024 IEEE Conference on Software Testing, Verification and Validation (ICST), Toronto, ON, Canada, 27–31 May 2024; pp. 209–220. [Google Scholar]
Feng, X.; Sun, R.; Zhu, X.; Xue, M.; Wen, S.; Liu, D.; Nepal, S.; Xiang, Y. Snipuzz: Black-Box Fuzzing of IoT Firmware via Message Snippet Inference. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, Virtual Event, Republic of Korea, 15–19 November 2021; pp. 337–350. [Google Scholar]
Qin, C.; Peng, J.; Liu, P.; Zheng, Y.; Cheng, K.; Zhang, W.; Sun, L. UCRF: Static Analyzing Firmware to Generate under-Constrained Seed for Fuzzing SOHO Router. Comput. Secur. 2023, 128, 103157. [Google Scholar] [CrossRef]
Yu, Z.; Wang, H.; Wang, D.; Li, Z.; Song, H. CGFuzzer: A Fuzzing Approach Based on Coverage-Guided Generative Adversarial Networks for Industrial IoT Protocols. IEEE Internet Things J. 2022, 9, 21607–21619. [Google Scholar] [CrossRef]
Shu, Z.; Yan, G. IoTInfer: Automated Blackbox Fuzz Testing of IoT Network Protocols Guided by Finite State Machine Inference. IEEE Internet Things J. 2022, 9, 22737–22751. [Google Scholar] [CrossRef]
Feng, X.; Zhu, X.; Han, Q.L.; Zhou, W.; Wen, S.; Xiang, Y. Detecting vulnerability on IoT device firmware: A survey. IEEE/CAA J. Autom. Sin. 2023, 10, 25–41. [Google Scholar] [CrossRef]
Yu, M.; Zhuge, J.; Cao, M.; Shi, Z.; Jiang, L. A Survey of Security Vulnerability Analysis, Discovery, Detection, and Mitigation on IoT Devices. Future Internet 2020, 12, 27. [Google Scholar] [CrossRef]
Zhang, C.; Wang, Y.; Wang, L. Firmware Fuzzing: The State of the Art. In Proceedings of the 12th Asia-Pacific Symposium on Internetware, Singapore, 1–3 November 2020; pp. 110–115. [Google Scholar]
Yin, Q.; Zhou, X.; Zhang, H. Firmhunter: State-Aware and Introspection-Driven Grey-Box Fuzzing towards IoT Firmware. Appl. Sci. 2021, 11, 9094. [Google Scholar] [CrossRef]
Du, X.; Chen, A.; He, B.; Chen, H.; Zhang, F.; Chen, Y. AFLIoT: Fuzzing on Linux-Based IoT Device with Binary-Level Instrumentation. Comput. Secur. 2022, 122, 102889. [Google Scholar] [CrossRef]
Nino, N.; Lu, R.; Zhou, W.; Lee, K.H.; Zhao, Z.; Guan, L. Unveiling IoT Security in Reality: A Firmware-Centric Journey. In Proceedings of the 33rd USENIX Security Symposium (USENIX Security 24), Philadelphia, PA, USA, 14–16 August 2024; pp. 5609–5626. [Google Scholar]
Gao, Z.; Zhang, C.; Liu, H.; Sun, W.; Tang, Z.; Jiang, L.; Chen, J.; Xie, Y. Faster and Better: Detecting Vulnerabilities in Linux-Based IoT Firmware with Optimized Reaching Definition Analysis. In Proceedings of the 2024 Network and Distributed System Security (NDSS) Symposium, San Diego, CA, USA, 26 February–1 March 2024. [Google Scholar]
Chen, J.; Diao, W.; Zhao, Q.; Zuo, C.; Lin, Z.; Wang, X.; Lau, W.C.; Sun, M.; Yang, R.; Zhang, K. IoTFuzzer: Discovering Memory Corruptions in IoT Through App-Based Fuzzing. In Proceedings of the 2018 Network and Distributed System Security (NDSS) Symposium, San Diego, CA, USA, 18–21 February 2018. [Google Scholar]
Zhang, Y.; Huo, W.; Jian, K.; Shi, J.; Lu, H.; Liu, L.; Wang, C.; Sun, D.; Zhang, C.; Liu, B. SRFuzzer: An Automatic Fuzzing Framework for Physical SOHO Router Devices to Discover Multi-Type Vulnerabilities. In Proceedings of the 35th Annual Computer Security Applications Conference, San Juan, PR, USA, 9–13 December 2019. [Google Scholar]
Feng, B.; Mera, A.; Lu, L. P2IM: Scalable and Hardware-Independent Firmware Testing via Automatic Peripheral Interface Modeling. In Proceedings of the 29th USENIX Security Symposium (USENIX Security 20), Online, 12–14 August 2020; pp. 1237–1254. [Google Scholar]
Zhou, W.; Guan, L.; Liu, P.; Zhang, Y. Automatic Firmware Emulation through Invalidity-Guided Knowledge Inference. In Proceedings of the 30th USENIX Security Symposium (USENIX Security 21), Online, 11–13 August 2021; pp. 2007–2024. [Google Scholar]
Scharnowski, T.; Bars, N.; Schloegel, M.; Gustafson, E.; Muench, M.; Vigna, G.; Kruegel, C.; Holz, T.; Abbasi, A. Fuzzware: Using Precise MMIO Modeling for Effective Firmware Fuzzing. In Proceedings of the 31st USENIX Security Symposium (USENIX Security 22), Boston, MA, USA, 10–12 August 2022; pp. 1239–1256. [Google Scholar]
Scharnowski, T.; Wörner, S.; Buchmann, F.; Bars, N.; Schloegel, M.; Holz, T. Hoedur: Embedded Firmware Fuzzing Using Multi-Stream Inputs. In Proceedings of the 32th USENIX Security Symposium (USENIX Security 23), Anaheim, CA, USA, 9–11 August 2023. [Google Scholar]
Zheng, Y.; Davanian, A.; Yin, H.; Song, C.; Zhu, H.; Sun, L. FIRM-AFL: High-Throughput Greybox Fuzzing of IoT Firmware via Augmented Process Emulation. In Proceedings of the 28th USENIX Security Symposium (USENIX Security 19), Santa Clara, CA, USA, 14–16 August 2019; pp. 1099–1114. [Google Scholar]
Kim, J.; Yu, J.; Kim, H.; Rustamov, F.; Yun, J. FIRM-COV: High-Coverage Greybox Fuzzing for IoT Firmware via Optimized Process Emulation. IEEE Access 2021, 9, 101627–101642. [Google Scholar] [CrossRef]
IoT-vulhub. Available online: https://github.com/Vu1nT0tal/IoT-vulhub (accessed on 20 December 2024).
Kim, M.; Kim, D.; Kim, E.; Kim, S.; Jang, Y.; Kim, Y. FirmAE: Towards Large-Scale Emulation of IoT Firmware for Dynamic Analysis. In Proceedings of the Annual Computer Security Applications Conference, Austin, TX, USA, 7–11 December 2020. [Google Scholar]
Redini, N.; Continella, A.; Das, D.; De Pasquale, G.; Spahn, N.; Machiry, A.; Bianchi, A.; Kruegel, C.; Vigna, G. Diane: Identifying Fuzzing Triggers in Apps to Generate under-Constrained Inputs for IoT Devices. In Proceedings of the 2021 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 24–27 May 2021; pp. 484–500. [Google Scholar]
Mu, D.; Cuevas, A.; Yang, L.; Hu, H.; Xing, X.; Mao, B.; Wang, G. Understanding the Reproducibility of Crowd-Reported Security Vulnerabilities. In Proceedings of the 27th USENIX Security Symposium (USENIX Security 18), Baltimore, MD, USA, 15–17 August 2018; pp. 919–936. [Google Scholar]
Zhao, B.; Ji, S.; Xu, J.; Tian, Y.; Wei, Q.; Wang, Q.; Lyu, C.; Zhang, X.; Lin, C.; Wu, J.; et al. A Large-Scale Empirical Analysis of the Vulnerabilities Introduced by Third-Party Components in IoT Firmware. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, Virtual, Republic of Korea, 18–22 July 2022. [Google Scholar]
Firmware Analysis Toolkit. Available online: https://github.com/attify/firmware-analysis-toolkit (accessed on 20 December 2024).
Bellard, F. QEMU, a Fast and Portable Dynamic Translator. In Proceedings of the USENIX Annual Technical Conference, FREENIX Track, Anaheim, CA, USA, 13–15 April 2005. [Google Scholar]
Anderson, C. Docker [Software Engineering]. IEEE Softw. 2015, 32, 102–105. [Google Scholar] [CrossRef]
TCPDUMP & LIBPCAP. Available online: https://www.tcpdump.org/ (accessed on 20 December 2024).
Tshark Manual Page. Available online: https://www.wireshark.org/docs/man-pages/tshark.html (accessed on 20 December 2024).
GitLab CI/CD Pipelines. Available online: https://docs.gitlab.com/ee/ci/pipelines/ (accessed on 20 December 2024).
Jabiyev, B.; Sprecher, S.; Onarlioglu, K.; Kirda, E. T-Reqs: HTTP Request Smuggling with Differential Fuzzing. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, Virtual Event, Republic of Korea, 15–19 November 2021; pp. 1805–1820. [Google Scholar]
Mutiny Fuzzing Framework. Available online: https://github.com/Cisco-Talos/mutiny-fuzzer (accessed on 20 December 2024).
A TCP/UDP Based Network Daemon Fuzzer. Available online: https://github.com/denandz/fuzzotron (accessed on 20 December 2024).
Cheng, Y.; Fan, W.; Huang, W.; Yang, J.; Yu, G.; Liu, W. MSLFuzzer: Black-Box Fuzzing of SOHO Router Devices via Message Segment List Inference. Cybersecurity 2023, 6, 51. [Google Scholar] [CrossRef]
Boofuzz: Network Protocol Fuzzing for Humans. Available online: https://github.com/jtpereyda/boofuzz (accessed on 20 December 2024).
IoTVulBench, an Open-Source Benchmark Dataset for IoT Security Research. Available online: https://github.com/a101e-lab/IoTVulBench (accessed on 20 December 2024).

Figure 1. Overview of IoTBenchSL.

Figure 2. Time Comparison for Benchmark Production and Validation.

Figure 3. Time Comparison for Firmware and Vulnerability Benchmark Production.

Table 1. Summary of Datasets Used in IoT Fuzzing Studies.

ID	Study Name	Device Type	Device Count	Open Source	N-Day Vulnerabilities
1	IoTFuzzer [28]	Physical	17	N/A	0
2	SRFuzzer [29]	Physical	10	N/A	0
3	ESRFuzzer [6]	Physical	10	N/A	0
4	Snipuzz [17]	Physical	20	N/A	0
5	UCRF [18]	Physical	10	N/A	0
6	P2IM [30]	Emulated	10	Yes	0
7	μEmu [31]	Emulated	21	Yes	0
8	Fuzzware [32]	Emulated	12	Yes	0
9	Hoedur [33]	Emulated	32	Yes	0
10	FIRM-AFL [34]	Emulated	7	Yes	15
11	SIoTFuzzer [7]	Emulated	9	Yes	12
12	Firmhunter [24]	Emulated	7	Yes	13
13	FIRM-COV [35]	Emulated	8	No	8
14	IoT-Vulhub [36]	Emulated	12	Yes	14

Table 2. Comparison of Benchmarks Produced by IoTBenchSL and Existing IoT Firmware Emulation Datasets.

Metric	IoTBenchSL	P2IM [30]	μEmu [31]	Fuzzware [32]	Firmhunter [24]	FIRM-AFL [34]	IoT-Vulhub [36]
Firmware Benchmarks	100	10	21	13	7	9	12
Device Models	73	10	21	13	7	9	12
Vulnerability Benchmarks	100	0	0	12	13	15	14
Vulnerability Types	10	0	0	6	5	2	5
One-Click Emulation	✔	✘	✘	✘	✘	✘	✘
Automated Validation	✔	✘	✘	✘	✘	✘	✘

Bold: Maximum value in the Metric. ✔: Feature present. ✘: Feature absent.

Table 3. Performance Evaluation of Different Fuzzers on IoTBenchSL Dataset.

Fuzzer	Vulns Found	Avg Time (s)	Avg Mem (MB)
Mutiny	41	2042.09	14.19
Snipuzz	20	357.54	130.36
Boofuzz	5	316.09	109.30
T-Reqs	23	8693.40	29.81
Fuzzotron	42	2316.64	7.66

Table 4. Summary of Discovered Unknown Vulnerabilities.

No.	Vuln ID	Vuln Score	Vuln Type
1	CVE-2023-33536	8.1	CWE-125
2	CVE-2023-33537	8.1	CWE-125
3	CVE-2023-33538	8.8	CWE-77
4	CVE-2023-36354	7.5	CWE-120
5	CVE-2023-36355	9.9	CWE-120
6	CVE-2023-36356	7.7	CWE-125
7	CVE-2023-36357	7.7	CWE-770
8	CVE-2023-36358	7.7	CWE-120
9	CVE-2023-36359	7.5	CWE-120
10	CVE-2023-39745	7.5	CWE-120
11	CVE-2023-39747	9.8	CWE-120
12	CVE-2023-39749	9.8	CWE-120
13	CVE-2023-39750	9.8	CWE-120
14	CVE-2023-39751	9.8	CWE-787
15	CNVD-2023-48035	8.1	CWE-125
16	CNVD-2023-48036	8.1	CWE-125
17	CNVD-2023-48037	8.1	CWE-125
18	CNVD-2023-48040	8.1	CWE-125
19	CNVD-2023-48041	8.1	CWE-125
20	CNVD-2023-48042	8.1	CWE-125
21	CNVD-2023-69715	7.5	CWE-120

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cheng, Y.; Li, X.; Mao, Z.; Fan, W.; Huang, W.; Liu, W. IoTBenchSL: A Streamlined Framework for the Efficient Production of Standardized IoT Benchmarks with Automated Pipeline Validation. Electronics 2025, 14, 856. https://doi.org/10.3390/electronics14050856

AMA Style

Cheng Y, Li X, Mao Z, Fan W, Huang W, Liu W. IoTBenchSL: A Streamlined Framework for the Efficient Production of Standardized IoT Benchmarks with Automated Pipeline Validation. Electronics. 2025; 14(5):856. https://doi.org/10.3390/electronics14050856

Chicago/Turabian Style

Cheng, Yixuan, Xiongfei Li, Zhikang Mao, Wenqing Fan, Wei Huang, and Wen Liu. 2025. "IoTBenchSL: A Streamlined Framework for the Efficient Production of Standardized IoT Benchmarks with Automated Pipeline Validation" Electronics 14, no. 5: 856. https://doi.org/10.3390/electronics14050856

APA Style

Cheng, Y., Li, X., Mao, Z., Fan, W., Huang, W., & Liu, W. (2025). IoTBenchSL: A Streamlined Framework for the Efficient Production of Standardized IoT Benchmarks with Automated Pipeline Validation. Electronics, 14(5), 856. https://doi.org/10.3390/electronics14050856

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

IoTBenchSL: A Streamlined Framework for the Efficient Production of Standardized IoT Benchmarks with Automated Pipeline Validation

Abstract

1. Introduction

2. Background and Related Work

2.1. IoT Firmware Fuzzing Workflow

2.2. IoT Firmware Fuzzing Datasets

3. Methodology

3.1. Firmware Benchmark Production Workflow

3.2. Vulnerability Benchmark Production Workflow

3.3. Benchmark Validation Pipeline

4. Implementation

4.1. Firmware Benchmark Production Workflow

4.2. Vulnerability Benchmark Production Workflow

4.3. Benchmark Verification Pipeline

5. Experiments and Results

5.1. Experiment Settings

5.2. Large-Scale Benchmark Production

5.3. Efficiency in Benchmark Production and Validation

5.4. Evaluation of Vulnerability Discovery Capabilities Based on Benchmarks

5.5. 0-Day Vulnerability Discovery Using Firmware Benchmarks

6. Discussion and Future Work

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI