1. Introduction
With the rapid development of the Internet of Things (IoT) [
1,
2], software defects not only threaten financial stability and reputation but also pose severe risks to human safety. This is especially so in sensor-driven systems, where real-time decision-making is paramount. Ensuring software correctness and identifying vulnerabilities are critical in modern software development [
3]. Symbolic execution is a key technique for software security verification, capable of detecting potential vulnerabilities and providing mathematical proof of correctness in formal verification. This is particularly essential in safety-critical and mission-critical systems, such as industrial automation [
4], autonomous vehicles [
5], etc., where software failures can lead to catastrophic consequences [
6]. As sensor networks become more integral to intelligent systems, the role of symbolic execution in software validation and security assurance [
7,
8] continues to expand, ensuring that software interacting with real-world data remains robust, reliable, and secure.
However, symbolic execution still faces many challenges [
9], with unknown function modeling being one of the most crucial problems. We have identified the following primary scenarios for this issue: (1) The absence of an executable environment where certain device APIs used in production cannot be simulated in the testing environment. (2) The difficulty in symbolic tracing due to memory isolation caused by network operations (e.g., web services and database connections) that disrupt symbol tracking. (3) Reliance on third-party and system libraries when the source code is unavailable, leading to unpredictable outcomes from library functions, which hinders accurate modeling. We define “unknown functions” as APIs without accessible source codes or those with indeterminable outcomes.
Predictably, symbolic execution is becoming increasingly constrained in the context of modern software. The development of IoT and edge computing has shifted the focus of traditional software security detection. Modern software systems are now integrated across diverse environments, such as autonomous vehicles, medical systems, and Web3.0 applications. For instance, Jodogne [
10] used WebAssembly to render medical images, bridging the gap between web-based and desktop-based medical applications. Traditional security verification faces three primary challenges, namely, the emergence of a new assembly language program, a new runtime system environment, and the server communication interface in the embedded system. To address the issues, the response data from a system environment or a server interface are necessary. However, rebuilding a system in a real environment for testing is highly inefficient because certain inputs are difficult to implement. Therefore, the emulator serves as a practical alternative.
In vulnerability verification, two general methods are employed for emulation, namely, system simulation, which reproduces the exact interface and behaviors of the original system, and lightweight function simulation, which simulates interface behavior without replicating the entire system, offering a more streamlined solution. In general, function simulation is widely favored due to its efficiency and simplicity. Considering the complexity of application program interfaces (APIs) in the real world, traditional algorithms categorize them, build frameworks, and manually develop functional code to simulate various interface behaviors as needed. This highlights the significant manual effort required for unknown function modeling, which depends heavily on the skill and efficiency of the developer.
To reduce this dependency, this paper proposes an automated approach based on machine learning to generate models for unknown functions. Our method minimizes human intervention by fine-tuning an autoregressive language model with 20 billion parameters while preserving the core principles of symbolic execution. The contributions of this paper are as follows:
Automated coding for vulnerability verification: Traditional unknown function modeling frameworks require extensive manual coding. Our approach leverages the automated coding capabilities of artificial intelligence to replace human effort, significantly improving the efficiency of vulnerability verification.
Reasoning with LLMs for formal verification: Unknown functions span various categories, including external system functions, network interfaces, and hardware interfaces. Traditional frameworks rely on manual information collection to model APIs. This paper is the first to utilize the reasoning capabilities of LLMs in formal verification tasks. By enhancing automation in software security verification, this work offers a novel perspective on integrating machine learning to address the challenges of modeling frameworks. It effectively resolves numerous issues previously encountered in real-world symbolic execution scenarios involving unknown functions.
Multi-language program support: Multilingual program analysis is a critical challenge, as traditional symbolic execution engines typically support only a single programming language. Our research explores the combination of WebAssembly with symbolic execution, providing a new research reference point for multi-language program vulnerability verification. This includes support for languages such as C/C++, Rust, Golang, etc., improving the feasibility of symbolic execution in the industrial applications of modern software.
The rest of this paper is arranged as follows.
Section 2 reviews the related works.
Section 3 introduces the preliminaries, while
Section 4 provides an extensive discussion of the solutions developed to address the challenges encountered during the experimental phase.
Section 5 meticulously documents the experimental process and presents the outcomes associated with automating the modeling of unknown functions within symbolic execution.
Section 6 concisely summarizes the significance of this work, providing a valuable reference for integrating LLMs into symbolic execution to improve the automation of vulnerability verification.
4. Methodology
In this section, we present the entire implementation process, which encompasses numerous detailed steps. The key steps to be aware of in this process are explained in this section. First, let us take a look at the framework as a whole. The whole process is divided into data preparation, model training, and model application. In the data preparation phase, our goal is to obtain the Q&A corpus, and the simplest input is a function definition as the human part of the human–bot Q&A pair. The ‘bot’ outputs Python implementation codes for functions. In order to generate the corpus of the ‘human’ part, we first need to collect the API categories in
Table 1. We use prompt text to enable LLMs to simulate the role of an interviewed programmer, following the prompt to implement the function. The LLM answers the code implementations. In fact, the preparation is a very complex process, and most of the data are synthesized by machines. For example, the official WebAssembly organization provides interface specifications and definitions for clarity. We synthesized a human–bot question–answering pair corpus. In this preparation process, we also make use of common LLMs to assist us in our work. In the model training phase, GPT-NeoX is modified to fit our hardware. Finally, when the model was applied, it was embedded in one of our previous works to verify the feasibility of the method. A detailed description will be given in a later section.
We mainly use a pre-training approach to achieve the automatic modeling of unknown functions based on LLMs. To realize the automatic modeling of unknown functions within symbolic execution, our research methods involve several key components: the training and fine-tuning of a 20B LLM, optimization of GPT-NeoX, construction of an abstract layer for WebAssembly functions, and the application to blockchain smart contracts. Among them, the symbolic execution framework, including the unknown function API, the abstract API layer, path exploration, vulnerabilities, etc., is based on our previous research work FVPS [
16]. Function modeling replaces earlier manual approaches. Our processing flow is shown in
Figure 1 below.
Build training datasets: The first step in automating the modeling of unknown functions is to collect as many common APIs as possible from runtime environments. These APIs span various categories, including programming languages, standard libraries, third-party libraries, and business logic code. Programming languages typically have language-specific grammatical features. For instance, the ‘print’ function outputs formatted text to the console, but the exact function name varies across languages, such as the ‘Println’ function of the ‘fmt’ package in Golang, ‘cout’ in C++, and ‘println’ in Rust. Although these functions serve similar purposes, they differ in their implementation for symbolic execution.
When compiling WebAssembly using different compilers, external runtime standard libraries are introduced, such as LLVM 8, Golang 1.11, Rust 1.x, Emscripten 1.37, Clang 9.0, and Node.js 12.x. Each library has a unique API, and while they may appear as syntactic sugar, symbolic execution must carefully consider the implementation of each API. Another example is the WebAssembly system interface (WASI) standard library, which is a set of APIs developed by a subgroup of the WebAssembly Community Group. While these APIs serve similar purposes across different runtime environments (e.g., Linux, browsers), they may have different names. In real-world program development, developers often depend on some base frameworks, such as those found in blockchain smart contracts (e.g., Hyperledger). The Hyperledger libraries contain framework-specific APIs like ‘stub.GetFunctionAndParameters’, which are unknown to symbolic execution. System APIs such as ‘gettime’, ‘getdate’, ‘I/O stream’, ‘thread’, ‘getip’, ‘ioctl’, and ‘fork’ are also common in symbolic execution. We collect common APIs across a wide range of use cases, including text and file handling, configuration files, image reading and writing, threading, operating systems, sockets, HTTP, RESTful, web services, logs, databases, encryption, and user input, as summarized in
Table 1.
To model unknown functions effectively, we design corresponding prompt templates to generate Python function codes using LLMs. The main goal of this process is to create comprehensive function models that account for exceptions, special values, random values, etc., rather than focusing on the actual implementation of a function. For example, when modeling an encryption function, it is critical to consider special return values that might be used in concrete conditions. For code tasks, collecting data manually is too costly. To replace manual mechanisms, obtaining datasets through feedback mechanisms executed by LLMs is our primary approach. Firstly,
Figure 2 shows one of our prompt templates for coding task generation according to a program topic from
Table 1. Secondly, part of the coding task problem is supplemented manually, such as third-party libraries. Building upon this foundation, we utilize LLMs to rephrase synonyms, thereby expanding the pool of questions.
Unknown function modeling prioritizes capturing the full range of potential return values, distinguishing it from standard function implementation. We employ LLMs to simulate the role of an interviewer using template questions, extending these functions to include exception handling, error resolution, performance optimization, coding parameters, and more. This approach provides a more complex function return model. The same logical requirements create many coding tasks with variations in parameter forms or return forms. A portion of these coding tasks is illustrated in
Figure 3.
In order to build a rich and flexible training dataset, the next step involves generating concrete Python codes from these coding tasks. Since code implementations tend to be fixed, we parameterize the generated function code into templates. This parameterization enables the GPT-NeoX model to learn the underlying patterns across a diverse set of codes. For instance, consider the HTTP status response from a server via a POST API. Under normal conditions, the server returns a 200 status code, as shown in Equation (
1):
here,
denotes various possible situations. However, status codes such as 404, 503, and others must also be returned under specific conditions, each with a defined probability. Therefore, the ‘POST’ function generates additional function codes to model these status codes based on different odds. Equation (
2) illustrates the parameter generalization formula.
The coding tasks in
Figure 3 generate code implementations. After parameter generalization, as shown in Equation (
2), we create a variety of training corpora. By removing duplicates, we compile a total of 50,000 original questions. These questions were then processed using the coding LLMs, which generates answers, yielding approximately 45,000 data items after filtering. To enhance the dataset, we employ synonyms to expand the variety of function names, enabling the model to adapt to various environmental contexts during function modeling. Examples include variations such as ‘fd_write’ in LLVM v8, ‘shim.Error/shim.Success’ in Hyperledger 1.4, and ‘console.log/console.debug/console.info’ in JavaScript ES5. The correspondence between function declarations and their respective implementations forms the basis of our training corpus. In addition, the WebAssembly official organization provides interface specifications and definitions, clarifying that developers are responsible for specific implementations. Hence, coding LLMs are used to generate the implementations as a training corpus, as shown in
Figure 4.
In conclusion, we prepared four different levels of training instruction templates to generate the training datasets. In general, the function declarations of system APIs and common SDKs and their APIs already contain rich function information to build functional logic code, although there are differences in different operating systems and programming languages. Through experiments, it is shown that such differences can be fully understood by the GPT-NeoX model. Therefore, our experiment prepares the end-to-end training corpus according to the declaration of the functional model to reduce the workload of manual intervention as much as possible. The second training instruction template is based on function description and function declaration. Most unknown functions are difficult to process uniformly, mainly because the data exchange and communication interfaces between information systems are self-defined. This part of the self-defined API usually has interface documents describing the input and output of related API interfaces. The third training instruction template is based on the user’s special requirements and function declarations. To obtain the system time, for example, if there are no special requirements, the model will output the current time of the system. If the software function is to determine the past or future time, then the current time of the system cannot trigger the corresponding logical conditions, resulting in incomplete coverage of software vulnerability detection. So specific requirements are added to the function declaration to achieve the goal, such as the system time returning a certain range of values. The fourth instruction training template combines requirements, comments, and declarations. The training instruction templates are shown in
Table 2.
Pretrained model: GPT-NeoX-20B serves as the base model, a simple question–answer prompt template for constructing the unknown function model is designed to train the LLM. The original trained architecture of GPT-NeoX-20B requires the resources of 96 GPUs, each equipped with 80 gigabytes of memory. Due to hardware constraints, the existing equipment cannot support the continuous training of this model. Our study reduces the training architecture to 9 GPUs with 48 gigabytes of memory. Through pipeline parallelism, the 44 Transformer block layers of the pre-trained model are evenly split into 9 GPUs, and each GPU loads 5 block layers. Since 44/9 cannot be calculated as an integer, the ninth GPU8 only loads the last four block layers of the model, and the input layer and output layer are loaded to the beginning GPU0 and the last GPU8, respectively, to construct the topology of the entire trained model, as shown in
Figure 5.
The prompt template consists of the bootstrap, modeling requirements, function comments, function declarations, and question and answer, as shown in
Figure 6. In the prompt template, the API declaration section is the most important and mandatory; this is the minimum requirement for automation. The requirement section is from the content reference enumeration or the API interface description document. The other sections consist of fixed words to bootstrap NLP inferences. Consequently, in order to extract Python codes from LLM responses, the response source code is enclosed within the special token ‘<code>’.
Transformer: In order to implicitly learn the positional information of sequences. The Transformer network suggests a positional encoding (PE) methodology, called the 2D PE tensor, as shown in Equation (
3):
This positional method combines the position index and dimension index, leading to each tensor element’s position independence. ‘
’ denotes the position of the word in the sentence, ‘
d’ denotes the dimension of PE, ‘
’ denotes the even dimension, and ‘
’ denotes the odd dimension (
). This position encoding method can adapt to sentences that are longer than all the sentences in the training sets. Additionally, for computational convenience, absolute positional encoding is employed, meaning that each position in the sequence is assigned a fixed position vector. The final input representation of each word is derived by combining its word vector and position vector, while the dense matrix is obtained through the compression of the embedding matrix.
The attention score calculation in the Transformer network is shown in Equation (
4).
and
are word embeddings for
and
, and
and
are the position vectors for positions i and j. When
and
are involved in computing the attention scores for the words at positions
i and
j, the calculation of
and
essentially affects the relative positional information between words, resulting in the final position information becoming unpredictable with the training of
and
.
Nevertheless, programming languages are highly structured and governed by strict logical rules. However, their expressive diversity allows the program logic to be shaped by different coding methods. When writing a program, developers can adopt various coding styles and methodologies based on their requirements and personal preferences. For instance, some programming languages permit multiple syntax rules and idiomatic expressions to convey the same logic, providing developers with the flexibility to choose the most suitable approach for a given context and set of requirements. Therefore, during model training, the ‘RoPE’ positional encoding, which is better suited for processing programming languages, is applied as shown in Equation (
6).
‘RoPE’ encodes relative positions through absolute position encoding, preserving both positional information and the relationships between relative positions. For the given positions
i and
j, with a relative distance
, ‘RoPE’ can be calculated using different methods. Relative positional encoding can be calculated using Equation (
5).
represents the value of the
k-th dimension of the relative positional encoding between positions
i and
j, and
represents the value of the
k-th dimension corresponding to the pair of positions with a distance of ‘d’ in the relative positional encoding matrix.
integrates relative positional encoding with absolute positional encoding to generate a comprehensive positional representation for each position. Specifically, the relative positional encoding is incorporated into the word embeddings, resulting in the final positional encoding, as shown in Equation (
6). When computing attention scores, the Transformer network incorporates both absolute and relative positional information. This approach enables it to effectively capture the relative relationships between different positions in a sequence. This enhances performance and efficiency, particularly when processing long sequences in programming languages.
Symbolic execution engine: Manticore is a symbolic execution tool written in Python, based on SMT2 specification, and supporting Z3, Yices, and CVC4 solvers. So far, Manticore is one of the WebAssembly symbolic execution engines that implements basic symbolic execution of the WebAssembly minimum unit of the operation module. A common WebAssembly program as illustrated in
Figure 7. It is clear that different compilations import various external functions, even if these functions have the same name or functionality and belong to the same environment namespaces. Let us take Emscripten as an example. When compiling to WebAssembly programs, ‘syscall/js’ is automatically included, which forces the runtime to either be a browser (Firefox, Chrome, etc), or at least pretend to be one. The APIs in packages such as go, env, wasi_unstable, and wasi_snapshot_preview1 are unknown functions. Therefore, the APIs in the import section of WebAssembly reveal that vulnerability verification has to set up a corresponding runtime environment, otherwise, vulnerability verification cannot be performed.
When WebAssembly programs are loaded by symbolic execution engines, our approach reads unknown functions from the function import section. Based on the namespace, function names, and contextual function call requirements, our model generates Python code. For instance, the imported function in
Figure 8 is well known as a system call, presenting a typical challenge in symbolic execution. Moreover, the symbolic execution engine invokes the abstract function layer based on our Python function instead of the real interface. Our proposed approach is necessary because it allows us to support the verification work of different compilers, which was previously not possible without writing a verification system interface for a specific runtime environment each time.
Memory: In our experimental task, our approach must deal with the memory swap problem of symbolic execution. WebAssembly defines a linear, contiguous memory model, allowing WebAssembly modules to utilize a contiguous block of memory. This memory space can be allocated by the host environment and accessed through the memory object of the WebAssembly instance (WebAssembly.Memory). WebAssembly modules can share memory with the host environment. For example, in web browsers, WebAssembly modules can share memory space with JavaScript code, enabling efficient data exchange. This shared memory mechanism is facilitated through the memory object of the WebAssembly instance. Certainly, memory access in WebAssembly is subject to strict bounds and type checking. Memory accesses in WebAssembly modules must be within the linear memory range, and out-of-bound accesses result in runtime errors. This design helps to ensure that the memory accesses in the WebAssembly modules are safe, thereby enhancing the system’s security and stability. The architecture of the symbolic execution engine reveals the internal relationship between the WebAssembly and Python environment as shown in
Figure 9.
In other words, WebAssembly’s memory isolation prevents external environments, including our Python programs, from directly accessing the memory of an unknown function model. WebAssembly provides the mem-related methods of the symbolic state and then interacts with the outside world. For example, our model implements the ‘wasi_unstable’ function illustrated in
Figure 8; the parameter ‘text’ of the ‘fd_write’ function is actually in the ‘mem’ object of the symbolic state. Our approach designs a simple symbolic execution memory swap management between WebAssembly and Python environment, illustrated in
Figure 10.
5. Experiment and Results
As is well-known, blockchain is a decentralized distributed technology, involving remote procedure call (RPC), distributed features, blockchain frameworks, runtime environments, etc. This is a very challenging vulnerability verification environment for symbolic execution. In order to justify the effectiveness of the LLM-based automated modeling methodology, this section shows that this proposed methodology is applied to smart contracts of the Hyperledger Fabric blockchain, implementing a WebAssembly vulnerability verification automated platform.
Next, we briefly introduce our model training experiment. The training script is executed on a system with an Nvidia RTX 8000 GPU with 48 GB (NVIDIA, Santa Clara, CA, USA) of memory, a dual Intel E2640 CPU (Intel, Santa Clara, CA, USA), running Ubuntu 22, and equipped with a 7 TB SSD hard disk. The hyperparameter configuration for the model includes a sequence input length of 2048, a dictionary token size of 50,432, a pipeline parallelization size of 9, an embedding size of 6144, a Transformer block layer of 44, a micro-batch size of 18, a gradient accumulation step of 4, and a warm-up step of 128. Our model training involves three stages. For the first stage, we use a unified language learning recovery training method that substantially improves existing language models and their scaling curves with a relatively tiny amount of extra computation, to enhance the generalization of the model. When the training result demonstrated in
Figure 11a, the loss value is around 1.78 and the accuracy is 61.92%, the model moves on to the next training stage. The second stage involves training one sample at a time with left padding. The loss calculation for each step employs a full-context approach, allowing the model to gain a comprehensive understanding of semantics. As illustrated in
Figure 11b, the model exhibits training accuracy stabilization at approximately 80%. During the final stage of training, we exclusively computed the loss for the response context to enhance the ultimate accuracy of the generated outcomes. Validation accuracy is shown in
Figure 12. The whole training process is monitored by the W&B tool; the stages show multiple training sessions with different colors, which are checkpoints saved to test the actual effect of the model.
Ultimately, the trained model is saved in PyTorch(2.6) format with FP16 precision, occupying approximately 44 gigabytes of disk space. We harness a 48-gigabyte GPU card to successfully carry out the inference task, applied in a railway–port–aviation blockchain transportation system. In this system, our verification work is faced with diverse scenarios of unknown functions. The smart contract is developed in Golang, based on the Hyperledger Fabric blockchain framework. The verification of smart contracts requires a mass of unknown function modeling, including a WebAssembly standard interface, a Hyperledger block link interface, a peer-to-peer communication interface, etc. Listing 1 shows a smart contract for a purchase order in railway transportation based on the Hyperledger blockchain.
The security verification of smart contracts is complex, it is due to the decentralized autonomous organization features of blockchain techniques. As shown in the ‘buyOrder’ source code, the rest of the blockchain parts are all interfaces, involving data storage like ‘PutState’, communication protocols like ‘Response’, a framework API like ‘shim.Error’, a data protocol like ‘marshal’, etc. In our previous research, we proposed an optimization methodology of the abstract proxy layer for the smart contract formal verification in the Golang engine framework. In this experiment, our automated modeling addresses the gap in previous research, which required human intervention in the abstract layer code. For instance, the previous security verification work necessitated implementing a ‘JSONProxy’ class in place of the ‘json.Marshal’ method. The abstract layer code programming work was finished by an automated modeling method; the following code in Listing 2 shows the modeling function code in Python.
In addition to this, our formal verification platform based on our automated modeling methodology successfully generates the rest interfaces, such as logging, peer-to-peer response, stub interfaces, etc. To justify the feasibility of our approach, we prepare program source codes in various languages including C/C++, Rust, and Golang; moreover, the user submissions are temporarily stored in the upload directory via the web page illustrated in
Figure 13 and
Figure 14a.
Listing 1. A smart contract of a railway transportation purchase order based on Hyperledger blockchain frameworks. |
![Sensors 25 02683 i001]() |
Listing 2. The JSON marshal function implementation generated by automated modeling with Python. |
![Sensors 25 02683 i002]() |
Our previous research compiles them directly into WebAssembly programs, as shown in
Figure 14b. Next, the symbolic execution engine loads WebAssembly instructions byte by byte. The import section of the WebAssembly program is recognized by automated modeling based on the LLM, and the corresponding import functions are implemented with Python. Subsequently, the symbolic execution engine starts its security verification explorations. When it encounters unknown functions from the import section of WebAssembly, generative Python function codes like ‘json_marshal’ support upper-level calls. The experimental results in
Figure 15 demonstrate the efficiency and automation of our approach.