*3.2. Formal Definition*

In order to facilitate the subsequent description of the algorithm, we give some related concepts and formal definitions of the evaluation index.

• Definition 1 PUT (input sample)

We define the program under test as PUT. For CVDF DYNAMIC, PUT is the corresponding binary executable program, and the corresponding test cases are mentioned in Section 4.1.

•Definition 2 Set Covering Problem (SCP)

A large number of facts show that there is an exponential proportional relationship between the growth number of execution paths of PUT and the growth number of its branch conditions, so the test cases cannot completely cover all execution paths. Therefore, in fuzzy testing, the problem of sample set coverage is transformed into the problem of minimum set coverage [36]. The minimum set covering problem is an NP hard problem [37]. The simplest algorithm idea is to use greedy algorithm to find the approximate optimal solution. The following formal definition is used to describe SCP problem:

For *A* = [*aij*], it is a 0–1 matrix of m-row n-columns, where *C* = *Cj* is an n-dimensional column vector. Let *p* = [1, 2, 3 ...... *m*] and *q* = [1, 2, 3 ...... *n*] be the row and column vectors of matrix *A*. Furthermore, let *Cj*, *j* ∈ *q* represent the cost of a column. Without losing generality, we assume that *Cj* > 0, *j* ∈ *q*. It is specified here that if *aij* = 1, it means that column *j* ∈ *q* at least covers one row *i* ∈ *p*. Therefore, the essence of the SCP problem is to find a minimum cost subset *S* ⊆ *q*. So, for every row *i* ⊆ *p*, it is covered by at least one column *j* ⊆ *S*. A natural mathematical model of SCP can be described as *v*(*SCP*) = min ∑ *j*<sup>∈</sup>*q Cjxj*, and it obeys ∑ *j*<sup>∈</sup>*q aijxj* ≥ 1, *i* ∈ *p*, *xj* ∈ (0, 1)(*j* ∈ *q*). If *xj* = 1(*j* ∈ *<sup>S</sup>*), then*xj*= 0.

• Definition 3 Path Depth Detection Ability

In fuzzy testing, there are many program-execution paths that may have vulnerabilities in PUT, so the generation of fuzzy testing samples should cover as many as possible for these program execution paths that may have vulnerabilities. For a program execution path, the number of detected vulnerabilities may be more than one, and different program execution paths can detect different numbers of vulnerabilities. We define the total number of vulnerabilities detected by the fuzzy testing sample under the current path as *DNUM*, the total number of vulnerabilities contained in the current path as *ANUM* and the weight of the total number of vulnerabilities contained in the current path as *W*. *DetectionCapability*(*DC*) is a weighted result, and its operation method is shown in Equation (1):

$$DC = \frac{D\_{NIM}}{A\_{NIM}} \times \mathcal{W} \tag{1}$$

Among them, *W* increases with the number of vulnerabilities in the current path. This is because the number of vulnerabilities in different paths is different. For the variation method of the same fuzzy testing sample seed, if more vulnerabilities are contained in a path, the smaller the value of *DNUM ANUM* is. If the weight *W* is a constant, the *DC* value will decrease, and the path depth detection ability of a test case generation method cannot be objectively measured.

Suppose that a program under test has *n* execution paths, we define the average path

detection ability as *WDC* = ∑ *i*=1 *DCi n* : It can measure the ability of a fuzzy testing tool to detect the overall path depth

### *3.3. CVDF DYNAMIC Fuzzy Testing Sample Generation*

The complete process of fuzzy testing sample generation of CVDF DYNAMIC is shown in Figure 1.

*n*

**Figure 1.** Complete Flow Chart of CVDF DYNAMIC Fuzzy Testing sample generation.

In the fuzzy testing part, we learn from the ensemble learning method in artificial intelligence. The seeds are mutated by genetic algorithm to generate a set of test cases, and then the seeds are mutated by the bi-LSTM neural network to generate another set of test cases. Finally, the two sets of test cases are integrated to obtain the final set of test cases.

Considering that the size of the sample set obtained by the integration of the two methods is too large, which reduces the efficiency of fuzzy testing, we use heuristic genetic algorithm to simplify the sample set. Finally, the reduced sample set is used for fuzzy testing, and the parameters in the bi-LSTM neural network are optimized according to the result feedback.

3.3.1. Theoretical Model and Training Process of BI-LSTM Neural Network

The BI-LSTM neural network training process of CVDF DYNAMIC is shown in the Figure 2.

**Figure 2.** Training of neural network.

(a) Preprocessing and Vectorization

We preprocess the training dataset, including unifying the input format of the test cases and changing the format of some binary executable programs, so that they can adapt to the input of the neural network without changing the logic function of the original program.

Then, we use the PTFuzz tool, which is a tool to obtain the program execution path by using the Intel Processor Tracing module (IntelPT). PTFuzz makes a further improvement on the basis of AFL, which removes the dependence on the program instrument but uses PT to collect package information and filter package information, and finally obtains the execution path of the current seed according to the package information. In order to achieve this goal, our hardware environment should be based on Intel CPU platform and run under the appropriate version of Linux system. Since the PTFuzz tool stores the program execution path information in data packets in order to obtain the program execution path information that can be trained for neural networks, we need to decode the data packets in the corresponding memory and recover the complete program execution

path according to the entry, exit and other relevant information of each data packet. The pseudocode of the Algorithm 1 Extracting program execution path is as follows:


In the pseudocode, JumpNextInstrument() and EndOfMemspace() are two judgment functions, which are used to judge whether to jump to the next instruction address and whether the end of the memory address of PTFuzz package has been reached, respectively. The ExecutionPath variable forms a complete program execution path by continuously connecting the PackagePath variable after decodeding. +|= is a concatenate operation.

After extracting the program execution path, we need to convert the program execution path containing instruction bytecodes into vector form and save the original semantic information of the original program execution path as much as possible.

We use the tool word2vec and regard a complete program execution path as a statement and an instruction as a word. Specifically, we regard the hexadecimal code of an instruction as a token, and then we use word2vec to train the corresponding bytecode sequence. In order to preserve as much context information as possible in the program execution path, we choose the Skip-Gram model in word2vec because it often has better performance in large corpus. The Skip-Gram model structure is shown in the Figure 3.

Finally, we need to transform the output of word2vec into an equal length coding input, which can be used as the input vector of the neural network. Let us set a maximum length, which is MaxLen. When the output length of word2vec is less than MaxLen, we use 0 to fill in the back end to make it MaxLen. When the output length of word2vec is larger than MaxLen, we truncate it from the front end and control the length to MaxLen.

(b) BI-LSTM neural network structure and parameter optimization

The neural network structure we choose is bi-LSTM.

Bi-LSTM has excellent performance in dealing with long-term dependency problems, such as statement prediction and named entity recognition [38]. The statements associated with vulnerability characteristics may be far away in the whole program execution path, so we need the bi-LSTM neural network structure for the long-term memory of the information related to the vulnerability characteristics. In order to make the bi-LSTM neural network suitable for fuzzy testing, we modify the corresponding rules of the input gate, output gate and forgetting gate of the bi-LSTM. The specific structure of the single LSTM neuron and the specific rules of the input gate, output gate and forgetting gate are shown in Figure 4.

The number of hidden layers in the bi-LSTM neural network, epochs, batch size and other parameters will affect the final performance of the neural network. According to the experimental part in Section 4.2, we set the number of hidden layers to 5, the batch size to 64 and the drop rate to 0.4 and use a BPTT back-propagation algorithm to adjust the network weight, using random gradient descent (SGD) method to prevent the model from falling into the local optimal solution. For the hyper parameters in the bi-LSTM neural

network, we choose to use dichotomy to accelerate the selection of corresponding values. Figure 5 shows the complete structure of the bi-LSTM neural network.

**Figure 3.** The Basic Structure Diagram Of Skip-Gram Model.

**Figure 4.** The Specific Structure of LSTM Neuron.

**Figure 5.** The Complete Structure of the bi-LSTM neural network.

From Figure 5, we make the coding input with length MaxLen pass through several bi-LSTM hidden layers to extract clearer context dependencies. We let the output of the last bi-LSTM hidden layer pass through a feed forward neural network layer and sigmoid activation function. The sigmoid activation function also normalizes the final output vector, which is the vector form of the fuzzy testing sample generated by the bi-LSTM neural network.

### 3.3.2. Genetic Algorithm for Constructing Test Cases

The core of the genetic algorithm used to construct samples can be divided into several parts, including population initialization, tracking and executing the tested program, fitness calculation and individual selection, crossover and mutation. The overall structure is shown in Figure 6.

**Figure 6.** General Flow Chart of Generating Test Cases By Genetic Algorithm.

(a) Population initialization

In a genetic algorithm, the population is composed of several individuals. We abstract an individual as a chromosome. Let us set the length of the chromosome as *Dlen*, which means the number of bytes of test data. Then, the *ith* individual in the population can be expressed as *Xi* = (*xi*,1, *xi*,2, *xi*,3, ... , *xi*,*Dlen* ). Population initialization is performed to assign a value to each gene *xi*,*<sup>k</sup>*(<sup>1</sup> ≤ *k* ≤ *Dlen* in *Xi*). When there are initial test data, each byte of the initial test data is used to assign a value of *xi*,*k*. Otherwise, the whole population can be initialized by randomized assignment.

(b) Tracking and executing the program under test

Tracking is divided into two aspects:


Because each program can be divided into many basic blocks during execution, the essence of the program execution is the process of execution and jump between basic blocks.

Each basic block has only one entry and exit. So, in a basic block, the program enters from the entry and exits from the exit. Therefore, we can use the entry address Inaddr of the basic block to represent each basic block. Then, the program execution process can be expressed as a sequence of basic blocks: (*Inaddr*1, *Inaddr*2, ... , *Inaddrn*) We define the jump of a basic block as *e* = (*Inaddrk*, *Inaddrk*+<sup>1</sup>), where (1 ≤ *k* ≤ *n* − <sup>1</sup>).

Obviously, if every basic block is regarded as a point in a graph, then E is an edge in the graph. Since a basic block may be executed multiple times in the execution sequence, the graph is directed. In this case, the execution path of the program can be expressed as a sequence of edges *Ee* = (*<sup>e</sup>*1,*e*2,*e*3 ...*en*−<sup>1</sup>).

Because some basic blocks may be repeated many times during program execution, some edges may appear many times. We combine the same edges to obtain a set of edges with the information of times of occurrence and analyze the frequency statistics of this set and further divide it into many groups according to the different times of occurrence 1, 2–3, 4–7, 8–15, 16–31, 32–63, 64–127 and 128.

It is easy to see that the significance of this classification is that it can use different bits of a byte to represent the times information, so it can improve the processing speed of the program. Finally, we will obtain a new set of occurrence information *Fe* = (*f*1, *f*2, *f*3 ... *fn*−<sup>1</sup>).

We use the above processing method for each basic block to ge<sup>t</sup> the final program execution path information.
