(f) Elite ratio

The elite ratio means that the individuals with the highest fitness in the current population do not participate in crossover and mutation operations but replace the individuals with the lowest fitness in the current population after crossover and mutation operations.

After the experiment and model optimization, the final elite ratio is 0.06.

(g) Stopping Criteria

The genetic algorithm has to go through several rounds of iterative evolution until it reaches the ideal result or reaches the threshold of the number of iterations. For the heuristic genetic algorithm, the threshold of iterations is 25.

### **4. Experiment and Evaluation**

### *4.1. Data Sources*

In the training part of the neural network, we need a large number of training samples to train our neural network so that the time series neural network can effectively capture the corresponding kinds of vulnerability characteristics from the training set. Therefore, we first collect a large number of vulnerability information from CVE and CNNVD national security vulnerability database, then screen out the vulnerability information, which is obviously suitable for neural network training. Then, we select the corresponding binary executable program and corresponding test cases from GitHub [40] and SARD [41] dataset and obtain a small number of training datasets from Symantec Security Company. The dataset we screened contains a variety of CWE vulnerability types, such as buffer overflow vulnerability (CWE-119, CWE-120, CWE-131), format string (CWE-134), etc. For the binary executable program corresponding to each vulnerability information, we filter out two versions, which are vulnerable version (no patch version) and clean version (with patch version). The purpose of using two different versions to train the neural network is to verify whether the corresponding test cases can trigger the vulnerability successfully. Second, we can further enhance the learning of the neural network for vulnerability features through this method of comparative training, so as to achieve a better training effect. The inspiration for the construction of this training dataset comes from the special training dataset constructed for generator G in GAN neural network, which contains labeled samples and unlabeled samples. Finally, all the datasets we ge<sup>t</sup> are shown in Table 1.

**Table 1.** Dataset Information of CVDF DYNAMIC.


We randomly select 80% of the data for the bi-LSTM neural network training set and the remaining 20% for CVDF DYNAMIC framework and subsequent experimental comparative analysis test set.

In the experiment, we mainly answer the following three questions:

Q1: Is the theoretical model of CVDF DYNAMIC valid?

Q2: Does CVDF DYNAMIC have a performance advantage in test case generation compared with the existing fuzzy testing tools?

Q3: What is the performance overhead of CVDF DYNAMIC? Does the reduction of sample sets improve the efficiency of CVDF DYNAMIC sample generation?

#### *4.2. Evaluate the Validity of CVDF DYNAMIC's Theoretical Model*

For Q1, our BI-LSTM neural network optimizes the parameters according to the method mentioned above, and after seven epochs training, the accuracy and loss performance of the model are shown in Figure 7.

**Figure 7.** The Relationship Between the Accuracy and Loss Of bi-LSTM And Epochs.

It can be seen from Figure 7 that after seven training epochs, the accuracy of the BI-LSTM neural network is more than 90%, approaching 93% and stable, while the loss is less than 20% and tends to be stable.

Figure 8 shows a specific example of parameter optimization for the number of hidden layers of the bi-LSTM neural network. As can be seen from Figure 8, when the number of hidden layers is five, the performance of the bi-LSTM neural network on the three evaluation indices of precision, recall and accuracy is the best. Other parameters such as drop rate and batch size are optimized in a similar way.

**Figure 8.** Relationship Between Evaluation Indices And Layer Numbers.

In the part of using the genetic algorithm to generate test cases, we compare the genetic algorithm with the existing fuzzy testing tool AFLFast under the two evaluation indices of code coverage and the number of generated edge sequences EdgeNum. The genetic algorithm has been generated through 25 rounds of iterations, and the test program uses the media processing program named FFmpeg [42] in the test set constructed above. The final experimental results are shown in Figure 9.

**Figure 9.** Comparative Test Results Of Genetic Algorithm And AFLFast.

In Figure 9, the ordinate dimension of code coverage is a percentage, and the dimension of the sequence number of edges is *value* × 102. As can be seen from Figure 9, compared with AFLFast, the genetic algorithm has significant performance advantages in code coverage and the number of edge sequences. The genetic algorithm finds 9246 edge sequences for FFmpeg, while AFLFast only finds 8137 edge sequences. Because of the positive correlation between the number of edges and code coverage, the code coverage of the genetic algorithm is better than that of AFLFast.

So far, we have effectively solved the first problem, that is, the CVDF DYNAMIC theoretical model is effective. For the bi-LSTM neural network part of CVDF DYNAMIC, Figure 7 shows that our model achieves ideal training results. For the part of genetic algorithm generating test cases in CVDF DYNAMIC, our test cases have performance advantages over AFLFast in terms of code coverage and number of edges.

#### *4.3. Performance Comparison between CVDF DYNAMIC and Existing Fuzzy Testing Tools*

For Q2, we use NeuFuzz, which is also based on a neural network to guide the generation of fuzzy testing samples, and AFLFast tools for comparative testing. In order to facilitate testing and comparison, we use widely used evaluation metrics in vulnerability mining and neural networks, including false positive rate (FPR), true positive rate (TPR) and accuracy rate (ACC).

Firstly, the common definitions of vulnerability evaluation index are given.

*TP* (true positive): True positive samples are samples with their own vulnerabilities and are correctly identified.

*FP* (false positive): False positive samples are samples that do not contain vulnerabilities and are not correctly identified.

*FN* (false negative): False negative samples are samples that contain vulnerabilities and are not correctly identified.

*TN* (true negative): True negative samples are samples that do not contain vulnerabilities and are correctly identified.

The specific forms of *FPR*, *TPR* and *ACC* are as follows:

$$TPR = \frac{TP}{TP + FN}$$

$$FPR = \frac{FP}{FP + TN}$$

$$A\% = \frac{TP + TN}{TP + FP + TN + FN}$$

On the other hand, in order to intuitively show the performance advantages of the bi-LSTM neural network and genetic algorithm integration, we also add two evaluation indices, which are code coverage and path depth detection ability, and use the dataset constructed in this paper to test it. The experimental results are shown in Table 2.


**Table 2.** Comparison Test Results Of CVDF DYNAMIC With Other Tools.

It can be seen from Table 2 that CVDF DYNAMIC has performance advantages over other fuzzy testing tools. This is because CVDF DYNAMIC combines the advantages of neural network and genetic algorithm and is superior to other tools in comprehensive performance. However, other tools are also very advanced fuzzy testing tools, so they also have good performance in contrast testing. CVDF DYNAMIC and NeuFuzz are very close to each other in terms of other evaluation indices, except code coverage. However, CVDF DYNAMIC has obvious advantages over NeuFuzz in code coverage because it combines the advantages of the bi-LSTM neural network and genetic algorithm. It should also be pointed out here that the author of NeuFuzz explains that NeuFuzz focuses on seed mutation and test case generation under critical execution path rather than code coverage. However, CVDF DYNAMIC is still in the leading position in comprehensive performance.

#### *4.4. Performance Overhead of CVDF DYNAMIC and Effectiveness of Sample Set Reduction*

For Q3, we consider the performance cost of CVDF DYNAMIC and the effectiveness of sample set reduction from the number of sample sets before and after reduction, the time of fuzzy testing before and after reduction, the compression ratio and other evaluation indicators.

From Table 3, it can be seen that the compression algorithm greatly reduces the number of samples, and the compression rate reaches 54.6%. However, because the compressed sample set basically retains the key path, the execution time has decreased to some extent, but it is not as obvious as the compression rate. The code coverage and WDC evaluation index of the compressed sample set are identical with the original sample set. It shows that the compression of the test case sample set has no loss of performance, and then proves the significance and necessity of the sample set compression.

**Table 3.** Index Comparison Of Sample Set Before And After Compression.


We use a random sampling method to form 6 initial sample sets with the scales of 1000, 2000, 3000, 4000, 5000 and 6000. The execution efficiency and time of the initial sample set and of the compressed sample set are compared, and the results are shown in Figure 10.

As can be seen from Figure 10, with the increase in the initial sample set size, the execution time efficiency after compression is gradually improved compared with that before compression.

Finally, this paper compares the compression ratio and test time of the sample set between the CVDF DYNAMIC heuristic genetic algorithm and the greedy-based approximation algorithm. The experimental results are shown in Figures 11 and 12.

**Figure 10.** Execution Time Comparison.

**Figure 11.** Comparison Of Compression Ratio Between Heuristic Genetic Algorithm And Approximation Algorithm.

**Figure 12.** Comparison of Test Time Between Heuristic Genetic Algorithm And Approximation Algorithm.

It can be seen that the compression ratio based on the heuristic genetic algorithm has obvious advantages in different size sample sets compared with an approximation algorithm. With the increase in sample size, the test time of the heuristic genetic algorithm is more and more advanced.

#### **5. Discussion on Security and Privacy of CVDF DYNAMIC Model**

Because CVDF DYNAMIC combines the bi-LSTM neural network and the genetic algorithm to generate fuzzy testing samples, the final sample set is a mixed sample set, and the sample set has no label for classification. Therefore, it is very difficult to deduce the sensitive training data of CVDF DYNAMIC through the final sample set generated by CVDF DYNAMIC. On the other hand, in the description of experiment part 4.1, the training data of CVDF DYNAMIC comes from the vulnerability databases of many different countries or companies. Some of these databases are open access and some are private, but CVDF DYNAMIC adopts mixed training for datasets from different sources in the training process and randomly selects 80% as the training set and 20% as the testing set in the mixed datasets. Therefore, even if the attacker obtains the CVDF DYNAMIC datasets through reverse derivation, it is also very difficult to further distinguish private data from the middle. However, the bi-LSTM neural network adopted by CVDF DYNAMIC is a mature neural network structure, and there are corresponding scientific studies to attack this neural network structure. The security of the bi-LSTM neural network structure still needs to be strengthened in the future.
