**Table 1.** Summary of related works.

Based on the review of the related works, it became evident that there are still challenges concerning the detection of cyber attacks within the control systems found in the CPSs. On the one hand, methods must be sought to decrease both the false positive and false negative rates, and to increase the true positive and true negative rates. This will improve the overall performance of these detection systems. It is also evident that the phenomenon of simultaneous attacks has not been addressed in the design of cyber attack detection systems, which is worrying because these situations can occur very often in the real world. Is important to clarify that, within a CPS, there are many points where a cyberattack can occur and that can cause different consequences in the system. The emphasis of this work seeks to design an architecture that allows detecting and locating attacks that occur between the elements of the physical layer and the controller of a CPS, precisely in attacks that modify or interrupt the sending of data from one element to another. In this way, this paper presents the design of an architecture that explores the potential of convolutional neural networks to extract features and, thus, to determine whether there is an event related to the possibility of a cyber attack occurring. This approach may have a closer approach to the implementation in real cases in which there is a high degree of uncertainty in the process models, since, on many occasions, the way to detect an anomaly or not is done under a process of comparison between estimated values and the real values of the process, which is subsequently evaluated by a threshold. In our proposal, this evaluation is carried out in an intrinsic way by the architecture based on convolutional neural networks, generating a better performance than current works, as well as shows promising results in the detection and isolation of simultaneous attacks.

#### **3. Problem Statement**

Several control applications supported in these systems can be labeled as safety critical in relation to the fulfillment of strict real time deadlines, associated with the generation of actions from the interaction between the computational systems and the physical systems related to the application, because the non-fulfillment of these requirements can cause irreparable damage to the physical system being controlled, as well as to the people depending on it [70]. Additionally, measurements and control actions can be altered while being transmitted through communication networks, thus requiring new control algorithms or design architectures, which, in the presence of adverse situations, can bring the system to safe and stable states [71,72]. The proposal presented in this work focuses in the detection and isolation of DoS and integrity cyber attacks on CPSs, specifically on the exchange of information between sensors, actuators, and controllers. The approach realized is based in the fault detection and isolation systems for what anomalies are represented as a variation of the system parameters [58]. Then, any control system where its control signals and/or measured variables are susceptible to be attacked can be modeled as a combination of the two models defined in (1) and (2).

$$\mathbf{x}(k+1) = A\mathbf{x}(k) + B\mathbf{u}(k) + F\_a f\_a(k),\tag{1}$$

$$y(k) = \mathbb{C}\mathfrak{x}(k) + F\_{\mathfrak{s}} f\_{\mathfrak{s}}(k),\tag{2}$$

where *<sup>x</sup>*(*k*) represents the state vector, *<sup>x</sup>*(*k*) <sup>∈</sup> <sup>R</sup>*n*×<sup>1</sup> , *<sup>y</sup>*(*k*) is the output vector, *<sup>y</sup>*(*k*) <sup>∈</sup> <sup>R</sup>*p*×<sup>1</sup> , *<sup>u</sup>*(*k*) is the control action, *<sup>u</sup>*(*k*) <sup>∈</sup> <sup>R</sup>*m*×<sup>1</sup> , matrix *<sup>A</sup>* is the state matrix, *<sup>A</sup>* <sup>∈</sup> <sup>R</sup>*n*×*<sup>n</sup>* , *B* is the input matrix, *<sup>B</sup>* <sup>∈</sup> <sup>R</sup>*n*×*m*, *<sup>C</sup>* is the output matrix, *<sup>C</sup>* <sup>∈</sup> <sup>R</sup>*p*×*<sup>n</sup>* , D is the feedthrough matrix, *<sup>D</sup>* <sup>∈</sup> <sup>R</sup>*p*×*m*, *<sup>F</sup><sup>a</sup>* <sup>=</sup> *<sup>B</sup>*, and *<sup>f</sup><sup>a</sup>* = (<sup>Γ</sup> <sup>−</sup> *<sup>I</sup>*)*<sup>U</sup>* <sup>+</sup> *<sup>U</sup><sup>f</sup>* <sup>0</sup> . Γ*U* and *U<sup>f</sup>* <sup>0</sup> , represent the effect of a multiplicative anomaly and an additive effect in the control action, respectively. DoS and integrity attacks are visible as anomalies on the control action. If the *i*-th control action is attacked, then the matrix *F<sup>a</sup>* corresponds to the *i*-th column of the matrix *B*, and *f<sup>a</sup>* corresponds to the magnitude of the attack that directly affects the controller.

Similarly, if the *i*-th sensor is attacked, the matrix *F<sup>s</sup>* is the *i*-th row of the matrix *C*, and the vector of attacks is *f<sup>s</sup>* , which represents the magnitude of the effect produced in the *i*-th sensor.

The problem with traditional methods based on mathematical models that describe the behavior of the system is that these models are dispensable of the complete knowledge of the system parameters, and the adaptation in real conditions can cause the overall performance to decrease. Because of this, we intend to address this problem from models based on artificial neural networks, precisely in one-dimensional convolutional neural networks, which have shown very promising results in fields where patterns are sought to identify a class.

#### *Modeling of the Cyber Attack*

Measurements of process signals and control action values are critical to the proper functioning of a control system, and its modification by cyber attacks can produce instability in the control system [73,74]. A cyber attack by data manipulation is called an integrity attack, modeled by (3), and an attack that results in a prolonged loss of these signals is called a type DoS attack, which is modeled by (4).

$$
\overline{y}\_i(k) = y\_i(k) + y\_i(k)^a \tag{3}
$$

$$
\overline{y}\_i(k) = y\_i(k)\_{t\_{s-1}\prime} \tag{4}
$$

where *y<sup>i</sup>* (*k*) corresponds to the sensor measurement that reaches the controller in the k-time, *yi*(*k*) corresponds to the sensor measurement before being transmitted to the controller in the k-time, and *yi*(*k*) *a* is a vector injected by the attackers which changes the *yi*(*k*) measure in the k-time. *<sup>y</sup>i*(*k*)*ts*−<sup>1</sup> corresponds to the measurement before the start of the DoS attack. The time interval for the occurrence of the attack is defined by *τ<sup>a</sup>* = [*t<sup>s</sup> t<sup>e</sup>* ].

For the development of the proposal, it is assumed that any sensor can be affected by any type of attack, integrity, or DoS. Additionally, the attacks may occur at any time in various parts of the system. The last premise is significant to note because simultaneous attacks are less discussed in previous works; thus, depending on the type of attack carried out on the system, output (2) may take the form of (3) and/or (4).

#### **4. Attack Detection and Isolation Method**

In the context of this work, most cyber attack detection methods use the available data to develop a model that determines the usual behavior of the system. Then, by a comparison between the estimated outputs of the model and the actual process outputs, determination of if the behavior of the system is normal or if a cyber attack is taking place. To isolate the attack, which is nothing more than locating the part of the system that is being affected directly by the cyber attack, decoupled models of the system are developed that are susceptible only to cyber attacks that occur in specific parts of the system.

The procedure to perform this task can be grouped into three steps. Firstly, the generation of a residual signal is realized, and this process consists of comparing the measured output with an estimated output. This signal is denoted as residual signal, *res*(*k*), this is described in (5).

$$res(k) = y(k) - \hat{y}(k),\tag{5}$$

where *y*(*k*) are the set of output measures of the actual process, and *y*ˆ(*k*) are the set of outputs estimated. The second step corresponds to the evaluation of the residual; in this case, a comparison of the residuals is made with a predefined threshold, as is shown in (6).

$$|\text{res}(k)| > \tau\_{\text{threshold}}.\tag{6}$$

The thresholds are obtained from data in which the attacks have been presented, thus allowing their detection and isolation. Finally, a decision-making process is carried out through indicators.

These steps involve the use of residuals that should take values close to 0 in situations where the system is not being attacked. On the other hand, when an attack is present, the residual signals must have values other than 0.

Although a single residual signal can alert or detect a cyber attack, a set of residuals is required to isolate it. Then, to locate the origin of the cyber attack, it is necessary that some residues be sensitive only for a particular part of the system. The above implies that the set of residuals must be independent of other cyber attacks defined. In this way, to isolate a cyber attack, a structured set of residuals is considered, where each residual vector can be used to detect a cyber attack in a specific place of the system.

In the architecture model proposed in this work, it is emphasized that second step will be an implicit step because the architecture based on artificial neural networks will interpret the input data generating intrinsic characteristics that will allow the evaluation to detect and isolate the attacks.

#### *Architecture Proposed for the Detection and Isolation of Cyber Attacks in CPS*

The architecture proposed is presented in Figure 2. This architecture includes a prediction model which uses an input dataset *x*0, *x*1, . . . , *xk*−<sup>1</sup> to estimate outputs *y*ˆ1, *y*ˆ2, . . . , *y*ˆ*<sup>k</sup>* (these datasets will depend specifically on the type of data available from the process), and these values are used to obtain the residual signal *res*(*k*), as is shown in (5). These signals are used by a classifier to detect anomalies presents in the process.

**Figure 2.** General architecture model to detect and isolate cyber attack.

As the characteristics of the signals in a specific process are different then values with different magnitude could affect the classifier training procedure, therefore, all input data to the classifier are normalized using its mean and standard deviation to obtain a z-score for each one as is shown in (7).

$$z = |\mathbf{x} - \boldsymbol{\mu}| / \sigma,\tag{7}$$

where *x* are the input data, *µ* is the mean, and *σ* is the standard deviation.

Although the architecture presented is general, it is a base for selecting different types of machine learning for the prediction and classification stages. The idea is to use deep neural networks to extract patterns that allow the detection of cyber attacks (such as LSTM or CNN 1-dimensional). As was not included a method to find spatial-temporal correlations to detect cyber attacks, it is expected that neural networks will be able to carry out this task implicitly.

The architecture can be detailed as follows for a specific CPS, shown in Figure 3. A model of the dynamics of the process generates the outputs signals *x*(*k*)*<sup>s</sup>* which correspond to the reconstruction of all the states (it is assumed that the outputs are the process states or some linear combination of them, although it can be extended to non-linear cases). In order to isolate the attack, there is a set of neural network models that relate the process states with their respective control actions for generate states that are decoupled from each other (*x*(*k*)*d*1,2,...*<sup>x</sup>* ); in this way, it is possible to isolate the attack in a way equivalent to the use of UIOs, but with the advantage that neural networks allow addressing the uncertainty in the representations. With this set of neural networks, *res*(*k*) is generated.

Detection and isolation functions are implemented using artificial neural networks, which use the process states *x*(*k*), the control actions *u*(*k*), the reference signals *r*(*k*), the residual signals *res*(*k*), and the signals generated by the predicting model.

**Figure 3.** Architecture based on neural networks for the detection and isolation of the cyber attack.

Mean squared error (MSE) [75] is adopted as the model's loss function to train the predicting model.

$$MSE = \frac{1}{n} \sum\_{i=1}^{n} (\mathbf{x}\_i - \hat{\mathbf{x}}\_i)^2,\tag{8}$$

where *n* is the amount of data, *x<sup>i</sup>* is the real state, and *x*ˆ*<sup>i</sup>* is the estimated state. For the classifier, the cost function categorical crossentropy (CCE) is used [76] because it is a single-label multi-class classification problem.

$$J\_{\mathbb{C}\mathbb{C}E} = -\sum\_{q=1}^{l} \sum\_{k=1}^{p} d\_{qk} \log(y\_{qk}).\tag{9}$$

With *p* classes, training data size of *l*, the input of *xq*, where *q* = 1, 2, . . . , *l* and *yqk* (0 ≤ *yqk* ≤ 1),*k* = 1, 2, . . . , *p* is the estimated probability that belongs to class *k*, and *dqk* (0 or 1) becomes the given label (9).

#### **5. Case Studies and Results Analysis**

Two test benches were used to evaluate the performance of the proposed architecture, the SWaT dataset [77,78] and an interconnected tank [58].

#### *5.1. Secure Water Treatment Dataset-SWaT*

This dataset was completed by the Singapore University of Technology and Design to provide researchers with data collected from a complex and realistic ICS environment. The testbed is a fully operational scale water treatment plant that produces purified water. SWaT is composed of six main processes corresponding to the physical and control components of the water treatment plant; each stage (from P1 to P6) is equipped with a certain number of sensors and actuators. The sensors include flow meters, water level meters, conductivity, and pH analyzers, among others, while the actuators consist of pumps that transfer water from one stage to another, pumps that dose chemicals, and valves that control inlet flow. The process is not circular, and P6 water is removed. Sensors and actuators in each stage

are connected to the corresponding PLC (programmable logic controller). This process is shown in Figure 4.

**Figure 4.** SWaT testbed processes overview [57].

Stage P1 controls the flow of raw water by opening or closing a motorized valve that is connected to the inlet of tank T101. By means of the P101 pump, water starts flowing from T101 through the chemical dosing station in stage P2 and is followed by the ultrafiltration (UF) process located in stage P3, which eliminates unwanted materials. Similarly, the feed pump in stage P3 is responsible for supplying the water to the ultrafiltration unit. In the P5 stage, inorganic impurities are separated by a reverse osmosis process. The output of the reverse osmosis process is stocked in the permeate tank of stage P6 for its distribution. The P6 stage is also controlling the cleaning of the ultrafiltration membranes in P3 by the backwashing process. Every certain period of time, the backwash process is triggered by turning on the backwash pump and is accomplished in under one minute. The backwash process can alternatively be started by a PLC when the differential pressure sensor value increases above 0.4, which means that the UF membranes are choked [57,78].

#### 5.1.1. Dataset Description

Training Dataset 1 and Training Dataset 2 were used. The first one corresponds to data collected under normal operating conditions. This dataset was released on November 20, 2016 and was generated from a one-year long simulation. The second dataset corresponds to situations when attack scenarios are generated. This dataset with partially labeled data was released on 28 November 2016. The dataset is around six months long and contains several attacks, as shown in Table 2.




**Table 2.** *Cont*.

#### 5.1.2. Data Preparation and Model Training

The data from the first dataset is used to generate a model corresponding to the "Predicting model" block shown in Figure 2. The architecture proposed in this case is based on a 1D CNN model, as shown in Figure 5.

**Figure 5.** Prediction model for SWaT dataset.

The input data is composed of 43 characteristics compounded mainly by sensors measurements, states of the pumps, and valves positions. The first convolution layer consists of 2 filters, and the kernel size is 3. The 1D average pooling layer has a stride of 2 and the same padding; the second convolution layer has 20 filters and a kernel size of 20; the last convolution layer is composed of 10 filters and a kernel size of 5. Finally, a fully connected layer is used with a 43 neurons layer and a neuron in the output layer, all with linear activation functions. Additionally, the batch normalization layer is added with ReLU activation in various parts of the network. The loss function used was MSE, and the optimizer was the stochastic gradient descent with momentum. For training, a maximum of 40 epochs was available with an initial learning rate of 0.001. In this case, 30% of the data was used to validate, and 70% of the data to train.

The parameters of the layers for this network were found in such a way that the lowest possible MSE will be achieved. Increasing the number of layers, neurons, filter size, or number of filters did not correspond to a significant improvement performance.

The second dataset was used for the classification process; it is composed of 4177 data, of which 3685 data correspond to normal operating conditions, 50 belong to the first attack scenario, 24 correspond to the second attack scenario, 60 to third and fifth attack, 94 to fourth and sixth attack, and 110 to the seventh scenario. As can be seen in Figure 6a, this dataset is unbalanced and would then generate problems to the classifier. The bar centered at 0 corresponds to normal operating conditions, while the other corresponds to the different attack scenarios which are shown in the ID column of Table 2. It could affect the algorithms in relation to the minority classes. To address this situation, initially, methods, such as Random Oversampling and Undersampling, were used for imbalanced classification without obtaining satisfactory results. For this reason, the approach shown in Reference [79] was followed. This proposal is a modification of temporal data determined by optimal sequences that are aligned with the original data, thus generating new time-

synthesized data to the training dataset. The distribution of the different classes for the new dataset to be used is shown in Figure 6b. Although it is observed that it is an unbalanced dataset, the amount of data generated from the attack scenarios was increased, and the performance was improved.

(**a**) Original SWAT dataset distribution. (**b**) New SWAT dataset distribution. **Figure 6.** SWAT dataset distribution.

This new dataset was used to estimate the outputs using the architecture shown in Figure 5, which were compared with the usual process variables to obtain the residual signal.

The input data for the classifier whose architecture is shown in Figure 7 are: the estimated outputs, the process variables, and the corresponding residuals. This corresponds to the "Classification model" block shown in Figure 2 and was implemented by a group of cascaded convolutional layers with a batch normalization layer with ReLU activation function between them. The number of convolutional layers selected was five, obtaining a higher accuracy than 90%. The number of filters implemented from the input to the fully connected layer were 128, 64, 32, 16, and 8, respectively. The kernel size in each one was 10. The fully connected layer is composed of eight neurons in its input layer with linear activation function, while the last layer has eight neurons with softmax activation functions corresponding to the 7 attacks and the usual operation scenarios.

**Figure 7.** Classification model for SWaT dataset.

The loss function used was CCE, and the optimizer used was stochastic gradient descent with momentum. For the training, a maximum of 4 epochs was available, with an initial learning rate of 0.0001. For training, a maximum of 4 epochs was available, with an initial learning rate of 0.0001, and 30% of the dataset was used to validate, while 70% was used to train.

#### 5.1.3. Evaluation Metrics

The metrics considered in this work were true positives (*TP*), false positives (*FP*), true negatives (*TN*), and false negatives (*FN*). In order to evaluate the performance of the architecture proposed, the following metrics were used: precision, accuracy, recall

(sensitivity or TPR), F1 score, and true negative rate or specificity (*TNR*). These metrics were calculated as follows:

$$Precision = \frac{TP}{TP + FP'} \tag{10}$$

$$Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \tag{11}$$

$$Recall = \frac{TP}{TP + FN} \tag{12}$$

$$F1\text{ Score} = \frac{2TP}{2TP + FP + FN} = 2\frac{Precision \times Recall}{Precision + Recall} \tag{13}$$

$$TNR = \frac{TN}{FP + TN}.\tag{14}$$

Additionally, the ROC (Receiver Operating Characteristics) and Precision-Recall Curves were considered.

#### 5.1.4. Analysis of Results of SWaT Case

The results obtained for this dataset are shown in this section. The training and recovering results are carried out in MATLAB software. Figure 8 shows the confusion matrix for each of the available classes. From these results, the metrics defined in the previous section are obtained and are presented in Table 3.


**Figure 8.** Confusion matrix for SWaT dataset.

**Table 3.** Summary of metrics.


Class 0 corresponds to the usual operation, while class 1 to 7 are the different attacks scenarios shown in Table 2. It is observed that accuracy is high in all cases. The above shows a high percentage ratio of samples correctly classified by our model. On the other hand, for precision, it is observed that all attack scenarios present a score above 0.94, which means that a lot of data was correctly classified in the different attack scenarios. Similarly, the recall scores are above 0.91 in the majority of classes, which allows minimizing the false alarm rate. Finally, the F1 score shows scores above 0.92. The high rate of TNR in each of the classes is highlighted, which means that FPR is low.

The ROC and Precision-Recall Curves shown in Figure 9a,b present an appropriate performance, indicating that the model has a good capability to distinguish different classes.

Table 4 presents a comparison of the proposal presented in this is paper with other methods. In the recall and F1 score metrics, the proposed method presents a better performance related to the other methods. For values of precision and accuracy, the proposed method is above in almost all cases, except for the last two methods, which exceed it by a score margin of 0.04. However, the performance of the F1 score metric is high, indicating that a satisfactory and reliable class detection was obtained.


**Table 4.** Summary of the results and performance comparison on the SWaT dataset.

#### *5.2. Interconnected Tank Testbed*

This testbed has been used extensively to test proposals to detect anomalies [37,65–69]. The hydraulic system consists of three identical cylindrical tanks with equal cross-sectional area *S*, as shown in Figure 10. These tanks are connected by two cylindrical pipes of the same cross-sectional area, denoted *Sn*, and have the same outflow coefficient, denoted *µ*<sup>13</sup> and *µ*32. The nominal outflow located at tank 2 has the same cross-sectional area as the coupling pipe between the cylinders, but a different outflow coefficient, denoted *µ*20. The

inlet flow of the tanks comes from two pumps, with a flow rate, *q*<sup>1</sup> and *q*2. A digital/analog converter is used to control each pump. A piezo-resistive differential pressure sensor carries out the necessary level measurement. The idea of the system is to maintain the height levels of the fluid stored in tanks 1 and 2 at a particular operating point.

**Figure 10.** Schematic diagram of the three-tank system.

The parameters are shown in Table 5, and the mathematical model is presented in (15)–(17) [58].

$$\begin{aligned} \frac{dl\_1(t)}{dt} &= (q\_1(t) - q\_{13}(t)) / S \\ \frac{dl\_2(t)}{dt} &= (q\_2(t) + q\_{32}(t) - q\_{20}(t)) / S\_\prime \\ \frac{dl\_3(t)}{dt} &= (q\_{13}(t) - q\_{32}(t)) / S \end{aligned} \tag{15}$$

$$q\_{\rm mm}(t) = \mu\_{\rm mm} \mathcal{S}\_p \text{sign}(l\_\mathbf{m}(t) - l\_\mathbf{n}(t)) \sqrt{2\mathbf{g}|l\_\mathbf{m}(t) - l\_\mathbf{n}(t)|} \text{ ( $m$ , $ n = 1, 2, 3 \; \forall \; m \neq n$ )},\tag{16}$$

$$q\_{20}(t) = \mu\_{20} S\_p \sqrt{2gl\_2(t)}.\tag{17}$$

**Table 5.** Parameters value of the three-tank system.


#### 5.2.1. Dataset Generation

Assuming that *l*<sup>1</sup> > *l*<sup>2</sup> > *l*3, a linear approximation was established around an equilibrium point (*U*0,*Y*0) using Taylor series expansion. The linearized system is described by a discrete state space representation with a sampling period of *T<sup>s</sup>* = 1*s*. This is shown in (18).

$$\begin{aligned} \mathbf{x}(k+1) &= A\mathbf{x}(k) + Bu(k) \\ \mathbf{y}(k) &= \mathbf{C}\mathbf{x}(k) \end{aligned} \tag{18}$$

The states *x*(*k*) correspond to the fluid level of the tanks.

The purpose of this study is to control system around the operating point (*U*0,*Y*0), as is shown in (19).

$$\begin{aligned} \Upsilon\_0 &= \left[0.4 \, 0.2 \, 0.3\right]^T (m) \\ \mathcal{U}\_0 &= \left[0.35 \times 10^{-4} \, 0.375 \times 10^{-4}\right]^T (m^3/s) \end{aligned} \tag{19}$$

A tracking control problem was considered in this study case, where the desired outputs *y* = [*l*<sup>1</sup> *l*2] *<sup>T</sup>* are required to track references. The state feedback pole assignment technique was used. Thus, a feedback gain matrix *K* was designed, so that the closed-loop eigenvalues of the augmented system are equal to [0.92 0.97 0.9 0.95 0.94]. MATLAB software was used to find the matrices *A* and *B*, as well as the controller gains. The values can be observed in (20)–(22).

$$A = \begin{bmatrix} 0.9888 & 0.0001 & 0.0112 \\ 0.0001 & 0.9781 & 0.0111 \\ 0.0112 & 0.0111 & 0.9776 \end{bmatrix} \tag{20}$$

$$B = \begin{bmatrix} 64.5687 & 0.0014\\ 0.0014 & 64.2202\\ 0.3650 & 0.3637 \end{bmatrix} \text{.} \tag{21}$$

$$K = \begin{bmatrix} K\_1 \ \vert \ K\_2 \end{bmatrix} = 10^{-4} \begin{bmatrix} \begin{pmatrix} 21.6 & 3 & -5 \\ 2.9 & 19 & -4 \end{pmatrix} \begin{pmatrix} -0.95 & -0.32 \\ -0.3 & -0.91 \end{pmatrix} . \end{bmatrix} \tag{22}$$

In order to construct the dataset for detecting attacks, the scheme shown in Figure 11 was implemented, which has modules to obtain measurements of the process variables, as well as the control actions applied by the actuators. An Ethernet was used as a control network. This representation is equivalent to boxes "Process" and "Controller" in the architectures presented in Figure 3.

**Figure 11.** Interconnected tank testbed.

Two datasets were generated. The first one is a dataset in normal operations to determine a model that estimates the system outputs. The second one includes cyber attacks on sensors 1 and 2. These cyber attacks can be integrity or DoS attacks.

In both cases, 499,000 samples were generated. The system references range between 0.35 m and 0.45 m for *l*1, and between 0.185 m and 0.25 m for *l*2. The time intervals were defined randomly with a uniform distribution and reference changes every 500 s to 850 s.

The cases are shown in Table 6. Case 0 corresponds to operation without attacks. The following cases correspond to situations in which integrity or DoS cyber attacks can be generated on any sensor, following the models described by the Equations (3) and (4). In cases 1 to 4, only one cyber attack is generated every time, while cases 5 to 8 correspond to simultaneous attacks.



The time intervals in which cyber attacks occur were defined such that the dataset was balanced, so it were defined randomly and uniformly distributed. The integrity attacks were implemented by changing the modified variable in a range of 5% to 8% of its measured value. This range of values depends on the sensitivity of the system since there will be particular processes where the effect of the variation of the measurements in a given range does not has as much impact as in others. All cases presented correspond to the classes that the classifier will identify. The distribution of these data is shown in Figure 12.

**Figure 12.** Dataset for cyber attack classification.

#### 5.2.2. Model Training

Figure 3 presents the architecture implemented. The first model generates the process states estimate, while two more models were obtained to reconstruct independent states *x*<sup>1</sup> and *x*2, according to those states susceptible to cyber attack.

The first network has the architecture shown in Figure 13. Its input data is composed of five characteristics, which are composed of the measurements of the sensors and the control actions corresponding to vector (23):

$$\begin{aligned} \text{input data} &= \left[ \mathbf{x}\_1(1), \dots, \mathbf{x}\_1(k-1), \mathbf{x}\_2(1), \dots, \mathbf{x}\_2(k-1), \mathbf{x}\_3(1), \dots, \mathbf{x}\_3(k-1), \\ &u\_1(1), \dots, u\_1(k-1), u\_2(1), \dots, u\_2(k-1) \right]^T \end{aligned} \tag{23}$$

**Figure 13.** Model to estimate all states.

The model has three outputs corresponding to the states of the process. The vector to be reconstructed is (24):

$$\begin{aligned} \text{output } data \, \mathbf{1} &= \mathbf{\hat{x}}\_1 = \begin{bmatrix} \mathbf{x}\_1(\mathbf{2}), \dots, \mathbf{x}\_1(k) \end{bmatrix}^T\\ \text{output } data \, \mathbf{2} &= \mathbf{\hat{x}}\_2 = \begin{bmatrix} \mathbf{x}\_2(\mathbf{2}), \dots, \mathbf{x}\_2(k) \end{bmatrix}^T\\ \text{output } data \, \mathbf{3} &= \mathbf{\hat{x}}\_3 = \begin{bmatrix} \mathbf{x}\_3(\mathbf{2}), \dots, \mathbf{x}\_3(k) \end{bmatrix}^T \end{aligned} \tag{24}$$

where *k* is the number of samples. This model has two convolutional layers, one average pooling 1D layer between the convolutional layers, and one fully connected layer. The first convolutional layer has a kernel size of 5 and has eight filters, while the second layer has a kernel size of 3 with 16 filters. Each of these layers has hyperbolic tangent activation function. Between previous layers, there is an average pooling 1D layer with a pool size of 2 and strides of 2 with same padding. Between the convolutional layers and the fully connected layer, there is a batch normalization layer with Leaky ReLU type activation function. In the fully connected layer, there is an input layer of 48 neurons and an output layer composed of 3 neurons with a linear activation function to estimate the corresponding states. The loss function used was MSE, and the optimizer used was Adam. For training, a maximum of 4 epochs and a batch size of 10 were available with initial learning rate of 0.01. To train the model, 30% of the data was used to validate, and 70% to train. The various parameters of the layers of this network were found in such a way that the lowest possible MSE will be achieved, it was 0.000067. Increasing the number of layers, neurons, filter size, or number of filters did not correspond to a significant improvement to the proposed architecture.

The second and third networks have the architecture shown in the Figure 14.

**Figure 14.** Model to estimate the decoupled states.

The input data for the second architecture is composed of four characteristics corresponding to the measurements of the sensors and the control actions, as is presented in vector (25):

$$\begin{aligned} input\ data = \left[ \mathbf{x}\_2(1), \dots, \mathbf{x}\_2(k-1), \mathbf{x}\_3(1), \dots, \mathbf{x}\_3(k-1) \right] \\ \boldsymbol{\mu}\_1(1), \dots, \boldsymbol{\mu}\_1(k-1), \boldsymbol{\mu}\_2(1), \dots, \boldsymbol{\mu}\_2(k-1) \right]^T \end{aligned} \tag{25}$$

The model generates an estimated uncoupled output for the first state as (26):

$$\text{output data} = \pounds\_{1d} = [\pounds\_1(2), \dots, \pounds\_1(k)]^T,\tag{26}$$

where *k* is the number of samples. This model has two convolutional layers and one fully connected layer. The first convolutional layer has a kernel size of 4 and has 8 filters, while the second layer has a kernel size of 2 with 16 filters. Each of these layers has the hyperbolic tangent activation function. Between these layers, there is an average pooling 1D layer with a pool size of 2 and a stride of 2 with the same padding. Between the convolutional layers and the fully connected layer, there is a batch normalization layer and a Leaky ReLU type activation function. Before the fully connected layer, a dropout layer (0.15) was added. In the fully connected layer, there is an input layer of 32 neurons and an output layer composed of 1 neuron with linear activation function to estimate the corresponding state. The loss functions used was MSE, and the optimizer used was Adam. For training, a maximum of 4 epochs and a batch size of 10 was available with initial learn rate of 0.01. Seventy percent of the data was used to train the model, and 30% to validate it. The various parameters of the layers of this network were found in such a way that the lowest possible MSE will be achieved, and it was 0.00047. Increasing the number of layers, neurons, filter size, or number of filters did not correspond to a significant improvement to the proposed architecture.

Finally, the structure used to estimate the second uncoupled state of the process is shown in (27) and (28). The respective MSE for this case was 0.000031.

$$\begin{aligned} input\ data = \left[ \mathbf{x}\_1(1), \dots, \mathbf{x}\_1(k-1), \mathbf{x}\_3(1), \dots, \mathbf{x}\_3(k-1) \right] \\ \mathbf{u}\_1(1), \dots, \mathbf{u}\_1(k-1), \mathbf{u}\_2(1), \dots, \mathbf{u}\_2(k-1) \right]^{T'} \end{aligned} \tag{27}$$

$$output\ data = \pounds\_{2d} = [\pounds(2), \dots, \pounds(k)]^T \tag{28}$$

The architecture proposed for the classifier of the cyber attack is similar to that shown in Figure 7. It is composed of three convolutional layers whose activation function is hyperbolic tangent. The first convolutional layer has a kernel size of 15 with several 80 filters. The second and third convolutional layers have the same kernel size, but the number of filters is 60 and 30, respectively. There is also a batch normalization layer with Leaky ReLU activation function. Finally, a fully connected layer is used with an input layer of 25 neurons and an output layer with nine neurons corresponding to the established classes above. The last layer uses the softmax function. The loss function used was CCE, and the optimizer used was stochastic gradient descent with momentum. For training, a maximum of 1000 epochs was established, with a batch size of 10 and initial learning rate of 0.0001. For model training, 30% of the data was used to validate, and 70% to train. The input data is (29):

$$input\ data = \left[\mathbf{x}\_1, \mathbf{x}\_2, \mathbf{x}\_3, \mathbf{f}\_1, \mathbf{f}\_2, \mathbf{f}\_3, \mathbf{f}\_{1d}, \mathbf{f}\_{2d}, q\_1, q\_2, \text{res}, \text{res}\_1, \text{res}\_2\right]^T,\tag{29}$$

where *x*1, *x*2, *x*<sup>3</sup> correspond to the real variables of the process; *x*ˆ1, *x*ˆ2, *x*ˆ<sup>3</sup> are the outputs estimated by the architecture shown in Figure 13; *x*ˆ1*<sup>d</sup>* and *x*ˆ2*<sup>d</sup>* correspond to the decoupled states estimated by the architecture of the Figure 14, *q*<sup>1</sup> and *q*<sup>2</sup> are the process references, and *res*, *res*1, and *res*<sup>2</sup> are the residual signals obtained by comparing the real process states with the estimated states, and the individual comparison between the first two real process states and the estimated decoupled states, respectively.

Figure 15a,b present the evolution of the cost function and the accuracy metric obtained during the classifier training procedure.

Python programming language and the Keras library were used for training and obtaining the results. With the purpose of evaluating the performance of this architecture, the same metrics of the previous case were used.

#### 5.2.3. Performance Analysis of the Three-Tank System Testbed

Indices values obtained for this case are presented in Figures 16 and 17a,b and Table 7. For accuracy, the best scores were obtained when simultaneous attacks happened with values above 0.97. The above is an important result because this situation has been little explored. In terms of recall, class 7 has a slightly fair score, while the other situations have scores above 0.83. Additionally, the F1 score also has high values. The scores show that the proposed architecture allows for high specificity and high sensitivity.

**Figure 16.** Confusion matrix for the three-tank system.

(**a**) ROC Curve. (**b**) Precision-Recall Curve. **Figure 17.** ROC and Precision-Recall Curves for Interconnected tank.



The alarm indicator was implemented from the classifier in order to know the process state. Since the classifier provides the probability to classify an input data in particular class, the alarm signal is generated taking in to account the maximum value obtained from the classifier. In Figure 18, the alarm indicator is 1 when sensor 1 or 2 is under attack, and 0 when it is not. Additionally, it is discriminated if the attack is DoS or integrity type. The response of the process when it is attacked is shown in Figure 19. Boxes indicate the time instance when the attack occurs in both sensors, according to the alarm signals generated. Red boxes correspond to DoS attacks, and black boxes correspond to integrity attacks.

Additionally, the effect is different, depending on whether it is DoS or integrity attack. The system proposed in this work performed appropriately to detect the occurrence of the cyber attack, as well as the location and type of the attack. As results obtained using convolutional networks were better than those employing RNN or LSTM networks, convolutional networks were then chosen for this proposal.

In summary, the key steps for using the proposed architecture are as follows:


**Figure 19.** Temporal response.

#### **6. Conclusions**

New applications of industrial automation request great flexibility in the systems, which is supported by the increase in the interconnection between its components. At the same time, it generates a large gap that affects the security of control systems. Current solutions are oriented mainly to avoid the occurrence of attacks, but, regardless, the problems appear; so, recently, the interest in developing new proposals that contribute to detect attacks has grown.

In this work, a new architecture for DoS and integrity cyber attacks detection and isolation in Cyber Physical Systems using one-dimensional Convolutional Neural Networks was presented, thereby overcoming other models that are based on machine learning and

model-based methods, such as the use of Unknown Input Observers. This architecture involves a series of steps to achieve its purpose. The first step was to generate an estimated output of the process under a regression model. The next step was to generate a residual signal under the comparison of the measured process outputs with estimated outputs. Then, a classification model was added whose input data are different characteristics, such as control actions, estimated outputs, measured process outputs, and residual signals. This model allowed for detection and isolation of different eventualities that were defined in classes. Finally, from the detected class, alarm signals were generated that are used to report the occurrence of a cyber attack, allowing to define the type of attack and the part of the system that is being affected by the attack.

The architecture proposed does not use threshold information to detect and isolate attacks, as is the case with model-based methods, such as Unknown Input Observers, which often use this information. These models require an exhaustive selection of these thresholds, which can cause both false detections and anomalous situations that go undetected, and the proposed architecture provides shows advantages over this.

The performance of the proposed architecture was validated by two test benches obtaining satisfactory results compared to other methods. The results on the SWaT dataset allowed observing that, in terms of precision and accuracy, the indexes are very close to the highest scores of other works, and these obtained a score of 0.95. In terms of recall and F1Score metrics, it presented a score of 0.95, which outperforms the previously proposed methods by a good margin. Overall, the proposed system has a high true positive rate and a low false positive rate. On the other hand, the ability of the system to be able to detect and isolate cyber attacks that may occur simultaneously is highlighted, which was presented in the three-tank system testbed. In the defined classes, the accuracy presents scores above 0.96, and the precision is above 0.83, in cases where attacks occur in a single part of the system, while the score is higher than 0.91 in cases where simultaneous attacks occur. In terms of the F1 score metric, the scores are above 0.81, which is a very promising result. Finally, with respect to the recall metric, the scores are above 0.83, in most cases. With the cases presented in this testbed, it was possible to demonstrate the ability of the proposed architecture to detect and locate attacks that may occur simultaneously. This is interesting because these types of experiments are rarely performed, let alone provide evidence of systems that can detect these types of situations, which are not alien to eventualities that may occur in reality. In both cases highlighted, there was a high rate of TNR in each of the classes, ranging between 0.98 and 0.99.

**Author Contributions:** Conceptualization, C.M.P.; methodology, C.M.P. and D.M.-C.; software, C.M.P.; validation, C.M.P. and D.M.-C.; formal analysis, C.M.P.; investigation C.M.P. ; resources, D.M.- C.; data curation, C.M.P.; writing—original draft preparation, C.M.P.; writing—review and editing, C.M.P, D.M.-C., V.I.-J. and A.G.-P.; visualization, D.M.-C. and A.G.-P; supervision, D.M.-C.; project administration, D.M.-C. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Data Availability Statement:** Interested partis can contact the first author about the availability of datasets.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations have been used in this manuscript:



#### **References**


### *Article* **REFUZZ: A Remedy for Saturation in Coverage-Guided Fuzzing**

**Qian Lyu <sup>1</sup> , Dalin Zhang 1,\* , Rihan Da <sup>1</sup> and Hailong Zhang 2,\***


**Abstract:** Coverage-guided greybox fuzzing aims at generating random test inputs to trigger vulnerabilities in target programs while achieving high code coverage. In the process, the scale of testing gradually becomes larger and more complex, and eventually, the fuzzer runs into a saturation state where new vulnerabilities are hard to find. In this paper, we propose a fuzzer, REFUZZ, that acts as a complement to existing coverage-guided fuzzers and a remedy for saturation. This approach facilitates the generation of inputs that lead only to covered paths by omitting all other inputs, which is exactly the opposite of what existing fuzzers do. REFUZZ takes the test inputs generated from the regular saturated fuzzing process and continue to explore the target program with the goal of *preserving* the code coverage. The insight is that coverage-guided fuzzers tend to underplay already covered execution paths during fuzzing when seeking to reach new paths, causing covered paths to be examined insufficiently. In our experiments, REFUZZ discovered tens of new unique crashes that AFL failed to find, of which nine vulnerabilities were submitted and accepted to the CVE database.

**Keywords:** remedial testing; greybox fuzzing; vulnerability detection; enhanced security

#### **1. Introduction**

Software vulnerabilities are regarded as a significant threat in information security. Programming languages without a memory reclamation mechanism (such as C/C++) have the risk of memory leaks, which may expose irreparable risks [1]. With the increase in software complexity, it is impractical to reveal all abnormal software behaviors manually. Fuzz testing, or *fuzzing*, is a (semi-) automated technology to facilitate software testing. A fuzzing tool, or *fuzzer*, feeds random inputs to a target program and, meanwhile, monitors unexpected behaviors during software execution to detect vulnerabilities [2]. Among all fuzzers, coverage-guided greybox fuzzers (CGF) have become one of the most popular ones due to their high deployability and scalability, e.g., AFL [3] and LibFuzzer [4]. They have been successfully applied in practice to detect thousands of security vulnerabilities in open-source projects [5].

Coverage-guided greybox fuzzing relies on the assumption that *more run-time bugs could be revealed if more program code is executed*. To find bugs as quickly as possible, AFL and other CGFs try to maximize the code coverage. This is because a bug at a specific program location can only be triggered unless that location is covered by some test inputs. A CGF utilizes light-weight program transformation and dynamic program profiling to collect run-time coverage information. For example, AFL instruments the target program to record transitions at the basic block level. The actual fuzzing process starts with an initial corpus of seed inputs provided by users. AFL generates a new set of test inputs by randomly mutating the seeds (such as bit flipping). It then executes the program using the mutated inputs and records those that cover new execution paths. AFL continually repeats this process, but starts with the mutated inputs instead of user-provided seed inputs. If there are any program crashes and hangs, for example, caused by memory errors, AFL would also report the corresponding inputs for further analysis.

**Citation:** Lyu, Q.; Zhang, D.; Da, R.; Zhang, H. REFUZZ: A Remedy for Saturation in Coverage-Guided Fuzzing. *Electronics* **2021**, *10*, 1921. https://doi.org/10.3390/ electronics10161921

Academic Editor: Arman Sargolzaei

Received: 21 June 2021 Accepted: 8 August 2021 Published: 10 August 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

When a CGF is applied, the fuzzing process does not terminate automatically. Practically, users need to decide when to end this process. In a typical scenario, a user sets a timer for each CGF run and the CGF stops right away when the timer expires. However, researchers have discovered empirically that, within a fixed time budget, exponentially more machines are needed to discover each new vulnerability [6]. With a limited number of machines, the CGF could rapidly reach a *saturation* state in which, by continuing the fuzzing, it is difficult to find new unique crashes (where exponentially more time is needed). Then, *what can we do to improve the capability of CGF to find bugs with constraints on time and CPU power*? In this work, we try to provide one solution to this question.

Existing CGFs are biased toward test inputs that can explore new program execution paths. These inputs are prioritized in subsequent mutations. Inputs that do not discover new coverage are considered unimportant and are not selected for mutation. However, in practice, this extensive coverage-guided path exploration may *hinder the discovery of or even overlook potential vulnerabilities on specific paths*. The rationale is that an execution path in one successful run may not be bug-free in all runs. Simply dumping "bad" inputs may cause insufficient testing of their corresponding execution paths. Rather, special attention should be paid to such inputs and paths. Intuitively, an input covering a path is more likely to cover the same path after mutation than any other arbitrary inputs. Although an input cannot trigger, in one execution, the bug in its path, it is possible that the input can do so after a few fine-grained mutations. In short, by focusing on the new execution paths, the CGFs can discover an amount of vulnerabilities in a fixed time, but they also omit some vulnerabilities, which need to be repeatedly tested on the specific execution path multiple times to be found.

Based on this, we propose a lightweight extension of CGF, REFUZZ, that can effectively find tens of new crashes within a fixed amount of time on the same machines. The goal of REFUZZ is not to achieve as high code coverage as possible. Instead, it aims to *detect new unique crashes on already-covered execution paths in a limited time*. In REFUZZ, test inputs that do not explore new paths are regarded as favored. They are prioritized and mutated often to examine the same set of paths repeatedly. All other mutated inputs are omitted from execution. As a prototype, we implement REFUZZ on top of AFL. In our experiments, it successfully triggered 37, 59, and 54 new crashes in our benchmarks that were not found by AFL, using three different experimental settings, respectively. Finally, we discovered nine vulnerabilities accepted to the CVE database.

In particular, REFUZZ incorporates two stages. Firstly, in the *initial* stage, AFL is applied as usual to test the target program. The output of this stage is a set of crash reports and a corpus of mutated inputs used during fuzzing. In addition, we record the code coverage of this corpus. Secondly, in the *exploration* stage, we use the corpus and coverage from the previous stage as seed inputs and initial coverage, respectively. During the testing process, instead of rewarding inputs that cover new paths, REFUZZ only records and mutates those that converge to the initial coverage, i.e., they contribute no new coverage. To further improve the performance, we also review the validity of each mutated input before execution and promote non-deterministic mutations, if necessary. In practice, the second stage may last until the fuzzing process becomes saturated.

Note that REFUZZ is not designed to replace CGF but as a *complement* and a *remedy* for saturation during fuzzing. In fact, the original unmodified AFL is used in the initial stage. The objective of the exploration stage is to verify whether new crashes can be found on execution paths that have already been covered by AFL and whether AFL and CGFs, in general, miss potential vulnerabilities on these paths while seeking to maximize code coverage.

We make the following contributions.

• We propose an innovative idea in which, though the input cannot trigger a bug over one execution time, it is possible that the input can do so after a few finegrained mutations.


The rest of the paper is organized as follows. Section 2 introduces fuzzing and AFL, as well as a motivating example to illustrate CGFs mutating strategy limitations. Section 3 describes the design details of REFUZZ. We report the experimental results and discussion in Sections 4 and 5. Section 6 discusses the related work, and finally, Section 7 concludes our work.

#### **2. Background**

#### *2.1. Fuzzing and AFL*

Fuzzing is a process of automatic test generation and execution with the goal of finding bugs. Over the past two decades, security researchers and engineers have proposed a variety of fuzzing techniques and developed a rich set of tools that helped to find thousands of vulnerabilities (or more) [8]. Blackbox fuzzing randomly mutates test inputs and examines target programs with these inputs. Whitebox fuzzing, on the other hand, utilizes advanced, sophisticated program analyses, e.g., symbolic execution [9], to systematically exercise all possible program execution paths. Greybox fuzzing sits in between the former two techniques. The testing is guided by run-time information gathered from program execution. Due to its high scalability and ease of deployment, coverage-guided greybox fuzzing gains popularity in both the research community and industry. Specifically, AFL [3] and its derivations [10–14] have received plenty of attention.

Algorithm 1 shows the skeleton of the original AFL algorithm. (The algorithm does not distinguish between deterministic and non-deterministic—totally random mutations for simplicity.) Given a program under test and a set of initial test inputs (i.e., the seeds), AFL instruments each basic block of the program to collect *block transitions* during the program execution and runs the program with mutated inputs derived from the seeds. The generation of new test inputs is guided by the collected run-time information. More specifically, if an input contributes no crash or new coverage, it is regarded as useless and is discarded. On the other hand, if it covers new state transitions, it is added as a new entry in the queue to produce new inputs since the likelihood of these resulting inputs achieving new coverage is heuristically higher, compared to other arbitrary inputs. However, this coverage-based exploration strategy leads to strong bias toward such inputs, making already explored paths probabilistically less inspected. In our experiments, we found that these paths actually contained a substantial number of vulnerabilities, causing programs to crash.

AFL mutates an input at both a coarse-grained level, which incorporates the changing bulks of bytes, and a fine-grained level, which involves byte-level modifications, insertions and deletions [15]. In addition, AFL adopts two strategies to apply the mutation, i.e., deterministic mutation and random mutation. In fuzzing, AFL maintains a seed queue that stores the *initial* test seeds provided by users and new test cases screened by the fuzzer. For one input in the seed queue, which has applied deterministic mutations, it will no longer be mutated through deterministic mutation in subsequent fuzzing. The deterministic mutation, including bitflip, arithmetic, interest, and dictionary methods, is one in which a new input is obtained by modifying the content of the input at a specific byte position and

every input is mutated in the same way. In particular, during the interest and dictionary mutation stages, some special contents and tokens automatically generated or provided by users are replaced or inserted into the original input. On the contrary, the havoc and splice called random mutations would always be applied until the fuzzing stops. In the havoc stage, a random number is generated as the mutation combination number. According to the number, one random mutation method is selected each time, and then applied to the file in turn. In the next stage, called splice, a new input is produced by splicing two seed inputs, and the havoc mutation is continued on the file.

#### **Algorithm 1:** ORIGINALAFL



Note that AFL is unaware of the structure of inputs. For example, it is possible that a MP3 file is generated from a PDF file because the magic number is changed by AFL. It is inefficient to test a PDF reader with a MP3 file since the execution will presumably terminate early, as the PDF parser does not accept non-PDF files, causing the major components not to be tested. Our implementation of REFUZZ tackles this problem by adding an extra check of validness of newly generated test inputs, as discussed in Section 3.

#### *2.2. Motivating Example*

Figure 1a shows a code snippet derived from the pdffonts program, which analyzes and lists the fonts used in a Portable Document Format (PDF) file. Class Dict defined at line 1–10 stores an array of entries. Developers can call the find function defined at line 12 to retrieve the corresponding entry by a key. In the experiments, we test this program by running both AFL and REFUZZ with the AddressSanitizer [16] to detect memory errors. Figure 1b shows the crashing trace caused by a heap buffer overflow error found only by REFUZZ. The crash is caused by accessing the entries array during the iteration at line 14–17 in Figure 1a. The root cause of this error is inappropriate destruction of the dictionary in the XRef and Object classes when pdffonts attempts to reconstruct the *cross-reference table* (xref for short, which internally uses a dictionary) for locating objects in the PDF file, e.g., bookmarks and annotations. The crash is triggered when the xref table of the test input is mostly valid (including the most important entries, such as "Root", "size", "Info", and "ID") but cannot pass the extra check to investigate whether the PDF file is encrypted. When the program issues a search of key "Encrypt", the dictionary has already been destructed by a previous query that checks for the validness of the xref table. A correct implementation should make a copy of the dictionary after the initial check.

```
1 class Dict {
2 public :
3 ...
4 private :
5 XRef * xref ; // the xref table for this PDF file
6 DictEntry * entries ; // array of entries
7 int length ; // number of entries in dictionary
8 ...
9 DictEntry * find ( char * key );
10 };
11 ...
12 inline DictEntry * Dict :: find ( char * key ){
13 int i;
14 for (i = 0; i < length ; ++ i) {
15 if (! strcmp ( key , entries [i ]. key ))
16 return & entries [i ];
17 }
18 return NULL ;
19 }
20 ...
```
(**a**) Code derived from pdffonts


(**b**) The crashing trace caused by a heap buffer overflow

**Figure 1.** The motivating example.

It is relatively expensive to find this vulnerability using AFL, compared to REFUZZ. In our experiments, by running AFL for 80 h, AFL failed to trigger this bug, even with the help of the AddressSanitizer tool. The major reason is that the check for validness of xref and the check for encryption of the PDF file are the first step when pdffonts parses an arbitrary file—that is, they are presumably regarded as "old" paths for most cases. When using AFL, if a test input does not cover a new execution path, the chance of mutating this input is low. In other words, the execution path covered by the input is less likely to be covered again (or is covered but by less "interesting" inputs) and the examination of the the two checks might not be enough to reveal subtle bugs, such as the one in Figure 1b.

To tackle this problem, REFUZZ does not aim at high code coverage. On the contrary, we want to detect new vulnerabilities residing in covered paths and to verify that AFL ignores possible crashes in such paths while paying attention to coverage. REFUZZ utilizes the corpus obtained in the initial stage (which runs the original AFL) as the seeds for the exploration stage. It only generates test inputs that linger on the execution paths that are covered in the first stage but not investigated sufficiently. In the next section, we provide more details about the design of REFUZZ.

#### **3. Design of REFUZZ**

#### *3.1. Overview*

We propose REFUZZ to further test the program under test with inputs generated by AFL to trigger unique crashes that were missed by AFL. REFUZZ consists of two stages, i.e., the *initial* stage and the *exploration* stage. In the initial stage, the original AFL is applied. The initial seed inputs are provided by the user. The output is an updated seed queue, including both the seed inputs and the test inputs covered new execution paths during fuzzing. In the exploration stage, REFUZZ uses this queue as the initial seed input, applying

a novel mutation strategy designed for investigating previously executed paths to generate new test inputs. Moreover, only inputs that passed the extra format check are added to the seed queue and participate in subsequent mutations and testing. Figure 2 shows the workflow of REFUZZ.

**Figure 2.** REFUZZ overview.

Algorithm 2 depicts the algorithmic sketch of REFUZZ. (Our implementation skips duplicate deterministic mutations of inputs in the MUTATE function.) The highlighted lines are new, compared to the original AFL algorithm. The REFUZZ algorithm takes two additional parameters besides *P* and *initSeeds*: *et*, the time allowed for the initial stage, and *ct*, the time limit for performing deterministic mutations. We discuss *ct* in the next subsection. At line 6 in Algorithm 2, when the elapsed time is less than *et*, REFUZZ is in the initial stage, and the original AFL algorithm is applied. When the elapsed time is greater than or equal to *et* ( line 8–24), the testing enters the exploration stage. REFUZZ uses in this stage the input corpus queue obtained in the initial stage and applies a novel mutation strategy to generate new test inputs. If a new input passes the format check, it would be fed to the target program. The input that preserved the code coverage (i.e., did not trigger new paths) would be added to the queue. In the experiments, we set *et* to various values to evaluate the effectiveness of REFUZZ under different settings.

#### *3.2. Mutation Strategy in Exploration Stage*

REFUZZ adopts the same set of mutation operators as in AFL, including bitflip, arithmetic, value overwrite, injection of dictionary terms, havoc, and splice. The first four methods are *deterministic* because of their slight destructiveness to the seed inputs. The latter two methods will significantly damage the structure of an input, which are *totally random*. To facilitate the discovery of crashes, as shown in Algorithm 2, we introduce a parameter *ct* to limit the time since the last crash during the fuzzing process for deterministic mutations. If an input is undergoing deterministic mutation operations and no new crashes are found for a long time (>*ct*), REFUZZ will skip the current mutation operation and perform the next random mutation (line 11 of Algorithm 2). In the experiments, we initialize *ct* to 60 min and set it incrementally for each deterministic mutation. Specifically, the *n*-th deterministic mutation is skipped if there no crash is triggered in the past *n* hours by mutating an input. REFUZZ will try other more destructive mutations to facilitate the efficiency of fuzzing.

As introduced in Section 1, REFUZZ does not aim at high code coverage. Instead, it generates inputs that converge to the existing execution paths. During the initial stage, AFL saves the test inputs that have explored new execution paths in the input queue. An execution path consists of a series of *tuples*, where each tuple records the run-time transition between two basic blocks in the program code. A path is new when the input results in (1) the generation of new tuples or (2) changing of the *hit count* (i.e., the frequency) of an existing tuple. Instead, the PRESERVECOVERAGE function in Algorithm 2 checks

whether new tuples are covered and returns false if this is the case. It returns true if any hit count along a path is updated. We add test inputs that preserves the coverage into the queue to participate in the next round of mutation as seeds. Using this mutation strategy, REFUZZ can effectively attack specific code areas that have been covered but are not well-tested and find vulnerabilities.


#### *3.3. Input Format Checking*

Blindly feeding random test inputs to the target program leads to low performance of the fuzzer since they are likely to fail the initial input validation [8]. For instance, it is better to run a audio processing program with a MP3 file instead of an arbitrary file. Since AFL is unaware of the expected input format for each program under test, it is usual that the structure of an input is changed by random mutation operations. We propose to add an extra, light-weight format check before each program run to reduce the unnecessary overhead caused by invalid test inputs. As an exemplar, in the experiments, we check whether each input is a PDF file when testing a PDF reader and discard those that do not conform to the PDF format during testing. Specifically, in our implementation, REFUZZ takes an extra command-line argument, indicating the expected format of inputs. For each mutated input, REFUZZ checks the magic number of each input file and only adds it to the queue for further mutation if it passes the check.

#### **4. Evaluation**

#### *4.1. Experimental Setup*

To empirically evaluate the REFUZZ and its performance in finding vulnerabilities, we implement REFUZZ on top of AFL and conduct experiments on a Ubuntu V16.04.6 LTS machine with 16-core Xeon E7 2.10 GHz CPU and 32 GB RAM, using 4 programs that were also used by prior, related work [7] . Table 1 shows the details of the subjects used in our experiments. Columns "Program" and "Version" show the program names and versions. Columns "#Files" and "#LOC" list the number of files and lines of code in each program, respectively.

**Table 1.** Experimental subjects.


#### *4.2. Vulnerability Discovery*

A crucial factor in evaluating a fuzzer's performance is its ability to detect vulnerabilities. We configure REFUZZ to run three different experiments for 80 h with identical initial corpus by modifying et al. Table 2 describes the time for the initial stage and the exploration stage. In the first stage, the original AFL is applied without the additional test input format checking. Then, REFUZZ takes the input queue as the initial corpus for the second stage and uses an extra parameter to pass the expected input type to the target program, e.g., PDF.



During the fuzzing process, the fuzzer records the information of each program crash along with the input that caused the crash. To avoid duplicates in the results, we use the *afl-cmin* [17] tool in the AFL toolset to minimize the final reports by eliminating redundant crashes and inputs. Tables 3–5 show the statistics of unique crashes triggered by REFUZZ. Note that the numbers in column "Init+Expl" are not exactly the sum of the numbers in columns "Init" and "Expl". This is because REFUZZ discovers duplicate crashes in the initial stage and the exploration stage. Additionally, the numbers in column "New" are discovered by REFUZZ but not AFL. After applying afl-cmin, only the unique crashes are reported.

We also run AFL for 80 h and report the number of crashes in Table 6. The total number in column "Init" is less than the number in column "Init+Expl" in Tables 3 and 5. This indicates that REFUZZ can find more unique crashes within 80 h. In Table 4, the data in column "Init" are much fewer than the other two experimental configurations, so they are fewer than the total number of crashes in Table 6. As described in Table 7, we compare the average and variance data of the unique crashes obtained though the four programs under three different experimental configurations. The data in column "Variance" have a large deviation, which reflects the randomness of the fuzzing.

From the Tables 3–5, we can see that new unique crashes are detected during the exploration stage in all three experimental settings, except for pdftopbm, which has 0 new

crashes, shown in Table 4. By applying the novel mutation strategy in the exploration stage and input format checking, REFUZZ discovers 37, 59, and 54 new unique crashes that are not discovered by AFL. These crashes are hard to find if we simply focus on achieving high code coverage since they reside in already covered paths and are not examined sufficiently with various inputs in that some vulnerabilities are detected by relying on plenty of special-type inputs.


**Table 3.** Number of unique crashes (60 + 20).

**Table 4.** Number of unique crashes (50 + 30).


**Table 5.** Number of unique crashes (40 + 40).


**Table 6.** Number of unique crashes (80 + 0).


**Table 7.** Average and variance of unique crashes.


Figure 3 shows the proportion of newly discovered unique crashes among all crashes that are triggered by REFUZZ in the exploration stage. For example, for pdftotext, the number of new unique crashes is greater than half of the total number of unique crashes (in the "40 + 40" setting). We can see that by preserving the code coverage and examining covered execution paths more, we can discover a relatively large number of new vulnerabilities that might be neglected by regular CGF, such as AFL. Note that this does not mean that AFL and others cannot find such vulnerabilities. It just implies that they have a lower chance of finding the vulnerabilities within a fixed amount of time, while REFUZZ is more likely to trigger these vulnerabilities, given the same amount of time.

In addition, we set up 12 extra experiments. The corpus obtained by running AFL for 80 h is used as the initial input of the exploration stage; then, the target programs are tested by REFUZZ for 16 h. The purpose is to verify whether REFUZZ can always find new unique crashes when AFL is saturated. The experimental data are recorded in Table 8. The column "Number of experiments" records the number of new unique crashes found by REFUZZ in 12 experiments. It can be proved that when the same initial inputs are provided, REFUZZ can always find new crashes that are not repeated with AFL, even though the fuzzing is random.

**Figure 3.** Proportion of newly discovered unique crashes in the exploration stage of REFUZZ.

**Table 8.** Number of new unique crashes (80 + 16).


We have submitted our findings in the target programs to the CVE database. Table 9 shows a summary of nine new vulnerabilities that were found by REFUZZ in our experiments. We are working on analyzing the rest crashes and will release more details in the future.

#### *4.3. Code Coverage*

As described earlier, the goal of REFUZZ is to test whether new and unique crashes can be discovered on covered paths after regular fuzzing in a limited time, instead of aiming at high code coverage. We collected the code coverage information during the execution of REFUZZ and found that the coverage for each target program remained the same during the exploration stage, which is to be expected. The results also show that AFL only achieved slightly higher coverage compared to REFUZZ in the exploration stage, which implies that AFL ran into a saturation state, which signifies a demand for new strategies to circumvent such scenarios. REFUZZ is one such remedy, and our experimental results show its effectiveness in finding new crashes.

**Table 9.** Submitted vulnerabilities.


#### **5. Discussion**

REFUZZ is effective at finding new unique crashes that are hard to discover using AFL. This is because some execution paths need to be examined multiple times with different inputs to find hidden vulnerabilities. The coverage-first strategy in AFL and other CGFs tends to overlook executed paths, which may hinder further investigation of such paths. However, such questions as "when should we stop the initial stage in REFUZZ and enter the exploration stage to start the examination of these paths", and "how long should we spend in the exploration stage of REFUZZ" remain unanswered.

**How long should the initial stage take?** As described in Section 4, we performed three different experiments with *et* set to 60, 50, and 40 h to gather empirical results. The intuition is that the effect of using the original AFL to find bugs would be the best when *et* is 60 h since it is to be expected that more paths could be covered and more unique crashes could be triggered if we apply the fuzzer for a longer time in the initial stage. However, our experimental results in Tables 3–5 show that the fuzzing process is unpredictable. The total number of unique crashes triggered in the initial stage of 60 h is close to 40 h (308 vs. 303), while the number obtained in 50 h is less than that of 40 h (246 vs. 303). In Algorithm 2, as well as our implementation of the algorithm, we allow the user to decide when to stop the initial stage and set *et* based on their experience and experiments. Generally, regarding the appropriate length of the initial stage, we suggest that users should pay attention to the dynamic data in the fuzzer dashboard. The code coverage remains stable, the color of the cycle numbers (*cycles done*) transforms from purple to green, or the last discovered unique crashes (*last uniq crash time*) have passed a long time, which indicates that continuing to test will not bring new discoveries. The best rigorous method is to combine these pieces of reference information to determine whether the initial stage should be paused.

**How long should the exploration stage take?** We conducted an extra experiment using REFUZZ with the corpus obtained from the 80-h run of AFL. We ran REFUZZ for 16 h and recorded the number of unique crashes per hour. In the experiment, each program was executed with REFUZZ for 12 trials. The raw results are shown in Figure 4 and the mean of the 12 trials are shown in Figure 5. In both figures, the *x*-axes show the number of bugs (i.e., unique crashes) and the y-axes show the execution time in hours. We can see that given a fixed corpus of seed inputs, the performance of REFUZZ in the exploration stage varies a lot in the 12 trials. This is due to the nature of random mutations. Overall, we can see from the figures that in the exploration stage, REFUZZ follows the empirical rule that finding a new vulnerability requires exponentially more time [6]. However, this does not negate the effectiveness of REFUZZ in finding new crashes. We suggest that the best test time to terminate the remedial testing is still when the exploration reaches saturation, and the relevant guidelines at the initial stage can be considered here.

**Is REFUZZ effective as remedy for CGF?** Many researchers have proposed remedial measures to CGFs. Driller [18] combines fuzzing and symbolic execution. When a fuzzer becomes stuck, symbolic execution can calculate the valid input to explore deeper bugs. T-Fuzz [19] detects whenever a baseline mutational fuzzer becomes stuck and no longer produces inputs that extend the coverage. Then, it produces inputs that trigger deep program paths and, therefore, find vulnerabilities (hidden bugs) in the program. The main cause of the saturation is due to the fact that AFL and other CGFs strongly rely on random mutation to generate new inputs to reach more execution paths. Our experimental results suggest that new unique crashes can actually be discovered if we leave code coverage aside and continue to examine the already covered execution paths by applying mutations (as shown in Tables 3–5). They also show that it is feasible and effective to use our approach as a remedy and an extension to AFL, which can easily be applied to other existing CGFs. While this conclusion may not hold for programs that we did not use in the experiments, our evaluation shows the potential of remedial testing based on re-evaluation of covered paths.

**Figure 4.** Number of bugs and execution time in exploration stage.

**Figure 5.** Average number of bugs and execution time in exploration stage.

#### **6. Related Work**

The mutation-based fuzzer uses actual inputs to continuously mutate the test cases in the corpus during the fuzzing process, and continuously feeds the target program. The code coverage is used as the key to measure the performance of the fuzzer. AFL [3] uses compiletime instrumentation and genetic algorithms to find interesting test cases, and can find new edge coverage based on these inputs. VUzzer [20] uses the "intelligent" mutation strategy based on data flow and control flow to generate high-quality inputs through the result feedback and by optimizing the input generation process. The experiments show that it can effectively speed up the mining efficiency and increase the depth of mining. FairFuzz [21] increases the coverage of AFL by identifying branches (rare branches) performed by a small amount of input generated by AFL and by using mutation mask creation algorithms to make mutations that tend to generate inputs that hit specific rare branches. AFLFast [12] proposes a strategy to make AFL geared toward the low-frequency path, providing more opportunities to the low-frequency path, which can effectively increase the coverage of AFL. LibFuzzer [4] uses SanitizerCoverage [22] to track basic block coverage information in order to generate more test cases that can cover new basic blocks. Sun et al. [23] proposed to use the ant colony algorithm to control seed inputs screening in greybox fuzzing. By estimating the transition rate between basic blocks, we can determine which the seed input is more likely to be mutated. PerfFuzz [24] generates inputs through feedbackoriented mutation fuzzing generation, can find various inputs with different hot spots in the program, and escapes local maximums to have higher execution path length inputs. SPFuzzs [25] implement three mutation strategies, namely, head, content and sequence mutation strategies. They cover more paths by driving the fuzzing process, and provide a method of randomly assigning weights through messages and strategies. By continuously updating and improving the mutation strategy, the above research effectively improves the efficiency of fuzzing. As far as we know, in our experiment, if there are no new crashes for a long time (>*ct*), and it is undergoing the deterministic mutation operations at present, then it performs the next deterministic mutation or to enter the random mutation stage directly, which reduces unnecessary time consumption to a certain extent.

The generation-based fuzzer is significant for having a good understanding of the file format and interface specification of the target program. By establishing the model of the file format and interface specification, the fuzzer generates test cases according to the model. Dam et al. [26] established the Long Short-Term memory model based on deep learning, which automatically learns the semantic and grammatical features in the code, and proves that its predictive ability is better than the state-of-the-art vulnerability prediction models. Reddy et al. [27] proposed a reinforcement learning method to solve the diversification guidance problem, and used the most advanced testing tools to evaluate the ability of RLCheck. Godefroid et al. [28] proposed a machine learning technology based on neural networks to automatically generate grammatically test cases. AFL++ [29] provides a variety of novel functions that can extend the blurring process over multiple stages. With it, variants of specific targets can also be written by experienced security testers. Fioraldi et al. [30] proposed a new technique that can generate and mutate inputs automatically for the binary format of unknown basic blocks. This technique enables the input to meet the characteristics of certain formats during the initial analysis phase and enables deeper path access. You et al. [31] proposed a new fuzzy technology, which can generate effective seed inputs based on AFL to detect the validity of the input and record the input corresponding to this type of inspection. PMFuzz [32] automatically generates high-value test cases to detect crash consistency bugs in persistent memory (PM) programs. These efforts use syntax or semantic learning techniques to generate legitimate inputs. However, our work is not limited to using input format checking to screen legitimate inputs during the testing process, and we can obtain high coverage in a short time by using the corpus obtained by AFL test as the initial corpus in the exploration phase. Symbolic execution is an extremely effective software testing method that can generate inputs [33–35]. Symbolic execution can analyze the program to obtain input for the execution of a specific code area. In other words, when using symbolic execution to analyze a program, the program uses symbolic values as input instead of the specific values used in the general program execution. Symbolic execution is a heavyweight software testing method because the possible input of the analysis program needs to be able to obtain the support of the target source code. SAFL [36] is augmented with qualified seed generation and efficient coverage-directed mutation. Symbolic execution is used in a lightweight approach to generate qualified initial seeds. Valuable exploration directions are learned from the seeds to reach deep paths in program state space earlier and easier. However, for large software projects, it takes a lot of time to analyze the target source code. As REFUZZ is a lightweight extension of AFL, in order to be able to repeatedly reach the existing execution path, we choose to add the test that fails to generate a new path to the execution corpus to participate in subsequent mutations.

#### **7. Conclusions**

This paper designs and implements a remedy for saturation during greybox fuzzing, called REFUZZ. Using the corpus of the initial stage as the seed test inputs of the exploration stage, REFUZZ can explore the same set of execution paths extensively to find new and unique crashes along those paths within a limited time. The AFL directly feeds the input obtained by the mutation into the target program for running, which causes many non-

compliant seeds to be unable to explore deeper paths. In this paper, we proposed an input format checking algorithm that can filter the file conformed to the input format, which is beneficial to enhance the coverage depth of the execution path. At the same time, the mutation strategy we proposed can transition to the random mutation stage to continue testing when the deterministic mutation stage is stuck, which significantly accelerates the testing efficiency of fuzzing. We evaluated REFUZZ , using programs from prior related work. The experimental results show that REFUZZ can find new unique crashes that account for a large portion among the total unique crashes. Specifically, we discovered and submitted nine new vulnerabilities in the experimental subjects to the CVE database. We are in the process of analyzing and reporting more bugs to the developers.

In the future, in order to make our prototype tool better serviced in the real world, we will study how to combine machine learning to improve the efficiency of input format checking and design more complex automatic saturation strategies to strengthen the linkability of the tool. We will continue to improve REFUZZ to help increase the efficiency of fuzzers in the saturation state using parallel mode and deep reinforcement learning. We are planning to develop more corresponding interfaces and drivers to explore more vulnerabilities of IoT terminals for enhanced security of critical infrastructures.

**Author Contributions:** Conceptualization, D.Z. and H.Z.; methodology, D.Z.; software, Q.L.; validation, R.D.; writing—original draft preparation, Q.L. and H.Z.; writing—review and editing, Q.L.; supervision, D.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Fundamental Research Funds for the Central Universities through the project(No.2021QY010).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

