**3. Proposed Model**

In this section, we discuss the proposed hybrid DL model in terms of its structure, the selected DL algorithms, and relevant theoretical concepts. The selected DL models (CNN and QRNN) can be used to classify a threat type in real time while providing a low FPR. The architecture of the proposed model is presented in Figure 1.

**Figure 1.** The architecture of the proposed hybrid model.

A CNN is an extension of a neural network [39] and it is effective at extracting features at a low level from the source data, especially spatial features [40].

CNNs are used widely in image processing due to their ability to automate feature extraction [41]. Additionally, CNNs have demonstrated their effectiveness in many fields such as biomedical text analysis and malware classification [30]. Based on the shape of the input data, a CNN can be classified into different types including a two-dimensional (2D) CNN, which uses data such as images, and a one-dimensional (1D) CNN, which uses data such as text. A CNN consists of a convolution layer, pooling layer, fully connected (FC) layer, and activation function [42]. The convolution layer is fundamental building block in CNNs that takes two sets of information as inputs and performs a mathematical operation with these inputs. The two sets of information are the data and a filter, which can be referred to as kernel. The filter is applied to an entire dataset to produce a feature map [41]. Each CNN filter extracts a set of features that are aggregated to a new feature map as output [30]. The pooling layer is implemented to reduce feature map dimensions and to remove irrelevant data to improve learning [20]. The output of the pooling layer is fed into the FC layer to classify the data [43].

The LSTM-RNN is one of the most powerful neural network models that is used in cyber security due to its ability to accurately model temporal sequences and their long-term dependencies [44]. However, LSTM usually takes a longer time for model training and high computation cost [45]. The QRNN model [23] was designed to overcome the RNN limitations in terms of each timestep's computation dependency on the previous timestep, which limits the power of parallelism. The QRNN combines the benefits of the CNN and RNN by using convolutional filters on the input data and allowing the long-term sequence dependency to store the data of previous timestamps [23]. The computation structure of the QRNN is presented in Figure 2. The QRNN consists of convolutional layers and recurrent pooling function, which allow the QRNN to work faster than LSTM due to its a 16-timesincrease in speed while achieving the same accuracy as LSTM [46]. The convolutional and pooling layers allow for the parallel computation of the batch and feature dimensions [23]. The QRNN has been used in different applications such as video classification [45], speech synthesis [46], and natural language processing [47].

**Figure 2.** The computation structure of the QRNN.

Our hybrid DL model consists of a 1D convolutional layer, 1D max-pooling layer, a QRNN, and FC layers. The first 1D convolutional layer selects the spatial features and produces a feature map that will be processed by the activation function. The Rectified Linear Unit (ReLU) activation function is used in the convolutional layers because of its rapid convergence of gradient descent, which made it a good choice for our proposed model [41]. Then, the feature map is processed by the second layer that uses the maxpooling operation. The max-pooling operation selects the maximum value in the pooling operation [41]. The pooling layer reduces dimensionality and removes irrelevant features. The output of the CNN model retains the temporal feature that is extracted by the QRNN model. Figure 3 provides details of our proposed model and shows that we used two QRNN layers to extract the temporal features. In the two layers of the QRNN, the hidden size represents the number of the hidden units and the output dimension. The hidden units can be selected based on the value of the number of features [45]. One of the problems

of a neural network is overfitting, which means that a model learns the data too well. Consequently, the model is not able to identify variants in new data [22]. We added a dropout layer to prevent overfitting.

**Figure 3.** Illustration of the details of the proposed model.

Then, a 1D convolutional layer and max-pooling layer are used to extract more spatialtemporal features. The output of the CNN model is passed to the Flatten layer, which is a fully connected input layer that transforms the output of the pooling layer into one vector to be an input for the next layer [48]. Finally, the dense layer, which is also a fully connected layer, with the SoftMax activation function is used to classify the threats by calculating the probabilities for each class [34].
