1. Introduction
With the rapid development of Internet technology, the number of Internet pages has increased almost exponentially in recent years. Although the Internet brings us convenience, it provides opportunities for many criminals. Criminals can maliciously install computer viruses and garbage software and execute other attacks to achieve the purposes of stealing user identity information or network fraud. In order to effectively reduce illegal behavior on the internet, a large number of researchers have conducted in-depth research on malicious URL detection technology.
In the past, the common method for diagnosing and defending malicious URL attacks was to use the blacklisting technique [
1], which combines the key information of known malicious URLs into a list. By accessing this list, we can accurately identify the confirmed malicious URLs. For example, Prakash et al. [
1] proposed the identification and discovery of new phishing pages through URL decomposition and similarity calculations, extending the scope of use of blacklisting and helping identify some malicious pages that do not appear in the blacklist. The blacklist mechanism for browsers such as Google Safe Browsing is also a similar method [
2]. Lin Hai-lun et al. [
3] presented an efficient method for detecting malicious URLs based on segment pattern, which had good performance and scalability. Because the blacklist is artificially constructed, it requires considerable human resources in subsequent maintenance. Moreover, with the continuous updating of malicious URL attacks, malicious URL detection is limited by the blacklisting technique, which has been unable to meet the needs of network attack defense in today’s society. Therefore, researchers began to seek more efficient defense technology. Later, honeypot technology emerged. For different types of attacks, honeypot technology gradually derived the honey net, distributed honeypot, honey field, and other technologies [
4,
5,
6]. For example, Ref. [
7] studied the network intrusion detection system (NIDS), which used the snooping agents on the Web and honeypot technology to prevent the activities of intruders. However, according to the research by Zhuge Jianwei et al. [
8], the honeypot environment cannot solve the contradiction between simulation and controllability, and the honeypot technology is mainly used for large-scale common security threats, whose defense ability against some special threats is far from sufficient.
With the research wave based on machine learning, the application of machine learning to malicious URL detection has attracted a lot of attention from researchers. For example, Vanhoenshoven et al. [
9] used a multilayer perceptron (MLP) to detect malicious URLs, and found that for the same data set, the detection results of different feature sets may be different. Arivazhagi et al. [
10] presented an efficient unsupervised feature construction method based on the linear support vector machine model, and the research results proved its effectiveness. Abed Sa’ed et al. [
11] proposed a machine learning model combining a self-encoder with a class of support vector machines. The dimension of input data was reduced by the self-encoder, and the network events were classified by a class of support vector machines. Azeez et al. [
12] used a naive Bayesian algorithm to detect malicious URLs based on the syntax, vocabulary, hosts, and other content of the URL embedded in the email. Laughter et al. [
13] integrated the http request features in the process of visiting the website into the detection feature set. By extracting the content of each field in the request header and request body, classification methods such as decision tree and SVM were used to complete the research. Because traditional machine learning methods require complex feature selection, such methods usually have the problem of poor scalability, that is, a feature or a class of features that is currently extracted may have good results in a certain type of malicious web page recognition problem, but the performance is degraded in other types of malicious web page recognition problems.
Deep learning is an important upgrade to machine learning. In recent years, with the unique advantages of deep learning in natural language processing (NLP), speech recognition, image recognition, and other fields, it has also brought new developments to the detection of malicious web pages. For example, Zhang et al. [
14] presented an automatic URL feature extraction method based on a convolutional neural network and verified the extraction results by combining random forests, support vector machines (SVM), and other classification methods, which proved the effectiveness of the deep learning method for URL feature extraction. However, convolutional neural networks can only learn the local features of URLs and cannot learn their context information. J.J. Christy Eunaicy et al. [
15] detected malicious URLs based on artificial neural network (ANN), convolutional neural network (CNN) and recurrent neural network (RNN) models. In their results, the RNN model had the best performance, with an accuracy of 94%. However, RNNs can easily experience gradient disappearance or gradient explosion, which will affect the accuracy of malicious URL detection and classification. Afzal et al. [
16] proposed a hybrid deep learning method called URLdeepDetect for click-time URL analysis and classification to detect malicious URLs. DasA et al. [
17] used a character-level embedded coding format to represent URLs and designed CNN and LSTM models to detect malicious URLs. However, the accuracy of the model was only 93.59%, which could not detect malicious URLs well. Cui et al. [
18] combined the HTTP parameters generated by user requests with the corresponding URLs and compared six vectorization methods in the data preprocessing stage.
Deep neural networks imitate the human brain mechanism to interpret data; the brain does not simply process incoming information in a moment, but instead it promotes the perception of the scene through attention, expectation, and prior knowledge, extracting interested characteristics to understand information. Therefore, researchers presented the attention mechanism [
19], which achieves the purpose of obtaining more effective information by giving higher weights to the concerned parts in deep learning and has achieved good results in machine translation, false news discrimination [
20], and other tasks. Attention mechanisms were inspired by the fact that deep neural networks [
15] do not distinguish information in the URLs when identifying malicious URLs, and the key information is not fully utilized as a result. The attention mechanism can effectively allocate the information’s characteristics and improve the deep learning efficiency so as to obtain better recognition performance of malicious URLs.
Based on the above analysis, this paper proposes a malicious URL detection method based on a bidirectional GRU and an attention mechanism. This method applies Word2Vec to train the word vector of URLs, uses BiGRU to extract all the sequence information of URLs and learn the relationships between sequences, and introduces an attention mechanism to strengthen the model to learn useful information. In addition, to prevent overfitting during model training, a dropout mechanism is added to the input layer to improve the accuracy of malicious URL detection and classification.
3. Model Structure
URLs themselves have strong requirements for sequence order, and GRU is a model that can process sequence data. At the same time, BiGRU can deeply mine and make full use of the relevant information in the obtained URL data. Therefore, this paper uses BiGRU as the basic model structure. In order to prevent the neural network model from overfitting in the training process, that is, a situation where it performs well and has high accuracy on the training set, but it performs poorly and has low accuracy on the test set. In this paper, a layer of dropout mechanism is added before URL data enter the model, which avoids the excessive influence of some weights on the network model to a certain extent and reduces the model’s deviation.
URLs are not complicated, but they have certain requirements for sequence relations. Therefore, this paper first uses a layer of BiGRU, and the number of neural nodes is set to 128 to learn the characteristics of URLs. Then, the learned information is outputted to the attention layer to improve the full use of key information. Finally, the full connection layer using the tanh function as the activation function is added, and the softmax function is used for the final classification. Thus, a dropout–attention bidirectional gated recurrent unit (DA-BiGRU) is formed. The DA-BiGRU model structure is shown in
Figure 7.
This method firstly preprocesses the URL data set using Word2Vec to construct a word vector and segmenting the data set into a digital vector. Secondly, the preprocessed data set is constructed into the data format required by the model through embedding layer. Then, we train the proposed model DA-BiGRU. Finally, utilizing the best-performing model, we can perform the malicious URL detection task and obtain the classification result. The calculation equations are shown below.
The input at
t time
becomes
by dropout mechanism:
The hidden state output of forward propagation GRU and back propagation GRU at
t time can be defined as:
where
and
denote the hidden state output of forward propagation GRU and back propagation GRU at
t − 1 time, respectively.
The hidden state output of BiGRU at
t time is the weighted sum of forward and backward:
where
is the weight of
,
is the weight of
, and
is the bias.
The output of hidden states through attention mechanism
can be expressed as:
After adding the full connection layer, the output
is calculated by
and
:
Finally, entering the softmax layer, the classification result is obtained by:
where
and
represent the weight and bias of the softmax layer, respectively.
4. Experimental Results and Analysis
The flowchart of this experiment is shown in
Figure 8. First, URL data are preprocessed. That is, the data are segmented first, and then Word2Vec is used to obtain the word vector. Secondly, the data after the embedding layer are input into the DA-BiGRU model for learning. Finally, URLs are classified using the DA-BiGRU model.
4.1. Data Set
The experimental data set used in this paper is a collection malicious phish URLs. The malicious phish URLs (
https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset) come from the Kaggle community and are from a collection of URL data collected by Manu Siddhartha across multiple communities. In this experiment, 65,536 benign URLs and 65,536 malicious URLs were randomly selected, for a total of 131,072 URLs. This article divides the URL data into a training set, a validation set, and a test set after randomly disrupting the order. The specific sample size is shown in
Table 2.
4.2. Data Preprocessing
The data preprocessing is divided into four parts: data cleaning, word segmentation, vocabulary creation, and length interception. The data preprocessing flowchart is shown in
Figure 9.
Data cleaning: because the URL protocol is generally http:// or https://, this part of the content has little effect on the identification of malicious URLs. Therefore, this paper conducts data cleaning on URLs containing http:// and https:// in the experimental data set to remove redundant information to reduce the waste of features.
Word segmentation processing: the clean data are segmented by regular expressions. Because the special symbols in the URL contain important feature information, the special symbols that often appear in the URL are counted. The statistics show that the following 14 special symbols appear more frequently: ‘-’, ‘_’, ‘.’, ‘=’, ‘/’, ‘?’, ‘&’, ‘#’, ‘<’,‘>’, ‘(’, ‘)’, ‘+’, and ‘@’. Finally, this paper sets the word segmentation rules according to the following 14 symbols: ’ - _ . = / ? & # + < > ( ) @’.
Creating a vocabulary: although one-hot encoding is easy to construct, it cannot accurately express the arrangement of words, and when the number of words is large, one-hot encoding is prone to dimension disaster. Therefore, this paper selects the skip-gram model algorithm of Word2Vec to train the vocabulary and obtain 50-dimensional word vectors.The vocabulary example can be seen in
Table 3.
Length truncation:Figure 10 shows the length statistics for the experimental data set URLs.
Figure 10a shows that the length of malicious URLs is concentrated below 25 words in the experimental data set, and
Figure 10b shows that the length of benign URLs is concentrated below 20 words. Therefore, in order to ensure the effective use of data, the cutoff length of URLs is set to 30 words, the URLs with less than 30 words are filled with 0s, and URLs with more than 30 words are cut to 30 words. Finally, the format of the input data is [64, 30, 50].
4.3. Hyperparameters
During the experiment, the hyperparameters are set as shown in
Table 4.
4.4. Experimental Comparison
In the training process, the change in malicious URL recognition accuracy and loss value with the number of training iterations is shown in
Figure 11, where orange represents the training curve and blue represents the test curve. It can be seen from
Figure 11a that the accuracy rate increases rapidly at the beginning of training. After the fifth iteration, an inflection point begins to appear, and after the 15th iteration, the training curve begins to flatten. It can be seen in
Figure 11b that the training loss curve and the validation loss curve begin to converge after the 15th iteration.
Figure 12 is the ROC curve of the experimental model. By observing the ROC curve, the area under curve (AUC) is close to 1, indicating that the DA-BiGRU model has a good classification effect.
MLP is the starting point for studying more complex deep learning methods. GRU is a variant of LSTM, which is simplified compared with the internal structure of LSTM, and the accuracy is also improved. Therefore, this paper selects four models of MLP, Att-BiLSTM, Att-BiGRU, and Dro-Att-BiLSTM for comparative testing to verify the classification effect of this model. Among them, the Att-BiLSTM model adds an attention mechanism to BiLSTM, the number of BiLSTM layers is set to 1 layer, and the number of nerve nodes is set to 128; the Att-BiGRU model adds an attention mechanism to BiGRU, the number of layers of BiGRU is set to 1 layer, and the number of neural nodes is set to 100; the Dro-Att-BiLSTM model adds a dropout mechanism to the Att-BiLSTM model, and the dropout rate is set to 0.2. The data of the comparison model are consistent with those of the model in this paper. The tanh function is used as the activation function and the softmax function is used for final classification.
In this paper, the DA-BiGRU model is compared with the other four models. The results of experimental comparison are shown in
Figure 13.
Figure 13a is the loss function curve of the training set. It can be seen from the graph that the DA-BiGRU model curve is smoother, and the loss rate is the smallest when compared with the comparison models.
Figure 13b is the loss function curve of the validation set, and the Dro-Att-BiLSTM model has the largest oscillation amplitude.
Figure 13c is the precision curve of the validation set. It can be seen in the graph that the Att-BiGRU model and the DA-BiGRU model have higher accuracies. Although the curve of Att-BiGRU is higher than that of DA-BiGRU in the intermediate verification process, DA-BiGRU is greater than Att-BiGRU in the later verification, which shows that the fitting effect of the model is better after adding the dropout mechanism.
Figure 13d is the accuracy curve of the validation set. It can be seen in the graph that the DA-BiGRU model in this paper achieves a good classification effect during the verification process, and the accuracy reaches 0.9792. Overall, the model curves based on BiLSTM are lower than those based on BiGRU, indicating that the GRU model is more suitable for classification tasks with a strong sequence.
Figure 13e is the recall curve of the validation set. It can be seen from the graph that DA-BiGRU has a good malicious URL detection ability.
The experimental results of different models on the test set are shown in
Table 5.
From
Table 5, we can see that the DA-BiGRU model in this paper is better than other models in each result parameter on the test set, and the accuracy is 0.9792. The MLP model is lower than other models in all parameters. This may be related to the MLP model itself, without taking into account the sequence of URLs and some deeper key information, so the classification effect is not good. By longitudinal comparison of the Att-BiLSTM, Att-BiGRU, DA-BiLSTM, and DA-BiGRU models, the model based on BiGRU is better than that based on BiLSTM, which shows that GRU is more suitable for URLs with strong sequences. Through the comparison between Att-BiGRU and DA-BiGRU, it is found that Att-BiGRU with dropout mechanism improves the fitting ability of the model, extracts useful features more accurately, and obtains the sequence information of the whole URLs, thereby improving the classification accuracy.