In the first phase, the dataset was collected based on the keywords of Google Trends to find the maximum number of phishing and suspicious websites. Google Trends [
19] is a service provided by Google that provides public discovery about the trends of people’s search behavior within Google Search. After collecting the keywords, Helium Scraper tool [
20] was utilized to collect the URLs. The Helium Scraper tool is manufactured by Helium 10 in Los Angeles, California, United States. The dataset was then cleaned by handling the missing values and the noisy data in the pre-processing step.
The dataset was preprocessed in three steps. First, the data were cleaned by filling in the missing values. Second, noisy data are handled, such as removing non-Arabic web pages, not-found pages, or pages with an internal server error. Third, alphabets were represented as numerical values in the dataset. For instance, we count the lengthy popular keywords instead of using the keyword itself.
The URLs were then classified and labeled as benign, suspicious, or phishing using VirusTotal [
21]. This was followed by the feature engineering step, which includes feature extraction and selection. The URL’s lexical, content-based, and network-based features were extracted from the dataset. The most suitable features that enhance the accuracy were selected using the following methods: correlation, hi-square, and ANOVA. Then, we applied four ML classification algorithms: RF, XGBoost, SVM, and DT. Those algorithms were selected based on our previous survey of applying ML techniques to detect malicious URLs [
8]. We evaluated the models using four evaluation metrics: accuracy, recall, F1 score, and precision.
The best-performing model was then used to create the website extension. Lastly, we tested the functionality by loading the website extension on the Chrome browser and determining whether a given URL would be correctly classified in a short time.
3.1. Dataset Description
In the beginning, the dataset contained a total of 15,000 URLs, out of which 11,906 URLs were collected using the Helium Scraper tool and 3094 URLs were collected from the ArabicWeb16 dataset [
22]. Helium Scraper is a tool that extracts content from websites by identifying target elements and specifying the desired content by typing the wanted keywords in the required language and then applying the extraction rules. In addition, it can export the extracted data in different formats [
20]. The URLs were labeled as 12,235 benign, 881 malicious, 220 malware, 761 phishing, 304 spam, and 569 suspicious URLs based on VirusTotal API. The VirusTotal API allows programmatic interaction with VirusTotal through an API Key, which any user can obtain by creating an account with VirusTotal [
21]. Furthermore, we noticed a huge difference between the benign class and others. As a result, we combined malware and malicious records to be in one category: phishing and spam records to suspicious. Phishing is a method of acquiring information that can involve malware. The term malware is an umbrella term for an entire range of malicious software [
23]. Moreover, suspicious activities can be defined as activities that are out of the ordinary, and spam is any unwanted, unsolicited digital communication sent out in bulk. Because there is a possibility that spam could come from good or bad sources, this behavior is suspicious (e.g., spam emails). After combining malware and malicious records, we changed their labels to phishing and spam records’ labels to suspicious. The benign class was under-sampled into 1313 benign records. Moreover, the other classes consist of 1862 phishing and 873 suspicious, totaling 4048 URLs.
Figure 2 below shows the number of records in each class: 0 for benign, 1 for suspicious, and 2 for phishing. Afterward, 17 lexical features, 13 network features, and 9 content features were extracted.
3.3. Feature Selection
Feature selection is a strategy for selecting a subset of features that contributes more to the prediction variable (URL label) in the dataset. Feature selection aids the effectiveness and efficiency of AI models by reducing time complexity and high data dimensionality. However, irrelevant features can also cause AI models to be misled, resulting in less accuracy [
25]. We selected ANOVA, correlation, and chi-square based on their high accuracies in previous studies, such as [
26,
27]. The methods of the feature selection are discussed below:
Correlation is a well-known statistical measure that measures the similarity between two features. The correlation coefficient between the two features results in a value that is one if they are linearly dependent and otherwise zero. The correlation approach is used to determine the relationship between the features. There are two basic groups to determine the correlation between two random variables. The first is based on linear correlation, whereas the second is based on information theory. The following formula gives the linear correlation coefficient ‘r’ for a pair of variables (X, Y) [
28]. Moreover, in
Figure 4, the heatmap of the features shows a strong positive correlation among url_len, special_char, count_digits, and path_len. Moreover, count_com and com_presence are strongly positive correlated to each other.
Chi-square is a statistical test used to see if two categorical variables are independent or how closely a sample fits the distribution of a known population. Alternatively, it calculates the distinction between the actual and expected outcomes. For example, consider two variables,
O for the observed value and
E for the expected value. The following formula can be given: if the chi-square value is large, the feature is more dependent, and the model can be applied to it [
29].
Moreover,
Figure 5 shows that lifetime, active time, number of words in the body of the page, and remaining days before the domain expire are the top features that contribute to output, while the other features have almost no impact.
ANOVA is a statistical approach that compares the means of two or more groups that differ considerably. It determines whether there is a significant difference between the means of multiple datasets [
30]. The test determines the impact that independent variables have on the dependent variable. Moreover,
Figure 6 presents ANOVA feature ranking, which shows that www presence is the feature with the highest
p-value, which is irrelevant for predicting the classification of the Arabic websites in our dataset.
Table 4 below shows the feature set selected by each method.
ANOVA highlighted features that exhibited distinct mean differences across different URL categories, suggesting their potential importance in classification. Chi-square pinpointed features that demonstrated strong associations with the target variable, indicating a significant correlation between feature values and the target variable. Correlation analysis revealed features that exhibited linear relationships with the class labels, implying a direct influence on classification outcomes.
To determine the most effective feature set for our classification task, we applied a variety of machine-learning models to the datasets resulting from each feature selection technique. By comparing the performance of these models, as measured by accuracy, we identified the feature selection method that consistently yielded the highest classification accuracy, thus revealing the most essential features for precise URL classification.
3.4. Machine Learning Classifiers
This section details the ML classifiers, including RF, XGBoost, DT, and SVM. For all classifiers, we employed an 80–20 split for training and testing the dataset. Additionally, we utilized grid search algorithm to optimize parameters and achieve the highest possible accuracy.
RF is a supervised ML method that works based on the ensemble method, which is based on DTs and can handle classification and regression problems.
An ensemble method means an RF algorithm comprises many small DTs called estimators, each of which makes its predictions. The RF method combines the estimators’ predictions for a more precise prediction. For classification problems, the RF output is the class majority voting selects. For regression problems, the output is the mean or average prediction of each tree [
31].
When using the RF method to solve regression problems, the mean squared error (MSE) can be used [
32]. The formula calculates the distance of each node from the predicted actual value, allowing the choice of the branch that suits the forest the most, and it is given in the following equation:
where
X is the number of data points, Yi is the value returned by the DT, and
is the value of the data point tested at a particular node. When the RF method is used to solve classification problems,
Gini index or information gain (IG) is used. The
Gini of each section on a node is calculated using the probability and the class, indicating which branch is more likely to occur. The formula that calculates the Gini index is given in the following equation [
32]:
pi represents the relative frequency of the class observed in the dataset, and
N represents the number of classes.
IG is another measure to choose the appropriate data split based on each feature’s gain. The formula that calculates the
IG is given in the following equation [
32]:
The hyperparameters and their optimal values for RF are presented in
Table 5. The random_state parameter enables us to set a random seed (which is 42) to the random number generation process in the RF, so that, each time we build the model with the same data, we obtain the same one. Moreover, n_estimators represent the number of trees in the forest, which is 1400. The max_features parameter represents the feature number to be considered while determining the optimum split for the tree. If the max_features parameter is set to auto, the tree will use the square root of the feature number as the max_features value. In addition, the max_depth parameter represents the maximum number of levels in each DT, which is 18. The criterion parameter ‘entropy’ represents the function used to measure a split’s quality.
XGBoost is an ensemble technique based on DTs, a type of gradient boosting. It is a supervised learning method based on function approximation through the optimization of specific loss functions and the application of various regularization methods, and it supports parallel processing, handles missing values, offers cache optimization, takes care of outliers to some extent, and has inbuilt cross-validation. XGBoost can be utilized to solve regression and classification problems. The algorithm combines the estimates of a group of smaller, weaker models to predict a target variable accurately. The general equation of the algorithm consists of two parts, training loss and regularization term as follows [
33]:
where
L represents the training loss function, and
represents the regularization parameters. The training loss measures determine how well a model predicts the training data, and MSE is a common choice of
L. Regularization helps prevent the problem of overfitting by controlling the model’s complexity [
33].
The hyperparameters and their optimal values for XGBoost are presented in
Table 6, where the n_estimator value is 600, the max_depth value is 22, and the gamma parameter specifies the least loss reduction required to make a split, which is 0.5. Furthermore, the learning_rate parameter value is 0.09, representing the weights assigned to the tree in the next iteration. The colsample_bytree parameter value is 0.7, representing the fraction of columns to be randomly sampled for each tree. Moreover, the booster parameter selects the type of model to run at each iteration, and the value ‘gbtree’ means that the model is tree-based. The cv parameter with the value 5 determines the cross-validation splitting strategy.
DT is a supervised learning technique that can be used for regression and classification problems [
34]. Moreover, DT classifier is constructed as a tree-like structure that represents all possible results of a decision based on defined conditions. Furthermore, DT comprises three essential elements: decision nodes (internal nodes), branches, and leaf nodes. The data are branched into two distinct categories, with each internal node representing an attribute. This is repeated until a class label, represented by a leaf, is reached.
Table 7 presents the hyperparameters and their optimal values for DT. The max_depth is 30, the max_features is auto, and the criterion value is entropy. The ccp_alpha parameter, with a value of 0.001, refers to the cost complexity parameter that provides another option for controlling the tree size. The greater the value of ccp_alpha, the more nodes are pruned.
SVM is a supervised learning algorithm that can solve classification and regression problems. It utilizes a dataset in which the input samples are separated into two classes with labels 0 or 1. The algorithm aims to find a line or plane, known as a hyperplane, that will most efficiently divide the two classes [
35].
The above equation represents the hyperplane equation which can be used to find whether the new example falls in class 0 or 1 side. The coefficients (B1 and B2) give the slope of the line, and the algorithm calculates the intercept (B0). X1 and X2 are the two input data points [
35].
The hyperparameters and their optimal values for SVM are presented in
Table 8. The first parameter is gamma, with a value of 0.01, which sets the distance of influence of a single training point. Near points will affect classification if the gamma value is high. In other words, the data points must be close to each other to be considered in the same class. However, if the gamma value is low, distant data points will influence the classification, which results in more data points being grouped. The second parameter is the kernel function parameter, with the value of rbf, which is used to transform non-linearly separable data into linearly separable one using the radial basis function (RBF). The third parameter is the C parameter, which defines how much misclassification of the training data is permitted in the model. If the C value is small, the decision boundary with a large margin will be chosen. However, if the C value is small, the SVM classifier attempts to reduce the number of misclassified ones, resulting in a decision boundary with a narrower margin [
36].
3.5. Building Extension
Google Chrome is the most commonly used browser in the world. It is a user-friendly interface that raises the standard for modern browser design. Furthermore, to enable proper functioning, we divided the work into two parts: the frontend and the backend. The frontend is the part with which a user interacts, whereas the backend is the framework that enables this interaction. The frontend includes two interfaces: a scanning interface that obtains the URL and processes it, then returns the label of the currently opened page, and the second is a search interface that enables users to enter any URL and returns the result.
The backend is the backbone of the system that is based on the Django framework, Django Rest API, Postgres SQL Database [
37], and Heroku Server [
38]. Moreover, it includes the feature extraction function (its result is the input of the trained model). In addition, it includes the trained ML model to predict the URL label. Django is a Python full-stack web framework that allows for the rapid building of safe and maintained websites. It is free and open-source for users [
39]. The REST APIs define what requests can be made to a component, how to make them (by GET, POST, etc.), and their expected responses.
In
Figure 7, the entire process of Kashif is presented. The user can search for a specific URL, or the current page’s URL will be sent to the Django framework (backend). After that, the workflow of the Kashif extension requires the users to be able to access a database in terms of obtaining a result from an existing URL with its label or saving a new result that consists of a URL with its corresponding label through the REST API that will communicate with the users’ input in the frontend (obtain the URL from the current page or the search page) by the GET method.
Therefore, the URL will be searched in the database, and, if the same URL is found, the result will be displayed to the user. If not, the features of the URL will be extracted to send the output to the ML model to predict the label of the URL, and the result will be saved in the database. At the same time, the result will be sent back to the user.