**3. Methodology**

The structure of this section is as follows: Section 3.1 outlines the characteristics and methods of collection of the dataset. Section 3.2 presents our evaluation metrics. Section 3.3 defines each of the well-known features from the literature. Section 3.4 covers the evaluation of their robustness, and Section 3.5 presents novel features and evaluates their robustness.

### *3.1. Data Collection*

The main ingredient of ML models is the data on which the models are trained. Data collection should be as heterogeneous as possible to model reality. The data collected for this work include both malicious and benign URLs: the benign URLs are based on the Alexa top 1 million [62], and the malicious domains were crawled from multiple sources [63,64] to allow diversity and due to the fact they are fairly rare.

According to [65], 25% of all URLs in 2020 were malicious, suspicious, or moderately risky. Therefore, to make a realistic dataset, all the evaluations include all 1356 malicious active unique URLs, and consequently, 5345 benign active unique URLs as well. For each instance, the URL and domain information properties were crawled from *Whois* and their

DNS records. *Whois* is a widely used Internet record listing that identifies who owns a domain, how to ge<sup>t</sup> in contact with them, the creation date, update dates, and expiration date of the domain. *Whois* records have been proven to be extremely useful and have developed into an essential resource for maintaining the integrity of the domain name registration and website ownership. Note that according to a study by ICANN (Internet Corporation for Assigned Names and Numbers) [66], many malicious attackers abuse the Whois system. Hence, only the information that could not be manipulated was used. A graphical representation of the data collection framework is illustrated in Figure 2.

Finally, based on these resources (*Whois* and DNS records), the following features were generated: the length of the domain, the number of consecutive characters, and the entropy of the domain from the URLs' datasets. Next, the lifetime of the domain and the active time of domain were calculated from the *Whois* data. Based on the DNS response dataset (a total of 263,223 DNS records), the number of IP addresses, distinct geo-locations of the IP addresses, average Time to Live (TTL) value, and the Standard deviation of the TTL were extracted. For extracting the novel features (Section 3.5), Virus Total (*VT*) [67] and *Urlscan* [68] were used, where *Urlscan* was used to extract parameters such as the IP address of the page element of the URL.

**Figure 2.** Data collection framework.

### *3.2. Evaluation Metrics*

Machine Learning (ML) is a subfield of computer science aimed at causing computers to act and improve over time autonomously by feeding them data in the form of observations and real-world interactions. In contrast to traditional programming, where input and algorithms are provided to receive an output, with ML, a list of inputs and their associated outputs are provided to extract the algorithm that maps the two.

ML algorithms are often categorized as either supervised or unsupervised. In supervised learning, each example is a pair consisting of an input vector (also called data point) and the desired output value (class/label). Unsupervised learning learns from data that have not been labeled, classified, or categorized. Instead of responding to feedback, unsupervised learning identifies commonalities in the data and reacts based on the presence or absence of such commonalities in each new piece of data.

In order to evaluate how a supervised model is adapted to a problem, the dataset needs to be split into two, namely, a training set and testing set. The training set is used to train the model, and the testing set is used to evaluate how well the model "learned" (i.e., by comparing the model predictions with the known labels). Usually, the train/test distribution is around 75%/25% (depending on the problem and the amount of data). Standard evaluation criteria are as follows: recall, precision, accuracy, F1-score, and loss. All of these criteria can easily be extracted from the evaluation's confusion matrix.

A confusion matrix (Table 1) is commonly used to describe the performance of a classification model. Recall (Equation (2)) is defined as the number of correctly classified malicious examples out of all the malicious ones. Similarly, precision (Equation (3)) is the number of correctly classified malicious examples from all examples classified as malicious (both correctly and wrongly classified). Accuracy (Equation (1)) is used as a statistical measure of how well a classification test correctly identifies or excludes a condition. That is, the accuracy is the proportion of true results (both true positives and true negatives) among the total number of cases examined. Finally, the F1-score (Equation (4)) is a measure of a test's accuracy. It considers both the precision and the recall of the test to compute the score. The F1-score is the harmonic average of the precision and recall, where an F1-score reaches its best value at 1 (perfect precision and recall) and worst at 0. These criteria are used as the main evaluation metric.

The problem of identifying malicious web domains is a supervised classification problem, as the correct label (i.e., malicious or benign) can be extracted using a blacklistbased method, as we describe in the next section.

$$Accuracy = \frac{TP + TN}{TP + FP + TN + FN} = \frac{T}{P + N} \tag{1}$$

$$Recall = \frac{TP}{TP + FN} \tag{2}$$

$$Precision = \frac{TP}{TP + FP} = \frac{TP}{P} \tag{3}$$

$$F\_1 - \text{score} = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall} \tag{4}$$

**Table 1.** Confusion matrix.

