*4.1. Experimental Design*

In addition to intelligently choosing the model parameters, one should verify that the data used for the learning phase accurately represent the domain malware's real-world distribution. Hence, the dataset was constructed such that 75% were benign domains, and the remaining 25% were malicious domains (~5000 benign URLs and ~1350 malicious domains, respectively) [65].

There are many ways to define the efficiency of a model. A broad set of metrics was extracted to account for most of them, including accuracy, recall, F1-score, and training time. Note that for each model, the dataset was split into train and test sets where 75% of the data (both benign and malicious) were assigned to the train set, and the remaining domains were assigned to the test set. Note that the entire dataset included 75% benign samples. Later, when we trained a model, we used 75% of the dataset for the training process and 25% for the evaluation (i.e., test set).

The evaluation measured the efficiency of the different models while varying the robustness of the features included in the model. Specifically, four classical models (i.e., Logistic Regression, SVM, ELM, and ANN) were trained using the following feature sets:

