**1. Introduction**

Cybersecurity attacks have become a significant issue for governments and civilians [1]. Many of these attacks are based on malicious web domains or URLs (see Figure 1 for an example of a URL structure). These domains are used for phishing [2–6] (e.g., spear phishing), Command and Control (C&C) [7] and a vast set of virus and malware [8] attacks. Therefore, the ability to identify a malicious domain in advance is a massive game-changer [9–26].

**Figure 1.** The URL structure.

A common way of identifying malicious/compromised domains is to collect information about the domain names (alphanumeric characters) and network information (such as DNS and passive DNS data). This information is then used to extract a set of features, according to which machine learning (ML) algorithms are trained based on a massive amount of data [11–15,17–22,24,26–28]. A mathematical approach can also be used in various ways [16,26], such as measuring the distance between a known malicious domain

**Citation:** Hajaj, C.; Hason, N.; Dvir, A. Less Is More: Robust and Novel Features for Malicious Domain Detection. *Electronics* **2022**, *11*, 969. https://doi.org/10.3390/ electronics11060969

Academic Editors: Leandros Maglaras, Helge Janicke and Mohamed Amine Ferrag

Received: 17 February 2022 Accepted: 15 March 2022 Published: 21 March 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

name and the analyzed domain (benign or malicious) [26]. Nonetheless, while ML-based solutions are widely used, many of them are not robust; an attacker can easily bypass these models with minimal feature perturbations (e.g., changing the domain's length or modifying network parameters such as Time To Live (TTL)) [29,30]. In this context, one of the main problems is how to train a robust malicious domain classifier, one that is immune to the presence of an intelligent adversary that can manipulate domain properties, to classify malicious domains as benign.

For this purpose, a feature selection process is executed to differentiate between robust and non-robust features. Given the robust feature set, the defender is still guaranteed to provide an efficient classifier, which is harder to manipulate. Even if the attacker has blackbox access to the model, tampering with the domain properties or network parameters will have a negligible effect on the classifier's accuracy. In order to achieve this goal, we collected a broad set of both malicious and benign URLs. In addition, we reviewed related work and identified a set of features commonly used for the classification task. These features were then artificially manipulated to show that some, although widely used, are not robust in the face of adversarial perturbations. In a complementary manner, we engineered an original set of novel and robust features. Therefore, we created a hybrid set of features, combining the robust well-known features with our novel features. Finally, the different feature sets (e.g., common, robust common, and novel) were evaluated using common machine learning algorithms, with emphasis on the importance of feature selection and feature engineering processes.

The rest of the paper is organized as follows: Section 2 summarizes related work. Section 3 describes the methodology and the novel features. Section 4 presents the empirical analysis and evaluation. Finally, Section 5 concludes and summarizes this work.
