AGCN-Domain: Detecting Malicious Domains with Graph Convolutional Network and Attention Mechanism

Xi Luo; Yixin Li; Hongyuan Cheng; Lihua Yin

doi:10.3390/math12050640

Abstract

Domain Name System (DNS) plays an infrastructure role in providing the directory service for mapping domains to IPs on the Internet. Considering the foundation and openness of DNS, it is not surprising that adversaries register massive domains to enable multiple malicious activities, such as spam, command and control (C&C), malware distribution, click fraud, etc. Therefore, detecting malicious domains is a significant topic in security research. Although a substantial quantity of research has been conducted, previous work has failed to fuse multiple relationship features to uncover the deep underlying relationships between domains, thus largely limiting their level of performance. In this paper, we proposed AGCN-Domain to detect malicious domains by combining various relations. The core concept behind our work is to analyze relations between domains according to their behaviors in multiple perspectives and fuse them intelligently. The AGCN-Domain model utilizes three relationships (client relation, resolution relation, and cname relation) to construct three relationship feature graphs to extract features and intelligently fuse the features extracted from the graphs through an attention mechanism. After the relationship features are extracted from the domain names, they are put into the trained classifier to be processed. Through our experiments, we have demonstrated the performance of our proposed AGCN-Domain model. With 10% initialized labels in the dataset, our AGCN-Domain model achieved an accuracy of 94.27% and the F1 score of 87.93%, significantly outperforming other methods in the comparative experiments.

Keywords:

malicious domain; graph convolutional network; attention mechanism; domain relations

MSC:

68T07

1. Introduction

The Domain Name Resolution System (DNS) is one of the key infrastructures of today’s Internet, which provides domain name resolution services for Internet users and enables the mapping of domain names to IPs. However, due to the openness of DNS services, cyber attackers also rely on DNS services to carry out attacks by registering a large number of domain names. These malicious domains can lead to web pages that the attacker has carefully designed to conduct crimes such as stealing accounts, fraud, and infecting computers with viruses (ransomware, Trojans, etc.). These types of web pages are almost identical to normal web pages, thus reducing the vigilance of ordinary users and providing attackers with opportunity. In addition, some malicious software, such as Trojans and backdoor viruses, connect to C&C servers through domain names (C&C servers act as bridges for communication between attackers and infected hosts. Attackers send instructions to compromised hosts through C&C servers, enabling lateral movement to infect other internal hosts or steal sensitive data). These forms of malicious software use fast-flux technology to evade detection, making it increasingly difficult to detect malicious domains. Therefore, the detection of malicious domains has attracted the attention of many researchers.

Researchers have proposed a large number of works to discover malicious domains. Traditional approaches can be classified as feature-based methods [1,2,3,4,5,6,7,8,9,10,11,12]. They usually extract features provided by security experts for each domain from DNS data [1,2,3,4,5,6,7]. Typical features include lexical features, whois features, resolution features, etc. They recognize malicious domains with trained classifiers. However, such methods can be evaded easily since most of them are based on features of an individual domain. In addition, they are limited in expert observation and knowledge with manual prejudice.

In the face of these problems, some researchers have turned to the use of relational features between domain names for malicious domain name detection. They built various graphs to describe relations like client relations [13,14], resolution relations [15], whois relations [16], etc. And then they discovered malicious domains based on the intuition that domains strongly related to known malicious domains are more likely to be malicious. However, in some studies they focused on only one type of relationship between domains, which left a great deal of information unexplored and failed to reveal the richer potential relationships between domains, which allowed experienced attackers to evade detection by this type of system. Some studies utilized multiple relationships [17,18,19] to calculate domain name similarity through relationships [17] or to discover domain names strongly associated with malicious domains through multiple mete-paths [18]. But these works ignored the fine-grained distinctions between relationship features and ceased to address the different effects of different relationship features on domain name judgements with implications for the accuracy of the system.

To address the problems with the above two main ways of detecting malicious domain names, it is necessary to develop a more comprehensive malicious domain name detection system. Graph Convolutional Neural Network is a good solution in solving graph-based correlation problems, and it also has good performance in collaborative classification, machine account detection, etc.

In this paper, we made a basic observation about the relational characteristics of domain names that domain names with strong associations are more likely to belong to the same category [15]. This is because the attacker’s attack resources are limited due to cost, making the adversary reuse their attack resources, leading to a strong correlation of malicious domains in the relationship [18]. Based on the above basic observation, we propose AGCN-Domain, a system that detects malicious domains by leveraging various relations and balancing their influence with the attention mechanism to solve the above limitation. In the AGCN-Domain model, firstly, we used three homogeneous subgraphs: client relation graph, resolution relation graph, and cname relation graph to represent the relationship features between domains and pre-process the subgraphs to optimize the subgraph structure. Graph convolution neural networks were implemented to aggregate features in each of the three subgraphs. Second, to balance the impact of different relationship features on domain features, we introduced an attention mechanism to make the detection results more accurate. We utilized a soft attention mechanism to intelligently assign weight coefficients to the relational features of the domain converged on the three subgraphs, thus obtaining a comprehensive relational feature vector of the domain. Finally, a fully connected layer was used to output the malicious/benign probability for the domain.

The contributions can be summarized as follows:

We proposed AGCN-Domain, a system for detecting malicious domain names using multiple relationship patterns: Client relation, resolution relation, and cname relation, which could extract and fuse the relationship features of domain names. The malicious domain names detection problem was transformed into a node binary prediction problem.
We designed a mechanism to detect malicious domains with high accuracy. We introduced a graph convolutional neural network, which performed well in graph correlation tasks to detect malicious domains. We integrated the graph convolutional neural network with an attention mechanism to intelligently blend the effects of different relationship features on the classification results of domains for different types of domains.
We made a comprehensive evaluation of our work with real-world data collected from an educational network. The results demonstrate that it has good performance in detecting malicious domains.

Organization. The rest of this paper is organized as follows. Section 2 reviews related work and its limitations. Section 3 illustrates preliminaries of our system, and Section 4 introduces the system structure and major components. Then, we make a comprehensive evaluation in Section 5. Finally, Section 6 concludes this paper.

3. Preliminaries

In this section, we first introduce the relations we used in this work and the reason why they are effective in detecting malicious domains. Then, we summarize the techniques used in our work.

3.1. Domain Relations

We investigated previous works detecting malicious domains utilizing domain relations and summarized three typical relations that are widely used: client relation, resolution relation, and cname relation.

Client Relation [13,14]. The client relation indicates that a domain queried by a client that has requested malicious domains is more likely to be malicious. There are two basic observations of the client relation. (1) Compromised clients are unlikely to query only one malicious domain since they commonly query a set of domains to achieve one malicious activity, and they are usually involved in more than one attack. (2) Malicious domains have no reason to be accessed by benign clients since they are dedicated to malicious services. As shown in Figure 1, domain d6 can be inferred to be malicious because it is requested by suspicious clients c1, c3, c5 and clearly far away from the benign network: no benign clients have queried it.

Figure 1. Client Relations.

Resolution Relation [15]. The resolution relation means that domains with the same resolutions (hosted on the same IP) tend to have the same labels. This relation is strongly based on the rarity and expensiveness of malicious servers (IPs) where attackers can host domains. In fact, attackers widely reuse resources (especially malicious IPs) in malicious activities, such as domain-flux and fast-flux. As shown in Figure 2, domain d5 is considered to be malicious, because it is hosted on suspicious servers IP1, IP2, which host three known malicious domains d1, d2, d3 and no benign domains.

Figure 2. Resolution Relations.

Cname Relation [24]. The cname relation in this paper indicates that domains that appear in one cname record have a homophilic state. In the DNS system, the cname record is used to set the alias name of domains. We illustrate an example of a cname record in Figure 3. When setting a cname record, the domain and its alias name finally point to the same destination: the client first queries www.example.com and obtains its alias name www.exmaple-alias.com, then the client queries www.exmaple-alias.com and obtains the final destination. It is obvious that domains connected to malicious domains through a cname relation are more likely to be malicious since they map to the equal resource.

Figure 3. Cname Relations.

3.2. Graph Convolutional Network (GCN)

Nowadays, graph convolutional networks (GCN) have yielded unusually brilliant results in multiple areas, such as document classification, recommender systems, traffic prediction, etc. GCN was first proposed by Kipf and Welling [29] in 2017. The GCN model takes both node attributes and graph structures as input and combines them to obtain the deep semantics of nodes. The propagation rule of multi-layer GCN can be expressed as:

H^{(l + 1)} = σ ({\tilde{D}}^{- \frac{1}{2}} (A + I) {\tilde{D}}^{- \frac{1}{2}} H^{(l)} W^{(l)})

(1)

where A is the adjacency matrix of graph,

σ

is a nonlinear active function,

W^{(l)}

is the trainable weight matrix of lth layer, and

H^{(l)}

is the activations matrix in lth hidden layer.

4. Method

4.1. Overview

As shown in Figure 4, AGCN-Domain is an intelligent malicious domain detection system with three main components: data preprocessor, relation graph constructor, and attention-GCN classifier. First, in the data preprocessor, we extracted four basic fields from raw DNS logs: domain, IP, client, and timestamp to represent domain behaviors in subsequent steps. Then, in the relation graph constructor, we constructed three types of relation graph of each domain, respectively: client relation graph, resolution relation graph and cname relation graph. Among them, client relation graph and resolution relation graph are mapped from two bipartite graphs while cname relation graph is generated from type cname records. Finally, we designed an attention-GCN classifier to mine deep features of domains by fusing information from multiple relation graphs to recognize malicious domains.

Figure 4. Architecture of AGCN-Domain.

It should be noted that AGCN-Domain is a universal framework that is not limited to the three types of relations mentioned in this paper and can be extended to far more relations among domains.

4.2. Data Structure

The DNS record dataset we collected includes the following fields: timestamp string, source IP, domain name server IP, protocol name, requested domain name, request record type, answers, and TTLs rejected.

4.3. Data Preprocessor

The main purpose of this step is to formalize dirty raw data and improve system efficacy. We took DNS logs as input and focused only on cname and A records. We filtered out records that meet filtering rules and extracted four fundamental fields (requested domain name, domain-resolve-IP, source IP, timestamp string) from the remaining data. Specific filtering rules are as follows:

Corrupted records. There are some corrupted records from transmission errors in collected raw data, such as an incomplete record missing some fields.
Irregular domains. There are some irregular domains in the original data, which can be divided into two categories. One is that domains do not comply with domain naming rules, which is probably due to mistyping or misconfiguration, for example, containing commas in strings like youtube,com. The other is that the domains whose TLD (Top Level Domain) are not registered in IANA, which means that they are invalid Internet domains.

4.4. Relation Graph Constructor

In this step, we constructed three relation graphs, client relation graph, resolution relation graph, and cname relation graph, to describe relationships among domains.

4.4.1. Client Relation Graph

Client relation actually indicates that two domains share the same clients within a time window. In order to model such a relation, we first built a bipartite graph among clients and domains, and further generated the client relation graph.

The client–query–domain bipartite graph can be expressed as

G_{c d} = (C, D_{t}, E)

, where C and

D_{t}

represent all clients in the network and all domains (with timestamps) they queried, respectively, E is a set of edges between domain and client. For instance, if client

c_{i}

requests domain

d_{j}

at time t, then an edge

< c_{i}, d_{j} (t) > \in E

exists to describe this behavior in the client–query–domain graph.

For depicting client relation among domains, we transformed the above client–query–domain graph into a client–relation graph

G_{c} r = (D, E, W)

, where D and E are the set of domains and edges, and W represents weights of edges. If domain

d_{i}

and

d_{j}

share at least one client within one time window, there is an edge

< d_{i}, d_{j} > \in E

. The weight of

< d_{i}, d_{j} >

can be calculated based on Jaccard Similarity:

W (d_{i}, d_{j}) = \frac{| ⋃_{t = 1}^{n} (C_{i}^{t} \cap C_{j}^{t}) |}{| C_{i} \cup C_{j} |}

(2)

In the above formula,

C_{i}^{t}

and

C_{j}^{t}

are sets of clients that have queried domain

d_{i}

and

d_{j}

within time window t, respectively.

C_{i}

and

C_{j}

are sets of clients that have queried domain

d_{i}

and

d_{j}

in the network data, respectively.

4.4.2. Resolution Relation Graph

Resolution relation here indicates that two domains share the same IPs in the whole network data. To model such a relation, we first built a bipartite graph among domains and resolved IPs and further generated the resolution relation.

The domain–resolve–IP bipartite graph can be expressed as

G_{d i} = (D, P, E)

, where D and P represent all domains in the network and all IPs where they once hosted respectively, and E is a set of edges between domain and IP. For instance, if domain

d_{i}

once hosts on IP

P_{j}

, then there is an edge

< d_{i}, p_{j}) > \in E

to describe this behavior in domain–resolve–IP graph.

For depicting resolution relation among domains, we transformed the above domain-resolve-IP graph into a resolution relation graph

G_{r r} = (D, E, W)

, where D and E are the set of domains and edges, W represents weights of edges. If domain

d_{i}

and

d_{j}

share at least one resolved IP, then there is an edge

< d_{i}, d_{j} > \in E

. The weight of

< d_{i}, d_{j} >

can be calculated based on Jaccard Similarity as follows:

W (d_{i}, d_{j}) = \frac{| P_{i} \cap P_{j} |}{| P_{i} \cup P_{j} |}

(3)

In the above formula,

P_{i}

and

P_{j}

are sets of IPs hosting domain

d_{i}

and

d_{j}

, respectively.

4.4.3. Cname Relation Graph

Cname relation represents that domains belong to one cname record. Unlike the above two relation graphs, the Cname Relation Graph can be generated from raw DNS data directly.

The cname relation graph

G_{n r} = (D, E)

is an unweighted graph, where D and E are the set of domains and edges. Edge

< d_{i}, d_{j} >

represents that domain

d_{i}

and

d_{j}

were once shown in one cname record.

4.4.4. Graph Pruner

Since local network data are noisy with massive domains, IPs, and clients, we deleted some nodes that cannot provide useful information from client–query–domain graph, domain–resolve–ip graph, and cname relation graph to increase AGCN-Domain’s performance and efficacy. We investigated former works [13,18] and set pruning rules as follows:

Popular domains. The basic intuition is that domains that have been queried by more clients are more likely to be legitimate. The typical example is that famous domains, such as google.com, can be queried by nearly all clients in the monitored local network. Processing such famous popular domains will take a lot of resources; thus, we pruned them to increase the efficacy of the system. A popular domain was defined as requested by more than 25% of clients.
Hyperactive clients. In our data, there are some very active clients that can query domains even 1,000,000 times one day. We analyzed them and found that they are proxies or forwarders: there may be hundreds of clients behind source IP. Such clients cannot provide valid client relation for domains, so we deleted them. We set the top 0.1% clients as hyperactive clients and removed them.
Inactive clients. There are some clients that query only a few domains. Such clients also cannot offer much information; thus, we set a threshold of $N_{i c}$ and removed clients querying fewer domains than this. The $N_{i c}$ was set to 2 in our experiment.
Inactive IPs. The same as inactive clients, we erased IPs that host only one domain in our network data.
Exceptions. Similar to previous work [18], we kept malicious domains and their related information even when they complied with the above rules, considering that malicious domains usually are inactive to avoid detection.

4.5. Attention-GCN Classifier

Model. In this step, we took three graphs that described relations between domains from different views, as input and output are deep features for each domain. Considering that diverse relations have different influences for detecting malicious domains, we applied GCN with an attention mechanism in our model. The hidden layer is designed as follows:

\begin{matrix} H_{c r}^{l + 1} = σ (L_{c r} H_{c r}^{l} W_{c r}) \end{matrix}

(4)

\begin{matrix} H_{r r}^{l + 1} = σ (L_{c r} H_{r r}^{l} W_{r r}) \end{matrix}

(5)

\begin{matrix} H_{n r}^{l + 1} = σ (L_{c r} H_{n r}^{l} W_{n r}) \end{matrix}

(6)

\begin{matrix} H^{L} = & \sum_{i \in (c r, r r, n r)} s o f t m a x (α_{i}) ⊙ H_{i}^{L} \end{matrix}

(7)

\begin{matrix} L = D^{- \frac{1}{2}} (A + I) D^{- \frac{1}{2}} \end{matrix}

(8)

Formulas (4)–(6) extract features of relation graphs, and final hidden features can be calculated by combining them with attention mechanism as shown in Formula (7), where

H_{i}^{l}

denotes the feature convergence process of the lth convolutional layer on the relation graph i,

H_{i}^{l} \in R^{n_{v}^{i} \times d_{v}}

,

n_{v}^{i}

denotes the number of nodes in relation graph i,

i \in {c r, r r, n r}

,

H_{i}^{0} = (X)

, and X denotes the initial embedding vector of the node,

X \in R^{d_{v}}

,

L_{c r}

,

L_{r r}

,

L_{n r}

denote the normalized Laplacian matrices of the three graphs, respectively, A and D represent the adjacency matrix and degree matrix of the subgraph, respectively, I is the identity matrix,

L_{c r}

,

L_{r r}

,

L_{n r} \in R^{n_{v}^{i} \times n_{v}^{i}}

, and

W_{r r}

,

W_{c r}

,

W_{n r}

denote the trainable weight matrices on three relational graphs,

W_{r r}

,

W_{c r}

,

W_{n r} \in R^{d_{v} \times d_{v}}

, and

σ (\cdot)

represents ReLU activation function.

α_{i}

is used as the attention score of the

H_{i}^{L}

,

i \in {c r, r r, n r}

, which is determined by the similarity between the hidden features of domain nodes and the

{\bar{h}}_{i}

vector, which is the mean of the

H_{i}^{L}

of the domains.

α_{i}

is calculated as follows:

\begin{matrix} α_{i} & = [\begin{matrix} α_{i 1} \\ α_{i 2} \\ ⋮ \\ α_{i n} \end{matrix}] \end{matrix}

(9)

\begin{matrix} α_{i k} = - S i m i l a r i t y & (h_{i k}^{L}, {\bar{h}}_{i}) = - \frac{h_{i k}^{L} \cdot {\bar{h}}_{i}^{t}}{∥{\bar{h}}_{i}∥ \cdot ∥h_{i k}^{L}∥} \end{matrix}

(10)

\begin{matrix} {\bar{h}}_{i} & = \frac{\sum_{v \in S_{i}} h_{i}^{L} (v)}{|S_{i}|} \end{matrix}

(11)

where

S_{i}

is the set of nodes of relation graph i.

Isolated points. In each relation graph, it is inevitable to have isolated nodes that have no edges with any other nodes, because it is impossible for all domains to have three relations with others in network data. Such nodes will be fairly negative to our model. Thus, we set one node’s related hidden vectors to zero if it was an isolated point in a graph to minimize the negative. For instance, the hidden vector

h_{c r}^{l} (d)

of domain d in layer l, which was an isolated node in client relation, was set to zero (Algorithm 1).

Algorithm 1 Attention-GCN Classifier

Input:: client relation graph $G_{c r}$ ; resolution relation graph $G_{a r}$ ; cname relation graph $G_{n r}$ ;
Parameters:: embedding size k; depth of layer L;
Output:: feature vectors V of nodes (domains)
1:: Initialize $H^{(0)}$
2:: $L_{c r} \leftarrow N o r m a l i z e d A d j a c e n t M a t r i x (G_{c r})$
3:: $L_{r r} \leftarrow N o r m a l i z e d A d j a c e n t M a t r i x (G_{r r})$
4:: $L_{n r} \leftarrow N o r m a l i z e d A d j a c e n t M a t r i x (G_{n r})$
5:: for l in range $(0, L - 1)$ do
6:: $H_{c r}^{l + 1} = L_{c r} H_{c r}^{l} W_{c r}$
7:: $H_{r r}^{l + 1} = L_{r r} H_{r r}^{l} W_{r r}$
8:: $H_{n r}^{l + 1} = L_{n r} H_{n r}^{l} W_{n r}$
9:: end for
10:: $H^{L} = \sum_{i \in (c r, r r, n r)} s o f t m a x (α_{i}) H_{i}^{L}$
11:: $V = F C (H^{L})$
12:: return V;

Loss Function For each domain, we obtained a

k - l e n g t h

vector, which represented its deep features after the above processes. Then, distinguishing malicious domains could be regarded as a binary classification problem. Thus, a fully connected layer was set to judge whether a domain was malicious or not. Naturally, the training objective was to minimize the gap between our predicted results and labels of known nodes. Therefore, the loss could be measured by cross-entropy:

\begin{matrix} l o s s = \sum_{d \in D L} Q (y_{d}^{'}, y_{d}) \\ = - \sum_{d \in D L} (y_{d} log y_{d}^{'} + (1 - y_{d}) log (1 - y_{d}^{'})) \end{matrix}

(12)

where

D L

is the set of all labeled domains, Q is the cross-entropy function,

y_{d}^{'}

is the predicted result of domain d, and

y_{d}

is its real label.

\begin{matrix} \frac{\partial l o s s}{\partial y_{d}^{'}} = - (\frac{y_{d}}{y_{d}^{'}} - \frac{1 - y_{d}}{1 - y_{d}^{'}}) \end{matrix}

(13)

\begin{matrix} y_{d}^{'} & = s o f t m a x [(H^{L} \cdot W_{f} + b_{f}) W_{b} + b_{b}] \\ = s o f t m a x (V \cdot W_{b} + b_{b}) \end{matrix}

(14)

p a r a m s = p a r a m s - η * \frac{\partial l o s s}{\partial p a r a m s}

(15)

where

W_{b}

and

W_{f}

are the trainable weight matrices for the output layer and the fully connected layer respectively,

p a r a m s

is the set of trainable parameters,

η

is the learning rate.

5. Evaluation

In this section, we implement a prototype of our system and evaluate it with multiple experiments from various perspectives.

5.1. Setup

To evaluate our system, we captured one week of DNS logs from a local education network in 2018. In our experiment, we only concentrated on successful A and cname records to obtain information. As described in Section 4, we collected four fields (domain, resolve IP, timestamp, and client) for A records to depict client relation and resolution relation, and two fields (domain and cname result) for cname records to obtain cname relation.

For labeling DNS data, we took ground truth from various sources: (1) Private blacklists/whitelists. The first and foremost ground truth we obtained was the blacklist and whitelist from a large Internet company. (2) Public blacklists/whitelists. We collected Alexa Top 10 K sites [30] for two months in 2018. The domain whose sTLD appeared every day in Alexa was marked as benign. Also, we collected public domain blacklists like 360 DGAs [31], malwaredomainslist.com [32], and Malc0de.com [33]. (3) Security Engine. We further leveraged VirusTotal [34], a popular security engine that has been used in many previous works as ground truth, to check our labels. Finally, we obtained over 10,000 labeled domains, including 2.6 K malicious domains and 8 K benign domains.

We implemented AGCN-Domain in Python 3.6 with PyTorch [35] to build the classifier and Networkx [36] to process the graph. In the following experiments, if there was no special explanation, we leveraged the 5-fold cross-validation technique to obtain final results, and the time window to capture client relation was set to one hour. Considering that it is impossible to infer the state of one domain with no relation to labeling nodes, we labeled at least one node for a component. In addition, we present the metrics used in the following experiment in Table 1.

Table 1. Confusion Matrix Calculation Table.

5.2. Features

In the following experiments, we evaluated the effectiveness of our features by comparing fused domain relation features with individual relation features.

The results are shown in Table 2, where C-R represents features extracted from only client relation graph, R-R is for resolution relation, and N-R is for cname relation. In these experiments, parameters embedding size k was set to 10, layer

L = 1

, and the initial labels fraction was 70%. The coverage ratio (CovRatio) in the table represents the domain ratio that the feature can cover. In other words, there were some domains that could not be predicted for one feature since they are isolated in the related graph. Such isolated domains were not calculated in the results for the individual feature. It can be seen that cname relation achieves the best result, but they can only cover a few domains, while our fused features can obtain good results with coverage of far more domains than the cname relation.

Table 2. Effectiveness evaluation of malicious domain classification under different relationship models and the comprehensive assessment of the three relationship models.

5.3. Initial Label Fraction

To further estimate the effectiveness of our system, we changed the initial label fraction of domains from 10% to 90% and observed the results, which are shown in Table 3. It is obvious that as the number of initially labeled domains increases, our system obtains better performance: the F1 score goes up from 0.87 to 0.94. It should be noted our work can detect malicious domains effectively even with limited labels: it obtains 94% accuracy and 98% precision even with only 10% initial labeled domains.

Table 3. Result with different fraction of labels.

5.4. Sensitive Parameters

In this set of experiments, we analyzed the influence of parameters: embedding size k and hidden layer L.

We first set the initial label fraction to 10%,

L = 2

, and experimented with k from 10 to 200. The results are shown in Figure 5. It can be seen from the figure that the embedding size k has little effect on the result: the accuracy and F1 score are almost unchanged as k increases from 10 to 200.

Figure 5. Impact of embedding size k.

Then we set

k = 10

, initial label fraction to 10%, and experimented with L from 1 to 7. The results are shown in Figure 6. Different from embedding size k, it can be seen that L has more impact on the result: with the increase of L, the performance slightly increases at first, but as it continually increases, the performance decreases rapidly.

Figure 6. Impact of layer L.

The reason behind such a phenomenon is that in the beginning, nodes can obtain information from more neighbors, so they can contain deeper information on features. Once L exceeds a value, nodes obtain too much fusing information from far nodes and obtain invalid information on features.

5.4.1. Comparison with Other Models

We first compare our work with the other two models that are usually used for node classification tasks in the graph, DeepWalk and Node2Vec.

Then, we compare the AGCN-Domain model with the Basic GCN model to highlight the effectiveness of incorporating attention mechanisms in combining three relationship patterns for the detection of malicious domains.

DeepWalk [37]. DeepWalk learns representations of graph nodes from truncated random walks. For this experiment, we took the DeepWalk model to each relation graph and added them to obtain final embedding vectors for each node. Then, we leveraged a fully connected layer to distinguish malicious domains.
Node2Vec [38]. Node2Vec aims to learn the scalable features of nodes in the graph. Similar to DeepWalk, we took Node2Vec to each relation graph and obtained representations for each node, then added them to obtain final node features. Then, a fully connected layer was applied to predict malicious domains.
Basic GCN [29]. GCN is a famous model and has shown great performance in many areas. In this experiment, we took a basic GCN model to each graph and combined different relation features without an attention mechanism.

Table 4 shows the result. It can be seen that our work AGCN-Domain can obtain better performance than other models. The attention mechanism can effectively and intelligently combine different relation features with the consideration of their various significance.

Table 4. Comparison with other models with 10% initial labels.

5.4.2. Comparison with Other Malicious Domain Detection Systems

To further illustrate the performance of our system, we compared AGCN-Domain with three similar works [14,15,17] that detect malicious domains based on relations.

Manadhata, et al. [14] constructed a client-querying-domain bipartite graph to depict who is querying what. Then, the researchers labeled domains with ground truth and leveraged belief propagation algorithm to predict unknown domains’ states.
Khalil, et al. [15] generated a domain resolution relation graph to represent whether two domains are sharing common resolutions. Then malicious scores for nodes can be calculated based on their distance from all known malicious domains.
Lei, et al. [17] modeled domain behavior with three domain similarity graphs. The researchers derived domain features with graph embedding techniques from three graphs and detected malicious domains by concatenating these features.

To make the comparison, we implemented the above research locally in accordance with our understanding of their papers and experimented using the same local DNS data and labels. Table 5 shows the result. It can be seen that AGCN-Domain even covers more domains than other works with better performance (their limited relations make some domains untapped). It is probable that our system leverages more relations, and our model can obtain deeper and more essential features.

Table 5. Comparison with other detection systems with 10% initial labels.

6. Conclusions

In this paper, we proposed a novel system named AGCN-Domain, which can distinguish malicious domains intelligently by fusing multiple domain relations considering their influences. Specifically, AGCN-Domain first analyzes domain behaviors with client-query-domain and domain-resolve-IP graphs. It describes three relations, client relation, resolution relation, and cname relation, with a relation graph of domains. Finally, a model composed by GCN and attention mechanism is applied to obtain and combine deep features of domains which are extracted from various relation graphs. We set multiple experiments from different angles to evaluate AGCN-Domain using 7 days of real-world data. We compared our AGCN-Domain model with classical graph representation algorithms, including Deepwalk, Node2vec, and BasicGCN. These algorithms simply sum up the node vector representations learned from the three relationship subgraphs without specifically focusing on the feature representation of any particular relationship subgraph or considering the combined representation of node vectors from the three relationship subgraphs. Consequently, their performance in experiments is inferior to our proposed AGCN-Domain model, demonstrating the effectiveness of our proposed attention mechanism. In comparison to the three malicious domain detection models proposed by Manadhata, Khalil, and Lei, which only consider one type of relationship between domains, the AGCN-Domain model comprehensively integrates the representations of three types of relationships through attention mechanisms, thereby highlighting the feature representations of malicious domains and improving the accuracy and F1 score of malicious domain detection. In the case of a 10% label initialization rate, the AGCN-Domain model exhibits excellent performance compared to other malicious domain detection systems, with an accuracy of 94.27% and an F1 score of 87.93%. Table 4 and Table 5 show that although the AGCN-Domain model performs well in terms of accuracy and precision, its recall rate is slightly lower than that of other malicious domain detection systems. We speculate that this is because some malicious domain relationship patterns are similar to benign domains, requiring further extraction of deeper relationship features. In summary, our proposed AGCN-Domain model intelligently integrates domain representation features from three relationship patterns through attention mechanisms. Its comprehensive performance surpasses that of the compared malicious domain detection systems, balancing precision and recall, reducing the number of false malicious domain alarms, and decreasing the workload of security personnel. This demonstrates the superiority of our proposed AGCN-Domain model. In future work, we plan to propose a method that can process massive data to improve the efficiency of the detection system and explore a real-time method of detecting malicious domains.

Author Contributions

Conceptualization, supervision, X.L.; methodology, Y.L. and H.C.; formal analysis, X.L. and Y.L.; investigation, original draft preparation Y.L. and H.C.; review and editing, X.L. and H.C.; project administration, X.L. and L.Y.; funding acquisition, L.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work is in part supported by the National Key R&D Program of China (No. 2022YFB3104100), National Science Foundation of China (No. 62102109), and the Major Key Project of PCL (No. PCL2021A09, PCL2021A02, PCL2022A03).

Data Availability Statement

The data supporting all the findings in the article are accessible, and no additional source data are required.

Conflicts of Interest

Yixin Li was employed by Big Data Center of State Grid Corporation of China. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. Big Data Center of State Grid Corporation of China had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

AGCN	Graph convolutional network with attention mechanism
GCN	Graph convolutional network
CR	Client relation
RR	Resolution relation
NR	Cname relation

References

Antonakakis, M.; Perdisci, R.; Dagon, D.; Lee, W.; Feamster, N. Building a dynamic reputation system for dns. In Proceedings of the 19th USENIX Security Symposium (USENIX Security 10), Washington, DC, USA, 11–13 August 2010; USENIX Association: Berkeley, CA, USA, 2010; p. 18. [Google Scholar]
Bilge, L.; Sen, S.; Balzarotti, D.; Kirda, E.; Kruegel, C. Exposure: A passive dns analysis service to detect and report malicious domains. ACM Trans. Inf. Syst. Secur. (TISSEC) 2014, 16, 1–28. [Google Scholar] [CrossRef]
Bilge, L.; Kirda, E.; Kruegel, C.; Balduzzi, M. Exposure: Finding malicious domains using passive dns analysis. In Proceedings of the 18th Annual Network and Distributed System Security Symposium (NDSS2011), San Diego, CA, USA, 6–9 February 2011; pp. 1–17. [Google Scholar]
Antonakakis, M.; Perdisci, R.; Lee, W.; Vasiloglou, N.; Dagon, D. Detecting malware domains at the upper dns hierarchy. In Proceedings of the 20th USENIX Conference on Security (USENIX Security 11), San Francisco, CA, USA, 8–12 August 2011; USENIX Association: Berkeley, CA, USA, 2011; p. 27. [Google Scholar]
Chiba, D.; Yagi, T.; Akiyama, M.; Shibahara, T.; Yada, T.; Mori, T.; Goto, S. Domainprofiler: Discovering domain names abused in future. In Proceedings of the 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Toulouse, France, 28 June–1 July 2016; pp. 491–502. [Google Scholar]
Hao, S.; Kantchelian, A.; Miller, B.; Paxson, V.; Feamster, N. Predator: Proactive recognition and elimination of domain abuse at time-of-registration. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, 24–28 October 2016; pp. 1568–1579. [Google Scholar]
Schüppen, S.; Teubert, D.; Herrmann, P.; Meyer, U. {FANCI}: Feature-based automated nxdomain classification and intelligence. In Proceedings of the 27th USENIX Security Symposium (USENIX Security 18), Baltimore, MD, USA, 15–17 August 2018; pp. 1165–1181. [Google Scholar]
Yadav, S.; Reddy, A.K.K.; Reddy, A.; Ranjan, S. Detecting algorithmically generated malicious domain names. In Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement, Melbourne, Australia, 1–3 November 2010; pp. 48–61. [Google Scholar]
Schiavoni, S.; Maggi, F.; Cavallaro, L.; Zanero, S. Phoenix: Dga-based botnet tracking and intelligence. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment; Springer: Berlin/Heidelberg, Germany, 2014; pp. 192–211. [Google Scholar]
Woodbridge, J.; Anderson, H.S.; Ahuja, A.; Grant, D. Predicting domain generation algorithms with long short-term memory networks. arXiv 2016, arXiv:1611.00791. [Google Scholar]
Tran, D.; Mac, H.; Tong, V.; Tran, H.A.; Nguyen, L.G. A lstm based framework for handling multiclass imbalance in dga botnet detection. Neurocomputing 2018, 275, 2401–2413. [Google Scholar] [CrossRef]
Xu, C.; Shen, J.; Du, X. Detection method of domain names generated by dgas based on semantic representation and deep neural network. Comput. Secur. 2019, 85, 77–88. [Google Scholar] [CrossRef]
Rahbarinia, B.; Perdisci, R.; Antonakakis, M. Segugio: Efficient behavior-based tracking of malware-control domains in large isp networks. In Proceedings of the 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Rio de Janeiro, Brazil, 22–25 June 2015; pp. 403–414. [Google Scholar]
Manadhata, P.K.; Yadav, S.; Rao, P.; Horne, W. Detecting malicious domains via graph inference. In European Symposium on Research in Computer Security; Springer International Publishing: Cham, Switzerland, 2014; pp. 1–18. [Google Scholar]
Khalil, I.; Yu, T.; Guan, B. Discovering malicious domains through passive dns data graph analysis. In Proceedings of the ASIA CCS ’16: 11th ACM on Asia Conference on Computer and Communications Security, Xi’an, China, 30 May–3 June 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 663–674. [Google Scholar]
Sun, X.; Wang, Z.; Yang, J.; Liu, X. Deepdom: Malicious domain detection with scalable and heterogeneous graph convolutional networks. Comput. Secur. 2020, 99, 102057. [Google Scholar] [CrossRef]
Lei, K.; Fu, Q.; Ni, J.; Wang, F.; Yang, M.; Xu, K. Detecting malicious domains with behavioral modeling and graph embedding. In Proceedings of the 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), Dallas, TX, USA, 7–10 July 2019; pp. 601–611. [Google Scholar]
Sun, X.; Tong, M.; Yang, J.; Xinran, L.; Heng, L. Hindom: A robust malicious domain detection system based on heterogeneous information network with transductive classification. In Proceedings of the 22nd International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2019), Beijing, China, 23–25 September 2019; USENIX Association: Berkeley, CA, USA, 2019; pp. 399–412. [Google Scholar]
Zou, F.; Zhang, S.; Rao, W.; Yi, P. Detecting malware based on dns graph mining. Int. J. Distrib. Sens. Netw. 2015, 11, 102687. [Google Scholar] [CrossRef]
Jia, Y.; Gu, Z.; Jiang, Z.; Gao, C.; Yang, J. Persistent graph stream summarization for real-time graph analytics. World Wide Web 2023, 26, 2647–2667. [Google Scholar] [CrossRef]
Jia, Y.; Gu, Z.; Du, L.; Long, Y.; Wang, Y.; Li, J.; Zhang, Y. Artificial intelligence enabled cyber security defense for smart cities: A novel attack detection framework based on the MDATA model. Knowl.-Based Syst. 2023, 276, 110781. [Google Scholar] [CrossRef]
Jia, Y.; Gu, Z.; Li, A. MDATA: A New Knowledge Representation Model: Theory, Methods and Applications; Springer Nature: New York, NY, USA, 2021; Volume 12647, pp. 1–255. [Google Scholar]
Lee, J.; Lee, H. Gmad: Graph-based malware activity detection by dns traffic analysis. Comput. Commun. 2014, 49, 33–47. [Google Scholar] [CrossRef]
Peng, C.; Yun, X.; Zhang, Y.; Li, S.; Xiao, J. Discovering malicious domains through alias-canonical graph. In Proceedings of the 2017 IEEE Trustcom/BigDataSE/ICESS, Sydney, Australia, 1–4 August 2017; pp. 225–232. [Google Scholar]
Najafi, P.; Mühle, A.; Pünter, W.; Cheng, F.; Meinel, C. Malrank: A measure of maliciousness in siem-based knowledge graphs. In Proceedings of the 35th Annual Computer Security Applications Conference, San Juan, PR, USA, 9–13 December 2019; pp. 417–429. [Google Scholar]
Anderson, H.S.; Woodbridge, J.; Filar, B. Deepdga: Adversarially-tuned domain generation and detection. In Proceedings of the AISec ’16: Proceedings of the 2016 ACM Workshop on Artificial Intelligence and Security, Vienna, Austria, 28 October 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 13–21. [Google Scholar]
Fu, Y.; Yu, L.; Hambolu, O.; Ozcelik, I.; Husain, B.; Sun, J.; Sapra, K.; Du, D.; Beasley, C.T.; Brooks, R.R. Stealthy domain generation algorithms. IEEE Trans. Inf. Forensics Secur. 2017, 12, 1430–1443. [Google Scholar] [CrossRef]
Yun, X.; Huang, J.; Wang, Y.; Zang, T.; Zhou, Y.; Zhang, Y. Khaos: An adversarial neural network dga with high anti-detection ability. IEEE Trans. Inf. Forensics Secur. 2019, 15, 2225–2240. [Google Scholar] [CrossRef]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. In Proceedings of the 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April 2017. Conference Track Proceedings OpenReview.net 2017. [Google Scholar]
Alexa. Available online: https://www.alexa.com (accessed on 15 September 2022).
360DGAs. Available online: https://data.netlab.360.com/dga/ (accessed on 15 September 2022).
MalwareDomainList. Available online: https://www.malwaredomainlist.com (accessed on 15 September 2022).
Malc0de.com. Available online: https://malc0de.com/bl/ZONES (accessed on 15 September 2022).
VirusTotal. Available online: https://www.virustotal.com (accessed on 15 September 2022).
Pytorch. Available online: https://pytorch.org (accessed on 10 September 2022).
Networkx. Available online: https://networkx.org (accessed on 10 September 2022).
Perozzi, B.; Al-Rfou, R.; Skiena, S. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24–27 August 2014; pp. 701–710. [Google Scholar]
Grover, A.; Leskovec, J. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 855–864. [Google Scholar]

Figure 1. Client Relations.

Figure 2. Resolution Relations.

Figure 3. Cname Relations.

Figure 4. Architecture of AGCN-Domain.

Figure 5. Impact of embedding size k.

Figure 6. Impact of layer L.

Table 1. Confusion Matrix Calculation Table.

Metrics Name	Instruction
TP	The number of malicious domains predicted as malicious
FP	The number of benign domains predicted as malicious
TN	The number of benign domains predicted as benign
FN	The number of malicious domains predicted as benign
Accuracy	(TP + TN)/(TP + FP + TN + FN)
Precision	TP/(TP + FP)
Recall	TP/(TP + FN)
F1	2 × (Precision × Recall)/(Precision + Recall)

Table 2. Effectiveness evaluation of malicious domain classification under different relationship models and the comprehensive assessment of the three relationship models.

Relation Pattern	Accuracy	Precision	Recall	F1	CovRatio
C-R	0.9421	0.9857	0.7991	0.8827	95.04%
R-R	0.9744	0.9650	0.8932	0.9277	66.96%
N-R	0.9955	1.000	0.8333	0.9091	11.48%
Fused	0.9693	0.9814	0.8915	0.9343	100.0%

Table 3. Result with different fraction of labels.

Initial Labels	Accuracy	Precision	Recall	F1
10%	0.9427	0.9811	0.7967	0.8793
30%	0.9564	0.9770	0.8487	0.9083
50%	0.9643	0.9823	0.8731	0.9245
70%	0.9693	0.9814	0.8915	0.9343
90%	0.9720	0.9867	0.9024	0.9427

Table 4. Comparison with other models with 10% initial labels.

Model	Accuracy	Precision	Recall	F1
DeepWalk	0.9171	0.9332	0.7365	0.8233
Node2vec	0.9212	0.9323	0.7539	0.8337
BasicGCN	0.9341	0.9786	0.7640	0.8581
AGCN-Domain	0.9427	0.9811	0.7967	0.8793

Table 5. Comparison with other detection systems with 10% initial labels.

Method	Accuracy	Precision	Recall	F1
Manadhata [14]	0.8994	0.7998	0.8736	0.8351
Khail [15]	0.9404	0.8869	0.8107	0.8471
Lei [17]	0.9217	0.7810	0.8676	0.8221
AGCN-Domain	0.9427	0.9811	0.7967	0.8793

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Multiple requests from the same IP address are counted as one view.

AGCN-Domain: Detecting Malicious Domains with Graph Convolutional Network and Attention Mechanism

Abstract

1. Introduction

2. Related Work

2.1. Feature-Based Methods

2.2. Graph-Based Methods

2.3. Discussion

3. Preliminaries

3.1. Domain Relations

3.2. Graph Convolutional Network (GCN)

4. Method

4.1. Overview

4.2. Data Structure

4.3. Data Preprocessor

4.4. Relation Graph Constructor

4.4.1. Client Relation Graph

4.4.2. Resolution Relation Graph

4.4.3. Cname Relation Graph

4.4.4. Graph Pruner

4.5. Attention-GCN Classifier

5. Evaluation

5.1. Setup

5.2. Features

5.3. Initial Label Fraction

5.4. Sensitive Parameters

5.4.1. Comparison with Other Models

5.4.2. Comparison with Other Malicious Domain Detection Systems

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Article Metrics

Article Access Statistics