Next Article in Journal
A Random Sampling-Based Method via Gaussian Process for Motion Planning in Dynamic Environments
Next Article in Special Issue
MSLCFinder: An Algorithm in Limited Resources Environment for Finding Top-k Elephant Flows
Previous Article in Journal
Mapping of the Upper Limb Work-Space: Benchmarking Four Wrist Smoothness Metrics
Previous Article in Special Issue
A3C System: One-Stop Automated Encrypted Traffic Labeled Sample Collection, Construction and Correlation in Multi-Systems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

FF-MR: A DoH-Encrypted DNS Covert Channel Detection Method Based on Feature Fusion

1
College of Electronic Engineering, National University of Defense Technology, Hefei 230037, China
2
Anhui Province Key Laboratory of Cyberspace Security Situation Awareness and Evaluation, Hefei 230037, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2022, 12(24), 12644; https://doi.org/10.3390/app122412644
Submission received: 25 October 2022 / Revised: 29 November 2022 / Accepted: 6 December 2022 / Published: 9 December 2022
(This article belongs to the Special Issue Network Traffic Security Analysis)

Abstract

:
In this paper, in order to accurately detect Domain Name System (DNS) covert channels based on DNS over HTTPS (DoH) encryption and to solve the problems of weak single-feature differentiation and poor performance in the existing detection methods, we have designed a DoH-encrypted DNS covert channel detection method based on features fusion, called FF-MR. FF-MR is based on a Multi-Head Attention and Residual Neural Network. It fuses session statistical features with multi-channel session byte sequence features. Some important features that play a key role in the detection task are screened out of the fused features through the calculation of the Multi-Head Attention mechanism. Finally, a Multi-Layer Perceptron (MLP) is used to detect encrypted DNS covert channels. By considering both global and focused features, the main idea of FF-MR is that the degree of correlation between each feature and all other features is expressed as an attention weight. Thus, features are re-represented as the result of the weighted fusion of all features using the Multi-Head Attention mechanism. Focusing on certain important features according to the distribution of attention weights improves the detection performance. While detecting the traffic in encrypted DNS covert channels, FF-MR can also accurately identify encrypted traffic generated by the three DNS covert channel tools. Experiments on the CIRA-CIC-DoHBrw-2020 dataset show that the macro-averaging recall and precision of the FF-MR method reach 99.73% and 99.72%, respectively, and the macro-averaging F1-Score reached 0.9978, which is up to 4.56% higher than the existing methods compared in the paper. FF-MR achieves at most an 11.32% improvement in macro-averaging F1-Score in identifying three encrypted DNS covert channels, indicating that FF-MR has a strong ability to detect and identify DoH-encrypted DNS covert channels.

1. Introduction

As a critical infrastructure of the Internet, the DNS protocol plays an important role in the translation between domain names and IP addresses. However, DNS requests and responses are transmitted in the form of plaintext, which means that anyone can intercept and view the network access behavior of users between the host and local DNS server, which is detrimental to the protection of user privacy and network security [1]. Therefore, the technology to protect the security of DNS requests comes into being. At present, the three protocols listed in the Internet standardization document include Domain Name System Security Extensions (DNSSEC), DNS over TLS (DoT), and DNS over HTTPS (DoH). DNSSEC mainly uses digital signature technology to protect the integrity and authenticity of the DNS response, but the communication process is still transparent to attackers. DoT and DoH both use TLS encryption, the difference between them being that the former uses the dedicated port 853, while the latter uses the HTTPS standard port 443, i.e., DoH transfers DNS messages using HTTPS streams. Today, companies, such as Google, Cloudflare, and Alibaba, offer DoH nodes, and Google’s Chrome browser natively supports DoH. In February 2020, Mozilla Firefox began enabling DoH by default for US users, and DNS requests from Firefox were encrypted by DoH and forwarded to Cloudflare or [2]. DoH has played an effective role in protecting customers’ privacy and network security and has thus developed rapidly and is more widely used in practical applications.
With the development of 5G technology, Internet of Things (IoT) systems are gaining momentum. While they result in greater convenience, their use presents many security challenges, e.g., data privacy, unstable network connections, and possible botnets [3]. The DNS system, as the underlying network, affects the reliability and security of the IoT [4]. DNS technology meets the availability and transparency requirements for deploying the IoT, a large number of heterogeneous devices can use DNS for network access, and DNS encryption technology better protects the DNS data privacy of users. However, the risks and opportunities are the same: the encryption of DoH and the interactivity of end-to-end devices in the IoT (without human involvement) may jointly lead to attacks that are less likely to be detected, not to mention the emergence of large-scale botnets and DDoS attacks on public IoT services [5].
Therefore, while DoH enhances security, it also provides new opportunities for attackers. In July 2019, Netlab, the cyber threat search division of Qihoo 360, released a report that malware named Godlua used DoH to obtain domains and use them as communication channels for Command and Control (C&C) [6]. In May 2020, Kaspersky also discovered that Iran’s APT group OilRig had weaponized DoH and applied it to actual network data theft activities [7]. General, non-encrypted DNS covert channels are transmitted in plaintext via port 53 in the C&C stage of Advanced Persistent Threats (APT), while a DoH uses encrypted transmission via port 443, which is indistinguishable from general HTTPS traffic for network administrators. An attacker can thus hide and encrypt the DNS covert channel via DoH to conduct malicious cyber attacks.
In this paper, we propose a DoH-encrypted DNS covert channel detection method called FF-MR that aims to improve detection performance and solve the problems of weak single-feature differentiation and poor performance in existing research. In summary, the contributions of our paper are three-fold:
  • We summarize and analyze the threat scenario of DoH-encrypted DNS covert channels in the C&C stage, clarify its communication principle, and provide support for the research of detection methods.
  • We propose a DoH-encrypted DNS covert channel detection method (FF-MR) based on feature fusion. FF-MR takes the session as a representation of encrypted DNS covert channel traffic, fuses statistical features with byte sequence features extracted by Residual Neural Networks, and focuses on important features through a Multi-Head Attention mechanism to detect and identify three encrypted DNS covert channels.
  • We conduct comprehensive experiments to evaluate the performance of FF-MR by comparing it with other encrypted DNS covert channel detection methods. We establish four baselines to measure the improvements achieved by the detection model in FF-MR and verify the validity of the model. Finally, recommended values for the hyperparameters are identified using a parameter sensitivity experiment.
The rest of the paper is organized as follows: Section 2 introduces related work on DoH-encrypted DNS covert channel detection. Section 3 summarizes and analyzes the command control process of DoH-encrypted DNS covert channel and introduces the research and application of Multi-Head Attention. In Section 4, we present the design details of FF-MR. In Section 5, we experimentally evaluate the performances of FF-MR on the publicly available dataset CIRA-CIC-DoHBrw-2020. Section 6 concludes this paper.

2. Related Work

In this paper, we mainly present the existing research on DoH-encrypted DNS covert channel detection. Most previous studies have used statistical features and the CIRA-CIC-DoHBrw-2020 dataset in their experiments. Detection is basically performed in two layers: the first layer classifies non_DoH (HTTPS traffic) with DoH traffic, and the second layer classifies DoH traffic into normal DoH and malicious DoH traffic, i.e., DoH-encrypted DNS covert channel traffic, as shown in Table 1.
Banadaki et al. [8] performed a statistical analysis of DoH traffic and extracted a total of 34 classes of statistical features, including IPs and ports, and used machine learning algorithms such as LGBM and XGBoost to perform two-level classification. However, the source IPs and destination IPs, which are directly related to the data itself, were used as the basis for classification, and the resulting experimental results were obviously not objective. MontazeriShatoori et al. [9] proposed arranging the captured packets in temporal order. A set of consecutive packets in the same direction within a certain time threshold is called a packet cluster. In this study, 28 classes of statistical features were extracted, and traditional machine learning algorithms, including Random Forest (RF), Naive Bayes (NB), Support Vector Machines (SVM), and Long Short-Term Memory (LSTM) Neural Networks, were used to distinguish non_DoH from DoH and normal DoH from malicious DoH traffic. Nguyen et al. [11] proposed a two-layer classification of DoH based on a Transformer containing a four-layer encoder and a six-layer decoder using statistical features as input. They also used an ELK stack architecture, which included four modules: Elasticsearch, Logstash, Kibana, and Beats. Finally, a Security Operation Center (SOC) system enabled the monitoring and detection of malicious DoH traffic for enterprise-level networks.
For binary classification, Al-Fawa’reh [10] combined statistical feature analysis with a Bidirectional Recurrent Neural Network (Bi-RNN) to achieve the detection of DoH-encrypted DNS covert channels. Zhan et al. [12] established a TLS fingerprint whitelist based on the information from the TLS handshake stage, where TLS fingerprints not on the whitelist will be considered as suspicious DoH traffic; in addition, normal DoH and encrypted DNS covert channels are classified according to the statistical features of the DoH flow. The difference between Al-Fawa’reh’s work and other studies is that the attack scenarios are simulated; data on the location (latency), number, sending interval (rate), and packet (domain name) length of different DoH servers are generated; and a large number of adversarial and evaluation experiments are conducted to verify the effectiveness of the method using three machine learning models.
The above studies only investigate the detection of encrypted DNS covert channels, but there is less work related to identification of DoH-encrypted DNS covert channels in the literature. Zebin et al. [14] proposed an interpretable machine learning approach using ten-fold cross-validation to triple classify HTTPS, normal DoH, and DoH-encrypted DNS covert channel traffic by stacking RF-based classifiers and, finally, test the performance of the model in identifying encrypted DNS covert channels. The model could only achieve an accuracy of about 92%. Mitsuhashi et al. [13] chose three machine learning algorithms, XGBoost, LightGBM, and CatBoost, to implement a three-stage detection of HTTPS and DoH, normal DoH, and DoH-encrypted DNS covert channel traffic, respectively, followed by the classification and identification of the encrypted DNS covert channels.
By summarizing the existing research on DoH-encrypted DNS covert channel detection, we identify the following limitations:
  • There are few works related to DoH-encrypted DNS covert channel detection and identification in existing studies, and the performance in this area still needs to be improved;
  • Most existing studies use statistical features as the basis for detection, which makes it easy for attackers to evade detection by using single features;
  • The role of byte sequence features and combining multiple features is ignored, thus failing to meet the requirements of encrypted DNS covert channel detection.
In summary, there is a lack of work related to DoH-encrypted DNS covert channel identification in existing studies, and the detection performance still needs to be improved. Therefore, in this paper, we focus on the detection and identification of DoH-encrypted DNS covert channels and calculate the correlation between statistical features and session byte sequence features globally through Multi-Head Attention to obtain weighted fusion features as a basis for detection and identification. As the correlation between global features is extracted and the key features are highlighted, detection and identification performance have been further improved.

3. Background

We elaborate and analyze the mechanism of the DoH-encrypted DNS covert channel in Section 3.1 and formally describe the principle of the Multi-Head Attention mechanism in Section 3.2.

3.1. DoH-Encrypted DNS Covert Channel

In this paper, we focus on DoH-encrypted DNS covert channels and domain name resolution using DoH in two cases: one is to use browsers such as Google, Firefox, etc., that support the DoH protocol, where all DNS traffic is directly encapsulated into a TLS encrypted HTTP message and sent to DoH servers, which are then forwarded to domain name servers on the Internet for resolution; the second is to use hosts that do not support the DoH protocol by building a local DoH proxy for forwarding (available proxy tools include QuantumultX, Surge, Loon, etc.). The host forwards all network requests to the local DoH proxy, the proxy will forward the DoH traffic to the Internet DoH server, and, finally, the DoH server will perform domain name resolution.
As shown in Figure 1, in the C&C stage of an APT attack, the data carrier in the DoH-encrypted DNS covert channel does not use the DNS covert channel directly, but rather, the attack is implemented by encapsulating it into a TLS-encrypted HTTP message. Firstly, the victim host sends a DoH request containing the DNS covert channel domain name, updata.tunnel.com, through a local DoH proxy or directly to the DoH server. Updata refer to sensitive information leaked from the victim host or command requests sent to the attacker. Secondly, the DoH server parses the DNS request and performs an iterative query, which is eventually forwarded to a disguised authoritative domain name server controlled by the attacker, i.e., C&C Server. Finally, the attacker obtains the updata sent by the victim host through the C&C server. The attacker also issues commands, i.e., downdata, through a disguised authoritative domain name server and delivers downdata to the victim host by DNS response and DoH response.
In general, the DoH-encrypted DNS covert communication between the attacker and the victim host is similar to the scenario of non-encrypted DNS covert communication: the principles of building DNS covert communication are the same. As shown in Figure 1, both updata and downdata are iteratively queried, and the disguised authoritative domain name server is used as the C&C server to relay between the attacker and the victim host.
The difference between a non-encrypted DNS covert channel and a DoH-encrypted DNS covert channel is reflected in two aspects. First, as a data carrier for leakage and command and control, DNS covert channels in DoH traffic are encrypted, making it impossible to apply existing deep packet inspection techniques. Second, DNS covert channels are hidden in HTTPS traffic, and the DoH server acts as a local DNS server to forward DNS traffic, which also makes it impossible for the local network administrator to find the malicious activities of the victim through DNS. At the same time, the victim host reduces the frequency of DNS requests, reducing the suspicion of malicious activities. The above two characteristics bring a greater challenge for DoH-encrypted DNS covert channel detection.

3.2. Multi-Head Attention Mechanism

Attention mechanisms were first proposed in the field of image processing [15]. In 2014, the Google mind team combined an RNN and an attention mechanism and applied it to an image classification task [16]; this was then further developed and expanded by researchers. In different fields, many different attention mechanisms have evolved, including basic attention mechanisms, such as Soft Attention, Hard Attention, Self-Attention, etc., and combined attention mechanisms, such as Co-Attention, Attention-over-Attention, Multi-Head Attention [17]. Although the above attention mechanisms are different, the basic principles of implementation are similar. This section provides a brief overview of the Multi-Head Attention mechanism by summarizing the general implementation principles of the attention mechanism.
In essence, the attention mechanism can be summarized as filtering out important and noteworthy information by computing the weight distribution of attention within or among data. It usually contains three variables, Query, Key, and Value ( Q , K , V ), which represent the data encoding using target data, source data encoding, and content data encoding, respectively. The calculation process can be divided into two steps: one is to calculate the similarity between the target data Q and the source data K , and the other is to calculate the new data representation V based on the similarity and V :
e = g ( f ( Q , K ) )
V = m ( e , V )
where the similarity between Q and K is calculated by the energy function f [18] and the distribution function g to obtain the weight distribution of attention e . Later, using the transformation function m, the new data representation V is obtained by multiplying e with V . Usually, the distribution function g is chosen to be normalized by softmax, while the transformation function m is a weighted summation. The usual energy functions f include additive and dot product functions [19], which are calculated as follows:
f ( Q , K ) = v T act W 1 K + W 2 Q + b
f ( Q , K ) = Q T K
where act is the nonlinear activation function, such as tanh and ReLU, etc.; v T is the parameter vector; b is the neuron bias; and W 1 and W 2 are the weight matrices.
The Self-Attention mechanism was proposed in 2017 by Vaswani et al. [20] for computing the correlation between words within a sentence to extract syntactic and semantic features. The difference compared to the general attention mechanism is that Q = K = V ; that is, it only focuses on the interdependencies between elements within the data.
The Multi-Head Attention mechanism belongs to a kind of combined attention mechanism, which can greatly improve the data fitting ability and enrich the feature representation by combining multiple attention mechanisms head i and jointly extracting information from different representation subspaces:
MultiHead ( Q , K , V ) = Concat head 1 , , head h W o w h e r e head i = m ( g ( f ( Q , K ) ) , V )
It should be noted that the Multi-Head Attention mechanism used in our paper is combining multiple Self-Attention mechanisms.
Nowadays, attention mechanisms are widely used in natural language processing [21], as well as in autonomous driving [22] and human–computer interaction [23], among others. After a major breakthrough by the Google mind team using attention mechanisms in image processing, researchers have also used them in natural language processing. Bahdanau et al. [24] first used the attention mechanism to solve the word alignment problem of indeterminate-length sentences in machine translation. Furthermore, attention mechanisms are gradually being taken advantage of in applications in the field of cyberspace security [25,26]. In their study of abnormal traffic and encrypted malicious traffic detection, Jiang et al. [27] used LSTM with CNN to extract spatio-temporal features of packets on the CICAndMal2017 dataset and further used a Multi-Head Attention mechanism to extract sequence features of multiple packets in a session. Wang et al. [28] deployed a single-layer Self-Attention mechanism on the CIC-IDS-2017 dataset to learn the correlation and dependency within statistical features in order to detect abnormal and attack traffic. Dong et al. [29] added convolutional operations between multi-layer Self-Attention mechanisms to improve performance over models such as GoogLeNet and ResNet-50 on the NSL-KDD dataset. For encrypted traffic classification, Lin et al. [30] proposed the ET-BERT (Encrypted Traffic Bidirectional Encoder Representations from Transformer) model. Based on the BERT model [31], traffic is converted to tokens for pre-training. They proposed two fine-tuning strategies, packet-level fine-tuning for single-packet classification and stream-level fine-tuning for single-stream classification, and verified the robustness and generalization ability of the model on five encrypted traffic datasets and the TLS1.3 dataset.

4. Method Design

In this section, we design a DoH-encrypted DNS covert channel detection method named FF-MR based on features fusion. FF-MR, including a Multi-Head Attention mechanism and a Residual Neural Network, fuses statistical features and byte sequence features. Its framework is shown in Figure 2.
FF-MR is mainly divided into three parts: data preprocessing, statistical features and session representation extraction, and a DoH-encrypted DNS covert channel detection model based on Multi-Head Attention and Residual Neural Network. The proposed method achieves HTTPS (i.e., non_DoH), normal DoH (i.e., benign_DoH), and three kinds of malicious DoH traffic (i.e., iodine, dnscat2, and dns2tcp) in five categories. Iodine, dnscat2, and dns2tcp are three kinds of malicious DoH traffic generated by encrypted DNS covert channel tools, that is, DoH-encrypted DNS covert channel traffic.
Firstly, the data preprocessing module splits and reorganizes the raw pcap file into sessions, and then, we clean these sessions for filtering and anonymization.
Secondly, session representation and statistical features are extracted, and after standardization and normalization, they are used as the input of the DoH-encrypted DNS covert channel detection model.
Finally, in the DoH-encrypted DNS covert channel detection model based on the Multi-Head Attention and Residual Neural Network (MHA-Resnet), byte sequence features are extracted by the Residual Neural Network, and the Multi-Head Attention mechanism calculates the weighted fusion of session statistical features and byte sequence features so that the distinction between features of different traffic sources is more pronounced. Moreover, the classification of four kinds of DoH and HTTPS traffic is achieved by a Multilayer Perceptron (MLP) to detect and identify DoH-encrypted DNS covert channels.

4.1. Data Preprocessing

Data preprocessing is divided into two steps—traffic splitting and traffic cleaning—in order to obtain traffic representation suitable for the detection model and to remove any invalid data mixed in with the original traffic that could reduce the classification performance of the model.

4.1.1. Traffic Splitting

We split the original pcap file into multiple temporally contiguous sets of packets according to certain rules. The five-tuple ( t u p l e ) contained in each packet is comprised of source IP address s r c I P , destination IP address d s t I P , source port s r c P o r t , destination port d s t P o r t , and transport layer protocol type p r o t o c o l . The ith packet, p i , can be defined by the start transmission time t i m e i , five-tuple t u p l e i , and payload p a y l o a d i as follows:
t u p l e = ( s r c I P , d s t I P , s r c P o r t , d s t P o r t , p r o t o c o l )
p i = t i m e i , t u p l e i , p a y l o a d i .
F l o w is the set of packets with the same t u p l e , and all packets in a f l o w have the same origin and destination and are independent of t i m e and p a y l o a d . S e s s i o n refers to a bidirectional flow. Packets in a session have the same t u p l e or t u p l e , where t u p l e has s r c I P , d s t I P , { s r c P o r t , d s t P o r t } and is the opposite of that in t u p l e . Therefore, even though the packets do not have exactly the same five-tuple, they are still considered to be the same session. F l o w and s e s s i o n are expressed as:
f l o w = p 1 , p 2 , , p N , t u p l e 1 = t u p l e 2 = = t u p l e N
s e s s i o n = p 1 , p 1 , p 2 , p 2 , , p N , p M w h e r e   t u p l e 1 = t u p l e 2 = = t u p l e N , t u p l e 1 = t u p l e 2 = = t u p l e M .
The raw pcap file is split into sessions using the SplitCap tool, which can optionally split the file by flow. In addition, the tool can also choose to keep all the data of the protocol layers or only the data above the transport layer. Since we need to extract session statistical features, the result of splitting the traffic is to retain all of the session’s protocol layers.

4.1.2. Traffic Cleaning

We sort, filter, and anonymize the sessions. Firstly, sessions after traffic splitting are classified and sorted into five categories according to detection results in Figure 2, namely, iodine [32], dnscat2 [33], dns2tcp [34], benign_DoH, and non_DoH, where iodine, dnscat2, and dns2tcp represent malicious_DoH sessions for different types of DoH-encrypted DNS covert channels. Benign_DoH refers to normal DoH sessions, in which packets are encrypted DNS packets without DNS covert channels, and non_DoH refers to HTTPS sessions.
Secondly, it is necessary to filter out sessions with too little data because of the uneven session size. The main principle of filtering is to remove sessions with fewer packets than min_window_size because the raw pcap files corresponding to sessions may be incomplete, and the TLS handshake information needed for model classification may be missing, which will greatly reduce the performance of model classification. Furthermore, since the input of the detection model named MHA-Resnet is of a fixed length, the session length, in bytes, needs to be unified. However, if the number of packets in the session is small, it will result in a small number of session bytes; thus, the extracted byte length will be insufficient, and a large amount of zero-padded byte data will be generated when the length is unified, which may also affect the performance of the model’s classification. Since the TCP connection and TSL handshake are generally completed before the sixth packet of the session, the value of min_window_size is taken as six in this paper.
Finally, the t u p l e of packets in a session is either the same or opposite, and the classification is directly influenced by the t u p l e , resulting in classification exclusively according to the t u p l e rather than the features of the session, which greatly affects the detection and identification performance of MHA-Resnet. Therefore, the session needs to be anonymized. Specifically, the port, IP, and MAC address of each packet in the session are overwritten with all zeros. In this way, the impact of specific fields on classification can be minimized.

4.2. Statistical Features and Session Representation Extraction

After preprocessing, we extract the statistical features and session representation as input to the detection model in two steps, as described in Section 4.2.1 and Section 4.2.2.

4.2.1. Session Representation Extraction

To make up for the fact that only using the statistical features is insufficient to detect DoH-encrypted DNS covert channels, we intercept a string of bytes from a session after traffic cleaning to use as input in the detection model. After analyzing the CIRA-CIC-DoHBrw-2020 dataset for HTTPS and DoH traffic, there is a certain difference in packet size between the two in the TCP connection stage, which is mainly reflected in the optional fields of the TCP headers. For example, in order to avoid serial number wrapping, most DoH traffic contains a TSval field for reliable transmission. In addition, because DoH traffic needs to query the domain name, the response time is longer, and time-related fields, such as TSval and Timestamps in the TCP header, reflect this communication delay, which can also be used as TCP transmission features to distinguish HTTPS and DoH traffic.
The distinction between normal DoH and malicious DoH traffic, i.e., DoH-encrypted DNS covert channel, is mainly reflected in the TLS handshake stage, where the communicating parties negotiate the plaintext information, such as the TLS version, extension, cipher suite, certificate, and elliptic curve type used for encryption and decryption. To a certain extent, the plaintext information reflects the trustworthiness of the encrypted session. Due to the lack of security and formality guarantees for malicious DoH traffic, the above plaintext information is different from normal DoH traffic; for example, malicious traffic is more likely to use a lower version of the encryption algorithm. Normal encrypted traffic mostly uses Extended Validation SSL Certificates (EV SSL) and other highest trust level certificates [35]. In related studies, the certificate information and Client Hello message have also been verified to ensure a good degree of differentiation [36,37]. For different forms of malicious DoH traffic, the TLS handshake information negotiated by the encrypted DNS covert channel, such as cipher suite and elliptic curve type, etc., is not consistent, so this information can be used as an effective feature to detect and identify encrypted malicious DNS covert channels.
In summary, the packet size during the TCP connection stage, the timestamp field of the TCP packet, and the non-encrypted messages in the TLS handshake stage all reflect the characteristics of communication behavior of different types of traffic. Therefore, instead of focusing on data below the network layer, we concatenate the traffic data in the TCP layer with the traffic data in the TLS layer of each packet, extracting the first n bytes as the session representation. The number of bytes n is used as the hyperparameter of the detection model, and 512, 1024, 2048, 4096, and 8192 bytes are selected in the experiment in Section 5.3. According to the experimental results, we choose 1024 bytes as the session representation for the input n of the DoH-encrypted DNS covert channel detection model. The byte vector X i of the ith session after normalization can be expressed as:
X i = x 1 i , x 2 i , , x k i , , x n i
where x k i is the kth byte of the ith session.

4.2.2. Session Statistical Features Extraction

We extract a total of 29 sessions statistical features of five categories, as shown in Table 2. The session duration, number of bytes, packet length, packet time, and request/response time difference are counted. We calculate the mean, median, mode, variance, standard deviation, coefficient of variation, skew from median, and skew from mode for three features, except session duration and number of bytes. At the same time, the rate of session bytes sent and received are calculated because the above five categories of features characterize the characteristics of DoH-encrypted DNS covert channel traffic. For example, DoH-encrypted DNS covert channels contain TCP traffic with covert transmission, which requires more data to be sent, so the values of session duration, number of bytes, and packet length are larger. Compared with the normal DoH and HTTPS traffic, DOH-encrypted DNS covert channel traffic usually has a lower cache hit rate, which leads to a higher latency, a higher frequency of sending packets, and a larger time difference between request and response.
After experimental verification, these features can better represent the difference between the DoH-encrypted DNS covert channel and normal DoH and HTTPS traffic. Because of the different magnitudes, numerical values vary greatly in the statistical features, and it is therefore necessary to be standardized to maintain numerical sensitivity. The standardized statistical features can be expressed as:
S i = s 1 i , s 2 i , , s k i , , s 29 i
where S i is the statistical feature vector of the ith session, and s k i is the kth statistical feature value of the ith session.
We have improved the statistical feature extraction tool DoHMeter [38]. The original DoHMeter tool extracts statistical features in the unit of time-divided streams, and the improved DoHMeter extracts statistical features in session (bidirectional stream) units. Compared with the standard tool, the features extracted by the improved DoHMeter are more complete, sufficient, and more effective for detection using experimental verification.

4.3. Model Development and Architecture

The MHA-Resnet architecture includes three parts: the extraction of session byte sequence features, the weighted fusion of session statistical features and byte sequence features, and session classification. The model is based on the Residual Neural Network. The Multi-Head Attention mechanism is used to globally weight and fuse the features to improve the detection performance of the model. The structure of MHA-Resnet is shown in Figure 3. In order to learn the different communication behaviors of five types of traffic and the mode of TLS encrypted connection, session statistical features and byte data are taken as the input of the model. Byte sequence features, including TCP transmission features, TLS handshake features, and local patterns of DoH-encrypted DNS covert channels, are extracted by the multiple one-dimensional convolutional layers (Conv1D) of the Residual Neural Network, which are then concatenated with session statistical features. The attention weight distribution between all features is calculated by the Multi-Head Attention mechanism, which re-represents the features as the result of a weighted fusion. The output vector of the model is the probability that a session is judged as each of the five types of traffic (i.e., iodine, dnscat2, dns2tcp, benign_DoH, non_DoH), and the label of the maximum probability in the output vector is taken as the classification result.
The model contains the design of residual connection in both of the above neural networks and achieves the purpose of adaptively adjusting the number of network layers according to the task needs by generating constant mapping of the redundant network layers when it is unnecessary. This mitigates to some extent the negative impact of the deep neural network degradation problem on the model performance [39].

4.3.1. Session Byte Sequence Features Extraction

We adopt a Residual Neural Network to extract the session byte sequence features, treating the bytes as words in Natural Language Processing (NLP) tasks and the session byte sequences as sentences, obtaining the contextual associations of the bytes in the session through Conv1D and then extracting the combined byte information, including fields and messages. Moreover, shallow convolution is used to obtain the contextual association of bytes inside the TCP header and TLS handshake message, at which time the combined byte information is at the field level, corresponding to the packet size in the TCP connection stage, timestamp, and TLS certificate mentioned in Section 4.2, i.e., TCP transmission features and TLS handshake features. Deep convolution is used to obtain the contextual association of the TCP and TLS messages in the session when the combined byte information is at the message level, corresponding to the extraction of the correlation between adjacent TCP and TLS messages, i.e., the local pattern of the DoH-encrypted DNS covert channel during transmission.
As shown in Figure 3, in order to process network traffic data with a one-dimensional sequence structure, the Residual Neural Network is based on one-dimensional convolution, and the main body consists of four residual layers (ResLayer). Each residual layer consists of two residual blocks (ResBlock), and two sets of one-dimensional convolutional layers (Conv1D) with batch normalization layers (BatchNorm) are used in the basic residual blocks. The difference between the different residual layers is that the first residual block in the last three residual layers adds to the downsampling operation, and the structure of the downsampling is shown in Figure 4.
Firstly, we ascend dimensionality of the first n bytes X i of the ith session after a layer of one-dimensional convolution using the batch normalization operation. Specifically, the multi-channel feature matrix X i is obtained by multiple convolution kernels, where multiple convolution kernels represent multiple feature extractors, indicating the extraction of different convolutional features with adjacent bytes. Batch normalization is mainly used to solve the problem of gradient disappearance or explosion in deep neural networks.
Secondly, four residual layers extract convolutional features of adjacent fields or messages. Afterwards, the preliminary convolution operation of X i is performed in the first residual layer to obtain the output with the same dimension as the input and the next three residual layers to extract the sequence features under different step lengths using the downsampling operation. The downsampling operation can be performed by decreasing the size of the convolution kernel and increasing the step length in order to extract sequence features with multiple fields while ensuring the same tensor dimensionality for residual concatenation. The calculation of the multi-channel session’s byte sequence features X i is as follows:
X i = BatchNorm Conv1D X i
X i = Reslayer4 Reslayer3 Reslayer2 Reslayer1 X i
Finally, the multi-channel session byte sequence features X i are input into the neural network composed of the Multi-Head Attention mechanism for weighted feature fusion. The main significance of the multi-channel features is to fully characterize the different TCP transmission features, TLS handshake features, and local transmission patterns of the DoH-encrypted DNS covert channels extracted by the multi-convolutional kernel and multi-step and then correlate them with statistical features, thus filtering out unnecessary features. On the other hand, after the global one-dimensional averaging pooling operation (Avgpooling1D) is performed, the features in each channel are averaged to simplify computation. Then, we obtain the session byte sequence feature vector Res _ X i extracted by the Residual Neural Network:
Res _ X i = AvgPool1D X i

4.3.2. Weighted Fusion of Session Statistical Features and Byte Sequence Features

In existing studies, the byte sequence features or statistical features alone are not enough to detect and identify encrypted DNS covert channels. Multifaceted features, such as session statistical features, TCP transmission features, TLS handshake features, and encrypted DNS covert channel transmission patterns of the same category of traffic, are not independent but have some correlation and are uniformly related to the behavior patterns of normal or malicious DoH traffic. Therefore, on the basis of the Residual Neural Network extracting byte sequence features, we also adopt the Multi-Head Attention mechanism in MHA-Resnet. The Self-Attention mechanism in Multi-Head Attention treats the distance between any two features as one and can obtain the global correlation between features, with the purpose of focusing on important features and ignoring redundant and useless features by assigning weights. Specifically, the Self-Attention mechanism expresses the global correlation and dependency between the above multi-faceted features as an attention weight matrix: the stronger the correlation between two features, the larger the weight. The more important a feature is, the more strongly correlated it is with multiple other features. The fusion features with greater distinction for detection are obtained by the weighted summation of all features. This method, which considers both global and focused features, solves the problem of long-distance dependency in RNN, highlights important features and their mapping relationships in the overall features using the correlations between multi-faceted features, and further improves detection performance.
The interpretability of the global correlation of features extracted by the Self-Attention mechanism can be visualized as Figure 5a of the attention distribution in machine translation. The solid line indicates the referential and correlation relationship between words. For example, the words related to the word “it” through the learning of the Self-Attention mechanism include “The”, “cat”, “street”, “it”, and punctuation “.”; the strongest correlation is the word “cat”, which is consistent with the semantics of the sentence. In this case, the correlation is a semantic-grammatical feature. Similarly, applying the Self-Attention mechanism to DoH traffic, the global correlation of features can be considered as the connection between multi-faceted features embodied in activities and behaviors of encrypted DNS covert channels, i.e., the correlation relationship between the session statistical features and byte sequence features within and among each other.
The Multi-Head Attention mechanism integrates multiple Self-Attention mechanisms to improve the robustness and generalization of MHA-Resnet by learning the features of different representation subspaces. As shown in Figure 5b, the distribution of attention learned by another Self-Attention mechanism is different from Figure 5a. Here, the word “it” has a strong correlation with “street”.
As shown in Figure 6a,b, the TCP transmission features extracted from session byte sequence are correlated with Mean Packet Length, Mean Packet Time, and Mean Request/response time differences, while different attention mechanisms will produce different degrees of correlation. The TCP transmission features in Figure 6a are strongly correlated with the Mean Request/response time difference, while the TCP transmission features in Figure 6b are strongly correlated with the Mean Packet Length. Therefore, the Multi-Head Attention mechanism can represent the relationship between traffic features in multiple dimensions, thereby preventing overfitting.
The computation of the Multi-Head Attention mechanism is divided into three steps. Firstly, to improve the nonlinear expression of the network, the statistical features S i are input to a fully connected layer linear with the Sigmoid activation function and transformed into a two-dimensional word vector matrix through word embedding. It is concatenated with a multi-channel byte sequence feature matrix X i extracted by the Residual Neural Network. Meanwhile, we adopt the LayerNorm normalization for the sake of faster convergence and consistent data distribution and then obtain the input U i of the Multi-Head Attention layer:
U i = LayerNorm Concat Embedding Linear S i , X i .
Secondly, Figure 7a [20] shows the weighted fusion of the session statistical features and byte sequence features using the scaled dot-product self-attention, which is more efficient compared to other Self-Attention mechanisms [40]. The scaled dot-product self-attention mechanism is implemented through Q (Query), K (Key), and V (Value), which are linear transformations of U i . Essentially, the attention weight distribution of b o l d s y m b o l V is determined by computing the similarity (multiplication) between Q and K , where a larger weight indicates that a feature is more relevant to another feature and vice versa. Thus, the statistical features S i and the multi-channel byte sequence features X i are computed by a single-scaled dot-product self-attention mechanism to obtain the feature matrix of weighted fusion attention i :
Q = U i W i Q
K = U i W i K
V = U i W i V
attention i = Attention ( Q , K , V ) = softmax Q K T d k V
where d k is the dimension of K , whose main function is to avoid the inner product of QK T being too large, and softmax normalizes the feature matrix of weighted fusion attention i .
The singular Self-Attention is concatenated to obtain the Multi-Head Attention mechanism, as shown in Figure 7b [20]. When calculating Q , K , and V in the h heads Self-Attention mechanism, b o l d s y m b o l V , the initialized W i Q , W i K are different, as is the similarity of Q to K ; therefore, the feature matrix of weighted fusion attention i also differs. MHA-Resnet combines the features of different representation subspaces by Concat. After linear transformation W O , attention i is dimensionally reduced to MultiHead ( U i ) of the same dimension as the input U i . After the residual connection and LayerNorm normalization, we obtain U i :
Mu ltiHead U i = Concat attention 1 , , attention h W O w h e r e attention i = Attention U i W i Q , U i W i K , U i W i V
U i = LayerNorm U i + MultiHead U i .
Finally, we obtain the weighted fusion MHA _ U i of session statistical features and multi-channel byte sequence features through the nonlinear transformation in the forward feedback layer (FeedForward) and residual concatenation and smoothing (Flatten):
MHA _ U i = Flatten U i + FeedForward U i .
The main reason for using smoothing instead of a maximum or average pooling is to reduce information loss so that different features in the multi-channel can be used for classification.

4.3.3. Session Classification

We combine the weighted fusion of features MHA _ U i with the byte sequence features Res _ X i , classified by MLP with softmax. Specifically, MHA _ U i and Res _ X i are concatenated and fed into three fully connected layers with ReLu activation, while the probability vector predicted as each class of traffic, and the label corresponding to the maximum probability is taken as the predicted label y _ p r e d i c t i for the ith session:
y _ p r e d i c t i = softmax MLP Concat MHA _ U i , Res _ X i
where we add dropout between the fully connected layers to prevent overfitting in MLP.

5. Experimental Evaluation

This section is divided into five parts. Section 5.1 describes the dataset and performance metrics. Section 5.2 shows the hyperparameter settings in MHA-Resnet. We evaluate the performance of FF-MR in Section 5.3. We verify the effectiveness of the model MHA-Resnet in Section 5.4, and in Section 5.5, we implement the parameter sensitivity experiments.

5.1. Dataset and Performance Metrics

The CIRA-CIC-DoHBrw-2020 dataset [9] comes from the Canadian Institute for Cybersecurity Research, and the data preprocessing results are shown in Table 3. DoH traffic is generated using two browsers, Google Chrome and Mozilla Firefox, and three DNS covert channel tools, including iodine, dnscat2, and dns2tcp, through four DoH servers, including AdGuard, Cloudflare, Google DNS, and Quad9. The dataset contains three categories, namely, non_DoH, benign_DoH, and malicious_DoH, respectively, representing HTTPS traffic, normal DoH traffic, and malicious DoH, i.e., DoH-encrypted DNS covert channel traffic. The first two are generated by browsers using the HTTPS and DoH protocols, respectively, to access the top 10,000 domains on the Alexa website, while encrypted DNS covert channel traffic is generated by DNS covert channel tools, which can send DNS requests using TLS-encrypted HTTPS requests to special DoH servers.
FF-MR not only detects encrypted DNS covert channels, i.e., malicious_DoH from HTTPS and normal DoH traffic, but it also identifies traffic generated by three DNS covert channel tools. As shown in Table 3, the magnitude of the preprocessed data is still at the level of hundreds of thousands, indicating that the dataset is sufficient. The division ratio of the training set, validation set, and test set is 6:2:2.
The CIRA-CIC-DoHBrw-2020 dataset is imbalanced, so the commonly used performance metrics such as Accuracy are not applicable. We adopt three performance metrics: Precision, Recall, and F1-Score for evaluation in five categories. The comprehensive performance metrics are macro-averaging:
Precision = T P T P + F P
Recall = T P T P + F N
F 1 - Score = 2 Precision Recall Precision + Recall
Macro _ P = 1 n i = 1 n Precision i
Macro _ R = 1 n i = 1 n Recall i
Macro _ F 1 = 1 n i = 1 n F 1 - Score i
where n = 5 , true positive ( T P ) means the model predicts the target traffic and the actual case is the target traffic, true negative ( T N ) means the model predicts the non-target traffic and the actual case is also the non-target traffic, false positive ( F P ) means the model predicts the target traffic and the actual case is the non-target traffic, and false negative ( F N ), in contrast, means the model predicts the non-target traffic, and the actual case is the target traffic.
The experimental environment is a 12th Gen Intel (R) Core (TM) i7-12700K @4.70GHz, 64GB RAM, and 2×NVIDIA RTX 3090 GPUs. The proposed architecture is developed on Ubuntu 20.04 LTS based on Python 3.9.7, Pytorch 1.11.0, CUDA Toolkit 11.3, and cuDNN8200, and codes are run with GPU acceleration.

5.2. Hyperparameter Settings

To ensure the objectivity and validity of the method, we performed ten experiments with MHA-Resnet on the CIRA-CIC-DoHBrw-2020 dataset and averaged the final experimental results in Section 5.3.
In terms of hyperparameter setting, MHA-Resnet was trained with a cross-entropy loss function for evaluation and an Adam optimizer for optimization. The number of training epochs was set to 100. We adopted a dynamic learning rate, where the initial learning rate was set to 0.0001 and decayed by 0.1 times every 20 rounds. The structural parameters of MHA-Resnet are shown in Table 4.

5.3. Performance Evaluation

We visualized and analyzed the features by using t-SNE feature dimensionality reduction in Section 5.3.1. The detection performance of FF-MR was evaluated by comparing it with state-of-the-art methods in Section 5.3.2.

5.3.1. t-SNE Feature Dimensionality Reduction and Visual Analysis

Figure 8 shows the experimental results of the the normalized confusion matrix. We mainly focus on the detection results of encrypted DNS covert channels. As shown in the matrix, there was small amount of confusion in the classification of encrypted DNS covert channel traffic, i.e., malicious_DoH traffic generated by iodine, dnscat2, and dns2tcp. To visually analyze the classification results, the test data were saved with the feature vectors learned by the MHA-Resnet before applying softmax, and we randomly selected 500 samples for each type of traffic, which were then reduced to two dimensions by t-SNE [41], as shown in Figure 9.
The same category of traffic is aggregated into a cluster in Figure 9, and the distinction between traffic in different categories is obvious, in which there is a small amount of confusion for the identification of the three encrypted DNS covert channels due to the effect of data imbalance, verifying the experimental results of Figure 8. As shown in Table 3, after data preprocessing, the encrypted DNS covert channel traffic generated by the iodine and dnscat2 tools was much smaller than other forms of traffic, resulting in the inability to learn the features to identify the above two categories of traffic. In general, this finding verifies that the FF-MR has a good feature extraction ability and performs well in detection and identification.

5.3.2. Results and Evaluation

Baselines. To measure the improvements achieved by FF-MR, we reproduce four baselines:
  • LightGBM [13] is a framework for implementing the Gradient Boosting Decision Tree (GBDT) algorithm, which supports efficient parallel training and has a faster training speed, lower memory consumption, and better accuracy. The method takes the statistical features of flow as input and outputs one of the five labels as a prediction;
  • RF [8] is based on the Random Forest classifier, which adopts the same input and output as LightGBM [13]. The above two methods use the same statistical features as in Table 2;
  • HAST-II [42] takes the first 4096 bytes of the session as input and combines CNN with LSTM to learn the spatial features and temporal features of the session bytes, respectively; it then uses a softmax classifier to perform five classifications on the spatio-temporal features;
  • The input of CENTIME [43] is the same as ours: both use the statistical features and the first n bytes of the session. The difference is that they use the self-encoder to reconstruct the statistical features and the residual neural network with the same structure as ours to extract the byte sequence features and then concatenate the two inputs to the fully connected network for classification.
Results. As shown in Table 5, FF-MR achieved scores of 99.72%, 99.73%, and 0.9978 on Macro_P, Macro_R, and Macro_F1, respectively, demonstrating that the detection performance is better than the other four methods both in macro-averaging and metrics of identification of each category. Next, we will present a detailed comparative analysis of the experimental results though Table 5 and Figure 10.
Evaluation. As shown in Figure 10, comparing the results of LightGBM and RF, HAST-II, and LightGBM and RF, which use statistical features alone, it can be seen that these methods have higher recall and their overall performance is similar; HAST-II, which uses session bytes as input, has a higher precision. Although the results of the above three methods reflect the different advantages of the two features, the macro-averaging metrics are poor, and their F1-scores are only about 0.96. In contrast, the detection performance of FF-MR and CENTIME, which combine the two features, is better than the other three methods.
As shown in Table 5, in terms of the metric Macro_F1, FF-MR improves over LightGBM and RF by 4.56% and 4.35%, respectively, and 3.62% over HAST-II. The results of FF-MR in the five classifications are also significantly higher than the other three methods, especially in the identification of encrypted DNS covert channel traffic generated by the iodine and dnscat2 tools. FF-MR improves F1-Score when using iodine and dnscat2 by 6.06% and 8.22% over LightGBM, by 5.81% and 8.12% over RF, and by 5.67% and 11.32% over HAST-II, respectively, indicating the important role of the combined use of features for encrypted DNS covert channel identification.
Our analysis is that LightGBM and RF belong to traditional machine learning algorithms, which are variations of decision tree algorithms. Therefore, we can infer that the decision tree algorithm does not achieve accurate classification and that decision tree integration algorithms cannot simply improve the performance of detection by optimizing the node splitting algorithm, i.e., XGBoost in LightGBM, or by increasing the number of decision trees (RF). The reason is that non_DoH, benign_DoH, and the three kinds of malicious DoH traffic are closer in the hyperplane, and multiple decision trees and the shallow neural network in HAST-II cannot divide them in a nonlinear way. However, the deep neural network used in FF-MR is able to achieve this distinction by training multiple layers of weights, thus greatly improving the performance of detection and identification.
Comparing FF-MR with CENTIME, although both use statistical features and byte sequence features and have similar performance, FF-MR is better than CENTIME in identifying the encrypted DNS covert channel traffic generated by iodine and dnscat2 tools due to the difference in the structure of the two models, improving F1-Score from 0.977 and 0.9869 to 0.9951 and 0.9954, respectively. The main reason for such improvement is that FF-MR is a weighted fusion of session statistical features and multi-channel byte sequence features though a Multi-Head Attention mechanism, instead of simply concatenating the two features in CENTIME. The drawback of CENTIME is that it does not mine the correlations between two features or give weighted attention to important features; thus, the performance of identifying a specific encrypted DNS covert channel is poor. The above results show that the Multi-Head Attention mechanism using the weighted fusion of features plays an important role in accurately identifying an encrypted DNS covert channel with smaller samples.

5.4. Validation of Effectiveness

We verify the effectiveness of MHA-Resnet from three aspects: first, we compare and validate the effect of one-dimensional and two-dimensional convolution on the model’s classification performance; second, we assess the improvement of the model’s classification performance using statistical features; third, the role of the Multi-Head Attention mechanism is comparatively verified on the CIRA-CIC-DoHBrw-2020 dataset.
Therefore, baseline models selected in this section include 1D-CNN, 2D-CNN, 1D-Resnet, and 2D-Resnet. The 1D-CNN and 2D-CNN both contain two convolutional layers in series, which are classified by fully connected networks; the difference between them is that the former is a one-dimensional convolution, and the latter is a two-dimensional convolution. The structural settings of 1D-Resnet and 2D-Resnet are the same as that of the Residual Neural Network in MHA-Resnet, and similarly, the difference is in the dimensions of convolution. The hyperparameters and other settings of the five models are the same, and the training losses are shown in Figure 11, which shows that all models have reached convergence without overlearning after 100 epochs. MHA-Resnet converges around the 20th epoch, while the other four models converge at around the 30th epoch, indicating that MHA-Resnet is more efficient in training.
As shown in Figure 12, the classification performance of the baseline models is compared with that of MHA-Resnet on macro-averaging metrics. MHA-Resnet has the best detection and identification performance, as shown in Table 6, showing an improvement of 9.03%, 8.52%, 1.42%, and 4.62% on Macro_F1 metrics over the baseline models, respectively. The next best model in detection performance is 1D-Resnet, with all three metrics scoring around 0.98, while the Macro_F1 of 2D-Resnet with the same structure only reaches around 0.95, verifying that the one-dimensional convolution is more useful for processing network traffic. However, the classification performance of 1D-CNN and 2D-CNN is similar, and the Macro_F1 of the two models is only around 0.9, which illustrates the effectiveness of the Residual Neural Network in the MHA-Resnet.
The results of detection and identification are shown in Table 6. Iodine, dnscat2, and dns2tcp belong to malicious_DoH. The classification performance of non_DoH and benign_DoH is more desirable due to the different protocols; the former is HTTPS, while other categories of traffic are DoH. On the other hand, because benign_DoH is generated by browsers, while malicious_DoH is generated by three DNS covert channel tools, a large difference in the plaintext information occurs (e.g., TLS certificates, TLS cipher suites). However, the identification results of three types of encrypted DNS covert channels in malicious_DoH vary greatly among different models; especially, the traffic generated by iodine and dnscat2 tools are more difficult to identify. The F1-Scores of MHA-Resnet reach 0.9951 and 0.9954 for the identification of these two types of encrypted DNS covert channels, respectively. These scores are much higher than the other baseline models’, improving by 14.88% and 28.34% over 1D-CNN, 14.16% and 26.58% over 2D-CNN, 2.67% and 4.05% over 1D-Resnet, and 7.26% and 14.61% over 2D-Resnet, respectively.
Comparing 1D-Resnet and MHA-Resnet reveals that both contain one-dimensional Residual Neural Networks with the same structure; however, the difference is that statistical features are added to MHA-Resnet, and important features are highlighted using the Multi-Head Attention mechanism that not only enrich the training information but also enhance the representation ability of the model. This is the reason why the MHA-Resnet performs better in classification. The above comparative analysis verifies that statistical features are an important factor in improving the classification performance of a model when applied to the CIRA-CIC-DoHBrw-2020 dataset. It also verifies that the Multi-Head Attention mechanism improves the detection performance of the model by fusing both features.

5.5. Parameter Sensitivity Experiments

FF-MR extracts the first n bytes above the TCP layer as the session representation, aiming to extract the TCP transmission features, TLS handshake features, and the local patterns of encrypted DNS covert channel during transmission, as shown in Figure 13. In addition, the TLS messages after Server Hello are encrypted. Therefore, the principle for selecting n is that the byte sequence of length n should include at least the Client Hello and Server Hello, which are plaintext messages in the TLS handshake stage, with as many TCP messages as possible on top of that.
We count the TCP and TLS layer bytes in the messages before Server Hello in a session. The distribution is shown in Figure 14, and the size of the bytes is mainly within 5000. According to the previous research on byte selection in the field of deep neural network traffic detection and identification, the number of bytes is usually taken to the power of two. Therefore, within 5000 bytes, n is selected as 512, 1024, 2048, and 4096 bytes. In addition, we also select 8192 bytes in order to minimize information loss. The results are shown in Figure 15.
As shown in Figure 15a,b, the results with byte sizes less than 4096 are close because 512 bytes already contain the Client Hello message, which can achieve high F1-Score, precision, and recall. When n > 4096 , the overall performance decreases significantly due to the excessive zero padding at a uniform length of 8192 bytes. Figure 15b focuses on the identification of the encrypted DNS covert channel traffic generated by iodine and dnscat2. The use of n = 1024 gives the best results; therefore, 1024 is chosen as the value of n.

6. Conclusions

In this paper, we propose a DoH-encrypted DNS covert channel detection method based on feature fusion called FF-MR to solve the problem of weak single-feature differentiation in existing research. FF-MR extracts TCP transmission features, TLS handshake features, and the local transmission patterns of DoH-encrypted DNS covert channels from session byte sequences using a Residual Neural Network, calculates global correlations with statistical features using a Multi-Head Attention mechanism, and finally, performs weighted fusion. After multiple iterations of the neural network, important features will be given higher weights, which plays a key role in classification. The proposed method’s results on the CIC-DoHBrw-2020 dataset show that its macro-averaging precision and recall reach 99.72% and 99.73%, respectively, and its macro-averaging F1-Score is able to reach 0.9978. Compared with existing methods discussed in this paper, FF-MR achieves at most a 4.56% improvement in macro-averaging F1-Score. Moreover, FF-MR demonstrates a better F1-Score than the methods discussed in this paper when identifying two encrypted DNS covert channels, iodine and dns2cat, improving from the highest scores of 0.977 and 0.9869 for other methods to 0.9951 and 0.9954, respectively. The effectiveness of the MHA-Resnet model used in FF-MR is verified from three aspects by comparison with baseline models, and finally, we implemented parameter sensitivity experiments to determine the value of the byte sequence length n. However, due to the complex structure of the model, the real-time performance is poor. Thus, we will take into account the accuracy and real-time performance in future research.

Author Contributions

Conceptualization, C.S., Y.W. and D.H.; methodology, C.S. and Y.W.; validation, C.S., Y.W. and D.H.; formal analysis, C.S., Y.W., X.X. and Y.L.; investigation, D.H.; resources, C.S. and Y.W.; data curation, C.S., Y.W. and X.X.; writing—original draft preparation, C.S.; writing—review and editing, C.S., Y.W., X.X. and Y.L.; visualization, C.S. and D.H.; supervision, C.S., D.H. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The CIRA-CIC-DoHBrw-2020 dataset that supports the findings of this study is openly available in DNS over HTTPS (CIRA-CIC-DoHBrw2020) at https://www.unb.ca/cic/datasets/index.html, (accessed on 10 June 2022).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
DNSDomain Name System
DoHDNS over HTTPS
FF-MRFeature Fusion based on Multi-Head Attention and Residual Neural Network
MLPMultilayer Perceptron
DNSSECDomain Name System Security Extensions
DoTDNS over TLS
IoTInternet of Things
C&CCommand and Control
APTAdvanced Persistent Threat
RFRandom Forest
NBNaive Bayes
SVMSupport Vector Machines
LSTMLong Short-Term Memory
ELKElasticsearch, Logstash, Kibana
SOCSecurity Operation Center
Bi-RNNBidirectional Recurrent Neural Network
ET-BERTEncrypted Traffic Bidirectional Encoder Representations from Transformer
EV SSLExtended Validation SSL Certificate
Conv1DOne-dimensional Convolutional layer
NLPNatural Language Processing
ResLayerResidual Layer
ResBlockResidual block
BatchNormBatch Normalization layer
Avgpooling1DOne-dimensional Averaging pooling
TPTrue Positives
TNTrue Negatives
FPFalse Positives
FNFalse Negatives
GBDTGradient Boosting Decision Tree

References

  1. Meng, D.; Zou, F. DNS Privacy Protection Security Analysis. Commun. Technol. 2020, 53, 5. [Google Scholar]
  2. Cloudflare. Dns Over tls Vs. dns Over https | Secure dns. Technical Report. 2021. Available online: https://www.cloudflare-cn.com/learning/dns/dns-over-tls/ (accessed on 10 June 2022).
  3. Bures, M.; Klima, M.; Rechtberger, V.; Ahmed, B.S.; Hindy, H.; Bellekens, X. Review of specific features and challenges in the current internet of things systems impacting their security and reliability. In World Conference on Information Systems and Technologies; Springer: Berlin/Heidelberg, Germany, 2021; pp. 546–556. [Google Scholar]
  4. Mahmoud, R.; Yousuf, T.; Aloul, F.; Zualkernan, I. Internet of things (iot) security: Current status, challenges and prospective measures. In Proceedings of the 2015 10th International Conference for Internet Technology and Secured Transactions (ICITST), London, UK, 14–16 December 2015; pp. 336–341. [Google Scholar]
  5. Hesselman, C.; Kaeo, M.; Chapin, L.; Claffy, K.; Seiden, M.; McPherson, D.; Piscitello, D.; McConachie, A.; April, T.; Latour, J.; et al. The dns in iot: Opportunities, risks, and challenges. IEEE Internet Comput. 2020, 24, 23–32. [Google Scholar] [CrossRef]
  6. Network Security Research Lab at 360. An Analysis of Godlua Backdoor. Technical Report. 2019. Available online: https://blog.netlab.360.com/an-analysis-of-godlua-backdoor-en/ (accessed on 10 June 2022).
  7. Cyber Security Review. Iranian Hacker Group Becomes First Known Apt to Weaponize Dns-Over-Https (Doh). Technical Report. 2020. Available online: https://www.cybersecurity-review.com/news-august-2020/iranian-hacker-group-becomes-first-known-apt-to-weaponize-dns-over-https-doh/ (accessed on 10 June 2022).
  8. Banadaki, Y.M.; Robert, S. Detecting malicious dns over https traffic in domain name system using machine learning classifiers. J. Comput. Sci. Appl. 2020, 8, 46–55. [Google Scholar]
  9. Montazerishatoori, M.; Davidson, L.; Kaur, G.; Lashkari, A.H. Detection of doh tunnels using time-series classification of encrypted traffic. In Proceedings of the 2020 IEEE International Conference on Dependable, Autonomic and Secure Computing, International Conference on Pervasive Intelligence and Computing, International Conference on Cloud and Big Data Computing, International Conference on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), Calgary, AB, Canada, 17–22 August 2020; pp. 63–70. [Google Scholar]
  10. Al-Fawa’reh, M.; Ashi, Z.; Jafar, M.T. Detecting malicious dns queries over encrypted tunnels using statistical analysis and bi-directional recurrent neural networks. Karbala Int. J. Mod. Sci. 2021, 7, 4. [Google Scholar] [CrossRef]
  11. Nguyen, T.A.; Park, M. Doh tunneling detection system for enterprise network using deep learning technique. Appl. Sci. 2022, 12, 2416. [Google Scholar] [CrossRef]
  12. Zhan, M.; Li, Y.; Yu, G.; Li, B.; Wang, W. Detecting dns over https based data exfiltration. Comput. Netw. 2022, 209, 108919. [Google Scholar] [CrossRef]
  13. Mitsuhashi, R.; Satoh, A.; Jin, Y.; Iida, K.; Takahiro, S.; Takai, Y. Identifying malicious dns tunnel tools from doh traffic using hierarchical machine learning classification. In International Conference on Information Security; Springer: Berlin/Heidelberg, Germany, 2021; pp. 238–256. [Google Scholar]
  14. Zebin, T.; Rezvy, S.; Luo, Y. An explainable ai-based intrusion detection system for dns over https (doh) attacks. IEEE Trans. Inf. Forensics Secur. 2022, 17, 2339–2349. [Google Scholar] [CrossRef]
  15. Ren, H.; Wang, X. Review of attention mechanism. J. Comput. Appl. 2021, 41, 6. [Google Scholar]
  16. Mnih, V.; Heess, N.; Graves, A.; Kavukcuoglu, K. Recurrent models of visual attention. Adv. Neural Inf. Process. Syst. 2014, 3, 2204–2212. [Google Scholar]
  17. Zhu, Z.; Rao, Y.; Wu, Y.; Qi, J.; Zhang, Y. Research Progress of Attention Mechanism in Deep Learning. J. Chin. Inf. Process. 2019, 33, 11. [Google Scholar]
  18. Zhao, S.; Zhang, Z. Attention-via-attention neural machine translation. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
  19. Britz, D.; Goldie, A.; Luong, M.T.; Le, Q. Massive exploration of neural machine translation architectures. arXiv 2017, arXiv:1703.03906. [Google Scholar]
  20. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
  21. Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Amodei, D. Language models are few-shot learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
  22. Ming, Y.; Meng, X.; Fan, C.; Yu, H. Deep learning for monocular depth estimation: A review. Neurocomputing 2021, 438, 14–33. [Google Scholar]
  23. Wang, Y.; Dong, X.; Li, G.; Dong, J.; Yu, H. Cascade regression-based face frontalization for dynamic facial expression analysis. Cogn. Comput. 2022, 14, 1571–1584. [Google Scholar] [CrossRef]
  24. Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
  25. Liu, S.; Zhang, X. Intrusion Detection System Based on Dual Attention. Netinfo Secur. 2022, in press. [Google Scholar]
  26. Zhang, G.; Yan, F.; Zhang, D.; Liu, X. Insider Threat Detection Model Based on LSTM-Attention. Netinfo Secur. 2022, in press. [Google Scholar]
  27. Jiang, T.; Yin, W.; Cai, B.; Zhang, K. Encrypted malicious traffic identification based on hierarchical spatiotemporal feature and Multi-Head attention. Comput. Eng. 2021, 47, 101–108. [Google Scholar]
  28. Wang, H.; Wei, T.; Huangfu, Y.; Li, L.; Shen, F. Enabling Self-Attention based multi-feature anomaly detection and classification of network traffic. J. East China Norm. Univ. (Nat. Sci.) 2021, in press. [Google Scholar]
  29. Wang, R.; Ren, H.; Dong, W.; Li, H.; Sun, X. Network traffic anomaly detection model based on stacked convolution attention. Comput. Eng. 2022, in press. [Google Scholar]
  30. Lin, X.; Xiong, G.; Gou, G.; Li, Z.; Shi, J.; Yu, J. Et-bert: A contextualized datagram representation with pre-training transformers for encrypted traffic classification. Proc. ACM Web Conf. 2022, 2022, 633–642. [Google Scholar]
  31. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
  32. Ekman, E. iodine; Technical Report; lwIP Developers: New York, NY, USA, 2014. [Google Scholar]
  33. Ron. dnscat2; Technical Report; SkullSecurity: New York, NY, USA, 2014. [Google Scholar]
  34. Dembour, O. dns2tcp; Technical Report; SkullSecurity: New York, NY, USA, 2017. [Google Scholar]
  35. Huo, Y.; Zhao, F. Analysis of Encrypted Malicious Traffic Detection Based on Stacking and Multi-feature Fusion. Comput. Eng. 2022, 142–148. [Google Scholar]
  36. Torroledo, I.; Camacho, L.D.; Bahnsen, A.C. Hunting malicious tls certificates with deep neural networks. In Proceedings of the 11th ACM Workshop on Artificial Intelligence and Security, Toronto, Canada, 15–19 October 2018; Association for Computing Machinery: New York, NY, USA. [Google Scholar]
  37. Pai, K.C.; Mitra, S.; Madhusoodhana, C.S. Novel tls signature extraction for malware detection. In Proceedings of the 2020 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT), Bangalore, India, 2–4 July 2020. [Google Scholar]
  38. Lashkari, A.H. Dohlyzer; Technical Report; York University: York, UK, 2020. [Google Scholar]
  39. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  40. Niu, Z.; Zhong, G.; Yu, H. A review on the attention mechanism of deep learning. Neurocomputing 2021, 452, 48–62. [Google Scholar] [CrossRef]
  41. Van Der Maaten, L.; Hinton, G. Visualizing data using t-sne. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
  42. Wang, W.; Sheng, Y.; Wang, J.; Zeng, X.; Ye, X.; Huang, Y.; Zhu, M. Hast-ids: Learning hierarchical spatial-temporal features using deep neural networks to improve intrusion detection. IEEE Access 2018, 6, 1792–1806. [Google Scholar] [CrossRef]
  43. Wang, M.; Zheng, K.; Ning, X.; Yang, Y.; Wang, X. Centime: A direct comprehensive traffic features extraction for encrypted traffic classification. In Proceedings of the 2021 IEEE 6th International Conference on Computer and Communication Systems (ICCCS), Chengdu, China, 23–26 April 2021; pp. 490–498. [Google Scholar]
Figure 1. Data leakage and command control process based on DoH-encrypted DNS covert channels.
Figure 1. Data leakage and command control process based on DoH-encrypted DNS covert channels.
Applsci 12 12644 g001
Figure 2. Framework of FF-MR.
Figure 2. Framework of FF-MR.
Applsci 12 12644 g002
Figure 3. Structure of MHA-Resnet.
Figure 3. Structure of MHA-Resnet.
Applsci 12 12644 g003
Figure 4. Downsample.
Figure 4. Downsample.
Applsci 12 12644 g004
Figure 5. Interpretability of distribution of attention in machine translation: (a) a distribution of attention; (b) another distribution of attention.
Figure 5. Interpretability of distribution of attention in machine translation: (a) a distribution of attention; (b) another distribution of attention.
Applsci 12 12644 g005
Figure 6. Interpretability of distribution of attention for features of traffic: (a) a distribution of attention; (b) another distribution of attention.
Figure 6. Interpretability of distribution of attention for features of traffic: (a) a distribution of attention; (b) another distribution of attention.
Applsci 12 12644 g006
Figure 7. Attention mechanism in this paper: (a) weighted fusion of features calculated by scaled dot-product self-attention; (b) weighted fusion of features in different representation subspaces calculated by Multi-Head Attention.
Figure 7. Attention mechanism in this paper: (a) weighted fusion of features calculated by scaled dot-product self-attention; (b) weighted fusion of features in different representation subspaces calculated by Multi-Head Attention.
Applsci 12 12644 g007
Figure 8. Normalized confusion matrix.
Figure 8. Normalized confusion matrix.
Applsci 12 12644 g008
Figure 9. Visualization of t-SNE dimension reduction in traffic features.
Figure 9. Visualization of t-SNE dimension reduction in traffic features.
Applsci 12 12644 g009
Figure 10. FF-MR vs. other methods on macro-averaging Metrics.
Figure 10. FF-MR vs. other methods on macro-averaging Metrics.
Applsci 12 12644 g010
Figure 11. Training Loss.
Figure 11. Training Loss.
Applsci 12 12644 g011
Figure 12. MHA-Resnet vs. Baseline Models on Macro-averaging Metrics.
Figure 12. MHA-Resnet vs. Baseline Models on Macro-averaging Metrics.
Applsci 12 12644 g012
Figure 13. TLS handshake and encrypted messages transmission.
Figure 13. TLS handshake and encrypted messages transmission.
Applsci 12 12644 g013
Figure 14. Distribution of TCP and TLS layer bytes.
Figure 14. Distribution of TCP and TLS layer bytes.
Applsci 12 12644 g014
Figure 15. Comparison of results under different bytes size n: (a) comparison of macro-averaging under different bytes size n; (b) Comparison of F1-Score under different bytes size n in identification of iodine and dnscat2.
Figure 15. Comparison of results under different bytes size n: (a) comparison of macro-averaging under different bytes size n; (b) Comparison of F1-Score under different bytes size n in identification of iodine and dnscat2.
Applsci 12 12644 g015
Table 1. Research on DoH-encrypted DNS covert channel detection and identification.
Table 1. Research on DoH-encrypted DNS covert channel detection and identification.
Research CategoryPublication YearAuthorFeatures/Neural Network InputMethod
Detection2020Banadaki et al. [8]Statistical FeaturesLGBM, Random Forest
2020MontazeriShatoori et al. [9]Statistical FeaturesRandom Forest, Naive Bayes, SVM, LSTM
2021Al-Fawa’reh [10]Statistical FeaturesBi-RNN
2022Nguyen et al. [11]Statistical FeaturesTransformer
2022Zhan et al. [12]Statistical Features +TLS fingerprintDecision tree, Random Forest, Logistic Regression
Detection and Identification2021Mitsuhashi et al. [13]Statistical FeaturesLGBM, XGBoost
2022Zebin et al. [14]Statistical FeaturesStacked Random Forest
Table 2. Session statistical features.
Table 2. Session statistical features.
CategoryNumberFeature
Duration1Session duration
Number of bytes2Number of session bytes sent
3Rate of session bytes sent
4Number of session bytes received
5Rate of session bytes received
Packet length6Mean Packet Length
7Median Packet Length
8Mode Packet Length
9Variance of Packet Length
10Standard Deviation of Packet Length
11Coefficient of Variation of Packet Length
12Skew from median Packet Length
13Skew from mode Packet Length
Packet time14Mean Packet Time
15Median Packet Time
16Mode Packet Time
17Variance of Packet Time
18Standard Deviation of Packet Time
19Coefficient of Variation of Packet Time
20Skew from median Packet Time
21Skew from mode Packet Time
Request/response
time difference
22Mean Request/response time difference
23Median Request/response time difference
24Mode Request/response time difference
25Variance of Request/response time difference
26Standard Deviation of Request/response time difference
27Coefficient of Variation of Request/response time difference
28Skew from median Request/response time difference
29Skew from mode Request/response time difference
Table 3. CIRA-CIC-DoHBrw-2020 dataset and preprocessing results.
Table 3. CIRA-CIC-DoHBrw-2020 dataset and preprocessing results.
CategoryBrowsers\ToolsNumber of FlowsNumber of SessionsNumber of Sessions after Preprocessing
malicious_DoHiodine46,61312,36812,367
dnscat235,62210,29810,298
dns2tcp167,515121,897121,738
benign_DoHGoogle Chrome19,80727,94026,238
Mozilla Firefox
non_DoHGoogle Chrome897,493492,171485,654
Mozilla Firefox
Table 4. Structural parameters in MHA-Resnet.
Table 4. Structural parameters in MHA-Resnet.
SubstructureLayerOperationInputOutput
Residual neural
Network
Conv1DOne-dimensional convolution1*102432*1024
ResLayer1One-dimensional convolution*432*102432*1024
ResLayer2One-dimensional convolution*432*102464*512
ResLayer 3One-dimensional convolution*464*512128*256
ResLayer 4One-dimensional convolution*4128*256256*128
AvgPooling1DGlobal average pooling256*128256*1
Multi-Head
Attention
mechanism
Linearlinear transformation +
Sigmoid
29*114*1
Embeddingword embedding14*114*128
Multi-Head
Attention
calculate the attention
weight matrix
(256 + 14)*128(256 + 14)*128
Feed Forwardlinear transformation +
ReLU+
linear transformation
(256 + 14)*128(256+14)*128
FlattenFlatten the weighted
fusion feature matrix
(256 + 14)*12834,560*1
MLP+softmaxLinearlinear transformation +
ReLU
34,560 + 256200
Linearlinear transformation20030
Linearlinear transformation +
softmax
305
Table 5. FF-MR vs. other methods.
Table 5. FF-MR vs. other methods.
MetricsLightGBM [13]RF [8]HAST-II [42]CENTIME [43]FF-MR
Macro-averagingMacro_P0.95580.96090.98920.99130.9972
Macro_R0.94890.94820.93910.9920.9973
Macro_F10.95220.95430.96160.99160.9978
iodinePrecision0.92340.93190.97430.97730.994
Recall0.94580.9420.9050.97660.9935
F1-Score0.93450.9370.93840.9770.9951
dnscat2Precision0.91570.9130.99570.98640.9939
Recall0.91070.91540.79190.98740.9942
F1-Score0.91320.91420.88220.98690.9954
dns2tcpPrecision0.99110.99260.97610.99310.9993
Recall0.9860.98810.99980.99820.9995
F1-Score0.98850.99030.98780.99560.9995
benign_DoHPrecision0.95130.96970.99990.99940.999
Recall0.90330.8960.9990.99920.9995
F1-Score0.92670.93140.99940.99930.9992
non_DoHPrecision0.99770.99750.99991.00000.9999
Recall0.99870.99941.00000.99870.9999
F1-Score0.99820.99850.99990.99930.9999
Table 6. MHA-Resnet vs. Baseline Models.
Table 6. MHA-Resnet vs. Baseline Models.
Metrics1D-CNN2D-CNN1D-Resnet2D-ResnetMHA-Resnet
Macro-averagingMacro_P0.90610.90980.98380.95140.9972
Macro_R0.90890.91570.98350.95190.9973
Macro_F10.90750.91260.98360.95160.9978
iodinePrecision0.8470.85570.96960.92620.994
Recall0.84560.85130.96730.91880.9935
F1-Score0.84630.85350.96840.92250.9951
dnscat2Precision0.70070.70810.9540.84240.9939
Recall0.72380.75240.95580.85630.9942
F1-Score0.7120.72960.95490.84930.9954
dns2tcpPrecision0.98510.98740.99590.98990.9993
Recall0.98250.98250.99590.98920.9995
F1-Score0.98380.9850.99590.98960.9995
benign_DoHPrecision0.99830.99810.99940.99890.999
Recall0.99290.99260.99870.99520.9995
F1-Score0.99560.99530.9990.9970.9992
non_DoHPrecision0.99960.99960.99990.99970.9999
Recall0.99990.99991.00000.99990.9999
F1-Score0.99970.99970.99990.99980.9999
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Wang, Y.; Shen, C.; Hou, D.; Xiong, X.; Li, Y. FF-MR: A DoH-Encrypted DNS Covert Channel Detection Method Based on Feature Fusion. Appl. Sci. 2022, 12, 12644. https://doi.org/10.3390/app122412644

AMA Style

Wang Y, Shen C, Hou D, Xiong X, Li Y. FF-MR: A DoH-Encrypted DNS Covert Channel Detection Method Based on Feature Fusion. Applied Sciences. 2022; 12(24):12644. https://doi.org/10.3390/app122412644

Chicago/Turabian Style

Wang, Yongjie, Chuanxin Shen, Dongdong Hou, Xinli Xiong, and Yang Li. 2022. "FF-MR: A DoH-Encrypted DNS Covert Channel Detection Method Based on Feature Fusion" Applied Sciences 12, no. 24: 12644. https://doi.org/10.3390/app122412644

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop