Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Real-Time Detection System for Data Exfiltration over DNS Tunneling Using Machine Learning

Electronics 2023, 12(6), 1467; https://doi.org/10.3390/electronics12061467

by Orieb Abualghanam^1,*

, Hadeel Alazzam^2,*

, Basima Elshqeirat¹

, Mohammad Qatawneh¹ and Mohammed Amin Almaiah^3,4,*

Reviewer 1:

Reviewer 2:

Reviewer 3:

Reviewer 4:

Reviewer 5:

Manuel Fernandez-Veiga

Electronics 2023, 12(6), 1467; https://doi.org/10.3390/electronics12061467

Submission received: 18 January 2023 / Revised: 10 March 2023 / Accepted: 14 March 2023 / Published: 20 March 2023

(This article belongs to the Section Computer Science & Engineering)

Round 1

Reviewer 1 Report

1. Please adjust the structure of the manuscript as it may help reviewers understand the situation.

2. The character font is a little bit small to see.

3. The function of Figures 8-11 in the manuscript? Any explanation?

4. Lack of discussion of the results

5. Actually, the authors need to put more details in section 5, such as the PIO algorithms, or other algorithms and the consideration of variation.

6. If possible, the authors may show one set of data in the appendix or main content.

Author Response

Thank you for giving us an opportunity to revise our manuscript (ID: electronics-2199752), entitled “Real-Time Detection System for Data Exfiltration Over DNS Tunneling Using Machine Learning”. We have read carefully this entire editorial decision letter and take all actions seriously. We have responded to the reviewer's comments and Compliance Requirements point-to-point.

Reviewer #1

Please adjust the structure of the manuscript as it may help reviewers understand the situation.

The manuscript has been modified based on the reviewer's suggestions, section 5 has been reorganized and a new subsection has been added that discusses the modified version of the PIO.

The character font is a little bit small to see.

The manuscript has been prepared by the MDPI latex template, the font size used in this manuscript is the default size specified in the template. On the other hand, many Figures have been redrawn in clear font size.

The function of Figures 8-11 in the manuscript? Any explanation?

This comment has been taken into consideration and detailed discussions for the Figures in the Appendices have been added.

Lack of discussion of the results

A detailed discussion has been added to section 6.4.

Actually, the authors need to put more details in section 5, such as the PIO algorithms, or other algorithms and the consideration of variation.

An additional section 5.2 presents the modified PIO has been added. Moreover, section 5 has been rewritten in a clear way.

If possible, the authors may show one set of data in the appendix or main content.

UNSW-NB15 snapshot of the data has been added to the appendix.

Reviewer 2 Report

This study a hybrid DNS tunnelling detection has been presented which is based on Tabu-PIO and packet length rang. Moreover, A testbed has been conducted using virtual machines to generate DNS tunneling dataset with different classes. Our generated dataset summarized different range of the packet length, which helps us to modify the tabu-PIO. The Evaluation has been conducted based on three datasets, DNS-UNSW-NB15 dataset, DNS tunneling dataset [31], and our testbed dataset. The results show that using Tabu-PIO reduces the number of features in all datasets. i.e from 42 to 13 features and from 17 to 5 in DNS-UNSW-NB15 dataset and DNS tunneling [31] respectively. Moreover, the results turn out that using a hybrid approach (M-PIO + packet length) enhances the runtime significantly when the size of the data increased.

The topic of the paper is interesting, however, the quality of the paper should be improved. detailed comments are given as:

1. The motivation should be improved. Moreover, please explain the importance and main contributions of the proposed approach.

2. Figures are not clear, high-quality figures are suggested.

3. The conclusion part does not provide future work, it is suggested to add some future research work. In addition, authors should improve abstract and conclusion.

4. Language of the paper should be improved. There are many typos and grammar errors. Authors should check the whole manuscript for typos and grammar errors.

5. The simulation results are weak, it is suggested to compare the proposed approach with some other recent approaches. In addition, it is suggested to highlight (make bold) the better performance of the proposed approach.

6. Figures 8-11 are out of the main body of the manuscript, are they in appendix? In addition, both of them must be reflected in the main manuscript. Authors should check all figures and tables carefully.

Author Response

Reviewer #2

The topic of the paper is interesting, however, the quality of the paper should be improved. detailed comments are given as:

The motivation should be improved. Moreover, please explain the importance and main contributions of the proposed approach.

This comment has been taken and the motivation has been modified.

Figures are not clear, high-quality figures are suggested.

Figure 4, Figure 5, and Figure 6 have been redrawn with high-quality

The conclusion part does not provide future work, it is suggested to add some future research work. In addition, the authors should improve the abstract and conclusion.

This comment has been taken and future work has been added while the abstract has been modified also.

Language of the paper should be improved. There are many typos and grammar errors. Authors should check the whole manuscript for typos and grammar errors.

An extensive improvement has been conducted for the paper language also all typos have been corrected

The simulation results are weak, it is suggested to compare the proposed approach with some other recent approaches. In addition, it is suggested to highlight (make bold) the better performance of the proposed approach.

The simulation results for the generated data are hard to compare with other approaches. However, we compare our approach (the proposed approach) with other recent approaches using the Labeled DNS exfiltration dataset. Also, better performance has been highlighted.

Figures 8-11 are out of the main body of the manuscript, are they in appendix? In addition, both of them must be reflected in the main manuscript. Authors should check all figures and tables carefully.

Figures 8-11 are part of the Appendix, more discussions have been added for them.

Reviewer 3 Report

The manuscript investigate the real time detection for data exfiltration over DNS tunneling using machine learning, I would give the following comments:

1. The abstract is not enough, please give more highlights on your contribution.

2. The literature review part is good but can be improved, more review on the machinle learning should be done. Besides, for the mechine learning method, the author can compare the method in: Multi-node load forecasting based on multi-task learning with modal feature extraction.

3. The modelling of the problem is not clear in (1)-(10), please give more details.

4. The case study is not enough, please give more comparision case to show your contribution.

Author Response

The manuscript investigate the real time detection for data exfiltration over DNS tunneling using machine learning, I would give the following comments:

The abstract is not enough, please give more highlights on your contribution.

This comment has been taken and the abstract has been modified.

2. The literature review part is good but can be improved, more review on the machine learning should be done. Besides, for the mechine learning method, the author can compare the method in: Multi-node load forecasting based on multi-task learning with modal feature extraction.

There is no Multi-node load forecasting based on multi-task learning with modal feature extraction research for DNS tunneling Detection.

The modeling of the problem is not clear in (1)-(10), please give more details.

The comment is not clear, what is the problem in 1 to 10!, however the motivation and the contributions sections have been modified.

The case study is not enough, please give more comparision case to show your contribution.

No case study has been presented in this paper, on the other hand, more details about the contribution and the proposed model have been added. Also, the challenge of this research has been added

Reviewer 4 Report

1. “system based on M-PIO”. What is M-PIO?

2. “strategy. [5,6].” – typo.

3. “The authors in [29]” – do not use the keyword “authors”. Name authors instead.

4. “[30] proposed” – never write like this. Always name authors.

5. The review of the related work must be critical. Now, the review just presented the obtained results of various research papers.

6. “Table 1 presents the related works mentioned in this section”

a. This is not true. Not all the related works mentioned in this section are replicated in Table 1.

b. Authors must be named next to the reference number.

7. “Kail Linux” – typo.

8. “Table .2” – typo. There are many typos.

9. “Algorithm 17” – was is it?

10. “comparison between different DNS tunneling for the same dataset [31]”

a. What dataset? Name it?

b. There is no the same dataset, because “The training data for the experiment has been collected within an isolated private network.” [31].

c. “same dataset introduced in [31]”. The term “dataset” in [31] is not used, at all.

d. The comparison provided in Table 11 is not possible.

11. “DNS tunneling dataset [31],” – false.

12. The figures 8 and 9 are not needed.

13. There are no such data in [31]:

Dns2tcp 6298 2772

dnscapy 10043 4375

Iodine 8565 3663

Tuns 40392 17378

Attack 65298 28188

Normal 12051 5054

Total Records 77350 33242

14. Using of the keyword “DNS-UNSW-NB15” is not advised.

15. There are too many typos, e.g., in the titles of the methods “PIO-Hill-Climbing”.

Author Response

system based on M-PIO”. What is M-PIO?

Section 5.2 has been added for the modified PIO.

“strategy. [5,6].” – typo.

This has been corrected.

“The authors in [29]” – do not use the keyword “authors”. Name authors instead.

This comment has been taken into consideration

“[30] proposed” – never write like this. Always name authors.

This comment has been taken into consideration

The review of the related work must be critical. Now, the review just presented the obtained results of various research papers.

The related works section has been modified as mentioned in point 6. Moreover, a paragraph at the end of this section has been added, which discusses the research gap, and how our work addresses it.

“Table 1 presents the related works mentioned in this section”

This is not true. Not all the related works mentioned in this section are replicated in Table 1.

All related work appears has been replicated in Table 1

Authors must be named next to the reference number.

Authors names have been added next to reference number.

“Kail Linux” – typo.

This has been corrected.

“Table .2” – typo. There are many typos.

This has been corrected.

“Algorithm 17” – was is it?

This was a typo, actually it is Algorithm 1 and it has been corrected.

“comparison between different DNS tunneling for the same dataset [31]”

What dataset? Name it?

Labeled DNS exfiltration dataset. We reflect this title in the manuscript

There is no the same dataset, because “The training data for the experiment has been collected within an isolated private network.” [31].

We use the available data on Github from this link https://github.com/netrack/learn

“same dataset introduced in [31]”. The term “dataset” in [31] is not used, at all.

The dataset used in reference [31], has been used in [33] and we used to compare with the results of our approach, the data is available on github [https://github.com/netrack/learn]

The comparison provided in Table 11 is not possible.

Referring to the above response, comparison Table 11 is reasonable, because all the mentioned studies use the DNS Exfiltration Dataset available on GitHub.

“DNS tunneling dataset [31],” – false.

The dataset name has been modified to Labeled DNS exfiltration dataset as mentioned by their study.

The figures 8 and 9 are not needed.

Figures 8-11 are part of the Appendix, more discussions have been added for them.

There are no such data in [31]:

Dns2tcp 6298 2772

dnscapy 10043 4375

Iodine 8565 3663

Tuns 40392 17378

Attack 65298 28188

Normal 12051 5054

Total Records 77350 33242

The statistical analysis for the dataset has been done by us on the dataset generated by [31]. The dataset used in reference [31], is available on GitHub [https://github.com/netrack/learn]

Using of the keyword “DNS-UNSW-NB15” is not advised.

This comment has been taken into consideration

There are too many typos, e.g., in the titles of the methods “PIO-Hill-Climbing”

All abbreviations have been unified to PIO-Hill_Climbing

Reviewer 5 Report

Overall, the paper is clearly organized and written, and the methodology applied is sound and is well described. There are however a few issues that should be better addressed by the authors in order to improve the paper:

1) The split between training samples and testing samples is different in the different datasets. This shroud be explained, what is the reason?

2) The results of classification are extremely good. The authors should explain how they avoided the risk of overfitting.

3) The two heuristics proposed for feature selection perform similarly and are only variations of known method. Its it possible to assess how much of the improvement is due to these methods?

Author Response

Reviewer #5

1) The split between training samples and testing samples is different in the different datasets. This shroud be explained, what is the reason?

The UNSW-NB15 and the labeled DNS tunneling datasets already have separate training and testing files.

2) The results of classification are extremely good. The authors should explain how they avoided the risk of overfitting.

We trained the data using One-Class classifiers ; Support Vector Machine, Local outlier Factor (LoF), and Isolation forest. All these classifiers are trained on single type of data.

3) The two heuristics proposed for feature selection perform similarly and are only variations of known method. Its it possible to assess how much of the improvement is due to these methods?

Figure 7 has been added in addition to a paragraph that discusses the Figure in detail. This figure presents the convergence curve when hill climb and Tabu are used.

Round 2

Reviewer 1 Report

Author Response

Thank you

Reviewer 4 Report

Thank you for the revision.

Author Response

You are very welcome

Reviewer 5 Report

This revised version of the manuscript contains further details and discussion on the proposed technique, and makes a number of necessary clarifications on the methodology. However, there remain some issues which are not fully explained:

1) The results presented are slightly inconsistent, e.g., Tables 8, 9 and 11 do not match in respect of the values for OC-SVM.

2) Several hyper parameters in the proposed approach are not explained or discussed. For instance, in Table 5, the values chosen for alpha, beta, and delta seem arbitrary (and they do not sum to one).

3) It is not clear how much of the excellent accuracy is due to the optimized feature selection and to the packet length range. This is important since, for the latter, it would imply that the PIO optimization could play a minor role in the whole system. I am not suggesting an ablation study, but at least a discussion of an apparent contradiction: the features are selected and optimized, but one extra feature is introduced blindly, out of the optimization process.

4) The results are very good with the tested datasets. The authors should briefly describe how they checked that no overfitting is taking place.

Author Response

Second Round

1) The results presented are slightly inconsistent, e.g., Tables 8, 9 and 11 do not match in respect of the values for OC-SVM.

Thank you for your comments. Each table presents the results for OC-SVM using different datasets. i.e Table 8 presents the evaluation results for the UNSW-NB15 dataset while Table.9 illustrate the achieved results using the Labeled DNS exfiltration dataset in [32]. Finally in Table 11 a Comparison between different DNS tunneling techniques using the labeled DNS exfiltration dataset with our proposed PIO-Tabu_search results.

This comment has been taken into consideration and a paragraph of discussion has been modified about these parameters. Moreover, the typo has been modified in alpha value 0.4

In “The fitness function has been used to evaluate the pigeons (solutions) in each iteration. In the LS_PIO, the fitness function used is presented in Equation 8. Based on Equation 8, the best pigeon is the pigeon that has the minimum fitness value [ 44]. As illustrated in Equation.8 N is the number of features in the dataset, and F is the number of selected features in the solution. α, β, and δ are weights that reflect the importance of each corresponding measure.

The summation of all the weights is equal to 1. Where α = 0.04,β = 0.48, and δ = 0.48. The weights of TPR and FPR are equal since they have the same importance regarding the DNS tunneling detection system. while the weight of the number of selected features is smaller than other weights, this small fraction is used to prefer solutions that have the same TPR and FPR but with a different number of selected features.”

Section 5 has reorganized and a new subsection has been added. A paragraph has been added to discuss in detail the importance of this hybrid system in section 5. Also, Figure 5 has been redrawn to clarify the proposed hybrid model.

4) The results are very good with the tested datasets. The authors should briefly describe how they checked that no overfitting is taking place.

In this paper, we evaluate the proposed model based on three data sets. The two datasets which are UNSW-NB15 and Labeled DNS exfiltration in [32] the splits of training and testing are existing from the source. While in our generated dataset we determine AUC measures that present a good indicator of there is overfitting. In any case, the proposed model uses one-class classifiers such as LOF, Iforest and OC-SVM which are trained based on one kind of labeled data “Normal”. Moreover, we use cross-validation for the generated dataset (testbed).

Round 3

Reviewer 5 Report

The authors have addressed adequately my previous concerns on some technical issues in the paper. Therefore, I consider this manuscript could be accepted for publication.

Article Menu

Real-Time Detection System for Data Exfiltration over DNS Tunneling Using Machine Learning

Further Information

Guidelines

MDPI Initiatives

Follow MDPI