Next Article in Journal
Blockchain for Doping Control Applications in Sports: A Conceptual Approach
Next Article in Special Issue
Zero-Inflated Patent Data Analysis Using Generating Synthetic Samples
Previous Article in Journal
Analytical Modeling and Empirical Analysis of Binary Options Strategies
Previous Article in Special Issue
Correlation between Human Emotion and Temporal·Spatial Contexts by Analyzing Environmental Factors
 
 
Article
Peer-Review Record

N-Trans: Parallel Detection Algorithm for DGA Domain Names

Future Internet 2022, 14(7), 209; https://doi.org/10.3390/fi14070209
by Cheng Yang, Tianliang Lu *, Shangyi Yan, Jianling Zhang and Xingzhan Yu
Reviewer 1:
Reviewer 2:
Future Internet 2022, 14(7), 209; https://doi.org/10.3390/fi14070209
Submission received: 28 May 2022 / Revised: 8 July 2022 / Accepted: 12 July 2022 / Published: 13 July 2022
(This article belongs to the Special Issue Trends of Data Science and Knowledge Discovery)

Round 1

Reviewer 1 Report

The English definitely needs to be improved. I strongly advise to contact a professional editor or native speaker to improve the English. (The current paper is unacceptable!)

There are several issues with the clarity and quality of your explanations in the paper, and also with the scientific soundness (which is a serious issue). Please see the following comments for details:

1. Introduction (p.1, l.41-42): You make a bold statement that existing malicious domain names detection algorithms are not good enough. Please substantiate this claim and add references to support this.

2. Related work (p.2, l. 70-71). Also here you make a bold statement that traditional approaches are laborious and take too much time. Please substantiate this claim and add references to support this.

3. In the second part of the Related work section, you give an overview of deep-learning methods. However, a conclusion is missing on how good/bad these models are for detecting DGA-generated domain names. Please expand on this.

4. Data pre-processing (p.3, l.123-125): The description of what parts of a domain name are extracted, is rather inaccurate. (You state that you remove the top-level domain and subdomainsm which implies that you remove the entire domain name.)

5. Data pre-processing (p.4, l.135): What do you mean when referrring to 'memory loss'?

6. Parallel training (p.4, l.138-148): This largely is a repetition of what you already mentioned before.

7. Parallel training (p.4, l.149): You mention N=2,3,4. However, in the experiments later on you use other ranges for N. Please be more accurate.

8. Transformer model (p.6-7): This text is rather unclear. On l.196-201 and Figure 5: What is meant by token, and what does multi-head attention do? On l.208-209: what is the letter vector? On l.213-214: What do Q, K and V actually represent in your model? On l.215: What matrix are you referring to?

9. Data set and test environment (p.7): You should provide more details on the dataset with malicious domain names that you used in your experiments. In particular, you should mention what DGA-families and types are actually included. In eg. the paper by Vranken & Alizadeh (Detection of DGA-Generated Domain Names with TF-IDF. In Electronics 2022, 11(3), 414, MDPI) is clearly shown that the mix and type of DGA families included in the dataset has a big impact. Hash-based DGAs are rather easy to detect, Arithmetic-based DGAs are somewhat more difficult to detect, but Wordlist-based DGAs are far more difficult to detect when applying machine learning or deep learning models.) Hence, it is possible to steer experimental results by using a DGA dataset that includes hash-based DGAs only. You therefore should be very explicit on what DGA families are included in your dataset (and preferably you should split experimental results for the different DGA types).

10. Data analysis (p.8, l.247): Why did you select only 10,000 domain names? This is a rather small number, also given that the Alexa dataset already has 1 million domain names. You also should make clear how you derived the training set and evaluation/test set for your experiments.

11. Data analysis (p.8, l.263-264): Please explain in more detail what is meant by 'due to the increase in the number of phrase element bits'.

12. Data analysis (p.9, l.282 and l.287): You state that variance and skewness of legitimate domain names are smaller, but data in Table 2 show that they actually are larger.

13. Evaluation indicators (p.10): Why do you only use accuracy and recall as metrics? When using recall, you also should show what precision is (as you may be trading off recall and precision). In addition, you may use F1-score (as harmonic mean of recall and precision) and preferably also show precision-recall curves.

14. Table 3 (p.10-11): The experimental results vary a lot when applying different ranges of N. You should discuss in more detail what this means for the trustworthiness of your results. Eg. accuracy is 0.9697 for N=2,3,4, while 0.7630 for N=3,4,5, and 0.9368 for N=2,3,4,5. There does not seem to be a simple relation between range of N and accuracy (or recall).

15. Network model comparison experiment (p.12): It is not very clear how you peformed these experiments, and it seems that you are comparing apples and oranges. It seems that you only used bigrams with the traditional models: why did you not consider also other N-grams, as you are also doing with your own model? Also, the strength of deep-learning models, such as RNN and LSTM that you used, is that they may be able to identify by themselves similar relationships between n-grams as your model does. Hence, a much fairer comparison would be to feed the complete second-level domain using word embedding to these deep-learning models.

16. Network model comparison experiment (p.12, l.401): You mention that speed of training is an issue, but you don't show experimental results to support this.

17. Conclusions (p.14, l.443): Please provide more details on what aspects of the framework still need improvement.

18. Conclusions (p.14, l.447): To what level are eg. adjacent consonants already considered in your current approach with n-grams?

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

This research proposes a parallel detection model named N-Trans based on N-gram and Transformer framework is proposed to extract external features.

 

 

The following should be noted and corrected accordingly:

1. How practicable is your proposed model in real-time? 

2. Is it cost-efficient?

3. Some diagrams and terms are not properly explained.

4. Grammar is not up to standard and requires extensive re-editing

5. Are the formulas and numbers here generic or generated by you?

6. The Introduction section should contain the organization of the paper

 

 

Study and consider the following related paper to embellish your paper:

• https://doi.org/10.1007/s00530-020-00701-5

Major revisions are required.

 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

- Please emphasize and highlight the real application of the proposed method.

- Be elaborate on the advantages or limitations of the proposed method.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

The English spelling and grammar has been improved considerably. However, the writing style still needs considerable improvements. I have comments on lots of sentences. To just give an example, below is a suggestion of how the first part of the abstract can be improved:

Domain name Generation Algorithms are widely used in malware, like botnet binaries, to generate large sequences of domain names of which some are registered by cybercriminals. Detection of such malicious domain names by the use of traditional machine learning algorithms has been explored by many researchers, but still is not perfect. To further improve on this, we propose a novel detection model named N-Trans that is based on the N-gram algorithm with the Transformer model.

The comments below are replies to the points mentioned in your cover letter.

Point 1 and 2: The updates in the paper indeed are improvements. Still, the troubles with feature engineering can be described more clearly. You refer to reference [3] to support your claim. I tried to lookup this paper (eg. on Google Scholar), but could not find it (and hence gives me the impression that it might be fake!).

Point 3: Your update on p.3, l.112-117, is an improvement. You identify two shortcomings of prior work: lack of location features (but you don't give evidence to support this claim), and dealing with low randomness in domain names. You address the first issue in your work, but not the second one.

Point 7: The paper still refers to different ranges for N at different places. Eg, p.4 l.115 (N=2,3,4).

Point 8: The explanation of the Transformer model still is not very clear.

Point 9: I am surprised that you now also mention that your dataset is based on DGArchive (which was not the case in the previous version of the paper!), while your experimental results are still the same. If you indeed used DGArchive, then why did you select the 9 DGA families as listed in Table 2? (DGArchive includes many more!)

In table 2 you provide the DGA families that you included, but details are lacking. You should have mentioned that all of these are in fact arithmetic-based DGAs, expect for suppobox which is a wordlist-based DGA. You also completely ignore that the domain names generated by these DGAs (as well as your legitimate domain names?) do not contain digits, except for gameover and rovnix.

As mentioned in my previous review comments, it is very important to be very precise about your dataset, since the composition of the dataset highly influences the outcomes of experiments (in an intended way or not...).

The motivation for your work is partly to address malicious domain names with low randomness, but there is no evidence in your experimental results that you actually targetted these (and your data set is focussed on arithmetic-based DGAs that generate domain names with rather high randomness).

Point 10: Please define what you mean by data point in the text!

Point 13: You are right that recall and accuracy are sufficient in principle. Still, I would prefer to also see results on the precision. (You now leave it up to the reader to find out that the number of FP should be low.)

Point 15 and 16: You did not address these comments!

 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

1. The Abstract doesn't clearly explain the problem your proposed system is trying to solve

2. The Literature Review is too brief and should be made more comprehensive

3. Grammar still requires a little re-editing 

 

 

 

 

Study and consider the following related papers to embellish your paper:

• https://doi.org/10.1142/S0219622021500619

Minor revisions are required.

 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 3

Reviewer 1 Report

Thanks for further improvement of the paper.

Point 1 and 2: Thanks for pointing to the location of reference [3]. The reason that I did not find this on Google Scholar, is that this paper is in Chinese. Since Future Internet is a journal in the English language, I strongly recommend to use references that are written in English. So, please replace reference [3] with one or more other references (written in English), that support your claims.

Point 3: I still think that your introduction is misleading. Lines 43-47 suggest that your work is improving on the specific shortcomings of RNN and LSTM, which however is not the case. Please be more precise!

Point 7: Ah, is see that N is used for multiple purposes in the paper. This is very confusing. I strongly recommend to use N only for the size of N-grams, and use another identifier for the number of parallel models.

Point 8: You did indeed explain in the cover letter, but please add these explanations to the paper.

Point 9: There is no such thing as 'malicious domain names as a whole'. Let me give an analogy: suppose you create a model that can identify  the colour of an object, and the model is better in detecting dark colours than light colours. If you evaluate the model by using a test with mostly dark coloured objects, then for sure you will get better results than in the case where you use a test set with both dark and light colours. Hence, the data set is crucial. 

Hence, I repeat my previous comments, that you should be more precise on the dataset used!

Point 10: It still is not clear. You now write on line 255-256: "Ten thousand data of each legitimate and malicious domain names were randomly selected as the dataset for the experiment." What do you mean with data (third word)? Please be more precise!

Point 15: This comment is still not addressed!

 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 4

Reviewer 1 Report

The paper has improved further. My additional comments on the quality of the paper have been expressed to the editors of the journal.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Back to TopTop