Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Malicious PDF Detection Model against Adversarial Attack Built from Benign PDF Containing JavaScript

Appl. Sci. 2019, 9(22), 4764; https://doi.org/10.3390/app9224764

by Ah Reum Kang¹

, Young-Seob Jeong¹

, Se Lyeong Kim² and Jiyoung Woo^1,*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Reviewer 4: Anonymous

Appl. Sci. 2019, 9(22), 4764; https://doi.org/10.3390/app9224764

Submission received: 4 October 2019 / Revised: 6 November 2019 / Accepted: 6 November 2019 / Published: 8 November 2019

(This article belongs to the Special Issue Machine Learning for Cybersecurity Threats, Challenges, and Opportunities)

Round 1

Reviewer 1 Report

The authors of this paper on a very popular subject, the creation of a malicious PDF detector that is robust against adversarial attacks. To achieve their goal, they take advantage of the structural characteristics of the PDF files extracting features from them along with others found in the literature. Their adversarial examples include the creation of files that mimic the structure and content of benign, while they contain malicious code. Using a variety of different machine learning algorithms, they prove that Random Forest offers the best results with high precision and recall reaching 0.998.

This is a well-written paper with a good structure. However, it has some issues that should be amended before publishing.

The contributions of their work should be listed and clarified in the introduction. No details are given about the testbed environment used by the authors. Regarding the adversarial attacks, there is no mention to related literature and especially taxonomies (e.g. Huang, L., Joseph, A. D., Nelson, B., Rubinstein, B. I., & Tygar, J. D. (2011, October). Adversarial machine learning. In Proceedings of the 4th ACM workshop on Security and artificial intelligence (pp. 43-58). ACM. , Shumailov, I., Zhao, Y., Mullins, R., & Anderson, R. (2018). The Taboo Trap: Behavioural Detection of Adversarial Samples. arXiv preprint arXiv:1811.07375. and Pitropakis, N., Panaousis, E., Giannetsos, T., Anastasiadis, E., & Loukas, G. (2019). A taxonomy and survey of attacks against machine learning. Computer Science Review. ) The authors should check the taxonomies and communicate to the potential reader characteristics of the attack that they orchestrate (e.g. attack type, attack specificity, attack influence etc.) as they only mention the attacker knowledge and it can be derived from the text that the attacked algorithm is meant for classification. The authors do not clarify how they inject malicious code to the benign PDF file and if it is done manually or automatically. This would greatly help the replicability of their experiments. The authors should also consider the transferability of adversarial examples from other application domains. (e.g. Papernot, N., McDaniel, P., & Goodfellow, I. (2016). Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. arXiv preprint arXiv:1605.07277.)

Author Response

Re: Manuscript ID: applsci-621355 - Major Revisions

Please find attached a revised version of our manuscript “Malicious PDF detection model against adversarial attack built from benign PDF containing JavaScript”, which we would like to resubmit for publication as a research article in Applied Sciences.

Your comments were highly insightful and enabled us to greatly improve the quality of our manuscript. In the following pages are our point-by-point responses your comments.

Revisions in the text are shown using yellow highlight for additions. We hope that the revisions in the manuscript and our accompanying responses will be sufficient to make our manuscript suitable for publication in Applied Sciences.

We shall look forward to hearing from you at your earliest convenience.

Author Response File: Author Response.docx

Reviewer 2 Report

This article is a very interesting research with high significance of content showing the dissection of pdf structure and its weaknesses. However, the main problem of this article is its poor quality of presentation. The following weaknesses are required for modification before publication:

1) The abstract writing should be revised. It is too surprise for the first sentence without mentioning the origins. Introduction's background should be summarised here.

2) The introduction is rich in background information but also contains the details of subsequent research procedure which should be placed in a new paragraph called "Research Methodology". Same as in Conclusions which contains more details than in section 3 and 4. It is advised to have the section called "Discussion" which contains the reflections of the authors about the results got.

3) The presentation of "Experiment and Results" for machine learning algorithm (sections 4.3, 4.3.1 and 4.3.2) are too brief and weak.

4) There are several typos in section number, e.g. section number 4.3.1 in page 14, section number 4 in page 15.

5) There are some minor points in the technical issues. In line 226, it is weak to create a keyword list based on the samples observed. This leaves the future challenge by other researchers. Should the authors use 'related work' or other statistical measure to justify the representation of whole population from the samples? Besides, why just choose among Bayes, Random Forest and SVM as representative algorithm? There are many other new ML algorithms.

Author Response

Re: Manuscript ID: applsci-621355 - Major Revisions

Your comments were highly insightful and enabled us to greatly improve the quality of our manuscript. In the following pages are our point-by-point responses your comments.

Revisions in the text are shown using cyan highlight. We hope that the revisions in the manuscript and our accompanying responses will be sufficient to make our manuscript suitable for publication in Applied Sciences.

We shall look forward to hearing from you at your earliest convenience.

Author Response File: Author Response.docx

Reviewer 3 Report

The paper discusses the problem of detecting PDF documents with malicious code produced through JavaScript. The problem has been debated many times and documentation abound.

However, the authors claim to have developed a novel detection technique.

Comments.

Chapter 3.

- Figure 1 is useless, there’s no information in there that cannot simply described in the text.

- Figure 4 cannot be read, too tiny.

Line 179: “In order to use the machine learning algorithm”, unclear. Which algorithm?

Line 184: “The big difference is that benign PDF files contain a much larger number of objects and streams than malicious PDF files”

Much larger is not an acceptable characterization. How much is much larger?

Line 186-197: Authors say that often PDF documents with JavaScripts are malicious, and then that in malicious documents the JavaScript is often obfuscated. Does this mean that finding obfuscate JavaScript in a PDF document is a sufficient condition to define it as malicious? Is it also a necessary condition? A better explanation is needed.

Figure 7: the meaning of this figure is unclear. What should it communicate to a reader?

Line 230: Section 3.3 is all vary confused. The English is very difficult to read. Then there is “These 251 samples are much different from benign documents collected by Vaccine companies and usually used in the previous works.” What are “Vaccine companies”? From where the authors took this expression?.

Figure 11: Completely useless

Line 261 This section is titled “Machine learning algorithm”, but actually it generically lists some machine learning approaches.

Line 283 Section 4.1 “The PDF files used in this study consist of 11,097 malware document files, 9,000 benign document files collected by major vaccine companies such as Fireye, Fsecure, Kaspersky, and so on, from November 2009 to June 2018.”

So, “vaccine companies” are cybersecurity companies. Bizarre terminology. Most important, what are those “malware document files”? What does it mean that authors collected them from those companies? Collected what? How?Explanations about the dataset are needed.

- Results and experiments section is so confused that it is hard to say if the authors found something interesting ot if it is just testing of some algorithms on test data.

This paper needs a full reorganization and to be rewritten from scratch, before being considered for publication, because in this form the original contribution, assuming it exists, is unrecognizable.

Author Response

Re: Manuscript ID: applsci-621355 - Major Revisions

Your comments were highly insightful and enabled us to greatly simprove the quality of our manuscript. In the following pages are our point-by-point responses your comments.

Revisions in the text are shown using green highlight for additions. We hope that the revisions in the manuscript and our accompanying responses will be sufficient to make our manuscript suitable for publication in Applied Sciences.

We shall look forward to hearing from you at your earliest convenience.

Author Response File: Author Response.docx

Reviewer 4 Report

Good paper, well written, interesting topic.

Some suggested edits:

Lines 37 – 39: Reword as: “Previous studies have derived structural and meta features, while previous research has focused on extracting features from malware and constructing algorithms based on the malware-centric feature set (excluding benign features”

Line 44: change “a few benign PDFs contain JavaScript although they are very few” to “a few benign PDFs contain JavaScript”

Line 58: Section 2 should be called “Related Research”

Line 133: Figure 1 needs to be referenced in the text

Line 351: Typo – should be “Structure feature model”

Author Response

Re: Manuscript ID: applsci-621355 - Major Revisions

Your comments were highly insightful and enabled us to greatly improve the quality of our manuscript. In the following pages are our point-by-point responses your comments.

Revisions in the text are shown using pink highlight for additions. We hope that the revisions in the manuscript and our accompanying responses will be sufficient to make our manuscript suitable for publication in Applied Sciences.

We shall look forward to hearing from you at your earliest convenience.

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

The authors have addressed all the comments as requested. The quality of their work has been improved, thus being ready to be published.

Author Response

Re: Manuscript ID: applsci-621355 - Minor Revisions

Your comments were highly insightful and enabled us to greatly improve the quality of our manuscript. In the following pages are our point-by-point responses your comments.

We appreciate your kind comments.

Author Response File: Author Response.docx

Reviewer 3 Report

The paper clearly improved with this revision.

All the critical parts have been reviewed and the authors demonstrated that their work was more solid than the presentation of the first version. The manuscript is now evidently more carefully redacted and has reached a good quality.

It remains to do a thorough and careful check of the English and fix typos. There are several, as well as sentences and paragraphs that need to be rewritten in a better way.

But, however, despite these minor comments, the work has showed its merits.

Author Response

Re: Manuscript ID: applsci-621355 - Minor Revisions

Your comments were highly insightful and enabled us to greatly improve the quality of our manuscript. In the following pages are our point-by-point responses your comments.

We appreciate your kind comments.

Author Response File: Author Response.docx

Article Menu

Malicious PDF Detection Model against Adversarial Attack Built from Benign PDF Containing JavaScript

Further Information

Guidelines

MDPI Initiatives

Follow MDPI