Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Dynamic Malware Analysis Based on API Sequence Semantic Fusion

Appl. Sci. 2023, 13(11), 6526; https://doi.org/10.3390/app13116526

by Sanfeng Zhang^1,2

, Jiahao Wu¹, Mengzhe Zhang¹ and Wang Yang^1,2,*

Reviewer 1:

Naeem Jan

Reviewer 2:

Abdelkader Laouid

Reviewer 3:

Paul Tavolato

Appl. Sci. 2023, 13(11), 6526; https://doi.org/10.3390/app13116526

Submission received: 15 March 2023 / Revised: 20 May 2023 / Accepted: 23 May 2023 / Published: 26 May 2023

(This article belongs to the Special Issue Data-Driven Cybersecurity and Privacy Analysis)

Round 1

Reviewer 1 Report

· Incorporate the novelty of your work into the abstract.

· In the introduction section, it would be beneficial to outline the contributions and motivations of your work in a point-wise manner.

· Please provide some discussion with each new section to aid the reader's comprehension.

· Kindly rephrase the conclusion section in a clear and concise manner and include the limitations and future directions of your work.

· The advantages of the proposed work should be added.

· The strategy of the proposed work explains in the main result section.

· Please correctly cite all the figures.

· Please add more detail for Figures 1 and 2. it should be better for the reader’s groups.

· How your proposed work is superior to other existing methods? Please add detail in the main result section.

· For papers related to GANs, malware, and similar topics, authors should include the latest references. Like 1. DOI: 10.1016/j.asoc.2023.110088 2. 10.1109/TCE.2022.3226819

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

The Mal-ASSF model, proposed by the authors, is an approach to malware detection that combines semantic and sequence features. API2Vec embedding is used to reduce the dimensionality of API call sequences, which are then analyzed using multiple convolution kernels and TextCNN to capture behavioral features of segments of different lengths. To uncover implicit semantic information, the API function is represented by a triplet of operation, type, and category. The model is built on BiLSTM and a self-attention mechanism, fusing the sequence and semantic features of the API to focus on suspicious, malicious segments in the sequences. However, the manuscript should be improved by considering the following points:

1. The English writing is poor. Short sentences are needed in scientific papers; Line 12 in the abstract writing "too many" is not scientific writing; Line 17, "The triplet of operation, type, and category is adopted" I think here "are" is correct.

2. In the Introduction section, the authors fail to introduce their work compared to the related works correctly. The authors should show the limitation of the existing works compared with the proposed work. Also, the Introduction section must be rewritten and reorganized to improve its readability.

3. The related work is too short and several related works should be cited and described. Not necessary to compare your proposal with all cited related works, but the most related works must be cited in this manuscript.

4. The contribution is not evident in the Section "Materials and Methods". The authors explain the proposed model's running mode with technical descriptions.

My opinion: The current version cannot be published as it is and a Major Revision is requested to this manuscript.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

The paper "Dynamic Malware Analysis Based on API-Sequence-Semantic Fusion" deals with an important topic: API sequences gained from an execution of a doubtful sample are analyzed with respect to their maliciousness. The main problem of the paper is in the subsections "Goals and Approaches" and "Contributions". For the reader it is not at all clear what the general setting of the project is. Usually in connection with machine learning methods one must make a clear difference between the following steps: a) feature selection b) learning phase and c) analyzing an unknown sample based on the results of the machine learning phase. These steps are not made clear in the paper. I never could get the point which of the three phases the paper addresses. Which of these phases does your work address? And if it addesses more than one, it should be made clear where you describe which phase. The project behind the paper seems to be ok and interesting, but the description is not adequate for a reader not involved in the project. Saying it metaphorically: You perfectly describe the third branch of the second tree in the last row; but before you can understand that, you must know which wood you are in.

Additionally some details that could be improved:

page 3, line 115: "... to treat all API calls as a unified sequence" is incomprehensible; which API calls are put together to a sequence? What does unified mean here?

In section 3.1 the overview should clearly what is the target of Mal-ASSF: Is it featue selection, machine learning or analysis of an unknown sample (see above)? Moreover, it is not clear whether the tasks "API Embedding" and "Implicit Information" are executed in parallel (as suggested by figure 1) or sequentially (as suggested by the description in lines 220-226, where the first tasks is defined as the construction of a triple sematic chain and the second tasks the uses these triple semantic chains).

In 3.2.1 you write about the "sufficient length" of the API call sequences. What is this in numbers?

In section 3.2.2 again is remains all the time unclear what the goal really is and what phase of process is strived for (see above).

Line 352: how does the tanh come into play here?

Line 440 ff: What is the dataset addressed here? Each record of the dataset contains 5 fields; but each record corresponds to what? Apparently to one API call. But it remains unclear what the sequence is. Maybe all API calls issued during one execution of a thread is a sequence?

In lines 468-472 the definition that the numbers (1000-6000) are sequence lengths. The conclusion that the optimal sequence length is 5000 based on the fact that the accuracy and F1-score gain between 5000 and 6000 is minimal is not straight forward (there could maybe an additional significant gain between 7000 and 8000); though not very probable it needs an explanation.

In section 4.3.3 there are some things left unclear: for the compared methods what software did you use? Did you program that yourself or did you use some available software?

In line 519 you talk about the "API2Vec module" in table 8; but there is no API2Vec in table 8.

In the conclusions (line 542 ff) you state that "...traditional machine learning methods ... only rely on API frequency information". I think this is not true for all approaches described in the literature; there some that analyze n-grams (see for example https://doi.org/10.1049/iet-ifs.2017.0430 as a start).

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Now the paper is ok from my side.

Author Response

Thank you for your review.

Reviewer 3 Report

The paper has improved through the changes. I still think that it is hard to understand the overall goal and method.

Article Menu

Dynamic Malware Analysis Based on API Sequence Semantic Fusion

Further Information

Guidelines

MDPI Initiatives

Follow MDPI