Next Article in Journal
Closing the Wearable Gap—Part VI: Human Gait Recognition Using Deep Learning Methodologies
Next Article in Special Issue
Smart Home Forensics—Data Analysis of IoT Devices
Previous Article in Journal
Compensation Parameters Optimization of Wireless Power Transfer for Electric Vehicles
 
 
Article
Peer-Review Record

Platform-Independent Malware Analysis Applicable to Windows and Linux Environments

Electronics 2020, 9(5), 793; https://doi.org/10.3390/electronics9050793
by Chanwoong Hwang 1, Junho Hwang 1, Jin Kwak 2 and Taejin Lee 1,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Electronics 2020, 9(5), 793; https://doi.org/10.3390/electronics9050793
Submission received: 13 April 2020 / Revised: 1 May 2020 / Accepted: 6 May 2020 / Published: 12 May 2020
(This article belongs to the Special Issue Security and Privacy for IoT and Multimedia Services)

Round 1

Reviewer 1 Report

# Summary

The paper addresses the problem of malware analysis by using binary data rather than features based on the structured format of executable files. Lack of standardization due to a variety of devices, vendors, and architectures in existing Linux systems is the main motivation behind the research.

The authors propose an approach that can address the problem of analyzing malicious code targeting the Windows, Linux, or IoT with a single code of complex analysis technology supporting various architectures.

In particular, Binary-based Strings Analysis and Count-based Strings Vectorization Technology have been used in conjunction with DNN for the purpose of malware detection.

Besides presenting the design of a platform-independent system for malware detection, the paper analyses the effectiveness of the system through an experimental evaluation performed on thousands of examples taken from a public repository and self-collected dataset. Principal component analysis (PCA) is used for feature selection.

The experiments show that the system works in practice and also helps to identify malware in the wild by analyzing binary strings extracted from a binary file of the malicious samples.

# Evaluation

This is a very relevant problem since malware analysis is usually hampered by a variety of vendor-specific data formats.

The paper makes a tall claim of addressing the problem of analyzing malware based on Windows and Linux environment with one code.

The first part of the paper is reasonably well written and the problem statement is clear. The approach isn't novel for the specific application field (malware analysis). However, it has extensive room to grow.

The authors haven’t made the tool open-source. From the presentation point of view, the paper depicts the problem and its solution.

Figures, tables and architectural diagrams portray the intended information backing the evaluation procedure. Despite it, the paper isn't in perfect shape.

In particular, I identified the following weaknesses:

  1. As mentioned in section 3.2, the paper solely relies on string extraction from binary files of the samples. It fails to take into account the malware riddled with obfuscation and encryption techniques. These techniques are often adopted and the authors should discuss their impact on the proposed approach.
  2. During the experimental evaluation, it turns out that it uses a plain vanilla form of neural network. No optimization is used to improve the results. Also, it fails to impart the reasoning behind the selection of the configuration of the neural network used. In addition, there is no mention of the activation function used.
  3. In section 4.2, a fixed vector size of 1000 was taken. However, I would have expected a discussion on how this fixed size was selected to improve malware identification and analysis.

# Minor Comments

- page 2, line -59 :: Spacing problem.

- page 2, line -73,75,77,79,81,83 :: Full stop is required on sentence completion.

- page 3, line 100,103,106,109,112 :: Full stop is required on sentence completion.

- page 4, line 139 :: “Chapter 1.2 ” -> Section 1.2 .

- page 6, lines 198 to 201 :: These lines are same as lines (202 to 206).

- page 7, line 246 :: “imporove”-> improve.

- page 8, line 255 :: “However,Most”-> However, most .

- page 9, line 268 :: “unmber”-> number.

- page 9, line 295 :: No mention of PCA full form in the paper.

- page 10 , line 307  :: “This chapter”->This section.

- page 11 , line 325  :: “self-collection datasets”->self-collected datasets.

Author Response

Dear Mr. Peng.

I received your kind comments.

We will inform you of the revised matter about your comment.

We used the "Track Changes" function to make it easy to show changes to editors. We used the "Track Changes" function to make it easy to show changes to editors. The default value of the “Track Changes” function written on the comments response cover is based on the revised manuscript. Corrected captions for pictures, tables and references. Also, the titles of Sessions 1.2, 1.3.1, and 1.3.2 were corrected.

I modified the name :: "Chan-Woong Hwang"-> "Chanwoong Hwang"

This is a cover step by step explaining the answers to the reviewer's comments.

 

# Reviewer 1

# Evaluation:

  1. As mentioned in section 3.2, the paper solely relies on string extraction from binary files of the samples. It fails to consider the malware riddled with obfuscation and encryption techniques. These techniques are often adopted, and the authors should discuss their impact on the proposed approach.
  • Session 3.2 added an additional explanation. In order to operate independently on the platform according to the subject of this paper, it is necessary to take static analysis and extract functions from binary data, so it cannot be free from obfuscation issues. However, through various experiments, we tried to reduce noise in the process of extracting strings.
  1. During the experimental evaluation, it turns out that it uses a plain vanilla form of neural network. No optimization is used to improve the results. Also, it fails to impart the reasoning behind the selection of the configuration of the neural network used. In addition, there is no mention of the activation function used.
  • Added information about the neural network used in session 3.2.4. To improve model performance, we propose optimization algorithms and activation functions considering verification time.
  1. In section 4.2, a fixed vector size of 1000 was taken. However, I would have expected a discussion on how this fixed size was selected to improve malware identification and analysis.
  • Session 3.2.3 shows how to choose a fixed size. The average number of strings extracted from Linux binaries is 1,800. We conducted experiments in a variety of ways to find the appropriate fixed size. As a result, the model performed well when the fixed size was set to 1,000.

 

# Minor Comments:

- page 2, line -59 :: Spacing problem.

  • A space issue was resolved due to Session 1.1 changes.

- page 2, line -73,75,77,79,81,83 :: Full stop is required on sentence completion.

  • Added a full stop to lines 94, 96, 98, 100, 102, 104 on page 2.

- page 3, line 100,103,106,109,112 :: Full stop is required on sentence completion.

  • Added a full stop to lines 121, 124, 127, 130, 133 on page 2.

- page 4, line 139 :: “Chapter 1.2 ” -> Section 1.2 .

  • We changed the content on line 187 on page 5.

- page 6, lines 198 to 201 :: These lines are same as lines (202 to 206).

  • Deleted lines 202 to 206 of the existing document.

- page 7, line 246 :: “imporove”-> improve.

  • Corrected to be "improve" in line 294 on page 9.

- page 8, line 255 :: “However, Most”-> However, most .

  • Corrected to be "However, most" in line 304 on page 9.

- page 9, line 268 :: “unmber”-> number.

  • Corrected to be "number" in line 322 on page 10.

- page 9, line 295 :: No mention of PCA full form in the paper.

  • In this paper, PCA, a dimensional reduction algorithm, is used to represent the distribution of data. References to the PCA are lines 378-382 on page 11.

- page 10 , line 307  :: “This chapter”->This section.

  • Corrected to be "This section" in line 396 on page 13.

- page 11 , line 325  :: “self-collection datasets”->self-collected datasets.

Corrected to be "self-collected datasets" in line 413 on page 13.

Reviewer 2 Report

This report reviews the manuscript submitted by Chan-Woong Hwang et al., "Platform-independent Malware Analysis applicable to Windows and Linux environments". The manuscript is very precise composed of 14 pages, 18 figures, 9 tables and 38 references.

Authors analyzed a very interesting topic that is related to Linux-based malware analysis in the context of IoT/Embedded environment. I guess this manuscript will be interesting for the informatics and cyber security scientific community. However, prior to accepting this manuscript, I suggest to authors to review the document thoroughly considering the below comments:

  1. Lines 10-13 can be rewritten, the information provided is not convincing the problem statement.
  2. In line 15, the IoT/Embedded environment can be further explained.
  3. In the introduction, brief about windows malware attacks and the consequences of it should be provided in much more detailed way.
  4. Sections 1.3.1 and 1.3.2 must be improved further. The contributions are not clear.
  5. In the introduction, authors did not consider the support of recent literature and the development in Linux based systems and technology evolution. 
  6. In lines 170 to 175, the classifier information is not clear. Author(s) should provide the how the classification techniques are applied to the considered data.
  7. Fig. 1. (a), and (b) are not clear. The embedded text and font is not visible. I must say to replace these figures with high resolutions ones
  8. Figs 2 to 6 should be replaced with high resolutions pictures.
  9. How does the AI and IoT are integrated with the malware platform. This needs much detailed explanation.
  10. How does the strings with a string length of 5 or more are considered for reducing noise? it should be made in clear in line no. 263.
  11. Fig. 7 must be replaced with high resolution picture.
  12. Fig. 8 to 18 should be rearranged and redrawn by improving the resolutions, text visibility etc. Axis are also not clear.
  13. Conclusions seems to have trival points. Try splitting the conclusion section into bullet points.

Author Response

I received your kind comments.

We will inform you of the revised matter about your comment.

We used the "Track Changes" function to make it easy to show changes to editors. We used the "Track Changes" function to make it easy to show changes to editors. The default value of the “Track Changes” function written on the comments response cover is based on the revised manuscript. Corrected captions for pictures, tables and references. Also, the titles of Sessions 1.2, 1.3.1, and 1.3.2 were corrected.

I modified the name :: "Chan-Woong Hwang"-> "Chanwoong Hwang"

This is a cover step by step explaining the answers to the reviewer's comments.

 

  1. Lines 10-13 can be rewritten, the information provided is not convincing the problem statement.
  • The contents were rewritten to fit the titles of lines 10 to 15.
  1. In line 15, the IoT/Embedded environment can be further explained.
  • Additional explanation of the IoT / embedded environment was added to Session 1.1. (39 to 51)
  1. In the introduction, brief about windows malware attacks and the consequences of it should be provided in much more detailed way.
  • In Session 1.1, we added the contents of the latest malware, ransomware, and the results. (29 to 38)
  1. Sections 1.3.1 and 1.3.2 must be improved further. The contributions are not clear.
  • Added clear titles and content for sessions 1.3.1 and 1.3.2.
  1. In the introduction, authors did not consider the support of recent literature and the development in Linux based systems and technology evolution. 
  • We added the title revision of Session 1.2, the association of IoT with Linux-based systems, and the major challenges of IoT.
  1. In lines 170 to 175, the classifier information is not clear. Author(s) should provide the how the classification techniques are applied to the considered data.
  • In Lines 221 to 231 on page 6, describe the classifier information and how it was applied to the data. The author proposes a function length frequency and a printable string information method.
  1. Fig. 1. (a), and (b) are not clear. The embedded text and font is not visible. I must say to replace these figures with high resolutions ones.
  • 1. (a), and (b) were reinserted in high resolution.
  1. Figs 2 to 6 should be replaced with high resolutions pictures.
  • Adjusted the size and spacing of Figs 2 to 6.
  1. How does the AI and IoT are integrated with the malware platform. This needs much detailed explanation.
  • As mentioned in the introduction, I think it will be possible to overcome the difficulties if AI operates independently of the platform in the automation and IoT environment. IoT is mostly Linux environment and supports various architectures. However, it is difficult to identify the architecture in which malware operates in Linux malware analysis. Identifying the architecture in malware expands the direction of malware research. As a pre-architecture research project, this paper introduces malware analysis technology that is independent of the operating system or platform.
  1. How does the strings with a string length of 5 or more are considered for reducing noise? it should be made in clear in line no. 263.
  • Only strings with a string length of 5 or more are used in this paper. The reason is shown on line 9, lines 315 to 319.
  1. Fig. 7 must be replaced with high resolution picture.
  • 7 was reinserted in high resolution.
  1. Fig. 8 to 18 should be rearranged and redrawn by improving the resolutions, text visibility etc. Axis are also not clear.
  • The texts in Figures 8 to 18 have been improved and rearranged taking into account the axis.
  1. Conclusions seems to have trival points. Try splitting the conclusion section into bullet points.
  • In the conclusion, the purpose of the study and the proposed approach were described and then modified based on the experimental results. This study realizes the need to identify various architectures in which Linux malware is used, and proposes a platform-independent malware analysis technology before researching the architecture.

Round 2

Reviewer 2 Report

Author(s) have addressed the given comments. Manuscript can be published in its current form.

Back to TopTop