Next Article in Journal
A Genetic Algorithm for the Waitable Time-Varying Multi-Depot Green Vehicle Routing Problem
Next Article in Special Issue
An Efficient and Universal Real-Time Data Integrity Verification Scheme Based on Symmetric Key in Stream Computing System
Previous Article in Journal
Graphs of Wajsberg Algebras via Complement Annihilating
Previous Article in Special Issue
Analysis of Blockchain in the Healthcare Sector: Application and Issues
 
 
Article
Peer-Review Record

Malware Detection Using Deep Learning and Correlation-Based Feature Selection

Symmetry 2023, 15(1), 123; https://doi.org/10.3390/sym15010123
by Esraa Saleh Alomari 1,*, Riyadh Rahef Nuiaa 1, Zaid Abdi Alkareem Alyasseri 2,3,4,*, Husam Jasim Mohammed 5, Nor Samsiah Sani 6,*, Mohd Isrul Esa 6 and Bashaer Abbuod Musawi 7
Reviewer 1: Anonymous
Reviewer 3: Anonymous
Symmetry 2023, 15(1), 123; https://doi.org/10.3390/sym15010123
Submission received: 23 November 2022 / Revised: 14 December 2022 / Accepted: 26 December 2022 / Published: 1 January 2023
(This article belongs to the Special Issue Symmetry Applied in Privacy and Security for Big Data Analytics)

Round 1

Reviewer 1 Report

Abstract does not clearly state de main goal of the paper, what do the authors want to obtain by using those datasets. And there is also no comparison to other methods. Only in conclusion it is finally said: "The main purpose of this difference was to evaluate the effect of feature selection on the performance of low-dimension and high-dimension based datasets.". And the comparison is only a table, with no discussion.

It is necessary to discuss your results compared to the others present in table 8, and establish the differences between yours and others studies.

Some other remarks:

When datasets are presented, nothing is said about the existence of equal or similar attributes.

Figures 3 and 4, legend missing "epochs".

Table 6 refers to the second dataset, while the text before refers to the first one.

Author Response

Thank you for your comments, please find in the attached file list of responses. 

Author Response File: Author Response.docx

Reviewer 2 Report

Dear Authors,

 This work presents very interesting malware detection by using neural networks and results comparison between the current research and previous ones. However, to raise the value of the article, please pay attention to these issues and make every effort to improve and supplement article.

 

  1. Page 2 – “Many ML algorithms have been used, like support vector machines (SVM) [16][17][18], K-nearest neighbor (KNN) [19][20], Bayesian estimation [21][22], genetic algorithms [23], etc., in order to build malware.”

The meaning of the sentence converges to building malware from ML algorithms, which is not true in terms of today researches. We all know what are ML algorithm are intended for and they are for Malware Detection. Please, correct sentence so that ML algorithms purpose is for detection of malware.

  1. Page 3 – “Some of the previous studies used ML approaches while others applied DL techniques.”

What are DL techniques? Please, explain DL techniques or supply appropriate reference.

  1. Section 2.2.1. Dense Layers model

Please, explain in details how you found that 50 neurons for the first dataset scenarios and 100 neurons for the second dataset scenarios are the write numbers for the initial layer. Why didn't you decide to work with 49 neurons in the initial layer, for example? Also, describe the decision to choose 5 hidden layers with 50 neurons for the rest of the neural network architecture.

  1. Page 9 – “In our study, for the first malware dataset, we will apply experiments based on four different scenarios:

1- First selected group (Dropping the following columns): 'vm_truncate_count', 'shared_vm', 'exec_vm', 'nvcsw', 'maj_flt', and 'utime', getting in only 27 attributes (since 'classification' and 'Hash' columns are already removed.

2- Second selected group (Dropping the following columns): 'vm_truncate_count', 'shared_vm', 'exec_vm', 'nvcsw', 'maj_flt', 'utime', 'static_prio', 'map_count', and 'end_data', getting in only 24 attributes.

3- Third selected group (Dropping the following columns): 'vm_truncate_count', 'shared_vm', 'exec_vm', 'nvcsw', 'maj_flt', 'utime', 'static_prio', 'map_count', 'end_data', 'nivcsw', 'fs_excl_counter', and 'reserved_vm', getting in only 21 attributes.

4- Forth selected group (Dropping the following columns): 'vm_truncate_count', 'shared_vm', 'exec_vm', 'nvcsw', 'maj_flt', 'utime', 'static_prio', 'map_count', 'end_data', 'nivcsw', 'fs_excl_counter', 'reserved_vm', 'mm_users', 'state', 'total_vm', 'free_area_cache', 'stime', 'gtime', and 'millisecond' getting in only 14 attributes.”

Please include an appropriate explanation in the text to reveal the dropping of columns by scenarios.

  1. Page 9 – "…, we followed a feature selection technique based on various correlation thresholds."

Please, add an appropriate reference for such a claim.

  1. Figure 3 – Please add Name of the X-axis on every of 12 charts.

Figure 4 – Please add Name of the X-axis on every of 10 charts.

  1. Tables 6 and 7 – It is not clear what is the focus of the table: train dataset or test dataset training or test.

Please, add somewhere in the table title dataset which table refers.

Kind regards

Author Response

Thank you for your comments, please find in the attached file list of responses. 

Author Response File: Author Response.docx

Reviewer 3 Report

In this paper, the authors applied correlation-based feature selection and dense and LSTM-based deep learning models for malware detection.

Several important issues required to be addressed to improve the paper quality:

1-  English proofreading is required since there are many mistakes in many phrases in terms of English.

2-  In Section Abstract: the abstract is not written and not organized in a good way. The authors should describe the problem faced in the existing works, the proposed methodology, the dataset, and then the experimental results.

3-  In Section Introduction:

-no sequence between sentences since each sentence is not related to the previous sentence.

- The sentences describe the importance of using machine learning techniques in malware detection. However, the motivations for using deep learning and correlation-based feature selection are missing.

-The paper contributions are missing.

4-  In Section 2.2. Proposed Methodology: the authors failed in presenting details of the proposed method.

- The subsection feature selection based on correlations is missing although it is the main contribution of the study.

- LSTM model and Dense Layers model were used in many previous works [3][26][27][36]. It is important to explain the differences between the proposed work and the previous works.

- in subsection 2.2.3. Evaluation Criteria: TP, TN, FP and FN are not ratios but numbers. Please revise the definition of these terms.

5-  In Section Results and Discussion:

- It is recommended that authors use 10-fold or 5-fold cross-validation instead of hold-out validation.

-In subsection 3.5. Experimental results, it is expected that authors compare correlation-based feature selection with other feature selection methods used in literature. Furthermore, it is also required that authors compare LSTM model and Dense Layers model with other common machine learning techniques on the same datasets used in this article.

- The caption of Table 8 is written by mistake caption of Table 7   

Author Response

Thank you for your comments, please find in the attached file list of responses. 

Author Response File: Author Response.docx

Round 2

Reviewer 3 Report

The paper becomes better after addressing most of the comments.

I suggest a few suggestions to further improve the quality of the paper

- The contribution of the paper (written at the end of the related work section) should be moved to the end of the Introduction Section. At the end of the related work section, the authors should discuss the advantages of the proposed method over the existing works

- The Section “2.2.1. Correlation-based feature selection proposed methodology” can be revised to “2.2.1. Correlation-based feature selection”

-in Section 2.2.1. Correlation-based feature selection “For the first dataset, the correlations between all columns and the target column are computed”. It is important to explain how to compute the correlations using an equation or algorithm.

 

-In Section 4. Conclusion, it is expected to add other limitations related to the slow performance of deep learning that make it difficult to apply in Android and mobile environment  

Author Response

Thank you for your comments, please find in the attached file the list of responses.

Author Response File: Author Response.docx

Back to TopTop