Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Printed Edition

A printed edition of this Special Issue is available at MDPI Books....

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Multi-Language Spam/Phishing Classification by Email Body Text: Toward Automated Security Incident Investigation

Electronics 2021, 10(6), 668; https://doi.org/10.3390/electronics10060668

by Justinas Rastenis^1,*, Simona Ramanauskaitė²

, Ivan Suzdalev³, Kornelija Tunaitytė³, Justinas Janulevičius¹ and Antanas Čenys¹

Reviewer 1: Anonymous

Reviewer 2:

David Ruano Ordás

Electronics 2021, 10(6), 668; https://doi.org/10.3390/electronics10060668

Submission received: 15 January 2021 / Revised: 5 March 2021 / Accepted: 7 March 2021 / Published: 12 March 2021

(This article belongs to the Special Issue Cybersecurity and Data Science)

Round 1

Reviewer 1 Report

Authors present an study of performance in the problem of classifiying unsolicited/disturbing e-mails in the categories of SPAM and Phishing. No previous studies have addressed this particular problem, although the problems of separating legitimate emails from spam and phishing emails have been addressed. The study is clearly written and is easy to understand. However, i feel the scientific contribution is quite weak and some issues should be fixed before considering its publication.

+. IMO (In my opinion), I do not think there is an advantage in solving this problem. Obviously these personal opinions are not relevant, but I think the authors should include some text that makes the reader aware of the usefulness of solving this problem. This text should be included in introduction. This justification should indicate where a process, people's digital lives, etc. can be improved. It is essential that the relevance of the research to any study is justified. In other words: This problem has not been studied before. Is this fact connected to its little or no usefulness?

+ IMO there is no proposed method. This work applies the classical method based on a Bag of Words representation combined with popular machine learning classification schemes. Section 3 contains an experimental protocol based on the popular text classification way (text cleaning, tokenisation, attribute selection and classification). They also design some parameters (3100 attributes maximum) and the dataset. But this is an experimental protocol. There is no new method. Of course, this not mean that the study is not valid. But, the current titles of the sections and the contents may lead into confusion.

+ Spam and Phising should be easier to distinguish. Spam messages usually contains textual tricks and images to avoid spam filters. However, Phising methods comprises the imitation of the corporate image of companies and the texts are usually without typos. I believe feature engineering for the problem addressed should contain features focused to these ideas. The usage of tokens seems not to be a good idea. Results from Table 4 also support this conclusion (best accuracy on 85 percent, best f-score on 82%). The areas under ROC curves seems not to be very good. I believe that the study could maintain the current performance evaluations but incorporate ideas for future work such as the provided in this paragraph.

+ A reference to the tools used for executing experiments would be neccessary. Which implementations of classifier used the authors. Weka, R Caret, scikit-learn python, ....

+ How the parameters of classifiers were optimised. Which software was used for optimization? Did you use a manual optimization?

+ Which is the positive class for ROC analysis? Please detail how this analysis was carried out. The area under curves would be interesting for better understanding the results.

+ I could not download the Nazario repository. Seems that the provided link is broken. Could you provide another one?

+ The dataset (included in data availability statement) is a 870Kb CSV file provided in a external link. I would recommend authors to provide it as an additional material for the publication so that it will be published as part of the study. The usage of external links may lead into broken links problem.

+ Complement the conclusions section with future work.

+ Text near Table 1 is not clear. Table 1 appears breaked in three parts. I believe parts of this table are over the text and it is difficult to understand some parts of the work. This is a minor observation but should be also considered.

Author Response

Thank you very much for the reviews We found recommendations for the paper improvement and advices how it could be done. Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

The authors present the manuscript titled "Multi-language SPAM/Phishing Classification by Email Body Text" which presents the problematic of classifying unsolicited emails into spam and phishing categories. The study is easy to understand however the manuscript does not clarify the need for solving this problem. In my humble opinion, I don’t see the need (or advantage) of identifying the type of spam messages. From a real perspective, the main purpose of a common e-mail user lies in the proper and precise classification of unsolicited emails regardless of their type (spam/phishing). I only see an application of this study from a statistical perspective. Concretely to analyse the evolution of each type of unsolicited e-mails (such as identify increasing or decreasing trends). Additionally, I found several typos and formatting errors (Table 1 is included three times in the generated pdf). To this end, I think the manuscript is quite weak to the suitable for publication in the journal. Following are additional questions that should be addressed before considering its publication.

Question 1. The “introduction” is quite weak. I think this section should motivate the real importance of the work. As I commented above, after reading this section I do not feel the need to address the described problem.

Question 2. The “related work” summarized the state of the art in spam filtering. To this end, the authors divide the section into two well-identified subsections: (i) e-mail preprocessing and (ii) e-mail classification solutions. Although the division is appropriate, the first one is focused on particular cases of use instead of summarizing the most used feature selection/extraction techniques in the spam filtering domain (doi: 10.1007/978-3-030-24051-6_86) or even the latest approximations using semantic-based knowledge (doi: 10.1016/j.asoc.2018.12.008, 10.1007/978-3-319-48308-5_6).

On the other hand, the “email classification solutions” contains some issues a bit confusing. Particularly, the first sentence refers to rule-based systems, but the reference [28] is linked to an article focused on scheduling algorithms to improve the performance of rule-based systems. Browsing over the article I found Wirebrush4SPAM (doi: 10.1002/spe.2135), the framework used in this manuscript. I suggest changing the references. Additionally, I miss any citation to SpamAssassin since is the reference in the rule-based spam filtering domain, and the inspiration for other and improved alternatives (such as Wirebrush4SPAM).

Another confusing issue is the assumption that “rule-based methods require continuous support and update”. One of the main problems of spam filtering is concept-drift. This fact forces the continuous update of spam filtering models to avoid reducing the filtering performance. To this end, several techniques were developed to palliate this problematic (such as the use of Multi-Objective Optimization Algorithms to optimize the score of each rule). Additionally, a rule-based filtering model such as Wirebrush4SPAM or SpamAssassin contains some rules triggering a Naive Bayes model (ML). Therefore the sentence “machine learning models take over” is quite “false” or “incorrect”.

Question 3. In section 3 authors define the experimental protocol carried out in the manuscript. In my opinion, corpus from different dates should not be joined to palliate the concept-drift. Some topics that were spam in 2017 (Nazario dataset), were not spam in 2004 (SpamAssassin corpus). A clear example of this can be seen in the emergence of the spam e-mails selling vaccines to immunise against COVID. This type of messages will not exist in 2017 and 2004. So, joining information from very separated dates in my opinion is not a good way to perform realistic experiments.

Question 4. In subsection 3.1 authors mention that they obtained 31000 attributes after performing the pre-processing stage. I doubt all the achieved attributes are useful. Why authors do not try to see the relevance of the attributes to see if all of them are necessary? Particularly when they mention in Section 2 the existence of scientific contributions in the spam filtering domain using a limited number of features (such as 22 or 9).

Question 5. In subsection 3.2 authors include a figure (Figure 1) explaining the phases of the research methodology carried out. I think Figures are very important since they help to better understand the concepts to be explained. However, in this case, I do not think the figure contributes to clarifying the experimental protocol. I suggest changing the structure of the Figure to facilitate the identification of each phase (such as order, the dependence between stages -if exist-, etc.).

Question 6. The “research results” subsection contains Table 4 which includes the training and classification times. In my opinion, training time is not the main problem since (i) in a static scenario the models are training only once, and (ii) in a dynamic scenario, model retraining issues should be executed in the background and at off-peak times. However, if authors include times, I think is important to add the hardware specifications to have a baseline.

Question 7. Continuing with “research and results” subsection, wich tools do the authors used for executing experiments?

Question 8. The conclusions are a summarised version of the experimental protocol. I suggest to include future work paragraph to identify the a-posterior potentiality of the scientific contribution.

Question 9. Typos found:

line 105: text-repocessing -> text-preprocessing

line 114: role-based -> rule-based

line 121: Table 1 appears three times in PDF version of the manuscript.

line 207 investigate is the -> investigate if the

line 220: error: reference not found

Author Response

Thank you very much for the reviews We found recommendations for the paper improvement and advices how it could be done. Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Authors improved the some questions raised in the previous revision round.

+ They successfully included a motivation for solving the target problem (which is absolutely necessary to understand the value of their contribution).

+ They argue in the responses they are introducing a method. I would like authors read some reviews about text classification and spam filtering. A quick glance on a research paper such as the one with DOI 10.3390/info10040150 (of course i am not author of this research work) clearly evidences that there are no contributions with regard to the method they use. They use the same method as the ones reviewed in the mentioned DOI. For me, authors are contributing a experimental review of common text classification methods applied to the specific problem of discerning from commercial spam and phishing. The augmentation of datasets with new texts generated by translation (or other mechanisms such as those designed to solve unbalanced-data problem) is for me part of the experimental protocol. In fact these mechanisms are contributed only to increase the significance of results. This question has not been addressed in the right form.

+ Authors also have successfully attended my recommendation of providing new and fresh ideas for future work.

+ I also valuate the inclusion of specific details about the framework used for testing, the computer and the form in which models were executed. But they should be included before presenting results (at the beginning of subsection 3.3 and not after Table 4).

+ The positive class for ROC should be explained a little bit before (before Table 4) because i assume is the same used to compute Recall/Precision. The AUC (Area Under Curve) is not provided because it complicates understanding Figure 4. ¿Why not provide it in Table 4?.

+ Thank you for providing the right link for the Nazario dataset, to upload the dataset used as additional file, and to solve visualisation issues of table 1.

ADDIONAL ISSUES FOUND:

+ Results of training/scoring times seems confusing. Training is a specially hard task. SVM needs to find a form to transform the input space that brings Phishing and Commercial messages in separate geometric areas. This is intuitively more complex than aply the transformation (test). However, the classification state takes more time than training. I would encourage authors to carefully examine these results.

+ In Table 4, Decission Trees seems that do not work. In the context of binary classification, when accuracy is near 0.5, the effectivity of the method is as good as flipping a coin without examining the message. I would consider eliminate the result or explaining why this result.

+ The title of section 4 should be "Conclusions and future work" (without s).

+ Please avoid the usage of numbered/bulleted lists. Their use makes the manuscript look like an outline and not really a journal article.

+ Please note the corpus SpamAssassin is sometimes not well written (SpanAssassin).

Author Response

Thank you once again for reviews and all the recommendations and advices how to improve our paper. You will find all the comments in the attached file.

Author Response File: Author Response.pdf

Reviewer 2 Report

The authors present the manuscript titled "Multi-language SPAM/Phishing Classification by Email Body Text" which presents the challenge of classifying unsolicited emails into spam and phishing categories. The authors do not address the majority of my previous suggestions so my opinion concerning the paper is the same as commented in the first revision. Moreover, they omit my suggestion of checking the full manuscript to fix the typo errors (they only fixed the ones I highlighted in the previous revision). Additionally, I found new typos in this new version of the manuscript and some inconsistent sentences due to the use of Word track-changes. To this end, my opinion about the quality of the paper is the same as the previous version. Following are the questions that should be raised before considering its publication.

Question 1. In the Introduction section authors added a new paragraph (lines 66-69) outlining their proposal. Particularly I am a bit sceptic concerning the appropriateness of the sentence “it presents a combination of existing methods, ..., adapted and combined for Spam/Phishing email classification”. In my opinion, this sentence is quite “optimistic” since the authors apply a common and well-know workflow to classify textual contents. In fact, data preparation and text classification stages are mandatory to address any text-mining problem.

Question 2. In the Email Dataset Preparation subsection, the authors explain the methodology used to prepare the dataset. In my opinion, this subsection could be improved by replacing numbered lists with interconnected paragraphs. This will give continuity to the explanation and keep the reader's attention.

Question 3. Table 2 shows the amount of spam/phishing emails for each dataset. However, I see an incongruence between the information included in the Table. Particularly, Table 2 indicates that SpamAssassin contains 874 spam emails while the README file included in SpamAssassin corpus indicates the existence of 1874 spam emails. Why do authors decide to use only 874 emails? The data preparation subsection does not include any information concerning this decision.

Question 4. I appreciate the change made in Figure 1. I think this figure facilitates the comprehension of the experimental protocol carried out. However, I miss some connections between the explanation and the figure. The figure should support the text (and facilitate their comprehension), however, I only found one reference to the Figure in the whole explanation of the experimental protocol. I encourage the authors to link the most important task in the text with their respective stage in the Figure.

Question 5. Regarding the “Research Result subsection” I found a striking result in lines 249-250. Particularly SVM the time required for training purposes is elevated regarding classification. However, in the experiments carried out in the paper, the classification is performed more than 15 times slower than training.

Question 6. I am not sure that the ROC curve is relevant in the manuscript. Taking into account the results included in Table 4, is easy to realise the behaviour of the models.

Question 7. In lines, 273-274 authors outline the use of a 5-fold cross-validation methodology. I think is very important to clearly motivate and justify the use of a 5-fold cross-validation methodology instead of the standard one (10-fold cross validation). In my opinion, authors should consider using 10-fold cross-validation or a leave-one-out methodology.

Question 8. Some typos found (please revise the full manuscript):

line 46: automat-ed -> automated

line 49: ad-vertising-> advertising

line 51: tha -> that

line 52: email -> emails

line 62: classifi-cation -> classification

line 89: format -> formats,

line 158: Table 1. SpanAssassin -> SpamAssassin. In the last two rows

......

Author Response

Thank you once again for reviews and all the recommendations and advices how to improve our paper. You will find all the comments in the attached file.

Author Response File: Author Response.pdf

Round 3

Reviewer 1 Report

Authors have successfully addressed all my comments. Nice work!. Therefore, my recommendation is to accept the current version of the manuscript for it publication in Electronics journal. Congratulations!

Author Response

Thank you once again for reviews and all the recommendations.

Reviewer 2 Report

The article has been significantly improved in several areas. However, I still have some suggestions/considerations that should be taken into account before considering the publication of the manuscript in the Electronics Journal.

Question 1. In my opinion, two minor issues should be improved in the introduction section. The first one refers to a formatting style were the numbered list included in lines 58-61 should be modified to an inline list. For example: “.... To achieve this goal two main research questions are raised: (i) are existing ...... and (ii) do spam .....“. The latter one is related (again) with the sentence highlighted in yellow and included in lines 67-69. In my opinion, one correct option is “... however it presents research for SPAM/Phishing email following the steps comprising a common classification workflow (data preparation, text augmentation, text classification)”. This suggestion trends to avoid the use of “methods” term in the sentence since data preparation, text augmentation, text-classification are not methods per se but are different stages used to address a typical classification problem.

Question 2. Regarding subsection 3.1 (Email Dataset Preparation) I encourage the authors to change the numbered list (located in lines 170-177) to an inline list (following the same example as explained in the previous question. Additionally, Table 2 and Table 3 are very similar since are used to represent the status of the corpus before and after applying the preprocessing techniques. To this end, I suggest authors joining both Tables into a unique one. This fact will allow to (i) reduce the size of the manuscript and (ii) display a compact view that facilitates the visualisation of the dataset size before and after applying the preprocessing techniques explained in the section

Question 3. I have some suggestions regarding subsection 3.2. First of all, it is recommended to refine the paragraph belonging to lines 229-232. I suggest something like “In the first stage Naïve Bayes, Generalized Linear Model, ...., and Support Vector Machines methods were selected for the automatic identification of SPAM/Phishing emails”. Moreover, line 239 includes a reference error message with Table 4. Please be careful is the second time you make this mistake (also in V1). I always recommend performing a preliminary review after generating the pdf file to avoid these issues.

Author Response

Thank you once again for reviews and all the recommendations and advices how to improve our paper.

Question 1. In my opinion, two minor issues should be improved in the introduction section. The first one refers to a formatting style were the numbered list included in lines 58-61 should be modified to an inline list. For example: “.... To achieve this goal two main research questions are raised: (i) are existing ...... and (ii) do spam .....“.

It was changed.

The latter one is related (again) with the sentence highlighted in yellow and included in lines 67-69. In my opinion, one correct option is “... however it presents research for SPAM/Phishing email following the steps comprising a common classification workflow (data preparation, text augmentation, text classification)”. This suggestion trends to avoid the use of “methods” term in the sentence since data preparation, text augmentation, text-classification are not methods per se but are different stages used to address a typical classification problem.

Thank You for the suggestion. I agree with the notice and changed it.

Question 2. Regarding subsection 3.1 (Email Dataset Preparation) I encourage the authors to change the numbered list (located in lines 170-177) to an inline list (following the same example as explained in the previous question.

The list was changed to inline one, which additional explanation after the inline list.

Additionally, Table 2 and Table 3 are very similar since are used to represent the status of the corpus before and after applying the preprocessing techniques. To this end, I suggest authors joining both Tables into a unique one. This fact will allow to (i) reduce the size of the manuscript and (ii) display a compact view that facilitates the visualisation of the dataset size before and after applying the preprocessing techniques explained in the section

Its a good point, therefore we combined both tables into one.

Question 3. I have some suggestions regarding subsection 3.2. First of all, it is recommended to refine the paragraph belonging to lines 229-232. I suggest something like “In the first stage Naïve Bayes, Generalized Linear Model, ...., and Support Vector Machines methods were selected for the automatic identification of SPAM/Phishing emails”.

Thank you the text was changed as you suggested.

Moreover, line 239 includes a reference error message with Table 4. Please be careful is the second time you make this mistake (also in V1). I always recommend performing a preliminary review after generating the pdf file to avoid these issues.

Reference problems were corrected.

Author Response File: Author Response.pdf

Article Menu

Printed Edition

Multi-Language Spam/Phishing Classification by Email Body Text: Toward Automated Security Incident Investigation

Further Information

Guidelines

MDPI Initiatives

Follow MDPI