Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

A Character String-Based Stemming for Morphologically Derivative Languages

Information 2022, 13(4), 170; https://doi.org/10.3390/info13040170

by Gvzelnur Imin

, Mijit Ablimit^*, Hankiz Yilahun and Askar Hamdulla

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Information 2022, 13(4), 170; https://doi.org/10.3390/info13040170

Submission received: 7 March 2022 / Revised: 17 March 2022 / Accepted: 25 March 2022 / Published: 28 March 2022

Round 1

Reviewer 1 Report

The article proposes an interesting approach to stemming. However, I recommend that the authors should make some changes and improvements before publication.

Content and Structure Changes

State in the Abstract, specifically, the languages that you approached in your work (Uyghur, Kazakh and Kirghiz). I realize that the authors refer to these languages as derivative languages, but this is not enough. Line 19 of the Abstract should be amended to state "... two different data sets of three derivative languages: Uyghur, Kazakh and Kirghiz". In this way, it is immediately clear which languages the paper relates to.

Although the Introduction is good and provides sufficient background, some of the material in Page 2 should be moved to Section 3 (Methods). In fact, all the text starting in Line 45 ("Table 1 below is a comparison...) and finishing at the end of the Introduction must be moved to Section 3 (Methods), and it should be part of a new Subsection 3.1. This new Subsection 3.1 should be called "Stemming in Derivative Languages", or something similar. Subsection 3.1, which currently is titled "BiLSTM Layer", should be renumbered as 3.2. Similarly, 3.2, which is currently titled "Attention Layer", should be renumbered as 3.3 and so on.

At the end of the new Introduction, add a paragraph stating what the following sections are about. For example, "The rest of this paper is organized as follows: Section 2 presents the related work. Section 3 discusses the Methods...".

Section 2 (Related work) must be simplified. The first three paragraphs are great in terms of learning about the history of stemming, but they are not directly related to the subject of this paper (they refer to stemming for the English language). This material can be part of any textbook defining the concept of stemming, but it is unnecessary in this paper. I recommend that the authors should start Section 2 by defining stemming and then jump immediately to paragraph 4 ("In 2007, Majumder [12] proposed a clustering-based stemming algorithm...").

The current Subsections 3.1 and 3.2 are important. However, they need to be contextualised. At present, they can be part of any textbook talking about neural networks, but they do not talk about the subject of this paper. Instead of describing the BiLSTM and Attention Layer generally in Subsections 3.1 and 3.2, the authors should add to these subsections an example directly relevant to the processing of the languages that they are approaching (Uyghur, Kazakh and Kirghiz). How exactly BiLSTM and Attention Layer process the languages that the authors are approaching?

In Subsection 4.1, specify the dates and times when you crawled the People's Daily website. Explain whether you crawl all the pages or a particular subset of the People's Daily website. This is to ensure that other academics who read your paper can understand your experiment better and replicate it for comparison purposes.

The references should be formatted uniformly. Unfortunately, references 19, 21 and 25 list the names of the authors in capital letters. This should be corrected. Only the names' initial should be capitalized.

Other amendments

Line 154-155: State, explicitly, the three languages that Chrupala et al. approached for stem extraction in their 2008 paper. Saying that they used the "maximum entropy classification model for stem extraction in three languages" is not enough. It is important to know which three languages they used.
Line 29: Replace "nonpopular" with " unpopular". Although it is understandable what "nonpopular" means, the correct word is " unpopular".
Line 37: Replace "previous researches" with "previous research". The word research should not be used in plural in this case. You can also write "previous work", but not "previous researches".
Line 70 should be replaced with: "In previous work, most of the researchers...". Once again, the word research should not be used in plural in this case.
Line 168: Replace "derived" with "derivative".
Line 186: Replace "Reference [21] used..." with " Wumaierjiang, et al., used...". If you know the names of the authors, you should credit them, rather than referring to them as Reference [21].
Line 187: Replace "... recurrent unit network (BiGRU) network respectively" with ... recurrent unit (BiGRU) network respectively"; otherwise, the word "network" appears twice.
Line 204: Remove the work "first". The sentence should read "... three languages as input, manually segment the stem and affix parts...".
Line 189: Replace "...different two data sets" with "...two different data sets".
Line 191: Replace "... the independence between words in the data brings inconvenience to the CRF model to learn more information" with "... the independence between words in the data makes it difficult to learn more information".
Line 281: Replace "dev set" with "development (dev) set". In the rest of the paper, you can refer to the "development set" as the "dev set", but it is better to use the whole name the first time you mention the term.
Line 327: I recommend that the authors should avoid using "In order to" so often. In this line, they could simply start by saying "To further verify the validity of the multilingual...".
Line 346: Replace "formulas" with "formulae". The plural of "formula" is "formulae".
Line 361: I recommend that the authors should avoid using "In order to" so often. In this line, they could simply start by saying "To verify the effect of different...".

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 2 Report

The paper is very good. It proposes a multilingual (3 languages) stemming method based on embedding and sequential modeling (bidirectional LSTM).

The related works are well presented. The research method is described. The size of the dataset allows proving the results. The results are presented and explained.

Only two suggestions from my side to improving the paper:
1) to add the authors' contribution to the Introduction section

2) to add future plans to the Conclusions.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

Article Menu

A Character String-Based Stemming for Morphologically Derivative Languages

Further Information

Guidelines

MDPI Initiatives

Follow MDPI