Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Named Entity Recognition Using Conditional Random Fields

Appl. Sci. 2022, 12(13), 6391; https://doi.org/10.3390/app12136391

by Wahab Khan^1,2,*

, Ali Daud³, Khurram Shahzad⁴

, Tehmina Amjad², Ameen Banjar³

and Heba Fasihuddin³

Reviewer 1:

Gil-Jin Jang

Reviewer 2:

Jianjun Huang

Reviewer 3:

Anke Berns

Appl. Sci. 2022, 12(13), 6391; https://doi.org/10.3390/app12136391

Submission received: 7 May 2022 / Revised: 14 June 2022 / Accepted: 20 June 2022 / Published: 23 June 2022

(This article belongs to the Section Computing and Artificial Intelligence)

Round 1

Reviewer 1 Report

This paper proposed named entity recognition (NER) method and a new database. There are two major questions that should be answered.

1. The authors proposed a NER method using conditional random field (CRF). However, CRF have already been proposed and used in many other research papers. For example,

- https://medium.com/data-science-in-your-pocket/named-entity-recognition-ner-using-conditional-random-fields-in-nlp-3660df22e95c

- Named Entity Recognition using Conditional Random Fields. Nita Patil, Ajay Patil and, B. V. Pawar. International Conference on Computational Intelligence and Data Science (ICCIDS 2019).

The above paper has almost identical title, and proposed language-independent method for NER. In Section "4. Urdu NER using Conditional Random Field", there is no specific aspect for Urdu. Only the basic of CRF and no detailed algorithm is given. There is no novelty in the method.

2. The authors insist the second contribution is UNER-I. However, the reviewer could not find the database on the web. If the authors want to insist the database for their contribution, the database should be publicly availabe.

3. page 12, Table 12

If the authors want to insist that the new dataset, UNER-I is as good as the baseline database, the reviewer suggests mixed experiments.

That is, training model with IJCNLP-Urdu and evaluate it with UNER-I. If the metric values are similar, the authors can convince that the new dataset is well designed and collected.

4. Minor comments, suggestions, and error corrections.

- page 1, abstract, 5th line from bottom: "preformed" -> "performed"

- page 7, equations 10 and 11: the equatiions are overlapped.

- page 7, 3rd line from bottom of the page, Section 5.2: what is kappa result? Please explain.

- page 10: Some of the Table numbers in the main text are missing.

- page 11, Section 6.2: the text is not well written. Not clear.

Author Response

Editor,

Applied Science Journal

Subject: Submission of the Revised version of Manuscript for Publication in Applied Science Journal

I am enclosing herewith the revised Manuscript ID applsci-1738585 entitled " Urdu Named Entity Recognition using Conditional Random Field” for publication in Applied Science Journal

We are very excited to have been allowed to publish our work in Applied Science Journal. Following the reviewer’s suggestion, we have revised the manuscript. We carefully considered the Reviewer's comments. We wish to express our appreciation for the reviewer’s in-depth comments, suggestions, and corrections, which have greatly improved the manuscript.

We hope our revision meets your approval.

We hope the revised version is now suitable for publication and look forward to hearing from you in due course

Next, we offer detailed responses to the comments of the reviewers.

WAHAB KHAN

PhD

Comments and Solutions

Reviewer 1

This paper proposed named entity recognition (NER) method and a new database. There are two major questions that should be answered.

The authors proposed a NER method using conditional random field (CRF). However, CRF have already been proposed and used in many other research papers. For example,

- https://medium.com/data-science-in-your-pocket/named-entity-recognition-ner-using-conditional-random-fields-in-nlp-3660df22e95c

- Named Entity Recognition using Conditional Random Fields. Nita Patil, Ajay Patil and, B. V. Pawar. International Conference on Computational Intelligence and Data Science (ICCIDS 2019).

Solution: CRF is the most widely used model for sequential labeling tasks. Due to its record-setting performance in the context of Western languages, the NER task motivated us to adopt it for Urdu NER.

Solution: yes, we agree that the title of our study matches the mentioned article. However, the main objective of our study is to propose a generic Urdu NER system based on Conditional random fields. Therefore, in the revised MS we have to change the title to “Urdu Named Entity Recognition using Conditional Random Field”.

The novelty does not exist in the context of the CRF model, but the current work is novel from the below three prospective. Firstly, in the context of the proposed feature set e.g we have proposed two types of features 1. Language-dependent features the Part of Speech Tags, for which we have used a third dataset named CLE POS Dataset. And for the assignment of POS tags to UNER-I and IJCNLP datasets, we used the Longest maximum matching approach. Secondly, we have used eleven template features consisting of both Unigram and Bi-grams, and the third novelty related to the creation of a new dataset which we term as UNER-I dataset.

The authors insist the second contribution is UNER-I. However, the reviewer could not find the database on the web. If the authors want to insist the database for their contribution, the database should be publicly available.

Solution: The UNER-I dataset can be accessed from this URL” https://github.com/My-Khan/UNER-Dataset” The URL is also mentioned in the main text using a footnote.

page 12, Table 12

Solution: The Blank space removed

If the authors want to insist that the new dataset, UNER-I is as good as the baseline database, the reviewer suggests mixed experiments.

That is, training model with IJCNLP-Urdu and evaluate it with UNER-I. If the metric values are similar, the authors can convince that the new dataset is well designed and collected.

Solutions: The suggestion of a mixed experiment is very good, however, there exist some limitations in the context of mismatched class labels in both datasets. e.g the IJCNLP dataset lacks a “Time” entity similarly the total number of identified entities in the IJCNLP dataset is much less e.g 1,115 while the UNER-I dataset contains 5,283 entities. Therefore, there is

Minor comments, suggestions, and error corrections.

- page 1, abstract, 5th line from bottom: "preformed" -> "performed"

Solution: Thanks, spelling corrected

- page 7, equations 10 and 11: the equations are overlapped.

Solution: many thanks for pointing out the overlapped equations. In the Revised Version we cleared the overlapping.

- page 7, 3rd line from bottom of the page, Section 5.2: what is kappa result? Please explain.

Solution: In the revised MS we defined Kappa as suggested by the reviewer, also a relevant reference added

- page 10: Some of the Table numbers in the main text are missing.

Solution: Many Thanks for pointing out the missing table numbers, In the revised MS table numbers are added.

- page 11, Section 6.2: the text is not well written. Not clear.

Solution: the mentioned is rephrased in the revised MS. We hope that it will now meet the reviewer’s requirement.

.com/style_download/mdpi/”

Author Response File: Author Response.docx

Reviewer 2 Report

This paper proposed a CRF based Urdu NER method. Compared with the baseline method, the improvement in F1-score is 1.5% to 3%. Another contribution is a new Urdu NER Dataset.

Problems: 1. The proposed method uses a handcraft feature template ( see equ.(3) ). How do these features reflect the three differences between Urdu and Western language?

2. In equ.(1) what is the relationship between y and y sub t and t-1?

3. How will the methods for Western language NER perform on Urdu dataset? Such experimental results should be presented.

4. Comparasion between the proposed method and other methods for Urdu (e.g. [36]) should also be given.

Author Response

Editor,

Applied Science Journal

Subject: Submission of the Revised version of Manuscript for Publication in Applied Science Journal

I am enclosing herewith the revised Manuscript ID applsci-1738585 entitled " Urdu Named Entity Recognition using Conditional Random Field” for publication in Applied Science Journal

We hope our revision meets your approval.

We hope the revised version is now suitable for publication and look forward to hearing from you in due course

Next, we offer detailed responses to the comments of the reviewers.

WAHAB KHAN

PhD

Reviewer 2

This paper proposed a CRF based Urdu NER method. Compared with the baseline method, the improvement in F1-score is 1.5% to 3%. Another contribution is a new Urdu NER Dataset.

Problems: 1. The proposed method uses a handcraft feature template (see equ.(3) ). How do these features reflect the three differences between Urdu and Western language?

Solution. If the training data of any language is modeled as per the format of table 4 then these features can be applied. So, we can say that the template features are dynamic and can be used for any language as these features are not language-dependent.

In equ.(1) what is the relationship between y and y sub t and t-1?

Solution: Actually, in Equation 1 the portion represents our feature function. In CRFs, our input data is sequential, and we have to take the previous context into account when making predictions on a data point. To model this behavior, we will use Feature Functions, that will have multiple input values, which are going to be:

The set of input vectors, X
The position t of the data point we are predicting
The label of data point t-1 in X
The label of data point t in X

The purpose of the feature function is to express some kind of characteristic of the sequence that the data point represents. For instance, if we are using CRFs for Named Entity Tagging then We define the feature function as:

func1 = if (output class = Person and feature = “U04: PNN”) return 1 else return 0

func2 = if (output class = Location and feature = “U04: PNN”) return 1 else return 0

....

funcN = if (output = NOR and feature= “U04:NN”) return 1 else return 0

How will the methods for Western language NER perform on the Urdu dataset? Such experimental results should be presented.

Solution: Most of the world's related languages share their advanced vocabulary although they have variations in their basics. For example, Hindi and Urdu share a common phonological, morphological and grammatical structure but the script writing styles of both are different. In addition, the vocabularies have also diverged significantly, especially in the written form (Visweswariah et al. 2010; Riaz 2012). In previous studies such as Riaz (2008, 2009, 2012) The author argued for independent innovative work for the Urdu language as opposed to depending on tools and resources developed for Hindi and other western languages. Therefore, resource sharing between Urdu and Western Languages is not possible.

References:

Riaz K (2008) Concept search in Urdu. In: Proceedings of the 2nd PhD workshop on information and knowledge management, pp 33–40
Riaz K (2009) Urdu is not Hindi for information access. SIGIR workshop on information access in a multilingual World, pp 53–57
Riaz K (2010) Rule-based named entity recognition in Urdu. In: Proceedings of the 2010 named entities workshop, pp 12–35
Riaz K (2012) Comparison of Hindi and Urdu in computational context. Int J Comput Linguist Nat Lang Process 1(3):92–97
Visweswariah K, et al. (2010) Urdu and Hindi: translation and sharing of linguistic resources. In: Proceedings of the 23rd international conference on computational linguistics (COLING), pp 1283–1291
Comparison between the proposed method and other methods for Urdu (e.g. [36]) should also be given.

Solution: In our recent study titled “Urdu Named Entity Recognition System Using Deep Learning Approaches” we have also compared the DL with CRF output layer with other methods such as LSTM, GRU, RNN with character embedding and word embedding.

https://academic.oup.com/comjnl/advance-article-abstract/doi/10.1093/comjnl/bxac047/6572657?login=false

Author Response File: Author Response.docx

Reviewer 3 Report

Dear authors,

The topic of your research is quite interesting and I enjoyed reading your work. However, before publishing you must thoroughly revise the paper with a special focus on the following issues and comments (see document attached).

( Moreover, I would definitely recomend the authors to revise the paper with a professional proof-reader. The paper needs to be revised in terms of language as well as format. There are plenty of formal mistakes that need to be revised before accepting the paper for publication.

With kind regards and I hope that my comments will help you to make the most of your research

Comments for author File: Comments.pdf

Author Response

Editor,

Applied Science Journal

Subject: Submission of the Revised version of Manuscript for Publication in Applied Science Journal

I am enclosing herewith the revised Manuscript ID applsci-1738585 entitled " Urdu Named Entity Recognition using Conditional Random Field” for publication in Applied Science Journal

We hope our revision meets your approval.

We hope the revised version is now suitable for publication and look forward to hearing from you in due course

Next, we offer detailed responses to the comments of the reviewers.

WAHAB KHAN

PhD

Reviewer 3

COMMENT 1:

Statements such as the followings require some more evidence (i.e. references) from the literature:

1 INTRODUCTION

“(…) For decades, NER systems have been researched and widely developed.

Solution: A valid reference is added. Thanks

I would indicate the url of the BBC Urdu website. “(…) Provision of a named entity annotated dataset consisting of 2161 news having 5283 entities. The news articles are obtained from the BBC Urdu website which is a valuable resource of Urdu text in the digital format. The newly developed Urdu NER dataset is named as UNER-I dataset

Solution: The required Url of the BBC Urdu home page is added as a footnote

COMMENT 3:

Please revise the English and writing of the following paragraph with a special focus on those parts in red color.

The rest of the paper is organized as follows: Section 2 presents an overview of the existing studies on Urdu NER. Section 3 discusses the key characteristics of the Urdu language as well as it challenges in the context of NER. An overview of our proposed approach is presented in Section 4. The details of the dataset that we have developed is presented in Section 5. Also, the specification of our dataset is compared with it counterparts. The details of the experiments performed to evaluate the effectiveness of the proposed approach are presented in Section 6. The results of the experiments are discussed in Section 7. Section 8 concludes the paper.

You could say something like: Finally, in Section 8 we present the conclusions drawn from our analysis and study depicting some future line of research. However, it would be good to revise the entire paragraph and improve its writing

Solution: yes, we agree that, in the initial draft of the MS, the writings and English were not of good quality. We are very thankful, for the suggestion and for highlighting the grammatical mistakes. In the revised version we revised it completely and we hope that it will now meet the reviewer requirement.

2 RELATED WORK

COMMENT 4:

Please revise this sentence and its reference which looks incomplete. “(…) In this study, we choose the works of [15] as baseline model. Mukund, Srihari [15] first tackled Urdu NER task (…)”

Solution: The mentioned sentence was revised as suggested.

CHARACTERISTICS OF URDU LANGUAGE

COMMENT 5:

The authors say “(…) For a further understanding, consider the example sentence presented in Table 2 (…), but I cannot see any Table in the respective section. There is only a Table 2 (Examples of changes in shape) in the next section (Section 4) of the paper but this table does not seem to be related to the comment. Please revise this!

Solution: table 2 demonstrates the shape change characteristics of Urdu characters/alphabets. Therefore, we moved it from section 4 to section three. In the initial draft, it was placed in the wrong section.

I would rewrite the following sentence:

“(…)Therefore, it is desirable to develop techniques and computational resources for named entity recognition in Urdu.

You could say: Therefore, the current study aims to ……

Solution: yes, we agree, we have changed the mentioned sentence accordingly

URDU NER USING CONDITION RANDOM FIELD

COMMENT 6: The authors write: “(…) This study has proposed a CRF based approach for the NER problem in Urdu. The proposed approach's overall architecture is shown in Figure 1. (..)” Again the authors should move the mentioned figure (i.e., Figure 1) which appears only at the end of the section, but which need to be integrated within the respective description in the text. Please revise this and try to be consistent about this in the whole paper. Please also revise incoherence’s in spelling Figure 1 instead of FIGURE 1.

Solution: The mentioned figure moved as suggested.

COMMENT 7:

4.2 Feature Templates

Once again you should revise the integration of Tables in your text. The idea behind figures and tables is to illustrate what you have previously explained in the text. To improve the readabilty of the text figures and tables must follow their description.

Solutions: Tables are integrated into the correct place of the Revised MS

COMMENT 8:

I think this paragrpah needs to be revised with a special focus on the following aspects: (1) four functions need to be explained, (2) the formatting of Equations 8 and 11 need to be revised.

“(…) For a formal specification, consider a given Word Token (WT) which is to be used as feature template. |WT| represents the length of the word, four functions d, ß, . and . can be defined, which determine the value of each element in the feature template, respectively. The definition of the four functions is presented in Equation 8 to Equation 11 (…)”.

Solution: The four functions are defined because of the details provided in Table 4. In the revised MS we tried to present it simply, such that it reflects the author's intention clearly

Equation 10 manipulates the first column e.g Token column of the training data given in table 5 during the encoding phase while Equation 10 manipulates the second column (POS Tag) of the training data given in table 5. Additionally, the overlapped equations are formatted.

COMMENT 9

4.4 ENCODING

Table 6 and 7 are not mentioned in the description.

Solution: description of both tables 6 and table 7 added in the revised MS

COMMENT 10

4.5 Decoding

Please revise this paragrpah in terms of language as well as content. Once again you forgot to mention in your description Table 7.

“(…) The encrypted model file produced during in the encoding stage is decrypted employing template features to an-ticipate NE labels in the test dataset during the decoding stage. Training data as well as test data have the same data format. The absence of the class label column in the test dataset is the most significant difference between it and the training file (…)”.

Solution: yes, we agree with the observation, in the revised MS the mentioned paragraph was rephrased and table 7 description added

COMMENT 11

5 Urdu UNER Dataset

Please revise the format and type of letter of Tables 8-11. They do not coincide with those used before. Please also revise the following paragraphs whose information is incomplete:

“(…) A notable characteristic is that the UNER-I is freely available for research purposes. Therefore, we contend that it will be a valuable resource for promoting NER research in Urdu, which is a low-resource language. The summary of the characteristics of our UNER-I is presented in Table 10, whereas, Table 11 contains the detailed statistics of the UNER-I dataset. In particular, Table…… contains the distribution of NE and their respective genres (…).

“(…) Table ….that our UNER-I dataset is substantially larger than IJCNLP-Urdu dataset in terms of number of words, number of sentences, as well as number of named entities. Furthermore, it can be observed from the comparison of Table 9 and Table 11 that the number of entities for each entity type in UNER-I is manifolds higher than IJCNLP-Urdu dataset. Hence, the dataset that we have contributed is manifolds greater than its existing counterparts (…)”

Solution: All the tables are revised and reformatted as suggested. Each related table is now cited in its corresponding section with the correct table number. All tables are inserted into the main text close to their first citation and numbered following their number of appearance.

COMMENT 12

Table 13 is not mentioned in your description.

With regard to the REFERENCE section I have observed several mistakes and incoherences why I would ask you to thoroughly revise this part. Here are only some examples. However, you should revise all references.

Solution: Table 13 is now in the revised MS mentioned in subsection “7.1 Results of News Genres”.

COMMENT 13

Results and Discussion

This sentence is incomplete. Please indicate the Table your are referring to.

“(…) For a further illustration consider an excerpt text presented in Table ……. The table contains the actual excerpt from the 70 IJCNLP-Urdu datasets where NEs are annotated

Solution: Many thanks for pointing out the missing table. In the Revised MS we have added the missing table.

COMMENT 14

Please thoroughly revise all your references in line with the MDPI reference style.

REFERENCES

Nadeau, D. and S. Sekine, A survey of named entity recognition and classification. Lingvisticae Investigationes, 2007. 30(1): p. 3-26.

Khan, I.U., et al., A Review of Urdu Sentiment Analysis with Multilingual Perspective: A Case of Urdu and Roman Urdu. Language. Computers, 2022. 11(1): p. 3.

Kazama, J.i. and K. Torisawa. Exploiting Wikipedia as external knowledge for named entity recognition. in 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 2007.

To facilitate you the revision, please have a look at the following examples from other MDPI publications: Journal Article:

Palomo-Duarte, M.; Berns, A.; Balderas, A.; Dodero, J.M.; Camacho, D. Evidence-Based Assessment of Student Performance in Virtual Worlds. Sustainability 2020, 13, 244, doi:10.3390/su13010244.

Book:

Burns, A. Collaborative action research for language teachers; Cambridge University Press, 1999.

Book Chapter:

Oskoz, A.; Elola, I. Promoting foreign language collaborative writing through the use of Web 2.0 tools and tasks. In Technology-665 mediated TBLT. Researching Technology and Tasks; Amsterdam: John Benjamins Publishing Company, 2014; pp. 115–148, 666 doi:10.1075/tblt.6.05osk.

Kessler, G.; Hubbard, P. Language Teacher Education and Technology. In The Handbook of Technology and Second Language Teaching and Learning; Wiley, 2017; pp. 278–292.

Solution: in the revised version of the MS, we have formatted all references as per the MDPI reference style. For this purpose, we have used the MDPI EndNote building style provided on the homepage of the MDPI “https://endnote

Article Menu

Named Entity Recognition Using Conditional Random Fields

Further Information

Guidelines

MDPI Initiatives

Follow MDPI