Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Similar Text Fragments Extraction for Identifying Common Wikipedia Communities

Data 2018, 3(4), 66; https://doi.org/10.3390/data3040066

by Svitlana Petrasova^1,*

, Nina Khairova^1,*

, Włodzimierz Lewoniewski², Orken Mamyrbayev³

and Kuralay Mukhsina⁴

Reviewer 1: Anonymous

Reviewer 2:

Peng Shi

Data 2018, 3(4), 66; https://doi.org/10.3390/data3040066

Submission received: 4 November 2018 / Revised: 9 December 2018 / Accepted: 10 December 2018 / Published: 13 December 2018

(This article belongs to the Special Issue Data Stream Mining and Processing)

Round 1

Reviewer 1 Report

The paper presents a methodology aimed at identifying similar text fragments across wikipedia portals. Starting from the linguistic realization of sentences performed using Stanford POS tagger and dependency parser, they devised a set of possible grammatical and semantic characteristics of collocation words occurring in wikipedia sentences.

The topic addressed by the authors is quite interesting and investigated in the field of defining methodologies to detect text fragments sharing a similar content for a number of different applications such as for example paraphrases generation, alignment of sentences that need to be simplified, etc. This is the reason why I would suggest to take into account in the Related Work section other methodologies devised in the computational linguistics community such as for example the following ones:

-- Regina Barzilay and Noemi Elhadad. 2003. "Sentence alignment for monolingual comparable corpora". In: Proceedings of the Conference on Empirical Methods in Natural Language Processing.

-- Rani Nelken and Stuart M. Shieber. 2006. "Towards robust context-sensitive sentence alignment for monolingual corpora". In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL-06), 3–7 April.

-- William Coster and David Kauchak. 2011. "Simple english wikipedia: a new text simplification task". In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.

-- Stefan Bott and Horacio Saggion. 2011. "An unsupervised alignment algorithm for text simplification corpus construction". In: Proceedings of the Workshop on Monolingual Text-To-Text Generation, co-located with ACL 2011, Porland, Oregon.

In the Result Analysis subsection, I would detail more the claim that the low results achieved "might be due to mistakes of the POS tagging and UD-parser". The typology of text analyzed in the experiments, i.e. wikipedia pages, contain linguistic characteristics not very different from those of the newspaper articles on which typically Stanford linguistic technologies are trained; accordingly, they should not suffer from the problem of domain adaptation. Could the authors show the kind of errors made by the tagger and parser? If the precision of linguistic analysis is too low for their task, the authors should try with the UDPipe (http://ufal.mff.cuni.cz/udpipe) which represent at the moment the state-of-the-art chain of linguistic annotation tool.

Author Response

Point 1: The topic addressed by the authors is quite interesting and investigated in the field of defining methodologies to detect text fragments sharing a similar content for a number of different applications such as for example paraphrases generation, alignment of sentences that need to be simplified, etc. This is the reason why I would suggest to take into account in the Related Work section other methodologies devised in the computational linguistics community...

Response 1: In the Related Work section, we have added “Similar studies on semantic proximity are monolingual sentence alignment algorithms [14, 15]. In works [16, 17] the authors applied this method to study unsimplified and simplified texts in the English and Spanish languages.”

In the References section, we have made reference to:

“14. Barzilay, R.; Elhadad, N. Sentence alignment for monolingual comparable corpora; In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2003; pp. 25-32, DOI: 10.3115/1119355.1119359.

15. Nelken, R.; Shieber, S.M. Towards robust context-sensitive sentence alignment for monolingual corpora; In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, 2006; pp. 161-168.

16. Coster, W.; Kauchak, D. Simple English Wikipedia: a new text simplification task; In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011; pp. 665–669.

17. Bott, S.; Saggion, H. An unsupervised alignment algorithm for text simplification corpus construction; In Proceedings of the Workshop on Monolingual Text-To-Text Generation, 2011; pp. 20–26.”

Point 2: In the Result Analysis subsection, I would detail more the claim that the low results achieved "might be due to mistakes of the POS tagging and UD-parser". The typology of text analyzed in the experiments, i.e. wikipedia pages, contain linguistic characteristics not very different from those of the newspaper articles on which typically Stanford linguistic technologies are trained; accordingly, they should not suffer from the problem of domain adaptation. Could the authors show the kind of errors made by the tagger and parser? If the precision of linguistic analysis is too low for their task, the authors should try with the UDPipe (http://ufal.mff.cuni.cz/udpipe) which represent at the moment the state-of-the-art chain of linguistic annotation tool.

Response 2: In the Result Analysis subsection, we have detailed “As our model identifies a set of possible grammatical and semantic characteristics of collocation words, it considerably depends on the result of parsing. Consequently, these mistakes are not determined by the chosen parser but based on morphological or/and syntactic ambiguity that is unavoidable and affect the precision of the final result”.

Author Response File: Author Response.pdf

Reviewer 2 Report

The paper concerns a good topic to find communities from Wikipedia. However, the results have not been explained, especially about the communities. I suggest the authors to investigate the results further and give some clearer conclusions.

Table 2 and Table 3 can be combined.

Some English representations are not good. For example, "web-based" shoud be "Web-based".

The related works shoud be described in the past tense.

Author Response

Point 1: The paper concerns a good topic to find communities from Wikipedia. However, the results have not been explained, especially about the communities. I suggest the authors to investigate the results further and give some clearer conclusions.

Response 1: In the Introduction section, we have added “It should be noted that in general the Wikipedia community is defined as “the community of contributors to the online encyclopedia Wikipedia” that can create and edit articles of Wikipedia projects in different languages and topics. However, in this study using the term “Wikipedia community”, we refer to the unity of information contained in short text fragments of dynamic Wikipedia resources of varying research directions”.

In the Results Analysis subsection, we have added “The articles of Wikipedia cover various subject areas represented in Wikipedia projects. We have proved the hypotheses that a lot of synonymous collocations from texts, especially, related to similar topics can form common information spaces of Wikipedia communities”.

Point 2: Table 2 and Table 3 can be combined.

Response 2: In the Data Description section, we have combined Table 2 and Table 3 into one table “Table 2. Statics of Wikipedia portals: Art and Biography”.

Point 3: Some English representations are not good. For example, "web-based" should be "Web-based.

Response 3: We have revised the manuscript and slightly improved English expressions.

Point 4: The related works should be described in the past tense.

Response 4: In the Related Work section, we have rewritten the statements in the past tense.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Most of my comments from last round have been answered in the report. The manuscript has been revised better.

It is better to give a typical example of community found by the experiments, as the authors described: “the community of contributors to the online encyclopedia Wikipedia”.

Author Response

Point 1: It is better to give a typical example of community found by the experiments, as the authors described: “the community of contributors to the online encyclopedia Wikipedia”.

Response 1: In the References section, we have added the reference [1] to the Wikipedia article with a definition of the Wikipedia community, mentioned in the Introduction section: “It should be noted that in general the Wikipedia community is defined as “the community of contributors to the online encyclopedia Wikipedia”[1] that can create and edit articles of Wikipedia projects in different languages and topics. However, in this study using the term “Wikipedia community”, we refer to the unity of information contained in short text fragments of dynamic Wikipedia resources of varying research directions.”

In Conclusions and Further Work section, we have added “Our model is one of the linguistic tool together with other approaches can be helpful in the formation of electronic catalogues of semantically connected texts in scientometric, library, and abstract systems.

Our further work will be directed to the integration of our technology in the systems of automatic generation of Wikipedia communities.

Author Response File: Author Response.pdf

Article Menu

Similar Text Fragments Extraction for Identifying Common Wikipedia Communities

Further Information

Guidelines

MDPI Initiatives

Follow MDPI