Next Article in Journal
Improvement of Heterogeneous Transfer Learning Efficiency by Using Hebbian Learning Principle
Next Article in Special Issue
Machine Learning in High-Alert Medication Treatment: A Study on the Cardiovascular Drug
Previous Article in Journal
Experimental Research on Mechanical Performance of SSRC Columns under Eccentric Compression
Previous Article in Special Issue
A CNN CADx System for Multimodal Classification of Colorectal Polyps Combining WL, BLI, and LCI Modalities
 
 
Article
Peer-Review Record

A Methodology for Open Information Extraction and Representation from Large Scientific Corpora: The CORD-19 Data Exploration Use Case

Appl. Sci. 2020, 10(16), 5630; https://doi.org/10.3390/app10165630
by Dimitris Papadopoulos 1,2,*, Nikolaos Papadakis 3 and Antonis Litke 3
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4: Anonymous
Reviewer 5: Anonymous
Appl. Sci. 2020, 10(16), 5630; https://doi.org/10.3390/app10165630
Submission received: 26 July 2020 / Revised: 7 August 2020 / Accepted: 8 August 2020 / Published: 13 August 2020
(This article belongs to the Special Issue Medical Artificial Intelligence)

Round 1

Reviewer 1 Report

This manuscript discusses a pipeline for open information extraction from scientific corpora. This was applied to the full-text subset of the COVID-19 Open Research Dataset (CORD-19).

The paper is well written and clear. The approach makes use of existing tools in all pipeline steps, and therefore the main contributions are the pipeline and analysis.

 

Major comments:

1) The impact / usefulness of the proposed pipeline needs some supporting results. 

These could be obtained by removing some steps of the pipeline (coreference resolution; summarization) and analysing the results, possibly on a subset of the corpus.

Another informative analysis would be comparing the results of the pipeline against each of the OIE engines used separately.

 

Minor comments:

1) Biomedical terminology is known for its ambiguity, but authors do not address this in the paper. How are ambiguous names resolved when performing entity linking?

 

Also, there are a few passages for which a revision is suggested:

p3, p10: "close information extraction" -> "closed"

p3: BIO is only introduced on page 11

p5: "underwent through"

p6: "more condense corpus" -> dense / condensed ?

p6/p10: is Figure 2 necessary? Maybe replace by summary statistics

p13: "The UMLS knowledge base contains approximately 3 million concepts (e.g. definitions, hierarchies)" - As expressed, this sentence is not totally correct; the UMLS Metathesaurus contains over 4 million concepts; some of these are associated to definitions and to concept-concept relations.

p14: "Due to the rich internal structure that characterizes labelled property graphs, allowing each node or relationship to store more than one properties (thus reducing the graph’s size) we used the Neo4J graph database " - the sentence is not clear unless the reader is aware that Neo4J offers labelled property graphs.

p14 (sentence above): "more than one properties" -> property

p14/15: In the description of Subject nodes, it is mentioned that these contain article_id, article_title, sentence_text, sent_num, triple_num, and engine. Shouldn't these be associated to the complete triple rather than the Subject node?

 

 

Author Response

Thank you for your review. Please see the attachment.

Author Response File: Author Response.docx

Reviewer 2 Report

Overall this is a well written paper.  I only have some minor suggested changes:

Open information extraction - all first letters should be capitalised as the abbreviation are used in brackets (OIE). Same with ontology-based information extraction (OBIE). NLP requires defining in the text (I assume Natural Language Processing).

Some references out of date - for example -refs 47-50.

Figure 2 is mentioned in the text - but needs further clarification to explaining the meaning depicted in the graph. 

 

 

Author Response

Thank you for your review. Please see the attachment.

Author Response File: Author Response.docx

Reviewer 3 Report

This article is very interesting and potentially will guide future research in this area. What is highly impressive is the use of multiple techniques towards an attempt to provide a robust outcome. Just suggesting some minor improvements: In 1.1, may cover related works and concepts in some more details; compared to the extensive reflection on the experimentation and research gaps, this section seems a bit short on what or how the existing works addressed similar problems. Thus, mainly as a minor suggestion, the authors may consider elaborating this a bit more, especially for readers who may not be familiar with the domain. With respect to detailing steps, like that in Coreference resolution, I am wondering if a symbolic form, like set expressions, may be used to make the discussion even clearer; alternatively, while the experiment discussions at later part outline what the steps do, may provide reader an example of the process early in the paper like when discussing in 1.1.1 - again, this is a suggestion with readers from other discipline in mind. Minor issue, Line 116, the ";" can possibly be replaced by "," Even though F1-score, Precision etc. are well known to readers from Machine Learning or data science community, for the sake of completeness, may clarify the measures somewhere within the text - though this is just a suggestion for consideration. Last section may be named 'Discussion and Conclusion'. Lastly, in discussion, while the article suggests promising results for the new strategy compared to existing strategies, it does also mention a direct comparison is not possible (Line - 473-476). Perhaps, this sentence may be rewritten, suggesting that the new strategy shows promising results and a direct comparison with existing strategy is not possible due to reasons indicated. In addition, a question may arises what is the point of going through all steps since each step is possibly adding to computational cost. A standalone approach will possibly require less time or resource. Perhaps, if possible, the authors may consider involving a human expert evaluation of usefulness of the graphs produced and which can then strengthen the position of the procedure - this may come as a future follow up work.

Author Response

Thank you for your review. Please see the attachment.

Author Response File: Author Response.docx

Reviewer 4 Report

The paper presents a method for open information retrieval and the representation from large corpora.  The technique is applied in the case of CORD-19 data.

The study is interesting and focusses a timely and important topic. The review of the literature covers 77 works and seems very complete. The discussion is clear and in general readers can follow easily to text. In the reviewer opinion, the application to the SARS-CoV-2 is particularly relevant. However, the reader has a final impression that something is missing and that the study has only general concepts and ideas. Also the so-called "representation" seems very qualitative, since we observe only some kind of hand-plotted charts. The reviewer recommends including some more assertive and quantitative discussion at this part. Finally, some new section with conclusions seems also necessary.

Author Response

Thank you for your review. Please see the attachment.

Author Response File: Author Response.docx

Reviewer 5 Report

The article tackles a very interesting problem, which is the triplification of knowledge hidden in unstructured or semi-structured texts. One of the challenges of this task is the proper identification of subject, object and predicate and usually NLP techniques such as POS tagging and Named Entity Recongnition are employed.

I suggest moving the related work subsections to a separate section.

The survey is quite comprehensive, although a few works that combine various OIE techniques (such as named entities, POS tags and synonymys) in summarization are missing:
- Hassel, M. (2003). Exploitation of named entities in automatic text summarization for swedish. In NODALIDA’03–14th Nordic Conferenceon Computational Linguistics, Reykjavik, Iceland, May 30–31 2003 (p. 9).
- Pal, A. R., & Saha, D. (2014, February). An approach to automatic text summarization using WordNet. In 2014 IEEE International Advance Computing Conference (IACC) (pp. 1169-1173). IEEE.
- Kouris, P., Alexandridis, G., & Stafylopatis, A. (2019, July). Abstractive text summarization based on deep learning and semantic content generalization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 5082-5092).

Also a few works that focus on knowledge extraction from the unstructured or semi-structured documents:
- Makrynioti, N., Grivas, A., Sardianos, C., Tsirakis, N., Varlamis, I., Vassalos, V., ... & Tsantilas, P. (2017). PaloPro: a platform for knowledge extraction from big social data and the news. International Journal of Big Data Intelligence, 4(1), 3-22.
- Holzinger, A., Kieseberg, P., Weippl, E., & Tjoa, A. M. (2018, August). Current advances, trends and challenges of machine learning and knowledge extraction: from machine learning to explainable AI. In International Cross-Domain Conference for Machine Learning and Knowledge Extraction (pp. 1-8). Springer, Cham.
- Wu, H., Lei, Q., Zhang, X., & Luo, Z. (2020, May). Creating A Large-Scale Financial News Corpus for Relation Extraction. In 2020 3rd International Conference on Artificial Intelligence and Big Data (ICAIBD) (pp. 259-263). IEEE.

Finally, in terms of Entity Linkage and resolution I suggest reading the recent article on jedai:
- Papadakis, G., Tsekouras, L., Thanos, E., Giannakopoulos, G., Palpanas, T., & Koubarakis, M. (2020). Domain-and structure-agnostic end-to-end entity resolution with jedai. ACM SIGMOD Record, 48(4), 30-36.

It would be a nice idea to enhance Figure 1 with the solutions/algorithms/models that you used in each step.

Table 2 is huge. Better give it in an appendix, or share it online or compress the text that is not selected for the summary. Use the space you will gain to explain how and why sentences are selected or not selected.

It is true that there is no straightforward comparison with similar works on the same dataset. However, it would be nice to have a comparison with other state of the art methods on a benchmark dataset, such as those of the BioNLP shared tasks (e.g. http://2011.bionlp-st.org/home/protein-gene-coreference-task, http://2011.bionlp-st.org/home/entity-relations) or BioNLP-ST GENIA (http://bionlp-st.dbcls.jp/GE/2011/downloads/)

The subjects in Table 6 are quite lengthy. A basic post-processing (stopword removal, POS tagging for keeping nouns only) and a lexicon based (or Wordnet or Wikipedia basaed) NER would help to focus on the actual subject text.

Author Response

Thank you for your review. Please see the attachment.

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

Authors have addressed the issues from the previous review.

Please adjust the following points:

lines 314-315: "(e.g. definitions, hierarchies, concept-concept relations)" - these are not types of concepts, as suggested by "e.g.", but rather additional information related to the >4 million concepts.

line 322: "papameter"

Author Response

Thank you very much for your review!

The following points were addressed (answers in bold): 

 

lines 314-315: "(e.g. definitions, hierarchies, concept-concept relations)" - these are not types of concepts, as suggested by "e.g.", but rather additional information related to the >4 million concepts.
Rephrased (lines 305-307).

line 322: "papameter"
Typo fixed (line 313)

 

Reviewer 5 Report

Authors have addressed my comments and have improved the presentation of the manuscript.

 

Author Response

Thank you for your review!

Please let us know if there are any additional points that you would like us to address before signing your review report. 

Back to TopTop