Next Article in Journal
EndoNet: A Model for the Automatic Calculation of H-Score on Histological Slides
Previous Article in Journal
Unraveling Microblog Sentiment Dynamics: A Twitter Public Attitudes Analysis towards COVID-19 Cases and Deaths
 
 
Article
Peer-Review Record

Knowledge-Based Intelligent Text Simplification for Biological Relation Extraction

Informatics 2023, 10(4), 89; https://doi.org/10.3390/informatics10040089
by Jaskaran Gill 1,*, Madhu Chetty 1,*, Suryani Lim 1 and Jennifer Hallinan 1,2
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3:
Informatics 2023, 10(4), 89; https://doi.org/10.3390/informatics10040089
Submission received: 16 October 2023 / Revised: 27 November 2023 / Accepted: 5 December 2023 / Published: 11 December 2023

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This manuscript introduces a sentence simplification-based approach to biomedical relation extraction, specifically protein-protein interactions. While the results superficially appear to support the value of the method, the presentation lacks clarity on important points and it is therefore difficult to fully accept these results in the current manuscript. I find the approach to text simplification to be somewhat ad hoc, rather than fully based on linguistic structures, but this concern could be ameliorated with better presentation of the methods. I will also note that the datasets that are used for these experiments are quite old by now, and it is not clear how the results would generalize to newer and larger biomedical relation extraction data sets. Further analysis of the results is needed to more deeply understand the key value of the proposed method.

Many statements are not appropriately referenced.


41: "manual RE of the relationships complemented by automated approaches still remains the gold standard" -- this HAS to be the gold standard for RE; we must always evaluate with respect to human performance. I don't think you mean "gold standard" here. This shouldn't be a criticism of NLP methods. In fact, your own evaluation results hinge on human annotations.

48: References needed for "the best-performing method in the BioCreative VI ChemProt challenge" (both for the challenge and for the best method -- was it this? https://academic.oup.com/database/article/doi/10.1093/database/bay073/5055578 ). What year was that (the above paper was published 2018)? Did the work even use "general language models"? (cf. l.61) AFAIK that challenge was held in 2017, so the methods would have been relied on less data-intensive modelling. Methods and the availability of pre-trained models have moved on substantially from there, as has performance on those datasets. Furthermore, it is not meaningful to refer to Recall in isolation of Precision and F-score. There is a reason the three metrics are reported together. NB: in Section 4.2 you refer to Sensitivity rather than Precision. Please stick with the standard language used in the NLP/relation extraction literature.

Note that the ChemProt data appears within the Blurb benchmark. Refer to https://academic.oup.com/bioinformatics/article/39/9/btad557/7264174 for results using pre-trained models on this dataset, reaching 82.74 F-score. So the results presented here are misleading.

Why focus on ChemProt when you are going to run your experiments on PPI data? Given that, you need to review PPI work and prior results on these datasets.


57: Reference needed for LLL dataset (and later for HPRD50 and BioInfer). How does this relate to the dataset for the BioCreative VI ChemProt challenge? What is the relevance/scope of that data set?

The references included in the paragraph at the top of page 3 are not numbered, dated, or linked to the reference list. Correct this. As far as I can tell nearly all of the biomedical RE work that is cited here is nearly a decade old. This is not representative. The literature review is not current.

As the authors mention, related to text simplification is work that utilises the core dependency structure of the texts, which works directly with the 'core' elements of the sentences based on syntactic structure. This work already proved effective relatively early on. The work by Junagadh is mentioned, but still seems to attempt to simplify the texts. There is also (quite old) work that has shown that working directly with dependency graphs can be helpful, without the need of simplifying: https://doi.org/10.1371/journal.pone.0060954 / https://jbiomedsem.biomedcentral.com/articles/10.1186/s13326-017-0168-3
This work and other related work (I leave the authors to do a more comprehensive search) weakens the novelty claims of this paper.

The criticism of this work as "lacking semantic and contextual understanding needed to handle ambiguities and complex relations" is not well-founded without an explanation of how text simplification would do a better job at addressing such issues. Where is your evidence for this? My read of the literature is that the key reason for lack of state-of-the-art performance of those methods is not related to lack of handling of ambiguity but rather more basic linguistic variability.

116: "Spacy’s pre-trained transformer model" -- Reference? Link? What do we know about this model? Is it tailored to biomedical text in any way? Has it been evaluated? How do you integrate biomedical NER with this pre-trained model, given that they can be multi-word entities? (I later understand that you substitute the entity name with an abstracted entity label before parsing. NB this step is missing in the depiction in Figure 6)

229: BERN2 is mentioned without introduction. No reference, no explanation of what it is. NEN not defined.

There appear to be two key aspects to the sentence simplification: (1) splitting of sentences on semicolons, and (2) retention of only the paths connecting two entities. The experiments do not present the contributions of these two aspects separately.

Regarding the fitness evaluation (Section 3.3) it appears to be very tied to the surface order of the text rather than taking advantage of structural/dependency information.  One would think that shortest path length would be more relevant here than this simple word position-based alignment. The sentence that is used as the core example in Figure 8 is not presented in terms of its dependency paths; I don't fully understand why "about 1 h earlier than normal does affect Spo0A which when phosphorylated" would be removed from the sentence; presumably the dependency path connecting Gene1 and Gene2 here goes via "affect" and the attachment of the "which" clause where Gene2 appears attaches to Spo0A, meaning the more relevant simplication should be (A) "Production of GENE1 affect Spo0A which when phosphorylated is activator of GENE2 transcription", not one that eliminates Spo0A. There also may be a misalignment between the text (line 301) and Figure 8 (Nodes list) which doesn't seem to be fully consistent. I believe the problem here is your text simplification procedure; the fitness evaluation is more about trying to compensate for errors introduced in that procedure. But I also don't understand equation (1): what are "sequence 1" and "sequence 2"? why does it make sense to consider the last word of sequence 1 and the first word of sequence 2? You are trying to capture the "sequential positioning distance between parent nodes of the two entities" but is this a difference between the original and simplified sentences, or just an absolute distance within the simplified sentence? In short, this is not well explained or illustrated. Also, I'm not sure why the "parent node" position is the most relevant, rather than the entity positions. I reviewed Algorithm A3 and it refers to a "continuous sequence containing GENEX" but I don't know what a "continuous sequence" is ... it's obviously not the sentence itself since that is a "continuous sequence" that contains both entities but it isn't clear what it is.

Section 4.3 does not describe Experimental Set up but rather introduces Results.

However, the experimental setup needs to be better described. One important aspect is that the structure of the input to the various models needs to be clarified.
- Do you always provide the abstracted entity strings for each system variant (i.e. replacing "ydhD" with "GENE1")? [i.e. the "Refined NER Tagged Sentences"] -- or only for some?
- What happens when you have multiple genes in the sentence, e.g. for the example discussed around lines 299-301 "Spo0A" is also a gene; effectively we have 3 pairs that decisions need to be made for in that case -- as in Figure 7. Would such a sentence be counted as 3 separate pairs that need to be evaluated in the context of a given input sentence, for each system variant?
- Is KITS applied during training as well as testing, or testing only?

This is not made clear and hence it is difficult to know whether the results are fully comparable.

Section 4.3 refers to results with "previous statistical and machine learning approaches". However you have not introduced the previous results on these data sets specifically; why aren't they in Table 2? Perhaps "GK", "CK", "PIPE", "WWSK" are these? However, these acronyms are not explained, no references are given, and the reader has no idea what they refer to.

Why was the "strict fitness threshold" of 5 selected? (I see the analysis in the Appendix Table B1, however this is not referenced where it is mentioned in the main text.)

Also, please clarify how the results in Table B1 were derived; is this based on the same cross-validation procedure as the final results? If so, this means that the hyper-parameter tuning of the threshold was done utilizing information from the TEST set for this dataset. This will result in unfairly optimized results on the LLL dataset in the main results.


No error analysis is presented. While the authors state (Table 4 heading "Number of sentences successfully simplified") it is clear that the sentence simplification process can introduce errors. This should be explored.

It is useful that the number of sentences successfully simplified is presented in Table 4. However, the authors should also present the performance of the models on just this subset to emphasise the differences. One would expect *no* difference in the performance w/ and w/o KITS on the other sentences; is this the case?

Also, please clarify how many sentences actually changed amongst these -- you state "some of the remaining sentences were already simple". This would also be relevant to understanding the results -- stratifying performance according to the amount of change or the fitness would be a relevant analysis to do.

Are there any cross-sentential relations, because the text simplification would fail in these cases?

How do the results interact with the above question of whether there are more than 2 genes mentioned in a given sentence? This is also extremely relevant. If all that you are doing is focusing attention on the genes, this is less interesting. I would like to see a co-occurrence baseline presented in this context.

Comments on the Quality of English Language


Typos
47: no apostrophe needed here: "approaches’"
There are many sentences that have extra space before full stop (48, 53, 57, 62, ... please search)
214: Name -> Named
290: fine-tunned -> fine-tuned
294: evalutation -> evaluation
406: does reference 27 here actually refer to 27 in the bibliography or is this a mistake?

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The topic of NLP task on text simplification, "Knowledge based intelligent text simplification for biological 2 relation extraction" seemed interesting. To capture the relational context among various binary relations within the sentence from biological publications, and preventing any potential changes in meaning for those sentences using KITS make sense. However, the certain areas may need changes.

1. The manuscript needs background regarding the issues associated with syntactic simplification in the biological domain where sentence formation, named entities, and relation descriptions are intrinsically complicated and make simplification difficult.

2. The relation extraction process that uses NER output sequences with labelled named entities to guide the sentence simplification is proposed where Spacy’s pre-trained transformer model is applied for parsing the input sequence into a grammatical dependency tree. Here, which transformer architecture is used from Spacy's package not mentioned. Also, the selection of this particular package is not justified.

3. There is no explanation mentioned on the criteria of determining the threshold value for fitness measure to evaluate the change in the meaning of sentence.

4. How does this novel fitness measure compare to the existing methods for text simplification? And can it be generalized to different domains' text or languages? Moreover, Equation 1 needs more details.

5. How are the Decision Tree classifier and bioBERT selected for classification? Does it make sense to compare their performances? What kind of approach of classification using LLM is used?

 

Comments on the Quality of English Language

Presentation is appropriate.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 3 Report

Comments and Suggestions for Authors

The paper is dedicated to employing large language models to such task as simplification task. While the paper is dedicated to biological data, I hope it may be helpful for other areas that employ large language models to specific tasks.

Since I have some questions about formatting only, here are some issues I found in your paper.

1. Some figures are larger than margins of the text, in particular figure 4, 5, 6, 7, 8.

2. Text chapter may not start on the last on the last line of the page (line 203 and line 430).

3. Similar to remark 1 but related to the tables. Tables 1, 2, 3 can’t overexceed the page margin, unless it is specially formatted for appendix part and have specific formatting to fit a page.

4. Similar to remark 2. Table name or column headers can’t start on the last line of the page (see line 406).

5. Similar to remarks 1 and 3. Algorithms can’t overexceed page margins. It is related to algorithm A1, A2 and A3.

6. Figure should be declared before using (figure 6, 7).

7. Table should be declared before using (table 4). Also note, that table 3 is better to move just after table 2, because it is used too far away from the declaration. General rule is use it a figure or table on the following page (if it is too big), but not later than the end of the paragraph of the current page, continuing on the following page.

Also I must inform you that 50% of your sources are more than 5 years old. While in general I may consider this in issue. But, because of specificity of your work - studying biology and medicine, I don’t consider this not an issue, since such studies are conducted for quite long periods of time and have quite enough expertise to be relevant for longer periods of time. However, please note that papers or sources that are not biology and medicine specific, like for instance source 34, should not be more than 5 years old.

In general, my opinion about the work is positive, although I strongly recommend to fix the formatting issues. These issues don’t harm the quality of the paper or its usefulness, but just improve the experience of readers on small screens (like smartphones). In my opinion, paper is well illustrated, has enough tables, formulae, as well as have supplemented algorithms and most important – has contributions, short contents and problem definition in Introduction section. Apart from remarks 1-7 I see no problems to the paper to be published; these issues can be easily fixed. Also, the old sources that I mentioned above, are very area-specific and are still relevant for your studies, because of specificity of biological studies and biological data, that don’t fall into “old” sources category; however I must underscore that any studies not related to medicine or biology field (for instance libraries for Python), should not be outdated.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

Really appreciate for incorporating all the suggestions in the manuscript precisely. There are many conventional ML models used for classification tasks; Random forest, SVM, Logistic Regression, and others. Their results vary depending on the problem or domain. Still, in the manuscript, selecting Decision Tree classifier to compare with very recent deep learning model BERT is not justified. Obviously, BERT's variants would produce better performance, although size of annotated datasets matter. It may be re-considered. 

 

Comments on the Quality of English Language

May be checked for special characters.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Back to TopTop