Next Article in Journal
Instance and Data Generation for the Offline Nanosatellite Task Scheduling Problem
Previous Article in Journal
Development of a Machine-Learning-Based Novel Framework for Travel Time Distribution Determination Using Probe Vehicle Data
 
 
Data Descriptor
Peer-Review Record

TKGQA Dataset: Using Question Answering to Guide and Validate the Evolution of Temporal Knowledge Graph

by Ryan Ong 1,*, Jiahao Sun 1,2, Ovidiu Șerban 1 and Yi-Ke Guo 1
Reviewer 1:
Reviewer 2:
Reviewer 3: Anonymous
Submission received: 29 October 2022 / Revised: 23 January 2023 / Accepted: 12 March 2023 / Published: 14 March 2023
(This article belongs to the Section Information Systems and Data Management)

Round 1

Reviewer 1 Report

The authors present a dataset that connects temporal knowledge graphs with documents with question answers pairs.

The proposed dataset (TKGQA) consists of over 5000 documents and its validation looks technically sound.

Author Response

Thank you for taking the time to review and provide feedbacks for our paper!

Reviewer 2 Report

1.     Add full names when the acronym appears in the first time.

Like what is TKGQA stands for? (temporal knowledge graph question answering?)

What is M&A ? NER …

2.     Grammar errors like L50 start -> started , L170 mention -> mentioned

3.     Increase the font size in the graph. Too small

4.     In Table 5. It shows with O tags is always better than without O tags. Why is that? Some discussion is needed to explain the result.

5.     In 3.2 section, it is mentioned that use annotators as well as manual labels for annotation. What is the difference between the two? Aren't they both human labeling? The same question is in Table 5. What is player 1, 2, or 3? Are they annotators and what is the difference between different players? What is the intention of comparing between 3 players in Table 5? and What is the intention of discussing the confusion matrix by only comparing manual labels vs the players (Fig4, Fig 5, Fig 6) (but not gold labels vs players, or other combinations in Table 5 )? Please add this information in the paper result/ discussion section.

6.     What is the significance of comparing the performance between different players(Fig4, Fig 5, Fig 6)? How can the conclusions from Fig4, Fig 5, Fig 6 link to the performance evaluation of  TKGQA dataset?

Author Response

Thank you for taking the time to review and provide feedbacks for our paper! Please see below our responses to each of your feedback:

Add full names when the acronym appears in the first time. Like what is TKGQA stands for? (temporal knowledge graph question answering?) What is M&A ? NER …
→ There’s an abbreviations section towards the end of the paper that contains all the used acronyms. We have added the missing ones that you have pointed out

Increase the font size in the graph. Too small
→ We have increased the font size in all the figures

Grammar errors like L50 start -> started , L170 mention -> mentioned
→ We have gone through the paper again and made grammar improvements

In Table 5. It shows with O tags is always better than without O tags. Why is that? Some discussion is needed to explain the result.
→ The IAA scores are computed at the token level and as such we have included scores computed with and without O tags (non-entities tokens). Scores with O tags are always higher than scores without O tags since there are many O tag tokens within a document and any match between annotators on the O tags are considered to be correct.

In 3.2 section, it is mentioned that use annotators as well as manual labels for annotation. What is the difference between the two? Aren't they both human labeling? The same question is in Table 5. What is player 1, 2, or 3? Are they annotators and what is the difference between different players? What is the intention of comparing between 3 players in Table 5? and What is the intention of discussing the confusion matrix by only comparing manual labels vs the players (Fig4, Fig 5, Fig 6) (but not gold labels vs players, or other combinations in Table 5 )? Please add this information in the paper result/ discussion section.
→ The “manual labels” is where we (the authors) manually labelled 1200 sentences, 400 sentences from each annotator's annotation set, so that we can compare our annotations against the annotators' annotations; to further assess the quality of annotators' annotations. This is the same in Table 5 too. Essentially, we treat authors’ labels as the “ground truth”. To avoid confusion, we have changed “manual labels” to “authors’ labels” in section 3.2, table 5, and everywhere else in the paper where appropriate.
→ There are no difference between the annotators (players). We simply want X number of people to annotate the datasets and compute the agreements between the annotators to ensure the quality of annotations is reliable.
→ The intention of comparing between 3 players is to assess the quality of annotations. A high IAA score between annotators means there's a high agreement level in their labelling, signalling that the annotations are reliable to be use for training our NER models.
→ We only compare the manual labels (now renamed to authors’ labels) against the players and not any other combinations because we treat authors’ labels as the “ground truth” since we (the authors) labelled it ourselves and have an idea of the quality. By comparing the authors’ labels and the annotators (via confusion matrix), we know what kind of “errors” (when compared to the authors) the annotators are making. The gold labels are computed using the players’ annotations and as such wouldn’t allow us to spot the errors clearly. Similar reasoning with the NER models (trained on annotators’ annotations) since they are only as good as the annotators’ annotations.
→ We used players and annotators interchangeably but given the confusion, we have changed “players” to “annotators”.

What is the significance of comparing the performance between different players(Fig4, Fig 5, Fig 6)? How can the conclusions from Fig4, Fig 5, Fig 6 link to the performance evaluation of TKGQA dataset?
→ Sorry for the confusion but fig. 4, 5, 6 are comparing the labels between authors’ labels and each annotator and not between different annotators. The 3 figures (and the result table 5) shows that although there are some confusion in labelling certain relations, the confusion is relative small and as such the annotations are reliable for us to train an accurate NER model to extract good entities and relations for our TKGQA dataset.

Reviewer 3 Report

Dear Authors,

I find your paper very interesting. But my first concern is about the references list - there are only 15 relevant references for your research? Additionally, the quality of these references is not convincing - most of them are Proceedia papers and not journal papers, as would be appropriate for a sound research. 

I consider that the first part should be named Introduction or The topic in the Literature (or anything similar) and not Summary as it is now (the Abstract is actually a summary of your paper). This section should end with presenting the structure of the paper.

Additionally, it should be clearer in the Abstract and Introduction that the case study is on M&As. 

The article is very technical (good from this point of view), but lacks utility description. How is your research helpful for other researchers or for the general public? Please improve the manuscript by considering this important aspect. 

You lack conclusions. 

 

 

Author Response

Thank you for taking the time to review and provide feedbacks for our paper! Please see below our responses to each of your feedback:

1. Increase reference list + its quality (from journal papers)
→ We have expanded our reference lists to include more papers in a) QA on temporal knowledge graph and b) other KG tasks (besides update) that our dataset helps in facilitate further research

2. Rename Summary to Introduction (and This section should end with presenting the structure of the paper.)
→ We have renamed the Summary section to Introduction and presented the structure of the paper at the end of the Introduction section.

3. Make it clear in the Abstract and Introduction that the case study is on M&As.
→ We have made it clearer in abstract and introduction section that our dataset is related to M&As.

4. Need to focus on how is your research helpful for other researchers or for the general public? (as it lacks utility description.Please improve the manuscript by considering this important aspect)
→ We have the 5. User Notes section, where we talk about the primary and secondary use case of our TKGQA dataset. The primary use case of the TKGQA dataset is for the temporal knowledge graph binary classification update task; specifically, the use of question answering to guide and validate the evolution of temporal knowledge graph. With TKGQA dataset, we are better enabling an important research area of validating the evolution of temporal knowledge graph; making sure that knowledge graphs are accurately being updated with the latest information and thus applications such as recommendation system, chatbots, semantic search, etc can perform better and timely.
→ Additionally, given the modularity of the dataset (4 components), the TKGQA dataset makes it easy for researchers to use it solely for the research of the individual components. For example, to mimic the real-world scenario, the TKGQA dataset introduces new entities and relations in the validation and testing set that has not been seen during training. As such, we would need a reliable inductive temporal knowledge graph embedding models to represent the zero-shot entities and relations. The traditional transductive models used in most link prediction research won’t work here.

5. You lack conclusions.
→ Conclusion is not a required section and hence we haven’t added a conclusion section. However, we have edited section 4 technical validation and 5 user notes section to better conclude a) how the methods we have used has validated the quality of our dataset and b) how our dataset is useful for other researchers and what other KG tasks our dataset can be used for

Back to TopTop