Automatic Construction of Educational Knowledge Graphs: A Word Embedding-Based Approach
Round 1
Reviewer 1 Report
Main Work and Contributions: The paper proposes a pipeline to automatically construct educational knowledge graphs (EduKGs) from textual learning materials using word and sentence embedding techniques. The paper explores the potential of leveraging state-of-the-art word and sentence embedding techniques for automatic EduKG construction, which is a novel application. It enhances SIFRank keyphrase extraction using SqueezeBERT, showing significant improvements in accuracy and efficiency over baseline methods. It proposes a concept weighting strategy based on SBERT sentence embeddings that considers semantic similarity between concepts and materials. The methods provide a simple yet powerful approach for automatic construction of accurate and efficient EduKGs.
While the paper makes useful contributions, there are some limitations or unclear in terms of innovation in the proposed methods and techniques:
-
The use of word embeddings for keyphrase extraction in Knowledge Graphs is not entirely novel - prior works (e.g. CAREA: Co-training Attribute and Relation Embeddings for Cross-Lingual Entity Alignment in Knowledge Graphs; and MAUIL: Multilevel attribute embedding for semisupervised user identity linkage ) have used word embeddings for tasks like keyword extraction and document summarization. Applying it specifically for educational concept extraction is an incremental innovation. Please add in the article how this article's approach to embedding differs from the work mentioned above.
-
The overall pipeline comprising of standard NLP tasks like keyphrase extraction, entity linking, and graph construction has been explored in prior literature on generating knowledge graphs. The application to education domain is incremental. This paper should further emphasize or fully review the preliminary work of NLP technology in the field of education.
The techniques are applications of known NLP methods rather than novel inventions. However, the ensemble of techniques proposed and results obtained are still an important advancement of research in this application area.
There are a few minor problems I noticed. The authors should thoroughly examine these problems that should not appear in the manuscript.
-
Incorrect use of articles in some places:
-
"a MOOC platform" should be "an MOOC platform" (Introduction, line 41)
-
"a heterogeneous information network" should be "an heterogeneous information network" (Section 3, line 152)
-
Inconsistent capitalization:
-
"Knowledge Graph" is capitalized in some places but not others, should be consistent. (Line 76, line 475)
The main contribution is in effectively combining advanced NLP techniques to address the key challenges of accuracy and efficiency in automatic EduKG construction. The proposed techniques and experimental results advance research in this direction. I recommend accepting this paper after a minor revision.
Need improvement
Author Response
Reviewer 1:
- Comment: The use of word embeddings for keyphrase extraction in Knowledge Graphs is not entirely novel - prior works (e.g. CAREA: Co-training Attribute and Relation Embeddings for Cross-Lingual Entity Alignment in Knowledge Graphs; and MAUIL: Multilevel attribute embedding for semisupervised user identity linkage ) have used word embeddings for tasks like keyword extraction and document summarization. Applying it specifically for educational concept extraction is an incremental innovation. Please add in the article how this article's approach to embedding differs from the work mentioned above.
-
- Response: The first paper “CAREA: Co-training Attribute and Relation Embeddings for Cross-Lingual Entity Alignment in Knowledge Graphs”, focuses on solving the problem of cross-lingual entity alignment. Entity alignment aims to automatically determine whether an entity pair in different KGs refers to the same entity in reality. The second paper “MAUIL: Multilevel attribute embedding for semisupervised user identity linkage”, aims to solve the challenge of linking user identities across different social networks using a novel semisupervised model called MAUIL, which incorporates multilevel attribute embedding and RCCA-based linear projection. Both of these papers are not relevant to our research, since they are working on totally different problems. Furthermore, none of them works on keyphrase extraction and considers the use of word embeddings for keyphrase extraction. Therefore, they do not align with our research. However, we have further clarified the motivation for the use of word and sentence embedding techniques in our work, for capturing semantic and contextual information (line 288-293).
2. Comment: The overall pipeline comprising of standard NLP tasks like keyphrase extraction, entity linking, and graph construction has been explored in prior literature on generating knowledge graphs. The application to education domain is incremental. This paper should further emphasize or fully review the preliminary work of NLP technology in the field of education.
-
- Response: NLP is being widely used in the field of education (i.e., Educational NLP) for multiple purposes, for example, automated grading and feedback, chatbots for student support, sentiment analysis, plagiarism detection, and many more. We believe that surveying all works on NLP in field of education would be out of the scope of the current paper. In the related work section, we have already extensively reviewed existing NLP literature related to the construction of educational Knowledge graphs. For the sake of clarification, the related work (section 2) has been divided into two parts. The first part (section 2.1, line 71-122) presents the use of knowledge graphs in education (Educational knowledge graphs) and a comprehensive overview of the methods to construct them. The second part (section 2.2, line 123-148) presents a fine-grained analysis of how prior works have extracted concepts automatically that formulates the backbone of the knowledge graphs and how our approach is different as compared to them (line 139-148).
3. Comment: Incorrect use of articles in some places: "a MOOC platform" should be "an MOOC platform", "a heterogeneous information network" should be "an heterogeneous information network"
-
- Response: Since M in MOOC and H in heterogeneous is not a vowel as well as does not pronounce with a vowel sound, that’s why a is used instead of an.
4. Comment: Inconsistent capitalization: "Knowledge Graph" is capitalized in some places but not others, should be consistent.
-
- Response: Thank you for identifying the inconsistencies. We have read the paper again and the identified problems are fixed as suggested.
Reviewer 2 Report
This paper focuses on the automatic construction of educational Knowledge Graphs. Specifically, the proposed approach aims to enhance concept extraction and weighting mechanisms by leveraging state-of-the-art word and sentence embedding techniques. To be more precise, this paper improves the SIFRank keyphrase extraction method by incorporating SqueezeBERT and introduces a concept weighting strategy based on SBERT. Furthermore, the authors have conducted extensive experiments on different datasets, demonstrating significant improvements over several state-of-the-art keyphrase extraction and concept weighting techniques. However, there are some minor issues that could be addressed:
Figure 3 appears somewhat vague, and it could benefit from additional details to make it clearer.
The technical details regarding concept weights is hard to follow and may require further clarification or simplification.
Including case studies and conducting a more comprehensive analysis could enhance the overall quality of this paper.
Missing references:
EasyKG: An End-to-End Knowledge Graph Construction System
Seq2KG: An End-to-End Neural Model for Domain Agnostic Knowledge Graph (not Text Graph) Construction from Text
KnowPrompt: Knowledge-aware Prompt-tuning with Synergistic Optimization for Relation Extraction
End-to-End Construction of NLP Knowledge Graph
DeepKE: A Deep Learning Based Knowledge Extraction Toolkit for Knowledge Base Population
Some small typos can be improved.
Author Response
Reviewer 2:
- Comment: Figure 3 appears somewhat vague, and it could benefit from additional details to make it clearer.
- Response: The figure presents our proposed pipeline for KG construction and is kept at an abstract level to make it understandable for everyone. Adding further details will make it complex and the readers interested in more details can refer to the text in paper (Section 4).
2. Comment: The technical details regarding concept weights is hard to follow and may require further clarification or simplification.
-
- Response: We provided more clarification of the technical details regarding concept weights and have highlighted them in the paper accordingly. Please refer to section 5.2.3, line 447-451, line 459-465, and line 466-475.
3. Comment: Including case studies and conducting a more comprehensive analysis could enhance the overall quality of this paper.
-
- Response: The proposed pipeline is applied in CourseMapper for automatic KG construction of a learning material, as discussed in Section 4.7. The focus of this paper is to present the proposed pipeline in detail and the extensive experimental offline evaluation of the keyphrase extraction and concept weighting steps, based on different benchmark datasets. In future work, we are planning to evaluate the accuracy of the generated KGs with CourseMapper users, as highlighted in Section 6, line 492-495.
4. Comment: Missing References:
EasyKG: An End-to-End Knowledge Graph Construction System
Seq2KG: An End-to-End Neural Model for Domain Agnostic Knowledge Graph (not Text Graph) Construction from Text
KnowPrompt: Knowledge-aware Prompt-tuning with Synergistic Optimization for Relation Extraction
End-to-End Construction of NLP Knowledge Graph
DeepKE: A Deep Learning Based Knowledge Extraction Toolkit for Knowledge Base Population
-
- Response: All the mentioned missing references have been added to the paper, except one, i.e. “EasyKG: An End-to-End Knowledge Graph Construction System”. This paper presents a pipeline for KG construction containing the components of knowledge modeling, knowledge extraction, knowledge reasoning, knowledge management and so forth. However, for knowledge modeling, domain experts are required to define KG schema including entities, relations and their attributes. This is rather a manual annotation than automatic extraction. Since the similar steps in our pipeline are automatic extraction and relationship building, therefore this paper does not align to our context.
5. Comment: Some small typos can be improved.
-
- Response: The paper has been read again and the typos are fixed carefully.
Reviewer 3 Report
This manuscript presents a method for automatically constructing an EduKG from pdfs of texts and other learning material. This will then be linked to associated slides, wikipedia categories, and other external resources.
While the execution and overall design of the project, research, and manuscript seem fine, I find myself confused regarding the actual success of the project.
There are no presented results of how well the system works on material inside of CourseMapper (I see instead several datasets which seem related, but are not exact proxies for the actual use case at hand). Moreover, I see that SqueezeBert does perform quite well compared to the state of the art, but is still performing < 0.5 F1 score, which to my understanding is exceptionally low. How is this justified in use in an educational context, where high performance is exceptionally important?
Author Response
Reviewer 3:
1. Comment: There are no presented results of how well the system works on material inside of CourseMapper (I see instead several datasets which seem related, but are not exact proxies for the actual use case at hand).
-
- Response: The proposed pipeline is applied in CourseMapper for automatic KG construction of a learning material, as discussed in Section 4.7. The focus of this paper is presenting the proposed pipeline in detail and its experimental offline evaluation. In future work, we are planning to evaluate the accuracy of the generated KGs with CourseMapper users, as highlighted in Section 6, line 492-495.
2. Comment: Moreover, I see that SqueezeBert does perform quite well compared to the state of the art, but is still performing < 0.5 F1 score, which to my understanding is exceptionally low. How is this justified in use in an educational context, where high performance is exceptionally important?
-
- Response: Yes, that’s correct. The performance of SqueezeBert is < 0.5 F1 score, but this is the state-of-the art which can be achieved on the keyphrase extraction task. This can be further verified by comparing it to the SIFRank paper where they present results of comparing different keyphrase extraction algorithms using the same datasets [1]. Furthermore, these results are only the results of keyphrase extraction step, not the accuracy of the whole KG. Acknowledging that keyphrase extraction using SqueezeBert does not produce results with high accuracy, we have added the ‘concept expansion’ step (Section 4.4) in our pipeline. In this way, we can cover more concepts than the ones identified in the keyphrase extraction step. Adding this step in our pipeline, we are confident that the constructed KG expand over additional concepts thus including a wide range of concepts as needed in educational context.
We appreciate your careful review of our paper and hope these revisions adequately address your concerns and suggestions. We believe these changes have improved our work's overall quality and clarity.
Round 2
Reviewer 2 Report
The revised version has addressed my concerns.
The writing can be improved.
Author Response
Thank you for your time to review the paper and for providing valuable feedback.
Reviewer 3 Report
Thank you for the responses to my comments. I remain unconvinced that the pipeline makes sense with such a low performing piece. Why even include it in the first place? I do not think this is adequately explained in the manuscript. Simply adding in an additional step to add coverage seems to me like a stop gap, instead of actually leveraging the technology in a usable state, and could not be considered a significantly innovative piece. However, the other reviewers do not seem to share this concern.
I would like to see minor language added clarifying these design decisions.
I agree with Reviewer 2 that Figure 3 is too vague, and disagree with the author response that it's supposed to be approachable. It's too vague as to be meaningless. What is the direction of data flow, etc. These can be minimally added without confusing the reader.
Author Response
We have carefully considered your comments and suggestions and have made the following changes based on your recommendations:
- Comment: Thank you for the responses to my comments. I remain unconvinced that the pipeline makes sense with such a low performing piece. Why even include it in the first place? I do not think this is adequately explained in the manuscript. Simply adding in an additional step to add coverage seems to me like a stop gap, instead of actually leveraging the technology in a usable state, and could not be considered a significantly innovative piece. However, the other reviewers do not seem to share this concern.
I would like to see minor language added clarifying these design decisions.
- Response: We understand your concern but as mentioned earlier, this is the state-of-the-art for keyphrase extraction algorithms. We further tried to improve the state-of-the-art SIFRank by using SqueezeBERT (section 5.1.3, line 388). The decision behind using keyphrase extraction as the first step in our pipeline is motivated by the need to avoid sending learning materials with a large amount of text to an entity linking service, thus improving the efficiency of the next step in the pipeline, i.e., concept identification. These design decisions are further explained and highlighted in the paper (Section 2.1, line 118-121 and Section 4.2).
Regarding the ‘concept expansion’ step, we do not consider it as a stop gap, rather it actually adds more coverage to the KG by further exploring the related concepts and categories in Wikipedia. For example, for the concept “Natural language processing”, the categories “Category:Computational linguistics” and “Category:Artifical intelligence” and related concept “Natural language understanding” are added to the EduKG. In this way, users can explore these newly added concepts and understand them by directly clicking on the Wikipedia button provided. This has also been further clarified in the paper (Section 4.4, line 258-263).
b. Comment: I agree with Reviewer 2 that Figure 3 is too vague, and disagree with the author response that it's supposed to be approachable. It's too vague as to be meaningless. What is the direction of data flow, etc. These can be minimally added without confusing the reader.
- Response: We provided more details to the figure, including the direction of data flow in blue arrows, as well as the techniques used in each step.
We appreciate again your careful review of our paper and hope these revisions adequately address your concerns and suggestions.
Sincerely,