Systematic Construction of Knowledge Graphs for Research-Performing Organizations
Abstract
:1. Introduction
2. Related Work
3. Research-Performing Organizations in Spain: The Hercules Project
The Hercules Ontology Network
- Project entities describes information related to any business or science activity, which is carefully planned with milestones, work packages, risks, etc. Each project can be classified into different categories (private, national, european, etc.) depending on how it is funded, and it may also be part of another project.
- Person entities focus on representing information about researchers, based mainly on the foaf:Person class. The ontology extends this class to incorporate specific research information. For example, it includes data properties such as roh:scopusID, roh:orcid, and vivo:researchId and object properties such as roh:hasRole, roh:hasCV, and roh:hasKnowledgeArea.
- Organization entities include the general description of research-performing organizations (e.g., universities, research centers, etc.). Similarly to the person entity, an organization entity mainly extends foaf:Organization, including specific data and objects properties about this kind of institutions. In addition, it includes a subclass to represent data from organizations that are allowed to emit academic accreditations.
- Funding entities represent the information associated with the funding of a project or an organization. The entity presents a general class (roh:Funding) and a set of subclasses (e.g., roh:Grant, roh:Loan). Each funding is divided into several roh:FundingAmount. Furthermore, the entity defines three other classes that are related between them: a roh:FundingProgram comes from a roh:FundingSource, which is supported by a vivo:FundingOrganization.
- Research Object entities aim to follow good open science practices, providing support to semantically represent all the results from projects (e.g., deliverables, reports, datasets), academic courses (e.g., Ph.D. or Master thesis), and other common research outputs (e.g., scientific papers, patents, etc.). It imports two main classes from the Information Artifact Ontology (OBO-IAO) [34] to represent in more detail any kind of research document, software repositories such as GitHub, Zenodo, or BitBucket, experimental protocols, or pieces of software.
- Activity entities focus on representing information about the actions or participations in events that researchers usually carry out during their career. In addition to the general class roh:Activity, the ontology includes a set of more specific classes such as bibo:Conference, vivo:Internship, and vivo:InvitedTalk.
4. Sustainable Workflow for Constructing KGs
4.1. Generation and Refinement of Mapping Templates
- 1.
- For each ontology class (e.g., vivo:Project) an empty mapping is generated, with a common structure: one empty source, one subject map with a potential URI created following good practices on resource name strategy (We define it in a rr:template property, using the ontology URI base followed by the name of the class and an empty reference.), one predicate object map that indicates the class and subclasses of the entity.
- 2.
- For each data property associated to a class (i.e., the domain of the property is the class), a tuple of predicate object maps is created. It is composed by a predicate, which is the actual value of the data property, and an empty reference in the object. In addition, if the range of the property is defined with a datatype, a third value is added with the corresponding value. For example, in Listing 1 the property roh:projectStatus has the class vivo:Project in its domain and the datatype xsd:integer in the range, hence the corresponding mapping tuple is created.
- 3.
- Similar to the previous step, for each object property and its domain class (If the property has several classes the process is repeated), a reference predicate object map is created where the parent triples map is the actual triples map which defines the rules for the class in the range of the property. The conditions of the join remain empty in this step. In our example, the property roh:produces has the class vivo:Project as domain and roh:ResearchObject as range, so the corresponding rule is created.
Listing 1: Automatic YARRRML template. |
|
4.2. Systematic Filling of Mapping Rules
Listing 2: YARRRML mapping filled. |
|
Listing 3: SPARQL query for extracting basic information from a researcher. | |
|
|
4.3. Validating the Mappings with Experts
- Semantic validation of mapping relations. The domain experts validate that the relationships declared in the mappings between the concepts and the properties of the ontology are semantically equivalent to the references (tables and columns in this case) of the input sources.
- Validation of the SQL views. During the mapping process, the knowledge engineer creates a set of SQL views to transform and prepare the data in the RDB to generate the desirable knowledge graph. Domain experts review and validate these SQL views to ensure their correctness and that the transformations are also semantically equivalent (for example, in the SKOS lists).
- Identification of missing references. The experts help the knowledge engineer in identifying and filling missing references from the database to the ontology properties. This activity can follow a bottom-up approach, reviewing database references, and finding the correspondence in the ontology, or in the other way around. The knowledge engineer decides which approach to follow depending on the size and knowledge coverage of both resources.
4.4. Mapping Integration and KG Construction
5. Use Case: Universitas XXI
5.1. Feed the HERCULES Central Node
5.2. Publishing the Knowledge Graphs through REST APIs
5.3. Exploiting Integrated Knowledge within the Organizations
6. Lessons Learned
- Simple but useful support tools. The construction of a knowledge graph in complex domains requires tools that support the creation of the rules and the management of complex data management tasks. Although there are solutions that aim to automatically create semantic annotations [46], they usually need a target KG (e.g., DBpedia or Wikidata) to create the actual instances from the input sources. However, most of the use cases have to fit the input data to a domain ontology, what demands manual work from a knowledge engineering for creating the mapping rules. We notice that simple tools such as OWL2YARRRML [38], the use of YARRRML [24] syntax instead of the common turtle-based syntax for the rules, or the deployment of virtual SPARQL endpoints per resource facilitates the creation and management of the rules and also guarantees their high quality and correctness.
- Domain experts with technical knowledge in the loop. One of the most relevant tasks during the construction of a knowledge graph is to ensure the correctness of the mapping (i.e., a column/field/register from the input source means exactly the same as a property/class of the ontology). The developer of the mapping rules (aka the knowledge engineer) knows very well the ontology and its structure, but in complex environments having a complete overview of the input sources is complicated, as documentation is not always sufficient. Involving domain experts that go beyond the knowledge of domain and have technical skills (for querying the database, understanding the tables and relations, etc.) is one of the key aspects to be successful during the mapping process. They help the knowledge engineer to understand the meaning of tables and columns as well as complex relationships and modeling decisions of the database.
- Divide and Conquer. Mapping has been been understood as an engineering task, but we believe that it is actually one of the most relevant and complex steps for constructing high quality domain knowledge graphs. For example, in our use case, we have a complex data integration problem where the ontology and the database have been developed completely independently, modeling a rich domain as it is research in different ways. First, it is needed to identify which of the two inputs overlaps the other, i.e., if the ontology covers more knowledge than the database or vice versa, and then a divide and conquer process can be followed to ensure a systematic mapping task. In our case, the database is the resource that covers more knowledge than the ontology, so we decided to split the mapping process by each class of the ontology. In this manner, we follow a systematic mapping process, ensuring that all the classes and properties from the ontology will be mapped.
- Delegate complex tasks to the DBMS. During the knowledge graph construction process there are many cases where the input data needs to be transformed or modified for obtaining the desirable structure in the generated RDF. This is usually the case of the SKOS lists, where the URIs defined in the thesaurus do not have a 1-1 mapping to the actual values of the database. Hence, some data transformation functions have to be applied. There are approaches that allow the declarative description of functions within mapping rules [39,47], that are interesting when the input source is not loaded in a database (i.e., raw files, APIS, etc). However, in cases that the input sources are supported by a DBMS it is better to create views and apply the transformation functions using the capabilities of the databases. On the one hand, delegating its application, we ensure an efficient execution of the transformation functions while we still maintain them in a declarative form. On the other hand, the domain experts and database managers are able to understand, manage, change and execute the created views, without adding more complexity to the mapping documents.
- Sustainable procedures. Either the ontology or the input sources can suffer changes after finishing the construction of the knowledge graph. This has a direct impact to the mapping rules, as they need to be adapted to new versions of the involved artifacts. In complex environments where the mapping rules can be huge (e.g., in our case the mapping document is defined by more than 5000 rules following N-Triples syntax), changing a property, a reference to a column or a class might be a difficult task. Defining common procedures and sustainable workflows to address these potential problems is also part of the construction of the knowledge graphs. The domain experts and database managers have to understand the mapping rules, its syntax but also its semantics, to be able to make these changes without the support of the knowledge engineer. Solutions such as YARRRML [24] that define the mapping rules following the YAML syntax or Mapeathor [48] that does the same but using Excel sheets, are good examples of how to use the general technical know how that developers, engineers, or database managers have for declaring the rules in a more human-friendly form.
7. Conclusions and Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
KG | Knowledge Graph |
RDF | Resource Description Framework |
RML | RDF Mapping Language |
SHACL | Shapes Constraint Language |
SQL | Structured Query Language |
ROH | Hercules Ontology Network |
References
- Asserson, A.; Jeffery, K.G.; Lopatenko, A. CERIF: Past, Present and Future: An Overview; Technical Report; euroCRIS: Kassel, Germany, 2002. [Google Scholar]
- Hogan, A.; Blomqvist, E.; Cochez, M.; d’Amato, C.; Melo, G.d.; Gutierrez, C.; Kirrane, S.; Gayo, J.E.L.; Navigli, R.; Neumaier, S.; et al. Knowledge Graphs. Synth. Lect. Data, Semant. Knowl. 2021, 12, 1–257. [Google Scholar]
- Belleau, F.; Nolin, M.A.; Tourigny, N.; Rigault, P.; Morissette, J. Bio2RDF: Towards a mashup to build bioinformatics knowledge systems. J. Biomed. Inform. 2008, 41, 706–716. [Google Scholar] [CrossRef] [PubMed]
- Jaradeh, M.Y.; Oelen, A.; Farfar, K.E.; Prinz, M.; D’Souza, J.; Kismihók, G.; Stocker, M.; Auer, S. Open research knowledge graph: Next generation infrastructure for semantic scholarly knowledge. In Proceedings of the 10th International Conference on Knowledge Capture, Marina Del Rey, CA, USA, 19–21 November 2019; pp. 243–246. [Google Scholar]
- Scrocca, M.; Comerio, M.; Carenini, A.; Celino, I. Turning transport data to comply with EU standards while enabling a multimodal transport knowledge graph. In Proceedings of the International Semantic Web Conference, Athens, Greece, 2–6 November 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 411–429. [Google Scholar]
- Google Knowledge Graph. Available online: https://developers.google.com/knowledge-graph (accessed on 3 October 2022).
- Amazon Knowledge Graph. Available online: https://www.amazon.science/tag/knowledge-graphs (accessed on 3 October 2022).
- Auer, S.; Bizer, C.; Kobilarov, G.; Lehmann, J.; Cyganiak, R.; Ives, Z. DBpedia: A nucleus for a web of open data. In The Semantic Web; Springer: Berlin/Heidelberg, Germany, 2007; pp. 722–735. [Google Scholar]
- Vrandečić, D.; Krötzsch, M. Wikidata: A free collaborative knowledgebase. Commun. ACM 2014, 57, 78–85. [Google Scholar] [CrossRef] [Green Version]
- Spanish Association of Universities (CRUE). Available online: https://www.crue.org/ (accessed on 3 October 2022).
- Emaldi, M.; Puerta, M.; Buján, D.; López-de Ipiña, D.; Azcona, E.R.; Gayo, J.E.L.; Sota, E.; Maturana, R.A. ROH: Towards a highly usable and flexible knowledge model for the academic and research domains. Semantic Web, 2022; under review. [Google Scholar]
- Hercules Project—University of Murcia. Available online: https://www.um.es/en/web/hercules/inicio (accessed on 3 September 2022).
- Corson-Rikert, J.; Mitchell, S.; Lowe, B.; Rejack, N.; Ding, Y.; Guo, C. The VIVO ontology. Synthesis Lectures on Semantic Web: Theory and Technology; Morgan and Claypool Publishers: San Rafael, CA, USA, 2012; p. 3. [Google Scholar]
- Bibliographic Ontology (BIBO). Available online: https://bibliontology.com/ (accessed on 3 September 2022).
- Peroni, S.; Shotton, D. The SPAR ontologies. In Proceedings of the International Semantic Web Conference, Monterey, CA, USA, 8–12 October 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 119–136. [Google Scholar]
- Sure, Y.; Bloehdorn, S.; Haase, P.; Hartmann, J.; Oberle, D. The SWRC ontology–semantic web for research communities. In Proceedings of the Portuguese Conference on Artificial Intelligence, Covilhã, Portugal, 5–8 December 2005; Springer: Berlin/Heidelberg, Germany, 2005; pp. 218–231. [Google Scholar]
- Jeffery, K.; Houssos, N.; Jörg, B.; Asserson, A. Research information management: The CERIF approach. Int. J. Metadata Semant. Ontol. 2014, 9, 5–14. [Google Scholar] [CrossRef]
- Das, S.; Sundara, S.; Cyganiak, R. R2RML: RDB to RDF Mapping Language. W3C Recommendation, W3C. 2012. Available online: http://www.w3.org/TR/r2rml/ (accessed on 15 September 2022).
- Dimou, A.; Vander Sande, M.; Colpaert, P.; Verborgh, R.; Mannens, E.; Van de Walle, R. RML: A generic language for integrated RDF mappings of heterogeneous data. In Proceedings of the Ldow, Seoul, Korea, 8 April 2014. [Google Scholar]
- Kalaycı, E.G.; Grangel González, I.; Lösch, F.; Xiao, G.; Kharlamov, E.; Calvanese, D. Semantic integration of Bosch manufacturing data using virtual knowledge graphs. In Proceedings of the International Semantic Web Conference, Athens, Greece, 2–6 November 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 464–481. [Google Scholar]
- Calvanese, D.; Cogrel, B.; Komla-Ebri, S.; Kontchakov, R.; Lanti, D.; Rezk, M.; Rodriguez-Muro, M.; Xiao, G. Ontop: Answering SPARQL queries over relational databases. Semant. Web 2017, 8, 471–487. [Google Scholar] [CrossRef] [Green Version]
- Xiao, G.; Lanti, D.; Kontchakov, R.; Komla-Ebri, S.; Güzel-Kalaycı, E.; Ding, L.; Corman, J.; Cogrel, B.; Calvanese, D.; Botoeva, E. The virtual knowledge graph system ontop. In Proceedings of the International Semantic Web Conference, Athens, Greece, 2–6 November 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 259–277. [Google Scholar]
- Rojas, J.A.; Aguado, M.; Vasilopoulou, P.; Velitchkov, I.; Assche, D.V.; Colpaert, P.; Verborgh, R. Leveraging Semantic Technologies for Digital Interoperability in the European Railway Domain. In Proceedings of the International Semantic Web Conference, Virtual, 24–28 October 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 648–664. [Google Scholar]
- Heyvaert, P.; De Meester, B.; Dimou, A.; Verborgh, R. Declarative rules for linked data generation at your fingertips! In Proceedings of the European Semantic Web Conference, Anissaras, Greece, 3–7 June 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 213–217. [Google Scholar]
- RMLMapper Implementation. Available online: https://github.com/RMLio/rmlmapper-java (accessed on 1 October 2022).
- Xiao, G.; Ding, L.; Cogrel, B.; Calvanese, D. Virtual knowledge graphs: An overview of systems and use cases. Data Intell. 2019, 1, 201–223. [Google Scholar] [CrossRef]
- Chaves-Fraga, D.; Priyatna, F.; Santana-Pérez, I.; Corcho, O. Virtual statistics knowledge graph generation from CSV files. In Emerging Topics in Semantic Technologies; IOS Press: Washington, DC, USA, 2018; pp. 235–244. [Google Scholar]
- Arenas-Guerrero, J.; Chaves-Fraga, D.; Toledo, J.; Pérez, M.S.; Corcho, O. Morph-KGC: Scalable Knowledge Graph Materialization with Mapping Partitions. Semant. Web J. 2022. [CrossRef]
- Iglesias, E.; Jozashoori, S.; Chaves-Fraga, D.; Collarana, D.; Vidal, M.E. SDM-RDFizer: An RML interpreter for the efficient creation of RDF knowledge graphs. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Virtual, 19–23 October 2020; pp. 3039–3046. [Google Scholar]
- Heling, L.; Bensmann, F.; Zapilko, B.; Acosta, M.; Sure-Vetter, Y. Building knowledge graphs from survey data: A use case in the social sciences (extended version). In Proceedings of the European Semantic Web Conference, Portorož, Slovenia, 2–6 June 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 285–299. [Google Scholar]
- Liu, Z.; Shi, M.; Janowicz, K.; Regalia, B.; Delbecque, S.; Mai, G.; Zhu, R.; Hitzler, P. LD Connect: A Linked Data Portal for IOS Press Scientometrics. In Proceedings of the European Semantic Web Conference, Hersonissos, Greece, 29 May–2 June 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 323–337. [Google Scholar]
- Shen, Y.; Chen, Z.; Cheng, G.; Qu, Y. CKGG: A Chinese knowledge graph for high-school geography education and beyond. In Proceedings of the International Semantic Web Conference, Virtual, 24–28 October 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 429–445. [Google Scholar]
- Iglesias-Molina, A.; Chaves-Fraga, D.; Priyatna, F.; Corcho, O. Enhancing the Maintainability of the Bio2RDF Project Using Declarative Mappings. In Proceedings of the SWAT4HCLS, Edinburgh, UK, 9–12 December 2019; pp. 1–10. [Google Scholar]
- Information Artifact Ontology (OBO-IAO). Available online: https://obofoundry.org/ontology/iao.html (accessed on 3 September 2022).
- Garijo, D. WIDOCO: A wizard for documenting ontologies. In Proceedings of the International Semantic Web Conference, Vienna, Austria, 21–25 October 2017; Springer: Berlin/Heidelberg, Germany, 2017; pp. 94–102. [Google Scholar]
- Hercules Ontology Network (ROH). Available online: http://w3id.org/roh/ (accessed on 13 October 2022).
- Börner, K.; Conlon, M.; Corson-Rikert, J.; Ding, Y. VIVO: A semantic approach to scholarly networking and discovery. Synth. Lect. Semant. Web Theory Technol. 2012, 7, 1–178. [Google Scholar]
- Chaves-Fraga, D. oeg-upm/owl2yarrrml. 2022. Available online: https://doi.org/10.5281/zenodo.5603173 (accessed on 13 June 2022).
- Meester, B.D.; Maroy, W.; Dimou, A.; Verborgh, R.; Mannens, E. Declarative Data Transformations for Linked Data Generation: The Case of DBpedia. In Proceedings of the European Semantic Web Conference, Portorož, Slovenia, 28 May–1 June 2019; Springer: Berlin/Heidelberg, Germany, 2017; pp. 33–48. [Google Scholar]
- Chaves-Fraga, D.; Ruckhaus, E.; Priyatna, F.; Vidal, M.E.; Corcho, O. Enhancing virtual ontology based access over tabular data with Morph-CSV. Semant. Web 2021, 12, 869–902. [Google Scholar] [CrossRef]
- Chaves, D.; LuisLopezPi; Doña, D.; Guerrero, J.A.; Corcho, O. oeg-upm/yarrrml-translator. 2022. Available online: https://doi.org/10.5281/zenodo.7024500 (accessed on 10 October 2022).
- Hercules Ontology Network Compentency Questions. Available online: https://github.com/HerculesCRUE/ROH/tree/main/validation-questions/sparql-query (accessed on 3 September 2022).
- Espinoza-Arias, P.; Garijo, D.; Corcho, O. Crossing the chasm between ontology engineering and application development: A survey. J. Web Semant. 2021, 70, 100655. [Google Scholar] [CrossRef]
- Meroño-Peñuela, A.; Lisena, P.; Martínez-Ortiz, C. Web Data APIs for Knowledge Graphs: Easing Access to Semantic Data for Application Developers. Synth. Lect. Data Semant. Knowl. 2021, 12, 1–118. [Google Scholar]
- Badenes-Olmedo, C.; Espinoza-Arias, P.; Corcho, O. R4R: Template-based REST API Framework for RDF Knowledge Graphs. In Proceedings of the ISWC (Demos/Industry), Virtual, 24–28 October 2021. [Google Scholar]
- Chaves-Fraga, D.; Dimou, A. Declarative Description of Knowledge Graphs Construction Automation: Status & Challenges. In Proceedings of the 3rd International Workshop on Knowledge Graph Construction, Crete, Greek, 30 May 2022. [Google Scholar]
- Jozashoori, S.; Chaves-Fraga, D.; Iglesias, E.; Vidal, M.E.; Corcho, O. Funmap: Efficient Execution of Functional Mappings for Knowledge Graph Creation. In Proceedings of the International Semantic Web Conference, Athens, Greece, 2–6 November 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 276–293. [Google Scholar]
- Iglesias-Molina, A.; Pozo-Gilo, L.; Dona, D.; Ruckhaus, E.; Chaves-Fraga, D.; Corcho, O. Mapeathor: Simplifying the specification of declarative rules for knowledge graph construction. In Proceedings of the ISWC (Demos/Industry), Virtual, 1–6 November 2020. [Google Scholar]
- Brunner, U.; Stockinger, K. Entity matching with transformer architectures-a step forward in data integration. In Proceedings of the International Conference on Extending Database Technology, Copenhagen, Denmark, 30 March–2 April 2020. [Google Scholar]
- Heling, L.; Acosta, M. Federated SPARQL Query Processing over Heterogeneous Linked Data Fragments. In Proceedings of the ACM Web Conference 2022, Virtual, 25–29 April 2022; pp. 1047–1057. [Google Scholar]
- Manghi, P.; Bardi, A.; Atzori, C.; Baglioni, M.; Manola, N.; Schirrwagen, J.; Principe, P.; Artini, M.; Becker, A.; De Bonis, M.; et al. The OpenAIRE research graph data model. Zenodo 2019. [Google Scholar] [CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chaves-Fraga, D.; Corcho, O.; Yedro, F.; Moreno, R.; Olías, J.; De La Azuela, A. Systematic Construction of Knowledge Graphs for Research-Performing Organizations. Information 2022, 13, 562. https://doi.org/10.3390/info13120562
Chaves-Fraga D, Corcho O, Yedro F, Moreno R, Olías J, De La Azuela A. Systematic Construction of Knowledge Graphs for Research-Performing Organizations. Information. 2022; 13(12):562. https://doi.org/10.3390/info13120562
Chicago/Turabian StyleChaves-Fraga, David, Oscar Corcho, Francisco Yedro, Roberto Moreno, Juan Olías, and Alejandro De La Azuela. 2022. "Systematic Construction of Knowledge Graphs for Research-Performing Organizations" Information 13, no. 12: 562. https://doi.org/10.3390/info13120562
APA StyleChaves-Fraga, D., Corcho, O., Yedro, F., Moreno, R., Olías, J., & De La Azuela, A. (2022). Systematic Construction of Knowledge Graphs for Research-Performing Organizations. Information, 13(12), 562. https://doi.org/10.3390/info13120562