1. Introduction
Scientific innovations drive progress in companies, industries and the economy. Currently, the scholarly publication cycles are at an alarming rate of 2.5 million articles per year [
1]. Thus, the traditional documents ranked lists offered by scholarly search engines no longer support efficient research and development (R&D). While they pinpoint individual papers of interest from a mass of documents, they do not offer researchers a sense of an overview of the field. Researchers seem to drown in the deluge of publications as a consequence of the tediously long information assimilation cycle to manually scan salient aspects of research contributions within information buried in static text. Thus, enabling machine-actionability of scholarly knowledge is warranted now more than ever. In this vein, the method of scholarly knowledge strategic reading powered by Natural Language Processing (NLP) is being advocated for research, business, government, and non-governmental organization (NGO) stakeholders [
2]. Most current strategic reading relies on human recognition of scientific terms from text, perhaps assisted with string searching and mental calculation of ontological relationships, combined with burdensome tactics of bookmarking, note-taking, and window arrangement. To this end, recently, an increasing number of research efforts are geared toward putting in place next-generation Findable, Accessible, Interoperable, and Reusable (FAIR) [
3] scholarly knowledge representation models as Knowledge Graphs (KGs) [
4,
5]. They advocate advanced semantic machine-interpretability of publications via KGs to enable more intelligent automated processing (e.g., smart information access). This development started in advanced scholarly digital libraries (DL) such as the Open Research Knowledge Graph (ORKG,
https://orkg.org/, accessed on 14 January 2024) [
5], that crowdsources templated research contributions resulting in tabulated surveys of comparable contributions (cf.
Figure 1), thus demonstrates strategic reading in practice.
To represent scholarly publications as KGs, from an Information Extraction (IE) perspective, named entity recognition (NER) over scholarly publications becomes a vital task since entities are at the core of KGs. As an IE task, NER over scholarly documents is a long-standing task in the NLP community–the Computer Science domain itself has been addressed over a wide body of works with various knowledge capture objectives [
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19]. However, this well-established research area [
20,
21,
22,
23], thus far, has not seen any practical applications in the Agricultural scholarly publications domain.
In the domain of agriculture, the gradual sophistication of food production and agricultural methods led to an increasing demand for data exchange, processing and information retrieval. Thus the recording of knowledge as information islands via manual notetaking had to evolve to the recording of relational knowledge in databases via protocols. These protocols facilitated standardized recording and exchange of knowledge between different databases via purposefully invented data dictionaries and coding systems that assigned simple alphanumeric codes to products, varieties, breeds or crops. E.g., the ISOBUS [
24]/ISO11783 [
25] data dictionary or the European and Mediterranean Plant Protection Organization (EPPO) codes of crops used for plant protection applications [
26]. Today, however, we are faced with not only sophisticated agricultural practices but also voluminous masses of agricultural research findings published worldwide. Hence the call for the adoption of next-generation semantic web publishing model [
27] of machine-actionable, structured scholarly contributions content via the ORKG platform. Within this model, a large-scale agricultural KG would be predicated on standardized templated subgraph patterns for recording interoperable structured scholarly contributions in agriculture. The custom-templated subgraphs ensure the standardized recording of
comparable research contributions in an overarching interoperable graph of highly varied underlying research domains. The research domains can include appraisals of agricultural products, e.g., A chemotaxonomic reappraisal of the Section Ciconium Pelargonium (Geraniaceae) [
28], or the restoration and management of plant systems, e.g., mangrove systems.
Table 1 lists 15 sub research domains of contemporary research in agriculture. An information modeling objective ensures capturing contributions under a uniform set of salient properties within a single domain, while allowing for the definition of varied sets of salient properties across domains. This enables machine-assisted strategic reading within the semantic web publishing model directly addressing the information ingestion problem over massive volumes of findings for the researchers by smart machine assistance. E.g., as structured contribution comparisons computed over the set of salient contribution properties in one domain as depicted in
Figure 1.
The road to discovering contribution templates for research domains should be based on a set of generic entity types being applicable across all domains that can be further specialized and instantiated as domain-specific, full-fledged templates. In other words, prior to obtaining research-domain-specific contribution template patterns, there needs to be put in place a standardized set of generic entity types that can foster the further development of the problem-specific contribution templates constituted by additional semantic properties. As such the Agriculture Named Entity Recognition service of the ORKG (the ORKG Agri-NER service), addressed seminally in this work, proposes a set of seven generic entity types that encapsulate the contribution of a work extracted from paper titles. The seven
contribution-centric entity types are:
research problem,
resource,
process,
location,
method,
solution, and
technology. Building on this idea, this study makes two novel key contributions: (1) we propose for the first time an NER service specifically tailored for the agricultural domain; and (2) predicated on seven contribution-centric entities derived from paper titles and inspired from the top-level concepts of the AGROVOC ontology (
https://agrovoc.fao.org, accessed on 14 January 2024) of the Food and Agriculture Organization of the United Nations (FAO,
https://www.fao.org/home/en/, accessed on 14 January 2024), we lay the groundwork for the discovery of domain-specific contribution templates for the further specification of the generic entity types.
The ORKG Agri-NER service is an IE system of seven entity types such as research problems, resources, location of study, etc., which since extracted from paper titles implicitly encapsulate the contributions of scholarly articles. Conceptually, the shared understanding around paper titles is that they are succinct summarizations of the contribution of a work [
18]. Thus when looking to formulate a contribution-centric entity extraction objective, the first place to seek out this information is from paper titles. Specifically, ORKG Agri-NER provides a conceptual ecosphere of seven entity types to begin to generically structure and compare the contributions of scholarly articles in the domain of Agriculture as illustrated in
Figure 2. A striking feature of the proposed work is that it supports retrieving, exploring and comparing research findings based on explicitly named entities of the knowledge contained in agricultural scientific publications. If applied widely, ORKG Agri-NER can have a significant impact on scholarly communication in the agricultural domain. It specifically addresses researchers who want to compare their research with related works, get an overview of works in a certain field, or search for research contributions addressing a particular problem or having certain characteristics.
Figure 3 gives a high-level overview of the proposed semantic model by showing the seven core entity types in Agri-NER. The ORKG Agri-NER service then is the first step in a long-term research agenda to create a paradigm shift from document-based to structured knowledge-based scholarly communication for the agricultural domain. Other than the discovery of contribution-centric template patterns in the ORKG, the machine-readable description of research knowledge in the seven entity types could support other services for analyzing scientific literature in the agricultural domain such as forecasting agricultural research dynamics, identifying key insights, informing funding decisions, and confirming claims in news on contemporary agricultural research. To facilitate further research, we contribute two resources to the community: (1) The ORKG Agri-NER human-annotated gold-standard corpus which can be downloaded at
https://github.com/jd-coderepos/contributions-ner-agri (accessed on 14 January 2024) under the CC BY-SA 4.0 license; and (2) The ORKG Agri-NER tool whose source code can be accessed at
https://gitlab.com/TIBHannover/orkg/nlp/experiments/orkg-agriculture-ner (accessed on 14 January 2024) under the MIT license, and which furthermore are available as services to the community in two ways–the python package version of the service can be accessed at
https://orkg-nlp-pypi.readthedocs.io/en/latest/services/services.html (accessed on 14 January 2024); also, it is possible to directly interact with the REST API for the Agri-NER service directly via the interaction documentation page at
https://orkg.org/nlp/api/docs#/annotation/annotates_agri_paper_annotation_agriner_post (accessed on 14 January 2024). The remainder of the paper explains both the creation of the dataset resource and tool in detail.
2. Background
“Semantic Web ⋯ does not require complex artificial intelligence to interpret human ideas, but ‘relies solely on the machine’s ability to solve well-defined problems by performing well-defined operations on well-defined data’ ”.
The FAIRification guidelines [
3] for scholarly knowledge publishing inadvertently advocates for adopting semantic models for machine-actionable knowledge capture of certain aspects of the article
content such that they are findable, actionable, interoperable, and reusable. Ontological or entity-centric conceptual schemas are an elegant demonstration of what “going FAIR” (
https://www.go-fair.org/fair-principles/, accessed on 14 January 2024) means in practice across the broad spectrum of the researchers landscape as long as they are involved in the publication of their work. These schemas, by going beyond ‘data’ in the conventional sense, and instead applying to algorithms, tools, and workflows that lead to the data which are traditionally captured in discourse text, bring the recording of these aspects of scholarly knowledge in the FAIR landscape. Thereby, transparency, reproducibility, and reusability of scholarly analytical pipelines are fostered. Broadly, the research paradigms around the generation of FAIR data can be classified into two broad types: (1) ontological models that can directly produce FAIR-compliant data when instantiated; and (2) informal conceptual annotation models which are not characteristically FAIR-compliant but which work on data instances that support the discovery of ontologies in a bottom-up manner. These models equip experts with a tool for semantifying their scholarly publications ranging from strictly-ontologized methodologies [
30,
31] to less-strict, flexible conceptual description schemes [
7,
17], wherein the latter aim toward the bottom-up, data-driven discovery of an ontology.
The remainder of this section is organized per these two broad paradigms.
2.1. Ontological Structuring of Scholarly Publications
Early works can be traced to the Dublin Core Metadata Terms (DCTerms) [
32] ontology (
http://purl.org/dc/terms/, accessed on 14 January 2024). The original “Dublin Core” was the result of a March 1995 workshop in Dublin, Ohio, which sought to define a generic metadata record that was generic enough to describe a wide range of electronic objects [
33]. Subsequent ontologies specifically modeled scholarly articles but inherited DCTerms in an upper-level ontology space.
Some ontologies focused on modeling the scholarly document structure and rhetorics. In this vein, the Document Components Ontology (DoCO) [
34] is an ontology for describing both structural and rhetorical document components in RDF. For structural annotations, DoCO imports the Document Structural Patterns Ontology (
https://sparontologies.github.io/po/current/po.html, accessed on 14 January 2024) with classes such as Sentence, Paragraph, Footnote, Table, Figure, CaptionedBox, FigureBox, List, BibliographicReferenceList etc. The pattern ontology defines formally patterns for segmenting a document into atomic components, in order to be manipulated independently and reflowed in different contexts. For the rhetorical annotations, DoCO imports the Discourse Elements Ontology (
https://sparontologies.github.io/deo/current/deo.html, accessed on 14 January 2024) which was written describing the major rhetorical elements of a document such as a journal article. Its classes include deo:Introduction, deo:Materials, deo:Methods, deo:Results, deo:RelatedWork, deo:FutureWork, etc. These rhetorical components give a defined rhetorical structure to the paper, which assists readers to identify the important aspects of the paper. DEO reuses some of the rhetorical blocks from the SALT Rhetorical Ontology [
35] and extends them by introducing 24 additional classes. In the context of structural and rhetorical organization of scholarly articles, it was noted that the rhetoric organization of a paper does not necessarily correspond neatly to its structural components (sections, paragraphs, etc.). The Ontology of Rhetorical Blocks (orb) [
36] introduces rhetorical classes to semantify sections of scholarly publications. Eg., orb:Introduction, orb:Methods, orb:Results, orb:Discussion to structure the Body of an article inspired after the IMRAD structure [
37]. The hypothesis preceding this ontology is that the coarse rhetoric emerging from publications’ content have commonly shared semantics. Thus ORB provided a minimal set of rhetorical blocks that could be leveraged from the Header, Body, and Tail of scholarly publications. The Ontology of Scientific Experiments [
38], EXPO, advocated that the development of ontology of experiments–which are testbeds for cause-effect relations–is a fundamental step in the formalization of science. Reported scientific findings with their salient attributes buried in discourse is made explicit increasing the findability of problems with formal, semantic annotations supported by EXPO. It then constitutes the intermediate layer of a general ontology of scientific experiments with ontological concepts such as experimental goals, experimental methods and actions, types of experiments, rules for experimental design, etc., that are common between different scientific areas.
With the ontologies discussed, one observes that each ontology defines an information scope for formalization. The current level of formalization varies greatly in granularity and between the sciences. Ontology reuse [
39] addresses in a sense the extent to which generalization or specification can occur depending on the level of the ontology model they are applied in, nonetheless is a key realizer in what would otherwise seem an impossible goal to design an ontology for Science. One of the first attempts to address the description of the whole publishing domain is the introduction of the Semantic Publishing and Referencing (SPAR) ontologies (
http://www.sparontologies.net/, accessed on 14 January 2024). SPAR is a suite of orthogonal and complementary OWL2 ontologies that enable all aspects of the publishing process to be described in machine-readable metadata statements, encoded using RDF. It includes FaBiO, CiTO proposed by [
40], BiRO, C4O proposed by [
41], among others. Another noteworthy example that followed best practices in ontology development by reusing related ontologies [
39] listed in the Linked Open Vocabularies (LOV) was Semsur, the Semantic Survey Ontology, proposed by [
30,
42]. It introduced the semantification model for survey articles as a core ontology for for describing individual research problems, approaches, implementations and evaluations in a structured, comparable way. It modeled metadata based on DCTerms, Semantic Web for Research Communities (SWRC) [
43] and Friend of a Friend (FOAF) (
http://xmlns.com/foaf/0.1/, accessed on 14 January 2024) ontologies. The inner structure of scientific articles was partially modeled by Discourse Elements Ontology (DEO) (
http://www.sparontologies.net/ontologies/deo, accessed on 14 January 2024) and Linked Science Core (LSC) [
44] to model publication workflows. Survey articles have been the traditional method for documenting overview of research progresses. However with the document-based publishing model, much of the data points presenting research progress remained buried in discourse, as a result were forever statically encoded. Semsur aimed to offer machine-actionability to these key resources.
2.2. Entity-Centric Annotation Models of Scholarly Publications
The trend towards scientific terminology mining methods in NLP steered the release of phrase-based annotated datasets in various domains. An early dataset in this line of work was the ACL RD-TEC corpus [
8] which identified seven conceptual classes for terms in the full-text of scholarly publications in Computational Linguistics, viz.
Technology and Method;
Tool and Library;
Language Resource;
Language Resource Product;
Models;
Measures and Measurements; and
Other. Another dataset focused on the research dynamics discovery around scientific terminology in Computational Lingistics included the FTD corpus [
7] annotated with
Focus,
Task and
Domain of application entity types. Similar to terminology mining is the task of scientific keyphrase extraction. Extracting keyphrases is an important task in publishing platforms as they help recommend articles to readers, highlight missing citations to authors, identify potential reviewers for submissions, and analyse research trends over time. Scientific keyphrases, in particular, of type
Processes,
Tasks and
Materials were the focus of the SemEval17 corpus annotations [
10]. The dataset comprised annotations of the full text articles in Computer Science, Material Sciences, and Physics. Following suit was the SciERC corpus [
12] of annotated abstracts from the Artificial Intelligence domain. It included annotations for six concepts, viz.
Task,
Method,
Metric,
Material,
Other-Scientific Term, and
Generic. Subsequently, based on this conceptual formalism, large-scale knowledge graphs such as AI-KG [
14] and CS-KG [
45] were generated. Recently, tackling the multidisciplinary discovery of entities, the STEM-ECR corpus [
15] was introduced notably including the Science, Technology, Engineering, and Medicine domains. It was annotated with four generic concept types, viz.
Process,
Method,
Material, and
Data that mapped across all domains, and further with terms grounded in the real-world via Wikipedia/Wiktionary links. Furthermore, along the lines of the motivation of Agri-NER is the CS-NER service [
18,
19] that addresses the extraction of seven contribution-centric entities applicable in the Computer Science research field, viz.
Research problem,
Resource,
Method,
Tool,
Dataset,
Language, and
Solution entity types from Computer Science paper titles and abstracts. These seven entity types were proposed to foster the discovery of research-domain-specific contribution templates in Computer Science.
Leaderboards construct of progress trackers taken up for the recording of results in the field of empirical Artificial Intelligence (AI) at large is a case in point of the development of templates arising from contribution-centric entities. This construct underlies the PapersWithCode
https://paperswithcode.com/ (accessed on 14 January 2024) framework, as well as the ORKG Benchmarks
https://orkg.org/benchmarks (accessed on 14 January 2024) feature. The construct defines the recording of results around four entity types viz.
Task,
Dataset,
Metric, and
Score from the full text of scholarly articles. The entities were then combined within the full-fledged semantic construct of a Leaderboard with between three or all four types for machine learning [
13,
17,
46,
47,
48].
The Agri-NER service is situated within this latter broad paradigm of obtaining structured comparable, FAIR descriptions of scholarly contributions for the agricultural domain with the aim of bottom-up discovery of template patterns. However, it also relies on the first paradigm of scholarly knowledge structuring by mapping the automatically extracted terms to the AGROVOC ontology [
49] which offers a controlled vocabulary designed to cover unambiguous semantic descriptions for terminology under the FAO’s areas of interest. The following desiderata guided the creation of Agri-NER. (1) Manual curation of Agriculture named entities from 5500 article titles that reflect the contribution of a work enabling machine learning model training and development. (2) Associating terms within the AGROVOC ontology allowing for conceptual enrichment for the terms. (3) Allowing for ongoing, collaborative expert curation of named entities termwise and for their typing. (4) Juxtaposing a contribution-centric information extraction objective with term standardization in ontologies - why a simple term normalization against authoritative ontologies does not serve the objective of obtaining contribution-centric models? The rest of paper discusses how these requirements were accomplished.
In essence, our work’s focus on FAIR principles, advanced NLP techniques, and the integration of machine-actionable knowledge capture aligns well with the core tenets of Industry 5.0, specifically in terms of its influence on agriculture [
50,
51]. Industry 5.0 emphasizes personalized and sustainable solutions, blending human-centric approaches with advanced technological innovations [
52]. The focus of the ORKG Agri-NER service, proposed in this work, on creating interoperable, reusable, and machine-interpretable models of scholarly contributions in agriculture fits into this paradigm by enabling more nuanced, efficient, and collaborative research practices. This approach can lead to more tailored agricultural practices and innovations, reflecting the personalized and sustainable ethos of Industry 5.0.
5. Discussion
“The first step is putting data on the Web in a form that machines can naturally understand, or converting it to that form. This creates what I call a Semantic Web—a web of data that can be processed directly or indirectly by machines”.
The Web flourished based on the hypertext linked information principle. Hypertext linking of information on the Web as a global information space revolutionized information access by enabling users to traverse, search, share, and browse information with the all-pervasive technology of web browsers. With the formalization of the Semantic Web [
29], these same principles that applied to information represented as document descriptions are being applied to data. This has fostered the evolution of the Web as a global information space of only linked documents to one where both documents and data are linked. A prerequisite to realizing the Semantic Web is what is called as establishing a Linked Open Data Cloud (LOD Cloud). Linked Data constitutes the LOD. In other words, the LOD Cloud is a KG that manifests as a Semantic Web of Linked Data via a small set of standardized technologies: URIs and HTTP as identification and access mechanism for data resources on the web, and RDF as content representation format. Thus Linked Data realizes the vision of evolving the Web into a global data commons as what is defined as the Semantic Web, allowing applications to operate on top of an unbounded set of data sources, via standardised access mechanisms [
78]. The LOD Cloud
https://lod-cloud.net/ (accessed on 14 January 2024) constitutes the central hub that allow users to start browsing in one open-access submitted data source and then navigate along links into related data sources. This global data space connects data from diverse domains such as geography, government, life sciences, linguistics, media, scholarly publications, social networks etc. Without the Linked Data creation tools and technologies, earlier data creation processes always resulted in data silos worldwide with no access means of interaction or interoperability. Now, however, leveraging a small standardized set of technologies of the Linked Data creation paradigm, any data source can be submitted to the LOD Cloud fostering the building of the Semantic Web. In light of these technological inventions, the FAIR guiding principles [
3] for scientific data creation can indeed be a practice.
The next natural question is, is the ORKG Agri-NER corpus released in the LOD Cloud? The response is
not yet. However in this last concluding section of the paper, we set the stage for realizing the vision of releasing the ORKG Agri-NER corpus within the LOD Cloud to be taken up in future work. The research paradigms underlying the NLP production of data and the Semantic Web production of data over a new domain are particularly beset by several steps of methodological and technological considerations. This merits dedicated discussions of the respective paradigm research processes and outcomes. The NLP data production lifecycle focuses on instantiated data annotation and all the steps that precede it including selecting a task and defining a conceptual annotation space for the task. While the Semantic Web data production lifecycle focuses on data representation in a strict machine-readable semantic representation language such as RDF or OWL to facilitate axiomatic machine reasoning. In other words, it is a natural product of the following ingredients. (1) Open Standards—such as URI, URL, HTTP, HTML, RDF, RDF-Turtle (and other RDF Notations), the SPARQL Query Language, the SPARQL Protocol, and SPARQL Query Solution Document Types. And, (2) A modern DBMS platform—Virtuoso from OpenLink Software or Neo4J (
https://neo4j.com/, accessed on 14 January 2024) as a graph database management system.
This work has described the NLP NER research paradigm over the novel agricultural domain. As such it entailed presenting the selected
contribution-centric NER task for the agricultural domain, defining the selected entity types for annotation, and annotating a corpus of 5500 paper titles as instantiated data for Agri-NER. In following work, the aim is to address the Semantic Web research paradigm such that scholarly contribution resources in the agricultural domain will be made into FAIR and reusable Linked Data. Linked Data refers to data published on the Web in such a way that it is machine-readable, its meaning explicitly defined, it is linked to other external data sets, and can in turn be linked to from external data sets [
78]. Machine-readability will utilize URIs and HTTP as identification and access mechanisms and RDF content representation. Meaning definition will be handled via a schema model. Links to external datasets will be handled as linking to the AGROVOC ontology [
49] as it is the only other semantic representation model for the agricultural domain. As already alluded to, Agri-NER and AGROVOC prescribe different conceptual spaces for how the entities are expected to be processed by machines. Specifically, AGROVOC enables the processing of the entities within a terminologically defined semantic space. It provides concepts resolved to URIs and supplemented with RDF descriptions of thousands of terms in the FAO’s area of interest. While ORKG Agri-NER permits the processing of the entities w.r.t. their functional role as reflecting the contribution of a scholarly work. By aiming to link the entities in our ORKG Agri-NER corpus to AGROVOC, we enable users to fetch an enriched representation of the terms such as: What is its terminological definition?, or What are the alternative term namings across languages?, or Which other data linkings can be facilitated via the Linked Data source in consideration? For instance, “Borneo” a
location entity type from Agri-NER is first resolved to AGROVOC concept for Borneo as
https://agrovoc.fao.org/browse/agrovoc/en/page/c_1017 (accessed on 14 January 2024). This Linked Data enriches the term with its definition, alternate names of Borneo in various languages, etc. Furthermore, the AGROVOC Linked Data connects to the DBpedia Linked Data source [
79]. Thus via AGROVOC the concept Borneo is enriched via a DBpedia knowledge source link
https://dbpedia.org/page/Borneo (accessed on 14 January 2024) which offers additional information such as its total geographical area, geo-coordinates, the total population size etc. In this way, by adopting data linking the Linked Data principles will foster scaling the development approach of Agri-NER beyond a fixed, predefined data silo of capturing
contribution-centric entities, to encompass a larger number of relevant structured knowledge sources on the LOD cloud comprising heterogeneous data models that each constitute unique semantic spaces for the machine-actionability of terms.
Toward FAIR, Reusable Scholarly Contributions in Agriculture, for machine readability and semantic representation, the schema and URI space will be implemented via global property and resource identifiers within the ORKG web ecosphere at
https://orkg.org/ (accessed on 14 January 2024). And for obtaining Linked Data, AGROVOC will be utilized. In this section, we offer concrete implementation details that contrast ORKG Agri-NER and AGROVOC models as potential related Linked Data sources. The preliminary findings discussed in this paragraph are obtained w.r.t. the following research question.
RQ6: How many ORKG Agri-NER entities can be mapped to AGROVOC? To answer the question, a programmatic process flow depicted in
Figure 5 was established. The process was fairly straightforward. Given the terms annotated in the Agri-NER model, query the concept nodes in AGROVOC with the terms. For those terms that were found as a whole, the corresponding AGROVOC concept URI is the desired retrieval unit. For the terms that were not found as a whole, they were iteratively split as the longest spanning subphrases with subphrase lengths as: original phrase length − 1 ≤ range ≤ 1. The link retrieval step was stopped when one or more of the subphrases for a specified subphrase length could be resolved to one or more AGROVOC concepts. Resultingly, some statistical insights shown in
Table 7 were obtained. This will form the basis of Linked Data creation in future work toward realizing FAIR, Reusable Scholarly Contributions in Agriculture. Of all the entities annotated in Agri-NER, 16% of them are found as AGROVOC concepts. And 53.75% of the Agri-NER entities are found as subphrase AGROVOC concepts. Per Agri-NER entity type, the ones that were most linkable involved the least amount of subjectivity in phrasal boundary determination. One way of gauging the subjective boundary determination decisions for Agri-NER entity types from the least to most can be based on the proportion of the Agri-NER entity type terms that could be directly resolved to AGROVOC. From the least to the most, they were:
location,
technology,
process,
method,
research problem,
resource, and
solution. The corpus used in the analysis is publicly released
https://github.com/jd-coderepos/contributions-ner-agri/tree/main/AGROVOC-linked-data-analysis (accessed on 14 January 2024).
Future Directions
As we advance in the field of Agriculture NER, the integration and utilization of large language models (LLMs) present a promising avenue for future research and development [
80]. These models, known for their deep learning capabilities and extensive training on diverse datasets, offer significant potential in enhancing the accuracy and scope of entity recognition in agricultural texts. The application of LLMs could revolutionize the way we extract, process, and interpret complex scientific entities, leading to more nuanced and contextually aware recognition systems. In context of furthering Agri-NER research, a key direction for future work is the customization of LLMs to better understand and interpret the unique terminologies and concepts specific to agriculture. This involves training models on domain-specific datasets such as ours, including scholarly articles and technical documents in the agricultural sector. Such specialized training would enable LLMs to accurately identify and classify a wide range of agricultural entities, thereby enhancing the overall quality and reliability of knowledge extraction in this field.