Construction of a Multi-Source, Heterogeneous Rice Disease and Pest Knowledge Graph Based on the MARBC Model

Li, Chunchun; Yang, Siyi; Liang, Dong; Chen, Peng; Dong, Wei

doi:10.3390/agronomy15030566

Open AccessArticle

Construction of a Multi-Source, Heterogeneous Rice Disease and Pest Knowledge Graph Based on the MARBC Model

by

Chunchun Li

^1,2,*

,

Siyi Yang

¹,

Dong Liang

^1,2,

Peng Chen

^1,2

and

Wei Dong

³

¹

School of Internet, Anhui University, Hefei 230039, China

²

National Engineering Research Center for Agro-Ecological Big Data Analysis and Application, Anhui University, Hefei 230601, China

³

Agricultural Economy and Information Research Institute, Anhui Academy of Agricultural Sciences, Hefei 230031, China

^*

Author to whom correspondence should be addressed.

Agronomy 2025, 15(3), 566; https://doi.org/10.3390/agronomy15030566

Submission received: 5 January 2025 / Revised: 17 February 2025 / Accepted: 21 February 2025 / Published: 25 February 2025

(This article belongs to the Section Pest and Disease Management)

Download

Browse Figures

Versions Notes

Abstract

Diseases and pests have a significant impact on rice production, affecting both yield and quality. Therefore, their effective management and control are crucial for successful rice cultivation. However, current research based on rice diseases and pests (RDPs) encounters challenges such as data scarcity, the integration of multi-source heterogeneous data and usability issues related to knowledge graphs. To tackle these issues, this paper proposes a novel entity and relationship extraction model called Multi-head Attention RoBERTa BiLSTM CRF (MARBC). Specifically, the MARBC model utilizes RoBERTa to obtain related word vector representations, and then employs BiLSTM to extract features from within the input sequences. By integrating a multi-head attention mechanism, the model retrieves contextual information and relevance from the text, enhancing the accuracy and depth of the knowledge graph. Additionally, Conditional Random Fields are used to model sequence labeling for entities and relationships. Experimental results demonstrate the model’s impressive performance, achieving precision, recall, and F1 scores of 95.31%, 93.58%, and 94.44%, respectively. Furthermore, this paper constructs a dedicated knowledge graph for RDPs from both ontology and data layers. By effectively integrating and organizing multi-source heterogeneous RDP data, this paper provides valuable resources and decision support for agricultural researchers and farmers.

Keywords:

RDP; MARBC; multi-source heterogeneous; knowledge graph; ontology

1. Introduction

Rice is acknowledged as one of the most crucial strategic commodities globally; it is not only intricately connected to global food security, but also closely associated with economic growth, employment, social stability, and regional peace [1]. However, the occurrence of rice diseases and pests (RDPs) poses a significant threat to both rice yield and quality [2]. Therefore, obtaining RDP information timely and accurately is essential for effective management and control of RDPs. However, the current ways for acquiring knowledge on RDPs are numerous and complex, which not only results in information exhibiting high levels of heterogeneity, diversity, and fragmentation, but also highlights the inadequacy and lack of management methods [3].

In recent years, the rapid development of information technology has driven a research trend that combines big data technology to assist in decision-making for rice pest and disease prevention and control [4,5,6,7]. Among these, the knowledge graph technology, focusing on effective organization and correlation analysis of knowledge, has become an important research direction [8,9,10]. The construction of a knowledge graph enables the extraction of knowledge from scattered, unrelated, and multi-source data, organizing it into structured “entity–relationship–entity” triplets and corresponding “attribute–value” pairs [11]. This approach provides a critical mechanism for efficiently retrieving and accessing expert-level knowledge within the system [12]. Knowledge extraction is a fundamental element in building knowledge graphs, involving tasks such as named entity recognition (NER) [13] and relation extraction (RE) [14]. Currently, many scholars are delving into the prospects of knowledge graphs and actively applying interdisciplinary knowledge graphs to various industries, such as healthcare [15], educational smart assistants [16], finance [17], agriculture [18], and other industries, aiming to provide users with more efficient and convenient information retrieval channels.

Qiao et al. constructed a hybrid model-based knowledge graph for food nutrition ingredients, enabling intuitive and detailed understanding of nutritional information through structured data triple conversion, unstructured data extraction, knowledge fusion, tree clustering, and Neo4j-based visualization [19]. Li et al. proposed an automated framework for knowledge extraction, visualization of knowledge graph construction, and graph fusion in the field of electronic information, aimed at improving students’ learning efficiency and exploring new educational paradigms supported by artificial intelligence [20]. Chen et al. proposed a new method for predicting Chinese financial event knowledge graph links, based on graph attention networks and convolutional neural networks, to improve the integrity of the Chinese financial knowledge graph [21]. Zhao et al. conducted a comprehensive review and detailed comparison of work related to the cybersecurity knowledge graph (CKG), discussed the challenges of applying the CKG, and proposed future research opportunities for the CKG [22]. Wu et al. explored the construction of a construction safety knowledge representation model and safety accident graph through deep learning methods, extracting construction safety knowledge entities through the BERT-BiLSTM-CRF model [23]. Yang et al. presented a knowledge graph construction method using ALBert-BiLSTM-Self_Att-CRF to automatically extract and organize potato disease and pest knowledge from heterogeneous data, which enhanced prevention and control efforts and demonstrated improved accuracy and efficiency over existing models [24]. Zhang et al. proposed a Chinese named-entity–relation-extraction model (BBCPF) for crop diseases, utilizing knowledge graphs to effectively integrate fragmented text data and enhance knowledge management, achieving high precision and recall rates in entity and relation extraction [25]. Lu et al. introduced an ontology-based, fine-grained knowledge extraction method for the wheat production chain, defining conceptual layers and utilizing Word2vec-BiLSTM-CRF models to enhance entity–relationship–attribute extraction accuracy [26]. Wang et al. integrated multi-source data to construct a knowledge graph for tomato leaf pests and diseases, utilizing the ALBERT-BiLSTM-CRF model for efficient knowledge extraction and Neo4j for visualization [27].

However, the data structure of RDPs is intricate, diverse, and lacks publicly annotated datasets, presenting a significant challenge to the automated construction of RDP knowledge graphs. Furthermore, current research on RDP knowledge graphs remains relatively limited, with entity and relationship extraction techniques still in the early stages of exploration. To address these issues, this paper focuses on constructing an RDP knowledge graph, conducting thorough research on named entity recognition and relationship extraction methods based on deep learning technology. The aim is to extract fragmented rice pest and disease knowledge from multi-source data by building a model, and to form a knowledge base with rich semantic relationships. The main contributions of this paper are as follows:

(1) Given the scarcity of data, this paper aims to integrate diverse data resources and collect information. After data cleaning and processing, a comprehensive Chinese rice disease and pest dataset was constructed.

(2) By integrating the efficient RoBERTa model and multi-head attention mechanism into the BiLSTM-CRF model, this paper proposes a multi-layer network extraction model, called MARBC, for extracting entities and relationships from RDP data.

(3) Aiming at the organization and representation of multi-source heterogeneous data, the construction of an RDP knowledge graph is achieved from two dimensions: the data layer and the ontology layer.

2. Materials and Methods

The construction process of the RDP knowledge graph, as illustrated in Figure 1, primarily encompassed four stages: data acquisition and processing, ontology construction, knowledge extraction, knowledge fusion, and, ultimately, knowledge storage and visualization (a detailed description can be found in Section 3.3). Firstly, a diverse range of data sources were gathered. Structured data were sourced from the Agricultural Economy and Information Research Institute (AEIRI), Anhui Academy of Agricultural Sciences, located in Hefei, China (230031). Semi-structured data were regularly crawled and collected from authoritative agricultural websites and platforms such as Baidu Encyclopedia (https://baike.baidu.com/) through crawling technology. Unstructured data refer to pure text corpora containing multiple complex relationships and books related to RDPs. Following this, the ontology for RDPs was constructed, taking into account the unique characteristics of the corpora. This involved defining classes, relationships, and attributes, and establishing corresponding constraints to delineate the boundaries for data extraction. Subsequently, knowledge extraction was conducted separately on various structured data, followed by a comprehensive knowledge fusion process. Ultimately, all the extracted triple data were stored and visualized using the Neo4j graph database.

2.1. Data Acquisition and Processing

This study extensively collected multi-source heterogeneous data in the field of RDPs, covering structured, semi-structured, and unstructured data to ensure the comprehensiveness and diversity of the data. Figure 2 illustrates the process of collecting this multi-source heterogeneous RDP data. The primary data sources included AEIRI, official agricultural websites, Baidu Encyclopedia and RDP-related books. The specific ways of collecting and processing data for these three structures are detailed as follows:

2.1.1. Structured Data Collection and Extraction

The structured data used in this paper come from AEIRI, which focuses on research in the fields of agricultural economics and information. AEIRI adheres to high standards in data accuracy and update frequency, aiming to provide solid data support for various types of research. Table 1 provides a detailed display of some structured data in the relational database.

The process of converting structured data into triples is relatively simple and intuitive. By mapping the table name, field name, and field value in the structured data to the entity, attribute, and attribute value in the triple, efficient knowledge representation can be achieved. In a relational database, each database table usually corresponds to a type of entity, each record in the table represents a specific instance, and the fields in the table are converted into the attributes of the entity. In addition, the names of all diseases and pests are used as head entities to facilitate subsequent data manipulation and management.

According to the mapping rules provided above, the data in Table 1 could be converted into triples such as (rice bakanae disease, synonym, phytophthora), (rice bakanae disease, pathogen type, fusarium oxysporum), (rice ragged stunt virus, geographical distribution, Guangdong), and so forth. This transformation process ensured that the structured data were accurately represented in a format suitable for knowledge graph construction.

2.1.2. Semi-Structured Data Collection and Extraction

The semi-structured data primarily originated from the Chinese crop Cermplasm Resources Information System, the Hubei Pest and Weed Data Platform, and the Baidu Encyclopedia. This study first analyzed the structure of the web page and wrote the crawler code to obtain the website’s request response. Then, we utilized XPath and other tools to parse the obtained Uniform Resource Locator (URL) of the web pages. Based on the crawling rules and the characteristics of the pages, we could batch-crawl the data on diseases and pests from the web pages. Subsequently, the source codes of the web pages were analyzed, and regular expressions were used to clean all irrelevant text labels. Finally, combined with manual review, the data were cleaned and stored in files.

Semi-structured data have a certain structure, between that of structured data and unstructured data. This study processed the crawled data files mentioned above through Python programming to extract the relevant attributes and relationship information about RDPs. The extraction process is shown in Figure 3. For triple extraction of this type of data, RDPs could be used as entities, and other key-value pairs could be used as attributes and attribute values—for example, (rice blast, damaged parts, seedlings), (rice blast, pathogen, rice blast pathogen), etc.

2.1.3. Unstructured Data Collection and Extraction

The aforementioned semi-structured data only focused on the recommended attribute terms from the website pages, simplifying the operation process to a certain extent. However, this also introduced issues of information loss and limitations to the data. For example, within the symptom attribute value of rice brown spot disease, there are still many undiscovered pieces of information, including alias information, affected parts, distribution areas, etc. The entity relationship information involved extracting knowledge from unstructured data. For data found in books on rice pest and disease diagnosis and control, this study used a high-resolution scanner to meticulously scan the useful content within the books. After accurately identifying the text areas, we implemented Optical Character Recognition (OCR) technology [28], successfully converting them into editable documents, which greatly facilitated subsequent knowledge extraction tasks. This study employed the MARBC model for extracting unstructured data. Detailed implementation specifics can be found in Section 2.3.

2.1.4. Data Annotation

This study used the Label-Studio data annotation tool for annotating text, utilizing a set of five category labels: HE for head entity, ON for alias, LOC for location, DP for damaged part, and CP for control pesticide. Once the labels were configured, the uploaded data could be annotated, as shown in Figure 4. Subsequently, the annotated data were exported, and a Python script was utilized to convert the exported Comma-Separated Values (CSV) file into the Begin-Inside-End-Single-Other (BIESO) data format. The data were then partitioned into training, testing, and validation sets in a ratio of 7:2:1.

In the task of text entity and relationship extraction focused on a single head entity, the core lies in extracting entities Ei (where i = 1, 2, …, n) related to the head entity HE, as well as the relationships Ri (where i = 1, 2, …, n) between these entities. As shown in Figure 5, considering the given head entity “rice bakanae disease” and its relationship with “leggy growth disease”, labeled as “alias”, when encountering the BIE (Begin, Inside, End) tags corresponding to the HE and another entity ON, a triplet (rice bakanae disease, alias, leggy growth disease) is formed. This process continues iteratively until the next head entity HE is matched, signifying the completion of the triplet extraction for the current head entity.

2.2. RDP Ontology Construction

Ontology represents an artificially constructed assembly of concepts and conceptual frameworks [29,30,31]. It serves as the cornerstone for the semantic model of a knowledge graph, offering standardized and restricted semantics for data population. This facilitates the scalable development of knowledge graphs, bolstering applications at advanced layers. Furthermore, through the construction of the ontology, the data layer can be efficiently organized and managed. This study used the open-source tool Protégé [32] to construct an ontology structure and delineate the boundaries of knowledge extraction by defining the ontology and data schema. The construction process is shown in Figure 6.

(1) Collect relevant domain knowledge: Firstly, prior to constructing the RDP ontology, a comprehensive acquisition of domain knowledge is imperative through pertinent literature review, data analysis, field surveys, and other means. This process aids in grasping the symptom profiles, life cycles, influencing factors, and other relevant aspects of diseases and pests. By furnishing a wealth of background knowledge and data, this undertaking fortifies the accuracy and comprehensiveness of the constructed rice pest and disease ontology.

(2) Determine the ontology elements and scope: By referencing existing relevant ontology construction cases and domain standards, key elements of ontology construction can be extracted, such as the names, characteristics, classifications, and relationships of diseases and pests. This process involves defining the granularity of concepts within the ontology and determining the scope of RDPs, including their types and geographical distribution, to ensure comprehensive coverage. The goal is to construct an ontology that not only meets practical requirements, but also exhibits scalability and broad applicability, enabling it to adapt to evolving research and application needs in the field.

(3) Identify relevant terms and term relationships: In the agricultural field, the construction of an ontology requires precise definition of specialized terminology to eliminate ambiguity and ensure clarity. This includes clearly delineating concepts such as viruses, pests, and their respective characteristics, as well as establishing the relationships between these entities. For instance, it is essential to define the harmful interactions between crops and pests or diseases, the hierarchical genus–species relationships among pests, and the taxonomic classifications at the levels of genus, order, family, and kingdom. Such meticulous structuring ensures that the ontology accurately represents the complex relationships and classifications within the agricultural ecosystem, facilitating effective knowledge organization and retrieval.

(4) Design the ontology hierarchy: This paper categorizes the RDP ontology into two primary classes—rice diseases and rice pests—based on the characteristics of crop attributes. Each category is further divided into subcategories that encompass the attributes and relationships relevant to the acquired multi-source heterogeneous data. Specifically, rice diseases comprise 13 subcategories, while rice pests include 14 subcategories, each with distinct branches. For instance, prevention and control methods are subdivided into biological, physical, chemical, and agricultural approaches. Morphological characteristics are further classified based on traits such as body shape, body color, and body length of diseases and pests. These detailed branches serve as the lowest layer of the ontology, with the overall hierarchical structure limited to four layers. A partial representation of this hierarchical structure is illustrated in Figure 7.

(5) Define attributes, relationships, and constraints: For each entity category within the classification system, its attributes and relationships are explicitly defined. Attributes are utilized to describe the characteristics or properties of entities. Relationships, on the other hand, are employed to depict the connections or associations between entities. When defining attributes, it is essential to assign names to these attributes and specify their domains and ranges. The domain identifies the entity categories to which the attributes and relationships apply, while the range specifies the type of values they can take. For instance, attributes such as symptoms, transmission routes, and affected crops are defined with specific data types and value ranges as constraints, providing detailed descriptions and restrictions for entity representation. Examples of these definitions are illustrated in Table 2 and Table 3. Attributes such as symptoms and control methods are used to describe the intrinsic characteristics of concepts, while relationships such as harmful crops and distribution areas are used to describe the associations between different concepts.

(6) Instantiate the ontology: After finalizing the ontology hierarchy and its corresponding constraints, the next step involves populating the ontology with actual RDP data. For example, instances such as the rice disease “stripe leaf blight” and the rice pest “rice planthopper” are added to the ontology. This process ensures that the ontology accurately reflects real-world scenarios in the field of RDPs. Figure 8 illustrates the constructed RDP ontology, showcasing a portion of the instances that have been incorporated.

Figure 7. RDP ontology hierarchical structure diagram.

Table 2. RDP relationship constraint settings.

Relations	Domains	Ranges
Alias	Rice diseases and pests	Rice diseases and pests
Pest site	Rice diseases and pests	Rice
Distribution area	Rice diseases and pests	Geography
Pesticides	Rice diseases and pests	Pesticides

Table 3. Attribute constraint settings.

Attributes	Domains	Ranges
Pathogen	Rice diseases	String
Symptoms	Rice diseases	String
Transmission pathways and conditions	Rice diseases	String
Habits	Rice pests	String
Scientific name	Rice pests	String
Characteristics of the disease	Rice pests	String
Morphological characteristics	Rice pests	String
Prevention and control methods	Rice diseases and pests	String

Figure 8. Partial construction of RDP ontology layer based on Protégé.

2.3. Joint Extraction of Entity Relationships Based on MARBC Model

This paper innovatively integrates the RoBERTa model and a multi-head attention mechanism into the BiLSTM-CRF framework, constructing a multi-layer network knowledge joint extraction model named MARBC (Multi-head Attention RoBERTa BiLSTM CRF). The goal of this model is to accurately extract complex nested knowledge from a corpus related to rice diseases and pests. The introduction of the RoBERTa model and the multi-head attention mechanism offers two key advantages. First, the RoBERTa model excels at learning complex language structures and contextual information, particularly when processing domain-specific data. This capability significantly enhances the effectiveness of pre-training. Second, the multi-head attention mechanism enables the capture of richer and more multi-dimensional information, facilitating the identification of diverse potential patterns and relationships within the input sequence. This enhances the model’s understanding and generalization capabilities for complex sequence tasks. This combination not only improves the accuracy of knowledge extraction, but also provides robust technical support for handling complex semantic relationships.

The overall framework of the MARBC model is shown in Figure 9. It primarily consists of four components: the RoBERTa layer, the BiLSTM layer, the multi-head attention layer, and the Conditional Random Fields (CRF) layer. Firstly, the RoBERTa pre-trained language model (PLM) is utilized to encode the annotated data into word representations, with the aim of obtaining the representation vectors of the text. Based on the contextual features of characters, character vector Ti (where i = 1, 2, …, n) is dynamically generated. Subsequently, these character vectors are fed into the BiLSTM module for bidirectional encoding. Multi-head attention is utilized to prioritize important information in the text, culminating in the final text representation. This representation is subsequently passed to the CRF module for decoding, where label transition probabilities and constraints are learned through training to predict the final annotation sequence accurately. The following is a detailed explanation and introduction of each module in the MARBC model.

(1) Word Embedding Module

Recently, RoBERTa has demonstrated outstanding performance in numerous natural language processing (NLP) tasks, and has emerged as one of the most widely utilized PLMs [33]. In this study, RoBERTa was employed to embed data and acquire contextually relevant word vector representations. The input representation method of RoBERTa is shown in Figure 10.

Upon receiving a text input, RoBERTa constructs its input vector representation by combining the word vector, segment embedding, and position representation of each token. These vectors are crucial for accurately capturing detailed vocabulary information, effectively distinguishing between different text fragments, and precisely determining the relative position of words within the sentence. The word vector is derived from the embedding layer, which assigns each token in the RoBERTa vocabulary to a fixed-dimension vector, fully expressing the semantic essence. Furthermore, segment embeddings are used to distinguish tokens from different text fragments. In RoBERTa, two segment embeddings are typically used: one for the first half of the sentence, and another for the second half. These segment embeddings allow the model to differentiate between tokens from different sources within the input text, enabling it to handle multiple sentences or text pairs. Positional embeddings represent the relative position of each token. Since the transformer model lacks inherent position information, positional embedding is implemented by adding specific numerical sequences. It is usually based on the position encoding of sine and cosine functions to help the model understand the order of tokens in the input. At each token position, the vectors from these three components are summed to form the input representation for RoBERTa. This combination allows the model to simultaneously consider vocabulary, text structure, and position information, enhancing its performance in NLP tasks.

(2) Feature Extraction Module

The BiLSTM (Bidirectional Long Short-Term Memory) utilized in MARBC is a Recurrent Neural Network (RNN) [34] with a model structure as depicted in Figure 11.

It comprises a fusion of forward and backward LSTM. For each time step

t

, the input

X_{t}

, the hidden state of the previous time step

H_{t - 1}

, and the cell state

C_{t - 1}

are input to the LSTM unit. The forget gate regulates the information to be discarded from the cell state, while the input gate determines the information to be added to the cell state. To update the cell state

C_{t}

,

f_{t}

,

i_{t}

,

{\tilde{c}}_{t}

, respectively, denote the forgetting gate, the input gate, and the output of the candidate memory cell. The formula for updating the cell state is given by

C_{t} = f_{t} ⊙ C_{t - 1} + i_{t} ⊙ {\tilde{C}}_{t}

. The forward LSTM layer begins processing from the start of the sequence, while the backward LSTM layer starts from the end. They independently compute hidden state representations for the sequence. Subsequently, the hidden states from both directions are concatenated to form the representation of the entire sequence, which is used as input for the subsequent layer.

(3) Richer Contextual Information Capture Module

The MARBC model integrates the multi-head attention mechanism to enhance the attention of different positions [35]. By leveraging the output of the BiLSTM with features, multiple attention heads can focus on different features, enabling the extraction of vital information within the sequence more effectively. Firstly, the output of the BiLSTM is multiplied by the corresponding parameter matrix, and then linearly transformed into the query vector, key vector, and value vector for the attention mechanism. Subsequently, through a sequence of operations including dot products, the attention weight is derived. The calculation formula is detailed in (1).

A t t e n t i o n (Q, K, V) = s o f t \max (\frac{Q K^{T}}{\sqrt{d_{k}}}) V)

(1)

where Q, K and V represent the query, key, and value matrices, respectively; and dk denotes the dimension of attention header, controlling the size of the inner product of Q and K.

As shown in Figure 12, the multi-head attention mechanism functions in a parallel manner. It independently computes weighted sums for each attention head. The resulting weighted feature matrices from each head are merged through concatenation to form the weighted feature representation. This composite representation is then fed into the CRF for classification. The calculation formulas are shown in (2) and (3).

h e a d_{i} = A t t e n t i o n (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})

(2)

M = C o n c a t (h e a d_{1}, h e a d_{2}, \dots, h e a d_{n}) W^{O}

(3)

where

W_{i}^{Q}

,

W_{i}^{K}

,

W_{i}^{V}

and

W^{O}

are parameter matrices;

h e a d_{i}

refers to the i-th attention head; Concat indicates the concatenation operation; and n denotes the number of attention heads (in this paper, n = 12).

(4) Sequence Prediction Module

Conditional Random Fields is a statistical model used for modeling sequence labeling tasks [36]. In tasks like NER, CRF is employed to capture dependencies among labels and predict labels for each word accurately. In the MARBC model, the CRF module uses the BiLSTM attention output as its input. By defining both the transition matrix and the emission matrix, the labeling probability of the sequence can be calculated. This process involves modeling named entities to determine the dependency between labels and predicting the label for each word. In this way, the entire sequence labeling task is optimized. The transition matrix is utilized to measure the transition probability between labels at adjacent positions. Conversely, the emission matrix is employed to represent the probability distribution of each label generated at each specific position in the sequence.

CRF comprehensively considers the relationship between the current label and adjacent labels. It undergoes training to maximize the probability of the entire label sequence. During this training process, CRF learns the transition matrix, which is crucial for obtaining the optimal label sequence. This end-to-end training strategy has introduced new ideas and technological breakthroughs to the realm of text information processing.

2.4. Knowledge Fusion of Multi-Source Heterogeneous Data

The knowledge extracted from different data sources may contain a significant amount of ambiguous and duplicate data. The objective of knowledge fusion is to effectively consolidate and integrate these data, enhancing the quality of the knowledge graph database. For instance, the entity “bacterial grain blight” obtained from web crawling is stored as “grain blight” in the relational database of the AIERI. Although they refer to the same entity, they are presented in distinct forms, each with its own set of relationships and attributes that share similarities and differences. In the structured data source pertaining to rice blast, attributes like “geographical distribution” and “pathogenesis factors” exist, whereas the semi-structured web page data source includes attributes such as “epidemic law” and “grading standard”. Hence, this paper necessitates a pairwise evaluation of extracted entities to determine if they denote the same object. If affirmed, processing the similarities and disparities between these entities is crucial for merging them effectively.

Upon observation, it is evident that certain entities within the extracted triples, despite having distinct names, actually belong to the same entity type. The ambiguity in the representation of such entities results in an inconsistent quality of the extracted triples, necessitating entity alignment. Because of the ambiguity in entity representation, the quality of the obtained triple data varies and requires rectification. In this regard, this paper employs a combination of cosine distance and the Jaccard correlation coefficient to compute the similarity between entities. This approach comprehensively considers the similarity of entities across both vector representation and attribute set levels, aiming to yield more precise and comprehensive measurement outcomes. The specific formulas are depicted in Equations (4) and (5). Through the establishment of a threshold, a determination is made as to whether the entities earmarked for alignment are a match, thereby facilitating the realization of knowledge fusion.

{A l i g n}_{\cos} (e_{1}, e_{2}) = \frac{| A (e_{1}) \cap A (e_{2}) |}{\sqrt{| A (e_{1}) | | A (e_{2})}}

(4)

{A l i g n}_{Jacc} (e_{1}, e_{2}) = \frac{| A (e_{1}) \cap A (e_{2}) |}{| A (e_{1}) \cup A (e_{2}) |}

(5)

where

A (e_{1})

and

A (e_{2})

represent the attribute sets of entities

e_{1}

and

e_{2}

, respectively.

3. Results

3.1. Configuration of Experimental Parameters and Evaluation Indicators

The experiment employed the Anaconda development environment, alongside Python 3.6, for training and testing the model. The model parameters of the RoBERTa model were optimized using the Adam optimizer and a learning rate scheduler, both provided by PyTorch version 1.7.1. This facilitated efficient weight updates during the fine-tuning process.

Throughout the model training process, several adjustments were made: (a) The batch size was determined based on the graphics processing unit (GPU) memory capacity. (b) The maximum length of the sequence was set according to the average length of the sentences. (c) Adjustments to the dropout rate and learning rate were made based on the convergence patterns observed in the loss function within the training logs. The final optimal combination of parameters for the model was obtained through this meticulous process, as shown in Table 4.

To accurately assess the performance of the model, three typical evaluation metrics, namely precision (P), recall (R), and F1 score, were utilized. The calculation formulas are as follows:

P = \frac{T P}{T P + F P} \times 100 %

(6)

R = \frac{T P}{T P + F N} \times 100 %

(7)

F 1 = \frac{2 P R}{P + R} \times 100 %

(8)

where

T P

,

T N

,

F P

, and

F N

represent correctly predicted positive/negative samples and incorrectly predicted positive/negative samples, respectively.

3.2. Prediction Results

3.2.1. Comparison Experiment

This study conducted an experimental comparison of several classic models in the field of named entity recognition, based on the RDP dataset constructed in this study. The comparative experimental results for each of these models are presented clearly in Table 5 and Figure 13. The performance comparison of the models in the table demonstrates the clear superiority of the MARBC model. Below is a detailed analysis of the results based on precision, recall, and F1 score:

(1) Precision: The MARBC model achieves the highest precision of 95.31%, significantly outperforming BiLSTM-CRF (91.53%), IDCNN-CRF (92.95%), and BERT-BiLSTM-CRF (91.62%). This improvement is attributed to the sophisticated semantic representation capabilities of RoBERTa, which bolster the model’s proficiency in accurately recognizing pertinent entities while reducing false positives. The incorporation of the multi-head attention mechanism further enhances this capability by allowing the model to concentrate on multiple contextual features concurrently, thereby ensuring more accurate predictions.

(2) Recall: MARBC also leads in recall, with a score of 93.58%, surpassing BiLSTM-CRF (87.57%), IDCNN-CRF (88.43%), and BERT-BiLSTM-CRF (92.69%). By integrating RoBERTa, the model can capture a wider array of semantic relationships, while the MA mechanism guarantees that crucial contextual information is not overlooked. This combination enables MARBC to detect a greater number of true positives, even within intricate and varied datasets.

(3) F1 Score: The MARBC model achieves the highest F1 score of 94.44%, reflecting its balanced performance in both precision and recall. This score is notably higher than those of BiLSTM-CRF (89.47%), IDCNN-CRF (90.60%), and BERT-BiLSTM-CRF (92.15%). The synergy between RoBERTa’s deep semantic understanding and the MA mechanism’s ability to process diverse contextual features ensures that MARBC excels in tasks requiring both precision and recall.

As a result, the MARBC model’s outstanding performance in terms of precision, recall, and F1 score underscores the effectiveness of combining RoBERTa and the multi-head attention mechanism. This makes MARBC a state-of-the-art solution for tasks requiring accurate entity recognition and relationship extraction.

Table 5. Comparison of experimental results based on RDP dataset constructed in this study.

Model	Precision (%)	Recall (%)	F1 (%)
BiLSTM-CRF	91.53	87.57	89.47
IDCNN-CRF	92.95	88.43	90.60
BERT-BiLSTM-CRF	91.62	92.69	92.15
MARBC	95.31	93.58	94.44

Figure 13. Comparison of different models’ evaluation results.

3.2.2. Ablation Experiment

The ablation experiment was designed to thoroughly analyze the effectiveness of each module in the MARBC model. This study delved into the specific contributions of individual modules to the model’s overall performance by sequentially removing the key modules: BiLSTM, multi-head attention, and RoBERTa. The variant resulting from the removal of the BiLSTM module is named RMAC, the variant after eliminating the multi-head attention module is referred to as RBC, and the variant following the removal of the RoBERT module is designated as BMAC.

The results of the ablation experiment are displayed in Table 6. Firstly, BMAC shows the weakest performance, strongly emphasizing the advantage of the RoBERTa pre-trained language model in data embedding representation. This advantage significantly enhances the performance of MARBC in NLP tasks. Secondly, the results of RBC reveal that the multi-head attention mechanism effectively captures diverse features within the sequence by introducing multiple attention heads. This process leads to more efficient extraction of key information, which is essential for enhancing model performance. Lastly, the performance of RMAC further confirms the positive impact of the BiLSTM module in text feature extraction. This module also makes a notable contribution to enhancing the overall performance of MARBC.

3.2.3. Entity Label Prediction

Furthermore, this study conducted a detailed verification of the precision, recall, and F1 score of the MARBC model across five entity labels. Figure 14 visually presents the experimental results for each label with respect to these three evaluation indicators.

Upon analyzing the data distribution results, it is evident that the “DP” (damaged part) label generally exhibits lower scores compared to other labels. Upon further exploration of the reasons behind this, we speculate that it may be closely linked to the data content encompassed by the label. Specifically, the “DP” label primarily pertains to leaves, sheaths, leaf bases, and other components of rice. These parts often lead to confusion in practical applications, thereby increasing the complexity of model prediction. Consequently, during the prediction process, the model may be more prone to misclassifying these parts, leading to deviations between the prediction results and the actual situation. This revelation offers valuable insights and guidance for refining the model and enhancing prediction accuracy in subsequent optimizations.

To further analyze the experimental results, this study compared the results for each label across the MARBC model and two baseline models (BiLSTM-CRF and BERT-BiLSTM-CRF). Figure 15 illustrates the comparison of the results for the first entity label, “HE”, between Model A (BiLSTM-CRF), Model B (BERT-BiLSTM-CRF), and Model C (MARBC). The experimental results show that the MARBC model proposed in this paper exhibits the best performance in head entity prediction. Specifically, compared with the other two models, the F1 value of the MARBC model for the HE label is improved by 5.13% and 2.09%, respectively.

Furthermore, this study examined the impact of additional relational entity labels, namely “ON”, “LOC”, “DP”, and “CP”, on the MARBC model in comparison with the two other models (BiLSTM-CRF and BERT-BiLSTM-CRF), with the results presented in Figure 16. Upon comparing the three evaluation indicators, it is evident that the performance of MARBC improves for each label, with varying degrees of enhancement. This outcome indicates that MARBC exhibits superior performance in handling these relational entity labels.

3.2.4. Discussion of Computational Efficiency of Different Models

To evaluate the time efficiency of different models, this paper performed a comparative analysis of the training time for the IDCNN-CRF, BiLSTM-CRF, BERT-BiLSTM-CRF, and MARBC models. The epoch was set to 20 for all models, while the remaining parameters were configured according to the aforementioned Table 4. The results are shown in Table 7. Combining the model performance comparison results presented in Table 5 above with the model calculation efficiency comparison results, the analysis is as follows:

(1) The BiLSTM-CRF model merges bidirectional LSTM with CRF to grasp the contextual nuances within sequence data. Nonetheless, LSTM’s computational complexity spikes notably, particularly when handling extensive sequences, leading to prolonged training periods. Its effectiveness stands at a moderate level, primarily restrained by LSTM’s inherent struggle in capturing long-range dependencies.

(2) IDCNN enhances the receptive field by leveraging dilated convolution, enabling efficient capture of local features with minimal computational overhead, contributing to its notably swift training duration. Nevertheless, IDCNN grapples with constrained modeling capacities for global contextual insights, culminating in relatively diminished recall and F1 scores.

(3) The BERT-BiLSTM-CRF model combines the pre-trained language representation capabilities of BERT and the sequence modeling capabilities of BiLSTM-CRF. BERT can capture rich semantic information and significantly improve the recall (+4.26) and F1 (+1.55) values. Nonetheless, BERT’s extensive parameter count and elevated computational intricacy, coupled with its fusion with BiLSTM, lead to a 0.26 h extension of training time.

(4) The RoBERTa-based MARBC model has a 0.53 h longer training time than the BERT-BiLSTM-CRF model, primarily due to its integration of the multi-head attention mechanism (MA) and the pre-trained architecture of RoBERTa. As an enhanced version of BERT, RoBERTa achieves superior semantic representation capabilities by leveraging larger-scale pre-training datasets and more optimized training strategies. Simultaneously, the multi-head attention mechanism enhances the model’s ability to capture contextual information and textual relationships through parallel computation of multiple attention heads. The training time of MARBC is still within 1 h, which is acceptable. Despite the extended training time, the MARBC model outperforms other models significantly in terms of precision (+2.36), recall (+0.89), and F1 score (+2.29), demonstrating its exceptional capability in handling complex semantic relationships and long-range dependency tasks. This performance advantage underscores the model’s robustness and suitability for applications requiring high accuracy and deep semantic understanding.

Table 7. Comparison of training time for different models.

Model	Training Time (h)
BiLSTM-CRF	0.24
IDCNN-CRF	0.08
BERT-BiLSTM-CRF	0.34
MARBC	0.87

3.2.5. Discussion of Broader Applications

The aforementioned research primarily focuses on the application of the MARBC model in the field of RDPs, demonstrating its significant potential in the agricultural domain. On the other hand, the applicability of the MARBC model extends well beyond this scope. By adapting this approach to other crops or agricultural challenges, the impact and practical utility of the research can be substantially enhanced. Currently, our team is engaged in several projects to validate the versatility and generalization capabilities of the MARBC model across diverse fields. These projects include the following:

(1) Anhui University Undergraduate Innovation and Entrepreneurship Training Program National Project: Research on Question-Answering System Based on Knowledge Graph of Tea Diseases and Pests (202410357086): This project uses the MARBC model to extract tea disease and pest knowledge, establish a knowledge graph of tea diseases and pests, and develop a question-answering system. It integrates scattered knowledge resources, provides fast and accurate answers to questions, greatly improves the efficiency of problem solving, and provides beneficial exploration and practice for the intelligent development of the agricultural field.

(2) Anhui Provincial Scientific Research Plan Project: Research on News Recommendation Based on Sentiment Analysis and Knowledge Graph (2024AH052212): This project uses the MARBC model to process news data and build a unique knowledge graph in the news field. Then, it integrates sentiment features with the deep information of the knowledge graph to build an intelligent news recommendation model. This model automatically explores and learns complex relationships between features, aiming to significantly improve the comprehensibility, diversity, and comprehensive coverage of recommendation results, and bring users a more accurate news recommendation experience.

(3) State Grid Anhui Xintong Company Data Center Link Operation Abnormal Prediction and Analysis Self-Healing Technology Research (Enterprise Cooperation Project): In response to the problems of various types of abnormal operation in the power data center and the reliance on manual on-site processing for potential abnormalities, this project conducts research in the following three areas: research on the construction and completion methods of knowledge graphs for the power data center, research on the construction of abnormal prediction models based on artificial intelligence, and research on the construction of abnormal self-healing models based on knowledge graph retrieval enhancement technology. In the process of constructing the knowledge graph of the data center, the MARBC model is used to extract key features from massive power data to build a high-quality knowledge graph.

The research of these projects shows that the MARBC model not only has broad application prospects in the agricultural field, but can also play an important role in other fields, such as news recommendations and power systems. This further proves the generalization ability and adaptability of the MARBC model, and provides a solid foundation for its application in more fields in the future.

3.3. Knowledge Storage and Downstream Application

3.3.1. Knowledge Storage

Currently, the prevalent methods for storing knowledge graphs encompass relational databases and graph databases [37]. Relational databases construct tables to represent entities and relationships within the knowledge graph. However, they suffer from scalability issues, and tend to accumulate substantial redundant information. Conversely, graph databases leverage the graph structure for storage and querying. In these databases, entities and concepts are represented as graph vertices, while attributes and relations serve as edges. This approach visually demonstrates data relationships, facilitating graph querying and knowledge reasoning. Consequently, graph databases have emerged as the primary choice for storing knowledge graphs today. This paper utilizes the open-source graph database Neo4j [38] to dynamically add and modify nodes and relationships. This flexibility enables us to accommodate data fluctuations effectively and manage RDP knowledge storage efficiently.

This study employed Cypher and Python languages for data importation, categorizing it into two parts. One part originated from attribute data, which encompass rice symptoms, pathogens, transmission routes, etc. This information was crawled through Scrapy. After processing, this part of the data was stored in the Neo4j graph database through Python programming. The other part of the data consisted of triplets extracted through the MARBC model. These triplets were imported into the graph database using Cypher statements to construct a knowledge graph of RDP.

Figure 17 illustrates the visualization of entities and relationships associated with rice blast disease. In the figure, edges denote relationships among entities, like “harms” and “location”, while nodes represent specific entities, such as “rice seeding”. Specifically, the three connections between the blue and orange nodes indicate that rice blast disease has three aliases, namely fire blight, rice fever, and nodding blight. This visualization form intuitively presents the related entities of rice blast disease and their complex relationships, which helps users better understand the multi-dimensional information of the disease.

3.3.2. Downstream Application

Traditional rice disease and pest retrieval methods suffer from several limitations. The results often contain a significant amount of redundant data, making it challenging to quickly identify relevant and useful information. Furthermore, a substantial portion of disease and pest control knowledge is scattered across specialized literature, which is difficult to access and time-consuming to retrieve. This makes it difficult for farmers and agricultural experts to efficiently acquire the necessary professional knowledge.

Currently, knowledge graph-based question-answering systems are widely applied in various fields, such as agriculture and healthcare, providing users with efficient and accurate ways to access information. In the agricultural field, with the advancement of artificial intelligence initiatives, China is actively promoting the development of intelligent agricultural informatization. A knowledge graph-based agricultural question-answering system can enable farmers to access a wealth of agricultural knowledge via the Internet, allowing them to acquire rice cultivation expertise. Therefore, our team designed and implemented an intelligent question-answering system based on the RDP knowledge graph. The system not only provides accurate answers, but also has powerful explanatory functions, offering users a novel solution. This approach significantly enhances the efficiency and accuracy of accessing knowledge related to RDP.

Figure 18 illustrates the system’s response to the natural language query, “Where are the distribution locations of the rice stem borer”? The answer not only provides a detailed textual description, but also intuitively presents relevant information through a visual map. Specifically, Figure 18 displays the Chinese-language interface of the question-answering system, while Figure 18 shows the corresponding English-language interface. Currently, the system is running stably at the Agricultural Economy and Information Research Institute, Anhui Academy of Agricultural Sciences, significantly enhancing the user experience in interactive question-answering.

4. Conclusions

This paper endeavors to construct an RDP knowledge graph to extract knowledge from multi-source data. It explores deep learning-based entity recognition and relationship extraction methods, establishing a semantically rich knowledge base to aid agricultural experts in decision-making. The main work of the paper is as follows:

(1) Addressing the scarcity of data and the need for effective organization of multi-source data, this paper collects data from diverse sources, including the Baidu and rice agriculture websites, to create a comprehensive dataset of Chinese rice diseases and pests.

(2) By integrating the efficient RoBERTa model and multi-head attention into the BiLSTM-CRF model, the paper proposes a multi-layer network extraction model named MARBC. To validate the model’s efficiency and accuracy, comparative experiments were conducted on the constructed RDP dataset. Experimental results demonstrate that the MARBC model achieves SOTA performance in precision, recall, and F1 score, with values of 95.31%, 93.58%, and 94.44%, respectively. Notably, the F1 score surpasses that of the current BiLSTM-CRF and BERT-BiLSTM-CRF models by 4.97% and 2.29%, respectively.

(3) The paper constructs an RDP knowledge graph for both the data and ontology levels, offering a reference for knowledge graph representation techniques and methods in vertical domains. This enhances users’ understanding and problem-solving capabilities regarding rice diseases and pests, facilitating timely warnings and prevention measures.

Although significant progress has been achieved in constructing RDP knowledge graphs, there is still ample room for continuous exploration. Specifically, this exploration can focus on refining construction methods and improving overlapping relationship extraction. Furthermore, enhancing the modalities of data sources, including images, videos, and geographical information, can enrich the study of RDP recognition. This endeavor involves developing multimodal deep learning models to merge features from diverse data sources, thereby enhancing the overall performance of named entity recognition. Additionally, exploring downstream applications based on the RDP knowledge graph, such as question-answering systems and recommendation systems, holds promise for further advancements in the field.

Author Contributions

Conceptualization, D.L.; methodology, C.L. and P.C.; software, C.L. and S.Y.; validation, C.L. and W.D.; formal analysis, C.L. and S.Y.; resources, W.D.; data curation, P.C. and W.D.; writing—original draft preparation, C.L.; writing—review and editing, C.L., S.Y. and D.L.; visualization, S.Y.; supervision, P.C.; project administration, D.L.; funding acquisition, C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (62273001 and 71971002); the Natural Science Foundation of Anhui Province (2108085QA35); the Open Research Fund of the National Engineering Research Center for Agro-Ecological Big Data Analysis and Application, Anhui University (AE202006); and the Anhui Provincial Scientific Research Plan Preparation Project (2024AH052212).

Data Availability Statement

Presented data in this paper are available on request from the corresponding author.

Acknowledgments

We thank the Agricultural Economy and Information Research Institute, Anhui Academy of Agricultural Sciences, for providing the structured rice disease and pest data.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Rahman, A.N.M.R.B.; Zhang, J. Trends in rice research: 2030 and beyond. Food Energy Secur. 2023, 12, e390. [Google Scholar] [CrossRef]
Saberali, S.F.; Darzi-Naftchali, A. Yield gap analysis and the relative importance of factors explaining yield variability in paddy fields. Eur. J. Agron. 2024, 156, 127172. [Google Scholar] [CrossRef]
Dunn, L.; Latty, T.; Van Ogtrop, F.F.; Tan, D.K. Cambodian rice farmers’ knowledge, attitudes, and practices (KAPs) regarding insect pest management and pesticide use. Int. J. Agric. Sustain. 2023, 21, 2178804. [Google Scholar] [CrossRef]
Muhammed, D.; Ahvar, E.; Ahvar, S.; Trocan, M.; Montpetit, M.-J.; Ehsani, R. Artificial Intelligence of Things (AIoT) for smart agriculture: A review of architectures, technologies and solutions. J. Netw. Comput. Appl. 2024, 228, 103905. [Google Scholar] [CrossRef]
Zhang, J.; Tao, D. Empowering things with intelligence: A survey of the progress, challenges, and opportunities in artificial intelligence of things. IEEE Internet Things J. 2021, 8, 7789–7817. [Google Scholar] [CrossRef]
Fountas, S.; Espejo-Garcia, B.; Kasimati, A.; Mylonas, N.; Darra, N. The future of digital agriculture: Technologies and opportunities. IT Prof. 2020, 22, 24–28. [Google Scholar] [CrossRef]
Adli, H.K.; Remli, M.A.; Wong, K.N.S.W.S.; Ismail, N.A.; González-Briones, A.; Corchado, J.M.; Mohamad, M.S. Recent advancements and challenges of AIoT Application in smart agriculture: A review. Sensors 2023, 23, 3752. [Google Scholar] [CrossRef]
Abu-Salih, B. Domain-specific knowledge graphs: A survey. J. Netw. Comput. Appl. 2021, 185, 103076. [Google Scholar] [CrossRef]
Ye, H.; Zhang, N.; Chen, H.; Chen, H. Generative knowledge graph construction: A review. arXiv 2022, arXiv:2210.12714. [Google Scholar]
Peng, C.; Xia, F.; Naseriparsa, M.; Osborne, F. Knowledge Graphs: Opportunities and Challenges. Artif. Intell. Rev. 2023, 56, 13071–13102. [Google Scholar] [CrossRef]
Ji, S.; Pan, S.; Cambria, E.; Marttinen, P.; Yu, P.S. A survey on knowledge graphs: Representation, acquisition, and applications. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 494–514. [Google Scholar] [CrossRef] [PubMed]
Yang, Y.; Zhuang, Y.; Pan, Y. Multiple knowledge representation for big data artificial intelligence: Framework, applications, and case studies. Front. Inf. Technol. Electron. Eng. 2021, 22, 1551–1558. [Google Scholar] [CrossRef]
Keraghel, I.; Morbieu, S.; Nadif, M. A survey on recent advances in named entity recognition. arXiv 2024, arXiv:2401.10825. [Google Scholar]
Yu, H.; Li, H.; Mao, D.; Cai, Q. A relationship extraction method for domain knowledge graph construction. World Wide Web 2020, 23, 735–753. [Google Scholar] [CrossRef]
Wu, X.; Duan, J.; Pan, Y.; Li, M. Medical knowledge graph: Data sources, construction, reasoning, and applications. Big Data Min. Anal. 2023, 6, 201–217. [Google Scholar] [CrossRef]
Hou, Y.; Liu, B.; Fan, Q.; Zhou, J. Research on the application mode of knowledge graph in education. In Proceedings of the 2023 6th International Conference on Educational Technology Management, Guangzhou, China, 3–5 November 2023. [Google Scholar]
Zheng, J.; Wu, X.; Tan, L.; Xu, P.; Xu, H.; Guo, Z.; Li, C. Intelligent financial risk warning for enterprises through knowledge graph-based deep learning. J. Circuits Syst. Comput. 2024, 33, 2450262. [Google Scholar] [CrossRef]
Zhang, F.; Wu, J.; Nie, Y.; Jiang, L.; Zhou, A.; Xie, N. Research of knowledge graph technology and its applications in agricultural information consultation field. In Proceedings of the 2020 IEEE 39th International Performance Computing and Communications Conference (IPCCC), Austin, TX, USA, 6–8 November 2020. [Google Scholar]
Qiao, L.; Li, H.; Wang, W.; Wang, D. Aknowledge graph construction method for food nutrition. In Proceedings of the 2022 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, WI-IAT, Niagara Falls, ON, Canada, 17–20 November 2022. [Google Scholar]
Li, Z.; Cheng, L.; Zhang, C.; Zhu, X.; Zhao, H. Multi-source education knowledge graph construction and fusion for college curricula. In Proceedings of the 2023 IEEE International Conference on Advanced Learning Technologies, ICALT, Orem, UT, USA, 10–13 July 2023. [Google Scholar]
Cheng, H.; Wang, K.; Tan, X. A link prediction method for Chinese financial event knowledge graph based on graph attention networks and convolutional neural networks. Eng. Appl. Artif. Intell. 2024, 138, 109361. [Google Scholar] [CrossRef]
Zhao, X.; Jiang, R.; Han, Y.; Li, A.; Peng, Z. A survey on cybersecurity knowledge graph construction. Comput. Secur. 2024, 136, 103524. [Google Scholar] [CrossRef]
Wu, W.; Wen, C.; Yuan, Q.; Chen, Q.; Cao, Y. Construction and application of knowledge graph for construction accidents based on deep learning. Eng. Constr. Archit. Manag. 2023, 32, 1097–1121. [Google Scholar] [CrossRef]
Yang, W.; Yang, S.; Wang, G.; Liu, Y.; Lu, J.; Yuan, W. Knowledge graph construction and representation method for potato diseases and pests. Agronomy 2024, 14, 90. [Google Scholar] [CrossRef]
Zhang, W.; Wang, C.; Wu, H.; Zhao, C.; Teng, G.; Huang, S.; Liu, Z. Research on the Chinese named-entity–relation-extraction method for crop diseases based on BERT. Agronomy 2022, 12, 2130. [Google Scholar] [CrossRef]
Lu, J.; Yang, W.; He, L.; Feng, Q.; Zhang, T.; Yang, S. A method for extracting fine-grained knowledge of the wheat production chain. Agronomy 2024, 14, 1903. [Google Scholar] [CrossRef]
Wang, K.; Miao, Y.; Wang, X.; Li, Y.; Li, F.; Song, H. Research on the construction of a knowledge graph for tomato leaf pests and diseases based on the named entity recognition model. Front. Plant Sci. 2024, 15, 1482275. [Google Scholar] [CrossRef]
Memon, J.; Sami, M.; Khan, R.A.; Uddin, M. Handwritten optical character recognition (OCR): A comprehensive systematic literature review (SLR). IEEE Access 2020, 8, 142642–142668. [Google Scholar] [CrossRef]
Asim, M.N.; Wasim, M.; Khan, M.U.G.; Mahmood, W.; Abbasi, H.M. A survey of ontology learning techniques and applications. Database 2018, 2018, bay101. [Google Scholar] [CrossRef]
Chen, J.; Mashkova, O.; Zhapa-Camacho, F.; Hoehndorf, R.; He, Y.; Horrocks, I. Ontology embedding: A survey of methods, applications and resources. arXiv 2024, arXiv:2406.10964. [Google Scholar]
Naqvi, M.R.; Elmhadhbi, L.; Sarkar, A.; Archimede, B.; Karray, M.H. Survey on ontology-based explainable AI in manufacturing. J. Intell. Manuf. 2024, 35, 3605–3627. [Google Scholar] [CrossRef]
Musen, M.A. The protégé project: A look back and a look forward. AI Matters 2015, 1, 4–12. [Google Scholar] [CrossRef]
Liu, Y. RoBERTa: A robustly optimized BERT pretraining approach. information systems research. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Wang, S.; Wang, X.; Wang, S.; Wang, D. Bi-directional long short-term memory method based on attention mechanism and rolling update for short-term load forecasting. Int. J. Electr. Power Energy Syst. 2019, 109, 470–479. [Google Scholar] [CrossRef]
Reza, S.; Ferreira, M.C.; Machado, J.; Tavares, J.M.R. A multi-head attention-based transformer model for traffic flow forecasting with a comparative analysis to recurrent neural networks. Expert Syst. Appl. 2022, 202, 117275. [Google Scholar] [CrossRef]
Bose, K.; Sarkar, K. Named entity recognition in bengali and hindi using muril and conditional random fields. SN Comput. Sci. 2024, 5, 856. [Google Scholar] [CrossRef]
Ma, R.; Han, X.; Yan, L.; Khan, N.; Ma, Z. Modeling and querying temporal RDF knowledge graphs with relational databases. J. Intell. Inf. Syst. 2023, 61, 569–609. [Google Scholar] [CrossRef]
Miller, J.J. Graph database applications and concepts with Neo4j. In Proceedings of the Southern Association for Information Systems Conference, Atlanta, GA, USA, 23–24 March 2018; pp. 141–147. [Google Scholar]

Figure 1. Block diagram of constructing knowledge graph for RDPs.

Figure 2. Flowchart for collecting multi-source heterogeneous RDP data.

Figure 3. Extraction process of semi-structured data.

Figure 4. Example of RDP data annotation based on Label-Studio tool.

Figure 5. Example of RDP data annotation based on BIESO.

Figure 6. Ontology construction process of RDP.

Figure 9. Framework of MARBC model.

Figure 10. Input representations of RoBERTa model.

Figure 11. LSTM structure.

Figure 12. Multi-head attention mechanism.

Figure 14. Entity labeling results of MARBC model.

Figure 15. Head entity label comparison results. (Model A represents BiLSTM-CRF, Model B represents BERT-BiLSTM-CRF, and Model C represents MARBC).

Figure 16. Entity relationship label comparison results. (Model A represents BiLSTM-CRF, Model B represents BERT-BiLSTM-CRF, and Model C represents MARBC).

Figure 17. Example visualization of RDP knowledge graph.

Figure 18. Example of question-answering system based on RDP knowledge graph.

Table 1. Examples of selected raw data in relational database.

Entity (English Name)	Synonym	Pathogen Type	Geographical Distribution
Rice bakanae disease	Phytophthora	Fusarium oxysporum	The whole world
Rice ragged stunt virus	Dwarfism with stippled epiphyses	Rice tungro bacilliform virus	Guangdong, Guangxi, Hunan, Fujian
Rice dwarf	General dwarf, green dwarf	Plant reovirus group viruses	Southern China

Table 4. Optimal model parameters.

Parameter Name	Parameter Value
batch_size	32
seq_max_len	128
dropout	0.5
learning rate	1 × 10⁻⁵

Table 6. Ablation experiments.

Model	Precision (%)	Recall (%)	F1 (%)
RoBERTa-MA-CRF(RMAC)	93.48	91.66	93.56
RoBERTa-BiLSTM-CRF(RBC)	92.35	93.01	92.68
BiLSTM-MA-CRF(BMAC)	91.72	89.70	90.70
MARBC	95.31	93.58	94.44

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, C.; Yang, S.; Liang, D.; Chen, P.; Dong, W. Construction of a Multi-Source, Heterogeneous Rice Disease and Pest Knowledge Graph Based on the MARBC Model. Agronomy 2025, 15, 566. https://doi.org/10.3390/agronomy15030566

AMA Style

Li C, Yang S, Liang D, Chen P, Dong W. Construction of a Multi-Source, Heterogeneous Rice Disease and Pest Knowledge Graph Based on the MARBC Model. Agronomy. 2025; 15(3):566. https://doi.org/10.3390/agronomy15030566

Chicago/Turabian Style

Li, Chunchun, Siyi Yang, Dong Liang, Peng Chen, and Wei Dong. 2025. "Construction of a Multi-Source, Heterogeneous Rice Disease and Pest Knowledge Graph Based on the MARBC Model" Agronomy 15, no. 3: 566. https://doi.org/10.3390/agronomy15030566

APA Style

Li, C., Yang, S., Liang, D., Chen, P., & Dong, W. (2025). Construction of a Multi-Source, Heterogeneous Rice Disease and Pest Knowledge Graph Based on the MARBC Model. Agronomy, 15(3), 566. https://doi.org/10.3390/agronomy15030566

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Construction of a Multi-Source, Heterogeneous Rice Disease and Pest Knowledge Graph Based on the MARBC Model

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Acquisition and Processing

2.1.1. Structured Data Collection and Extraction

2.1.2. Semi-Structured Data Collection and Extraction

2.1.3. Unstructured Data Collection and Extraction

2.1.4. Data Annotation

2.2. RDP Ontology Construction

2.3. Joint Extraction of Entity Relationships Based on MARBC Model

2.4. Knowledge Fusion of Multi-Source Heterogeneous Data

3. Results

3.1. Configuration of Experimental Parameters and Evaluation Indicators

3.2. Prediction Results

3.2.1. Comparison Experiment

3.2.2. Ablation Experiment

3.2.3. Entity Label Prediction

3.2.4. Discussion of Computational Efficiency of Different Models

3.2.5. Discussion of Broader Applications

3.3. Knowledge Storage and Downstream Application

3.3.1. Knowledge Storage

3.3.2. Downstream Application

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI