1. Introduction
Potato diseases and pests are known to be one of the most important factors affecting the quality and yield of potatoes. The Great Famine in Ireland caused by potato blight in the mid-19th century is the best example [
1]. Therefore, the prevention and control of potato diseases and pests is essential. However, there is a lack of specialised, systematic knowledge among potato growers, and authoritative knowledge on potato diseases and pest control is mostly contained in books or on certain agricultural websites, making such knowledge not only dispersed, but also not correlated, thus making it inaccessible to potato growers and managers and less conducive to systematic application. Therefore, it is necessary to study the construction of knowledge graphs. Because the construction of a knowledge graph can extract knowledge from fragmented, unrelated, and multi-source heterogeneous data and organise knowledge in the form of “entity-relationship-entity” triplets and corresponding “attribute-value-pairs”, it provides an important way to quickly query and obtain systematic expert knowledge [
2]. For example, when a grower finds that potato leaves show brown spots with irregular spot texture, no concentric whorls, and a white mould on the spots, he or she inputs these features into the potato pests and diseases knowledge base for querying. Based on the semantic association, the corresponding disease name, i.e., potato late blight, as well as the cause of the disease, its relationship with the environment and the corresponding control methods will be obtained. This will help potato growers to correctly understand the causes of the disease and treat it as soon as possible. However, due to the complex structure and variety of potato pest and disease data, it brings great challenges to the automatic construction of a potato pests and diseases knowledge map. And there are few research reports on the construction and application of a potato pests and diseases control knowledge map. Therefore, this article aims to construct a knowledge graph of potato pests and diseases, and studies entity naming recognition and relationship extraction methods based on deep learning technology. A model is established to extract scattered potato pests and diseases knowledge from a large amount of data, forming a knowledge base with semantic relationships. Therefore, with the goal of constructing a knowledge map of potato pests and diseases, this paper researches the entity naming recognition and relationship extraction method based on deep learning technology and establishes a model to extract the fragmented knowledge of potato pests and diseases from data to form a knowledge base with semantic relationships. It can provide auxiliary decision-making assistance for agricultural experts, provide knowledge guidance for potato growers or managers, and also provide knowledge support for the intelligent question answering and recommendation system of the agricultural intelligent service platform [
3,
4], so as to make the static knowledge flow, apply it to agricultural production practice quickly, and give great play to the value of the knowledge, which is of good theoretical and application value.
Name entity recognition (NER) and relationship extraction (RE) are two core tasks in knowledge graph construction [
5,
6]. With the development of deep learning technologies, NER and RE models based on recurrent neural networks [
7,
8] and bidirectional encoder representation of transformers (BERT) [
9] have performed well and become the main method for knowledge extraction. They are widely used in the automatic construction of knowledge bases in agriculture, medicine, finance, and other fields. For example, the NER model developed in [
10], which is based on RNN networks, can accurately recognise entities in disease and pest data. The researchers found that extracting features by combining word embeddings with neural networks resulted in better recognition of entities and relationships. Zhang et al. [
11] constructed an entity recognition model using word2vec and BiLSTM-CRF to significantly improve the performance of the original model. However, word2vec is static and cannot solve the problem of polysemy. For this purpose, BERT word embedding [
12,
13] was introduced. Reference [
14] combines the character-level features of BERT representations with external dictionary features for NER in the agricultural field with a recognition accuracy of 94.84%. However, the BERT model generates a large number of parameters during the training process and results in time-consuming training. For this reason, the Reduced Bert (ALBert) model [
15] has been developed, which mainly saves model parameter space by sharing parameters. For example, Chen et al. used ALBert not only to improve the effectiveness of NER, but also to ensure efficiency. The BiLSTM model can capture global contextual information, but it cannot capture long-range related information. Therefore, the attention mechanism is used to solve this problem. For example, the introduction of an attention mechanism in BiLSTM-CRF has been successfully applied to entity extraction tasks in the agricultural domain [
16]. Guo et al. constructed the Att-BiLSTM-CRF model for pest and disease feature extraction, which solved the problem of missing inner semantic information [
17].
In summary, by introducing the efficient ALBert model and attention mechanism into the BiLSTM-CRF model, a multilayer network knowledge joint extraction model is constructed to extract complex nested knowledge from the potato pest and disease corpus in the paper. This method effectively balances the efficiency and complexity of the model, provides a scheme for the automatic construction of large-scale knowledge graphs, and also provides experimental guidance for the construction of multilayer models.
Furthermore, compared to knowledge graph construction, knowledge representation learning [
18] is more important. It can not only explore the implicit relationship between disease and pest entities, so as to infer new knowledge, but also discover the interactions that exist between different pests and diseases to better predict and prevent the occurrence of diseases and pests. Therefore, knowledge representation learning, based on the constructed knowledge graph, is further explored to achieve knowledge inference and fusion. In turn, automatic knowledge updating can be achieved, which greatly enhances the value of the knowledge graph.
4. Results
In order to improve the representation and reasoning capabilities of the knowledge graph constructed in this paper, knowledge representation technology was further employed to enhance its scalability and increase its application value.
4.1. Experimental Dataset
APOC (A Package of Components) technology was used to export data from the Neo4j graph database as a CSV-type relationship file, which was then processed into entity txt files, relationship txt files, and triplet txt files. The entity file was a mapping of entities to entity IDs, with each line in the format of “entity\t entityID”; the relationship file was a mapping of relationships to relationship IDs, with each line in the format of “relationship\t relationshipID”; each line of the triplet file was in the format of “head entity\t tail entity\t relationship”. Finally, the triplet file was divided into training, validation, and test sets in an 8:1:1 ratio.
4.2. Evaluation Metrics
In the knowledge graph representation learning algorithms, MRR (mean reciprocal rank) and Hit@N (such as Hit@10 and Hit@3) are typically used to evaluate the performance of the model. The MRR measures the average ranking score of correctly predicted entities, with a lower value indicating a higher ranking. Hit@N measures the probability of correct entities being ranked in the top N, with a higher value indicating better performance. Here, N was chosen as 10 and 3 to evaluate the model. The higher the MRR and Hit@N scores, the closer the model’s predictions are to the actual values, and the better the model’s performance.
4.3. Experimental Environment and Parameter Indicators
The experiment was run on the Windows 11 operating system using Python 3.7 and the PyCharm compiler, using the Tensorflow 1.14.0 learning framework. The model training parameters were that learning_rate was 0.001, the hidden_size (the length of the entity and relation word vectors) was 200, and the number of triplets entered in each epoch (batch_size) was 300. In the TransR model, we only used a hidden_sizeR of 10 and also kept the length of the relationship vector at 10 to ensure the model’s performance.
4.4. Experiment and Results Analysis
The knowledge representation performances of the TransE, TransH, TransR, and TransD models on the constructed knowledge base were compared in the experiment, and the results are shown in
Table 6 and their visualisation is shown in
Figure 10. It can be seen that the Hit@10, Hit@3, and MRR metrics of the TransH model are all higher than the other three models, indicating that the TransH model is suitable for capturing and representing special semantic relationships in knowledge graphs. In addition, the TransH model can effectively deal with the different relationships between 1-to-N, N-to-1, and N-to-N situations involved in the entities in the knowledge graph, However, the metrics of the TransD model in the experimental results are low, which is somewhat different from the theoretical analysis of this model. A possible reason for this is that there were too few types of entity relationships in the experimental dataset, which is insufficient to fully demonstrate the performance of the model. Therefore, the TransH model was chosen to represent the knowledge graph of potato diseases and pests in this paper. During the training process, the loss rate of the TransH model with changes in epochs is illustrated in
Figure 11, which shows that its loss rate can converge well.