1. Introduction
In recent years, the agrifood sector has become knowledge-intensive, with the need for standardized data and information exchange continuously increasing [
1]. Methods for extracting and retrieving data and information are important for developing services and applications automating agrifood production. The extraction of agricultural terms from domain corpora is a task usually implemented at the beginning of pipelines for knowledge base creation/updates, question answering, and text summarization. Advances in language modeling using transformer technologies have provided affordances for term extraction based on the capture of the semantic context in which terms occur [
2]. In addition, ranked lists of terms, instead of unordered term sets, can be helpful in assigning weights to tags for indexing documents or setting priorities regarding the terms to consider for knowledge base updates. The rationale is that not all terms are equally important. Tf-idf is widely used for term importance computation. However, it is a document frequency-based metric, ignoring immediate contextual factors such as the words around a term, word sequences, or their respective semantics. It is valid to assume that an agricultural term’s context plays a role in term importance. To the best of our knowledge, there is currently limited or no research on this.
In this paper, we investigate whether there is a relation between agricultural term importance and context. We conceptualize context as the semantics of the text in which a term occurs and is provided to a neural network-based language model for vectorization. It is encoded in the embeddings (real-valued vectors) outputted by a language model. However, embeddings are not human-interpretable, providing no direct information for a term’s context. Therefore, we use the semantic distances of the generated embeddings as a proxy to context. We computed the correlation between the tf-idf scores of agricultural concepts and the following variables used as a proxy to term context: (i) the semantic distances of the various occurrences of the concept; (ii) the semantic distances between the concept and the abstracts that it appears in; and (iii) the semantic distances of the abstracts in which the concept occurs. Our term set consisted of 50 concepts randomly extracted from AGROVOC. We used it to build a dataset of approximately 33.7 K AGRIS abstracts published in 2023 and 2024. Text vectorization was carried out with Agriculture-BERT [
3].
2. Related Work
Agrifood-related research using embeddings is limited. Existing work focuses on semantic matching for question answering tasks and chatbot applications. Rezayi et al. [
4] pre-trained BERT from scratch for semantically matching food-related text data to data on the nutritional value of food. The task is performed by computing the semantic distances of food data vectors to those of food nutritional value. Word and text embeddings are also used in research into on-demand farmer support using chatbots. Such work is presented in [
5], which describes a chatbot providing support tailored to parameters such as the type of crop production and geographic location. In [
6], a chatbot is proposed for farmer support in weather forecasting, plant protection, market rates, and policy measures. Sen2vec [
7] is used to vectorize text to match user queries to responses indexed in the system database. Initial steps in using the semantic context of words to compute their importance are taken in [
8]. The study presents a method of keyword extraction from medical documents using semantically enriched tf-idf. Semantic enrichment is achieved with the embeddings generated by word2vec [
9].
3. Materials and Methods
3.1. Dataset and Measure of Concept Importance
Our dataset consisted of approximately 33.7 K AGRIS abstracts. The terms used to build our dataset were 50 unigram concepts randomly extracted from AGROVOC and contained in the AGRIS abstracts. Acronyms were not considered. The abstracts were indexed in AGRIS in 2023 and 2024. We did so to reduce the risk of drifts in the meaning of concepts occurring over time. Abstracts with more than 350 words were excluded (see
Section 3.2). Abstracts were cleaned of noise (e.g., non-alphanumeric characters) before being used for concept and text vectorization. Extra pre-processing was carried out (stopword and punctuation sign removal, text lower-casing) before abstracts were used for the computation of concept importance. The importance of a concept for the dataset was computed by averaging the per-abstract tf-idf scores of the concept. Dataset statistics are provided in
Table 1.
3.2. Text Vectorization and Semantic Distance Computation
Text vectorization was carried out with Agriculture-BERT, a model pre-trained on agricultural texts. For text vectorization, we used the cleaned (not pre-processed) abstracts. Given BERT’s 512-token limit, we used abstracts of 350 words maximum. This threshold was set after some tests that we conducted and found that BERT’s tokenizer tokenizes each word into 1.5 tokens on average. The embedding of the [CLS] token was used as the embedding of each abstract. [CLS] is a special token added by BERT at the beginning of each abstract [
10]. Concept embeddings were retrieved from their token embeddings by using mean pooling. We considered all concept occurrences in the dataset. Embeddings were retrieved from the tensor outputted by Agriculture-BERT’s last hidden state.
Semantic distances were measured using cosine similarities. In our work, increased similarity (a small semantic distance) between the embeddings of the various occurrences of a concept indicates the fixed use of the concept (i.e., the concept is not polysemous). High similarities between the embeddings of the abstracts that a concept occurs in show that there is also a specific semantic context in which the concept appears. High concept–abstract similarities indicate a close match between a concept’s meaning and the information that the abstract conveys. It shows a good fit for the abstract as semantic context for the concept.
4. Results
Relations between concept importance and semantic context were sought by computing correlations between the tf-idf scores of the AGROVOC concepts in the corpus and each of the semantic distances. We averaged each of the semantic distances computed per concept.
Table 2 provides statistics related to concept importance. Statistics for the semantic distances are available in
Table 3.
To find the distributions of concept importance and semantic distance variables, we ran the Smirnov–Kolmogorov test. Only the variable “semantic distance of concept from abstracts where it occurs” had a normal distribution. Consequently, Spearman’s rho was computed. The results are shown in
Table 4.
The results show a tendency towards a statistically significant, positive correlation between concept importance and the semantic distance of the occurrences of the concept (r = 0.269; p = 0.059). There is also a tendency towards a statistically significant, negative correlation between concept importance and the semantic distance of concepts from the abstracts in which they occur (r = −0.225; p = 0.116). Concept importance and the semantic distance of the abstracts containing the concept are not significantly correlated (r = 0.139; p = 0.337). It is worth highlighting the statistically significant, positive correlations between the semantic distance of the abstracts containing a concept and the semantic distance of (i) the concept occurrences in the corpus and (ii) the concept from the abstracts containing it.
5. Discussion and Conclusions
In our work, we investigated whether there is a relation between agricultural term importance and semantic context. To the best of our knowledge, this work is novel. Researching the relation between term importance and semantic context is a significant step towards term importance computation methods considering semantic, context-related factors. Such methods can power the development of information search and delivery applications tailored to the needs of farmers and extension services. By not being based on mere measures of occurrence, they can be useful in assigning weights to tags for document indexing or prioritizing the terms to consider for knowledge base updates, thus enabling the delivery of up-to-date information accurately reflecting the domain advances described in the corpus used for term extraction. Our results show a tendency towards statistical significance in some cases. The statistically significant correlations between the semantic distances validate our decision to consider them as a proxy to context. The negative correlation between concept importance and the semantic distances of concepts from the abstracts in which they occur needs attention. It implies that the more important a concept, the lower its semantic similarity to the abstracts it appears in. The random selection of concepts from AGROVOC for building our dataset is a limitation explaining our results. The concept tf-idf scores for the dataset were low and also had low variability (std. deviation = 0.003133). Our focus on unigram concepts to minimize complexity is also a limitation. The 350-word maximum text length can also be considered a limitation. However, it was necessary in order to address BERT’s token limit.
Based on our results, further research is needed to investigate the relation between term importance and the semantic context of term occurrence. Considering the low tf-idf scores indicating corpus-wise concept importance, which are the result of averaging the per abstract tf-idf scores of concepts to obtain their importance for the corpus, we will use datasets, or samples from a dataset, with different variability regarding term importance. This will allow for rigorous conclusions as to whether a relation between term importance and semantic context exists or not. We will also broaden the scope of our research by considering multi-gram terms. We intend to use texts from sources other than AGRIS to better address the need for a dataset containing terms of more distinct importance.
Author Contributions
Conceptualization, H.P.; methodology, H.P.; software, P.S. and H.P.; validation, H.P., P.S. and X.W.; formal analysis, H.P. and P.S.; investigation, H.P. and P.S.; resources, H.P. and P.S.; data curation, H.P. and P.S.; writing—original draft preparation, H.P.; writing—review and editing, P.S., X.W. and C.B.; visualization, H.P.; supervision, C.B.; project administration, H.P.; funding acquisition, H.P. and C.B. All authors have read and agreed to the published version of the manuscript.
Funding
This work is partly supported by the Horizon Europe EU-FarmBook project with contract number 101060382.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Subirats-Coll, I.; Kolshus, K.; Turbati, A.; Stellato, A.; Mietzsch, E.; Martini, D.; Zeng, M. AGROVOC: The linked data concept hub for food and agriculture. Comput. Electron. Agric. 2022, 196, 105965. [Google Scholar] [CrossRef]
- Tran, H.T.H.; Martinc, M.; Caporusso, J.; Doucet, A.; Pollak, S. The Recent Advances in Automatic Term Extraction: A survey. arXiv 2023, arXiv:2301.06767. [Google Scholar] [CrossRef]
- Quadros, V.P. BERT for Agriculture Domain. 2021. Available online: https://medium.com/@vionaquadros/bert-for-agriculture-domain-f655d80c7da4 (accessed on 12 March 2024).
- Rezayi, S.; Liu, Z.; Wu, Z.; Dhakal; Ge, B.; Zhen, C.; Liu, T.; Li, S. AgriBERT: Knowledge-Infused Agricultural Language Models for Matching Food and Nutrition. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22) Special Track on AI for Good, Vienna, Austria, 23–29 July 2022; pp. 5150–5156. [Google Scholar] [CrossRef]
- Jain, N.; Jain, P.; Kayal, P.; Sahit, J.; Pachpande, S.; Choudhari, J.; Singh, M. AgriBot: Agriculture-Specific Question Answer System. 2019. Available online: https://osf.io/preprints/indiarxiv/3qp98_v1 (accessed on 7 April 2024).
- Gounder, S.; Patil, M.; Rokade, V.; More, N. Agrobot: An agricultural advancement to enable smart farm services using NLP. J. Emerg. Technol. Innov. Res. 2021, 8, 445–454. [Google Scholar]
- Arora, S.; Liang, Y.; Ma, T. A simple but tough-to-beat baseline for sentence embeddings. In Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
- Jalilifard, A.; Caridá, V.F.; Mansano, A.F.; Cristo, R.S.; da Fonseca, F.P.C. Semantic Sensitive TF-IDF to Determine Word Relevance in Documents. In Advances in Computing and Network Communications; Thampi, S.M., Gelenbe, E., Atiquzzaman, M., Chaudhary, V., Li, K.C., Eds.; Lecture Notes in Electrical Engineering; Springer: Singapore, 2021; Volume 736, pp. 327–337. [Google Scholar] [CrossRef]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. In Proceedings of the 1st International Conference on Learning Representations, Scottsdale, AZ, USA, 2–4 May 2013. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).