Next Article in Journal
Farmers’ Perceptions of the Organic Product Certification Procedure: A Preliminary Investigation in North Greece
Previous Article in Journal
Honey Cost: An Experimental Approach for Determining the Production Costs of Honey
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Proceeding Paper

Unveiling the Secrets of Embeddings: Does the Importance of Agricultural Terms Relate to the Context They Occur In? †

by
Hercules Panoutsopoulos
1,2,*,
Panagiotis Stamatelopoulos
2,
Xu Wang
1 and
Christopher Brewster
1,3
1
Institute of Data Science, Maastricht University, Paul-Henri Spaaklaan 1 (PHS1), 6229 EN Maastricht, The Netherlands
2
Department of Natural Resources Development & Agricultural Engineering, Agricultural University of Athens, 75 Iera Odos Street, GR11855 Athens, Greece
3
Data Science Group, TNO, Kampweg 55, 3769 DE Soesterberg, The Netherlands
*
Author to whom correspondence should be addressed.
Presented at the 11th International Conference on Information and Communication Technologies in Agriculture, Food & Environment, Samos, Greece, 17–20 October 2024.
Proceedings 2025, 117(1), 5; https://doi.org/10.3390/proceedings2025117005
Published: 18 April 2025

Abstract

:
Advances in language modeling have provided affordances for term extraction based on the capture of the lexical context and its semantics and encoding this in real-valued vectors (embeddings). Term importance is usually computed using quantitative measures, ignoring the semantic context. Until now, there has been limited or no research on the effect of context on term importance using machine learning methods. In this paper, we investigate whether there is a relation between the importance of agricultural terms and the context of their occurrence as represented by text embeddings. Using a dataset of almost 33.7 K AGRIS abstracts containing 50 concepts randomly extracted from AGROVOC, we computed the correlation between the concept tf-idf scores and each of three semantic distances (cosine similarity of embeddings) used as a proxy to context: (i) the semantic distances of the various occurrences of the concept; (ii) the semantic distances between the concept and the abstracts that it appears in; (iii) the semantic distances of the abstracts in which the concept occurs. Embeddings were generated using Agriculture-BERT. We present a methodology and initial results from the computation of correlations. The novelty of our work is in the systematic investigation of the relation between term importance and semantic context.

1. Introduction

In recent years, the agrifood sector has become knowledge-intensive, with the need for standardized data and information exchange continuously increasing [1]. Methods for extracting and retrieving data and information are important for developing services and applications automating agrifood production. The extraction of agricultural terms from domain corpora is a task usually implemented at the beginning of pipelines for knowledge base creation/updates, question answering, and text summarization. Advances in language modeling using transformer technologies have provided affordances for term extraction based on the capture of the semantic context in which terms occur [2]. In addition, ranked lists of terms, instead of unordered term sets, can be helpful in assigning weights to tags for indexing documents or setting priorities regarding the terms to consider for knowledge base updates. The rationale is that not all terms are equally important. Tf-idf is widely used for term importance computation. However, it is a document frequency-based metric, ignoring immediate contextual factors such as the words around a term, word sequences, or their respective semantics. It is valid to assume that an agricultural term’s context plays a role in term importance. To the best of our knowledge, there is currently limited or no research on this.
In this paper, we investigate whether there is a relation between agricultural term importance and context. We conceptualize context as the semantics of the text in which a term occurs and is provided to a neural network-based language model for vectorization. It is encoded in the embeddings (real-valued vectors) outputted by a language model. However, embeddings are not human-interpretable, providing no direct information for a term’s context. Therefore, we use the semantic distances of the generated embeddings as a proxy to context. We computed the correlation between the tf-idf scores of agricultural concepts and the following variables used as a proxy to term context: (i) the semantic distances of the various occurrences of the concept; (ii) the semantic distances between the concept and the abstracts that it appears in; and (iii) the semantic distances of the abstracts in which the concept occurs. Our term set consisted of 50 concepts randomly extracted from AGROVOC. We used it to build a dataset of approximately 33.7 K AGRIS abstracts published in 2023 and 2024. Text vectorization was carried out with Agriculture-BERT [3].

2. Related Work

Agrifood-related research using embeddings is limited. Existing work focuses on semantic matching for question answering tasks and chatbot applications. Rezayi et al. [4] pre-trained BERT from scratch for semantically matching food-related text data to data on the nutritional value of food. The task is performed by computing the semantic distances of food data vectors to those of food nutritional value. Word and text embeddings are also used in research into on-demand farmer support using chatbots. Such work is presented in [5], which describes a chatbot providing support tailored to parameters such as the type of crop production and geographic location. In [6], a chatbot is proposed for farmer support in weather forecasting, plant protection, market rates, and policy measures. Sen2vec [7] is used to vectorize text to match user queries to responses indexed in the system database. Initial steps in using the semantic context of words to compute their importance are taken in [8]. The study presents a method of keyword extraction from medical documents using semantically enriched tf-idf. Semantic enrichment is achieved with the embeddings generated by word2vec [9].

3. Materials and Methods

The steps implemented in our work are illustrated in Figure 1. All code and datasets are available in the paper’s GitHub repository (https://github.com/herculespan/term-to-context_relation), created on the 17 May 2024.

3.1. Dataset and Measure of Concept Importance

Our dataset consisted of approximately 33.7 K AGRIS abstracts. The terms used to build our dataset were 50 unigram concepts randomly extracted from AGROVOC and contained in the AGRIS abstracts. Acronyms were not considered. The abstracts were indexed in AGRIS in 2023 and 2024. We did so to reduce the risk of drifts in the meaning of concepts occurring over time. Abstracts with more than 350 words were excluded (see Section 3.2). Abstracts were cleaned of noise (e.g., non-alphanumeric characters) before being used for concept and text vectorization. Extra pre-processing was carried out (stopword and punctuation sign removal, text lower-casing) before abstracts were used for the computation of concept importance. The importance of a concept for the dataset was computed by averaging the per-abstract tf-idf scores of the concept. Dataset statistics are provided in Table 1.

3.2. Text Vectorization and Semantic Distance Computation

Text vectorization was carried out with Agriculture-BERT, a model pre-trained on agricultural texts. For text vectorization, we used the cleaned (not pre-processed) abstracts. Given BERT’s 512-token limit, we used abstracts of 350 words maximum. This threshold was set after some tests that we conducted and found that BERT’s tokenizer tokenizes each word into 1.5 tokens on average. The embedding of the [CLS] token was used as the embedding of each abstract. [CLS] is a special token added by BERT at the beginning of each abstract [10]. Concept embeddings were retrieved from their token embeddings by using mean pooling. We considered all concept occurrences in the dataset. Embeddings were retrieved from the tensor outputted by Agriculture-BERT’s last hidden state.
Semantic distances were measured using cosine similarities. In our work, increased similarity (a small semantic distance) between the embeddings of the various occurrences of a concept indicates the fixed use of the concept (i.e., the concept is not polysemous). High similarities between the embeddings of the abstracts that a concept occurs in show that there is also a specific semantic context in which the concept appears. High concept–abstract similarities indicate a close match between a concept’s meaning and the information that the abstract conveys. It shows a good fit for the abstract as semantic context for the concept.

4. Results

Relations between concept importance and semantic context were sought by computing correlations between the tf-idf scores of the AGROVOC concepts in the corpus and each of the semantic distances. We averaged each of the semantic distances computed per concept. Table 2 provides statistics related to concept importance. Statistics for the semantic distances are available in Table 3.
To find the distributions of concept importance and semantic distance variables, we ran the Smirnov–Kolmogorov test. Only the variable “semantic distance of concept from abstracts where it occurs” had a normal distribution. Consequently, Spearman’s rho was computed. The results are shown in Table 4.
The results show a tendency towards a statistically significant, positive correlation between concept importance and the semantic distance of the occurrences of the concept (r = 0.269; p = 0.059). There is also a tendency towards a statistically significant, negative correlation between concept importance and the semantic distance of concepts from the abstracts in which they occur (r = −0.225; p = 0.116). Concept importance and the semantic distance of the abstracts containing the concept are not significantly correlated (r = 0.139; p = 0.337). It is worth highlighting the statistically significant, positive correlations between the semantic distance of the abstracts containing a concept and the semantic distance of (i) the concept occurrences in the corpus and (ii) the concept from the abstracts containing it.

5. Discussion and Conclusions

In our work, we investigated whether there is a relation between agricultural term importance and semantic context. To the best of our knowledge, this work is novel. Researching the relation between term importance and semantic context is a significant step towards term importance computation methods considering semantic, context-related factors. Such methods can power the development of information search and delivery applications tailored to the needs of farmers and extension services. By not being based on mere measures of occurrence, they can be useful in assigning weights to tags for document indexing or prioritizing the terms to consider for knowledge base updates, thus enabling the delivery of up-to-date information accurately reflecting the domain advances described in the corpus used for term extraction. Our results show a tendency towards statistical significance in some cases. The statistically significant correlations between the semantic distances validate our decision to consider them as a proxy to context. The negative correlation between concept importance and the semantic distances of concepts from the abstracts in which they occur needs attention. It implies that the more important a concept, the lower its semantic similarity to the abstracts it appears in. The random selection of concepts from AGROVOC for building our dataset is a limitation explaining our results. The concept tf-idf scores for the dataset were low and also had low variability (std. deviation = 0.003133). Our focus on unigram concepts to minimize complexity is also a limitation. The 350-word maximum text length can also be considered a limitation. However, it was necessary in order to address BERT’s token limit.
Based on our results, further research is needed to investigate the relation between term importance and the semantic context of term occurrence. Considering the low tf-idf scores indicating corpus-wise concept importance, which are the result of averaging the per abstract tf-idf scores of concepts to obtain their importance for the corpus, we will use datasets, or samples from a dataset, with different variability regarding term importance. This will allow for rigorous conclusions as to whether a relation between term importance and semantic context exists or not. We will also broaden the scope of our research by considering multi-gram terms. We intend to use texts from sources other than AGRIS to better address the need for a dataset containing terms of more distinct importance.

Author Contributions

Conceptualization, H.P.; methodology, H.P.; software, P.S. and H.P.; validation, H.P., P.S. and X.W.; formal analysis, H.P. and P.S.; investigation, H.P. and P.S.; resources, H.P. and P.S.; data curation, H.P. and P.S.; writing—original draft preparation, H.P.; writing—review and editing, P.S., X.W. and C.B.; visualization, H.P.; supervision, C.B.; project administration, H.P.; funding acquisition, H.P. and C.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work is partly supported by the Horizon Europe EU-FarmBook project with contract number 101060382.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All code and datasets are available in the paper’s GitHub repository (https://github.com/herculespan/term-to-context_relation), created on the 17 May 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Subirats-Coll, I.; Kolshus, K.; Turbati, A.; Stellato, A.; Mietzsch, E.; Martini, D.; Zeng, M. AGROVOC: The linked data concept hub for food and agriculture. Comput. Electron. Agric. 2022, 196, 105965. [Google Scholar] [CrossRef]
  2. Tran, H.T.H.; Martinc, M.; Caporusso, J.; Doucet, A.; Pollak, S. The Recent Advances in Automatic Term Extraction: A survey. arXiv 2023, arXiv:2301.06767. [Google Scholar] [CrossRef]
  3. Quadros, V.P. BERT for Agriculture Domain. 2021. Available online: https://medium.com/@vionaquadros/bert-for-agriculture-domain-f655d80c7da4 (accessed on 12 March 2024).
  4. Rezayi, S.; Liu, Z.; Wu, Z.; Dhakal; Ge, B.; Zhen, C.; Liu, T.; Li, S. AgriBERT: Knowledge-Infused Agricultural Language Models for Matching Food and Nutrition. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22) Special Track on AI for Good, Vienna, Austria, 23–29 July 2022; pp. 5150–5156. [Google Scholar] [CrossRef]
  5. Jain, N.; Jain, P.; Kayal, P.; Sahit, J.; Pachpande, S.; Choudhari, J.; Singh, M. AgriBot: Agriculture-Specific Question Answer System. 2019. Available online: https://osf.io/preprints/indiarxiv/3qp98_v1 (accessed on 7 April 2024).
  6. Gounder, S.; Patil, M.; Rokade, V.; More, N. Agrobot: An agricultural advancement to enable smart farm services using NLP. J. Emerg. Technol. Innov. Res. 2021, 8, 445–454. [Google Scholar]
  7. Arora, S.; Liang, Y.; Ma, T. A simple but tough-to-beat baseline for sentence embeddings. In Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
  8. Jalilifard, A.; Caridá, V.F.; Mansano, A.F.; Cristo, R.S.; da Fonseca, F.P.C. Semantic Sensitive TF-IDF to Determine Word Relevance in Documents. In Advances in Computing and Network Communications; Thampi, S.M., Gelenbe, E., Atiquzzaman, M., Chaudhary, V., Li, K.C., Eds.; Lecture Notes in Electrical Engineering; Springer: Singapore, 2021; Volume 736, pp. 327–337. [Google Scholar] [CrossRef]
  9. Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. In Proceedings of the 1st International Conference on Learning Representations, Scottsdale, AZ, USA, 2–4 May 2013. [Google Scholar]
  10. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Figure 1. Steps to investigate the relation between agricultural term importance and semantic context.
Figure 1. Steps to investigate the relation between agricultural term importance and semantic context.
Proceedings 117 00005 g001
Table 1. Dataset statistics.
Table 1. Dataset statistics.
MinimumMaximumAverageStd. Deviation
Abstract length (in words)20.00350.00222.2352.98
Number of abstracts per concept8.0010,000.00734.962006.03
Number of concepts per abstract1.004.001.250.50
Table 2. Statistics for the tf-idf scores (concept importance).
Table 2. Statistics for the tf-idf scores (concept importance).
MinimumMaximumAverageStd. Deviation
Concept tf-idf scores0.0000250.0174510.0013740.003133
Table 3. Statistics for the semantic distances (context).
Table 3. Statistics for the semantic distances (context).
MinimumMaximumAverageStd. Deviation
Semantic distance of concept occurrences0.10740.36020.19570.0465
Semantic distance of concept from abstracts where it occurs0.35870.77480.55770.0940
Semantic distance of abstracts containing a concept0.08000.47490.36380.0727
Table 4. Results of correlation computation (Spearman’s rho).
Table 4. Results of correlation computation (Spearman’s rho).
tf-idf ScoresSemantic Distance of Concept OccurrencesSemantic Distance of Concept from Abstracts It Occurs InSemantic Distance of Abstracts Containing a Concept
tf-idf scorescorrelation coeff.1.0000.269−0.2250.139
sig. (2-tailed)-0.0590.1160.337
Semantic distance of concept occurrencescorrelation coeff.0.2691.0000.0230.342 (*)
sig. (2-tailed)0.059-0.8770.015
Semantic distance of concept from abstracts it occurs incorrelation coeff.−0.2250.0231.0000.341 (*)
sig. (2-tailed)0.1160.877-0.015
Semantic distance of abstracts containing a conceptcorrelation coeff.0.1390.342 (*)0.341 (*)1.000
sig. (2-tailed)0.3370.0150.015-
The asterisk (*) indicates statistical significance at the 0.05 level.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Panoutsopoulos, H.; Stamatelopoulos, P.; Wang, X.; Brewster, C. Unveiling the Secrets of Embeddings: Does the Importance of Agricultural Terms Relate to the Context They Occur In? Proceedings 2025, 117, 5. https://doi.org/10.3390/proceedings2025117005

AMA Style

Panoutsopoulos H, Stamatelopoulos P, Wang X, Brewster C. Unveiling the Secrets of Embeddings: Does the Importance of Agricultural Terms Relate to the Context They Occur In? Proceedings. 2025; 117(1):5. https://doi.org/10.3390/proceedings2025117005

Chicago/Turabian Style

Panoutsopoulos, Hercules, Panagiotis Stamatelopoulos, Xu Wang, and Christopher Brewster. 2025. "Unveiling the Secrets of Embeddings: Does the Importance of Agricultural Terms Relate to the Context They Occur In?" Proceedings 117, no. 1: 5. https://doi.org/10.3390/proceedings2025117005

APA Style

Panoutsopoulos, H., Stamatelopoulos, P., Wang, X., & Brewster, C. (2025). Unveiling the Secrets of Embeddings: Does the Importance of Agricultural Terms Relate to the Context They Occur In? Proceedings, 117(1), 5. https://doi.org/10.3390/proceedings2025117005

Article Metrics

Back to TopTop