Hybrid Transformer-Based Large Language Models for Word Sense Disambiguation in the Low-Resource Sesotho sa Leboa Language

Masethe, Hlaudi Daniel; Masethe, Mosima Anna; Ojo, Sunday O.; Owolawi, Pius A.; Giunchiglia, Fausto

doi:10.3390/app15073608

Open AccessArticle

Hybrid Transformer-Based Large Language Models for Word Sense Disambiguation in the Low-Resource Sesotho sa Leboa Language

by

Hlaudi Daniel Masethe

^1,*

,

Mosima Anna Masethe

^2,*

,

Sunday O. Ojo

³

,

Pius A. Owolawi

⁴

and

Fausto Giunchiglia

⁵

¹

Department of Computer Science, Faculty of Information and Communication Technology, Tshwane University of Technology, Pretoria 0183, South Africa

²

Department of CSIT, School of Science and Technology, Sefako Makgatho Health Sciences University, Pretoria 0204, South Africa

³

Department of Information Technology, Faculty of Accounting and Informatics, Durban University of Technology, Durban 4001, South Africa

⁴

Department of Computer Engineering, Faculty of Information and Communication Technology, Tshwane University of Technology, Pretoria 0183, South Africa

⁵

Department of Information Engineering and Computer Science, Faculty of Information Communication Technology, University of Trento, 38122 Trento, Italy

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(7), 3608; https://doi.org/10.3390/app15073608

Submission received: 1 January 2025 / Revised: 19 March 2025 / Accepted: 21 March 2025 / Published: 25 March 2025

(This article belongs to the Special Issue Advances in Large Language Models: Techniques, Applications and Challenges)

Download

Browse Figures

Versions Notes

Abstract

:

This study addresses a lexical ambiguity issue in Sesotho sa Leboa that arises from terms with various meanings, often known as homonyms or polysemous words. When compared to, for instance, European languages, this lexical ambiguity in Sesotho sa Leboa causes computational semantic problems in NLP when trying to identify the lexicon of a language. In other words, it is challenging to determine the proper lexical category and sense of words due to this ambiguity problem. In order to address the issue of polysemy in the Sesotho sa Leboa language, this study set out to create a word sense discrimination (WSD) scheme using a corpus-based hybrid transformer-based architecture and deep learning models. Additionally, the performance of baseline and improved machine learning models for a sequence-based natural language processing (NLP) task was assessed and compared. The baseline models included RNN-LSTM, BiGRU, LSTMLM, DeBERTa, and DistilBERT, with accuracies of 61%, 79%, 74%, 70%, and 64%, respectively. Among these, BiGRU emerged as the strongest performer, leveraging its bidirectional architecture to achieve the highest baseline accuracy. Transformer-based models, such as DeBERTa and DistilBERT, demonstrated moderate performance, with the latter prioritizing efficiency at the cost of accuracy. The enhanced results explored optimization techniques and hybrid model architectures to improve performance. BiGRU, optimized with ADAM, achieved an accuracy of 84%, while BiGRU with attention mechanisms further improved to 85%, showcasing the effectiveness of these enhancements. Hybrid models integrating BiGRU with transformer architectures demonstrated varying results. BiGRU + DeBERTa and BiGRU + ALBERT achieved the highest accuracies of 85% and 84%, respectively, highlighting the complementary strengths of bidirectional context modeling and advanced transformer-based contextual understanding. Conversely, the Hybrid BiGRU + RoBERTa model underperformed, with an accuracy of 70%, indicating potential mismatches in model synergy. These findings highlight how crucial hybridization and optimization are to reaching cutting-edge performance on NLP tasks. According to this study’s findings, the most promising approaches for fusing accuracy and efficiency are attention-based BiGRU and BiGRU–transformer hybrids, especially those that incorporate DeBERTa and ALBERT. To further improve speed, future research should concentrate on exploring task-specific optimizations and improving hybrid model integration.

Keywords:

natural language processing (NLP); RNN-LSTM; BiGRU; LSTMLM; DeBERTa; DistilBERT; optimization; attention mechanism; hybrid models; transformer architectures; contextual understanding; sequence modeling; ADAM optimizer; bidirectional neural networks; machine learning; performance evaluation

1. Introduction

The Bantu language of the Sotho group is called Sesotho sa Leboa. Disjunctive orthography, particularly the category of verb prefixal morphemes, is what distinguishes the language. The language is regarded as semi-conjunctive since, in it, suffixal morphemes are written in a conjunctive manner. Additionally, because of its phonology and history, the Bantu language is regarded as agglutinative and has significant intrinsic structural similarities, yet it differs significantly in its orthography [1,2].

When polysemous words occur in sentences during human discourse, listeners can infer the meaning of the words based on the context. Since computers lack the skills and knowledge necessary to understand such sentences, a method is needed for disambiguation. In Sesotho sa Leboa language processing, WSD is the most alluring and difficult research subject. It is also incredibly significant for text processing and language understanding applications. In this study, the researchers offer methods for removing confusing Sesotho sa Leboa terms from sentences.

Because of their straightforward morphological characteristics, English and other European languages have been investigated in a considerable number of studies published in the field of WSD [3].

Lexical ambiguity represents both homonymy and polysemy. Ambiguity is a form of polysemy or multiple meanings. One word or lexeme may have different distinctions of senses. Polysemy refers to one word with two or more meanings, whereas homonymy refers to two or more words with related meanings spelled identically and with the same pronunciation or sound. Distinguishing between polysemy and homonymy can be challenging as the boundary between them is fluid; however, the two are very distinct [4,5,6].

One of the main problems in natural language processing (NLP) is word sense disambiguation (WSD), which becomes more difficult in languages with limited resources like Sesotho sa Leboa. Sesotho sa Leboa lacks the extensive annotated corpora, strong linguistic resources, and pre-trained models required for efficient computational linguistic applications, in contrast to highly resourced languages like English. In addition to addressing the difficulties presented by WSD in Sesotho sa Leboa, this study intends to further the area of computational linguistics in low-resource languages. This WSD research is essential for closing the digital language gap caused by the growing demand for NLP solutions in native African languages. This work aims to improve the accessibility of linguistic technology and encourage the inclusion of Sesotho sa Leboa in the global NLP environment by creating strong, context-aware models.

Languages are gradually acquired and formed via practice, production, social networking, and usage in daily life; as a result, word definitions must always accurately reflect objective reality [7]. Common nouns, entities, and modifiers in natural languages can suggest several ways to express the same idea, yet distinct terms in formal languages, once disambiguated, may align with a concept articulated in another formal language [8]. Spoken by about 4.7 million people, Sesotho sa Leboa, commonly referred to as “Sepedi” or “Northern Sotho”, is one of South Africa’s 11 official languages. Thirty dialects make up Sesotho sa Leboa, some of which notably differ from the standardized written form [6].

Words in Sesotho sa Leboa have many meanings, making it a highly polysemous language. Part-of-Speech (POS), specialized, metaphoric, and other types of polysemy exist in Sesotho sa Leboa, and each presents a unique computational problem for the delivery of WSD solutions. In current computational linguistics research, polysemy is still an unresolved subject that presents significant computational challenges [9]. Handling sense distinctiveness and vagueness vs. polysemy concerns remains an unresolved theoretical subject, as shown by the shortcomings of several polysemy tests. Thus, polysemy is still a theoretically unresolved topic [10]. However, it is also a real problem in modern NLP applications, since WSD has several problems with handling extremely polysemous terms. Thus, WSD calls for both theoretical and applied answers. As with other African languages, although several WSD solutions have been proposed for the English language, not much work has addressed the Sesotho sa Leboa language.

This study addresses the challenge of improving NLP model performance by systematically evaluating baseline architectures and investigating optimization techniques and hybrid designs. It seeks to identify methods that maximize accuracy while balancing computational efficiency, ultimately advancing the development of robust NLP solutions.

The absence of extensive data with word meaning annotations has frequently worsened this problem [11]. One of the most important tasks to determine whether a model is capable of deep comprehension is word sense disambiguation, and the emergence of pre-trained language models has led to significant progress in these tasks. Nevertheless, most pre-training activities used nowadays substitute the token level while ignoring the tokens’ linguistic meaning. Therefore, it is debatable if learning the polysemy and disambiguating senses can be accomplished just through token-predicting approaches [12].

Researchers continue to explore multilingual ways to improve disambiguation accuracy [13]. WSD refers to the computational difficulty of determining a word’s meaning in relation to its context, a problem sometimes described as AI-complete. Diverse methodologies for word sense disambiguation (WSD) have emerged, including supervised machine learning techniques that leverage extensive annotated corpora, dictionary-based strategies that utilize lexical knowledge bases, and unsupervised methods that infer word senses through clustering techniques [14,15]. Among these, supervised learning methods have demonstrated the highest efficacy [16]. Research suggests that semi-supervised methods yield an accuracy similar to that of supervised techniques; nevertheless, neural networks, particularly convolutional neural networks (CNNs), exhibit superior effectiveness in natural language processing for word sense disambiguation (WSD) [17].

The absence of substantial corpora for low-resource languages, such as Sesotho sa Leboa (SsL), exacerbates the challenge of word sense disambiguation (WSD). While WSD approaches have been thoroughly examined for English, their application to morphologically rich, low-resource languages such as SsL poses distinct issues due to intricate linguistic structures. Resolving lexical ambiguity in SsL is crucial for enhancing machine translation and information retrieval efficacy in these languages. A global network was trained to disambiguate target words, attaining roughly 90% accuracy. In a separate study, Shafi et al. (2023) [18] devised and assessed Urdu semantic tagging approaches on a manually annotated corpus of 8000 tokens spanning multiple genres. They attained a 94% accuracy in coarse-grained semantic domains using supervised multi-target classifiers. Supervised word sense disambiguation techniques, encompassing algorithms such as neural networks, K-nearest neighbors, support vector machines (SVMs), decision trees, and Naive Bayes, utilize manually annotated data for the training of classification models [19,20,21]. Although effective, these methods necessitate considerable annotated corpora, rendering them resource-intensive. Sarmah and Sarma (2016) [21] utilized a Naive Bayes classifier for word sense disambiguation in the Assamese language, attaining 71% accuracy with 160 ambiguous phrases sourced from WordNet and the Assamese Corpus.

Hybrid learning approaches combine supervised, unsupervised, and knowledge-based techniques to enhance WSD performance. For instance, Demlew and Yohannes (2022) [22] proposed a hybrid approach for Amharic WSD, which achieved an accuracy of 86% using a combination of supervised and unsupervised methods. Similarly, Gahankari et al. (2023) [15] integrated supervised and knowledge-based methods for Marathi WSD, improving the overall accuracy and performance.

This paper advances the theoretical foundations of word sense disambiguation (WSD) and hybrid transformer-based large language models (LLMs) for low-resource languages, particularly Sesotho sa Leboa. This research offers useful, real-world applications in low-resource NLP, notably for Sesotho sa Leboa and other Bantu languages, in addition to theoretical advances:

It broadens the theoretical knowledge of WSD in low-resource, morphologically rich languages, particularly for Sesotho sa Leboa, by achieving up to 85% accuracy in resolving WSD using BiGRU and transformer-based hybrid models.
It proposes a novel hybrid model to optimize WSD for Sesotho sa Leboa, integrating deep learning (BiLSTM, BiGRU, and transformers) with classical machine learning (SVMs and decision trees), achieving a peak accuracy of 85% (BiGRU + DeBERTa).
It enhances the theoretical foundation for WSD by incorporating linguistic information into deep learning models, improving polysemous word disambiguation with models like Hybrid BiGRU + BERT (79%) and Hybrid BiGRU + RoBERTa (70%).
It develops domain-specific and hierarchical graph representations to simulate semantic relationships between Sesotho sa Leboa words, with hierarchical graph models achieving a 79% accuracy and domain-specific graph models achieving a 75% accuracy
It creates a sense-annotated dataset for WSD, contributing to training and evaluation benchmarks for NLP models in Sesotho sa Leboa.
It delivers a high-performing hybrid transformer-based WSD model, significantly improving the accuracy (up to 85%), recall, and F1-score in resolving polysemous word ambiguity.
It bridges the data scarcity gap in computational linguistics for Sesotho sa Leboa by integrating corpus-based, transformer-based, and deep learning approaches, outperforming classical techniques (e.g., decision trees at 80%) in WSD tasks.

2. Related Literature Review

A fundamental component of natural language processing, distributional semantics based on neural techniques, also has unexpected parallels to human meaning representation [23]. Transformer-based language models have demonstrated their ability to generate consistent, sense-specific contextual word representations. Researchers [23] have presented a more ethical method of utilizing data from all NLM layers.

Sesotho sa Leboa’s WSD is comparable to that for the joint supervised and unsupervised sense disambiguation technique used for Amharic [22], in that both languages are low-resource and morphologically rich, and they both address problems like polysemy and homonymy. Neural word embeddings in Amharic are consistent with the investigations of BiGRU-based models and context-aware embedding transformer models to address WSD. The integration of supervised and unsupervised learning suggests that hybrid approaches combining rule-based, statistical, and deep learning techniques could further enhance WSD performance in Sesotho sa Leboa.

Hybrid learning in word sense disambiguation (WSD) integrates many approaches—supervised, unsupervised, and knowledge-based methods—to capitalize on their respective strengths and enhance the model’s overall efficacy. Hybrid models seek to amalgamate the advantages of labeled data, the capacity to learn from unlabeled data, and external knowledge sources.

Equation (1) can be used to express a hybrid learning model as a weighted combination of the outputs from the supervised, unsupervised, and knowledge-based models. Hybrid learning can be expressed as an optimization problem, where the final decision for the word sense is a combination of multiple methods, each contributing to the final prediction.

Let

$M_{s}$ be the supervised learning model;
$M_{u}$ be the unsupervised learning model;
$M_{k}$ be the knowledge-based model;
$w_{s}, s_{u}, a n d w_{k}$ be the weights assigned to each model based on its contribution to the final decision.

S_{b e s t} = {a r g m a x}_{S_{i}} (w_{s} . P_{s} (S_{i}| f) + w_{u} . P_{u} (S_{i}| f) + w_{k} . P_{k} (S_{i} | f)

(1)

where

$P_{s} (S_{i}| f) = S u p e r v i s e d M o d e l (f)$ is a supervised model $M_{s}$ trained on labeled data using a classifier such as Naïve Bayes or SVM.
$P_{u} (S_{i}| f) = U n s u p e r v i s e d M o d e l (f)$ is an unsupervised model $M_{u}$ that groups similar contexts into a cluster or topics and computes the probability of each sense on the latent clusters or topics.
$P_{k} (S_{i}| f) = K n o w l e d g e M o d e l (f)$ is a knowledge-based model $M_{k}$ that uses lexical resources such as WordNet to compute the probability of a word sense based on the overlap between the context and the dictionary definitions.
$w_{s}, s_{u}, a n d w_{k}$ are weights that echo the relevance or accuracy of each model.

Equation (1) represents the sum of the outputs from all selected models, and the final word sense is the sense

S_{i}

that maximizes the weighted sum of the probability from all selected models.

The supervised learning element in hybrid word sense disambiguation models depends on annotated training data to construct a model that forecasts word senses based on acquired attributes. Unsupervised learning methods, like clustering and topic modeling, yield insights from unlabeled data, facilitating the identification of latent semantic structures. The knowledge-based learning component employs external resources, including definitions from WordNet or other lexicons, to facilitate the sense disambiguation process. The weighting and integration phase in hybrid models allocates significance to each element according to its dependability, amalgamating them to identify the most suitable word sense.

The combined supervised and unsupervised sense disambiguation method for Amharic is similar to WSD in Sesotho sa Leboa. Both languages are morphologically rich and low-resource, and they face issues like homonymy and polysemy [22].

A unique knowledge-based method for Persian word sense disambiguation (WSD) employs latent Dirichlet allocation (LDA) for semantic expansion and is analogous to Sesotho sa Leboa WSD; both languages are low-resource and morphologically intricate, requiring effective sense disambiguation [24].

In practice, Demlew and Yohannes (2022) [22] developed a hybrid WSD approach combining supervised and unsupervised methods for Amharic, achieving 86% accuracy by leveraging both labeled and unlabeled data. Similarly, Gahankari et al. (2023) [15] integrated supervised and knowledge-based methods for Marathi WSD, improving the overall performance by merging data-driven learning with external lexical knowledge. This hybrid model offers greater flexibility and accuracy by incorporating multiple learning paradigms, making it especially effective for WSD tasks in low-resource settings or languages with limited labeled data. The research community has been given access to a well-preprocessed Ewe dataset for text classification using limited resources by researchers [25]. Additionally, to use low-resource semantic representation, researchers [25] have created a word embedding based on Ewe, using a preprocessed Ewe dataset that had been suggested [25].

Word ambiguity in a morphologically complicated language is resolved by an optimization-based method, which is how the genetic algorithm for WSD in Hindi text documents connects to WSD in Sesotho sa Leboa. In using linguistic resources and machine learning to resolve word ambiguity in a morphologically rich language, the hybrid methodology of the knowledge-based approach and SVM for WSD in Marathi is related to WSD in Sesotho sa Leboa [15].

2.1. Transformer-Based Language Models

This section briefly discusses the key characteristics of each model relevant to the literature review. This study delineates the differences among each of these models. The transformer is a non-recurrent design including an encoder and a decoder for sequence-to-sequence (seq2seq) communication [26]. The encoder and decoder of a transformer are situated next to each other, consisting of several identical blocks. Each encoder block consists of a position-wise feed-forward layer and a multi-head self-attention layer. Each decoder block has an additional cross-attention layer compared to the encoder block, as the decoder requires the encoder’s output as contextual information for generation [26]. Transformers were introduced in the 2017 article “Attention Is All You Need”. Their objective is to rectify the shortcomings of RNNs. The self-attention mechanism represents the primary innovation of transformers. Transformers exceed RNNs in their capacity to capture long-range dependencies in phrases, exhibiting considerable effectiveness in neural machine translation (NMT) [27].

The domain of natural language processing (NLP) has gained significance over time. Transformer-based pre-trained models, particularly BERT, have demonstrated remarkable efficacy in many NLP tasks; yet, their substantial parameter count and extended processing demands complicate their deployment on resource-limited embedded platforms. To address this problem, researchers [28] have employed the ALBERT model with the enhanced early exit technique (ELBERT) and developed an efficient VLSI architecture via an algorithmic and hardware co-design methodology. Initially, through the implementation of quantization and encoder-level parameter sharing approaches, the storage capacity for BERT is diminished from 1208.88 MB to 20.99 MB without any degradation in accuracy [28].

Word sense disambiguation (WSD) in Arabic [29] is related to WSD in Sesotho sa Leboa because BERT uses contextualized word embeddings to resolve lexical ambiguity in a low-resource, morphologically complicated language. Arabic, like Sesotho sa Leboa, has extensive morphological features, polysemy, and homonymy, which reduces the effectiveness of classic WSD techniques. This study investigates how to enhance WSD resolution in Sesotho sa Leboa by combining transformer-based embeddings (BERT, RoBERTa, and DeBERTa) with BiGRU and attention processes. BERT’s success in Arabic WSD shows that multilingual BERT models could be enhanced, or Sesotho sa Leboa-specific embeddings could be created for better WSD performance.

2.1.1. BERT

The inaugural prominent transformer-based neural language model designed explicitly for language understanding is referred to as BERT [26,30,31]. Two unsupervised modeling objectives—Next Sentence Prediction (NSP) and masked language modeling (MLM)—are applied for pre-training, while WordPiece tokenization is implemented to decompose words into their constituent character-level components [23,26,30,32]. The researchers in [29] introduced a dataset including one hundred polysemous Arabic phrases, each demonstrating three to eight unique interpretations, accompanied by ten illustrative utterances for each term. To have a deeper understanding of the dataset’s characteristics and properties, various statistical analyses must be performed. BERT, an innovative method for word sense disambiguation, was developed to determine the relationship between dictionary meanings and contextual information using similarity metrics, and the suggested pre-trained language model enabled effective Arabic word disambiguation [25]. New attributes were integrated during training to improve the model’s capacity to distinguish between distinct meanings of words. A composite model architecture was created by incorporating the proposed BERT model [33], leading to the WSD system attaining an F1-score of 96%, exceeding the latest systems [29].

2.1.2. RS-BERT

The researchers in [12] presented sense-aware language modeling, a novel pre-training goal that adds more sense-level data to the model. RS-BERT is a radically improved sense embedding model. The senses were anticipated for every training phase and then these predictions were used to update the model. The two actions listed above were alternatively carried out in an expectation maximization fashion during training. Furthermore, pre-training was begun by introducing radical knowledge to RS-BERT. The investigations employed two datasets for the disambiguation of Chinese word senses. The experimental results indicated that RS-BERT is competitive. RS-BERT exhibits exceptional performance when combined with further customized modifications for datasets. Moreover, research indicates that Chinese characters are effectively categorized into several meanings via RS-BERT [12].

2.1.3. ALBERT

ALBERT [33] employs parameters more efficiently and possesses a more streamlined architecture than BERT. ALBERT is constructed on an architecture analogous to that of BERT and is offered in several variants [23]. It utilizes fewer parameters than BERT while exhibiting a comparable benchmark performance, offering significant insights on parameter reduction: The initial input word embedding matrix is partitioned into two smaller matrices. Secondly, ALBERT enables all transformer layers to share parameters, significantly reducing the total number of parameters [26,33,34].

ALBERT [35] was developed to enhance recent progress in language representation learning and to address the shortcomings of the conventional BERT framework. The ALBERT model employs the sentence order prediction objective, whereas the conventional BERT model utilizes the next-sentence prediction (NSP) objective. ALBERT proposes two ways for parameter reduction to enhance memory efficiency and expedite training: embedding factorization (a) and cross-layer parameter sharing (b). Furthermore, ALBERT argues that the NSP objective is excessively simplistic as it merges topic estimation and coherence estimation into a single task by creating negative instances through the connection of fragments from different sources [22,33]. Researchers have investigated language embedding models for both ALBERT and BERT. The empirical results indicated that on the STS benchmark, the CNN architecture markedly surpassed BERT models in enhancing ALBERT models [35].

2.1.4. RoBERTa and DistilRoBERTa

Subsequent to GPT and BERT, enhanced models such as RoBERTa and ALBERT were introduced. RoBERTa, a successful variant of BERT, incorporates four straightforward modifications: (1) the removal of the NSP task; (2) an increase in training steps and batch sizes; (3) an extension of training durations; and (4) a dynamic alteration of the [MASK] pattern. RoBERTa, founded on BERT, produces exceptional empirical outcomes [26,36]. RoBERTa, pre-trained on a vast corpus of unlabeled text data, seeks to establish a universal language representation that can be fine-tuned for many downstream natural language processing applications [25,36].

Pre-trained language models (PLMs) [25] have a markedly improved performance accuracy; yet, executing a substantive comparison has been difficult. The hyperparameters of a low-resource dataset, such as the Ewe news dataset, profoundly affect the outcomes, rendering training on these datasets computationally demanding. In response to this problem, the researchers in [25] improved the BERT model, renaming it RoBERTa and altering its static masking method to a dynamic one. Conversely, DistilRoBERTa is an optimized variant of RoBERTa that maintains approximately 95% of the original’s performance during training [25].

2.1.5. DistilBERT

This section analyzes DistilBERT [25,37,38], a distilled variant of BERT, which is a knowledge distillation model that is more efficient, cost-effective, lightweight, and compact. It consists of two models: DistilBERT-base-cased and DistilBERT-base-uncased. The Separate BERT-base-cased and Separate BERT-base-uncased models employ the BERT learning framework [25,38]. DistilBERT is a more concise and efficient alternative of the well-known BERT (Bidirectional Encoder Representations) language model for natural language processing (NLP) applications, which is founded on the transformer architecture. The primary aim of DistilBERT is to maintain BERT’s performance while substantially minimizing its size, improving speed, reducing cost, and decreasing weight. Researchers [34] have suggested a technique to enhance text matching by modeling the meanings of potentially ambiguous terms with lexical knowledge from external sources. A sense-aware methodology was implemented that incorporates a word sense disambiguation (WSD) model into text matching, utilizing multi-task learning to simultaneously enhance both tasks (matching and WSD). The proposed word sense disambiguation (WSD) model is a streamlined adaption that employs WordNet’s lexical data to enhance a pre-trained BERT-based model [34]. Optimizing large language models (LLMs) like BERT for natural language processing (NLP) applications is tough in resource-constrained circumstances. DistilBERT [39] is a more compact and efficient variation of BERT; yet, its size may still pose challenges for deployment on devices with limited memory and processing power. While model compression techniques can enhance the inference speed and reduce the size of large language models, they often lead to the diminished performance of the models.

2.1.6. GPT

The GPT methodology distinctly combines the self-supervised pre-training objective with the contemporary transformer architecture. Empirically, GPT demonstrates superiority in nearly all natural language processing (NLP) tasks, encompassing question answering, semantic similarity, natural language inference, and classification. GPT utilizes extensive unlabeled data and implements a conventional autoregressive language model to optimize the conditional probability of each word based on its prior context. The transformer element calculates the conditional probability of each word during the pre-training phase of the GPT model [22,26].

2.1.7. DeBERTa

DeBERTa is a sophisticated evolution of the RoBERTa and BERT models, refined in adversarial environments subsequent to pre-training for masked language modeling. A disconnected component attention layer is integrated into DeBERTa [25,40].

Word sense disambiguation (WSD) significantly depends on DeBERTa’s disentangled attention mechanism. Unlike traditional self-attention mechanisms, exemplified by BERT and RoBERTa, DeBERTa separates two critical representations—”Content Embeddings” and “Position Embeddings”—through disentangled attention. The disentangled attention mechanism of DeBERTa enhances contextual representations and captures subtle semantic distinctions [40].

DeBERTa introduces two unique strategies: disentangled attention and enhanced masked decoding, rendering it a more decoding-centric variant of BERT with disentangled attention. In the input layer, each word or token is represented by two vectors that signify its content and position within the corpus, exemplifying the principle of disentangled attention. This is indicated by the dependence of content extraction on the word’s position [40]. DeBERTa enhances RoBERTa and BERT through the implementation of disentangled attention processes that independently encode word content and location information, leading to superior performance in word sense disambiguation tests [25].

The attention mechanism operates with two types of embeddings in DeBERta:

Content embedding $c_{i}$ for the word content;
Positional embedding $p_{i}$ for the word’s position in the sequence.

The attention score for word

i

and word

j

is computed using Equation (2):

A t t e n t i o n (i, j) = \frac{(W_{Q_{c_{i}} + W_{Q P_{i}}^{P}}) {(W_{K_{c_{j}}} + W_{K_{P_{j}}}^{p})}^{T}}{\sqrt{d_{k}}}

(2)

where

$W_{Q}, W_{Q}^{P}$ are the weight matrices for the content and positional query;
$W_{K}, W_{K}^{P}$ are the weight matrices for the content and positional key;
$d_{k}$ is the dimension of the key vectors.

DeBERTa enhances the expressiveness of the attention mechanism by separating positional and content embeddings, which results in improved performance on tasks such as word sense disambiguation (WSD).

By distinguishing between positional information and word content, DeBERTa improves on BERT and makes it possible to model word meanings in context more accurately. Performance on a variety of NLP tasks is enhanced using this method, particularly in WSD. A summary table including the paper title, authors, research techniques/methodology, research gaps, summary of findings, and interventions is presented in Table 1 for studies on low-resource languages.

This synopsis offers a succinct synopsis of the important facets of research on low-resource languages, highlighting the difficulties and cutting-edge strategies being investigated.

2.1.8. XLNet

XLNet, a neural language model, was developed to address many shortcomings inherent in the prevalent transformer-based language model BERT. BERT’s constraints encompass its reliance on a unidirectional pre-training approach and its incapacity to tackle permutation-based language modeling challenges. BERT is a powerful model proficient in capturing bidirectional contextual information in text, notwithstanding its restrictions. It can depict the relationships among all permutations of the input sequence resulting from language modeling [41].

2.1.9. T5

The extensive scale of T5 [42], a distinguished language model created by Google’s artificial intelligence team, distinguishes it from other models. T5 is a versatile model that may be tailored for question answering, summarization, and language translation, in contrast to task-specific models. T5 employs a “text-to-text” methodology, which entails converting an input text into a target text. This method enhances generalization to novel tasks and promotes more flexible and adaptive training. T5 has demonstrated exceptional performance on numerous NLP benchmarks, solidifying its status as a preferred instrument for NLP research and applications [41,42].

The pre-trained transformer models BERT, RoBERTa, DistilBERT, and XLNet were evaluated in [43] for their ability to detect emotions in texts. The outputs of all potential models were compared in the study in [43]. RoBERTa attained the highest recognition accuracy, demonstrating the models’ efficacy in emotion identification from text. The superiority of RoBERTa compared to other candidate models in emotion identification was further evidenced by the precision, recall, and F1-scores [43].

2.2. Deep Learning Models

Deep learning architectures, such as long short-term memory (LSTM), bidirectional long short-term memory (BiLSTM), and Bidirectional Gated Recurrent Unit (BiGRU) networks, have demonstrated efficacy in the field of word sense disambiguation (WSD). These models adeptly capture sequential information and address challenges such as the loss of word order, rendering them especially beneficial in low-resourced languages, where conventional methods fall short due to insufficient data.

2.2.1. RNN-LSTM

Recurrent neural networks (RNNs) [44] are a class of network that use repeating connections to create memory. The inputs in feed-forward networks are unrelated to one another. However, every input in an RNN is linked to every other input. A modified RNN incorporates LSTM, or long short-term memory units. These LSTM units assist in avoiding mistakes. They allow RNNs to continue learning across multiple time steps while maintaining a more constant inaccuracy. Information outside of the RNN’s core flow is contained in the LSTM units, which are valved blocks [44].

2.2.2. LSTM

Recurrent neural networks (RNNs) employing the long short-term memory (LSTM) architecture are used to manage long-term information values. These networks were designed to mitigate the inherent vanishing gradient problem of RNNs. In LSTM architecture, gates are employed to control the flow of inputs and outputs. LSTM cells generally comprise three distinct types of gates. The input gate, as the principal gate, governs the transmission of new information values to memory. The second gate, referred to as the forget gate, regulates the duration for resetting memory cells and the retention interval for input values in memory. The output gate, the terminal component, governs the timing of data value release from memory. The gates control the dissemination of information [29,45,46]. Currently, deep learning approaches are producing promising outcomes in several NLP tasks, with LSTM recognized as the most successful RNN variation for these applications, including word sense disambiguation.

LSTM cells are designed to capture long-range dependencies via memory cells that store and modify information over time, hence alleviating the vanishing gradient issue prevalent in traditional recurrent neural networks (RNNs). The LSTM cell is governed by the input, forget, and output gates, represented mathematically [46] as follows:

i_{t} = σ (W_{i} \cdot [h_{t - 1}, x_{t}] + b_{i}) (input gate)

(3)

f_{t} = σ (W_{f} \cdot [h_{t - 1}, x_{t}] + b_{f}) (forget gate)

(4)

o_{t} = σ (W_{o} \cdot [h_{t - 1}, x_{t}] + b_{o}) (output gate)

(5)

Ċ_{t} = t a n h (W_{C} \cdot [h_{t - 1}, x_{t}] + b_{C}) (cell state candidate)

(6)

C_{t} = f_{t} * C_{t - 1} + i_{t} * Ċ_{t} (new cell state)

(7)

h_{t} = o_{t} * t a n h (C_{t}) (new hidden state)

(8)

where

$x_{t}$ is the input at time $t;$
$h_{t - 1}$ is the hidden state from the previous time step;
$i_{t}$ , $f_{t}$ , $o_{t}$ are the input, forget, and output gates, respectively;
$C_{t}$ is the cell state at time $t$ ;
$W_{i}$ , $W_{f}$ , $W_{o}$ , $W_{C}$ are weight matrices, and $b_{i}, b_{f}, b_{o}, b_{C}$ are biases;
$σ$ is the sigmoid activation function, and tanh is the hyperbolic tangent activation function.

Long-term dependencies are maintained with LSTM [46], which makes them perfect for WSD tasks where word order and sequence information are crucial. The cell state, shown by the horizontal line along the top of Figure 1, is the main concept of LSTM, and the gates that add or delete information from the cell state are the input, forget, and output gates (Equations (3)–(8)). When it comes to processing temporal information, the LSTM model is widely used, which, with very little variation, is used throughout most works.

2.2.3. BiLSTM

Bidirectional LSTM (BiLSTM) networks represent an enhancement of LSTM models. An LSTM is employed on the input sequence as a forward state layer, while a second LSTM model processes the reversed input sequence as a backward state layer in time. This model interlinks two distinct hidden LSTM layers oriented in opposing directions to a singular output. Information circulates once from the conclusion to the inception and twice from the inception to the conclusion [32,48,49,50]. BiLSTM can assign varying weights to words based on their contextual usage. In an impressive effort to address the deficiencies of deep neural networks in capturing local aspects and highlighting the importance of individual words within the broader context, BiLSTM ensures contextually relevant semantic association information [32,49,50].

2.2.4. BiGRU

A forward Gated Recurrent Unit (GRU) and a backward GRU neural network were integrated to form the Bidirectional Gated Recurrent Unit (BiGRU) network, incorporating representations of the hidden layers from both the forward and backward GRUs. In a conventional unidirectional RNN, information progresses from the past to the future. A bivariate RNN (BiGRU) incorporates contextual information from both past and future data in time series analysis by adding a reverse layer to the bidirectional GRU, hence enhancing its capacity to extract contextual information from sequential data [37,51]. The pertinent information regarding the past and future of the inquiry is comprehensively employed in the input and output throughout the mapping process between sequences, guaranteeing that the context preceding and succeeding the sentence is documented and that the past and future information is thoroughly acknowledged. BiGRU enhances the unidirectional network through the second layer of the network [51].

Linked characteristics and possible text connections have been identified by researchers [52] using MLCNN and BiGRU with attention mechanisms, respectively. Horizontal fusion has been used for classification, which can lead to an increase in classification accuracy [52].

Furthermore, the downstream-specific NLP job and the pre-training-generated word vector were methodically transformed into the pre-training-generated word vector utilizing the BERT-BIGRU-CRF entity extraction technique, as evidenced by the researchers in [53].

BiGRU, an enhanced variant of BiLSTM, utilizes fewer parameters by merging the input and forget gates into a singular update gate. Subsequently, two gates—the reset gate and the update gate—are utilized to alter the GRU’s secret state. BiGRU enhances the process while maintaining the ability to manage long-term dependencies in both forward and backward directions. Equations (9)–(12) calculate the gates and states for the forward GRU [54]:

z_{t} = σ (W_{z} \cdot [h_{t - 1}, x_{t}]) (Update gate)

(9)

r_{t} = σ (W_{r} \cdot [h_{t - 1}, x_{t}]) (Reset gate)

(10)

\tilde{h_{t}} = t a n h (W_{h} \cdot [r_{t} * h_{t - 1}, x_{t}])

(11)

h_{t} = (1 - z_{t}) * h_{t - 1} + z_{t} * {\tilde{h}}_{t}

(12)

For the bidirectional GRU [47], see Equations (13)–(15):

\vec{h_{t}} = {G R U}_{f o r w a r d} (x_{t}, {\vec{h}}_{t - 1})

(13)

{\overset{⃐}{h}}_{t} = {G R U}_{b a c k w a r d} (x_{t}, \overset{⃐}{h_{t + 1}})

(14)

h_{t} = [\vec{h_{t}}, {\overset{⃐}{h}}_{t}] (Final hidden state)

(15)

There are two kinds of gates in the GRU design shown in Figure 1: update gates and reset gates. A linear interpolation between earlier hidden states defines the hidden state

h_{t}

. Bi-GRU’s structure is the result of combining two distinct directional GRUs, as seen in Figure 2. Both directional GRUs received the same input in the study, which was also the GRU’s input, and both GRUs determined the output.

2.2.5. LSTMLM

Lattice rescoring employing long short-term memory language models (LSTMLMs) and N-best rescoring are the predominant strategies utilized for rescoring. These methodologies, however, encounter challenges related to a limited search space or inconsistent training and evaluation procedures. Researchers employ an end-to-end model to accurately derive the optimum hypothesis from the word lattice to address these challenges. An attentional LSTM decoder follows a bidirectional lattice LSTM encoder in the model [55]. An LSTMLM is ill equipped for ascertaining the best path within the lattice because of its training being based on word prediction criteria. The model may struggle to differentiate between competing hypotheses during evaluation if the training data comprise just positive cases [55].

WSD in Sesotho sa Leboa is related to the knowledge-based approach with an SVM for Marathi [15] because of its hybrid methodology, which combines linguistic resources and machine learning to resolve word ambiguity in a morphologically rich language. Word senses are correctly classified by the SVM using extracted features, but the knowledge-based component uses semantic links and lexical resources. Similarly, the study looks into ensemble approaches (SVM, NB, and LR), transformer embeddings, and BiGRU + attention to handle WSD in Sesotho sa Leboa. Accordingly, a hybrid approach that blends knowledge-based and machine learning techniques could further enhance WSD performance in Sesotho sa Leboa.

3. Materials and Methods

This section describes the research design as it lays out strategies for optimal WSD outcomes. The researchers detail the model architecture, training implementation, and evaluation methods, justifying each choice along the way. With this framework, this section presents a wide-angle view of how the design and implementation in this study developed and pushed forward existing WSD practices.

This study uses baseline models depicted in Figure 3 for deep learning, such as LSTMLM, BiGRU, and RNN-LSTM. To enhance performance for context-aware disambiguation, the study also optimizes BiGRU using attention mechanisms and ADAM algorithms. Furthermore, the transformer models DeBERTa, RoBERTa, and ELECTRA, as well as pre-trained big language models, provide improved contextual awareness for WSD in morphologically rich languages like Sesotho and Leboa. Using ADAM and attention-based mechanisms, the study improved language models even further. By developing hybrid models that integrated BiGRU with transformer topologies, such as BiGRU + DeBERTa, BiGRU + RoBERTa, and BiGRU + ALBERT, the study expanded on its findings.

This research tackles the challenges of polysemy by creating, executing, and validating a word sense disambiguation (WSD) system for Sesotho sa Leboa, a low-resource and morphologically complex language. The proposed solution integrates transformer architectures, corpus-based ensemble methods, and enhanced deep learning models, including Hybrid BiGRU-transformer techniques and attention mechanisms, to achieve superior semantic disambiguation performance. This research enhances the capabilities of natural language processing (NLP) for underrepresented languages and offers a versatile and scalable framework for additional computational and linguistic investigations.

A recent advancement in natural language processing (NLP) has occurred, notably with transformer-based designs like BERT, GPT, and RoBERTa, which have yielded exceptionally competitive outcomes across various language comprehension tests [51]. Since these models are capable of capturing subtle details in contextual semantics, they can be used for WSD tasks requiring fine-grained sense distinction [56]. This work aims to augment WSD systems from a stronger, context-aware perspective [53]; it builds on these innovations with transformers.

This research proposes an enhanced BiGRU architecture that incorporates additional mechanisms to better capture nuanced semantic relationships in WSD. By integrating elements such as task-specific fine-tuning and advanced contextual embeddings, the enhanced BiGRU aims to bridge the performance gap between traditional RNN-based architectures and transformer-based models. This modified BiGRU is specifically tailored to address the challenges posed by WSD, ensuring that both immediate and extended contextual information are effectively utilized for sense disambiguation.

Building on these advancements, this methodology integrates transformer-based architectures with an enhanced BiGRU, designed to refine the model’s interpretive accuracy in distinguishing subtle contextual variations in word meaning. Through this approach, this section seeks to contribute to the growing body of research focused on improving WSD outcomes, moving closer to achieving nuanced, accurate sense disambiguation across diverse language contexts.

3.1. Data Collection and Annotation

To compile an SsL WSD dataset, data were collected from several authentic sources, including academic dissertations, research papers, dictionaries, and web pages specifically containing ambiguous words. These sources were selected due to their standardized usage of language, which serves as a basis for reliable word meanings. Content extraction was achieved through web scraping techniques using Beautiful Soup, which enabled the automated collection of text by parsing HTML and filtering out any HTML tags. This extraction focused on gathering contextual examples of ambiguous words, ensuring that the collected text represented natural, everyday language usage.

The dataset for this study was created using web scraping techniques from a variety of sources because structured linguistic resources for Sesotho sa Leboa are scarce. These methods also included extracting articles from newspapers and blogs to capture linguistic diversity in the real world, gathering informal and conversational texts through social media and forums to include a variety of word senses and usages, and making sure that formal language samples from academic and government publications are of high quality for domain-specific disambiguation.

During the manual annotation phase of the Sesotho sa Leboa dataset, two linguists with expertise in Sesotho sa Leboa morphology, syntax, and semantics contributed their professional insights on intricate linguistic structures and cases of ambiguity. Additionally, three native speakers of Sesotho sa Leboa from various dialectal backgrounds worked as a team to confirm contextual appropriateness and genuine language usage. Two master’s students from the Applied Language Department also ensured quality by monitoring consistency and settling disagreements regarding annotations. This manual approach was essential for maintaining consistency and accuracy across the dataset, creating a dependable benchmark for WSD model development. Researchers use these types of annotated datasets to explore patterns in ambiguity and observe how words vary in meaning depending on their context within a sentence.

A taxonomy of Sesotho sa Leboa lexical ambiguities is shown in Figure 4. Polysemy and homonymy were used in this study to identify ambiguity in the dataset. Both homonymy and polysemy are represented by lexical ambiguity. One type of polysemy is ambiguity regarding multiple meaning. A single word or lexeme may have multiple unique senses. Moreover, homonymy refers to two or more words with similar meanings that are spelled the same way and have the same pronunciation or sound. Although it can be challenging to distinguish between polysemy and homonymy because the line between the two is blurry, they are significantly different [4,5,6,54,57]. Specifically, polysemy is well defined as a specific type of lexical ambiguity where a word or phrase has multiple semantically connected meanings that share the same etymology and fall into three sub-categories: metonymy, specialization polysemy, or metaphors [54,57].

Again, it is possible for two lexemes with different etymologies to be written and pronounced similarly. The many forms of polysemy share a common ancestor, and it is possible to distinguish between basic and derived senses whenever they are hypothesized. To put it another way, a term is considered polysemous if and only if it has two or more senses. For instance, the word “leleme” can imply “language”, “tongue”, or “telling lies”.

Homonymy occurs when a word’s meanings are unrelated, which also suggests that words do not create certain structures. For instance, the word “lewa” might signify “spread of divine bones”, “cooked corn”, or “cave”.

The detailed characteristics of the SsL WSD dataset are presented in Table 1, which summarizes essential metrics, including the total number of sentences, tokens, annotations, sense types, lemmas, and ambiguity level. In total, the dataset contains 2859 sentences, 3289 tokens, and 3288 annotated instances across 162 distinct sense types and 22 lemmas. The average ambiguity level of 4.85 highlights the complexity of the dataset, as many words have multiple senses that depend heavily on the context.

This dataset forms the backbone of the WSD model training process, serving as both the training and evaluation benchmark.

3.2. Baseline Model: BiGRU

To establish a performance benchmark, the researchers first employed a baseline BiGRU model without additional enhancements. The BiGRU analyzes input sequences in both forward and backward orientations, enabling it to record bidirectional context for each word in the sentence. However, despite its bidirectional structure, the baseline BiGRU may face limitations in handling the complex contextual dependencies required for fine-grained word sense disambiguation, particularly for polysemous words, where subtle contextual nuances define meaning. Using this baseline, the researchers aimed to quantify the enhancements offered by additional model components in subsequent configurations.

BiGRU was chosen as the baseline model despite the fact that transformer-based models like BERT and RoBERTa are frequently used to establish robust benchmarks due to their high computational cost and substantial hardware requirements for both training and inference. As a recurrent neural network (RNN) variation, BiGRU is more effective and lightweight, which makes it suitable for low-resource settings. BiGRU is essential for WSD because it can capture bidirectional dependencies in textual sequences. It enhances contextual knowledge by enabling the model to preserve both past and future context when making logical predictions. Additionally, assessing the additional impact of transformer embeddings is made possible using BiGRU as a baseline, which offers a robust but interpretable standard. Incorporating contextual embeddings into the model aids in separating their distinct contributions.

BiGRU and transformer embeddings work well together to enable a hybrid strategy that improves sense disambiguation by processing contextualized representations from transformers in a sequential manner.

3.3. Enhanced BiGRU with Attention

An attention mechanism was incorporated into the BiGRU to enhance the baseline model’s capacity to capture pertinent context. The attention layer enables the model to allocate differing significance to various words in the phrase, concentrating on contextually pertinent segments of the input. This enhancement is expected to refine the model’s interpretive accuracy by enabling it to prioritize critical contextual cues, thereby addressing one of BiGRU’s baseline limitations in capturing nuanced semantic details necessary for WSD. In directing focus to important words or phrases, BiGRU with attention is anticipated to enhance disambiguation accuracy and improve the model’s ability to handle rare or less frequent senses.

3.4. BiGRU with Attention + Transformer Embeddings

Further enhancing the model of BiGRU with attention, the researchers introduced transformer-based embeddings, such as BERT or RoBERTa, as input representations. Transformer embeddings are known for their ability to capture deep contextual semantics through self-attention mechanisms. Integrating these embeddings provides the BiGRU model with a rich, contextually informed starting point, allowing it to leverage both the fine-grained semantic information encoded by transformers and the sentence-level focus introduced by the attention mechanism. This configuration—BiGRU with Attention + transformer embeddings—is expected to bridge the gap between traditional sequential models and state-of-the-art WSD techniques, enabling the model to better handle complex word senses across diverse contexts.

Figure 5 illustrates the process of transformer embedding generation, showcasing key stages such as tokenization, positional encoding, self-attention, feed-forward layers, and output embeddings. Transformer embeddings are incorporated into the BiGRU model to increase WSD accuracy.

Embedding Layer: contextual information from surrounding text is captured via pre-trained transformer-based embeddings (like BERT, RoBERTa, or ALBERT) that create dense vector representations of words. BiGRU Encoder: by processing the sequential embeddings in both the forward and backward directions, the Bidirectional Gated Recurrent Unit (BiGRU) maintains contextual relationships from the past and future.

By dynamically weighing contextual contributions, the attention mechanism (for attention-optimized BiGRU models) aids the model in concentrating on the most pertinent words in a phrase, enhancing disambiguation. Then, a Fully Connected Layer assigns word sense categories to the learnt feature representations, and a Softmax Layer determines the most likely interpretation based on context by computing probability distributions across all potential senses. This hybrid design preserves the sequential processing advantages of BiGRU while utilizing transformers.

A condensed representation of the input is created by the hidden layer shown in Figure 4. Using this representation, the decoder recreates the original input. Only the compressed feature vector would be available without the decoder, and we would not be able to determine how accurately it replicates the original input. When the task involves converting the hidden feature representation back into a structured or interpretable form, a decoder is required. A decoder might not be required if the feature vector alone is adequate for the task (such as classification).

4. Results

4.1. Heuristics and Correlation Analysis

Establishing guidelines or approximations used to evaluate performance data is known as heuristics. Figure 6’s graphical representation verifies which words perform better than others in a variety of parameters. Moreover, correlation analysis measures how strongly and in which direction the metrics are related to one another. These analytics aid in identifying trends and revelations. The graphical representation in Figure 5 can be used to analyze phrase lengths across the dataset and offers insight into possible input size variability that affects model efficiency and resource requirements.

Figure 6 consists of two histograms representing the following information:

Distribution of unique senses per word (left)

Here, the X-Axis represents the number of unique senses per word and the Y-Axis represents frequency (number of words with that many senses). The observation from the diagram indicates that most words have fewer unique senses (1–3 meanings). The frequency drops sharply as the number of senses increases. There are few outlier words with more than 10 senses, indicating highly polysemous words. A density curve (smoothed blue line) shows the distribution trend, confirming that words with higher polysemy are less frequent.

2.: Distribution of Sentence Lengths (Right)

Here, the X-Axis represents sentence length in terms of word count and the Y-Axis represents frequency (number of sentences of that length). The key observation from the diagram shows that the distribution is highly skewed, with most sentences having very few words (mostly 2–3 words). There are very few long sentences (above 6 words). A density curve (smoothed green line) follows the histogram, showing the extreme concentration of short sentences.

4.2. Frequent Words and Phrases (e.g., N-Grams) Associated with Each Sense Category

To find lexical patterns and links, it can be helpful to examine common words and phrases (such as n-grams) connected to each sense category. In Figure 7, bar charts show the most common n-grams for each sense category.

This bar chart represents the top 10 most frequent 2-g (word pairs) occurring in text data associated with the word sense “asthma” in Sesotho sa Leboa. The word “mafaḥla” appears most frequently, suggesting its strong relevance to the sense “asthma”. The words “bolwetši” (illness/disease) and “bja” (of) frequently co-occur, indicating they might form a phrase like “bolwetši bja mafaḥla” (asthma disease). The most common 2-g (e.g., “bolwetši bja” and “bja mafaḥla”) suggest standard expressions for referring to asthma in Sesotho sa Leboa.

4.3. Outlier Detection and Handling

Figure 8 illustrates the computation of the Z-score for each sentence length, with outliers noted. Identifying and managing outliers is crucial for developing reliable models and accurate analyses. The scatter plot in Figure 8 facilitates the identification of outliers in multivariate datasets by illustrating points that are distant from the dense data cluster.

Figure 8 illustrates a scatter plot of text embeddings projected onto a two-dimensional space via Principal Component Analysis (PCA). The X-Axis (PCA Component 1) and Y-Axis (PCA Component 2) represent the primary components, condensing the high-dimensional embeddings into a two-dimensional space.

The blue points represent normal sentence embeddings. The red points indicate outliers, identified by an outlier detection algorithm.

Outliers (red points) are dispersed at the edges of the distribution, indicating that they are significantly different from most sentence embeddings. There are more outliers in the lower part of the plot, suggesting embeddings with extreme variations in one or both PCA dimensions. Some isolated red points at the top and far left/right indicate extreme deviations, possibly due to noise, rare linguistic patterns, or data errors.

The outliers indicate polysemous words that require disambiguation. The red points are useful for detecting incoherent, adversarial, or irrelevant text.

4.4. Evaluation Metrics

A series of performance criteria was employed to analyze the enhanced BiGRU model at various developmental phases, providing a comprehensive evaluation of the model’s efficacy in word meaning disambiguation (WSD). The accuracy, precision, recall, and F1-score were employed to evaluate each model iteration, as they jointly demonstrate the model’s efficacy in differentiating between several interpretations of ambiguous terms within a certain context. This is represented numerically in Equation (16):

A c c u r a c y = \frac{C o r r e c t P r e d i c t i o n s}{T o t a l P r e d i c t i o n s}

(16)

This metric is essential for understanding the model’s general performance in predicting the correct word sense without additional enhancements. The accuracy of each subsequent model is compared against this baseline to assess improvements.

Precision and recall are particularly relevant for WSD tasks, where accuracy alone may not capture the nuances of identifying the correct sense among multiple candidates. Precision measures the proportion of correctly predicted senses among those predicted as positive by the model, reflecting its ability to avoid false positives. The precision metric is calculated using Equation (17):

P r e c i s i o n = \frac{T r u e P o s i t i v e s}{T r u e P o s i t i v e s + F a l s e P o s i t i v e s}

(17)

Recall, conversely, assesses the model’s ability to recognize all pertinent instances of each word sense. This statistic indicates the model’s sensitivity to ambiguous circumstances by quantifying the ratio of accurately predicted senses to the total real senses, derived using Equation (18):

R e c a l l = \frac{T r u e P o s i t i v e s}{T r u e P o s i t i v e s + F a l s e N e g a t i v e s}

(18)

For each model variation, the F1-score is calculated to equilibrate precision and recall, which is especially beneficial in cases of an unequal distribution of word senses or when specific senses are more difficult to forecast precisely. The F1-score integrates both measurements as the harmonic mean, providing a singular measure to assess the model’s efficacy in accurately and consistently disambiguating sensations. This is delineated in Equation (19):

F1-score = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(19)

Compared to that of the baseline BiGRU, performance is then evaluated for the BiGRU with an attention model, which incorporates an attention mechanism to selectively emphasize relevant context words for WSD. The BiGRU with attention + transformer embeddings model adds transformer-based embeddings to further capture contextual nuances, and metrics for this model show how pre-trained contextual embeddings impact word sense identification.

These metrics, stored and compared across each model variant, provide an in-depth look at the contributions of each architectural enhancement. The progression of the accuracy, precision, recall, and F1-score in each version of the enhanced BiGRU highlights the incremental improvements achieved by refining context representation and integrating syntactic features. Through this iterative evaluation, the final model’s performance demonstrates how each component optimally contributes to addressing the challenges of WSD.

4.5. Comparative Analysis of Model Accuracies in WSD

Based on the models’ stated accuracy scores (61%, 79%, 74%, 70%, 64%, 62%, and 63%) as well as their intrinsic qualities, Figure 9 and Table 2 provide a thorough comparison of the models. Recall that long short-term memory (LSTM) is a kind of recurrent neural network (RNN) created to deal with long-term dependencies in sequence data. It is especially well suited for applications like sequence labeling, language modeling, and time series prediction. Although the model learns from the data, its ability to recognize intricate patterns may be constrained, as indicated by its 61% accuracy rate. This is to be expected since LSTM models generally perform worse than transformer models, despite being strong for sequential tasks.

A type of GRU (Gated Recurrent Unit) that can analyze input data in both the forward and backward directions is called a BiGRU (Bidirectional Gated Recurrent Unit). Among the models on the list, BiGRU performs the best, with an accuracy of 79%. Its bidirectional nature, which enables the model to learn from both past and future contexts, is responsible for its performance. An LSTM model intended for language modeling (LM) goals, such as next-word prediction, text production, or sentence structure comprehension, is called an LSTMLM. The LSTMLM performs well on language modeling tasks with an accuracy of 74%, but it lags behind BiGRU. This could mean that transformer-based models beat LSTM-based models in terms of accuracy, even while LSTM-based models are effective for sequence learning.

DistilBERT is a condensed, more effective variant of BERT that is 60% faster and uses fewer resources while maintaining 97% of BERT’s language comprehension abilities. Disentangled attention and improved position embeddings are two advancements that are incorporated into DeBERTa (decoding-enhanced BERT with disentangled attention), an upgraded version of BERT. With an accuracy of 70%, DeBERTa appears to perform well, but it not the best model on the list. It is crucial to remember that DeBERTa’s architecture enables it to perform quite well on a variety of downstream tasks, and that 70% accuracy can be the result of particular task-related difficulties rather than a basic flaw in the model. With an accuracy of 64%, DistilBERT is a little less accurate than one might anticipate for a transformer model. However, in contexts with limited resources or where rapid deployment is necessary, DistilBERT may be worth the trade-off because it sacrifices some capacity for speed and efficiency.

A single text-to-text model called T5 was created to handle a broad range of natural language processing tasks by transforming them into text-to-text formats. T5 appears to perform worse than the other models, such as BiGRU and LSTMLM, with an accuracy of 62%. The task’s nature or the T5 fine-tuning procedure may be to blame for this. Nonetheless, T5’s adaptability to various NLP tasks continues to be one of its main advantages. By sharing parameters between layers, ALBERT, a more parameter-efficient variant of BERT, is intended to preserve excellent performance while minimizing model size. With a 63% accuracy rate, ALBERT offers a fair balance between performance and efficiency. Despite not achieving the best accuracy, it is a good option for contexts with limited resources due to its lower computing requirements.

Table 3 and Table 4 are contingency tables for each model pair. McNemar’s Test was computed to check if the accuracy differences are statistically significant; confidence intervals were then computed for each model comparison. The two models, RNN-LSTM and DistilBERT, in Table 3 and Table 4, express a p-value of 0.07010, which proves that the difference between the models is not statistically significant, whereas other models in Table 3 show a p-value < 0.05, proving that the difference between the models is statistically significant.

The statistical significance of the model differences is revealed by the p-values and confidence ranges. While the confidence interval gives a range that the genuine difference most likely falls within, a lower p-value (usually less than 0.05) indicates a significant difference. In particular, when compared to BiGRU (ADAM), BiGRU (Attention), and Hybrid BiGRU + BERT, comparisons involving Hybrid BiGRU + RoBERTa frequently show highly significant differences (p-value = 0.00000), with the confidence intervals showing continuously worse performance. Similarly, when compared to Hybrid BiGRU + RoBERTa, Hybrid BiGRU + DeBERTa and Hybrid BiGRU + ALBERT demonstrate great statistical significance, indicating their superior performance. In comparing BiGRU (ADAM) and BiGRU (Attention), Hybrid BiGRU + DeBERTa and Hybrid BiGRU + ALBERT, and Hybrid BiGRU + DeBERTa and BiGRU (Attention), on the other hand, results in higher p-values (>0.05), indicating that there are no statistically significant differences.

The confidence intervals for these comparisons overlap considerably, reinforcing the similarity in performance. These results suggest that Hybrid BiGRU models leveraging DeBERTa and ALBERT embeddings perform better and are statistically distinct from those using RoBERTa, while attention-based BiGRU models do not significantly outperform their ADAM-optimized counterparts.

Enhanced Models

When evaluating the improved performance of different models, as shown in Figure 10 and Table 5, it is critical to compare their accuracy metrics with the methods used for enhancement, particularly for models that have been optimized using different techniques like Adam optimization, attention mechanisms, or hybrid models. A thorough examination of the models based on the updated accuracy scores and improvements made to them is provided below.

Better performance and faster convergence are usually achieved when using the Adam optimizer in conjunction with BiGRU (Bidirectional Gated Recurrent Unit), particularly when non-linear optimization and huge datasets are involved. The improved performance (84%) demonstrates how well Adam optimization works to increase convergence and final model performance, demonstrating that optimizing conventional RNN-based models, such as BiGRU, can result in considerable gains. BiGRU’s learning of significant patterns, particularly in lengthy sequences, is enhanced by the addition of an attention mechanism, which helps the model focus on particular segments of the input sequence. Among the models that use BiGRU, the accuracy of 85% is the highest, demonstrating the significance of attention mechanisms in identifying subtle patterns and enhancing the overall accuracy.

BiGRU and BERT, pre-trained transformer models renowned for their potent contextual language understanding, are combined into a hybrid model. Utilizing the advantages of both models is the goal of the hybrid method. Although the hybrid BiGRU + BERT’s accuracy of 79% is respectable, it raises the possibility that the combination may not fully utilize each component’s advantages. Performance could be further enhanced by adjusting the model or implementing a more effective integration method. A kind of BERT that optimizes training methods for improved performance is called RoBERTa (robustly optimized BERT approach). The goal of combining BiGRU with RoBERTa is to improve BiGRU’s sequential learning capability while also utilizing RoBERTa’s enhanced comprehension. Although RoBERTa outperforms BERT, its 70% accuracy indicates that it does not enhance BiGRU as well as other transformer models, maybe as a result of problems with model integration or training methodology.

With the use of improved position embeddings and disentangled attention, DeBERTa is an improved version of BERT that can handle more challenging language problems. BiGRU and DeBERTa are combined in the hybrid technique to collect contextual information and sequential dependencies. The performance of BiGRU optimized with attention mechanisms is mirrored by its accuracy of 85%, indicating that DeBERTa’s sophisticated capabilities enhance BiGRU. An abbreviated, parameter-efficient variant of BERT is called ALBERT. Using effective embeddings and parameter sharing, BiGRU and ALBERT seek to lessen the computational load while preserving high performance. The hybrid BiGRU + ALBERT model performs well, leveraging both the sequential learning capability of BiGRU and the efficiency of ALBERT, as evidenced by its 84% accuracy.

Any NLP issue is approached as a text-to-text problem by the flexible transformer model T5. The goal of combining BiGRU with T5 is to combine the sequential learning of BiGRU with the potent text-to-text generating capabilities of T5. However, the hybrid BiGRU + T5 model may not be in line with the task at hand, as shown by its 58% accuracy. Combining T5 with BiGRU may result in less-than-ideal performances for tasks that are better suited for classification because T5 is essentially a text generation model.

Prediction uses classification results as additional context to predict the correct sense given the original sentence, as outlined in Table 6.

5. Discussion

The application of optimization techniques or hybrid approaches leads to notable increases in model performance when compared to the baseline model accuracies. A thorough discussion and comparison are provided below.

The baseline results in Table 7 indicate that BiGRU’s efficient and context-aware bidirectional architecture makes it the most effective standalone model.

BiGRU (optimized with ADAM): This optimizer increases training efficiency and gradient updates, which results in a 5% increase in accuracy over the standard BiGRU, as seen in Table 7 and Table 8. Emphasizing the value of focusing on pertinent portions of the input sequence, attention methods are introduced to further improve accuracy.

The performances of the hybrid models BiGRU + DeBERTa (85%) and BiGRU + ALBERT (84%) are shown in Table 8:

These hybrids achieve the highest ratings by combining the potent context modeling of transformer-based architectures with the bidirectional efficiency of BiGRU.
Hybrid BiGRU + BERT (79%): its performance is comparable to the basic BiGRU, indicating either little additional advantage or inadequate hybrid tuning.
Hybrid BiGRU + RoBERTa (70%): the outcome is disappointing, which could be the result of difficulties in integrating the hybrid model or the less-than-ideal architecture synergy.

Comparative Perspectives and Benchmarking WSD Performance

The benchmarking results for Sesotho sa Leboa demonstrate that the BiGRU-based models achieve competitive accuracy compared to that of other language models for WSD. Among the Sesotho sa Leboa models, BiGRU optimized with attention (85%) slightly outperforms BiGRU optimized with ADAM (84%), indicating that attention mechanisms enhance contextual understanding in polysemous word disambiguation. The Hybrid BiGRU + DeBERTa model (85%) performs equally to BiGRU + attention, suggesting that DeBERTa’s improved encoding of syntactic dependencies benefits WSD in Sesotho sa Leboa. However, Hybrid BiGRU + BERT (79%) and Hybrid BiGRU + RoBERTa (70%) show lower accuracy, likely due to BERT and RoBERTa’s suboptimal generalization for low-resource linguistic structures.

When compared to other languages in Table 9, Amharic (BiGRU: 99.99%) and Ge’ez (99.52% using multiple classifiers) significantly outperform Sesotho sa Leboa models, likely due to larger annotated datasets and optimized architectures. Portuguese (BERT: 84%) achieves similar accuracy to BiGRU-based models in Sesotho sa Leboa, while Korean (Viterbi Algorithm: 76.4%) and Hindi (Word2Vec: 58%) perform notably lower, reinforcing the importance of contextual embeddings and neural architectures for effective WSD. Amharic’s joint supervised and unsupervised approach (86% accuracy, 92.5% F1-score) suggests that semi-supervised methods could enhance Sesotho sa Leboa’s WSD performance. Meanwhile, Arabic’s BERT model (96% F1-score) further highlights the benefits of transformer-based architectures in high-resource settings.

Overall, while Sesotho sa Leboa’s BiGRU-based models achieve robust performance, they still lag behind top-performing WSD models for Amharic and Ge’ez, indicating a need for larger annotated corpora, improved hybrid architectures, and semi-supervised learning strategies. Future research could explore pre-training custom transformer models on Sesotho sa Leboa-specific datasets to bridge the performance gap with high-resource languages.

BiGRU as a robust baseline: because of its efficient and bidirectional design, BiGRU continuously operates well, even in its baseline configuration. Improvements drive performance: Accuracy is greatly increased by both hybridizations and optimizations (such as ADAM and attention). Combining BiGRU with cutting-edge transformers like DeBERTa and ALBERT yields the greatest benefits. The function of transformers in hybrids: Not every transformer integration works as well as others. Some, like RoBERTa, have limited success, probably because of architectural or task-specific mismatches, whereas DeBERTa and ALBERT improve BiGRU’s capabilities.

6. Conclusions

The experimental findings indicate that BiGRU models, whether standalone or hybrid, are successful in addressing word sense disambiguation (WSD) for Sesotho sa Leboa. The highest accuracy of 85% was achieved with BiGRU (Optimized with Attention) and Hybrid BiGRU + DeBERTa, highlighting the importance of advanced transformer architectures and attention mechanisms in enhancing context representation. The model’s disambiguation capabilities are augmented by the attention mechanism, enabling it to focus on significant portions of the input sequence. DeBERTa performs better at catching complex word meanings because of its improved spatial encoding and disentangled attention. Strong accuracy levels of 84% were maintained by BiGRU (Optimized with ADAM) and Hybrid BiGRU + ALBERT, demonstrating the value of ALBERT’s parameter efficiency in capturing semantic subtleties and the efficacy of ADAM optimization for training stability. While ALBERT’s factorized embedding parameterization maximizes model performance with fewer parameters, ADAM’s adaptive learning rate aids in effective convergence.

Although it performed well, the Hybrid BiGRU + BERT model’s accuracy of 79% was marginally worse than that of its ALBERT and DeBERTa counterparts. The heavier architecture of BERT, which could not work as well with BiGRU, could be the cause of this. In this situation, BERT’s enormous model size and conventional attention processes could add complexity without corresponding benefits. The lowest accuracy, 70%, was obtained by Hybrid BiGRU + RoBERTa, indicating that RoBERTa’s pre-training dynamics or tokenization strategy might not be the best fit for MCD tasks in morphologically rich, low-resource languages. Combining RoBERTa’s aggressive masking method with BiGRU’s sequential processing may limit how successful it is. BiGRU’s ability to recognize contextual dependencies—which are essential for deciphering polysemous words—is greatly enhanced by attention processes. Because of its novel attention mechanism and improvements in positional encoding, DeBERTa’s design provides better integration with BiGRU. Parameter-efficient models like ALBERT maintain good performance with less computing complexity, while optimization techniques like ADAM help to increase the training efficiency. When addressing MCD in low-resource languages like Sesotho sa Leboa, the performance variation in hybrid models emphasizes the significance of model compatibility and the requirement for customized structures. These results imply that BiGRU-based models for WSD tasks in morphologically rich languages can be considerably improved by utilizing attention mechanisms and choosing appropriate transformer architectures.

Author Contributions

Conceptualization, M.A.M. and H.D.M.; methodology, M.A.M. and H.D.M.; formal analysis, M.A.M. and H.D.M.; data curation, M.A.M. and H.D.M.; writing—original draft preparation, M.A.M. and H.D.M.; writing—review and editing, M.A.M. and H.D.M.; visualization, M.A.M. and H.D.M.; supervision, S.O.O., P.A.O. and F.G.; project administration, M.A.M. and H.D.M.; funding acquisition, M.A.M. and H.D.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Research Foundation (NRF), grant number BAAP2204052075-PR-2023, through Sefako Makgatho Health Sciences University.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The material used in this study is available on request to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Pretorius, R.; Pretorius, L. Setswana Tokenisation and Computational Verb Morphology: Facing the Challenge of a Disjunctive Orthography. In Proceedings of the EACL Workshop on Language Technologies for African Languages (AfLaT), Athens, Greece, 31 March 2009; pp. 66–73. [Google Scholar]
Masethe, M.A.; Masethe, H.D.; Ojo, S.O.; Pius, A. Word Sense Disambiguation Pipeline Framework for Low Resourced Morphologically Rich Languages. In Proceedings of the International Conference on Information Systems and Emerging Technologies (ICISET), Winhoek, Namibia, 23–25 November 2022. [Google Scholar]
Pal, A.R.; Saha, D.; Naskar, S.K. Word Sense Disambiguation in Bengali: A Knowledge based Approach using Bengali WordNet. In Proceedings of the Electrical, Computer and Communication Technologies (ICECCT), 2017 Second International Conference, Coimbatore, India, 22–24 February 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–5. [Google Scholar] [CrossRef]
Chokoe, S.J. Linguistic Ambiguity in Northern Sotho: Saying the Unmeant. Ph.D. Thesis, Randse Afrikaanse Universiteit, Johannesburg, South Africa, 2000. [Google Scholar]
Mojela, V.M.; National, L.; Unit, L. Polysemy and Homonymy: Challenges Relating to Lexical Entries in the Sesotho sa Leboa—English Bilingual Dictionary. Lexikos 2007, 17, 433–439. [Google Scholar]
FaaB, G. A Morphosyntacic Description of Northern Sotho as a Basis for an Automated Translation from Northern Sotho into English. Ph.D. Thesis, University of Pretoria, Pretoria, South Africa, 2010. Available online: https://www.up.ac.za/ (accessed on 8 October 2019).
Zhang, Y. A Constructing Method of Mongolia-Chinese-English Multilingual Semantic Net based on WordNet. In Proceedings of the International Conference on Computer Science and Applications, Wuhan, China, 20–22 November 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 196–198. [Google Scholar] [CrossRef]
Giunchiglia, F.; Maltese, V.; Dutta, B. Domains and context: First steps towards managing diversity in knowledge. J. Web Semant. Sci. Serv. Agents World Wide Web 2012, 12–13, 53–63. [Google Scholar] [CrossRef]
Marobela, R.M. Polysemy of the Verbs ya and tla in Northern Sotho. Ph.D. Thesis, University of Stellenbosch, Stellenbosch, South Africa, 2006. Available online: http://hdl.handle.net/10019.1/1033 (accessed on 30 October 2019).
Popov, A. Neural Network Models for Word Sense Disambiguation: An Overview. Cybern. Inf. Technol. 2018, 18, 139–151. [Google Scholar] [CrossRef]
Roh, J.; Park, S.; Kim, B.K.; Oh, S.H.; Lee, S.Y. Unsupervised multi-sense language models for natural language processing tasks. Neural Networks 2021, 142, 397–409. [Google Scholar] [CrossRef] [PubMed]
Zhou, X.; Huang, H.; Chi, Z.; Ren, M.; Gao, Y. RS-BERT: Pre-training radical enhanced sense embedding for Chinese word sense disambiguation. Inf. Process. Manag. 2024, 61, 103740. [Google Scholar] [CrossRef]
Farouk, G.M.; Ismail, S.S.; Aref, M.M. Transformer-Based Word Sense Disambiguation: Advancements, Impact, and Future Directions. In Proceedings of the 2023 Eleventh International Conference on Intelligent Computing and Information Systems (ICICIS), Cairo, Egypt, 21–23 November 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 140–146. [Google Scholar] [CrossRef]
Srivastav, A.; Tayal, D.K.; Agarwal, N. A Novel Fuzzy Graph Connectivity Measure to Perform Word Sense Disambiguation Using Fuzzy Hindi WordNet. In Proceedings of the 3rd IEEE 2022 International Conference on Computing, Communication, and Intelligent Systems, ICCCIS 2022, Greater Noida, India, 4–5 November 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 648–654. [Google Scholar] [CrossRef]
Gahankari, A.; Kapse, A.S.; Atique, M.; Thakare, V.M.; Kapse, A.S. Hybrid approach for Word Sense Disambiguation in Marathi Language. In Proceedings of the 2023 4th IEEE Global Conference for Advancement in Technology, GCAT 2023, Bangalore, India, 6–8 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–4. [Google Scholar] [CrossRef]
Abdelaali, B.; Tlili-Guiassa, Y. Swarm optimization for Arabic word sense disambiguation based on English pre-trained word embeddings. In Proceedings of the ISIA 2022—International Symposium on Informatics and its Applications, Proceedings, M’sila, Algeria, 29–30 November 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–6. [Google Scholar] [CrossRef]
Kokane, C.D.; Babar, S.D.; Mahalle, P.N. Word sense disambiguation for large documents using neural network model. In Proceedings of the 2021 12th International Conference on Computing Communication and Networking Technologies (ICCCNT), Kharagpur, India, 6–8 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–5. [Google Scholar] [CrossRef]
Shafi, J.; Nawab, R.M.A.; Rayson, P. Semantic Tagging for the Urdu Language: Annotated Corpus and Multi-Target Classification Methods. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2023, 22, 1–32. [Google Scholar] [CrossRef]
Bakx, G.E. Machine Learning Techniques for Word Sense Disambiguation. Ph.D. Thesis, Universitat Politµecnica de Catalunya, Barcelona, Spain, 2006. Available online: https://www.lsi.upc.edu/~escudero/wsd/06-tesi.pdf (accessed on 5 August 2024).
Hladek, D.; Stas, J.; Pleva, M.; Ondas, S.; Kovacs, L. Survey of the Word Sense Disambiguation and Challenges for the Slovak Language. In Proceedings of the 17th IEEE International Symposium on Computational Intelligence and Informatics, Budapest, Hungary, 17–19 November 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 225–230. [Google Scholar] [CrossRef]
Sarmah, J.; Sarma, S.K. Word Sense Disambiguation for Assamese. In Proceedings of the 2016 IEEE 6th International Conference on Advanced Computing (IACC), Bhimavaram, India, 27–28 February 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 146–151. [Google Scholar] [CrossRef]
Demlew, G.; Yohannes, D. Resolving Amharic Lexical Ambiguity using Neural Word Embedding. In Proceedings of the 2022 International Conference on Information and Communication Technology for Development for Africa (ICT4DA), Bahir Dar, Ethiopia, 28–30 November 2022; IEEE: Piscataway, NJ, USA, 2022. [Google Scholar] [CrossRef]
Loureiro, D.; Jorge, A.M.; Camacho-Collados, J. LMMS reloaded: Transformer-based sense embeddings for disambiguation and beyond. Artif. Intell. 2022, 305, 103661. [Google Scholar] [CrossRef]
Rouhizadeh, H.; Shamsfard, M.; Rouhizadeh, M. Knowledge Based Word Sense Disambiguation with Distributional Semantic Expansion for the Persian Language. In Proceedings of the 2020 10th International Conference on Computer and Knowledge Engineering (ICCKE), Mashhad, Iran, 29–30 October 2020; pp. 329–335. [Google Scholar] [CrossRef]
Agbesi, V.K.; Chen, W.; Yussif, S.B.; Hossin, A.; Ukwuoma, C.C.; Kuadey, N.A.; Agbesi, C.C.; Samee, N.A.; Jamjoom, M.M.; Al-Antari, M.A. Pre-Trained Transformer-Based Models for Text Classification Using Low-Resourced Ewe Language. Systems 2025, 12, 1. [Google Scholar] [CrossRef]
Han, X.; Zhang, Z.; Ding, N.; Gu, Y.; Liu, X.; Huo, Y.; Qiu, J.; Yao, Y.; Zhang, A.; Zhang, L.; et al. Pre-trained models: Past, present and future. AI Open 2023, 2, 225–250. [Google Scholar] [CrossRef]
Ghadekar, P.; Malwatkar, N.; Sontakke, N.; Soni, N. Comparative Analysis of LSTM, GRU and Transformer Models for German to English Language Translation. In Proceedings of the 2023 3rd Asian Conference on Innovation in Technology, ASIANCON 2023, Pune, India, 25–27 August 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–7. [Google Scholar] [CrossRef]
Li, B.; Lu, S.; Xie, K.; Wang, Z. Accelerating NLP Tasks on FPGA with Compressed BERT and a Hardware-Oriented Early Exit Method. In Proceedings of the 2022 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), Nicosia, Cyprus, 4–6 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 410–413. [Google Scholar] [CrossRef]
Kaddoura, S.; Nassar, R. EnhancedBERT: A feature-rich ensemble model for Arabic word sense disambiguation with statistical analysis and optimized data collection. J. King Saud Univ.—Comput. Inf. Sci. 2024, 36, 101911. [Google Scholar] [CrossRef]
Khuntia, M.; Gupta, D. Indian News Headlines Classification using Word Embedding Techniques and LSTM Model. Procedia Comput. Sci. 2023, 218, 899–907. [Google Scholar] [CrossRef]
Aurpa, T.T.; Ahmed, S. Heliyon An ensemble novel architecture for Bangla Mathematical Entity Recognition (MER) using transformer based learning. Heliyon 2024, 10, e25467. [Google Scholar] [CrossRef] [PubMed]
Nicula, B.; Dascalu, M.; Newton, N.N.; Orcutt, E.; McNamara, D.S. Automated Paraphrase Quality Assessment Using Language Models and Transfer Learning. Computers 2021, 10, 166. [Google Scholar] [CrossRef]
Nicolae, D.C.; Yadav, R.K.; Tufiş, D. A Lite Romanian BERT: ALR-BERT. Computers 2022, 11, 57. [Google Scholar] [CrossRef]
Pu, X.; Yuan, L.; Leng, J.; Wu, T.; Gao, X. Lexical knowledge enhanced text matching via distilled word sense disambiguation. Knowledge-Based Syst. 2023, 263, 110282. [Google Scholar] [CrossRef]
Choi, H.; Kim, J.; Joe, S.; Gwon, Y. Evaluation of BERT and Albert sentence embedding performance on downstream NLP tasks. In Proceedings of the International Conference on Pattern Recognition, Milan, Italy, 10–15 January 2021; IEEE: Piscataway, NJ, USA, 2020; pp. 5482–5487. [Google Scholar] [CrossRef]
Xu, M.; Liu, S. A RoBERTa-Based Model with Bi-GRU and Multi-Head Attention for Chinese Offensive Language Detection in Social Media. Appl. Sci. 2023, 13, 11000. [Google Scholar] [CrossRef]
Alshanqiti, A.; Namoun, A.; Alsughayyir, A.; Mashraqi, A.M.; Gilal, A.R.; Albouq, S.S. Leveraging DistilBERT for Summarizing Arabic Text: An Extractive Dual-Stage Approach. IEEE Access 2021, 9, 135594–135607. [Google Scholar] [CrossRef]
Benselloua, A.Y.M.; Messadi, S.A.; Belfedhal, A.E. Effective Malicious PowerShell Scripts Detection Using DistilBERT. In Proceedings of the 2023 1st IEEE Afro-Mediterranean Conference on Artificial Intelligence, AMCAI 2023—Proceedings, Hammamet, Tunisia, 13–15 December 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar] [CrossRef]
Prema, V.; Elavazhahan, V. Sculpting DistilBERT: Enhancing Efficiency in Resource-Constrained Scenarios. In Proceedings of the 2023 12th International Conference on System Modeling and Advancement in Research Trends, SMART 2023, Moradabad, India, 22–23 December 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 251–256. [Google Scholar] [CrossRef]
Nemani, P.; Vollala, S. A Cognitive Study on Semantic Similarity Analysis of Large Corpora: A Transformer-based Approach. In Proceedings of the INDICON 2022—2022 IEEE 19th India Council International Conference, Kochi, India, 24–26 November 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–6. [Google Scholar] [CrossRef]
Khan, W.; Daud, A.; Khan, K.; Muhammad, S.; Haq, R. Exploring the frontiers of deep learning and natural language processing: A comprehensive overview of key challenges and emerging trends. Nat. Lang. Process. J. 2023, 4, 100026. [Google Scholar] [CrossRef]
Adewumi, T.; Sabry, S.S.; Liwicki, F.; Abid, N.; Liwicki, M. T5 for Hate Speech, Augmented Data, and Ensemble. Sci 2023, 5, 37. [Google Scholar] [CrossRef]
Belete, M.D.; Salau, A.O.; Alitasb, G.K.; Bezabh, T. Contextual word disambiguates of Ge’ez language with homophonic using machine learning. Ampersand 2024, 12, 100169. [Google Scholar] [CrossRef]
Sharfuddin, A.A.; Tihami, M.N.; Islam, M.S. A Deep Recurrent Neural Network with BiLSTM model for Sentiment Classification. In Proceedings of the 2018 International Conference on Bangla Speech and Language Processing, ICBSLP 2018, Sylhet, Bangladesh, 21–22 September 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–4. [Google Scholar] [CrossRef]
Catelli, R.; Casola, V.; De Pietro, G.; Fujita, H.; Esposito, M. Combining contextualized word representation and sub-document level analysis through Bi-LSTM+CRF architecture for clinical de-identification. Knowledge-Based Syst. 2021, 213, 106649. [Google Scholar] [CrossRef]
Alom, M.Z.; Moody, A.T.; Maruyama, N.; Van Essen, B.C.; Taha, T.M. Effective Quantization Approaches for Recurrent Neural Networks. In Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 8–13 July 2018. [Google Scholar] [CrossRef]
Meng, S.; Jiang, X.Q.; Gao, Y.; Hai, H.; Hou, J. Performance Evaluation of Channel Decoder based on Recurrent Neural Network. J. Phys. Conf. Ser. 2020, 1438, 012001. [Google Scholar] [CrossRef]
Al, A.; Hoenig, A.; Roy, K. Sentence subjectivity analysis of a political and ideological debate dataset using LSTM and BiLSTM with attention and GRU models. J. King Saud Univ.—Comput. Inf. Sci. 2022, 34, 7974–7987. [Google Scholar] [CrossRef]
Aziz, A.; Hossain, M.A.; Chy, A.N.; Ullah, M.Z.; Aono, M. Leveraging contextual representations with BiLSTM-based regressor for lexical complexity prediction. Nat. Lang. Process. J. 2023, 5, 100039. [Google Scholar] [CrossRef]
Ali, M.N.A.; Tan, G.; Hussain, A. Bidirectional Recurrent Neural Network Approach for Arabic Named Entity Recognition. Future Internet 2018, 10, 123. [Google Scholar] [CrossRef]
Lukasik, M.; Dadachev, B.; Papineni, K.; Simões, G. Text segmentation by cross segment attention. In Proceedings of the EMNLP 2020—2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Virtual, 16–20 November 2020; pp. 4707–4716. [Google Scholar] [CrossRef]
Duan, J.; Zhao, H.; Qin, W.; Qiu, M.; Liu, M. News Text Classification Based on MLCNN and BiGRU Hybrid Neural Network. In Proceedings of the Proceedings—2020 3rd International Conference on Smart BlockChain, SmartBlock 2020, Zhengzhou, China, 23–25 October 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 137–142. [Google Scholar] [CrossRef]
Pappagari, R.; Zelasko, P.; Villalba, J.; Carmiel, Y.; Dehak, N. Hierarchical Transformers for Long Document Classification. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, 14–18 December 2019; pp. 838–844. [Google Scholar] [CrossRef]
Lohk, A.; Orav, H.; Vare, K.; Bond, F.; Vaik, R. New Polysemy Structures in Wordnets Induced by Vertical Polysemy. In Proceedings of the Tenth Global Wordnet Conference, Wroclaw, Poland, 23–27 July 2019; Global Wordnet Association: Wroclaw, Poland, 2019; pp. 394–403. Available online: https://www.aclweb.org/anthology/2019.gwc-1.50 (accessed on 28 May 2021).
Belete, M.D.; Shiferaw, L.G.; Alitasb, G.K.; Tamir, T.S. Enhancing Word Sense Disambiguation for Amharic homophone words using Bidirectional Long Short-Term Memory network. Intell. Syst. Appl. 2024, 23, 200417. [Google Scholar] [CrossRef]
Kanade, A.; Maniatis, P.; Balakrishnan, G.; Shi, K. Learning and Evaluating Contextual Embedding of Source Code. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, Virtual, 13–18 July 2020; pp. 5066–5077. [Google Scholar]
Liang, X.; Huang, F.; Liu, D.; Xu, M. Brain and Language Brain representations of lexical ambiguity: Disentangling homonymy, polysemy, and their meanings. Brain Lang. 2024, 253, 105426. [Google Scholar] [CrossRef] [PubMed]
Do Nascimento, C.H.; Garcia, V.C.; de Andrade Araújo, R. A Word Sense Disambiguation Method Applied to Natural Language Processing for the Portuguese Language. IEEE Open J. Comput. Soc. 2024, 5, 268–277. [Google Scholar] [CrossRef]
Yoon, Y.; Seon, C.N.; Lee, S.; Seo, J. Unsupervised word sense disambiguation for Korean through the acyclic weighted digraph using corpus and dictionary. Inf. Process. Manag. 2007, 43, 836–847. [Google Scholar] [CrossRef]
Kumari, A.; Lobiyal, D.K. Efficient estimation of Hindi WSD with distributed word representation in vector space. J. King Saud Univ.—Comput. Inf. Sci. 2022, 34, 6092–6103. [Google Scholar] [CrossRef]

Figure 1. The GRU framework [47].

Figure 2. The architecture of Bi-GRU [47].

Figure 3. Theoretical framework for WSD research methodology.

Figure 4. Taxonomy of lexical ambiguity in Sesotho sa Leboa.

Figure 5. Diagram for transformer embedding generation.

Figure 6. Heuristics and correlation analysis.

Figure 7. N-grams for senses.

Figure 8. Outlier detection.

Figure 9. Comparative analysis of model accuracies in WSD.

Figure 10. Comparative analysis of enhanced WSD models.

Table 1. Summary of dataset description.

Dataset	# Sentences	# Tokens	# Annotations	# Sense Types	# Lemmas	Ambiguity
Augmented dataset	2859	3289	3288	162	22	4.85

Table 2. Summary table of baseline model performances.

Model	Accuracy	Key Strengths	Key Weaknesses
RNN-LSTM	61%	Good for sequence learning, handles long dependencies	Limited compared to transformer models
BiGRU	79%	Bidirectional, efficient, high accuracy	RNN-based, may underperform on complex tasks
LSTMLM	74%	Effective for language modeling	May not be as strong in other NLP tasks
DeBERTa	70%	State-of-the-art, better context handling	Larger, more computationally expensive
DistilBERT	64%	Faster, resource-efficient, good for real-time applications	Poorer performance than full BERT models

Table 3. Contingency table for base models.

Model 1	Model 2	Statistic	p-Value	Confidence Interval
RNN-LSTM	BiGRU	122.000	0.00000	(0.6830067283444499, 0.7662032039354598)
RNN-LSTM	LSTMLM	145.000	0.00000	(0.6369521078503567, 0.7228712917081422)
RNN-LSTM	DeBERTa	188.000	0.00001	(0.557530428504149, 0.6458594020043255)
RNN-LSTM	DistilBERT	212.000	0.07010	(0.49777829086492364, 0.5884286056868006)
BiGRU	LSTMLM	162.000	0.06494	(0.3986093107729891, 0.5013906892270109)
BiGRU	DeBERTa	148.000	0.00000	(0.323529705809955, 0.41832493078152366)
BiGRU	DistilBERT	133.000	0.00000	(0.2688570801976323, 0.3570252727435441)
LSTMLM	DeBERTa	181.000	0.00141	(0.37517803244141085, 0.46864481138143294)
LSTMLM	DistilBERT	153.000	0.00000	(0.31131578957374595, 0.4019709237129674)
DeBERTa	DistilBERT	207.000	0.01110	(0.3955443431952622, 0.48530672063452507)

Table 4. Contingency table for enhanced models.

Model 1	Model 2	Statistic	p-Value	Confidence Interval
BiGRU (ADAM)	BiGRU (Attention)	127.000	0.22774	(0.4792592363193085, 0.597104400044328)
BiGRU (ADAM)	Hybrid BiGRU + BERT	141.000	0.13899	(0.4007746830967713, 0.5118466761265298)
BiGRU (ADAM)	Hybrid BiGRU + RoBERTa	121.000	0.00000	(0.27232675494027975, 0.3661956725003535)
BiGRU (ADAM)	Hybrid BiGRU + DeBERTa	117.000	0.13573	(0.48765376745025135, 0.6088713290748452)
BiGRU (ADAM)	Hybrid BiGRU + ALBERT	134.000	0.28799	(0.47538143894824697, 0.5908206516440875)
BiGRU Attention)	Hybrid BiGRU + BERT	125.000	0.00638	(0.3634355216212898, 0.4754906528753545)
BiGRU Attention)	Hybrid BiGRU + RoBERTa	112.000	0.00000	(0.24754338530870543, 0.3388440492462683)
BiGRU (Attention)	Hybrid BiGRU + DeBERTa	133.000	0.85518	(0.4477741478105329, 0.567040667004282)
BiGRU (Attention)	Hybrid BiGRU + ALBERT	129.000	0.95056	(0.43537974112738154, 0.5569279511803108)
Hybrid BiGRU + BERT	Hybrid BiGRU + RoBERTa	150.000	0.00000	(0.31923018679206183, 0.4124771302811089)
Hybrid BiGRU + BERT	Hybrid BiGRU + DeBERTa	117.000	0.00250	(0.5339274010633905, 0.6478907807547913)
Hybrid BiGRU + BERT	Hybrid BiGRU + ALBERT	131.000	0.01023	(0.5194619493666492, 0.6298886999840001)
Hybrid BiGRU + RoBERTa	Hybrid BiGRU + DeBERTa	106.000	0.00000	(0.6709043846204781, 0.7622506955934256)
Hybrid BiGRU + RoBERTa	Hybrid BiGRU + ALBERT	103.000	0.00000	(0.6689909392452787, 0.7619482872740583)
Hybrid BiGRU + DeBERTa	Hybrid BiGRU + ALBERT	135.000	0.76350	(0.43015639091530755, 0.5481044786499099)

Table 5. Summary table of enhanced model performances.

Model	Accuracy	Key Strengths	Key Weaknesses
BiGRU (Optimized with ADAM)	84%	Faster convergence, efficient optimization	RNN-based, may struggle with long-range dependencies
BiGRU (Optimized with Attention)	85%	Focus on important parts of the input, high performance	Increased complexity, longer training time
Hybrid BiGRU + BERT	79%	Combines sequential and contextual learning	May not fully leverage the strengths of both models
Hybrid BiGRU + RoBERTa	70%	Combines sequential learning with robust contextual understanding	Model integration issues, lower performance
Hybrid BiGRU + DeBERTa	85%	Superior contextual handling, high performance	Model complexity, requires optimization
Hybrid BiGRU + ALBERT	84%	Efficient parameter usage, good performance	May not outperform larger models

Table 6. Predicted sense.

	Original Sentence	True Tense	Predicted Sense	Correct
0	Ke ile ka šupa gore ga bo Mahlatse ke kae.	Location	Time	False
1	Bomma bana le bana ba šupa dintlo tša botse.	Number seven	Number seven	True
2	Mmane o rile ketšwa kae ka iri ya bo šupa.	time	Time	True
3	Maabane nna le Mogwera waka re rekile dipene tše šupa.	Number seven	Number seven	True
4	Bo koko ba rile ke se ka šupa batho ke sa ba tsebe.	Location	Location	True
5	Ngwana wa gešo o na le mengwaga ye šupa.	Number seven	Number seven	True
6	Rakgolo o rile ge o bala dikgomo o bale ka go di šupa.	Location	Location	True
7	Ka gae re na le di katse tše šupa.	Number seven	Number seven	True
8	Ke ile ka šupa gore sekolo saka ke sefe.	Location	See	False
9	Ke na le maeba a šupa.	Number seven	Number seven	True

Table 7. Baseline summary discussions.

Model	Accuracy	Key Insights
RNN-LSTM	61%	Struggles with contextual understanding; lacks bidirectional processing.
BiGRU	79%	Strong baseline; excels in bidirectional context capture.
LSTMLM	74%	Effective for language modeling; slightly less capable than BiGRU.
DeBERTa	70%	Transformer model with good context handling; computationally intensive.
DistilBERT	64%	Lightweight model with lower accuracy, prioritizing efficiency over performance.

Table 8. Enhanced results.

Model	Accuracy	Key Insights
BiGRU (Optimized with ADAM)	84%	Optimization improves gradient handling, boosting accuracy significantly.
BiGRU (Optimized with Attention)	85%	Attention mechanism enhances focus on critical context, achieving the highest accuracy.
Hybrid BiGRU + BERT	79%	Combines BiGRU’s efficiency with BERT’s contextual power, matching baseline BiGRU.
Hybrid BiGRU + RoBERTa	70%	Underwhelming synergy; possible mismatch in architecture or task alignment.
Hybrid BiGRU + DeBERTa	85%	Highly effective integration, leveraging DeBERTa’s advanced transformer capabilities.

Table 9. Benchmarking table for WSD.

Language	Model	Accuracy	F1 Score	Authors
Amharic	BiGRU	99.99	92.5	[55]
Amharic	Joint supervised, unsupervised	86	83	[22]
Portuguese	BERT	84		[58]
Korean	Unsupervised (Viterbi algorithm)	76.4	96	[59]
Arabic	BERT			[59]
Ge’ez	Naive Bayes, decision trees, random forests, K-nearest neighbor, logistic regression, linear support vector machine	99.52	92.5	[43]
Hindi	Word2Vec	58	57	[60]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Masethe, H.D.; Masethe, M.A.; Ojo, S.O.; Owolawi, P.A.; Giunchiglia, F. Hybrid Transformer-Based Large Language Models for Word Sense Disambiguation in the Low-Resource Sesotho sa Leboa Language. Appl. Sci. 2025, 15, 3608. https://doi.org/10.3390/app15073608

AMA Style

Masethe HD, Masethe MA, Ojo SO, Owolawi PA, Giunchiglia F. Hybrid Transformer-Based Large Language Models for Word Sense Disambiguation in the Low-Resource Sesotho sa Leboa Language. Applied Sciences. 2025; 15(7):3608. https://doi.org/10.3390/app15073608

Chicago/Turabian Style

Masethe, Hlaudi Daniel, Mosima Anna Masethe, Sunday O. Ojo, Pius A. Owolawi, and Fausto Giunchiglia. 2025. "Hybrid Transformer-Based Large Language Models for Word Sense Disambiguation in the Low-Resource Sesotho sa Leboa Language" Applied Sciences 15, no. 7: 3608. https://doi.org/10.3390/app15073608

APA Style

Masethe, H. D., Masethe, M. A., Ojo, S. O., Owolawi, P. A., & Giunchiglia, F. (2025). Hybrid Transformer-Based Large Language Models for Word Sense Disambiguation in the Low-Resource Sesotho sa Leboa Language. Applied Sciences, 15(7), 3608. https://doi.org/10.3390/app15073608

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hybrid Transformer-Based Large Language Models for Word Sense Disambiguation in the Low-Resource Sesotho sa Leboa Language

Abstract

1. Introduction

2. Related Literature Review

2.1. Transformer-Based Language Models

2.1.1. BERT

2.1.2. RS-BERT

2.1.3. ALBERT

2.1.4. RoBERTa and DistilRoBERTa

2.1.5. DistilBERT

2.1.6. GPT

2.1.7. DeBERTa

2.1.8. XLNet

2.1.9. T5

2.2. Deep Learning Models

2.2.1. RNN-LSTM

2.2.2. LSTM

2.2.3. BiLSTM

2.2.4. BiGRU

2.2.5. LSTMLM

3. Materials and Methods

3.1. Data Collection and Annotation

3.2. Baseline Model: BiGRU

3.3. Enhanced BiGRU with Attention

3.4. BiGRU with Attention + Transformer Embeddings

4. Results

4.1. Heuristics and Correlation Analysis

4.2. Frequent Words and Phrases (e.g., N-Grams) Associated with Each Sense Category

4.3. Outlier Detection and Handling

4.4. Evaluation Metrics

4.5. Comparative Analysis of Model Accuracies in WSD

Enhanced Models

5. Discussion

Comparative Perspectives and Benchmarking WSD Performance

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI