2. Related Work
The researchers [
7] developed the Marathi language’s word sense disambiguation system by employing the Marathi WordNet and Lesk technique. They employed synset verification, which produces accurate results in many cases but is not very effective in others. The synset verification is implemented using the Lesk algorithm, which also confirms the validity of the set. The similarity statistics are not particularly accurate because the morphology of Indian languages varies greatly. The authors were the only ones who used the method to test terms. The algorithm only considers noun terms, which is a disadvantage of the chosen approach.
For supervised medical word sense disambiguation, the researchers [
11] presented a novel deep neural network architecture based on a layered bidirectional LSTM network, upon which a max-pooling throughout many time steps is performed to produce a dense representation of the context. To determine the best input form for the max-pooling layer, four further tweaks were made to the LSTM’s output. Additionally, a “universal” network is trained by researchers to jointly disambiguate all the target confusing terms. The ambiguous word’s embedding is concatenated to the maxpooled vector in the universal network as a “hint” layer by researchers. Our universal network achieves around 90% of the test accuracy, according to the results [
11].
The researcher [
10] proposed a creative strategy for addressing WSD for a low-resourced language with the assistance of a high-resource language. To enhance performance and increase the efficacy of WSD, researchers employed a modified version of the Lesk Algorithm driven by a Word2Vec model trained in a High Resource Language. Although the experiment was conducted using the low-resourced Assamese language, the technique works with other low-resourced languages as well. The agglutinative nature of Assamese, an Indian language with rich morphology, makes the problem hard. By creating and utilizing a word extractor for the Assamese language to provide the tokens during processing, researchers were able to significantly enhance performance. The results show that it is possible to distinguish between distinct senses of the same word using this unique approach.
The researchers [
12] give an example of how Urdu semantic labeling techniques might be developed and assessed using the suggested corpus. The suggested corpus has 8000 tokens spread over the following genres or domains: Wikipedia, social media, news, and historical texts (each with 2000 tokens). Using the USAS (UCREL Semantic Analysis System) semantic taxonomy, which offers a comprehensive set of semantic fields for coarse-grained annotation, the corpus has been manually annotated with 21 major semantic fields and 232 sub-fields. From the suggested corpus, the researchers collected local, topical, and semantic features, and then used seven distinct supervised multi-target classifiers on them. Findings indicate that 94% of the suggested corpus semantic domains for coarse-grained annotation are accurate [
12].
The method proposed by [
10] uses a Word2Vec model trained in a High-Resource language to power a modified version of the Lesk Algorithm, which is used to solve WSD using a lexical dictionary-based approach. In addition, to eliminating the stopwords, the researcher enhanced efficiency by creating and utilizing a Word-Extractor that provides the tokens with data from the Assamese WordNet, even in cases where the original word form is not present. The solution is total automation. The experiment’s findings support the viability of employing this method to distinguish between senses of the same term. When utilized independently as a plug-in together with other Natural Language Processing life cycles for low-resourced languages, this WSD paradigm can greatly enhance processing for other low-resourced languages.
The supervised WSD technique uses machine-learning algorithms from sense-annotated data created manually or semantically annotated corpora to introduce the induction principle for classification models to determine the appropriate sense for each specific context [
13,
14]. According to [
15], supervised learning techniques organize structured data into an annotated training corpus. The primary drawback of supervised learning approaches is their reliance on a substantial amount of manually annotated data. Supervised learning techniques handle structured information using annotated training corpora [
15]. Neural networks, K-Nearest Neighbors (KNN), Support Vector Machines, Decision Trees, and the Naïve Bayes classifier are a few of the computational intelligence approaches used in the supervised classification approach. In their 2016 study, Sarmah and Sarma, (2016) [
16] employed the supervised Naïve Bayes classifier technique for an autonomous disambiguation task. They created training data with sense-annotated features using 160 ambiguous terms from WordNet and the Assamese Corpus, and their results showed an accuracy of 71%.
A model based on a bi-directional long short-term memory (BiLSTM) network and an attention model based on self-attention architecture were the two deep learning-based models for supervised WSD that the researchers [
17] presented. The outcome demonstrates that, on the MSH WSD dataset, the BiLSTM neural network model with an appropriate upper layer structure outperforms the current state-of-the-art models even better, while the attention model outperformed the BiLSTM model with good accuracy by a factor of three or four. Furthermore, the researchers [
17] trained “universal” models to jointly decipher any confusing words. As a “hint,” concatenate the target ambiguous word’s embedding to the max-pooled vector in the universal models. According to the outcome, the universal BiLSTM neural network model produced results with an accuracy of almost 90% [
17].
Although they lack subjectivity and overlook the semantics of the underlying textual structures, automatic annotators automatically annotate the data to create the training set for the supervised classifier. The researchers [
18] considered domain-specific aspects of the annotation process to create an automated annotation system that is both scalable and semantically rich. Using distributional semantic models (LSA and Word2Vec) to supplement the novel bootstrapping algorithm, the authors [
18] developed an improved method for automatically annotating Tweets. The suggested algorithm was tested on 12,000 crowdsourced annotated Tweets, and it produced an accuracy of 68.56%, higher than the baseline.
The goal of the research is to ascertain the meaning of an ambiguous word by addressing the issue of word sense disambiguation in the Arabic language. To model the issue, researchers [
19] use supervised sequence to sequence learning. For Arabic word sense disambiguation, researchers presented recurrent neural network models, BERT-based models, and combined POS-BERT models. The POSBERT method yields 96.33% accuracy for researchers [
19].
The researchers [
20,
21] have demonstrated strong performance in addressing this issue by thoroughly examining supervised machine-learning techniques in this field. Scholars [
20,
21] have examined four approaches that incorporate pre-trained word embedding as features for training two supervised machine learning models: Naïve Bayes (NB) and Support Vector Machines (SVM). One of the training features is the context of the target abbreviation, which is applied to 500 sentences for each of the 13 abbreviations that were taken from public clinical note datasets from Fairview Health Services, a Twin Cities organization affiliated with the University of Minnesota (UMN). Our findings demonstrated that SVM outperforms NB in all four strategies; the model with the highest accuracy, which was pre-trained using texts from PubMed, Wikipedia, and PMC (PubMedCentral), had a 97.08% accuracy rate.
The work on word sense disambiguation (WSD) in Bengali, one of the Indian languages with fewer resources, is presented in this paper [
22]. There are two phases to the entire work that are completed in order. Using conventional methodologies, four popular approaches—Decision Tree (DT), Support Vector Machine (SVM), Artificial Neural Network (ANN), and Naïve Bayes (NB)—that are frequently used for word sense disambiguation are examined in the first phase. Appropriate adjustments are made and put into practice during application to obtain the intended effects. The outcomes of the first experiments are used to propose a combined strategy in the second step. The accuracy rates of these baseline techniques were 63.84%, 76.9%, 76.23%, and 80.23%, in that order [
22].
Words in the Punjabi language have been decoded using the Naïve Bayes supervised classifier. Building the supervised machine learning models requires a thorough understanding of the feature extraction procedure. Bag of Words (BoW) and collocation models are used independently for the proposed Punjabi WSD system to extract pertinent features. While the collocation model uses two words before and two words after the target word as features, the BoW model uses all words around the target word. The same training dataset was used to construct both models. It has been noted that the Naïve Bayes algorithm’s performance is greatly influenced by the choice of smoothing parameter. Tests for this suggested work were conducted on 150 of the most confusing noun words taken from Punjabi WordNet [
23].
The NB algorithm is a probabilistic model that makes use of the Bayes rule and adopts conditional independence of features given the class label, it has been used considerably with success for WSD tasks [
13,
24]. NB approach categorizes text documents using two constraints named conditional probability of each sense (S
i) of a word (w) and the features (f
j) in the context [
14]. The appropriate sense in the context is represented by the maximum value assessed through the expressed formula [
14]:
In the expressed formula above, the number of features is signified by m, and probability P(Si) is computed from the co-occurrence frequency in the training set of sense. The P (fj | Si) is computed from the feature in the presence of the sense.
Researchers [
25] describe how, after making the required adjustments, a supervised methodology was used for the job of word sense disambiguation in Bangla. When applied to a database of the 19 most used ambiguous Bangla words, the Naïve Bayes probabilistic model, which has been selected as a baseline method for sense classification, produces modest results with 81% accuracy at the beginning. Two modifications are made to the baseline method: (1) the lemmatization process is integrated into the system, and (2) the operational process is bootstrapped. As a result, the approach’s accuracy level is increased to 84% accuracy, which is encouraging for the disambiguation process overall since it allows for more refinement of the current method to obtain better results.
Using supervised approaches, the researchers [
26] explicitly investigated the WSD system for the Punjabi language. A manual preparation was performed for the sense-tagged corpus of 150 ambiguous Punjabi noun terms. This proposed work investigated the following techniques: Decision List, Decision Tree, Naïve Bayes, K-Nearest Neighbor (K-NN), Random Forest, and Support Vector Machines (SVM). Feature space unigram, bigram, collocations, co-occurrence, and syntactic count-based features were employed by the classifier. From the unlabeled Wikipedia text, the semantic characteristics of Punjabi were derived using word2vec CBOW and skip-gram shallow neural network models. For the WSD of Punjabi words, two additional deep-learning neural network classifiers have been used: multilayer perceptrons and long short-term memory. Using the word embedding feature, the LSTM classifier has attained an accuracy of 84% [
26].
Unsupervised methods organize a vast amount of textual material by processing unstructured semantic information [
15]. Using training data without exact annotation, unsupervised techniques circumvent the limitations of supervised techniques and analyze unstructured semantic information in the context of enormous volumes of textual data [
15]. Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), and PageRank approaches are examples of dimension reduction or clustering techniques that are used to identify semantic ideas. One drawback of these techniques is that they do not demarcate context groups [
15].
An innovative approach to identifying the latent meaning that connects a sentence’s words has been proposed by researchers [
27]. A graph is employed by researchers [
27] to uncover this implicit information, which is then applied to clarify the unclear word. The results of the studies demonstrate that the suggested algorithm correctly understands the sense of both homonyms and polysemous words. The suggested technique has demonstrated an accuracy of 79.6%, which is 2.5 percent better than the best-unsupervised approach in SemEval-2007 and has outperformed the approaches given in the SemEval-2013 assignment for word sense disambiguation.
The researchers [
28,
29] developed an unsupervised graph-based system for the Hindi word sense disambiguation challenge with an emphasis on word sense ambiguity. With the help of the suggested method, a weighted graph is produced, with nodes standing in for the meanings of words that occur in the context of ambiguous phrases and edges for the relationships between them. It employs a random walk-style approach to determine which sense of a polysemous word is more suited in a particular context and leverages semantic similarity computed from Hindi WordNet to provide weight to edges. Twenty polysemous nouns from a sense-annotated dataset were used for the evaluation. A higher overall accuracy of 63.39% was noted by the researchers [
28,
29] than in previously published studies using the same dataset.
An innovative unsupervised graph-based method for Hindi word sense disambiguation has been put into practice by the researchers [
30]. Researchers use a random walk on a graph made specifically for each occurrence to help in disambiguation. The many meanings of the words that exist in the context of ambiguous words are represented as nodes in the graph. The edge weights are determined by comparing the semantic similarity of two nodes. Two path-based similarity metrics are compared by the researchers. According to experimental research, a Leacock–Chodrow similarity measure outperforms the shortest path measure. An average accuracy of 72.09% was noted by the researchers [
30] for all five cases of polysemous nouns.
Using pre-trained language models’ masked language model task, the researchers [
31] present a novel unsupervised technique for HowNet-based Chinese WSD. In the studies, researchers construct a new and larger HowNet-based WSD dataset, considering the current evaluation dataset to be small and outdated. The model performs noticeably better than all the baseline techniques, according to experimental results.
The impact of word embedding on developing an unsupervised Arabic sense inventory is presented by the researchers [
32]. To explore their impact on the resultant sense inventory and their effectiveness in word sense disambiguation for Arabic context, three pre-trained embeddings are examined. A fully unsupervised technique based on a graph-based word sense induction algorithm is used to create sense inventories. According to the findings, the Aravec-Twitter inventory obtains an accuracy of 0.47 for 50 neighbors, which is the best result; for 200 neighbors, it is almost as accurate as the Fasttext inventory [
32].
The knowledge-based approach is based on diverse knowledge sources such as machine-readable dictionaries (e.g., WordNet) or sense inventories and uses information explicitly articulated in the form of rules or lexicons [
15]. The knowledge-based scheme is categorized by the type of resources they use, such as Machine-Readable Dictionaries (MRDs), Thesauri, and Computational Lexicons or Lexical Knowledge Bases [
13,
15].
In the Bengali language, a knowledge-based method for word sense disambiguation (WSD) has been introduced. Bengali Text Corpus, created as part of the Government of India’s Technology Development for Indian Language (TDIL) initiative, provided the input dataset for Bengali WordNet, a knowledge base created at ISI Kolkata [
33]. The suggested method determines the precise meaning of an ambiguous word in Bengali by finding the largest overlap between the definition of the word in the dictionary, the definitions of the terms that collocate with it in that sentence, and the synonyms of those collocating words. Nine often-used ambiguous terms in Bengali are selected to test the system. Seventy-five percent of the output is accurate [
33].
To address the issue of inadequate usage of the current knowledge base, the researchers [
34] presented a word sense disambiguation method based on graphs and knowledge bases. It builds the disambiguation graph by processing samples in the lexical knowledge base that have a strong sense differentiation capability and using dependency parsing to gather contextual knowledge. Subsequently, the disambiguation can be completed by merging the contextual and dependency disambiguation graphs. Tests conducted on the SemEval-2007 task #5 dataset reveal a disambiguation accuracy of 47%, surpassing that of the previously listed techniques.
The knowledge-based approach is a compromise between the supervised and unsupervised methods, which both make use of WordNet, ontologies, and manually built lexical databases. Nonetheless, research indicates that knowledge-based applications are a viable substitute for supervised systems [
15]. Thanks to the sophisticated Graph-based method, knowledge-based disambiguation has become more relevant in NLP [
35].
An innovative knowledge-based sense disambiguation (KSD) technique was put out by researchers [
36] to address the issue of lexical ambiguity in question-answer (QA) systems. The suggested creative process, which combines several knowledge sources, is the main contribution. This creates a shallow NLP by combining the question’s metadata (date/GPS), context information, and domain ontology. The suggested KSD approach evolved into a special tool for a mobile quality assurance application that seeks to ascertain the pilgrims’ intended meanings when they ask inquiries. The findings of the experiment demonstrate that, within the pilgrimage domain, the approach achieved accuracy performance that was both equivalent to and superior to the baselines.
A knowledge-based coarse-grained sense disambiguation technique based on selectional preferences defined by topic models is presented by the researchers [
37]. With three competitive baselines, the method’s overall accuracy of 83% is a substantial improvement [
37]. To develop a knowledge base, the researchers [
38] looked at the learning of semantic class-level selectional preference (SP). First, a Semantic Knowledge-base of Contemporary Chinese (SKCC) noun taxonomy is modified for SP acquisition. Second, a tree-cut model based on MDL is put into practice. Thirdly, SP is presented in SKCC as the source of the gold standard test set used to assess the performance of SP acquisition. The experiments investigate verb–object, verb–subject, and adjective–noun relations as three different types of predicate-argument relations. The best three relaxed accuracy for the verb–object relation is 75.26%, while the top one strict accuracy is 24.74% [
38].
Researchers [
21,
39,
40] offer a knowledge-based method for word sense disambiguation in this work that can identify a term’s correct sense in a particular context by utilizing a variety of semantic similarity metrics. The studies demonstrate that the technique drew extremely near performance utilizing semantic measures based on word embeddings when using WordNet-based similarity measurements. Additionally, using real-world data, the researchers created a tiny dataset in which the annotators’ feedback allowed them to discern between terms that were confusing and those that were unclear. Lastly, an analysis of a cutting-edge dataset, including linguistic factors, was conducted to assist and explain the effectiveness of the approach. The results of the analysis showed that texts with high noun and adjective ratios and high scores for lexical richness correlate with improved WSD performance.
Researchers [
41] examined Hindi WSD in this article using this knowledge-based methodology. To eliminate words’ ambiguity, word knowledge from outside knowledge sources is incorporated. In this work, using WordNet of Hindi, researchers attempted to create a WSD tool by taking a knowledge-based approach. For WSD for Hindi, the program employs the knowledge-based LESK algorithm. The accuracy provided by the suggested approach is roughly 71.4%.
For Persian WSD, the researchers [
42] suggest a brand-new knowledge-based technique. Each document’s subjects are retrieved using a pre-trained LDA model, and each ambiguous content word is assigned to a different topic. The study determines how similar the words of the designated topic of w, are to the gloss of s on the FarsNet (the Persian WordNet) for each conceivable sense s of a particular word w. Next, we determine which sense is most likely by giving it the highest score. Research could attain state-of-the-art performance when compared to other knowledge-based methods, according to an evaluation of the method conducted on a Persian all-words WSD dataset [
42].
Based on distributional semantic space, the researchers [
43] suggested a unique hybrid supervised and unsupervised method for Amharic sense disambiguation. Ten ambiguous words in total are used to evaluate each strategy and the combined approach. 82.3 percent of the F1-score and 70% accuracy were obtained by the supervised technique, whereas 85.7% of the F1-score and 60% accuracy were obtained by the unsupervised approach. An accuracy of 86% and an F1-score of 92.5% were attained by the combined method.
Utilizing additional context-based information on unclear terms, researchers [
44] offer a scoring system that allows for knowledge-based resolution of ambiguity. To achieve this, the researchers assembled two lists of terms: they chose words around the target word from the corpus and extracted associated words of a sense from WordNet. Ultimately, 80.1% accuracy was attained using the suggested strategy on the TWA corpus, which is encouraging when compared to the outcomes of previous studies using comparable techniques.
Using the public MSH WSD dataset as the test set, researchers [
45] use knowledge-based techniques that take use of recent developments in neural word/concept embeddings to outperform the state-of-the-art in biological WSD. To obtain concept vectors, researchers [
45] use MetaMap 2016V2, an existing concept mapping software. The accuracy of the linear time technique is 92.24%, a 3% gain over unsupervised methods. Our more costly method, which uses a nearest-neighbor structure, achieves an accuracy of 94.34%, effectively halving the error rate. Our work demonstrates that biological WSD is not an exception to the current trend of many language processing tasks benefiting from the use of dense vector representations learned from unlabeled free text.
A new word sense disambiguation method based on the semantic relations of the lexical database PolyWordNet was created by researchers [
46]. Unlike contextual overlap count knowledge-based word sense disambiguation methods, the algorithm does not count word overlaps between the glosses of context and sense bags. Rather, the algorithm looks for connections between the senses of the target word and the routes or links of context terms. It records every path or link that joins a sense of the target word with a context term. Next, for every linked sense, the program counts the number of paths, links, or connections. The technique that uses PolyWordNet has an accuracy of 96.11%, which is higher than the other contextual overlap count WSD approach that uses Princeton WordNet, which has an accuracy of 58.33%.
A deep learning-based methodology for constructing an RDF-based ontology from unstructured text was proposed by researchers [
47]. Using databases of newspaper articles, the researchers hope to assess the suggested model by developing a general knowledge ontology. The suggested model includes a Relation Extraction model, a novel implementation of the RDF mapping technique, and is built on the foundation of transformer architecture and natural language processing. The model’s capacity to resolve the word sense disambiguation issue is its primary selling point. The model demonstrated strong performance and earned exceptionally high accuracy ratings.
A new unsupervised knowledge-based algorithm for global word sense disambiguation (WSD) is called ShotgunWSD. The Shotgun sequencing technique, a popular whole genome sequencing method, served as the model for the program. The relatedness score between two-word senses is computed using a new method in ShotgunWSD 2.0, which is an enhanced version of the tool that keeps the fundamental process of creating better local sense configurations in place. To create a sense bag for every sense, researchers [
48] gather all the words from the associated WordNet synset, gloss, and related synsets. Then, they use a common word embedding framework to embed all the words from all the sense bags across the document in a vector space. To determine the sense embedding for a specific word sense, researchers [
48] use the median of all the remaining word embeddings in that sense bag. On six benchmarks, SemEval 2007, Senseval-2, Senseval-3, SemEval 2013, SemEval 2015, and overall (uniform), researchers compare the enhanced ShotgunWSD algorithm (version 2.0) with its prior version (1.0) as well as numerous cutting-edge unsupervised WSD techniques. Researchers show that ShotgunWSD 2.0 outperforms several other recent unsupervised or knowledge-based methods, as well as ShotgunWSD 1.0. The improvements of ShotgunWSD 2.0 over ShotgunWSD 1.0 are, for the most part, statistically significant, according to the study’s paired McNemar’s significance tests, which had a confidence interval of 0:01.
To categorize WSD algorithms, the researchers [
49] looked at machine learning and knowledge-based approaches. Every category is thoroughly analyzed, and the algorithms that correspond with it are explained. The research study examines a range of WSD strategies and resources. The study covers publications from a variety of journals and talks about current research directions as well as field competitions and trends.
The related literature review discusses knowledge-based, supervised, and unsupervised approaches’ efficacy in general.
Table 1 lists particular methods or algorithms in each of these categories, along with information about how well they worked and in what situations.
Models can be adapted to low-resource languages using WSD approaches that take advantage of transfer learning from high-resource languages. Despite having little data, methods like cross-lingual models and multilingual embeddings help transfer knowledge between languages and improve performance. Low-resource languages can benefit from the fine-tuning of pre-trained models on high-resource languages (e.g., multilingual BERT), which provides a foundation and increases accuracy with fewer data. In environments with limited resources, methods that make use of both labeled and unlabeled data can be successful. WSD performance can be improved using semi-supervised learning by using large amounts of unlabeled data to speed up the learning process. One of the main obstacles to training and evaluating WSD models in low-resource languages is the absence of substantial, high-quality annotated corpora. Multiple dialects in low-resource languages make data collecting and model training even more difficult.
The training of complex models necessitates large computational resources, which are scarce in environments centered around low-resource languages. The lack of common assessment datasets and standards for low-resource languages makes evaluating and contrasting the effectiveness of various WSD techniques challenging. Low-resource languages may have distinct or sophisticated linguistic characteristics (such as morphology or syntax) that are poorly represented in current models, which could have an impact on WSD performance. By addressing these flaws and capitalizing on the advantages, WSD for low-resource languages can improve significantly, resulting in more inclusive and efficient natural language processing systems.