1. Introduction
Morphological disambiguation (MD) is a long-standing problem in processing text for morphologically complex languages. It is similar to part-of-speech (POS) tagging [
1]; however, for MD, not only the POS tag but the lemma/root along with its corresponding morphological tags should be correctly predicted.
Considering the analysis involves various components like root forms, part-of-speech tags, and morpheme chains, treating MD as a straightforward tagging task introduces complexity, leading to an expansive tagset with sparse data points. To address this issue, several strategies have been proposed. A common approach is to decompose the tag sequence into smaller, more manageable segments. For example, in the HMM-based method proposed by Hakkani-Tur et al. [
2], the analysis is broken down into smaller parts like inflectional groups. This method operates under the assumption that the tags in the current analysis depend only on the previous one, simplifying the task of disambiguation. However, this assumption has significant drawbacks. First, it limits the model’s ability to capture long-term dependencies, which are essential for this type of task. Second, despite the decomposition, the overall tagset remains quite large, making the approach less efficient.
To alleviate this issue, a voted perception approach is proposed in [
3]; instead of using a certain part of the analysis, the author proposed a set of features and sought to represent a sequence analysis with feature vectors. The approach utilized tri-gram decoding, which relaxes the previous assumption and, compared to the HMM-based approach, may better capture longer dependencies between tags. The underlying hypothesis of this approach is that the model will maximize the objective function, which ensures the feature vectors from the correct path of analyses obtain larger values than those in the non-correct path of analyses. This is a discrete feature-based approach, which requires manual feature engineering and it still cannot capture long-term dependencies.
To address this issue, a deep learning-based approach was proposed [
4], which is currently considered the best for this task. The authors segment an analysis into (i) the root, (ii) its POS and (iii) the morpheme chain (MC), then use a nonlinear layer to calculate a dense representation for analysis. In order to capture the long-term dependencies, a bi-directional long short-term memory (LSTM) method was applied for context learning from sentences. Another character-level LSTM is utilized for capturing word internal features. Combing the fine-grained information from characters, the contextualized representation obtained for each word, and guided by the intuition that the correct analysis should be most similar to the context’s representation, a dot product of two representations is computed to perform disambiguation. One disadvantage of this approach is that it uses a binary vector to calculate analysis embeddings, which are not in continuous space.
Another limitation is that it relies on a biLSTM to compute the contextual representation, which may not fully capture the complexities of language use in syntactic and semantic contexts as effectively as embeddings generated from large language models. With the advent of large language models (LLMs) [
5,
6,
7], many results for downstream tasks of natural language processing (NLP) [
8,
9] have improved significantly. However, the impact of pre-trained large language models on the performance of MD remains unclear.
In this paper, a contrastive learning approach for MD using LLM is presented. A contrastive loss function is introduced for training the approach, which reduces the distance between the correct analysis and contextual embeddings while maintaining a margin between correct and incorrect embeddings. One of the aims of this work is to analyze the effectiveness of LLMs on MD and how the knowledge encoded in LLMs can be transferred to low-resource languages to improve model performance. Another aim of this work is to experimentally analyze the effectiveness of different distance measurements for performing morphological disambiguation by calculating the distance between context and morphological analysis’ embeddings via the contrastive loss function. Experimental results show that the model incorporated with knowledge from an LLM gives better results while not using any designed internal or external features. The results for comparing different distance measurements for performing disambiguation show that the Euclidean distance and dot product yield a better outcome than others. The results showed that LLMs are useful in enhancing the performance of MD for various MCLs, especially for low-resource languages.
The structure of the paper is organized as follows: (i) In
Section 1, we introduce the task and the purpose of this work; (ii)
Section 2 describes the existing work related to MD; (iii)
Section 3 details the proposed model; (iv) In
Section 4, we report the experimental results; and in
Section 5, the most common error cases are discussed with statistics, and the paper concludes in
Section 6.
2. Related Work
A morphological analyzer generates a set of morphological analyses for an input word, and the MD task performs morphological disambiguation by choosing a possible analysis among the candidates depending on the context. Morphological disambiguation has been studied extensively over the past decades, especially for agglutinative languages like Turkish and Kazakh.
Approaches for morphological ambiguity resolution can be categorized into three problem groups as follows: (i) sequence labeling problem; (ii) morphological disambiguation problem; (iii) sequence-to-sequence (Seq2Seq) problem.
Sequence labeling is a type of problem in NLP where the goal is to assign a categorical label to each token in a sequence of tokens. This type of problem is essential in various NLP tasks, such as POS tagging [
10] and named entity recognition (NER) [
11]. In a sequence labeling problem, an input sequence
of length
n is given, and the task is to predict a corresponding sequence of labels
. Each label
corresponds to an input element
, and the labels are drawn from a predefined set of categories. MD is treated as a sequence labeling problem, similar to POS tagging. However, MD is more complex than POS tagging because its labels include the root, POS tag, and a sequence of morpheme tags.
There are two ways of performing MD as sequence labeling: (i) treat each morphological analysis as a label; usually they are drawn from the training as a predefined set of categories. Since a morphological analysis contains the root, POS tag, and a sequence of morpheme tags, it results in a large number of unique labels. It not only increases label data sparsity but also leads to a potential issue of out-of-vocabulary for a label. (ii) To avoid this issue, most of the sequence labeling approaches for MD use a multi-class and multi-label model to predict different morphological categories. For each morphological category, there is a separate classifier.
In [
12], the authors proposed different models with different architectures for morphological tagging. In their multiclass, multilabel (McMI) model, which predicts POS, different morphological categories are employed separately as the output of the model; however, they share an input layer.
Since the morphological tagging contains a tag sensitivity issue (one tag may depend on previous tags), to capture this information, the authors proposed a hierarchical multiclass, multilabel model (HMcMI); in this architecture, the authors only consider the POS tags’ sensitivity. Another approach was the testing of a sequence model (Seq); this takes each word in a sentence, and feeds it to a long short-term memory LSTM [
13]) network as a context vector; then, for the decoder, it generates category-value pairs.
The authors take a multiclass model (MC) as their baseline, treating an analysis as a label. The experiments were conducted on UDv2.1 corpora for 49 languages. The experiments showed that, on average, the Seq model performed the best compared to the others. For the OOV case, no significant difference was observed between the HMcMI and McMI models. For large datasets, it seems that the baseline MC outperformed these two models. For POS tagging, HMcMI outperformed McMI. This type of approach is efficient for languages with less complex morphology, such as English and others; it treats the problem as a morphological tagging problem, which essentially involves predicting the entire set of morphological tags given the context.
In contrast to predicting the entire analysis given a word context, treating it as a disambiguation problem uses all candidates and selects the most probable analysis. In this direction, approaches to the MD task for Turkish began with a rule-based method reported in [
14].
It uses a constrained lexical rule to select the most probable analysis among the candidates. The lexical rule contains the word case, POS tags, and the positional features of the word. Disambiguation involves using different combinations of these lexical rules to detect the pattern.
In another work, the authors explored a rule-based approach to MD, integrating a set of predefined constraint rules with an algorithm that automatically learns additional rules. It is denoted as constraint-based MD [
15]. Following this constraint-based MD, a pure rules-based approach [
16] was proposed for Turkish MD. The approach is similar to that described in [
15], since all the rules are designed and chosen manually; the disambiguator uses more capable and descriptive formatting for the disambiguation rules.
Statistically based approaches have been proposed for extracting the lexical pattern of words to disambiguate the morphology ambiguity. In this direction, hidden Markov model (HMM)-based approaches [
2] were proposed for MD by modeling the transition and emission probabilities from features of the observations. To avoid the complex structure of the MD label, the authors break down each analysis into smaller units called inflectional groups (IGs). Then, HMM is used to calculate the transition and emission probabilities between these inflectional groups. This approach makes assumptions that the current word prediction only depends on the previous words (tri-grams); in MD, it is used between IGs. For instance, in a model, the presence of IGs in a word only depends on the final IGs of the previous words.
The IGs-based model for Turkish achieves 92.08% accuracy, while the root-based model achieves 80.36%. The combined model improves the accuracy further, achieving 93.95%.
In [
17], the authors proposed an open source toolkit for Turkish text processing including a morphological analyzer and disambiguation, in which they tried different methods of disambiguation as follows: (i) rote-learning disambiguator: this counts analyses observed in the training data; for testing, it picks the analysis with the highest frequency as correct; (ii) model without root: instead of performing a complete analysis, it decomposes the analysis into two parts—root
r and a sequence of morphological tags
a. Then, they formulate the joint probability to
; (iii) model with IGs: instead of decomposing the analysis into two parts, in this model, it is split into IGs, and the joint probabilities between these IGs are modeled. Experimental results showed that the last two models achieved comparable accuracy on a test set similar to the training set (the text domain is news). The final model demonstrated better accuracy on a smaller test set that differed more significantly from the training set. It also showed that the models performed worse when the boundaries of the IGs or tags were not clear. In [
18], a set of classifiers was proposed for MD; using the J48 Tree algorithm, a highest accuracy of 95.61% was obtained. Using a trigram sequence, a perceptron approach was proposed with 23 features [
3], and using these features, the results of accuracy were improved from 93.95% to 96.80% in MD.
In [
19], a bidirectional long short-term memory network-based neural network architecture was introduced for disambiguating morphological parses using different amounts of contextual information. The results demonstrated that the type and amount of context required for effective disambiguation vary across languages, depending on their linguistic characteristics. In languages like Turkish, where morphological information is largely conveyed by the surrounding context, models utilizing surface context can effectively capture long-range dependencies to resolve ambiguities. In contrast, languages like Arabic, where surface representations are less informative, there is a significant benefit from incorporating representations of surrounding parse candidates alongside the surface forms of neighboring words.
MD is treated as a Seq2Seq problem [
20,
21,
22]. It takes the characters of a word as an input sequence, and for each word, generates a morphological analysis. Each analysis can be decomposed into a sequence of tags. Using two sequences, a sequence-to-sequence approach from machine translation can be applied. In [
20], the authors proposed a neural architecture, namely, Morpheus, which is based on sequential neural encoder-decoders. It jointly solves the lemmatization and morphological tagging task. It uses a two-level LSTM network that produces context-aware vector encodings, and takes this vector as input for the decoders. The outputs are both the morphological tags associated with each word and the minimal edit operations required to transform the surface words into their respective lemmas. Previous work [
21] employed an encoder-decoder framework with a bidirectional LSTM for encoding, paired with an attention mechanism during decoding to better understand the semantic relationships between the suffixes of words and other elements in a sentence. These approaches have been validated across multiple languages, including Finnish, Turkish, and English, demonstrating that such models can achieve performance that rivals or exceeds existing state-of-the-art techniques.
Overall, the shortcomings of existing approaches can be summarized as follows:
(i) The variety and complexity of morphological analysis result in a large number of unique labels. This not only increases label data sparsity but also leads to the potential issue of out-of-vocabulary labels.
(ii) The issue of long-term dependency, such as in HMM-based approaches, fails to capture the long-term dependencies between tags. While CRF-based approaches may mitigate the issue found in HMMs, they require feature engineering since features are extracted from the analysis. Additionally, CRFs still operate under a second-order HMM assumption, making it impractical to fully consider all the possible paths of different analyses.
The approach proposed in this work attempts to address these problems using contrastive learning methods. It models the analyses separately and employs a Transformer architecture to process input at the sentence level, incorporating knowledge encoded in LLMs. To our knowledge, there are few papers available on the MD task with LLMs, and the work presented here not only represents one of the first efforts to apply LLMs to this task but also introduces a contrastive learning method for the approach.
3. Proposed Approach
To analyze the impact of pre-trained large language models on MD and to effectively utilize the knowledge encoded in a PLM, this approach integrates contextual embeddings from a PLM with dense morphological representations to perform ambiguity resolution.
Figure 1 illustrates a contrastive learning approach for morphological disambiguation. Given an input token, the LLM generates a context representation based on the token’s context in the sentence. Multiple possible morphological analyses are transformed into morphological embeddings. The learning process aims to minimize the distance between the context representation and the correct morphological analysis (a1+, in blue), while maximizing the distance between the context and incorrect analyses (a2−, a3−, in red). During the learning process, the model improves its ability to select the correct morphological form by contrasting correct and incorrect analyses.
3.1. Task Formulation
Morphological disambiguation is the process of assigning the most appropriate morphological analysis to a given word form within its context. This task is important for languages with complex morphology, where a single word form can have multiple valid analyses depending on its use in a sentence. Formally, this task can be defined as follows:
Let denote a sentence, where each is a word. For each word , it has a set of morphological analyses, denoted by . Each analysis can be represented as a tuple , where:
represents a root/lemma,
denotes its part of speech,
indicates a sequence of grammatical features.
The objective of morphological disambiguation is to identify the most probable analysis for each word in the sentence, such that . This prediction should maximize the overall probability of the sequence of analyses given the sentence S.
The probability of the sequence of morphological analyses given the sentence
S is expressed as
. This can be factorized using the chain rule:
where
represents the probability of the analyses
for the word
, conditioned on the sentence
S and the analyses of the preceding words.
To approximate these probabilities, various models can be employed: (i) HMMs are used to capture the transition probabilities between analyses and the likelihood of words given the analyses. (ii) Models such as recurrent neural networks (RNNs) [
23,
24] and transformers [
25] are capable of learning complex patterns and dependencies in the data, providing a better understanding of context.
3.2. Contextual Embedding
In MD, capturing the context in which a word appears is crucial for resolving ambiguity. Pre-trained language models (PLMs) are trained on large corpora and provide contextual embeddings that may contain syntactic and semantic information. To generate a sub-word context embedding using PLMs, a byte pair encoding (BPE) for tokenization is needed, which is a subword tokenization technique that splits words into smaller units.
Given a word in a sentence S, BPE tokenizes it into a sequence of subwords . To handle the alignment of these subwords with their analysis sequence, only the first subword of a word carries morphological labels, while subsequent subwords are assigned an “ignored” label. To ensure that each sentence has the same length, sequences are padded to a maximum length using special padding tokens and masks. To ensure that each word has the same number of analyses, analyses are padded to a maximum number using a special padding analysis and masks.
With a transformer-based PLM, we calculate contextual embeddings for input tokens by transferring knowledge from a PLM, capturing the fine-grained usage of words across various contexts as well as the long-term dependencies captured by multi-headed attentions. The subword tokenizer captures word-internal features, then a transformer layer calculates subword-aware representation for the input sequence, ranging from fine to coarse. More formally, let
T be the tokenization function using BPE. A word
is tokenized into a sequence of subwords:
where
m is the number of subwords in
.
Let
be the embedding function of PLM. For a given sentence
S, the tokenized subwords are fed into the PLM to generate embeddings for each subword, considering the context of the entire sentence. A PLM generates embeddings for each subword in the tokenized sentence, taking the entire sentence
S into account for context:
where
generates the embeddings for the sequence of subwords in the entire tokenized sentence.
3.3. Morphological Embedding
For a word , its analyses can be denoted by . For simplicity, we do not distinguish between the root, POS, and morpheme chain, and instead consider them as a sequence of morphological tags.
Let
represent a sequence of tags for the
k-th analysis of the
i-th word, where each tag
can be a root, POS, or any morpheme tag. Let
(the bold indicates a vector, while non-bold represents tag.) be an embedding for the
l-th tag of the
k-th analysis of the
i-th word and each analysis can have up to
L tags (the maximum tag length). The average embeddings of all tags in an analysis’s tag sequence form a single embedding for an analysis:
Use the averaged embedding vector with a nonlinear layer to compute the final analysis embedding
:
where
and
are the weight matrix and bias vector of the nonlinear layer, and
is a nonlinear activation function.
Collect all the analysis embeddings for the word
into a matrix
. If the number of analyses is less than
K, pad the matrix with a special padding vector
:
where
is a matrix, and
k is the number of actual analyses (where
) and
d is the dimension of each analysis’ embedding.
3.4. Ambiguity Resolution
To perform disambiguation, the distance between the context embedding and each analysis embedding is calculated. Since context embedding is based on subwords, for a word
, we only compute the distance between the first subword
of that word with its corresponding analysis embeddings. For the subsequent subwords
the distance calculation uses the padding analysis embedding
as follows:
The following distance metrics can be calculated for measuring the distance between the context and analysis embeddings:
(i) Euclidean distance: this measures the straight-line distance between two points in the embedding space.
(ii) Cosine similarity: this measures the cosine of the angle between two vectors, reflecting how similar their directions are.
(iii) Dot product: this measures the magnitude of projection of one vector onto another, reflecting their similarity.
(iv) Linear layer with a single output: this combines the embeddings through a weighted sum and a bias term to produce a scalar output.
3.5. Contrastive Loss and Training
The training process for the MD focuses on optimizing the model’s parameters to compute embeddings for both subword-level contextual representations and morphological analyses. In order to effectively learn the parameters for both correct and incorrect analyses (for each tag in the analysis, we use the same dimensional dense vector for the root, POS, and morpheme tags, and we do not distinguish them separately here), we introduce a contrastive loss function, which reduces the distance between the correct analysis and contextual embeddings, while maintaining a margin between correct and incorrect embeddings. The goal is to reduce the distance between the contextual embedding and the correct analysis embedding, while increasing the separation between the contextual embedding and the incorrect analysis embeddings.
This process is set up as a metric-based approach, where the objective is to minimize the distance between the correct analysis and the context, while maximizing the margin between the correct and incorrect analyses from multiple candidates.
In order to illustrate how the contrastive loss works in the proposed approach, we take the dot product as the distance metric in the following notation, and for other distance metrics, the process should be similar and the gradient will be calculated with the corresponding formulae of the chosen distance metrics. For the contrastive loss, we calculate the distance (in the case of the dot product) between the context embedding
and the morphological analysis embedding
, treating it as the similarity score.
where
is the similarity score between the context and the
k-th analysis embedding. By taking the negative of the dot product, we ensure that higher similarity (larger value of dot product) results in a smaller distance, and lower similarity (smaller value of dot product) results in a greater distance.
The contrastive loss for MD includes two parts:
(i) Positive loss: minimizes the distance (maximize similarity) for the correct morphological analysis
for word
where
is the probability for the correct morphological analysis
for word
.
(ii) Negative loss: ensures that the sum of probabilities for incorrect analyses is below a margin
.
where
is the sum of probabilities for all incorrect analyses for the
i-th word, and
is the margin.
The total loss for the entire training set, averaged over
N words, can be defined as follows:
4. Experiments
Two sets of experiments were carried out in this work, and they are summarized as follows:
To explore the effectiveness of various distance measurements for performing disambiguation, we chose two Kazakh datasets, and report the results for overall and out-of-vocabulary (OOV) and ambiguous tokens, providing detailed results for separate units of analysis like root, part-of-speech, and morpheme tags. It should be noted that morpheme tags form a sequence of multiple morphological tags with their specified order.
After selecting the distance measurement that gives the highest result, the second experiment involves a model with a pre-trained language model, comparing it to a model without one. Two comparisons were reported for two languages and three datasets. A general comparison was conducted, and for each dataset, a detailed comparison was provided for predicting each root, POS, and morpheme tag.
4.1. Datasets
Table 1 presents the data statistics for the Kazakh and Turkish datasets in terms of dataset size, OOV rates, and AT. The size of the Turkish dataset is much greater than the size of the Kazakh datasets. The OOV rates for Kazakh are higher at 27.58% and 43.9% for kD1 and kD2, while the Turkish dataset has an OOV rate of 10.24%. This complexity is further manifest in the average number of analyses per token where the Kazakh datasets have an AT of 2.95 and 2.85 for kD1 and kD2, respectively, compared to 1.76 for the Turkish dataset. These statistics suggest that while both languages exhibit rich morphological structures, Kazakh presents more challenges due to its smaller datasets and higher OOV rates. These factors are important in model development because fine-tuning PLMs could enhance performance for low-resource languages such as Kazakh.
4.2. Model Setup
The model’s hidden size is set at 768. The dimensions for both word embedding and tag embedding are set to 768, ensuring uniformity across the model’s embedding spaces. The learning rate is set at 5 × 10−4, providing a balanced approach between effective training and achieving convergence. A weight decay of 0.01 is employed to avoid overfitting by penalizing large weights. The model was trained for a total of 100 epochs. Additionally, a warmup proportion of 0.3 is utilized. The xlm-roberta-base model was used as the base model, as it is a multilingual pre-trained language model trained on a large dataset covering over 100 languages.
4.3. Results
In the experimental results, the accuracy for three types of tokens is reported as follows: (i) all tokens; (ii) ambiguous tokens (those with at least two analyses); (iii) OOV tokens.
Table 2 shows the overall, OOV, and ambiguous case accuracy for four distance measures: cosine similarity, Euclidean distance, dot product, and linear layer. For the kD1 dataset, the dot product distance metric achieved the highest overall accuracy, with 92.39% accuracy, 86.46% for OOV cases, and 88.36% for ambiguous cases. The Euclidean distance metric also performed strongly, with 92.14% overall accuracy. The cosine similarity achieved 89.67% overall accuracy, while the linear layer produced the lowest results, with an overall accuracy of 73.47%. In the kD2 dataset, the Euclidean distance yielded the highest overall accuracy (88.46%), followed by dot product at 87.83%. The cosine similarity showed an overall accuracy of 87.06%, and the linear layer again demonstrated the lowest performance, with 80.97% accuracy.
Table 3 reports the accuracy results for MD on the Kazakh datasets, measured across three aspects: root, POS, and morpheme. For the kD1 dataset, the Euclidean distance yielded the highest accuracy for root disambiguation, with 98.33% overall accuracy, followed by the dot product at 98.27%. In the OOV and ambiguous cases, the Euclidean distance achieved 96.44% and 97.45%, respectively. The cosine similarity yielded 97.71% overall accuracy. The linear layer achieved 93.75% overall accuracy, with a drop to 90.44% for ambiguous cases. In POS disambiguation, the Euclidean distance reached 96.60% overall accuracy, while the dot product scored 96.35%. The linear layer performed at 81.08%, with a significant decrease to 71.05% for ambiguous cases. For morpheme disambiguation, the dot product reached 93.94% overall accuracy and 88.36% OOV accuracy, with the Euclidean distance at 93.69% overall. The cosine similarity reached 91.22% overall accuracy, and the linear layer reached 80.27%, with the lowest ambiguous case accuracy of 69.82%.
In the kD2 dataset, the Euclidean distance achieved 98.65% overall accuracy for root disambiguation, followed by the dot product at 98.17% and the cosine similarity at 98.16%. The linear layer achieved 96.42%. For POS disambiguation, the Euclidean distance reached 94.40%, while the dot product followed with 93.81%. The linear layer achieved 89.52%. For morpheme disambiguation, the Euclidean distance reached 88.85% overall accuracy, with the dot product at 88.27%. The cosine similarity achieved 87.64%, and the linear layer reached 81.46%. Across both datasets, the Euclidean distance and the dot product consistently produced higher accuracy in all categories. The linear layer consistently showed lower results, particularly in ambiguous cases.
The possible reason behind why the linear layer showed lower results compared to the other metrics is the large tagset in morphological disambiguation (MD). The linear layer uses a parameter , which computes the concatenation of the context and the representations of different analyses. Here, d represents the dimension of the concatenation of the context and one analysis embedding. Since the number of unique analyses is large and is a vector parameter, it becomes difficult to effectively capture the relationship between the context and the numerous different analyses. In comparison, in named entity recognition (NER), there is a fixed tagset, and in that case, the parameter’s dimension is fixed to a specific label, making it easier to model the relationships.
Table 4 shows that fine-tuning with PLMs (cMD
plm) results in improved accuracy across both languages compared to the baseline (cMD
mlp) that uses a multilayer perceptron. Both of the models cMD
plm and cMD
mlp were trained and tested without using any external designed features. For the Turkish dataset, the overall accuracy increases slightly from 91.47% to 92.44%, with a gain in the OOV accuracy from 87.97 to 88.45%. The ambiguous case accuracy improves by nearly 2%. In the kD1 dataset, the overall accuracy improves by 6%, with the OOV accuracy reaching 85.27% and the ambiguous case accuracy increasing by over 9%. The kD2 dataset shows an over 6% improvement in overall accuracy, with similar increases in the OOV and ambiguous case accuracy.
Table 5 provides a detailed breakdown of the accuracy improvements across the root, POS, and morpheme categories. After using LLM, the accuracy for the ambiguous case improves across all categories. For the Turkish dataset, the accuracy for the root increases slightly, with the POS and morpheme categories seeing gains of around 0.9% and 2%, respectively. In the Kazakh datasets, the model cMD
plm shows significant improvements. For kD1, there is a nearly 1.5% increase in the root ambiguous accuracy, an over 5% gain in POS, and a nearly 8% improvement in morpheme ambiguous accuracy compared with cMD
mlp.
In the kD2 dataset, the model shows similar results, with improvements of 1.5% in root, 6% in POS, and 10% in morpheme ambiguous accuracy.
The results show that the proposed model with PLMs trained with contrastive loss improves the model’s performance in terms of ambiguous cases and OOV tokens, especially in the context of low-resource languages, including Kazakh, with a small-sized dataset, without the help of any designed features.
Table 6 compares the proposed approach with the previous work. Since the datasets are the same as for the previous work, the results are comparable. The results for the HMM and voted perceptron approaches were obtained from [
26]. The voted perceptron approach for Turkish was trained with the tool developed by [
3]. It can be observed that the cMD
plm outperforms the other approaches for the Kazakh language and Turkish without using any feature and decoding techniques.
6. Conclusions
Morphological disambiguation is a long-standing challenge in processing MCLs. It is akin to POS tagging, but with a key distinction: in MD, it is not just the POS tag that needs to be correctly predicted, but also the lemma or root form of the word along with its corresponding morphological tags. This adds layers of complexity to the task compared to traditional POS tagging.
In this paper, a contrastive learning approach for morphological disambiguation (MD) using pre-trained language models (PLMs) was presented. The intention behind the contrastive loss of the proposed approach is to reduce the distance between the correct analysis and contextual embeddings while maintaining a margin between incorrect with contextual embeddings.
One of the aims of the paper is to explore the impact of fine-tuning a PLM for MD on Kazakh, a low-resource language as well as Turkish. Traditional approaches to MD, like HMM-based and feature-engineered models, will not be able to capture long-term dependencies. In this regard, the study leveraged the contextualized embeddings from the PLMs, which allowed the model to enhance its handling of ambiguous cases and OOV tokens without depending on other features. The experimental results show that the fine-tuned PLM gave better MD results without using any designed features, especially for Kazakh. Further, the disambiguation analysis performed with different distance measurements shows that the Euclidean distance and dot product yield the best distance between context and morphological analysis embeddings. These results show that PLMs are useful in enhancing the performance of MD tasks for various morphologically complex languages, including low-resource ones like Kazakh. The error analysis revealed that the model faces challenges in differentiating singular from plural noun forms (it has strong long-term dependency), struggles with maintaining noun-adjective agreement in both number and case, and encounters difficulties in distinguishing between tense and auxiliary verb forms in the case of Kazakh. The results for the Turkish dataset revealed that the model struggled to distinguish between adjectives and nouns, properly identifying proper nouns versus common nouns, handling case and number distinctions (especially with possession and plurality), and managing complex verb forms involving tense and mood.
Overall, cMDplm, which uses a contrastive learning method and is incorporated with LLM, achieves an overall accuracy of 92.64% for Kazakh (kD1). It outperforms the HMM-based approach by approximately 7% and the voted-perceptron approach (a set of features combined with decoding techniques) by 2%. For Turkish, it slightly outperforms the voted-perceptron approach which uses a group of features with decoding.
Future work can be summarized as follows: (i) to improve the model seeking to solve the existing errors; (ii) to explore how morphological analysis and disambiguation can be integrated into a unified learning framework through large language models, thus significantly improving the performance of the MD models in various linguistic contexts; (iii) to explore the question of how to effectively use the morphological features, since long-term dependency exits between these tags.