A Benchmark for Morphological Segmentation in Uyghur and Kazakh

Abudouwaili, Gulinigeer; Ruzmamat, Sirajahmat; Abiderexiti, Kahaerjiang; Wu, Binghong; Wumaier, Aishan

doi:10.3390/app14135369

Open AccessArticle

A Benchmark for Morphological Segmentation in Uyghur and Kazakh

by

Gulinigeer Abudouwaili

^1,2,

Sirajahmat Ruzmamat

^1,2,

Kahaerjiang Abiderexiti

^1,2,

Binghong Wu

^1,2 and

Aishan Wumaier

^1,2,*

¹

School of Computer Science and Technology, Xinjiang University, No. 777 Huarui Street, Urumqi 830017, China

²

Xinjiang Laboratory of Multi-Language Information Technology, Xinjiang University, No. 777 Huarui Street, Urumqi 830017, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(13), 5369; https://doi.org/10.3390/app14135369

Submission received: 15 May 2024 / Revised: 13 June 2024 / Accepted: 19 June 2024 / Published: 21 June 2024

Download

Browse Figures

Versions Notes

Abstract

:

Morphological segmentation and stemming are foundational tasks in natural language processing. They have become effective ways to alleviate data sparsity in agglutinative languages because of the nature of agglutinative language word formation. Uyghur and Kazakh, as typical agglutinative languages, have made significant progress in morphological segmentation and stemming in recent years. However, the evaluation metrics used in previous work are character-level based, which may not comprehensively reflect the performance of models in morphological segmentation or stemming. Moreover, existing methods avoid manual feature extraction, but the model’s ability to learn features is inadequate in complex scenarios, and the correlation between different features has not been considered. Consequently, these models lack representation in complex contexts, affecting their effective generalization in practical scenarios. To address these issues, this paper redefines the morphological-level evaluation metrics: F1-score and accuracy (ACC) for morphological segmentation and stemming tasks. In addition, two models are proposed for morpheme segmentation and stem extraction tasks: supervised model and unsupervised model. The supervised model learns character and contextual features simultaneously, then feature embeddings are input into a Transformer encoder to study the correlation between character and context embeddings. The last layer of the model uses a CRF or softmax layer to determine morphological boundaries. In unsupervised learning, an encoder–decoder structure introduces n-gram correlation assumptions and masked attention mechanisms, enhancing the correlation between characters within n-grams and reducing the impact of characters outside n-grams on boundaries. Finally, comprehensive comparative analyses of the performance of different models are conducted from various points of view. Experimental results demonstrate that: (1) The proposed evaluation method effectively reflects the differences in morphological segmentation and stemming for Uyghur and Kazakh; (2) Learning different features and their correlation can enhance the model’s generalization ability in complex contexts. The proposed models achieve state-of-the-art performance on Uyghur and Kazakh datasets.

Keywords:

agglutinativelanguages; morphological segmentation; stemming; feature extraction; evaluation metrics

1. Introduction

Morphology is the meaningful independent unit in language, and the process of segmenting words into morphemes is called morphological segmentation [1]. Due to the derivational characteristics of agglutinative languages, morphological segmentation has become a fundamental task in agglutinative language processing. Agglutinative languages are a type of language in which words are typically formed by stringing together a sequence of morphemes; based on the different meanings expressed by morphemes, agglutinative words can be divided into stem and affixes [2]. The stem expresses the meaning of the word, whereas the affix expresses grammatical information. Uyghur and Kazakh are typical agglutinative languages with extremely rich morphology [3] and a rich set of affixes expressing derivation or inflection. The lexical and grammatical structures of the two languages are very similar [4], and their word formation rules can be represented as [5]

p r e f i x_{1} + \dots + p r e f i x_{n} + s t e m + s u f f i x_{1} + \dots + s u f f i x_{m}

. Stemming, as an extension of morphological segmentation, aims to extract the morphemes that represent the meaning of a word and remove the morphemes that express grammatical meanings [5]. Figure 1 shows examples for Uyghur and Kazakh of morphological segmentation and stemming.

In Figure 1, the first line is words, and the second to fifth lines present the analysis results using the Leipzig Glossing Rules. The fifth line specifically marks the word’s stem and the original form of morphemes before the phonological changes. From Figure 1, it can be observed that there are relations between morphological segmentation and stemming. In agglutinative languages, the first morpheme is the stem without prefixes. Furthermore, when the morphology of Uyghur and Kazakh is concatenated, phonological harmony may cause characters at the junction to occur: phenomena such as deletion, addition, and weakening [6]. Figure 1 shows three different phonological changes. The differing spelling of morphemes after harmony could increase out-of-vocabulary words, impacting the model’s generalization ability. This characteristic distinguishes the morphological segmentation task in agglutinative languages from the word segmentation or morphological segmentation tasks in other languages, such as Chinese or English. Therefore, morphological segmentation models designed for other languages may not perform well when applied to morphologically rich languages like Uyghur or Kazakh.

Due to the agglutinative nature of Uyghur and Kazakh, theoretically, an infinite vocabulary can be generated [7]. As a result, data sparsity in agglutinative languages poses a challenge for downstream NLP tasks, as even small datasets lead to a large vocabulary [5]. However, morphological segmentation, which divides words into their smallest semantic units while maintaining semantic information, effectively alleviates the data sparsity issue caused by rich morphology. Therefore, morphological segmentation and stemming are widely used in various downstream natural language processing tasks such as named entity recognition [8], keyword extraction [4], question answering [9], speech recognition [10], machine translation [11,12], and language modeling [3].

The sequence labeling task is a fundamental problem in NLP, which involves assigning a label to each element in an input sequence. In Uyghur and Kazakh, morphological segmentation or stemming are often considered character-level sequence labeling problems, and models predict at the character-level labels. Therefore, character-level evaluation methods are commonly used. However, character-level evaluation cannot reflect the overall performance of models in agglutinative language morphology segmentation, exhibiting some shortcomings. As shown in Table 1, the correct segmentation of the word “سانائەتنىڭ ” is “سانائەت نىڭ ”, true labels are “BMMMMEBME”, consisting of two morphemes. In labels, “B” represents the starting character of a morpheme, “M” represents the middle characters of a morpheme, “E” represents the ending character of a morpheme. However, the model incorrectly predicted the label of the eighth character as E, where the prediction was wrong by only one label while the rest were correct. Using a character-based method, an incorrect label will not affect the morpheme to which the character belongs. However, when characters are merged into morphemes, it can be seen that this error label affects the morphemes “نى” and “ڭ”. Although current state-of-the-art (SOTA) models have achieved around 97% accuracy in character-level evaluation, the evaluation methods used only focus on individual characters (a single point) and do not take into account the morphological context of the characters (the horizontal relationships between points). The effect of model prediction errors on cutoff results.

To enhance the performance of stemming and morphological segmentation models for Uyghur and Kazakh, this paper redefines morpheme-based evaluation metrics (F1-score and accuracy) in morphological segmentation and stemming. In addition, we propose two benchmark models based on different training methods: (1) a supervised model—Feature-Enhanced Morphological Segmentation Model (FEMSeg) for morphological segmentation and stemming; (2) an unsupervised morphological segmentation model—Masked Morphological segmentation (MMSeg). The character-level and contextual features are learned in the supervised model through CNN and BiLSTM networks, representing the input embedding. This embedding is then fed into the encoder of the Transformer model, where linear transformations and multi-head attention mechanisms are used to learn the relationships between character features, contextual features, and morphological boundaries in different spaces. In the unsupervised model, character embedding, pre-trained using the word2vec model, was fed into an encoder–decoder structure composed of an LSTM network with n-gram correlations and masked multi-head attention. This reduces the interference of characters outside the n-gram in determining morphological boundaries. Finally, the paper further comparatively analyzes the performance of recent stemming and morphological segmentation models on languages Uyghur and Kazakh from several different perspectives. The contributions of this paper can be summarized as follows:

This paper redefines evaluation metrics in morphological segmentation and stemming tasks from a morphological perspective. Then, a comparison is made between recently proposed stemming and morphological segmentation models across various criteria, providing a comprehensive performance analysis.
For the second issue, this paper proposes two models employing different training approaches: supervised and unsupervised. Both models utilize character features, contextual features, and correlations between them to improve the model’s generalization ability in complex scenarios (such as phonological harmony).
The two models proposed in this paper achieve SOTA results in morphological segmentation and stemming for Uyghur and Kazakh, updating the benchmark models and evaluation metrics.

2. Related Work

As fundamental tasks in NLP, stemming and morphological segmentation have produced numerous representative research achievements in high-resource and low-resource languages. When applied to downstream tasks, these achievements can effectively mitigate the problem of data sparsity. Like other NLP tasks, the research methods for stemming and morphological segmentation have evolved from dictionary-based or finite state automaton methods [13,14] to statistical learning methods based on manual feature extraction [15,16], and to methods based on deep learning [17,18]. Rule-based methods rely on grammatical rules and require linguistic experts to construct a rule base. The segmentation results are not ideal when rule conflicts or ambiguities occur. Supervised statistical machine learning models perform better than rule-based methods, as they do not require the construction of large dictionaries and complex grammatical rules. However, to enhance model performance, they rely on manual feature engineering. Commonly used statistical machine learning algorithms include Conditional Random Fields [19], Perceptrons [20], Graph-based models [21], and Morfessor [22].

In supervised learning, stemming and morphological segmentation are usually regarded as sequence labeling tasks. By combining different labeling schemes, characters in morphemes are labeled according to their positions [23]. Commonly used labeling schemes include BIO, BMES, BIOES, etc. The letter B represents the starting character of a morpheme, M and I represent the middle characters of a morpheme, E represents the ending character of a morpheme, S represents a single morpheme, and O represents a character that does not belong to the morpheme. Qiu et al. [24] proposed a multi-task learning model for Chinese word segmentation on multi-criteria. The model consists of a Transformer encoder and a CRF layer, with a set of embeddings (including criterion, bigram, and position embeddings) added to the input, prompting the output criteria and exhibiting excellent transfer capabilities. Huang et al. [25] addressed issues such as the explosive growth of model parameters on multi-criteria and trained a joint multi-criteria Chinese word segmentation model with shared parameters on multiple benchmark datasets using pre-trained language models. Pre-trained models have shown strong competitiveness in word segmentation tasks, but these models tend to learn word segmentation knowledge from in-vocabulary words rather than from context. Therefore, Lin et al. [17] proposed a context-aware Chinese word segmentation model. This method introduces an unsupervised sentence representation learning auxiliary task into the multi-criteria training framework, enabling the model to understand the entire context better. Specifically, the multi-criteria training framework incorporates unsupervised sentence representation learning with different dropout masks. Through contrastive learning, the differences between sentence representations of the same sentence under different masks are minimized. Currently, Chinese word segmentation methods based on pre-trained models have reached state-of-the-art levels, but they pose certain challenges for deployment. To improve model efficiency and generality, Li et al. [18] proposed a method for enhancing pre-trained models for Chinese word segmentation through cohort training and versatile decoding strategies. Numerous word segmentation models have been proposed for resource-rich languages like Chinese, achieving near-human annotation levels.

Unlike supervised models, unsupervised models may not provide high-quality segmentation results, but they have certain advantages in open domains or specific applications. Downey et al. [26] proposed a new Masked Segmental Language Model (MSLM) that generates unsupervised subword segmentation by training a masked neural language model. MSLM is based on a bidirectional Transformer architecture with span masking, utilizing context and attention to increase the model’s scalability. To improve the model’s word segmentation performance in open domains, Pan et al. [27] proposed a new model called TopWORDS-Seg based on Bayesian inference. This model combines the TopWORDS and PKUSEG tools, enabling word segmentation and the discovery of new words. A series of experimental studies have demonstrated the robustness and interpretability of this model in open-domain Chinese word segmentation tasks. To address the lack of labeled data resources, Yan et al. [28] proposed a concept of word influence. They argued that the influence between words can be divided into strong and weak influences, assuming they follow a Gaussian distribution. By calculating the mutual influence between words using a pre-trained language model, they proposed a new loss function that separates the distributions of strong and weak influences as much as possible. Morfessor [29] is a classic unsupervised morphological segmentation tool. Rouhe et al. [30] investigated the possibility of using the unsupervised morphological segmentation model Morfessor on supervised conditions for segmentation. Specifically, they used Morfessor to segment words and enrich the model’s input features, then used a seq2seq model to determine whether Morfessor’s segmentation results were correct. Song et al. [11] proposed a self-supervised subword segmentation model that optimizes the word generation probability of partially masked character sequences and uses dynamic programming to generate the segmentation with the maximum posterior probability.

In statistical-based stemming or morphological segmentation tasks for Uyghur and Kazakh languages, features such as syllables [31], part-of-speech, context [19,32,33], phonetic classes, the presence of sound change phenomena, and phonetic features [34] are often selected and added to the model to improve its performance. In deep learning-based models, (Bi)RNN [35], BiLSTM-CRF [36], CNN-BiLSTM-CRF [7], pointer networks [37], and attention mechanism [7,37,38] have been used to learn the labels of the input sequence and distinguish morpheme boundaries. The literature mentioned above have introduced labeling schemes, but these labels are not independent, which can easily lead to model overfitting. Therefore, Yang et al. [37] only used segmentation points for modeling. They proposed a morphological segmentation model based on a pointer network with a fused attention mechanism, and its segmentation effect is superior to the BiGRU model [35]. Abudukelimu et al. [7] also applied the CNN-BiLSTM-CRF model to the morphological segmentation task and compared it with the pointer network [37]; the F1-score improved by 0.33%, comprehensively analyzing typical error types. The model improved the ability to recognize out-of-vocabulary words and low-frequency morphemes. Gvzelnur et al. [38] and Imin et al. [36] introduced an attention mechanism based on BiLSTM-CRF, considering contextual sentence information and capturing the boundaries of stems and affixes through global features. On word-level datasets, it under performs compared to BiGRU [35]. However, on sentence-level datasets, the F1-score reached 96.07%. Zhang et al. [39] proposed an unsupervised morphological segmentation model for Uyghur based on meta-learning, realizing morphological segmentation in a few-shot learning environment and alleviating overfitting.

In summary, although supervised models for Uyghur and Kazakh lexical analysis have achieved certain research progress [6,40,41,42], these models’ evaluation metrics are based on character-level metrics. The performance and differences of these models have not been explored using morpheme-level evaluation metrics. In addition, the correlation between different features has not yet been investigated. Based on the above analysis, this paper redefines the evaluation metrics and proposes two feature-enhanced models for Uyghur and Kazakh morphological segmentation and stemming. Then, it comprehensively analyzes the experimental results from different dimensions.

3. Method

3.1. Task Definition

This paper utilizes unsupervised and supervised models to learn Uyghur and Kazakh stemming and morphological segmentation tasks. Specifically, the unsupervised model is used to learn morphological segmentation, and the supervised model is used to learn both stemming and morphological segmentation. Assume a word of arbitrary length W, consisting of x characters or y morphemes, i.e.,

W = {c_{1}, c_{2}, \dots, c_{x}}

or

W = {m_{1}, m_{2}, \dots, m_{y}}

, where

m_{i} = {c_{1}, c_{2}, \dots, c_{z}}

. In supervised morphological segmentation and stemming, each character is labeled and learns the corresponding tags. In unsupervised morphological segmentation, the boundaries of morphemes are determined through the correlations between characters.

3.2. Feature-Enhanced Morphological Segmentation Model

This paper proposes two Feature Enhanced Morphological Segmentation models—FEMSeg and FEMSeg-CRF—based on the advantages of the sequence labeling model. The model integrates a CNN character-level representation layer, a BiLSTM context-level representation layer, a Transformer encoder layer, a linear layer, and a softmax layer or CRF layer. The model structure is shown in Figure 2. Specifically, the model learns character-level embeddings from the input sequence. Then, it captures character-level features of word and contextual features between characters (referred to as character-level context in this paper) through a CNN convolutional layer and a BiLSTM network layer, respectively. To learn the correlations between character features and contextual features and further determine morphological boundaries, character representation and contextual representation are concatenated and fed into a Transformer encoder layer for relevant feature learning. Finally, the CRF layer or softmax layer is used to predict the labels of the input sequence.

After the input characters are embedded, they are fed into the CNN and BiLSTM layers to learn character-level and context-level representations. In stemming and morphological segmentation, character-level feature extraction is particularly important. Feature engineering primarily relies on manual feature extraction in the literature on stemming and morpheme segmentation in low-resource languages. Therefore, to reduce manual feature extraction, this paper uses a character-level CNN network to learn character features. The input character is a d-dimensional embedding, i.e.,

C \in R^{d}

. Assuming there is a convolution kernel

W^{h \times d}

, the convolution operation on the input yields the result O, as shown in Equation (1). To obtain more features,

n (n \geq 1)

convolution kernels are set up, and their outputs are concatenated together to represent the output of the convolutional layer, as shown in Equation (2).

O_{i} = f (ω_{i} \cdot C + b), i \in n

(1)

C N N_{o u t} = c a t (O_{1} \oplus O_{2} \oplus \dots \oplus O_{n})

(2)

For sequence labeling tasks, RNN models can effectively learn temporal information. Bidirectional RNNs can learn not only past information but also future information. However, for sequences with long-span dependencies, there exist problems of gradient vanishing or explosion. To solve the long-distance dependency issue, Hochreiter and Schmidhuber [43] proposed a variant of RNN called LSTM, which controls the information flow and forgetting through a gate mechanism. The BiLSTM model overcomes the shortcomings of the LSTM model, which only records the previous context information without considering the future context information. The BiLSTM layer obtains two hidden layer outputs and concatenates them as the final output. Specifically, the hidden layer representation at time step t is

H_{t} = [\vec{h_{t}}; \overset{\leftarrow}{h_{t}}]

, where

\vec{h_{t}}

represents the forward hidden layer at time step t, and

\overset{\leftarrow}{h_{t}}

represents the backward hidden layer at time step t.

After obtaining the character-level and context-level representations, they are concatenated and result in the final feature representation, i.e.,

F e a t_{o u t} = [C N N_{o u t}, B i L S T M_{o u t}]

, where

B i L S T M_{o u t} = H_{t}

. The process of multi-feature extraction is summarized in Algorithm 1.

Algorithm 1 Multi-feature extraction.
	Input: Word W, Kernel size K, Parameter P
	Output: Feature F
1:	$E m b \leftarrow Embedding (W, P)$
2:
3:	function CNN_Feature( $E m b$ , K, P)
4:	for i in K do
5:	$O_{i} \leftarrow Conv 1 d (E m b, P)$
6:	end for
7:	$C N N_{o u t} \leftarrow [O_{1}, O_{2}, \dots, O_{k}]$	▹ concat $O_{1}$ to $O_{n}$
8:	$C N N_{o u t} \leftarrow Dropout (C N N_{o u t}, P)$
9:	$C N N_{o u t} \leftarrow Linear (C N N_{o u t}, P)$
10:	return $C N N_{o u t}$
11:	end function
12:
13:	function BiLSTM_Feature( $E m b$ , P)
14:	$B i L S T M_{o u t} \leftarrow LSTM (E m b, P)$
15:	return $B i L S T M_{o u t}$
16:	end function
17:
18:	$C N N_{o u t} \leftarrow C N N_F e a t u r e (E m b, K, P)$
19:	$B i L S T M_{o u t} \leftarrow B i L S T M_F e a t u r e (E m b, P)$
20:	$F e a t_{o u t} \leftarrow [C N N_{o u t}, B i L S T M_{o u t}]$	▹ concat $C N N_{o u t}$ and $B i L S T M_{o u t}$

The multi-head attention mechanism can learn correlations between elements in the input sequence from different dimensions. Typically, it processes inputs of the same type. For instance, if the input sequence is a sentence, it learns the relationships between words; if it is a word, it learns the relationships between characters. To learn characters, context, and the correlation between them, we have concatenated the correlation representation and input it into the encoder of the Transformer mode. It utilizes a multi-head attention mechanism, residual normalization layer, and feed-forward network layer to capture the dependencies and semantic information between the character-level input sequences. This enhances the model’s awareness of morphological boundaries and semantic dependencies between characters and their context. The calculation steps are shown in the following equations:

Q, K, V = F e a t_{o u t} \cdot (W_{k}, W_{q}, W_{v})

(3)

A t t e n t i o n {(Q, K, V)}_{i} = s o f t m a x (\frac{Q_{i} K_{i}^{T}}{\sqrt{d}}) V_{i}

(4)

M u l i t i H e a d (Q, K, V) = (A t t e n t i o n_{1} \oplus A t t e n t i o n_{2} \oplus \dots \oplus A t t e n t i o n_{n}) W^{o}

(5)

X_{A d d & N o r m_{1}} = L a y e r N o r m (F e a t_{o u t} + M u l t i H e a d (Q, K, V))

(6)

X_{F N N} = m a x (0, X_{A d d & N o r m_{1}} W_{1} + b_{1}) W_{2} + b_{2}

(7)

X_{A d d & N o r m_{2}} = L a y e r N o r m (X_{A d d & N o r m_{1}} + X_{F N N})

(8)

where n is the number of heads. The concatenation of n attention mechanisms followed by a linear transformation represents the final multi-head attention mechanism.

A t t e n t i o n {(Q, K, V)}_{i}

represents the ith attention. After obtaining the multi-dimensional span dependencies through the multi-head attention mechanism, the result is fed into the residual normalization layer to prevent model degradation. There are two residual normalization layers in the Transformer encoder. The output from the first residual normalization layer is then fed into the feed-forward layer. The feed-forward layer consists of two fully connected layers, with a ReLU activation function in the first layer and no activation function in the second layer. After the output from the Transformer encoder is fed into a fully connected network layer, the purpose is to unify the dimensions of the vectors input to the CRF and softmax layers. The calculation formula is shown in Equation (9):

X_{L i n e a r} = X_{A d d & N o r m_{2}} W_{L i n e a r} + b_{L i n e a r}

(9)

The last layer of a sequence labeling model is generally set as a softmax classification layer or a CRF layer, and this paper also follows this structure. When the last layer is chosen as a softmax function, it transforms the feature vector into a probability distribution in the range [0–1], predicting the probability that the feature embedding belongs to a specific label. When the last layer is chosen to be a CRF model, given a sequence

X = x_{1}, x_{2}, \dots, x_{n}

, the label sequence predicted by CRF is

Y = y_{1}, y_{2}, \dots, y_{n}

, and the score of the sequence is defined as in Equation (10):

s (X, Y) = \sum_{i = 0}^{n} A_{y_{i}, y_{i + 1}} + \sum_{i = 1}^{n} P_{i, y_{i}}

(10)

In this paper, P is the output result of

X_{L i n e a r}

, where

P_{i, y_{i}}

represents the score of the

y_{i}

th label of the ith character in the sequence; A is the matrix of transition scores, which is position-independent, and

A_{i, j}

represents the score of transition from label i to label j.

3.3. Masked Morphological Segmentation Model

In unsupervised morphological or word segmentation models, it is common to learn the correlations between characters at the sentence level, such as BPE, Morfessor, WordPiece, BBPE, etc. These methods have shown strong performance in downstream tasks, but their performance in morphological segmentation and word segmentation evaluation is not very good. Therefore, this paper takes words as units and learns the n-gram correlations within words. It uses a masked self-attention mechanism to avoid the influence of characters outside the n-gram on determining morpheme boundaries. The model structure is shown in Figure 3.

Masked Morphological Segmentation (MMSeg) model consists of an input layer, an encoding layer, a decoding layer, and an output layer. We will introduce the model separately below:

Input Layer: The input sequence is represented as a fixed-size vector. Given an input sequence

X = {x_{1}, x_{2}, x_{3}, \dots, x_{n}}

, it is represented as

E = {e_{1}, e_{2}, e_{3}, \dots, e_{n}}

. Here,

E \in R^{n \times d}

, and d is the embedding dimension. To incorporate prior knowledge into the model, this paper initializes the embedding representation of the input sequence using pre-trained character embedding. These embedding representations are obtained through training a word2vec model. During unsupervised learning, the embeddings are continuously updated.

Encoding Layer and Masked Attention Mechanism: After the input sequence is vectorized, it is fed into the encoding layer. The encoder consists of a single-layer unidirectional LSTM network. The output at the ith time step is represented as shown in Equation (11):

h_{i} = E n c o d e r (h_{i - 1}, x_{i})

(11)

where

h_{i - 1}

represents the hidden layer state at time step

t - 1

, and

x_{i}

is the input at time step i.

After the input sequence passes through the encoding layer, the hidden layer is truncated based on a predefined maximum character length and fed into the masked attention mechanism. The masked attention mechanism will use a multi-head self-attention mechanism with four heads. The calculation formulas were introduced in Section 3.2. Unlike other attention mechanisms, a upper triangular matrix is used as the masking matrix, ensuring that the attention calculation emphasizes features close in time to the tth time step. The masking values for nearby positions are set to 1, whereas the masking values for other positions are set to 0. Algorithm 2 summarizes the masked attention mechanism.

Decoding Layer and Objective Function: The decoding layer of the model consists of a single-layer unidirectional LSTM. The output A from the masked attention mechanism is fed into the LSTM layer for decoding, as shown in Equation (12):

\hat{p} (y_{j}^{i} | y_{1 : j - 1}^{i}, y^{1 : i - 1}) = D e c o d e r (A_{j}^{i}, y_{j}^{i})

(12)

where

y_{j}^{i}

represents the jth character in the ith segment, and

y^{(1 : i - 1)}

represents the segmentation from the first time step to the

(i - 1)

th time step. Concatenating all the segmentation results

y_{i : T_{i}}^{i}

can represent the entire word

y_{1 : T}

. The expression

T_{i}

represents the length of the ith segment

y^{i}

, and T is the length of the sequence Y. The model inputs the embedding from the attention layer into a single-layer unidirectional LSTM network. The ith segmentation result will depend on the previous

1 : i - 1

segmentation results. The initial

h_{0}

of the ith segment is initialized using the results of the previous segments

(1 : i - 1)

. The model achieves the purpose of unsupervised word segmentation by learning the joint probability of the segmented character sequences, as shown in Equation (13):

p (y_{1 : T}) = \sum_{T_{1}, T_{2}, \dots} \prod_{i} \hat{p} (y_{1 : T_{i}}^{i}) = \sum_{T_{1}, T_{2}, \dots} \prod_{i} \prod_{t = 1}^{T_{1} + 1} \hat{p} (y_{j}^{i} | y_{0 : j - 1}^{i})

(13)

where the embedding of

y_{0}^{i}

is represented as

y^{1 : i - 1}

. We train this model by maximizing the log-likelihood value, as shown in Equation (14):

L = - l o g p (y_{1 : T})

(14)

Algorithm 2 Masked attention mechanism.
	Input: sequence_length $s_{l}$ , the output of LSTM H, max_length $m_l$ , Parameter P, Head_num, $H_n$ , $D_{model} D$
	Output: Attention Feature A
1:	Initialize $W_{q}, W_{k}, W_{v}, W_{o}$
2:
3:	functionGet_Attention_Mask( $s_l, m_l$ )
4:	$m a s k = triu (np . ones (s e q_l e n, s e q_l e n)) = = 1$
5:	return $m a s k$
6:	end function
7:
8:	$a t t_m a s k = G e t_A t t e n t i o n_M a s k (s_l, m_l)$
9:	function Multi_Head_Attention( $H, P, H_n, a t t_m a s k$ )
10:	for i in $H_n$ do
11:	$Q_{i}, K_{i}, V_{i} \leftarrow H \cdot (W_{q}^{i}, W_{k}^{i}, W_{v}^{i})$
12:	$a t t e n t i o n_{i} \leftarrow softmax ((Q_{i} K_{i}^{T}) \cdot a t t_m a s k / \sqrt{D}) \cdot V_{i}$
13:	end for
14:	$A t t e n t i o n \leftarrow [a t t e n t i o n_{1}, a t t e n t i o n_{2}, \dots, a t t e n t i o n_{H_{n}}] \cdot W_{o}$	▹ concat $a t t e n t i o n_{1}$ to $a t t e n t i o n_{H_{n}}$
15:
16:	$A \leftarrow LayerNorm (A t t e n t i o n, P)$
17:	$A \leftarrow tanh (A)$
18:	return A
19:	end function

When the initial conditions satisfy

p (y_{1 : 0}) = 1

, the loss function can be calculated using dynamic programming, as shown in Equation (15):

p (y_{1 : n}) = \sum_{k = 1}^{k} p (y_{1 : n - k}) \hat{p} (y_{n - k + 1 : n})

(15)

where

P (\cdot)

is the joint probability of all segmentation results,

\hat{P} (\cdot)

is the probability of segment, and K is the maximum length of a segment.

4. Experiments

4.1. Data

This paper used the Uyghur morphological segmentation dataset constructed by Tsinghua University (http://thuuymorph.thunlp.org/ accessed on 18 June 2024) and the Kazakh dataset collected and annotated by our laboratory. Both datasets were word-level datasets, and neither contained contextual information. This paper used the BMES tagging schemes to annotate the morphemes in the stemming and morphological segmentation. The datasets were split into training, testing, and validation sets in an 8:1:1 ratio, and a 10-fold cross-validation method was used. The distributions of words and morphemes in the datasets are shown in Table 2.

4.2. Evaluation Metrics

In previous literature, stemming and morphological segmentation have been considered character-level sequence labeling tasks, mainly evaluating the prediction results of labels corresponding to characters. Specifically, they have calculated the recall, precision, and F1-score for the labels “B”, “M”, “E”, and “S” individually. Then, the average of these values was taken to obtain the final values for recall, precision, and F1 score. However, this paper argues that such an evaluation method is unsuited to stemming and morphological segmentation. When evaluating model performance, it should be done from a morphological perspective to truly assess the differences in the model’s morphological segmentation ability. Therefore, in this paper, when evaluating the model’s performance in morphological segmentation, the predicted BMES labels were used to reconstruct the segmentation results, which were then compared with the ground truth. This paper will use recall, precision, and F1-score with their calculation Equations (16)–(18):

R = \frac{# C o r r e c t s e g m e n t a t i o n}{# G r o u n d T r u t h}

(16)

P = \frac{# C o r r e c t s e g m e n t a t i o n}{# M o d e l^{'} s s e g m e n t a t i o n}

(17)

In Equations (16) and (17),

# G r o u n d T r u t h

refers to the number of segmentation results used as a reference by the model,

# c o r r e c t s e g m e n t a t i o n

refers to the number of segments in the model’s predictions that match the

G r o u n d T r u t h

segmentation results, and

# m o d e l^{'} s s e g m e n t a t i o n

refers to the number of model’s predicted segmentation results. The F1-score can be calculated using recall and precision, as shown in Formula (18):

F_{β} - s c o r e = (1 + β^{2}) \frac{P \cdot R}{(β^{2} \cdot P) + R}

(18)

The

F_{β} - s c o r e

is the weighted harmonic mean of precision and recall. When

β = 1

, the

F_{β}

-score becomes the F1-score, where precision and recall have equal weight. Additionally, there are cases of F0.5 and F2, where the F0.5-score gives more weight to precision, and the F2-score gives more weight to recall.

Figure 4 illustrates the differences between the two evaluations for the Uyghur word. The evaluation process in the pink box on the left side of Figure 3 represents the character-level method. In contrast, the evaluation process in the green box above represents the morphological-level method. It can be observed that the character-level evaluation method focuses on labels or points rather than the impact of labels or points on morphological errors. The F1-score for the character-level evaluation method is 0.89, whereas in the morphological-level evaluation method,

# G r o u n d T r u t h = 2

,

# c o r r e c t s e g m e n t a t i o n = 1

,

# m o d e l ’ s s e g m e n t a t i o n = 3

. Therefore,

R = 0.50

,

P = 0.33

, and the F1-score is 0.4. The morphological-level evaluation more accurately reflects the model’s performance in morpheme segmentation.

To evaluate the model’s stemming ability, this paper also restores the characters into the form of “stem–affix” sequences. It analyzes the stemming results using accuracy and two types of error rates. The accuracy calculation is shown in Equation (19) below:

A C C = \frac{# C o r r e c t s t e m m i n g}{# A l l s t e m m i n g}

(19)

where

# C o r r e c t s t e m m i n g

refers to the number of correct stemming outputs by the model, and

# A l l s t e m m i n g

refers to the total number of stemming outputs by the model. This paper divides the error types of stemming into over-extraction and under-extraction. Over-stemming refers to the case where the stem extracted by the model is shorter than the ground truth, whereas under-stemming refers to the case where the stem extracted by the model is longer than the ground truth. This paper’s definition of error types also varies due to the different characteristics of experimental data and tasks [44].

4.3. Hyperparameters

The experiments are based on the Ubuntu operating system (version 20.04.1), the experimental environment is based on Python 3.8 (https://www.python.org/downloads/ accessed on 18 June 2024), and the deep learning framework PyTorch 1.9.0 (https://pytorch.org/ accessed on 18 June 2024). Models are trained on a 3090 GPU. The supervised experiment parameters are as follows: character embedding dimension is 128, CNN window sizes are [1, 3, 5], and BiLSTM hidden layer dimension is 128. The input dimension of the Transformer is 512, with a single-layer Transformer structure and an 8-head attention mechanism. To prevent overfitting, the dropout rate used in the model is 0.1, the batch size is 64, and the parameters are updated using the Adam optimizer with an initial learning rate 0.001 and a decay rate is 0.05. The unsupervised experiment parameters are as follows: a single-layer unidirectional LSTM model is used for the encoder and decoder, with an embedding dimension and hidden layer dimension of 256, 4-head attention, and a maximum character length k of 8 for the model. The parameters are updated using the Adam optimizer with an initial learning rate of 0.0001 and a dropout rate of 0.1. The parameters used by the model are shown in Table 3.

4.4. Baseline Models

To compare the effectiveness of different models, this paper selects classic sequence labeling models and recently published stemming and morphological segmentation models as baseline models. Based on the similarity between the morphological segmentation task and the stemming, this paper uses the same supervised models for both tasks, specifically including the following models.

4.4.1. Unsupervised Comparison Models

Morfessor [29]: A classic unsupervised model for the morphological segmentation task. The currently used Morfessor 2.0 introduces semi-supervised learning based on Morfessor 1.0, which can effectively alleviate the problems of over-segmentation and under-segmentation. This paper utilized the open-source toolkit Morfessor. Since Morfessor 2.0 is a semi-supervised model and does not meet the training conditions for unsupervised models, Morfessor 1.0 was used instead. The training and testing were conducted using the default configuration. BPE [45]: The most widely used and high-performing unsupervised subword segmentation model in NLP tasks. It achieved word segmentation by iteratively merging the most frequent subwords. In this study, the number of merges for BPE is set to [2000, 4000, 8000, 16,000]. The highest F1-score was achieved with 8000. (Due to the data scale and the agglutinative nature of the languages, the BPE algorithm stopped merging operations at 5414 merges for Uyghur and 5249 merges for Kazakh. Therefore, when the number of merges exceeds these values, the model’s F1-score does not change.)

4.4.2. Supervised Comparison Models

HMM and CRF: Models widely used in sequence labeling tasks in statistical learning models. HMM is a discriminative model that follows the independence and homogeneous Markov assumptions during training. CRF is a generative model that can learn more context than HMM and predicts labels through a globally optimal solution. BiLSTM and BiGRU [35]: Models based on deep learning that have strong modeling capabilities for time series data and are two variants of the RNN network. Among them, BiLSTM can capture forward and backward information and effectively prevents gradient explosion and gradient vanishing in the training of sequence data through a gating mechanism. The BiGRU model alleviates the gradient vanishing problem through a gating mechanism and memory cell state, and, compared to BiLSTM, it has fewer model parameters and a simpler structure. In practice, the performance of the models varies depending on the task and requirements. BiLSTM-CRF [34]: To enhance the ability of the BiLSTM model to constrain labels, the softmax function is replaced with the CRF model in its output layer. BiLSTM-ATT-CRF [36,38]: An attention mechanism was added to the middle layer of the BiLSTM-CRF model to enhance the model’s ability to extract features.

As only a few papers have made their source code public, this paper implemented the above benchmark models, except for the Morfessor, BPE, and CRF models, to ensure parameter consistency and comparability of model results.

4.5. Results

4.5.1. Morphological Segmentation Comparison

The experimental results comparing unsupervised and supervised morphological segmentation are shown in Table 4. This paper uses recall (R), precision (P), and F1-score to compare the performance of the models.

The next row of the experimental results for MMSeg, FEMSeg, and FEMSeg-CRF models in Table 4 represents the results of a 10-fold cross-validation experiment (in line: 4, 8, 14). The experimental results show that the method proposed in this paper has certain improvements in both unsupervised and supervised approaches. In the unsupervised experiments, compared with the mainstream unsupervised segmentation models BPE and Morfessor, the F1-score of MMSeg increased by 29.64%, 11.59%, and 14.75%, 8.08% on the test sets of Uyghur and Kazakh, respectively. In the supervised segmentation, the experimental results reveal the following: (1) Models can generally be divided into two categories based on whether they incorporate a CRF layer. Adding a CRF layer can strengthen the model’s constraints on the predicted labels, ensuring the final output is valid. For example, BiLSTM (91.51%, 90.97%) versus BiLSTM-CRF (91.92%, 91.29%), and FEMSeg (92.04%, 92.34%) versus FEMSeg-CRF (92.69%, 92.84%). (2) The FEMSeg and FEMSeg-CRF models show significant improvements in both languages. The F1-score of FEMSeg reaches 92.04% and 92.34% on the test sets of Uyghur and Kazakh, respectively, which is an improvement of 1.57% and 2.43% compared with the BiGRU model, and an improvement of 0.53% and1.37% compared with the BiLSTM model. The F1-score of FEMSeg-CRF reaches 92.69% and 92.84% on the test sets of Uyghur and Kazakh, respectively, which is an improvement of 51.25% and 50.94% compared with the HMM model, an improvement of 4.23% and 12.60% compared with the CRF model, an improvement of 0.77% and 1.55% compared with the BiLSTM-CRF model, and an improvement of 1.28% and 1.75% compared with the BiLSTM-ATT-CRF model.

On the test sets of both languages, FEMSeg and FEMSeg-CRF, compared to other models, can learn character features and the relevant relationships between characters and their contexts in different dimensions, thereby improving the accuracy of morphological segmentation. In the case of the BiLSTM and BiLSTM-ATT-CRF models, BiLSTM-ATT-CRF shows a slight decrease in performance compared to BiLSTM on the Uyghur test set, with less improvement on the Kazakh test set (0.12%). However, when comparing BiLSTM with BiLSTM-CRF, the effectiveness of the CRF model is demonstrated. Therefore, it can be concluded that the soft attention mechanism does not necessarily improve model performance in word-level datasets. The differences between FEMSeg and FEMSeg-CRF and other models are further discussed in Section 4.6.

This comparative experiment proved the effectiveness of the two methods proposed in this paper for the morphological segmentation task. In unsupervised segmentation, the morpheme boundaries were modeled using the correlation between n-grams. In supervised segmentation, words were fully represented through character-level feature learning and context feature learning and then input into the encoder of the Transformer model, where further attention was paid to the key features in the multidimensional space.

4.5.2. Stemming Comparison

The comparative experimental results of stemming for Uyghur and Kazakh are shown in Table 5. In the comparative experiments, this paper evaluated the accuracy and error rates of the models, with the error rates further subdivided into under-segmentation and over-segmentation. The accuracy of FEMSeg-CRF reaches 91.51% and 90.70%, and the accuracy of FEMSeg reaches 91.08% and 90.35% on the test sets of Uyghur and Kazakh, respectively, which is higher than other methods, and the occurrence frequencies of the two error types are also close to the lowest frequency. Among the previously proposed models, the best-performing model is BiLSTM-CRF; the impact of whether to add an attention mechanism on model performance is not significant in the comparative experiments. In the Uyghur test set, the BiLSTM model has a lower over-segmentation error rate, whereas the BiGRU model has a lower under-segmentation error rate; in the Kazakh test set, FEMSeg has a lower over-segmentation error rate, whereas the BiLSTM-CRF model has a lower under-segmentation error rate.

The morphological segmentation and stemming experiments show that FEMSeg applies to both tasks, and the model performance is higher than other models.

4.5.3. Ablation Studies

To evaluate the impact of different feature extraction modules on morphological segmentation, we conducted a set of ablation studies in this section. The experimental results are presented in Table 6.

The experimental results in Table 6 indicate that the masked attention mechanism (MAM) used in MMSeg effectively captures relevant features of n-gram characters, thereby improving the model’s performance on Uyghur and Kazakh test datasets. When the CRF layer is used in the FEMSeg-CRF model, it adds constraints to the predicted labels to ensure their validity, whereas using softmax (FEMSeg) alone does not capture the transition relationships between labels. Additionally, when using a Transformer encoder, the model can capture as much inter-sequence correlation in different spaces as possible, thus mitigating the uncertainty caused by phonetic variations and enhancing the model’s ability to recognize morpheme boundaries.

4.6. Result Analysis

To further evaluate the ability of all models to predict morphology, this paper evaluates these models in terms of OOV recall rate, IV recall rate, morphological error rate, and average edit distance.

4.6.1. OOV and IV Analysis

Discovering new words is a core challenge in the field of NLP. Out-of-vocabulary (OOV) or out-of-set words refer to new words encountered in the validation or test set. The OOV recall rate in this paper can be understood as evaluating the model’s ability to identify new morphologies correctly. In-vocabulary (IV) words, also known as in-set words, usually refer to words encountered in the validation or test set that have appeared in the training set. The IV recall rate in this paper can be understood as evaluating the model’s ability to identify morphologies that appear in the dictionary correctly. In this paper, the morphologies that appear in the training set are used as the dictionary for evaluation. Table 7 shows the out-of-vocabulary rate of the validation and test sets, and Table 8 shows the situation of different models in identifying OOV and IV.

Table 8 shows that, among the models based on unsupervised learning, MMSeg has a higher ability to recall out-of-vocabulary and in-vocabulary words than other methods. On the Uyghur test set, the OOV and IV rates reach 46.69% and 50.55%, respectively; on the Kazakh test set, the OOV and IV rates reach 8.52% and 48.11%, respectively. Although the OOV rate in the Kazakh dataset is lower than that in the Uyghur, the ability of unsupervised models to recall OOV words in Kazakh is severely lower than that in Uyghur. Through experimental analysis, it is found that this problem is caused by the following two reasons: (1) In the unsupervised learning models, due to the introduction of the n-gram correlation assumption, the morpheme length after model training and prediction will be much smaller than N. The relatively long morphemes in the dataset are stems. In the training set, the average length of stems in Uyghur is 6.3, and the average length of stems in Kazakh is 5.9; in the test set, the average length of stems in Uyghur is 6.3, and the average length of stems in Kazakh is 6.1. The average stem length of Kazakh has changed significantly, with longer stems in the test set. Therefore, the model’s segmentation granularity will be finer during testing, breaking the stems’ integrity. (2) In Kazakh, morphological ambiguity is a common situation. Therefore, the ambiguity problem is also a reason for the low OOV recall rate.

In the models based on supervised learning, FEMSeg or FEMSeg-CRF has higher or nearly the best OOV and IV recall rates compared to other models. FEMSeg achieved OOV and IV recall rates of 87.67% and 93.93%, respectively, on the Uyghur test set, which is close to the highest recall rate (BiLSTM: 88.91%); on the Kazakh test set, the OOV and IV recall rates reached 81.51% and 93.57%, respectively. FEMSeg-CRF has achieved OOV and IV recall rates of 87.37% and 94.06%, respectively, on the Uyghur test set, and OOV and IV recall rates of 78.83% and 94.39%, respectively, on the Kazakh test set. It is worth noting that the FEMSeg has achieved higher out-of-vocabulary (OOV) and in-vocabulary (IV) recall rates compared to BiLSTM and BiGRU on both datasets, demonstrating stronger OOV recognition capabilities on the Kazakh dataset. Similarly, the FEMSeg-CRF model also achieved higher OOV recall rates on the Kazakh dataset.

There is also a situation where the OOV recall rate is relatively low in all models. In contrast, the two models proposed in this paper have a higher ability to recall OOV words, and the IV recall ability is close to the highest value.

4.6.2. Average Edit Distance and Morphological Error Rate Analysis

The above analyses evaluated the model’s predictive ability from the morphological perspective. Next, this paper evaluates the model’s ability from the word perspective. The average edit distance evaluates the difference between the model’s segmentation results and the ground truth from the character level. This paper uses Levenshtein distance to calculate the similarity between morphemes, mainly by calculating the distance between two sequences through morphemes’ insertion, deletion, and substitution operations. Figure 5 shows the average edit distance of different models on the Uyghur and Kazakh test sets. The shorter the edit distance, the more similar the morphemes are. In unsupervised learning, MMSeg obtains a lower score on the Uyghur, reducing it by 0.42 compared to the Morfessor model and 0.97 compared to the BPE model, but it is insignificant in Kazakh. This phenomenon also appears in the evaluation of morphological error rate. This paper posits that the reason is related to the n-gram assumption of the unsupervised model, which will be analyzed with examples in the case study. In supervised learning, the BiLSTM-ATT-CRF and FEMSeg-CRF achieve the smallest edit distance of 0.12 on the Uyghur, and FEMSeg-CRF achieves the smallest edit distance of 0.14 on the Kazakh.

The morphological error rate evaluates the model’s ability to correctly predict all morphemes that appear in words from the word perspective. Figure 6 shows the morphological error rates of different models on Uyghur and Kazakh validation and test sets.

Figure 6 shows that, when evaluating the morphological segmentation error rate of different models, the error rates of unsupervised learning models and HMM models are relatively high. Among the unsupervised learning models, MMSeg obtains a relatively low error rate in the Uyghur, whereas the error rate is high in the Kazakh. Combining the experimental results in Table 4, it can be found that on the Uyghur, MMSeg has certain advantages, whether from the morphological or word perspective. However, in Kazakh, although MMSeg can predict more correct morphemes from the morphological perspective, the morphemes predicted by the model are not higher than the Morfessor or BPE model in terms of word or morphological integrity. In supervised learning, regardless of whether a CRF layer is added to the model, FEMSeg has a lower error rate than other models in the same group. The error rate of FEMSeg-CRF is reduced by 1.53% and 0.90% on the Uyghur and Kazakh test sets, respectively. The error rate of FEMSeg is reduced by 0.53% and 2.10% on the Uyghur and Kazakh test datasets compared to the BiLSTM model. The error rate of FEMSeg-CRF is reduced by 0.63% and 2.35% on the Uyghur and Kazakh test datasets compared to the BiLSTM-Att-CRF model. Compared to Uyghur, the error rate reduction of FEMSeg-CRF is relatively large compared to other methods in Kazakh. Combined with Table 4, it is found that the performance of other methods in the Kazakh language is lower than that of Uyghur, whereas the performance of FEMSeg-CRF is not much different between the two languages. From the perspective of morphological integrity, FEMSeg-CRF can more correctly segment the morphemes in words. Therefore, the error rate reduction in Kazakh is more significant than that in Uyghur. In addition, these two sets of comparisons can prove that when a CRF layer is added to the model, the purpose of improving the model’s correct segmentation of morphemes can be achieved by constraining the labels.

4.6.3. Case Study

To analyze the problems existing in different models further, this paper will provide cases to study the prediction results of the models. Figure 7 and Figure 8 compare the ground truth and the predicted results. Combined with Table 4, several comparison models with higher experimental results are selected. We use bold font to mark the same morpheme as the ground truth.

The experimental results in Figure 7 show that the morphemes are often segmented more finely when using unsupervised learning models compared with the ground truth. This problem is more obvious in MMSeg. For example, in the Uyghur word ”ئۆمىكىنى ”, three morphemes can be obtained after correct segmentation. Morfessor obtains two morphemes (one of which is correctly matched), and MMSeg obtains three morphemes (three of which are correctly matched). In the Kazakh word “ Applsci 14 05369 i001

”, three morphemes can be obtained after correct segmentation, and Morfessor also obtains three morphemes (one of which is correctly matched). In comparison, MMSeg obtains six morphemes (one of which is correctly matched). This situation also occurs in the word “ Applsci 14 05369 i002

”. This is because, when training the model, this paper introduces the n-gram morpheme boundary assumption, so the results after model segmentation are all within the assumed boundary length. Compared with Uyghur, the length of a single morpheme in Kazakh is longer, so although MMSeg has improved in Kazakh, it is not very obvious. This also corresponds to why the model performs poorly in Kazakh, as shown in several analysis results mentioned above.

The experimental results in Figure 8 show that the main errors in the models based on supervised learning appear in the stem. For example, in the word “ Applsci 14 05369 i003

”, except for FEMSeg-CRF, which can correctly segment the stem, the results of other models are problematic. One point worth noting is that, because the dataset is manually annotated, inconsistent or inaccurate annotation results will be introduced, and the appearance of such problems will affect the model’s performance. However, in Figure 8, it is found that although the correct segmentation result of the word “سېلىشقا ” is incorrect, the results of other models are correct, except for the BiLSTM-CRF model. Furthermore, when encountering phonetic changes, both FEMSeg and FEMSeg-CRF models demonstrate better ability to correctly segment morphemes compared to other models. For example, in the word “بارسىڭىز”, after morphological segmentation, it is divided as “بار سى ڭىز”, and after morphological restoration, it becomes “بار سا ڭىز”.

In summary, the methods proposed in this paper demonstrate superior performance in Uyghur and Kazakh stemming and morphological segmentation across various aspects. Both supervised and unsupervised models show strong capabilities in learning features. Thus, the FEMSeg and FEMSeg-CRF models do not rely on manual feature extraction processes. Instead, they leverage CNN, BiLSTM, and Transformer encoders to learn different features and the correlations between them. The MMSeg model also learns the correlations between n-grams and determines morphological boundaries through a unidirectional LSTM model and the masked attention mechanism. Both models are easy to implement and highly portable, providing valuable references for stemming and morphological segmentation in other agglutinative languages. Additionally, the reliability and stability of the models was demonstrated through 10-fold cross-validation.

5. Conclusions

This paper redefines the model evaluation metrics to effectively evaluate the performance of the Uyghur and Kazakh morphological segmentation and stemming models proposed in recent years. Morphological segmentation metrics shift the evaluation focus from the character to the morpheme. We propose two benchmark models for morphological segmentation and stemming for Uyghur and Kazakh: MMSeg and FEMSeg (FEMSeg-CRF). The stability of the models was demonstrated through 10-fold cross-validation. In the unsupervised learning morphological segmentation, MMSeg achieves F1-scores of 47.55% and 35.22% on the Uyghur and Kazakh, respectively. In the supervised learning morphological segmentation, FEMSeg achieves F1-scores of 92.69% and 92.84% on the Uyghur and Kazakh, respectively. In stemming, FEMSeg achieves accuracy rates of 91.51% and 90.70% on the Uyghur and Kazakh, respectively. The above experiments’ F1-scores and accuracy rates are higher than the previously proposed models. Finally, we analyzed the advantages and disadvantages of the different models in the two tasks from various perspectives. Although current research has improved character-level evaluation metrics, there is still room for improvement in morpheme-level or word-level metrics. In future research, we will continue to explore high-performance low-resource language morphological segmentation models by combining large language models and prompt learning methods.

Author Contributions

Conceptualization, G.A.; methodology, G.A.; software, G.A. and S.R.; validation, G.A. and S.R.; formal analysis, G.A.; investigation, G.A. and B.W.; writing—original draft preparation, G.A.; writing—review and editing, K.A.; visualization, G.A.; supervision, A.W.; project administration, A.W.; funding acquisition, A.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by National Natural Science Foundation of China [No. 62166044] and Natural Science Foundation of Xinjiang Province [No. 2021D01C079].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in this paper can be accessed by http://thuuymorph.thunlp.org/ (accessed on 20 May 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sorokin, A. Convolutional neural networks for low-resource morpheme segmentation: Baseline or state-of-the-art? In Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology, Florence, Italy, 2 August 2019; pp. 154–159. [Google Scholar] [CrossRef]
Li, Z.; Li, X.; Sheng, J.; Slamu, W. AgglutiFiT: Efficient Low-Resource Agglutinative Language Model Fine-Tuning. IEEE Access 2020, 8, 148489–148499. [Google Scholar] [CrossRef]
Liu, R.; Hu, Y.; Zuo, H.; Luo, Z.; Wang, L.; Gao, G. Text-to-Speech for Low-Resource Agglutinative Language With Morphology-Aware Language Model Pre-Training. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 1075–1087. [Google Scholar] [CrossRef]
Parhat, S.; Sattar, M.; Hamdulla, A.; Kadir, A. Uyghur–Kazakh–Kirghiz Text Keyword Extraction Based on Morpheme Segmentation. Information 2023, 14, 283. [Google Scholar] [CrossRef]
Pan, Y.; Li, X.; Yang, Y.; Dong, R. Multi-Task Neural Model for Agglutinative Language Translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, Online, 5–10 July 2020; pp. 103–110. [Google Scholar] [CrossRef]
Abudouwaili, G.; Abiderexiti, K.; Shen, Y.; Wumaier, A. Research on the Uyghur morphological segmentation model with an attention mechanism. Connect. Sci. 2022, 34, 2577–2596. [Google Scholar] [CrossRef]
Abudukelimu, A.; Liu, C.; Abudukelimu, H.; Guo, W. Research on Uyghur Morphological Segmentation Based on Character Feature. Comput. Simul. 2022, 39, 257–262. [Google Scholar]
Bareket, D.; Tsarfaty, R. Neural Modeling for Named Entities and Morphology (NEMO2). Trans. Assoc. Comput. Linguist. 2021, 9, 909–928. [Google Scholar] [CrossRef]
Haisa, G.; Altenbek, G. Multi-Task Learning Model for Kazakh Query Understanding. Sensors 2022, 22, 9810. [Google Scholar] [CrossRef]
Abudubiyaz, A.; Ablimit, M.; Hamdulla, A. The Acoustical and Language Modeling Issues on Uyghur Speech Recognition. In Proceedings of the 2020 13th International Conference on Intelligent Computation Technology and Automation (ICICTA), Xi’an, China, 24–25 October 2020; pp. 366–369. [Google Scholar] [CrossRef]
Song, H.; Dabre, R.; Chu, C.; Kurohashi, S.; Sumita, E. SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2023, 22, 1–24. [Google Scholar] [CrossRef]
Chen, W.; Fazio, B. Morphologically-Guided Segmentation For Translation of Agglutinative Low-Resource Languages. In Proceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021), Virtual, 16 August 2021; pp. 20–31. [Google Scholar]
Dawel, A.; Altenbek, G. Study and implementation of Kazakh lexical scanner. Comput. Eng. Appl. 2008, 44, 146–149. [Google Scholar]
Altenbek, G.; Wang, X.l. Kazakh Segmentation System of Inflectional Affixes. In Proceedings of the CIPS-SIGHAN Joint Conference on Chinese Language Processing, Beijing, China, 28–29 August 2010. [Google Scholar]
Ruokolainen, T.; Kohonen, O.; Virpioja, S.; Kurimo, M. Supervised morphological segmentation in a low-resource learning setting using conditional random fields. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, Sofia, Bulgaria, 8–9 August 2013; pp. 29–37. [Google Scholar]
Cotterell, R.; Müller, T.; Fraser, A.; Schütze, H. Labeled Morphological Segmentation with Semi-Markov Models. In Proceedings of the Nineteenth Conference on Computational Natural Language Learning, Beijing, China, 30–31 July 2015; pp. 164–174. [Google Scholar] [CrossRef]
Lin, C.; Lin, Y.J.; Yeh, C.J.; Li, Y.T.; Yang, C.; Kao, H.Y. Improving Multi-Criteria Chinese Word Segmentation through Learning Sentence Representation. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; pp. 12756–12763. [Google Scholar] [CrossRef]
Li, D.; Zhao, R.; Tan, F. CWSeg: An Efficient and General Approach to Chinese Word Segmentation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; Volume 5: Industry Track, pp. 1–10. [Google Scholar] [CrossRef]
Turghun, O.; YANG, Y.; Eziz, T.; Li, C. Collaborative Analysis of Uyghur Morphology Based on Character Level. Acta Sci. Nat. Univ. Pekin. 2019, 1, 47–54. [Google Scholar]
Wu, H.; Altenbek, G. Improved Joint Kazakh POS Tagging and Chunking. In Proceedings of the Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data, Yantai City, China, 15–16 October 2016; pp. 114–124. [Google Scholar]
Xu, C.; Jiang, T.; Yu, K.; Jiang, W. Model Construction of Uygur and Korean Morphological Analysis. J. Beijing Univ. Posts Telecommun. 2018, 41, 88–94. [Google Scholar]
Grönroos, S.A.; Virpioja, S.; Kurimo, M. Morfessor EM+Prune: Improved Subword Segmentation with Expectation Maximization and Pruning. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 3944–3953. [Google Scholar]
Shao, Y. Cross-lingual word segmentation and morpheme segmentation as sequence labelling. arXiv 2017, arXiv:1709.03756. [Google Scholar]
Qiu, X.; Pei, H.; Yan, H.; Huang, X. A Concise Model for Multi-Criteria Chinese Word Segmentation with Transformer Encoder. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 16–20 November 2020; pp. 2887–2897. [Google Scholar] [CrossRef]
Huang, K.; Huang, D.; Liu, Z.; Mo, F. A Joint Multiple Criteria Model in Transfer Learning for Cross-domain Chinese Word Segmentation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 3873–3882. [Google Scholar] [CrossRef]
Downey, C.; Xia, F.; Levow, G.A.; Steinert-Threlkeld, S. A Masked Segmental Language Model for Unsupervised Natural Language Segmentation. In Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, Online, 14 July 2022; pp. 39–50. [Google Scholar] [CrossRef]
Pan, C.; Sun, M.; Deng, K. TopWORDS-Seg: Simultaneous Text Segmentation and Word Discovery for Open-Domain Chinese Texts via Bayesian Inference. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; Volume 1: Long Papers, pp. 158–169. [Google Scholar] [CrossRef]
Yan, R.; Zhang, H.; Silamu, W.; Hamdulla, A. Unsupervised word Segmentation Based on Word Influence. In Proceedings of the ICASSP—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
Smit, P.; Virpioja, S.; Grönroos, S.A.; Kurimo, M. Morfessor 2.0: Toolkit for statistical morphological segmentation. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, Gothenburg, Sweden, 26–30 April 2014; pp. 21–24. [Google Scholar] [CrossRef]
Rouhe, A.; Grönroos, S.A.; Virpioja, S.; Creutz, M.; Kurimo, M. Morfessor-enriched features and multilingual training for canonical morphological segmentation. In Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, Online, 14 July 2022; pp. 144–151. [Google Scholar] [CrossRef]
Wumaier, A.; Yibulayin, T.; Kadeer, Z.; Tian, S. Conditional random fields combined fsm stemming method for uyghur. In Proceedings of the 2009 2nd IEEE International Conference on Computer Science and Information Technology, Beijing, China, 8–11 August 2009; pp. 295–299. [Google Scholar]
Sediyegvl, E.; Xiang, L.; Zong, C.; Akbar, P.; Askar, H. A Multi-Strategy Approach to Uyghur Stemming. J. Chin. Inf. Process. 2015, 29, 204–210. [Google Scholar]
Kuwatebaike, M. Research on Kazakh Stemming Based on Machine Learning. Comput. Technol. Dev. 2020, 30, 7. [Google Scholar]
Abuduwaili, G.; Maimaiti, M.; Yibulayin, T.; Kadeer, Z.; Hairula, X.; Lulu, W. Method of Uyghur stemming based on character sequence labeling. Mod. Electron. Tech. 2020, 43, 151–154+160. [Google Scholar] [CrossRef]
Abudukelimu, H.; Cheng, Y.; Liu, Y.; Sun, M. Uyghur morphological segmentation with bidirectional GRU neural networks. J. Tsinghua Univ. Sci. Technol. 2017, 57, 1–6. [Google Scholar]
Imin, G.; Ablimit, M.; Yilahun, H.; Hamdulla, A. A character string-based stemming for morphologically derivative languages. Information 2022, 13, 170. [Google Scholar] [CrossRef]
Yang, Y.; Li, S.; Zhang, Y.; Zhang, H.P. Point the Point: Uyghur Morphological Segmentation Using PointerNetwork with GRU. In Proceedings of the Chinese Computational Linguistics: 18th China National Conference, CCL 2019, Kunming, China, 18–20 October 2019; pp. 371–381. [Google Scholar] [CrossRef]
Gvzelnur, I.; Mijit, A.; Hankiz, Y.; Askar, H. Phoneme Sequence Based Stemming of Agglutinative Language. J. Chin. Comput. Syst. 2023, 44, 2362–2368. [Google Scholar]
Zhang, Y.; Li, W.; Abudukelimu, H.; Abulizi, A. Meta-Learning Method of Uyghur Morphological Segmentation. Comput. Eng. Appl. 2023, 59, 98–104. [Google Scholar] [CrossRef]
Tolegen, G.; Toleu, A.; Mussabayev, R. Voted-perceptron approach for Kazakh morphological disambiguation. In Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), Marseille, France, 11–12 May 2020; pp. 258–264. [Google Scholar]
Toleu, A.; Tolegen, G.; Mussabayev, R. Language-Independent Approach for Morphological Disambiguation. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 12–17 October 2022; pp. 5288–5297. [Google Scholar]
Toleu, A.; Tolegen, G.; Makazhanov, A. Character-Aware Neural Morphological Disambiguation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; Volume 2: Short Papers, pp. 666–671. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Singh, J.; Gupta, V. Text stemming: Approaches, applications, and challenges. ACM Comput. Surv. (CSUR) 2016, 49, 1–46. [Google Scholar] [CrossRef]
Sennrich, R.; Haddow, B.; Birch, A. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; Volume 1: Long Papers, pp. 1715–1725. [Google Scholar] [CrossRef]

Figure 1. Examples of segmentation and stemming.

Figure 2. The architecture of FEMSeg or FEMSeg-CRF for stemming and morphological segmentation.

Figure 3. The architecture of MMSeg for stemming and morphological segmentation.

Figure 4. Comparison of character-level and morphological-level evaluations.

Figure 5. Average edit distance (Levenshtein distance) of different models on test set.

Figure 6. Morphological error rates of different models on validation and test sets.

Figure 7. Case study of unsupervised models.

Figure 8. Case study of supervised models.

Table 1. The effect of model prediction error on segmentation. The difference between ground truth labels and model prediction labels are marked in blue.

Word	سانائەتنىڭ (industrial)
Ground Truth	سانائەت - نىڭ
Ground Truth Label	BMMMMEBME
Model Predicted Label	BMMMMEBEE
Merging	سانائەت - نى - ڭ

Table 2. Statistics for two datasets for the two datasets.

Data	Uyghur		Kazakh
Data	Words	Morphemes	Words	Morphemes
Train	16,892	7839	16,000	5686
Test	2112	1807	2000	1699
Validation	2112	1800	2000	1665

Table 3. Hyperparameters of models.

Model	Hyperparameters	Values
FEMSeg/FEMSeg-CRF	Char Embedding	128
	CNN kernel-size	[1, 3, 5]
	CNN dimension	128
	LSTM dimension (bidirectional)	128
	Transformer dimension	512
	Head number	8
MMSeg	Char Embedding	256
	LSTM dimension	256
	Head number	4
	K	8

Table 4. Comparative experimental results of morphological segmentation (%). Bold represents the best result in the same group of experiments.

Lang.	Uyghur						Kazakh
	Validation			Test			Validation			Test
	R	P	F1	R	P	F1	R	P	F1	R	P	F1
Morfessor	34.68	31.19	32.84	35.17	30.73	32.80	29.25	26.35	27.72	28.80	25.66	27.14
BPE	6.28	5.76	6.01	19.01	16.92	17.91	26.44	25.04	25.72	23.92	23.35	23.63
MMSeg	49.46	44.88	47.06	49.99	45.34	47.55	43.39	27.91	33.97	44.71	29.05	35.22
10-flod	48.84	44.84	46.75	48.73	45.07	46.75	46.12	35.53	40.09	46.20	35.69	40.22
BiLSTM	90.53	90.26	90.40	91.58	91.45	91.51	91.17	89.54	90.35	91.57	90.37	90.97
BiGRU	91.02	88.57	89.78	91.85	89.13	90.47	90.62	90.16	90.39	90.23	89.61	89.91
FEMSeg	91.85	89.88	90.85	93.02	91.09	92.04	92.08	91.96	92.02	92.53	92.14	92.34
10-flod	92.11	90.52	91.32	92.26	90.86	91.55	91.19	90.26	90.72	91.19	91.85	91.82
HMM	47.60	36.87	41.55	47.67	36.65	41.44	45.84	36.76	40.80	47.07	37.75	41.90
CRF	82.91	82.28	82.60	88.76	88.16	88.46	81.40	79.70	80.54	81.12	79.38	80.24
BiLSTM-CRF	90.80	91.00	90.90	91.62	92.22	91.92	91.61	89.55	90.57	92.32	90.28	91.29
BiLSTM-ATT-CRF	91.70	91.14	91.42	91.76	91.06	91.41	91.36	90.55	90.95	91.44	90.74	91.09
FEMSeg-CRF	92.62	91.85	92.24	93.08	92.29	92.69	92.58	92.47	92.52	93.05	92.63	92.84
10-flod	92.74	91.22	91.97	92.82	91.27	92.04	92.65	92.08	92.36	92.48	92.65	93.56

Table 5. Experimental results of stemming (%). Bold represents the best result in the same group of experiments.

Lang.	Uyghur						Kazakh
	Validation			Test			Validation			Test
	ACC	Over Stem.	Under Stem.	ACC	Over Stem.	Under Stem.	ACC	Over Stem.	Under Stem.	ACC	Over Stem.	Under Stem.
BiLSTM	90.17	4.77	5.06	89.15	5.45	5.40	87.65	7.65	4.70	88.80	6.75	4.45
BiGRU	88.68	8.12	3.20	89.46	7.78	2.77	86.85	7.40	5.75	86.80	7.70	5.50
FEMSeg	90.21	6.40	3.39	91.08	5.87	3.05	89.75	5.00	5.25	90.35	4.80	4.85
HMM	29.67	57.19	13.14	31.20	55.15	13.65	26.25	62.05	11.70	27.10	62.80	10.10
CRF	79.17	11.28	9.56	86.40	7.25	6.35	74.10	15.50	10.40	74.05	15.75	10.20
BiLSTM-CRF	89.82	4.59	5.59	90.36	3.82	5.82	87.75	8.30	3.95	89.00	7.90	3.10
BiLSTM-ATT-CRF	90.35	5.26	4.40	90.08	5.49	4.44	87.95	7.35	4.70	88.55	7.00	4.45
FEMSeg-CRF	90.92	5.64	3.44	91.51	5.20	3.29	90.15	4.95	4.90	90.70	5.00	4.30

Table 6. Experimental results of ablation studies (%). Bold represents the best result in the same group of experiments.

Lang.	Uyghur	Kazakh
MMSeg	47.55	35.22
w/o MAM	42.53	33.67
FEMSeg-CRF	92.69	92.84
w/o CRF(FEMSeg)	92.04	92.34
w/o Transformer-encoder	90.89	90.56

Table 7. Out-of-vocabulary (OOV) rate in datasets (%).

Lang.	Uyghur	Kazakh
Validation	13.44%	7.27%
Test	14.62%	8.60%

Table 8. Recall rates of out-of-vocabulary (OOV) and in-vocabulary (IV) words in validation and test sets. Bold represents the best result in the same group of experiments.

Lang.	Uyghur				Kazakh
Models	Validation		Test		Validation		Test
Models	OOV	IV	OOV	IV	OOV	IV	OOV	IV
Morfessor	5.88	39.10	4.78	40.37	2.33	31.36	1.95	31.33
BPE	1.80	19.16	1.70	21.98	2.04	28.35	3.65	25.46
MMSeg	45.75	50.04	46.69	50.55	6.98	46.24	8.52	48.11
BiLSTM	83.82	91.58	88.91	92.03	64.54	93.25	70.56	93.54
BiGRU	80.88	92.59	85.52	92.93	63.15	92.46	67.40	92.38
FEMSeg	82.35	93.33	87.67	93.93	81.98	92.87	81.51	93.57
HMM	29.41	50.42	31.43	50.45	17.73	48.04	23.84	49.26
CRF	72.71	83.48	87.06	89.05	58.43	83.20	62.53	82.87
BiLSTM-CRF	84.15	91.83	87.83	92.27	68.41	93.67	69.83	94.44
BiLSTM-ATT-CRF	85.95	92.59	87.67	92.45	68.61	93.14	70.07	93.45
FEMSeg-CRF	83.66	94.01	87.37	94.06	78.20	93.71	78.83	94.39

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Abudouwaili, G.; Ruzmamat, S.; Abiderexiti, K.; Wu, B.; Wumaier, A. A Benchmark for Morphological Segmentation in Uyghur and Kazakh. Appl. Sci. 2024, 14, 5369. https://doi.org/10.3390/app14135369

AMA Style

Abudouwaili G, Ruzmamat S, Abiderexiti K, Wu B, Wumaier A. A Benchmark for Morphological Segmentation in Uyghur and Kazakh. Applied Sciences. 2024; 14(13):5369. https://doi.org/10.3390/app14135369

Chicago/Turabian Style

Abudouwaili, Gulinigeer, Sirajahmat Ruzmamat, Kahaerjiang Abiderexiti, Binghong Wu, and Aishan Wumaier. 2024. "A Benchmark for Morphological Segmentation in Uyghur and Kazakh" Applied Sciences 14, no. 13: 5369. https://doi.org/10.3390/app14135369

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Benchmark for Morphological Segmentation in Uyghur and Kazakh

Abstract

1. Introduction

2. Related Work

3. Method

3.1. Task Definition

3.2. Feature-Enhanced Morphological Segmentation Model

3.3. Masked Morphological Segmentation Model

4. Experiments

4.1. Data

4.2. Evaluation Metrics

4.3. Hyperparameters

4.4. Baseline Models

4.4.1. Unsupervised Comparison Models

4.4.2. Supervised Comparison Models

4.5. Results

4.5.1. Morphological Segmentation Comparison

4.5.2. Stemming Comparison

4.5.3. Ablation Studies

4.6. Result Analysis

4.6.1. OOV and IV Analysis

4.6.2. Average Edit Distance and Morphological Error Rate Analysis

4.6.3. Case Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI