UnifiedCut: A Simple and Efficient Neural Model for Thai, Burmese and Khmer Word Segmentation

Wen, Yonghua; Xian, Yantuan; Wang, Yuehan; Yu, Zhengtao

doi:10.3390/app142311435

Open AccessArticle

UnifiedCut: A Simple and Efficient Neural Model for Thai, Burmese and Khmer Word Segmentation

¹

Faculty of Information Engineering and Automation, Kunming University of Science and Technology, 727 South Jingming Road, Chenggong District, Kunming 650500, China

²

Yunnan Key Laboratory of Artificial Intelligence, 727 South Jingming Road, Chenggong District, Kunming 650500, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(23), 11435; https://doi.org/10.3390/app142311435

Submission received: 19 August 2024 / Revised: 10 November 2024 / Accepted: 29 November 2024 / Published: 9 December 2024

Download

Browse Figures

Versions Notes

Abstract

:

Word segmentation is a critical task in natural language processing for southeast Asian Abugida languages, including Thai, Burmese, and Khmer. Existing approaches demonstrate that models using fixed-length windowed context inputs can achieve high segmentation accuracy; however, they often rely on low-level character features or language-specific preprocessing. Character-based methods can limit feature learning, while language-specific features add complexity due to specialized preprocessing requirements. This paper introduces UnifiedCut, which is a neural model that leverages multiple n-grams within a windowed multi-head attention mechanism. This design captures segmentation features from local contexts and multi-perspective n-gram inputs, enhancing generalization and recall, particularly for out-of-vocabulary words. Compared to CNN- and RNN-based approaches, UnifiedCut’s multi-head attention enables finer-grained feature extraction and greater parallelism, resulting in a faster, more scalable solution. Comprehensive experiments on public datasets for Thai, Burmese, and Khmer show that UnifiedCutachieves state-of-the-art performance in word segmentation.

Keywords:

word segmentation; transformer encoder; multiple n-grams; Thai; Burmese; Khmer

1. Introduction

Word segmentation is usually a preprocessing step for downstream natural language processing tasks, such as text classification, information retrieval, and machine translation. Like Chinese and Japanese, their scripts do not mark word boundaries with spaces, making word segmentation necessary. In addition, they are more pronounced in southeast Asian Abugida languages like Thai, Burmese, and Khmer. Moreover, their writing systems lack clear word boundaries. However, Thai, Burmese, and Khmer scripts differ from Chinese and Japanese in that a character represents just a consonant or a vowel, but a Chinese character represents a whole syllable. Hence, extending the Chinese word segmentation methods [1] to Thai, Burmese, and Khmer word segmentation problems is not straightforward.

Complex orthographic structures present significant challenges for word segmentation tasks in southeast Asian Abugida scripts. These scripts encode consonant and vowel information within single syllables, frequently incorporating diacritics, stacked characters, and reordering rules. Such orthographic complexity introduces additional challenges to word segmentation, requiring models to effectively process both character and sub-word features. Another challenge is the ambiguity of word boundaries. In languages like Thai and Khmer, the absence of explicit markers between words often results in sequences of characters that can correspond to multiple words. For example, กอดอกไม้ (flower bush or hugging a wooden chest) can be segmented into กอ|ดอก|ไม้ (flower bush) or กอด|อก|ไม้ (hugging a wooden chest), but the latter is rather nonsensical and statistically unlikely. This ambiguity necessitates a high reliance on contextual information, substantially increasing the computational demands on word segmentation models.

To address the challenge of efficient and accurate word segmentation, several neural-based methods have been developed, such as Sertis [2], KhPOS [3], DeepCut [4], AttaCut [5], THDICTSDR [6], and BERT+BiLSTMCRF+Fine-tune [7]. In these models, word segmentation is treated as a character-based sequence tagging task, employing various sequence encoders.

Sertis and KhPOS models utilize bidirectional recurrent neural networks (RNNs) to capture contextual character information. In contrast, DeepCut and AttaCut replace RNNs with convolutional neural networks (CNNs) as sequence encoders. DeepCut leverages multiple convolutional filters with variable kernel widths to extract segmentation features from characters and character types but encounters slow segmentation speeds. Meanwhile, AttaCut leverages syllable embeddings as supplementary features and enhances processing speed through three dilated CNN filters. This performance boost comes at the expense of accuracy and necessitates a preprocessing step for syllable segmentation. THDICTSDR improves dictionary-based methods by applying Sparse Distributed Representations (SDRs) to learn contextual information and integrating n-grams to select the correct word, achieving accuracy close to or better than DeepCut in some cases. However, THDICTSDR still suffers from relatively high processing times, since it retrieves features from an external dictionary. The BERT+BiLSTMCRF+Fine-tune model leverages character, syllable, and word embeddings to capture rich syntactic and semantic features for word segmentation. However, because it relies on the BERT language model and the BiLSTM as its encoder, it is the largest model among these approaches, which significantly impacts its prediction speed, making it considerably slower.

To develop an efficient and accurate word segmentation model for southeast Asian Abugida languages, it is essential to learn rich and effective segmentation features from sequences of characters and sub-words while also maintaining time efficiency in contextual modeling. The windowed context modeling strategy is more efficient than RNNs-based models, which is shown in DeepCut [4] and AttaCut [5] models. Given the word formation characteristics of southeast Asian Abugida languages, using excessively long context does not enhance segmentation performance and instead adds unnecessary model complexity. Our approach follows this modeling strategy, learning segmentation features for the target character based on fixed-length windows of multiple n-gram sequences. The n-gram inputs do not require language-specific preprocessing, such as syllable segmentation or character type recognition, making it more versatile. Our method introduces an attention mechanism that learns segmentation features from windowed contexts alongside a multi-head attention design that models these features from multiple perspectives. Unlike the convolutional networks used in DeepCut and AttaCut, the multi-head attention mechanism enables the model to learn more nuanced and effective segmentation features within the same computational framework. The multi-head attention features help enhance the model’s generalization ability and improve the recall rate for out-of-vocabulary (OOV) words. Moreover, compared to the CNN-based model, the proposed approach also supports greater parallelism, leading to improved segmentation efficiency.

Experiments demonstrate that UnifiedCut surpasses state-of-the-art systems in in-domain word segmentation accuracy and provides comparable or superior results in cross-domain scenarios. Additionally, UnifiedCut proves versatile and effective for syllable segmentation tasks, offering a robust and efficient solution across applications.

Our contributions can be summarized as follows:

We propose a simple and efficient unified neural model for Thai, Burmese, and Khmer word segmentation, which learns word segmentation features from windowed multiple characters n-gram sequences by an attention mechanism.
The proposed multi-head attention encoder has much fewer learnable parameters than the CNN-based models and supports greater parallelism. The proposed UnifiedCut runs about 4× times faster than the state-of-the-art Thai word segmentation model, DeepCut.
Experiments on public Thai, Burmese, and Khmer word segmentation datasets show that our model consistently outperforms the state-of-the-art word segmentation models, resulting in higher recall values of out-of-vocabulary words.

2. Related Works

2.1. Word Segmentation

At present, the main research methods for word segmentation tasks in the Thai [5,8,9], Burmese [7,10,11] and Khmer [12,13,14] are divided into rule-based, traditional machine learning and neural network methods. Common traditional machine learning methods mainly include hidden Markov models (HMMs), maximum entropy models (MEMs), and conditional random fields (CRFs), and so on. In recent years, deep learning has made great breakthroughs in the field of natural language processing, such as RNNs, CNNs, BiLSTM and pre-training-based methods.

The dictionary-based or rule-based word segmentation methods are efficient [15,16,17], but they depend too much on the quality of the dictionaries, and their ability to recognize new words is insufficient. Limcharoen et al. propose a statistics-based method [18], in which word n-gram and the Generalized Left-to-right Reduce (GLR) analysis algorithm are used. In this technique, an input Thai text is segmented into a sequence of Thai Character Clusters (TCCs) that is inseparable based on Thai writing rules, which help reduce the chances of segmenting words incorrectly. Although it can automatically exclude some ambiguities and identify out-of-vocabulary and yield better performance than the dictionary-based method, the model has a large parameter space resulting in the curse of dimension. Mon et al. [19] proposed a combination of rule-based heuristic methods and statistical methods for Burmese word segmentation. This methods segment Burmese words by artificially summarizing the combination characteristics of the Burmese syllables, but the word segmentation effect is heavily dependent on the manually defined word segmentation rules.

Due to the drawbacks of the dictionary-based and statistics-based methods, machine learning (ML) methods, especially neural network models, have been adopted for word segmentation. Bi et al. [13] proposed a bidirectional maximum matching approach to maximize segmentation accuracy. The model processed maximum matching twice-forward and backward, however, it was unable to handle out-of-vocabulary words (OOV) and take into account the context. Phyu et al. [10] proposed a Burmese word segmentation method based on CRF and feature clustering. Moreover, CRF requires a customized language feature model so that the process is more complicated. Furthermore, the error rate can be high if the model definition is inaccurate.

At present, the mainstream is to use neural network method, which can automatically learn the feature representation of related tasks, and the effect is obviously better than statistical machine learning. Two usually adopted neural networks structures are RNN and CNN. Jussi et al. [2] apply bidirectional RNN to Thai word segmentation, which is independent on artificial features. Compared with CNN, RNN needs more training time because of its recurrent nature. Subsequently, Kittinaradorn et al. [4] propose a CNN-based Thai word segmentation model named DeepCut, which uses characters and character category features as input features and has high performance. Chormai et al. [5] propose Thai word segmentation models named AttaCut-C and AttaCut-SC, which are also based on CNN. Both models employ the dilation technique in the convolutional layers [20,21], which allows the models to use non-redundant convolution layers that cover sufficient context. Their architecture has less computational dependencies and hence a higher degree of parallelism than DeepCut. In addition, AttaCut-SC uses syllable embeddings as additional features, which provides higher-level information than using character information alone. But it utilizes syllable embeddings as the representation, which causes the model to require additional preprocessing steps. However, the model in this paper only uses character-level embedding as the representation to avoid the error propagation of syllable segmentation. Buoy et al. [3] proposed joint word segmentation and a POS tagging Khmer model using a deep learning approach. The model utilizes a bidirectional-LSTM network that takes inputs at the character level and outputs a sequence of POS tags. Although the forgetting and memory mechanism of BiLSTM can model the order dependence between sequences, it has the problem of long-distance dependence, and the calculation is difficult to parallelize. However, the modeling of long-distance context features seems to have little significance for small language word segmentation. So, our model learns contextual features through the local attention mechanism, which fundamentally avoids the problem of long-distance dependence. Current pre-training language models also learn contextual features based on this attention mechanism. Mao et al. [7] proposed a joint learning neural network model based on Bert pre-training. However, this model based on Bert pre-training will significantly increase the parameters of the model and reduce the prediction speed of the model. Therefore, this paper only uses a one-layer transformer as the encoder, which has a small scale of model parameters and fast running speed.

2.2. Syllable Segmentation

Thet syllable segmentation task is defined similarly, and the performance of syllable segmentation has a direct impact on the downstream tasks of natural language processing. At present, many attempts have been made to deal with syllable segmentation in Thai, while there have been fewer in Myanmar and Khmer. Combining the characteristics of syllables, Poowarawan [22] proposes the first dictionary-based method for Thai syllable segmentation using a greedy algorithm. But the method is hard to generalize to other domains. Aroonmanakun [23] uses more than 200 rules to segment Thai syllables and achieves good results. Thet et al. [24] proposed a method of Burmese syllables segmentation, which is based on rules. However, these syllable segmentation rules are complicated. In addition, there may be conflicts between rules when the amount of rules is large. At present, the mainstream method of Thai syllable segmentation is mainly based on statistical learning. Chormai et al. [5] developed the first ML-based Thai orthographical syllable segmenter that can resist typos. Anyhow, the machine learning methods have higher performance than the dictionary-based and statistics-based methods and show great potential for syllable segmentation.

3. Methodology

In our proposed model, we frame the sentence segmentation task as a character-level sequence labeling problem. Each character is classified based only on features learned from a fixed-length context. During both training and prediction, the model processes multiple n-gram sequences derived from this fixed context. UnifiedCut, our proposed model, learns the classification features for each target character from these sequences and performs classification independently for each character. This independent classification allows for parallel processing, which improves segmentation efficiency.

3.1. UnifiedCut Model

We propose learning classification features through a multi-head attention transformer encoder. Figure 1 provides an overview of the UnifiedCut architecture, which includes embedding layers, multi-head attention layers, a fully connected layer, and a word label classification layer.

3.1.1. Multiple n-Gram Embedding Layer

We use multiple n-gram sequences as model input to perform word segmentation. The embedding layer encodes each character n-gram into a dense vector. For example, given the i-th n-gram token

x_{i}^{k}

, we embed it as a dense vector

e_{i}^{k} \in R^{d}

using the appropriate lookup table,

e_{i}^{k} = {Lookup}^{k} (x_{i}^{k}),

(1)

where k is the length of the n-gram. We concatenate the n-gram embeddings at all positions to learn richer features from multiple n-grams. We denote the concatenated vector for the i-th character in the context as

e_{i} = Concat ({[e_{i}^{k}]}_{k = 1}^{K}),

(2)

where

e_{i} \in R^{d_{m}}

, K is the max-length of the n-gram, and

d_{m} = K \times d

is the size of the concatenated vector.

3.1.2. Transformer Encoder

We use a transformer encoder [25] to learn contextual features from an input sliding window of n-gram embeddings

E \in R^{n \times d_{m}}

of length n,

E = [e_{1}, e_{2}, \dots, e_{n}] .

(3)

The sequence of the input n-gram embeddings is fed into the transformer encoder. The transformer layer consists of two sub-layers, which are named the multi-head attention layer and the position-wise feed-forward layer. The encoder employs residual connections around each of the sub-layers, which is followed by layer normalization [26].

Before applying the self-attention to the input sequence, the model firstly transforms the input

E^{j}

into query

Q \in R^{n \times d_{k}}

, key

K \in R^{n \times d_{k}}

, and value

V \in R^{n \times d_{v}}

sequences via separate linear transformation. Then, we calculate the self-attention outputs

H \in R^{n \times d_{v}}

by dot-product self-attention

H = Attention (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) V .

(4)

Multi-head attention enables the model to simultaneously focus on information from different representation subspaces at various positions. This allows our model to effectively learn multi-view word segmentation features from the surrounding context. The output of the multi-head attention layer is a projection formed by concatenating the outputs of all attention heads,

\begin{matrix} MultiHead (Q, K, V) = Concat ({head}_{1}, \dots, & {head}_{h}) W^{O}, \\ where {head}_{i} = Attention (Q W_{i}^{Q}, K W_{i}^{K}, & V W_{i}^{V}), \end{matrix}

(5)

where

W_{i}^{Q} \in R^{d_{m} \times d_{k}}

,

W_{i}^{K} \in R^{d_{m} \times d_{k}}

, and

W_{i}^{V} \in R^{d_{m} \times d_{v}}

are the projection parameters of head i.

W^{O} \in R^{h d_{v} \times d_{model}}

denotes the output projection matrix. For each parallel attention head, we use

d_{k} = d_{v} = d_{m} / h

.

After the multi-head attention layer, a LayerNorm(X + MultiHead(X)) layer is applied to the sum of the output and residual connection of X.

In addition to attention sub-layers, the encoder block contains a fully connected feed-forward network applied to each position separately and identically. The feed-forward network consists of two linear transformations with a ReLU activation in between.

\begin{matrix} FFN (z) & = ReLU (W_{1} z + b_{1}) W_{2} z + b_{2}, \end{matrix}

(6)

where

W_{1}

,

b_{1}

,

W_{2}

, are the

b_{2}

projection parameters, and z denotes the output of the previous layer normalization. Another layer normalization is applied after the feed-forward network, which makes the outputs

o_{i}

of the encoder block.

We derive the contextual prediction feature from the multiple input n-grams by flattening the transformer’s encoder output into a single vector,

h = Concat ([o_{1}, o_{2}, \dots, o_{n}]) .

(7)

3.1.3. Decoding Layer

In the decoding layer, we use a two-layer MLP binary classifier to predict the word segmentation label for the center character of the input n-gram window sequences. We use binary classification because in southeast Asian languages, it is uncommon for a single character to form a word. This approach is simpler and more effective than using more complex labeling schemes.

In the first layer, the feature vector

h_{w} \in R^{d_{w}}

for the word segmentation classifier is learned by a linear layer with ReLU activation,

h_{w} = ReLU (W_{w}^{1} h + b_{w}^{1}),

(8)

where

W_{w}^{1}

and

b_{w}^{1}

are the parameter matrix and bias. Then, the segmentation classifier computes the label probability using

h_{w} \in R^{d_{w}}

p_{w} = sigmoid (W_{w}^{2} h_{w} + b_{w}^{2}),

(9)

where

W_{w}^{2}

and

b_{w}^{2}

are the parameter matrix and bias and

p_{w}

is the word segmentation label probability.

3.2. Objective Function

We use binary labels—starting-word character (B) and in-word character (I)—to tag characters within the sentence. Therefore, we select binary cross-entropy loss to train our model. The objective function is defined as follows,

\begin{matrix} L (Y, P) = \frac{1}{n} \sum_{i = 1}^{n} (y_{i} \times ln p_{i} + (1 - y_{i}) \times ln (1 - p_{i})), \end{matrix}

(10)

where

y_{i}

denotes the ground truth label, and

p_{i}

denotes the predicted label.

4. Experiments

We conducted extensive experiments on public datasets to demonstrate the performance and efficiency of our proposed model. The following sections provide a detailed discussion of the datasets, evaluation metrics, experimental results, and hyperparameter analysis.

4.1. Datasets and Evaluation Metrics

For Thai word segmentation tasks, we evaluate the proposed method using the InterBEST2010 [27] and LST20 [28] datasets, which were both annotated by Thailand’s National Electronics and Computer Technology Center (NECTEC). The BEST-2010 corpus consists of 34,107 samples (rows) across 415 documents in four categories: articles, encyclopedias, news, and novels. We reserve 10% of the training samples from each category as validation data. The LST20 corpus includes 74,180 sentences across 4751 documents spanning fifteen categories with 3794 documents in the training set, 474 in the validation set, and 483 in the test set.

For Burmese word segmentation tasks, we use the ALT corpus (https://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/, accessed on 1 December 2024) from the Asian Language Treebank (ALT) project [29]. Briefly, 20,000 English sentences collected from Wikinews were translated manually into different Asian languages as the raw data. We divide the Burmese data into training, validation, and test sets according to 8:1:1.

For Khmer word segmentation tasks, we use the khPOS dataset [3] which is published on GitHub (https://github.com/ye-kyaw-thu/khPOS, accessed on 1 December 2024). The khPOS dataset consists of 12,000 sentences (25,626 words in total). The average number of words per sentence is 10.75. We divide the 12,000-sentence dataset into training and validation sets according to 9:1. Thu [3] provided a separate open test set for evaluating the model performance. Table 1 shows statistical information about the experiment corpus.

Following AttaCut [5], we use precision, recall, and F1-score as evaluation metrics, which were calculated at both the character level and word level for word segmentation. Word-level precision is defined as the ratio of correctly segmented words to the total number of predicted words, while word-level recall is the ratio of correctly segmented words to the total number of words in the ground truth.

F 1_{w}

denotes the word level F1-score. We also assess the recall for out-of-vocabulary (OOV) words to highlight the model’s ability to generalize. We use the PyThaiNLP project’s benchmark module for word segmentation to compute these metrics (https://github.com/PyThaiNLP/pythainlp/commit/bec416f9b97fe4198c4f4288df517696557b475e, accessed on 1 December 2024).

4.2. Implementation Details

We use PyTorch [30] to implement the proposed model and then train and evaluate it on a single Nvidia GeForce RTX 2080 Ti GPU. The hyperparameters of our model are listed in Table 2. In default, we use 1-gram, 2-gram and 3-gram sequences as input to segment a sentence, and the sliding window size is set to 21. The model’s sensitivity to these hyperparameters will be discussed further in Section 4.4.

We employ the Adam [31] optimizer to train the proposed models. We set the learning rate to

0.001

. We also apply dropout to n-gram embeddings with a rate of 0.15. We use an early stopping strategy to select the best model via a min loss value of validation data.

4.3. Main Results

4.3.1. Word Segmentation for Thai

Table 3 presents a performance comparison of several models on the BEST2010 and LST20 datasets for Thai word segmentation. PythaiNLP serves as a dictionary-based model that segments words using a maximum matching algorithm, providing a basic baseline for comparison. Sertis leverages bidirectional GRU networks, making it a neural model that captures the context in both directions. DeepCut, considered a state-of-the-art segmentation model, uses multiple CNN filters as feature encoders, with inputs that include both characters and their types. AttaCut-C is another CNN-based model that stands out for using dilated CNN filters, which is designed for faster processing due to its reduced number of CNN filters compared to DeepCut. Its variant, AttaCut-CS, further enhances segmentation by incorporating syllable embeddings as additional input. Finally, THDICTSDR takes a dictionary-based approach but introduces Sparse Distributed Representations (SDRs) to capture contextual information, improving its understanding of Thai text structure. Together, these models demonstrate a range of techniques and architectural choices that address the unique challenges of Thai word segmentation.

The performance results in Table 3 highlight the strengths of UnifiedCut in Thai word segmentation, especially in comparison to other models. UnifiedCut consistently achieves top scores in precision, recall, F1-score, and OOV (out-of-vocabulary) handling, demonstrating its versatility and robustness. UnifiedCut shows the highest precision among all models with 0.971 on BEST2010 and 0.988 on LST20. UnifiedCut also performs well in the recall, although on the BEST2010 dataset, DeepCut slightly surpasses it. UnifiedCut’s F1-score is the highest across both datasets (0.965 on BEST2010 and 0.988 on LST20), showcasing its strong balance between precision and recall. The results suggest that UnifiedCut is well rounded, effectively capturing correct segments from multiple n-gram inputs via a multi-head attention encoder.

UnifiedCut demonstrates the best OOV handling, with OOV recall (

R_{OOV}

) scores of 0.703 on BEST2010 and 0.697 on LST20, outperforming all other models. A high

R_{OOV}

score means that UnifiedCut is especially adept at handling words not seen in training data, which is a crucial advantage in real-world applications where new or rare words frequently appear.

4.3.2. Word Segmentation Results for Burmese and Khmer

To demonstrate the performance of the UnifiedCut in other South Asian languages such as Burmese and Khmer, we also conducted a comparative analysis of the most advanced models in the benchmark datasets of both languages, respectively. BERT + BiLSTM-CRF + Fine-tune is the state-of-the-art Burmese word segmentation model. It uses the pre-trained Burmese BERT and BiLSTM-CRF as encoders to learn contextual Burmese word segmentation features. KhPOS is a Khmer word segmentation and POS-tagging system based on a deep learning network to remove the adverse dependency effect.

The results in Table 4 indicate that UnifiedCutoutperforms the benchmark models, BERT + BiLSTM-CRF for Burmese and KhPOS for Khmer, across all evaluation metrics, highlighting its superior accuracy and adaptability. In terms of precision, UnifiedCut achieves the highest scores, reaching 0.979 for Burmese and 0.983 for Khmer, reflecting its ability to minimize false positives and ensure the accurate segmentation of word boundaries. It also demonstrates leading recall scores, with 0.981 for Burmese and 0.987 for Khmer, indicating its strong capability to capture actual word boundaries. These high precision and recall values yield the highest F1-scores for both languages, achieving 0.980 for Burmese and 0.985 for Khmer, underscoring the model’s balanced performance. Additionally, UnifiedCut excels in out-of-vocabulary (OOV) word handling, with

R_{OOV}

scores of 0.595 for Burmese and 0.613 for Khmer, significantly surpassing those of the baseline models. This robust OOV performance demonstrates UnifiedCut’s generalizability to unseen words, which is a crucial advantage for real-world applications. Overall, these results confirm UnifiedCut’s effectiveness in segmenting words in Burmese and Khmer, leveraging multiple n-grams and multi-head attention encoders to deliver consistent and high-quality segmentation.

4.3.3. Cross-Domain Word Segmentation Results

To comprehensively illustrate the performance of cross-domain data, we also evaluate our UnifiedCutmodel (trained on BEST2010 dataset) on another three datasets for Thai: (i) ORCHID [32] is a Thai part-of-speech (POS) tagged corpus. It contains 23,125 sentences with about 2 M annotated words. (ii) Thai National Historical Corpus (TNHC) consists of human-annotated Thai classical literature documents; it contains 20,791 samples, with about 599 K words, and 2.14 M characters, and (iii) the Wisesight-1000 corpus [5] contains 1000 social media text samples with around 22 K words and 75 K characters.

Table 5 shows the results of different models on cross-domain datasets. Our models outperform the state-of-the-art system on the Orchid and the Wisesight-1000 datasets. We noticed that PyThaiNLP achieved the highest word-level F1 value on the Orchid and TNHC datasets, while other neural network baselines performed poorly. Compared to neural network baselines, UnifiedCut obtains significant improvements on the Orchid dataset and achieves a comparable result to the PyThaiNLP model on the TNHC dataset. The experiment’s result shows that UnifiedCut has better generalization performance and domain adaptability than other neural baselines.

It is worth noting that all neural network models failed to perform well on the TNHC dataset. The main reason may be that TNHC is significantly different from BEST-2010 in writing styles and vocabulary. The literature in TNHC is typically written with archaic words, and the poems in the document need additional linguistic structures that are quite different from the training samples.

4.3.4. Syllable Segmentation Performance

Syllable segmentation is similar to word segmentation with the key difference that in southeast Asian abugida languages, syllables are smaller linguistic units that form words. Syllable boundaries are generally clearer than word boundaries, so our model can be directly applied to the syllable segmentation task.

For syllable segmentation, we train the model on Thai syllable benchmark datasets: the Thai National Corpus (TNC) [33] that contains 2.56 M annotated syllables (around 8 M characters). For the transfer learning model, we split training, development, and testing data with the ratio 7:2:1, which contain 1.8 M, 0.5 M, and 0.25 M syllables, respectively.

We selected four baseline models for the syllable segmentation task. The first is a Conditional Random Field (CRF) model that uses character and trigram features. The second is a Maximum Entropy (MaxEnt) model that relies on trigrams as features. Additionally, we included PythaiNLP, a dictionary-based Thai syllable segmentation model that applies a maximum matching algorithm, and SSG, which is the first machine learning-based model for Thai syllable segmentation [5].

Table 6 presents the performance comparison for syllable segmentation. The results indicate that UnifiedCut, using a unigram sequence as input, achieves the highest scores, with an F1-score of 0.98 and an OOV recall of 0.78. Compared to the second-best model, SSG, UnifiedCut shows a notable improvement in OOV recall by 0.17. These results demonstrate that UnifiedCut is highly effective in producing reliable syllable segmentation outcomes.

4.4. Hyperparameter Study

Several important hyperparameters are present in our model, such as the choice of n-grams as input, the embedding dimension of n-grams, the length of the context window, the number of encoder layers, and the number of heads in multi-head attention. We studied the impact of these parameters on model performance while keeping other hyperparameters at their default values. The hyperparameter study experiment results on the BEST2010 dataset are shown in Figure 2.

Figure 2a illustrates how different n-gram embedding dimensions affect model performance and size when using various n-gram sequences as input. The results show that models using n-gram (1, 2, 3) consistently achieve better performance across embedding dimensions compared to those using n-gram (1, 2) though at the cost of increased model size. The figure also indicates that both very small and very large embedding dimensions can adversely impact performance. The optimal setting is achieved with n-gram (1, 2, 3) and an embedding dimension of 32, resulting in the best performance and a model size of 0.5 M parameters.

Context window length is another important parameter impacting model performance. Figure 2b illustrates the effect of varying context window sizes, from 13 to 29, on model accuracy. The results show that our model performs well without needing an excessively long context; a window size of around 20 characters achieves optimal results. Generally, a longer context window improves model precision, but an overly long context can reduce recall. Additionally, longer contexts increase the number of model parameters, as indicated in the figure, where model size grows linearly with context length.

In general, multi-layer encoders help improve a model’s learning capacity. Figure 2c shows the performance and size of UnifiedCut with different numbers of encoder layers. The results indicate that the two-layer encoder performs slightly better than the one-layer encoder; however, as the number of layers increases, model performance declines significantly, while the model size grows linearly with additional layers. Therefore, UnifiedCut does not require a deep encoder. Given that the performance difference between one and two layers is minimal, with only a 0.001 increase in F1 score, and the one-layer encoder is more time-efficient, we selected the one-layer encoder as the default.

Another key hyperparameter we examine is the number of attention heads. By using multiple heads, the model can capture segmentation features from various perspectives within an input sequence. As shown in Figure 2d, choosing an appropriate number of attention heads significantly enhances model performance, primarily by improving recall. In our experiments, setting the head number to 8 produced the best segmentation results.

4.5. Speed Benchmark

We measure the model speed by segment using the Wisesight training set line by line on the same device. We used the state-of-the-art neural word segmenters, e.g., DeepCut and AttaCut, as baselines, and exclude the PyThaiNLP segmenter from the benchmark due to its low performance.

Figure 3 illustrates the speed benchmark results. It shows that DeepCut takes the longest time, around 474 s on a Linux server with 64 cores Intel(R) Xeon(R) CPU 2.10 GHz and 64 G RAM. UnifiedCut takes 123.7 s on the CPU, which is 3.8 times faster than DeepCut. All models in the benchmark can be significantly sped up by a GPU. For example, by using a Nvidia GeForce RTX 2080i GPU, UnifiedCut takes only 38.3 s to finish the segmentation test. In comparison, it takes 127.7 s on the CPU.

5. Discussion

The UnifiedCut model presents significant strengths in word segmentation for Thai, Burmese, and Khmer scripts. A key advantage lies in its simplicity and efficiency: by employing a single-layer transformer encoder, UnifiedCut maintains a streamlined architecture that accelerates processing, making it considerably faster than state-of-the-art models such as DeepCut and AttaCut. Additionally, UnifiedCut demonstrates robust performance in recognizing out-of-vocabulary (OOV) words and identifying new and meaningful terms effectively. Its high accuracy across both word and syllable segmentation tasks further underscores its versatility and robustness across various southeast Asian languages.

Compared to dictionary-based methods, UnifiedCut has distinct advantages. While dictionary-based approaches can be efficient and precise for known vocabulary, they often struggle with OOV words and depend heavily on comprehensive and up-to-date lexicons. In contrast, UnifiedCut leverages neural network-based multi-head attention, enabling it to generalize better by learning patterns directly from data rather than relying on static dictionaries. This flexibility allows UnifiedCut to adapt more easily to dynamic language changes and linguistic variations.

Despite these strengths, UnifiedCut faces certain limitations, particularly in cross-domain applications. Its accuracy and recall decline when applied to datasets from different domains, highlighting challenges in generalization. This limitation suggests that while UnifiedCut excels with data closely aligned to its training set, its effectiveness diminishes in unfamiliar linguistic contexts. Additionally, although UnifiedCut performs well in OOV word recognition, improvement levels vary across different datasets, revealing inconsistencies in handling diverse language data.

6. Conclusions and Future Works

In this paper, we present a simple yet effective neural model for word segmentation in Thai, Burmese, and Khmer. By using windowed multiple n-grams as input and leveraging a transformer encoder to capture contextual features, our model outperforms state-of-the-art approaches while benefiting from the efficiency of parallelized multi-head attention. Future work could focus on enhancing this model through the integration of pre-trained models to further improve contextual understanding in segmentation tasks. Additionally, exploring hybrid approaches that combine neural networks with rule-based methods may offer further gains in accuracy and robustness, especially in handling complex segmentation cases.

Author Contributions

Conceptualization: Y.W. (Yonghua Wen) and Y.X.; Methodology: Y.W. (Yonghua Wen); Software: Y.X.; Validation: Y.W. (Yonghua Wen) and Y.W. (Yuehan Wang); Formal Analysis: Y.W. (Yonghua Wen); Investigation: Y.W. (Yuehan Wang); Resources: Y.W. (Yuehan Wang); Data Curation: Y.W. (Yuehan Wang); Writing—Original Draft Preparation: Y.W. (Yonghua Wen); Writing—Review and Editing: Y.X. and Z.Y.; Visualization: Y.W. (Yuehan Wang); Supervision: Y.X. and Z.Y.; Project Administration: Z.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China under grant numbers 62266028, U23A20388 and 62366027; the Yunnan Fundamental Research Projects under grant numbers 202301AS070047; the Yunnan Provincial Major Science and Technology Special Plan Projects under grant numbers 202402AD080002 and 202302AD080003; the Yunnan Provincial Key Research and Development Plan under grant number 202303AP140008.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this study are publicly available, and the source code can be accessed at the following GitHub repository (https://github.com/xianyt/jontcut, accessed on 1 December 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Tang, L.; Guo, C.; Chen, J. Review of Chinese word segmentation studies. Data Anal. Knowl. Discov. 2020, 4, 1–17. [Google Scholar]
Sertis Co., Ltd. Thai Word Segmentation with Bi-Directional RNN; Sertis Co., Ltd.: Bangkok, Thailand, 2017. [Google Scholar]
Thu, Y.K.; Chea, V.; Sagisaka, Y. Comparison of Six POS Tagging Methods on 12K Sentences Khmer Language POS Tagged Corpus. In Proceedings of the 1st Regional Conference Optical Character Recognition and Natural Language Processing Technologies for ASEAN Languages, Phnom Penh, Cambodia, 7–8 December 2017. [Google Scholar]
Kittinaradorn, R.; Achakulvisut, T.; Chaovavanich, K.; Srithaworn, K.; Chormai, P.; Kaewkasi, C.; Ruangrong, T.; Oparad, K. DeepCut: A Thai Word Tokenization Library Using Deep Neural Network. Zenodo 2019. [Google Scholar] [CrossRef]
Chormai, P.; Prasertsom, P.; Rutherford, A. AttaCut: A Fast and Accurate Neural Thai Word Segmenter. arXiv 2019, arXiv:1911.07056. [Google Scholar]
Soisoonthorn, T.; Unger, H.; Maliyaem, M. Thai Word Segmentation with a Brain-Inspired Sparse Distributed Representations Learning Memory. Comput. Intell. Neurosci. 2023, 2023, 8592214. [Google Scholar] [CrossRef] [PubMed]
Mao, C.; Man, Z.; Yu, Z.; Gao, S.; Wang, Z.; Wang, H. A Neural Joint Model with BERT for Burmese Syllable Segmentation, Word Segmentation, and POS Tagging. Trans. Asian-Low-Resour. Lang. Inf. Process. 2021, 20, 1–23. [Google Scholar] [CrossRef]
Paripremkul, K.; Sornil, O. Segmenting Words in Thai Language Using Minimum Text Units and Conditional Random Field. J. Adv. Inf. Technol. 2021, 12, 135–141. [Google Scholar] [CrossRef]
Tapsai, C.; Unger, H.; Meesad, P. Thai Natural Language Processing—Word Segmentation, Semantic Analysis, and Application. In Studies in Computational Intelligence; Springer: Cham, Switzerland, 2021. [Google Scholar]
Phyu, M.L.; Hashimoto, K. Burmese word segmentation with Character Clustering and CRFs. In Proceedings of the 2017 14th International Joint Conference on Computer Science and Software Engineering (JCSSE), Nakhon Si Thammarat, Thailand, 12–14 July 2017; pp. 1–6. [Google Scholar]
Oo, T.M.; Tanprasert, T.; Thu, Y.K.; Supnithi, T. Syllable-to-Syllable and Word-to-Word Transducers for Burmese Dialect Translation. In Proceedings of the 2022 17th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), Chiang Mai, Thailand, 5–7 November 2022; pp. 1–6. [Google Scholar]
Buoy, R.; Taing, N.; Kor, S. Khmer Word Segmentation Using BiLSTM Networks. In Proceedings of the 4th Regional Conference on OCR and NLP for ASEAN Languages, Phnom Penh, Cambodia, 17–18 December 2020. [Google Scholar]
Bi, N.; Taing, N. Khmer word segmentation based on Bi-directional Maximal Matching for Plaintext and Microsoft Word document. In Proceedings of the Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific, Siem Reap, Cambodia, 9–12 December 2014; pp. 1–9. [Google Scholar]
Sry, S.; Nguyen, A.S. A review of Khmer word segmentation and part-of-speech tagging and an experimental study using bidirectional long short-term memory. Ho Chi Minh City Open Univ. J. Sci.-Eng. Technol. 2022, 12, 23–34. [Google Scholar] [CrossRef]
Duong, L.; Cohn, T.; Bird, S.; Cook, P. Low Resource Dependency Parsing: Cross-lingual Parameter Sharing in a Neural Network Parser. In Proceedings of the Meeting of the Association for Computational Linguistics and International Joint Conference on Natural Language Processing, Beijing, China, 26–31 July 2015. [Google Scholar]
Yang, Y.; Hospedales, T.M. Trace Norm Regularised Deep Multi-Task Learning. arXiv 2017, arXiv:1606.04038. [Google Scholar]
Zeiler, M.D.; Fergus, R. Visualizing and Understanding Convolutional Networks. In Proceedings of the Computer Vision—ECCV 2014—13th European Conference, Zurich, Switzerland, 6–12 September 2014; Fleet, D.J., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Lecture Notes in Computer Science. Springer: Berlin, Germany, 2014; Volume 8689, Part I. pp. 818–833. [Google Scholar]
Limcharoen, P. Thai Word Segmentation Using GLR Parsing Technique and Statistical Language Model; Technical Report; Thammasat University: Bangkok, Thailand, 2010. [Google Scholar]
Mon, A.M.; Phyue, S.L.; Thein, M.M.; Htay, S.S.; Win, T.T. Analysis of Myanmar Word boundary and segmentation by using Statistical Approach. In Proceedings of the 2010 3rd International Conference on Advanced Computer Theory and Engineering (ICACTE), Chengdu, China, 20–22 August 2010; Volume 5, p. V5-233. [Google Scholar]
Strubell, E.; Verga, P.; Belanger, D.; McCallum, A. Fast and Accurate Entity Recognition with Iterated Dilated Convolutions. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, 9–11 September 2017; Palmer, M., Hwa, R., Riedel, S., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 2670–2680. [Google Scholar]
Yu, F.; Koltun, V. Multi-Scale Context Aggregation by Dilated Convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
Poowarawan, Y. Dictionary-based Thai Syllable Separation. In Proceedings of the Ninth Electronics Engineering Conference (EECON-86), Bangkok, Thailand; 1986. [Google Scholar]
Aroonmanakun, W. Collocation and Thai word segmentation. In Proceedings of the SNLP-Oriental COCOSDA, Hua Hin, Thailand, 9–11 May 2002; pp. 68–75. [Google Scholar]
Maung, Z.M.; Mikami, Y. A rule-based syllable segmentation of Myanmar text. In Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages, Hyderabad, India, 11 January 2008. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.U.; Polosukhin, I. Attention is All you Need. In Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30, pp. 5998–6008. [Google Scholar]
Ba, J.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
Kosawat, K.; Boriboon, M.; Chootrakool, P.; Chotimongkol, A.; Wutiwiwatchai, C. BEST: Benchmark for Enhancing the Standard of Thai Language Processing. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP2010, Cambridge, MA, USA, 9–11 October 2010. EMNLP2010. [Google Scholar]
Boonkwan, P.; Luantangsrisuk, V.; Phaholphinyo, S.; Kriengket, K.; Leenoi, D.; Phrombut, C.; Boriboon, M.; Kosawat, K.; Supnithi, T. The Annotation Guideline of LST20 Corpus. arXiv 2020, arXiv:2008.05055. [Google Scholar]
Thu, Y.K.; Pa, W.P.; Utiyama, M.; Finch, A.; Sumita, E. Introducing the asian language treebank (alt). In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia, 23–28 May 2016; pp. 1574–1578. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32, pp. 8024–8035. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Sornlertlamvanich, V.; Sornlertlamvanich, V.; Takahashi, N.; Isahara, H. Building a Thai part-of-speech tagged corpus (ORCHID). J. Acoust. Soc. Jpn. (E) 1999, 20, 189–198. [Google Scholar] [CrossRef]
Chormai, P.; Prasertsom, P.; Cheevaprawatdomrong, J.; Rutherford, A. Syllable-based Neural Thai Word Segmentation. In Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain, Online, 8–13 December 2020; pp. 4619–4637. [Google Scholar]

Figure 1. Overview of the UnifiedCut architecture. The UnifiedCut model performs word segmentation by using multiple n-grams of characters as input. For each sliding window over the input sequence, n-gram embeddings are concatenated and fed into a Transformer encoder to capture contextual features. The encoder outputs are then flattened and passed through fully connected layers with ReLU activation. Finally, a classifier predicts the segmentation label (B or I) for each target character. This architecture enables efficient segmentation by utilizing sliding windows and multi-head attention for parallel processing.

Figure 2. Hyperparameter study of UnifiedCut.

Figure 3. Speed comparison between Thai word segmenters on Wisesight’s training set.

Table 1. Detailed statistics of experimental datasets for Thai, Burmese, and Khmer word segmentation tasks.

Language	Train			Validation			Test
Language	Sentence	Word	Char	Sentence	Word	Char	Sentence	Word	Char
BEST-2010	134,096	5105 K	18,693 K	14,899	565 K	2069 K	2252	125.2 K	447 K
LST20	63,310	2715 K	10,273 K	5620	241 K	945.9 K	5250	207.3 K	780.1 K
ALT	16,085	608 K	2682 K	2011	74.8 K	335 K	2010	74.8 K	335 K
KhPOS	10,800	115 K	545 K	1200	13.5 K	64 K	1000	10.7 K	51 K

Table 2. Default hyperparameters values of the UnifiedCut.

Hyperparameter	Values	Description
n-grams	1, 2, 3	Multiple n-grams as sequence inputs
n-gram-embed-dim	32	Embedding size for the input n-grams
n-gram-embed-dropouts	0.15	Dropout rate for the n-gram embeddings
transformer-window-size	21	Sliding window size of the input sequences
transformer-layer-num	1	Layer number for the transformer encoder
transformer-head-num	8	Header number of the transformer encoder
transformer-feedforward-dim	128	Feedforward dense layer size
encoder-output-dim	64	Output dimension for the the context encoder
optimizer	Adam	Optimizer for training the model
lr	$0.001$	Learning rate
max-epochs	20	Max epochs

Table 3. Word-level performance comparison on BEST2010 and LST20.

	Model	Precision	Recall	F1-Score	R_OOV
BEST2010	PythaiNLP	0.704 ± 0.191	0.645 ± 0.185	0.673 ± 0.194	–
	Sertis	0.854 ± 0.182	0.903 ± 0.147	0.878 ± 0.163	–
	AttaCut-C	0.882 ± 0.172	0.912 ± 0.143	0.897 ± 0.161	0.519
	AttaCut-SC	0.913 ± 0.151	0.925 ± 0.134	0.919 ± 0.141	0.641
	DeepCut	0.961 ± 0.143	0.966 ± 0.122	0.963 ± 0.133	0.605
	THDICTSDR	0.943 ± 0.127	0.969 ± 0.126	0.956 ± 0.126	0.674
	UnifiedCut	0.971 ± 0.121	0.960 ± 0.103	0.965 ± 0.112	0.703
LST20	DeepCut	0.967 ± 0.028	0.971 ± 0.029	0.969 ± 0.027	0.641
	THDICTSDR	0.944 ± 0.030	0.969 ± 0.031	0.956 ± 0.029	0.653
	UnifiedCut	0.988 ± 0.023	0.987 ± 0.025	0.988 ± 0.023	0.697

Table 4. Word-level performance comparison in Burmese (my) and Khmer (km) datasets.

	Model	Precision	Recall	F1-Score	R_OOV
my	BERT + BiLSTM-CRF	0.968	0.970	0.969	0.406
my	UnifiedCut	0.979	0.981	0.980	0.595
km	KhPOS	0.969	0.973	0.971	0.435
km	UnifiedCut	0.983	0.987	0.985	0.613

Table 5. Cross-domain word-level performance comparison in Thai.

Model	Dataset
Model	Orchid	TNHC	Wisesight-1000
PythaiNLP	0.72 ± 0.24	0.73 ± 0.21	0.74 ± 0.21
Sertis	0.63 ± 0.25	0.70 ± 0.23	0.81 ± 0.18
DeepCut	0.66 ± 0.25	0.63 ± 0.26	0.81 ± 0.20
AttaCut-C	0.63 ± 0.25	0.66 ± 0.24	0.80 ± 0.20
AttaCut-SC	0.62 ± 0.26	0.63 ± 0.26	0.81 ± 0.20
UnifiedCut	0.76 ± 0.20	0.71 ± 0.21	0.85 ± 0.17

Table 6. Syllable segmentation performance comparison in Thai.

Algorithm	Features	F1-Score	R_OOV
CRF	Chr (W = 3), Trigram (W = 3)	$0.96 \pm 0.18$	0.44
MaxEnt	Trigram (W = 4)	$0.94 \pm 0.22$	0.41
PythaiNLP	-	$0.96 \pm 0.22$	0.45
SSG	-	$0.97 \pm 0.12$	0.61
UnifiedCut	Unigram (W = 21)	$0.98 \pm 0.02$	0.78

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wen, Y.; Xian, Y.; Wang, Y.; Yu, Z. UnifiedCut: A Simple and Efficient Neural Model for Thai, Burmese and Khmer Word Segmentation. Appl. Sci. 2024, 14, 11435. https://doi.org/10.3390/app142311435

AMA Style

Wen Y, Xian Y, Wang Y, Yu Z. UnifiedCut: A Simple and Efficient Neural Model for Thai, Burmese and Khmer Word Segmentation. Applied Sciences. 2024; 14(23):11435. https://doi.org/10.3390/app142311435

Chicago/Turabian Style

Wen, Yonghua, Yantuan Xian, Yuehan Wang, and Zhengtao Yu. 2024. "UnifiedCut: A Simple and Efficient Neural Model for Thai, Burmese and Khmer Word Segmentation" Applied Sciences 14, no. 23: 11435. https://doi.org/10.3390/app142311435

APA Style

Wen, Y., Xian, Y., Wang, Y., & Yu, Z. (2024). UnifiedCut: A Simple and Efficient Neural Model for Thai, Burmese and Khmer Word Segmentation. Applied Sciences, 14(23), 11435. https://doi.org/10.3390/app142311435

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

UnifiedCut: A Simple and Efficient Neural Model for Thai, Burmese and Khmer Word Segmentation

Abstract

1. Introduction

2. Related Works

2.1. Word Segmentation

2.2. Syllable Segmentation

3. Methodology

3.1. UnifiedCut Model

3.1.1. Multiple n-Gram Embedding Layer

3.1.2. Transformer Encoder

3.1.3. Decoding Layer

3.2. Objective Function

4. Experiments

4.1. Datasets and Evaluation Metrics

4.2. Implementation Details

4.3. Main Results

4.3.1. Word Segmentation for Thai

4.3.2. Word Segmentation Results for Burmese and Khmer

4.3.3. Cross-Domain Word Segmentation Results

4.3.4. Syllable Segmentation Performance

4.4. Hyperparameter Study

4.5. Speed Benchmark

5. Discussion

6. Conclusions and Future Works

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI