1. Introduction
Word segmentation is usually a preprocessing step for downstream natural language processing tasks, such as text classification, information retrieval, and machine translation. Like Chinese and Japanese, their scripts do not mark word boundaries with spaces, making word segmentation necessary. In addition, they are more pronounced in southeast Asian Abugida languages like Thai, Burmese, and Khmer. Moreover, their writing systems lack clear word boundaries. However, Thai, Burmese, and Khmer scripts differ from Chinese and Japanese in that a character represents just a consonant or a vowel, but a Chinese character represents a whole syllable. Hence, extending the Chinese word segmentation methods [
1] to Thai, Burmese, and Khmer word segmentation problems is not straightforward.
Complex orthographic structures present significant challenges for word segmentation tasks in southeast Asian Abugida scripts. These scripts encode consonant and vowel information within single syllables, frequently incorporating diacritics, stacked characters, and reordering rules. Such orthographic complexity introduces additional challenges to word segmentation, requiring models to effectively process both character and sub-word features. Another challenge is the ambiguity of word boundaries. In languages like Thai and Khmer, the absence of explicit markers between words often results in sequences of characters that can correspond to multiple words. For example, กอดอกไม้ (flower bush or hugging a wooden chest) can be segmented into กอ|ดอก|ไม้ (flower bush) or กอด|อก|ไม้ (hugging a wooden chest), but the latter is rather nonsensical and statistically unlikely. This ambiguity necessitates a high reliance on contextual information, substantially increasing the computational demands on word segmentation models.
To address the challenge of efficient and accurate word segmentation, several neural-based methods have been developed, such as Sertis [
2], KhPOS [
3], DeepCut [
4], AttaCut [
5], THDICTSDR [
6], and BERT+BiLSTMCRF+Fine-tune [
7]. In these models, word segmentation is treated as a character-based sequence tagging task, employing various sequence encoders.
Sertis and KhPOS models utilize bidirectional recurrent neural networks (RNNs) to capture contextual character information. In contrast, DeepCut and AttaCut replace RNNs with convolutional neural networks (CNNs) as sequence encoders. DeepCut leverages multiple convolutional filters with variable kernel widths to extract segmentation features from characters and character types but encounters slow segmentation speeds. Meanwhile, AttaCut leverages syllable embeddings as supplementary features and enhances processing speed through three dilated CNN filters. This performance boost comes at the expense of accuracy and necessitates a preprocessing step for syllable segmentation. THDICTSDR improves dictionary-based methods by applying Sparse Distributed Representations (SDRs) to learn contextual information and integrating n-grams to select the correct word, achieving accuracy close to or better than DeepCut in some cases. However, THDICTSDR still suffers from relatively high processing times, since it retrieves features from an external dictionary. The BERT+BiLSTMCRF+Fine-tune model leverages character, syllable, and word embeddings to capture rich syntactic and semantic features for word segmentation. However, because it relies on the BERT language model and the BiLSTM as its encoder, it is the largest model among these approaches, which significantly impacts its prediction speed, making it considerably slower.
To develop an efficient and accurate word segmentation model for southeast Asian Abugida languages, it is essential to learn rich and effective segmentation features from sequences of characters and sub-words while also maintaining time efficiency in contextual modeling. The windowed context modeling strategy is more efficient than RNNs-based models, which is shown in DeepCut [
4] and AttaCut [
5] models. Given the word formation characteristics of southeast Asian Abugida languages, using excessively long context does not enhance segmentation performance and instead adds unnecessary model complexity. Our approach follows this modeling strategy, learning segmentation features for the target character based on fixed-length windows of multiple n-gram sequences. The n-gram inputs do not require language-specific preprocessing, such as syllable segmentation or character type recognition, making it more versatile. Our method introduces an attention mechanism that learns segmentation features from windowed contexts alongside a multi-head attention design that models these features from multiple perspectives. Unlike the convolutional networks used in DeepCut and AttaCut, the multi-head attention mechanism enables the model to learn more nuanced and effective segmentation features within the same computational framework. The multi-head attention features help enhance the model’s generalization ability and improve the recall rate for out-of-vocabulary (OOV) words. Moreover, compared to the CNN-based model, the proposed approach also supports greater parallelism, leading to improved segmentation efficiency.
Experiments demonstrate that UnifiedCut surpasses state-of-the-art systems in in-domain word segmentation accuracy and provides comparable or superior results in cross-domain scenarios. Additionally, UnifiedCut proves versatile and effective for syllable segmentation tasks, offering a robust and efficient solution across applications.
Our contributions can be summarized as follows:
We propose a simple and efficient unified neural model for Thai, Burmese, and Khmer word segmentation, which learns word segmentation features from windowed multiple characters n-gram sequences by an attention mechanism.
The proposed multi-head attention encoder has much fewer learnable parameters than the CNN-based models and supports greater parallelism. The proposed UnifiedCut runs about 4× times faster than the state-of-the-art Thai word segmentation model, DeepCut.
Experiments on public Thai, Burmese, and Khmer word segmentation datasets show that our model consistently outperforms the state-of-the-art word segmentation models, resulting in higher recall values of out-of-vocabulary words.
2. Related Works
2.1. Word Segmentation
At present, the main research methods for word segmentation tasks in the Thai [
5,
8,
9], Burmese [
7,
10,
11] and Khmer [
12,
13,
14] are divided into rule-based, traditional machine learning and neural network methods. Common traditional machine learning methods mainly include hidden Markov models (HMMs), maximum entropy models (MEMs), and conditional random fields (CRFs), and so on. In recent years, deep learning has made great breakthroughs in the field of natural language processing, such as RNNs, CNNs, BiLSTM and pre-training-based methods.
The dictionary-based or rule-based word segmentation methods are efficient [
15,
16,
17], but they depend too much on the quality of the dictionaries, and their ability to recognize new words is insufficient. Limcharoen et al. propose a statistics-based method [
18], in which word n-gram and the Generalized Left-to-right Reduce (GLR) analysis algorithm are used. In this technique, an input Thai text is segmented into a sequence of Thai Character Clusters (TCCs) that is inseparable based on Thai writing rules, which help reduce the chances of segmenting words incorrectly. Although it can automatically exclude some ambiguities and identify out-of-vocabulary and yield better performance than the dictionary-based method, the model has a large parameter space resulting in the curse of dimension. Mon et al. [
19] proposed a combination of rule-based heuristic methods and statistical methods for Burmese word segmentation. This methods segment Burmese words by artificially summarizing the combination characteristics of the Burmese syllables, but the word segmentation effect is heavily dependent on the manually defined word segmentation rules.
Due to the drawbacks of the dictionary-based and statistics-based methods, machine learning (ML) methods, especially neural network models, have been adopted for word segmentation. Bi et al. [
13] proposed a bidirectional maximum matching approach to maximize segmentation accuracy. The model processed maximum matching twice-forward and backward, however, it was unable to handle out-of-vocabulary words (OOV) and take into account the context. Phyu et al. [
10] proposed a Burmese word segmentation method based on CRF and feature clustering. Moreover, CRF requires a customized language feature model so that the process is more complicated. Furthermore, the error rate can be high if the model definition is inaccurate.
At present, the mainstream is to use neural network method, which can automatically learn the feature representation of related tasks, and the effect is obviously better than statistical machine learning. Two usually adopted neural networks structures are RNN and CNN. Jussi et al. [
2] apply bidirectional RNN to Thai word segmentation, which is independent on artificial features. Compared with CNN, RNN needs more training time because of its recurrent nature. Subsequently, Kittinaradorn et al. [
4] propose a CNN-based Thai word segmentation model named DeepCut, which uses characters and character category features as input features and has high performance. Chormai et al. [
5] propose Thai word segmentation models named AttaCut-C and AttaCut-SC, which are also based on CNN. Both models employ the dilation technique in the convolutional layers [
20,
21], which allows the models to use non-redundant convolution layers that cover sufficient context. Their architecture has less computational dependencies and hence a higher degree of parallelism than DeepCut. In addition, AttaCut-SC uses syllable embeddings as additional features, which provides higher-level information than using character information alone. But it utilizes syllable embeddings as the representation, which causes the model to require additional preprocessing steps. However, the model in this paper only uses character-level embedding as the representation to avoid the error propagation of syllable segmentation. Buoy et al. [
3] proposed joint word segmentation and a POS tagging Khmer model using a deep learning approach. The model utilizes a bidirectional-LSTM network that takes inputs at the character level and outputs a sequence of POS tags. Although the forgetting and memory mechanism of BiLSTM can model the order dependence between sequences, it has the problem of long-distance dependence, and the calculation is difficult to parallelize. However, the modeling of long-distance context features seems to have little significance for small language word segmentation. So, our model learns contextual features through the local attention mechanism, which fundamentally avoids the problem of long-distance dependence. Current pre-training language models also learn contextual features based on this attention mechanism. Mao et al. [
7] proposed a joint learning neural network model based on Bert pre-training. However, this model based on Bert pre-training will significantly increase the parameters of the model and reduce the prediction speed of the model. Therefore, this paper only uses a one-layer transformer as the encoder, which has a small scale of model parameters and fast running speed.
2.2. Syllable Segmentation
Thet syllable segmentation task is defined similarly, and the performance of syllable segmentation has a direct impact on the downstream tasks of natural language processing. At present, many attempts have been made to deal with syllable segmentation in Thai, while there have been fewer in Myanmar and Khmer. Combining the characteristics of syllables, Poowarawan [
22] proposes the first dictionary-based method for Thai syllable segmentation using a greedy algorithm. But the method is hard to generalize to other domains. Aroonmanakun [
23] uses more than 200 rules to segment Thai syllables and achieves good results. Thet et al. [
24] proposed a method of Burmese syllables segmentation, which is based on rules. However, these syllable segmentation rules are complicated. In addition, there may be conflicts between rules when the amount of rules is large. At present, the mainstream method of Thai syllable segmentation is mainly based on statistical learning. Chormai et al. [
5] developed the first ML-based Thai orthographical syllable segmenter that can resist typos. Anyhow, the machine learning methods have higher performance than the dictionary-based and statistics-based methods and show great potential for syllable segmentation.
3. Methodology
In our proposed model, we frame the sentence segmentation task as a character-level sequence labeling problem. Each character is classified based only on features learned from a fixed-length context. During both training and prediction, the model processes multiple n-gram sequences derived from this fixed context. UnifiedCut, our proposed model, learns the classification features for each target character from these sequences and performs classification independently for each character. This independent classification allows for parallel processing, which improves segmentation efficiency.
3.1. UnifiedCut Model
We propose learning classification features through a multi-head attention transformer encoder.
Figure 1 provides an overview of the UnifiedCut architecture, which includes embedding layers, multi-head attention layers, a fully connected layer, and a word label classification layer.
3.1.1. Multiple n-Gram Embedding Layer
We use multiple n-gram sequences as model input to perform word segmentation. The embedding layer encodes each character n-gram into a dense vector. For example, given the
i-th n-gram token
, we embed it as a dense vector
using the appropriate lookup table,
where
k is the length of the n-gram. We concatenate the n-gram embeddings at all positions to learn richer features from multiple n-grams. We denote the concatenated vector for the
i-th character in the context as
where
,
K is the max-length of the n-gram, and
is the size of the concatenated vector.
3.1.2. Transformer Encoder
We use a transformer encoder [
25] to learn contextual features from an input sliding window of n-gram embeddings
of length
n,
The sequence of the input n-gram embeddings is fed into the transformer encoder. The transformer layer consists of two sub-layers, which are named the multi-head attention layer and the position-wise feed-forward layer. The encoder employs residual connections around each of the sub-layers, which is followed by layer normalization [
26].
Before applying the self-attention to the input sequence, the model firstly transforms the input
into query
, key
, and value
sequences via separate linear transformation. Then, we calculate the self-attention outputs
by dot-product self-attention
Multi-head attention enables the model to simultaneously focus on information from different representation subspaces at various positions. This allows our model to effectively learn multi-view word segmentation features from the surrounding context. The output of the multi-head attention layer is a projection formed by concatenating the outputs of all attention heads,
where
,
, and
are the projection parameters of head
i.
denotes the output projection matrix. For each parallel attention head, we use
.
After the multi-head attention layer, a LayerNorm(X + MultiHead(X)) layer is applied to the sum of the output and residual connection of X.
In addition to attention sub-layers, the encoder block contains a fully connected feed-forward network applied to each position separately and identically. The feed-forward network consists of two linear transformations with a ReLU activation in between.
where
,
,
, are the
projection parameters, and
z denotes the output of the previous layer normalization. Another layer normalization is applied after the feed-forward network, which makes the outputs
of the encoder block.
We derive the contextual prediction feature from the multiple input n-grams by flattening the transformer’s encoder output into a single vector,
3.1.3. Decoding Layer
In the decoding layer, we use a two-layer MLP binary classifier to predict the word segmentation label for the center character of the input n-gram window sequences. We use binary classification because in southeast Asian languages, it is uncommon for a single character to form a word. This approach is simpler and more effective than using more complex labeling schemes.
In the first layer, the feature vector
for the word segmentation classifier is learned by a linear layer with ReLU activation,
where
and
are the parameter matrix and bias. Then, the segmentation classifier computes the label probability using
where
and
are the parameter matrix and bias and
is the word segmentation label probability.
3.2. Objective Function
We use binary labels—starting-word character (B) and in-word character (I)—to tag characters within the sentence. Therefore, we select binary cross-entropy loss to train our model. The objective function is defined as follows,
where
denotes the ground truth label, and
denotes the predicted label.
4. Experiments
We conducted extensive experiments on public datasets to demonstrate the performance and efficiency of our proposed model. The following sections provide a detailed discussion of the datasets, evaluation metrics, experimental results, and hyperparameter analysis.
4.1. Datasets and Evaluation Metrics
For Thai word segmentation tasks, we evaluate the proposed method using the InterBEST2010 [
27] and LST20 [
28] datasets, which were both annotated by Thailand’s National Electronics and Computer Technology Center (NECTEC). The BEST-2010 corpus consists of 34,107 samples (rows) across 415 documents in four categories: articles, encyclopedias, news, and novels. We reserve 10% of the training samples from each category as validation data. The LST20 corpus includes 74,180 sentences across 4751 documents spanning fifteen categories with 3794 documents in the training set, 474 in the validation set, and 483 in the test set.
For Burmese word segmentation tasks, we use the ALT corpus (
https://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/, accessed on 1 December 2024) from the Asian Language Treebank (ALT) project [
29]. Briefly, 20,000 English sentences collected from Wikinews were translated manually into different Asian languages as the raw data. We divide the Burmese data into training, validation, and test sets according to 8:1:1.
For Khmer word segmentation tasks, we use the khPOS dataset [
3] which is published on GitHub (
https://github.com/ye-kyaw-thu/khPOS, accessed on 1 December 2024). The khPOS dataset consists of 12,000 sentences (25,626 words in total). The average number of words per sentence is 10.75. We divide the 12,000-sentence dataset into training and validation sets according to 9:1. Thu [
3] provided a separate open test set for evaluating the model performance.
Table 1 shows statistical information about the experiment corpus.
Following AttaCut [
5], we use precision, recall, and F1-score as evaluation metrics, which were calculated at both the character level and word level for word segmentation. Word-level precision is defined as the ratio of correctly segmented words to the total number of predicted words, while word-level recall is the ratio of correctly segmented words to the total number of words in the ground truth.
denotes the word level F1-score. We also assess the recall for out-of-vocabulary (OOV) words to highlight the model’s ability to generalize. We use the PyThaiNLP project’s benchmark module for word segmentation to compute these metrics (
https://github.com/PyThaiNLP/pythainlp/commit/bec416f9b97fe4198c4f4288df517696557b475e, accessed on 1 December 2024).
4.2. Implementation Details
We use PyTorch [
30] to implement the proposed model and then train and evaluate it on a single Nvidia GeForce RTX 2080 Ti GPU. The hyperparameters of our model are listed in
Table 2. In default, we use 1-gram, 2-gram and 3-gram sequences as input to segment a sentence, and the sliding window size is set to 21. The model’s sensitivity to these hyperparameters will be discussed further in
Section 4.4.
We employ the Adam [
31] optimizer to train the proposed models. We set the learning rate to
. We also apply dropout to n-gram embeddings with a rate of 0.15. We use an early stopping strategy to select the best model via a min loss value of validation data.
4.3. Main Results
4.3.1. Word Segmentation for Thai
Table 3 presents a performance comparison of several models on the BEST2010 and LST20 datasets for Thai word segmentation. PythaiNLP serves as a dictionary-based model that segments words using a maximum matching algorithm, providing a basic baseline for comparison. Sertis leverages bidirectional GRU networks, making it a neural model that captures the context in both directions. DeepCut, considered a state-of-the-art segmentation model, uses multiple CNN filters as feature encoders, with inputs that include both characters and their types. AttaCut-C is another CNN-based model that stands out for using dilated CNN filters, which is designed for faster processing due to its reduced number of CNN filters compared to DeepCut. Its variant, AttaCut-CS, further enhances segmentation by incorporating syllable embeddings as additional input. Finally, THDICTSDR takes a dictionary-based approach but introduces Sparse Distributed Representations (SDRs) to capture contextual information, improving its understanding of Thai text structure. Together, these models demonstrate a range of techniques and architectural choices that address the unique challenges of Thai word segmentation.
The performance results in
Table 3 highlight the strengths of UnifiedCut in Thai word segmentation, especially in comparison to other models. UnifiedCut consistently achieves top scores in precision, recall, F1-score, and OOV (out-of-vocabulary) handling, demonstrating its versatility and robustness. UnifiedCut shows the highest precision among all models with 0.971 on BEST2010 and 0.988 on LST20. UnifiedCut also performs well in the recall, although on the BEST2010 dataset, DeepCut slightly surpasses it. UnifiedCut’s F1-score is the highest across both datasets (0.965 on BEST2010 and 0.988 on LST20), showcasing its strong balance between precision and recall. The results suggest that UnifiedCut is well rounded, effectively capturing correct segments from multiple n-gram inputs via a multi-head attention encoder.
UnifiedCut demonstrates the best OOV handling, with OOV recall () scores of 0.703 on BEST2010 and 0.697 on LST20, outperforming all other models. A high score means that UnifiedCut is especially adept at handling words not seen in training data, which is a crucial advantage in real-world applications where new or rare words frequently appear.
4.3.2. Word Segmentation Results for Burmese and Khmer
To demonstrate the performance of the UnifiedCut in other South Asian languages such as Burmese and Khmer, we also conducted a comparative analysis of the most advanced models in the benchmark datasets of both languages, respectively. BERT + BiLSTM-CRF + Fine-tune is the state-of-the-art Burmese word segmentation model. It uses the pre-trained Burmese BERT and BiLSTM-CRF as encoders to learn contextual Burmese word segmentation features. KhPOS is a Khmer word segmentation and POS-tagging system based on a deep learning network to remove the adverse dependency effect.
The results in
Table 4 indicate that UnifiedCutoutperforms the benchmark models, BERT + BiLSTM-CRF for Burmese and KhPOS for Khmer, across all evaluation metrics, highlighting its superior accuracy and adaptability. In terms of precision, UnifiedCut achieves the highest scores, reaching 0.979 for Burmese and 0.983 for Khmer, reflecting its ability to minimize false positives and ensure the accurate segmentation of word boundaries. It also demonstrates leading recall scores, with 0.981 for Burmese and 0.987 for Khmer, indicating its strong capability to capture actual word boundaries. These high precision and recall values yield the highest F1-scores for both languages, achieving 0.980 for Burmese and 0.985 for Khmer, underscoring the model’s balanced performance. Additionally, UnifiedCut excels in out-of-vocabulary (OOV) word handling, with
scores of 0.595 for Burmese and 0.613 for Khmer, significantly surpassing those of the baseline models. This robust OOV performance demonstrates UnifiedCut’s generalizability to unseen words, which is a crucial advantage for real-world applications. Overall, these results confirm UnifiedCut’s effectiveness in segmenting words in Burmese and Khmer, leveraging multiple n-grams and multi-head attention encoders to deliver consistent and high-quality segmentation.
4.3.3. Cross-Domain Word Segmentation Results
To comprehensively illustrate the performance of cross-domain data, we also evaluate our UnifiedCutmodel (trained on BEST2010 dataset) on another three datasets for Thai: (i) ORCHID [
32] is a Thai part-of-speech (POS) tagged corpus. It contains 23,125 sentences with about 2 M annotated words. (ii) Thai National Historical Corpus (TNHC) consists of human-annotated Thai classical literature documents; it contains 20,791 samples, with about 599 K words, and 2.14 M characters, and (iii) the Wisesight-1000 corpus [
5] contains 1000 social media text samples with around 22 K words and 75 K characters.
Table 5 shows the results of different models on cross-domain datasets. Our models outperform the state-of-the-art system on the Orchid and the Wisesight-1000 datasets. We noticed that PyThaiNLP achieved the highest word-level F1 value on the Orchid and TNHC datasets, while other neural network baselines performed poorly. Compared to neural network baselines, UnifiedCut obtains significant improvements on the Orchid dataset and achieves a comparable result to the PyThaiNLP model on the TNHC dataset. The experiment’s result shows that UnifiedCut has better generalization performance and domain adaptability than other neural baselines.
It is worth noting that all neural network models failed to perform well on the TNHC dataset. The main reason may be that TNHC is significantly different from BEST-2010 in writing styles and vocabulary. The literature in TNHC is typically written with archaic words, and the poems in the document need additional linguistic structures that are quite different from the training samples.
4.3.4. Syllable Segmentation Performance
Syllable segmentation is similar to word segmentation with the key difference that in southeast Asian abugida languages, syllables are smaller linguistic units that form words. Syllable boundaries are generally clearer than word boundaries, so our model can be directly applied to the syllable segmentation task.
For syllable segmentation, we train the model on Thai syllable benchmark datasets: the Thai National Corpus (TNC) [
33] that contains 2.56 M annotated syllables (around 8 M characters). For the transfer learning model, we split training, development, and testing data with the ratio 7:2:1, which contain 1.8 M, 0.5 M, and 0.25 M syllables, respectively.
We selected four baseline models for the syllable segmentation task. The first is a Conditional Random Field (CRF) model that uses character and trigram features. The second is a Maximum Entropy (MaxEnt) model that relies on trigrams as features. Additionally, we included PythaiNLP, a dictionary-based Thai syllable segmentation model that applies a maximum matching algorithm, and SSG, which is the first machine learning-based model for Thai syllable segmentation [
5].
Table 6 presents the performance comparison for syllable segmentation. The results indicate that UnifiedCut, using a unigram sequence as input, achieves the highest scores, with an F1-score of 0.98 and an OOV recall of 0.78. Compared to the second-best model, SSG, UnifiedCut shows a notable improvement in OOV recall by 0.17. These results demonstrate that UnifiedCut is highly effective in producing reliable syllable segmentation outcomes.
4.4. Hyperparameter Study
Several important hyperparameters are present in our model, such as the choice of n-grams as input, the embedding dimension of n-grams, the length of the context window, the number of encoder layers, and the number of heads in multi-head attention. We studied the impact of these parameters on model performance while keeping other hyperparameters at their default values. The hyperparameter study experiment results on the BEST2010 dataset are shown in
Figure 2.
Figure 2a illustrates how different n-gram embedding dimensions affect model performance and size when using various n-gram sequences as input. The results show that models using n-gram (1, 2, 3) consistently achieve better performance across embedding dimensions compared to those using n-gram (1, 2) though at the cost of increased model size. The figure also indicates that both very small and very large embedding dimensions can adversely impact performance. The optimal setting is achieved with n-gram (1, 2, 3) and an embedding dimension of 32, resulting in the best performance and a model size of 0.5 M parameters.
Context window length is another important parameter impacting model performance.
Figure 2b illustrates the effect of varying context window sizes, from 13 to 29, on model accuracy. The results show that our model performs well without needing an excessively long context; a window size of around 20 characters achieves optimal results. Generally, a longer context window improves model precision, but an overly long context can reduce recall. Additionally, longer contexts increase the number of model parameters, as indicated in the figure, where model size grows linearly with context length.
In general, multi-layer encoders help improve a model’s learning capacity.
Figure 2c shows the performance and size of UnifiedCut with different numbers of encoder layers. The results indicate that the two-layer encoder performs slightly better than the one-layer encoder; however, as the number of layers increases, model performance declines significantly, while the model size grows linearly with additional layers. Therefore, UnifiedCut does not require a deep encoder. Given that the performance difference between one and two layers is minimal, with only a 0.001 increase in F1 score, and the one-layer encoder is more time-efficient, we selected the one-layer encoder as the default.
Another key hyperparameter we examine is the number of attention heads. By using multiple heads, the model can capture segmentation features from various perspectives within an input sequence. As shown in
Figure 2d, choosing an appropriate number of attention heads significantly enhances model performance, primarily by improving recall. In our experiments, setting the head number to 8 produced the best segmentation results.
4.5. Speed Benchmark
We measure the model speed by segment using the Wisesight training set line by line on the same device. We used the state-of-the-art neural word segmenters, e.g., DeepCut and AttaCut, as baselines, and exclude the PyThaiNLP segmenter from the benchmark due to its low performance.
Figure 3 illustrates the speed benchmark results. It shows that DeepCut takes the longest time, around 474 s on a Linux server with 64 cores Intel(R) Xeon(R) CPU 2.10 GHz and 64 G RAM. UnifiedCut takes 123.7 s on the CPU, which is 3.8 times faster than DeepCut. All models in the benchmark can be significantly sped up by a GPU. For example, by using a Nvidia GeForce RTX 2080i GPU, UnifiedCut takes only 38.3 s to finish the segmentation test. In comparison, it takes 127.7 s on the CPU.
5. Discussion
The UnifiedCut model presents significant strengths in word segmentation for Thai, Burmese, and Khmer scripts. A key advantage lies in its simplicity and efficiency: by employing a single-layer transformer encoder, UnifiedCut maintains a streamlined architecture that accelerates processing, making it considerably faster than state-of-the-art models such as DeepCut and AttaCut. Additionally, UnifiedCut demonstrates robust performance in recognizing out-of-vocabulary (OOV) words and identifying new and meaningful terms effectively. Its high accuracy across both word and syllable segmentation tasks further underscores its versatility and robustness across various southeast Asian languages.
Compared to dictionary-based methods, UnifiedCut has distinct advantages. While dictionary-based approaches can be efficient and precise for known vocabulary, they often struggle with OOV words and depend heavily on comprehensive and up-to-date lexicons. In contrast, UnifiedCut leverages neural network-based multi-head attention, enabling it to generalize better by learning patterns directly from data rather than relying on static dictionaries. This flexibility allows UnifiedCut to adapt more easily to dynamic language changes and linguistic variations.
Despite these strengths, UnifiedCut faces certain limitations, particularly in cross-domain applications. Its accuracy and recall decline when applied to datasets from different domains, highlighting challenges in generalization. This limitation suggests that while UnifiedCut excels with data closely aligned to its training set, its effectiveness diminishes in unfamiliar linguistic contexts. Additionally, although UnifiedCut performs well in OOV word recognition, improvement levels vary across different datasets, revealing inconsistencies in handling diverse language data.
6. Conclusions and Future Works
In this paper, we present a simple yet effective neural model for word segmentation in Thai, Burmese, and Khmer. By using windowed multiple n-grams as input and leveraging a transformer encoder to capture contextual features, our model outperforms state-of-the-art approaches while benefiting from the efficiency of parallelized multi-head attention. Future work could focus on enhancing this model through the integration of pre-trained models to further improve contextual understanding in segmentation tasks. Additionally, exploring hybrid approaches that combine neural networks with rule-based methods may offer further gains in accuracy and robustness, especially in handling complex segmentation cases.