Next Article in Journal
Clustering Method for Signals in the Wideband RF Spectrum Using Semi-Supervised Deep Contrastive Learning
Previous Article in Journal
Condensation Flow of Refrigerants Inside Mini and Microchannels: A Review
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Tibetan Sentence Boundaries Automatic Disambiguation Based on Bidirectional Encoder Representations from Transformers on Byte Pair Encoding Word Cutting Method

College of Computer Science & Engineering, Northwest Normal University, Lanzhou 730070, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(7), 2989; https://doi.org/10.3390/app14072989
Submission received: 22 February 2024 / Revised: 27 March 2024 / Accepted: 28 March 2024 / Published: 2 April 2024
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

Sentence Boundary Disambiguation (SBD) is crucial for building datasets for tasks such as machine translation, syntactic analysis, and semantic analysis. Currently, most automatic sentence segmentation in Tibetan adopts the methods of rule-based and statistical learning, as well as the combination of the two, which have high requirements on the corpus and the linguistic foundation of the researchers and are more costly to annotate manually. In this study, we explore Tibetan SBD using deep learning technology. Initially, we analyze Tibetan characteristics and various subword techniques, selecting Byte Pair Encoding (BPE) and Sentencepiece (SP) for text segmentation and training the Bidirectional Encoder Representations from Transformers (BERT) pre-trained language model. Secondly, we studied the Tibetan SBD based on different BERT pre-trained language models, which mainly learns the ambiguity of the shad (“།”) in different positions in modern Tibetan texts and determines through the model whether the shad (“།”) in the texts has the function of segmenting sentences. Meanwhile, this study introduces four models, BERT-CNN, BERT-RNN, BERT-RCNN, and BERT-DPCNN, based on the BERT model for performance comparison. Finally, to verify the performance of the pre-trained language models on the SBD task, this study conducts SBD experiments on both the publicly available Tibetan pre-trained language model TiBERT and the multilingual pre-trained language model (Multi-BERT). The experimental results show that the F1 score of the BERT (BPE) model trained in this study reaches 95.32% on 465,669 Tibetan sentences, nearly five percentage points higher than BERT (SP) and Multi-BERT. The SBD method based on pre-trained language models in this study lays the foundation for establishing datasets for the later tasks of Tibetan pre-training, summary extraction, and machine translation.
Keywords: sentence boundary disambiguation; Tibetan; pre-trained language model; BERT (BPE); shad (“།”) sentence boundary disambiguation; Tibetan; pre-trained language model; BERT (BPE); shad (“།”)

Share and Cite

MDPI and ACS Style

Li, F.; Zhao, Z.; Wang, L.; Deng, H. Tibetan Sentence Boundaries Automatic Disambiguation Based on Bidirectional Encoder Representations from Transformers on Byte Pair Encoding Word Cutting Method. Appl. Sci. 2024, 14, 2989. https://doi.org/10.3390/app14072989

AMA Style

Li F, Zhao Z, Wang L, Deng H. Tibetan Sentence Boundaries Automatic Disambiguation Based on Bidirectional Encoder Representations from Transformers on Byte Pair Encoding Word Cutting Method. Applied Sciences. 2024; 14(7):2989. https://doi.org/10.3390/app14072989

Chicago/Turabian Style

Li, Fenfang, Zhengzhang Zhao, Li Wang, and Han Deng. 2024. "Tibetan Sentence Boundaries Automatic Disambiguation Based on Bidirectional Encoder Representations from Transformers on Byte Pair Encoding Word Cutting Method" Applied Sciences 14, no. 7: 2989. https://doi.org/10.3390/app14072989

APA Style

Li, F., Zhao, Z., Wang, L., & Deng, H. (2024). Tibetan Sentence Boundaries Automatic Disambiguation Based on Bidirectional Encoder Representations from Transformers on Byte Pair Encoding Word Cutting Method. Applied Sciences, 14(7), 2989. https://doi.org/10.3390/app14072989

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop