MDPI - Publisher of Open Access Journals

20 pages, 2098 KiB

Open AccessArticle

Tibetan Sentence Boundaries Automatic Disambiguation Based on Bidirectional Encoder Representations from Transformers on Byte Pair Encoding Word Cutting Method

by Fenfang Li, Zhengzhang Zhao, Li Wang and Han Deng

Appl. Sci. 2024, 14(7), 2989; https://doi.org/10.3390/app14072989 - 2 Apr 2024

Viewed by 1225

Abstract

Sentence Boundary Disambiguation (SBD) is crucial for building datasets for tasks such as machine translation, syntactic analysis, and semantic analysis. Currently, most automatic sentence segmentation in Tibetan adopts the methods of rule-based and statistical learning, as well as the combination of the two, which have high requirements on the corpus and the linguistic foundation of the researchers and are more costly to annotate manually. In this study, we explore Tibetan SBD using deep learning technology. Initially, we analyze Tibetan characteristics and various subword techniques, selecting Byte Pair Encoding (BPE) and Sentencepiece (SP) for text segmentation and training the Bidirectional Encoder Representations from Transformers (BERT) pre-trained language model. Secondly, we studied the Tibetan SBD based on different BERT pre-trained language models, which mainly learns the ambiguity of the shad (“།”) in different positions in modern Tibetan texts and determines through the model whether the shad (“།”) in the texts has the function of segmenting sentences. Meanwhile, this study introduces four models, BERT-CNN, BERT-RNN, BERT-RCNN, and BERT-DPCNN, based on the BERT model for performance comparison. Finally, to verify the performance of the pre-trained language models on the SBD task, this study conducts SBD experiments on both the publicly available Tibetan pre-trained language model TiBERT and the multilingual pre-trained language model (Multi-BERT). The experimental results show that the F1 score of the BERT (BPE) model trained in this study reaches 95.32% on 465,669 Tibetan sentences, nearly five percentage points higher than BERT (SP) and Multi-BERT. The SBD method based on pre-trained language models in this study lays the foundation for establishing datasets for the later tasks of Tibetan pre-training, summary extraction, and machine translation. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

9 pages, 1925 KiB

Open AccessProceeding Paper

A New Approach for Carrying Out Sentiment Analysis of Social Media Comments Using Natural Language Processing

by Mritunjay Ranjan, Sanjay Tiwari, Arif Md Sattar and Nisha S. Tatkar

Eng. Proc. 2023, 59(1), 181; https://doi.org/10.3390/engproc2023059181 - 17 Jan 2024

Cited by 3 | Viewed by 5876

Abstract

Business and science are using sentiment analysis to extract and assess subjective information from the web, social media, and other sources using NLP, computational linguistics, text analysis, image processing, audio processing, and video processing. It models polarity, attitudes, and urgency from positive, negative, or neutral inputs. Unstructured data make emotion assessment difficult. Unstructured consumer data allow businesses to market, engage, and connect with consumers on social media. Text data are instantly assessed for user sentiment. Opinion mining identifies a text’s positive, negative, or neutral opinions, attitudes, views, emotions, and sentiments. Text analytics uses machine learning to evaluate “unstructured” natural language text data. These data can help firms make money and decisions. Sentiment analysis shows how individuals feel about things, services, organizations, people, events, themes, and qualities. Reviews, forums, blogs, social media, and other articles use it. DD (data-driven) methods find complicated semantic representations of texts without feature engineering. Data-driven sentiment analysis is three-tiered: document-level sentiment analysis determines polarity and sentiment, aspect-based sentiment analysis assesses document segments for emotion and polarity, and data-driven (DD) sentiment analysis recognizes word polarity and writes positive and negative neutral sentiments. Our innovative method captures sentiments from text comments. The syntactic layer encompasses various processes such as sentence-level normalisation, identification of ambiguities at paragraph boundaries, part-of-speech (POS) tagging, text chunking, and lemmatization. Pragmatics include personality recognition, sarcasm detection, metaphor comprehension, aspect extraction, and polarity detection; semantics include word sense disambiguation, concept extraction, named entity recognition, anaphora resolution, and subjectivity detection. Full article

(This article belongs to the Proceedings of Eng. Proc., 2023, RAiSE-2023)

► Show Figures

Figure 1

15 pages, 770 KiB

Open AccessArticle

Pauses and Parsing: Testing the Role of Prosodic Chunking in Sentence Processing

by Caoimhe Harrington Stack and Duane G. Watson

Languages 2023, 8(3), 157; https://doi.org/10.3390/languages8030157 - 28 Jun 2023

Cited by 1 | Viewed by 2219

Abstract

It is broadly accepted that the prosody of a sentence can influence sentence processing by providing the listener information about the syntax of the sentence. It is less clear what the mechanism is that underlies the transmission of this information. In this paper, we test whether the influence of the prosodic structure on parsing is a result of perceptual breaks such as pauses or whether it is the result of more abstract prosodic elements, such as intonational phrases. In three experiments, we test whether different types of perceptual breaks, e.g., intonational boundaries (Experiment 1), an artificial buzzing sound (Experiment 2), and an isolated pause (Experiment 3), influence syntactic attachment in ambiguous sentences. We find that although full intonational boundaries influence syntactic disambiguation, the artificial buzz and isolated pause do not. These data rule out theories that argue that perceptual breaks indirectly influence grammatical attachment through memory mechanisms, and instead, show that listeners use prosodic breaks themselves as cues to parsing. Full article

(This article belongs to the Special Issue Pauses in Speech)

► Show Figures

Figure 1

25 pages, 1359 KiB

Open AccessArticle

Estimating Sentence-like Structure in Synthetic Languages Using Information Topology

by Andrew D. Back and Janet Wiles

Entropy 2022, 24(7), 859; https://doi.org/10.3390/e24070859 - 22 Jun 2022

Cited by 1 | Viewed by 2326

Abstract

Estimating sentence-like units and sentence boundaries in human language is an important task in the context of natural language understanding. While this topic has been considered using a range of techniques, including rule-based approaches and supervised and unsupervised algorithms, a common aspect of these methods is that they inherently rely on a priori knowledge of human language in one form or another. Recently we have been exploring synthetic languages based on the concept of modeling behaviors using emergent languages. These synthetic languages are characterized by a small alphabet and limited vocabulary and grammatical structure. A particular challenge for synthetic languages is that there is generally no a priori language model available, which limits the use of many natural language processing methods. In this paper, we are interested in exploring how it may be possible to discover natural ‘chunks’ in synthetic language sequences in terms of sentence-like units. The problem is how to do this with no linguistic or semantic language model. Our approach is to consider the problem from the perspective of information theory. We extend the basis of information geometry and propose a new concept, which we term information topology, to model the incremental flow of information in natural sequences. We introduce an information topology view of the incremental information and incremental tangent angle of the Wasserstein-1 distance of the probabilistic symbolic language input. It is not suggested as a fully viable alternative for sentence boundary detection per se but provides a new conceptual method for estimating the structure and natural limits of information flow in language sequences but without any semantic knowledge. We consider relevant existing performance metrics such as the F-measure and indicate limitations, leading to the introduction of a new information-theoretic global performance based on modeled distributions. Although the methodology is not proposed for human language sentence detection, we provide some examples using human language corpora where potentially useful results are shown. The proposed model shows potential advantages for overcoming difficulties due to the disambiguation of complex language and potential improvements for human language methods. Full article

(This article belongs to the Special Issue Statistical Methods for Complex Systems)

► Show Figures

Figure 1

Search Results (4)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (4)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI