Examining Sentiment Analysis for Low-Resource Languages with Data Augmentation Techniques

Thakkar, Gaurish; Preradović, Nives Mikelić; Tadić, Marko

doi:10.3390/eng5040152

Open AccessArticle

Examining Sentiment Analysis for Low-Resource Languages with Data Augmentation Techniques

by

Gaurish Thakkar

^*

,

Nives Mikelić Preradović

^* and

Marko Tadić

Faculty of Humanities and Social Sciences, University of Zagreb, Ivana Lućića 3, 10000 Zagreb, Croatia

^*

Authors to whom correspondence should be addressed.

Eng 2024, 5(4), 2920-2942; https://doi.org/10.3390/eng5040152

Submission received: 13 September 2024 / Revised: 1 November 2024 / Accepted: 4 November 2024 / Published: 7 November 2024

(This article belongs to the Special Issue Feature Papers in Eng 2024)

Download

Browse Figures

Review Reports Versions Notes

Abstract

This investigation investigates the influence of a variety of data augmentation techniques on sentiment analysis in low-resource languages, with a particular emphasis on Bulgarian, Croatian, Slovak, and Slovene. The following primary research topic is addressed: is it possible to improve sentiment analysis efficacy in low-resource languages through data augmentation? Our sub-questions look at how different augmentation methods affect performance, how effective WordNet-based augmentation is compared to other methods, and whether lemma-based augmentation techniques can be used, especially for Croatian sentiment tasks. The sentiment-labelled evaluations in the selected languages are included in our data sources, which were curated with additional annotations to standardise labels and mitigate ambiguities. Our findings show that techniques like replacing words with synonyms, masked language model (MLM)-based generation, and permuting and combining sentences can only make training datasets slightly bigger. However, they provide limited improvements in model accuracy for low-resource language sentiment classification. WordNet-based techniques, in particular, exhibit a marginally superior performance compared to other methods; however, they fail to substantially improve classification scores. From a practical perspective, this study emphasises that conventional augmentation techniques may require refinement to address the complex linguistic features that are inherent to low-resource languages, particularly in mixed-sentiment and context-rich instances. Theoretically, our results indicate that future research should concentrate on the development of augmentation strategies that introduce novel syntactic structures rather than solely relying on lexical variations, as current models may not effectively leverage synonymic or lemmatised data. These insights emphasise the nuanced requirements for meaningful data augmentation in low-resource linguistic settings and contribute to the advancement of sentiment analysis approaches.

Keywords:

sentiment analysis; language models; data augmentation

1. Introduction

“A neural network is a computational model inspired by the way biological neural networks in the human brain function. It consists of layers of interconnected nodes (called neurons), where each node performs a simple computation, and information is passed from one layer to another”. In the context of a neural network, parameters refer to the internal variables that the model learns from the training data. These include the weights and biases associated with the neurons in each layer [1]. “Hyperparameters are configuration settings that are used to control the learning process of a machine learning model but are not learnt from the data itself. They differ from model parameters in that they are set before training begins and remain constant during the training process” [1,2]. In contrast to learnt parameters (such as weights and biases), which are modified during training, hyperparameters govern the learning process and must be defined before model training commences. Examples encompass the learning rate, batch size, layer count, number of neurons per layer, and dropout rate. In neural networks, parameters refer to the values learned by the model during the training process.

The performance of a neural network is completely dependent on its hyperparameters and the training set-learned parameters [3]. It is commonly believed that having more data points is the default method for improving performance [4]. A direct approach requires running an annotation campaign, which is expensive, time-consuming, and labour-intensive in terms of annotation and training [5]. Because these models rely on large parameters that necessitate many training instances to perform the intended task, this requirement cannot be eliminated.

In the reverse direction, new data points are generated from existing supervised or unsupervised text bodies [6]. To date, numerous techniques for data generation have been identified. Ref. [7] reported using contextual language under the assumption that sentences are invariant when original words are replaced by words with paradigmatic relations [8]. When compared to original texts, in-context predicted words were deemed to be better options for creating data samples that vary in terms of pattern. Attempts [9,10] have also been made at using data augmentation for different text classifications in large English-language datasets. The augmentations were derived from an English thesaurus and then trained using various machine learning and deep learning algorithms. Ref. [6] described simple augmentation operations (such as insertion, deletion, swap, and replacement) that produced comparable results when only half of the original dataset was used.

In data-driven research, these techniques focus primarily on resolving low-data scenarios, mitigating the phenomenon of class imbalance, or serving as regularising terms to make systems more resistant to adversarial attacks. The purpose of enhancing a neural network model’s resistance against adversarial attacks is to guarantee its robustness and reliability in practical applications. Adversarial attacks entail the creation of minor, frequently undetectable, modifications to input data that can lead the model to produce erroneous predictions. This presents considerable security threats in applications like autonomous driving, medical diagnosis, and facial recognition [11].

Existing data augmentation strategies for other tasks in languages with abundant resources (especially English) have also been investigated. To detect event causality, Ref. [12] employed a remote annotator, followed by filtering, relabelling, and annealing on instances with noisy labels. For the common-sense reasoning task, Ref. [13] used a pre-trained task model (XLM-R) and a generative language model (GPT-2) to generate synthetic data instances. Data selection was conducted using filtering functions that considered the quality and diversity of synthetic instances. The approaches that have been published for high-resource languages, such as English, are constructed using other linguistic resources as primary building blocks. To produce facts from an existing knowledge repository or knowledge graph, such a resource must be available in the target language. Therefore, a language with limited resources may lack these dependent resources, thereby rendering the method inapplicable. Empirical evidence regarding the effectiveness of these interventions in low-resource settings is still lacking. Even though data augmentation techniques such as EDA (Easy Data Augmentation) [6] are simple to implement, it is essential to conduct additional research on their applicability in low-resource settings.

This paper aims to investigate the efficacy of various data augmentation (DA) strategies in enhancing sentiment analysis for low-resource languages, particularly South Slavic languages. The article presents a novel strategy termed “expand-permute-combine” and assesses its efficacy in comparison to other methods to evaluate their influence on classification accuracy for under-resourced languages. We hypothesise that DA strategies are equivalent to cross-lingual and cross-family configurations. For the task of sentiment classification, we experiment with various data augmentation techniques on a set of low-resource languages from the same language family (i.e., South Slavic languages). To analyse each of these facets, we employ three distinct data augmentation techniques that rely on synonymy [14] and pre-trained large language models [15,16]. In addition, we propose a straightforward method of augmentation that requires no additional resources. To determine the effectiveness of these techniques, evaluation was performed on the task of sentiment classification. Experiments were conducted on South-Slavic languages (i.e., Bulgarian, Croatian, Slovak, and Slovene). To enable a three-class classification of the dataset for the Croatian language, we also conducted an annotation campaign to label instances that were claimed to be noisy by the original authors of the dataset.

2. Research Question

In this study, we explore DA methods as a means to artificially increase the instance space and compare the performance with that when using resources from the same language family. This study has the following main research question: can data augmentation be utilised effectively for sentiment analysis in low-resource languages? Additionally, 3–4 more specific questions are used, as follows:

(1)

Can the data augmentation technique improve the performance metric?

(2)

What is the effect of using augmented data generated from different techniques? We explore three different data augmentation techniques and compare their performances with each other.

(3)

Can WordNet-based augmentation techniques work better with sentiment classification tasks?

Does training with Lemma-based instances work for Croatian?

We hypothesise that the accuracy of the data augmentation techniques is comparable to that of supervised methods when applied to typologically related languages.

3. Literature Review

The section commences by examining the key methodologies and advancements in the field that are relevant to this investigation. This includes an analysis of different approaches, including data augmentation, adversarial attacks, and distant supervision, that have been used to enhance the performance of NLP tasks. In the subsequent subsections, we will explore specific techniques and models, emphasising their application in a variety of domains, with a particular emphasis on their relevance to sentiment analysis and low-resource languages. This investigation establishes the groundwork for understanding the landscape of previous research and the context for the methodologies proposed in this work.

Data Augmentation

Distant supervision is a method for curating labelled data instances by utilising an existing knowledge base [17]. Ref. [18] reported the first instance of using distant supervision in NLP. The work entailed curating datasets for the task of relation extraction. The authors used Freebase, a large database that stores the relationships between two entities. The assumption was that any sentence containing two freebase entities could express the relationship. As a result, Freebase was used as an unsupervised lookup table. Various features were designed, ranging from POS tag, NER, and n-words within the context window. Ref. [17] introduced a similar approach in the BioNLP domain, in which knowledge from a database is used to label sentences containing two entities to generate a dataset based on remote supervision. In the same work, heuristics (trigger words and high confidence patterns) were proposed to reduce noise in the sentence augmentation process. A CNN trained with an automatically created dataset and then trained on a manually annotated dataset achieved the highest score. The authors hypothesised that the direct union of two datasets (distant supervision-based and manually annotated) is not advantageous because noisy datasets lead to a decline in the final performance.

Two types of augmentation methods for NLP can be broadly distinguished: (1) text-based augmentation and (2) feature-based augmentation. The text-based enhancements operate at the text level. The process of augmentation can be implemented at various linguistic levels (morphological, syntactic, and semantic). Another branch of research focuses on adversarial attacks against the trained model. This is accomplished by generating text instances

X^{'}

similar to the training data X, such that the model attempting to perform the intended task fails. Instances X and

X^{'}

should have identical human predictions, with

X^{'}

containing minimal textual changes relative to the original instance. All adversarial attack techniques [19,20,21] on classification tasks rely on text-augmenters as their primary component when supplying augmented instances for adversarial attacks.

Ref. [22] experimented with various synonym replacement methods to generate adversarial samples. The synonyms were obtained from WordNet. The method for choosing a synonym for a word ranged from random selection to a more sophisticated method based on Word Saliency [23] score. Another way of finding a replacement for a given word is to use a pre-trained language model that uses context to predict the replacement word. Ref. [7] altered the language model so that it integrates the label in the model along with the context during the word prediction stage. The language mode was trained on the WikiText-103 corpus of English Wikipedia articles. Ref. [19] used contextual perturbations from a BERT masked language model to replace and insert tokens at masked locations. Ref. [24] extended the work using RoBERTa and three contextualised perturbations, i.e., replace, insert, and merge. All of these studies were published in English datasets.

In the field of Neural Machine Translation (NMT), the technique of translating a target language into a source language is known as back-translation [25]. The ultimate goal of this procedure is to increase the number of samples by paraphrasing using the translation module. The final system is trained using both the parallel synthetic corpus and the original training data. Although back-translation is an easy-to-use technique, it necessitates the training of a machine-translation model for low-resource languages, which may not be a viable option given the required volume of data. Ref. [26] showed through experiments that sampling and noisy beam outputs (delete, swap, and replace words) are better for making fake data than pure beam and greedy search. Ref. [6] introduced EDA (Easy Data Augmentation), a set of augmentation techniques consisting of multiple processes including synonym replacement, random replacement, random swap, and random deletion. On five distinct datasets, the processes were executed and benchmarked. The authors conducted experiments with an augmentation parameter named

α

whose values were in the range [0.05, 0.1, 0.2, 0.3, 0.4, 0.5] and discovered that small

α

values provided greater gain than large values. The same work was expanded by [27] to include two additional datasets for examining the impact of data augmentations using pre-trained language models (BERT, XL-NET, and ROBERTA). EDA and back-translation are two task-independent data augmentation techniques. According to reports, data-augmentation methods do not provide any consistent improvement for pre-trained transformers. The authors attributed this phenomenon to large-scale, unsupervised, domain-spanning pre-training, although all datasets utilised in the study were English-based.

Consistency training is based on the premise that small changes or noise in the input should not impact model predictions. Ref. [28] used data augmentation in place of noise signals to enforce consistency constraints during training. The overall loss consisted of classification losses and consistency losses between the original input and the enhanced version of the same. The consistency loss is only computed for instances in which the model has high confidence. The author used back-translation, RandAugment (for image classification), and TF-IDF word replacement for augmentations. A data filter was implemented within the domain to prevent domain mismatch.

Ref. [29] proposed the first method for classifying the sentiment of tweets using emoticons as remote supervisors. The technique was based on the premise that the emoticons “:)” and “:(” (and their variants) are poor indicators of positive and negative emotions. Therefore, each tweet containing these emoticons was tagged with their respective classes. There was an assumption that the statements in Wikipedia and newspaper headlines were neutral. The neutral class was not classified because it had no emoticons associated with it. The dataset was used to train the machine learning algorithms Naive Bayes, Maximum Entropy (MaxEnt), and Support Vector Machines (SVM). The entire setup was studied using English as the study language. Ref. [30] compared multiple data augmentation strategies (such as WordNet and Bert-based) for the generation of news headlines in Croatian, Finnish, and English. In addition to ROGUE, the authors employed two additional methods to assess the performance score. One technique was the computation of semantic similarity using a sentence transformer trained in the task of paraphrasing. The second method employed a metric based on natural language inference to quantify the similarity between the original and generated headlines. The authors did note that there was no NLI model covering Croatian and Estonian. The other branch of data augmentation directly focuses on the latent space. Training as a whole aims to add new latent information without altering the original class representation. This enables difficult-to-input semantic cases with limited training data to be induced. Ref. [31] proposed that difficult-to-classify samples are the best candidates for data augmentation because they contain more information. Latent space augmentations were created using interpolation, extrapolation, noise addition, and the difference transform. Table 1 presents a summary of all the aforementioned approaches.

Techniques dependent on external knowledge bases [18] encounter challenges in disambiguating and resolving contexts for a singular matched item. This introduces noisy labels, which impact the system’s accuracy. Ref. [29] faced challenges with noisy text and informality, as well as the effect of emoticons as labels. Methods employing NMT presume the existence of an NMT system and a substantial monolingual corpus within the domain. The NMT system generates noisy back-translations mostly characterised by lexical inaccuracies [25]. Previous research indicates that sentiment analysis using augmented data for low-resource languages has received little attention.

Morphology is the examination of the structure of words and the process by which they are assembled from lesser elements, known as morphemes [32]. In relation to their grammatical function within a sentence, the morphological features of words, such as their tense, case, and number, are substantially modified in a number of low-resource languages. These transformations have the potential to substantially alter the form of words, which poses a challenge for models that were trained on smaller datasets. Inflection systems are a component of morphological structure and the process by which words alter their form to convey various grammatical categories, such as tense, mood, or number. For instance, in highly inflected languages, a single word can take on numerous forms based on its function in a sentence, which complicates the process of generalising machine learning models across various forms of the same word. In the absence of sufficient data to account for these variations, models may encounter difficulty in generalising, which may result in inaccurate classifications. The situtuation is further complicated as these languages’ grammars are not simple and their morphology and inflexion systems are complex.

4. Data

This study employed a mixed-methods research approach, combining both qualitative and quantitative methods to provide a comprehensive understanding of the research phenomenon. The quantitative component of the study entailed the collection and modelling of a dataset, which provided a comprehensive understanding of performance. In contrast, the qualitative component involved an in-depth analysis of the predictions from the trained classification systems, which offered contextualised and nuanced perspectives on the research phenomenon. To address the research questions, this mixed-methods approach was considered necessary, as it enabled the triangulation of data modelling and the validation of findings through error analysis, thereby enhancing the reliability and validity of the results.

We used sentiment classification datasets to answer our research questions, employing existing datasets from the previous studies. However, we targeted only low-resource languages in our experiment: Bulgarian, Croatian, Slovak, and Slovene. A single dataset was selected for each language in the study. In Table 2, the sizes of the original training, development, and test dataset splits are displayed.

4.1. Croatian Re-Annotation

The authors of the Pauza dataset [33] eliminated reviews with a rating between 2.5 and 4.0 because these reviews were noisy. Therefore, ratings below 2.5 are considered negative, whereas ratings above 4.0 are considered positive. The reviews with ratings ranging from 2.4 to 4.0 have instances where the text is positive but has ratings that might tag it as a positive instance, and vice versa. We hypothesise that this might lead to semantic drift, meaning that the model might learn to classify instances incorrectly. Our methodology involves artificially augmenting data using multiple techniques; however, a text with contradictory labels, when excessively enhanced, may hinder the model’s learning process. Hence, we take up the activity of re-annotating our Croatian dataset. We re-evaluated the ratings between 2.5 and 4.0 and asked three native speakers to annotate particular instances. Annotators were asked to classify the given text as positive, negative, or neutral/mixed. Only two annotators manage to complete the annotation of all the provided instances. The instances devoid of consensus were eliminated through filtering. Nine instances of the text were not included in the final set, as collective agreement about these instances was not reached by the annotators.

4.2. Sentiment Analysis Datasets

This section provides a detailed overview of the dataset’s characteristics, including size, source, and distribution across different sentiment classes, which form the foundation for training the sentiment classification models.

Bulgarian The Cinexio [34] dataset is composed of film reviews with 11-point star ratings: 0 (negative), 0.5, 1, …, 4.5, 5 (positive). Other meta-features included in the dataset were film length, director, actors, genre, country, and various scores.
Croatian Pauza [33] contains restaurant reviews from Pauza.hr4, the largest food-ordering website in Croatia. Each review is assigned an opinion rating ranging from 0.5 (worst) to 6 (best). User-assigned ratings are the benchmark for the labels. The dataset also contains opinionated aspects.
Slovak The Review3 [35] is composed of customer evaluations of a variety of services. The dataset is categorised using the 1–3 and 1–5 scales.
Slovene The Opinion corpus of Slovene web commentaries KKS 1.001 [36] includes web commentaries on various topics (business, politics, sports, etc.) from four Slovene web portals (RtvSlo, 24ur, Finance, Reporter). Each instance within the dataset is tagged with one of three labels (negative, neutral, or positive).

The following two sections explains the overall methodology: data generation and model training. First, we used tools for natural language processing and data augmentation to create samples of the data. Then, we used the samples to train a transformer-based classification model on the data.

4.3. Data Generation and Augmentation

To answer the questions posed in earlier sections, we utilised four simple language processing techniques and three existing data augmentation methods. The aforementioned existing data augmentation strategies are used in adversarial attacks against trained classification models and can be utilised to obtain samples that are more semantically similar to the original dataset. Next, we describe the individual techniques for augmenting data and the overall procedure for augmenting and training the classifier.

$D a t a_{l e m m a}$ based on lemmatisation.
$D a t a_{e x p a n d e d}$ based on sentence tokenisation [ours].
$D a t a_{e x p a n d e d - c o m b i n e d}$ based on sentence tokenisation [ours].
$D a t a_{e x p a n d e d - p e r m u t e d}$ based on sentence tokenisation [ours].
WordNet [22].
Masked Language Model (MLM) based Clare [24].
Causal Language Model (CLM)-based Generative Pre-trained Transformer (GPT)-2 [37].

4.4. Lemmatisation

After performing a morphological analysis, the lemmatisation process returned the word’s morphological base. The output was the canonical form of the original word. Since South Slavic languages are rich in morphology, we decided to create a lemma-form variant of the original dataset. Previous studies [38,39] fed lemmas into machine learning classification algorithms as input features (such as Support Vector Machines and Random Forests). Transformers-based models use byte-pair encoding to reduce the vocabulary size, which is required to avoid sparse vector representations of the input text.

For instance, the word running is converted to run + ##ing and the neural network learns to weight individual byte-pairs based on the dataset and the requirements of the task. Therefore, the affixes may be useful for tasks that take the additional information into account. However, this requirement has not been looked at in pre-trained models with languages that are rich in morphology, or for sentiment analysis in particular. We made a lemmatised version of the original dataset to see how lemmatisation affects the final performance of a language model that has already been trained.

Original HR: super, odlicni cevapi.
Lemmatised: super, odličan ćevap.

4.5. Expansion [Ours]

Every labelled instance

D^{i}

from the train-set, i.e., the document or text, consists of one or more sentences

D_{1 . . n}^{i}

and a single instance

D^{i} \in L

, where L can be positive, negative, or neutral/mixed.

D^{i} = D_{1 . . n}^{i}

(1)

D^{i} \in L

(2)

D_{1 . . n}^{i} \in L

(3)

D^{1} D^{2} D^{n} \in L \Rightarrow D^{i} \in L

(4)

From (4), it follows that each of the sentences (

D^{1}, D^{2}, . ., D^{n}

) of a single training instance can be weakly assumed to be labelled with the same class. Therefore, every sentence from a review can be individually treated as a new labelled instance. For example:

(Original HR): “Pizze Capriciosa i tuna, dobre. Inače uvijek dostava na vrijeme i toplo jelo”.
(Translated EN): “Pizza Capricios and tuna, good. Otherwise always delivery on time and hot food”.

This example belongs to the positive class, and individual sentences may be treated as reviews of the positive class. Theoretically, this assumption may hold true for extremely polar classes, such as positive and negative, but may fail for classes that are mixed or neutral. The mixed and neutral instances are indistinguishable. A mixed review consists of both positive and negative elements that are either connected by a conjunction or presented as two distinct phrases. There is no clear mechanism to differentiate between the positive and negative components. As the polar components are indiscernible without further processing, employing a positive statement from a mixed review and exponentially augmenting it would immediately lead to the inclusion of positive instances for the case of mixed classes in varying proportions. This would eventually lead to the misrepresentation of mixed classes during the training of neural networks. In practice, we are also presented with instances in which the service was poor, but the reviewer still awarded a high rating due to previous positive experiences.

4.5.1. Expansion-Combination [Ours]

Based on the previous technique for expansion, we propose a straightforward extension. Assuming that all individual sentences from all reviews for a given class also belong to the same parent class, we can now create a brand-new dataset by randomly sampling from this set of individual sentences. Here, we consider the entire

D_{1 . . n}^{i}

range to be the universal set. We obtained the new dataset by sorting the instances using combinations denoted by mathematical (5). For a more intuitive explanation, assume ABCD to be four positive sentences from various positive reviews. Combination ordering produces a new sampled dataset represented by the combinations (ABCD’, 2) > AB AC AD BC BD CD.“ Elements are treated as unique based on their position, not on their value. So, if the input elements are unique, there will be no repeat values in each combination” [40]. This indicates that AB and BA will not be present in the final sampled dataset.

{}^{n}C_{k} = \frac{n!}{k! (n - k)!} - c o m b i n a t i o n

(5)

4.5.2. Expansion-Permutation [Ours]

We also propose a second simple method that replaces the previous combination sampling with a permutational process. Mathematically, this is denoted by Equation (6), in which the universal set of individual sentences belonging to a single class can be combined, as depicted by permutations (ABCD’, 2)—> AB AC AD BA BC BD CA CB CD DA DB DC. According to the order of the input iterable, the permutation tuples are returned in lexicographic order. Therefore, if the input iterable is sorted, the output combination tuples will also be sorted. “Elements are treated as unique based on their position, not on their value. So if the input elements are unique, there will be no repeat values in each permutation” [41]. In other words, AB and BA will represent two distinct instances of the generated dataset.

{}^{n}P_{k} = \frac{n!}{(n - k)!} - p e r m u t a t i o n

(6)

4.5.3. WordNet Augmentations

WordNet [14,42,43,44] provides a straightforward formal synonym model for locating replacement words in context. This method replaces each word in a given text with its synonym. The assumption that a word’s synonym will not affect the polarity of the given instance makes this one of the most straightforward data enhancement techniques. Synonyms are derived from synsets by querying WordNet with candidate keywords. The synset includes words with equivalent meanings. Notably, the word being searched may belong to multiple synsets, necessitating additional processing, such as word-sense disambiguation, to prevent incorrect synset selection (Due to the limited resources available, we did not pursue a more sophisticated synset selection).

(1): Lemma HR: Jako dobar pizza. (Translation: very good pizza.)
(2): Augmented HR: jako divan pizza.
(3): Augmented HR: jako krasan pizza.

Here, the word dobra (“good”) has been replaced with its synonyms, ‘divan’ and ‘krasan’. WordNet’s entries are in lemmatised form, which is an important detail to note. Therefore, in order to obtain more results for the words in context, they must be lemmatised. The lemma can then be used to retrieve the synonym set. The retrieved results are also in lemma form. Although this is not a necessary condition, we can still obtain a significant number of terms to replace the words in the dataset. This is illustrated by the following examples:

(1): HR: Jako dobra pizza i brza dostava. (Translation: Very good pizza and fast delivery.)
(2): Augmented HR: Jako dobra pizza i brza dostavljanje.
(3): Augmented HR: Jako dobra pizza i brza doprema.

To prevent semantic drift, no additional relations were employed. To reimplement a custom WordNet augmentor for each of the languages (Bulgarian, Croatian, Slovak, and Slovene), we used the textattack (https://github.com/QData/TextAttack, accessed on 21 July 2022) library, and derived a new class from the Augmentor (https://tinyurl.com/wz85rf43, accessed on 21 July 2022) base class. In the augmentor, we introduced constraints to prevent modifications to stopwords and words that were already modified. Based on the recommendation reported by [6], the pct-words-swap parameter (i.e., percentage of words to swap) was set to 0.05, limiting the number of words that were to be replaced with synonyms. The number of augmentations per instance was set at 16. We used Open Multilingual WordNet (http://compling.hss.ntu.edu.sg/omw, accessed on 24 July 2022) to find replacements for synonyms.

4.6. Language Tools

Each dataset for each of the four languages was required to undergo tokenisation, part of the speech extraction and lemmatisation. The Classla (https://github.com/clarinsi/classla, accessed on 26 July 2022) library was used for processing Bulgarian, Croatian, and Slovene, while the Stanza (https://stanfordnlp.github.io/stanza/, accessed on 26 July 2022) library was utilised for Slovak (https://huggingface.co/stanfordnlp/stanza-sk, accessed on 26 July 2022). We used the tokenised and lemmatised data to generate the lemmatised (

D a t a_{l e m m a}

) and expanded (

D a t a_{e x p a n d e d}

) versions of the dataset. The expanded version was converted into Data_{expanded-combined} and Data_{expanded-permuted} by combining two individual sentences into a single training instance via sampling.

4.7. MLM Augmentations

CLARE (ContextuaLized AdversaRial Example) [24] is an adversarial attack text generation technique. In this method, each word in the given sentence is greedily masked, followed by an infill procedure that is used to obtain a replacement word for the masked word. The method permits data enhancement through the replace, insert, and merge operations. This method makes locally optimal choices, which may not always lead to globally optimal solutions, as it replaces all the words in a sentence with substitutes. This typically results in augmentations with a different semantic meaning than the original, so it relies on multiple constraints to generate meaningful data. These constraints eliminate enhancements that do not meet the given criteria. Checking the semantic similarity of the augmented sentence with the original input using an existing process is one of these constraints. Using a neural network already trained on sentence similarity, cosine distance (i.e., 1—Cosine Similarity) can be used to compute the semantic similarity in its most basic form. This distance ranges from 0 to 2, where a value of 0 indicates that the vectors are identical (i.e., the angle between them is 0°). A value of 1 indicates that the vectors are orthogonal (i.e., the angle between them is 90°). A value of 2 indicates that the vectors are diametrically opposed (i.e., the angle between them is 180°) [45]. To compute the similarity between the encoding of original sentences and augmentations, the authors utilised the Universal Sentence Encoder, a text encoder model that maps variable-length English input to a fixed-size 512-dimensional vector. In addition to the encoding model, there are dataset-dependent parameters such as minimum confidence, window size, and maximum candidates. To prevent semantic drift due to arbitrary deletions and insertions, we only used the Replace method.

(1): HR: Ne narucivat chilly. (Translation: Do not order chilly.)
(2): Augmented HR: Ne narucivat meso. (Translation: Do not order meat.)

Initially, we compared each augmentation to the original sentence using a second pre-trained language model. The authors suggested using the Universal Sentence Encoder, a pre-trained language model, to compute the similarity between the encoding of original sentences and augmentations. The Universal Sentence Encoder (https://tfhub.dev/google/universal-sentence-encoder-multilingual/3, accessed on 28 July 2022) has been trained in 16 languages, but none of them is South Slavic; as a result, it was not a good candidate for encoding our data. Consequently, we utilised LaBSE (https://tfhub.dev/google/LaBSE/2, accessed on 28 July 2022), which has been trained in 109 languages. We used cosine scores as a similarity measure and eliminated all sentences that had a cosine similarity of less than 0.80. This was to obtain augmentations with the same class label as the original sentence due to their similar meaning. We implemented a custom MLM-CLARE augmentor with the constraints using the CLARE (https://tinyurl.com/wz85rf43, accessed on 28 July 2022) base class from the textattack library. The percentage of exchanged words was set at 0.5 percent. For Croatian, MLM augmentations were performed using a variety of pre-trained language models, including EMBEDDIA/crosloengual-bert, Andrija/SRoBERTa-F, macedonizer/hr-roberta-base, and classla/bcms-bertic. In terms of perplexity score, EMBEDDIA/crosloengual-bert, xlm-roberta-base, and Andrija/SRoBERTa-F performed the best. Ultimately, EMBEDDIA/crosloengual-bert was selected after examining its enhanced output. Similar procedures were repeated for additional languages.

4.8. CLM Augmentations

Language generation tasks are competitively performed by causal language models such as GPT-2. During training, the model is tasked with predicting the next word in a text sequence. This causes the model to generate the next suitable word based on the previous words or context. During the inference stage, a model is fed an initial prompt and instructed to predict the next word. The entire procedure can be easily used to generate training resources for a model. This method was reported by [37] using a small supervised English dataset. Typically, a single model is trained with data from multiple classes in such a way that the generated text depends on the label. For instance, to generate a positive review, we instructed the model, during training, with the start token, class label, and text (i.e., ‘<|startoftext|> |review pos|> WHOLE TEXT |endoftext|>’). During the inference, only a few initial words (such as ‘|startoftext|> |review pos|> PROMPT-TEXT’) are needed to produce the entire text. Using a single model to generate data for all classes with a large amount of data is possible. After training in this environment, we noticed that the model began to generate negative reviews for the mixed/neutral class. Consequently, we trained three distinct models for each of the individual classes. Due to the fact that each class has its own model, the model can only generate text for the class in question. Since they are discussed in the reviews, we decided to use nouns as prompts to capture the overall context during the generation process. Typically, the context is food, such as pizza or risotto, or a service, such as delivery. Using morphosyntactic (MSD) tags, we extracted all nouns from the dataset. The nouns were manually inspected for pipeline-annotated false-positive artefacts. The obtained nouns were then used as inputs for the three fine-tuned GPT-2 models to generate the datasets.

(1): HR: naručili salatu, dostava je bila na vrijeme, dostavljac simpatican.
(2): translation: pizza arrived, no complaints just ordered a salad in advance, delivery was on time, the delivery man was nice.

Using the original and WordNet-augmented datasets, we optimised three distinct GPT-2 models for each of the three classes. The model was independently optimised for each dataset label to generate positive, negative, and mixed reviews. For the purpose of training the language generator, we eliminated all reviews longer than five words. We utilised GPT-2 models trained in the respective languages as the initial backbone encoder. We optimised the model for the language generation task using a learning rate of 0.001, 1 epoch, a batch size of 4, and 1000 warm-up steps. We employed a decoding strategy with a penalty for bi-gram repetition and a beam search with five beams for text generation. Using this method, we created three different datasets that grew larger so we could study the size of the corpus as a dependent feature.

4.9. Experiments

Using a transformer-based classifier, we compared the efficacy of various data generation methods. Two distinct dataset versions were created: two-class, which is the binary version (positive and negative), and three-class, which is the ternary version (positive, negative, or neutral—We refer to the class as neutral despite the fact that it consists of both positive and negative elements). Using the various training sets, the parameters of entire networks were optimised. We trained a separate model for each language in the study and for each dataset generated using the previously described methods (including the original dataset) while maintaining the same network parameters. When the dataset was not balanced, labels from the training set were used to determine the class weight, which was then used as a rescaling weight parameter in the cross-entropy loss. This allowed for a greater penalty if a class with few instances made an incorrect prediction. We trained the model with a learning rate of 1 × 10⁻⁵, a weight decay of 0.01, early stopping on validation loss, and a patience of four to five epochs. Utilising the softmax classifier, the class probabilities were calculated. The final scores for the original set of manually administered tests associated with the dataset are reported. Table 3 presents various transformer-based models used for MLM and CLM augmentations. We utilised the “unsloth/gemma-7b-bnb-4bit” model to perform instruction fine-tuning on all datasets under examination. This is a large-language multilingual model and is a quantised version of Gemma-7b [46].

4.10. Training Set Size

Table 4 displays the final distribution of the original, expanded–combined, and expanded–permuted datasets. For the expanded–combined and expanded–permuted datasets, we varied the training set by sampling 10k, 20k, and 40k instances for each class. In the cases of WN, MLM, and CLM, the augmentation methods affected the final size of the training set, as the process of augmentation is influenced by several factors, including the nature of the original text, the matching of the words, WordNet, and semantic constraints. We obtained 10,000 and 20,000 (and, in some cases, 25,000 and 40,000) samples to be trained and tested for all languages, except for Bulgarian, where the number of instances remained low.

5. Results and Discussion

Our findings indicate that augmentation methods do not contribute directly to sentiment classification. We found that the performance of augmentations based on pre-trained contextualised language models is inferior to that of methods constructed by combining multiple datasets from the same and different languages. Factors that indirectly affect the final classification score include noisy text and code-mixing. In addition, we found that WordNet-based augmentations are more effective than those based on the Masked Language Model or Causal Language. In seven instances, the expansion–permutation–combination technique resulted in an improvement. The results of the experiments are shown in Table 5, Table 6 and Table 7. The F1-score and accuracy values for the original, lemma, and expanded versions are shown in Table 5. The results of all the experiments for all the languages are shown in Figure 1, Figure 2, Figure 3 and Figure 4. The performance of the original version of the dataset was superior to that of two other datasets.

The performance of the binary-lemmatised version was 1% worse than that of the original dataset. This performance decline is greater in a three-class setting. This demonstrates that the pre-trained models, in this case, XLM-R, which were trained on unprocessed text, prefer a grammatically correct form over a lemma form for the given text. We conclude that non-lemmatised data should be used when using pre-trained models like XLM-R. In contrast, separating reviews into individual sentences and using them for training did not lead to a better performance than the other two settings. In conclusion, treating opinionated text as a sum of parts does not make any contribution to training classification models. In addition, we compared the scores obtained with augmentation techniques with scores trained on a large-language model, i.e., Gemma. The Gemma model provided higher overall scores than other models without any additional data.

In all languages except Croatian, the *nary-original *nary-lemmatised settings outperformed the simple expansion technique. The results of using permuted and combined versions of the datasets are presented in Table 6. Using the 20k/class version of the dataset yielded a slight improvement in the F1 score for Croatian compared to the original training dataset, based on the data presented in the table. There were no significant changes to the Bulgarian language. For Slovak, the expanded–permuted 10k-class version produced a four-point improvement in binary classification, but no improvement was observed for ternary classification. The performance of Slovene decreased when permuted and combined versions of the dataset were utilised. Except for Slovak, all other languages scored higher on the expanded combined train set.

According to the data in Table 7, training on the three augmented datasets did not improve the final classification scores. Some cells in the table were left blank because the augmentation technique did not generate the required number of training instances. In the final column, we present the scores for the data points for each class that were either less than 10,000 or greater than 40,000. We performed random approximation tests [47] using the sigf package with 10,000 iterations to determine the statistical significance of differences between the models. For all the languages, none of the models showed a statistically significant improvement (p < 0.05) in score compared to the model trained with the original data. Our findings related to the MLM-based DA techniques are very similar to the ones for Norwegian reported by [48]. The authors indicate that augmentation strategies frequently yield gains; nevertheless, the impacts are moderate, and the significant volatility complicates the ability to draw definitive conclusions.

5.1. Error Analysis

For the best scoring models, we randomly sampled incorrectly classified instances from the test set for each language. We manually examined the cases and present a summary of the results. A majority of the issues encountered throughout the evaluations were previously reported in other studies [49].

5.1.1. Text Accompanied by Additional Context

In this category of incorrectly classified instances, the statement begins with a premise or speculation (I believe it will be good) and ends with the user’s opinion (But I did not like it). Alternatively, the text might start with an opinion and then move on to speculation. The additional information may or may not justify the users’ feelings. The user discusses audience members leaving the theatre in the following example, then he provides his own review. The original label of the review is positive, but the predicted label is negative.

(Original BG) Пoлoвината салoн си тръгна на 30тата минута. Аз следя сериала oт кактo гo има и филма ми хареса.
(Transliteration BG) Polovinata salon si trgna na 30tata minuta. Az sledya seriala ot kakto go ima i filma mikharesa.
(Translation EN) Half the salon left at the 30 min mark. I’ve been following the series since it started and I liked the movie.
Original label: positive; predicted: negative.

The sentence “I liked the movie” points to the final user sentiment, while the first sentence causes the model to predict the review to be negative.

5.1.2. Reviews with Aspect Ratings

In this type of text, each aspect is evaluated separately by the user. The current classifier fails to classify these formats, and a specialised process may be required to classify them.

(Original BG) 1 за декoрите … Начoсът заслужава 5.
(Transliteration BG) 1 za dekorite … Nachost zasluzhava 5
(Translation EN) 1 for the decorations … The nachos deserve a 5.
original label: negative; predicted: positive.

5.1.3. Mixed Aspects

The majority of cases fall into this category. The text comprises a compound or a complex sentence with multiple targets.

(Original BG) Твърде мнoгo ненужнo пеене,нo всичкo oстаналo е супер!:)
(Transliteration BG) Tvrde mnogo ne nuzhno peene, no vsichko ostanalo e super!:)
(Translation EN) Too much unnecessary singing, but everything else is great!:)
original rating: negative; predicted: positive.

5.1.4. Contradictory Expressions

The conflicting sub-parts of a sentence are presented as a single unit rather than a compound sentence, as in the previous error type.

(Original BG) Красив филм с безкрайнo несъстoятелен сценарий.
(Transliteration BG) Krasiv film s bezkraino nesstoyatelen stsenarii.
(Translation EN) A beautiful film with an endlessly unworkable script.
Original rating: negative; predicted: positive.

The neutral/mixed-class instances in the Croatian test set have the highest number of misclassifications. We used the SHAP (https://github.com/shap/shap, accessed on 1 June 2022) (SHapley Additive exPlanations) tool to observe and study the model predictions. The text of binary-classified reviews consists of only positive or negative words. When used with the Transformer encoder, these polar words receive aheightened focus, which ultimately determines whether the final classification is positive or negative. In the case of the mixed-class, the text is composed of both positive and negative polar words, with one group receiving a disproportionate amount of attention, resulting in an incorrect classification. We discovered that ‘ali’-containing sentences were misclassified because the model could not identify compound sentences. As specified by [50], dealing with mixed-class sentences is difficult because the assumption that the document or sentence has a single target is false. Further examination of the test-set predictions and ground-truth labels yielded the following findings:

(1): Some reviews contain sentences that are lengthy. The XLM-R accepts 512 (-2) tokens that have been processed by a tokeniser [16]. Due to the omission of these text tokens, the model performs poorly when the text is exceedingly long. This phenomenon is notable in the Slovene and Croatian datasets.
(2): Cases in which the author gave the review a positive rating, but the text contains many unrelated negative statements. This occurs when the author rants about many other stores and writes one positive line about the target entity [50].
(3): We also found that the greater the distance between the negation cue and the scope of the negation, the less likely the model is to capture the negation. For example, “Pizza dola mlaka, i ne ukusna”, vs. “Pizza dola mlaka, i ne ba ukusna”, and “Pizza dola mlaka, i ne ba previe ukusna”. The first sample was correctly classified, but the second and third samples were not [51].
(4): People write negative reviews but rate the restaurant highly because they had a pleasant experience there [52].
(5): Code-mixing and English text in Croatian and Slovene [53].

Additionally, we observe that customers may rate the overall review positively even if something was missing from the delivery.

(1): Brza dostava, ok hrana. Jedino kaj su zaboravili coca colu :(. (Translation EN) Fast delivery, ok food. Only what they forgot about Coca Cola :(.
(2): Nisam vidjela prut na pizzi special, al nema veze, vratina je bila sasvim dovoljna! (Translation EN) I did not see the prosciutto on the pizza special, but it does not matter, the door was enough!
(3): Malo gumasto tijesto, inace OK pizza. (Translation EN) A little rubber dough, otherwise ok pizza.

The MLM model augmentor generated “Treba narucivat chilly” as the correct augmentation for “Ne narucivat chilly”, despite paraphrasing the constraints. This may be due to the LaBSe model misclassifying texts as paraphrases of one another. Therefore, improved constraints are recommended. For Slovak, we identified cases that contained positive phrases but were labelled neutral by the authors.

(1): Bol som vemi spokojný. (Translation EN) I have been very satisfied.
(2): super super super. (Translation EN) Super Super Super.
(3): Bola vemi príjemná a milá. (Translation EN) She was very pleasant and nice.
(4): Vemi ústretová a ochotná. (Translation EN) Very helpful and willing.
(5): Bagety, ktoré som kúpila boli perfektné … akujem. (Translation EN) Baguettes I bought were perfect … Thank you.

In addition to classification errors, the following text-processing errors were observed: Using the Classla package, errors are introduced at three stages (sentence tokenisation, lemmatisation, and POS). For instance, garbled tokens are identified as nouns in the text, and improper sentence boundary detection is also detected. Typically, the user-text lacks diacritics (narucívati -> naruívati). Therefore, processing is required to correct the spelling in order to reduce the number of failed WordNet lookups. The Bulgarian dataset consists of movie reviews with emoticons included in the text. This calls for an emoticon-aware tokenizer. Classla did not support the processing of non-standard text types for Bulgarian, so standard mode was used for sentence splitting, lemma, and POS. This is a potential entry point for errors.

5.2. Revisiting Research Questions

We can answer our research questions after conducting the experiments and analysing the data.

Can the data augmentation techniques improve the performance metric? According to our findings, using a pre-trained contextualised language encoder reduces the impact of an augmented dataset. As previously reported by [27], these transformer-based models are invariant to certain transformations, such as synonym substitution. This is attributable to the proximity of synonyms in the representation space of these encoders. Therefore, using synonyms obtained from WordNet or other sources and encoding them in these spaces does not result in a significant gain. The only way to improve performance is to generate novel linguistic structures that were not encountered during the Transformer model’s pre-training.

What is the effect of having augmented data generated from different techniques? We investigated three distinct data augmentation techniques in addition to three text expansion techniques. Comparing their performance reveals that training with augmented data does not lead to a performance improvement compared with training with the original dataset alone. Although binary class performance improved by a few points, this improvement was not consistent. In addition, increasing the size of the augmented data has little effect on the performance of the techniques.

Can WordNet-based augmentation techniques work better with sentiment classification tasks? Although WordNet-based augmentation techniques appear to be more effective than MLM and CLM-based techniques, they provided no significant improvement for the downstream task. Training with lemma-based instances decreased system performance by one point for binary classification but drastically decreased system performance for ternary classification. Also, as [28] pointed out, it is easy to improve the performance of binary sentiment classification by adding more data, but fine-grained classification faces the same problem as training on the whole dataset.

6. Conclusions

In summary, this investigation assessed the efficacy of data augmentation methodologies in enhancing sentiment analysis in low-resource languages, with a particular emphasis on Slovene, Slovak, Croatian, and Bulgarian. Our results suggest that traditional augmentation methods, such as WordNet-based synonym replacement, MLM-based augmentations, and sentence permutation and combination, provide limited benefits to model performance, particularly when transformer-based encoders are used. Although the results of the WordNet-based augmentation were marginally superior to those of other methods, none of the techniques achieved significant improvements over the original datasets. In practical terms, this implies that existing augmentation strategies may require modification to accommodate the distinctive complexities and linguistic variability in low-resource languages. In theory, these results suggest that more innovative methods, such as the development of syntactic diversity rather than lexical diversity, may be necessary to more accurately simulate real-world language use in order to effectively augment sentiment analysis in these languages. Therefore, future research should investigate innovative augmentation methods that integrate syntactic transformations and intricate language structures, as these have the potential to provide more significant enhancements in sentiment analysis in low-resource language contexts.

Author Contributions

Conceptualization, G.T.; Methodology, G.T.; Software, G.T.; Validation, G.T.; Formal analysis, G.T.; Investigation, N.M.P.; Resources, M.T.; Writing—original draft, G.T.; Writing—review & editing, N.M.P. and M.T.; Visualization, G.T.; Supervision, N.M.P. and M.T.; Project administration, N.M.P. and M.T.; Funding acquisition, M.T. All authors have read and agreed to the published version of the manuscript.

Funding

The work presented in this paper received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement no. 812997 and under the name CLEOPATRA (Cross-lingual Event-centric Open Analytics Research Academy).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2020. [Google Scholar]
Géron, A. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2022. [Google Scholar]
Bengio, Y. Practical Recommendations for Gradient-Based Training of Deep Architectures. Neural Netw. Tricks Trade Second. Ed. 2012, 7700, 437–478. [Google Scholar] [CrossRef]
Halevy, A.; Norvig, P.; Pereira, F. The Unreasonable Effectiveness of Data. IEEE Intell. Syst. 2009, 24, 8–12. [Google Scholar] [CrossRef]
Schreiner, C.; Torkkola, K.; Gardner, M.; Zhang, K. Using Machine Learning Techniques to Reduce Data Annotation Time. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting, Sydney, Australia, 20–22 November 2006; pp. 2438–2442. [Google Scholar] [CrossRef]
Wei, J.; Zou, K. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. arXiv 2019, arXiv:1901.11196. [Google Scholar]
Kobayashi, S. Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, LA, USA, 1–6 June 2018; pp. 452–457. [Google Scholar] [CrossRef]
Ribeiro, M.T.; Wu, T.; Guestrin, C.; Singh, S. Beyond Accuracy: Behavioral Testing of NLP models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 4902–4912. [Google Scholar]
Wang, J.; Lan, C.; Liu, C.; Ouyang, Y.; Qin, T. Generalizing to Unseen Domains: A Survey on Domain Generalization. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, Montreal, QC, Canada, 19–27 August 2021; Zhou, Z.-H., Ed.; Survey Track. pp. 4627–4635. [Google Scholar] [CrossRef]
Zhang, X.; Zhao, J.; LeCun, Y. Character-Level Convolutional Networks for Text Classification. Adv. Neural Inf. Process. Syst. 2015, 28, 649–657. [Google Scholar]
Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and Harnessing Adversarial Examples. arXiv 2014, arXiv:1412.6572. [Google Scholar]
Zuo, X.; Chen, Y.; Liu, K.; Zhao, J. KnowDis: Knowledge Enhanced Data Augmentation for Event Causality Detection via Distant Supervision. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), 8–13 December 2020; pp. 1544–1550. [Google Scholar] [CrossRef]
Yang, Y.; Malaviya, C.; Fernandez, J.; Swayamdipta, S.; Bras, R.L.; Wang, J.; Bhagavatula, C.; Choi, Y.; Downey, D. Generative Data Augmentation for Commonsense Reasoning. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 16–20 November 2020; pp. 1008–1025. [Google Scholar] [CrossRef]
Miller, G.A. WordNet: A Lexical Database for English. Commun. ACM 1995, 38, 39–41. [Google Scholar] [CrossRef]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models Are Few-Shot Learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Su, P.; Li, G.; Wu, C.; Vijay-Shanker, K. Using Distant Supervision to Augment Manually Annotated Data for Relation Extraction. PLoS ONE 2019, 14, e0216913. [Google Scholar] [CrossRef] [PubMed]
Mintz, M.; Bills, S.; Snow, R.; Jurafsky, D. Distant Supervision for Relation Extraction without Labeled Data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Suntec, Singapore, 2–7 August 2009; pp. 1003–1011. [Google Scholar]
Garg, S.; Ramakrishnan, G. BAE: BERT-based Adversarial Examples for Text Classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 6174–6181. [Google Scholar]
Li, L.; Ma, R.; Guo, Q.; Xue, X.; Qiu, X. BERT-ATTACK: Adversarial Attack Against BERT Using BERT. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 6193–6202. [Google Scholar]
Yoo, J.Y.; Qi, Y. Towards Improving Adversarial Training of NLP Models. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual, 16–20 November 2021; pp. 945–956. [Google Scholar]
Ren, S.; Deng, Y.; He, K.; Che, W. Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 1085–1097. [Google Scholar] [CrossRef]
Samanta, S.; Mehta, S. Towards Crafting Text Adversarial Samples. arXiv 2017, arXiv:1707.02812. [Google Scholar]
Li, D.; Zhang, Y.; Peng, H.; Chen, L.; Brockett, C.; Sun, M.; Dolan, B. Contextualized Perturbation for Textual Adversarial Attack. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 5053–5069. [Google Scholar] [CrossRef]
Sennrich, R.; Haddow, B.; Birch, A. Improving Neural Machine Translation Models with Monolingual Data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016; pp. 86–96. [Google Scholar] [CrossRef]
Edunov, S.; Ott, M.; Auli, M.; Grangier, D. Understanding Back-Translation at Scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 489–500. [Google Scholar] [CrossRef]
Longpre, S.; Wang, Y.; DuBois, C. How Effective is Task-Agnostic Data Augmentation for Pretrained Transformers? In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 16–20 November 2020; pp. 4401–4411. [Google Scholar]
Xie, Q.; Dai, Z.; Hovy, E.; Luong, T.; Le, Q. Unsupervised Data Augmentation for Consistency Training. Adv. Neural Inf. Process. Syst. 2020, 33, 6256–6268. [Google Scholar]
Go, A.; Bhayani, R.; Huang, L. Twitter sentiment classification using distant supervision. CS224N Proj. Rep. Stanf. 2009, 1, 2009. [Google Scholar]
Martinc, M.; Montariol, S.; Pivovarova, L.; Zosa, E. Effectiveness of Data Augmentation and Pretraining for Improving Neural Headline Generation in Low-Resource Settings. In Proceedings of the LREC 2022, Marseille, France, 20–25 June 2022. [Google Scholar]
Cheung, T.-H.; Yeung, D.-Y. {MODALS}: Modality-agnostic Automated Data Augmentation in the Latent Space. In Proceedings of the International Conference on Learning Representations, Online, 3–7 May 2021. [Google Scholar]
Goldsmith, J.; Riggle, J.; Alan, C.L. The Handbook of Phonological Theory; Wiley Online Library: Hoboken, NJ, USA, 1995. [Google Scholar]
Glavaš, G.; Korenčić, D.; Šnajder, J. Aspect-Oriented Opinion Mining from User Reviews in Croatian. In Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing, Sofia, Bulgaria, 8–9 August 2013; pp. 18–23. [Google Scholar]
Kapukaranov, B.; Nakov, P. Fine-Grained Sentiment Analysis for Movie Reviews in Bulgarian. In Proceedings of the International Conference Recent Advances in Natural Language Processing, Hissar, Bulgaria, 7–9 September 2015; pp. 266–274. [Google Scholar]
Pecar, S.; Simko, M.; Bielikova, M. Improving Sentiment Classification in Slovak Language. In Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, Florence, Italy, 2 August 2019. [Google Scholar]
Kadunc, K.; Robnik-Šikonja, M. Opinion Corpus of Slovene Web Commentaries KKS 1.001; Slovenian Language Resource Repository CLARIN.SI. 2017. Available online: http://hdl.handle.net/11356/1115 (accessed on 19 July 2022).
Anaby-Tavor, A.; Carmeli, B.; Goldbraich, E.; Kantor, A.; Kour, G.; Shlomov, S.; Tepper, N.; Zwerdling, N. Do Not Have Enough Data? Deep Learning to the Rescue! In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, the Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, 7–12 February 2020; pp. 7383–7390. [Google Scholar] [CrossRef]
Bollegala, D.; Weir, D.; Carroll, J.A. Using Multiple Sources to Construct a Sentiment Sensitive Thesaurus for Cross-Domain Sentiment Classification. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011; pp. 132–141. [Google Scholar]
Gamon, M. Sentiment Classification on Customer Feedback Data: Noisy data, large feature vectors, and the role of linguistic analysis. In Proceedings of the COLING 2004: 20th International Conference on Computational Linguistics, Geneva, Switzerland, 23–27 August 2004; pp. 841–847. [Google Scholar]
Itertools Combinations. 2022. Available online: https://docs.python.org/3/library/itertools.html#itertools.combinations (accessed on 26 July 2022).
Itertools Permutations. 2022. Available online: https://docs.python.org/3/library/itertools.html#itertools.permutations (accessed on 26 July 2022).
Erjavec, T.; Fišer, D. Building Slovene Wordnet. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy, 22–28 May 2006. [Google Scholar]
Koeva, S.; Genov, A.; Totkov, G. Towards Bulgarian Wordnet. Rom. J. Inf. Sci. Technol. 2004, 7, 45–60. [Google Scholar]
Raffaelli, I.; Tadic, M.; Bekavac, B.; Agic, Ž. Building croatian wordnet. In Proceedings of the GWC, Szeged, Hungary, 22–25 January 2008; pp. 349–360. [Google Scholar]
Baeza-Yates, R.; Ribeiro-Neto, B. Modern Information Retrieval; ACM Press: New York, NY, USA, 1999; Volume 463. [Google Scholar]
Gemini Team Google. Gemini: A Family of Highly Capable Multimodal Models. arXiv 2023, arXiv:2312.11805. [Google Scholar] [CrossRef]
Yeh, A. More Accurate Tests for the Statistical Significance of Result Differences. In Proceedings of the COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics, Saarbrücken, Germany, 31 July–4 August 2000. [Google Scholar]
Kolesnichenko, L.; Velldal, E.; Øvrelid, L. Word Substitution with Masked Language Models as Data Augmentation for Sentiment Analysis. In Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023), Torshavn, Danmark, 22 May 2023; pp. 42–47. [Google Scholar]
Pang, B.; Lee, L. Opinion Mining and Sentiment Analysis. In Foundations and Trends in Information Retrieval; Now Publishers Inc.: Norwell, MA, USA, 2008; pp. 1–135. [Google Scholar] [CrossRef]
Liu, B. Sentiment Analysis: Mining Opinions, Sentiments, and Emotions; Cambridge University Press: Cambridge, UK, 2020. [Google Scholar]
Khandelwal, A.; Sawant, S. NegBERT: A Transfer Learning Approach for Negation Detection and Scope Resolution. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 5739–5748. [Google Scholar]
Askalidis, G.; Kim, S.J.; Malthouse, E.C. Understanding and Overcoming Biases in Online Review Systems. Decis. Support Syst. 2017, 97, 23–30. [Google Scholar] [CrossRef]
Utsab, B.; Das, A.; Joachim, W.; Foster, J. Code Mixing: A Challenge for Language Identification in the Language of Social Media. In Proceedings of the First Workshop on Computational Approaches to Code Switching, Doha, Qatar, 25 October 2014; pp. 13–23. [Google Scholar]

Figure 1. Comparison of F1 scores for Bulgarian datasets. Our proposed methods are labelled with prefix “expanded”.

Figure 2. Comparison of F1 scores for Croatian datasets. Our proposed methods are labelled with prefix “expanded”.

Figure 3. Comparison of F1 scores for Slovak datasets. Our proposed methods are labelled with prefix “expanded”.

Figure 4. Comparison of F1 scores for Slovene datasets. Our proposed methods are labelled with prefix “expanded”.

Table 1. Literature review.

Author	Purpose	Method	Sample Size	Key Findings
[18]	relation extraction in NLP	Distant supervision (DS) using Freebase as a lookup table	800 K	Multi-instance learning framework.
[29]	Classifying sentiment in tweets	Remote supervision using emoticons as labels	1600 K	Emoticons were used as labels for the SA of tweets.
[25]	Enhancing NMT with synthetic data	Back-translation	100 K	Used machine translation as paraphraser.
[7]	Improving adversarial attack performance	Altered language model trained on WikiText-103 corpus	7 K–540 K	Contextual DA method outperforms traditional DA methods
[26]	Improving NMT sample quality	Sampling and noisy beam outputs for back-translation	29 M	Noisy beam outputs, create better synthetic data than beam or greedy search.
[17]	Curating datasets for BioNLP tasks	Distant supervision with heuristics to reduce noise	25 K–77 K	Proposed heuristics to reduce noise.
[22]	Generating adversarial samples for NLP	Synonym replacement using WordNet	25 K –1.4 M	Saliency-based methods for detecting important words.
[6]	Simplifying data augmentation	EDA: synonym replacement, random replacement, swap, deletion	500–5 K	Found small augmentation values ( $α$ ) produced better performance gains than large values.
[19]	Improving adversarial sample generation	Contextual perturbations using BERT masked language model	10 K–598 K datasets	Used BERT for replacing and inserting tokens at masked locations.
[27]	Examining the impact of pre-trained language models on data augmentation	Augmentation with BERT, XL-NET, and RoBERTa	500–10 K	DA did not provide consistent improvements for pre-trained transformers.
[28]	Enforcing consistency in model predictions with augmented data	Consistency training with back-translation and TF-IDF	25 K	Used consistency loss to improve model predictions.
[24]	Extending adversarial attack methods	Contextualized perturbations with RoBERTa	105 K–560 K	Introduced replace, insert, and merge operations for adversarial attacks.
[31]	Proposing data augmentation using latent space for difficult-to-classify samples	Latent space augmentation using interpolation and noise addition	50 K–120 K	Difficult-to-classify samples contain more information, making them ideal for DA in low-data settings.
[30]	Comparing augmentation strategies for headline generation in various languages	WordNet and Bert-based augmentation	10 K–260 K	Domain-specific data benefit more from data augmentation and pretraining schemes
Ours	Comparing multiple DA strategies for SA in various low-resourced languages	Expansion and permutation-based techniques	10 K–40 K	Transformer-based models do not benefit from DA based on synonymy.

Table 2. The original distribution of sentiment analysis datasets.

Language	Dataset	Train	Val	Test
Bulgarian	Cinexio	5520	614	682
Croatian	Pauza	2050	227	1033
Slovak	Reviews3	3834	661	1235
Slovene	KKS	3977	200	600

Table 3. Transformer models used in the training as base encoders for CLM and MLM.

Language	Method	Model Name
Croatian	CLM	macedonizer/hr-gpt2
	MLM	EMBEDDIA/crosloengual-bert
Bulgarian	CLM	rmihaylov/gpt2-medium-bg
	MLM	rmihaylov/bert-base-bg
Slovak	CLM	Milos/slovak-gpt-j-405M
	MLM	gerulata/slovakbert
Slovene	CLM	macedonizer/sl-gpt2
	MLM	EMBEDDIA/sloberta

Table 4. Train–development–test distribution of the original and expanded datasets: pos—positive; neg—negative; neu—neutral.

Language	Version	Train			Dev			Test
		neg	pos	neu	neg	pos	neu	neg	pos	neu
Croatian	Original	467	1586	145	47	159	14	236	719	78
	lemma	467	1586	145	47	159	14	236	719	78
	expanded	1523	3979	436	44	398	152	742	1787	254
Bulgarian	Original	864	3898	710	96	436	80	107	486	88
	lemma	864	3898	710	96	436	80	107	486	88
	expanded	1435	6321	1060	154	686	116	185	803	133
		neg	pos	neu	neg	pos	neu	neg	pos	neu
Slovak	Original	297	1337	1926	46	211	265	80	416	545
	lemma	297	1337	1926	46	211	265	80	416	545
	expanded	879	2493	2397	136	352	326	279	841	627
Slovene	Original	2722	749	506	138	37	25	431	112	57
	lemma	2722	749	506	138	37	25	431	112	57
	expanded	13,676	2165	2073	559	170	141	2183	400	229

Table 5. Results of original, lemmatised, and expanded (ours) versions of the dataset.

Language	Version	Binary		Ternary
		F1	ACC	F1	ACC
Croatian	Original	94.11	95.86	75.04	88.18
	lemma	93.61	95.53	60.95	77.77
	expanded	73.99	78.76	73.31	86.93
	gemma	98.05	98.03	90.84	90.99
Bulgarian	Original	90.00	94.43	72.90	83.55
	lemma	88.82	93.76	68.31	81.20
	expanded	84.44	91.09	65.89	80.55
	gemma	96.41	96.45	80.39	84.43
Slovak	Original	94.83	97.17	79.50	81.07
	lemma	94.65	96.97	79.43	81.84
	expanded	88.07	90.98	71.60	72.46
	gemma	98.99	98.99	76.07	76.65
Slovene	Original	80.92	87.84	68.70	79.33
	lemma	79.25	87.29	66.38	77.16
	expanded	68.05	85.63	49.96	67.03
	gemma	93.57	93.73	85.8	85.83

Table 6. Results of expanded–combined (ours) and expanded–permuted (ours) datasets for all languages.

Lang	Ver	Binary_10k		Ternary_10k		Binary_20k		Ternary_20k		Binary_40k		Ternary_40k
		F1	ACC	F1	ACC	F1	ACC	F1	ACC	F1	ACC	F1	ACC
Hr	expanded-combined	95.37	96.84	73.17	87.41	95.84	97.16	72.96	85.96	94.26	96.07	71.84	87.6
	expanded-permuted	95.53	96.84	73.87	87.99	94.79	96.4	68.72	84.99	93.06	95.31	71.63	86.93
Bg	expanded-combined	90.16	94.26	66.18	76.35	89.88	93.92	72.23	81.93	89.41	93.76	72.27	82.96
	expanded-permuted	89.85	94.26	71.7	80.91	89.17	93.76	71.69	81.64	89.08	93.76	70.5	79.29
Sk	expanded-combined	97.76	98.79	76.58	77.52	96.92	98.38	77.55	78.09	96.72	98.18	79.34	80
	expanded-permuted	98.12	98.99	76.4	76.94	97.37	98.58	78.31	79.05	97.8	98.79	77.86	79.05
Sv	expanded-combined	75.89	81.76	59.73	70.16	77.9	84.16	62.89	74.88	77.67	83.6	58.8	67
	expanded-permuted	75.57	81.21	53.66	60.16	74.07	79.92	54.62	59.33	77.84	83.24	61.5	73.5

Table 7. Results when using augmented datasets using WordNet, MLM, and CLM. Bold values represent best performing system.

		10k				20k				25k				40k				All
Lang	Version	Binary		Ternary		Binary		Ternary		Binary		Ternary		Binary		Ternary		Binary		Ternary
		F1	ACC	F1	ACC	F1	ACC	F1	ACC	F1	ACC	F1	ACC	F1	ACC	F1	ACC	F1	ACC	F1	ACC
Hr	WN	94.18	95.96	71.90	87.12	93.09	95.31	68.73	84.80					94.20	95.96	61.78	84.31	93.94	95.86	69.43	86.73
	MLM	92.30	94.55	67.74	81.31	90.26	93.35	70.63	83.93	90.76	93.68	69.36	83.15
	CLM	92.06	94.44	64.96	81.89	90.74	93.89	6235	81.80									89.73	93.02	67.11	83.83
Bg	WN																	91.56	94.94	70.64	84.43
	MLM																	88.73	93.76	70.07	81.49
	CLM	87.07	92.58	61.87	79.73	84.15	90.55	59.05	77.09	82.76	88.87	58.43	80.02					84.10	91.23	58.35	76.65
Sk	WN	96.00	97.78	74.86	79.82	95.61	97.58	79.35	82.32					95.22	97.37	77.67	80.97	97.37	98.58	76.50	78.96
	MLM	96.19	97.98	77.24	78.67	94.93	97.17	76.49	76.75	96.27	97.98	73.44	74.25
	CLM	92.31	95.96	70.01	72.14	90.54	94.55	69.8	71.85					91.63	95.56	68.79	71.66	91.40	95.16	68.66	70.50
Sv	WN	73.47	79.18	59.39	68.83	78.25	84.71	53.33	65.00					78.25	84.71	58.53	69.5	77.83	86.37	59.87	73.5
	MLM	63.02	66.11	62.00	72.16	73.99	79.37	60.827	72.33	76.152	82.13	56.11	64.16
	CLM	74.29	81.03	55.16	65.33	67.19	72.19	54.46	69.83	90.26	73.66	56.38	65.83					65.89	69.98	47.68	57.83

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Thakkar, G.; Preradović, N.M.; Tadić, M. Examining Sentiment Analysis for Low-Resource Languages with Data Augmentation Techniques. Eng 2024, 5, 2920-2942. https://doi.org/10.3390/eng5040152

AMA Style

Thakkar G, Preradović NM, Tadić M. Examining Sentiment Analysis for Low-Resource Languages with Data Augmentation Techniques. Eng. 2024; 5(4):2920-2942. https://doi.org/10.3390/eng5040152

Chicago/Turabian Style

Thakkar, Gaurish, Nives Mikelić Preradović, and Marko Tadić. 2024. "Examining Sentiment Analysis for Low-Resource Languages with Data Augmentation Techniques" Eng 5, no. 4: 2920-2942. https://doi.org/10.3390/eng5040152

APA Style

Thakkar, G., Preradović, N. M., & Tadić, M. (2024). Examining Sentiment Analysis for Low-Resource Languages with Data Augmentation Techniques. Eng, 5(4), 2920-2942. https://doi.org/10.3390/eng5040152

Article Menu

Examining Sentiment Analysis for Low-Resource Languages with Data Augmentation Techniques

Abstract

1. Introduction

2. Research Question

3. Literature Review

Data Augmentation

4. Data

4.1. Croatian Re-Annotation

4.2. Sentiment Analysis Datasets

4.3. Data Generation and Augmentation

4.4. Lemmatisation

4.5. Expansion [Ours]

4.5.1. Expansion-Combination [Ours]

4.5.2. Expansion-Permutation [Ours]

4.5.3. WordNet Augmentations

4.6. Language Tools

4.7. MLM Augmentations

4.8. CLM Augmentations

4.9. Experiments

4.10. Training Set Size

5. Results and Discussion

5.1. Error Analysis

5.1.1. Text Accompanied by Additional Context

5.1.2. Reviews with Aspect Ratings

5.1.3. Mixed Aspects

5.1.4. Contradictory Expressions

5.2. Revisiting Research Questions

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI