Evaluating the Impact of Synthetic Data on Emotion Classification: A Linguistic and Structural Analysis

Üveges, István; Ring, Orsolya

doi:10.3390/info16040330

Open AccessArticle

Evaluating the Impact of Synthetic Data on Emotion Classification: A Linguistic and Structural Analysis

by

István Üveges

^*

and

Orsolya Ring

HUN-REN Centre for Social Sciences, Tóth Kálmán u. 4, 1097 Budapest, Hungary

^*

Author to whom correspondence should be addressed.

Information 2025, 16(4), 330; https://doi.org/10.3390/info16040330

Submission received: 14 March 2025 / Revised: 30 March 2025 / Accepted: 16 April 2025 / Published: 21 April 2025

Download

Browse Figures

Versions Notes

Abstract

:

Emotion classification in natural language processing (NLP) has recently witnessed significant advancements. However, class imbalance in emotion datasets remains a critical challenge, as dominant emotion categories tend to overshadow less frequent ones, leading to biased model predictions. Traditional techniques, such as undersampling and oversampling, offer partial solutions. More recently, synthetic data generation using large language models (LLMs) has emerged as a promising strategy for augmenting minority classes and improving model robustness. In this study, we investigate the impact of synthetic data augmentation on German-language emotion classification. Using an imbalanced dataset, we systematically evaluate multiple balancing strategies, including undersampling overrepresented classes and generating synthetic data for underrepresented emotions using a GPT-4–based model in a few-shot prompting setting. Beyond enhancing model performance, we conduct a detailed linguistic analysis of the synthetic samples, examining their lexical diversity, syntactic structures, and semantic coherence to determine their contribution to overall model generalization. Our results demonstrate that integrating synthetic data significantly improves classification performance, particularly for minority emotion categories, while maintaining overall model stability. However, our linguistic evaluation reveals that synthetic examples exhibit reduced lexical diversity and simplified syntactic structures, which may introduce limitations in certain real-world applications. These findings highlight both the potential and the challenges of synthetic data augmentation in emotion classification. By providing a comprehensive evaluation of balancing techniques and the linguistic properties of generated text, this study contributes to the ongoing discourse on improving NLP models for underrepresented linguistic phenomena.

Keywords:

emotion classification; synthetic data augmentation; class imbalance; GPT-4; few-shot prompting

1. Introduction

In recent years, emotion classification has been widely explored across multiple languages, often focusing on large, resource-rich language corpora [1,2,3]. However, even within well-studied languages, achieving optimal performance requires managing several challenges, notably the inherent imbalance of labeled emotion datasets [1,4,5]. Imbalanced categories can skew model training, disproportionately favoring common labels at the expense of those that appear only rarely [4,5,6,7]. Traditional solutions range from basic undersampling, which is reducing the size of overrepresented classes, to oversampling minority classes to balance label distributions. More recently, data augmentation using large language models (LLMs) has emerged as a promising approach to increase coverage and diversity in training data [1,2,3].

This study exclusively focuses on building a high-performing emotion classification model for German-language text. We systematically experimented with multiple balancing strategies based on a German dataset exhibiting a strong label frequency imbalance. First, we applied undersampling to reduce the sheer volume of extremely frequent categories to a more moderate level. Subsequently, we used oversampling through synthetic data generation, leveraging a GPT-4–based model in a 10-shot prompting setting. This process aims to enhance the minority emotion classes by producing novel yet contextually coherent sentences [1].

Beyond merely integrating generated data, we investigated the linguistic properties of these synthetic examples in detail. Specifically, we analyzed the augmented sentences from syntactic, semantic, and lexical perspectives to determine whether they add genuine diversity—or merely replicate patterns from the original dataset. By examining both the distributional outcomes of balancing techniques and the intrinsic qualities of the synthetic text, we offer insight into the most effective ways to improve German-language emotion classification performance [8,9,10].

Overall, this work addresses two primary questions: (1) Which combination of undersampling and synthetic oversampling yields the best classification results on a highly imbalanced German dataset? and (2) How does the linguistic diversity of generated data affect the final model’s generalization capacity? Our findings contribute to a deeper understanding of how balancing methods and the fine-grained analysis of synthetic data can be applied to maximize performance in practical emotion classification tasks for German language. Our key contributions are as follows:

A stepwise data-balancing framework for German emotion classification combining undersampling, synthetic oversampling via GPT-4, and original sample re-weighting.
A linguistically informed evaluation of synthetic data quality using lexical, syntactic, and semantic metrics.
An experimental setup that distinguishes apparent vs. true performance gains from synthetic augmentation, providing insights into generalization vs. domain adaptation.

2. Relevant Works

The field of computational text analysis has advanced significantly, particularly in emotion detection and classification. In contrast to traditional sentiment analysis—which typically categorizes text as positive, negative, or neutral—emotion analysis attempts to classify specific emotional states, such as happiness, anger, or fear [4,5]. This distinction is especially pertinent in political communication research, where emotional cues, however subtle, can substantially shape how a message is received and interpreted by the public [11,12]. Emotions may reinforce or undermine credibility, alter perceptions of policy issues, and even affect voting decisions. Consequently, understanding emotional tones in textual data has become a focal point for scholars and practitioners aiming to quantify the impact of political discourse on audiences [1,2,13,14,15].

Recent research in the German context has also explored emotion classification with various methodological approaches. Christian Rauh validated a sentiment dictionary for analyzing German political language [16], testing it on parliamentary speeches, party manifestos, and media content. The results indicated that positive emotions are easier to detect, while negative emotions pose more significant challenges, especially in shorter texts. Widmann and Wich [17] compared three models for emotion detection in German political texts: a dictionary-based model, a neural network with word embeddings, and a transformer-based model (ELECTRA). Their findings demonstrated that the transformer-based model significantly outperformed traditional dictionary-based approaches in accurately measuring emotional nuances.

Early research efforts in this domain heavily focused on sentiment polarity detection within political news [6,7]. Although foundational for analyzing broad positive or negative trends, these studies highlighted the need for more granular models that account for varied emotional expressions. Later, researchers began to adopt more nuanced theoretical frameworks, such as Martin and White’s appraisal approach [18], which addresses subtleties like effect and judgment in a text. Likewise, explicit opinion analysis [19] broadened the methodological toolkit for analyzing how political actors, media outlets, or stakeholders articulate attitudes and stances. These diversifications made capturing subtle emotional and evaluative dimensions in political statements, campaign materials, and news articles increasingly possible. Mullen and Malouf [20], for instance, conducted pioneering work on informal political discourse—particularly in online debate forums—while Van Atteveldt et al. [21] employed machine learning techniques to distinguish positive from negative sentiments in campaign texts. Additionally, innovative approaches such as dictionary-based methods or crowd-coding allowed researchers to combine computational efficiency with human insights [13], and subsequent comparisons with expert-coded data provided a clearer view of how automated sentiment tools align with human judgments [14].

However, most prior studies either focused on high-resource languages or rely on traditional classification models, often overlooking the impact of data imbalance on minority emotion categories or the linguistic quality of synthetic inputs. Beyond the English-speaking world, scholars have tackled emotion and sentiment analysis in under-resourced or morphologically rich languages, further broadening the field’s scope. Turkish [22,23], Norwegian [24], and Albanian [25] exemplify languages where morphological complexity and limited annotated data present challenges for traditional approaches. Similarly, Ukrainian, Russian [26], Indonesian [27], and Hungarian [28,29,30] have garnered increased scholarly attention. Political news in these contexts often contain unique linguistic structures and cultural references, necessitating specialized lexicons or domain-specific models to detect sentiment and emotion accurately.

In contrast to previous works, our study introduces a twofold contribution: first, we systematically assess the effectiveness of GPT-4-generated synthetic data in balancing imbalanced emotion classes in German political text; second, we offer a fine-grained linguistic evaluation of the generated data’s lexical, syntactic, and semantic features. This allows us to not only boost classification performance but also reflect critically on the limitations of current augmentation practices. Unlike [31], which focused on Hungarian political sentiment using monolingual resources, and [32], which addressed Arabic emotion detection, our work combines large-scale generative augmentation with structural linguistic diagnostics, offering a complementary perspective to transformer-based emotion modeling reveal [33,34,35].

Recent developments in natural language processing (NLP), particularly the advent of large language models (LLMs), have significantly influenced emotion analysis in political contexts. Research suggests that these models can outperform some fine-tuned BERT variants in entity-specific tasks, such as identifying sentiments directed at particular political figures or policy issues [36]. Furthermore, specialized adaptations of these models have appeared for underrepresented languages. For instance, ref. [31] investigates political communication in Hungarian, while [32] explores Arabic emotion detection, demonstrating how LLMs can be tailored to diverse linguistic landscapes. The success of transformer-based approaches in classifying emotions from headlines, where brevity and possible sensationalism can confound analysis, highlights the nuanced capabilities of modern architectures [37].

Transformer architectures themselves have revolutionized NLP since their introduction [38], offering more efficient parallelization and robust feature learning. BERT, in particular, popularized fully bidirectional attention mechanisms and established new baselines for many downstream tasks by leveraging large-scale pre-training [15]. GPT-3 illustrated the feasibility of few-shot or zero-shot learning setups, drastically reducing data requirements [1], while RoBERTa refined training procedures for greater performance gains, and T5’s unified text-to-text paradigm streamlined different tasks under a common framework [2,3]. These strategies have been extended to multilingual domains as well, where models like mBERT, mBART, and XLM-R [39] share parameters across languages. Although distributing representational capacity across many languages can dilute performance for high-resource languages, it can significantly benefit under-resourced settings through cross-lingual transfer and shared vocabulary.

Model selection thus becomes a balancing act between specialization and generalization. Monolingual models can yield higher accuracy by concentrating on language-specific traits, including morphological rules or idiomatic expressions, whereas multilingual counterparts facilitate transfer learning and cost-effective training for low-resource languages. The real-world utility of each approach often hinges on the data availability, the domain’s complexity, and the research or application goals [40,41,42].

Data augmentation methods also offer ways to enhance training in scenarios where collecting labeled samples is expensive or logistically difficult. Techniques originally refined for computer vision [40] have been adapted to NLP, including “Easy Data Augmentation” (EDA) strategies that randomly insert, delete, swap, or replace words [41]. While these techniques can improve model robustness, especially for simpler classification tasks, their random nature may distort semantic nuance [42]. More sophisticated augmentation involves embeddings like Word2Vec, GloVe, or transformer-based vectors, enabling semantically aware replacements that preserve context and meaning [43,44,45].

In parallel, generative models—particularly instruction-tuned LLMs—provide another avenue for synthetic data generation to counter data scarcity [1]. When prompted with detailed instructions, these models can produce diverse text samples that capture specific styles, emotional tones, or domain features, thereby expanding a corpus in controlled ways. Nevertheless, synthetic data quality is closely tied to the underlying model’s capabilities and prompt design. Over-reliance on machine-generated samples may introduce spurious patterns, ultimately weakening real-world generalization if not balanced with manually verified or gold-standard data [46,47]. Integrating few-shot examples—where a small set of high-quality human-labeled samples are provided—can improve generative consistency and thematic accuracy, proving especially useful for tasks like emotion classification that rely on nuanced linguistic cues [3,4,11].

In summary, advancing emotion detection and classification in political communication requires both technical innovations and domain-specific considerations. Robust transformer-based architectures, multilingual and monolingual model trade-offs, and carefully executed data augmentation techniques collectively shape how effectively we can capture the emotional signals embedded in political texts. Although breakthroughs in LLMs demonstrate promising results, ongoing research highlights the need to balance automated methods with reliable human-labeled data and contextually aware modeling approaches, ensuring that emotion analysis remains both accurate and reflective of complex real-world discourse.

3. Dataset

3.1. Main Statistics

For the present study, the main underlying resource was the Widmann German-language dataset [17], which originally contained ten labels: disgust, fear, sadness, pride, joy, enthusiasm, cannot be coded, hope, anger, and no emotion. Table 1 shows the full distribution before any exclusions, as calculated from the raw dataset. In this work, instances labeled cannot be coded were removed, leaving nine effective categories for experimentation.

By excluding “cannot be coded” instances (6869 instances), we retained nine final classes (disgust, fear, sadness, pride, joy, enthusiasm, hope, anger, no emotion), for a total of 120,217 sentences. Within those nine labels, no emotion and anger remained the most frequent categories, while disgust, fear, and sadness appeared relatively rarely.

3.2. Corpus Variants

As Table 1 makes clear, the original corpus was highly imbalanced, which often leads to suboptimal classification performance. To address this, we explored various strategies to improve the models’ accuracy. First, we used undersampling for the heavily overrepresented classes so as to reduce their impact on training. We then applied oversampling via synthetic data to the classes with few examples, thereby increasing the variety of input for those categories. Finally, we observed that the no emotion class continued to underperform. Because this category is critical—false positives here strongly degrade the accuracy of all other categories—we decided to double it using additional original samples. Sufficient real data were available for this step, ensuring a stronger model focus on detecting truly non-emotional instances.

The above steps led to the following models, in order. The baseline_model uses the raw label distribution (after removing the cannot be coded category) without any resampling or synthetic data. Next, model_1 applies undersampling to the two most frequent categories, anger and no emotion, to reduce them to the size of the next-largest category. Subsequently, model_2a expands the minority classes (such as disgust, fear, and sadness) with synthetic sentences to achieve a more balanced overall distribution; in this variant, anger, hope, and no emotion remain at the levels determined in model_1. Finally, model_2b further increases the no emotion class by adding additional original German data on top of the synthetic augmentation already performed for minority classes. The main characteristics of the model version are summarized in Table 2.

The composition of the training data used for all models is shown in Table 3.

4. Methods

4.1. Model Training (XLM-RoBERTa)

In this work, we adopted the XLM-RoBERTa architecture, which has proven effective for multilingual tasks yet has also demonstrated robust performance in monolingual settings. Although our investigation focuses on German-language texts exclusively, the multilingual pre-training of XLM-RoBERTa confers additional benefits for morphological richness and diverse token coverage. By being exposed to multiple languages during pre-training, the model captures deeper linguistic representations that positively impact downstream tasks in a single target language.

To fine-tune XLM-RoBERTa, we relied on the standard training pipeline. We split our labeled corpus into training, validation, and test sets using stratified sampling, ensuring consistent label proportions. We then initialized the model with a learning rate typically set between

1 \times 10^{- 5}

and

2 \times 10^{- 5}

, and applied an early-stopping criterion with a patience of two epochs. Batch sizes ranged from 16 to 32, depending on available GPU memory. Throughout training, we monitored macro-averaged F1-scores, precision, recall, and accuracy across all emotion labels to assess each balancing technique’s efficacy. We found the macro-averaged F1-score particularly useful, as it balances precision and recall while giving equal weight to each class, making it well suited for evaluating performance on imbalanced datasets such as ours. Models converged efficiently under these settings, while careful checkpointing guards against potential overfitting.

4.2. Prompt Design for Synthetic Augmentation

To generate additional German sentences for minority classes, we employed a GPT-4-based large language model in a few-shot manner. Each prompt included the following:

A concise description of the target emotion (e.g., “Generate a German sentence expressing fear, according to the following description: …”),
Ten real, representative examples sampled from the original dataset.

By providing 10 authentic sentences per emotion (randomly selected from the original corpus in each iteration), the LLM received sufficient context to produce coherent and diverse synthetic samples. We enforced strict requirements for German text only, specifying maximum sentence length where possible. The resulting data were merged with the original corpus for model re-training in our oversampling trials. This procedure effectively boosted the size of minority classes, improving overall classification performance in subsequent experiments. OpenAI API was used for generation with the gpt-4o model at a 0.6 temperature setting. The temperature parameter controls the randomness of the generated output: lower values make the output more focused and deterministic, while higher values increase diversity and creativity. We used a moderate setting to balance coherence and variation in the synthetic data.

While we have summarized our prompt design here, the complete prompt template—used to guide the generative model in producing synthetic German sentences—can be found in full in Appendix A. This template details the instructions provided to the model, specifying language requirements, emotion constraints, and output format.

4.3. Synthetic Data Quality

In order to obtain a clear picture of the quality of the augmented synthetic data, we examined it from lexical, syntactic, and semantic points of view.

4.3.1. Lexical Diversity

Lexical diversity plays a crucial role in ensuring that language models generalize effectively across different contexts. A dataset with a rich and varied vocabulary enables a model to move beyond simple pattern recognition and develop a deeper understanding of linguistic variation. One of the common challenges with synthetic data is its tendency to repeat frequent words and phrases rather than introduce novel expressions. If synthetic data fail to expand the lexical space, models may become overly reliant on familiar linguistic patterns, limiting their ability to adapt to unseen texts.

Type–Token Ratio (TTR)

Lexical diversity measures the variation in word usage within a dataset. One of the most common metrics for this is the Type–Token Ratio (TTR) [8,9], which is defined as

T T R = \frac{Number of Unique Words (Types)}{Total Number of Words (Tokens)}

(1)

This metric can take a value between 1 and 0, where 1 means perfect repetition-free text. Therefore, a higher TTR indicates a more diverse vocabulary, while a lower TTR suggests repetition of language use. In our analysis, we computed the TTR for both the original and synthetic datasets using spaCy’s German-language model (de_core_news_sm) for tokenization and lemmatization [48]. The latter was necessary due to the extensive conjugation system of the German language.

Lexical Overlap: Jaccard Index

The Jaccard index [10] is a commonly used measure of set similarity that quantifies the lexical overlap between two datasets. It is defined as

J (A, B) = \frac{| A \cap B |}{| A \cup B |}

(2)

where A represents the set of unique words (types) in the original dataset, and B represents the set of unique words in the synthetic dataset. The Jaccard index ranges from 0 to 1, where a value close to 1 indicates high lexical similarity and a value close to 0 suggests significant vocabulary divergence.

4.3.2. Syntactic Analysis

Syntactic diversity is equally important for improving the generalization capabilities of models. When datasets predominantly contain homogeneous sentence structures, models may struggle to understand alternative syntactic constructions or handle complex sentence patterns.

POS Diversity

To evaluate the structural variation between the original and synthetic datasets, we analyzed the Part-of-Speech (POS) patterns using bigram and trigram frequency distributions.

Divergence Analysis Using Jensen–Shannon Divergence

The Jensen–Shannon Divergence (JSD) [49] is used to measure the difference between the POS n-gram distributions in the two datasets. JSD is a symmetric and smoothed measure of divergence between two probability distributions P and Q, calculated as

J S D (P | | Q) = \frac{1}{2} D_{K L} (P | | M) + \frac{1}{2} D_{K L} (Q | | M),

(3)

where

M = \frac{1}{2} (P + Q)

is the mean distribution, and

D_{K L}

denotes the Kullback–Leibler divergence. A higher JSD value indicates greater divergence in the POS structure, while a lower value suggests similar grammatical patterns.

4.3.3. Semantic Analysis

Dependency Tree Edit Distance

To further quantify the structural differences between the original and synthetic datasets, we utilized the Dependency Tree Edit Distance (TED) metric, which is common practice in linguistics-focused studies [50,51]. TED measures the minimal number of insertions, deletions, and substitutions required to transform one dependency tree into another. Higher TED scores indicate greater syntactic divergence, whereas lower scores suggest structural similarity.

4.4. Performance Evaluation Metrics

Following training, we evaluated models on various test conditions:

Both training and test sets contained synthetic data.
Trained with synthetic data, tested only on original data.
Baseline, no synthetic data in either training or test sets.

We employed standard metrics and visual tools to measure classification robustness:

Confusion Matrices provided a breakdown of classification outcomes across each emotion category, highlighting misclassification patterns.
ROC Curve Analysis assessed the trade-off between the true positive rate and false positive rate, with the AUC quantifying the overall model discrimination capability.
Precision–Recall Curves offered further insights into performance under class-imbalanced conditions, focusing on the trade-off between precision and recall.

4.5. Apparent vs. True Effect of Synthetic Data

Finally, to differentiate between domain adaptation gains (where synthetic data appear in both training and test sets) and genuine improvements in classifying real-world data, we contrasted results across apparent effect and true effect scenarios. The apparent effect compared Model 2b and Model 1 when both the training and test sets included synthetic data, whereas the true effect examined Model 2b’s performance on original-only data vis à vis Model 1 on the same dataset. This two-step comparison clarified whether performance gains generalized to human-generated text beyond synthetic–synthetic conditions.

Overall, these methods provide a comprehensive framework for quantifying how synthetic text, generated via GPT-4 in a few-shot prompting environment, influences emotion classification. By examining both quantitative metrics and the nuances of how synthetic data interact with real text, this study offers insights into best practices and potential caveats for data augmentation in NLP.

5. Results and Discussion

5.1. Model Evaluation

Table 4 summarizes the main metrics for model variants. The baseline_model, trained on the raw, imbalanced dataset, shows a relatively high overall accuracy of 0.64 but obtains a low macro-averaged F1 of 0.44, indicating it struggles with minority classes. By reducing overrepresented categories, model_1 manages to improve the F1-score (0.55), signaling better handling of rarer emotions.

Further progress can be seen with model_2a, where synthetic oversampling significantly boosts minority-class representation, yielding a macro F1 of 0.70 and more balanced precision–recall values. The best performer, model_2b, not only employed synthetic data for minority classes but also doubled the no emotion category with original sentences. This strategy raised the macro F1 to 0.73, alongside an equivalent precision and recall, ultimately delivering the most stable performance across all emotion categories.

From Table 5, we observe that the baseline_model, trained on highly imbalanced data, underperforms especially for fear (0.21) and enthusiasm (0.23), even though it attains a reasonably high F1-score (0.74) for the dominant no emotion category. By reducing the overrepresented classes, model_1 improves on most minority labels (e.g., fear to 0.54), though no emotion drops to 0.41 as a result of undersampling.

When synthetic data are introduced for the underrepresented categories, as in model_2a, several F1-scores increase markedly, notably fear (0.77) and disgust (0.93). This underscores the benefits of oversampling with generative text. Finally, model_2b goes further by replenishing the no emotion category with additional original examples, thereby boosting its F1 to 0.66 while retaining strong performance for other emotions.

Overall, the synthetic data appear highly effective in lifting minority-class performance, but we also wished to identify how much of this improvement was real generalization versus potential overfitting to artificially generated content. To that end, we performed a detailed analysis of the synthetic sentences themselves—evaluating their linguistic diversity (lexical, syntactic, and semantic) and assessing out-of-distribution test scenarios. The results of these deeper investigations clarify the extent to which synthetic augmentation genuinely enhances classification, beyond simply inflating performance metrics.

5.2. Quality and Diversity of Synthetic Data

In this section, we look at the diversity of the synthetic data generated from three different angles, with use of lexical, syntactic and semantic measurements. Understanding these factors is crucial to accurately assessing the effectiveness and limitations of the created emotion models.

5.2.1. Token–Type Ratio (TTR)

The results reveal a significant difference in lexical diversity. The original data set achieved a TTR of 0.035, while the synthetic data set exhibited a much lower value of 0.010. This suggests that the synthetic data are more heavily based on repetitive word patterns, potentially reducing their effectiveness in enhancing model generalization. Figure 1 visualizes the TTR values for both datasets.

The reduced TTR in synthetic data indicates that the data generation process did not introduce sufficient lexical variation. Although controlled repetition can be beneficial for maintaining coherence over the distinct emotion categories, an overly constrained vocabulary may limit the ability of models to generalize effectively to unseen text.

5.2.2. Lexical Overlap: Jaccard Index

We computed the Jaccard index for each emotion category separately to determine whether synthetic data maintained lexical consistency with the original dataset. The results are summarized in Table 6.

The results indicate that disgust (0.1673) and fear (0.1642) exhibit the highest lexical overlap between synthetic and original data, suggesting that the generated data closely mirror the vocabulary distribution of human-written text in these categories. Sadness (0.1341), pride (0.1291), and joy (0.1221) show moderate overlap, implying that while the synthetic data maintain some lexical consistency, they also introduce variation. The enthusiasm category (0.0851) has the lowest observed overlap, suggesting greater lexical differences between the original and synthetic datasets. For hope, no emotion, and anger, no synthetic data were generated, and thus, no Jaccard index could be computed for these categories.

These findings align with our earlier lexical diversity results, which highlighted lower overall Type–Token Ratios (TTRs) in synthetic data. The moderate to high Jaccard indices for some categories suggest that the synthetic data generation process is effective at maintaining lexical similarity but may still introduce variations that impact generalization.

5.2.3. POS Diversity

To evaluate the structural variation between the original and synthetic datasets, we analyze the Part-of-Speech (POS) patterns using bigram and trigram frequency distributions to evaluate the structural variation between the original and synthetic datasets.

POS Bigram and Trigram Variability

The synthetic dataset highlights key structural differences. While the synthetic data largely preserves the most common grammatical structures from the original dataset, there are noticeable shifts in the frequency of certain phrase patterns.

Bigram Patterns. The bigram analysis shows that “DET + NOUN” remains the most frequent structure in both datasets, although its frequency is slightly reduced in the synthetic data. A more significant drop is observed for “ADJ + NOUN”, which suggests that the synthetic data employ fewer adjective modifications compared to the original. Similarly, “NOUN + ADP” appears much less frequently in the generated sentences, indicating that prepositional phrase constructions are underrepresented in the synthetic dataset.

Interestingly, “NOUN + VERB” appears with similar frequency in both datasets, implying that basic subject–verb constructions are well preserved. However, some lower-ranked bigrams (e.g., “VERB + PUNCT”) appear more frequently in the synthetic data, suggesting that sentence–final structures might be overrepresented.

Trigram Patterns. The trigram analysis reveals a more pronounced divergence. The “PUNCT + ‘,’ + PUNCT” structure remains dominant in both datasets, suggesting that punctuation conventions are consistently maintained. However, differences emerge in noun phrase structures. For instance, we observed the following:

“ADP + DET + NOUN” is significantly underrepresented in the synthetic data, implying that prepositional phrase constructions are less frequent.
“DET + NOUN + ADP” appears at a much lower rate, further reinforcing that the use of prepositions in complex noun phrases is more constrained.
The reduction in “NOUN + DET + NOUN” suggests that noun–noun compounds or appositive structures are also less frequent.

Interestingly, verb-related trigrams such as “NOUN + VERB + NOUN” appear in similar proportions in both datasets, meaning that basic subject–verb–object (SVO) constructions are relatively well preserved. However, “VERB + PUNCT + ‘,’” has a higher occurrence in the synthetic data, possibly indicating a more structured, template-driven sentence generation approach.

Implications. These findings indicate that while synthetic data retain the fundamental syntactic structures of the original dataset, they introduce notable differences in certain phrase constructions. The decreased frequency of adjective–noun combinations and prepositional phrases suggests that synthetic data may simplify noun phrase structures, potentially limiting the diversity of descriptive content. Additionally, the shift in punctuation-related structures hints at a more rigid and formulaic sentence generation process.

Figure 2 and Figure 3 illustrate these trends by comparing the most frequent bigram and trigram structures in both datasets.

5.2.4. Jensen–Shannon Divergence

The computed Jensen-Shannon Divergence (JSD) values between original and synthetic POS n-grams were 0.073 for bigrams and 0.092 for trigrams, suggesting that the synthetic data maintain a POS structure similar to that of the original data, but with noticeable variations, especially in longer syntactic patterns.

The results indicate that synthetic data preserve major syntactic structures, but subtle differences appear in longer POS sequences. The trigram divergence is slightly higher, suggesting that synthetic sentences may introduce distinct multi-word phrase structures. However, the relatively low JSD values imply that the synthetic data do not drastically alter the grammatical framework.

Overall, the synthetic data mirror the structural tendencies of the original corpus but with reduced variability, particularly in longer POS sequences. Consequently, while the generated sentences appear structurally appropriate on the surface, they may not provide the same depth and diversity as the original dataset.

5.2.5. Dependency Tree Edit Distance

Computation and Statistical Summary

We computed TED scores for sentence pairs consisting of original and synthetic samples from the dataset. A statistical summary of the TED scores is presented in Table 7.

The mean TED score of 16.55 suggests a moderate degree of syntactic variation between the original and synthetic data. However, the standard deviation of 6.42 indicates considerable variability, meaning that some synthetic sentences closely resemble their original counterparts, while others exhibit substantial modifications. The median TED score (15.00) being slightly lower than the mean suggests a right-skewed distribution with a subset of sentences exhibiting high syntactic divergence.

Distribution of Syntactic Differences

To further investigate the syntactic divergence, we visualized the distribution of TED scores (Figure 4).

The histogram reveals a right-skewed distribution, with the majority of TED scores concentrated between 10 and 25. This suggests that most synthetic sentences introduce moderate structural modifications rather than drastic transformations. However, the presence of outliers with TED scores exceeding 40 indicates that some synthetic sentences undergo significant syntactic changes. The average TED score was 16.55, with a standard deviation of 5.91 and a median of 16.00, indicating a moderately concentrated distribution with some variability in syntactic transformations.

The observed TED score distribution suggests that the synthetic data retain a degree of syntactic similarity to the original dataset but introduce moderate structural variation. However, the high variability in TED scores raises concerns regarding the consistency of syntactic transformations, specifically concerning the following:

The lower TED scores (below 10) indicate that some synthetic sentences are nearly identical in structure to their original counterparts, suggesting limited diversity in augmentation.
The moderate TED range (10–25) represents the bulk of the synthetic dataset, demonstrating that the augmentation process introduces reasonable syntactic modifications.
The higher TED scores (above 40) highlight cases where the synthetic data significantly alter the dependency structure, which may introduce unintended distortions rather than meaningful linguistic variability.

Overall, while the synthetic dataset successfully introduces structural variation, its consistency varies considerably, potentially affecting its usefulness for training robust language models. Our analysis of synthetic data suggests that generated texts often replicate learned syntactic patterns but rarely introduce significant variations. This can restrict a model’s ability to perform well in real-world scenarios where diverse sentence structures are common. Increasing syntactic diversity requires synthetic data to incorporate varied dependency structures, subordinate clauses, and compound sentence constructions to ensure a richer representation of natural language.

5.3. Model Performance

In this section, we describe the performance of each model version in detail.

5.3.1. Confusion Matrix Analysis

To better understand the impact of synthetic data on model performance, we analyze confusion matrices for different test conditions. These matrices provide insights into how well each model generalizes across datasets and whether synthetic data introduce biases or improve classification accuracy.

For simplicity, we will call the model trained only on the original data model 1, and the model trained on both the original and synthetic data model 2b (after their serial numbers obtained during the experiment).

In all confusion matrix figures, each cell contains both the absolute number of instances (top) and the corresponding row-wise percentage (bottom), rounded to one decimal place. This allows for a more nuanced comparison across emotion categories and model variants. The percentages are calculated relative to the total number of true instances per class (i.e., row-wise normalization), making it possible to assess the distribution of predictions independent of class imbalance. The color intensity of the heatmaps reflects these relative percentages rather than the absolute counts, thereby highlighting proportional misclassifications and improvements across models.

Model 2b on Test Set 2b

Model 2b, trained on both original and synthetic data, performs well on test set 2b, which also includes synthetic examples. The confusion matrix (Figure 5) demonstrates that the model correctly classifies most instances. However, notable misclassifications occur in some categories, particularly anger and no emotion. A significant number of no emotion instances are classified as anger, suggesting that the model may have absorbed certain patterns more strongly from the synthetic augmentation.

Model 2b on Test Set 1

When evaluating model 2b on test set 1 (containing only original data), we observe a decline in classification accuracy (Figure 6). The model still performs reasonably well, but misclassifications increase, particularly in categories such as hope and enthusiasm. This suggests that while synthetic data improved overall classification performance, they introduced patterns that do not fully align with the original data. This discrepancy indicates that synthetic data augmentation can shift decision boundaries in ways that may not always transfer effectively to real-world data.

Model 1 on Test Set 1

Model 1, trained exclusively on original data, performs consistently when evaluated on test set 1 (Figure 7). This is expected, as there is no distributional shift between training and testing. However, in comparison to model 2b on the same dataset, its classification accuracy appears slightly lower for some categories. This suggests that exposure to a more diverse dataset, including synthetic examples, may have improved generalization, despite the observed misclassification shifts in model 2b.

Key Observations

From the above, we drew the following main conclusions:

Model 2b benefits from synthetic data when evaluated on a test set that also includes synthetic examples, maintaining high classification accuracy.
When applied to a dataset with only original data, model 2b exhibits more misclassifications, likely due to learned patterns from synthetic data that do not perfectly align with purely human-written text.
Model 1 performs consistently on original data but lacks the potential generalization advantages observed with model 2b trained on diverse data sources.

These findings suggest that while synthetic data augmentation enhances model robustness, its impact on generalization must be carefully considered. The observed misclassification patterns indicate that synthetic data introduce valuable training diversity but also the potential for distributional shifts that require mitigation in real-world applications.

5.3.2. Receiver Operating Characteristic (ROC) Curve Analysis

The Receiver Operating Characteristic (ROC) curve provides insight into the trade-off between true positive and false positive rates for different classification thresholds. The Area Under the Curve (AUC) summarizes the overall performance, with values closer to 1 indicating a better classifier.

We evaluated the classification performance for each emotion category using three different methods:

2b_on_2b: Model 2b trained on both original and synthetic data and evaluated on a test set containing also synthetic and original sentences.
2b_on_1: Model 2b evaluated on model 1’s original-data-only test set.
1_on_1: The model trained only on original data evaluated on its own original-data-only test set.

Key Observations

The ROC analysis indicates that the model trained on synthetic data (2b) generally achieve higher AUC scores compared to the model trained exclusively on original data (model 1). However, the performance difference varies across emotion categories.

High Performance in Certain Categories

For emotions such as disgust, sadness, and joy, the 2b_on_2b evaluation consistently resulted in near-perfect AUC scores (ranging between 0.97 and 1.00), suggesting that training with synthetic data enhances classification capabilities for these emotions. Figure 8 presents the ROC curve for disgust, where the synthetic-data-trained model reaches an AUC of 1.00, while the original-data-only model has a slightly lower score of 0.97.

Moderate Improvements for Certain Emotions

For emotions like hope, pride, and fear, the model trained with synthetic data shows improvement but at a lesser magnitude. As seen in Figure 9, the AUC score of the 2b_on_2b evaluation (0.92) is only slightly higher than in the cases of 2b_on_1 (0.90) and 1_on_1 (0.87).

Significant Differences in the No Emotion Category

One of the most notable divergences occurs in the no emotion category (c.f. Figure 10). The 2b_on_2b evaluation significantly outperforms the original-data-only model (AUC of 0.91 vs. 0.77). This suggests that including synthetic data in training helps distinguishing between neutral and emotional sentences more effectively.

Summary of ROC Curve Analysis

Overall, the results demonstrate that incorporating synthetic data into training improves classification performance across multiple emotion categories. The enhancement is most pronounced in disgust and no emotion, while emotions like fear and pride show only marginal improvements. The general trend suggests that synthetic data contribute positively to classification robustness, particularly in distinguishing complex emotional nuances.

Given the consistency in performance across most emotions, we propose moving less informative ROC curves to the appendix. Categories with minimal differences, such as anger and sadness, for which all models perform comparably well, are not discussed in the main text. Related figures can be found in Appendix B.

5.3.3. Precision–Recall Curve Analysis

To further evaluate the classification performance across different models and datasets, we analyze the precision–recall (PR) curves for each emotion category. PR curves provide a representation of model behavior for imbalanced datasets from another perspective than ROC curves, as they focus on the trade-off between precision and recall.

Key Observations

The PR curves for different emotions indicate significant differences in classification effectiveness. The most notable trends are the following:

Sadness and Joy: The PR curves for these emotions indicate strong performance across all models, with the model trained on synthetic and original data (2b) achieving the highest average precision (AP). Model 2b generalizes well, even when tested on purely original data.
Pride and Hope: These categories exhibit more divergence in PR curves. Model 2b outperforms the baseline (1) significantly, suggesting that synthetic data help the model capture nuanced patterns.
No Emotion: The PR curve for the no emotion class shows the weakest overall performance, especially in the original-data-only model (1). The synthetic-data-trained model (2b) maintains better precision across recall levels.
Fear and Disgust: These categories exhibit strong classification, particularly for the synthetic-data-trained model, which achieves near-optimal precision for high recall values.

Comparison of Models

For most categories, the model trained on both synthetic and original data (2b) performs significantly better, confirming that data augmentation improves classification robustness (c.f. Figure 11, Figure 12 and Figure 13). The gap between 2b and 2b tested on original data suggests that the benefits of synthetic data persist, even when tested in an out-of-domain setting.

For the remaining PR curves, please refer to Appendix C.

5.4. Evaluating the True Performance Improvement from Synthetic Data

The final step in our analysis is to assess the actual effect of synthetic data on model performance. The comparison between apparent and true improvements in F1-score across different emotions provides deeper insight into the impact of synthetic augmentation.

Our starting assumption here was that the synthetic data form a homogeneous structure that the model learns easily. This results in a significant increase in performance when the evaluation is carried out on a model trained on synthetic data and a test set containing synthetic data. When we compare this to a test set consisting of only original data or to the results of a model trained on only original data on the same test set, we can obtain the “true value” of the synthetic data.

Apparent vs. True Effect of Synthetic Data

Figure 14 illustrates the difference between the apparent and true performance effects of synthetic data. The apparent effect (blue bars) measures the improvement observed when testing the model trained on synthetic and original data (2b) against a baseline model trained only on original data (1). However, this improvement might partially stem from domain adaptation rather than genuine generalization.

To isolate the true effect (orange bars), we compare the performance difference between model 2b tested on the original dataset versus baseline model 1 tested on the same dataset. This removes the potential bias introduced by synthetic data into the training and test sets.

Additionally, the green line represents the synthetic-to-original data ratio, showing how much synthetic data was introduced relative to the original dataset.

5.5. Key Observations

Hope, pride, and fear exhibit the highest true performance gains, indicating that synthetic data successfully enhance classification robustness for these emotions. We did not provide synthetic data for the hope and pride categories. This entails that in the case of these two categories, the data provided for other categories also helped to identify them.
Anger shows a negative apparent effect, suggesting that the synthetic data might not have contributed positively or that domain adaptation played a role in performance fluctuations. In fact, the model performed slightly worse than its counterpart trained without synthetic augmentation. This may be due to overfitting to synthetic patterns or domain shifts introduced by the augmented data.
Joy, sadness, and disgust have a relatively small true improvement, indicating that while synthetic data helped, their effect was limited.
The synthetic data ratio varies significantly: for fear, for which the ratio is relatively high, we observe a strong improvement, whereas in the case of anger, for which there were even more synthetic data, the improvement was much smaller.

Implications

These findings suggest that synthetic data augmentation can significantly improve model performance, particularly for emotions that are harder to distinguish in natural data distributions (e.g., fear and sadness). However, the effect is not uniform across categories, and in some cases (e.g., anger), the synthetic data may introduce biases that do not generalize well to purely original test data.

Future research should focus on refining data augmentation strategies, particularly for classes where synthetic data have a smaller impact, ensuring that the introduced variations align closely with real-world distributions.

6. Conclusions and Future Work

The analysis presented in this study provides a detailed evaluation of the benefits and limitations of synthetic data in emotion classification. Our findings suggest that while synthetic data contribute to improved model performance in several aspects, they also introduce challenges that must be carefully addressed.

From a lexical and syntactic perspective, synthetic data exhibit reduced diversity compared to original data. The significantly lower Type–Token Ratio (TTR) of synthetic data suggests that they rely on a more repetitive vocabulary, which may limit their generalization capabilities. The Jaccard index analysis further revealed that a substantial portion of the synthetic data’s vocabulary overlaps with the original dataset, indicating that the generated text does not expand the lexical space as much as one might hope. Structurally, the analysis of part-of-speech (POS) patterns and dependency trees suggests that synthetic data follow the structural tendencies of the original corpus but with limited novel variations. The Dependency Tree Edit Distance (TED) scores confirm that while some synthetic sentences introduce meaningful modifications, many remain structurally similar to their original counterparts.

The impact of synthetic data on model performance is evident in the classification results. Models trained on both synthetic and original data outperform those trained exclusively on original data in most emotion categories. The strongest improvements are observed in emotions such as hope, pride, and fear, for which synthetic data appear to provide useful additional context for the model. However, when tested on a dataset containing only original text, models trained with synthetic data show increased misclassification rates. This suggests that while synthetic data enhance performance within the domain they were generated for, they may introduce subtle distributional shifts that do not always translate well to purely human-written text.

Our analysis of Receiver Operating Characteristic (ROC) curves and precision–recall (PR) curves further supports this conclusion. The models trained with synthetic data achieved higher AUC scores across most emotions, particularly in disgust and no emotion. However, while recall improved, precision was often more sensitive to dataset shifts. This indicates that synthetic data help detect more positive cases but may lead to an increase in false positives. The comparison of apparent and true model improvements revealed that the observed gains in the F1-score were, in part, influenced by the effects of domain adaptation rather than true generalization gains. In all cases, the true improvement was more modest than the apparent gains suggested, highlighting the need for a more refined approach to synthetic data generation.

Overall, synthetic data have proven to be a valuable augmentation tool, but they are not without their limitations. They improve classification robustness, particularly in complex emotional categories, but their over-reliance on repetitive patterns and potential to introduce distributional biases require further investigation. To maximize the benefits of synthetic data, future work will focus on several key areas.

First, we aim to improve the lexical and syntactic diversity of synthetic data. By refining the generation process to introduce more varied vocabulary and structural patterns, we hope to mitigate the observed limitations in generalization. One potential approach is to incorporate multiple-generation models or apply controlled paraphrasing techniques to ensure greater linguistic variation. Additionally, we plan to explore the use of adversarial training methods to reduce the risk of models overfitting to synthetic data patterns.

Second, we intend to investigate hybrid augmentation strategies that combine synthetic data with other data augmentation techniques, such as back-translation and contextual word replacements. By diversifying the augmentation methods, we aim to ensure that models are exposed to a broader range of linguistic variations, ultimately improving their ability to generalize to unseen data.

Another crucial aspect of future research will be the mitigation of distributional shifts introduced by synthetic data. One possible approach is to apply domain adaptation techniques to better align synthetic and original distributions, ensuring that the synthetic data remain useful even in settings where only human-written text is present. We will also explore fine-tuning strategies that incorporate a weighting mechanism, where the model learns to differentiate between synthetic and original instances and adjust its reliance on them accordingly.

Finally, we plan to conduct a more detailed error analysis to identify specific cases where synthetic data improve performance and lead to misclassification. Understanding these patterns will provide deeper insights into when synthetic augmentation is most beneficial and when alternative approaches may be needed.

Synthetic data remain a powerful tool in model training, but their effectiveness depends on careful implementation and ongoing refinement. Through the planned improvements, we aim to make synthetic data augmentation more robust, reducing its weaknesses while preserving its ability to enhance model performance in emotion classification tasks.

Author Contributions

Conceptualization, I.Ü. and O.R.; methodology, I.Ü. and O.R.; software, I.Ü.; validation, O.R.; formal analysis, I.Ü.; investigation, I.Ü. and O.R.; resources, O.R.; data curation, I.Ü. and O.R.; writing—original draft preparation, I.Ü. and O.R.; writing—review and editing, I.Ü. and O.R.; visualization, I.Ü.; supervision, O.R.; project administration, I.Ü. and O.R.; funding acquisition, O.R. All authors have read and agreed to the published version of the manuscript.

Funding

The project was co-financed by the governments of Czechia, Hungary, Poland, and Slovakia through Visegrad Grants from International Visegrad Fund ID #22310057. Hungarian Academy of Sciences, MOMENTUM V-SHIFT. Ministry of Innovation and Technology National Research, Development and Innovation (NRDI) Office and the European Union in the Framework through the Artificial Intelligence National Laboratory Project (Grant Number: RRF-2.3.1-21-2022-00004). The project was supported by Miklós Sebők’s Excellence project funded by the Hungarian National Research, Development and Innovation Office’s National Research Excellence Programme (Grant number: 151324).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Prompt Template

Below is the exact template employed to generate synthetic sentences for undersampled emotion categories. It includes the instructions for sentence length, emotion specificity, and output format:

“You are a text generator, whose job is to generate sentences based on the
description you are given. The descriptions are about a single human emotion.

1.

Your task is to generate sentences based on your own knowledge that reflect

the given emotion.

2.

The sentences you generate must not contain any other emotion than the

one referred to in the description, you got.

3.

The description is in English, but you must create sentences in German!

Your answer should not contain anything else than the generated sentences,

as a python list!

4.

Don’t write anything else, only sentences with the right emotion. Do not

write any comments next to your answer, only the generated sentences in

a python list!

The description of which you must generate sentences with the appropriate
emotion:
###
{}
###
Generate exactly 10 sentences with the emotion given in the description!
It is very important to vary the length of the sentences and the topic as
much as possible! Try to generate sentences that are as long as possible.
Example sentences:
”

Appendix B. Supplementary ROC Curves

Figure A1. ROC curve for joy.

Figure A2. ROC curve for fear.

Figure A3. ROC curve for anger.

Figure A4. ROC curve for sadness.

Figure A5. ROC curve for pride.

Figure A6. ROC curve for enthusiasm.

Appendix C. Additional Precision–Recall Curves

Figure A7. Precision–recall curve for joy.

Figure A8. Precision–recall curve for hope.

Figure A9. Precision–recall curve for fear.

Figure A10. Precision–recall curve for enthusiasm.

Figure A11. Precision–recall curve for disgust.

Figure A12. Precision–recall curve for anger.

References

Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H., Eds.; Curran Associates, Inc.: Nice, France, 2020; Volume 33, pp. 1877–1901. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar] [CrossRef]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
Haider-Markel, D.P.; Allen, M.D.; Johansen, M. Understanding variations in media coverage of US Supreme Court decisions: Comparing media outlets in their coverage of Lawrence v. Texas. Harv. Int. J. Press. 2006, 11, 64–85. [Google Scholar] [CrossRef]
Pang, B.; Lee, L. Opinion Mining and Sentiment Analysis. Found. Trends Inf. Retr. 2008, 2, 1–135. [Google Scholar] [CrossRef]
De Vreese, C.H.; Semetko, H.A. Cynical and engaged: Strategic campaign coverage, public opinion, and mobilization in a referendum. Commun. Res. 2002, 29, 615–641. [Google Scholar] [CrossRef]
Farnsworth, S.J.; Lichter, S.R. Reporting on two presidencies: News coverage of George W. Bush’s first year in office. Congr. Pres. 2005, 32, 91–108. [Google Scholar] [CrossRef]
Kettunen, K. Can Type-Token Ratio be Used to Show Morphological Complexity of Languages? J. Quant. Linguist. 2014, 21, 223–245. [Google Scholar] [CrossRef]
Xu, B.; Wang, Q.; Lyu, Y.; Dai, D.; Zhang, Y.; Mao, Z. S2ynRE: Two-stage Self-training with Synthetic data for Low-resource Relation Extraction. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; pp. 8186–8207. [Google Scholar] [CrossRef]
Hancock, J.M. Jaccard Distance (Jaccard Index, Jaccard Similarity Coefficient). 2004. Available online: https://onlinelibrary.wiley.com/doi/10.1002/9780471650126.dob0956 (accessed on 13 March 2025).
Kepplinger, H.M. Mediatization of politics: Theory and data. J. Commun. 2002, 52, 972–986. [Google Scholar] [CrossRef]
Uribe, R.; Gunter, B. Are Sensational News Stories More Likely to Trigger Viewers’ Emotions than Non-Sensational News Stories? A Content Analysis of British TV News. Eur. J. Commun. 2007, 22, 207–228. [Google Scholar] [CrossRef]
Haselmayer, M.; Jenny, M. Sentiment analysis of political communication: Combining a dictionary approach with crowdcoding. Qual. Quant. 2017, 51, 2623–2646. [Google Scholar] [CrossRef]
Boukes, M.; Van De Velde, B.; Araujo, T.; Vliegenthart, R. What’s the Tone? Easy Doesn’t Do It: Analyzing Performance and Agreement Between Off-the-Shelf Sentiment Analysis Tools. Commun. Methods Meas. 2020, 14, 83–104. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019, arXiv:1810.04805, 4171–4186. [Google Scholar] [CrossRef]
Rauh, C. Validating a sentiment dictionary for German political language—A workbench note. J. Inf. Technol. Politics 2018, 15, 319–343. [Google Scholar] [CrossRef]
Widmann, T.; Wich, M. Creating and Comparing Dictionary, Word Embedding, and Transformer-Based Models to Measure Discrete Emotions in German Political Text. Political Anal. 2022, 31, 626–641. [Google Scholar] [CrossRef]
Khoo, C.S.G.; Nourbakhsh, A.; Na, J.C. Sentiment analysis of online news text: A case study of appraisal theory. Online Inf. Rev. 2012, 36, 858–878. [Google Scholar] [CrossRef]
Balahur, A.; Steinberger, R. Rethinking Sentiment Analysis in the News: From Theory to Practice and back. Proceed. WOMSA 2009, 9, 1–12. [Google Scholar]
Mullen, T.; Malouf, R. A Preliminary Investigation into Sentiment Analysis of Informal Political Discourse. In Proceedings of the Computational Approaches to Analyzing Weblogs, Stanford, CA, USA, 27–29 March 2006; pp. 159–162. [Google Scholar]
Van Atteveldt, W.; Kleinnijenhuis, J.; Ruigrok, N.; Schlobach, S. Good News or Bad News? Conducting Sentiment Analysis on Dutch Text to Distinguish Between Positive and Negative Relations. J. Inf. Technol. Politics 2008, 5, 73–94. [Google Scholar] [CrossRef]
Kaya, M.; Fidan, G.; Toroslu, I.H. Sentiment analysis of Turkish political news. In Proceedings of the 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, Macau, China, 4–7 December 2012; Volume 1, pp. 174–180. [Google Scholar]
Sağlam, F.; Sever, H.; Genç, B. Developing Turkish sentiment lexicon for sentiment analysis using online news media. In Proceedings of the 2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA), Agadir, Morocco, 29 November–2 December 2016; pp. 1–5. [Google Scholar]
Bakken, P.F.; Bratlie, T.A.; Sánchez-Marco, C.; Gulla, J.A. Political news sentiment analysis for under-resourced languages. In Proceedings of the COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, 11–16 December 2016; pp. 2989–2996. [Google Scholar]
Biba, M.; Mane, M. Sentiment analysis through machine learning: An experimental evaluation for Albanian. In Proceedings of the Recent Advances in Intelligent Informatics: Proceedings of the Second International Symposium on Intelligent Informatics (ISI’13), Mysore, India, 23–24 August 2013; pp. 195–203. [Google Scholar]
Bobicev, V.; Sokolova, M. Inter-Annotator Agreement in Sentiment Analysis: Machine Learning Perspective. In Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, Varna, Bulgaria, 2–8 September 2017; pp. 97–102. [Google Scholar] [CrossRef]
Suryono, R.R.; Indra, B. P2P Lending sentiment analysis in Indonesian online news. In Proceedings of the Sriwijaya International Conference on Information Technology and Its Applications (SICONIAN 2019), Palembang, Indonesia, 16 November 2019; pp. 39–44. [Google Scholar]
Bene, M.; Szabó, G. Discovered and Undiscovered Fields of Digital Politics: Mapping Online Political Communication and Online News Media Literature in Hungary. Intersections. East Eur. J. Soc. Politics 2021, 7, 1–21. [Google Scholar] [CrossRef]
Szabó, G. Emotional communication and participation in politics. Intersections. East Eur. J. Soc. Politics 2020, 6, 5–21. [Google Scholar] [CrossRef]
Szabó, G.; Szilágyi, S. Morál a Médiában: Az Ukrajnai háBorú az Online híRportálokon a 2022-es Országgyűlési Kampány Idején. 2022. Available online: https://real.mtak.hu/154704/ (accessed on 15 April 2025).
Üveges, I.; Ring, O. HunEmBERT: A fine-tuned BERT-model for classifying sentiment and emotion in political communication. IEEE Access 2023, 11, 60267–60278. [Google Scholar] [CrossRef]
Al-Twairesh, N. The Evolution of Language Models Applied to Emotion Analysis of Arabic Tweets. Information 2021, 12, 84. [Google Scholar] [CrossRef]
Cho, J.; Boyle, M.P.; Keum, H.; Shevy, M.D.; McLeod, D.M.; Shah, D.V.; Pan, Z. Media, Terrorism, and Emotionality: Emotional Differences in Media Content and Public Reactions to the September 11th Terrorist Attacks. J. Broadcast. Electron. Media 2003, 47, 309–327. [Google Scholar] [CrossRef]
Bhowmick, P.K.; Basu, A.; Mitra, P. Reader Perspective Emotion Analysis in Text through Ensemble based Multi-Label Classification Framework. Comput. Inf. Sci. 2009, 2, 64–74. [Google Scholar] [CrossRef]
Boomgaarden, H.G.; Schmitt-Beck, R. The Media and Political Behavior. In Oxford Research Encyclopedia of Politics; Oxford University Press: New York, NY, USA, 2019. [Google Scholar] [CrossRef]
Kuila, A.; Sarkar, S. Deciphering Political Entity Sentiment in News with Large Language Models: Zero-Shot and Few-Shot Strategies. arXiv 2024, arXiv:2404.04361. [Google Scholar] [CrossRef]
Rozado, D.; Hughes, R.; Halberstadt, J. Longitudinal analysis of sentiment and emotion in news media headlines using automated labelling with Transformer language models. PLoS ONE 2022, 17, e0276367. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar] [CrossRef]
Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-lingual Representation Learning at Scale. arXiv 2020, arXiv:1911.02116. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Wei, J.; Zou, K. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. arXiv 2019, arXiv:1901.11196. [Google Scholar]
Jurafsky, D.; Martin, J.H. Speech and Language Processing. 2021. Available online: https://web.stanford.edu/~jurafsky/slp3/ (accessed on 15 April 2025).
Mikolov, T.; Grave, E.; Bojanowski, P.; Puhrsch, C.; Joulin, A. Advances in pre-training distributed word representations. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018. [Google Scholar]
Pennington, J.; Socher, R.; Manning, C.D. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
Feng, S.; Gangal, V.; Wei, J.; Chandar, S.; Vosoughi, S.; Mitamura, T.; Hovy, E. A Survey of Data Augmentation Approaches for NLP. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, 1–6 August 2021; pp. 968–988. [Google Scholar] [CrossRef]
Rahman, S.; Khan, S.; Porikli, F. A Unified Approach for Conventional Zero-Shot, Generalized Zero-Shot, and Few-Shot Learning. IEEE Trans. Image Process. 2018, 27, 5652–5667. [Google Scholar] [CrossRef]
Li, Z.; Zhu, H.; Lu, Z.; Yin, M. Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 16 June 2023; pp. 10443–10461. [Google Scholar] [CrossRef]
Pramana, R.; Debora; Subroto, J.J.; Gunawan, A.A.S.; Anderies. Systematic Literature Review of Stemming and Lemmatization Performance for Sentence Similarity. In Proceedings of the 2022 IEEE 7th International Conference on Information Technology and Digital Applications (ICITDA), Yogyakarta, Indonesia, 4–5 November 2022; pp. 1–6. [Google Scholar] [CrossRef]
Manning, C.D.; Schütze, H. Foundations of Statistical Natural Language Processing; The MIT Press: Cambridge, MA, USA, 1999. [Google Scholar]
Sidorov, G.; Gomez-Adorno, H.; Markov, I.; Pinto, D.; Loya, N. Computing text similarity using Tree Edit Distance. In Proceedings of the 2015 Annual Conference of the North American Fuzzy Information Processing Society (NAFIPS) Held Jointly with 2015 5th World Conference on Soft Computing (WConSC), Redmond, WA, USA, 17–19 August 2015; pp. 1–4. [Google Scholar] [CrossRef]
Chitra, A.; Rajkumar, A. Plagiarism Detection Using Machine Learning-Based Paraphrase Recognizer. J. Intell. Syst. 2016, 25, 351–359. [Google Scholar] [CrossRef]

Figure 1. Comparison of Type–Token Ratio (TTR) values indicating lexical diversity in original and synthetic texts.

Figure 2. Comparison of top 10 POS bigrams between original and synthetic data.

Figure 3. Comparison of top 10 POS trigrams between original and synthetic data.

Figure 4. Distribution of TED scores between original and synthetic sentences (mean = 16.55, std = 5.91).

Figure 5. Confusion matrix for model 2b evaluated on test set 2b.

Figure 6. Confusion matrix for model 2b evaluated on test set 1.

Figure 7. Confusion matrix for model 1 evaluated on test set 1.

Figure 8. ROC curve for disgust. The 2b_on_2b evaluation achieves an AUC of 1.00, indicating near-perfect classification performance.

Figure 9. ROC curve for hope. The model trained with synthetic data shows an improvement over the original-data-only model, but differences remain marginal.

Figure 10. ROC curve for no emotion. The synthetic-data-trained model demonstrates superior classification compared to the original-data-only model.

Figure 11. Precision–recall curve for sadness.

Figure 12. Precision–recall curve for pride.

Figure 13. Precision–recall curve for no emotion.

Figure 14. Comparison of apparent and true effects of synthetic data on model performance across emotions.

Table 1. Original label distribution of the Widmann dataset (including cannot be coded).

Category	Count	Proportion (%)
disgust	1056	0.8%
fear	3168	2.5%
sadness	3337	2.6%
pride	3542	2.8%
joy	4275	3.4%
enthusiasm	5434	4.3%
cannot be coded	6869	5.4%
hope	7763	6.1%
anger	29,357	23.1%
no emotion	62,285	49.0%
Total	127,086	100.0%

Table 2. Key differences between the evaluated models in terms of data composition and augmentation strategy.

Model	Undersampling	Synthetic Augmentation	Original Data Duplication
baseline_model	No	No	No
model_1	Yes (for anger and no emotion)	No	No
model_2a	Yes (same as model_1)	Yes (for minority classes)	No
model_2b	Yes (same as model_1)	Yes (for minority classes)	Yes (for no emotion)

Table 3. Number of original (ORIG) and synthetic (SYN) samples for each category in four dataset variants.

	Baseline_MODEL			Model_1			Model_2a			Model_2b
Category	ORIG	SYN	All	ORIG	SYN	All	ORIG	SYN	All	ORIG	SYN	All
disgust	1056	0	1056	1056	0	1056	1056	6707	7763	1056	6707	7763
fear	3168	0	3168	3168	0	3168	3168	4595	7763	3168	4595	7763
sadness	3337	0	3337	3337	0	3337	3337	4426	7763	3337	4426	7763
pride	3542	0	3542	3542	0	3542	3542	4221	7763	3542	4221	7763
joy	4275	0	4275	4275	0	4275	4275	3488	7763	4275	3488	7763
enthusiasm	5434	0	5434	5434	0	5434	5434	2329	7763	5434	2329	7763
hope	7763	0	7763	7763	0	7763	7763	0	7763	7763	0	7763
anger	29,357	0	29,357	7763	0	7763	7763	0	7763	7763	0	7763
no emotion	62,285	0	62,285	7763	0	7763	7763	0	7763	15,526	0	15,526
All	120,217	0	120,217	44,101	0	44,101	69,867	20,966	90,833 ¹	77,630	21,066	98,696 ²

¹ 69,867 ORIG + 20,966 SYN = 90,833. ² 77,630 ORIG + 21,066 SYN = 98,696.

Table 4. Comparison of macro F1, precision, recall, and accuracy for the four evaluated models on the German test set.

Model	Macro F1	Precision	Recall	Accuracy
model_2b	0.73	0.73	0.73	0.72
model_2a	0.70	0.70	0.71	0.71
model_1	0.55	0.55	0.56	0.54
baseline_model	0.44	0.58	0.39	0.64

Table 5. Per-class F1-scores for each model on the German dataset, with classes as rows and models as columns.

Category	Baseline_Model	Model_1	Model_2a	Model_2b
Anger	0.59	0.62	0.61	0.57
Fear	0.21	0.54	0.77	0.80
Disgust	0.43	0.59	0.93	0.93
Sadness	0.51	0.63	0.85	0.86
Joy	0.57	0.68	0.81	0.83
Enthusiasm	0.23	0.44	0.63	0.63
Hope	0.29	0.55	0.53	0.54
Pride	0.41	0.52	0.77	0.78
No emotion	0.74	0.41	0.39	0.66

Table 6. Jaccard index by emotion category.

Emotion	Jaccard Index
Disgust	0.1673
Fear	0.1642
Sadness	0.1341
Pride	0.1291
Joy	0.1221
Enthusiasm	0.0851

Table 7. Summary statistics of TED scores.

Metric	Value
Mean TED Score	16.55
Standard Deviation	6.42
Median TED Score	15.00
25th Percentile (Q1)	12.00
75th Percentile (Q3)	20.00

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Üveges, I.; Ring, O. Evaluating the Impact of Synthetic Data on Emotion Classification: A Linguistic and Structural Analysis. Information 2025, 16, 330. https://doi.org/10.3390/info16040330

AMA Style

Üveges I, Ring O. Evaluating the Impact of Synthetic Data on Emotion Classification: A Linguistic and Structural Analysis. Information. 2025; 16(4):330. https://doi.org/10.3390/info16040330

Chicago/Turabian Style

Üveges, István, and Orsolya Ring. 2025. "Evaluating the Impact of Synthetic Data on Emotion Classification: A Linguistic and Structural Analysis" Information 16, no. 4: 330. https://doi.org/10.3390/info16040330

APA Style

Üveges, I., & Ring, O. (2025). Evaluating the Impact of Synthetic Data on Emotion Classification: A Linguistic and Structural Analysis. Information, 16(4), 330. https://doi.org/10.3390/info16040330

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Evaluating the Impact of Synthetic Data on Emotion Classification: A Linguistic and Structural Analysis

Abstract

1. Introduction

2. Relevant Works

3. Dataset

3.1. Main Statistics

3.2. Corpus Variants

4. Methods

4.1. Model Training (XLM-RoBERTa)

4.2. Prompt Design for Synthetic Augmentation

4.3. Synthetic Data Quality

4.3.1. Lexical Diversity

Type–Token Ratio (TTR)

Lexical Overlap: Jaccard Index

4.3.2. Syntactic Analysis

POS Diversity

Divergence Analysis Using Jensen–Shannon Divergence

4.3.3. Semantic Analysis

Dependency Tree Edit Distance

4.4. Performance Evaluation Metrics

4.5. Apparent vs. True Effect of Synthetic Data

5. Results and Discussion

5.1. Model Evaluation

5.2. Quality and Diversity of Synthetic Data

5.2.1. Token–Type Ratio (TTR)

5.2.2. Lexical Overlap: Jaccard Index

5.2.3. POS Diversity

POS Bigram and Trigram Variability

5.2.4. Jensen–Shannon Divergence

5.2.5. Dependency Tree Edit Distance

Computation and Statistical Summary

Distribution of Syntactic Differences

5.3. Model Performance

5.3.1. Confusion Matrix Analysis

Model 2b on Test Set 2b

Model 2b on Test Set 1

Model 1 on Test Set 1

Key Observations

5.3.2. Receiver Operating Characteristic (ROC) Curve Analysis

Key Observations

High Performance in Certain Categories

Moderate Improvements for Certain Emotions

Significant Differences in the No Emotion Category

Summary of ROC Curve Analysis

5.3.3. Precision–Recall Curve Analysis

Key Observations

Comparison of Models

5.4. Evaluating the True Performance Improvement from Synthetic Data

Apparent vs. True Effect of Synthetic Data

5.5. Key Observations

Implications

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Prompt Template

Appendix B. Supplementary ROC Curves

Appendix C. Additional Precision–Recall Curves

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI