2. Relevant Works
The field of computational text analysis has advanced significantly, particularly in emotion detection and classification. In contrast to traditional sentiment analysis—which typically categorizes text as positive, negative, or neutral—emotion analysis attempts to classify specific emotional states, such as happiness, anger, or fear [
4,
5]. This distinction is especially pertinent in political communication research, where emotional cues, however subtle, can substantially shape how a message is received and interpreted by the public [
11,
12]. Emotions may reinforce or undermine credibility, alter perceptions of policy issues, and even affect voting decisions. Consequently, understanding emotional tones in textual data has become a focal point for scholars and practitioners aiming to quantify the impact of political discourse on audiences [
1,
2,
13,
14,
15].
Recent research in the German context has also explored emotion classification with various methodological approaches. Christian Rauh validated a sentiment dictionary for analyzing German political language [
16], testing it on parliamentary speeches, party manifestos, and media content. The results indicated that positive emotions are easier to detect, while negative emotions pose more significant challenges, especially in shorter texts. Widmann and Wich [
17] compared three models for emotion detection in German political texts: a dictionary-based model, a neural network with word embeddings, and a transformer-based model (ELECTRA). Their findings demonstrated that the transformer-based model significantly outperformed traditional dictionary-based approaches in accurately measuring emotional nuances.
Early research efforts in this domain heavily focused on sentiment polarity detection within political news [
6,
7]. Although foundational for analyzing broad positive or negative trends, these studies highlighted the need for more granular models that account for varied emotional expressions. Later, researchers began to adopt more nuanced theoretical frameworks, such as Martin and White’s appraisal approach [
18], which addresses subtleties like effect and judgment in a text. Likewise, explicit opinion analysis [
19] broadened the methodological toolkit for analyzing how political actors, media outlets, or stakeholders articulate attitudes and stances. These diversifications made capturing subtle emotional and evaluative dimensions in political statements, campaign materials, and news articles increasingly possible. Mullen and Malouf [
20], for instance, conducted pioneering work on informal political discourse—particularly in online debate forums—while Van Atteveldt et al. [
21] employed machine learning techniques to distinguish positive from negative sentiments in campaign texts. Additionally, innovative approaches such as dictionary-based methods or crowd-coding allowed researchers to combine computational efficiency with human insights [
13], and subsequent comparisons with expert-coded data provided a clearer view of how automated sentiment tools align with human judgments [
14].
However, most prior studies either focused on high-resource languages or rely on traditional classification models, often overlooking the impact of data imbalance on minority emotion categories or the linguistic quality of synthetic inputs. Beyond the English-speaking world, scholars have tackled emotion and sentiment analysis in under-resourced or morphologically rich languages, further broadening the field’s scope. Turkish [
22,
23], Norwegian [
24], and Albanian [
25] exemplify languages where morphological complexity and limited annotated data present challenges for traditional approaches. Similarly, Ukrainian, Russian [
26], Indonesian [
27], and Hungarian [
28,
29,
30] have garnered increased scholarly attention. Political news in these contexts often contain unique linguistic structures and cultural references, necessitating specialized lexicons or domain-specific models to detect sentiment and emotion accurately.
In contrast to previous works, our study introduces a twofold contribution: first, we systematically assess the effectiveness of GPT-4-generated synthetic data in balancing imbalanced emotion classes in German political text; second, we offer a fine-grained linguistic evaluation of the generated data’s lexical, syntactic, and semantic features. This allows us to not only boost classification performance but also reflect critically on the limitations of current augmentation practices. Unlike [
31], which focused on Hungarian political sentiment using monolingual resources, and [
32], which addressed Arabic emotion detection, our work combines large-scale generative augmentation with structural linguistic diagnostics, offering a complementary perspective to transformer-based emotion modeling reveal [
33,
34,
35].
Recent developments in natural language processing (NLP), particularly the advent of large language models (LLMs), have significantly influenced emotion analysis in political contexts. Research suggests that these models can outperform some fine-tuned BERT variants in entity-specific tasks, such as identifying sentiments directed at particular political figures or policy issues [
36]. Furthermore, specialized adaptations of these models have appeared for underrepresented languages. For instance, ref. [
31] investigates political communication in Hungarian, while [
32] explores Arabic emotion detection, demonstrating how LLMs can be tailored to diverse linguistic landscapes. The success of transformer-based approaches in classifying emotions from headlines, where brevity and possible sensationalism can confound analysis, highlights the nuanced capabilities of modern architectures [
37].
Transformer architectures themselves have revolutionized NLP since their introduction [
38], offering more efficient parallelization and robust feature learning. BERT, in particular, popularized fully bidirectional attention mechanisms and established new baselines for many downstream tasks by leveraging large-scale pre-training [
15]. GPT-3 illustrated the feasibility of few-shot or zero-shot learning setups, drastically reducing data requirements [
1], while RoBERTa refined training procedures for greater performance gains, and T5’s unified text-to-text paradigm streamlined different tasks under a common framework [
2,
3]. These strategies have been extended to multilingual domains as well, where models like mBERT, mBART, and XLM-R [
39] share parameters across languages. Although distributing representational capacity across many languages can dilute performance for high-resource languages, it can significantly benefit under-resourced settings through cross-lingual transfer and shared vocabulary.
Model selection thus becomes a balancing act between specialization and generalization. Monolingual models can yield higher accuracy by concentrating on language-specific traits, including morphological rules or idiomatic expressions, whereas multilingual counterparts facilitate transfer learning and cost-effective training for low-resource languages. The real-world utility of each approach often hinges on the data availability, the domain’s complexity, and the research or application goals [
40,
41,
42].
Data augmentation methods also offer ways to enhance training in scenarios where collecting labeled samples is expensive or logistically difficult. Techniques originally refined for computer vision [
40] have been adapted to NLP, including “Easy Data Augmentation” (EDA) strategies that randomly insert, delete, swap, or replace words [
41]. While these techniques can improve model robustness, especially for simpler classification tasks, their random nature may distort semantic nuance [
42]. More sophisticated augmentation involves embeddings like Word2Vec, GloVe, or transformer-based vectors, enabling semantically aware replacements that preserve context and meaning [
43,
44,
45].
In parallel, generative models—particularly instruction-tuned LLMs—provide another avenue for synthetic data generation to counter data scarcity [
1]. When prompted with detailed instructions, these models can produce diverse text samples that capture specific styles, emotional tones, or domain features, thereby expanding a corpus in controlled ways. Nevertheless, synthetic data quality is closely tied to the underlying model’s capabilities and prompt design. Over-reliance on machine-generated samples may introduce spurious patterns, ultimately weakening real-world generalization if not balanced with manually verified or gold-standard data [
46,
47]. Integrating few-shot examples—where a small set of high-quality human-labeled samples are provided—can improve generative consistency and thematic accuracy, proving especially useful for tasks like emotion classification that rely on nuanced linguistic cues [
3,
4,
11].
In summary, advancing emotion detection and classification in political communication requires both technical innovations and domain-specific considerations. Robust transformer-based architectures, multilingual and monolingual model trade-offs, and carefully executed data augmentation techniques collectively shape how effectively we can capture the emotional signals embedded in political texts. Although breakthroughs in LLMs demonstrate promising results, ongoing research highlights the need to balance automated methods with reliable human-labeled data and contextually aware modeling approaches, ensuring that emotion analysis remains both accurate and reflective of complex real-world discourse.
6. Conclusions and Future Work
The analysis presented in this study provides a detailed evaluation of the benefits and limitations of synthetic data in emotion classification. Our findings suggest that while synthetic data contribute to improved model performance in several aspects, they also introduce challenges that must be carefully addressed.
From a lexical and syntactic perspective, synthetic data exhibit reduced diversity compared to original data. The significantly lower Type–Token Ratio (TTR) of synthetic data suggests that they rely on a more repetitive vocabulary, which may limit their generalization capabilities. The Jaccard index analysis further revealed that a substantial portion of the synthetic data’s vocabulary overlaps with the original dataset, indicating that the generated text does not expand the lexical space as much as one might hope. Structurally, the analysis of part-of-speech (POS) patterns and dependency trees suggests that synthetic data follow the structural tendencies of the original corpus but with limited novel variations. The Dependency Tree Edit Distance (TED) scores confirm that while some synthetic sentences introduce meaningful modifications, many remain structurally similar to their original counterparts.
The impact of synthetic data on model performance is evident in the classification results. Models trained on both synthetic and original data outperform those trained exclusively on original data in most emotion categories. The strongest improvements are observed in emotions such as hope, pride, and fear, for which synthetic data appear to provide useful additional context for the model. However, when tested on a dataset containing only original text, models trained with synthetic data show increased misclassification rates. This suggests that while synthetic data enhance performance within the domain they were generated for, they may introduce subtle distributional shifts that do not always translate well to purely human-written text.
Our analysis of Receiver Operating Characteristic (ROC) curves and precision–recall (PR) curves further supports this conclusion. The models trained with synthetic data achieved higher AUC scores across most emotions, particularly in disgust and no emotion. However, while recall improved, precision was often more sensitive to dataset shifts. This indicates that synthetic data help detect more positive cases but may lead to an increase in false positives. The comparison of apparent and true model improvements revealed that the observed gains in the F1-score were, in part, influenced by the effects of domain adaptation rather than true generalization gains. In all cases, the true improvement was more modest than the apparent gains suggested, highlighting the need for a more refined approach to synthetic data generation.
Overall, synthetic data have proven to be a valuable augmentation tool, but they are not without their limitations. They improve classification robustness, particularly in complex emotional categories, but their over-reliance on repetitive patterns and potential to introduce distributional biases require further investigation. To maximize the benefits of synthetic data, future work will focus on several key areas.
First, we aim to improve the lexical and syntactic diversity of synthetic data. By refining the generation process to introduce more varied vocabulary and structural patterns, we hope to mitigate the observed limitations in generalization. One potential approach is to incorporate multiple-generation models or apply controlled paraphrasing techniques to ensure greater linguistic variation. Additionally, we plan to explore the use of adversarial training methods to reduce the risk of models overfitting to synthetic data patterns.
Second, we intend to investigate hybrid augmentation strategies that combine synthetic data with other data augmentation techniques, such as back-translation and contextual word replacements. By diversifying the augmentation methods, we aim to ensure that models are exposed to a broader range of linguistic variations, ultimately improving their ability to generalize to unseen data.
Another crucial aspect of future research will be the mitigation of distributional shifts introduced by synthetic data. One possible approach is to apply domain adaptation techniques to better align synthetic and original distributions, ensuring that the synthetic data remain useful even in settings where only human-written text is present. We will also explore fine-tuning strategies that incorporate a weighting mechanism, where the model learns to differentiate between synthetic and original instances and adjust its reliance on them accordingly.
Finally, we plan to conduct a more detailed error analysis to identify specific cases where synthetic data improve performance and lead to misclassification. Understanding these patterns will provide deeper insights into when synthetic augmentation is most beneficial and when alternative approaches may be needed.
Synthetic data remain a powerful tool in model training, but their effectiveness depends on careful implementation and ongoing refinement. Through the planned improvements, we aim to make synthetic data augmentation more robust, reducing its weaknesses while preserving its ability to enhance model performance in emotion classification tasks.