The Impact of Linguistic Variations on Emotion Detection: A Study of Regionally Specific Synthetic Datasets

Calderón Alvarado, Fernando Henrique

doi:10.3390/app15073490

Open AccessArticle

The Impact of Linguistic Variations on Emotion Detection: A Study of Regionally Specific Synthetic Datasets

by

Fernando Henrique Calderón Alvarado

Department of Computer Science and Information Engineering, Fu Jen Catholic University, New Taipei City 24205, Taiwan

Appl. Sci. 2025, 15(7), 3490; https://doi.org/10.3390/app15073490

Submission received: 13 December 2024 / Revised: 27 February 2025 / Accepted: 21 March 2025 / Published: 22 March 2025

(This article belongs to the Special Issue Application of Affective Computing)

Download

Browse Figures

Versions Notes

Abstract

:

This study examines the role of linguistic regional variations in synthetic dataset generation and their impact on emotion detection performance. Emotion detection is essential for natural language processing (NLP) applications such as social media analysis, customer service, and mental health monitoring. To explore this, synthetic datasets were generated using a state-of-the-art language model, incorporating English variations from the United States, United Kingdom, and India, alongside a general baseline dataset. Two levels of prompt specificity were employed to assess the influence of regional linguistic nuances. Statistical analyses—including frequency distribution, term frequency-inverse document frequency (TF-IDF), type–token ratio (TTR), hapax legomena, pointwise mutual information (PMI) scores, and key-phrase extraction—revealed significant linguistic diversity and regional distinctions in the generated datasets. To evaluate their effectiveness, classification experiments were conducted with two models using bidirectional encoder representations from transformers (BERT) and its de-noising sequence to sequence variation (BART), beginning with zero-shot classification on the contextualized affect representations for emotion recognition (CARER) dataset, followed by fine-tuning with both baseline and region-specific datasets. Results demonstrated that region-specific datasets, particularly those generated with detailed prompts, significantly improved classification accuracy compared to the baseline. These findings underscore the importance of incorporating global linguistic variations in synthetic dataset generation, offering insights into how regional adaptations can enhance emotion detection models for diverse NLP applications.

Keywords:

synthetic datasets; large language models; emotion detection; low-resource languages; regional variations

1. Introduction

Emotion detection is a critical task in NLP with applications spanning from sentiment analysis to human–computer interaction. The ability to accurately identify and interpret emotions in text can significantly enhance user experience in various domains, including social media, customer service, and mental health monitoring [1]. Emotions play a central role in human communication, influencing decision making, behavior, and social interactions. Consequently, the accurate detection and analysis of emotions from text have become increasingly important in developing intelligent systems that can understand and respond to human emotions effectively [2]. For instance, in social media, emotion detection can help identify and mitigate cyberbullying, enhance content recommendation systems, and provide insights into public sentiment on various issues [3]. In customer service, emotion-aware systems can improve user satisfaction by tailoring responses to the emotional state of the customer [4]. Moreover, in mental health monitoring, emotion detection can assist in identifying signs of emotional distress and providing timely interventions [5].

Recent advancements in synthetic dataset generation have opened new avenues for improving the performance of emotion detection models. Synthetic datasets, generated using large language models, offer a scalable solution to the challenges of data scarcity and diversity [6]. However, the effectiveness of these datasets can be influenced by the specificity of the prompts used during their creation. Incorporating regional language nuances into the prompts can potentially enhance the diversity and relevance of the generated datasets, leading to better model performance.

The primary objectives of this study are to generate synthetic datasets for emotion detection that incorporate regional language variations, analyze the linguistic diversity and regional differences in the generated datasets, and evaluate the impact of region-specific synthetic datasets on the performance of emotion detection models.

This study makes several key contributions:

Demonstrate the feasibility of generating region-specific synthetic datasets for emotion detection using advanced language models.
Provide a comprehensive analysis of the linguistic diversity and regional differences in the generated datasets.
Show that region-specific synthetic datasets can significantly improve the performance of emotion detection models.
The findings highlight the importance of incorporating regional language variations in synthetic dataset generation, offering valuable insights for future research and applications in NLP.

The novelty of this research lies in its focus on regional language nuances and their impact on synthetic dataset generation and emotion detection performance. By addressing these aspects, the study contributes to the broader understanding of synthetic data’s role in NLP and paves the way for more inclusive and effective emotion detection models.

The remainder of this paper is organized as follows. Section 2 provides an overview of the related work, with subsections covering synthetic dataset generation (Section 2.1), emotion detection (Section 2.2), and regional language variations (Section 2.3). Section 3 details the methodology, including dataset generation (Section 3.3), statistical analysis (Section 3.4), classification experiments (Section 3.5), and evaluation (Section 3.6). Section 4 presents the experimental results, followed by a discussion in Section 5. Finally, Section 6 concludes the study by summarizing key findings and outlining potential directions for future research.

2. Related Work

2.1. Synthetic Dataset Generation for NLP

The increasing reliance on data-driven approaches in natural language processing (NLP) has fueled a growing demand for large, diverse, and high-quality datasets. However, the acquisition and annotation of real-world data can be expensive, time-consuming, and fraught with privacy concerns. This has led to a surge of interest in synthetic data generation as a viable alternative [7]. Early efforts in synthetic data generation primarily relied on simpler rule-based or statistical methods, which often struggled to capture the nuances and complexities of natural language [8].

The advent of powerful language models (LMs), like the Generated Pretrained Transformer (GPT), Language Model for Dialogue Applications (LaMDA), and Pathways Language Model (PaLM), has marked a significant turning point in this field [9,10,11]. These models, trained on massive text corpora, have exhibited remarkable capabilities in generating human-quality text that is both coherent and contextually relevant [9]. This has opened up new possibilities for creating synthetic datasets that can effectively mimic real-world data while overcoming the limitations of traditional methods [7].

Recent research has explored various strategies to further enhance the quality and utility of synthetic data. For instance, Li et al. (2021) [12] investigated the use of data augmentation techniques, including synthetic data generation, to improve cross-lingual transfer learning in dependency parsing. Their findings demonstrated that synthetic data can be particularly beneficial for low-resource languages, where real-world data are scarce [12]. In another study, Sánchez-Junquera et al. (2021) [13] delved into the generation of synthetic data for emotion recognition in conversation, highlighting the potential of LMs to create data that capture the nuances of emotional expression in conversational contexts.

Despite the significant progress, the field of synthetic data generation still faces several challenges. One critical challenge is ensuring the diversity and representativeness of the generated data, especially when dealing with tasks like emotion detection, where cultural and linguistic variations play a crucial role [7].

2.2. Emotion Detection: Advancements and Challenges

Emotion detection, a key task in NLP, has witnessed substantial progress in recent years, driven by advancements in deep learning and the availability of large datasets. Early approaches often relied on handcrafted features and lexical resources, which limited their ability to capture the complexities of emotional expression [14]. However, the emergence of deep learning models, particularly transformer-based architectures like BERT and RoBERTa, has revolutionized the field. These models, with their ability to learn contextualized representations of language, have achieved state-of-the-art results on various emotion detection benchmarks [15].

Despite the impressive performance of these models, several challenges remain. One major challenge is addressing the issue of bias and fairness. Studies have shown that emotion detection models can exhibit biases against certain demographic groups, leading to unfair or discriminatory outcomes. This bias often stems from the training data, which may not be representative of all populations [16]. Another challenge is capturing the nuances of emotional expression across different cultures and linguistic contexts. Emotions are often expressed differently in different cultures, and models trained on data from one culture may not generalize well to others.

To address these challenges, researchers have explored various approaches. For example, Schuff et al. (2020) [14] conducted a computational analysis of emotion expression in English and Chinese, highlighting the cultural differences in how emotions are conveyed in text. Their work emphasizes the need for culturally aware emotion detection models that can adapt to different cultural contexts. In another study, Acheampong et al. (2020) [15] reviewed BERT-based approaches for emotion detection, highlighting their strengths and limitations in capturing implicit emotions, particularly in diverse linguistic contexts.

2.3. Regional Language Variations in NLP

Regional variations in language, encompassing dialects, accents, and colloquialisms, pose significant challenges for NLP tasks, including emotion detection. Models trained on data from one region may not generalize well to other regions due to differences in linguistic patterns and cultural norms. This has led to a growing body of research focused on capturing and modeling regional language variations in NLP.

One approach is to incorporate region-specific information into the models. For example, Jurgens and Lu (2017) [17] explored the use of geolocation data to enhance sentiment analysis, demonstrating that incorporating regional context can improve model performance. Another approach is to develop region-specific language models. This involves training separate models on data from different regions, allowing each model to specialize in the linguistic characteristics of its respective region [18].

The study of regional language variations also has implications for data collection and annotation. To ensure that NLP models are fair and inclusive, it is crucial to collect and annotate data that are representative of all regions and linguistic communities. This requires careful consideration of sampling strategies and annotation guidelines to avoid biases and ensure that all voices are represented.

Despite the growing use of synthetic datasets in NLP, many studies do not explicitly consider regional linguistic variations, which can impact model generalization and performance. The existing research on synthetic data generation has primarily focused on general-purpose datasets, often assuming linguistic uniformity across English-speaking populations [19,20]. However, regional language variations influence how emotions are expressed, which has been observed in sentiment analysis and other NLP tasks [17,21]. Failing to capture these variations may lead to biases in emotion detection models, limiting their applicability across diverse populations. This study builds upon the existing literature by investigating the use of region-specific synthetic datasets for emotion detection. By leveraging the capabilities of LLMs to generate data that capture regional language variations, we aim to develop more robust and culturally sensitive emotion detection models that can be deployed in diverse real-world settings.

3. Methodology

3.1. Objectives

To address the previously mentioned gaps, this study seeks to answer the following research questions:

RQ1: How do region-specific prompts influence the linguistic diversity of synthetic datasets?
RQ2: What are the regional differences in language use for expressing emotions in synthetic datasets?
RQ3: How do region-specific synthetic datasets affect the performance of emotion detection models compared to baseline datasets?

Based on these questions, the following hypotheses are proposed:

H1: Prior research has shown that prompt engineering can guide language models to generate domain-specific content [19]. By incorporating region-specific cues, prompts are expected to elicit greater linguistic diversity, reflecting local idioms, word choices, and syntactic structures.
H2: Studies on regional language variations suggest that people in different regions use distinct expressions and lexical choices to convey emotions [16,22]. This hypothesis assumes that these differences will be evident in synthetic datasets when region-specific prompts are used.
H3: Empirical findings indicate that training NLP models on more diverse and contextually relevant data improves performance [15]. Since region-specific datasets are expected to better capture local linguistic nuances, models trained on them should outperform those trained on baseline datasets.

3.2. Overview

This study employs a mixed-method approach, combining quantitative and qualitative analyses to investigate the impact of incorporating regional variations in language on synthetic datasets for emotion detection. The methodology encompasses four key stages:

Dataset Generation: Synthetic datasets are generated using the GPT-3.5 language model, incorporating English variations from the United States, United Kingdom, and India, alongside a general baseline dataset. Two levels of prompt specificity are employed to assess the influence of regional linguistic nuances on the generated data.
Statistical Analysis: A range of statistical analyses are conducted to evaluate the linguistic diversity and regional differences in the generated datasets. These analyses include frequency distribution, TF-IDF, type–token ratio, hapax legomena, PMI scores, and key-phrase extraction.
Classification Experiments: BERT and BART models are employed to conduct classification experiments, starting with zero-shot classification on the CARER dataset, followed by fine-tuning with both baseline and region-specific datasets. This stage evaluates the impact of region-specific synthetic datasets on emotion detection performance.
Evaluation: The performance of the models is evaluated using standard metrics, and comparisons are made between the baseline and region-specific datasets to assess the impact of regional language variations on model performance.

This comprehensive methodology allows for a thorough investigation of the research questions and hypotheses, providing insights into the role of regional linguistic variations in synthetic dataset generation and their impact on emotion detection performance. The complete methodological framework is depicted in Figure 1.

3.3. Dataset Generation

GPT-3.5 was utilized to generate synthetic datasets for emotion detection based on Ekman’s six basic emotions [23]. The datasets were tailored to English variations from the United States, United Kingdom, and India, alongside a baseline dataset. These countries were chosen for this initial study because they represent regions where English is a native or widely spoken language, yet they exhibit significant linguistic and cultural differences. This selection allows us to explore how regional language nuances impact the generated datasets. Two levels of prompt specificity were employed: a general prompt requesting regional language and a detailed prompt with more specific instructions, regarding vocabulary, grammar, and cultural references. The baseline dataset was generated by just requesting “English social media texts” with no specific regional details. This approach allows us to investigate how regional language nuances influence the diversity and relevance of the generated datasets, addressing our research questions on the impact of regional prompts. While this study focuses on these three regions, the methodology can be expanded to include other languages, regions, and more fine-grained locations, providing a broader understanding of regional variations in emotion expression.

For each generated record, the model would randomly choose one of the six target emotions, resulting in a different emotion distribution for each set. A summary of the counts of the records corresponding to each emotion per dataset is presented in Table 1. The total for the baseline is three times the amount for the other datasets because during the fine tuning stages, the regional general and regional detailed are merged together, respectively.

3.4. Statistical Analysis

To evaluate the linguistic diversity and regional differences in the generated datasets, several statistical analyses were performed. These analyses help us understand the characteristics of the datasets and how they vary across different regions, which is crucial for assessing the effectiveness of our synthetic data generation approach.

3.4.1. Frequency Distribution

Analyzing the frequency distribution of words in each dataset helps determine the most commonly used terms across different regions. This analysis provides insight into whether region-specific prompts lead to distinct vocabularies, which is essential for assessing the presence of regional language characteristics. By comparing high-frequency words, it is possible to identify commonalities and differences in how emotions are expressed across linguistic variations.

3.4.2. TF-IDF

TF-IDF was applied to identify words that hold particular significance in each dataset. This metric highlights words that appear frequently within a region-specific dataset while remaining relatively uncommon across the entire corpus, thus uncovering regionally distinctive terms. The use of TF-IDF helps assess whether regional prompts result in datasets that emphasize unique lexical choices, which can enhance the adaptability of emotion detection models to specific linguistic contexts. The TF-IDF score for a term t in document d is given by

TF-IDF (t, d) = TF (t, d) \times log (\frac{N}{DF (t)}),

(1)

where

TF (t, d)

is the term frequency of t in d, N is the total number of documents, and

DF (t)

is the document frequency of t.

3.4.3. Type–Token Ratio and Hapax Legomena

Lexical diversity is a crucial factor in determining the richness of a dataset. The TTR measures the variety of vocabulary by computing the ratio of unique words (types) to the total word count (tokens). A higher TTR indicates a more lexically diverse dataset, suggesting greater linguistic expressiveness. TTR is defined as

TTR = \frac{Number of Types}{Number of Tokens} .

(2)

Additionally, the hapax legomena count—words that appear only once in a dataset—was calculated. A higher number of hapax legomena suggests the presence of rare or unique words, indicating richer linguistic diversity. These measures collectively help determine whether region-specific prompts enhance the expressiveness and variability of the generated datasets, which is critical for improving emotion detection models.

3.4.4. PMI Scores

PMI was computed to analyze how words co-occur in different regional datasets. This measure quantifies how strongly two words are associated beyond random chance, providing insights into regional variations in phrase usage and contextual meaning. PMI is particularly useful in emotion detection because it helps uncover patterns in how emotions are expressed through word pairings. By comparing PMI scores across regional datasets, this study examines whether region-specific datasets reflect distinctive semantic structures that could improve model performance. PMI for a word pair

(w_{1}, w_{2})

is given by

PMI (w_{1}, w_{2}) = log (\frac{P (w_{1}, w_{2})}{P (w_{1}) \times P (w_{2})}),

(3)

where

P (w_{1}, w_{2})

is the joint probability of

w_{1}

and

w_{2}

, and

P (w_{1})

and

P (w_{2})

are the individual probabilities of

w_{1}

and

w_{2}

. This analysis helps us understand the contextual relationships between words, revealing how regional language variations influence the way emotions are expressed.

3.4.5. Key-Phrase Extraction

Key-phrase extraction was performed using the rapid automatic keyword extraction (RAKE) algorithm [24] to identify prominent themes in each dataset. This analysis helps determine whether regional variations influence the way emotions are discussed. Since emotions are often context-dependent, identifying key phrases allows for a deeper understanding of how different linguistic communities articulate emotional experiences. The results from this technique further inform how well synthetic datasets capture the thematic elements essential for emotion detection.

3.5. Classification Experiments

A series of classification experiments were conducted using BERT [25] and BART [26] models to evaluate the impact of region-specific synthetic datasets on emotion detection performance. Specifically, the “bert-based-uncased”and the “bart-large-mlni” models were used, respectively. These models were selected due to their established reliability and robustness in the task of emotion detection. Nonetheless, it is put forward that any language model capable of both zero-shot classification and fine-tuning could be equally suitable for this task. These experiments help us test our hypothesis that region-specific datasets improve model performance.

3.5.1. Zero-Shot Classification

Initially, zero-shot classification was performed on the CARER dataset [27] using pre-trained BERT and BART models to establish baseline performance. This step allows us to compare the performance of models without any fine-tuning, providing a reference point for evaluating the impact of our synthetic datasets.

3.5.2. Fine-Tuning

The fine-tuning process in this study involves adapting the pre-trained models for multi-class emotion classification using a custom-labeled dataset. The models are initialized with six output labels corresponding to the emotion categories: joy, fear, surprise, sadness, disgust, and anger. The emotion dataset is pre-processed by mapping textual emotion labels to numerical values and tokenizing the text inputs using the corresponding BERT and BART tokenizers. Tokenization ensures uniform sequence lengths with truncation and padding to a maximum length of 128 tokens. The dataset is split into training and validation subsets for supervised learning, where the model is fine-tuned using a cross-entropy loss function to minimize the difference between predicted and true labels. The training configuration includes a learning rate of

2 \times 10^{- 5}

, a batch size of 4 for both training and evaluation, and weight decay for regularization. The training process spans three epochs, with the model evaluated after each epoch based on accuracy. The trainer API from Hugging Face’s Transformers library manages the optimization, evaluation, and saving of the best-performing model checkpoint. This approach leverages the model’s pre-trained language understanding while adapting its classification head to the specific task of emotion recognition, resulting in a domain-specific fine-tuned model.

The models were then fine-tuned with the following datasets:

Baseline synthetic dataset (general English)
Region-specific synthetic datasets with general prompts
Region-specific synthetic datasets with detailed prompts

Fine-tuning the models with these different datasets allows us to assess how regional language variations and prompt specificity affect model performance. This step is crucial for testing our hypothesis that region-specific synthetic datasets lead to better emotion detection. Cross-dataset classification was chosen as it presents a more challenging task and better simulates the practical application of emotion detection in online environments, where classification is often performed on previously unseen data samples.

3.6. Evaluation

The performance of the models was evaluated using standard metrics [28]. Comparisons were made between the baseline and region-specific datasets to assess the impact of regional language variations on model performance. This evaluation helps us determine the effectiveness of our approach and provides evidence for the benefits of incorporating regional nuances in synthetic dataset generation.

4. Results

4.1. Zero-Shot Classification

In the context of zero-shot classification, the pre-trained models exhibited significantly different performance levels. As shown in Figure 2 and Figure 3, the pre-trained BERT model performs poorly without additional training. In contrast, the selected version of the BART model, pre-trained for natural language inference, demonstrates inherent zero-shot capabilities, which are evident in its performance.

4.2. Fine-Tuning with Synthetic Datasets

The models were then fine-tuned with the synthetic datasets generated for this study. The performance of models fine-tuned with the baseline dataset (general English) was compared against those fine-tuned with region-specific datasets (US, UK, India) using both general and detailed prompts. This comparison addresses our research question on how region-specific synthetic datasets affect model performance.

The performance results presented in Figure 2 and Figure 3 provide insights into our research question (RQ3). These results demonstrate that the highest classification performance is achieved when models are fine-tuned using regionally detailed synthetic datasets. The hypothesis is that the diversity and richness of expressions present in these datasets enable the models to generalize better, allowing them to correctly predict more complex and nuanced unseen cases.

The BART model fine-tuned with the regionally detailed datasets consistently outperforms other configurations, achieving the highest accuracy and F-1 score across all experiments. This suggests that the pre-trained BART-large-MNLI model, which has been specifically optimized for natural language inference, benefits significantly from fine-tuning on domain-specific data, further enhancing its capability to handle nuanced classification tasks.

In the case of the BERT-based models, there is a clearer progression in performance from zero-shot classification to the best-performing fine-tuned configuration. This progression highlights the model’s need for domain adaptation to achieve optimal results. This can be attributed to the fact that the BERT model, being a general-purpose pre-trained language model, requires fine-tuning to specialize in emotion classification tasks. In contrast, the selected BART model exhibits relatively strong zero-shot performance due to its pre-training for language inference tasks, which likely aligns more closely with the requirements of emotion classification.

These findings underscore the importance of both pre-training objectives and fine-tuning datasets in optimizing model performance for emotion detection tasks, particularly in scenarios involving complex and diverse data distributions.

4.3. Statistical Analysis of Datasets

Our statistical analyses provided further insights into the characteristics of the generated datasets, specifically addressing our research question on the influence of region-specific prompts on linguistic diversity. The frequent term analysis revealed how varying levels of prompting, as well as the different regional contexts, elicit more unique and regionally specific vocabulary. Table 2 presents the top-20 most frequent words from the different synthetic datasets, highlighting several unique terms that appear to be generated as a result of the different regional prompts.

One of the key observations is that the datasets generated from more detailed prompts exhibit a higher proportion of specialized or domain-specific terms, even within the top-20 most frequent words. This suggests that the regional prompts successfully captured and emphasized the unique linguistic characteristics of each region. This finding further supports the hypothesis that more detailed prompts lead to richer and more diverse word usage.

In the case of the regional detailed dataset for English from the United States, the use of slang and colloquial expressions, such as “vibes”, “shook”, and “feelin” can be observed. These terms reflect the informal, conversational nature of the language in this region. Similarly, the regional detailed dataset from the United Kingdom includes vocabulary that is distinctly British, such as “mates”, “gutted“, and “innit”, among others. This confirms the presence of region-specific linguistic features that distinguish British English from other variants of the language.

Moreover, the dataset from India reveals a fascinating mix of cultural cues, with terms such as “Bollywood” and “chai” commonly appearing. Additionally, this dataset contains numerous instances of code-switching, where English is mixed with local languages. Terms like “yaar”, “hai”, and “kia” are indicative of the multilingual nature of communication in this region, where speakers frequently incorporate words from Hindi, Urdu, and other regional languages into their English discourse. These findings underscore the importance of regional prompts in capturing the nuances of language use and how cultural and linguistic diversity are reflected in the generated datasets.

From Figure 4, it is evident that both the type–token ratio (TTR) and the counts of hapax legomena were significantly higher in the region-specific detailed datasets compared to the baseline. This indicates that the region-specific prompts resulted in a richer and more diverse vocabulary. Specifically, the higher TTR values suggest that a greater proportion of unique words were used in the regionally tailored datasets, reflecting a broader lexical diversity. Additionally, the increased occurrence of hapax legomena—words that appear only once in the corpus—further supports the notion that these datasets contain a wider range of rare and specialized terms, which are often characteristic of more specific linguistic contexts.

These findings imply that the regional prompts successfully encouraged the generation of language that is more varied and representative of the distinct linguistic features and cultural nuances inherent to each region. The higher TTR and hapax legomena counts also suggest that the models trained on these region-specific datasets are better equipped to understand and generate more nuanced expressions reflective of different regional and cultural contexts. This increased linguistic diversity may also indicate that the models trained on these region-specific datasets are better equipped to understand and generate more complex and varied text, moving beyond the generic or repetitive vocabulary typically found in more general datasets.

These results directly address our first research question (RQ1), which investigates the influence of region-specific prompts on the linguistic diversity and richness of the generated datasets.

The pointwise mutual information (PMI) scores for the top 10 co-occurring word pairs from each regional synthetic dataset are presented in Figure 5. These pairs reveal significant associations that are unique to each region, underscoring how regional language variations influence the expression of emotions. The PMI analysis provides valuable insights into how the choice of regional prompts can shape the types of emotional expressions generated by the model, highlighting the subtle but distinct ways in which cultural and linguistic contexts affect the language produced.

An interesting observation from this analysis is the variety of emojis and expressions that the language model produces when given appropriate region-specific prompts. The model not only captures regional language nuances but also reflects cultural-specific symbols, like emojis, which are an integral part of modern communication. This suggests that when prompted with regional cues, the model can adapt its output to align with local forms of expression.

The analysis of the co-occurring word pairs further illustrates the differences in expressions elicited across the various datasets. These differences are particularly evident in the use of hashtags, a modern cultural feature that encapsulates regional identity. For example, in the UK dataset, hashtags such as #AngryBritish and #ProudBritish are present, directly reflecting the socio-political identity of the region. In contrast, the India dataset includes region-specific hashtags like #IndiaPride, explicitly signifying national pride and cultural identity.

Additionally, the dataset from the UK includes subtle references to British cultural elements, such as #RoyallyShocked, which evokes the British royal family, and #TopBanter, a term commonly associated with UK slang. The India dataset, on the other hand, features the hashtag #DesiHipster, a clear cultural marker linked to the youth subculture in India. These cultural nuances, although less overt, provide important context to the emotional expressions generated by the model.

In the US dataset, more region-specific cultural references are observed, such as “Eggo waffles” and the hashtag #SayNoToPineapplePizza, both of which highlight American food culture and regional humor. Similarly, the UK dataset features references to “tea”, a beverage deeply rooted in British tradition, while the India dataset includes “vegan”, reflecting growing trends in Indian food culture. These examples illustrate how socio-cultural aspects, particularly food, are reflected in the synthetic datasets, offering further evidence of the model’s sensitivity to regional influences.

This detailed analysis of co-occurring word pairs and regional expressions provides crucial insights into the second research question (RQ2), which explores the influence of regional prompts on the expression of emotions and the way cultural elements shape the language model’s output. By observing these region-specific linguistic features, deeper understanding of how models can be fine-tuned to capture and reflect cultural nuances in emotional expressions was gained.

To further investigate the variations in the generated content, keyphrase extraction was conducted using the rapid automatic keyword extraction (RAKE) algorithm. This method allowed us to identify prominent topics within each dataset, revealing thematic differences across regions. The top 10 keyphrases for each dataset are presented in Figure 6. One of the first observations from this analysis is that the baseline dataset contains highly repetitive keyphrases, with almost all of the top ten phrases being identical. This suggests a lack of linguistic diversity in the baseline dataset, which likely arises from the general nature of the prompts used to generate the content.

In contrast, the regional datasets—corresponding to the IN (India), US, and UK regions—are presented on the left side of Figure 6. These datasets were generated using a general prompt, and as expected, many of the keyphrases seem to follow a templated structure. Although there is some regional specificity reflected in the keyphrases, such as the inclusion of cultural references and colloquialisms, the overall linguistic diversity appears somewhat limited due to the broad nature of the prompting.

On the right side of Figure 6, the keyphrases extracted from datasets generated with more specific prompts are presented. These datasets exhibit a much greater variety in terms of linguistic expression, with the keyphrases being more free-form and less constrained by a predefined structure. For example, the India dataset reveals the interesting feature of code-switching, with the use of local languages interspersed with English terms. Reflecting the cultural and linguistic fluidity present in Indian communication.

These differences highlight the power of prompt design in eliciting region-specific linguistic features. By using more specific prompts, it is possible to generate content that is not only richer in variety but also more authentic to the cultural context of the region. This underscores the importance of tailoring prompts to capture the linguistic diversity of different regions, which can significantly influence the quality and authenticity of the generated content. Ultimately, this analysis further demonstrates how prompt engineering can be used to steer the generation of language models toward more regionally relevant outputs.

In summary, the results of our experiments demonstrate that incorporating regional language variations into synthetic dataset generation significantly enhances the performance of emotion detection models. The detailed prompts, in particular, provide a richer and more diverse linguistic representation, leading to improved model accuracy and robustness. These findings directly address our research questions and support our hypotheses regarding the benefits of region-specific synthetic datasets.

5. Discussion

5.1. Linguistic Diversity and Regional Differences

Our statistical analyses revealed significant regional differences and enhanced linguistic diversity in the datasets generated using detailed prompts. In Figure 7, Figure 8, Figure 9, Figure 10, Figure 11 and Figure 12, we present the term frequency-inverse document frequency (TF-IDF) word clouds for the different regional datasets. On the left side of these figures, we display the datasets generated using a general prompt, while on the right side, we show the corresponding datasets generated with a more detailed prompt. The word clouds serve as an effective tool for visually comparing the linguistic differences elicited by varying levels of prompt detail. By examining these word clouds, we can quickly identify the specific terms and phrases that are more prominent in each dataset, providing insight into how the detail in the prompt influences the richness and specificity of the generated vocabulary.

The word clouds from the general prompts typically show more common, broadly used terms, indicating less diversity in the generated content. In contrast, the word clouds from the detailed prompts reveal a wider range of region-specific words and expressions. These differences underscore how detailed prompting leads to more nuanced and contextually appropriate language, reflecting the unique linguistic characteristics of each region. The use of TF-IDF further emphasizes the prominence of these region-specific terms, highlighting how such prompts can better capture cultural and linguistic nuances in the generated data.

The observed increase in the type–token ratio (TTR) and hapax legomena ratios in the region-specific datasets compared to the baseline suggests a richer and more diverse vocabulary in the regionally tailored data. This outcome aligns with our hypothesis that the inclusion of region-specific prompts fosters the generation of more varied and nuanced synthetic datasets. The higher diversity in word usage reflects a broader range of emotional expressions, which is crucial for improving the robustness and generalizability of emotion detection models.

Furthermore, the pointwise mutual information (PMI) scores for co-occurring word pairs highlighted significant associations unique to each region. These findings underscore the influence of regional language variations on emotional expression, as they show how different cultural and linguistic factors shape the way emotions are conveyed in text. For instance, the use of region-specific idioms, slang, and culturally relevant terms all contribute to the nuanced expression of emotions, which must be accounted for when developing emotion detection systems.

In addition, key-phrase extraction via the RAKE algorithm provided valuable insights into the thematic differences across the regional datasets. The identified keyphrases revealed distinctive topics that are of particular relevance to each region, further emphasizing the importance of considering regional context in emotion detection tasks. These regional variations in keyphrases reinforce the idea that emotions are not universally discussed in the same way; instead, they are shaped by local cultural and social factors. By incorporating regional-specific cues, models can achieve better performance and more accurate predictions when dealing with diverse, real-world data. Therefore, these findings highlight the significance of region-specific prompts in capturing the full spectrum of emotional expressions, which is essential for building more effective and inclusive emotion detection systems. However, it is crucial to be mindful of potential biases and ethical considerations when generating synthetic data that reflect regional linguistic variations. As highlighted by Mozes et al. (2023) [7], careful prompt engineering and dataset evaluation are necessary to ensure inclusivity and avoid perpetuating stereotypes. Additionally, exploring the impact of synthetic data on cross-lingual transfer learning and its potential to improve NLP models in low-resource languages is an important area for future research, as suggested by Li et al. (2021) [12].

5.2. Impact on Emotion Detection Performance

The results from our classification experiments with both BERT and BART models clearly demonstrate that region-specific synthetic datasets, particularly those generated using detailed prompts, significantly enhance emotion detection performance compared to the baseline. This improvement underscores the value of region-specific data in capturing the linguistic and cultural nuances of emotional expression. By tailoring the datasets to reflect regional variations in language, such as slang, idioms, and culturally specific references, the models become better equipped to understand and classify emotions more accurately. This suggests that emotion detection systems can benefit from regionally diverse data, which may lead to more robust and contextually sensitive models that are better suited for real-world applications across different demographics.

The zero-shot classification results on the CARER dataset initially provided a useful baseline for performance evaluation. However, the subsequent fine-tuning of the models with the synthetic region-specific datasets resulted in a notable improvement in performance. This finding supports our hypothesis that incorporating regional language variations into the fine-tuning process enhances emotion detection capabilities. Models fine-tuned with region-specific data consistently outperformed those fine-tuned with the baseline dataset, further emphasizing the importance of region-aware training. The improved performance highlights the ability of regionally adapted datasets to capture contextually relevant emotional expressions, thereby providing more accurate predictions.

Overall, our findings suggest that region-specific data are a crucial factor in refining emotion detection models, and incorporating such data into the fine-tuning process can significantly improve the robustness, accuracy, and cultural sensitivity of these models. This has important implications for the development of emotion detection systems, particularly those intended for use in diverse, global contexts where emotional expression may vary widely across regions. These results are consistent with the findings in the literature that emphasize the importance of considering regional variations in language for various NLP tasks. For instance, Acheampong et al. (2020) [15] highlighted the limitations of current models in capturing implicit emotions, particularly in diverse linguistic contexts, and Hovy and Purschke (2018) [21] emphasized the need for models that can adapt to different linguistic contexts. Our study contributes to this body of knowledge by demonstrating the effectiveness of region-specific synthetic datasets in enhancing emotion detection performance. Furthermore, Sánchez-Junquera et al. (2021) [13] have shown that models trained on synthetic data can achieve comparable performance to those trained on real data for emotion recognition in conversation. This supports the notion that synthetic data can be a valuable resource for improving emotion detection models. Additionally, the influence of cultural factors on emotion expression in text has been well-documented by Schuff et al. (2020) [14], highlighting the differences in how emotions are conveyed across different cultures. This underscores the importance of considering regional variations in language for emotion detection tasks and developing culturally sensitive models.

5.3. Implications for Future Research and Applications

Our findings have several important implications for the future of research and applications in natural language processing (NLP). First, they emphasize the critical need for inclusive and representative datasets that accurately capture regional language variations. This is especially significant for tasks like emotion detection, where cultural and linguistic nuances deeply influence how emotions are expressed, understood, and perceived. As our results show, regional prompts can effectively capture these subtleties, highlighting the importance of incorporating diverse linguistic expressions into training data to build more inclusive and accurate models. Without such diversity, NLP systems risk overlooking or misinterpreting emotionally charged expressions that may vary widely across different cultural contexts.

The primary focus of this study is to highlight the critical need to consider regional variations in language when developing and training emotion detection models. While synthetic data serve as a valuable tool in this particular research due to the limited availability of diverse and readily accessible real-world datasets, they are not presented as the ultimate solution for addressing linguistic diversity. The central aim is to underscore the importance of incorporating regional linguistic nuances in the training process, regardless of whether the data are sourced from real-world corpora or generated synthetically. This emphasis on diversity aims to promote the development of more inclusive and robust emotion detection models that can better cater to the diverse linguistic landscape of the digital world.

While our study provides valuable insights into the impact of region-specific datasets on emotion detection, it also has some limitations that warrant consideration. The primary focus on English variations from the US, UK, and India represents an initial exploration of regional differences, but further research is necessary to expand the scope to other languages and regions. By including a more diverse set of languages and cultural contexts, future studies could provide a more comprehensive understanding of how regional variations influence emotional expression across the globe.

Furthermore, while this study specifically evaluated the effectiveness of synthetic datasets for emotion detection, future work could extend these findings to explore the impact of region-specific datasets on other NLP tasks, such as sentiment analysis, text classification, and even language translation. This would further demonstrate the broad applicability of regionally tailored data in improving NLP systems.

Additionally, the focus on just three countries leaves room for further exploration of more fine-grained regional variations within countries. Many nations exhibit diverse dialects, slang, and cultural subgroups that may significantly influence language use. By investigating regional variations within a single country, researchers could gain deeper insights into the nuances of emotional expression at a more localized level, offering even greater precision in emotion detection models.

Beyond linguistic variations, the integration of multimodal data (e.g., text, audio, and video) presents another exciting avenue for future research. Emotion detection, for instance, is often enriched by non-verbal cues such as tone of voice, facial expressions, and body language. Incorporating multimodal data could provide a more holistic view of emotional expression, leading to more accurate and robust models.

Moreover, the development of more sophisticated prompt engineering techniques could further improve the quality and relevance of synthetic datasets. By refining the prompts to better capture specific emotional cues, cultural nuances, and contextual information, researchers could enhance the performance of models trained on these datasets. This would allow for a more precise and tailored approach to synthetic data generation, especially when dealing with complex tasks like emotion detection. The use of synthetic data for training large language models has also shown promising results, as demonstrated by Saharia et al. (2023) [9], indicating their potential to improve the performance of these models on various NLP tasks.

Overall, our study demonstrates the significant benefits of incorporating regional language variations into the generation of synthetic datasets for emotion detection. By addressing the limitations of the existing approaches and providing a scalable solution to data scarcity and diversity, our research contributes to advancing the field of NLP. Our findings emphasize the need for more inclusive, culturally aware, and context-sensitive models, ultimately contributing to the development of more effective emotion detection systems that can be deployed in real-world applications across diverse linguistic and cultural contexts. However, it is essential to address ethical considerations, particularly concerning privacy and consent, and ensure responsible AI development when generating and utilizing synthetic data, as emphasized by Wang et al. (2022) [29].

5.4. Ethical Considerations

The use of synthetic datasets and emotion detection models presents several ethical challenges that must be carefully addressed to ensure responsible and fair applications. In this study, we implemented specific measures to mitigate these ethical concerns.

5.4.1. Bias and Fairness

To address potential biases in synthetic datasets, we employed a multi-step validation process. First, we designed region-specific prompts with neutral and inclusive language to prevent reinforcement of stereotypes. Second, we analyzed the linguistic diversity of the generated datasets using statistical methods, such as type–token ratio and keyphrase extraction, to ensure they reflect a broad and representative sample of regional language variations. Third, we compared classification performance across multiple regions to detect any systematic biases in model predictions. Any observed discrepancies were addressed by refining prompt engineering strategies to enhance fairness and balance.

5.4.2. Privacy and Consent

Since our study utilized synthetic datasets generated by a large language model, no personally identifiable information (PII) from real individuals was involved. However, we ensured that the synthetic data generation process did not inadvertently reproduce sensitive or private information by employing strict filtering mechanisms and auditing generated text for anomalies. Additionally, our methodology adheres to ethical AI principles by ensuring transparency in data sources and generation processes, preventing potential privacy violations.

5.4.3. Misuse and Misinterpretation

To minimize the risk of misuse, we provide clear documentation on the intended applications and limitations of our emotion detection models. The models were developed for research purposes in NLP and were not designed for applications involving surveillance or high-stakes decision-making. Additionally, we emphasize the importance of human oversight when deploying these models in real-world scenarios to prevent misinterpretation of emotional signals and unintended consequences.

5.4.4. Transparency and Accountability

To promote transparency, we have made our dataset generation methodology, evaluation criteria, and analysis techniques fully documented. The research follows established ethical AI frameworks, and the results are reported with thorough explanations of their implications. We also encourage independent validation of our findings by making key resources and evaluation scripts available for replication. Moreover, our study adheres to ethical guidelines set forth by relevant research institutions and follows best practices in NLP research to uphold accountability and responsible AI development.

6. Conclusions

In this study, the impact of region-specific synthetic datasets on emotion detection performance was explored. By leveraging GPT-3.5, synthetic datasets were generated and tailored to distinct English variations from the US, UK, and India, alongside a baseline dataset. Our methodology incorporated two levels of prompt specificity to evaluate how regional language nuances influenced the diversity and relevance of the generated datasets. This approach enabled us to assess the extent to which linguistic and cultural variations shape emotion detection tasks.

Our comprehensive statistical analyses, including frequency distribution, TF-IDF, type–token ratio, hapax legomena, PMI scores, and keyphrase extraction, revealed notable regional differences and enhanced linguistic diversity in datasets generated with more detailed prompts. These findings underscore the importance of incorporating regional language variations in synthetic dataset generation, as they contribute to a richer and more diverse representation of language, enhancing the quality of models trained on such data.

The classification experiments using BERT and BART models demonstrated that region-specific synthetic datasets, particularly those generated with detailed prompts, led to significant improvements in emotion detection performance compared to the baseline. This improvement highlights the potential of region-specific datasets to enhance the robustness and accuracy of emotion detection models. The results also address the limitations of the current approaches, which often fail to capture the intricacies of regional language variations that can significantly affect emotion recognition.

Our research contributes to the broader understanding of synthetic data’s role in natural language processing by systematically evaluating the impact of regional language variations. Valuable insights into how synthetic datasets can be tailored to better reflect linguistic diversity were gained, ultimately making emotion detection models more robust and accurate. Additionally, this study opens up new avenues for future research, suggesting that the methodology can be expanded to include other languages, regions, and more fine-grained regional distinctions. By doing so, researchers can enhance the applicability and effectiveness of synthetic datasets across various NLP tasks.

In summary, our findings demonstrate that incorporating regional language nuances into synthetic dataset generation is a promising and effective approach to improving emotion detection models. This research not only advances the field of synthetic data generation but also emphasizes the importance of developing inclusive and representative datasets that reflect the full spectrum of linguistic and cultural diversity. Future work can build on this foundation by exploring additional regions and languages, with the ultimate goal of creating more robust, accurate, and culturally aware NLP systems that can better serve global, multilingual audiences.

Funding

This research was funded by the National Science and Technology Council of Taiwan grant number NSTC 113-2221-E-030-006. The funding agreement ensured the author’s independence in designing the study, interpreting the data, writing, and publishing the report.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets generated and analyzed during the current study are available in the GitHub repository at the following link: https://github.com/fhcalderon87/SyntheticRegionalEmotion (accessed on 1 December 2024). This repository includes the synthetic datasets for emotion detection tailored to English variations from the US, UK, and India, as well as the baseline dataset. Additional data and code used for statistical analyses and classification experiments are also provided to ensure reproducibility and facilitate further research.

Conflicts of Interest

The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Alswaidan, N.; Menai, M.E.B. A survey of state-of-the-art approaches for emotion recognition in text. Knowl. Inf. Syst. 2020, 62, 2937–2987. [Google Scholar] [CrossRef]
Plaza-del-Arco, F.M.; Curry, A.; Curry, A.C.; Hovy, D. Emotion Analysis in NLP: Trends, Gaps and Roadmap for Future Directions. arXiv 2024, arXiv:2403.01222. [Google Scholar]
Safari, F.; Chalechale, A. Emotion and personality analysis and detection using natural language processing, advances, challenges and future scope. Artif. Intell. Rev. 2023, 56, 3273–3297. [Google Scholar]
Prajapati, Y.; Khande, R.; Parasar, A. Sentiment Analysis of Emotion Detection Using Natural Language Processing; Springer: Berlin/Heidelberg, Germany, 2021. [Google Scholar]
Systematic Review of Emotion Detection with Computer Vision and Deep Learning. Sensors 2024, 24, 3484. [CrossRef] [PubMed]
Gamage, G.; Silva, D.D.; Mills, N.; Alahakoon, D.; Manic, M. Emotion AWARE: An artificial intelligence framework for adaptable, robust, explainable, and multi-granular emotion analysis. J. Big Data 2024, 11, 93. [Google Scholar] [CrossRef]
Mozes, M.; Tutelman, D.; Winter, Y.; Globerson, A. Synthetic Data for Natural Language Processing: A Survey. CoRR 2023. abs/2304.03751. [Google Scholar]
Bowman, S.R.; Vilnis, L.; Vinyals, O.; Dai, A.M.; Jozefowicz, R.; Bengio, S. Generating sentences from a continuous space. arXiv 2015, arXiv:1511.06349. [Google Scholar]
Saharia, C.; Tamkin, A.; Allmendinger, F.; Lowe, R.; Pang, R.; Tabrizian, M.; Liang, P.; Joshi, N.; Wei, J.; Chowdhery, A.; et al. Language Models Can Teach Themselves to Program Better. CoRR 2023. abs/2305.12413. [Google Scholar]
Chen, X.; Sun, T.; Qiu, X.; Huang, X. A Survey on Multilingual Pre-trained Language Models. CoRR 2022. abs/2210.10271. [Google Scholar]
Zhao, J.; Lin, Z.; Lin, J.; Tan, M.; Yu, J. Generative Pre-trained Models for Text Generation: A Survey. CoRR 2022. abs/2204.08582. [Google Scholar]
Li, D.; Bu, K.; Wu, J.; Chang, B. Data Augmentation for Cross-lingual Transfer Learning of BERT-based Dependency Parsers. Trans. Assoc. Comput. Linguist. 2021, 9, 942–957. [Google Scholar]
Sánchez-Junquera, J.; Fernández-González, D.; Rodríguez-Fórtiz, M.J.; Montero, J.M.; Ordóñez, L. Generating Synthetic Data for Emotion Recognition in Conversation. CoRR 2021. abs/2104.08773. [Google Scholar]
Schuff, H.; Barnes, J.; Crossley, S.; McNamara, D.S.; Cai, Z. Culture and Emotion in Text: A Computational Analysis of Emotion Expression in English and Chinese. CoRR 2020. abs/2010.06435.. [Google Scholar]
Acheampong, F.A.; Nunoo-Mensah, H.; Chen, W. Transformer models for text-based emotion detection: A review of BERT-based approaches. arXiv 2020, arXiv:2011.04071. [Google Scholar]
Blodgett, S.L.; Green, L.; O’Connor, B. Demographic Dialectal Variation in Social Media: A Case Study of African-American English. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), Berlin, Germany, 7–12 August 2016; pp. 1119–1130. [Google Scholar]
Jurgens, D.; Lu, T.C. Incorporating geolocation into supervised learning models of user demographics. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 9–11 September 2017; pp. 2359–2368. [Google Scholar]
Han, X.; Su, J.; Wan, X. Unsupervised Multi-Target Domain Adaptation: An Information-Theoretic Approach. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, MN, USA, 2–7 June 2019; pp. 3006–3016. [Google Scholar]
Wang, A.; Cho, K.; Lewis, M. Towards understanding the impact of artificial data on sequential language tasks. arXiv 2021, arXiv:2109.09193. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; So, P.; Srinivasan, M.; Shinn, J.; Stoyanov, V. Synthetic data generation for natural language processing. arXiv 2022, arXiv:2204.08582. [Google Scholar]
Hovy, D.; Purschke, C. Social and regional variation in language processing and its implications for NLP. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 632–637. [Google Scholar]
Bamman, D.; Dyer, C.; Smith, N.A. Distributed representations of geographically situated language. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Baltimore, MA, USA, 22–27 June 2014; pp. 828–834. [Google Scholar]
Ekman, P. An argument for basic emotions. Cogn. Emot. 1992, 6, 169–200. [Google Scholar] [CrossRef]
Rose, S.; Engel, D.; Cramer, N.; Cowley, W. Automatic keyword extraction from individual documents. In Text Mining; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2010; pp. 1–20. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Li, J.; Zettlemoyer, L.; Schuster, M. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020, Virtually, 5–10 July 2020; pp. 7871–7880. [Google Scholar]
Saravia, E.; Liu, H.; Huang, Y.; Wu, J.; Chen, Y. CARER: Contextualized affect representations for emotion recognition. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing 2018, Brussels, Belgium, 31 October–4 November 2018; pp. 3687–3697. [Google Scholar]
Powers, D.M. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv 2020, arXiv:2010.16061. [Google Scholar]
Wang, X.; Zhao, J.; Li, Y. Prompt-Based Learning for Text-Based Emotion Detection: A Survey. CoRR 2022. abs/2210.10271. [Google Scholar]

Figure 1. Methodological Framework for Evaluating the Impact of Region-Specific Synthetic Datasets on Emotion Detection.

Figure 2. Classification performance for BERT-based models in their different variations. Metrics are based on the weighted averages.

Figure 3. Classification performance for BART-based models in their different variations. Metrics are based on the weighted averages.

Figure 4. Type–token ration (TTR) and hapax legomena scores for the synthetic baseline and region-specific datasets.

Figure 5. Top PMI sample pairs from different synthetic regional datasets. Usage of different emojis highlights the emotion loaded content in the generated texts.

Figure 6. Top 10 keyphrases extracted by RAKE from each synthetic regional dataset. The usage of different emojis highlights the emotions one the generated texts.

Figure 7. Top 20 TF-IDF terms in the Regional General US dataset.

Figure 8. Top 20 TF-IDF terms in the Regional Detailed US dataset.

Figure 9. Top 20 TF-IDF terms in the Regional General UK dataset.

Figure 10. Top 20 TF-IDF terms in the Regional Detailed UK dataset.

Figure 11. Top 20 TF-IDF terms in the Regional General India dataset.

Figure 12. Top 20 TF-IDF terms in the Regional Detailed India dataset.

Table 1. Number of records generated for each emotion on the synthetic datasets.

		Records by Emotion						Total
Dataset		Anger	Fear	Joy	Love	Sadness	Surprise	Total
Baseline		471	514	504	517	505	489	3000
Regional General	US	155	178	169	163	159	176	1000
	UK	173	148	157	169	176	177	1000
	IN	162	155	158	184	169	172	1000
Regional Detailed	US_2	150	174	169	162	171	174	1000
	UK_2	177	150	168	163	187	155	1000
	IN_2	160	160	168	169	160	183	1000

Table 2. Top-20 words by frequency for the generated synthetic datasets.

Baseline	Regional General			Regional Detailed
Baseline	US	UK	IN	US_2	UK_2	IN_2
feeling	feeling	feeling	feeling	like	proper	feeling
love	love	absolutely	love	just	feeling	just
today	today	love	india	feeling	just	like
life	grateful	today	today	today	chuffed	yaar
absolutely	life	uk	absolutely	right	mate	today
grateful	believe	life	let	life	like	chai
believe	absolutely	grateful	sending	got	absolutely	time
spread	right	believe	believe	believe	believe	life
fear	joyful	just	grateful	vibes	right	believe
sadness	time	right	time	shook	today	bollywood
right	let	sending	life	good	gutted	totally
time	spread	fuming	stay	blessed	cuppa	movie
remember	world	bit	knows	feelin	time	hai
joyful	scared	people	staystrong	bummed	bits	world
just	stay	okay	times	deal	mates	kya
wow	just	chuffed	boundaries	living	innit	vibe
let	times	spooked	spread	best	ve	boss
okay	support	remember	support	ride	fuming	desi
spreadlove	staystrong	blimey	fear	time	life	shake
unexpected	way	throws	just	love	blimey	nailed

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Calderón Alvarado, F.H. The Impact of Linguistic Variations on Emotion Detection: A Study of Regionally Specific Synthetic Datasets. Appl. Sci. 2025, 15, 3490. https://doi.org/10.3390/app15073490

AMA Style

Calderón Alvarado FH. The Impact of Linguistic Variations on Emotion Detection: A Study of Regionally Specific Synthetic Datasets. Applied Sciences. 2025; 15(7):3490. https://doi.org/10.3390/app15073490

Chicago/Turabian Style

Calderón Alvarado, Fernando Henrique. 2025. "The Impact of Linguistic Variations on Emotion Detection: A Study of Regionally Specific Synthetic Datasets" Applied Sciences 15, no. 7: 3490. https://doi.org/10.3390/app15073490

APA Style

Calderón Alvarado, F. H. (2025). The Impact of Linguistic Variations on Emotion Detection: A Study of Regionally Specific Synthetic Datasets. Applied Sciences, 15(7), 3490. https://doi.org/10.3390/app15073490

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Impact of Linguistic Variations on Emotion Detection: A Study of Regionally Specific Synthetic Datasets

Abstract

1. Introduction

2. Related Work

2.1. Synthetic Dataset Generation for NLP

2.2. Emotion Detection: Advancements and Challenges

2.3. Regional Language Variations in NLP

3. Methodology

3.1. Objectives

3.2. Overview

3.3. Dataset Generation

3.4. Statistical Analysis

3.4.1. Frequency Distribution

3.4.2. TF-IDF

3.4.3. Type–Token Ratio and Hapax Legomena

3.4.4. PMI Scores

3.4.5. Key-Phrase Extraction

3.5. Classification Experiments

3.5.1. Zero-Shot Classification

3.5.2. Fine-Tuning

3.6. Evaluation

4. Results

4.1. Zero-Shot Classification

4.2. Fine-Tuning with Synthetic Datasets

4.3. Statistical Analysis of Datasets

5. Discussion

5.1. Linguistic Diversity and Regional Differences

5.2. Impact on Emotion Detection Performance

5.3. Implications for Future Research and Applications

5.4. Ethical Considerations

5.4.1. Bias and Fairness

5.4.2. Privacy and Consent

5.4.3. Misuse and Misinterpretation

5.4.4. Transparency and Accountability

6. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI