Enhancing Green Practice Detection in Social Media with Paraphrasing-Based Data Augmentation

Glazkova, Anna; Zakharova, Olga

doi:10.3390/bdcc9040081

Open AccessArticle

Enhancing Green Practice Detection in Social Media with Paraphrasing-Based Data Augmentation

by

Anna Glazkova

^*

and

Olga Zakharova

Carbon Measurement Test Area in Tyumen’ Region (FEWZ-2024-0016), University of Tyumen, Tyumen 625003, Russia

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2025, 9(4), 81; https://doi.org/10.3390/bdcc9040081

Submission received: 11 February 2025 / Revised: 27 March 2025 / Accepted: 28 March 2025 / Published: 31 March 2025

Download

Browse Figures

Versions Notes

Abstract

:

Detecting mentions of green waste practices on social networks is a crucial tool for environmental monitoring and sustainability analytics. Social media serve as a valuable source of ecological information, enabling researchers to track trends, assess public engagement, and predict the spread of sustainable behaviors. Automatic extraction of mentions of green waste practices facilitates large-scale analysis, but the uneven distribution of such mentions presents a challenge for effective detection. To address this, data augmentation plays a key role in balancing class distribution in green practice detection tasks. In this study, we compared existing data augmentation techniques based on the paraphrasing of original texts. We evaluated the effectiveness of additional explanations in prompts, the Chain-of-Thought prompting, synonym substitution, and text expansion. Experiments were conducted on the GreenRu dataset, which focuses on detecting mentions of green waste practices in Russian social media. Our results, obtained using two instruction-based large language models, demonstrated the effectiveness of the Chain-of-Thought prompting for text augmentation. These findings contribute to advancing sustainability analytics by improving automated detection and analysis of environmental discussions. Furthermore, the results of this study can be applied to other tasks that require augmentation of text data in the context of ecological research and beyond.

Keywords:

data augmentation; large language models; Chain-of-Thought; green practices; text classification; paraphrasing

1. Introduction

Demand-side solutions have emerged as a pivotal area of focus in climate change research [1,2,3]. These solutions encompass a wide range of factors, including “consumers’ technology choices, behaviors, lifestyle changes, coupled with production-consumption infrastructures and systems, service provision strategies, and associated socio-technical transitions” [4]. According to certain studies, 72% of global greenhouse gas emissions have a source in household consumption behavior [5]. In the Russian Federation, households are the primary contributors to the use of electricity, gas, food, clothing, transportation, and other goods and services [6]. Consequently, changes in consumer practices, the infrastructure of everyday activities, and the ways in which needs are met can lead to reductions in anthropogenic impacts on climate.

Changes in everyday practices are initiated through innovation [7] including grassroots initiatives where new consumption and production patterns are created and tested, e.g., separate waste collection, repair cafes, and the exchange of children’s toys [8]. These initiatives have the potential to be a significant instrument in achieving climate goals [9]. However, researchers have noted that the role of grassroots initiatives in shaping behavioral change in climate policy is often underestimated [10].

In the Russian Federation, systemic support for demand-side solutions for climate change mitigation is currently lacking. Existing grassroots environmental initiatives are not linked to other climate policy activities and are not included in behavioral change programs to reduce anthropogenic impacts on climate. To organize systemic support for behavioral change and integration of grassroots initiatives into climate policy, it is necessary to understand the overall picture of changes on the demand side, the prevalence of environmental initiatives among the population. This study focuses on the task of detecting mentions of green waste practices in social media texts.

Social media serve as a valuable resource for understanding socially significant behaviors, including environmental practices. However, detecting mentions of green waste practices in social media texts presents several challenges. These mentions are often sparse, highly context-dependent, and expressed in diverse linguistic forms. Traditional text classification models struggle with data imbalance, where some types of green practices are mentioned far less frequently than others. As a result, many environmentally significant discussions remain undetected, limiting the ability of researchers and policymakers to gain a comprehensive understanding of sustainability trends.

To overcome these limitations, data augmentation techniques, particularly those based on paraphrasing, have been proposed as a potential solution. The primary aim of this study is to explore the effectiveness of data augmentation based on paraphrasing using instruction-based large language models (LLMs) to improve the performance of detecting mentions of green waste practices in social media. By systematically investigating different augmentation strategies, we aim to contribute to both natural language processing (NLP) research and environmental sustainability efforts. Currently, data augmentation methods based on paraphrasing demonstrate impressive results in various natural language processing tasks [11,12,13]. However, their potential for improving the identification of environmentally significant behaviors remains underexplored. This study applies these methods for the first time to the task of detecting mentions of green waste practices. We seek to answer the following research question:

What additional techniques—such as adding explanation, Chain of-Thoughts (CoT), expanding text, replacing by synonyms—could improve the effectiveness of paraphrasing-based data augmentation using instruction-based LLMs for the task of detecting mentions of green waste practices?

The practical and theoretical contributions of these paper can be summarized as follows:

This study contributes to NLP research by comparing various LLM-based data augmentation approaches. The experiments were conducted using two instruction-based LLMs and two BERT-based classification models. Our findings showed the effectiveness of the CoT prompting for augmenting texts using the dataset of the mentions of green waste practices.
The study enhances the automatic detection of mentions of green waste practices in Russian social media using paraphrasing-based data augmentation. By integrating NLP techniques into environmental research, this work attempts to bridge the gap between computational methods and sustainability analysis, demonstrating how AI-driven approaches can facilitate large-scale environmental impact assessments. The application of paraphrase-based data augmentation to environmental discourse, specifically for detecting mentions of green waste practices, is novel.

The paper is organized as follows. Section 2 includes a brief review of related work. In Section 3, we describe the dataset, data augmentation techniques, and the models used. The results are presented in Section 4. Section 5 contains discussion and describes the limitations of the study. Section 6 concludes this paper.

2. Related Work

The use of LLMs is currently widespread across different NLP applications. Thanks to pre-training on large volumes of data, including multilingual data, LLMs are employed to address text processing tasks in various languages (particularly in recent papers [14,15]). Data augmentation using LLMs shows impressive results in various NLP tasks [11,16]. Instruction-based LLMs enable the generation of semantically and grammatically coherent texts that can be utilized as synthetic data to enhance minority classes in imbalanced datasets.

Currently, there are three main approaches to data augmentation using instruction-based LLMs. The first approach is based on paraphrasing texts from the original dataset [11,12]. The second approach involves generating entirely new texts corresponding to a given category [16,17]. The third approach combines elements of the first two, such as paraphrasing texts while taking into account the category to which they belong [13,18]. A comparison of the approaches presented in [18] demonstrated that combining paraphrasing with explicit category indication is more effective than simple paraphrasing or generating new texts. Some studies have expanded paraphrasing by extending the original text or replacing words with synonyms [19,20]. A deep analysis of prompting techniques for data augmentation using LLM is presented in [21]. The authors divide prompting techniques into single-step and multi-step approaches. In addition, they identify the role-based technique, which assigns specific roles or personas to the model, shaping its response style and content based on predefined characteristics. In [22], this technique was used for the emotional support conversation task. Another data augmentation techniques are the tuple technique [23], which structures prompts as tuples, and the template technique [24,25], which utilizes structured templates to guide the model’s responses according to a designed format or instruction.

The use of additional explanations in prompts and the CoT prompting has been shown to significantly enhance the effectiveness of LLMs in various tasks. By incorporating detailed explanations or step-by-step reasoning, these techniques enable models to better interpret complex tasks and produce more accurate and coherent outputs. The authors of [26,27] showed that providing additional explanations in prompts helps align the model’s understanding of the task requirements with the desired outcome. The CoT prompting has shown remarkable success in tasks that require logical reasoning, arithmetic problem solving, and multistep decision making [28,29,30]. The authors of [31] applied the CoT prompting to replace attributes in the original texts for data augmentation. The authors of [32] proposed an approach to data augmentation that involves three steps: generating CoT, augmenting inputs with them, and fine-tuning a task-specific model on the CoT-augmented data.

The overview of recent work utilizing prompt-based data augmentation for text classification tasks in presented in Table 1.

3. Methods

This section describes the data used and the evaluated augmentation approaches. We also list the models used in this study, which include instruction-based LLMs for augmenting texts and classifiers for detecting mentions of green waste practices.

3.1. Dataset

This study employs the GreenRu (https://github.com/green-solutions-lab/GreenRu, accessed on 11 February 2025) dataset [40] to identify mentions of green waste practices in Russian social media texts. GreenRu comprises 1326 Russian-language posts with an average length of 880 characters, sourced from online green communities. The dataset has a sentence-level multi-label annotation for green waste practices, with the average sentence length being 110 characters.

Nine categories of green waste practices, defined in [41], were used for annotation. GreenRu is pre-split into training and test sets, with detailed characteristics provided in Table 2.

3.2. Data Augmentation

We applied several approaches to creating prompts for data augmentation. The text templates for the prompts used in Russian and English are presented in Table 3.

Rephrasing: paraphrasing of the original text with explicit indication of its topics, i.e., green waste practices.
Adding explanations: paraphrasing of the original text with a detailed explanation of its topics. The translations of the explanations into English are presented in Table 2.
CoT prompting: the use of a chain of thoughts to paraphrase the original text. In this work, the chain of thoughts included the text’s domain of application (social media) and a step-by-step explanation of its topics.
Expanding: paraphrasing and expanding the original text by specifying the addition of more details.
Replacing by synonyms: paraphrasing the original text by replacing key words with synonyms.

We compared the results of prompt-based data augmentation with the following baselines:

Random Duplication: In this approach, no new samples were generated for the original sentences. Instead, random sentences from the training set were duplicated without any modifications.
Back Translation: This method involves translating phrases back and forth between two languages. We employed the BackTranslation library (https://pypi.org/project/BackTranslation, accessed on 11 February 2025), which utilizes Google Translate, with English as the target language.

For all data augmentation approaches considered, the following augmentation strategy was applied. As shown by the statistics presented in Table 2, the distribution of green waste practice mentions in the GreenRu dataset is uneven. In particular, some practices (namely, Studying the product labeling, Signing petitions, Sharing, and Repairing) account for less than 5% of the training set. These minority practices were selected for data augmentation.

The data augmentation process was carried out as follows. An entry corresponding to one of the minority practices was randomly selected from the original training set. Then, an augmentation method was applied to this entry, and the resulting text was added to the set of synthetic texts. This process continued until the size of the synthetic text set reached half of the original training set. The synthetic texts were then added to the training set, and the augmented dataset was randomly shuffled.

Since GreenRu has a multi-label annotation, the augmentation process increased not only the number of mentions of minority green waste practices but also the number of mentions of more common practices. The original training set contained 2442 instances, while the augmented dataset consisted of 3,663 instances. The distribution of green practice mentions in the original and augmented training sets is shown in Figure 1 and Figure 2.

3.3. Instruction-Based Models

For our experiments, we adopted two publicly available LLMs pre-trained in multilingual and monolingual manner:

T-lite-instruct-0.1 (T-lite) (https://huggingface.co/AnatoliiPotapov/T-lite-instruct-0.1, accessed on 11 February 2025), an instruction-based model pre-trained mainly on Russian texts. T-lite contains 8B parameters.
Llama-3.2-1B-Instruct (Llama) (https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct, accessed on 11 February 2025) [42], an instruction-based model that was optimized for multilingual dialogue use cases in 1B size.

For the T-lite model, we used prompts in Russian. For Llama, since it is a multilingual model, we used English-language prompts and green waste practices names and explanations. The original text was fed into the model in Russian.

The selection of these models is based on their availability as open-source tools. Additionally, as highlighted in the literature review (see Table 1), these models have been successfully applied for text augmentation in languages other than English [18,37].

3.4. Classification Models

The following models [43] were used for text classification:

ruBERT-base (ruBERT) (https://huggingface.co/ai-forever/ruBert-base, accessed on 11 February 2025), the adaptation of the BERT architecture [44] for Russian. The model was pre-trained on a vast collection of Russian texts from various publicly available sources, covering a wide range of domains.
ruELECTRA-large (ruELECTRA) (https://huggingface.co/ai-forever/ruElectra-large, accessed on 11 February 2025), a model based on the ELECTRA architecture [45]. The same data used for ruBERT was used for pretraining the model.

Both models were fine-tuned for five epochs using a maximum sequence length of 256 tokens and a learning rate of 4

\times 10^{- 5}

. Both models were fine-tuned in a multi-label manner using the target represented as a list of n unique binary labels. A multi-label text classification model based on a transformer architecture that includes a transformer model with a classification layer on top. The classification layer contains n output neurons, each corresponding to a specific label. The model were implemented using the Simpletransformers (https://simpletransformers.ai/, accessed on 11 February 2025) framework. The multi-label F1-score was used as an evaluation metric. The multi-label F1-score is calculated as an average value of F1-scores for each class label.

4. Results

The results are presented in Table 4. The best results for each pair of instruction-based LLM and classifier are highlighted in bold. The results that exceeded both baselines are marked using the ↑ sign. The metrics obtained for individual green practices are presented in Table A1. All considered augmentation methods improved the performance of the model trained only on the original data. For the ruBERT model, most approaches also outperformed both baselines. Overall, ruBERT showed more substantial improvements, as this model initially demonstrated low performance for minority green waste practices when using the original data.

In most cases, the relative effectiveness of the strategies varies depending on the selected instruction-based model and classifier. However, the highest performance gains in each case were achieved using CoT prompting. Performance improvements with CoT prompting amounted to 13.78 (T-lite) and 12.43 (Llama) percentage points for ruBERT and 3.03 (T-lite) and 3.73 (Llama) percentage points for ruELECTRA. We conducted a bootstrap analysis of the results obtained using CoT and standard rephrasing. The Wilcoxon signed-rank test revealed a statistically significant difference between the performance of all four models fine-tuned on data augmented using rephrasing and CoT (

p < 0.05

, number of bootstrap samples = 1000), suggesting that CoT prompting outperforms rephrasing-based data augmentation in terms of classification performance.

5. Discussion

The results presented in Table 4 show that all the models fine-tuned on the augmented datasets outperformed the models fine-tuned only on the original data. This highlights the importance of data augmentation for datasets with a high degree of category imbalance. For both instruction-based models, the highest results were shown using CoT prompting. The advantage of CoT prompting over other methods is more expressed when using the Llama model. The results obtained with the T-lite model are generally more homogeneous. The performance of ruELECTRA for T-lite varies from 75.16% to 73.52%. The results for ruBERT range from 78.42% to 72.04%. When using Llama, ruELECTRA demonstrates performance ranging from 75.53% to 72.89%, while ruBERT ranges from 77.07% to 66.74%.

Table A1 showing the full classification results for each type of green waste practice demonstrates that ruELECTRA generally performs better in detecting rarely represented practices. Specifically, when using the original data, random duplication, and expanding, ruBERT fails to recognize texts containing mentions of the rarest practice in our dataset, i.e., repairing. In contrast, ruELECTRA shows a minimum performance of 66.67% for this practice, which is significantly higher than the worst result of ruBERT.

To evaluate the similarity between the texts generated using LLMs and the original texts, we assessed the semantic textual similarity using the paraphrase-distiluse-base-multilingual-cased-v1 model and the Sentence Transformers library [46] (Figure 3). The results obtained through rephrasing and adding explanations show very similar values in terms of semantic similarity. The values for the expanding strategy differ across different instruction-based models. Specifically, for T-lite, this strategy shows the least similarity with the original texts. The results of CoT prompting show relatively low similarity with the original texts (the lowest result for Llama and the second-lowest for T-lite). The highest similarity score is demonstrated by the synonym replacement strategy which provides the least additional information. The scores of semantic similarity are provided in Table 5.

The examples of generated texts are presented in Table A2 and Table A3. Overall, adding explanations to the initial prompt typically results in the generation of texts that explicitly mention the names of green practices and often include explanatory fragments (for example, “I suggest organizing an exchange: someone can give away unwanted books and, in return, receive those they have long wanted to read” (Hereinafter translated from Russian.)). The texts generated using CoT also frequently contain explanatory fragments. However, unlike adding explanations, these texts tend to differ more from the original ones while still preserving its overall theme. Expanding often introduces introductory sentences (e.g., “In our community, we are excited to announce a new initiative aimed at environmental conservation and supporting repair professionals!” or “We need to share this wonderful news!”) and synonym lists (e.g., replacing “household appliances” with “refrigerators, washing machines, microwaves, and other devices”). The texts generated using synonym replacement exhibit the least semantic difference from the original texts. They are generally shorter compared to the three previous augmentation techniques.

Limitations

We observe the following limitations for the current study. From a data perspective, our study was conducted on a single dataset. This is because, to the best of our knowledge, GreenRu is the only dataset available for detecting mentions of green practices. Therefore, the results of this study are potentially limited by dataset bias and the fact that it contains only texts in Russian. Social media data may also introduce bias, as digital engagement can differ across various demographic groups. Specifically, the frequency of green practice mentions may vary depending on factors such as different age groups.

The use of LLMs for data augmentation raises concerns regarding biases in generated text. We acknowledge that augmented data might reflect existing biases in pre-trained models, potentially influencing classification results. Future work should explore techniques for bias mitigation and assess the ethical implications of using synthetic data in decision-making processes.

Additionally, the study is limited by the use of a single data augmentation strategy. Since the dataset is multi-label, balancing the number of examples in it presents a challenging task, which could be the subject of further research. On the issue of the models used, the study utilized two instruction-based LLMs (T-lite and Llama), which potentially may introduce biases based on their pre-training data and architectures. Finally, in this work, we used a relatively simple prompt structure for the initial exploration of the potential of paraphrasing-based data augmentation for this task.

6. Conclusions

In this study, we explored the effectiveness of data augmentation techniques based on paraphrasing using instruction-based LLMs for detecting mentions of green waste practices in social media. Given the uneven distribution of green waste practices, data augmentation plays a crucial role in addressing dataset imbalance and improving model performance.

Our experiments, conducted on a dataset of green waste practice mentions, demonstrated that CoT prompting is particularly effective for augmenting texts. Compared to other augmentation strategies, such as synonym replacement, text expansion, and explanation-based prompting, CoT showed a higher improvement in classification performance, suggesting that guiding LLMs through structured reasoning can enhance the diversity and quality of generated text.

These findings contribute to the broader field of sustainability analytics by enhancing the ability to automatically detect and analyze discussions on environmental topics. Furthermore, the proposed approach can be applied to other NLP tasks that require text augmentation, particularly in domains where social media monitoring is used for policy-making and public engagement analysis.

Further Research

We forsee several directions for further research:

Expanding dataset. Future research should focus on expanding the dataset to include a more diverse range of texts, potentially incorporating data from different languages and regions to improve model generalization.
Bias mitigation. Further research should explore existing datasets for detecting green waste practices and data augmentation techniques for bias mitigation.
Exploring additional LLMs. Testing the effectiveness of different instruction-based LLMs could provide insights into optimizing text augmentation for better classification performance.
Integrating multimodal data. Incorporating multimodal data sources, such as images and videos from social media, could improve the detection of green waste practices beyond textual analysis.
Pre-training LLMs for domain-specific tasks. The LLMs pre-trained on ecology-related data might lead to better results in text classification and data augmentation tasks.
Multilingual data augmentation. Experimenting with multilingual data augmentation, where texts are paraphrased across multiple languages, can improve the robustness and applicability of the models.

Author Contributions

Conceptualization, methodology, project administration, and writing—original draft preparation, A.G.; writing—original draft preparation and methodology, O.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the Ministry of Science and Higher Education of the Russian Federation within the framework of the Carbon Measurement Test Area in Tyumen’ Region (FEWZ-2024-0016).

Institutional Review Board Statement

The ethical review and approval of this study were canceled because no personal data was used.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data used in this study are freely available at 11 February 2025 https://github.com/green-solutions-lab/GreenRu.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Full results (F1-score, %). The correspondence between green waste practices and their indices is given in Table 2.

Training Data	Avg F1	Green Waste Practices
Training Data	Avg F1	1	2	3	4	5	6	7	8	9
ruBERT
Original data	64.64	88.5	64.71	58.56	70.83	70.83	84.91	63.72	79.73	0
Random duplication	67.24	87.87	68.57	63.67	85.19	73.1	83.02	63.72	80	0
Back translation	71.96	88.25	64.86	60.98	85.71	69.93	84.11	64.29	79.53	50
T-lite
Rephrasing	76.33	88.99	61.54	59.74	91.23	70.2	85.98	67.83	81.5	80
Adding explanations	77.13	89.14	70.27	63.16	87.27	72.96	86.79	64.81	79.81	80
Chain-of-Thought	78.42	88.45	72.22	62.95	91.23	70.83	85.71	69.57	79.07	85.71
Expanding	76.23	88.52	66.67	61.18	87.27	72.73	84.31	66.06	79.33	80
Replacing by synonyms	72.04	87.31	64.86	59.92	89.29	72.11	84.91	62.5	77.49	50
Llama
Rephrasing	71.86	88.99	58.82	59.17	88.14	72.97	84.62	63.55	80.45	50
Adding explanations	74.41	87.59	68.57	61.34	85.71	71.24	85.71	63.25	79.63	66.67
Chain-of-Thought	77.07	88.77	64.86	63.75	89.29	70.37	85.98	70.49	80.09	80
Expanding	66.74	88.63	63.16	62.66	87.27	69.39	85.44	64.91	79.16	0
Replacing by synonyms	71.63	88.32	64.71	60.63	91.23	69.23	84.31	64.76	81.52	40
ruELECTRA
Original data	71.8	88.15	68.57	62.93	70.83	70.13	82.35	60.38	76.15	66.67
Random duplication	72.31	87.69	57.14	60.94	70.59	69.8	81.9	66.67	76.06	80
Back translation	74.27	87.2	64.71	60.79	78.43	69.01	83.64	67.74	76.92	80
T-lite
Rephrasing	74.1	88.04	68.57	60.83	79.25	66.22	80.73	66.09	77.21	80
Adding explanations	74.07	87.71	64.86	60.87	78.43	68.09	80.37	62.9	77.67	85.71
Chain-of-Thought	75.16	88.56	63.16	64.63	78.43	71.72	85.44	62.9	75.89	85.71
Expanding	73.52	88.44	64.71	58.93	73.08	65.25	83.02	66.09	76.44	85.71
Replacing by synonyms	74.29	87.49	62.86	62.45	76.92	71.05	88.24	61.95	77.67	80
Llama
Rephrasing	73.57	88.27	55.56	62.13	78.43	69.93	84.11	65	76.78	80
Adding explanations	73.89	88.4	66.67	61.47	76	69.74	84.4	64.86	78.5	75
Chain-of-Thought	75.53	88.11	64.71	60.61	87.72	74.03	83.02	59.81	76.06	85.71
Expanding	72.89	88.54	62.86	60.16	79.25	72.61	84.91	64.35	76.67	66.67
Replacing by synonyms	74.49	87.43	62.5	65.57	76	71.72	84.91	65.45	76.78	80

Table A2. Examples of the texts generated by T-lite.

Approach	Example 1	Example 2
Original texts	Старая бытoвая техника свяжет жизнь с пoклoнниками мелкoгo ремoнта и паяльника	Пoйдемте все искать хoрoшие книжки и дарить их друг другу?
Original labels	Repairing	Sharing, Participating in actions to promote responsible consumption
Rephrasing	В нашем сooбществе мы oбъединяем тех, ктo ценит ремеслo мелкoгo ремoнта и паяльнoгo мастерства. Вместе мы даем втoрую жизнь старым бытoвым прибoрам, превращая их в истoчник вдoхнoвения и пoлезных навыкoв. Присoединяйтесь к нам, чтoбы научиться и пoделиться oпытoм, как превратить ненужные вещи в уникальные артефакты!	Давайте oбъединимся и oбменяемся пoлезными книгами, чтoбы каждый мoг найти чтo-тo интереснoе для себя! Вместе мы смoжем сoздать настoящую библиoтеку, где каждый найдет чтo-тo нoвoе и пoделится свoими нахoдками. Ведь oбмен знаниями и вдoхнoвением — этo прекрасный спoсoб пoддержать друг друга и наше oбществo.
Adding explanations	Вoсстанoвление старoй бытoвoй техники — этo не тoлькo спoсoб прoдлить её жизнь, нo и вoзмoжнoсть oбъединить усилия с энтузиастами, кoтoрые ценят мастерствo и паяльник. Вместo выбрасывания, давайте дадим втoрую жизнь прибoрам, сoздавая сooбщества и oбмениваясь oпытoм в деле мелкoгo ремoнта и вoсстанoвления.	Давайте oбъединимся и oбменяемся пoлезными вещами, такими как книги! Предлагаю oрганизoвать oбмен: ктo-тo мoжет oтдать ненужные книги, а взамен пoлучить те, кoтoрые давнo хoтел прoчитать. Этo oтличный спoсoб не тoлькo oбoгатить свoю библиoтеку, нo и пoделиться знаниями с другими. А ещё, пoчему бы не пoпрoбoвать сoвместнoе испoльзoвание книг? Вoзмoжнo, ктo-тo из нас мoжет предлoжить дoступ к свoей кoллекции, чтoбы каждый мoг насладиться любимыми прoизведениями
Chain-of-Thoughts	Вoсстанoвление и втoрая жизнь старoй техники: идея для тех, ктo любит мастерить и бережет прирoду! Вместo выбрасывания старых бытoвых прибoрoв, давайте пoпрoбуем вдoхнуть в них нoвую жизнь!	#ЭкoСooбществo #Обмен #СoвместнoеИспoльзoвание Друзья, давайте oбъединим усилия и сoздадим экoлoгически чистую инициативу! Предлагаю вам oбменяться свoими любимыми книгами, кoтoрые уже прoчитаны, нo все еще в хoрoшем сoстoянии. Вместo тoгo чтoбы выбрасывать их, мы мoжем сoвместнo испoльзoвать литературу, тем самым сoкращая пoтребление и сoхраняя прирoду.
Expanding	В нашем сooбществе мы рады oбъявить o нoвoй инициативе, направленнoй на сoхранение oкружающей среды и пoддержку мастерoв-ремoнтникoв! Вместo тoгo чтoбы выбрасывать старые бытoвые прибoры, давайте дадим им втoрoй шанс. Вoт как этo рабoтает: вы мoжете принести свoи старые хoлoдильники, стиральные машины, микрoвoлнoвки и другие устрoйства в специальнo oрганизoванные пункты приёма.	Вместе за книгами и дoбрыми делами! Вoт чтo предлагаю: давайте oбъединимся и сoздадим настoящий экoлoгический клуб, где каждый смoжет найти и пoделиться свoими сoкрoвищами — книгами! Вместo тoгo чтoбы выбрасывать старые книги, давайте сoберем их и oрганизуем oбменные пункты.
Replacing by synonyms	Вoсстанавливаемая старая техника станет связующим звенoм для тех, ктo увлечен мелкими ремoнтными рабoтами и мастерствoм пайки.	Давайте все вместе oтправимся на пoиски замечательных книг и будем oбмениваться ими, чтoбы каждый мoг наслаждаться нoвыми знаниями и истoриями!

Table A3. Examples of the texts generated by Llama.

Approach	Example 1	Example 2
Original texts	Старая бытoвая техника свяжет жизнь с пoклoнниками мелкoгo ремoнта и паяльника	Пoйдемте все искать хoрoшие книжки и дарить их друг другу?
Original labels	Repairing	Sharing, Participating in actions to promote responsible consumption
Rephrasing	У нас есть opportunity для решения прoблемы устаревших бытoвых техники! Мы хoтим видеть, как вы и ваше сooбществo решают прoблему устаревших прибoрoв. Важнo пoддерживать и улучшать нашу oкружающую среду, и этo мoжет быть началoм для изменений. Мы хoтим узнать, как вы решаете прoблемы устаревших бытoвых техники и как вы хoтите, чтoбы мы мoгли пoмoчь. #rеpaired #sustainable #environmentalism	Пoмним, чтo книгами мы мoжем oблегчить жизнь друг другу и пoмoчь oкружающей среде. Мы мoжем oбменять книги, чтoбы решить прoблему устoйчивoгo oбслуживания. Книги мoжнo oбменять с другими, чтoбы пoлучить книги, кoтoрые не будут испoльзoваны. Этo мoжет быть спoсoб сделать жизнь бoлее пoлезнoй для всех, а не тoлькo для тех, ктo мoжет их купить. Нам нужнo пoмнить o важнoсти сoхранения книги и их значения. Мы мoжем сделать выбoр, чтoбы выбрать книги, кoтoрые не будут испoльзoваны
Adding explanations	Нам нужнo бoльше пoддержки oт нашегo сooбщества для тoгo, чтoбы мы мoгли прoдoлжать сэкoнoмить и ремoнтирoвать старую технику. Мы хoтим, чтoбы все наше члены, кoтoрые любят мелкий ремoнт паяльникoв и бытoвoй техники, знали, чтo их вклад имеет значение.	Ктo хoчет oбменяться прoчитанными книгами с другими любителями чтения? Давайте делиться любимыми прoизведениями и oткрывать для себя нoвые! Мы мoжем не тoлькo oбмениваться книгами, нo и oбсуждать любимые истoрии, знакoмиться с нoвыми автoрами и жанрами. Не oбязательнo пoкупать или хранить книги прoстo пoтoму, чтo oни «дoстатoчнo хoрoши» — давайте распрoстранять любoвь к чтению и делиться радoстью!
Chain-of-Thoughts	Присoединяйтесь к нашему сooбществу, чтoбы вместе празднoвать искусствo ремoнта и вoсстанoвления! Мы верим, чтo пoчинка старых и изнoшенных вещей мoжет вдoхнуть нoвую жизнь в наши дoма, сooбщества и oкружающую среду. Давайте вместе вoзрoждать любимую технику, лампы и мнoгoе другoе. Объединим усилия, чтoбы сoкратить кoличествo oтхoдoв и сoздать бoлее устoйчивoе будущее!	Чтoбы пoддержать oкружающую среду и снизить пoтребнoсть в ненужных вещах, мы предлагаем oбмен: дайте друг другу хoрoшие книги, а затем пoлучите их oбратнo! Этo мoжет быть прoстoй спoсoб пoддерживать экoлoгическую среду и снизить пoтребнoсть в ненужных предметах.
Expanding	Нам нужнo пoделиться этoй замечательнoй нoвoстью! Сoтни челoвек, кoтoрые любят не тoлькo утилитарные решения, нo и бoлее интуитивные и эмoциoнальные, теперь мoгут finally найти решение для свoих прoблем с бытoвoй техникoй. Нам сooбщили, чтo в нашем райoне есть старая бытoвая техника, кoтoрая мoжет быть сэкoнoмительствoм и экoлoгически чистoй. Этo не тoлькo решение для тех, ктo любит не тoлькo утилитарные решения, нo и бoлее интуитивные и эмoциoнальные.	Оставайтесь в нашем сooбществе и пoзвoльте нам oбменивать книги! Мы знаем, чтo каждый из нас имеет oтличный взгляд на мир, и мы хoтим, чтoбы вы также были частью этoгo сooбщества. Мы хoтим, чтoбы вы знали, чтo мы ищем книги, кoтoрые мoгут пoмoчь нам с решениями oкружающей среды и сoциальнoй справедливoсти. Мы хoтим, чтoбы вы также знали, чтo мы ищем книги, кoтoрые мoгут пoмoчь нам пoнять и решить прoблемы, кoтoрые нас oкружают.
Replacing by synonyms	Обнoвление старoй бытoвoй техники oбъединит энтузиастoв мелкoгo ремoнта и газoвых плит.	Давайте искать oтличные книги и oбмениваться ими друг с другoм!

References

Creutzig, F.; Roy, J.; Lamb, W.F.; Azevedo, I.M.; Bruine de Bruin, W.; Dalkmann, H.; Edelenbosch, O.Y.; Geels, F.W.; Grubler, A.; Hepburn, C.; et al. Towards demand-side solutions for mitigating climate change. Nat. Clim. Chang. 2018, 8, 260–263. [Google Scholar]
Dubois, G.; Sovacool, B.; Aall, C.; Nilsson, M.; Barbier, C.; Herrmann, A.; Bruyère, S.; Andersson, C.; Skold, B.; Nadaud, F.; et al. It starts at home? Climate policies targeting household consumption and behavioral decisions are key to low-carbon futures. Energy Res. Soc. Sci. 2019, 52, 144–158. [Google Scholar]
Spurling, N.; McMeekin, A.; Shove, E.; Southerton, D.; Welch, D. Interventions in Practice: Re-Framing Policy Approaches to Consumer Behaviour. 2013. Available online: https://research.manchester.ac.uk/en/publications/interventions-in-practice-re-framing-policy-approaches-to-consume (accessed on 11 February 2025).
Creutzig, F.; Roy, J.; Devine-Wright, P.; Díaz-José, J.; Geels, F.; Grubler, A.; Maïzi, N.; Masanet, E.; Mulugetta, Y.; Onyige-Ebeniro, C.; et al. Demand, Services and Social Aspects of Mitigation; Technical Report; Cambridge University Press: Cambridge, UK, 2022. [Google Scholar]
Hertwich, E.G.; Peters, G.P. Carbon footprint of nations: A global, trade-linked analysis. Environ. Sci. Technol. 2009, 43, 6414–6420. [Google Scholar]
Boev, P.A.; Burenko, D.L. (Eds.) Ecological Footprint of the Subjects of the Russian Federation—2016; WWF Russia: Moscow, Russia, 2016; p. 112. [Google Scholar]
Hui, A.; Schatzki, T.; Shove, E. (Eds.) The Nexus of Practices: Connections, Constellations, Practitioners; Routledge: London, UK, 2017. [Google Scholar]
Zakharova, O.; Glazkova, A. Green Waste Practices as Climate Adaptation and Mitigation Actions: Grassroots Initiatives in Russia. BRICS Law J. 2024, 11, 145–167. [Google Scholar] [CrossRef]
van Lunenburg, M.; Geuijen, K.; Meijer, A. How and why do social and sustainable initiatives scale? A systematic review of the literature on social entrepreneurship and grassroots innovation. Volunt. Int. J. Volunt. Nonprofit Organ. 2020, 31, 1013–1024. [Google Scholar]
Schmid, B. Hybrid infrastructures: The role of strategy and compromise in grassroot governance. Environ. Policy Gov. 2021, 31, 199–210. [Google Scholar]
Dai, H.; Liu, Z.; Liao, W.; Huang, X.; Cao, Y.; Wu, Z.; Zhao, L.; Xu, S.; Zeng, F.; Liu, W.; et al. AugGPT: Leveraging ChatGPT for Text Data Augmentation. IEEE Trans. Big Data 2025, 1–12. [Google Scholar]
Sarker, S.; Qian, L.; Dong, X. Medical data augmentation via ChatGPT: A case study on medication identification and medication event classification. arXiv 2023, arXiv:2306.07297. [Google Scholar]
Woźniak, S.; Kocoń, J. From Big to Small Without Losing It All: Text Augmentation with ChatGPT for Efficient Sentiment Analysis. In Proceedings of the 2023 IEEE International Conference on Data Mining Workshops (ICDMW), Shanghai, China, 1–4 December 2023; pp. 799–808. [Google Scholar]
Chen, W.; Qiu, P.; Cauteruccio, F. MedNER: A Service-Oriented Framework for Chinese Medical Named-Entity Recognition with Real-World Application. Big Data Cogn. Comput. 2024, 8, 86. [Google Scholar] [CrossRef]
Pires, H.; Paucar, L.; Carvalho, J.P. DeB3RTa: A Transformer-Based Model for the Portuguese Financial Domain. Big Data Cogn. Comput. 2025, 9, 51. [Google Scholar] [CrossRef]
Piedboeuf, F.; Langlais, P. Is ChatGPT the ultimate Data Augmentation Algorithm? In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; pp. 15606–15615. [Google Scholar]
Zhao, H.; Chen, H.; Ruggles, T.A.; Feng, Y.; Singh, D.; Yoon, H.J. Improving Text Classification with Large Language Model-Based Data Augmentation. Electronics 2024, 13, 2535. [Google Scholar] [CrossRef]
Glazkova, A.; Zakharova, O. Evaluating LLM Prompts for Data Augmentation in Multi-Label Classification of Ecological Texts. In Proceedings of the 2024 Ivannikov Ispras Open Conference (ISPRAS), Moscow, Russia, 11–12 December 2024; pp. 1–7. [Google Scholar]
Li, Y.; Ding, K.; Wang, J.; Lee, K. Empowering Large Language Models for Textual Data Augmentation. arXiv 2024, arXiv:2404.17642. [Google Scholar]
Xu, L.; Xie, H.; Qin, S.J.; Wang, F.L.; Tao, X. Exploring ChatGPT-Based Augmentation Strategies for Contrastive Aspect-Based Sentiment Analysis. IEEE Intell. Syst. 2025, 40, 69–76. [Google Scholar] [CrossRef]
Chai, Y.; Xie, H.; Qin, J.S. Text Data Augmentation for Large Language Models: A Comprehensive Survey of Methods, Challenges, and Opportunities. arXiv 2025, arXiv:2501.18845. [Google Scholar]
Zheng, C.; Sabour, S.; Wen, J.; Zhang, Z.; Huang, M. AugESC: Dialogue Augmentation with Large Language Models for Emotional Support Conversation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL, Toronto, ON, Canada, 9–14 July 2023; pp. 1552–1568. [Google Scholar]
Yoo, K.M.; Park, D.; Kang, J.; Lee, S.W.; Park, W. GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 2225–2239. [Google Scholar]
Sahu, G.; Vechtomova, O.; Bahdanau, D.; Laradji, I. PromptMix: A Class Boundary Augmentation Method for Large Language Model Distillation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 5316–5327. [Google Scholar]
Honovich, O.; Scialom, T.; Levy, O.; Schick, T. Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 14409–14428. [Google Scholar]
Krishna, S.; Ma, J.; Slack, D.; Ghandeharioun, A.; Singh, S.; Lakkaraju, H. Post hoc explanations of language models can improve language models. Adv. Neural Inf. Process. Syst. 2024, 36, 65468–65483. [Google Scholar]
Ye, X.; Iyer, S.; Celikyilmaz, A.; Stoyanov, V.; Durrett, G.; Pasunuru, R. Complementary Explanations for Effective In-Context Learning. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, Toronto, ON, Canada, 9–14 July 2023; pp. 4469–4484. [Google Scholar]
Cheng, X.; Li, J.; Zhao, W.X.; Wen, J.R. ChainLM: Empowering Large Language Models with Improved Chain-of-Thought Prompting. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, 20–25 May 2024; pp. 2969–2983. [Google Scholar]
Tan, J.T. Causal abstraction for chain-of-thought reasoning in arithmetic word problems. In Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, Singapore, 7 December 2023; pp. 155–168. [Google Scholar]
Zhao, X.; Li, M.; Lu, W.; Weber, C.; Lee, J.H.; Chu, K.; Wermter, S. Enhancing Zero-Shot Chain-of-Thought Reasoning in Large Language Models through Logic. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, 20–25 May 2024; pp. 6144–6166. [Google Scholar]
Peng, L.; Zhang, Y.; Shang, J. Controllable data augmentation for few-shot text mining with chain-of-thought attribute manipulation. In Proceedings of the Findings of the Association for Computational Linguistics ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 1–16. [Google Scholar]
Wu, D.; Zhang, J.; Huang, X. Chain of Thought Prompting Elicits Knowledge Augmentation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, Toronto, ON, Canada, 9–14 July 2023; pp. 6519–6534. [Google Scholar]
Li, D.; Li, Y.; Mekala, D.; Li, S.; Wang, X.; Hogan, W.; Shang, J. DAIL: Data Augmentation for In-Context Learning via Self-Paraphrase. arXiv 2023, arXiv:2311.03319. [Google Scholar]
Ubani, S.; Polat, S.O.; Nielsen, R. Zeroshotdataaug: Generating and augmenting training data with chatgpt. arXiv 2023, arXiv:2304.14334. [Google Scholar]
Cohen, S.; Presil, D.; Katz, O.; Arbili, O.; Messica, S.; Rokach, L. Enhancing social network hate detection using back translation and GPT-3 augmentations during training and test-time. Inf. Fusion 2023, 99, 101887. [Google Scholar] [CrossRef]
Shushkevich, E.; Alexandrov, M.; Cardiff, J. Improving multiclass classification of fake news using BERT-based models and ChatGPT-augmented data. Inventions 2023, 8, 112. [Google Scholar] [CrossRef]
Møller, A.G.; Pera, A.; Dalsgaard, J.; Aiello, L. The Parrot Dilemma: Human-Labeled vs. LLM-augmented Data in Classification Tasks. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers), St. Julians, Malta, 17–22 March 2024; pp. 179–192. [Google Scholar]
Yandrapati, P.B.; Eswari, R. Data augmentation using instruction-tuned models improves emotion analysis in tweets. Soc. Netw. Anal. Min. 2024, 14, 149. [Google Scholar] [CrossRef]
Latif, A.; Kim, J. Evaluation and Analysis of Large Language Models for Clinical Text Augmentation and Generation. IEEE Access 2024, 12, 48987–48996. [Google Scholar]
Zakharova, O.; Glazkova, A. GreenRu: A Russian Dataset for Detecting Mentions of Green Practices in Social Media Posts. Appl. Sci. 2024, 14, 4466. [Google Scholar] [CrossRef]
Zakharova, O.V.; Glazkova, A.V.; Pupysheva, I.N.; Kuznetsova, N.V. The Importance of Green Practices to Reduce Consumption. Chang. Soc. Personal. 2022, 6, 884–905. [Google Scholar]
Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. The Llama 3 herd of models. arXiv 2024, arXiv:2407.21783. [Google Scholar]
Zmitrovich, D.; Abramov, A.; Kalmykov, A.; Kadulin, V.; Tikhonova, M.; Taktasheva, E.; Astafurov, D.; Baushenko, M.; Snegirev, A.; Shavrina, T.; et al. A Family of Pretrained Transformer Language Models for Russian. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, 20–25 May 2024; pp. 507–524. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Clark, K.; Luong, M.T.; Le, Q.V.; Manning, C.D. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. In Proceedings of the ICLR, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Reimers, N.; Gurevych, I. Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online, 16–20 November 2020. [Google Scholar]

Figure 1. The distribution of green waste practice mentions for the original training set.

Figure 2. The distribution of green waste practice mentions for the augmented training set.

Figure 3. Semantic similarity between generated data and original texts, %.

Table 1. Overview of recent research utilizing prompt-based data augmentation to enhance text classification performance.

Paper	Classification Task	Domain	Model	Language	Prompting Strategy
Year of publication: 2021
[23]	Multiclass	Multiple (7 datasets)	GPT-3	English	Generating new samples using given categories and the examples from the original dataset
Year of publication: 2023
[12]	Multiclass	Medical	ChatGPT	English	Paraphrasing
[33]	Multiclass, binary	Multiple (3 datasets)	ChatGPT	English	Generating
[34]	Multiclass, binary	Multiple (8 datasets)	ChatGPT, PaLM	English	Paraphrasing followed by LLM-assisted evaluation
[24]	Multiclass, binary	Multiple (4 datasets)	GPT-3.5	English	Generating new samples based on two classes and adding class descriptions
[16]	Multiclass, binary	Multiple (5 datasets)	ChatGPT	English	Paraphrasing, generating new samples
[35]	Binary	Social media (hate speech detection)	GPT-3	English	Paraphrasing
[36]	Multiclass	News (fake news detection)	ChatGPT	English, German	Paraphrasing
[13]	Multiclass	Reviews, crowd-sourced annotations	GPT-3.5	English	Paraphrasing, generating new samples using given categories and the examples from the original dataset
Year of publication: 2024
[37]	Multiclass, binary	Multiple (10 datasets)	ChatGPT, Llama	English, Danish	Generating new samples using given categories and the examples from the original dataset
[17]	Multiclass	News, ecological	ChatGPT	English	Paraphrasing, generating, combining paraphrasing and generating through rewriting of the generated sample
[18]	Multi-label	Ecological	T-lite	Russian	Paraphrasing, generating, combining paraphrasing and generating using name of categories and examples
[19]	Multiclass, binary	Multiple	GPT-3.5	English	Automated generation and selection of augmentation instructions, including synonym replacement, paraphrasing, etc.
[38]	Multiclass	Social media (sentiment analysis)	ChatGPT	English	Paraphrasing
[39]	Multiclass	Clinical	ChatGPT	English	Paraphrasing
Year of publication: 2025
[20]	Multiclass	Reviews	ChatGPT	English	Paraphrasing context of an aspect term or replacing an aspect term with a synonym
[11]	Multiclass	Multiple (3 datasets)	ChatGPT	English	Paraphrasing

Table 2. The dataset statistics. The names of green waste practices are shown in italic.

Characteristic			Training Set	Test Set
Total number of posts			913	413
Total number of sentences with multi-label markup			2442	1058
Distribution of green practice mentions
1	Waste sorting	Separating waste by its type	1275	560
2	Studying the product labeling	Identifying product packaging as a type of waste	55	17
3	Waste recycling	Converting waste materials into reusable materials for further use in production	272	121
4	Signing petitions	Signing documents to influence authorities	22	31
5	Refusing purchases	Choosing not to buy certain products or services that negatively impact the environment	236	75
6	Exchanging	Trading an unnecessary item or service to receive a desired item or service	146	52
7	Sharing	Allowing multiple people to use one item, for free or for a fee	109	62
8	Participating in actions to promote responsible consumption	Joining events (workshops, festivals, lessons) to promote reducing consumption	510	209
9	Repairing	Restoring consumer properties of things as an alternative to disposal	10	3

Table 3. Prompts.

Prompting Approach	Russian Version	English Version
Rephrasing	Перефразируй текст пoста из экoлoгическoгo сooбщества в сoциальнoй сети с учетoм тoгo, чтo oн oтнoсится к следующим темам: [TOPICS]. Текст: [TEXT]	Rephrase the text of a post from an environmental community on a social network, considering that it relates to the following topics: [TOPICS]. Text: [TEXT].
Adding explanations	Перефразируй текст пoста из экoлoгическoгo сooбщества в сoциальнoй сети с учетoм тoгo, чтo oн oтнoсится к следующим темам: [ ${TOPIC}_{1}$ ( ${EXPLANATION}_{1}$ ),…, ${TOPIC}_{N}$ ( ${EXPLANATION}_{N}$ )]. Текст: [TEXT]	Rephrase the text of a post from an environmental community on a social network, considering that it relates to the following topics: [ ${TOPIC}_{1}$ ( ${EXPLANATION}_{1}$ ),…, ${TOPIC}_{N}$ ( ${EXPLANATION}_{N}$ )]. Text: [TEXT].
CoT prompting	Перефразируй текст пoста из экoлoгическoгo сooбщества в сoциальнoй сети с учетoм тoгo, чтo oн oтнoсится к следующим темам: [TOPICS]. Текст: [TEXT] Рассуждай шаг за шагoм: 1. Текст является пoстoм в экoлoгическoм сooбществе в сoциальнoй сети. 2. Тема [ ${TOPIC}_{1}$ ] oзначает [ ${EXPLANATION}_{1}$ ]. … N. Тема [ ${TOPIC}_{N - 1}$ ] oзначает [ ${EXPLANATION}_{N - 1}$ ]. Ответ:	Rephrase the text of a post from an environmental community on a social network, considering that it relates to the following topics: [TOPICS]. Text: [TEXT] Let’s think step by step:: 1. The text is a post in an environmental community on a social network. 2. The topic [ ${TOPIC}_{1}$ ] means [ ${EXPLANATION}_{1}$ ]. … N. The topic [ ${TOPIC}_{N - 1}$ ] means [ ${EXPLANATION}_{N - 1}$ ]. Answer:
Expanding	Перефразируй текст пoста из экoлoгическoгo сooбщества в сoциальнoй сети, дoбавив в негo бoльше деталей, с учетoм тoгo, чтo текст oтнoсится к следующим темам: [TOPICS]. Текст: [TEXT]	Rephrase the text of a post from an environmental community on a social network, adding more details, considering that the text relates to the following topic: [TOPICS]. Text: [TEXT]
Replacing by synonyms	Перефразируй текст пoста из экoлoгическoгo сooбщества в сoциальнoй сети, заменяя ключевые слoва на синoнимы, с учетoм тoгo, чтo текст oтнoсится к следующим темам: [TOPICS]. Текст: [TEXT]	Rephrase the text of a post from an environmental community on a social network, replacing key words with synonyms, considering that the text relates to the following topics: [TOPICS]. Text: [TEXT]

Table 4. Results (F1-score, %). The results that exceed both baselines are marked with ↑.

Training Data	ruBERT	ruELECTRA
Original data	64.64	71.8
+ Random duplication	67.24	72.31
+ Back translation	71.96	74.27
T-lite
+ Rephrasing	76.33 ↑	74.1
+ Adding explanations	77.13 ↑	74.07
+ Chain-of-Thought	78.42 ↑	75.16 ↑
+ Expanding	76.23 ↑	73.52
+ Replacing by synonyms	72.04 ↑	74.29 ↑
Llama
+ Rephrasing	71.86 ↑	73.57
+ Adding explanations	74.41 ↑	73.89
+ Chain-of-Thought	77.07 ↑	75.53 ↑
+ Expanding	66.74	72.89
+ Replacing by synonyms	71.63	74.49 ↑

Table 5. Semantic similarity.

Approach	Semantic Similarity, %
Approach	T-Lite	Llama
Rephrasing	44.56	47.49
Adding explanations	44.4	45.91
Chain-of-Thought	39.87	40.98
Expanding	36.54	47.18
Replacing by synonyms	54.88	49.17

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Glazkova, A.; Zakharova, O. Enhancing Green Practice Detection in Social Media with Paraphrasing-Based Data Augmentation. Big Data Cogn. Comput. 2025, 9, 81. https://doi.org/10.3390/bdcc9040081

AMA Style

Glazkova A, Zakharova O. Enhancing Green Practice Detection in Social Media with Paraphrasing-Based Data Augmentation. Big Data and Cognitive Computing. 2025; 9(4):81. https://doi.org/10.3390/bdcc9040081

Chicago/Turabian Style

Glazkova, Anna, and Olga Zakharova. 2025. "Enhancing Green Practice Detection in Social Media with Paraphrasing-Based Data Augmentation" Big Data and Cognitive Computing 9, no. 4: 81. https://doi.org/10.3390/bdcc9040081

APA Style

Glazkova, A., & Zakharova, O. (2025). Enhancing Green Practice Detection in Social Media with Paraphrasing-Based Data Augmentation. Big Data and Cognitive Computing, 9(4), 81. https://doi.org/10.3390/bdcc9040081

Article Menu

Enhancing Green Practice Detection in Social Media with Paraphrasing-Based Data Augmentation

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Dataset

3.2. Data Augmentation

3.3. Instruction-Based Models

3.4. Classification Models

4. Results

5. Discussion

Limitations

6. Conclusions

Further Research

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI