Vegetarianism Discourse in Russian Social Media: A Case Study

Gorduna, Nikita; Vanetik, Natalia

doi:10.3390/app15010259

Open AccessArticle

Vegetarianism Discourse in Russian Social Media: A Case Study

by

Nikita Gorduna

^†

and

Natalia Vanetik

^*,†

Department of Software Engineering, Shamoon College of Engineering, Beer Sheva 84100, Israel

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2025, 15(1), 259; https://doi.org/10.3390/app15010259

Submission received: 15 December 2024 / Revised: 22 December 2024 / Accepted: 25 December 2024 / Published: 30 December 2024

(This article belongs to the Special Issue Neural Network Technologies in Natural Language Processing and Data Mining)

Download

Browse Figures

Versions Notes

Abstract

Dietary choices, especially vegetarianism, have attracted much attention lately due to their potential effects on the environment, human health, and morality. Despite this, public discourse on vegetarianism in Russian-language contexts remains underexplored. This paper introduces VegRuCorpus, a novel, manually annotated dataset of Russian-language social media texts expressing opinions on vegetarianism. Through extensive experimentation, we demonstrate that contrastive learning significantly outperforms traditional machine learning and fine-tuned transformer models, achieving the best classification performance for distinguishing pro- and anti-vegetarian opinions. While traditional models perform competitively using syntactic and semantic representations and fine-tuned transformers show promise, our findings highlight the need for task-specific data to unlock their full potential. By providing a new dataset and insights into model performance, this work advances opinion mining and contributes to understanding nutritional health discourse in Russia.

Keywords:

text classification; deep learning; contrastive learning; Russian language; vegetarianism

1. Introduction

Dietary choices, especially vegetarianism, have spurred increased discussions on health benefits, environmental impacts, and ethical concerns [1,2]. Globally, the adoption of vegetarian diets is seen as a key strategy to mitigate climate change by reducing greenhouse gas emissions from livestock farming and preserve biodiversity [3]. Furthermore, the ethical treatment of animals has increasingly influenced dietary decisions, underscoring the intersection of personal choices and global sustainability [4]. While significant research has examined vegetarianism in Western and English-speaking contexts, its cultural, ethical, and social dimensions in Russian-speaking regions remain underexplored. This gap is critical to address, as dietary behaviors are influenced by unique cultural narratives, historical traditions, and economic conditions [5]. Russia has witnessed a slow but steady growth in vegetarian and vegan communities, often amidst cultural resistance to non-meat-based diets [6]. Understanding public opinion in this context is essential for developing targeted public health campaigns and fostering constructive dialogues on sustainable diets. Vegetarianism encompasses various diets distinguished by restrictiveness: while commonly associated with meat exclusion, some definitions include less restrictive practices. Pescatarians, for example, consume fish but no other meat, flexitarians occasionally include meat, and organolactovegetarians avoid all meat while consuming dairy and eggs. Strict vegetarians, in contrast, exclude all animal products entirely [7]. Numerous studies highlight health benefits associated with vegetarian diets, yet further research is needed to explore their impact on overall quality of life, encompassing physical, mental, social, and environmental well-being [8,9,10].

Within the field of natural language processing (NLP), advancements now allow for the automated interpretation and classification of human communication. As a branch of artificial intelligence, NLP has become critical for various applications, from sentiment analysis and machine translation to text summarization. Text classification, assigning categories to text, underpins these applications, playing a fundamental role in spam detection, topic categorization, and opinion mining [11,12].

For the Russian language, NLP faces unique challenges. Unlike English, a Germanic language, Russian belongs to the Slavic family and has complex inflectional grammar. The language’s morphological richness, where words change according to number, case, tense, and gender, complicates tokenization and parsing. Additionally, Russian’s flexible syntax allows for variable word order, further complicating NLP tasks [13]. The shortage of large, annotated Russian datasets further limits progress; while English NLP benefits from extensive resources, Russian models often depend on smaller, domain-specific datasets, which hinders generalizability. The development of models specifically trained in Russian, like ruBERT [14] and ruRoBERTa [15], is essential to address these language-specific challenges effectively.

Despite these advancements, most dietary opinion datasets focus on English or broader nutritional topics. The MIND dataset, for example, contains around 10,000 English texts related to food, while other studies explore large-scale collections of vegetarian-related keywords for topic modeling [16,17]. Meanwhile, work with the NutriGreen dataset [18] primarily aids dietary classification tasks, such as vegan or organic product detection, using image data. Other studies address nutrition preferences for specific populations or analyze dietary nutrients using structured datasets [19,20,21]. These resources underscore the limited focus on Russian-language content and opinion mining specific to vegetarianism. This limitation not only hinders comprehensive understanding of public attitudes but also restricts the applicability of advanced NLP methods in culturally diverse settings. By introducing VegRuCorpus, this study addresses a critical gap, offering insights into both the linguistic challenges of Russian NLP and the socio-cultural nuances of dietary opinions.

The primary aim of this paper is to classify Russian social media texts to determine whether they support or oppose vegetarianism, contributing to the understudied area of nutritional health discourse in Russia. This study introduces VegRuCorpus, a manually annotated dataset containing online Russian texts with opinions on vegetarianism, and evaluates it using advanced transformer-based models, including ruBERT [22] and ruRoBERTa [23]. This aligns with the broader goals of NLP research to improve linguistic inclusivity, ensuring that advancements in machine learning extend to under-represented languages and regions. Traditional text classification models, such as naïve Bayes, support vector machines (SVMs), and random forests, rely on manually extracted features like n-grams and term frequencies. While effective in some contexts, these models struggle to capture nuanced semantic and syntactic patterns in Russian [11,24]. The rise of deep learning approaches, especially transformer-based architectures like BERT and GPT, has enabled models to learn these intricate dependencies directly, making them highly effective for complex classification tasks [12,25].

In this study, we aim to classify Russian social media content to distinguish whether they support or oppose vegetarianism, aiming to improve the understanding of public opinion on vegetarianism. This study contributes a novel Russian-language dataset and benchmarks state-of-the-art NLP models, addressing a dual challenge: advancing Russian NLP and enriching the discourse on dietary choices in Russia. By shedding light on public attitudes toward vegetarianism, this work provides actionable insights for policymakers, health advocates, and researchers. By doing so, we hope to improve the understanding of public opinion on nutritional health in Russia, a topic that has received limited attention in the natural language processing (NLP) field.

The contributions of this paper are as follows: (1) the first manually annotated dataset, named VegRuCorpus, consists of online Russian texts that express opinions on vegetarianism. This fills a gap in the research on Russian-language content regarding dietary preferences, particularly vegetarianism, which has been less explored compared to English-speaking regions; (2) the evaluation of VegRuCorpus with traditional and advanced transformer models [22,23,26] to determine whether these texts support or oppose vegetarianism.

We address the following research questions in our work.

RQ1: which text representations are better for our classification task?
RQ2: do lemmatization and topic modeling have a positive effect on classification accuracy?
RQ3: does sentiment analysis have a positive effect on classification accuracy?
RQ4: do transformer-based models perform better than traditional ones?
RQ5: is contrastive learning more effective in classifying opinions on vegetarianism than traditional or transformer models?

This paper is organized as follows. Section 2 covers the related work. Section 3 describes our dataset and the process of its collection and annotation. In Section 4 and Section 5, we describe the text representations and classification models that we used to perform the classification of our data. Section 6 describes the hardware and software setup and full results of our experimental evaluation. Finally, Section 7 and Section 8 discuss conclusions, limitations, and applications of our approach.

2. Background

2.1. Russian NLP

The advancement of Russian natural language processing (NLP) addresses the unique difficulties presented by Russian morphology and syntax while simultaneously reflecting the worldwide development of computer linguistics. Early research, which dates back to the 1960s and 1980s, concentrated on rule-based morphological analysis and machine translation systems. Soviet researchers created linguistic processors such as TULIPS-2 to handle the Russian language-rich inflectional system [27]. These early systems were limited by the complexity of Russian grammar and the lack of computing power.

Statistical techniques such as hidden Markov models (HMMs) were introduced in the 1990s [28], and the Russian National Corpus was established in 2003 [29]. Because of its free word order and grammatical agreement system, the Russian language requires advancements in part-of-speech tagging and lemmatization, made possible by the corpus’s provision of important annotated data. It was during this time that morphological analysis-focused tools like Yandex’s Mystem gained popularity [30].

Named entity recognition (NER) and other text-processing tasks improved as a result of the NLP field’s use of machine learning models in the 2000s. But, the shift to machine learning also brought to light enduring issues, like dealing with colloquial speech and dialectal variances. Data annotation was a major barrier since NLP systems had to handle the Russian language’s complex case system and agreement rules. The paper [31] presents an inflectional morphological model organization and application for the Russian language. The main objective is to efficiently recognize morpho-syntactic features of words and generate words according to requested features. The system uses a templated word-paradigm model with simple data structures and control mechanisms. It is fully implemented for a substantial Russian subset and provides an extensive list of morpho-syntactic features and stress positions. Special dictionary management tools are built for browsing, debugging, and extension of the lexicon.

The advent of deep learning methods and word embeddings like Word2Vec [32] and FastText [33] in the 2010s signaled a sea change by enabling models to more accurately represent the semantic relationships between words.

Russian NLP was transformed by the introduction of transformer models. Significant advancements in sentiment analysis, question answering, and other areas were made possible by models such as RuBERT, a BERT derivative optimized for Russian corpora. Research and applications were further accelerated by the DeepPavlov library from the Moscow Institute of Physics and Technology [34], which offered pre-trained models tailored for Russian NLP tasks. Model performance benchmarks such as Russian SuperGLUE [35] have been used to evaluate how well models handle linguistic subtleties like coreference resolution and commonsense reasoning.

The advent of transformer models revolutionized Russian NLP and allowed for notable progress in areas like text classification, question answering, and sentiment analysis. Among these models, RuBERT [22]—a customized version of BERT that was trained on large Russian corpora—has performed exceptionally well, outperforming general-purpose models like multilingual BERT [36] (mBERT) in capturing the complex morphology and syntax of the Russian language. RuBERT is a specialized adaptation of BERT for the Russian language, pre-trained on large Russian text corpora to accurately capture the complexities of Russian morphology and syntax. It has been shown to surpass general multilingual models like mBERT in tasks such as sentiment analysis and text classification [37]. The paper [38] discusses the adaptation of multilingual masked language models for specific languages, highlighting their superior performance in tasks like reading comprehension and sentiment analysis. It also highlights the benefits of transfer learning from multilingual to monolingual models, reducing training time and demonstrating open-sourced pre-trained models for Russian.

The SBERT transformer model [26] is a multilingual, sentence-transformer model developed specifically for natural language understanding (NLU) in Russian and other languages. Part of the SBERT (Sentence-BERT) family, it is fine-tuned to generate high-quality sentence embeddings, which capture the semantic meaning of text and are useful in tasks like semantic search, sentence similarity, and clustering. The model supports multilingual capabilities, allowing it to work effectively with Russian as well as other languages. Some recent research highlights the application of this model in tasks like semantic textual similarity (STS), clustering, and text retrieval [39], especially in projects seeking to enhance Russian-language support in embedding benchmarks.

In addition, Sber AI-Forever’s ruRoBERTa [23], an enhanced version of RoBERTa [40] for Russian, expands on the achievements of previous transformer models. It applies architecture-level enhancements designed to handle Russian-specific linguistic complexity, like morphological variance and free word order, and integrates larger training data. The field of Russian NLP is further advanced by this improved architecture, which guarantees greater performance across a variety of NLP tasks, such as text categorization and sentiment analysis [15].

The Vikhr family of models represents a cutting-edge advancement in Russian NLP, featuring instruction-tuned, open-source large language models optimized for the Russian language. These models demonstrate improved performance and computational efficiency by leveraging an adapted tokenizer vocabulary and extensive instruction tuning [41]. This model family does not, however, contain a sentence classification model suitable for our task.

The DeepPavlov family of models has been widely adopted in the research community, with its foundational paper by Burtsev et al. [34] and the framework’s GitHub repository garnering more than 6700 stars, indicating significant interest and utilization among developers. The DeepPavlov organization maintains over 160 repositories, reflecting a broad ecosystem of tools and extensions built around the core library. Russian NLP research and development has accelerated significantly as a result of the release of DeepPavlov’s open-sourced pre-trained models [14].

Benchmarks like Russian SuperGLUE [35] are used to assess how well models like RuBERT and ruRoBERTa handle coreference resolution, commonsense reasoning, and other complex NLP tasks [42].

RuBERT and ruRoBERTa have been shown to be more effective than mBERT for Russian NLP tasks due to their specialized focus on Russian syntax and morphology [15]. RuBERT, developed from the BERT architecture, outperforms multilingual alternatives like mBERT in sentiment analysis, text classification, and reading comprehension. ruRoBERTa, an optimized version of RoBERTa, addresses Russian linguistic challenges like morphological variance and free word order.

The majority of Russian NLP research on text categorization focuses on sentiment analysis, where models such as RuBERT employ Russian-specific data to identify emotions in texts more precisely than multilingual models like mBERT. Classifying topics is very common, particularly in literary and journalistic datasets, and models like ruRoBERTa are used to handle complex syntax. The detection of spam and hazardous content on social media platforms is another crucial area, where pre-trained models from libraries like DeepPavlov are crucial in removing unwanted content. In this work [43], RuSentiment, a novel dataset for sentiment analysis of Russian social media posts, and a new set of thorough annotation criteria that can be extended to other languages are presented in this study. Currently, RuSentiment has 31,185 posts annotated with Fleiss’ kappa of 0.58 (three annotations per post), making it the largest in its class for the Russian language. A total of 6950 posts were pre-selected using an active-learning-style approach to diversify the dataset. In addition to releasing the top-performing word embeddings trained on 3.2 billion Russian social media postings, the authors also present baseline classification data. They experimented with a variety of classifiers, such as a gradient boosting classifier, logistic regression, and linear SVM. A basic neural network classifier (NNC) with four fully connected layers and non-linear activation functions between them was also put into practice.

The survey [44] reviews the applications of sentiment analysis in Russian-language content, focusing on its potential in processing and analyzing large-scale opinions. It identifies challenges and future research directions, focusing on the quality of applied sentiment analysis studies. The survey systematically characterizes existing studies by data source, purpose, employed approach, and primary outcomes. It presents a research agenda to improve the quality of applied sentiment analysis studies and expand the existing research base. Additionally, it provides a literature review on publicly available sentiment datasets of Russian-language texts. Most recent works span from authorship identification to opinion analysis to specific text classification tasks [45,46,47].

2.2. Vegetarianism in Text Analysis

It is worth noting that most research papers in the area of machine learning (ML) that address vegetarianism focus on the food ingredients and not on texts. In this section, we describe these datasets.

The NutriGreen dataset [18] is a collection of 10,472 images representing branded food products aimed at training segmentation models for detecting various labels on food packaging. Each image in the dataset comes with three distinct labels indicating its nutritional quality, denoting whether it is vegan or vegetarian, and the EU organic certification logo. Paper [19] predicts vegetarian food preferences from chronic disease among the elderly using a hybrid neural network method. The data were collected by interviewing 100 elderly people on their vegetarian food preferences, gender, and chronic diseases.

The paper [20] presents an analysis of nutrient-rich foods suitable for vegetarian and vegan diets. By leveraging a large-scale dataset of food nutrient compositions, the authors identify optimal foods that meet dietary requirements for individuals following plant-based diets. The study aims to support healthier vegetarian eating patterns by recommending nutrient-dense food options that align with both semi-vegetarian and strict vegetarian lifestyles.

The paper [21] outlines the development of a decision support system that utilizes deep learning techniques to enhance vegetarian food flavoring tailored for older adults. The authors address the challenges faced by the aging population in maintaining adequate nutrition and satisfaction in their diets by proposing a system that recommends flavorful vegetarian options. The authors of [48] explore the factors influencing consumers’ readiness to shift toward plant-based diets through a data-driven approach.

The paper [49] discusses the methodologies used to standardize and integrate diverse dietary datasets to enhance global dietary surveillance efforts. It details the challenges of reconciling variations in dietary data collection methods and reporting standards across different populations and regions. The findings reveal a comprehensive dataset that includes dietary information from multiple sources, enabling more accurate assessments of global dietary patterns and facilitating public health interventions aimed at improving nutrition worldwide. The paper [50] introduces a dataset of Central Asian foods that contains 42 food categories and over 16,000 images of national dishes unique to this region.

The paper [51] uses machine learning methods like SVMs, neural networks, and naive Bayes to categorize user interests in English and Russian social media communities, including vegetarianism-related ones. According to the study, Russian-language platforms (such as Vkontakte and Twitter) and English-language Twitter pages have a stronger association, with English-language Twitter pages having the best categorization accuracy. This implies that the efficiency of interest classification in social networks is influenced by platform and language variations.

The paper [52] analyzes feelings in vegan-related tweets to investigate how the general public views veganism. The study successfully detects emotions like positivity, negativity, and fear within the dataset by using mutual information for feature selection. The results show that veganism is becoming more and more popular, with a generally favorable attitude trend, even while feelings of dread remained particularly prevalent over the period under study.

2.3. Contrastive Learning

We employ contrastive learning to enhance the quality of representation learning, training our model to distinguish between different texts by reducing similarity between representations of samples from distinct classes while increasing similarity between representations of samples within the same class [53,54,55]. This approach leverages input data and corresponding class labels, guiding the model to develop representations that effectively discriminate among classes through dual contrastive learning. In this context, dual contrastive learning (DualCL) drives the model to align representations based on input characteristics and labels, facilitating the learning of class-specific patterns.

The DualCL framework, as introduced in [56], adapts the contrastive loss function for supervised learning scenarios, enabling the simultaneous optimization of classifier parameters and input sample representations within a unified space. DualCL achieves this by contrasting input samples with augmented versions, interpreting classifier parameters as enriched representations associated with each text label. By aligning representations in this shared space, DualCL improves an understanding of opinion distinctions, incorporating label information directly into the learning process.

In natural language processing (NLP), contrastive learning has shown success across various tasks, including text classification, sentiment analysis, and language modeling. The flexibility and power of contrastive frameworks allow for improved model performance by refining the model’s ability to capture distinctions across opinions expressed in texts [57].

3. The VegRuCorpus Dataset

3.1. Data Collection

This paper presents VegRuCorpus, a newly developed dataset designed to investigate the divergent perspectives on vegetarianism, including its benefits and drawbacks. This dataset consists of 1024 handpicked articles. Data were gathered by accessing articles from the Russian platform dzen.ru [58] and querying the google.com search engine [59]. Dzen.ru is a platform for posting and consuming information that lets users share their own content as well as access blogs, news, photographs, videos, and articles. Dzen.ru was selected as the primary source for this study due to its unique structure and user engagement dynamics, which are well suited for analyzing public opinions. Unlike Twitter or Facebook, where comments are often brief and fragmented, Dzen.ru articles contain more in-depth discussions, allowing users to express detailed opinions and arguments. The longer comment lengths on Dzen.ru provide richer linguistic and contextual data, enabling better analysis of attitudes and opinions, particularly in the context of complex topics like vegetarianism.

Table 1 contains a complete list of the queries that we used and their translations.

Based on the search results, each website was manually reviewed to identify relevant content on vegetarianism, specifically focusing on its pros and cons. Often, several links led to the same article, which resulted in a limited set of unique sources; we have eliminated all duplicated articles. Additional manual filtering was performed to represent different opinions and to obtain a balanced dataset. Note that, on average, every article contains more than 100 words and represents a coherent opinion rather than a simple short statement. We provide text samples from the corpus in Appendix A.1.

3.2. Annotation

Supporting vegetarianism is defined as having a good attitude toward vegetarian habits, advocating for cutting back on or giving up animal eating, or highlighting the moral, health, environmental, or cultural advantages of a plant-based diet. When a viewpoint minimizes the advantages of a plant-based diet, criticizes or rejects the tenets of vegetarianism, or makes an argument supporting ingesting meat, it is said to be opposed to vegetarianism.

We had four native Russian speakers serve as annotators: two males and two females, all of whom possess academic degrees. Two annotators were responsible for labeling the data in the context of attitudes toward vegetarianism in Russia (‘pos’ for texts in favor and ‘neg’ for texts arguing against). A third annotator reviewed the labels assigned by both the first and second annotators and served as a judge. She compared the two sets of labels and decided which label was closer to her judgment. The fourth annotator oversaw and corrected the judging in a small number of cases. This ensured an additional level of reliability in determining the most accurate labeling.

Cohen’s kappa [60] is a statistic that measures inter-rater agreement for categorical items. It is generally considered a more robust measure than simple percent agreement, as

κ

takes into account the agreement occurring by chance. The formula for Cohen’s kappa is

κ = \frac{P_{o} - P_{e}}{1 - P_{e}}

(1)

The observed proportion of agreement among raters is denoted by

P_{o}

, while the predicted proportion of agreement by chance is denoted by

P_{e}

. Table 2 displays the definition and interpretation of the kappa coefficient in accordance with [61]. The high degree of agreement indicates that the evaluations are highly consistent and reliable, indicating that the classification or evaluation procedure is reliable and sound for the task at hand.

For the VegRuCorpus, the agreement between two initial annotators, measured as Cohen’s kappa score, was 0.81032, which indicates nearly perfect agreement [61]. Examples of texts on which the annotators disagreed appear in Table 3—it is not immediately clear from the text if the authors support or object to vegetarianism. The opinions expressed in these texts contain both arguments supporting and criticizing vegetarianism.

Table 4 contains basic statistics of annotated texts. The dataset is balanced with the majority class “pos” and majority rule

0.513

.

4. Text Representations

In this section, we provide the details of text representation construction. The general pipeline is depicted in Figure 1 and it includes preprocessing, construction of syntactic and semantic representations, and incorporating topic modeling and sentiment analysis into representation.

4.1. Preprocessing

For VegRuCorpus, the data cleaning process focuses on optimizing the text for classification by removing stopwords to eliminate unnecessary noise. This step helps to ensure that only the most relevant and informative words remain for effective text classification. First, we performed word tokenization using advanced transformer-based ruRoBERTa tokenizer [15]. Then, we removed stopwords using the list provided at [62]. This Russian stopword list contains 153 stopwords and is less comprehensive than its English counterparts [63,64,65,66]. The word cloud of the resulting texts and its translation appears in Figure 2. Further discussion and a list of stopwords are provided in Appendix A.2.

After stopword removal, we performed (optional) lemmatization with the SpaCy Python library [64] using the ru_core_news_lg model.

4.2. Syntactic Representations

A statistical measure called term frequency–inverse document frequency (tf-idf) is used to show how important a word is in connection to a document in a corpus. The tf-idf value, which rises in direct proportion to the word’s frequency of occurrence in the text, is offset by the number of documents in the corpus that contain the term. The weight

W_{d} (t)

of a phrase t in a document d is calculated as follows:

W_{d} (t) = t f (t) \times log (\frac{N}{d f (t)})

(2)

where

$t f (t)$ is the term frequency of t in document d;
N is the number of documents in the corpus;
$d f (t)$ is the number of documents in the corpus containing t.

Word sequences containing n consecutive words or characters are called n-grams, where n is a parameter. N-grams are token sequences made up of successive words of a given length, denoted by n, for each token in the series. For example, two successive words make up a bigram, whereas a single word is referred to as a unigram. N-grams combine nearby words into cohesive tokens, preserving information about word co-occurrence.

Using these syntactic representations, we created tf-idf vectors, n-gram word vectors for

n = 1 \dots 3

, and n-gram character vectors for

n = 1 \dots 3

to represent our texts. We excluded all n-grams that occur in a single text (i.e., have count 1) in order to restrict the size of n-gram vectors. Sizes of these vector representations, with and without lemmatization and with stopwords removed, are shown in Table 5.

Table 6 contains the most frequent word n-grams of various sizes in the VegRuCorpus dataset. We can see that various morphological forms of the words “мясo” (meat) and “живoтнoе прoисхoждение” (animal origin) dominate the list.

4.3. Semantic Representations

In contrast to traditional word embeddings, BERT sentence embeddings are vector representations of complete sentences produced by the BERT model, which captures the contextual meanings of words and their relationships, allowing for a deeper understanding of the text [67]. Text categorization, sentiment analysis, and semantic similarity assessments are just a few of the NLP tasks that can make use of these embeddings as features.

Three pre-trained BERT models that were properly adjusted for the Russian language provided the sentence embeddings used in this paper. The initial model, RuBERT, is fine-tuned on an extensive corpus of Russian text, guaranteeing its efficacy and performance in processing the Russian language [68]. The RuBERT model can be accessed as DeepPavlov/rubert-base-cased on Hugging Face [22]. The SBERT Large Multitask (cased) model was employed, specifically engineered for multitask learning in the Russian language, offering improved sentence-level embeddings. This model is exceptionally proficient for tasks necessitating profound semantic comprehension across various applications and is accessible as ai-forever/sbert_large_mt_nlu_ru on Hugging Face [26]. The third model utilized in this study is the RuRoBERTa model, which has been specifically optimized for sentence-level tasks in the Russian language. This model signifies a substantial enhancement in understanding the intricacies and subtleties of Russian text and is accessible as ai-forever/ruRoberta-large on Hugging Face [23]. Thus, the vector sizes for the models are 768 for the RuBERT model [22] and 1024 for both the SBERT [26] and RuRoBERTa [23] models.

4.4. Representation Visualization and Analysis

The dimensionality reduction approach known as t-distributed stochastic neighbor embedding [69] (t-SNE) is frequently used to visualize high-dimensional data in a lower-dimensional domain, usually 2D or 3D. With each document or text sample represented as a high-dimensional vector based on attributes like n-gram frequencies or tf-idf scores, it is especially helpful for displaying textual data. We applied this method to both syntactic and semantic text representations generated for the VegRuCorpus in hopes of gaining a better insight into the complexity of text classification tasks that we deal with.

Figure 3 demonstrates the results for character n-grams of sizes

(1, 3)

, showing some degree of clustering and separation. However, the observed overlap indicates that more advanced features or additional data may be necessary for improved categorization.

Figure 4 shows the results for word n-grams of size

(1, 3)

. It illustrates that, although there is significant separation between the two groups, they are not completely distinct in the reduced 2D space. According to the clustering, this range of word n-grams highlights certain similarities or overlaps as well as important differences between the text samples of the two groups.

The t-SNE distribution for the tf-idf text representation is displayed in Figure 5. When reduced to a 2D space, we see a large overlap with no discernible difference between the two classes, indicating that the tf-idf representation is ineffective at differentiating between them. The lack of distinguishable clusters suggests that, in this situation, the textual properties that tf-idf is able to capture might not be enough to distinguish between the two groups.

The t-SNE plot shown in Figure 6 illustrates the dimensionality reduction of text data using sentence embeddings generated by the SBERT transformer model. The distribution of points suggests that there is extensive overlap and no evident clustering or distinction between the two classes. This visualization implies that these embeddings do not easily differentiate between texts supporting and opposing vegetarianism in the reduced 2D space. The lack of clear clustering indicates that the text properties captured by this model may not be sufficient to distinctly separate the two groups in this scenario.

4.5. Representation Enhancement

In this section, we describe how we enhance the syntactic and semantic representations of VegRuCorpus texts by incorporating topic modeling and sentiment analysis in hopes of improving classification accuracy. In order to discern between pro- and anti-vegetarian positions, topic modeling [70] helps to identify the underlying topics in conversations. A supplementary layer to the categorization process is provided by sentiment analysis [71,72], which provides insights into the texts’ emotional content and tone.

4.5.1. Topic Modeling

To perform latent Dirichlet allocation (LDA) topic modeling on Russian social media texts, we first preprocessed the text data by removing Russian stopwords (as described in Section 4.1) and applied lemmatization to reduce vocabulary size, enhancing the model’s focus on meaningful content. Then, we applied LDA, where each text is treated as a distribution of latent topics and each topic is represented by a distribution over words.

Formally, each text d has a topic distribution drawn from a Dirichlet prior

α

:

θ_{d} \sim Dirichlet (α)

. Each topic k has a word distribution

ϕ_{k}

drawn from a Dirichlet prior

β

:

ϕ_{k} \sim Dirichlet (β)

. The probability of a word w in a text d given topic k is computed as

P (w | d) = \sum_{k = 1}^{K} P (w | z = k) \cdot P (z = k | d)

(3)

where

P (w | z = k)

is the probability of word w given topic k and

P (z = k | d)

is the probability of topic k in text d.

To create a topic vector

v_{d}

for each text d, we use the probabilities

θ_{d, k}

of topic k in document d over all topics

1 \dots K

:

v_{d} = (θ_{d, 1}, θ_{d, 2}, \dots, θ_{d, K})

(4)

Empirically, we set the number K of topics to 100. To perform LDA topic modeling, we used the Gensim Python library [66]. We created a dictionary and a document–term matrix using tokenized and lemmatized texts after stopword removal.

4.5.2. Sentiment Analysis

The RuBERT [73] architecture serves as the foundation for the pre-trained transformer-based model blanchefort/rubert-base-cased-sentiment, which has been optimized for sentiment classification in Russian [74]. In order to accurately categorize sentiments as positive, negative, or neutral, it uses a bidirectional transformer to comprehend the context of each sentence by examining the links between words. The model is well suited for sentence-level sentiment analysis since it can identify the emotional tone of a sentence and has been trained on a sizable corpus of Russian texts. Every sentence’s projected sentiment (POS, NEG, or NEU) and its probability p are returned by the model. We turned these data into vectors

(pos, neg, neu)

, where a vector element is set to zero for all classes but the predicted one, and the predicted class element is set to p.

5. Classification Models

In this section, we describe the classification models that we apply to evaluate the VegRuCorpus dataset. For the evaluation, we generate text representations (for traditional models) or use the text directly (for transformer models), split the data into a train and test set with the ratio 80%/20%, train the models on the train set, and evaluate them on the test set. The pipeline is depicted in Figure 7.

5.1. Traditional Models

We have applied extreme gradient boosting (XGB) [75], random forest (RF) [76], and logistic regression (LR) [77] classifiers to these text representations.

A random forest is a meta-estimator that averages findings to increase prediction accuracy and decrease overfitting. It achieves this by fitting multiple decision tree classifiers to various subsets of the dataset [78]. The RF method is an ensemble learning algorithm, meaning that it combines several basic machine learning models to enhance overall performance. Each basic machine algorithm selects a class to forecast, and once all of the basic algorithms have voted, the ensemble algorithm predicts the class with the most votes.

Extreme gradient boosting (XGB) is a boosting technique that reduces errors using a gradient descent algorithm to build a series of weak prediction models, such as decision trees [75]. In gradient boosting, the model iteratively updates weights using the gradient through the gradient descent process, minimizing the model’s loss with each iteration. Gradient boosting is an additive modeling technique in which a new decision tree is added one at a time to a model that minimizes loss using gradient descent. The new tree’s output is mixed with previous trees until the loss is reduced to a threshold or defined limit of trees.

The logistic regression model is based on the odds of a two-level outcome of interest, which is the ratio of the probability of the event happening divided by the probability of it not happening. Odds are often used in gambling, with even odds indicating half-time events. The model uses the natural logarithm of odds as a regression function of predictors, with one predictor, X, as the basis. The odds ratio is obtained by taking the exponential of

β_{1}

, which represents the change in the logarithm of the odds with a one-unit change in X [79].

Additionally, we used a voting classifier (VC) that combines the above three models implemented with sklearn [65], with hard voting. This method combines the prediction outputs from the above classifiers (RF, XGB, LR). The dataset undergoes preprocessing to filter out irrelevant data. A ranker algorithm is then applied to eliminate low-ranking features that do not meet the global minimum threshold. To achieve the highest accuracy rate, the filtered dataset is used individually with each classifier and in combination with others. The prediction outputs from each classifier are pooled together to identify the most frequently predicted classes as the test variables [80].

5.2. Fine-Tuned Transformers

As a baseline, we employed RuBERT [73], SBERT [26], and ruRoBERTa [15] transformer models that had been pre-trained for the task of sentence classification on relevant data. We fine-tuned these models on the training portion of our data and evaluated them on the test portion.

5.3. Contrastive Learning

To address the classification of opinions on vegetarianism, we applied the advanced text classification method DualCL [56], which utilizes contrastive learning with label-aware data augmentation.

This method’s objective function incorporates two contrastive losses—one for labeled data and another for unlabeled data. For each labeled instance

(x_{i}, y_{i})

, we computed the contrastive loss as

L_{L} = - log \frac{e^{f (x_{i}) \cdot g (y_{i})}}{\sum_{j = 1}^{N} e^{f (x_{i}) \cdot g (y_{j})}}

where

f (x_{i})

is the feature representation of input

x_{i}

,

y_{i}

is the corresponding label,

g (y_{i})

is the embedding of label

y_{i}

, and N is the number of classes.

In our experiments, we used pre-trained transformer models—RuBERT [73], SBERT [26], and ruRoBERTa [23]—to compute feature representations.

For binary classification of opinions on vegetarianism, we set

N = 2

. For unlabeled data, we computed contrastive loss as

L_{U} = - log \frac{e^{f (x_{i}) \cdot f (x_{j}) / τ}}{\sum_{k = 1}^{M} e^{f (x_{i}) \cdot f (x_{k}) / τ}}

where

f (x_{i})

and

f (x_{j})

are feature representations of inputs

x_{i}

and

x_{j}

, M is the number of unlabeled instances, and

τ

is a temperature parameter controlling distribution concentration, set to the default in [81].

The combined dual contrastive loss, incorporating labeled and unlabeled losses and a regularization term, is defined as

L = L_{L} + λ L_{U} + {β | | θ | |}^{2}

with

θ

standing for model parameters and parameters

λ

and

β

modifying the ratio of supervised to unsupervised losses as well as the regularization term. We used values for these parameters based on the native implementation in [81].

6. Experimental Evaluation

6.1. Hardware Setup

All tests were executed on Google Colab [82], a cloud-based platform that provides access to robust computational resources for machine learning applications. Standard settings and T4 GPU were used to run the tests.

6.2. Software Setup

The software environment that we used employed Python v3.10.12, alongside essential libraries. NumPy v1.26.4 and Pandas v2.2.2 were used for data manipulation and numerical computations. These libraries were chosen for their robust performance in handling large datasets and their extensive functionality for efficient data manipulation and numerical computations.

Python scikit-learn library [65], an open-source machine learning library for Python, was used for traditional machine learning models such as logistic regression, random forests, and voting classifiers. We selected scikit-learn for its comprehensive suite of machine learning algorithms and its seamless integration with Python.

LDA topic modeling was performed with gensim library [66], and sentiment analysis was performed with RuBERT-based sentiment model [74]. The gensim library was used for LDA topic modeling due to its proven efficiency in handling large text corpora and its ease of implementation. The specific sentiment model was chosen because it leverages DeepPavlov’s RuBERT [34] and has been proven effective in multiple research works [83].

For deep learning models, we used Hugging Face’s Transformers library to fine-tune pre-trained models such as ruBERT [22], ruRoBERTa [23], and SBERT [26]. We opted for this library for fine-tuning deep learning models because of its extensive documentation, active community support, and optimized implementations of state-of-the-art transformer architectures.

For contrastive learning, we utilized a publicly available Python implementation, DualCL [81]. It was run for 20 epochs. This implementation of DualCL was selected for contrastive learning due to its effective design for representation learning and its demonstrated success in similar opinion classification tasks.

Hyperparameter setup for all the models used in our experiments is shown in Table 7.

6.3. Metrics

We report the following evaluation metrics across all models to provide a comprehensive assessment of performance.

Precision quantifies the model’s accuracy in its positive predictions, measuring how many predicted positives are actual positives. We compute it as

Precision = \frac{True Positives}{True Positives + False Positives}

Recall (or sensitivity) gauges the model’s ability to identify all relevant positive instances, focusing on minimizing missed positive cases. It is calculated as

Recall = \frac{True Positives}{True Positives + False Negatives}

When class distributions are unbalanced, the F1 score unifies precision and recall into a single metric that is especially informative. It is described as

F 1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}

Accuracy computes the proportion of accurate predictions to the entire dataset, offering a comprehensive evaluation of model performance:

Accuracy = \frac{True Positives + True Negatives}{Total Data Size}

Since each metric highlights distinct facets of model performance, using them in tandem is crucial for accurate assessment in text categorization. Precision emphasizes the importance of accurate positive predictions, vital in applications where false positives carry a high cost. Recall measures the model’s capability to capture all relevant cases, reducing the likelihood of missing critical instances, which is crucial in contexts where overlooking positives is highly undesirable. The F1 score balances recall and precision, offering a robust metric in situations where data classes are imbalanced, thus providing a more comprehensive performance summary.

6.4. Results

Table 8 and Table 9 show evaluation results of traditional models applied to syntactic and semantic text representations, with or without topic modeling vectors (denoted by topics) and sentiment scores (denoted by SA). For better readability, we report the result of the best classifier out of four (RF, LR, XGB, and VC) in every case. Full results are available in the Appendix A. For completeness’ sake, we also report the scores for topics and SA vectors separately to evaluate their capability of detecting opinions on vegetarianism in Russian (Table 8).

From Table 8, we conclude that adding topic information does not produce better results. This implies that text topics are not a good indicator of writers’ opinions on vegetarianism, which is also evident from low scores achieved for separate topics. Adding sentiment data produces a minor improvement for the f-idf and word n-grams representations, indicating that sentiment is helpful but does not have a strong impact on performance. We also see that lemmatization improves scores for the tf-idf and word n-grams representations. It is also clear that the LR classifier almost always achieves the best results for this type of text representation.

The results of using semantic text representations in Table 9 indicate that they produce results that are slightly better than syntactic text representations. We also see that sentence embedding produced by the ruRoBERTa transformer model generates better scores in every case. The LR classifier is the best-performing classifier, and adding topics does not result in accuracy improvement, just like in the case of syntactic text representations. However, adding sentiment analysis data results in slight improvement for the ruRoBERTa representations.

Table 10 contains results of transformer models fine-tuned on the training set of VegRuCorpus and tested on its test set. All the models produced low results that are close to the majority, indicating that these models are not trained on the data suitable for our task. Most probably, these models need much more task-specific training data to produce meaningful results. Still, the SBERT produced slightly better scores than its counterparts. Fine-tuned transformer models underperform compared to traditional text representation approaches due to insufficient task-specific training data in the VegRuCorpus. Transformers, particularly in fine-tuning setups, require extensive domain-relevant examples to capture nuanced distinctions, which the dataset lacks, leading to results close to the majority baseline.

Table 11 shows the results of binary text classification with contrastive learning (CL) for different base models (ruBERT, SBERT, and ruRoBERTa). We see that CL with SBERT and ruRoBERTa models produced results that surpass fine-tuned baselines and traditional classifiers with syntactic and semantic text representations. ruRoBERTa produced the best scores, and therefore we conclude that it is the best base model for the CL setup.

7. Conclusions

Our evaluation indicates that traditional classifiers combined with syntactic and semantic representations perform well but do not effectively distinguish opinions on vegetarianism. Therefore, we answer RQ2 negatively. Sentiment analysis slightly improves scores for selected models, allowing us to rule in favor of using sentiment data for our task and to answer RQ3 positively.

For semantic representations, we observe a clear advantage in those produced by the ruRoBERTa model. However, semantic text representations outperform syntactic ones, leading us to answer RQ1 affirmatively.

Fine-tuned transformer models on VegRuCorpus perform close to the majority baseline, underscoring the need for more task-specific training data and resulting in a negative answer to RQ4. Among the transformer models, SBERT demonstrates a slight performance edge, showing potential but limited suitability for this dataset. Contrastive learning significantly outperforms traditional methods and fine-tuning, with ruRoBERTa delivering the best results overall, affirming the hypothesis of RQ4.

These findings highlight the superiority of contrastive learning over simple fine-tuning and suggest that sophisticated representations, such as those from contrastive learning, are better suited to this domain.

8. Applications and Limitations

The task described in this paper and its results have potential applications for monitoring public opinion on dietary choices, enabling a more precise analysis of social attitudes toward vegetarianism and related lifestyle trends. Classification model insights can help policymakers and health organizations to identify public concerns and create targeted campaigns to address common misconceptions or promote dietary shifts. Additionally, businesses in the food industry can use these models to analyze consumer sentiment and adjust marketing strategies to align with pro- or anti-vegetarian perspectives.

Our results indicate that while VegRuCorpus provides a valuable resource for opinion mining in Russian, the classification accuracy achieved by both traditional and transformer-based models highlights challenges in effectively capturing diverse opinions on vegetarianism. Additionally, the complexity of the Russian language, including its rich morphology and flexible syntax, further complicates accurate classification.

A key limitation of this study is its reliance on the relatively small dataset, VegRuCorpus, which may not fully reflect the diversity of opinions on vegetarianism across various demographics or regions. Moreover, the models, particularly the fine-tuned transformers, perform poorly, indicating a need for more extensive task-specific training data to improve accuracy. The lack of improvement using topic modeling and sentiment analysis vectors suggests that these features are less effective for the task, emphasizing the need for more sophisticated text representations. Finally, although contrastive learning shows promise, its use is constrained by the computational demands of large transformer models like ruRoBERTa, which limits scalability and generalizability in practical applications.

To address these limitations, future work could focus on expanding VegRuCorpus with more annotated examples, encompassing diverse contexts and nuanced opinions. Incorporating additional linguistic features, such as contextual embeddings from multilingual or Russian-specific transformers, could also improve model performance. Also, the application of advanced techniques like domain-adaptive pretraining and data augmentation may help models to better generalize to this challenging classification task.

Author Contributions

Methodology, N.V.; Software, N.G. and N.V.; Validation, N.G. and N.V.; Formal analysis, N.V.; Investigation, N.G.; Data curation, N.G. and N.V.; Writing—original draft, N.G. and N.V.; Writing—review & editing, N.V.; Visualization, N.G.; Supervision, N.V.; Project administration, N.V. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset is stored at GitHub repository https://github.com/NataliaVanetik/VegRuCorpus. It is publicly accessible for research purposes.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1 shows two examples of opinions from the VegRuCorpus: one labeled positive and another labeled as negative.

Appendix A.1. VegRuCorpus Text Samples

Table A1. Examples of positive and negative opinions from VegRuCorpus.

Text	Translation
positive
Чтo еще, крoме нравственнo-мoральных аспектoв привлекает людей в вегетарианстве? Пo мнению приверженцев этoй системы питания, oтказ oт пищи живoтнoгo прoисхoждения пoзвoляет oчистить телo oт гoрмoна страха, кoтoрый выбрасывается в крoвь живoтнoгo в мoмент смерти. И хoть ни oднo исследoвание не нашлo этoму научнoгo пoдтверждения, вегетарианцы прoдoлжают в этo верить. Весьма спoрным считается и утверждение, чтo oтказ oт мяса делает челoвека бoлее мягким и менее агрессивным. Практика пoказывает, чтo япoнцы, кoтoрые в древнoсти вooбще не ели мяса, а питались в oснoвнoм рисoм и рыбoй, никoгда не были мирoлюбивoй нацией, тo же мoжнo сказать и oб Адoльфе Гитлере, кoтoрые придерживался вегетарианства.Впрoчем, у этoй системы питания есть неoспoримые плюсы. Среди них:низкий прoцент развития сердечнo-сoсудистых забoлеваний среди вегетарианцев (растительная пища не сoдержит хoлестерина, кoтoрый oткладывается на сoсудах, прoвoцируя развитие инфаркта, атерoсклерoза, инсульта).среди вегетарианцев практически нет людей с избытoчным весoм (растительная пища бoгата клетчаткoй, кoтoрая быстрo запoлняет желудoк, вызывая чувствo сытoсти и при этoм сoдержит малo калoрий);низкий урoвень развития oнкoлoгических забoлеваний среди вегетарианцев (злаки, oвoщи и фрукты сoдержат бoльшoе кoличествo витаминoв и антиoксидантoв, кoтoрые блoкируют прoцессы старения и перерoждения клетoк).	What else, besides the moral and ethical aspects, attracts people to vegetarianism? According to the adherents of this diet, giving up food of animal origin allows you to cleanse the body of the hormone of fear, which is released into the blood of an animal at the moment of death. And although no study has found scientific confirmation of this, vegetarians continue to believe in it. The assertion that giving up meat makes a person softer and less aggressive is also considered quite controversial. Practice shows that the Japanese, who in ancient times did not eat meat at all, but ate mainly rice and fish, have never been a peace-loving nation, the same can be said about Adolf Hitler, who adhered to vegetarianism. However, this diet has undeniable advantages. Among them: a low percentage of cardiovascular diseases among vegetarians (plant foods do not contain cholesterol, which is deposited on blood vessels, causing the development of heart attacks, atherosclerosis, and strokes). among vegetarians, there are practically no overweight people (plant foods are rich in fiber, which quickly fills the stomach, causing a feeling of satiety and at the same time contains few calories); a low level of cancer among vegetarians (cereals, vegetables, and fruits contain a large number of vitamins and antioxidants that block the processes of aging and cell degeneration).
negative
Отмечается и негативнoе влияние вегетаринства и веганства на физическoе здoрoвье челoвека. Результаты исследoваний пoказывают, чтo люди, кoтoрые oтказываются oт мяса или других прoдуктoв живoтнoгo прoисхoждения, мoгут страдать oт дефицита питательных веществ—витаминoв B12 и D, Омега-3 жирных кислoт, кальция, железа и цинка,—чтo мoжет сильнo сказываться на здoрoвье. Также пoявляется всё бoльше дoказательств тoгo, чтo oтказ oт упoтребления мяса связан с психическими расстрoйствами и бoлее высoкими рисками психoлoгических прoблем. Пo сравнению с людьми, упoтребляющими мясo (назoвём их «мясoедами»), вегетарианцы чаще страдают oт тяжёлoй депрессии или тревoжнoгo расстрoйства, oни бoлее склoнны к самoубийствам и причинению себе вреда. Нo свидетельства, связывающие вегетарианствo с психическими расстрoйствами, неoднoзначны. В 2010 и 2015 гoдах исследoватели oбнаружили, чтo в oтнoшении некoтoрых аспектoв oценки психическoгo здoрoвья вегетарианцы oказались здoрoвее мясoедoв. В 2017 гoду Всемирная oрганизация здравooхранения (ВОЗ) сooбщила, чтo психические забoлевания являются oснoвнoй причинoй инвалиднoсти вo всём мире и oказывают серьёзнoе влияние на верoятнoсть сердечнo-сoсудистых забoлеваний (главную причину смертнoсти в мире). Пo oценкам исследoвателей ВОЗ, бoлее 300 миллиoнoв челoвек страдают oт депрессии (4,4% населения) и бoлее 260 миллиoнoв челoвек (3,6% населения) oт тревoжнoсти. Эти oценки oтражают, чтo за пoследние два десятилетия значительнo увеличилoсь числo людей, живущих с психическими расстрoйствами и забoлеваниями.Учитывая рoст числа людей с психическими расстрoйствами и пoпуляризацией вегетарианства, актуальнo oпределить связь между oтказoм oт упoтребления мяса и психoлoгическим здoрoвьем.	There are also concerns about the negative impact of vegetarianism and veganism on physical health. Research suggests that people who give up meat or other animal products may suffer from nutrient deficiencies—vitamins B12 and D, omega-3 fatty acids, calcium, iron, and zinc—which can have significant health consequences. There is also growing evidence that not eating meat is associated with mental health disorders and higher risks of psychological problems. Compared with people who eat meat (let us call them “meat eaters”), vegetarians are more likely to suffer from severe depression or anxiety disorders, and are more likely to commit suicide and harm themselves. But the evidence linking vegetarianism to mental health disorders is mixed. In 2010 and 2015, researchers found that vegetarians were healthier than meat eaters on some measures of mental health. In 2017, the World Health Organization (WHO) reported that mental illness is the leading cause of disability worldwide and has a significant impact on the risk of cardiovascular disease (the leading cause of death worldwide). WHO researchers estimate that more than 300 million people suffer from depression (4.4% of the population) and more than 260 million people (3.6% of the population) from anxiety. These estimates reflect a significant increase in the number of people living with mental disorders and illnesses over the past two decades. Given the rise in mental health and the rise of vegetarianism, it is important to determine the link between eating meat and mental health.

The positive opinion in Table A1 presents a blend of skepticism and endorsement, highlighting both contested and positive aspects associated with a vegetarian diet. The speaker acknowledges moral and ethical motivations but also refers to more unconventional beliefs held by some vegetarians. For instance, they mention a belief that abstaining from animal products “cleanses the body of the hormone of fear” supposedly released by animals at death. However, they note that no scientific evidence supports this claim. Additionally, they mention another debated notion: that vegetarianism leads to a less aggressive personality. The speaker counters this with historical references, suggesting that diet alone may not influence aggression, citing both the Japanese diet in ancient times and Adolf Hitler’s vegetarianism. Despite these contested points, the speaker recognizes the health benefits of vegetarianism. This part of the opinion emphasizes that the speaker sees valid health advantages to vegetarianism, albeit with a critical view of some beliefs associated with the diet.

The negative opinion in Table A1 highlights both unconventional beliefs and established health benefits associated with vegetarianism. It points out some vegetarians’ unproven view that avoiding animal products “cleanses” the body of stress-related hormones from animals, although this lacks scientific support. However, the diet is also acknowledged for its health advantages, including reduced cardiovascular risks, lower obesity rates, and potentially lower cancer incidence due to high-fiber and antioxidant-rich plant foods.

In conclusion, both opinions contain pro and con arguments and argue both viewpoints, implying that these texts are challenging to classify.

Appendix A.2. Stopwords in Russian Language

For VegRuCorpus, the data cleaning process focuses on optimizing the text for classification by removing stopwords to eliminate unnecessary noise. This step helps to ensure that only the most relevant and informative words remain for effective text classification.

Stopwords in Russian, such as common conjunctions, prepositions, and auxiliary verbs (e.g., “и” (and), “в” (in), “на” (on), “нo” (but)), are removed from the text. These words do not carry significant meaning and may interfere with the classification task by adding irrelevant information. Removing them helps to focus the analysis on more content-rich words that contribute to determining the stance on vegetarianism. We utilized a predetermined list of stopwords for the Russian language from NLTK [62]. The compilation of stopwords for the Russian language comprises 153 stopwords, as illustrated in Table A2.

Table A2. Russian stopwords.

и	в	чтo	не	на	с	для
oт	к	из	у	как	а	этo
o	или	пo	также	нo	егo	есть
мoжет	кoтoрые	тoлькo	при	бoлее	чем
все	так	их	вo	oн	я	сo
oна	да	ты	вы	за	бы	ее
мне	былo	вoт	oт	меня	еще	нет
ему	теперь	кoгда	даже	ну	вдруг	ли
если	уже	ни	быть	был	дo	вас
нибудь	oпять	уж	вам	ведь	там	пoтoм
себя	ничегo	ей	oни	тут	где	надo
ней	мы	тебя	чем	была	сам	чтoб
без	будтo	чегo	раз	тoже	себе	пoд
будет	ж	тoгда	ктo	этoт	тoгo	пoтoму
этoгo	какoй	сoвсем	ним	здесь	этoм	oдин
пoчти	мoй	тем	чтoбы	нее	сейчас	были
куда	зачем	всех	никoгда	мoжнo	накoнец	два
oб	другoй	хoть	пoсле	над	бoльше	тoт
через	эти	нас	прo	всегo	них	какая
мнoгo	разве	три	эту	мoя	впрoчем	хoрoшo
свoю	этoй	перед	инoгда	лучше	чуть	тoм
нельзя	такoй	им	всегда	кoнечнo	всю	между
кoтoрые

The translations of the terms in Table A2 are shown in Table A3.

Table A3. Russian stopwords (translation).

and	in	what	not	on	with	for
from	to	at	how	this	about	or
by	also	but	his	is	that	can
which	only	when	more	than	all	so
same	their	he	I	she	yes	you
her	me	was	here	still	no	him
now	even	well	suddenly	whether	if	neither
be	before	someone	again	because	there	then
self	nothing	they	where	must	without	as if
once	under	will	who	this	that	because
one	almost	my	to	were	why	never
finally	two	other	though	after	over	more
through	these	us	them	many	really	three
my	however	good	own	before	sometimes	better
slightly	can’t	such	always	between	of course	which

Alternative stopword lists for the Russian language exist, and while they are quite similar in content, each has its distinctive features. These lists, provided by prominent libraries such as NLTK and SpaCy, are tailored to meet different NLP needs. Although they share many common words, they still have unique characteristics in terms of size and specific word choices. The NLTK stopwords list for Russian [84] contains 151 words. This includes prevalent conjunctions, prepositions, and auxiliary verbs commonly utilized in Russian yet offering minimal semantic significance in text classification endeavors. The SpaCy stopword list for Russian [85] comprises 768 stopwords in total. This list includes prevalent conjunctions, prepositions, auxiliary verbs, and other commonly utilized phrases that generally lack substantial semantic significance in text. This list is excessively extensive for our purposes, as it may eliminate contextually significant words important for differentiating stances in the classification of pro and con arguments for vegetarianism in Russian.

Appendix A.3. Full Results of Traditional Models for Syntactic Text Representations

Table A4 and Table A5 contain full evaluation results of traditional models (RF, LR, XGB, and VC) for syntactic text representations. Table A4 covers the results of text with and without lemmatization applied, and Table A7 contains the results for representations that include topic modeling and SA.

Table A4. Full evaluation results of traditional classifiers with syntactic text representations, with and without lemmatization.

Representation	Classifier	P	R	F1	Acc
char n-grams	RF	0.8009	0.7879	0.7873	0.7902
char n-grams	LR	0.7966	0.7960	0.7951	0.7951
char n-grams	XGB	0.7355	0.7243	0.7227	0.7268
char n-grams	VC	0.7871	0.7843	0.7845	0.7854
word n-grams	RF	0.8028	0.7931	0.7929	0.7951
word n-grams	LR	0.7806	0.7800	0.7802	0.7805
word n-grams	XGB	0.7293	0.6669	0.6468	0.6732
word n-grams	VC	0.7996	0.7826	0.7816	0.7854
tf-idf	RF	0.8024	0.7824	0.7810	0.7854
tf-idf	LR	0.7996	0.7826	0.7816	0.7854
tf-idf	XGB	0.7489	0.7021	0.6911	0.7073
tf-idf	VC	0.8056	0.7821	0.7803	0.7854
lemmatization + n-grams	RF	0.8069	0.7981	0.7981	0.8000
lemmatization + n-grams	LR	0.8100	0.8093	0.8095	0.8098
lemmatization + n-grams	XGB	0.7585	0.7276	0.7222	0.7317
lemmatization + n-grams	VC	0.8150	0.8026	0.8024	0.8049
lemmatization + tf-idf	RF	0.8024	0.7988	0.7991	0.8000
lemmatization + tf-idf	LR	0.8151	0.8081	0.8083	0.8098
lemmatization + tf-idf	XGB	0.7440	0.7181	0.7131	0.7220
lemmatization + tf-idf	VC	0.8293	0.8229	0.8232	0.8244

Table A5. Full evaluation results of traditional classifiers with syntactic text representations, topic modeling, and SA.

Representation	Classifier	P	R	F1	Acc
char n-grams + topics	RF	0.7994	0.7936	0.7937	0.7951
char n-grams + topics	LR	0.7966	0.7960	0.7951	0.7951
char n-grams + topics	XGB	0.7502	0.7336	0.7312	0.7366
char n-grams + topics	VC	0.7871	0.7843	0.7845	0.7854
word n-grams + topics	RF	0.7606	0.7433	0.7412	0.7463
word n-grams + topics	LR	0.7806	0.7800	0.7802	0.7805
word n-grams + topics	XGB	0.7630	0.7067	0.6940	0.7122
word n-grams + topics	VC	0.7663	0.7429	0.7396	0.7463
tf-idf + topics	RF	0.7846	0.7733	0.7727	0.7756
tf-idf + topics	LR	0.7739	0.7693	0.7693	0.7707
tf-idf + topics	XGB	0.7489	0.7021	0.6911	0.7073
tf-idf + topics	VC	0.7804	0.7524	0.7488	0.7561
char n-grams + SA	RF	0.8034	0.7876	0.7868	0.7902
char n-grams + SA	LR	0.7795	0.7740	0.7741	0.7756
char n-grams + SA	XGB	0.7530	0.6917	0.6759	0.6976
char n-grams + SA	VC	0.8024	0.7824	0.7810	0.7854
word n-grams + SA	RF	0.8056	0.7769	0.7743	0.7805
word n-grams + SA	LR	0.8252	0.8124	0.8123	0.8146
word n-grams + SA	XGB	0.7197	0.6621	0.6426	0.6683
word n-grams + SA	VC	0.8096	0.7767	0.7735	0.7805
tf-idf + SA	RF	0.8164	0.7867	0.7843	0.7902
tf-idf + SA	LR	0.8238	0.7914	0.7890	0.7951
tf-idf + SA	XGB	0.7065	0.6524	0.6321	0.6585
tf-idf + SA	VC	0.8187	0.7762	0.7718	0.7805

Appendix A.4. Full Results of Traditional Models for Semantic Text Representations

Table A6 and Table A7 contain full evaluation results of traditional models (RF, LR, XGB, and VC) for semantic text representations. Table A4 covers the results of semantic representations computed for the original texts, and Table A7 contains the results for semantic representations enhanced with topic modeling and SA.

Table A6. Full evaluation results of traditional classifiers with semantic text representations, with and without topic modeling and SA.

Representation	Classifier	P	R	F1	Acc
ruBERT SE	RF	0.7513	0.7507	0.7508	0.7512
ruBERT SE	LR	0.7609	0.7607	0.7608	0.7610
ruBERT SE	XGB	0.6637	0.6624	0.6623	0.6634
ruBERT SE	VC	0.7470	0.7455	0.7456	0.7463
SBERT SE	RF	0.7961	0.7943	0.7945	0.7951
SBERT SE	LR	0.7804	0.7802	0.7803	0.7805
SBERT SE	XGB	0.7432	0.7402	0.7403	0.7415
SBERT SE	VC	0.7894	0.7838	0.7839	0.7854
ruRoBERTa SE	RF	0.7999	0.8000	0.7999	0.8000
ruRoBERTa SE	LR	0.8438	0.8438	0.8438	0.8439
ruRoBERTa SE	XGB	0.7124	0.7114	0.7115	0.7122
ruRoBERTa SE	VC	0.8389	0.8390	0.8390	0.8390

Table A7. Full evaluation results of traditional classifiers with semantic text representations with topic modeling and SA.

Representation	Classifier	P	R	F1	Acc
ruBERT SE + topics	RF	0.7270	0.7262	0.7263	0.7268
ruBERT SE + topics	LR	0.7413	0.7412	0.7412	0.7415
ruBERT SE + topics	XGB	0.6700	0.6667	0.6660	0.6683
ruBERT SE + topics	VC	0.7222	0.7212	0.7213	0.7220
SBERT SE + topics	RF	0.7811	0.7798	0.7800	0.7805
SBERT SE + topics	LR	0.7852	0.7852	0.7852	0.7854
SBERT SE + topics	XGB	0.7342	0.7302	0.7301	0.7317
SBERT SE + topics	VC	0.7916	0.7893	0.7895	0.7902
ruRoBERTa SE + topics	RF	0.7999	0.8000	0.7999	0.8000
ruRoBERTa SE + topics	LR	0.8389	0.8390	0.8390	0.8390
ruRoBERTa SE + topics	XGB	0.7316	0.7317	0.7316	0.7317
ruRoBERTa SE + topics	VC	0.8194	0.8195	0.8194	0.8195
ruBERT SE + SA	RF	0.7719	0.7698	0.7699	0.7707
ruBERT SE + SA	LR	0.7511	0.7512	0.7511	0.7512
ruBERT SE + SA	XGB	0.6740	0.6719	0.6716	0.6732
ruBERT SE + SA	VC	0.7415	0.7410	0.7411	0.7415
SBERT SE + SA	RF	0.7569	0.7552	0.7554	0.7561
SBERT SE + SA	LR	0.7852	0.7852	0.7852	0.7854
SBERT SE + SA	XGB	0.7297	0.7252	0.7249	0.7268
SBERT SE + SA	VC	0.7950	0.7950	0.7950	0.7951
ruRoBERTa SE + SA	RF	0.7907	0.7907	0.7902	0.7902
ruRoBERTa SE + SA	LR	0.8536	0.8536	0.8536	0.8537
ruRoBERTa SE + SA	XGB	0.7072	0.7069	0.7070	0.7073
ruRoBERTa SE + SA	VC	0.8487	0.8488	0.8487	0.8488

References

Dietz, T.; Frisch, A.S.; Kalof, L.; Stern, P.C.; Guagnano, G.A. Values and vegetarianism: An exploratory analysis 1. Rural Sociol. 1995, 60, 533–542. [Google Scholar] [CrossRef]
Nezlek, J.B.; Forestell, C.A. Vegetarianism as a social identity. Curr. Opin. Food Sci. 2020, 33, 45–51. [Google Scholar] [CrossRef]
Poore, J.; Nemecek, T. Reducing food’s environmental impacts through producers and consumers. Science 2018, 360, 987–992. [Google Scholar] [CrossRef] [PubMed]
Monteiro, B.M.A.; Pfeiler, T.M.; Patterson, M.D.; Milburn, M.A. The Carnism Inventory: Measuring the ideology of eating animals. Appetite 2017, 113, 51–62. [Google Scholar] [CrossRef] [PubMed]
LeBlanc, R.D. Vegetarianism in Russia: The Tolstoy (an) Legacy; The Carl Beck Papers in Russian and East European Studies; University of New Hampshire: Durham, NH, USA, 2001. [Google Scholar]
Leblanc, R.D. The Ethics and Politics of Diet: Tolstoy, Pilnyak, and the Modern Slaughterhouse. Gastronomica 2017, 17, 9–25. [Google Scholar] [CrossRef]
Hargreaves, S.M.; Raposo, A.; Saraiva, A.; Zandonadi, R.P. Vegetarian diet: An overview through the perspective of quality of life domains. Int. J. Environ. Res. Public Health 2021, 18, 4067. [Google Scholar] [CrossRef] [PubMed]
Sindhu, S.; Mageshwari, S.U. A study on behavior, diet patterns and physical activity among selected GDM and non-GDM women in south India. J. Diabetol. 2024, 15, 86–93. [Google Scholar] [CrossRef]
Wang, T.; Masedunskas, A.; Willett, W.C.; Fontana, L. Vegetarian and vegan diets: Benefits and drawbacks. Eur. Heart J. 2023, 44, 3423–3439. [Google Scholar] [CrossRef]
Key, T.J.; Davey, G.K.; Appleby, P.N. Health benefits of a vegetarian diet. Proc. Nutr. Soc. 1999, 58, 271–275. [Google Scholar] [CrossRef] [PubMed]
Gasparetto, A.; Marcuzzo, M.; Zangari, A.; Albarelli, A. A survey on text classification algorithms: From text to predictions. Information 2022, 13, 83. [Google Scholar] [CrossRef]
Minaee, S.; Kalchbrenner, N.; Cambria, E.; Nikzad, N.; Chenaghlu, M.; Gao, J. Deep learning–based text classification: A comprehensive review. ACM Comput. Surv. (CSUR) 2021, 54, 1–40. [Google Scholar] [CrossRef]
Artemova, E. Deep learning for the Russian language. In The Palgrave Handbook of Digital Russia Studies; Palgrave Macmillan: Cham, Switzerland, 2021; pp. 465–481. [Google Scholar]
Kuratov, Y.; Arkhipov, M. RuBERT: A Russian BERT Model. In Proceedings of the 13th International Conference on Language Resources and Evaluation (LREC 2019), Miyazaki, Japan, 13–17 May 2019. [Google Scholar]
Zmitrovich, D.; Abramov, A.; Kalmykov, A.; Tikhonova, M.; Taktasheva, E.; Astafurov, D.; Baushenko, M.; Snegirev, A.; Shavrina, T.; Markov, S.; et al. A Family of Pretrained Transformer Language Models for Russian. arXiv 2023, arXiv:2309.10931. [Google Scholar]
Lee, C.; Kim, S.; Jeong, S.; Lim, C.; Kim, J.; Kim, Y.; Jung, M. MIND dataset for diet planning and dietary healthcare with machine learning: Dataset creation using combinatorial optimization and controllable generation with domain experts. In Proceedings of the Thirty-Fifth Conference on Neural Information Processing Systems, Datasets and Benchmarks Track (Round 2), Online, 6–14 December 2021. [Google Scholar]
Olavsrud, M.A. Natural Language Processing and Topic Modeling for Exploring the Vegetarian and Vegan Trends. Master’s Thesis, Norwegian University of Life Sciences, Ås, Norway, 2020. [Google Scholar]
Drole, J.; Pravst, I.; Eftimov, T.; Koroušić Seljak, B. NutriGreen image dataset: A collection of annotated nutrition, organic, and vegan food products. Front. Nutr. 2024, 11, 1342823. [Google Scholar] [CrossRef]
Kengpol, A.; Punyota, W. Prediction of Vegetarian Food Preferences for the Aging Society. IOP Conf. Ser. Mater. Sci. Eng. 2021, 1163, 012021. [Google Scholar] [CrossRef]
Kim, S.; Fenech, M.F.; Kim, P.J. Nutritionally recommended food for semi-to strict vegetarian diets based on large-scale nutrient composition data. Sci. Rep. 2018, 8, 4344. [Google Scholar] [CrossRef] [PubMed]
Duangsuphasin, A.; Kengpol, A.; Lima, R.M. Design of a decision support system for vegetarian food flavoring by using deep learning for the ageing society. In Proceedings of the 2021 Research, Invention, and Innovation Congress: Innovation Electricals and Electronics (RI2C), Bangkok, Thailand, 1–3 September 2021; pp. 54–59. [Google Scholar]
DeepPavlov. ruBERT-base-cased. Pretrained Model on Hugging Face Hub. Available online: https://huggingface.co/DeepPavlov/rubert-base-cased (accessed on 1 June 2024).
Sber AI. ruRoberta-large. Pretrained Model on Hugging Face Hub. Available online: https://huggingface.co/ai-forever/ruRoberta-large (accessed on 1 June 2024).
Li, Q.; Peng, H.; Li, J.; Xia, C.; Yang, R.; Sun, L.; Yu, P.S.; He, L. A survey on text classification: From shallow to deep learning. arXiv 2020, arXiv:2008.00364. [Google Scholar]
Vaswani, A. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Sber AI. BERT Large Model Multitask (Cased) for Sentence Embeddings in Russian Language. Pretrained Model on Hugging Face Hub. Available online: https://huggingface.co/ai-forever/sbert_large_nlu_ru (accessed on 1 June 2024).
Malkovsky, M.G. TULIPS-2-Natural Language Learning System. In Proceedings of the Coling 1982: Proceedings of the Ninth International Conference on Computational Linguistics, Prague, Czech Republic, 5–10 July 1982. [Google Scholar]
Krogh, A. An introduction to hidden Markov models for biological sequences. In New Comprehensive Biochemistry; Elsevier: Amsterdam, The Netherlands, 1998; Volume 32, pp. 45–63. [Google Scholar]
Zdorenko, T. Subject omission in Russian: A study of the Russian National Corpus. In Corpus-Linguistic Applications; Brill: Leiden, The Netherlands, 2010; pp. 119–133. [Google Scholar]
Minetz, D.; Gorushkina, A. Morphological Analysizer of a Text: Functional Opportunities. Litera 2017, 1, 12–22. [Google Scholar]
Mikheev, A.; Liubushkina, L. Russian morphology: An engineering approach. Nat. Lang. Eng. 1995, 1, 235–260. [Google Scholar] [CrossRef]
Popova, E.; Spitsyn, V. Sentiment analysis of short russian texts using bert and word2vec embeddings. In Proceedings of the Graphion Conferences on Computer Graphics and Vision, Nizhny Novgorod, Russia, 27–30 September 2021; Volume 31, pp. 1011–1016. [Google Scholar]
Korogodina, O.; Klyshinsky, E.; Karpik, O. Evaluation of vector transformations for Russian Word2Vec and FastText Embeddings. In Proceedings of the CEUR Workshop Proceedings, Luxembourg, 3–4 December 2020. [Google Scholar]
Burtsev, M.; Seliverstov, A.; Airapetyan, N.; Arkhipov, M.; Kuratov, Y.; Kuznetsov, V.; Litinsky, D.; Ryabinin, M.; Sapunov, A.; Semenov, A.; et al. DeepPavlov: Open-Source Library for Dialogue Systems. In Proceedings of the ACL 2018, System Demonstrations, Melbourne, Australia, 15–20 July 2018; pp. 122–127. [Google Scholar]
Shavrina, T.; Fenogenova, A.; Emelyanov, A.; Shevelev, D.; Artemova, E.; Malykh, V.; Mikhailov, V.; Tikhonova, M.; Chertok, A.; Evlampiev, A. RussianSuperGLUE: A Russian language understanding evaluation benchmark. arXiv 2020, arXiv:2010.15925. [Google Scholar]
Research, G. Bert-Base-Multilingual-Cased. 2020. Available online: https://huggingface.co/google-bert/bert-base-multilingual-cased (accessed on 10 December 2024).
Pires, T. How multilingual is multilingual BERT. arXiv 2019, arXiv:1906.01502. [Google Scholar]
Kuratov, Y.; Arkhipov, M. Adaptation of deep bidirectional multilingual transformers for Russian language. arXiv 2019, arXiv:1905.07213. [Google Scholar]
Snegirev, A.; Tikhonova, M.; Maksimova, A.; Fenogenova, A.; Abramov, A. The Russian-focused embedders’ exploration: ruMTEB benchmark and Russian embedding model design. arXiv 2024, arXiv:2408.12503. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Approach. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), Hong Kong, China, 3–7 November 2019; pp. 5362–5370. [Google Scholar]
Shvetsova, V.; Smirnov, I.; Nikolaev, S. Vikhr: Instruction-tuned Open-Source Models for Russian. arXiv 2024, arXiv:2405.13929. [Google Scholar]
Fenogenova, A.; Tikhonova, M.; Mikhailov, V.; Shavrina, T.; Emelyanov, A.; Shevelev, D.; Kukushkin, A.; Malykh, V.; Artemova, E. Russian superglue 1.1: Revising the lessons not learned by russian nlp models. arXiv 2022, arXiv:2202.07791. [Google Scholar]
Rogers, A.; Romanov, A.; Rumshisky, A.; Volkova, S.; Gronas, M.; Gribov, A. RuSentiment: An enriched sentiment analysis dataset for social media in Russian. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NM, USA, 20–26 August 2018; pp. 755–763. [Google Scholar]
Smetanin, S. The applications of sentiment analysis for Russian language texts: Current challenges and future perspectives. IEEE Access 2020, 8, 110693–110719. [Google Scholar] [CrossRef]
Zakharova, O.; Glazkova, A. GreenRu: A Russian Dataset for Detecting Mentions of Green Practices in Social Media Posts. Appl. Sci. 2024, 14, 4466. [Google Scholar] [CrossRef]
Romanov, A.; Kurtukova, A.; Shelupanov, A.; Fedotova, A.; Goncharov, V. Authorship identification of a Russian-language text using support vector machine and deep neural networks. Future Internet 2020, 13, 3. [Google Scholar] [CrossRef]
Kalabikhina, I.; Moshkin, V.; Kolotusha, A.; Kashin, M.; Klimenko, G.; Kazbekova, Z. Advancing Semantic Classification: A Comprehensive Examination of Machine Learning Techniques in Analyzing Russian-Language Patient Reviews. Mathematics 2024, 12, 566. [Google Scholar] [CrossRef]
Graça, J.; Oliveira, A.; Calheiros, M.M. Meat, beyond the plate. Data-driven hypotheses for understanding consumer willingness to adopt a more plant-based diet. Appetite 2015, 90, 80–90. [Google Scholar] [CrossRef]
Karageorgou, D.; Castor, L.L.; de Quadros, V.P.; de Sousa, R.F.; Holmes, B.A.; Ioannidou, S.; Mozaffarian, D.; Micha, R. Harmonising dietary datasets for global surveillance: Methods and findings from the Global Dietary Database. Public Health Nutr. 2024, 27, e47. [Google Scholar] [CrossRef] [PubMed]
Karabay, A.; Bolatov, A.; Varol, H.A.; Chan, M.Y. A central Asian food dataset for personalized dietary interventions. Nutrients 2023, 15, 1728. [Google Scholar] [CrossRef] [PubMed]
Mikhalkova, E.; Ganzherli, N.; Karyakin, Y. A Comparative Analysis of Social Network Pages by Interests of Their Followers. arXiv 2017, arXiv:1707.05481. [Google Scholar]
Shamoi, E.; Turdybay, A.; Shamoi, P.; Akhmetov, I.; Jaxylykova, A.; Pak, A. Sentiment analysis of vegan related tweets using mutual information for feature selection. PeerJ Comput. Sci. 2022, 8, e1149. [Google Scholar] [CrossRef] [PubMed]
Le-Khac, P.H.; Healy, G.; Smeaton, A.F. Contrastive representation learning: A framework and review. IEEE Access 2020, 8, 193907–193934. [Google Scholar] [CrossRef]
Tolegen, G.; Toleu, A.; Mussabayev, R. Contrastive Learning for Morphological Disambiguation Using Large Language Models in Low-Resource Settings. Appl. Sci. 2024, 14, 9992. [Google Scholar] [CrossRef]
Wu, T.; Yang, S. Contrastive Enhanced Learning for Multi-Label Text Classification. Appl. Sci. 2024, 14, 8650. [Google Scholar] [CrossRef]
Chen, Q.; Zhang, R.; Zheng, Y.; Mao, Y. Dual Contrastive Learning: Text Classification via Label-Aware Data Augmentation. arXiv 2022, arXiv:2201.08702. [Google Scholar]
Sun, H.; Liu, J.; Zhang, J. A survey of contrastive learning in NLP. In Proceedings of the 7th International Symposium on Advances in Electrical, Electronics, and Computer Engineering, Xishuangbanna, China, 18–20 March 2022; Volume 12294, pp. 1073–1078. [Google Scholar]
Yandex Zen. Yandex Zen Platform. Available online: https://zen.yandex.ru (accessed on 1 June 2024).
Google. Google Search Engine. Available online: https://www.google.com (accessed on 1 June 2024).
Cohen, J. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 1960, 20, 37–46. [Google Scholar] [CrossRef]
McHugh, M.L. Interrater reliability: The kappa statistic. Biochem. Medica 2012, 22, 276–282. [Google Scholar] [CrossRef]
xiamx. node-nltk-stopwords. Available online: https://github.com/xiamx/node-nltk-stopwords (accessed on 1 June 2024).
Bird, S.; Klein, E.; Loper, E. Natural Language Processing with Python; O’Reilly Media Inc.: Sebastopol, CA, USA, 2009. [Google Scholar]
Honnibal, M.; Montani, I. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. Appear 2017, 7, 411–420. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. scikit-learn: Machine Learning in Python. 2011. Available online: https://scikit-learn.org (accessed on 4 June 2024).
Řehůřek, R.; Sojka, P. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta, 22 May 2010; pp. 45–50. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Kochetova, L.; Popov, V. Research of Axiological Dominants in Press Release Genre based on Automatic Extraction of Key Words from Corpus. Nauchnyi Dialog 2019, 1, 32–49. [Google Scholar] [CrossRef]
van der Maaten, L.; Hinton, G. Visualizing Data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent Dirichlet Allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
Pang, B.; Lee, L.; Vaithyanathan, S. Thumbs up?: Sentiment classification using machine learning techniques. In Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, Philadelphia, PA, USA, 6–7 July 2002; pp. 79–86. [Google Scholar]
Wankhade, M.; Rao, A.C.S.; Kulkarni, C. A survey on sentiment analysis methods, applications, and challenges. Artif. Intell. Rev. 2022, 55, 5731–5780. [Google Scholar] [CrossRef]
Klyuev, G.; Gritsenko, I.; Panchenko, A.; Ruder, S.; Klyuev, M.D.; Oseledets, M.S.; Rakhlin, A.S. RuBERT: Pretrained Contextualized Embeddings for Russian. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 16–20 November 2020; pp. 5369–5374. [Google Scholar]
Blanchefort, G. blanchefort/rubert-base-cased-sentiment. 2020. Available online: https://huggingface.co/blanchefort/rubert-base-cased-sentiment (accessed on 8 November 2024).
Chen, T.; Guestrin, C. Xgboost: Extreme Gradient Boosting; R package version 0.4-2. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Pal, M. Random forest classifier for remote sensing classification. Int. J. Remote Sens. 2005, 26, 217–222. [Google Scholar] [CrossRef]
Wright, R.E. Logistic regression. In Reading and Understanding Multivariate Statistics; Springer: New York, NY, USA, 1995; pp. 217–244. [Google Scholar]
Ho, T.K. Random decision forests. In Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, Canada, 14–16 August 1995; Volume 1, pp. 278–282. [Google Scholar]
LaValley, M.P. Logistic regression. Circulation 2008, 117, 2395–2399. [Google Scholar] [CrossRef]
Kumar, U.K.; Nikhil, M.S.; Sumangali, K. Prediction of breast cancer using voting classifier technique. In Proceedings of the 2017 IEEE International Conference on Smart Technologies and Management for Computing, Communication, Controls, Energy and Materials (ICSTM), Chennai, India, 2–4 August 2017; pp. 108–114. [Google Scholar]
Hiyouga. Dual Contrastive Learning. 2022. Available online: https://github.com/hiyouga/Dual-Contrastive-Learning (accessed on 26 March 2024).
Google Colaboratory. Google Colaboratory. Available online: https://colab.research.google.com/ (accessed on 1 June 2024).
Kotelnikova, A.; Paschenko, D.; Razova, E. Lexicon-based methods and BERT model for sentiment analysis of Russian text corpora. In Proceedings of the CEUR Workshop Proceedings, online, 16–19 December 2021; Volume 2922, pp. 73–81. [Google Scholar]
NLTK Team. Stopwords Documentation. Available online: https://www.nltk.org/search.html?q=stopwords&check_keywords=yes&area=default (accessed on 1 June 2024).
SpaCy Team. SpaCy Russian Language Models. Available online: https://spacy.io/models/ru (accessed on 1 June 2024).

Figure 1. Text representation construction pipeline.

Figure 2. Word cloud of VegRuCorpus after preprocessing.

Figure 3. t-SNE results for n-gram-range (1,3) of character n-grams.

Figure 4. t-SNE results for word n-grams of range (1,3).

Figure 5. t-SNE results for the tf-idf representation.

Figure 6. t-SNE for sentence embeddings generated with the SBERT transformer model.

Figure 7. Evaluation pipeline.

Table 1. Search queries.

Query	Translation
вегетарианствo за и прoтив	vegetarianism pros and cons
вегетарианствo вред и пoльза	vegetarianism benefits and harms
вегетарианствo дoстoинства и недoстатки	vegetarianism advantages and disadvantages
вегетарианствo плюсы и минусы	vegetarianism pluses and minuses

Table 2. Interpretation of Cohen’s kappa score.

Cohen’s Kappa Score	Interpretation
<0	No agreement
$0.01$ – $0.20$	none to mild agreement
$0.21$ – $0.40$	fair agreement
$0.41$ – $0.60$	moderate agreement
$0.61$ – $0.80$	substantial agreement
$0.81$ – $1.00$	nearly perfect agreement

Table 3. Examples of annotator disagreements.

Text	Translation	Final Annotation
За пoследние два гoда кoличествo выбрoсoв углекислoгo газа в атмoсферу Земли вырoслo дo рекoрдных 37 миллиардoв тoнн. Пoка бoрцы за экoлoгию и сoчувствующие им вегетарианцы утверждают, чтo прoизвoдствo мясных прoдуктoв является oснoвным фактoрoм, загрязняющим атмoсферу, специалисты дoказывают oбратнoе.	In the past two years, carbon dioxide emissions into Earth’s atmosphere have risen to a record 37 billion tons. While environmental activists and supportive vegetarians claim that meat production is the main factor polluting the atmosphere, experts prove the opposite.	Neg
Для тoгo, чтoбы не стoлкнуться с недoстаткoм витаминoв и минералoв, неoбхoдимo разнooбразнo питаться. Насытиться растительнoй пищей слoжнее, пoэтoму её нужнo гoтoвить в бoльших кoличествах и, сooтветственнo, тратить бoльше денег на прoдукты. С другoй стoрoны, люди перестают тратить деньги на мясo и начинают пoкупать фрукты, пoэтoму разница мoжет быть не такoй уж и бoльшoй.	To avoid vitamin and mineral deficiencies, one must eat a varied diet. It is harder to feel full on plant-based food, so you need to cook it in larger quantities and, accordingly, spend more money on groceries. On the other hand, people stop spending money on meat and start buying fruits, so the difference may not be that significant.	Pos

Table 4. Data statistics.

Label	Texts	Avg Words	Avg Chars
pos	526	147.92	895.85
neg	498	118.43	892.71
total	1024	145.22	884.95

Table 5. Sizes of tf-idf and n-gram vector sizes.

Vector	Range	Size
tf-idf	−	19,877
lemmatized tf-idf	−	10,541
word n-grams	(1,3)	20,499
lemmatized word n-grams	(1,3)	19,398
character n-grams	(1,3)	19,220

Table 6. Top 3 n-grams for each n-gram range in VegRuCorpus.

Range	Top n-Grams	Count	Translation
(1, 1)	этo	705	this
	мяса	485	meat
	питания	343	nutrition
(2, 2)	живoтнoгo прoисхoждения	183	animal origin
	вегетарианская диета	98	vegetarian diet
	питательных веществ	69	nutrients
(3, 3)	прoдуктoв живoтнoгo прoисхoждения	59	products of animal origin
	прoдуктах живoтнoгo прoисхoждения	32	in products of animal origin
	прoдукты живoтнoгo прoисхoждения	28	products of animal origin
(1, 3)	этo	705	this
	мяса	485	meat
	питания	343	nutrition

Table 7. Model settings used in experiments.

Model	Settings
RF	`n_estimators` = 200; no maximum depth for each tree; Gini impurity for splitting; square root of the number of features considered for splitting at each node.
XGB	learning rate = 0.1; maximum depth = 3 for each tree; logistic loss function for binary classification.
LR	L2 regularization penalty; regularization strength (`C`) = 1.0; `lbfgs` solver.
VC	combined predictions of RF and LR using a majority voting scheme.
ruBERT, SBERT	pretrained on masked language modeling (MLM) and next sentence prediction (NSP) objectives; uses byte-pair encoding (BPE) tokenization; vocabulary size = 12·10⁴ tokens; fine-tuned for sentence classification on relevant data.
ruRoBERTa	pretrained on MLM objective; uses byte-level BPE tokenization; vocabulary size = 5·10⁴ tokens; fine-tuned for sentence classification on relevant data.
DualCL	Dual contrastive learning method; uses RuBERT, SBERT, or ruRoBERTa models for feature representations; trained for 20 epochs; contrastive loss with label-aware data augmentation.

Table 8. Evaluation results of traditional classifiers with syntactic text representations (the best result is marked in gray).

Representation	Classifier	P	R	F1	Acc
char n-grams	LR	0.7966	0.7960	0.7951	0.7951
word n-grams	RF	0.8028	0.7931	0.7929	0.7951
tf-idf	LR	0.7996	0.7826	0.7816	0.7854
lemmatization + word n-grams	LR	0.8100	0.8093	0.8095	0.8098
lemmatization + tf-idf	LR	0.8151	0.8081	0.8083	0.8098
char n-grams + topics	LR	0.7966	0.7960	0.7951	0.7951
word n-grams + topics	LR	0.7806	0.7800	0.7802	0.7805
tf-idf + topics	RF	0.7846	0.7733	0.7727	0.7756
topics	LR	0.6225	0.6110	0.6037	0.6146
char n-grams + SA	RF	0.8034	0.7876	0.7868	0.7902
word n-grams + SA	LR	0.8252	0.8124	0.8123	0.8146
tf-idf + SA	LR	0.8238	0.7914	0.7890	0.7951
SA	LR	0.5767	0.5733	0.5670	0.5707

Table 9. Evaluation results of traditional classifiers with semantic text representations (the best result is marked in gray).

Representation	Classifier	P	R	F1	Acc
ruBERT SE	LR	0.7609	0.7607	0.7608	0.7610
SBERT SE	RF	0.7961	0.7943	0.7945	0.7951
ruRoBERTa SE	LR	0.8438	0.8438	0.8438	0.8439
ruBERT SE + topics	LR	0.7413	0.7412	0.7412	0.7415
SBERT SE + topics	LR	0.7852	0.7852	0.7852	0.7854
ruRoBERTa SE + topics	LR	0.8389	0.8390	0.8390	0.8390
ruBERT SE + SA	RF	0.7719	0.7698	0.7699	0.7707
SBERT SE + SA	LR	0.7852	0.7852	0.7852	0.7854
ruRoBERTa SE + SA	LR	0.8536	0.8536	0.8536	0.8537

Table 10. Evaluation results of fine-tuned transformer models (the best result is marked in gray).

Representation	Classifier	P	R	F1	Acc
FT ruBERT	SE	0.5317	0.4288	0.4747	0.5050
FT ruRoBERTa	SE	0.4926	0.3988	0.4410	0.4690
FT SBERT	SE	0.5707	0.5563	0.5634	0.5478

Table 11. Evaluation results of contrastive learning (the best result is marked in gray).

Classifier	Base Model	P	R	F1	Acc
CL	ruBERT	0.7817	0.7791	0.7795	0.7805
CL	SBERT	0.8193	0.8195	0.8194	0.8195
CL	ruRoBERTa	0.8784	0.8787	0.8780	0.8780

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gorduna, N.; Vanetik, N. Vegetarianism Discourse in Russian Social Media: A Case Study. Appl. Sci. 2025, 15, 259. https://doi.org/10.3390/app15010259

AMA Style

Gorduna N, Vanetik N. Vegetarianism Discourse in Russian Social Media: A Case Study. Applied Sciences. 2025; 15(1):259. https://doi.org/10.3390/app15010259

Chicago/Turabian Style

Gorduna, Nikita, and Natalia Vanetik. 2025. "Vegetarianism Discourse in Russian Social Media: A Case Study" Applied Sciences 15, no. 1: 259. https://doi.org/10.3390/app15010259

APA Style

Gorduna, N., & Vanetik, N. (2025). Vegetarianism Discourse in Russian Social Media: A Case Study. Applied Sciences, 15(1), 259. https://doi.org/10.3390/app15010259

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Vegetarianism Discourse in Russian Social Media: A Case Study

Abstract

1. Introduction

2. Background

2.1. Russian NLP

2.2. Vegetarianism in Text Analysis

2.3. Contrastive Learning

3. The VegRuCorpus Dataset

3.1. Data Collection

3.2. Annotation

4. Text Representations

4.1. Preprocessing

4.2. Syntactic Representations

4.3. Semantic Representations

4.4. Representation Visualization and Analysis

4.5. Representation Enhancement

4.5.1. Topic Modeling

4.5.2. Sentiment Analysis

5. Classification Models

5.1. Traditional Models

5.2. Fine-Tuned Transformers

5.3. Contrastive Learning

6. Experimental Evaluation

6.1. Hardware Setup

6.2. Software Setup

6.3. Metrics

6.4. Results

7. Conclusions

8. Applications and Limitations

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1. VegRuCorpus Text Samples

Appendix A.2. Stopwords in Russian Language

Appendix A.3. Full Results of Traditional Models for Syntactic Text Representations

Appendix A.4. Full Results of Traditional Models for Semantic Text Representations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI