Monitoring of Sustainable Development Trends: Text Mining in Regional Media

Chernyshova, Galina; Taran, Evgeniy; Firsova, Anna; Vavilina, Alla

doi:10.3390/su17073122

Open AccessArticle

Monitoring of Sustainable Development Trends: Text Mining in Regional Media

¹

Faculty of Computer Science and Information Technologies, Saratov State University, 83, Astrakhanskaya Str., 410600 Saratov, Russia

²

Faculty of Economics, Peoples Friendship University of Russia (RUDN University), 6, Miklukho-Maklaya Str., 117198 Moscow, Russia

³

Faculty of Economics, Saratov State University, 83, Astrakhanskaya Str., 410600 Saratov, Russia

^*

Author to whom correspondence should be addressed.

Sustainability 2025, 17(7), 3122; https://doi.org/10.3390/su17073122

Submission received: 5 March 2025 / Revised: 26 March 2025 / Accepted: 27 March 2025 / Published: 1 April 2025

(This article belongs to the Section Development Goals towards Sustainability)

Download

Browse Figures

Versions Notes

Abstract

:

The monitoring of regional development sustainability is closely linked to the development of an indicator system that best meets stakeholders’ requirements, providing a solid foundation for strategic decision-making. In pursuit of progress in achieving the Sustainable Development Goals (SDG), efforts are continuously being undertaken to refine and enhance the indicator framework. Implementing interdisciplinary approaches for a comprehensive assessment of sustainable development in regions allows for a swift expansion and augmentation of data on regional transformations. An important aspect of the study of sustainability at the regional level is the additional possibility of using unstructured news content through text mining methods. The issue of applying natural language processing techniques for Russian-language sources is significant, as a large number of relevant tools are developed for English. Additionally, the analysis of news content has several features that complicate the classification of sentiments of messages with mostly neutral wording. The proposed methodology for processing specific news content in assessing the sustainability of regional development was implemented. An application for data scraping was developed, data were collected taking into account the selected regions and periods, stop word dictionaries were configured, frequency analysis was implemented, and the sentiment analysis of the obtained slices was carried out. For the formed set of news documents related to sustainable development by keywords according to SDGs 1–17, for the regions of the Volga Federal District, a corpus of documents was obtained representing data for 2021, 2022, and 2023 for 14 regions. The analysis of key topics for different areas and periods was carried out using the cosine similarity measure. The developed approach to news analysis allows for increasing the efficiency of monitoring on various topics. This methodology has been tested for systemic and operational assessment in the dynamics of the sustainable development of regions. Text analysis methods within the framework of decision support at the regional level provide the opportunity to identify emerging trends.

Keywords:

mathematical modeling; regional sustainable development; text analysis; dynamics assessment; sentiment analysis

1. Introduction

Sustainable development is a crucial trend in today’s world, on which the future of future generations depends. Achieving the goals outlined in the SDG agenda is a complex process that requires careful planning and execution. The analysis of progress towards these goals, especially at the regional level, is directly related to the involvement of additional indicators of sustainability, which are based on ESG principles.

The transition to sustainable development requires the efforts of not only national governments but also the actions of many micro-level actors, such as firms, households, and individuals.

Quantitative assessments of regional development are traditionally based on the statistical data of various levels. This inevitably leads to a delay due to the complex procedures of collecting, processing, and publishing a wide range of indicators.

However, the lack or incompleteness of available data can be a significant barrier to evaluating the effectiveness of these goals. This can make it difficult to accurately measure progress and make organizational decisions. Issues related to data availability require an expansion of innovative approaches, such as the use of Big Data, to overcome these challenges [1].

Analysis of news content provides a more prompt understanding of socio-economic processes. The use of social networks as a source of information is associated with a number of problems, including the presence of fake news and information chaos [2]. Therefore, the emphasis in our study is on using reliable news sources, such as official regional pages and large online communities affiliated with media.

The set of potential indicators of regional sustainable development is very extensive for empirical analysis. However, it should be noted that it is not always possible to collect statistical information on all aspects of sustainability in a timely manner. The use of text mining methods provides additional opportunities to solve problems related to the systemic assessment of regional processes.

The motivation for this study is to expand the important areas of analysis of territorial and temporal features of sustainable development, taking into account additional sources of information. The aim of the research is to expand the tools for monitoring changes in the environmental, social, and governance (ESG) agenda of Russian regions. The novelty of this approach is determined by the development of a methodology for tracking systemic changes, assessing positive and negative processes in the development of regions from the point of view of sustainability.

The main hypotheses were considered:

H1.

There is a high degree of heterogeneity in the implementation of the Sustainable Development Goals of the regions;

H2.

There are insignificant dynamics in the expansion of strategic directions for sustainable regional development.

The existing classification of the Sustainable Development Goals by relevant areas is aimed at solving various social, economic, and environmental problems [3]. The Sustainable Development Goals (SDGs) are designed to reflect the most important and urgent challenges facing the global community [4]. Moreover, these goals are interconnected, necessitating a comprehensive approach when assessing the outcomes of their implementation [5]. Our research proposes to systematically assess the sustainability of regional development taking into account all ESGs.

The concept of sustainable development based on the formulated goals represents the joint influence of business, government, and civil society. Additional clarification is required regarding the regions’ characteristics and contributions to the Sustainable Development Goals. Several studies have been conducted on regional development practices, considering the necessary adjustments to indicators [6,7,8].

Science-based approaches to SDG implementation of indicator-based assessments are an area in need of additional research [9,10,11].

To understand the range of factors contributing to sustainable development, it is essential to consider the diverse characteristics of different countries and regions [12].

Considering the rather abstract nature of the Sustainable Development Goals, it is important to formulate them correctly to search for relevant news reports [13,14].

Regression models are traditionally used to quantify the impact of time and macro-regional factors. Analysis of panel data and surveys of respondents made it possible to determine the impact of efforts related to the Sustainable Development Goals on the perception of residents of regions [15].

It is proposed to determine the corresponding weighting factors in order to form a comprehensive assessment through sustainable development indicators based on the Sustainable Development Goals in accordance with their priorities [16].

To evaluate the trends and progress of selected indicators of sustainable development, a method is employed that calculates the distance from the indicator’s current value to its target value of 2030 [17,18].

The use of factor analysis for the selection of indicators and hierarchical methods of cluster analysis partially avoids the disadvantages of combining a large number of indicators into a single index [19].

For the purpose of cause-and-effect analysis, we propose to consider various combinations of factors for achieving the Sustainable Development Goals. Methods based on fuzzy logic and qualitative comparative analysis (QCA), in particular fsQCA (fuzzy set qualitative comparative analysis), make it possible to identify indicators for Specific Sustainable Development Goals that may be sufficient for different regions [20].

Data Envelopment Analysis (DEA) has been applied to assess the sustainable development of cities. The method of variable selection and the DEA models allowed us to identify five social, five economic, and three environmental key indicators for evaluation [21].

A relevant approach for evaluating Russian regions is the ESG (Environmental, Social, and Governance) City and Regional Index, which comprises 60 indicators across 16 comprehensive factors grouped into three categories related to sustainable development [22]. The final assessment is formed by the additive method, which uses official public statistical data and other ratings. Open Internet sources were not used in this research.

Approaches based on artificial intelligence techniques are employed to analyze the implementation processes of the Sustainable Development Goals [23,24].

When text analysis was applied to sustainability assessment, machine learning methods, particularly modified convolutional neural networks, were used to classify companies according to their sustainability level [25].

The use of news reports to form a corpus of documents for further text analysis is due to the presence of a cause-and-effect relationship between the relevance of social problems among the population and the thematic priorities of the media [26]. Some studies use news articles to assess individual sustainability indicators related to climate change [27,28].

Assessing news sentiment during crises can serve as an operational indicator of the economy’s state and can be used in combination with more traditional explanatory variables [29].

Improving the tools for monitoring sustainable development provides an opportunity to gain insight into the extent to which local and national territories are moving towards Sustainable Development Goals.

2. Materials and Methods

The proposed methodology includes a number of stages (Figure 1):

The creation of a news articles corpus based on the extraction of data from media resources using a developed application, considering the subject matter, selected region, and period;
Preprocessing and stemming of the obtained corpus of documents, the removal of stop words;
Construction of a word cloud and analysis of keyword frequencies in document sets for different regional topics and periods;
Comparative analysis of news topics for various regions and different periods using text mining techniques;
Sentiment analysis of the generated data slices for individual regions and periods.

Web scraping allows solving various analytical tasks, in particular, automation of the process of collecting data from websites in accordance with the user’s needs [30]. An application for parsing user requests for the social network VKontakte has been developed.

Currently, the Vkontakte network is the most popular, publicly accessible, and widespread social network in the Russian Federation, which is confirmed by expert assessments of professional communities (https://mediascope.net/data/ (accessed on 4 March 2025). The official pages of the regions are presented in this network. This ensures that the sample contains information from official sources. Regional mass media are also represented in Vkontakte communities. The activity of regions in news communities may vary, but the selection of a certain number of news communities by these methods ensures the relevance of the sample. The results obtained as a result of the application of the developed methodology are relevant to the regional news agenda.

Various tools are offered for implementing parsing, including on the Python, Java, Ruby, and JavaScript platforms. This application uses BeautifulSoup and Scrapy framework on the Python 3.9 platform, which has an extensive range of text mining libraries.

The necessity to create a representative sample of textual data led to the search for information sources that have special APIs and contain a sufficient amount of news articles. VKontakte, the most popular Russian language network, is used as such a source. The VK API is based on the HTTP protocol and utilizes the JSON data transfer format [31].

The news search was carried out for each region using a special VK API groups.search method, which can be used to specify a search query. The search query was specified in a format that included the name of the region and the keyword “news”.

This method provides a list of relevant news communities using the internal algorithms of the VKontakte network. A limited number of communities were selected for each region. The results of the expert assessment confirm that the final list of received network resources includes news from regional media and official sources that maintain their communities on VKontakte.

Then, for each region, the news in the resulting list of communities was selected by year and a list of keywords converted into a regular expression. The news data slices obtained in this way include a fairly wide range of news from regional authorities and other official sources.

To optimize the execution of a large number of queries, VK Script, a language for automating various tasks in VK, was used. The developed parser was designed to extract news reports. It provides a comprehensive set of features to perform data preparation, including stemming, specifying a list of keywords for searching, setting the period, specifying specific news sources to apply territorial restrictions to the generated sample, collecting data according to the specified parameters, and saving the collected documents in convenient file formats (e.g., .xlsx). In addition to extracting posts, it is also possible to generate a set of associated comments, significantly expanding the possibilities for analyzing textual information. The document corpus generated in this way provides additional solutions to various text mining tasks, including word cloud generation, sentiment analysis, clustering, and text categorization.

During the pre-processing of the collected news documents, tokenization was performed, i.e., the text was divided into sentences, and the sentences were divided into words. Stemming was implemented, which is the normalization of words, reducing a word to its base. Removing stop words that are common in the language and that do not carry serious substantive information is an important element of text pre-processing. Stop word dictionaries used to process texts in Russian require expansion or the creation of separate dictionaries that will be used in the analysis using tokenization.

A word cloud is a weighted dictionary representation, showing the frequency of words in a certain text, which provides convenient visualization of the document. A word cloud as a method of text compression can easily obtain the most significant information about the content of the text and highlight subtopics.

One of the main benefits of a word cloud is its ability to visually represent large volumes of data that are difficult to describe quickly and precisely using the original text. Additional information, such as the relative importance of each keyword, can be added to the word cloud.

In the received news texts, not only stop words from the standard dictionary were removed. The list of stop words was modified, for example, we have redeleted Russian proper names and place names, and others. During the study, our own list was formed, which made it possible to form a more relevant set of keywords. To do this, during computational experiments, the list of stop words was adjusted by adding the names of regions and territorial entities, special characters, and certain frequently used abbreviations. The modified list of stop words includes numbers, punctuation marks, emoji, names of months, names of regions and their capitals, and heads of regions. These words do not carry a semantic meaning in this context.

Various methods are used for vector representation of texts, including bag-of-words, tf–idf (tf—term frequency, idf—inverse document frequency), Word2Vec, GloVe, BERT [32]. The two most commonly used methods are frequency analysis and word weight analysis. Frequency analysis entails counting the number of occurrences of a word in a text and making inferences based on the frequency of the most common words.

The statistical approach tf–idf can be used as a preliminary step to assess document similarity [33].

To determine the relevance of keywords, we calculate the frequency of occurrence of a word within a selected data subset (slice). In this context, a slice refers to a specific fragment of information. For the purpose of evaluating news reports, we used slices that consisted of texts related to a selected region in a given time period. A higher occurrence frequency of a keyword within a data subset indicates a higher level of relevance. The top-ranking keywords are presented in a list in order of decreasing relevance.

Let t is a word, d is a document, D is a collection of documents.

t f (t, d)

is defined as follows:

t f (t, d) = \frac{n_{t}}{\sum_{k} n_{k}},

(1)

where

n_{t}

is the number of occurrences of word t in document d, and the denominator is the total number of words in document d.

i d f (t, D)

is presented in the following way:

i d f (t, D) = l n (\frac{|D| + 1}{|\{d_{i} \in D| t \in d_{i}\}| + 1} + 1),

(2)

where |D| is the number of documents in the collection D;

|\{d_{i} \in D| t \in d_{i}\}|

is the number of documents from collection D for which word t occurs.

Then,

t f - i d f (t, d, D) = t f (t, d) \cdot i d f (t, D) .

(3)

For each word t_i from document d_j, the value

t f - i d f (t_{i}, d_{j}, D)

is calculated. We obtain a matrix V consisting of elements

v_{i j} = t f - i d f (t_{i}, d_{j}, D)

,

i = 1, \dots, n

,

j = 1, \dots, m,

where n is the number of unique words in all documents from the document collection D,

m = |D|

. To estimate the similarity measure of documents transformed into a vector representation, a cosine measure is determined. Each document is described by a vector, each component of which corresponds to a word. Let R = and P =

, i = 1, \dots, n,

are vectors corresponding to some two documents. Then, the cosine similarity measure is calculated as follows:

\cos (R, P) = \frac{\sum_{i = 1}^{n} r_{i} \cdot p_{i}}{\sqrt{\sum_{i = 1}^{n} {(r_{i})}^{2}} \cdot \sqrt{\sum_{i = 1}^{n} {(p_{i})}^{2}}}

(4)

Sentiment analysis is a crucial component in natural language processing (NLP), particularly in the fields of review analysis, social media monitoring, and many other areas. The selection of an appropriate model for this task is of paramount importance, as it significantly impacts the accuracy of tonality assessments.

Various approaches are used for sentiment analysis; namely, lexicon methods, machine learning methods, deep learning and neural network approaches, and combined methods.

Classical machine learning methods, such as logistic regression, SVM, and ensemble models also remain relevant due to their simplicity.

Among the deep learning methods, Transformer models, such as Bidirectional Encoder Representations from Transformers (BERT) and GPT (Generative Pre-trained Transformer), stand out, as they account for the contextual and semantic aspects of language [34]. The main problem in the case of news content in Russian is the lack of available high-quality samples for training such multiparameter models.

In recent years, transformers such as RuBERT, FinBERT, Multilingual BERT, and XLM-R have become popular.

The use of transformers in this case is limited by the need for a large amount of data for fine-tuning. Although RuBERT can be adapted to specific tasks using fine-tuning, this process requires a fairly large amount of labeled data. In this study, fine-tuning of the pre-trained RuBERT model was carried out for a specific task; namely, sentiment analysis. This process allows using the knowledge gained by the model during pre-training on a large corpus of texts and applying it to highly specialized data.

To compare the effectiveness of different models, it is advisable to use samples focused on news texts. In particular, FinBERT is a specialized BERT-based model developed for text analysis in the financial sector [35]. It was trained on a corpus of financial documents, including company reports, news, and analytical articles, which makes it especially effective for tasks related to financial analytics, including sentiment analysis of financial texts. This model has a limitation typical of other transformers; namely, high resource requirements. In addition, FinBERT has a fairly narrow specialization focused specifically on the financial sector. To analyze Russian-language financial texts, the model needs to be adapted, for example, by retraining on Russian-language data and using machine translation.

Multilingual BERT is a version of BERT with the ability to process texts for multilingual tasks [36]. It should be noted that for the Russian language, the accuracy may be low due to the peculiarities of training.

The XLM-R model is an improved version of the Cross-lingual Language Model (XLM), based on the RoBERT architecture. The model demonstrates high efficiency for tasks related to languages with a small amount of available training data. This is due to its improved pre-training process. However, it requires additional training for tonality analysis using a larger sample size.

Dictionary methods are characterized by the interpretability of results and, unlike machine learning, do not require large volumes of labeled data for training. In addition, they are less susceptible to the problem of overfitting. An argument in favor of the dictionary approach was the lack of large Russian-language samples that allow for training classification models for news content.

The publicly available tonal dictionary LinisCrowd (https://linis-crowd.org/ (accessed on 4 March 2025)) is focused on solving the fundamental linguistic problem of the lack of a Russian-language dictionary of tonal vocabulary for user texts on socio-political topics. The LinisCrowd dictionary includes more than 26,000 lexemes [37]. The LinisCrowd dictionary includes more than 26,000 lexemes.

To select the most accurate sentiment analysis method for specific news texts, the RuBERT, FinBERT, and XLM-R transformer models were considered, and the LinisCrowd and KartaSlovSent dictionaries were also used. In our study, 1000 news articles on topics related to sustainable development were added to adapt transformer models.

In this study, the open linguistic dataset KartaSlovSent was used to assess the sentiment. The sentiment dictionary of the Russian language KartaSlovSent contains words and expressions of the Russian language [38]. Dictionary lexemes are supplied with a sentiment label (“positive”, “negative”, “neutral”) and a scalar value of the strength of the emotional assessment from the continuous range [−1, 1], where +1 corresponds to inputs with the most positive sentiment strength score, −1 corresponds to inputs with the most negative sentiment, 0 corresponds to inputs with a neutral assessment (the same as no coloring). The total volume of the dictionary obtained as a result of expert sentiment labeling is 28,197 words and expressions of the Russian language. To analyze the sentiment of news messages, a scale was configured for sentiment classification taking into account the features of news texts with a neutral tonality of presentation.

Computational experiments to assess the accuracy of the constructed models were carried out on a special test data sample. It contains 430 news messages on socio-economic topics. To ensure a relevant assessment of the sentiment detection model, the test sample was marked by experts. Generated positive and negative news texts were used to expand the test sample. Table 1 presents the Accuracy and F1-scores of models that use different approaches to analyzing the sentiment of news articles in Russian.

The F1-score accuracy of the constructed sentiment assessment model based on the dictionary approach KartaSlovSent was 0.87. As a result, it was used to assess the tonality of news.

3. Results

The proposed approach has been tested for the example of some Russian regions. This study collected data on the Volga Federal District, which includes 14 different territorial entities, including republics and regions. This was conducted purposefully to test the methodology. These regions were selected as a full-fledged research object. Given the significant unevenness of regional development, it was necessary to consider regions characterized by territorial and geographical community, relatively equal in terms of development. These regions are connected by complex connections of the same level, but they have sufficient differences to illustrate the possibilities of the methodology. It is for such regions that benchmarking analysis of the current level and directions of sustainable development in neighboring regions is especially important.

Case studies of the following regions have been used: Republic of Tatarstan, Udmurtian Republic, Chuvashi Republic, Republic of Bashkortostan, Republic of Marij El, Republic of Mordovia, Nizhni Novgorod region, Orenburg region, Kirov region, Penza region, Perm territory, Samara region, Saratov region, Ulyanovsk region. A corpus of news reports was compiled for these 14 regions.

In the developed application for assessing the sustainable development of regions, a set of keywords was used in the process of collecting data from news resources of the VKontakte social network. A selection of network regional news resources was made. Official resources representing regional authorities and additional news sources with a sufficient number of subscribers were selected. During the parsing process, regional resources in the VKontakte network were selected as sources based on the VK API popularity rating. The corresponding news slices were obtained by this application for different periods. The application allows parsing news resources for thematic selection of texts taking into account the presence of keywords and phrases. The selection was carried out based on the following keywords in accordance with the Sustainable Development Goals and taking into account Russian specifics:

Poverty, income distribution, income of the population, welfare, social security;
Food security, food sovereignty, food culture, regenerative agriculture, organic food, food price;
Mental health, public health, mental well-being, disability, health education, infectious diseases, child mortality, family planning, neonatal mortality, infant mortality, child health, road accidents, reproductive health, epidemics, health insurance;
Education, environmental education, technical and vocational education, free education, accessible education, primary education, secondary education, higher education;
Gender equality;
Reclamation, water efficiency, groundwater depletion, desertification, green infrastructure;
Renewable energy, wind, solar, geothermal, hydroelectric, fuel-efficient technologies, emissions, greenhouse effect, biofuels;
Employment, economic growth, sustainable development, wages, economic empowerment, small and medium enterprises, youth employment;
Infrastructure, investment, internet, industrial diversification;
Trade, financial market, taxation, social security, government program;
Public transport, climate change adaptation, affordable housing, pedestrian zone, public spaces;
Natural resources, recycling, industrial ecology, reuse, decarbonization, food waste;
climate, greenhouse gas, global warming, weather, environment;
Water protection, fish stocks;
Land use, ecological land restoration, forest conservation, deforestation, reforestation;
Social justice, legal system, and fight against terrorism.

Filtering using keywords allowed us to limit the topics of messages, which provides relevant news collections following the topic of sustainable development. Corresponding data slices were formed for regions of the Volga Federal District. The number of news items by region for 2021–2023 is presented in Table 2.

After preliminary processing of the obtained news slices, word clouds were generated to select the most relevant terms. In addition to the visual representation as a word cloud, frequency estimates were obtained for keywords in the corpus of news documents by region and corresponding periods.

As a result, it is possible to quantitatively evaluate the relevant keywords to provide more objective characteristics of the main topics in the news content.

The next stage of the study is presented as ranked lists of keywords for some slices. Figure 2, Figure 3, Figure 4 and Figure 5 show the obtained results for several regions (Tatarstan, Samara, Baschkortostan, Penza) as examples. Fifteen keywords were identified for these regions in the slices according to tf value.

It should be noted that the presented ranked lists illustrate the heterogeneity of key topics related to regional sustainable development. It can be concluded that there is no broad coverage in the areas formulated on the basis of Sustainable Development Goals. This is largely due to the limited resources of Russian regions due to the current economic situation. In the regions considered as examples, much attention is paid to pressing aspects of education at different levels.

According to the Russian ESG index ratings, Tatarstan is one of the leading regions. In addition, this region is traditionally considered a research and university region. In the resulting list of key topics, education is presented as the most relevant topic for this region. The keywords cover all three areas of the ESG assessment. In the recent period, the environmental agenda has not been so relevant for the region. Samara Oblast demonstrates increased attention to goals related to the development of education and entrepreneurship; environmental issues are not a priority. For Bashkortostan, social aspects of ensuring sustainable development (promoting social development) are presented for the most part. In the Penza region, health care and demography issues are actively considered during all the periods under study.

Similarity metrics were calculated to quantitatively assess changes in the discussed issues of sustainable development. Table 3 presents numerical estimates of the similarity of sustainability issues discussed in the news agenda for all 14 regions of the Volga Federal District based on the cosine measure.

The numerical estimates for 2023 based on the cosine measure do not exceed 0.6, which indicates significant differences in regional approaches to increasing sustainability (H1). To assess the change in the topic in news reports for individual periods, term-document matrices were obtained, and the cosine metric was applied. Using the developed methodology, it is possible to assess the sustainability of regional development in dynamics. For this purpose, similarity matrices were constructed based on news content for all periods under consideration. For example, Table 4 shows similarity metrics for Tatarstan for 2021, 2022, and 2023. Table 5 shows similarity metrics for the Samara region, Table 6 contains data for Bashkortostan, and data for the Penza region are available in Table 7.

The presented numerical estimates of the cosine measure show a relatively high similarity between the key topics discussed in news reports in 2021, 2022, and 2023 in Tatarstan, Samara region, and Penza region. Bashkortostan is characterized by a more significant revision of the target areas for increasing the region’s resilience.

Tatarstan is included in the sustainable development rankings as one of the leaders in Russia during the periods under review. According to the values of the cosine metric, this region demonstrates a high degree of compliance with the ESG goals that have already been formulated earlier in a number of development programs.

Based on the obtained similarity estimates, it can be concluded that news reports devoted to issues of regional sustainability change insignificantly during the selected periods (H2). This is largely due to the adoption of federal strategic programs for increasing resilience at the regional level.

The final stage of the study is the Sentiment Analysis of news reports related to the goals of sustainable regional development. News in slices is assessed in accordance with tonality. The relative number of positive and negative documents in slices for the period from 2021 to 2023 is presented below in Figure 6 and Figure 7.

The number of positive news on sustainable development is decreasing on average across regions (in 2021, the percentage of positive news in the Volga Federal District was 32.47%, in 2022 it was 30.24%, and in 2023 it was 29.75%). However, for certain regions (Nizhny Novgorod Region, Penza Region, Bashkortostan, Mordovia, Udmurtia), there is a tendency for positive news to increase.

The proposed methodology can be applied in a similar way for different regions. The expansion of the list of regions will be associated with the parsing of relevant news sources, which is a rather complicated process. The methodology is a means of forming an up-to-date cross-section of news content based on regional ESG factors for further expert work, comparison with other regions, and identification of development directions.

4. Discussion and Conclusions

Various aspects of sustainable development of regions remain in the spotlight for a long time. An attempt to comprehensively apply mathematical methods for operational support of the decision-making process at the regional level expands traditional approaches to sustainability management [39,40]. The contribution of our research is the application of intellectual analysis methods to the processing of news texts related to the sustainability of Russian regions.

Textual information presented in various reports was used to solve individual sustainability assessment tasks, particularly at the enterprise level [41]. In our research on sustainable development, an important aspect is the regional level with coverage of online news resources.

The currently existing methodology for forming the rating assessment of Russian regions includes a large set of indicators of sustainable development, both from official statistics and from other consolidated ratings. Such a systematic approach is carried out based on the results of the period when the initial data are reflected in full in official sources. Such a methodology will be retrospective in nature [22].

To increase the efficiency of the assessment, we used additional sources of information in the form of news reports. This will also increase the objectivity of the trends analysis of sustainable development concerning Russian regions.

The differences in news topics for regional sections revealed using the proposed quantitative assessment are associated with various external and internal features. With a variety of factors influencing the change in the news agenda on the goals of sustainable development of regions, the presence of regional strategic programs plays an important role. Limited regional resources and high key rates in the current situation led to a change in priority tasks in increasing sustainability. Factors that determine the change in the agenda relating to sustainable development include economic, financial, and reputational incentives. It is also important to keep in mind the interest of regional governments and government structures in decision-making, associated with the emergence of sustainability criteria in the distribution of subsidies and the provision of budget loans.

Sustainability issues at the macro level are studied in sufficient detail, but the proposed methodology is aimed at the regional level. This enables key stakeholders to assess the dynamics of sustainable development in regions.

The text mining approach is innovative, providing the opportunity to assess and compare the current level of sustainability for different regions in each period of time based on an operational assessment of the news context. Comparative analysis of news content can be performed as a benchmarking stage of the strategic development of regions.

The proposed methodology will enable stakeholders at both the regional and federal levels to identify key trends in the sustainable development of regions. This will allow for the identification of common goals and strategies among different regions, as well as the comparison of innovative solutions and institutional support measures for enhancing sustainability. Taking into account a wide range of sustainable development objectives, this process will also contribute to the development of strategic planning and corrective actions to ensure successful outcomes. The tonality assessment, as part of this process, provides additional opportunities for comparative analysis among regions.

System analysis will make it possible to more quickly identify the dynamics of regional sustainability in order to identify competitive advantages of the regions. The developed approach can be applied not only at the regional level but also to individual enterprises. For enterprises, this is an additional opportunity to integrate with ESG goals at the regional level.

A limited list of keywords and phrases was formed based on commonly accepted formulations to form a cross-section of news data related to various SDGs. The direction of future research involves identifying a specific SDG for differentiated assessment of regions. In this case, an expansion of the list of keywords will be required to scrape suitable documents.

Comparison with official statistics on sustainable development is possible, taking into account the time lag caused by a significant delay in the publication of a block of similar indicators. The construction of models that take into account a multifactorial set of statistical indicators of the sustainable development of regions and additional aspects taking into account news content is the subject of further research. It involves a comprehensive analysis of statistical indicators of sustainable development and the construction of predictive models with additional information from news sources.

Author Contributions

Conceptualization, G.C. and A.F.; methodology, G.C.; software, G.C. and E.T.; validation, G.C., E.T. and A.V.; formal analysis, G.C., A.F. and A.V.; investigation, G.C., E.T. and A.V.; resources, G.C., E.T. and A.V.; data curation, G.C., A.F. and A.V.; writing—original draft preparation, G.C.; writing—review and editing, G.C., E.T., A.F. and A.V.; visualization, G.C. and E.T.; supervision, G.C.; project administration, G.C. and A.F.; funding acquisition, A.V. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Peoples’ Friendship University of Russia named after P. Lumumba in the framework “Development of an intelligent system for supporting management decision-making to improve enterprise efficiency in the conditions of the data economy”.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Nilashi, M.; Keng Boon, O.; Tan, G.; Lin, B.; Abumalloh, R. Critical Data Challenges in Measuring the Performance of Sustainable Development Goals: Solutions and the Role of Big-Data Analytics. Harv. Data Sci. Rev. 2023, 5, 1–36. [Google Scholar] [CrossRef]
Tomassi, A.; Falegnami, A.; Romano, E. Mapping automatic social media information disorder. The role of bots and AI in spreading misleading information in society. PLoS ONE 2024, 19, e0303183. [Google Scholar] [CrossRef] [PubMed]
Sustainable Development Goals. The United Nations Official Website. Available online: https://www.un.org/sustainabledevelopment/ru/sustainable-development-goals (accessed on 24 December 2024).
Confraria, H.; Ciarli, T.; Noyons, E. Countries’ research priorities in relation to the Sustainable Development Goals. Res. Policy 2024, 53, 104950. [Google Scholar] [CrossRef]
Biggeri, M.; Clark, D.A.; Ferrannini, A.; Mauro, V. Tracking the SDGs in an ‘integrated’ manner: A proposal for a new index to capture synergies and trade-offs between and within goals. World Dev. 2019, 122, 628–647. [Google Scholar] [CrossRef]
Moallemi, E.A.; Malekpour, S.; Hadjikakou, M.; Raven, R.; Szetey, K.; Ningrum, D.; Dhiaulhaq, A.; Bryan, B.A. Achieving the Sustainable Development Goals requires transdisciplinary innovation at the local scale. One Eart 2020, 3, 300–313. [Google Scholar] [CrossRef]
Krantz, V.; Gustafsson, S. Localizing the sustainable development goals through an integrated approach in municipalities: Early experiences from a Swedish forerunner. J. Environ. Plan. Manag. 2021, 64, 2641–2660. [Google Scholar] [CrossRef]
D’Adamo, I.; Gastaldi, M.; Imbriani, C.; Morone, P. Assessing regional performance for the Sustainable Development Goals in Italy. Sci. Rep. 2021, 11, 24117. [Google Scholar] [CrossRef]
Allen, C.; Metternicht, G.; Wiedmann, T. Initial progress in implementing the Sustainable Development Goals (SDGs): A review of evidence from countries. Sustain. Sci. 2018, 13, 1453–1467. [Google Scholar] [CrossRef]
Yeh, S.-C.; Hsieh, Y.-L.; Yu, H.-C.; Tseng, Y.-H. The Trends and Content of Research Related to the Sustainable Development Goals: A Systemic Review. Appl. Sci. 2022, 12, 6820. [Google Scholar] [CrossRef]
D’Adamo, I.; Di Carlo, C.; Gastaldi, M.; Rossi, E.N.; Uricchio, A.F. Economic Performance, Environmental Protection and Social Progress: A Cluster Analysis Comparison towards Sustainable Development. Sustainability 2024, 16, 5049. [Google Scholar] [CrossRef]
Shlyapina, M.V.; Tretyakova, E.A. A nexus between regional welfare and sustainable development: A conceptual model. J. New Econ. 2024, 25, 85–105. [Google Scholar]
Montiel, I.; Cuervo-Cazurra, A.; Park, J.; Antolín-López, R.; Husted, B.W. Implementing the United Nations’ sustainable development goals in international business. J. Int. Bus. Stud. 2021, 52, 999–1030. [Google Scholar]
Van Zanten, J.A.; Van Tulder, R. Multinational enterprises and the Sustainable Development Goals: An institutional approach to corporate engagement. J. Int. Bus. Policy 2018, 1, 208–233. [Google Scholar]
Foroudi, P.; Marvi, R.; Cuomo, M.T.; D’Amato, A. Sustainable Development Goals in a regional context: Conceptualising, measuring and managing residents’ perceptions. Reg. Stud. 2024, 59, 1–16. [Google Scholar]
Mindrinos, L.; Panagiotopoulos, P. Measuring Sustainable Development: A Weighting Approach to Sustainable Development Indicators. Int. J. Multidiscip. Res. Anal. 2023, 6, 4510–4520. [Google Scholar]
Cohen, G.; Shinwell, M. How to measure distance to SDG targets anywhere. In OECD Statistics Working Papers, 2020/03; OECD Publishing: Paris, France, 2020. [Google Scholar]
Garcia, C.; López-Jiménez, P.A.; Pérez-Sánchez, M.; Sanchis, R. Methodology for assessing progress in sustainable development goals indicators in urban water systems. How far are we from the 2030 targets? Sustain. Cities Soc. 2024, 112, 105616. [Google Scholar]
Pravitasari, A.E.; Rustiadi, E.; Mulya, S.P.; Fuadina, L.N. Developing Regional Sustainability Index as a New Approach for Evaluating Sustainability Performance in Indonesia. Environ. Ecol. Res. 2018, 6, 157–168. [Google Scholar]
Carvalho, L.; Almeida, D.; Loures, A.; Ferreira, P.; Rebola, F. Quality Education for All: A Fuzzy Set Analysis of Sustainable Development Goal Compliance. Sustainability 2024, 16, 5218. [Google Scholar] [CrossRef]
Rebolledo-Leiva, R.; Vásquez-Ibarra, L.; Feijoo, G.; Moreira, M.T.; González-García, S. Determining key indicators for the assessment of sustainable development in Spanish cities under a multi-criteria approach. Clean. Prod. Lett. 2023, 5, 100046. [Google Scholar]
ESG-Index of Cities and Regions. Available online: https://xn----ctbjbleaab3chwacdqgef8f3d.xn--80afd3bal.xn--p1ai/ (accessed on 22 February 2025).
Vaio, A.; Palladino, R.; Hassan, R.; Escobar, O. Artificial intelligence and business models in the sustainable development goals perspective: A systematic literature review. J. Bus. Res. 2020, 121, 283–314. [Google Scholar]
Nasir, O.; Javed, R.T.; Gupta, S.; Vinuesa, R.; Qadir, J. Artificial intelligence and sustainable development goals nexus via four vantage points. Technol. Soc. 2023, 72, 102171. [Google Scholar]
Spinder, S.; Frasincar, F.; Matsiiako, V.; Boekestijn, D.; Brandt, T. A text mining approach to identifying sustainability in the private sector. Comput. Ind. 2023, 149, 103932. [Google Scholar]
Rogers, E.M.; Dearing, J.W.; Bregman, D. The Anatomy of Agenda-Setting Research. J. Commun. 2006, 43, 68–84. [Google Scholar]
Rivera, S.J.; Minsker, B.S.; Work, D.B.; Roth, D. A text mining framework for advancing sustainability indicators. Environ. Model. Softw. 2014, 62, 128–138. [Google Scholar]
Hwang, H.; An, S.; Lee, E.; Han, S.; Lee, C.-H. Cross-Societal Analysis of Climate Change Awareness and Its Relation to SDG 13: A Knowledge Synthesis from Text Mining. Sustainability 2021, 13, 5596. [Google Scholar] [CrossRef]
Makeeva, N.M.; Stankevich, I.P.; Lyubaykin, N.S. Nowcasting the Russian economy macroeconomic indicators under uncertainty: Does taking into account the news sentiment help? Vopr. Ekon. 2024, 3, 120–142. [Google Scholar]
Ryan, M. Web Scraping with Python: Collecting More Data from the Modern Web; O’Reilly Media: Sebastopol, CA, USA, 2018; Volume 308. [Google Scholar]
VK API Documentation. Available online: https://dev.vk.com/method (accessed on 2 October 2024).
Asudani, D.S.; Nagwani, N.K.; Singh, P. Impact of word embedding models on text analytics in deep learning environment: A review. Artif Intell Rev. 2023, 56, 10345–10425. [Google Scholar] [CrossRef]
Chen, L.C. An extended TF-IDF method for improving keyword extraction in traditional corpus-based research: An example of a climate change corpus. Data Knowl. Eng. 2024, 153, 102322. [Google Scholar]
Zhao, Z.; Aletras, N. Comparing Explanation Faithfulness between Multilingual and Monolingual Fine-tuned Language Models. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Mexico City, Mexico, 16–21 June 2024; Volume 1, pp. 3226–3244. [Google Scholar]
Yang, Y.; Uy, M.C.; Huang, A. FinBERT: A Pretrained Language Model for Financial Communications. arXiv 2020, arXiv:2006.08097. [Google Scholar] [CrossRef]
Manias, G.; Mavrogiorgou, A.; Kiourtis, A. Multilingual text categorization and sentiment analysis: A comparative analysis of the utilization of multilingual approaches for classifying twitter data. Neural Comput. Appl. 2023, 35, 21415–21431. [Google Scholar]
Koltsova, O.; Alexeeva, S.; Kolcov, S. An opinion word lexicon and a training dataset for Russian sentiment analysis of social media. Comput. Linguist. Intellect. Technol. Mater. Dialogue 2016, 2016, 277–287. [Google Scholar]
Kulagin, D. Publicly available sentiment dictionary for the Russian language KartaSlovSent. In Proceedings of the Annual International Conference “Computational Linguistics and Intellectual Technologies”, Moscow, Russia, 31 May–3 June 2021; Volume 20, pp. 1106–1119. [Google Scholar]
Zeigermann, U.; Böcher, M. Challenges for bridging the gap between knowledge and governance in sustainability policy—The case of OECD ‘Focal Points’ for Policy Coherence for Development. For. Policy Econ. 2020, 114, 102005. [Google Scholar]
Chernyshova, G.; Veshneva, I.; Firsova, A.; Makarova, E.L.; Makarova, E.A. Methodology for Assessing the Risks of Regional Competitiveness Based on the Kolmogorov–Chapman Equations. Mathematics 2023, 11, 4206. [Google Scholar] [CrossRef]
Yoon, J.; Han, S.; Lee, Y.; Hwang, H. Text Mining Analysis of ESG Management Reports in South Korea: Comparison With Sustainable Development Goals. Sage Open 2023, 13, 21582440231202896. [Google Scholar] [CrossRef]

Figure 1. The main stages of using text mining to analyze news content to assess the sustainability of regional development.

Figure 2. Keywords of news reports corresponding to the data slices of Tatarstan for 2021–2023.

Figure 3. Keywords of news reports corresponding to the data slices of Samara for 2021–2023.

Figure 4. Keywords of news reports corresponding to the data slices of Baschkortostan for 2021–2023.

Figure 5. Keywords of news reports corresponding to the data slices of Penza for 2021–2023.

Figure 6. Evaluation of positive messages in news reports.

Figure 7. Evaluation of negative messages in news reports.

Table 1. Evaluating various models for sentiment analysis of news content.

Model	Accuracy	F1-Score
RuBERT	0.77	0.76
FinBERT	0.81	0.80
Multilingual BERT	0.52	0.64
XLM-R	0.82	0.83
LinisCrowd	0.68	0.69
KartaSlovSent	0.85	0.87

Table 2. Results of parsing regional news for 2021–2023.

Region	2021	2022	2023
Baschkortostan	1823	2270	2615
Orenburg	2187	2017	1680
Ulyanovsk	210	556	501
Saratov	1586	1113	1036
Perm	1698	1386	1692
Tatarstan	1539	1329	1259
Udmurtia	2151	1482	2240
Penza	1191	903	588
Kirov	1056	1130	1823
Chuvashia	1541	1850	1772
Mordovia	690	585	564
Samara	1398	1842	1900
Nizhni Novgorod	626	859	1158
Marij El	630	644	491

Table 3. Comparison of news text corpora of regions for 2023.

Region	Ulyanovsk	Baschkortostan	Kirov Region	Marij El	Mordovia	Nizhni Novgorod	Orenburg	Penza	Perm	Samara	Saratov	Tatarstan	Udmurtia	Chuvashia
Baschkortostan	1	0.315	0.418	0.319	0.330	0.352	0.274	0.440	0.450	0.404	0.355	0.411	0.280	0.287
Kirov	0.315	1	0.400	0.327	0.327	0.214	0.268	0.400	0.402	0.383	0.325	0.359	0.220	0.264
Marij El	0.418	0.400	1	0.373	0.394	0.286	0.334	0.456	0.548	0.483	0.408	0.377	0.296	0.339
Mordovia	0.319	0.327	0.373	1	0.326	0.273	0.278	0.387	0.403	0.375	0.343	0.376	0.280	0.294
Nizhni Novgorod	0.330	0.327	0.394	0.326	1	0.261	0.285	0.419	0.453	0.418	0.346	0.371	0.245	0.305
Orenburg	0.352	0.214	0.286	0.273	0.261	1	0.209	0.329	0.349	0.313	0.283	0.425	0.541	0.283
Penza	0.274	0.268	0.334	0.278	0.285	0.209	1	0.351	0.386	0.335	0.298	0.316	0.208	0.250
Perm	0.440	0.400	0.456	0.387	0.419	0.329	0.351	1	0.521	0.496	0.430	0.502	0.305	0.366
Samara	0.450	0.402	0.548	0.403	0.453	0.349	0.386	0.521	1	0.568	0.468	0.450	0.339	0.420
Saratov	0.404	0.383	0.483	0.375	0.418	0.313	0.335	0.496	0.568	1	0.443	0.458	0.305	0.391
Tatarstan	0.355	0.325	0.408	0.343	0.346	0.283	0.298	0.430	0.468	0.443	1	0.392	0.278	0.340
Udmurtia	0.411	0.359	0.377	0.376	0.371	0.425	0.316	0.502	0.450	0.458	0.392	1	0.372	0.385
Chuvashia	0.280	0.220	0.296	0.280	0.245	0.541	0.208	0.305	0.339	0.305	0.278	0.372	1	0.248
Ulyanovsk	0.287	0.264	0.339	0.294	0.305	0.283	0.250	0.366	0.420	0.391	0.340	0.385	0.248	1

Table 4. Cosine similarity matrix of news content topics for Tatarstan.

	2021	2022	2023
2021	1
2022	0.86	1
2023	0.82	0.88	1

Table 5. Matrix of cosine similarity of news content topics for the Samara region.

	2021	2022	2023
2021	1
2022	0.83	1
2023	0.81	0.89	1

Table 6. Cosine similarity matrix of news content topics for Bashkortostan.

	2021	2022	2023
2021	1
2022	0.58	1
2023	0.59	0.75	1

Table 7. Matrix of cosine similarity of news content topics for the Penza region.

	2021	2022	2023
2021	1
2022	0.84	1
2023	0.80	0.82	1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chernyshova, G.; Taran, E.; Firsova, A.; Vavilina, A. Monitoring of Sustainable Development Trends: Text Mining in Regional Media. Sustainability 2025, 17, 3122. https://doi.org/10.3390/su17073122

AMA Style

Chernyshova G, Taran E, Firsova A, Vavilina A. Monitoring of Sustainable Development Trends: Text Mining in Regional Media. Sustainability. 2025; 17(7):3122. https://doi.org/10.3390/su17073122

Chicago/Turabian Style

Chernyshova, Galina, Evgeniy Taran, Anna Firsova, and Alla Vavilina. 2025. "Monitoring of Sustainable Development Trends: Text Mining in Regional Media" Sustainability 17, no. 7: 3122. https://doi.org/10.3390/su17073122

APA Style

Chernyshova, G., Taran, E., Firsova, A., & Vavilina, A. (2025). Monitoring of Sustainable Development Trends: Text Mining in Regional Media. Sustainability, 17(7), 3122. https://doi.org/10.3390/su17073122

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Monitoring of Sustainable Development Trends: Text Mining in Regional Media

Abstract

1. Introduction

2. Materials and Methods

3. Results

4. Discussion and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI