Towards Media Monitoring: Detecting Known and Emerging Topics through Multilingual and Crosslingual Text Classification

Kapočiūtė-Dzikienė, Jurgita; Ungulaitis, Arūnas

doi:10.3390/app14104320

Open AccessArticle

Towards Media Monitoring: Detecting Known and Emerging Topics through Multilingual and Crosslingual Text Classification

by

Jurgita Kapočiūtė-Dzikienė

^1,2,*

and

Arūnas Ungulaitis

³

¹

JSC Tilde IT, Jasinskio Str. 12, LT-01112 Vilnius, Lithuania

²

Department of Applied Informatics, Vytautas Magnus University, Universiteto Str. 10, Akademija, LT-53361 Kaunas, Lithuania

³

JSC Novian Pro, Gynėjų Str. 14, LT-01109 Vilnius, Lithuania

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(10), 4320; https://doi.org/10.3390/app14104320

Submission received: 6 May 2024 / Revised: 15 May 2024 / Accepted: 17 May 2024 / Published: 20 May 2024

(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

This study aims to address challenges in media monitoring by enhancing closed-set topic classification in multilingual contexts (where both training and testing occur in several languages) and crosslingual contexts (where training is in English and testing spans all languages). To achieve this goal, we utilized a dataset from the European Media Monitoring webpage, which includes approximately 15,000 article titles across 18 topics in 58 different languages spanning a period of nine months from May 2022 to March 2023. Our research conducted comprehensive comparative analyses of nine approaches, encompassing a spectrum of embedding techniques (word, sentence, and contextual representations) and classifiers (trainable/fine-tunable, memory-based, and generative). Our findings reveal that the LaBSE+FFNN approach achieved the best performance, reaching macro-averaged F1-scores of 0.944 ± 0.015 and 0.946 ± 0.019 in both multilingual and crosslingual scenarios. LaBSE+FFNN’s similar performance in multilingual and crosslingual scenarios eliminates the need for machine translation into English. We also tackled the open-set topic classification problem by training a binary classifier capable of distinguishing between known and new topics with the average loss of ∼0.0017 ± 0.0002. Various feature types were investigated, reaffirming the robustness of LaBSE vectorization. The experiments demonstrate that, depending on the topic, new topics can be identified with accuracies above ∼0.796 and of ∼0.9 on average. Both closed-set and open-set topic classification modules, along with additional mechanisms for clustering new topics to organize and label them, are integrated into our media monitoring system, which is now used by our real client.

Keywords:

closed-set vs. open-set topic classification; multilingual and crosslingual scenarios; well- and lesser-supported languages; media monitoring

1. Introduction

In today’s interconnected world, where digital communication transcends borders, comprehending and analyzing text in multiple languages has become crucial. This is particularly relevant in the field of media monitoring, which has shifted significantly towards a global perspective. The expansion of digital media across diverse linguistic communities underscores the need for sophisticated tools that can effectively navigate and process multilingual content. Our research is positioned within this context, aiming to advance multilingual text classification. We are developing a robust and accurate classification technique for known and new topics, utilizing a rich dataset spanning various languages. For a comprehensive understanding of these issues, readers may refer to several sources: [1] presents a comprehensive overview of media monitoring problems and best practices in Europe for 14 languages from a social perspective; [2] describes various social media monitoring tools published by the NATO Strategic Communications Center of Excellence, and [3] discusses the challenges of highly inflected languages in multilingual media monitoring.

The design of our media monitoring system is specifically tailored to meet the unique needs of the client who will deploy it. The system is engineered to handle a wide variety of languages, offering multilingual support for the seamless processing and classification of media content across language barriers. It includes enhanced crosslingual functionality, enabling the topic classification model to (1) adapt to and learn from the available languages and (2) comprehend even those languages not included in the initial training dataset. The system can detect both known and new topics, ensuring their accurate labeling and the effective organization of emerging topics. To achieve the highest possible accuracy, the system’s performance undergoes meticulous testing and comparison across various approaches to identify the most effective ones. Additionally, the system operates autonomously to ensure the security of analyzed data and minimize operational costs by reducing reliance on external services.

Before starting the research, we identified several potential limitations that must be addressed. Despite the availability of approximately 40 media datasets (https://data.world/datasets/media (accessed on 14 May 2024)) and approximately 60 social media datasets (https://data.world/datasets/social-media (accessed on 14 May 2024)) online, these datasets have some limitations: they are predominantly focused on US content or social issues such as marketing, shopping, and dating. A comprehensive news media dataset covering a broad range of languages and topics is not publicly available. Therefore, we plan to create such a dataset within the framework of this research. We do not intend to train large language models from scratch but will instead leverage available models by fine-tuning them or using them as vectorizers. This approach relies on their quality and language support, and we will carefully select and evaluate crosslingual models for their accuracy in topic classification. Additionally, detecting new topics in open-set text classification is a significant obstacle in machine learning. We plan to explore various approaches and feature types to address this issue.

The next step involves conducting an analytical review of available tools and solutions, narrowing them down to a select few that can be experimentally tested to address our problem effectively.

2. Related Work

Media monitoring has been a crucial area of focus for many years, as it involves tracking and analyzing a wide range of media outlets to glean actionable insights [4]. Efforts to enhance the capabilities of media monitoring systems have been continuous, particularly in the realm of processing and understanding content in diverse languages such as Swahili [5], Hungarian [6], and many others. This multilingual approach is vital as it allows for more comprehensive coverage across different geopolitical and cultural landscapes, enabling businesses and organizations to maintain a global perspective. Today, the market offers a variety of media monitoring solutions, each designed to meet the specific needs of different user groups. For instance, social media monitoring tools have become indispensable in the journalism sector, significantly streamlining the workflow for journalists by providing quick access to trending topics and public opinions [7]. Similarly, businesses leverage these tools to gain deeper insights into customer behavior, market trends, and the effectiveness of marketing campaigns [8]. These insights are crucial for developing targeted strategies that resonate well with the intended audience. However, despite the wide array of tools available, a significant limitation remains. Many of these tools are predominantly tailored for larger languages, which are more commonly spoken and hence provide a broader audience base. This focus on major languages often leaves lesser-spoken languages under-represented, thereby limiting the scope of media monitoring to more general and broadly applicable purposes. Furthermore, the existing tools tend to prioritize general usability over specialized functions, which can dilute the effectiveness of media monitoring in niche or specialized contexts. The comprehensive list of these tools, while extensive (available at https://www.stateofdigitalpublishing.com/digital-platform-tools/best-media-monitoring-tools/ (accessed on 10 March 2024)), highlights the need for more nuanced and adaptable solutions that can cater to the unique demands of diverse linguistic and cultural groups.

Media monitoring is a multifaceted field involving several key analytical tasks critical for effectively interpreting vast amounts of media data. These tasks include topic classification, which categorizes content into predefined topics; Named Entity Recognition (NER), which identifies and classifies key elements like names, organizations, and locations within text; sentiment analysis, aimed at determining the emotional tone behind a body of text; topic modeling, which identifies the underlying themes within large volumes of text; and event detection, which focuses on identifying significant occurrences reported in the media. Each of these tasks addresses a unique aspect of media analysis, contributing to a comprehensive understanding of the media landscape. Despite the broad spectrum of tasks associated with media monitoring, our research specifically focuses on text classification. This focus is driven by the critical role text classification plays in structuring unstructured data, thus facilitating more effective data analysis and interpretation. The Papers with Code repository (https://paperswithcode.com/task/text-classification (accessed on 10 March 2024)) serves as a valuable resource in this area, listing approximately 1000 scientific articles that explore various aspects of text classification. These articles span 150 benchmarks and involve 132 different datasets, covering a wide range of related tasks including, but not limited to, topic classification, sentiment analysis, intent detection, and hate speech detection. A unifying factor across these diverse text classification tasks is their reliance on supervised machine learning techniques, which require large annotated datasets for model training. The advent of transformer-based models has marked a significant advancement in the field. Techniques such as XLNet [9] and RoBERTa [10] represent some of the most effective approaches currently available. These models have consistently performed at the top of various leaderboards, demonstrating their ability to handle complex, nuanced tasks in text classification. Their architecture, which allows for a deep understanding of context and the relationships between words in longer texts, makes them particularly well-suited for the intricate demands of media monitoring tasks. This high level of performance underscores the importance of sophisticated computational models in tackling the challenges of modern media analysis.

Topic classification stands out as one of the most advanced areas within text classification, particularly for languages that are well-supported with substantial resources and research. This maturity has facilitated the development of zero-shot learning models, which have significantly changed the landscape by eliminating the need for traditional supervised data collection or extensive model training. These models can understand and categorize content they have never seen during training, a revolutionary step for scaling applications across languages without annotated data. However, the evolution of zero-shot learning has not just been about broadening its application. Recent shifts in focus have aimed at refining these models to enhance their versatility and accuracy. Techniques such as unsupervised clustering for data compression before classification [11] help reduce the dimensionality of data, making the classification process more efficient without supervised labels. Additionally, fine-tuning models by predicting the first sentence in a paragraph has shown promise in adapting zero-shot models to new tasks and contexts more effectively [12]. These methods improve the model’s ability to generalize from limited information and apply its learning to broader scenarios, a crucial enhancement for practical applications. Despite these advancements, zero-shot learning models face significant challenges when applied to less-supported languages. For example, despite its strengths, the XLM-R-based zero-shot model struggles with languages with minimal digital resources, as demonstrated with 10 American low-resourced languages [13]. This limitation highlights a critical gap in the model’s ability to deal with linguistic diversity. Additionally, strategies like translating text from low-resourced languages into high-resourced ones before applying zero-shot models can be problematic. This approach, although sometimes effective, may not always capture the nuanced meanings and cultural contexts embedded in the original language, as shown in studies [14]. In our research, we also recognize potential limitations with the application of zero-shot models, particularly in the multilingual and open-set classification scenarios.

A pivotal aspect of contemporary text classification is the use of multilingual models, which are equipped to understand and process linguistic nuances across multiple languages. These models are designed to capture the similarities and semantic relationships of words in diverse languages, making them invaluable for global applications. For effective deployment in specific scenarios, these models often undergo a process of fine-tuning to tailor their capabilities to the particular nuances of downstream tasks. A notable example of this is in the multilingual epidemiological text classification field, which has been explored using a variety of machine learning approaches, including advanced deep learning techniques [15]. This research spanned six languages and demonstrated the superior performance of BERT models, which were fine-tuned for the specific requirements of epidemiological data analysis. These models significantly outperformed other techniques, showcasing their robust adaptability across diverse linguistic datasets. Further evidence of the effectiveness of multilingual models can be seen in research addressing sentiment classification and hate speech detection [16]. This study explored the application of these models across four languages for sentiment classification and two for hate speech detection, finding that multilingual models, especially those based on the XLM-R architecture, consistently outperformed their monolingual counterparts. Such findings underscore the potential of multilingual models to provide more uniform and effective solutions across different languages and content types. Additional research with the Sinhala language has further reinforced the advantages of multilingual models in handling a variety of linguistic tasks such as sentiment analysis, news category classification, writing style analysis, and news source identification [17]. In these cases, models like XLM-R, LaBSE, and LASER were compared against monolingual models specifically developed for the Sinhala language, such as SinBERTo and SinhalaBERTo. The multilingual models, particularly XLM-R, consistently delivered superior performance across all tasks, highlighting their effectiveness in leveraging crosslingual knowledge. The prowess of XLM-R was further demonstrated in a study involving product categorization and sentiment analysis across multiple datasets covering three and six languages, respectively [18]. In these experiments, XLM-R outperformed other models, including mBERT and DistilmBERT, as well as various zero-shot models. Its capabilities were even further enhanced when fine-tuned on a diverse set of multilingual Twitter data that included over 60 languages [19]. The resultant model, named XLM-T, was applied to sentiment classification tasks across eight languages and proved even more effective than the standard XLM-R, offering improved performance and demonstrating the model’s ability to adapt to an even broader array of linguistic contexts. These studies collectively illustrate the critical role of multilingual models in modern text classification, particularly their capacity to transcend language barriers and provide scalable, effective solutions for a wide range of text-based applications. The ongoing advancements in model training and fine-tuning are likely to enhance their applicability further, making them even more central to the future of multilingual text processing.

After delving into the capabilities and challenges of multilingual models, our research turns its attention to crosslingual models. These models are distinctive because they are generally trained in a dominant language, often English, and are then applied to different languages without retraining. Such an approach allows for leveraging high-quality, abundant resources available in English to benefit less-resourced languages. However, it is important to acknowledge that crosslingual models often experience a drop in accuracy compared to their monolingual counterparts, a phenomenon well documented in the literature [20]. This accuracy dip is partly due to the intrinsic variability and complexity of language translation and contextual usage across languages. Some crosslingual strategies rely heavily on translated knowledge to bridge language gaps. For instance, knowledge acquired through the Expectation-Maximization (EM) algorithm and refined via semi-supervised learning techniques has shown potential in enhancing performance across diverse linguistic pairs, such as English to Chinese and English to French, providing substantial improvements over methods dependent solely on direct machine translation [21]. Furthermore, a novel teacher–student approach provides ’weak’ supervision in the target language using a curated set of translated seed words, which facilitates learning by contextualizing these seed words within unlabeled texts. This method has demonstrated its efficacy across an impressive range of 18 languages, outperforming established models like mBERT [22]. Neural networks also play a pivotal role in advancing crosslingual solutions. One approach involves using a CNN-based classifier trained in a source language with labeled data, which is then adapted using unlabeled data from a parallel corpus in the target language. Soft labels derived from this corpus help train a classifier for the target language, proving effective in experimental setups involving multiple language pairs [23]. Additionally, pretrained multilingual encoders have been utilized to create universal sentence representations that can be shared across languages, facilitating the prediction of target classes in various linguistic settings [24]. Another sophisticated strategy combines language-invariant and language-specific features using adversarial networks and a mixture-of-experts model. This dynamic approach establishes a nuanced similarity between target and source languages, enhancing the overall classification performance across multiple languages more effectively than conventional models [25]. Moreover, innovative data augmentation techniques that incorporate multilingual code-switching have been developed to support the language-agnostic capabilities of models. These techniques have been applied successfully in a setup that includes nine languages, with LSTM and mBERT classifiers, where mBERT demonstrated superior crosslingual transfer capabilities [26]. In a similar approach, methods using graph convolutional networks to manage heterogeneous information within and across languages have shown significant promise. These models surpass traditional approaches like mBERT or XLM-R in various tasks spread over six languages [27]. The flexibility and robustness of XLM-R have also been validated in rule-based, dictionary-based, zero-shot, few-shot, and supervised learning environments across multiple datasets, solidifying its reputation as a formidable crosslingual tool [28]. Further expanding the capabilities of multilingual models, recent studies have enhanced mBERT and XLM-R with language-independent Wiki entities to tackle the challenges of crosslinguality more effectively. These enriched models have demonstrated impressive performance in topic classification tasks involving up to twelve languages [29]. Moreover, a cutting-edge teacher–student framework has been explored to improve both supervised and zero-shot performances of multilingual models. This framework strategically employs multiple source languages and target languages, showcasing a sophisticated method to optimize multilingual model training [30]. In conclusion, our exploration into crosslingual modeling highlights a range of innovative approaches that effectively address the inherent complexities of language processing across diverse linguistic environments. These advancements improve the practical applications of language models and pave the way for future innovations in handling multilingual and crosslingual data more efficiently.

In addition to closed-set classification, we also explore the open-set text classification problem. Previously presented zero-shot techniques expand the range of potential topics but require prior knowledge of the topics to search for. Additionally, their applicability is limited by language support, necessitating machine translation into languages that zero-shot models support, which complicates the pipeline. The paper [31] tests several approaches. One of them is a straightforward approach using word2vec to vectorize text documents and calculate cosine similarities between the mean of all document vectors and a test example. However, due to the similarities often being too close or even overlapping, it was concluded that cosine similarity at the document level is unsuitable for open-set classification. Another tested approach involved using CNNs to extract useful features from the data. Various CNN architectures were experimented with, modifying the softmax layer to include an unseen class by replacing it with an OpenMax layer. This layer uses a learned distance metric to account for the open-set risk, enhancing the model’s ability to handle unknown classes. The modified model employs an ensemble approach to decision-making using activations in the penultimate layer and is incremental in nature, meaning it does not require retraining to accommodate new unknown classes. In general, the OpenMax method [32] is a very popular approach that enhances traditional neural network classifiers for open-set recognition by adjusting the final softmax layer. This adjustment estimates the likelihood of an input being from an unknown class by modeling the tail of the activation distributions for each known class and recalibrating the softmax scores. This provides a measure of uncertainty that effectively helps to reject unknown inputs. One more possible approach to deal with the open-set problems is Binary Relevance, which treats each label as independent as in [33] with CNN, classifying each as relevant or irrelevant through a binary classifier trained specifically for that label. This approach uses multiple binary models, each trained with a sigmoid function threshold of 0.5, to decide label relevance. However, Binary Relevance is more appropriate for multi-label classification scenarios, not open-class classification problems.

Our comprehensive review of existing approaches highlights a significant gap in current technologies and methodologies. Despite the various systems and models available, none fully meet the complex requirements of our media monitoring system. Our analysis reveals that, while many systems provide solutions tailored to specific linguistic or topical scopes, none adequately address the simultaneous processing of a diverse range of languages and the emergence of unseen topics. Moreover, existing research rarely addresses three critical aspects: (1) comparing a broad spectrum of analytical techniques—classification, memory-based, and generative methods—within a single study; (2) exploring these techniques under both multilingual scenarios (training and testing on many languages) and crosslingual scenarios (training on one language, e.g., English in our case, and testing on any language in the dataset); (3) tackling both closed-set classification problems (where the training and testing are on a predefined number of classes) and open-set classification problems (where new classes can emerge).

In response to these deficiencies, our research contributes to the field of media monitoring in several key ways:

Comparative analysis across techniques. Our study rigorously compares various vectorization and topic classification methods, including trainable/fine-tunable, memory-based, and generative approaches.
Comparative analysis across multilingual and crosslingual scenarios.
Addressing closed-set vs. open-set topic classification problems.
Incorporating additional clustering mechanisms. Besides identifying the emergence of a new topic, we introduce an additional clustering mechanism to differentiate between multiple new topics and accurately label them.
Integration into our media monitoring systems. We integrate our trained models and clustering module into the real media monitoring system, enabling its practical use and further testing with real clients.
Open access dataset. To foster continued research and advancement in the field, we make our datasets publicly available for further investigation.

Based on the outlined contributions, we plan to address two major research questions in this paper:

How do various vectorization and topic classification methods, including trainable/fine-tunable, memory-based, and generative approaches, compare in terms of effectiveness and accuracy in multilingual and crosslingual scenarios?
Which solutions are effective in addressing open-set topic classification problems within the context of media monitoring, and to what extent are they effective in terms of accuracy?

3. Formal Definition of the Solving Problem

Closed-set text classification problem. Let X represent the input space where

X = {x_{1}, x_{2}, \dots, x_{n}}

and C represent the set of predefined classes

C = {c_{1}, c_{2}, \dots, c_{m}}

, where m is the number of predefined classes. A classifier is a function

f : X \to C

that maps an input

x_{i} \in X

to one of the known classes

c_{j} \in C

. The goal in a closed-set classification problem is to efficiently and accurately train the classifier

Γ

with the input–output pairs

{(x_{1}, c_{1}), (x_{2}, c_{2}), \dots, (x_{n}, c_{n})}

, which learns to approximate f. This involves minimizing the error between the predicted and actual classes, using a loss function specifically designed to enhance accuracy, thus ensuring robust performance even under varying input conditions.

Open-set text classification problem.In addition to the standard closed-set classification task, this approach involves a binary classification problem to determine the presence of potentially new classes not covered by the predefined class set C. Here, all classes in C are labeled as known, and any other classes are labeled as new, effectively transforming C into

{c_{1}, c_{2}}

where

c_{1}

represents known and

c_{2}

represents new classes. The input space X not only encompasses texts

{x_{1}, x_{2}, \dots, x_{n}}

of known classes but also includes additional texts

{x_{n + 1}, x_{n + 2}, \dots, x_{n + N}}

potentially belonging to new classes. A classifier

f_{b i n a r y} : X \to {c_{1}, c_{2}}

maps an input

x_{i} \in X

to either

c_{1}

or

c_{2}

. The objective is to train the classifier

Γ_{b i n a r y}

with input–output pairs

{(x_{1}, c_{1}), (x_{2}, c_{1}), \dots, (x_{n}, c_{1}), (x_{n + 1}, c_{2}), (x_{n + 2}, c_{2}), \dots, (x_{n + N}, c_{2})}

, aiming to approximate

f_{b i n a r y}

.

This dual framework facilitates a comprehensive system that handles the well-defined task of closed-set classification and adapts to emerging text categories in an open-set environment, thereby enhancing the system’s applicability and robustness in real-world scenarios.

4. Dataset

We found no specifically tailored or publicly available datasets that met our experimental needs at this project stage. To address this problem, we created our own using categorized texts from the European Media Monitor (EMM) website (https://emm.newsbrief.eu/ (accessed on 2 May 2022)).

The harvested texts span a period of nine months, from May 2022 to March 2023. We excluded articles assigned to multiple categories, yielding a dataset of over 2.2 million articles across 41 categories in 68 languages. For our experiments, we exclusively used article titles for several reasons. Crawling complete texts from websites was cost-prohibitive due to their diverse structures. Additionally, longer texts tended to introduce ambiguity, necessitating segmentation and significant reannotation efforts. We divided this dataset into three subsets based on equal time intervals, aligning with the timestamps of the articles: the first three months for training, the next for validation, and the remainder for testing. This division accounts for the evolving nature of media content; although the list of topics may remain stable, the content within these topics changes over time, which can impact the model’s accuracy.

Our initial experiments with the complete dataset did not meet the desired accuracy levels, achieving only approximately ∼0.341 ± 0.03 on the testing subset, averaged over five runs. These results were obtained after training and validating with the corresponding subsets using the Language Agnostic BERT Sentence Embedding (LaBSE) [34] model as the vectorizer and an optimized Feed Forward Neural Network (FFNN) as the classifier. To identify the causes of these low results, we manually reviewed a random sample of texts in English and Lithuanian to check the accuracy of the labels assigned. This analysis highlighted two main issues: (1) incorrect assignment of titles to their corresponding topics, and (2) titles that were not sufficiently informative or were misleading, considering the articles they represent.

Manually verifying and relabeling this dataset would require enormous human resources, specifically experts proficient in 68 languages. Given that such a task was infeasible within the project’s scope, we automatically cleaned the dataset. For this purpose, we employed the LaBSE sentence vectorizer, using cosine similarity to compare the sentence vectors of the title with the sentence vectors of all topics. Before vectorization, the names of topics were processed by separating concatenated words with white spaces (e.g., EnergyMarketandStrategies → Energy Market and Strategies). After comparison, we excluded titles from the dataset based on the following criteria: (1) the title must achieve a similarity score greater than 0.5 with a specific topic pre-assigned by the EMM; (2) none of the other topics could have a cosine similarity score higher than 0.3. This process resulted in the creation of our dataset, named MM18x58 (for statistics, see Table 1), which comprises approximately 15,000 texts across 18 topics and 58 languages (listed in descending order of coverage in our dataset: ru, en, sq, de, el, bg, es, uk, hu, pl, tr, ro, fr, pt, it, bs, mk, sr, hr, lt, vi, sk, cs, ar, ja, et, lv, sv, fi, no, id, nl, az, da, sl, fa, km, ca, sw, be, ka, ky, he, ko, ku, zh, hi, th, gl, lb, lo, ml, mn, ha, bn, hy, pap, so). This dataset is now publicly available. (The MM18x58 dataset is available at https://github.com/novian-pro/EMM_18x58_dataset (accessed on 16 May 2024). In this reference, you can find useful statistics about the distribution of texts among different languages and splits.) Despite the evident imbalance highlighted in Table 1, we made a deliberate choice to maintain all topics in our MM18x58 dataset. We believe that preserving the full spectrum of topics, even those poorly covered, reflects a more realistic and nuanced representation of real-world scenarios.

Alongside the MM18x58 dataset, we created its English version by machine-translating all non-English texts in the training and validation splits into English using the Googletrans Python library (https://pypi.org/project/googletrans/ (accessed on 2 May 2022)). Approximately 10% of texts in both the training and testing splits were already in English and required no translation. The testing split remained original (i.e., untranslated). We intentionally translated the training and validation splits into English, as this version will be utilized in crosslingual experiments, where training is recommended to be conducted consistently in a resource-rich language. This modified version of the dataset is named MM18x58_En. Both MM18x58 and MM18x58_En datasets can be used for the closed-set text classification problems (see Section 3).

For the binary classification problem, we created an additional dataset MM18x58_binary (refer to Table 2). This dataset includes texts where the cosine similarities between the text vectors and the vectors of all 18 topics (as listed in Table 1) were lower than 0, assigned to the new (positive) class. The known (negative) class comprises all instances from MM18x58 belonging to those 18 categories. The MM18x58_binary dataset can be used for the open-set text classification problems (see Section 3).

5. Approaches

During the experiments with the datasets described in Section 4, we tested a wide variety of approaches. These approaches were selected based on several criteria: (1) their ability to support all 58 languages of interest present in our dataset; (2) the goal to evaluate diverse types of methods, including trainable/fine-tunable, memory-based, and generative models, ensuring that at least one representative from each category would be tested:

BERT+CNN. Utilizing the BERT (Bidirectional Encoder Representations from Transformers) model [35] for vectorization, in conjunction with a CNN (Convolutional Neural Network) classifier [36], presents multiple benefits. BERT effectively encodes the semantics of individual words within their contextual environments. These encoded word representations are subsequently input into the CNN. The CNN’s role is to delineate features from the n-gram patterns of these embeddings, enabling the identification of keywords or phrases within the text. This capability is particularly advantageous for topic detection. Our CNN’s structure is similar to that shown in Figure 3 of [37]. For BERT, the bert-base-multilingual-cased model was chosen (https://huggingface.co/bert-base-multilingual-cased (accessed on 2 May 2022)), which supports 104 languages and is case-sensitive. The BERT+CNN training involved tuning approximately 735 thousand parameters, suggesting that this method could either solve our current problem effectively or establish a strong baseline.
XLM-R_fine_tuning. For our classification problem, we employed the XLM-R model as described in [38], making several targeted adjustments. All layers of the model were unfrozen, rendering approximately 278 million parameters trainable. Specifically, we utilized the xlm-roberta-base model (https://huggingface.co/xlm-roberta-base (accessed on 3 March 2023)), a multilingual variant that supports 100 languages.
LaBSE+FFNN. The LaBSE model [34], a vectorizer, was used in combination with an FFNN classifier. Unlike BERT embeddings, which provide word-level representations, LaBSE specializes in sentence-level embeddings, vectorizing entire texts into single aggregated vectors. This approach effectively handles varying sentence structures with identical semantic meanings, making it ideal for languages featuring flexible sentence structures. The LaBSE model (https://huggingface.co/sentence-transformers/LaBSE (accessed on 3 March 2023)) supports 109 languages and is both multilingual and crosslingual. The sentence vectors generated by LaBSE are input into the FFNN. The FFNN’s architecture and hyperparameters were optimized using Hyperas (https://github.com/maxpumperla/hyperas (accessed on 3 March 2023)) and Hyperopt, which are Python libraries for hyperparameter optimization. We employed the Tree-structured Parzen Estimator (TPE) optimization algorithm, conducting 200 optimization trials to determine optimal settings, including discrete parameters like neuron counts and activation functions, continuous parameters like dropout rates, and conditional parameters such as additional FFNN layers. The training process required adjustments to approximately 1.3 thousand parameters.
LaBSE_fine_tuning. This process entailed two primary steps: (1) unfreezing all layers of LaBSE and optimizing all parameters, and (2) introducing an additional layer specifically tailored to our classification problem. We utilized the LaBSE2 model (https://tfhub.dev/google/LaBSE/2 (accessed on 3 March 2023)) for fine-tuning purposes. In contrast to the LaBSE+FFNN method, where all LaBSE layers remained frozen, the fine-tuning strategy here involved the adjustment of approximately 490 million parameters.
LaBSE_LangChain_k1. This method employs the LangChain framework (https://python.langchain.com/ (accessed on 3 March 2023)) designed to build context-aware applications utilizing large language models. It operates based on memory and similarity: training instances are vectorized (omitting validation instances), and a semantic search is performed to find the most similar training instance to the instance under test using cosine similarity. The classification of the test instance is determined by the class of the closest training instance identified in this search. Vectorization is conducted using the LaBSE model.
LaBSE_LangChain_k10_mv. This approach is an extension of the LaBSE_LangChain_k1 method. It is configured to retrieve the 10 most similar instances rather than just one. A majority voting mechanism is then employed, where the most frequently occurring class among these top 10 instances determines the class label for the testing instance.
ADA_LangChain_k1. This method is analogous to LaBSE_LangChain_k1, but utilizes OpenAI’s text-embedding-ada-002 model [39] for vectorization instead of the LaBSE model.
The ADA_LangChain_k10_mv approach uses the same vectorization model as in ADA_LangChain_k1 and follows the methodology of LaBSE_LangChain_k10_mv.
Davinci_fine_tuning. This generative-based approach employs the Davinci model as discussed in [40]. We configured the model to generate only the first token corresponding to the class label. In our experiments, the Davinci-002 version was fine-tuned using both the training and validation datasets, with all hyperparameters maintained at their default settings.

6. Experiments and Results

During the investigation, we conducted closed-set multilingual and crosslingual text classification experiments using the MM18x58 and MM18x58_En datasets, respectively. These experiments were designed to compare the approaches presented in Section 5.

For the training of classification models with BERT+CNN, LaBSE+FFNN, LaBSE_fine_tuning, and XLM-R_fine_tuning, we set a maximum of 100 epochs but used early stopping to monitor the validation loss. Early stopping was triggered when the loss showed minimal improvement (with a minimum delta of 0.01) and after a patience of three epochs without improvement. All other parameters remained at their defaults. We also used default parameters with Davinci_fine_tuning. We repeated each experiment five times, averaged the results, and calculated the confidence intervals.

Because all memory-based approaches (LaBSE_LangChain_k1, LaBSE_LangChain_k10, ADA_LangChain_k1, and ADA_LangChain_k10) do not require training but rather the storage of training instances, the validation dataset was not utilized. Since there is no randomness with memory-based approaches, each experiment was performed only once; neither averaging of results nor confidence intervals were needed.

Our closed-set text classification experiments were conducted with a highly imbalanced dataset, with the largest class, NATO, representing 75%, 69%, and 72% of all instances in the training, validation, and testing splits, respectively. To address the potential issue of the accuracy metric favoring the majority class and potentially yielding misleadingly high scores, we have chosen the F1 score (see Equation (1)) as our primary evaluation metric. Furthermore, we adopted the macro-averaged evaluation method to ensure unbiased evaluation across all classes. The results with different approaches are summarized in Figure 1.

F 1 = \frac{2 \cdot T P}{2 \cdot T P + F P + F N}

(1)

where

T P

(True Positives) represents the number of instances correctly predicted as the class of interest,

F P

(False Positives) represents the number of instances incorrectly classified as the class of interest, and

F N

(False Negatives) represents the number of instances that are actually the class of interest but were predicted as any other class.

To determine whether the differences between the results achieved with different methods are statistically significant, we used the Student’s t-test [41] with a significance level of

α

= 0.05. In cases where we needed to compare a group of values in one experiment with a single value in another, we conducted the one-sample t-test [42] with the same

α

= 0.05.

The second set of experiments with the binary dataset (see Table 2) was dedicated to finding the best set of features that could effectively distinguish new topics from known ones, addressing the open-set text classification problem. We tested the optimized FFNN approach using the sigmoid function instead of softmax for these experiments. For details on the optimization of FFNN with Hyperas and Hyperopt, see the LaBSE+FFNN description in Section 5. Once the probabilities were obtained via the sigmoid function, binary cross-entropy was used as the loss function L presented in Equation (2). As the evaluation metric, this loss function is much more precise (especially in scenarios with a very imbalanced dataset, as in our case) because it measures how far off the predictions are from the actual values. The accuracy defined in Equation (3) was used as our second evaluation metric.

L = - (y log (p) + (1 - y) log (1 - p))

(2)

where y is the actual label, which can be either 0 or 1; p is the predicted probability, the output from the sigmoid function, that the observation is of class 1.

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(3)

where

T N

(True Negatives) represents the number of instances that are not the class of interest and were correctly not predicted as the class of interest. For the other notation, see Equation (1).

To determine the most accurate FFNN-based binary classifier, various feature types were investigated (the results are summarized in Table 3):

LaBSE vectors.
Softmax values for each of the 18 known classes were obtained from the multi-class classification model (specifically, LaBSE+FFNN, which achieved the best results with the MM18x58 dataset) and were used as the feature vector.
Penultimate layer’s values are taken from the same model that provides the softmax values, but are extracted before the application of the softmax function.
Cosine similarities to the cluster centers of all 18 known classes were used as feature vectors. These cluster centers were calculated by averaging all LaBSE instance vectors belonging to each of these classes.
Concatenation (denoted as “+”) of various features.

The best results, characterized by the lowest loss function values, were achieved using the LaBSE+FFNN approach (see Table 3); hence, this approach was utilized in subsequent experiments. During these experiments, we (1) removed class X from both the training and validation splits; (2) retrained the multi-class LaBSE+FFNN classification model using the MM18x58 dataset, excluding class X; (3) retrained the binary LaBSE+FFNN classification model using the MM18x58_binary dataset, also excluding class X; (4) evaluated the method’s ability to detect the new class by using all instances of class X across training, validation, and testing splits. The results obtained are summarized in Table 4.

7. System’s Architecture

The previously trained multi-class classification model, denoted as

Γ

, and the binary text classification model,

Γ_{b i n a r y}

, are integrated into the final system. The final system is depicted in Figure 2, which is composed of multiple blocks detailed in the sections below (for notation details, see Section 3):

Vectorization. The test text, denoted as $x^{t}$ , is vectorized using the LaBSE vectorization model $Γ_{L a B S E}$ , resulting in the vector output ${\vec{x}}^{t}$ . Specifically, ${\vec{x}}^{t} = Γ_{L a B S E} (x^{t})$ .
Binary classification. The vector ${\vec{x}}^{t}$ is then passed into the binary classification model $Γ_{b i n a r y}$ , which returns the sigmoid function value from the interval $[0, 1]$ . If the output of $Γ_{b i n a r y} ({\vec{x}}^{t}) \leq 0.5$ , the text $x^{t}$ is labeled with the known (i.e., stable) class (denoted in Table 1); otherwise, it is labeled as the new class.
Multi-class classification (conditional). If $Γ_{b i n a r y} ({\vec{x}}^{t}) \leq 0.5$ , then ${\vec{x}}^{t}$ is passed to the multi-class classification model $Γ$ . This model returns pairs of class labels and their softmax probabilities: $Γ ({\vec{x}}^{t}) = {(c_{j}, P (c_{j})) ∣ P (c_{j}) > t h or c_{j} = c_{max}}$ , where $c_{max}$ is the class with the highest probability, irrespective of the threshold. The pairs $(c_{j}, P (c_{j}))$ are ordered such that $P (c_{j}) > P (c_{j + 1})$ for all j. In our system, the threshold value $t h$ is arbitrarily set at 0.3, allowing a maximum of three classes to be determined.
Clustering (conditional). If $Γ_{b i n a r y} ({\vec{x}}^{t}) > 0.5$ , the input text $x_{i}^{t}$ and its vector ${\vec{x}}^{t}$ are passed into the clustering mechanism:
(a)
Data storage. The incoming instance $x^{t}$ is added to the $D^{C l}$ dataset: $D^{C l} = D^{C l} \cup {x^{t}}$ , and its vector ${\vec{x}}^{t}$ to ${\vec{D}}^{C l}$ : ${\vec{D}}^{C l} = {\vec{D}}^{C l} \cup {{\vec{x}}^{t}}$ . The text $x^{t}$ is machine translated into English and stored in $D_{E n}^{C l}$ : $D_{E n}^{C l} = D_{E n}^{C l} \cup {x_{E n}^{t}}$ . The machine translation is performed using the Googletrans Python library.
(b)
Check against existing clusters. The algorithm checks whether ${\vec{x}}^{t}$ can be assigned to one of n clusters ${C l_{1}, C l_{2}, \dots, C l_{n}}$ , each with names represented as keyword collections ${C l_{1}^{n a m e}, C l_{2}^{n a m e}, \dots, C l_{n}^{n a m e}}$ . Each cluster $C l_{i}$ has a centroid ${\vec{μ}}_{i}$ , calculated as the mean vector of all instances belonging to it. The instance $x^{t}$ is assigned to the cluster $C l_{k}$ for which the Euclidean distance is minimal, yet below the threshold $t h_{r}$ : $k = arg {min}_{i} {∥ {\vec{x}}^{t} - {\vec{μ}}_{i} ∥ ∣ ∥ {\vec{x}}^{t} - {\vec{μ}}_{i} ∥ \leq t h_{r}}$ . Here, $t h_{r} = 1.037$ serves as the maximum radius, corresponding to the largest Euclidean distance between any instance in the training dataset and the centroid of its known class, as listed in Table 1. This threshold ensures that the dimensions of new formed clusters approximate those of the known classes. If $x^{t}$ is successfully assigned to cluster $C l_{k}$ , then the cluster’s name, $C l_{k}^{n a m e}$ , is returned.
(c)
Reclustering. If ${\vec{x}}^{t}$ cannot be assigned to any of the current n clusters, this number is incremented by 1, resulting in $n + 1$ . This new number is used as the argument for the number of clusters in the K-means clustering algorithm [43]. The repeats parameter is set to 25 to minimize the impact of initial centroid placements; all other parameters are set to their defaults. The clustering is performed using the nltk.cluster Python library. During clustering, all instances from $D^{C l}$ , including their vectors from ${\vec{D}}^{C l}$ and their English translations from $D_{E n}^{C l}$ , are organized into new, non-overlapping clusters ${C l_{1}, C l_{2}, \dots, C l_{n + 1}}$ .
(d)
Naming of clusters. For each cluster $C l_{i}$ , all English texts belonging to that cluster are concatenated into a single text document $T_{i}$ : $T_{i} = ⨁_{x_{E n}^{t} \in C l_{i}} x_{E n}^{t}$ . These texts are then passed into the KeyBERT model $Γ_{K e y B E R T}$ . This model returns either one keyword with a cosine similarity score above 0.5 or the three keywords with the highest similarity scores. Specifically, the output is defined as follows:

$Γ_{K e y B E R T} (T_{i}) = C l_{i}^{n a m e} = \{\begin{matrix} {k} & if max (\cos_sim (k, T_{i})) > 0.5, \\ {k_{1}, k_{2}, k_{3}} & top 3 with largest \cos_sim . \end{matrix}$

In our implementation, KeyBERT utilizes BERT embeddings with the keyphrase n-gram range of one or two words; all other parameters are set to their defaults. After reclustering and renaming, all stored instances, including $x^{t}$ , are relabeled with the new cluster names ${C l_{1}^{n a m e}, C l_{2}^{n a m e}, \dots, C l_{n + 1}^{n a m e}}$ . The cluster name $C l_{k}^{n a m e}$ , to which $x^{t}$ is assigned, is then returned.

The logical flow of the system is as follows. The incoming text instance is first vectorized (step no. 1) using the LaBSE vectorization model. Then, the binary classifier for known and new topics is applied to the just vectorized text (step no. 2). If the binary classifier detects a known topic, the vectorized text is passed to the multi-class classifier module (in step no. 3) to determine class label(s) from 18 topics, and the determined labels are returned to the user. If the binary classifier (in step no. 2) detects the new class, the vectorized text (together with the input text) is passed to the clustering module (in step no. 4) and memorized (in step no. 4(a)). If the incoming text belongs to one of the known existing clusters (in step no. 4(b)), then this cluster’s label is returned. If the text cannot be attached to any of the existing clusters, the stored vectorized instances are reclustered (step no. 4(c)) and relabeled (step no. 4(d)) again. The label of the cluster to which the incoming instance belongs is returned to the user.

8. Discussion

The first part of the Section 8, focusing on the multi-class closed-set classification experiments and their outcomes, will enable us to address the first research question (outlined in the concluding part of Section 2). We will achieve this by breaking it down into more detailed sub-questions:

How effective is the best solution to this text classification problem?
Which scenario—multilingual or crosslingual—is more suitable for solving our problem? This will determine whether machine translation into English is necessary or whether methods can be applied directly to the original texts in various languages.
Which predicted categories are the least accurately predicted, and why are these shortcomings?
Which type (i.e., trainable/fine-tunable, memory-based, or generative) and which of the tested approaches are most recommended for solving our problem, and which are not?

The comparative multi-class closed-set topic classification experiments determined the superiority of the LaBSE+FFNN (i.e., optimized FFNN applied on top of LaBSE vectorization) over the other tested memory-based approaches (i.e., LaBSE_LangChain_k1, LaBSE_LangChain_k10_mv, ADA_LangChain_k1, and ADA_LangChain_k10_mv), trainable/fine-tunable approaches (BERT+CNN, XLM-R_fine_tuning, and LaBSE_fine_tuning), and LLM-based generative approaches (i.e., Davinci_fine_tuning). Under multilingual and crosslingual scenarios, the LaBSE+FFNN approach achieves the best macro-averaged F1 score values, which are ∼0.944 ± 0.015 and ∼0.946 ± 0.019, respectively. The FFNN part of this approach was optimized using Hyperas and Hyperopt, and the best architecture is presented in Figure 3. Our multilingual and crosslingual experiments produced nearly identical results (the differences are insignificant with the p-value ≫ 0.05) with LaBSE+FFNN, affirming the robustness of this multilingual and crosslingual LaBSE language model. This leads us to conclude that machine translation of texts into English is unnecessary, as it does not significantly impact the results.

Zooming into the results for different topics (see Figure 4) reveals that the worst performance is observed in the ClimateAction, EnergyMarketsandStrategies, and ForgeryMoney topics. This outcome is unsurprising given our training dataset’s limited coverage of these topics. Manual error analysis revealed that ClimateAction is frequently confused with a similar ClimateChange topic. Additionally, the accuracy of the best-covered NATO topic reaches approximately 95%, indicating that augmenting training data for less-covered topics would likely enhance prediction accuracy.

The second-best performing method was LaBSE_fine_tuning, which required the adjustment of approximately 490 million parameters, while LaBSE+FFNN required only around 1.3 thousand. The minimal difference in F1 score values with MM18x58 and MM18x58_En datasets further underscores LaBSE’s strength in both multilingual and crosslingual scenarios. The calculated p-values between the results of LaBSE_fine_tuning and LaBSE+FFNN exceeded 0.1, indicating that the observed differences lack statistical significance. This allows us to conclude that the utilization of LaBSE as a sentence vectorizer eliminates the need to unfreeze and adjust all its parameters. It appears that LaBSE already possesses sufficient knowledge about languages, including their vocabularies, sentence structures, and domain-specific information, thereby leaving us with the sole task of adapting it to our specific topic classification problem.

The XLM-R_fine_tuning method, a popular choice, ranks third in our experiments. Compared to LaBSE+FFNN, significant differences emerge with p-values of 0.02 and 0.004 in multilingual and crosslingual experiments, respectively. In contrast, when compared to LaBSE_fine_tuning, the differences are not statistically significant with the multilingual dataset (p = 0.09) and remain insignificant with the crosslingual dataset (p = 0.03). However, ranking third does not necessarily imply that XLM-R_fine_tuning is not effective; it is simply less suitable for our specific problem.

The fourth-ranked approach, LaBSE_LangChain_k1, shares similarities with other methods using LaBSE as a sentence vectorizer, with negligible differences between multilingual and crosslingual results. We performed a one-sample t-test to compare the performance of LaBSE_LangChain_k1 against LaBSE+FFNN, yielding p-values of 0.007 and 0.01 for MM18x58 and MM18x58_En, respectively. Thus, LaBSE_LangChain_k1 significantly underperforms compared to the top-ranked method in both multilingual and crosslingual scenarios. Moreover, all other memory-based approaches yielded even worse results than LaBSE_LangChain_k1, underscoring that relying solely on storing training instances’ vectors may not be the most efficient strategy. Given the broad range of topics, it is presumed that the distribution of training instances in the semantic space is also broad. Hence, classifiers capable of extracting relevant information from text data through their layers are more advisable. Further analysis of memory-based methods reveals that selecting the closest instance tends to be more effective than majority voting among the 10 closest instances. Experiments confirm LaBSE’s superiority over ADA. Notably, ADA_LangChain_k10_mv was the only method that performed better in crosslingual experiments compared to multilingual ones. However, this does not necessarily prove that ADA is more suitable for crosslingual experiments, as its overall result is worse than that of other methods.

The performance gap between multilingual and crosslingual results using Davinci_fine_tuning is substantial, with multilingual performance being notably better. The significant discrepancy between Davinci_fine_tuning and the top-performing method, LaBSE+FFNN, suggests that Davinci_fine_tuning is not well-suited for addressing our problem. Additionally, this approach is not suitable for integration into our media monitoring system, as it violates the requirement for autonomous operation since the trained model must be stored on OpenAI’s servers and incurs a fee for its usage.

The least effective method for our problem is BERT+CNN. Unlike LaBSE, which is designed specifically to capture crosslingual nuances, BERT’s performance in multilingual contexts is notably and statistically significantly worse. Our application of BERT as a word vectorizer, which produces sequence vectors by concatenating word embeddings, fails to adequately adapt to the varied word orders in sentences across languages, diminishing its ability to learn patterns accurately. Consequently, we strongly recommend against using this approach for problems of the nature encountered in our case.

While a direct comparison of our results with other research works is challenging due to differing experimental conditions, some of our findings are consistent with those identified by previous researchers. These findings highlight the superiority of sophisticated transformer-based approaches. For instance, multilingual (and crosslingual) models such as XLM-R or LaBSE have demonstrated superior performance compared to multilingual BERT, as discussed in [18]; the superiority of XLM-R is also affirmed in [19]. Although LaBSE is not usually the best option in other studies, it excels in our context primarily because of the limited number of languages tested. LaBSE, serving as the crosslingual model fine-tuned on bilingual English–other language sentences (as shown in Figure 7 in [34]), supports 109 languages—more than are included in our dataset. Moreover, LaBSE benefits from the sheer volume of these bilingual sentences and sentences in related languages within groups such as Germanic, Romanic, Slavic, etc. There is always a risk that languages not covered by the list of 109 languages may not be well recognized. However, their similarity to other supported languages can still provide some benefits.

The second part of the Section 8, focusing on the open-set text classification experiments and their outcomes, will enable us to address the second research question (described at the end of Section 2). We will accomplish this by dividing it into more detailed sub-questions:

How accurately can the binary classifier distinguish between known and new topics?
What is the optimal feature type for the binary classifier, and why are other types less effective?
How does the performance of the open-class classifier vary across specific topics?

The open-set text classification problem was addressed by training a binary classification model to distinguish known topics from new ones. The results, summarized in Table 3, reveal that the LaBSE vectorizer is the best option similar to closed-set classification. Despite this, we cannot depend on the previously trained closed-set classification model, as neither the softmax probabilities nor the outputs of the penultimate layer are adequate to detect the emergence of new topics. Moreover, the cosine similarities, which assess the similarity between the cluster centers of 18 known classes (averaged from all their training instance vectors) and the vectors of new texts, also fail to provide a reliable measure. Consequently, we explored the similarity values (see Table 5) between the cluster centers of these known classes, hoping that the instances of different classes would appear in distinct clusters without overlap. Contrary to expectations, these values range from 0.836 to 0.099, indicating the absence of a clear threshold that could effectively delineate the boundaries of known classes due to their overlap, thereby complicating the search for new ones in the entire vectorization space. These findings also clarify why memory-based approaches are unsuitable for our closed-set topic classification problem.

The results presented in Table 4 demonstrate the accuracy of the open-class classifier in detecting new topics. The accuracy for all classes is above 90%, except for the NATO category, which reached 79.6%. A manual analysis of texts belonging to this topic revealed several issues: the training dataset for the NATO category includes various military and sometimes political actions, which can be easily mistaken for other topics such as Migration and TerroristAttack, especially where military involvement is aimed at stopping or controlling these situations.

Similar to the multi-class closed-set topic classification experiments, the crosslingual LaBSE model in the binary open-set text classification method has again demonstrated its strength, this time as the vectorizer, thereby outperforming other feature types. This success is attributed to the crosslingual mechanisms, which enable the model to understand texts from different languages and discern the similarity of their content. In contrast, other researchers in similar studies have adopted completely different methodologies. The cosine similarity-based approach applied to word2vec, as in [31], does not demonstrate promising results, mainly due to its less sophisticated and non-crosslingual vectorization model. Some authors, such as in [32], utilized the OpenMax approach; however, this method is more suitable when new topics appear infrequently and, thus, such softmax refinements are not constant. The Binary Relevance approach, as detailed in [33], is typically adjusted for multi-label classification problems. Training many binary models for each topic is an alternative solution, which would allow for the detection of a new topic if all binary models returned a negative class result. However, this solution also introduces additional risks of identifying multiple potential dominant classes, which we aim to avoid. Not only do the methods differ, but also the number of classes and languages. Thus, there are significant reasons why our study’s and previous studies’ results cannot be directly compared.

The outcomes of the closed-set and open-set classification experiments were instrumental in training models that were subsequently integrated into the overall media monitoring system. The pipeline for detecting known and new topics was further enhanced with additional modules, such as a clustering module, as described in Section 7. The system has been deployed with our real client for media monitoring purposes and is currently undergoing an investigative process in a real-world scenario.

9. Conclusions

This research investigates both closed-set and open-set topic classification problems.

For the closed-set topic classification problem, we utilized texts collected from the European Media Monitoring webpage, covering 58 languages and 18 topics. Our comparative experiments explored different embeddings, including word embeddings (bert-base-multilingual-cased), semantic sentence embeddings (LaBSE, text-embedding-ada-002), and contextual embeddings (XLM-R), along with various classification (BERT+CNN, LaBSE+FFNN, LaBSE_fine_tuning, XLM-R_fine_tuning), memory-based (LaBSE_LangChain_k1, LaBSE_LangChain_k10_mv, ADA_LangChain_k1, ADA_LangChain_k10_mv), and generative (Davinci_fine_tuning) approaches under both multilingual and crosslingual scenarios (training on English and testing on any of the 58 languages). Our experimental findings demonstrated the superiority of trainable/fine-tunable approaches over memory-based and generative methods. LaBSE+FFNN emerged as the most accurate method, achieving macro-average F1 score values of 0.944 ± 0.015 in multilingual experiments and 0.946 ± 0.019 in crosslingual experiments. Furthermore, the experimental investigation revealed similar performance in both multilingual and crosslingual scenarios, thereby affirming the robustness of LaBSE in crosslingual mechanisms and eliminating the need for machine translation of texts into English. Additionally, a deeper analysis of the results indicated a correlation between topics that are less covered in the training dataset and their poorer prediction outcomes.

The open-set classification experiments were conducted on the multilingual dataset of known classes, supplemented with additional texts harvested from the European Media Monitoring page. These experiments confirmed the superiority of LaBSE vectorization over other feature types that rely on the output of the closed-set model or the distribution of known classes in the semantic vector space. The average loss of distinguishing known topics from new ones is very low (∼0.0017 ± 0.0002), and the accuracy of detecting separate emerging specific topics is generally high, with most topics achieving over 0.9. The exception, a lower accuracy value of 0.796 for one broader topic, underscores the challenges of handling topics that have diverse and sometimes overlapping content with others.

The open-set and closed-set classification models have been integrated into our media monitoring system, which has been further enhanced with additional clustering modules used to define the boundaries and names of newly detected classes. This system is currently deployed and undergoing testing in real-world scenarios. In the future, we plan to expand the number of topics that are investigated and consider other suggestions from our clients.

Author Contributions

Conceptualization, J.K.-D.; methodology, J.K.-D.; software, J.K.-D.; validation, J.K.-D.; formal analysis, J.K.-D.; investigation, J.K.-D.; resources, A.U.; data curation, A.U.; writing—original draft preparation, J.K.-D. and A.U.; writing—review and editing, J.K.-D.; visualization, A.U.; supervision, J.K.-D.; project administration, A.U.; funding acquisition, A.U. All authors have read and agreed to the published version of the manuscript.

Funding

The project “Development of the National Information Impact Identification and Analysis Ecosystem (NAAS)” (No. 01.2.1-LVPA-V-835-03-000) was financed from the funds of the European Regional Fund under the priority 1 “Promotion of scientific research, experimental development, and innovation” measure “Pre-purchase LT”.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset that was created and used for this research is publicly available at https://github.com/novian-pro/EMM_18x58_dataset, accessed on 16 May 2024.

Acknowledgments

We gratefully acknowledge the contributions of Evaldas Bružė from Mykolas Riomeris University for recommending valuable data sources and Jolanta Čekanauskaitė from JSC Novian Pro for her assistance in data collection. Their expertise and support were instrumental in the successful completion of this research.

Conflicts of Interest

Author Jurgita Kapočiūtė-Dzikienė was employed by the company JSC Tilde IT. Author Arūnas Ungulaitis was employed by the Novian Pro. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Harro-Loit, H.; Eberwein, T. News Media Monitoring Capabilities in 14 European Countries: Problems and Best Practices. Media Commun. 2024, 12. [Google Scholar] [CrossRef]
Grizāne, A.; Isupova, M.; Vorteil, V. Social Media Monitoring Tools: An In-Depth Look; NATO Strategic Communications Centre of Excellence: Riga, Latvia, 2022. [Google Scholar]
Steinberger, R.; Ehrmann, M.; Pajzs, J.; Ebrahim, M.; Steinberger, J.; Turchi, M. Multilingual Media Monitoring and Text Analysis—Challenges for Highly Inflected Languages. In Proceedings of the Text, Speech, and Dialogue, Pilsen, Czech Republic, 1–5 September 2013; Habernal, I., Matoušek, V., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; pp. 22–33. [Google Scholar]
Steinberger, R. Multilingual and Cross-Lingual News Analysis in the Europe Media Monitor (EMM); Spinger: Berlin/Heidelberg, Germany, 2013; pp. 1–4. [Google Scholar] [CrossRef]
Steinberger, R.; Ombuya, S.; Kabadjov, M.; Pouliquen, B.; Della Rocca, L.; Belyaeva, E.; De Paola, M.; Ignat, C.; Van Der Goot, E. Expanding a multilingual media monitoring and information extraction tool to a new language: Swahili. Lang. Resour. Eval. 2011, 45, 311–330. [Google Scholar] [CrossRef]
Pajzs, J.; Steinberger, R.; Ehrmann, M.; Ebrahim, M.; Della Rocca, L.; Bucci, S.; Simon, E.; Váradi, T. Media monitoring and information extraction for the highly inflected agglutinative language Hungarian. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, 26–31 May 2014; pp. 2049–2056. [Google Scholar]
Thurman, N.; Hensmann, T. Social Media Monitoring Apps in News Work: A Mixed-Methods Study of Professional Practices and Journalists’ and Citizens’ Opinions. Available online: https://ssrn.com/abstract=4393018 (accessed on 5 February 2024).
Perakakis, E.; Mastorakis, G.; Kopanakis, I. Social Media Monitoring: An Innovative Intelligent Approach. Designs 2019, 3, 24. [Google Scholar] [CrossRef]
Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.G.; Salakhutdinov, R.; Le, Q.V. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv 2019, arXiv:1906.08237. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Alcoforado, A.; Ferraz, T.P.; Gerber, R.; Bustos, E.; Oliveira, A.S.; Veloso, B.M.; Siqueira, F.L.; Costa, A.H.R. ZeroBERTo: Leveraging Zero-Shot Text Classification by Topic Modeling. In Proceedings of the Computational Processing of the Portuguese Language, Fortaleza, Brazil, 21–23 March 2022; Pinheiro, V., Gamallo, P., Amaro, R., Scarton, C., Batista, F., Silva, D., Magro, C., Pinto, H., Eds.; Springer: Cham, Switzerland, 2022; pp. 125–136. [Google Scholar]
Liu, C.; Zhang, W.; Chen, G.; Wu, X.; Luu, A.T.; Chang, C.H.; Bing, L. Zero-Shot Text Classification via Self-Supervised Tuning. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, Toronto, ON, Canada, 9–14 July 2023; pp. 1743–1761. [Google Scholar] [CrossRef]
Ebrahimi, A.; Mager, M.; Oncevay, A.; Chaudhary, V.; Chiruzzo, L.; Fan, A.; Ortega, J.; Ramos, R.; Rios, A.; Meza Ruiz, I.V.; et al. AmericasNLI: Evaluating Zero-shot Natural Language Understanding of Pretrained Multilingual Models in Truly Low-resource Languages. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; Volume 1: Long Papers, pp. 6279–6299. [Google Scholar] [CrossRef]
Song, Y.; Upadhyay, S.; Peng, H.; Mayhew, S.; Roth, D. Toward any-language zero-shot topic classification of textual documents. Artif. Intell. 2019, 274, 133–150. [Google Scholar] [CrossRef]
Mutuvi, S.; Boros, E.; Doucet, A.; Jatowt, A.; Lejeune, G.; Odeo, M. Multilingual Epidemiological Text Classification: A Comparative Study. In Proceedings of the 28th International Conference on Computational Linguistics, Virtual, 8–13 December 2020; pp. 6172–6183. [Google Scholar] [CrossRef]
Wang, C.; Banko, M. Practical Transformer-based Multilingual Text Classification. In Proceedings of the North American Chapter of the Association for Computational Linguistics, Virtual, 6–11 June 2021. [Google Scholar]
Dhananjaya, V.; Demotte, P.; Ranathunga, S.; Jayasena, S. BERTifying Sinhala—A Comprehensive Analysis of Pre-trained Language Models for Sinhala Text Classification. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France, 20–25 June 2022; pp. 7377–7385. [Google Scholar]
Manias, G.; Mavrogiorgou, A.; Kiourtis, A.; Symvoulidis, C.; Kyriazis, D. Text categorization and sentiment analysis: A comparative analysis of the utilization of multilingual approaches for classifying twitter data. Neural Comput. Appl. 2023, 35, 21415–21431. [Google Scholar] [CrossRef] [PubMed]
Barbieri, F.; Espinosa Anke, L.; Camacho-Collados, J. XLM-T: Multilingual Language Models in Twitter for Sentiment Analysis and Beyond. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France, 20–25 June 2022; pp. 258–266. [Google Scholar]
Kapočiūtė-Dzikienė, J.; Salimbajevs, A.; Skadiņš, R. Monolingual and Cross-Lingual Intent Detection without Training Data in Target Languages. Electronics 2021, 10, 1412. [Google Scholar] [CrossRef]
Shi, L.; Mihalcea, R.; Tian, M. Cross Language Text Classification by Model Translation and Semi-Supervised Learning. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Cambridge, MA, USA, 9–11 October 2010; pp. 1057–1067. [Google Scholar]
Karamanolakis, G.; Hsu, D.; Gravano, L. Cross-Lingual Text Classification with Minimal Resources by Transferring a Sparse Teacher. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 16–20 November 2020; pp. 3604–3622. [Google Scholar] [CrossRef]
Xu, R.; Yang, Y. Cross-lingual Distillation for Text Classification. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; Volume 1: Long Papers, pp. 1415–1425. [Google Scholar] [CrossRef]
Dong, X.; de Melo, G. A Robust Self-Learning Framework for Cross-Lingual Text Classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 6306–6310. [Google Scholar] [CrossRef]
Chen, X.; Awadallah, A.H.; Hassan, H.; Wang, W.; Cardie, C. Multi-Source Cross-Lingual Model Transfer: Learning What to Share. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 3098–3112. [Google Scholar] [CrossRef]
Xu, W.; Haider, B.; Mansour, S. End-to-End Slot Alignment and Recognition for Cross-Lingual NLU. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 5052–5063. [Google Scholar] [CrossRef]
Wang, Z.; Liu, X.; Yang, P.; Liu, S.; Wang, Z. Cross-lingual Text Classification with Heterogeneous Graph Neural Network. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Online, 1–6 August 2021; Volume 2: Short Papers, pp. 612–620. [Google Scholar] [CrossRef]
Barnes, J. Sentiment and Emotion Classification in Low-resource Settings. In Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis, Toronto, ON, Canada, 14 July 2023; pp. 290–304. [Google Scholar] [CrossRef]
Nishikawa, S.; Yamada, I.; Tsuruoka, Y.; Echizen, I. A Multilingual Bag-of-Entities Model for Zero-Shot Cross-Lingual Text Classification. In Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL), Abu Dhabi, United Arab Emirates, 7–8 December 2022; pp. 1–12. [Google Scholar] [CrossRef]
Yang, Z.; Cui, Y.; Chen, Z.; Wang, S. Cross-Lingual Text Classification with Multilingual Distillation and Zero-Shot-Aware Training. arXiv 2022, arXiv:2202.13654. [Google Scholar]
Prakhya, S.; Venkataram, V.; Kalita, J. Open Set Text Classification Using CNNs. In Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017), Kolkata, India, 18–21 December 2017; pp. 466–475. [Google Scholar]
Bendale, A.; Boult, T.E. Towards Open Set Deep Networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1563–1572. [Google Scholar]
Yang, Z.; Emmert-Streib, F. Optimal performance of Binary Relevance CNN in targeted multi-label text classification. Knowl.-Based Syst. 2024, 284, 111286. [Google Scholar] [CrossRef]
Feng, F.; Yang, Y.; Cer, D.; Arivazhagan, N.; Wang, W. Language-agnostic BERT Sentence Embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; Volume 1: Long Papers, pp. 878–891. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Kim, Y. Convolutional Neural Networks for Sentence Classification. arXiv 2014, arXiv:1408.5882. [Google Scholar]
Kapočiūtė-Dzikienė, J.; Balodis, K.; Skadiņš, R. Intent Detection Problem Solving via Automatic DNN Hyperparameter Optimization. Appl. Sci. 2020, 10, 7426. [Google Scholar] [CrossRef]
Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 8440–8451. [Google Scholar] [CrossRef]
Greene, R.; Sanders, T.; Weng, L.; Neelakantan, A. New and Improved Embedding Model. Available online: https://openai.com/blog/new-and-improved-embedding-model (accessed on 15 December 2022).
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the 34th International Conference on Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: New York, NY, USA, 2020; Volume 33, pp. 1877–1901. [Google Scholar]
Gosset, W.S. The Probable Error of a Mean. Biometrika 1908, 6, 1–25. [Google Scholar]
Ross, A.; Willson, V.L. One-Sample T-Test. In Basic and Advanced Statistical Tests: Writing Results Sections and Creating Tables and Figures; SensePublishers: Rotterdam, The Netherlands, 2017; pp. 9–12. [Google Scholar] [CrossRef]
Arthur, D.; Vassilvitskii, S. k-means++: The advantages of careful seeding. In Proceedings of the SODA ’07: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, USA, 7–9 January 2007; pp. 1027–1035. [Google Scholar]

Figure 1. Macro-average F1 score values (with confidence intervals) achieved on the MM18x58 and MM18x58_En datasets in multilingual and crosslingual experiments, respectively.

Figure 2. Integrated pipeline for topic classification (known classes) and clustering (new classes).

Figure 3. The optimal architecture of the LaBSE+FFNN approach that was determined using Hyperas and Hyperopt.

Figure 4. Macro-average F1 score values (with confidence intervals) achieved on the MM18x58 dataset with the LaBSE+FFNN approach.

Table 1. The distribution of instances over different topics (categories) and training/validation/testing subsets in the MM18x58 dataset.

Topic	Training	Validation	Testing
1. AlternativeEnergy	10	19	14
2. ClimateAction	10	14	23
3. ClimateChange	475	484	358
4. CoronavirusInfection	71	108	95
5. CybersecurityAntifraud	62	118	122
6. Drugs	18	39	39
7. EUEconomy	28	28	23
8. EUInternet	13	25	16
9. EnergyMarketsandStrategies	6	10	1
10. EuropeanCouncil	107	174	111
11. Europol	169	432	233
12. FightagainstFraud	21	24	38
13. FinancialEconomicCrime	6	9	15
14. ForgeryMoney	2	3	4
15. InformationSecurity	26	31	56
16. Migration	49	65	67
17. NATO	3516	3865	3415
18. TerroristAttack	92	117	125
In total:	4681	5565	4755

Table 2. The distribution of instances over new/known categories and training/validation/testing subsets in the MM18x58_binary dataset.

Category	Training	Validation	Testing
Known (negative)	4681	5565	4755
New (positive)	11,350	13,665	13,647
In total:	16,031	19,230	18,402

Table 3. The loss values, averaged from 5 runs and presented with confidence intervals, were obtained by exploring different features with the optimized binary FFNN. The “+” symbol denotes the concatenation of features.

Features	Avg. Loss ± Confidence Int.
LaBSE vectors	0.0017 ± 0.0002
1. Softmax values	0.0346 ± 0.0014
2. Penultimate layer’s values	0.1101 ± 0.0596
3. Cosine similarities	0.0049 ± 0.0077
1 + 2	0.0661 ± 0.0850
1 + 3	0.0035 ± 0.0031
2 + 3	0.0137 ± 0.0334
1 + 2 + 3	0.0057 ± 0.0036

Table 4. The accuracy values for classifying the removed class X as the new class.

Class X	Numb. of Tested Instances	Accuracy
1. AlternativeEnergy	43	1.000
2. ClimateAction	47	1.000
3. ClimateChange	1317	0.992
4. CoronavirusInfection	274	0.971
5. CybersecurityAntifraud	302	0.990
6. Drugs	96	0.906
7. EUEconomy	79	0.941
8. EUInternet	54	1.000
9. EnergyMarketsandStrategies	17	1.000
10. EuropeanCouncil	392	0.980
11. Europol	834	0.978
12. FightagainstFraud	83	0.940
13. FinancialEconomicCrime	30	1.000
14. ForgeryMoney	9	1.000
15. InformationSecurity	113	1.000
16. Migration	181	0.972
17. NATO	10,796	0.796
18. TerroristAttack	334	0.997

Table 5. The cosine similarity values between cluster centers (mean vector of all instance vectors belonging to a particular class) of different topics.

Topic	Topic	Cosine Similarity
Top 10 topic pairs with the highest cosine similarity
ClimateAction	ClimateChange	0.836
EUEconomy	EuropeanCouncil	0.802
AlternativeEnergy	EnergyMarketsandStrategies	0.791
CybersecurityAntifraud	InformationSecurity	0.713
EUEconomy	EnergyMarketsandStrategies	0.711
FightagainstFraud	FinancialEconomicCrime	0.702
EUEconomy	EUInternet	0.694
EUInternet	EuropeanCouncil	0.690
EuropeanCouncil	Europol	0.677
CybersecurityAntifraud	EUInternet	0.635
Last 10 topic pairs with the lowest cosine similarity
ClimateChange	Drugs	0.234
EUEconomy	FightagainstFraud	0.231
ClimateAction	Drugs	0.229
AlternativeEnergy	CoronavirusInfection	0.225
AlternativeEnergy	FightagainstFraud	0.220
AlternativeEnergy	ForgeryMoney	0.216
EuropeanCouncil	FightagainstFraud	0.206
EnergyMarketsandStrategies	FightagainstFraud	0.203
ClimateAction	ForgeryMoney	0.180
FightagainstFraud	NATO	0.099

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kapočiūtė-Dzikienė, J.; Ungulaitis, A. Towards Media Monitoring: Detecting Known and Emerging Topics through Multilingual and Crosslingual Text Classification. Appl. Sci. 2024, 14, 4320. https://doi.org/10.3390/app14104320

AMA Style

Kapočiūtė-Dzikienė J, Ungulaitis A. Towards Media Monitoring: Detecting Known and Emerging Topics through Multilingual and Crosslingual Text Classification. Applied Sciences. 2024; 14(10):4320. https://doi.org/10.3390/app14104320

Chicago/Turabian Style

Kapočiūtė-Dzikienė, Jurgita, and Arūnas Ungulaitis. 2024. "Towards Media Monitoring: Detecting Known and Emerging Topics through Multilingual and Crosslingual Text Classification" Applied Sciences 14, no. 10: 4320. https://doi.org/10.3390/app14104320

APA Style

Kapočiūtė-Dzikienė, J., & Ungulaitis, A. (2024). Towards Media Monitoring: Detecting Known and Emerging Topics through Multilingual and Crosslingual Text Classification. Applied Sciences, 14(10), 4320. https://doi.org/10.3390/app14104320

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Towards Media Monitoring: Detecting Known and Emerging Topics through Multilingual and Crosslingual Text Classification

Abstract

1. Introduction

2. Related Work

3. Formal Definition of the Solving Problem

4. Dataset

5. Approaches

6. Experiments and Results

7. System’s Architecture

8. Discussion

9. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI