**1. Introduction**

Language is a medium of expression in human society. Communication is a form of social interaction that involves exchange or negotiation of information primarily through the use of the *language* [1]. As it is discussed in [2], languages are socio-cultural as well as individual products and, as such, they change over time, keeping up with the evolution of the social environment. IT has revolutionised the communication way [3] as evidenced by the amount of information generated and spread via social media, such as Facebook, Instagram, Twitter and YouTube. In particular, the growth of the Internet-based communication has become one of the basic necessities of the world. Thanks to such technology, many users attempt to use the Internet even for malicious purposes. The high number of users make it very easy for any kind of information to diffuse very quickly so as to influence the perception of the user itself [4]. This kind of information, mostly known as *viral messages*, is often not verified, from its authenticity point of view and it can result in outrage or hysteria. Furthermore, as discussed in [5], such methods can be used to shape public opinion, finances or even create panic in the society, that means it can be treated as an *infection of the mind*. Terrorist organisations, for example, use online networking to generate and promote propaganda based content to legitimise or instruct illegal activities to the public, as the huge amount of online registered users make the information spreading

**Citation:** Tundis, A.; Mukherjee, G.; Mühlhäuser, M. An Algorithm for the Detection of Hidden Propaganda in Mixed-Code Text over the Internet. *Appl. Sci.* **2021**, *11*, 2196. https:// doi.org/10.3390/app11052196

Academic Editor: David Megías

Received: 29 December 2020 Accepted: 24 February 2021 Published: 3 March 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

faster and easier [6–8]. That is why criminals exploit the Internet as an active means to perpetrate some crimes [9–12].

Especially, online social media is skillfully utilised by terrorists as a tool of persuasion and behaviour change (by including political opinion, marketing and sales, humanitarian and community acts) in order to cause significant damage as well as social unrest in sensitive places by using fake news and propaganda, which might result in individuals joining terrorist networks [13–16].

It is not trivial to monitor all ongoing activities even if such cyber communication takes place in the clear, without adopting any anonymous or encrypted communication techniques [17,18]. It is difficult to recognise it, especially if it is structured in such a fashion that avoids raising the necessary security flags while reaching the intended targets. The writing style, for example, is a strong way to spread propaganda while avoiding the authorities [19]. It can represent a big obstacle from a machine perspective, to automatically detect and correctly interpret a message written in clear as a plain text. There are a few ways to enable it, such as code-mixing and mixed-code text methods. *Code-mixing* is the term used when a person writes something using two different languages especially of two different scripts. For example, "Aap kaise ho?" is a text written for a Hindi person but using a Roman script. This is common parlance called a *Hinglish* text, and likewise: *Romanji* (Japanese in roman script), *Benglish* (Bengali in Roman script), *Roman Urdu* (Urdu in Roman script) and so on. Furthermore, when using code-mixing, the spelling of words tends to differ for different persons, while the variation of the phonetic usage of the Roman letters can result in varying interpretation adding to its complexity. The major challenge relies on properly monitoring these harmful messages circulating in the social network and removing or delegitimising them. This is a big concern in multilingual societies like in India where correctly identifying a potential threat is a huge hurdle. This creates a challenge for the machine to properly "understand" the semantics. In countries like India where multiple languages are used and social media is predominantly expressed in Roman script, a wide usage of mixed code to converse on a daily basis can be observed. However, this poses a problem to recognise the intended message in a mixed code environment. On the other hand, *mixed-code* text consists of using special characters to write words in such a way they would graphically resemble the intended words but from the machine point of view, it is considered a no-sense sequence of symbols. The question should be asked—What is the function of mixed-code text? The answer lies in the simplicity of the method. Mixed-code text only requires the user to use a bit of creativity and on the reader's part only the knowledge of the language as the reader in a flow would easily understand the intended text. However, a machine will take the text literally and would check for its content which in most cases would be dismissed as garbage. Consequently, there is a need for careful deliberation to re-evaluate text and to provide a method for improving its interpretation.

In this panorama, this article, which is an extended version of [20], focuses on the detection of hidden propaganda in mixed-code text, by proposing an algorithm for supporting the analysis of suspicious mixed-code text, based on the optical character recognition (OCR) analogy [21]. Specifically, a segregation approach is adopted to analyse the characters' combinations, which in turn are transformed into images. A character recognition model, based on machine learning techniques, is trained in order to recognise such modified letters of the alphabet. The process iterates through all possible combinations to find the intended word. A second model deals with the text classification, to determine whether a text contains hidden propaganda messages or not. Since such problem falls within the detection field, accuracy, precision, recall and f1-score have been employed as evaluation criteria for assessing the proposal's performances for the detection of hidden propaganda in mixed-code. From the conducted experiments, it results that each of those metrics reaches more than 91%, as a symptom of good performances, that are inline in comparison to the related works described in Section 2.

The rest of the paper is structured as follows: Section 2 provides an overview of the related works by distinguishing between mixed-code analysis and propaganda detection. An overview on optical character and image recognition, which represent the enabling technologies, have been introduced in Section 3. In Section 4, the most common identified mixed-code categories, the overall adopted analysis process along with the proposed algorithm are elaborated. Whereas its evaluation is presented and discussed in Section 5. Finally, the conclusion and future works are highlighted in Section 6.

#### **2. Related Work**

This section provides an overview on the related scientific contributions both in the field of (i) mixed-code analysis (see Table 1), which takes place when, in a conversation or communication, two distinct languages, syntax or written forms are employed [22]. It is generally used to overcome linguistic shortfalls, to help in understanding groups' dynamics and characteristics, as well as to grab various ethnic peculiarities which can define their group identities [23]; (ii) online propaganda detection, which is devoted to identify specific content that aims to incite the masses by influencing their mindset. An overview of the most popular related works is presented in Sections 2.1 and 2.2 respectively.

#### *2.1. Mixed Code Analysis*

A contribution centred on machine translation of the so called *Hinglish* text is proposed in [24], where Hinglish represents a communication way that uses a mix of Hindi and English terms within a conversation. The authors proposed a system centred on machine learning in order to translate Hinglish script. They discussed how the constraints of mixed code should be considered when defining a model in order to obtain an accurate interpretation of the mixed-code. They claimed that acceptable results were achieved in more than 90% of cases; however, any metrics and the quantitative results related to the evaluation approach have not been presented. Furthermore, the application context in which such experimentation was conducted has not been clearly specified.

In [25], the hate speech detection problem in code-mixed texts on social media has been faced, by exploiting a Hindi-English code-mixed dataset consisting of online posts on Twitter. Different types of textual-based features have been used to define a detection model: n-grams, word n-grams, punctuations, negation words and hate lexicon. The annotation consisted of two Factor-World Level and Hate/Normal speech, which were used to indicate which words of the text were English and which words were Hindi. Ngram was adopted to support the feature extraction (both at character and word level) in order to detect relevant words. Furthermore, punctuation and negative words were considered as a benchmark to determine potential hate speech or not. The experiment showed that the best result was achieved by combining all such features to train a SVM, thus achieving 71.7% accuracy.

In [26], an approach for language identification on social media has been elaborated. A dataset with Facebook posts and comments that contains code mixing between Bengali, English and Hindi has been built. The authors considered a text as mixed-code not only when the script was different but also either when the words of different languages were used randomly or sentences switch languages within a single text or paragraph. Different analysis techniques have been experimented, some based on a unsupervised dictionary approach and others on supervised word-level classification. However, no clear evaluation has been provided. Instead, the work proposed in [27] focused on the classification of offensive tweets written in Hinglish language. A Twitter dataset containing tweets in Hindi-English code switched language has been collected. The dataset has been organised into three main classes: nonoffensive, abusive and hate-speech. Convolutional Neural Networks (CNNs) combined with transfer learning have been used in the classification process by reaching 83.9% accuracy.

In [28], the identification of hate speech from code-mixed text on social media has been faced, by using a Hindi-English mixed code corpus as the initial baseline. A Long Short

Term Memory (LSTM) technique, based on subword level, has been mainly considered. The idea behind the subword level selection is to identify the root words and to match their variations. This is because in a mixed code setting the variation may differ for each writer. The hierarchical LSTM model as well as Random Forest (RF) and Support Vector Machine (SVM) have been experimented by using subwords phonetic. As Hindi is a phonetically accurate for writing, a phonetic input of the text has been also provided. The words were divided into consonant-vowel sequences and these sequences form the base of their attention model. This enabled the model to determine both the words that make up the vocabulary and to match the phonetic to such words as well. The results showed that the hierarchical LSTM reached the highest Recall and F1-score, whereas the SVM reached the highest accuracy, equal to 70.7%.

In [29], different analysis techniques for supporting sentiment analysis of textual information, after having normalised the influence of multiple languages in mixed-code script, have been experimented. In particular, several machine learning techniques, which are able to characterise the text by determining the polarity of a text have been used. The mixed code text used to train the model was initially identified to mark the words as Hindi or English. Such model used an Artificial Neural Network (ANN) to identify the affinity of the text and to mark it with values from "very positive" to "very negative" (−5 to +5) so as to provide a sentimental score. The results showed circa 80% accuracy. However, regarding the meaning of polarity, it is not clearly specified whether positivity means happiness or joy and negativity anger or sadness.

Another research contribution related to code-mixed text centred on sentiment analysis is described in [30]. It is based on the combination of two models: N-gram probabilistic model, which was called Multinomial Naive Bayes (MNB), and LSTM model centred on Tri-gram characters. In particular, the LSTM model has been used to cache deep sequential patterns contained in the text, whereas the low-level word combination of keywords has been faced with the MNB model, in order to compensate grammatical inconsistencies. The overall achieved accuracy was equal to 70.8%. Whereas in [31], the authors have extended the NLP techniques to detect humour in a Hindi-English mixed code environment. N-gram, Bag-of-Words (BoW), Common words and hash-tags have been used and experimented on a manually annotated dataset. SVM, Random Forest and Naïve Bayes have been experimented with a highest accuracy of 69.3%. In [32], instead, another NLP technique to normalise words in a code-mixed environment—as the spelling tends to vary for different persons—has been proposed. A Skip-gram model, which clusters the words of variation to deliver the base word as output in the normalised text, has been used. The results have shown an accuracy level of 71% and a F1-score equals to 66%.



A clear aspect that emerged from the above identified related works is that such previous research studies have focused on mixed-code analysis by mainly dealing with the language translation. Such research direction differs from the purpose of this current work, as it focuses on the detection of hidden propaganda in mixed code as *leet text* [33], that is a particular form of expression, which is primarily used on the Internet. The distinguishing characteristic of such textual expression form lies in the combination of characters and symbols of the computer keyboard in order to graphically represent a text. This writing style makes the *leet text* a powerful way of communicating and spreading propaganda, as it allows for easy bypass of its automatic identification.

#### *2.2. Online Propaganda Detection*

Online propaganda is a modern way of conducting campaign centred on IT-based tool and in particular online social networking. Twitter represents one of the most used medium for propaganda and it is heavily utilised by extremist and terrorist groups to reach the mass. The recent interest in detecting online propaganda is documented by the different research efforts conducted in this field (see Table 2).

In [34], for example, the authors focused on the detection of extremist ISIS groups, by proposing a method to actively identify tweets containing propaganda related content. The method aimed to recognise not only patterns containing specific Hashtags but also user accounts by profiling them so as to enable their potential predictions. The method was centred on the Term Frequency (TF) of the suspected words in order to derive relevant information, whereas a Regression-based Neural Network (RBNN) has been used as classification algorithm, without quantifying the achievements.

Similarly, the detection of radical content was faced in [35], through the definition of a model. It was used to analyse the text produced by the users from the psychological perspective by considering specific behavioural patterns. In particular, the authors utilised TF-IDF to identify suspicious terms related to radical behaviour, which were used to train a Random Forest (RF), a Support Vector Machine (SVM) and a K-Nearest Neighbor (KNN) classifier by obtaining 94% accuracy, as the highest result.

A deep learning model for supporting the identification of Sunni extremist propaganda via text analysis is presented in [36]. Here, word2vec and doc2vec techniques have been adopted to detect specific relationships among extremist-related terms which in turn were used to identify and classify new text as propaganda or not. Among the conducted tests, Artificial Neural Networks (ANNs) showed the best classification results by providing accuracy and precision of about 90%.

In [37], the authors focused on the creation of a free available dataset as a benchmark to support the propaganda detection research activities, that has been used in different research competitions related to propaganda detection. It contains sentences that are annotated from experts as propaganda and not. In addition, the paper presents preliminary classification results by achieving about 63% of accuracy by using the Neural Network classification. Using the same database, in [38] the authors exploited the BERT classification to identify single nonpropaganda rows by employing ELMo, BERT and RoBERTa approach, by achieving 79% recall, 66% precision and 60% f1-score.

Whereas in [39], the authors focused on the assessment of radical and extremist online propaganda, by proposing a pyramidal conceptual model that enables to distinguish propaganda related content at different level of radicalisation. The model is centred on three sociological human aspects which characterise traits of radicals and terrorists. A preliminary experimentation was carried out in order to illustrate how to allocate propaganda items to the pyramid model.


#### **Table 2.** Mixed code related works.

From the above reviewed papers, it resulted that online propaganda detection is an active research field towards analysis of extremist related content. However, it emerged that all previous research effort are limited on the analysis of natural language of a plain-text. Our scope is instead to deal with a form of expression which is not directly attributable to plain-text but which is made up to "hide" the real message, by composing different symbols to graphically resemble the standard alphabet.
