**3. Background**

For the sake of completeness and to make the paper's content self-contained, an overview on basic concepts regarding character recognition and writing ways is given in Sections 3.1 and 3.2 respectively.

#### *3.1. Optical Character Recognition*

The objective of character recognition is to enable a machine to recognise characters by using optical devices, such as a camera. The machine should be able to capture the image of the character and associate it with its corresponding symbolic identity [40]. Optical Character Recognition (OCR) technique can be found as early as 1900 where the scientist Tyurin successfully implemented the idea and with the invention of modern computers, the research into this field has increased a lot. The OCR was initially developed to assist the visually challenged individuals to read the text, and then it was adopted into other application context such as banking, health and security in combination to Machine Learning (ML). A common OCR application is handwriting recognition, centred on smart devices that are capable of transforming handwritten text into its digital counterpart [41]. OCR is utilised to recognise languages of different origins not only in Roman script [42] but also in the case of degraded language documents. Another use of OCR to recognise handwriting is presented in [43], where the authors used the n-gram model, which produces groups of words that overlay on each other. The authors extended such model and utilised the likelihood score, from a statistical machine translation system, as main feature, so as to be able to capture the image of the words and to translate each image into other languages.

A further application of OCR is provided in [44], which describes a method to recognise Bangla handwritten numerical characters using local binary pattern. Whereas in [45], the authors used OCR technique to detect printed English and Fraktur text using Long Short Term Memory (LSTM). The authors described that the handwritten content may vary in the position and baseline of the text, as a consequence a text line normalisation approach has been used to uniformly position the text. Such normalisation approach was based on an dictionary, containing data generated on the basis of the connected component shapes and associated baseline. For recognition and classification, the authors implemented 1-D Bidirectional Long Short Term Memory model, in order to establish the ground-truth alignment by using a forward-backward algorithm, which also provides a decoding mechanism to map the frame network upon a sequence of symbols.

Further applications of OCR, for supporting the recognition of handwritten digits, are presented in literature. For example in [46], the use of Random Forest, Decision Tree and Hoeffding Tree as classifiers have been proposed to build a tool that can easily recognise the digits. For this, they utilised the handwritten digit dataset of MNIST5, that is preprocessed before being used, in order to extract potential characterising features. In such work, a comparative study among the different classifiers has been conducted by concluding that the Hoeffding tree was the most performing one by reaching 73% accuracy. Another approach to support character recognition was based on the use of the Tesseract OCR open source engine to train a Tamil OCR model [47]. In particular, a segmentation approach was adopted by using a box file system. Each character was numbered, so that the number of boxes was equal to the number of training characters. Starting from such boxed character a classifier has been trained by using various images with different font size type. The evaluation of the proposed OCR system was performed using 20 scanned images from 20 ancient Tamil books and over 14,000 characters, by obtaining 81% accuracy.

A method for the recognition of Roman Script and English language was proposed in [21], based on Artificial Neural Network (ANN) and Nearest Neighbour (NN) to detect and interpret scanned English documents in three different font types. Using such method, the authors achieved 98% accuracy, by experimenting it on a dataset consisting of English alphabets in different fonts created by themselves, which is not available. As a consequence, the database could have been biased. Unfortunately, this is not verifiable, as no information is available neither regarding such dataset nor on the adopted evaluation approach.

### *3.2. The Art of Writing*

When we talk about *Word Art*, we typically refer to the graphic structural aspect with which words are written, commonly known as calligraphy. It is often considered to be the craft of writing text in an appealing form, and it is regularly used in font design, typography, logo and graphic design. Word Art can be depicted as a text modifier which includes visual enhancements to text like shadows, outlines, colors, skew and 3-D effects to make it more attractive to the user. *Text fonts* are also an integral part of the modern writing. Since there are multiple fonts to choose, a user can use a font to describe the purpose of the text and to emphasised the mood of the article. For example, a user might choose the calligraphy font for invitations and calibri font for formal documents or even create custom fonts to describe a unique style of presenting a textual content.

In [48], a distinction between the term of *Computer Art* and *Digital Art* is discussed. *Digital art*, which is considered more general, represents the use of the computers capacity to convert different media types (such as Music, Pictures, Movies, Story and Text) into a digital form and to process them for multiple purposes. For example, a digitally encoded video of an event can be integrated using music of another origin while displaying text of another. The integration is possible into one, as all of them are fundamentally a series of binary code. In contrast to this, *Computer Art* is the art created by exploiting any external methods, by only using the available tools. The initial stages were just art with characters available with the standard keyboard to create the images. These types of images are called ASCII art, which rely only on the use of standard characters to make images, due to the lack of other graphical resources and professional tools. The use of ASCII art is still prominent today in chats and forums like Reddit, Stackoverflow as well as by players in a multiplayer game communities. In addition, the symbols used to create the full picture are typically linked to the meaning and the context of the pictures itself.

As argued in [49], *ASCII art* is a more complex art form than the intended originally ones. It requires precision to provide the proper alignment of the text to avoid any misinterpretation, which in turn would give inconsistent result for recognition techniques like optical character recognition (OCR). Furthermore, with the help of the computers today, it is easier to provide text in multiple forms and fonts, as the operating systems offer built-in

writing systems, that support a multitude of fonts, design styles as well as design tools for creating the so popularly called "Word Art".

Another form of writing text that combines "Text Art" and "Font" is called *Leet Text*, typically of websites, which uses characters found on the keyboard to type out text, by playing with the similarities of the characters to be represented. Since the language used on the Internet is predominantly English and the keyboard of most users have Roman characters on it, it is generally used for Roman scripts. Other languages are also possible but the scope may be limited. The term "leet" comes from the word "Elite" and it would be used online as a symbol of proficiency in certain fields and especially in Gaming. The representation of the word "leet" itself in leet text is represented as "l337". Leet is considered similar to ASCII or Word art and it is also synonymous of emotions. Leet code is based on conventions but no standard rules are defined.

Since leet speak is dependent on a language base for communication, it does adhere to its grammatical constraints. However, leet speak has some of its own unique sets of texts which are generally known to its users. It uses misspellings and abbreviations to convey a word or message and it uses numbers as valid letters to write a word. Furthermore, it has an extensive use of special characters to depict certain characters. Leet speak is also used to provide censorship to certain text which can be provocative or hate inducing. These texts require precision and practise to be understood and interpreted. More insights about the complexities of leet speak have been provided in [33]. For example, it is mentioned that there are many varieties of leet speak and that many others can be developed on the basis of the specific context.

#### **4. Propaganda Detection in Mixed-Code Text**

In this section, the proposed method for supporting the detection of online propaganda hidden in mixed-code text is elaborated. In particular in Section 4.1, first different mixedcode types are presented, and then the adopted research process is briefly introduced. In Section 4.2, the proposed normalisation algorithm for transforming mixed-code into standard text is elaborated, whereas the approach for the analysis and classification of propaganda-related content is described in Section 4.3.

#### *4.1. Writing Styles and Mixed-Code Categorisation*

The way of expressing thoughts in written form can take place differently. Indeed, not only a text is a composition of symbols and characters that are used to build a word in order to convey a message, but it can also be used to depict some phonetic structures or art forms. The basic way of using writing is called *Text on Document (ToD)*, which is characterised by the use of standard alphabetic characters, that is, the symbols belong to a specific language, whose structure is governed by a set of grammar rules, semantics and vocabulary. *ToD* is supposed to contain an explicit message easy to read and to understand. *Text in Visual Media (TiVM)*, instead, represents the use of the text, in graphical representations, to improve the visual information by imitating sounds or emotions. It might not necessarily follow grammar rules or having any semantic identity. Whereas, a *Text as Art Form (TaAF)* aims to emphasise the artistry of the writer, by using forms or symbols that are not part of the alphabet, in order to represent a code or a coded message in clear.

Mixed-code text belongs to *TaAF*, and according to the conduced research, we classified it into three main types (i.e., *Single row—multiple language*, *Single row—single language*, and *Multiple rows*) by considering two main parameters, as depicted in Figure 1: (i) the *language*, that is, whether the mixed-code text contains multilanguage factors or writing styles linked to one or more languages and (ii) the *graphical writing style*, that is, whether a standard row-based writing style is followed or multiple rows are used in order to graphically represent the alphabetic characters.

**Figure 1.** Mixed-code categories.

Moreover, for each of them further subcategories have been identified, which are described in the following:


Due to the high diversity of such subcategories, their analysis requires in turn an individual and specific study. That is why the rest of this research work focused on one of them and in particular on the *Text as Art Form* called "Special Characters for Alphabets", which has not been investigated yet.

Figure 2 depicts the adopted analysis process. In particular, given a textual Input, a first check verifies whether the text, which is from now on referred to as *TaAF*, contains special characters. If so, the mixed-code text *Normalization Algorithm* is applied, so as to transform the *TaAF* into a standard computer readable text. At this point, the transformed text can be further analysed. In particular, the normalised *TaAF* is given in input to the "Propaganda detection classifier", trained on a specific dataset, which is able to discriminate whether it is Propaganda or Nonpropaganda related. If the textual Input is not a *TaAF*, the propaganda detection step can be directly applied. The next sections elaborate the proposed "Normalization Algorithm" as well as the training of the "Propaganda detection classifier".

**Figure 2.** Research approach: data management, phases and work-products.

#### *4.2. A Mixed Code Text Normalisation Algorithm*

This section presents the proposed algorithm to normalise mixed-code text related to "Special Characters for Alphabets", whose pseudocode is reported in Algorithm 1. It consists of four main steps so labelled: (i) *Text Segregation*, which splits words by generating different subset of symbols, (ii) *Character Transformation*, that aims to derive, from each subset of symbols, letters of the alphabet, (iii) *Word Selection*, which deals with the generation and selection of existing words according to a dictionary on the basis of the derived letters, (iv) *Sentence Reconstruction*, which aims to replace the initially mixed-code text with the selected words in order to obtain a meaningful standard textual sentence.

A more detailed description of each step is exemplified through a simple reference example. In particular, given the sentence *S* in input, as it is represented through the equation (1), which contains a mixed-code text called from now on *Art Form Word—AFW* (e.g., "F{}{}T"), each step of the proposed normalisation algorithm is elaborated in the following.

$$S = \text{My F} \|\|\|\text{T} \text{ is odd} \tag{1}$$

*Text Segregation step*: this step has a syntactic function, and it works on the single words. It aims to create groups of symbols starting form the *AFW* in consideration. In particular, after splitting the sentence into subwords and identify those containing nonalphabetic characters, for each *AFW* the algorithm reworks its sequence of characters by grouping them differently, and generating several combinations, as it is shown in Table 3. In this approach a recursive operation is used to make combinations. It iterates through all the characters to make all the possible combinations. The process keeps one character constant and makes combinations with the rest of the remaining characters. Then it takes the first and second characters together and then makes combinations with the rest of the remaining characters. This operation is carried out exhaustively. In is assumed that, the order of the symbols through which the mixed-code text is built, and as a consequence the order of the combinations, reflects the order of the letters in the standard word. It means that the word has been written in the order it is meant to be read and it is not an anagram.

```
Algorithm 1: Preudocode for text normalisation
  Data: String S={W1, W2, ..., Wk−1, Wk };
  Placeholder={P1, P2, ..., Pk−1, Pk };
  CandidateWords cw[];
  Integer i=1; j=1; z=1;
  Word w; CharactersGroup cg; String SNorm;
  SegregationMatrix SM [K][N][M];
  TransformedCharacter TC [M];
  Text Segregation step: while i<= S.size() do
     // - as long as there are words
     w=S.getWord(i); //take the i-th word
      if w.isNotMixedCode() then
         //if the i-th word is a regular word
         SNorm.appendi(w); //it does not have to be normalised
      else
         SNorm.appendi(Pi); // create a place holder in the i-th position
         SMk=i.segregate(w); // segregate the i-th word
      end
      i++;
  end
  Character Transformation step: while j<= SM.GroupSize() do
     // for each segregated word
      cg=SM.getGroup(j); // select the j-th group of characters
      if cg.isAnAlphabeticCharacter() then
         //if it is one of the standard alphabetic letters
         TC[j]=cg; //apply the rule-1 by considering it as it is
      else
         TC[j]=convertIntoSingleCharacter(cg); //apply the rule-2 by using OCR to transform the j-th cg group of
          symbols into one alphabetic character.
      end
      j++;
  end
  k=1;
  Word Selection step: while k<=S.size() do
      z=1; while z<=SM(k).CombinationNumber() do
         //as long as there is a combination for each word
         cz=SM(k).getCombination(z);
         cz.replaceGroupsWithTheTransformedCharacter(SM,TC);
         if cz.string().isAnExistingEnglishWord() then
            //if the generated word is not part of the English vocabulary, it is not considered
             SM(k).removeCombination(cz)
         else
            // otherwise it become a candidate placeholder replacement
             cw.addCandidateWord(k,cz)
         end
         z++;
      end
      k++;
  end
  k=1; Sentence Reconstruction step: while k<=S.size() do
      SNorm.replacePlaceholder(k,cw(k).candidateWord());
      k++;
  end
  Result: SNorm;
```


**Table 3.** An example of segregation with Art Form Word (AFW) = "**F{}{}T**".

*Character Transformation step*: in this step, for each combination generated in the previous step the following transformation rules have been defined:


A classifier for character recognition has been trained by using the dataset provided in [50], which consists of a collection of character images of the alphabet belonging to the 26 English characters both handwritten and typed through a computer. Furthermore, other additional characters of other languages (such as *ü, Ü, ä, Ä,ö, Ö, ß, Ø, ø* and so on) have been further created in order to extend and enrich the initial dataset.

**Figure 3.** An example of a recognition rule.

To create such images, the use of python and the drawing library of Pillow [51] was used for creating the images. It aimed to incorporate certain non-English characters belonging to other languages. The images are made by using a dimension of 200 × 200 px for the width and height. They were generated in a gray scale, because for the character identification purpose RGB, the colour does not provide any additional information. The best performance for the character recognition have been achieved by using a Convolutional Neural Network (CNN). A training and testing set with a ratio of 80:20 and number of epochs equal to 15, were used to configure the CNN. Its performance, in terms of Evaluation Metrics, Confusion Matrix as well as training vs. testing loss, is depicted in Figure 4.

**Figure 4.** Performances in the recognition of single characters.

Table 4 shows an example of character transformation by applying the above mentioned rules. For example, only the letters of the groups 3, 6 and M have been used as they are, whereas the rest of the groups generated from the segregation process have been first transformed into images and then mapped with single letters.


**Table 4.** An example of Character Transformation.

*Word Selection step*: this part of the algorithm aims to derive candidate words that can replace the related *AFW*. First, for each *Combination* generated during the *Text Segregation step*, each group of symbols is replaced with its related *Transformed Character* obtained within the *Character Transformation step*. As it is shown in Table 5, a set of words which are syntactically written only with letters of the alphabet are built.



The existence of such derived words is then checked against an English dictionary. Through this step, only the words that are part of the standard English language are selected and, as a consequence, used for further analysis. Figure 5 shows graphically an example of word selection, on the basis of their existence in the English dictionary.

**Figure 5.** Selection of the derived words based on a dictionary.

*Sentence Reconstruction step*: this is the last part of the algorithm which is centred on two main input parameters: (i) all existing words that have been generated and selected, that means those that are part of the English dictionary and (ii) the initial sentence *S* containing a *placeholder* for each identified *AFW*. In particular, from the previous steps for each *AFW* a set of potential words have been derived and selected. As it is represented in Figure 6, at this point, the algorithm provides in output different versions of the normalised sentence *SNormalized*={*S*1, *S*2, ..., *Sm*}, this means, all sentences which are possible to reconstruct by replacing the *placeholders* with the generated words in all possible combinations.

**Figure 6.** An example of the sentence reconstruction step.

In case of multiple sentences, all of them are evaluated. In particular, in the presence of an odd number of sentences, the final evaluation is based on the obtained majority value; in case of an even number of sentences, if the majority criterion is not applicable, then the evaluation is delegated to the human by requiring a manual intervention.

#### *4.3. Dealing with the Propaganda Detection*

In the previous section, the identification and transformation of a particular type of mixed-code text, based on "Special character for alphabets", has been faced by proposing an algorithm which is able to derive potential sentences in natural language. This section, instead, focuses on the description of the propaganda detection approach of Figure 2, which aims to automatically analyse the reconstructed sentences *SNormalized*={*S*1, *S*2, ..., *Sm*}, in order to discriminate whether they are related to propaganda or not and which one.

#### 4.3.1. Machine Learning Approach

Among the available analysis techniques, machine learning (ML) is one of the most popular ones. It is adopted in different research fields by facing with (i) computational finance for the evaluation of credit risk and algorithmic trading; (ii) image processing and artificial vision for facial recognition, motion detection and object identification; (iii) computational biology for the diagnosis of tumors, pharmaceutical research and DNA sequencing; (iv) energy production for price and load forecasts; (v) automotive, aerospace and manufacturing sectors, for predictive maintenance; (vi) natural language processing for speech recognition applications and so on.

Furthermore, from the literature review, which is described in the related work section, it emerged that four main ML techniques represent the most popular algorithms in the automatic detection field, that is: (i) *Multinomial Naïve Bayes (MNB)*: which is a variant of Naive Bayes classifiers which uses a multinomial distribution for each of the features, (ii) *Support Vector Machine (SVM)*: that is a linear model for classification and regression problems which is used to solve linear and nonlinear problems, (iii) *Logistic Regression (LR)*: which is a linear predictive analysis algorithm based on the concept of probability to face with classification problems (iv) *Convolutional Neural Network (CNN)*: that is a type of artificial neural network, which is used in image recognition and processing, centred on deep learning to perform both generative and descriptive tasks.

As a consequence, given a sentence S, to be able to assess whether it is Propaganda or Nonpropaganda related, four different classifiers have been trained and evaluated (one for each of the above mentioned ML techniques respectively) regarding propaganda detection. The best one has been then selected.

### 4.3.2. Dataset Description and Evaluation Metrics

The four classifiers have been trained using an available and free downloadable dataset [52], which represents a research results achieved by a collaboration among the MIT Computer Science and Artificial Intelligence Laboratory of Cambridge (USA), the Qatar Computing Research Institute (Qatar) and University of Bologna (Italy). The dataset details are fully reported in [37]. It contains a collection of 15,847 textual items labelled with "0" and "1" to indicate Propaganda and Nonpropaganda. In particular, 4270 items are Propaganda related, and 11,577 are Nonpropaganda related. As the dataset was unbalanced, in order to reduce the bias of the classifier, two sampling techniques have been applied and evaluated for dealing with it: undersampling and oversampling approach. In particular, by applying the under sampling approach, we used the same number of textual items both for Propaganda and Nonpropaganda for a total of 8540, whereas by applying the oversampling technique we used all the 11,577 Nonpropaganda related items and we increased the number of Propaganda related item to 9440. In both cases, the generated datasets have been divided into two subsets: 80% of the items have been used to train the four models, whereas 20% of the items have been used to test the classifiers. To allow the comparison with previous research contributions, the trained classifiers have been evaluated through Accuracy, Precision, Recall and F1-Score, which are the most popular metrics used in ML.

#### 4.3.3. Classifier Assessment

Different experiments have been conducted to determine the best classifier among the selected ones. More specifically, in order to reduce possible bias, that could have been arised from several factors, during the training and classification phase, MNB, SVM and LR classifiers have been assessed by using different vectorization approaches. In particular, Count Vectorizer, Term frequency–inverse document frequency, Term Frequency inverse document frequency with word n-gram, Term frequency inverse document frequency

with character n-gram, were used. The overall performance reached from all classifiers, including the CNN, are reported in Table 6 .

In particular, it resulted that the Convolutional Neural Network performed better than the other classifiers. By using an undersampled dataset, it reached the following results: 73% accuracy, 74% precision, 73% recall and 73% F1-Score. Whereas, as it is reported in Table 6, by oversampling the same dataset, better performances have been reached after retraining and reassessing the CNN-based classifier. In particular, it showed the following performances: 94% accuracy, 92% precision, 92% recall and 91% F1-Score, which have shown not only better performance than the other ones tested within this research work, but it is also inline in comparison to the performances of the other related works, that are reported in Table 2.


**Table 6.** Comparison of the performance achieved from each classifier with oversampling.

#### **5. Evaluation and Results Discussion**

In the previous section, the proposed algorithm for supporting the analysis of mixedcode text has been described, as well as the approach for assessing and selecting the most promising machine learning classifier for the propaganda detection step has been contextually presented.

This section, instead, aims to discuss and show how the evaluation of the overall workflow has been conducted by explaining the encountered problem, by clarifying how the experiment has been setup, as well as by discussing the achieved results.

One of the problems encountered during the evaluation part concerned the lack of available dataset containing word art mixed code text and, especially, related to propaganda. To deal with it, the dataset described in Section 4.3 [37] has been taken in consideration as a starting point. In particular, a subset, called *SS*, of its instances has been selected by creating a smaller balanced dataset, that means by taking 50% of sentences labelled as propaganda and 50% of them labelled as nonpropaganda related.

After that, an online tool, called Universal Leet [53] has been used. Given in input to such tool a word *W*, it is able to automatically generate a possible related *Art Form Word* (*WAFW*). As a consequence, the experiment dataset has been built by replacing in each sentence of the called *SS* subset, at least one word or even the full text with its related art form word, automatically generated from the above mentioned tool, so as to obtain a *Art Form Word* dataset (*SSAFW*) with labels. In particular, at first we used alternatively three standard available modalities: *basic, advanced*, and *ultimate* leet to obtain the first version, and then in order to make such versions nonstandard but more human-like in terms of their variety, we have further updated part of them manually, so as to make them more difficult and less machine-related. All this encoding step has been done manually, because based on our knowledge we have not found any APIs, which allowed us to automate this process. An excerpt is shown in Table 7.



Instead, the WordNet package [54] has been used as reference English dictionary, in order to implement the *Word Selection step* of the proposed normalisation algorithm, described in Section 4.2. It is centred on the Python's Natural Language Toolkit (NLTK) module and it consists of a database where the collected nouns, adjectives, adverbs and verbs are grouped into a set of cognitive synonyms, which are called synsets. As it has been already mentioned, on the basis on the results gathered from the classifier assessment described in Section 4, a CNN has been chosen for supporting the propaganda detection part of the process, as the trained classifier performed the best in comparison to the others. The configuration parameters, that have been used to setup the CNN classifier, are reported in Table 8.

**Table 8.** Configuration parameters of the CNN classification model.


An example of the results, gathered by experimenting the method presented in Section 4, are reported in Table 9.

As it is possible to see in Table 9, for each sentence of Table 7, at least one reconstructed sentence is obtained. Indeed, according to the *Word Selection step* of the method, different "normal" words, and as a consequence multiple reconstructed sentences, can be generated from one single Art Form Word. Consequently, different evaluations are possible as for the sentence with Id = "1" or like the sentence with Id = "6" which is also reconstructed in different way but with the same result evaluation. Whereas, the sentence with Id = "7" is not properly reconstructed and then misclassified. To overcome the classification problem in case of discordant multiple classifications, the human intervention is required, in order to select one of the possible available alternatives.


**Table 9.** Reconstructed sentences and related classification results.

Figure 7 summarises, instead, the confusion matrix at the end of the overall evaluation process based on the selected Convolutional Neural Network (CNN), which shows that only 9% of the instances are wrong classified and in particular only 5% of those related to propaganda are missclassified as nonpropaganda related.


**Figure 7.** Confusion matrix of propaganda detection in hidden mixed-code text.

Whereas, Figure 8 shows the classification performances, in terms of accuracy, precision, recall and f1-score, by comparing the detection of propaganda in a standard text (i.e., with mixed code and as a consequence without applying the normalisation algorithm) with the detection of propaganda hidden mixed-code text. In particular, not only the diagram shows very similar performances, meaning that in presence of mixed-code, the normalisation algorithm is able to reconstruct and analyse appropriately the sentences; but it also highlights that the proposed approach is in average inline with the performances of the related works presented in Section 2, in terms of propaganda detection, by ranging from 90% to 92% in terms of accuracy, precision, recall and f1-score. Moreover, the correctness of the heuristic to normalise a text can be expressed as the number of correctly reconstructed sentences on the basis of the total original ones. Intuitively, a correct classification of a sentence/text is directly related to its correct normalisation process. This means that the classification values presented in Figure 8 represent, as a consequence, the lower bound of the heuristic to be able to correctly retrieve the original text starting from its TaAF representation.

**Figure 8.** Results evaluation using CNN.

It is worth noting that the current results are improved, in comparison to those presented in the previous version of this work [20]. This result is not due to a new version of the presented heuristics, as its logic has not been changed, but it derived from a better training of the CNN algorithm, which makes less misclassification, by positively impacting on the overall process.

#### **6. Conclusions**

The paper dealt with the identification of mixed-code text for the detection of hidden propaganda. First a possible categorisation of different types of existing mixed-code on the basis of two parameters, related to the *language* and the *graphical writing style* has been provided. The study has been focused on the analysis of one type of *Text as Art Form* written on a single row (called "Special Characters for Alphabets"), by adopting a methodological approach centred on two main aspects: (i) mixed code text analysis and (ii) hidden propaganda detection. In particular, regarding the mixed code text analysis, a four-step algorithm (called *Text Segregation*, *Character Transformation*, *Word Selection* and *Sentence Reconstruction*) that supports both the identification of mixed code in text along with its normalisation into natural language has been proposed.

Whereas, regarding the hidden propaganda detection, a Convolutional Neural Network classifier has been chosen among other ML algorithms, as it provided the best performance in the detection of propaganda. The overall performance of the method has been experimented on a public available dataset containing a collection of 15,847 textual propaganda and nonpropaganda related items. The results showed good performances, by achieving 92% Accuracy and Precision, whereas 91% F1-Score and 90% Recall, which are on average better in comparison to the related work.

The impact of this solution lies in the ability to automate and consequently speed up the identification of sources and individuals, who use mixed-code to dissemination propaganda content, linked to extremist behaviour, so as to support the Law Enforcement Agencies (LEAs) and Police Forces (PFs) the fight against this phenomenon. Future works will focus on: (i) improving the performance of the current algorithm by defining a more efficient heuristic related to the Text Segregation step, in term of computational time. The current version generates all possible combination of symbols that works fine as long as the input is "reasonably contained". A smarter way to segregate the text needs be investigated to make it work also in larger context; (i) extending and experimenting the proposed method for supporting the detection of hidden propaganda by considering other types of mixed-code.

**Author Contributions:** Software, G.M.; Supervision, M.M.; Research, Writing—original draft, A.T. The authors contributed equally in all parts of the article in terms of literature review, adopted methodology, feature identification, model definition, experimentation and results analysis. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was partially performed in the context of the CHAMPIONs research project, which receives funding from the European Union's Internal Security Fund—Police, grant agreement no. 823705.

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** The authors want to thanks José L. Diego, Project Manager at Valencia Local Police (Spain), for supporting this research activity.

**Conflicts of Interest:** The authors declare no conflicts of interest.

### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Applied Sciences* Editorial Office E-mail: applsci@mdpi.com www.mdpi.com/journal/applsci

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18

www.mdpi.com