*Article* **Authorship Identification of a Russian-Language Text Using Support Vector Machine and Deep Neural Networks**

**Aleksandr Romanov <sup>1</sup> , Anna Kurtukova 1,\*, Alexander Shelupanov <sup>1</sup> , Anastasia Fedotova <sup>1</sup> and Valery Goncharov <sup>2</sup>**


**Abstract:** The article explores approaches to determining the author of a natural language text and the advantages and disadvantages of these approaches. The importance of the considered problem is due to the active digitalization of society and reassignment of most parts of the life activities online. Text authorship methods are particularly useful for information security and forensics. For example, such methods can be used to identify authors of suicide notes, and other texts are subjected to forensic examinations. Another area of application is plagiarism detection. Plagiarism detection is a relevant issue both for the field of intellectual property protection in the digital space and for the educational process. The article describes identifying the author of the Russian-language text using support vector machine (SVM) and deep neural network architectures (long short-term memory (LSTM), convolutional neural networks (CNN) with attention, Transformer). The results show that all the considered algorithms are suitable for solving the authorship identification problem, but SVM shows the best accuracy. The average accuracy of SVM reaches 96%. This is due to thoroughly chosen parameters and feature space, which includes statistical and semantic features (including those extracted as a result of an aspect analysis). Deep neural networks are inferior to SVM in accuracy and reach only 93%. The study also includes an evaluation of the impact of attacks on the method on models' accuracy. Experiments show that the SVM-based methods are unstable to deliberate text anonymization. In comparison, the loss in accuracy of deep neural networks does not exceed 20%. Transformer architecture is the most effective for anonymized texts and allows 81% accuracy to be achieved.

**Keywords:** authorship; text mining; machine learning; attribution; neural networks; deep learning; forensic intelligence

#### **1. Introduction**

It is now known that it is possible to determine the individual characteristics of the author on the basis of the writing style, since each text has a specific linguistic personality [1].

The topic of attribution overlaps with information security [2–5]. With the constant increase in volume of transmitted and received documents, there are many opportunities for the illegal use of personal data. An example is a type of fraud in which an attacker sends an employee of an organization an email on behalf of a manager asking them to perform a specific action (e.g., to divulge confidential information of the organization or to transfer funds). In addition, quite often there are situations related to hacking the victim's social media accounts and sending messages on the victim's behalf. One solution to this kind of problem is to compare the writing style of the suspicious texts with others for which it is certain that they were written by the person. As a result of the comparison, it is possible to determine the author. Establishing general differences in the documents based on the writing style is most relevant if there are no other data that would allow the author to be identified.

**Citation:** Romanov, A.; Kurtukova, A.; Shelupanov, A.; Fedotova, A.; Goncharov, V. Authorship Identification of a Russian-Language Text Using Support Vector Machine and Deep Neural Networks. *Future Internet* **2021**, *13*, 3. https://doi.org/ 10.3390/fi13010003

Received: 10 December 2020 Accepted: 23 December 2020 Published: 25 December 2020

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/ licenses/by/4.0/).

One type of violation in cyberspace is a copyright infringement and related rights of a text, which can be expressed, for example, by claiming a text by another author for material gain or attempting to pass off the authorship of a created text as the authorship of another person. The effectiveness of intellectual property protection in the digital space is determined by an ability to resist such violations and threats of their occurrence. Authorship identification methods allow determining such infringements and establishing the identity of the text creator.

Interest in the topic is also due to a growth in the volume of text data, the evolution of technology, and social networks. Thus, automatic identification of authorship is a growing area of research, which is also important in the fields of forensic science and marketing.

In this article, we solve the problem of identifying the author of a Russian-language text using a support vector machine and deep neural networks. Literary texts written by Russian-speaking writers were used as input data. The article includes an overview of related works, the statement of the text authorship problem, a detailed description of approaches to solving the authorship identification problem, an impact evaluation of attacks on the developed approaches, and a discussion of the results obtained.

#### **2. Related Works**

An excellent overview of articles up to 2010 is presented in [1]. However, since then, methods based on deep neural networks (NN) have become more and more popular, replacing classical methods of machine learning. For example, the topic of author identification is considered annually at the PAN conference [6]. As part of the conference, researchers were offered two datasets of different sizes, containing texts by well-known authors.

The authors of [7] emphasize that they have proposed an approach that takes into account only the topic-independent features of a writing style. Guided by this idea, the authors chose several features such as the frequency of punctuation marks, highlighting the last word in a sentence, consideration of all existing categories of functional words, abbreviations and contractions, verb tenses, and adverbs of time and place. An ensemble of classifiers was used in the work. Each of them accepts or rejects the supposed authorship. Research is distinguished by the application of an approach that, in general, is aimed at recognizing a person based on his behavior. Here, Equal Error Rate (EER) has been applied as the thresholding mechanism. Essentially, the EER corresponds to the point on the curve where the false acceptance rate is equal to the false rejection rate. The results are 80% and 78% accuracy for the large and small datasets, respectively. The results of the approach allowed the authors to take third place among all the submitted works.

In [8], stylometric features were extracted for each pair of documents. The absolute difference between the feature vectors was used as input data for the classifier. Logistic regression was used for a small dataset, and a NN was used for a large one. These models achieved 86% and 90% accuracy for small and large datasets, respectively. As a result, the authors of the study took second place.

The work that achieved the best result in the competition [9] presents the combination of NN with statistical modeling. Research is aimed at studying pseudo metrics that represent a variable-length text in the form of a fixed-size feature vector. To estimate the Bayesian factor in the studied metric space, a probability layer was added. The ADHOMINEM system [10] was designed to transmit the association of selected tokens into a two-level bi-directional long short-term memory (LSTM) network with an attention mechanism. Using additional attention levels made it possible to visualize words and sentences that were marked by the system as "very significant". It was also found that using the sliding window method instead of dividing a text into sentences significantly improves results. The proposed method showed excellent overall performance, surpassing all other systems in the PAN 2020 competition on both datasets. The accuracy was 94% for the large dataset and 90% for the small one.

The authors of [11] took into account the syntactic structure of a sentence when determining the author of a text, highlighting two components of the self-supervised

network: lexical and syntactic sub-network, which took a sequence of words and their corresponding structural labels as input data. The lexical sub-network was used to code a sequence of words in a sentence, while the syntactic subnetwork was used to code selected labels, e.g., parts of speech. The proposed model was trained on the publicly available LAMBADA dataset, which contains 2662 texts of 16 different genres in English. The consideration of the syntactic structure made it possible to eliminate the need for semantic analysis. The resulting accuracy was 92.4%.

The work in [12] provides an overview of the methods for establishing authorship with the possibility of their subsequent application in the field of forensic research on social networks. According to the authors, in forensic sciences, there is a significant need for new attribution algorithms that can take context into account when processing multimodal data. Such algorithms should overcome the problem of a lack of information about all candidate authors during training. Functional words have been chosen as a feature, as they are quite likely to appear even in small samples and can therefore be particularly effective for analyzing social networks. Combinations of different sets of *n*-grams at symbol and word level with *n*-grams at the part-of-speech level were investigated. An accuracy of 70% was obtained for 50 authors.

The main idea of the study [13] is to modify the approach to establishing authorship by combining it with pre-trained language models. The corpus of texts consisted of essays by 21 undergraduate students written in five formats (essay, email, blog post, interview, and correspondence). The method is based on a recurrent neural network (RNN) operating at the symbol level and a multiheaded classifier. In cross-thematic authorship determination, the results were 67–91%, depending on the subject, and in cross-genre, 77–89%, depending on the genre.

The essence of [14] is to research document vectors based on *n*-grams. Experiments were conducted on a cross-thematic corpus containing some articles from 1999 to 2009 published in the English newspaper *The Guardian*. Articles by 13 authors were collected and grouped into five topics. To avoid overlapping, those articles for which content included more than one category were discarded. The results show that the method is superior to linear models based on *n*-gram symbols. To train the Doc2vec model, the authors used a third-party library called GENSIM 3. The best results were achieved on texts of large sizes. Accuracy for different categories ranged from 90.48 to 96.77%.

In [15], an ensemble approach that combines predictions made by three independent classifiers is presented. The method based on variable-length *n*-gram models and polynomial logistic regression and used to select the highest likelihood prediction among the three models. Two evaluation experiments were conducted: using the PAN-CLEF 2018 test dataset (93% accuracy) and a new corpus of lyrics in English and Portuguese (52% accuracy). The results demonstrate that the proposed approach is effective for fiction texts but not for lyrics.

The research conducted in [16] used the support vector machine (SVM). Parameters for defining the writing style were highlighted at different levels of the text. The authors demonstrated that more complex parameters are capable of extracting the stylometric elements presented in the texts. However, they are most efficiently used in combination with simpler and more understandable *n*-grams. In this case, they improve the result. The dataset included 20 samples in four different languages (English, French, Italian, and Spanish). Thus, five samples from 500 to 1000 words in each language were used. The challenge was to assign each document in the set of unknown documents to a candidate author from the problem set. The results were 77.7% for Italian, 73% for Spanish, 68.4% for French, and 55.6% for English.

Authorship identification methods are used not only for literary texts but also to determine plagiarism in scientific works. For example, [17] presents a system for resolving the ambiguity of authorship of articles in English using Russian-language data sources. Such a solution can improve the search results for articles by a specific author and the calculation of the citation index. The link.springer.com database was used as the initial

repository of publications, and the eLIBRARY.ru scientific electronic library was used to obtain reliable information about authors and their articles. To assess the quality of the comparison, experiments were carried out on the data of employees of the A.P. Yershov Institute of Informatic Systems. The sample included 25 employees, whose publications are contained in the link.springer.com system. To calculate the similarity rate of natural language texts, they were presented as vectors in multidimensional space. To construct a vector representation of texts, a bag-of-words algorithm was used with the term frequency– inverse document frequency (TF-IDF) measure. Stop-words were preliminarily removed from the texts, and stemming of words was carried out. Experiments were also provided on the vectorization of natural language texts using the word2vec. The average percentage of the number of publications of authors recognized by the system was 79%, while the number of publications that did not belong to the author but were assigned to his group was close to zero. The approaches used in the system are applicable for disambiguating authorship of publications from various bibliographic databases. The implemented system showed a result of 92%.

There were only a few works that achieved a high level of author identification in Arabic texts. In [18], the Technique for Order Preferences by Similarity to Ideal Solution (TOPSIS) was used to select the basic classifier of the ensemble. More than 300 stylometric parameters were extracted as attribution features. The AdaBoost and Bagging methods were applied to the dataset in Arabic. Texts were taken from six sources. Corpora included both short and long texts by three hundred authors writing in various genres and styles. The final accuracy was 83%.

A new area of research is attribution, which uses not only human-written texts but also texts obtained using generation [19]. Several recently proposed language models have demonstrated an amazing ability to generate texts that are difficult to distinguish from those written by humans. In [20], a study of the problem of authorship attribution is proposed in two versions: determining the authorship of two alternative human–machine texts and determining the method that generated the text. One human-written text and eight machine-generated texts (CTRL, GPT, GPT2, GROVER, XLM, XL-NET, PPLM, FAIR) were used. Most generators still produce texts that significantly differ from texts written by humans, which makes it easier to solve the problem. However, the texts generated by GPT2, GROVER, and FAIR are of significantly better quality than the rest, which often confuses classifiers. For these tasks, convolutional neural networks (CNN) were used, since the CNN architecture is better suited to reflect the characteristics of each author. In addition, the authors improved the implementation of the CNN using *n*-gram words and part-of-speech (PoS) tags. The result in the "human-machine" category ranges from 81% to 97%, depending on the generator, and, for determining the generation method, 98%.

The author of [21] presented the software product StylometRy, which allows the identification of the author of a disputed text. Texts were presented in the form of a bagof-words model. Naive Bayesian classifier, *k*-nearest method, and logistic regression were chosen as classifiers, and pronouns were used as linguistic features. The models were checked in L. Tolstoy, M. Gorky, and A. Chekhov texts. The minimum text volume for analysis was 5500 words. The accuracy of the model for texts over 150,000 characters was in the range of 60–100% (average 87%).

The scientific work [22] describes the features of four styles of the Russian language —scientific, official, literary, and journalistic. The parameters selected for texts analysis were: the ratio of the number of verbs, nouns, adjectives, pronouns, particles, and interjections to the number of words in the text, the number of "noun + noun" constructions, the number of "verb + noun" constructions, the average word length, and the average sentence length. Decision trees were used for classification. The accuracy of the analysis of 65 texts of each style was 88%. The highest accuracy was achieved when classifying official and literary texts, and the lowest was achieved for journalistic texts.

The authors of [23] present the analysis and application of various NNs architectures (RNN, LSTM, CNN, bi-directional LSTM). The study was conducted based on three datasets

in Russian (Habrahabr blog—30 authors, average text length 2000 words; vk.com—50 and 100 authors, average text length 100 words; Echo.msk.ru—50 and 100 authors, average text length 2000 words). The best results were achieved by CNN (87% for Habrahabr blog, 59% and 53% for 50 and 100 authors with vk.com, respectively). Character's trigrams performed significantly better for short texts from social networks, while for longer texts, both trigram and tetragram representations achieved almost the same accuracy (84% for trigrams, 87% for tetragram representations).

The object of research study [24] is journalistic articles from Russian pre-revolutionary magazines. The information system Statistical Methods of Literary Texts Analysis (SMALT) has been developed to calculate various linguistic and statistical features (distribution of parts of speech, average word and sentence length, vocabulary diversity index). Decision trees were used to determine the authorship. The resulting accuracy was 56%.

The problem of authorship attribution of short texts obtained from Twitter was considered in scientific work [25]. Authors proposed a method of learning text representations using a joint implementation of words and character n-grams as input to the NNs. Authors used an additional feature set with 10 elements: text length, number of usernames, topics, emoticons, URLs, numeric expressions, time expressions, date expressions, polarity level, and subjectivity level. Two series of comparative experiments were provided to test using CNN and LSTM. The method achieved an accuracy of 83.6% on the corpus containing 50 authors.

The authors of [26] applied integrated syntactic graphs (ISGs) to the task of automatic authorship attribution. ISGs allow for combining different levels of language description into a single structure. Textual patterns were extracted based on features obtained from the shortest path walks over integrated syntactic graphs. The analysis was provided on lexical, morphological, syntactic, and semantic levels. Stanford dependency parser and WordNet taxonomy were applied in order to obtain the parse trees of the sentences. The feature vectors extracted from the ISGs can be used for building syntactic *n*-grams by introducing them into machine learning methods or as representative vectors of a document collection. Authors showed that these patterns, used as features, allow determining the author of a text with a precision of 68% for the C10 corpus and also performed experiments for the PAN'13 corpus, obtaining a precision of 83.3%.

An approach based on joint implementation of words, *n*-grams, and the latent Dirichlet allocation (LDA) was presented in [27]. The LDA-based approach allows the processing of sparse data and volumetric texts, giving a more accurate representation. The described approach is an unsupervised computational methodology that is able to take into account the heterogeneity of the dataset, a variety of text styles, and also the specificity of the Urdu language. The considered approach was tested on 6000 texts written by 15 authors in Urdu. The improved sqrt-cosine similarity was used as a classifier. As a result, an accuracy of 92.89% was achieved.

The idea of encoding the syntax parse tree of a sentence into a learnable distributed representation is proposed in [28]. An embedding vector is created for each word in the sentence, encoding the corresponding path in the syntax tree for the word. The one-to-one correspondence between syntax-embedding vectors and words (hence their embedding vectors) in a sentence makes it easy to integrate obtained representation into the word-level Natural Language Processing (NLP) model. The demonstrated approach has been tested using CNN. The model consists of five types of layers: syntax-level feature embedding, content-level feature embedding, convolution, max pooling, and softmax. The accuracy obtained on the datasets was 88.2%, 81%, 96.16%, 64.1%, and 56.73% on five benchmarking datasets (CCAT10, CCAT50, IMDB62, Blogs10, and Blogs50, respectively).

The authors of [29] combined widely known features of texts (verbs tenses frequency, verbs frequency in a sentence, verbs usage frequency, commas frequency in a sentence, sentence length frequency, words usage frequency, words length frequency, characters *n*gram frequency) and genetic algorithm to find the optimal weight distribution. The genetic algorithm is configured with a mutation probability of 0.2 using a Gaussian convolution on the values with a standard deviation of 0.3 and evolved over 1000 generations. The method

was tested on the Gutenberg Dataset, consisting of 3036 texts written by 142 authors. The method is implemented using Stanford CoreNLP, stemming, PoS tagging, and genetic algorithm. The obtained accuracy was 86.8%.

There is no generally accepted opinion regarding the set of text features that provides the best result. In most works, text features such as bigrams and trigrams of symbols and words, functional words, the most frequent words in the language, the distribution of words in parts of speech, punctuation marks, and the distribution of word length and sentence length have proven to be effective. It is incorrect to judge the accuracy of the methods applied to the Russian language based on the results of research in the English language or any other languages because of the specific structure of each language. The choice of approach depends on the text language, the authorship identification method, and the accuracy of the available analysis methods. Particularly, the peculiarity of the Russian language in comparison with English, for which most of the results are presented, is its flexibility and, consequently, more complex word formation and a high degree of morphological and syntactic homonymy, which makes it difficult to use some features useful for the English language. The problems of genre, sample representativeness, and dataset size also limit the implementation of some approaches.

Investigations aimed at finding a method with high separating ability with a large number of possible authors are not always useful when solving real-life tasks. It is necessary to continue further research aimed at finding new methods or improving/combining existing methods of identifying the author, as well as conducting experiments aimed at finding features that allow accurately dividing the styles of authors of Russian-language texts. By using these features, it will be possible to work with small samples.

#### **3. Problem Statement**

We define the identification of the text author as the process of determining the author based on a set of general and specific features of the text that formed the author's style.

The problem of identifying the author of the text with a limited set of alternatives is formulated as follows. There are the set of texts **T** = {*t*1, . . . , *tk*} and the set of authors **A** = {*a*1, . . . , *al*}. For a certain subset of texts **T** <sup>0</sup> = {*t*1, . . . , *tm*} ⊆ **T**, the authors are known; i.e., there are the set of text–author pairs **D** = n (*t i* , *a j* ) o*<sup>m</sup> i*=1 . It is necessary to determine which author from set **A** is the true author of the remaining texts (anonymous or disputed) **T** <sup>00</sup> = {*tm*+1, . . . , *tk*} ⊆ **T**.

In this statement, the author's identification problem can be considered as a multilabel classification task. In this case, set **A** is the set of predefined classes and their labels, set **D** is the set of training samples, and objects to be classified are included in the set **T** 00. The goal is to develop a classifier that solves the problem—finding the objective function **F** : **T** × **A** → [−1, 1], which assigns some text from the set **T** to its true author. The function value is described as the degree to which the object belongs to the class, where 1 corresponds to the completely positive solution, while −1, on the contrary, is a negative one.

#### **4. Methods for Determining the Author of a Natural Language Text**

Early research [1] was aimed at evaluating the accuracy and the speed of classifiers based on machine learning algorithms. Then, the best results in all parameters were demonstrated by the SVM classifier. However, over the past 10 years, many solutions based on deep NNs appeared in the field of NLP: RNN and CNN for multi-label text categorization, category text generation, and learning word dependencies, and hybrid networks for aspect-based sentiment analysis. These solutions significantly exceed the effectiveness of traditional algorithms. As of 2020, LSTM, CNN with self-attention, and Transformer [30,31] are the models that successfully solve related text analysis problems. Thus, the purpose of the study was to compare SVM with modern classification methods based on deep NN. The enumerated models, their mathematical apparatuses, as well as the techniques of their application to the task of authorship attribution are described below.

#### *4.1. Support Vector Machine*

The SVM classifier is similar to the classical perceptron. Application of its kernel transformations allows training radial basis function network and perceptron with a sigmoidal activation function, the weights of which are determined by solving a quadratic programming problem with linear constraints, while training a standard NN implies solving the problem of non-convex minimization without restrictions. In addition, SVM allows working directly with a high-dimensional vector space without preliminary analysis and also without manually selecting the number of neurons in the hidden layer.

The main difference between SVM and deep-learning models is that SVM is unable to find unobvious informative features in text that have not been pre-processed. Therefore, it is necessary to first extract such features from the text.

Let us denote the set of letters of the alphabet, numbers, and separators **A** = n *<sup>a</sup>*1, *<sup>a</sup>*2, . . . , *<sup>a</sup>*|*A*<sup>|</sup> o , the set of possible morphemes **M** = n *<sup>m</sup>*1, *<sup>m</sup>*2, . . . , *<sup>m</sup>*|*M*<sup>|</sup> o , the language dictionary **W** = n *<sup>w</sup>*1, *<sup>w</sup>*2, . . . , *<sup>w</sup>*|*W*<sup>|</sup> o , the set of phrases **C** = n *<sup>c</sup>*1, *<sup>c</sup>*2, . . . , *<sup>c</sup>*|*C*<sup>|</sup> o , the set of sentences **S** = n *<sup>s</sup>*1,*s*2, . . . ,*s*|*S*<sup>|</sup> o , and the set of paragraphs **P** = n *<sup>p</sup>*1, *<sup>p</sup>*2, . . . , *<sup>p</sup>*|*P*<sup>|</sup> o . Then, the text *T* can be represented as sequences of elements as follows:

$$T = \left\{ a\_{\dot{\boldsymbol{\gamma}}}^{i} \right\}\_{i=1}^{N\_{\boldsymbol{a}}} = \left\{ m\_{\dot{\boldsymbol{\gamma}}}^{i} \right\}\_{i=1}^{N\_{\boldsymbol{m}}} = \left\{ w\_{\dot{\boldsymbol{\gamma}}}^{i} \right\}\_{i=1}^{N\_{\boldsymbol{v}}} = \left\{ c\_{\dot{\boldsymbol{\gamma}}}^{i} \right\}\_{i=1}^{N\_{\boldsymbol{c}}} = \left\{ s\_{\dot{\boldsymbol{\gamma}}}^{i} \right\}\_{i=1}^{N\_{\boldsymbol{s}}} = \left\{ p\_{\dot{\boldsymbol{\gamma}}}^{i} \right\}\_{i=1}^{N\_{\boldsymbol{p}}} \tag{1}$$

where *a i <sup>j</sup>* <sup>∈</sup> **<sup>A</sup>**, *<sup>m</sup><sup>i</sup> <sup>j</sup>* ∈ **M**, *w i <sup>j</sup>* ∈ **W**, *c i <sup>j</sup>* ∈ **C**, *s i <sup>j</sup>* ∈ **S**, *p i <sup>j</sup>* ∈ **P**; *Na*, *Nm*, *Np*, *Nw*, *Nc*, *Ns*—the number of characters, morphemes, words, phrases, sentences, paragraphs in the text.

Thus, the SVM feature space can be described as vectors of features that reflect the properties of text elements: {*a* 0 <sup>1</sup>, . . . , *a* 0 *<sup>k</sup>*} for symbols, {*m*<sup>0</sup> <sup>1</sup>, . . . , *m*0 *<sup>l</sup>*} for morphemes, {*w* 0 <sup>1</sup>, . . . , *w* 0 *<sup>n</sup>*} for words, {*c* 0 <sup>1</sup>, . . . , *c* 0 *<sup>r</sup>*} for phrases, {*s* 0 <sup>1</sup>, . . . ,*s* 0 *<sup>t</sup>*} for sentences, and {*p* 0 1 , . . . , *p* 0 *u* } for paragraphs.

In the study, when classifying with SVM, informative features are used as an unordered collection as inputs of the SVM. The frequencies of single text's elements are used as follows:

$$t\_k = \left\{ \begin{array}{c} 1 \leftrightarrow w\_i^j \in \mathbf{W} \\ 0 \leftrightarrow w\_i^j \notin \mathbf{W} \end{array} , j = \overline{1, n\_i}, k = \overline{1, |\mathbf{W}|} \right\}, \tag{2}$$

In addition, the texts elements sequences of some length (*n*-grams) or a limited number of them from the dictionary are used as follows:

$$f(a\_{i\prime}, \dots, a\_{i+n-1}) = \frac{\mathbb{C}(a\_{i\prime}, \dots, a\_{i+n-1})}{L},\tag{3}$$

$$P(a\_i | a\_{i-n+1} \dots a\_{i-1}) = \frac{\mathbb{C}(a\_{i-n+1}, \dots, a\_i)}{\mathbb{C}(a\_{i-n+1}, \dots, a\_{i-1})},\tag{4}$$

where *L*—total number of counted *n*-grams; *k*—threshold value; *f()*—relative frequency of the element in the text; *a*—the symbol; *P()*—the probability of the element appearing in the text; *n*—the length of the *n*-gram.

It should be noted that for texts of small volumes, it is supposed to use frequencies smoothed by the methods of Laplace (5), Good-Turing (6), and Katz (7), which makes it possible to estimate the probabilities of non-occurring events:

$$P\_{ADD}(a\_{i\prime}, \dots, a\_{i+n-1}) = \frac{1 + \mathbb{C}(a\_{i\prime}, \dots, a\_{i+n-1})}{\mathbf{W} + \sum\_{i} \mathbb{C}(a\_{i\prime}, \dots, a\_{i+n-1})} \,\tag{5}$$

where *PADD*—estimates of Laplace; **W**—the language dictionary; *C()*—the number of occurrences of the element in the text.

$$P\_{GT}^\* = \frac{\mathbb{C}\*}{N}, P\_{GT}^\* = \frac{N\_1}{N}\mathbb{C}\* = (\mathbb{C} + 1)\frac{N\_{\mathbb{C}+1}}{N\_{\mathbb{C}}},\tag{6}$$

where *PGT*—estimates of Laplace; *N*—the total number of the considered elements of the text; *NC*—the number of text elements encountered exactly *C* times; *C*\*—discounted Good Turing estimate.

$$P\_{\mathrm{KATZ}}(a\_i | a\_{i-n+1}, \dots, a\_{i-1}) = \begin{cases} \mathrm{P} \ast (a\_i | a\_{i-n+1}, \dots, a\_{i-1}), & \text{if } \mathbb{C}(a\_{i-n+1}, \dots, a\_i) > \mathrm{k} \\\ a(a\_{i-n+1}, \dots, a\_{i-1}) P\_{\mathrm{KATZ}}(a\_i | a\_{i-n+2}, \dots, a\_{i-1}), & \text{if } 1 \le \mathrm{C}(a\_{i-n+1}, \dots, a\_i) \le \mathrm{k} \end{cases},\tag{7}$$

where *tk*—the fact of the existence of the *j*-th word of the *i*-th text in the dictionary **W**; *PKATZ*—estimates of Katz; *α()*—weight coefficient.

In the process of authorship attribution of natural language text using classical machine learning methods, not only standard feature sets can be used; features obtained as a result of solving related tasks such as determining the author's gender and age, the level of the author's education, the sentiment of the text, etc. can also be used. However, as a part of this study, aspect-oriented analysis was also used for informative features extraction. Such a type of analysis involves understanding the meaning of a text by identifying aspect terms or categories. Thus, it becomes possible to extract keywords and opinions related to aspects.

There are two well-known approaches to implementing aspect analysis: statistical and linguistic. The statistical approach is performed as an extraction of aspects, determination of the threshold value for them, and selection such aspects, the values of which are indicated above the given threshold. The linguistic approach takes into account the syntactic structure of the sentence and searches for aspects by patterns.

We decided to use a combination of these methods. Aspects chosen were nouns and noun phrases (statistical approach), and the syntactic structure of the sentence was determined based on the dependencies between words (linguistic approach).

Multi-layered NN, consisting of fully connected layers, was implemented to extract aspects. The following training parameters were used:


The principle of operation of SVM is to construct a hyperplane in the space of highdimensional features in such a way that the gap between the support vectors (the extreme points of the two classes) is maximized. The mapping of the original data onto space with the linear separating surface is performed using a kernel transformation:

$$\left(\Phi(\mathbf{x}), \Phi(\mathbf{x'})\right) = k(\mathbf{x}, \mathbf{x'}), \tag{8}$$

where (Φ(*x*), Φ(*x* 0 )) is the inner product between the sample being recognized and the training samples, and *k* is some mapping of the original space onto the space with the inner product (the space of dimension sufficient for linear separability).

Then the function performing the classification looks like this:

$$f(\mathbf{x}) = \left\{ \sum\_{i=1}^{l} a\_i y\_i k(\mathbf{x}\_i, \mathbf{x}) \right\} + b\_\prime \tag{9}$$

where *α* is the optimal coefficient, *k* is the kernel function, *y* is the label of class, *b* is the parameter that ensures the fulfillment of the second Karush-Kuhn-Tucker condition for all input samples corresponding to Lagrange multipliers that are not on the boundaries.

The optimal coefficient *α* is determined by maximizing the objective function:

$$\mathcal{W}(\mathfrak{a}) = \sum\_{i=1}^{l} \mathfrak{a}\_{i} - \frac{1}{2} \sum\_{i,j=1}^{l} y\_{i} y\_{j} \mathfrak{a}\_{i} \mathfrak{a}\_{j} k(\mathfrak{x}\_{i}, \mathfrak{x}\_{j}),\tag{10}$$

where the maximization condition:

$$\sum\_{i=1}^{l} a\_i y\_i = 0,\tag{11}$$

in the positive quadrant 0 ≤ *α<sup>i</sup>* ≤ *C*, *i* = 1, *l*.

The regularization parameter *C* determines the ratio between the number of errors in the training set and the size of the gap.

#### *4.2. Deep Neural Networks*

A distinctive feature of deep NNs is their ability to analyze a text sequence and extract informative features by itself. In some studies, texts should be accepted by the model unchanged [1]. However, in solving the problem of determining the author of a natural language text, preliminary preparation is an important stage.

The purpose of preprocessing is to cleaning the dataset from noise and redundant information. Within the framework of the study, the following actions were taken to clean up the texts:


The data obtained from the results of preprocessing must be converted into a vectorunderstandable NN. For this purpose, it was decided to use word embeddings—a text representation, where words having a similar meaning are defined by vectors close to each other in hyperspace. The received word representations are fed to the inputs of the deep NN.

#### 4.2.1. Long Short-Term Memory

LSTM is a successful modification of the classical RNN, which avoids the problem of vanishing or exploding gradients. This is due to the fact that the semantic weights of the LSTM model are the same for all time steps during error backpropagation. Therefore, the signal becomes too weak (exponentially decreases) or too strong (exponentially increases). This is the problem that LSTM solves.

The LSTM model contains the following elements:


Then the time step *t* is considered. The input to the LSTM cell is the current input vector **X***<sup>t</sup>* , the previous hidden state *Ht*−1, and the previous memory state *Ct*−1. The cell outputs are the current hidden state *H<sup>t</sup>* and the current memory state *Ct.* The following formulas are used to calculate outputs:

$$f\_t = \sigma(\mathbf{X}\_t \* \mathbf{U}\_\mathbf{f} + H\_{t-1} \* \mathbf{W}\_\mathbf{f}),\tag{12}$$

$$\overline{\mathbf{C}\_{t}} = \tanh(\mathbf{X\_{t}} \ast \mathbf{U\_{c}} + H\_{t-1} \ast \mathbf{W\_{c}}) \tag{13}$$

$$I\_t = \sigma(\mathbf{X\_t} \* \mathbf{U\_i} + H\_t \* \mathbf{W\_i}),\tag{14}$$

$$O\_t = \sigma(\mathbf{X\_t} \* \mathbf{U\_0} + H\_{t-1} \* \mathbf{W\_0}) \tag{15}$$

where **X***t*—the input vector; *Ht*−1—the hidden state of the previous cell; *Ct*−1—the memory state of the previous cell; *Ht*—the hidden state of the current cell; *Ct*—the memory state of the current cell at time *t*; **W**, **U** are the weight vectors for the forget gate *f()*, the gate of candidates, i.e., an input and output gates; *σ*—sigmoidal function; *tanh*—tangential function.

The most important role is the state of memory *C<sup>t</sup>* . It is the state in which the input context is stored. It changes dynamically depending on the need to add or remove information. If the value of the forget gate is 0, then the previous state is completely forgotten; if equal to 1, then it is completely transferred to the cell. With the current state of *C<sup>t</sup>* memory, a new one can be calculated:

$$\mathbf{C}\_{t} = f\_{t} \* \mathbf{C}\_{t-1} + I\_{t} \* \overline{\mathbf{C}}\_{t}. \tag{16}$$

Then it is necessary to calculate the output from the hidden state *H* at time *t*. It will be based on memory state:

$$H\_t = O\_t \* \tanh(C\_t),\tag{17}$$

Received *C<sup>t</sup>* and *H<sup>t</sup>* are transferred to the next time step, and the process is repeated.

#### 4.2.2. CNN with Attention

CNN consists of many convolutional layers and subsampling layers. Each convolutional layer uses filters with input and output dimensions *Din* and *Dout*. The layer is parameterized by the four-dimensional nuclear tensor **W** of the measurement and the displacement vector *Dout*—*bout*. Therefore, the output value for some word *q*:

$$Y\_q = \sum\_{\Delta} \mathbf{X}\_{q+\Delta} \mathbf{W}\_q + b\_\prime \tag{18}$$

where ∆—kernel change.

The main difference between the attention mechanism and CNN is that the new meaning of a word is determined by every second word of the sentence, since the receptive field of attention includes the full context and not just a grid of nearby words.

The attention mechanism takes as input a token feature matrix, query vectors, and several key-value pairs. Each of the vectors is transformed by a trainable linear transform, and then the inner product query vectors are calculated with each key in turn. The result is run through Softmax, and with the weights obtained from Softmax, all vectors values are summed into a single vector. As a result of applying the attention mechanism, a matrix is obtained where the vectors contain information about the value of the corresponding tokens in the context of other tokens.

#### 4.2.3. Transformer

The mechanism of attention in its pure form can lose information and complicate the convergence, and therefore a solution is required to this problem. Therefore, it was decided to also try its more complex modification—a transformer.

The transformer consists of an encoder and a multi-head attention mechanism. Some of the transformer layers are fully connected, and part of a shortcut is connected. A mandatory component of the architecture is multi-head attention, which allows each input vector to interact with other tokens using the attention mechanism. The study uses a common combination of multi-head attention, a residual layer, and a fully connected layer. The depth of the model is created by repeating this combination 6 times.

A distinctive feature of multi-head attention is that there are several attention mechanisms and they are trained in parallel. The final result is concatenated, passed through the training linear transformation once again, and goes to the output. Formally, it can be described as follows. The attention layer is determined by the size of the key/query

*Dk* , the number of heads *N<sup>h</sup>* , the size of the head *D<sup>h</sup>* , and the output *Dout*. The layer is parametrized with the key matrix, the query matrix **W**<sup>x</sup> qry, and the value matrix **W**<sup>x</sup> val for each head, together with the protector matrix **W**out used to assemble all the heads together. Attention for each head is calculated as:

$$A\_q = \mathbf{X\_q} : \mathbf{W\_{qry}} \mathbf{W\_{ky}^T} \mathbf{X\_k^T} \tag{19}$$

The actual head value is calculated as:

$$H\_q^{(h)} = \sum\_{k' \in [W] \times [H]} softmax(A\_q^{(h)})\_{k'} \mathbf{X}\_{\mathbf{k'}} \mathbf{W}\_{\text{val}}^{(h)}.\tag{20}$$

And the output value is calculated as follows:

$$H\_q = \text{concat}(H\_q^{(1)}, \dots, H\_q^{(N\_{\text{fl}})}) \mathbf{W\_{\text{out}}} + b\_{\text{out}} \tag{21}$$

where **X**—output values, **W**key—the matrix of keys, T—the transposition operation, *Aq*—the attention value for a particular head, *k*—the key position, *q*—the query position, *Nh*—the number of heads, *bout*—the bias coefficient of the measurement *Dout*.

#### **5. Experiment Setup and Results**

About 45 groups of different features of text were used to train the SVM classifier [1]. Vectors ranging in size from 33 to 5000 features were used, including characteristics of different levels of text analysis:


Even a carefully selected feature space does not guarantee high model efficiency, but equally important are the training parameters of the SVM model. In an early study [1], the following parameters were identified as the most appropriate:


As stated earlier, deep NNs do not need a predetermined set of informative text features, as they are able to search for them on their own. However, these models are also extremely sensitive to learning parameters. These parameters have been selected based on the results of model experiments for related tasks [32,33]:


A large number of data are required to train models. For this purpose, the corpus was collected from the Moshkov library [34]. The corpus includes 2086 texts written by 500 Russian authors. The minimum size of each text was 100,000 symbols.

As part of experiments with models, the number of training examples varied with needs in solving real-life authorship identification tasks (including when the training data are limited). Therefore, the texts were divided into fragments ranging from 1000 to 100,000 characters (~ 200–20,000 words). We used three training examples for each author and one for testing.

Table 1 shows the accuracy of the SVM model for datasets of 2, 5, 10, and 50 candidate authors. Table 2 shows the results of applying SVM trained on statistical features and extracted aspects. Cross-validation for 10-folds was used as a procedure for evaluating the effectiveness of the models.


**Table 1.** Average accuracy of author identification using SVM.

**Table 2.** Average accuracy of author identification using SVM with extracted aspects.


It should be noted that the results presented in Tables 1 and 2 were obtained by joint application of SVM and the Laplace smoothing method, which gives a slight increase in accuracy (from 0.01 to 0.07) on small sample sizes. Experiments have also shown that the Good-Turing and Katz smoothing methods negatively affect the quality of identification, with an average accuracy 0.04–0.11 lower when using them.

Table 3 shows the accuracy of determining the author using the LSTM for datasets of similar size and obtained by 10-fold cross-validation, while Table 4 shows the CNN with Attention and Table 5, the Transformer.

**Table 3.** Average accuracy of author identification using LSTM.



**Table 4.** Average accuracy of author identification using CNN with attention

**Table 5.** Average accuracy of author identification using Transformer


Obtained results allow one to form a conclusion about the special effectiveness of SVM trained on accurately selected parameters and features. The approach based on SVM demonstrates superior accuracy to modern deep NNs architectures, regardless of the number of the samples and their volume. It should also be noted that the SVM classifier is able to learn on large volumes of data 10 times faster than deep NNs architectures. The average training time for SVM was 0.25 machine-hours, while deep models were trained for an average of 50 machine-hours.

#### **6. Attacks on the Method**

SVM classifier showed excellent results in determining the author of a naturallanguage text. However, keep in mind that the above experiments were not complicated by deliberate modifications aimed at text anonymization. Anonymization may have a negative impact on the accuracy of authorship identification. This hypothesis was confirmed by an early study [35]. A text anonymization technique was proposed based on a fast correlation filter, dictionary synonymizing, and a universal transformer model with a self-attention mechanism. The results of the study showed that decision-making accuracy can be reduced by almost 50% due to the proposed method of anonymization, keeping the text in readable and understandable form for humans.

As part of the work, it was decided to evaluate the described anonymization technique on the developed approaches. The results are presented in Table 6. The results of the experiments confirm that deep models are much more resistant to the anonymization technique than the SVM classifier. This is due to their ability to extract unobvious features that are not controlled by the author on a conscious level, while SVM operates on the basis of pre-defined features manually found by experts, and therefore text may be exposed to deliberate confusion by anonymization techniques. It should be noted that in such cases, SVM with aspect analysis shows a bit higher accuracy than SVM without it.


**Table 6.** Average accuracy of author identification using Transformer.

#### **7. Discussion and Conclusions**

During the course of the research, the authors analyzed modern approaches to determining the author of a natural-language text, implemented approaches of authorship attribution based on SVM and deep NNs architectures, evaluated the developed approaches on different numbers of authors and volumes of texts, and evaluated the resistance of the approaches to anonymization techniques. The results obtained allow us to draw several conclusions.

Firstly, despite the great popularity of deep NNs architectures, they are inferior to the traditional SVM machine learning algorithms in accuracy by more than 10% on average. This is due to the fact that NNs require more data for learning than SVM to extract informative features from the text. However, when solving real-life authorship identification tasks, the number of data could be not enough for accurate decision-making by the NN.

Secondly, the SVM classification is based on an accurately found set of features manually formed by experts. Such informative features are also obvious for anonymization techniques and therefore can be removed or significantly corrupted. Thus, to solve the problem of identification of the author of a natural language text, both the SVM-based approach and deep models proposed by authors are equally suitable. However, when choosing an approach, the researched data and available technical resources should be objectively evaluated. In the case of a lack of resources, an SVM approach should be used. If there are traces of use anonymization in the text, despite the longer processing time, deep NNs architectures are recommended because they can find both the obvious and unobvious dependences in the text.

Thirdly, when using SVM, we recommended using five of the most informative features of the author's style that may improve the authorship identification process: unigrams and trigrams of Russian letters, high-frequency words, punctuation marks, and distribution of words among parts of speech.

Finally, based on the results obtained, as well as on the experience of earlier research, the authors identified the important criteria to obtain accurate results when identifying the author of a natural language text:


**Author Contributions:** Supervision, A.S.; writing—original draft, A.K., A.F.; writing—review and editing, A.R., V.G., A.S.; conceptualization, A.K., V.G., A.S.; methodology, A.K., A.R.; software, A.K., A.F.; validation, A.K., A.R., A.R.; formal analysis, A.K., A.F.; resources, A.S.; data curation, A.K., A.R.; project administration, A.R.; funding acquisition, A.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Ministry of Science and Higher Education of Russia, Government Order for 2020–2022, project no. FEWM-2020-0037 (TUSUR).

**Informed Consent Statement:** Not applicable.

**Acknowledgments:** The authors express their gratitude to the editor and reviewers for their work and valuable comments on the article.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

