Authorship Detection on Classical Chinese Text Using Deep Learning

Zhao, Lingmei; Shi, Jianjun; Zhang, Chenkai; Liu, Zhixiang

doi:10.3390/app15041677

Open AccessArticle

Authorship Detection on Classical Chinese Text Using Deep Learning

¹

College of Foreign Languages, Shanghai Ocean University, Shanghai 201306, China

²

Institute of Linguistics, Shanghai International Studies University, Shanghai 201620, China

³

College of Information Technology, Shanghai Ocean University, Shanghai 201306, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(4), 1677; https://doi.org/10.3390/app15041677

Submission received: 4 November 2024 / Revised: 24 December 2024 / Accepted: 26 December 2024 / Published: 7 February 2025

(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Authorship detection has played an important role in social information science. In this study, we propose a support vector machine (SVM)-based authorship detection model for classical Chinese texts. Term frequency-inverse document frequency (TF-IDF) feature extraction technique is combined with the SVM-based method. The linguistic features used in this model are based on TF-DIF calculations of different function words, including literary Chinese words, end-function words, vernacular function words, and transitional function words. Furthermore, a bidirectional long short-term memory (BiLSTM)-based authorship model is introduced to detect authorship in classical Chinese texts. The BiLSTM model incorporates an attention mechanism to better capture the meaning and weight of the words. We conduct a comparative analysis between the SVM-based and BiLSTM-based models in the context of authorship detection in Chinese classical literature. The applicability of the two authorship detection models for classical Chinese texts is examined. Results indicate varying authorship between different sections of the texts, with the SVM model outperforming the BiLSTM model. Notably, these classification outcomes are consistent with findings from prior studies in classical Chinese literary analysis. The proposed SVM-based authorship detection model is especially suited for automatic literary analysis, which underscores its potential for broader literary studies.

Keywords:

authorship detection; support vector machine; TF-IDF; bidirectional long short-term memory; attention mechanism

1. Introduction

Authorship detection, a subset of authorship analysis within social information science, is a classification problem [1], whose aim is to discover the particular author of a text from a set of candidates. The authorship analysis of handwritten texts has been well-researched since ancient times, as every author has a unique writing style. In authorship detection an author is identified by analyzing the writing style characteristics based on stylometry [2]. It is challenging to extract from texts the most distinctive characteristics that reflect the author’s writing style. The accuracy of detection depends on the relevance of the extracted features. Lexical [3], syntactic [4], and linguistic features [5] are the most distinctive characteristics of an author’s writing style.

Nowadays, large numbers of textual documents have been digitized and made available on the Internet [6]. Authorship detection has advanced alongside the development of science and technology. Several methods have been applied to authorship detection. Keystroke biometrics [7] is based on using software applications; it can extract features from the rhythm when an author is typing words. The most common and popular method is stylometry-based [8], which uses data analysis techniques to extract the most distinctive features of an author’s writing style. It has also been suggested that sentimental words significantly affect authorial writing [9].

Text mining makes a great contribution to authorship detection, as it is widely utilized to extract meaningful features from large quantities of data in various unstructured formats. Text dataset pre-processing tends to be the starting point [10]. The selected characteristics are employed to convert the text into a feature set. Machine learning is an important tool in data mining [11], utilized to identify patterns from data and create analysis models to extract meaningful information. A feed-forward neural network language model can be applied for authorship detection [12]. The author’s set size is decent and the complexity is significantly reduced while improving the accuracy of author detection. When a multi-headed recurrent neural network language model is used [13], it can reach high accuracy even with small dataset.

Dream of the Red Chamber (DRC), also called The Story of the Stone, written by Cao Xueqin during the Qing Dynasty in the 18th century, is known as the pinnacle of the Four Great Classical Novels of Chinese literature and a gem of world literature [14]. DRC, with its unique artistic style, profound ideological content, superb narrative skills and rich cultural connotations, has occupied a pivotal position in the history of world literature. Through its profound analysis of feudal society and deep concern for human nature, it demonstrates the common problems of human society and the universal characteristics of human nature, providing valuable ideological resources and spiritual wealth for world literature. The novel is about a large feudal family with declining fortunes. It is a fictional reflection of Cao’s own family. Considered the encyclopedia of Chinese feudal society, it has aroused extensive attention from researchers on authorship attribution in classical literature [15].

The authorship detection of the first 80 and the remaining 40 chapters of the novel DRC is a well-known unsolved mystery about authorship attribution in classical Chinese literature [16]. Cao did not finish his novel when he died in 1763, leaving about 80 chapters in circulation. Cheng Weiyuan and Gao E claimed that they had collected more chapters of Cao through different channels. They put them together and published the first printed version. However, the authorship attribution of the last 40 chapters caused widespread controversy. The renowned critic Hu Shih considered them forgeries written by Gao [17], while some scholars believed that Gao made a mistake and took someone else’s forgery as the original. On the other hand, a few scholars support Gao and think the 40 chapters were Cao’s authentic work. Nevertheless, most scholars in the field of classical literature generally believe that DRC was written by two authors.

Some statistical research methods have been applied to the authorship detection of the two texts [18]. Observations of the authors’ grammatical and lexical habits in both the first 8 and the last 40 chapters [19] suggest that they are the same. An analysis of word-correlation, including verbs and adverbs, further supports this conclusion [20]. Principal component analysis applied to the verse portions [21] led to the conclusion that there are two authors. Finally, delta procedure, a method for identifying text authorship [22], confirmed the two-author attribution.

Authorship detection technology was developed with the optimization and improvement of computer technology. Machine learning provides another route to explore the authorship attribution problem of DRC. Among the most effective techniques is the support vector machine (SVM), a classification model known for its performance [23]. Additionally, long short-term memory (LSTM) is a technique that emerged after the rise of deep learning, suitable for natural language processing [24].

In this paper, we propose machine-learning-based methods to solve the authorship attribution problem of DRC using SVM and BiLSTM. We explore the significance of applying author detection techniques to classical literature through these models. The remainder of the paper is organized as follows. Section 2 provides a detailed overview of the data and methods employed in this study, followed by the experiments presented in Section 3. The differences in authorship detection based on SVM and BiLSTM are discussed in Section 4 and the paper concludes in Section 5.

2. Data and Methods

The starting point for authorship detection is to decide on the text resource. Texts are the subject matter to be classified and analyzed. After obtaining the texts, preparation is carried out, including removing redundant spaces and punctuation marks. Artificial-intelligence-based methods are utilized to extract meaningful characteristics from the texts. Classification and description are determined by comparing the extracted characteristics in the non-disputed chapters with those in the disputed ones. The process of authorship detection methods adopted in this work is presented in Figure 1.

Natural language processing tasks often require the processing of large-scale textual data. For many systems of authorship detection, a big dataset is still required. However, it is challenging to obtain a big dataset from traditional Chinese literature because it tends to be brief and concise. It is challenging to solve the authorship detection problem using a large model such as the BERT and GPT. This is why machine-learning-based methods to solve the authorship attribution problem of DRC based on SVM and BiLSTM were chosen.

2.1. Data

It is important to determine the dataset when conducting author detection using machine learning methods. A high-quality dataset provides sufficient information for the model to train on and validate [25], which allows the model to better understand the features and patterns in the data, improving the accuracy and reliability of results. The edition of DRC used in this work was published by the People’s Literature Publishing House in 1982 and edited by Feng Qiyong et al.

The first 80 chapters are based on the Gengchen book with the Cheng book used as a reference. The remaining 40 chapters are based on the Chengjia book, with the Chengyi book and other versions serving as references. The Gengchen book is one of the early copies of DRC, copied around 1761; it contains 78 chapters, missing the 64th and 67th chapters, and is considered to have preserved the original form of DRC. The Gengchen book does not divide the 17th and 18th chapters. The version of DRC used in this work has 119 chapters.

Nowadays, Simplified Chinese characters are widely used in Mainland China [26]. DRC was written during the Qing Dynasty in the 18th century when Traditional Chinese characters were in common use. The difference between Simplified Chinese and Traditional Chinese characters is not relevant to authorship detection [27]. In this work, the texts in DRC are converted from Traditional to Simplified Characters out of habit. Punctuation marks in each chapter do not reflect the author’s writing style and will be removed.

2.2. Support Vector Machine Based on Term Frequency-Inverse Document Frequency

SVM technology is a machine learning-based pattern recognition method that has gained considerable attention in the field of artificial intelligence. It is one of the most efficient machine learning models and has become one of the best-performing and most applicable models for pattern recognition. SVM originated from the concept of the optimal separating hyperplane proposed by Vapnik, et al. [28]. It is widely utilized when the training data are clearly labeled, allowing the algorithm to learn from the given data and build a classification model on the extracted features. The dataset is divided into two parts by the model. The linear SVM is employed based on the observation of the results of the experiment.

Term frequency-inverse document frequency (TF-IDF) is a common weighting technique used in information retrieval and text mining. It is widely applied to evaluate the importance of a word within a document relative to a collection or corpus. The importance of a word increases proportionally with its frequency within the document, but a word’s occurrence within a particular document decreases inversely with its frequency across the corpus. The frequency value of the appearance of words in documents is considered term frequency (TF), while inverse document frequency (IDF) refers to the presence of that word throughout the entire text corpus. TF-IDF can be calculated as below

tf - idf (word) = tf (word) \times idf (word) idf (word) = \log \frac{n}{1 + df (word)}

(1)

where TF value is the frequency value of a specific word in a document, n is the total number of documents, and DF is the number of documents that contain that specific word. To prevent division by zero, 1 is added in the denominator.

In literary novels, every author has a unique writing style, and imitations often reveal traces [29]. In this work we extract linguistic features that reflect each writer’s stylistic patterns to serve as training and test data for the SVM model. Chapters 20 through 29 from the first 80 chapters are selected as the training data for the SVM model, as there is no dispute regarding authorship for these. Chapters 110 through 119 from the remaining 40 chapters are selected as the training data. The other chapters are used as test data. Preparation steps included removing redundant spaces and Chinese punctuation marks.

Function words within each sentence are evenly distributed in classical literary texts [30]. However, differences in each chapter, or document, emerge from the varying frequence of these function words. Our study considers the TF-IDF of function words as the linguistic feature for authorship detection in DRC. A set of 44 function words is chosen, excluding 2 two-letter function words. Assuming that there exists no prior knowledge of differing authorship between the first 80 and last 40 chapters, the model identifies any shift in writing patterns along the chapters. The result of authorship detection is determined by the boundary of that change.

The process of authorship detection using SVM is summarized as follows:

(1): Pre-process the texts of DRC, dividing it into two parts, each considered manuscripts by two different authors.
(2): Divide these two parts into the corresponding chapters and calculate the TF-IDF of function words in chapters as extracted features, forming the whole dataset in two parts.
(3): Select ten chapters from both the non-disputed chapters and disputed ones as the training data, with the remaining used as testing data.
(4): Train the SVM algorithm on the selected data and build the classification model.
(5): Classify the whole dataset and observe the shift in writing patterns, attributing authorship based on the boundary of changing writing habits.

2.3. Bidirectional Long Short-Term Memory with an Attention Mechanism

The LSTM network, an enhanced form of the recurrent neural network proposed by Hochreiter and Schmidhuber, is widely utilized for sequence labeling and classification of temporal sequences [31]. With sequential data as input, it captures state change over time. The structure of LSTM is well-designed to process long-sequence data and store information for long periods of time, avoiding gradient vanishing and gradient explosion. It is depicted in Figure 2.

The core component of LSTM is the cell state, which acts like a conveyor belt, allowing information to flow carefully regulated by gates that selectively add or remove it. There are three gates in LSTM: forget gate, input gate, and output gate [32]. Each gate consists of a sigmoid neural net layer and a pointwise multiplication operation. The forget gate decides what information will be removed from the cell state, while the input gate determines what information will be stored in the cell state. After deleting and updating information in the cell state, the output gate controls what information will exit the cell state. Mathematically, the hidden state of each node can be expressed as

\begin{array}{l} f_{t} = σ (W_{f} x_{t} + U_{f} h_{t - 1} + b_{f}) \\ i_{t} = σ (W_{i} x_{t} + U_{i} h_{t - 1} + b_{i}) \\ u_{t} = t a n h (W_{u} x_{t} + U_{u} h_{t - 1}) + b_{u} \\ c_{t} = f_{t} * c_{t - 1} + i_{t} * u_{t} \\ o_{t} = σ (W_{o} x_{t} + U_{o} h_{t - 1} + b_{o}) \\ h_{t} = o_{t} * t a n h (c_{t}) \end{array}

(2)

f_{t}

is the output of the forget gate.

x_{t}

is the input.

h_{t - 1}

and

h_{t}

are the hidden state of layers.

W

is the weight, and

b

is the bias. σ is the logistic function.

i_{t}

is the output of the input gate.

u_{t}

is the intermediate state.

c_{t - 1}

and

c_{t}

are the cell states of layers.

o_{t}

is the output of the output gate.

In many natural language processing tasks, the current output depends on both the previous states and the future states. However, LSTM networks can only predict the next moment’s output based on the sequence information from previous moments. BiLSTM is composed of forward and backward LSTM. Its structure is shown in Figure 3. The formula used is as follows:

\begin{matrix} \overset{\leftarrow}{h_{t}} = L S T M (x_{t}, {\overset{⇀}{h}}_{t - 1}) \\ \vec{h_{t}} = L S T M (x_{t}, {\overset{⇀}{h}}_{t - 1}) \\ h_{t} = w_{t} \vec{h_{t}} + ν_{t} \overset{\leftarrow}{h_{t}} + b_{t} \end{matrix}

(3)

where

\overset{\leftarrow}{h_{t}}

,

\vec{h_{t}}

, and

h_{t}

are the forward output, the backward output, and the present output of the network at time t, respectively.

w_{t}

and

ν_{t}

are two weight matrices of the output.

b_{t}

denotes the offset.

The attention mechanism in deep learning draws its inspiration from the selective attention in human vision. It aims to sift through a plethora of data to identify and prioritize the most relevant information for the task at hand, while sidelining the less relevant details. The core of the attention mechanism is the dynamic allocation of weight coefficients to each value, which are adjusted throughout the training process for each time step, determining the significance of each word in the context of the task.

Authorship detection of DRC can be considered as a sequence annotation task. Deep learning requires a large number of samples for training, but for literary works, only a limited number of samples are available. Additionally, the chapter length of DRC is limited, which makes it challenging to confirm the data type and sample length to input into BiLSTM for the authorship attribution [33]. It is important to reduce the length of the sample sequence and increase the number of samples.

Assuming that the first few chapters of DRC were written by author A and the last few chapters by author B, we do not know exactly how many chapters author A or B wrote. A part of the book from both the beginning and ending is used to train the BiLSTM model, so it can classify the unknown chapters. The relationship between the classification results and the number and length of learning samples is analyzed. According to the classification results, the authorship attribution of the unknown chapters is determined. Training samples are taken from this work’s first and last 10, 20, and 30 chapters, with sequence lengths of 500, 2000, and 5000 words, respectively. After conducting the experiments, the results of authorship detection on DRC using SVM and BiLSTM are presented in detail in the next section.

3. Experiments

In this section, the authorship identification procedure is introduced. For SVM-based attribution, different kinds of function words are selected and TF-IDF is calculated as linguistic features input into the SVM classification model. The observed change in the boundary of writing patterns serves as evidence to detect authorship across different chapters of DRC. For BiLSTM-based authorship detection, words in the dataset are vectorized as network input. A convolutional network with no activation function is combined with BiLSTM to improve classification performance. The classification results of the softmax classifier are considered as evidence to detect authorship across different chapters of DRC. Precision, recall, and f1-score are employed in the experiments to evaluate the classification performance.

3.1. SVM-Based Authorship Detection on DRC

Table 1 lists 44 function words categorized into four different types: 13 literary-Chinese function words, 6 end-function words, 14 vernacular function words, and 10 transitional function words.

The frequency of these 44 function words is calculated on a per-mileage scale as linguistic features. Training data, which include chapters 20 to 29 and 110 to 119, are input into the SVM model. Assuming that chapters 20 to 29 are written by author A and chapters 110 to 119 are written by author B, test data, which includes the remaining chapters, are input into the model and classified into two parts. The classification results are depicted in Table 2.

Based on these classification results, a total of 67 chapters were written by author A and 53 chapters by author B. The boundary of changing writing habits is depicted in Figure 4.

There is a clear boundary of inconsistency in writing habits between the 80th and the 81st chapters. Except for the 85th chapter, the writing patterns of chapters 81 through 120 are the same. The majority of the writing patterns of the first 80 chapters are consistent, with only 14 chapters exhibiting inconsistent patterns, accounting for 17. 5%.

Figure 5 presents the confusion matrix for the SVM-based method. Based on this matrix, precision, recall, and f1-score values are calculated. TP is equal to 66. FN is equal to 14. FP is equal to 1. TN is equal to 39. Precision is equal to TP/(TP+FP). Recall is equal to TP/(TP+FN). The combination of SVM and TF-IDF produces a good result, with precision values of 98.5%, recall values of 82.5%, and f1 scores of 89.8%.

According to our classification results, the first 80 and last 40 chapters of DRC can be considered as literary works of two different authors.

3.2. BiLSTM-Based Authorship Detection on DRC

The necessary samples are taken from the specified chapters, including 500-word, 2000-word, and 5000-word sequences. The sample sizes are shown in Table 3.

The number of 500-word samples taken from the first 10 and last 10 chapters is 245. The number of 500-word samples taken from the first 20 and last 20 chapters is 484. The number of 500-word samples taken from the first 30 and last 30 chapters is 734. The number of 2000-word samples taken from the first 10 and last 10 chapters is 71. The number of 2000-word samples taken from the first 20 and last 20 chapters is 138. The number of 2000-word samples taken from the first 30 and last 30 chapters is 206. The number of 5000-word samples taken from the first 10 and last 10 chapters is 37. The number of 5000-word samples taken from the first 20 and last 20 chapters is 67. The number of 5000-word samples taken from the first 30 and last 30 chapters is 104.

The entire structure of the BiLSTM-based authorship detection model is depicted in Figure 6.

It is challenging to convert Chinese texts into a form that BiLSTM can understand. After confirming the samples, Word2vec is utilized to pre-process sequences of varying lengths [34]. Word2vec mathematically detects the sequence vector in vector space and generates a set of feature vectors representing words, without human intervention. These vectorized words are suitable for deep neural networks.

The word vectors are input into the convolutional network layer [35], where kernel function extracts features by sliding over the vectors. The key to the sliding process is determining the appropriate padding strategies. Valid padding is adopted, and the pooling operation is omitted to avoid destroying the sequential relationship of results [36]. Next, a two-tier stacked BiLSTM is applied. The results are passed into the BiLSTM layer without being modified by the activation function. BiLSTM is capable of retaining long-term information, and the high-level features are output.

Softmax is used as a classifier, working as a normalization function [37]. It transforms the high-level features into probability values within the interval (0, 1). This function can be expressed as

σ {(\vec{z})}_{i} = \frac{e^{z_{i}}}{\sum_{j = 1}^{K} e^{z_{j}}}

(4)

\vec{z}

is a vector containing K elements. The denominator creates a probability distribution by summing to standardize each element. The numerator applies the exponential function to each vector element, returning the corresponding output value of the i-th category.

The detailed setting for the BiLSTM authorship detection model is shown in Table 4.

Classifications of three different lengths of sample sequences are conducted. The performance of a 500-word sample is shown in Figure 7.

After 80 training epochs using samples from the first and last 10 chapters, the model achieved an accuracy of 68.23%. When the training was conducted for the same number of epochs but with samples from the first and last 20 chapters, the accuracy improved to 69.48%. By extending the sample range to the first and last 30 chapters and reducing the training epochs to just 14, the model’s accuracy increased to 83.96%.

The performance of the 2000-word sample is shown in Figure 8.

After 50 training epochs using samples from the first and last 10 chapters, the model achieved an accuracy of 63.28%. When the training was conducted for the same number of epochs but with samples from the first and last 20 chapters, the accuracy improved to 68.73%. By extending the sample range to the first and last 30 chapters and reducing the training epochs to just 14, the model’s accuracy increased to 85.07%.

The performance of the 5000-word sample is shown in Figure 9.

The model achieved an accuracy of 67.04% after 50 training epochs using samples from the first and last 10 chapters. When the samples were expanded to include the first and last 20 chapters, the model completed 40 training epochs with an accuracy of 67.21%. Further increasing the samples to the first and last 30 chapters, the model maintained 40 training epochs and increased its accuracy to 80.84%

The classification results of the BiLSTM model using 2000-word samples from the first 30 and last 30 chapters as training samples showed the best performance. Our results suggest that the writing patterns are consistent in 85.07% of all chapters, and 97.88% of the first 80 chapters exhibit consistent writing patterns. For the last 40 chapters, only one chapter’s writing pattern differs, while the rest aligned with the writing pattern of the first chapters. According to the classification results, the first 80 and last 40 chapters of DRC can be attributed to a single author.

4. Discussion

SVM relies on the extraction of linguistic features to identify writing patterns [38]. However, the number of samples used for training is limited. While the discriminant validity of texts with consistent writing patterns is high, controversial texts need to be further studied using alternative methods.

There is a clear boundary of inconsistency in writing style between the 80th and the 81st chapters of DRC when using the SVM model. Here, the two-author scenario aligns with the results of previous classical Chinese literary studies.

On the other hand, BiLSTM is more straightforward, using text sequences directly as training data [39]. Training samples are easy to extract and label, but the model’s accuracy depends on the number of training samples, and the sequence length must be 500 words at least.

Unlike the SVM model, the BiLSTM model does not show a clear boundary of inconsistent writing patterns. This model identifies a single author for the entire novel, which is significantly different from the results of previous classical Chinese literary studies [40].

While SVM requires fewer samples, BiLSTM demands larger datasets. The SVM classification model performs well, and clearly detects authorship attribution in DRC, whereas the LSTM classification model achieves high accuracy, but does not directly identify authorship attribution in DRC. The applicability of SVM is better than that of BiLSTM for authorship attribution in Chinese classical literature.

5. Conclusions

Authorship detection is an important part of social information science. Based on a contrastive study of data-driven applications in this field, this work proposes an SVM-based authorship detection model to identify the authorship of one classical Chinese novel, Dream of the Red Chamber (DRC), whose authorship attribution remains an open research problem due to the text’s historical age. In this study, we consider the TF-IDF of function words as the linguistic feature for detecting the author of DRC using the SVM-based model.

This work also introduces a BiLSTM-based authorship detection model combined with an attention mechanism. We explore the impact of sequence length and size of training samples on the BiLSTM model. The best classification results for this model are achieved using 2000-word samples chosen from the first 30 and last 30 chapters as training samples. Word2vec is utilized to pre-process the dataset, and a convolutional network with no activation function is integrated with BiLSTM to improve the classification performance. The attention mechanism can enhance the performance of the model.

A comparison is made between the applicability of SVM-based and BiLSTM-based models for authorship attribution in Chinese classical literature. For the SVM model, there is a clear boundary of inconsistency in writing habits between the 80th and the 81st chapters of the DRC. In contrast, no boundary of inconsistent writing patterns is detected in the BiLSTM model.

The SVM model proves to be more effective, with its classification results aligning well with the results of previous classical Chinese literary studies, as evidence that DRC may have been written by two different authors.

Author Contributions

Conceptualization, J.S. and Z.L.; Methodology, L.Z. and C.Z.; software, C.Z.; validation, C.Z.; formal analysis, Z.L. and C.Z.; investigation, Z.L.; resources, Z.L.; data curation, Z.L.; writing—original draft preparation, Z.L. and C.Z.; writing—review and editing, J.S.; supervision, J.S.; project administration, J.S.; funding acquisition, J.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Social Science Foundation of China (22AYY024) and Special Funding for the Development of Science and Technology of Shanghai Ocean University (A2-0203-00-100410).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data generated or analyzed during this study are available from the first author (L. Zhao) on reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

He, X.; Lashkari, A.H.; Vombatkere, N.; Sharma, D.P. Authorship Attribution Methods, Challenges, and Future Research Directions: A Comprehensive Survey. Information 2024, 15, 131. [Google Scholar] [CrossRef]
Kreuz, R. Linguistic Fingerprints: How Language Creates and Reveals Identity; Rowman & Littlefield: Lanham, MD, USA, 2023. [Google Scholar]
Zheng, W.; Jin, M. A review on authorship attribution in text mining. Wiley Interdiscip. Rev. Comput. Stat. 2023, 15, e1584. [Google Scholar] [CrossRef]
Jafariakinabad, F.; Tarnpradab, S.; Hua, K.A. Syntactic neural model for authorship attribution. In Proceedings of the Thirty-Third International Flairs Conference, North Miami Beach, FL, USA, 17–20 May 2020. [Google Scholar]
Lagutina, K.; Lagutina, N.; Boychuk, E.; Larionov, V.; Paramonov, I. Authorship verification of literary texts with rhythm features. In Proceedings of the 2021 28th Conference of Open Innovations Association (FRUCT), Moscow, Russia, 27–29 January 2021; pp. 240–251. [Google Scholar]
Finneran, R.J. The Literary Text in the Digital Age; University of Michigan Press: Ann Arbor, MI, USA, 1996. [Google Scholar]
Acien, A.; Morales, A.; Monaco, J.V.; Vera-Rodriguez, R.; Fierrez, J. TypeNet: Deep learning keystroke biometrics. IEEE Trans. Biom. Behav. Identity Sci. 2021, 4, 57–70. [Google Scholar] [CrossRef]
Zenkov, A.V. A novel method of stylometry based on the statistic of numerals. Comput. Res. Model. 2017, 9, 837–850. [Google Scholar] [CrossRef]
Savoy, J. Machine Learning Methods for Stylometry; Springer: Cham, Switzerland, 2020. [Google Scholar]
Symeonidis, S.; Effrosynidis, D.; Arampatzis, A. A comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis. Expert Syst. Appl. 2018, 110, 298–310. [Google Scholar] [CrossRef]
Palmer, A.; Jiménez, R.; Gervilla, E. Data mining: Machine learning and statistical techniques. In Knowledge-Oriented Applications in Data Mining; Funatsu, K., Ed.; IntechOpen: The Balearic Islands, Spain, 21 January 2011; pp. 373–396. [Google Scholar]
Jafariakinabad, F.; Tarnpradab, S.; Hua, K.A. Syntactic recurrent neural network for authorship attribution. arXiv 2019, arXiv:1902.09723. [Google Scholar]
Bagnall, D. Author identification using multi-headed recurrent neural networks. arXiv 2015, arXiv:1506.04891. [Google Scholar]
Liu, C.-L.; Jin, G.-T.; Wang, H.; Liu, Q.-F.; Cheng, W.-H.; Chiu, W.-Y.; Tsai, R.T.-H.; Wang, Y.-C. Textual Analysis for Studying Chinese Historical Documents and Literary Novels. In Proceedings of the ASE BigData & SocialInformatics 2015, Kaohsiung, Taiwan, 7–9 October 2015; pp. 1–10. [Google Scholar]
Anthony, C.Y. Rereading the Stone: Desire and the Making of Fiction in Dream of the Red Chamber; Princeton University Press: Princeton, NJ, USA, 2018. [Google Scholar]
Shi, W. Yi Sheng Er Shu Ge: Hearing Multiple Tones at the Same Time a Study of the Dream of the Red Chamber, The Canterbury Tales and Their Readers; Wheaton College: Norton, MA, USA, 2017. [Google Scholar]
Du, K. Authorship of Dream of the Red Chamber: A Topic Modeling Approach. In Proceedings of the DH, Würzburg, Germany, 8–11 August 2017. [Google Scholar]
Yu, Y.; Liu, W.; Feng, Y. A quantitative study on Dream of the Red Chamber: Word-length distribution and authorship attribution. Complexity 2022, 2022, 9077360. [Google Scholar] [CrossRef]
Karlgren, B. New Excursions in Chinese Grammar; Museum of Far Eastern Antiquities: Stockholm, Sweden, 1952. [Google Scholar]
Chan, B.-C. The Authorship of the Dream of the Red Chamber is Based on a Computerized Statistical Study of Its Vocabulary; Joint Publishing, Company, Limited: Hong Kong, 1986. [Google Scholar]
Zhu, H.; Lei, L.; Craig, H. Prose, verse and authorship in Dream of the Red Chamber: A stylometric analysis. J. Quant. Linguist. 2021, 28, 289–305. [Google Scholar] [CrossRef]
Gupta, S.; Patra, T.K.; Chaudhuri, C. Role of Machine Learning in Authorship Attribution with Select Stylometric Features. In Proceedings of the International Conference on Intelligent Systems Design and Applications, Kolkata, India, 26 March 2021; pp. 920–932. [Google Scholar]
Cervantes, J.; Garcia-Lamont, F.; Rodríguez-Mazahua, L.; Lopez, A. A comprehensive survey on support vector machine classification: Applications, challenges, and trends. Neurocomputing 2020, 408, 189–215. [Google Scholar] [CrossRef]
Van Houdt, G.; Mosquera, C.; Nápoles, G. A review on the long short-term memory model. Artif. Intell. Rev. 2020, 53, 5929–5955. [Google Scholar] [CrossRef]
Shao, S.; Li, Z.; Zhang, T.; Peng, C.; Yu, G.; Zhang, X.; Li, J.; Sun, J. Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8430–8439. [Google Scholar]
Zhao, S.; Baldauf, R.B., Jr. Simplifying Chinese Characters: Not. In Handbook of Language and Ethnic Identity: The Success-Failure Continuum in Language and Ethnic Identity Efforts (Volume 2); Oxford University Press: New York, NY, USA, 2011; p. 168. [Google Scholar]
Tsang, Y.-K.; Huang, J.; Wang, S.; Wang, J.; Wong, A.W.-K. Comparing word recognition in simplified and traditional Chinese: A megastudy approach. Q. J. Exp. Psychol. 2023, 77, 593–610. [Google Scholar] [CrossRef]
Vapnik, V.; Izmailov, R. Reinforced SVM method and memorization mechanisms. Pattern Recognit. 2021, 119, 108018. [Google Scholar] [CrossRef]
Kelcher, F. Anonymity and Imitation in Linguistic Identity Disguise; Aston University: Birmingham, UK, 2021. [Google Scholar]
Boukhaled, M.A. A Machine Learning based Study on Classical Arabic Authorship Identification. In Proceedings of the ICAART (1), Valletta, Malta, 22–24 February 2022; pp. 489–495. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
DiPietro, R.; Hager, G.D. Deep learning: RNNs and LSTM. In Handbook of Medical Image Computing and Computer Assisted Intervention; Elsevier: Amsterdam, The Netherlands, 2020; pp. 503–519. [Google Scholar]
Uchendu, A.; Le, T.; Shu, K.; Lee, D. Authorship attribution for neural text generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 8384–8395. [Google Scholar]
Hu, W.; Gu, Z.; Xie, Y.; Wang, L.; Tang, K. Chinese text classification based on neural networks and word2vec. In Proceedings of the 2019 IEEE Fourth International Conference on Data Science in Cyberspace (DSC), Hangzhou, China, 23–25 June 2019; pp. 284–291. [Google Scholar]
Namatēvs, I. Deep convolutional neural networks: Structure, feature extraction and training. Inf. Technol. Manag. Sci. 2017, 20, 40–47. [Google Scholar] [CrossRef]
Luan, Y.; Lin, S. Research on text classification based on CNN and LSTM. In Proceedings of the 2019 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA), Dalian, China, 29–31 March 2019; pp. 352–355. [Google Scholar]
Jogin, M.; Madhulika, M.; Divya, G.; Meghana, R.; Apoorva, S. Feature extraction using convolution neural networks (CNN) and deep learning. In Proceedings of the 2018 3rd IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT), Bangalore, India, 18–19 May 2018; pp. 2319–2323. [Google Scholar]
Chammas, M.; Makhoul, A.; Demerjian, J. Writer identification for historical handwritten documents using a single feature extraction method. In Proceedings of the 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA), Virtual Event, 14–17 December 2020; pp. 1–6. [Google Scholar]
Shaikh, S.; Daudpotta, S.M.; Imran, A.S. Bloom’s learning outcomes’ automatic classification using lstm and pretrained word embeddings. IEEE Access 2021, 9, 117887–117909. [Google Scholar] [CrossRef]
Moratto, R.; Liu, K.; Chao, D.-K. Dream of the Red Chamber: Literary and Translation Perspectives; Taylor & Francis: New York, NY, USA, 2022. [Google Scholar]

Figure 1. Authorship detection process adopted in this work.

Figure 2. LSTM structure.

Figure 3. BiLSTM structure.

Figure 4. Change of authors’ writing habits.

Figure 5. Confusion matrix of the SVM-based method.

Figure 6. Structure of BiLSTM-based authorship detection model.

Figure 7. Classification performance of a 500-word sample: (a) Samples are taken from the first 10 and last 10 chapters; (b) Samples are taken from the first 20 and last 20 chapters; (c) Samples are taken from the first 30 and last 30 chapters.

Figure 8. Classification performance of a 2000-word sample: (a) Samples are taken from the first 10 and last 10 chapters; (b) Samples are taken from the first 20 and last 20 chapters; (c) Samples are taken from the first 30 and last 30 chapters.

Figure 9. Classification performance of a 5000-word sample: (a) Samples are taken from the first 10 and last 10 chapters; (b) Samples are taken from the first 20 and last 20 chapters; (c) Samples are taken from the first 30 and last 30 chapters.

Table 1. Function words are chosen to calculate their frequency as the linguistic feature to detect the author of DRC.

Literary-Chinese Function Words	End-Function Words	Vernacular Function Words	Transitional Function Words
Zhi (之), Qi (其), Huo (或), Yi (亦), Fang (方), Yu (于), Ji (即), Jie (皆), Yin (因), Reng (仍), Gu (故), Shang (尚), Nai (乃)	Ya (呀), Ma (吗), Lie (咧), Ba Me (罢么), A (啊), Ne (呢)	Le (了), De (的), Zhe (着), Yi (一), Bu (不), Ba (把), Rang (让), Xiang (向), Wang (往), Shi (是), Zai (在), Bie (别), Hao (好), Er (儿)	Ke (可), Bian (便), Jiu (就), Dan (但), Yue (越), Zai (再), Geng (更), Bi (比), Hen (很), Pian (偏)

Table 2. Results of SVM-based classification.

Chapters of Author A	Chapters of Author B
1, 2, 3, 4, 5, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 46, 47, 48, 50, 51, 52, 53, 54, 55, 57, 58, 59, 60, 62, 64, 66, 70, 71, 72, 73, 74, 75, 76, 77, 79, 80, 85	6, 11, 31, 32, 45, 49, 56, 61, 63, 65, 67, 68, 69, 78, 81, 82, 83, 84, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120

Table 3. Size of training samples, including different lengths of sequences.

	Ch. 1–10 and 111–120	Ch. 1–20 and 101–120	Ch. 1–30 and 91–120
500 words	245	484	734
2000 words	71	138	206
5000 words	37	67	104

Table 4. Settings of the BiLSTM-based authorship detection model.

Convolutional Network Layer	BiLSTM Layer
Sliding step: 1 Filter size: 3, 4, 5 Filter number: 128 Word embedding dimension: 256	Hidden units number: 128

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, L.; Shi, J.; Zhang, C.; Liu, Z. Authorship Detection on Classical Chinese Text Using Deep Learning. Appl. Sci. 2025, 15, 1677. https://doi.org/10.3390/app15041677

AMA Style

Zhao L, Shi J, Zhang C, Liu Z. Authorship Detection on Classical Chinese Text Using Deep Learning. Applied Sciences. 2025; 15(4):1677. https://doi.org/10.3390/app15041677

Chicago/Turabian Style

Zhao, Lingmei, Jianjun Shi, Chenkai Zhang, and Zhixiang Liu. 2025. "Authorship Detection on Classical Chinese Text Using Deep Learning" Applied Sciences 15, no. 4: 1677. https://doi.org/10.3390/app15041677

APA Style

Zhao, L., Shi, J., Zhang, C., & Liu, Z. (2025). Authorship Detection on Classical Chinese Text Using Deep Learning. Applied Sciences, 15(4), 1677. https://doi.org/10.3390/app15041677

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Authorship Detection on Classical Chinese Text Using Deep Learning

Abstract

1. Introduction

2. Data and Methods

2.1. Data

2.2. Support Vector Machine Based on Term Frequency-Inverse Document Frequency

2.3. Bidirectional Long Short-Term Memory with an Attention Mechanism

3. Experiments

3.1. SVM-Based Authorship Detection on DRC

3.2. BiLSTM-Based Authorship Detection on DRC

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI