Chinese Story Generation Based on Style Control of Transformer Model and Content Evaluation Method

Lin, Jhe-Wei; Su, Tang-Wei; Chang, Che-Cheng

doi:10.3390/a18030168

Open AccessArticle

Chinese Story Generation Based on Style Control of Transformer Model and Content Evaluation Method

by

Jhe-Wei Lin

,

Tang-Wei Su

and

Che-Cheng Chang

^*

Department of Information Engineering and Computer Science, Feng Chia University, Taichung City 407102, Taiwan

^*

Author to whom correspondence should be addressed.

Algorithms 2025, 18(3), 168; https://doi.org/10.3390/a18030168

Submission received: 13 February 2025 / Revised: 5 March 2025 / Accepted: 12 March 2025 / Published: 14 March 2025

(This article belongs to the Collection Feature Papers on Artificial Intelligence Algorithms and Their Applications)

Download

Browse Figures

Versions Notes

Abstract

Natural language processing (NLP) has numerous applications and has been extensively developed in deep learning. In recent years, language models such as Transformer, BERT, and GPT have frequently been the foundation for related research. However, relatively few studies have focused on evaluating the quality of generated sentences. While traditional evaluation methods like BLEU can be applied, the challenge is that there is no ground truth reference for generated sentences, making it difficult to establish a reliable evaluation criterion. Therefore, this study examines the content generated by Bidirectional Encoder Representations and related recurrent methods based on the Transformer model. Specifically, we focus on analyzing sentence fluency by assessing the degree of part-of-speech (PoS) matching and the coherence of PoS context ordering. Determining whether the generated sentences align with the expected PoS structure of the model is crucial, as it significantly impacts the readability of the generated text.

Keywords:

Chinese story generation; part-of-speech context sorting coherence; part-of-speech match

1. Introduction

Previous research on Chinese story cycle generation models has demonstrated that Chinese sentences can be effectively generated based on part-of-speech structures constrained by the model. Building on this foundation, this paper aims to enhance the cycle generation process by incorporating a fuzzy matching method to identify the most suitable vocabulary within an FP-Tree. The relevance of generated sentences is then used as the final selection criterion, ensuring that the chosen sentence improves contextual coherence in story generation.

Natural Language Processing (NLP) focuses on enabling computers to process and analyze large volumes of natural language data. Common applications, such as chatbots, automatic summarization, and reading comprehension, primarily generate shorter sentences by analyzing longer text. These applications are often statistical and analytical. Even in reading comprehension tasks, answers are typically extracted directly from a specific paragraph of the given text rather than being generated through sentence restructuring and deeper analysis. In past natural language research, evaluating generated text was often challenging. Standard evaluation metrics, such as BLEU [1] and ROUGE [2], primarily assess text quality by comparing generated outputs to predefined ground truth references. However, in natural language generation, there is no universally fixed evaluation method, particularly in the case of Chinese sentences, where word order and phrase structures exhibit significant variability. Existing NLP evaluation methods rely heavily on the original dataset, making it difficult to assess the performance of text generation tasks effectively.

To address this issue, this study proposes an evaluation approach that includes the Bilingual Evaluation Understudy (BLEU) for simple sentence verification, along with two additional methods designed explicitly for PoS structure evaluation: (1) PoS matching degree and (2) PoS context ordering coherence. The PoS matching degree method determines whether the expected distribution of parts of speech in the generated text aligns with the overall structural patterns. In creative writing, different authors exhibit unique content styles, and the model-generated text is influenced by the writing style present in the training dataset. Thus, assessing whether the distribution of various parts of speech in the generated content aligns with that of the training data can indicate whether the generated sentences reflect the author’s writing habits. By calculating the PoS matching degree, a smaller difference value signifies a higher degree of alignment, suggesting that the generated text more accurately follows the expected linguistic structure.

2. Related Work

In recent years, Natural Language Generation (NLG) has been increasingly recognized as a process in which a computer program produces natural language as output. According to Gatt and Krahmer [3], NLG can be categorized into two major types based on input: text-to-text generation and data-to-text generation. Text-to-text generation involves taking existing text as input and producing new text as output. Common applications of this approach include machine translation [4], text summarization and simplification [5], automatic writing correction [6], and paraphrasing [7].

Transformers, first introduced by Vaswani et al. [8], represent the state-of-the-art neural network architecture for natural language processing and have emerged as the dominant approach for natural language generation (NLG) [9]. Compared to earlier neural network architectures, such as Recurrent Neural Networks (RNNs) [10] and Long Short-Term Memory (LSTM) [11], Transformers effectively address the gradient vanishing problem [12], which can hinder the ability to maintain context over longer sequences and increase processing time for complex texts. Additionally, Transformers support parallel training, allowing for significantly faster processing. As training data and model architectures continue to expand, Transformers are capable of capturing long-range dependencies more effectively, leading to more advanced language understanding and generation [13].

Despite being a relatively new technique in NLP and NLG, Transformers have already demonstrated practical applications across various domains. For instance, Lee and Hsiang [14] utilized GPT-2 with fine-tuning to generate patent claims. By 2020, the development of GPT-3 [15] marked a significant milestone, with substantial increases in both model parameters and the volume of training data, further advancing the capabilities of large-scale language models.

Since 2015, numerous studies have explored language model pre-training, leading to the development of various approaches [16,17,18]. A language model (LM) is designed to estimate the probability of a word occurring in a given sentence, either as a discrete word or as part of a statistical probability distribution. Pre-training involves training a language model on large volumes of unlabeled data, allowing it to learn linguistic patterns and acquire a set of model parameters. These parameters can then be used to initialize the model for specific tasks, followed by fine-tuning on task-specific datasets to enhance performance.

Language model pre-training has been shown to significantly improve various natural language processing tasks. A notable example is the Bidirectional Encoder Representations from Transformers (BERT) [19], introduced by the Google AI Language team in 2019. BERT follows a pre-training and fine-tuning paradigm, leveraging a multi-layer bidirectional encoder model based on the Transformer architecture. By applying unsupervised learning, BERT enables the Transformer model to capture contextual dependencies more effectively, resulting in substantial improvements across multiple NLP applications [20,21,22,23]. While BERT-Finetune and BERT-Pretrain have demonstrated improvements in language generation, their reliance on masked language modeling limits their ability to maintain long-term coherence in generated text. Furthermore, they do not explicitly incorporate structural constraints such as part-of-speech arrangements, which are crucial for story generation.

3. Approach

This section introduces the cycle story generation process and its underlying architecture. One of the key challenges in natural language generation (NLG) is the lack of a standardized verification method. This issue is particularly pronounced in Chinese sentence generation, where word combinations and syntactic structures exhibit high variability. Existing evaluation methods in natural language processing primarily rely on comparisons with the original dataset, which often fail to effectively assess the performance of text generation tasks.

To address this limitation, in addition to using the Bilingual Evaluation Understudy (BLEU) as a verification method for simple sentences, this section proposes two additional evaluation methods specifically designed for PoS structures: (1) Part-of-Speech Matching Degree and (2) Part-of-Speech Context Ordering Coherence. These methods aim to provide a more comprehensive assessment of the syntactic accuracy and structural consistency of generated sentences.

3.1. Story Cyclic Generation Flow

Previous research has demonstrated that methods based on part-of-speech structures and cycle generation can effectively control both the style and content of generated text. Building upon this approach, fuzzy pairing is introduced in part-of-speech selection, ensuring that candidate words maintain stronger relevance to the preceding sentence.

The input data structure includes the <SOS>, <EOS>, and <MOS> tokens, where <MOS> serves as a separator between the current and subsequent sentences. The sentence is first segmented, with each word annotated with its corresponding PoS. The words following the next sentence segment are then used as the predicted content in the label section. This design allows the model to anticipate the content of the next sentence based on the given input while also incorporating the expected part-of-speech structure of the upcoming sentence, thereby improving coherence and contextual consistency.

Input data:

< S O S > {w o r d}^{1} {P o S}^{1} \dots {w o r d}^{n} {P o S}^{n} < M O S > {P o S}^{n + 1} {P o S}^{n + 2} \dots {P o S}^{n + k} < E O S >

Label:

{w o r d}^{n + 1} {w o r d}^{n + 2} \dots {w o r d}^{n + k}

In this structure, the sequence from <SOS> to <MOS> represents the words and their corresponding parts of speech in the input sentence. Meanwhile, the segment from <MOS> to <EOS> denotes the part-of-speech sequence of the next sentence in the dataset. For the label, the original words within the <MOS> to <EOS> segment are extracted from the dataset based on their respective parts of speech, ensuring that the generated content aligns with the expected linguistic structure.

After processing the data and training the model using Cycle-SG NET, Cycle-SG GAN, BERT-Finetune, and BERT-Pretrain, the entire cycle story generation process is illustrated in Figure 1. The cycle story is generated based on a predefined sentence format, where the part-of-speech structure of the next sentence is represented between <MOS> and <EOS>.

The cycle generation concept begins with constructing a Frequent Pattern Tree (FP-Tree) that captures the part-of-speech structures of all sentences in the dataset. When generating a new sentence, the model determines the appropriate part-of-speech structure for the next sentence by referencing the pre-established FP-Tree. The overall workflow is depicted in Figure 1.

3.2. Generating Candidate Part-of-Speech Tags Using FP-Tree

The process of selecting candidate words follows a structured approach. First, the method determines whether all parts of speech between <MOS> and <EOS> belong to the high-frequency category. If they do, a probability is calculated based on the distribution of each PoS and assigned accordingly. If a PoS does not meet the high-frequency criteria, no probability is assigned, and the corresponding word is omitted. Finally, the final candidate words are selected based on their assigned probabilities and used as the root to identify sentence structures within the FP-Tree.

Through this filtering process, only the most suitable candidate words are retained as the primary basis for further selection. These candidate words are then compared against the pre-established FP-Tree database, allowing the system to identify and generate the most contextually appropriate sentences.

Once the candidate word is identified, the complete sentence part-of-speech structure can be retrieved from the FP-Tree. However, during this search process, multiple sentences may be generated simultaneously based on the same part-of-speech structure. Selecting the most suitable candidate sentence is crucial, as it directly impacts the coherence and quality of the generated article.

Additionally, since the FP-Tree is constructed without considering the sequential relationships between words in adjacent sentences, it is possible to retrieve multiple different candidate sentences for the same candidate word. This introduces a challenge in determining the most appropriate part-of-speech arrangement for the next sentence. To address this issue, fuzzy matching is applied between the candidate sentence and the previous sentence. By evaluating the degree of similarity, the system identifies the candidate sentence with the highest matching score, ensuring that the most contextually appropriate sentence is selected. The process of determining similarity relies on Equations (1) and (2).

s_{j} = \{\begin{matrix} 0 i f, m = 0 \\ \frac{1}{3} (\frac{m}{| s_{1} |} + \frac{m}{| s_{2} |} + \frac{m - t}{m}) o t h e r w i s e \end{matrix}

(1)

The similarity is:

{s_{j} = s}_{j} + l p (1 - s_{j})

(2)

where l represents the shared prefix length between the candidate and reference sentences, and p is a user-defined constant with a range of 0 < p ≤ 0.25, defining the weight assigned to the shared prefix in the similarity calculation. These parameters ensure a balanced assessment of both structural similarity and contextual coherence.

The most suitable candidate sentence can be identified through fuzzy matching, which evaluates the alignment between the candidate sentence and the reference context. Once the appropriate sentence is selected, its part-of-speech structure is inserted between the newly defined <MOS> and <EOS> tokens. This updated structure serves as the input for the cycle generation model, enabling the seamless generation of subsequent content while maintaining consistency throughout the article.

3.3. Bilingual Evaluation Understudy (BLEU)

In the early stages, the Bilingual Evaluation Understudy (BLEU) metric was commonly used to assess the quality of generated Chinese sentences. However, this evaluation method requires ground truth references to compare against the generated results. In fields like article and story generation, where ground truth is often unavailable, calculating accuracy becomes challenging. Nevertheless, BLEU can still provide insights by analyzing different n-gram patterns to determine whether a model has improved its understanding of the training set during the training process.

Although BLEU alone cannot fully evaluate the quality of generated content, it remains a valuable metric for comparing the same dataset across different models, providing insights into whether one model demonstrates a better understanding of the underlying data. The BLEU metric operates by performing a co-occurrence comparison between two sentences, effectively assessing their similarity. The calculation involves averaging the unit precision

(p_{n})

of each n-gram, where n represents the n-gram length. However,

p_{n}

alone cannot address the issue of incomplete matching, necessitating the introduction of a penalty mechanism.

To address this, BLEU employs two key components: the brevity penalty (BP) and the overall BLEU score, which are calculated using Equations (3) through (5). These equations incorporate both precision and penalties to provide a more balanced assessment of the quality of generated content. For further details, please refer to Equations (3) to (5).

p_{n} = \frac{\sum_{C \in {C a n d i d a t e}} \sum_{n - g r a m \in C} {C o u n t}_{c l i p} (n - g r a m)}{\sum_{c^{'} \in {C a n d i d a t e}} \sum_{{n - g r a m}^{'} \in C^{'}} C o u n t (n - g r a m)}

(3)

B P = \{\begin{matrix} 1 i f c > r \\ e^{1 - \frac{r}{c}} i f c \leq r \end{matrix}

(4)

B L E U = B P \times e x p (\sum_{1}^{N} w_{n} \log p_{n})

(5)

In the BLEU formula, n is typically set to four, corresponding to the classic BLEU-4 evaluation method, which is widely used in most NLP research papers. However, when applied to text generation, BLEU has some notable limitations. It cannot account for nuanced language expression, is heavily influenced by common words, and may fail to appropriately handle synonyms or similar words, often judging them as mismatched.

Despite these shortcomings, BLEU offers several advantages, including fast computation, low computational cost, and the ability to assess associations through n-grams effectively. For these reasons, this study will utilize BLEU as a verification method and reference point for evaluating text generation quality.

3.4. Part-of-Speech Matching Degree

This method evaluates the matching degree of each part-of-speech quantity. In the subsequent analysis of Chinese Knowledge and Information Processing (CKIP) statements, variations in sentence arrangements or word combinations may lead to differences in word segmentation and part-of-speech tagging results. Even minor differences can result in further variations in these outputs. To address this, the verification method is divided into two categories: classification-based and non-classification-based matching. In the classification-based algorithm, similar parts of speech are grouped together, simplifying the comparison process. Conversely, in the non-classification approach, the original CKIP part-of-speech tags are retained without grouping, allowing for a more granular analysis. This dual approach ensures flexibility in evaluating part-of-speech consistency under varying linguistic conditions.

For the calculation of part-of-speech matching, please refer to Equations (6) and (7). The definitions for classification and non-classification remain consistent across these equations. The degree of part-of-speech matching is determined by calculating the sum and average of the differences in part-of-speech ratios, along with the quantity ratio of each PoS. In this context, designation refers to the part-of-speech arrangement derived from the part-of-speech arrangement search mechanism, while generation represents the part-of-speech arrangement obtained through CKIP analysis of sentences generated by the model. By comparing these two arrangements, the method evaluates how closely the generated content aligns with the expected part-of-speech structure.

{P o S R a t i o D i f f e r e n c e}_{P o S} = |\frac{{N u m b e r o f P o S}_{D e s i g n a t i o n}}{{S u m o f a l l P o S}_{D e s i g n a t i o n}} - \frac{{N u m b e r o f P o S}_{G e n e r a t i o n}}{{S u m o f a l l P o S}_{G e n e r a t i o n}}| \times 100 %

(6)

{P o S M a t c h}_{\begin{matrix} C l a s s i f i c a t i o n \\ N o n - c l a s s i f i c a t i o n \end{matrix}} = \frac{\sum_{P o S}^{a l l P o S} {R a t i o D i f f e r e n c e}_{P o S}}{N u m b e r o f P o S ’ s s p e c i e s}

(7)

This method employs a part-of-speech ratio algorithm to evaluate whether the overall structure of the expected part-of-speech distribution aligns with that generated by the model. The degree of difference is calculated by summing the absolute values of the differences between the two distributions. A smaller resulting value indicates a smaller degree of difference, which, in turn, reflects a higher matching degree between the model’s output and the expected structure.

3.5. Part-of-Speech Context Sorting Coherence

The part-of-speech matching process evaluates the degree of alignment generated for each sentence. However, sentence fluency is equally critical for the overall coherence of an article. To address this, Part-of-Speech Context Sorting Coherence is introduced to measure the degree of coherence between part-of-speech arrangements. First, a corresponding FastText model is trained using the dataset containing the part-of-speech arrangements of preceding and succeeding sentences. The FastText model enables the calculation of association probabilities between different parts of speech. The part-of-speech arrangement of the current input sentence is then compared to the part-of-speech arrangement generated by the model. Using these arrangements, the relative probability of the parts of speech is calculated. The coherence score is obtained by taking the average of the summed probabilities, which reflects the relationship between the two arrangements. This value is referred to as the FastText Relation Score (FRS), providing a quantitative measure of part-of-speech coherence.

The current and specified FastText Relation Scores (FRSs) are denoted as

{F R S}_{D e s i g n a t i o n}

, while the current and generated FRS are represented as

{F R S}_{G e n e r a t i o n}

. The ordering coherence of the part-of-speech context is calculated as the absolute difference between

{F R S}_{D e s i g n a t i o n}

and

{F R S}_{G e n e r a t i o n}

. For the detailed formulas, please refer to Equations (8) through (10). Given the differences in the lengths of part-of-speech arrangements, the current part-of-speech sequence is denoted as

p_{1} t o p_{x}

, the designated sequence as

p_{1} t o p_{y}

, and the generated sequence as

p_{1} t o p_{z}

. The FastText function is employed to calculate the association probabilities between these part-of-speech sequences. For a schematic representation of the algorithm, please refer to Figure 2.

{F R S}_{D e s i g n a t i o n} = \frac{\sum_{i = 1}^{x} \sum_{j = 1}^{y} F a s t T e x t (p_{i}, p_{j})}{x \times y} \times 100 %

(8)

{F R S}_{G e n e r a t i o n} = \frac{\sum_{i = 1}^{x} \sum_{j = 1}^{z} F a s t T e x t (p_{i}, p_{j})}{x \times z} \times 100 %

(9)

P o S C o n t e x t S o r t i n g C o h e r e n c e = |{F R S}_{D e s i g n a t i o n} - {F R S}_{G e n e r a t i o n}|

(10)

This method aims to determine the correlation difference between the specified sentence and the sentences generated by the model using a part-of-speech association algorithm applied to preceding and succeeding sentences. By evaluating this correlation, the method ensures that the generated sentence aligns with the expected next-sentence structure specified in the model.

The Cyclic generation framework is composed of two core components: Cyclic Generation Processing and Part-of-Speech Sorting Search Mechanism. These components are essential in ensuring that the generated text maintains syntactic coherence, linguistic structure, and contextual consistency. These components are essential in ensuring that the generated text maintains syntactic coherence, linguistic structure, and contextual consistency (Algorithms 1 and 2). The first phase, Cyclic Generation Processing, serves as the primary loop responsible for generating text iteratively. Initially, the Part-of-Speech Sorting table is preloaded using an FP-growth algorithm, constructing an FP-tree that stores frequently occurring PoS sequences (line 1). This FP-tree acts as a reference for predicting the syntactic structure of upcoming sentences. The generation model is then selected from one of four architectures: New SG-Net, New SG-GAN, BERT-Finetune, or BERT-Pretrain (line 2). The process begins with a starting sentence, which is assigned to both the current sentence and the story result (line 6). The number of generated sentences is specified by the user as n (line 7), and the cyclic generation process continues until this number is reached. During each iteration, CKIP (Chinese Knowledge and Information Processing system) is applied to the current sentence to extract both its word sequence and PoS sequence (line 8). These extracted features are passed into the Part-of-Speech Sorting Search Mechanism, which queries the FP-tree to retrieve the most contextually appropriate PoS sequence for the next sentence (line 9). The retrieved next PoS sequence, along with the word and PoS sequence of the current sentence, is then processed by the Input Sentence Preprocessing module (line 10). The Input Sentence Preprocessing module plays a crucial role in constructing a structured input for the generation model. It transforms the extracted word sequence, PoS sequence, and next PoS sequence into an innovative input sentence that aligns with the expected linguistic patterns (line 10). This transformation ensures that the generated sentence adheres to both grammatical constraints and stylistic consistency. Once the innovative input sentence is formed, it is passed to the generation model (line 11), which produces a new sentence following the syntactic rules learned from training data. The newly generated sentence undergoes further processing before being appended to the story result. First, its word sequence is extracted and assigned to the current sentence to serve as input for the next iteration (line 12). Then, the generated sentence is concatenated with the story result, progressively expanding the narrative (line 13). This cyclic process continues until the required number of sentences is reached, at which point the final generated text is returned as output.

Algorithm 1: Cycle Generation Processing

Pre-loaded Part-of-speech sorting table

(01) FP tree <- FP-growth algorithm(PoS sorting table)

(02) Generation model <- New SG-Net || New SG-GAN || BERT-Finetune || BERT-Pretrain

CKIP(sentence) /*Get word and PoS sorting of sentence*/

(03) Part-of-speech Sorting Search Mechanism(PoS sorting)

(04) Input Sentence Pre-processing(word sorting, PoS sorting, next PoS sorting)

(05) current sentence ← starting sentnce; story result ← starting sentnce

(06) NumofSent ← n /*User specifies the number of generated statements, assume n.*/

(07) while NumofSent > 0 do

(08) word sotring, PoS sorting ← CKIP(current sentence)

(09) next PoS sorting ← Part-of-speech Sorting Search Mechanism(PoS sorting)

(10) innovative input sentence ← Input Sentence Pre-processing

(11) generated sentence ← Genration model(innovative input sentence)

(12) current sentence ← word sorting of generated sentence

(13) story result ← Story result + current sentence

Algorithm 2: Part-of-speech Sorting Search Mechanism

Pre-establish commonly used parts of speech. /*Probability distribution >10%*/

(01) Use the common PoS in the current PoS sorting, and select one of them to enter the FP-tree according to their probability distribution.

(02) Through the FP-growth algorithm, it will first come to the most frequent node of the selected PoS. Combine the path from the node to all the leaves with the unique path from the node to the root, so will get several next PoS sorting as candidate.

(03) Regarding the PoS sorting as a set, using the smallest common set, find the complete PoS sorting matching the candidate from the PoS sorting table, and the complete PoS sorting as new candidate.

(04) Part-of-speech Sorting Search Mechanism regarding the PoS sorting as a set, using the smallest common set, find the complete PoS sorting matching the candidate from the PoS sorting table, and the complete PoS sorting as new candidate.

4. Architecture of Each Model

Based on Chinese sentences and story generation, we will compare Cycle-SG Net, Cycle-SG GAN, BERT-Finetune, and BERT-Pretrain. The following sections will introduce the architectures of Cycle-SG Net and Cycle-SG GAN, as well as the parameters used for each model. Additionally, we will describe the process of adapting BERT-Finetune and BERT-Pretrain to our specific data format.

4.1. Architecture of Cycle-SG Net

In the feature extraction phase, FastText is employed for multi-dimensional quantized vector representation, effectively enhancing the training feature information for the Transformer model. Similarly, a preprocessed dataset is used to train a FastText model, which utilizes the skip-gram technique to infer the current word within its context. The FastText model serves as the Word Embedding layer for SG-Net while maintaining the Position Embedding originally designed for SG-Net to preserve the input sentence order. Additionally, Adaptive Softmax is implemented to reduce computational costs. To enhance the model’s ability to capture more complex natural language features, the model depth is increased from three layers to six layers, making the overall architecture more intricate. This structural enhancement enables the model to learn richer linguistic patterns.

Furthermore, the AdamW optimizer is utilized to fine-tune weight parameters, improving optimization efficiency. For a detailed overview of the model architecture and parameter settings, please refer to Table 1 and Figure 3a.

4.2. Architecture of the Cycle-SG GAN

To enhance the generation performance of Cycle-SG Net, the integration of a Generative Adversarial Network (GAN) is necessary. The internal modifications of the model enable it to effectively train sentences through adversarial learning. Cycle-SG GAN, similar to Cycle-SG Net, employs FastText as the Word Embedding method. The GAN architecture is designed with Cycle-SG Net functioning as the Generator and TextCNN serving as the Discriminator. In this study, the GAN objective function is formulated based on sentence pairs. The objective function of Cycle-SG GAN is expressed in Equation (11), where

p_{r e a l - s e n t}

represents the probability distribution of real sentences, and

p_{i n e n u t - s e n t}

represents the probability distribution of the current input sentence. For a visual representation of the Cycle-SG GAN architecture, please refer to Figure 3b.

\binom{m i n}{G} \binom{m a x}{D} V (D, G) = E_{x ~ p_{r e a l - s e n t} (x)} [\log D (x)] + E_{x ~ p_{i n p u t - s e n t} (x)} [\log D (G (x))] + E_{z ~ p_{i n p u t - s e n t} (z)} [\log (1 - D (G (z)))]

(11)

4.3. Architecture of BERT-Fintune

To compare the performance of Cycle-SG Net and Cycle-SG GAN, the experimental results are evaluated against BERT-Finetune and BERT-Pretrain. To ensure a fair comparison, the dataset is retrained using the “Jin Yong” novels. For BERT-Finetune, the BertForMaskedLM model framework is employed for fine-tuning. The model is initialized with pre-trained weight parameters in Traditional Chinese, which are loaded into BertForMaskedLM. To optimize the model, the AdamW optimizer is used for adjusting weight parameters, enabling effective fine-tuning for the task. For details on BERT-Finetune parameters, please refer to Table 2, and for the BERT-Finetune model architecture, see Figure 3c.

4.4. Architecture of BERT-Pretrain

The BERT-Pretrain model follows the same architecture as BERT-Finetune, with the key difference being that it does not utilize the BERT-base-Chinese pre-trained model as an initial weight parameter for BertForMaskedLM. Instead, the model undergoes pre-training from scratch. To enhance the training effect, input sentences are preprocessed and transformed into four feature representations:

Input: The original input sentence.
Attention: Focuses on the words at each position while avoiding attention to padding positions.
Type: Informs the model whether the current position belongs to the previous or next sentence.
Position: Provides the model with the positional sequence of each word.

The AdamW optimizer is used to fine-tune weight parameters, enabling the completion of the pre-training task. For details on the BERT-Pretrain model architecture, please refer to Figure 3d.

5. Training Methods for Each Model

To train the model to capture the style and structure of Jin Yong’s novels, a new data format is established through Word Sorting and Part-of-Speech Sorting, both of which are completed during data preprocessing. The input sentence consists of three key components:

Special Label—Identifies the type of input data.
Staggered Sorting of Words and Parts of Speech in the Current Sentence—Represents an interleaved arrangement of words and their corresponding parts of speech.
Part-of-Speech Sorting of the Next Sentence—Specifies the expected part-of-speech structure for the upcoming sentence.

Since different models require distinct training approaches, the composition of the data format varies accordingly. Each element in the input sentence carries a specific representational meaning based on the model being trained. For a detailed comparison of data formats across different training modes, please refer to Table 3, which provides examples of data formats for each model.

5.1. Special Label

The BERT-Finetune and BERT-Pretrain models inherently provide special tokens such as [CLS], [SEP], and [MASK] to distinguish different elements within the input. These tokens serve the following purposes:

[CLS]: Marks the beginning of the sentence.
[SEP]: Differentiates between the current sentence, the following sentence, and the target sentence.
[MASK]: Indicates the word whose position needs to be predicted.

In contrast, Cycle-SG Net and Cycle-SG GAN are custom-designed models that do not utilize the predefined BERT tokens. Instead, they require custom tagging mechanisms to structure the input data effectively. To address this, the <SOS>, <EOS>, and <MOS> tags are introduced:

<SoS>: Marks the start of the sentence.
<EoS>: Denotes the end of the sentence.
<MoS>: Separates the word and part-of-speech arrangement of the current sentence from the part-of-speech arrangement of the next sentence.

For Cycle-SG Net and Cycle-SG GAN, the training tasks focus on distinguishing and learning the interleaved structure of words and parts of speech in the current sentence while simultaneously designating the part-of-speech arrangement for the next sentence. This customized approach ensures that the models generate structured and contextually coherent outputs.

5.2. Staggered Sorting of Words and PoS in the Current Sentence

The current input sentence is processed using CKIP word segmentation and part-of-speech analysis, transforming it into a staggered arrangement of words and their corresponding parts of speech. This preprocessing step aims to help the training model comprehend the writing structure of Jin Yong’s novels while reinforcing the relationship between words and their respective parts of speech. By structuring the data in this manner, the model can better capture the stylistic and syntactic patterns characteristic of Jin Yong’s writing.

5.3. PoS Sorting of the Next Sentence

In the training process, the PoS arrangement of the next sentence is derived from the dataset based on the current sentence. The goal is to enable the model to learn the structural patterns of subsequent sentences and improve its ability to generate coherent text. For the Cycle-SG NET and Cycle-SG GAN models, the PoS length of the next sentence is set to match the length of the target-generated sentence. In contrast, for the BERT-Finetune and BERT-Pretrain models, the PoS length of the next sentence corresponds to the number of [MASK] tokens, ensuring alignment with BERT’s masked language modeling mechanism.

Input Sentence and Next Sentence
- Input sentence
  ■
  It represents the current textual input given to the model.
- The next sentence
  ■
  It represents the ground truth sentence that follows in the dataset, which the model is expected to learn and predict in a structured manner.
Special Labels
- Our proposed Cycle SG-Net and Cycle SG-GAN introduce three special labels:
  - <SOS> (Start of Sentence): Indicates the beginning of the sequence.
  - <MOS> (Middle of Sentence): Separates the current sentence’s staggered word-PoS sequence from the PoS sequence of the next sentence.
  - <EOS> (End of Sentence): Marks the end of the input sequence.
- In contrast, BERT-Finetune and BERT-Pretrain utilize predefined BERT special tokens:
  - [CLS] (Classification Token): Indicates the beginning of the sequence.
  - [SEP] (Separator Token): Separates different parts of the input.
  - [MASK] (Mask Token): Represents missing words that the model must predict.
Staggered Sorting of Words and Parts of Speech (PoSs) in the Current Sentence
- Cycle SG-Net and Cycle SG-GAN, the input sentence is transformed into a staggered sequence of words interleaved with their PoS tags.
  ■
  段譽 (Nb) 今天 (Nd) 心情 (Na) 不錯 (VH)
  - This format helps the model capture the structural composition of sentences based on their PoS distributions, enhancing grammatical consistency in the generated text.
- BERT-Finetune and BERT-Pretrain follow a character-level representation, where each character is individually tagged with its corresponding PoS.
  ■
  段(Nb) 譽(Nb) 今(Nd) 天(Nd) 心(Na) 情(Na) 不(VH) 錯(VH)
  - This ensures that the BERT models focus on learning masked character relationships rather than full-word structures.
Part-of-Speech Sorting of the Next Sentence
- For Cycle SG-Net and C SG-GAN, the PoS sequence of the next sentence is explicitly provided as part of the input:
  ■
  (Nb) (Cbb) (D) (Dfa) (VH)
- In BERT-Finetune and BERT-Pretrain, the model must infer the next sentence’s structure by predicting masked tokens. The PoS sequence corresponds to the masked token positions, ensuring that the output sentence aligns with expected syntactic structures.
Innovative Input Sentence Construction
- The final input representation for each model differs in structure:
  - Cycle SG-Net and Cycle SG-GAN adopt a word-level processing approach, where the input sequence explicitly interleaves words with their respective PoS tags. Additionally, the PoS sequence of the next sentence is provided as a reference, allowing the model to learn structured sentence progression. The input starts with the <SOS> token, followed by the staggered word-PoS sequence of the current sentence, then the <MOS> token, which separates it from the PoS sequence of the next sentence, and concludes with the <EOS> token.
  - BERT-Finetune and BERT-Pretrain, on the other hand, process text at the character level, leveraging masked language modeling (MLM). In these models, each character is individually assigned a PoS tag, and the next sentence is represented using masked tokens. This requires the model to infer the missing words based on the provided PoS structure. Special tokens such as [CLS] and [SEP] are used to distinguish different components, while [MASK] tokens mark the positions where the model must predict missing words.
Target/Label Sentence
- The target output format also differs between the proposed models and the BERT-based approaches:
  - Cycle SG-Net and Cycle SG-GAN, the expected output is a fully formed word-level sentence, reflecting the grammatical and stylistic coherence of the training data. The model learns to generate a sequence of words that aligns with the writing style and PoS structure of Jin Yong’s novels.
  - BERT-Finetune and BERT-Pretrain, the expected output follows a character-level format, where each predicted character is evaluated within the context of the masked input. The model reconstructs the sentence by predicting each missing character while maintaining the syntactic structure defined by the PoS constraints. Special tokens such as [CLS] and [SEP] frame the sentence to align with the BERT training paradigm.

By adopting these different strategies, Cycle SG-Net and Cycle SG-GAN focus on PoS-aware text generation with explicit syntactic control, while BERT-based models rely on masked prediction to infer missing text elements.

6. Experiments

The experimental results are divided into two main parts. The first part evaluates the proposed verification methods, which include BLEU scores based on different n-Grams, Part-of-Speech Matching Degree, and Part-of-Speech Context Sorting Coherence. The second part involves generating stories and texts using each model to assess their performance in text generation. The third part focuses on human evaluation, where 25 students majoring in Chinese language and literature participated in an experiment to assess the readability, coherence, and fluency of AI-generated texts. For verification, this experiment utilizes a dataset comprising martial arts novels written by the renowned Chinese novelist Jin Yong. The primary texts used include Demi-Gods and Semi-Devils, The Legend of the Condor Heroes, and The Deer and the Cauldron. For a detailed overview of the dataset, please refer to Table 4.

6.1. BLEU Based Analysis Results

The experiment analyzes the results of Chinese story generation using three evaluation metrics: BLEU scores, Part-of-Speech Matching Degree, and Part-of-Speech Context Sorting Coherence. The performance of Cycle Story Generation, Cycle Story Generation combined with GAN, and multiple BERT fine-tuned models will be compared in this study. For a detailed comparison of the experimental results, please refer to Table 5.

Although the BLEU score cannot fully capture the accuracy of semantics, grammar, and sentence structure, it serves as a useful reference for evaluating generalization ability and overfitting, as well as assessing the ordering of word combinations. According to Table 6, observations on generalization ability indicate that Cycle-SG Net and Cycle-SG GAN outperform other models. Notably, Cycle-SG Net achieves higher BLEU scores across 1-gram to 4-gram evaluations. Additionally, at the 4-gram level, the BLEU score reaches 0.23, suggesting a high degree of variation in word permutations and combinations. However, this may also indicate poor generation quality in some cases. Another critical factor is assessing overfitting, as BLEU scores in this experiment are derived solely from the training dataset. The results show that BERT-Finetune, BERT-Finetune-Large, and BERT-Pretrain exhibit higher BLEU scores, which suggests a potential overfitting issue. These models appear to be more influenced by frequent words or closely resemble the original training data, making them less capable of generating diverse and novel content.

The PoS matching value is measured as the average percentage difference in the number of parts of speech between the generated and expected sentences. According to Table 6, all models exhibit a PoS matching difference of less than 1%, indicating that they successfully accomplish the task of generating structurally consistent sentences. In most cases, the generated sentence structure adheres to the specified format, demonstrating that the models have effectively learned to construct sentences using the designated parts of speech. Experimental results further indicate that when the classification of parts of speech is well-defined, the PoS matching performance of classified models surpasses that of unclassified ones. Additionally, when examining the number of generated sentences, models generating ten sentences perform better than those generating five sentences, suggesting that cycle generation stabilizes sentence structure over a larger number of iterations. This result implies that the models are capable of long-form text generation while maintaining structural consistency.

Finally, in the structural analysis of Chinese sentences, vocabulary-level training proves to be more effective than word-level training. This is because word-level training disperses the PoS structure, increasing training complexity. The Cycle-SG Net and Cycle-SG GAN models utilize FastText word embedding technology, ensuring that the vocabulary in the dataset is distributed more effectively within the vector space. This design enhances the models’ ability to learn and represent sentence structures more efficiently, improving the overall quality of the generated text.

The PoS Context Sorting Coherence value is measured as the average percentage difference in the probability of association between generated and expected sentences. According to Table 7, all models achieve a coherence difference of less than 0.1%, indicating that the contextual association structures in the generated text align well with the expected outcomes. From the verification data, it is observed that Cycle-SG GAN outperforms all other models in generating coherent sentences, demonstrating superior contextual fluency. Additionally, findings from the BERT-Finetune model suggest that improving dataset quality can further enhance contextual coherence, highlighting the significance of well-prepared training data in generating structurally and contextually consistent text.

6.2. Context Generation of Each Model

Table 8 and Table 9 present the context generation results for each model, including Cycle-SG Net, Cycle-SG GAN, BERT-Finetune, and BERT-Pretrain. These tables provide a comparative analysis of the generated text, highlighting the differences in contextual consistency, fluency, and structural alignment across models.

6.3. Assessing AI-Generated Chinese Text: A Comparison of Readability, Coherence, and Fluency Across Models

To further evaluate the quality of generated Chinese text, this study conducts a human evaluation experiment comparing the performance of the following models:

Cycle-SG Net
Cycle-SG GAN
BERT-Finetune [19]
BERT-Finetune-Large
BERT-Pretrain
SG-GAN [24]

To further evaluate the quality of AI-generated Chinese text, this study conducts a human evaluation experiment comparing the performance of Cycle-SG Net, Cycle-SG GAN, BERT-Finetune, BERT-Finetune-Large, BERT-Pretrain, and two additional methods from previous research. The experiment assesses three key criteria: readability, coherence, and fluency.

Experimental Design

A total of 25 students majoring in Chinese language and literature participated in the study. Each participant was presented with 21 generated short texts, with each text limited to 150 characters to maintain consistency in evaluation. The dataset includes:

Six different models (Cycle-SG Net, Cycle-SG GAN, BERT-Finetune, BERT-Finetune-Large, BERT-Pretrain, and SG-Net).
Three generated samples per model, ensuring a balanced and diverse evaluation set.

Each participant read all 21 generated texts and assigned scores based on the following evaluation criteria:

Readability (1–5): Measures how easy and natural the generated text is to read.
Coherence (1–5): Assesses how well sentences connect logically and whether the content follows a natural progression.
Fluency (1–5): Evaluates grammatical correctness, smoothness, and overall text fluidity.

Participants rated each generated text using a five-point Likert scale, where 1 represents the lowest score and 5 represents the highest. Ratings were collected with decimal precision to capture more nuanced differences in evaluation. The three evaluation criteria in Table—readability, coherence, and fluency—represent the average scores provided by the participants for each model in these respective categories. Additionally, the AVG column in Table 10 reflects the overall average score, which combines the results from these three criteria to provide a comprehensive assessment of each model’s performance. For a detailed breakdown of human evaluation scores, please refer to Table 10.

Based on the experimental results presented, the performance of the models in terms of readability, coherence, fluency, and overall average ratings highlights significant distinctions among the tested approaches. The proposed methods, Cycle-SG Net and Cycle-SG GAN, demonstrate notable performance, particularly in coherence and overall average scores. Cycle-SG GAN achieves the highest overall average score of 4.13, excelling in both readability (4.1) and coherence (4.6), which indicates its ability to produce highly coherent and readable Chinese texts. Meanwhile, Cycle-SG Net achieves an average score of 3.43, with balanced performance in readability (3.3), coherence (3.6), and fluency (3.4), showing its capability to maintain consistent sentence structure but with room for improvement in fluency. In comparison, BERT-Finetune-Large shows competitive performance, achieving an average score of 4.03, which is close to Cycle-SG GAN, with strong readability (4.2) but slightly lower coherence (4.0). BERT-Finetune and BERT-Pretrain also perform reasonably well, with average scores of 3.86 and 3.64, respectively, but are slightly weaker in coherence and fluency compared to Cycle-SG GAN.

Finally, SG Net, another model tested for comparison, achieves the lowest overall average score of 3.23, with its weakest area being fluency (2.9). This highlights the effectiveness of the proposed Cycle-SG GAN model in improving text generation quality, particularly in coherence and readability, compared to both traditional methods and fine-tuned large-scale models. The results suggest that integrating adversarial learning, as demonstrated by Cycle-SG GAN, significantly enhances the coherence and overall quality of generated texts, making it a promising approach for future advancements in AI-generated Chinese stories.

7. Conclusions

In this experiment, in addition to using BLEU verification, we also designed Part-of-Speech Matching Degree and Part-of-Speech Context Ordering Coherence as verification metrics to assess the effectiveness of the training task. These measures address the current lack of a comprehensive evaluation method in text generation literature. The results show that all models successfully grasp the writing objectives of innovative tasks, with New SG-GAN demonstrating the most outstanding performance.

One of the key factors influencing model performance is that BERT-Finetune is fine-tuned on language models pre-trained on millions of modern corpora, while BERT-Finetune and BERT-Pretrain rely on word-level training. This training approach limits their effectiveness in capturing long-form structured text. However, the verification results suggest that expanding the dataset size could help address this limitation. Future studies should consider conducting human evaluation experiments, where linguistic experts or general readers assess the readability, coherence, and fluency of generated texts, providing qualitative insights beyond automatic metrics. The New SG-GAN model’s performance further suggests that word-level training better aligns with human writing patterns, allowing the model to effectively structure text and produce readable outputs. However, there is still significant room for improvement in word choice and contextual precision, highlighting the potential for further advancements in Chinese story generation. Future research should explore alternative verification methods, such as semantic similarity assessments, human-in-the-loop evaluations, or reinforcement learning-based text optimization.

Additionally, Chinese story generation remains an area with vast research potential, offering opportunities for developing more refined evaluation frameworks and practical applications. If machines can truly learn and understand semantics, they could assist humans in a wider range of NLP tasks, including literary creation, automated storytelling, and intelligent text summarization. Future experiments could focus on hybrid models that integrate statistical and neural-based methods, ensuring greater adaptability to complex narrative structures and diverse writing styles.

Author Contributions

Conceptualization, J.-W.L., T.-W.S. and C.-C.C.; methodology, J.-W.L., T.-W.S. and C.-C.C.; software, J.-W.L., T.-W.S. and C.-C.C.; validation, J.-W.L., T.-W.S. and C.-C.C.; formal analysis, J.-W.L., T.-W.S. and C.-C.C.; writing—original draft preparation, J.-W.L., T.-W.S.; writing—review and editing, J.-W.L., T.-W.S. and C.-C.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by the National Science and Technology Council, Taiwan, R.O.C., under grant 113-2221-E-035-059-.

Data Availability Statement

Data will be available on reasonable request.

Acknowledgments

We would like to thank the School of Electrical and Information Engineering Feng Chia University provides laboratories and help of a junior student named Chen.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar]
Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out; Association for Computational Linguistics (ACL): Stroudsburg, PA, USA, 2004; pp. 74–81. [Google Scholar]
Gatt, A.; Krahmer, E. Survey of the state of the art in natural language generation: Core tasks, applications and evaluation. J. Artif. Intell. Res. 2018, 61, 65–170. [Google Scholar] [CrossRef]
Kenny, D. Machine translation. In Routledge Encyclopedia of Translation Studies; Routledge: New York, NY, USA, 2019; pp. 305–310. [Google Scholar]
Ozsoy, M.G.; Alpaslan, F.N.; Cicekli, I. Text summarization using latent semantic analysis. J. Inf. Sci. 2011, 37, 405–417. [Google Scholar] [CrossRef]
Karyuatry, L. Grammarly as a tool to improve students’ writing quality: Free online-proofreader across the boundaries. JSSH J. Sains Sos. Dan Hum. 2018, 2, 83–89. [Google Scholar] [CrossRef]
Li, Z.; Jiang, X.; Shang, L.; Li, H. Paraphrase Generation with Deep Reinforcement Learning. arXiv 2017, arXiv:1711.00279. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, Ca, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Topal, M.O.; Bas, A.; van Heerden, I. Exploring Transformers in Natural Language Generation: GPT, BERT, and XLNet. arXiv 2021, arXiv:2102.08036. [Google Scholar]
Medsker, L.R.; Jain, L.C. Recurrent neural networks. Des. Appl. 2001, 5, 64–67. [Google Scholar]
Graves, A. Long Short-Term Memory. Supervised Sequence Labelling with Recurrent Neural Networks; Springer: Berlin/Heidelberg, Germany, 2012; pp. 37–45. [Google Scholar]
Pascanu, R.; Mikolov, T.; Bengio, Y. On the difficulty of training recurrent neural networks. In Proceedings of the International Conference on Machine Learning (PMLR), Lille, France, 28 June–2 July 2013; pp. 1310–1318. [Google Scholar]
Schubert, L.K.; Hwang, C.H. Episodic Logic Meets Little Red Riding Hood: A Comprehensive, Natural Representation for Language Understanding. Natural Language Processing and Knowledge Representation: Language for Knowledge and Knowledge for Language; Morgan Kaufmann Publishers: Burlington, MA, USA, 2000; pp. 111–174. [Google Scholar]
Lee, J.S.; Hsiang, J. Patent claim generation by fine-tuning OpenAI GPT-2. World Pat. Inf. 2020, 62, 101983. [Google Scholar] [CrossRef]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Amodei, D. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Dai, A.M.; Le, Q.V. Semi-supervised sequence learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
Sarzynska-Wawer, J.; Wawer, A.; Pawlak, A.; Szymanowska, J.; Stefaniak, I.; Jarkiewicz, M.; Okruszek, L. Detecting formal thought disorder by deep contextualized word representations. Psychiatry Res. 2021, 304, 114135. [Google Scholar] [CrossRef] [PubMed]
Howard, J.; Ruder, S. Universal Language Model Fine-Tuning for Text Classification. arXiv 2018, arXiv:1801.06146. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Bengio, Y.; Ducharme, R.; Vincent, P. A neural probabilistic language model. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Tucson, AZ, USA, 4–9 December 2000; Volume 13. [Google Scholar]
Yu, L.; Zhang, W.; Wang, J.; Yu, Y. SeqGAN: Sequence generative adversarial nets with policy gradient. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
Lin, K.; Li, D.; He, X.; Zhang, Z.; Sun, M.T. Adversarial ranking for language generation. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Guo, J.; Lu, S.; Cai, H.; Zhang, W.; Yu, Y.; Wang, J. Long text generation via adversarial training with leaked information. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Lin, J.W.; Chang, R.G. Chinese story generation of sentence format control based on multi-channel word embedding and novel data format. Soft Comput. 2022, 26, 2179–2196. [Google Scholar] [CrossRef]

Figure 1. Find the candidate word in each part-of-speech.

Figure 2. Schematic diagram of part-of-speech association.

Figure 3. (a) Model architecture of Cycle-SG Net; (b) Model architecture of Cycle-SG GAN; (c) Model architecture of BERT-Finetune; (d) Model architecture of BERT-Pretrain.

Table 1. The parameter of Cycle-SG Net.

	FastText Model	Cycle-SG Net	AdamW
Model Based	Skip-Gram
Demissions	1024
Multi-Head Attention		6
Hidden Size		1024
Learning Rate			0.0001
Warmup			0.05
Weight Decay			0.1

Table 2. The parameter of BERT-Finetune.

	BERT-Finetune
Word Embedding	BertTokenizer (BERT-base-Chinese)
Multi-Head Attention	12
Hidden Size	768
Learning rate	0.00001
Weight decay	0.1

Table 3. The training data format definition.

Input Sentence	段譽今天心情不錯
Next Sentence	王語嫣因此也很開心
Special Label
Cycle-SG NET Cycle-SG GAN	<SOS>, <MOS>, <EOS>
BERT-Finetune BERT-Pretrain	[CLS], [SEP], [MASK]
Staggered sorting of words and PoS in the current sentence
段譽 (Nb) 今天 (Nd) 心情 (Na) 不錯 (VH)
PoS sorting of the next sentence
(Nb) (Cbb) (D) (Dfa) (VH)
Input Sentence in Vocabulary Level and Words Level
	Cycle-SG NET, Cycle-SG GAN
Vocabulary Level	<SOS> 段譽 (Nb) 今天 (Nd) 心情 (Na) 不錯 (VH) <MOS> (Nb) (Cbb) (D) (Dfa) (VH) <EOS>
	BERT-Finetune, BERT-Pretrain
Words Level	(CLS) 段 (Nb) 譽 (Nb) 今 (Nd) 天 (Nd) 心 (Na) 情 (Na) 不 (VH) 錯 (VH) (SEP) (Nb) (Nb) (Nb) (Cbb) (Cbb) (D) (Dfa) (VH) (VH) (SEP) (MASK) (MASK) (MASK) (MASK) (MASK) (MASK) (MASK) (MASK) (MASK) (SEP)
Target/Label Sentence
Cycle-SG NET Cycle-SG GAN	Vocabulary Level	王語嫣因此也很開心
BERT-Finetune BERT-Pretrain	Words Level	[CLS] 段 (Nb) 譽 (Nb) 今 (Nd) 天 (Nd) 心 (Na) 情 (Na) 不 (VH) 錯 (VH) [SEP] 王語嫣因此也很開心 [SEP]

Table 3 illustrates the structure of the innovative input sentence format used in our proposed approach (Cycle SG-Net, Cycle SG-GAN) and the baseline BERT-Finetune and BERT-Pretrain models. The Table provides a side-by-side comparison of input representations across these models.

Table 4. Number of words in JIN-YOUNG’s novels.

Model	Number of the Words (Ten Thousand)
Demi-Gods and Semi-Devils	121
Condor Heroes	98
Legend of the Condor Heroes	92
The Book and The Sword	51
Sword Stained with Royal Blood	49
Fox Volant of the Snowy Mountain	13
The Sword of Many Lovers	44
The Heaven Sword and Dragon Saber	96
A Deadly Secret	23
Blade-dance of the Two Lovers	3.4
The Deer and the Cauldron	123
Sword of the Yue Maiden	1.6

Table 5. BLEU value in different n-grams.

Model	Type	1-Gram	2-Gram	3-Gram	4-Gram
Cycle-SG Net	Voc	0.68	0.41	0.33	0.26
Cycle-SG GAN	Voc	0.9	0.87	0.77	0.77
BERT-Finetune	Word	0.91	0.88	0.92	0.82
BERT-Finetune-Large	Word	0.94	0.92	0.88	0.86
BERT-Pretrain	Word	0.97	0.92	0.89	0.87

Table 6. Part-of-Speech Matching Degree.

Model	Type	Five Sentence		Ten Sentence
Model	Type	Non-Classify	Classify	Non-Classify	Classify
Cycle-SG Net	Voc	0.09	0.089	0.11	0.082
Cycle-SG GAN	Voc	0.07	0.062	0.063	0.051
BERT-Finetune	Word	0.23	0.183	0.173	0.145
BERT-Finetune-Large	Word	0.19	0.131	0.133	0.11
BERT-Pretrain	Word	0.20	0.141	0.161	0.13

Table 7. Part-of-Speech Context Coherence.

Model	Type	Generate Ten Sentences
Cycle-SG Net	Voc	0.025
Cycle-SG GAN	Voc	0.02
BERT-Finetune	Word	0.053
BERT-Finetune-Large	Word	0.047
BERT-Pretrain	Word	0.061

Table 8. Generation sentence of Cycle-SG Net and Cycle-SG GAN.

Input Sentence

危急時刻，阿紫得到蕭峰幫助，出掌擊回火焰。

Cycle-SG Net

矮個子驀地一聲呼喊，緊追馬蹄而來，兩人同時爆發出劇烈的痛呼。其餘三名弟子則指向他，伸手奪下酒壺，隨後竟猛然發出一聲低沉的鼓音，隨即如閃電般沖了出去。
眾人對段譽一向忌憚，段正淳與游坦之在傳遞消息時曾與首領會面，此事本無可能撇清。但眼下見她的手下紛紛露出疑惑神情，彼此互望，怎會不明白其中玄機？

Cycle-SG GAN

蕭峯已鬆開了她的手勢，心知對方隱藏於招式之中，功力雖不弱，卻有些怪異。他暗自揣測，對方纖細的身形似是動作不定，竟然圍著圈子鬥了幾招，剛走了兩步便突然一驚，心中叫苦不迭，喃喃道：「這香氣也未免太歹毒了些。」
阿紫輕笑道：「敵人雖然表面裝作聽不見，實則是想藉此控制凌波與少林的僧眾。只可惜，這等冒天下之大不韙的行徑，早晚自取滅亡。」
話音未落，忽然聽到廳中傳來一聲大喝：「奸賊！星宿…咱們怎麼會在這兒？師妹，你別再休息了，趕緊去騙阿紫！找機會除掉幫主。至於師父、阿寶，我又何曾有什麼妹子！」

Table 9. Generation sentence of BERT-Pretrain and BERT-Finetune.

Input Sentence

危急時刻，阿紫得到蕭峰幫助，出掌擊回火焰。

BERT-Finetune

氣勢洶湧，哪來半點歡喜？火焰兇猛，勢如奔雷，氣流直衝相抗。
此時，段律洪年屆二百，正值壽宴之際，那女子隨手一推，水杯便跌落，連帶著桌上的酒水四濺，杯中液體順勢飛濺而出，直向廳內的馬車衝去。
沉吟片刻，他忽然想起一事，開口道：「聽說這座上有位姓朱的女子，便是那阿朱。她站得筆直，步伐不穩，顯然並非因病。」

BERT-Pretrain

蕭冷眼觀察，立於原地未動，不願輕舉妄動。他見她雙目緊盯火光，臉上神情似有鬥志。阿紫忽然開口，語氣帶火：「大仇未報，你以為我是什麼人？還敢聽蕭公一句話便求饒？」
蕭聞言，心中冷笑，動作迅捷，一掌揮去，片刻間，對方臉上已挨了一記重拳。她的臉色瞬間蒼白，雙眼翻白，隨後手足無措，終於無力倒地。
蕭沉聲道：「你還敢與我為敵？這手的力量，不是你能承受的。」他動作稍慢，但手掌內力如雷，穿胸而過，僅因體內真氣未穩，未能完全壓制。但他未有多言，步步緊逼，怒意未消，手指如勾，意欲致命。

Table 10. Human assessment of readability, coherence, and fluency in AI-generated Chinese stories.

Model	Type	Readability	Coherence	Fluency	Avg
Cycle-SG Net	Voc	3.3	3.6	3.4	3.43
Cycle-SG GAN	Voc	4.1	4.6	3.7	4.13
BERT-Finetune	Word	3.7	4.1	3.8	3.86
BERT-Finetune-Large	Word	4.2	4.0	3.9	4.03
BERT-Pretrain	Word	3.6	3.8	3.0	3.46
SG Net	Voc	3.3	3.5	2.9	3.23

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lin, J.-W.; Su, T.-W.; Chang, C.-C. Chinese Story Generation Based on Style Control of Transformer Model and Content Evaluation Method. Algorithms 2025, 18, 168. https://doi.org/10.3390/a18030168

AMA Style

Lin J-W, Su T-W, Chang C-C. Chinese Story Generation Based on Style Control of Transformer Model and Content Evaluation Method. Algorithms. 2025; 18(3):168. https://doi.org/10.3390/a18030168

Chicago/Turabian Style

Lin, Jhe-Wei, Tang-Wei Su, and Che-Cheng Chang. 2025. "Chinese Story Generation Based on Style Control of Transformer Model and Content Evaluation Method" Algorithms 18, no. 3: 168. https://doi.org/10.3390/a18030168

APA Style

Lin, J.-W., Su, T.-W., & Chang, C.-C. (2025). Chinese Story Generation Based on Style Control of Transformer Model and Content Evaluation Method. Algorithms, 18(3), 168. https://doi.org/10.3390/a18030168

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Chinese Story Generation Based on Style Control of Transformer Model and Content Evaluation Method

Abstract

1. Introduction

2. Related Work

3. Approach

3.1. Story Cyclic Generation Flow

3.2. Generating Candidate Part-of-Speech Tags Using FP-Tree

3.3. Bilingual Evaluation Understudy (BLEU)

3.4. Part-of-Speech Matching Degree

3.5. Part-of-Speech Context Sorting Coherence

4. Architecture of Each Model

4.1. Architecture of Cycle-SG Net

4.2. Architecture of the Cycle-SG GAN

4.3. Architecture of BERT-Fintune

4.4. Architecture of BERT-Pretrain

5. Training Methods for Each Model

5.1. Special Label

5.2. Staggered Sorting of Words and PoS in the Current Sentence

5.3. PoS Sorting of the Next Sentence

6. Experiments

6.1. BLEU Based Analysis Results

6.2. Context Generation of Each Model

6.3. Assessing AI-Generated Chinese Text: A Comparison of Readability, Coherence, and Fluency Across Models

Experimental Design

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI