1. Introduction
Previous research on Chinese story cycle generation models has demonstrated that Chinese sentences can be effectively generated based on part-of-speech structures constrained by the model. Building on this foundation, this paper aims to enhance the cycle generation process by incorporating a fuzzy matching method to identify the most suitable vocabulary within an FP-Tree. The relevance of generated sentences is then used as the final selection criterion, ensuring that the chosen sentence improves contextual coherence in story generation.
Natural Language Processing (NLP) focuses on enabling computers to process and analyze large volumes of natural language data. Common applications, such as chatbots, automatic summarization, and reading comprehension, primarily generate shorter sentences by analyzing longer text. These applications are often statistical and analytical. Even in reading comprehension tasks, answers are typically extracted directly from a specific paragraph of the given text rather than being generated through sentence restructuring and deeper analysis. In past natural language research, evaluating generated text was often challenging. Standard evaluation metrics, such as BLEU [
1] and ROUGE [
2], primarily assess text quality by comparing generated outputs to predefined ground truth references. However, in natural language generation, there is no universally fixed evaluation method, particularly in the case of Chinese sentences, where word order and phrase structures exhibit significant variability. Existing NLP evaluation methods rely heavily on the original dataset, making it difficult to assess the performance of text generation tasks effectively.
To address this issue, this study proposes an evaluation approach that includes the Bilingual Evaluation Understudy (BLEU) for simple sentence verification, along with two additional methods designed explicitly for PoS structure evaluation: (1) PoS matching degree and (2) PoS context ordering coherence. The PoS matching degree method determines whether the expected distribution of parts of speech in the generated text aligns with the overall structural patterns. In creative writing, different authors exhibit unique content styles, and the model-generated text is influenced by the writing style present in the training dataset. Thus, assessing whether the distribution of various parts of speech in the generated content aligns with that of the training data can indicate whether the generated sentences reflect the author’s writing habits. By calculating the PoS matching degree, a smaller difference value signifies a higher degree of alignment, suggesting that the generated text more accurately follows the expected linguistic structure.
2. Related Work
In recent years, Natural Language Generation (NLG) has been increasingly recognized as a process in which a computer program produces natural language as output. According to Gatt and Krahmer [
3], NLG can be categorized into two major types based on input: text-to-text generation and data-to-text generation. Text-to-text generation involves taking existing text as input and producing new text as output. Common applications of this approach include machine translation [
4], text summarization and simplification [
5], automatic writing correction [
6], and paraphrasing [
7].
Transformers, first introduced by Vaswani et al. [
8], represent the state-of-the-art neural network architecture for natural language processing and have emerged as the dominant approach for natural language generation (NLG) [
9]. Compared to earlier neural network architectures, such as Recurrent Neural Networks (RNNs) [
10] and Long Short-Term Memory (LSTM) [
11], Transformers effectively address the gradient vanishing problem [
12], which can hinder the ability to maintain context over longer sequences and increase processing time for complex texts. Additionally, Transformers support parallel training, allowing for significantly faster processing. As training data and model architectures continue to expand, Transformers are capable of capturing long-range dependencies more effectively, leading to more advanced language understanding and generation [
13].
Despite being a relatively new technique in NLP and NLG, Transformers have already demonstrated practical applications across various domains. For instance, Lee and Hsiang [
14] utilized GPT-2 with fine-tuning to generate patent claims. By 2020, the development of GPT-3 [
15] marked a significant milestone, with substantial increases in both model parameters and the volume of training data, further advancing the capabilities of large-scale language models.
Since 2015, numerous studies have explored language model pre-training, leading to the development of various approaches [
16,
17,
18]. A language model (LM) is designed to estimate the probability of a word occurring in a given sentence, either as a discrete word or as part of a statistical probability distribution. Pre-training involves training a language model on large volumes of unlabeled data, allowing it to learn linguistic patterns and acquire a set of model parameters. These parameters can then be used to initialize the model for specific tasks, followed by fine-tuning on task-specific datasets to enhance performance.
Language model pre-training has been shown to significantly improve various natural language processing tasks. A notable example is the Bidirectional Encoder Representations from Transformers (BERT) [
19], introduced by the Google AI Language team in 2019. BERT follows a pre-training and fine-tuning paradigm, leveraging a multi-layer bidirectional encoder model based on the Transformer architecture. By applying unsupervised learning, BERT enables the Transformer model to capture contextual dependencies more effectively, resulting in substantial improvements across multiple NLP applications [
20,
21,
22,
23]. While BERT-Finetune and BERT-Pretrain have demonstrated improvements in language generation, their reliance on masked language modeling limits their ability to maintain long-term coherence in generated text. Furthermore, they do not explicitly incorporate structural constraints such as part-of-speech arrangements, which are crucial for story generation.
3. Approach
This section introduces the cycle story generation process and its underlying architecture. One of the key challenges in natural language generation (NLG) is the lack of a standardized verification method. This issue is particularly pronounced in Chinese sentence generation, where word combinations and syntactic structures exhibit high variability. Existing evaluation methods in natural language processing primarily rely on comparisons with the original dataset, which often fail to effectively assess the performance of text generation tasks.
To address this limitation, in addition to using the Bilingual Evaluation Understudy (BLEU) as a verification method for simple sentences, this section proposes two additional evaluation methods specifically designed for PoS structures: (1) Part-of-Speech Matching Degree and (2) Part-of-Speech Context Ordering Coherence. These methods aim to provide a more comprehensive assessment of the syntactic accuracy and structural consistency of generated sentences.
3.1. Story Cyclic Generation Flow
Previous research has demonstrated that methods based on part-of-speech structures and cycle generation can effectively control both the style and content of generated text. Building upon this approach, fuzzy pairing is introduced in part-of-speech selection, ensuring that candidate words maintain stronger relevance to the preceding sentence.
The input data structure includes the <SOS>, <EOS>, and <MOS> tokens, where <MOS> serves as a separator between the current and subsequent sentences. The sentence is first segmented, with each word annotated with its corresponding PoS. The words following the next sentence segment are then used as the predicted content in the label section. This design allows the model to anticipate the content of the next sentence based on the given input while also incorporating the expected part-of-speech structure of the upcoming sentence, thereby improving coherence and contextual consistency.
In this structure, the sequence from <SOS> to <MOS> represents the words and their corresponding parts of speech in the input sentence. Meanwhile, the segment from <MOS> to <EOS> denotes the part-of-speech sequence of the next sentence in the dataset. For the label, the original words within the <MOS> to <EOS> segment are extracted from the dataset based on their respective parts of speech, ensuring that the generated content aligns with the expected linguistic structure.
After processing the data and training the model using Cycle-SG NET, Cycle-SG GAN, BERT-Finetune, and BERT-Pretrain, the entire cycle story generation process is illustrated in
Figure 1. The cycle story is generated based on a predefined sentence format, where the part-of-speech structure of the next sentence is represented between <MOS> and <EOS>.
The cycle generation concept begins with constructing a Frequent Pattern Tree (FP-Tree) that captures the part-of-speech structures of all sentences in the dataset. When generating a new sentence, the model determines the appropriate part-of-speech structure for the next sentence by referencing the pre-established FP-Tree. The overall workflow is depicted in
Figure 1.
3.2. Generating Candidate Part-of-Speech Tags Using FP-Tree
The process of selecting candidate words follows a structured approach. First, the method determines whether all parts of speech between <MOS> and <EOS> belong to the high-frequency category. If they do, a probability is calculated based on the distribution of each PoS and assigned accordingly. If a PoS does not meet the high-frequency criteria, no probability is assigned, and the corresponding word is omitted. Finally, the final candidate words are selected based on their assigned probabilities and used as the root to identify sentence structures within the FP-Tree.
Through this filtering process, only the most suitable candidate words are retained as the primary basis for further selection. These candidate words are then compared against the pre-established FP-Tree database, allowing the system to identify and generate the most contextually appropriate sentences.
Once the candidate word is identified, the complete sentence part-of-speech structure can be retrieved from the FP-Tree. However, during this search process, multiple sentences may be generated simultaneously based on the same part-of-speech structure. Selecting the most suitable candidate sentence is crucial, as it directly impacts the coherence and quality of the generated article.
Additionally, since the FP-Tree is constructed without considering the sequential relationships between words in adjacent sentences, it is possible to retrieve multiple different candidate sentences for the same candidate word. This introduces a challenge in determining the most appropriate part-of-speech arrangement for the next sentence. To address this issue, fuzzy matching is applied between the candidate sentence and the previous sentence. By evaluating the degree of similarity, the system identifies the candidate sentence with the highest matching score, ensuring that the most contextually appropriate sentence is selected. The process of determining similarity relies on Equations (1) and (2).
The similarity is:
where
l represents the shared prefix length between the candidate and reference sentences, and
p is a user-defined constant with a range of 0 <
p ≤ 0.25, defining the weight assigned to the shared prefix in the similarity calculation. These parameters ensure a balanced assessment of both structural similarity and contextual coherence.
The most suitable candidate sentence can be identified through fuzzy matching, which evaluates the alignment between the candidate sentence and the reference context. Once the appropriate sentence is selected, its part-of-speech structure is inserted between the newly defined <MOS> and <EOS> tokens. This updated structure serves as the input for the cycle generation model, enabling the seamless generation of subsequent content while maintaining consistency throughout the article.
3.3. Bilingual Evaluation Understudy (BLEU)
In the early stages, the Bilingual Evaluation Understudy (BLEU) metric was commonly used to assess the quality of generated Chinese sentences. However, this evaluation method requires ground truth references to compare against the generated results. In fields like article and story generation, where ground truth is often unavailable, calculating accuracy becomes challenging. Nevertheless, BLEU can still provide insights by analyzing different n-gram patterns to determine whether a model has improved its understanding of the training set during the training process.
Although BLEU alone cannot fully evaluate the quality of generated content, it remains a valuable metric for comparing the same dataset across different models, providing insights into whether one model demonstrates a better understanding of the underlying data. The BLEU metric operates by performing a co-occurrence comparison between two sentences, effectively assessing their similarity. The calculation involves averaging the unit precision of each n-gram, where n represents the n-gram length. However, alone cannot address the issue of incomplete matching, necessitating the introduction of a penalty mechanism.
To address this, BLEU employs two key components: the brevity penalty (
BP) and the overall BLEU score, which are calculated using Equations (3) through (5). These equations incorporate both precision and penalties to provide a more balanced assessment of the quality of generated content. For further details, please refer to Equations (3) to (5).
In the BLEU formula, n is typically set to four, corresponding to the classic BLEU-4 evaluation method, which is widely used in most NLP research papers. However, when applied to text generation, BLEU has some notable limitations. It cannot account for nuanced language expression, is heavily influenced by common words, and may fail to appropriately handle synonyms or similar words, often judging them as mismatched.
Despite these shortcomings, BLEU offers several advantages, including fast computation, low computational cost, and the ability to assess associations through n-grams effectively. For these reasons, this study will utilize BLEU as a verification method and reference point for evaluating text generation quality.
3.4. Part-of-Speech Matching Degree
This method evaluates the matching degree of each part-of-speech quantity. In the subsequent analysis of Chinese Knowledge and Information Processing (CKIP) statements, variations in sentence arrangements or word combinations may lead to differences in word segmentation and part-of-speech tagging results. Even minor differences can result in further variations in these outputs. To address this, the verification method is divided into two categories: classification-based and non-classification-based matching. In the classification-based algorithm, similar parts of speech are grouped together, simplifying the comparison process. Conversely, in the non-classification approach, the original CKIP part-of-speech tags are retained without grouping, allowing for a more granular analysis. This dual approach ensures flexibility in evaluating part-of-speech consistency under varying linguistic conditions.
For the calculation of part-of-speech matching, please refer to Equations (6) and (7). The definitions for classification and non-classification remain consistent across these equations. The degree of part-of-speech matching is determined by calculating the sum and average of the differences in part-of-speech ratios, along with the quantity ratio of each PoS. In this context, designation refers to the part-of-speech arrangement derived from the part-of-speech arrangement search mechanism, while generation represents the part-of-speech arrangement obtained through CKIP analysis of sentences generated by the model. By comparing these two arrangements, the method evaluates how closely the generated content aligns with the expected part-of-speech structure.
This method employs a part-of-speech ratio algorithm to evaluate whether the overall structure of the expected part-of-speech distribution aligns with that generated by the model. The degree of difference is calculated by summing the absolute values of the differences between the two distributions. A smaller resulting value indicates a smaller degree of difference, which, in turn, reflects a higher matching degree between the model’s output and the expected structure.
3.5. Part-of-Speech Context Sorting Coherence
The part-of-speech matching process evaluates the degree of alignment generated for each sentence. However, sentence fluency is equally critical for the overall coherence of an article. To address this, Part-of-Speech Context Sorting Coherence is introduced to measure the degree of coherence between part-of-speech arrangements. First, a corresponding FastText model is trained using the dataset containing the part-of-speech arrangements of preceding and succeeding sentences. The FastText model enables the calculation of association probabilities between different parts of speech. The part-of-speech arrangement of the current input sentence is then compared to the part-of-speech arrangement generated by the model. Using these arrangements, the relative probability of the parts of speech is calculated. The coherence score is obtained by taking the average of the summed probabilities, which reflects the relationship between the two arrangements. This value is referred to as the FastText Relation Score (FRS), providing a quantitative measure of part-of-speech coherence.
The current and specified FastText Relation Scores (FRSs) are denoted as
, while the current and generated FRS are represented as
. The ordering coherence of the part-of-speech context is calculated as the absolute difference between
and
. For the detailed formulas, please refer to Equations (8) through (10). Given the differences in the lengths of part-of-speech arrangements, the current part-of-speech sequence is denoted as
, the designated sequence as
, and the generated sequence as
. The FastText function is employed to calculate the association probabilities between these part-of-speech sequences. For a schematic representation of the algorithm, please refer to
Figure 2.
This method aims to determine the correlation difference between the specified sentence and the sentences generated by the model using a part-of-speech association algorithm applied to preceding and succeeding sentences. By evaluating this correlation, the method ensures that the generated sentence aligns with the expected next-sentence structure specified in the model.
The Cyclic generation framework is composed of two core components: Cyclic Generation Processing and Part-of-Speech Sorting Search Mechanism. These components are essential in ensuring that the generated text maintains syntactic coherence, linguistic structure, and contextual consistency. These components are essential in ensuring that the generated text maintains syntactic coherence, linguistic structure, and contextual consistency (Algorithms 1 and 2). The first phase, Cyclic Generation Processing, serves as the primary loop responsible for generating text iteratively. Initially, the Part-of-Speech Sorting table is preloaded using an FP-growth algorithm, constructing an FP-tree that stores frequently occurring PoS sequences (line 1). This FP-tree acts as a reference for predicting the syntactic structure of upcoming sentences. The generation model is then selected from one of four architectures: New SG-Net, New SG-GAN, BERT-Finetune, or BERT-Pretrain (line 2). The process begins with a starting sentence, which is assigned to both the current sentence and the story result (line 6). The number of generated sentences is specified by the user as
n (line 7), and the cyclic generation process continues until this number is reached. During each iteration, CKIP (Chinese Knowledge and Information Processing system) is applied to the current sentence to extract both its word sequence and PoS sequence (line 8). These extracted features are passed into the Part-of-Speech Sorting Search Mechanism, which queries the FP-tree to retrieve the most contextually appropriate PoS sequence for the next sentence (line 9). The retrieved next PoS sequence, along with the word and PoS sequence of the current sentence, is then processed by the Input Sentence Preprocessing module (line 10). The Input Sentence Preprocessing module plays a crucial role in constructing a structured input for the generation model. It transforms the extracted word sequence, PoS sequence, and next PoS sequence into an innovative input sentence that aligns with the expected linguistic patterns (line 10). This transformation ensures that the generated sentence adheres to both grammatical constraints and stylistic consistency. Once the innovative input sentence is formed, it is passed to the generation model (line 11), which produces a new sentence following the syntactic rules learned from training data. The newly generated sentence undergoes further processing before being appended to the story result. First, its word sequence is extracted and assigned to the current sentence to serve as input for the next iteration (line 12). Then, the generated sentence is concatenated with the story result, progressively expanding the narrative (line 13). This cyclic process continues until the required number of sentences is reached, at which point the final generated text is returned as output.
Algorithm 1: Cycle Generation Processing |
Pre-loaded Part-of-speech sorting table |
(01) FP tree <- FP-growth algorithm(PoS sorting table) |
(02) Generation model <- New SG-Net || New SG-GAN || BERT-Finetune || BERT-Pretrain |
CKIP(sentence) /*Get word and PoS sorting of sentence*/ |
(03) Part-of-speech Sorting Search Mechanism(PoS sorting) |
(04) Input Sentence Pre-processing(word sorting, PoS sorting, next PoS sorting) |
(05) current sentence ← starting sentnce; story result ← starting sentnce |
(06) NumofSent ← n /*User specifies the number of generated statements, assume n.*/ |
(07) while NumofSent > 0 do |
(08) word sotring, PoS sorting ← CKIP(current sentence) |
(09) next PoS sorting ← Part-of-speech Sorting Search Mechanism(PoS sorting) |
(10) innovative input sentence ← Input Sentence Pre-processing |
(11) generated sentence ← Genration model(innovative input sentence) |
(12) current sentence ← word sorting of generated sentence |
(13) story result ← Story result + current sentence |
Algorithm 2: Part-of-speech Sorting Search Mechanism |
Pre-establish commonly used parts of speech. /*Probability distribution >10%*/ |
(01) Use the common PoS in the current PoS sorting, and select one of them to enter the FP-tree according to their probability distribution. |
(02) Through the FP-growth algorithm, it will first come to the most frequent node of the selected PoS. Combine the path from the node to all the leaves with the unique path from the node to the root, so will get several next PoS sorting as candidate. |
(03) Regarding the PoS sorting as a set, using the smallest common set, find the complete PoS sorting matching the candidate from the PoS sorting table, and the complete PoS sorting as new candidate. |
(04) Part-of-speech Sorting Search Mechanism regarding the PoS sorting as a set, using the smallest common set, find the complete PoS sorting matching the candidate from the PoS sorting table, and the complete PoS sorting as new candidate. |
4. Architecture of Each Model
Based on Chinese sentences and story generation, we will compare Cycle-SG Net, Cycle-SG GAN, BERT-Finetune, and BERT-Pretrain. The following sections will introduce the architectures of Cycle-SG Net and Cycle-SG GAN, as well as the parameters used for each model. Additionally, we will describe the process of adapting BERT-Finetune and BERT-Pretrain to our specific data format.
4.1. Architecture of Cycle-SG Net
In the feature extraction phase, FastText is employed for multi-dimensional quantized vector representation, effectively enhancing the training feature information for the Transformer model. Similarly, a preprocessed dataset is used to train a FastText model, which utilizes the skip-gram technique to infer the current word within its context. The FastText model serves as the Word Embedding layer for SG-Net while maintaining the Position Embedding originally designed for SG-Net to preserve the input sentence order. Additionally, Adaptive Softmax is implemented to reduce computational costs. To enhance the model’s ability to capture more complex natural language features, the model depth is increased from three layers to six layers, making the overall architecture more intricate. This structural enhancement enables the model to learn richer linguistic patterns.
Furthermore, the AdamW optimizer is utilized to fine-tune weight parameters, improving optimization efficiency. For a detailed overview of the model architecture and parameter settings, please refer to
Table 1 and
Figure 3a.
4.2. Architecture of the Cycle-SG GAN
To enhance the generation performance of Cycle-SG Net, the integration of a Generative Adversarial Network (GAN) is necessary. The internal modifications of the model enable it to effectively train sentences through adversarial learning. Cycle-SG GAN, similar to Cycle-SG Net, employs FastText as the Word Embedding method. The GAN architecture is designed with Cycle-SG Net functioning as the Generator and TextCNN serving as the Discriminator. In this study, the GAN objective function is formulated based on sentence pairs. The objective function of Cycle-SG GAN is expressed in Equation (11), where
represents the probability distribution of real sentences, and
represents the probability distribution of the current input sentence. For a visual representation of the Cycle-SG GAN architecture, please refer to
Figure 3b.
4.3. Architecture of BERT-Fintune
To compare the performance of Cycle-SG Net and Cycle-SG GAN, the experimental results are evaluated against BERT-Finetune and BERT-Pretrain. To ensure a fair comparison, the dataset is retrained using the “Jin Yong” novels. For BERT-Finetune, the BertForMaskedLM model framework is employed for fine-tuning. The model is initialized with pre-trained weight parameters in Traditional Chinese, which are loaded into BertForMaskedLM. To optimize the model, the AdamW optimizer is used for adjusting weight parameters, enabling effective fine-tuning for the task. For details on BERT-Finetune parameters, please refer to
Table 2, and for the BERT-Finetune model architecture, see
Figure 3c.
4.4. Architecture of BERT-Pretrain
The BERT-Pretrain model follows the same architecture as BERT-Finetune, with the key difference being that it does not utilize the BERT-base-Chinese pre-trained model as an initial weight parameter for BertForMaskedLM. Instead, the model undergoes pre-training from scratch. To enhance the training effect, input sentences are preprocessed and transformed into four feature representations:
Input: The original input sentence.
Attention: Focuses on the words at each position while avoiding attention to padding positions.
Type: Informs the model whether the current position belongs to the previous or next sentence.
Position: Provides the model with the positional sequence of each word.
The AdamW optimizer is used to fine-tune weight parameters, enabling the completion of the pre-training task. For details on the BERT-Pretrain model architecture, please refer to
Figure 3d.
5. Training Methods for Each Model
To train the model to capture the style and structure of Jin Yong’s novels, a new data format is established through Word Sorting and Part-of-Speech Sorting, both of which are completed during data preprocessing. The input sentence consists of three key components:
Special Label—Identifies the type of input data.
Staggered Sorting of Words and Parts of Speech in the Current Sentence—Represents an interleaved arrangement of words and their corresponding parts of speech.
Part-of-Speech Sorting of the Next Sentence—Specifies the expected part-of-speech structure for the upcoming sentence.
Since different models require distinct training approaches, the composition of the data format varies accordingly. Each element in the input sentence carries a specific representational meaning based on the model being trained. For a detailed comparison of data formats across different training modes, please refer to
Table 3, which provides examples of data formats for each model.
5.1. Special Label
The BERT-Finetune and BERT-Pretrain models inherently provide special tokens such as [CLS], [SEP], and [MASK] to distinguish different elements within the input. These tokens serve the following purposes:
[CLS]: Marks the beginning of the sentence.
[SEP]: Differentiates between the current sentence, the following sentence, and the target sentence.
[MASK]: Indicates the word whose position needs to be predicted.
In contrast, Cycle-SG Net and Cycle-SG GAN are custom-designed models that do not utilize the predefined BERT tokens. Instead, they require custom tagging mechanisms to structure the input data effectively. To address this, the <SOS>, <EOS>, and <MOS> tags are introduced:
<SoS>: Marks the start of the sentence.
<EoS>: Denotes the end of the sentence.
<MoS>: Separates the word and part-of-speech arrangement of the current sentence from the part-of-speech arrangement of the next sentence.
For Cycle-SG Net and Cycle-SG GAN, the training tasks focus on distinguishing and learning the interleaved structure of words and parts of speech in the current sentence while simultaneously designating the part-of-speech arrangement for the next sentence. This customized approach ensures that the models generate structured and contextually coherent outputs.
5.2. Staggered Sorting of Words and PoS in the Current Sentence
The current input sentence is processed using CKIP word segmentation and part-of-speech analysis, transforming it into a staggered arrangement of words and their corresponding parts of speech. This preprocessing step aims to help the training model comprehend the writing structure of Jin Yong’s novels while reinforcing the relationship between words and their respective parts of speech. By structuring the data in this manner, the model can better capture the stylistic and syntactic patterns characteristic of Jin Yong’s writing.
5.3. PoS Sorting of the Next Sentence
In the training process, the PoS arrangement of the next sentence is derived from the dataset based on the current sentence. The goal is to enable the model to learn the structural patterns of subsequent sentences and improve its ability to generate coherent text. For the Cycle-SG NET and Cycle-SG GAN models, the PoS length of the next sentence is set to match the length of the target-generated sentence. In contrast, for the BERT-Finetune and BERT-Pretrain models, the PoS length of the next sentence corresponds to the number of [MASK] tokens, ensuring alignment with BERT’s masked language modeling mechanism.
Input Sentence and Next Sentence
Input sentence
- ■
It represents the current textual input given to the model.
The next sentence
- ■
It represents the ground truth sentence that follows in the dataset, which the model is expected to learn and predict in a structured manner.
Special Labels
Our proposed Cycle SG-Net and Cycle SG-GAN introduce three special labels:
<SOS> (Start of Sentence): Indicates the beginning of the sequence.
<MOS> (Middle of Sentence): Separates the current sentence’s staggered word-PoS sequence from the PoS sequence of the next sentence.
<EOS> (End of Sentence): Marks the end of the input sequence.
In contrast, BERT-Finetune and BERT-Pretrain utilize predefined BERT special tokens:
[CLS] (Classification Token): Indicates the beginning of the sequence.
[SEP] (Separator Token): Separates different parts of the input.
[MASK] (Mask Token): Represents missing words that the model must predict.
Staggered Sorting of Words and Parts of Speech (PoSs) in the Current Sentence
Cycle SG-Net and Cycle SG-GAN, the input sentence is transformed into a staggered sequence of words interleaved with their PoS tags.
- ■
段譽 (Nb) 今天 (Nd) 心情 (Na) 不錯 (VH)
BERT-Finetune and BERT-Pretrain follow a character-level representation, where each character is individually tagged with its corresponding PoS.
- ■
段(Nb) 譽(Nb) 今(Nd) 天(Nd) 心(Na) 情(Na) 不(VH) 錯(VH)
Part-of-Speech Sorting of the Next Sentence
For Cycle SG-Net and C SG-GAN, the PoS sequence of the next sentence is explicitly provided as part of the input:
- ■
(Nb) (Cbb) (D) (Dfa) (VH)
In BERT-Finetune and BERT-Pretrain, the model must infer the next sentence’s structure by predicting masked tokens. The PoS sequence corresponds to the masked token positions, ensuring that the output sentence aligns with expected syntactic structures.
Innovative Input Sentence Construction
Target/Label Sentence
By adopting these different strategies, Cycle SG-Net and Cycle SG-GAN focus on PoS-aware text generation with explicit syntactic control, while BERT-based models rely on masked prediction to infer missing text elements.
6. Experiments
The experimental results are divided into two main parts. The first part evaluates the proposed verification methods, which include BLEU scores based on different n-Grams, Part-of-Speech Matching Degree, and Part-of-Speech Context Sorting Coherence. The second part involves generating stories and texts using each model to assess their performance in text generation. The third part focuses on human evaluation, where 25 students majoring in Chinese language and literature participated in an experiment to assess the readability, coherence, and fluency of AI-generated texts. For verification, this experiment utilizes a dataset comprising martial arts novels written by the renowned Chinese novelist Jin Yong. The primary texts used include
Demi-Gods and Semi-Devils, The Legend of the Condor Heroes, and
The Deer and the Cauldron. For a detailed overview of the dataset, please refer to
Table 4.
6.1. BLEU Based Analysis Results
The experiment analyzes the results of Chinese story generation using three evaluation metrics: BLEU scores, Part-of-Speech Matching Degree, and Part-of-Speech Context Sorting Coherence. The performance of Cycle Story Generation, Cycle Story Generation combined with GAN, and multiple BERT fine-tuned models will be compared in this study. For a detailed comparison of the experimental results, please refer to
Table 5.
Although the BLEU score cannot fully capture the accuracy of semantics, grammar, and sentence structure, it serves as a useful reference for evaluating generalization ability and overfitting, as well as assessing the ordering of word combinations. According to
Table 6, observations on generalization ability indicate that Cycle-SG Net and Cycle-SG GAN outperform other models. Notably, Cycle-SG Net achieves higher BLEU scores across 1-gram to 4-gram evaluations. Additionally, at the 4-gram level, the BLEU score reaches 0.23, suggesting a high degree of variation in word permutations and combinations. However, this may also indicate poor generation quality in some cases. Another critical factor is assessing overfitting, as BLEU scores in this experiment are derived solely from the training dataset. The results show that BERT-Finetune, BERT-Finetune-Large, and BERT-Pretrain exhibit higher BLEU scores, which suggests a potential overfitting issue. These models appear to be more influenced by frequent words or closely resemble the original training data, making them less capable of generating diverse and novel content.
The PoS matching value is measured as the average percentage difference in the number of parts of speech between the generated and expected sentences. According to
Table 6, all models exhibit a PoS matching difference of less than 1%, indicating that they successfully accomplish the task of generating structurally consistent sentences. In most cases, the generated sentence structure adheres to the specified format, demonstrating that the models have effectively learned to construct sentences using the designated parts of speech. Experimental results further indicate that when the classification of parts of speech is well-defined, the PoS matching performance of classified models surpasses that of unclassified ones. Additionally, when examining the number of generated sentences, models generating ten sentences perform better than those generating five sentences, suggesting that cycle generation stabilizes sentence structure over a larger number of iterations. This result implies that the models are capable of long-form text generation while maintaining structural consistency.
Finally, in the structural analysis of Chinese sentences, vocabulary-level training proves to be more effective than word-level training. This is because word-level training disperses the PoS structure, increasing training complexity. The Cycle-SG Net and Cycle-SG GAN models utilize FastText word embedding technology, ensuring that the vocabulary in the dataset is distributed more effectively within the vector space. This design enhances the models’ ability to learn and represent sentence structures more efficiently, improving the overall quality of the generated text.
The PoS Context Sorting Coherence value is measured as the average percentage difference in the probability of association between generated and expected sentences. According to
Table 7, all models achieve a coherence difference of less than 0.1%, indicating that the contextual association structures in the generated text align well with the expected outcomes. From the verification data, it is observed that Cycle-SG GAN outperforms all other models in generating coherent sentences, demonstrating superior contextual fluency. Additionally, findings from the BERT-Finetune model suggest that improving dataset quality can further enhance contextual coherence, highlighting the significance of well-prepared training data in generating structurally and contextually consistent text.
6.2. Context Generation of Each Model
Table 8 and
Table 9 present the context generation results for each model, including Cycle-SG Net, Cycle-SG GAN, BERT-Finetune, and BERT-Pretrain. These tables provide a comparative analysis of the generated text, highlighting the differences in contextual consistency, fluency, and structural alignment across models.
6.3. Assessing AI-Generated Chinese Text: A Comparison of Readability, Coherence, and Fluency Across Models
To further evaluate the quality of generated Chinese text, this study conducts a human evaluation experiment comparing the performance of the following models:
Cycle-SG Net
Cycle-SG GAN
BERT-Finetune-Large
BERT-Pretrain
To further evaluate the quality of AI-generated Chinese text, this study conducts a human evaluation experiment comparing the performance of Cycle-SG Net, Cycle-SG GAN, BERT-Finetune, BERT-Finetune-Large, BERT-Pretrain, and two additional methods from previous research. The experiment assesses three key criteria: readability, coherence, and fluency.
Experimental Design
A total of 25 students majoring in Chinese language and literature participated in the study. Each participant was presented with 21 generated short texts, with each text limited to 150 characters to maintain consistency in evaluation. The dataset includes:
Six different models (Cycle-SG Net, Cycle-SG GAN, BERT-Finetune, BERT-Finetune-Large, BERT-Pretrain, and SG-Net).
Three generated samples per model, ensuring a balanced and diverse evaluation set.
Each participant read all 21 generated texts and assigned scores based on the following evaluation criteria:
Readability (1–5): Measures how easy and natural the generated text is to read.
Coherence (1–5): Assesses how well sentences connect logically and whether the content follows a natural progression.
Fluency (1–5): Evaluates grammatical correctness, smoothness, and overall text fluidity.
Participants rated each generated text using a five-point Likert scale, where 1 represents the lowest score and 5 represents the highest. Ratings were collected with decimal precision to capture more nuanced differences in evaluation. The three evaluation criteria in Table—readability, coherence, and fluency—represent the average scores provided by the participants for each model in these respective categories. Additionally, the AVG column in
Table 10 reflects the overall average score, which combines the results from these three criteria to provide a comprehensive assessment of each model’s performance. For a detailed breakdown of human evaluation scores, please refer to
Table 10.
Based on the experimental results presented, the performance of the models in terms of readability, coherence, fluency, and overall average ratings highlights significant distinctions among the tested approaches. The proposed methods, Cycle-SG Net and Cycle-SG GAN, demonstrate notable performance, particularly in coherence and overall average scores. Cycle-SG GAN achieves the highest overall average score of 4.13, excelling in both readability (4.1) and coherence (4.6), which indicates its ability to produce highly coherent and readable Chinese texts. Meanwhile, Cycle-SG Net achieves an average score of 3.43, with balanced performance in readability (3.3), coherence (3.6), and fluency (3.4), showing its capability to maintain consistent sentence structure but with room for improvement in fluency. In comparison, BERT-Finetune-Large shows competitive performance, achieving an average score of 4.03, which is close to Cycle-SG GAN, with strong readability (4.2) but slightly lower coherence (4.0). BERT-Finetune and BERT-Pretrain also perform reasonably well, with average scores of 3.86 and 3.64, respectively, but are slightly weaker in coherence and fluency compared to Cycle-SG GAN.
Finally, SG Net, another model tested for comparison, achieves the lowest overall average score of 3.23, with its weakest area being fluency (2.9). This highlights the effectiveness of the proposed Cycle-SG GAN model in improving text generation quality, particularly in coherence and readability, compared to both traditional methods and fine-tuned large-scale models. The results suggest that integrating adversarial learning, as demonstrated by Cycle-SG GAN, significantly enhances the coherence and overall quality of generated texts, making it a promising approach for future advancements in AI-generated Chinese stories.
7. Conclusions
In this experiment, in addition to using BLEU verification, we also designed Part-of-Speech Matching Degree and Part-of-Speech Context Ordering Coherence as verification metrics to assess the effectiveness of the training task. These measures address the current lack of a comprehensive evaluation method in text generation literature. The results show that all models successfully grasp the writing objectives of innovative tasks, with New SG-GAN demonstrating the most outstanding performance.
One of the key factors influencing model performance is that BERT-Finetune is fine-tuned on language models pre-trained on millions of modern corpora, while BERT-Finetune and BERT-Pretrain rely on word-level training. This training approach limits their effectiveness in capturing long-form structured text. However, the verification results suggest that expanding the dataset size could help address this limitation. Future studies should consider conducting human evaluation experiments, where linguistic experts or general readers assess the readability, coherence, and fluency of generated texts, providing qualitative insights beyond automatic metrics. The New SG-GAN model’s performance further suggests that word-level training better aligns with human writing patterns, allowing the model to effectively structure text and produce readable outputs. However, there is still significant room for improvement in word choice and contextual precision, highlighting the potential for further advancements in Chinese story generation. Future research should explore alternative verification methods, such as semantic similarity assessments, human-in-the-loop evaluations, or reinforcement learning-based text optimization.
Additionally, Chinese story generation remains an area with vast research potential, offering opportunities for developing more refined evaluation frameworks and practical applications. If machines can truly learn and understand semantics, they could assist humans in a wider range of NLP tasks, including literary creation, automated storytelling, and intelligent text summarization. Future experiments could focus on hybrid models that integrate statistical and neural-based methods, ensuring greater adaptability to complex narrative structures and diverse writing styles.