Advancing Quality Assessment in Vertical Field: Scoring Calculation for Text Inputs to Large Language Models

Yi, Jun-Kai; Yao, Yi-Fan

doi:10.3390/app14166955

Open AccessArticle

Advancing Quality Assessment in Vertical Field: Scoring Calculation for Text Inputs to Large Language Models

by

Jun-Kai Yi

and

Yi-Fan Yao

^*

College of Automation, Beijing Information Science and Technology University, Beijing 100192, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(16), 6955; https://doi.org/10.3390/app14166955

Submission received: 10 May 2024 / Revised: 3 August 2024 / Accepted: 7 August 2024 / Published: 8 August 2024

(This article belongs to the Special Issue Text Mining, Machine Learning, and Natural Language Processing)

Download

Browse Figures

Versions Notes

Abstract

:

With the advent of Transformer-based generative AI, there has been a surge in research focused on large-scale generative language models, especially in natural language processing applications. Moreover, these models have demonstrated immense potential across various vertical fields, ranging from education and history to mathematics, medicine, information processing, and cybersecurity. In research on AI applications in Chinese, it has been found that the quality of text generated by generative AI has become a central focus of attention. However, research on the quality of input text still remains an overlooked priority. Consequently, based on the vectorization comparison of vertical field lexicons and text structure analysis, proposes three input indicators D₁, D₂, and D₃ that affect the quality of generation. Based on this, we studied a text quality evaluation algorithm called VFS (Vertical Field Score) and designed an output evaluation metric named V-L (Vertical-Length). Our experiments indicate that higher-scoring input texts enable generative AI to produce more effective outputs. This enhancement aids users, particularly in leveraging generative AI for question-answering in specific vertical fields, thereby improving response effectiveness and accuracy.

Keywords:

VFS evaluation algorithm; text quality; large language models; generative AI

1. Introduction

In recent years, Transformer-based models have taken a dominant position in the field of natural language processing (NLP) due to their outstanding performance [1,2]. Prior to this, Recurrent Neural Networks (e.g., LSTM) were widely used to solve a variety of existing NLP problems [3,4,5]. However, Recurrent Neural Networks exhibit limitations in capturing long-distance dependencies within data sequences, particularly struggling with information at the beginning or end of texts, as well as with distant information [6]. Furthermore, its architecture does not effectively support the parallelization of training and inference processes, which poses a significant challenge in terms of computational resource requirements [7]. In contrast, the advent of the Transformer architecture has effectively overcome these issues. It was initially proposed as a sequence-to-sequence encoder–decoder model [8], with its advantages lying in the use of attention mechanisms to capture long-distance relationships in text, and its ability to easily achieve the parallelization of computations.

With the development of more powerful GPUs and TPUs [9], it has become possible to create models with an increased number of parameters, allowing these models to achieve or even surpass human-level performance in an expanding array of tasks [10,11,12]. To make the model’s responses more aligned with human needs, the development of the InstructGPT model employed a reinforcement learning from human feedback approach [13], which is particularly important for model fine-tuning.

However, due to the characteristics of the encoder-decoder model in the Transformer architecture, large language models (LLMs) need to convert input data into fixed-length representations, which the decoder then uses to generate output. Despite the “multi-head” design of the Transformer architecture offering significant improvements in context and semantic understanding, its effectiveness remains limited when dealing with complex text problems and specific tasks. In particular, within specific vertical fields, to accomplish certain tasks, it necessitates preprocessing the input text to enhance the efficiency of the encoder and increase the accuracy of the generated text. For example, Xuemei Dong and colleagues, when utilizing ChatGPT-3.5-Turbo-0301 for zero-shot Text-to-SQL tasks, employed structured inputs and prompt engineering, effectively improving the success rate of SQL statement generation on the Spider test set [14]. Furthermore, to enhance the text generation effect, Le Xiao and colleagues combined an additional genetic algorithm with ChatGPT’s natural language processing capabilities to generate news headlines, achieving notable results [15].

Based on these considerations, we designed an input text quality evaluation algorithm, VFS (Vertical Field Score), which is divided into three parts: prompt word score, vertical industry relevance score, and text logicality score. By performing a weighted calculation of the scores from these three aspects, we obtained the total score. Through experimental testing, we found that higher-scoring inputs were able to achieve higher quality, more specialized output results. This provides strong support for enhancing the output effects of generative AI in vertical industries.

2. VFS Algorithm Model

The VFS (Vertical Field Score) algorithm comprises three components: D₁, D₂, and D₃. The overall score, S, can be calculated using the following formula (Formula (1)):

S = \frac{\ln (e + D_{1}) (D_{2} + D_{3})}{2 \ln (e + 1)}

(1)

where D₁ represents the prompt word score, D₂ represents the text structure score, and D₃ represents the content relevance score.

When designing evaluation metrics, it is crucial to understand the semantic processing logic of large language models (LLMs). LLMs first process the input information through their multi-layer Transformer architecture in a weighted manner, aiming to extract more comprehensive information. However, when processing human input, most model frameworks pay special attention to the logical structure of human language, such as conjunctions (e.g., “therefore”, “then”, “finally”, etc.), during the training process. The models are highly sensitive to these terms and will segment sentences containing these words for analysis, in order to more accurately understand the meaning of the input text and its contextual logic. Therefore, large language models have a higher requirement for the logical coherence of lengthy texts, which includes a deep understanding of complex issues. The lack of structural words and overly verbose text narration can lead to biases in the model’s information processing.

Additionally, large models are extremely sensitive to specific prompt words. For example, Xuemei Dong and others improve the accuracy of ChatGPT-3.5-Turbo-0301 in handling specific tasks, such as zero-shot Text-to-SQL tasks, by prompting ChatGPT to play the role of a professional SQL code engineer. The higher the level of specialization in the information, the more specialized the answers generated by the large model will be.

3. VFS Algorithm Analysis

3.1. Prompt Word Score D₁

Prompt words play a crucial role in the output of large language models (LLM). They enable the model to lock onto specific areas of knowledge within the vast training data, improving accuracy in subsequent tasks. In the application process of large language models, the role of prompt words is extremely broad. For example, in handling visual problems, specific prompt words can significantly improve the accuracy of recognition and analysis [16]. Similarly, in studying the programming capabilities of ChatGPT-3.5, appropriate prompt words are also used to enhance the success rate of code generation [17]. When calculating the prompt word score, the prompt words in the input text are evaluated. In the experiment, it was found that when the same prompt appears n times (n > 1), the generated results do not improve and may even deteriorate. However, when n = 1, the generated results show a significant improvement. Therefore, when setting the metrics, only the case where n = 1 is considered. If the text contains prompt words, the D₁ value is set to 1; if the text does not contain prompt words, the D₁ value is set to 0. Through this approach, it is possible to quantify the impact of prompt words on the output quality of large language models and reflect this in subsequent evaluation metrics.

3.2. Text Structure Score D₂

The degree of text structuring directly affects the output efficiency of large language models (LLMs). Input texts with strong structuring typically yield higher quality output results. In research on the generative effects of large language models across many fields, researchers often subconsciously opt for structured questions and input formats. For instance, in studies on extracting character relationships from texts, questions, and task requirements are clearly segmented through formatting or punctuation. This structuring of inputs helps ChatGPT-3.5 to provide more accurate answers [18]. However, in most practical application scenarios, users do not fully recognize the importance of input format, leading to varying degrees of errors in understanding by large language models, especially in the field of mathematics [19]. Therefore, it is clear that structuring is crucial for enhancing the output effectiveness of large language models.

Evaluating the degree of text structuring is a complex task. Typically, the degree of text structuring is directly positively correlated with its logical coherence, but the judgment of logical coherence often carries strong subjectivity. Therefore, when assessing the degree of text structuring, we choose to use text structure parameters to quantify the overall structuring score of the text. The text structure score, D₂, is composed of three parts: s₁, s₂, and s₃. The method for calculating its total score is shown in the following formula (Formula (2)).

D_{2} = \frac{1}{3} (s_{1} + s_{2} + s_{3})

(2)

The overall calculation process of D₂ is shown in Figure 1 below. The problem presented in the figure consists of five sentences and two logical connectors, with the logical connectors already highlighted in bold.

Where s₁ represents the lexical richness of the text, calculated by segmenting the text into words to count the occurrence of each unique word n₁ and comparing the number of unique words to the total number of words n to derive s₁.

s_{1} = \frac{n_{1}}{n}

(3)

In the evaluation metric system of this text, s₂ is a parameter used to measure the complexity of sentences. The specific method involves splitting the text into individual sentences using periods, then calculating the length of each sentence, and based on this, determining the average sentence length, l₁. According to related research [20], an analysis of a Chinese corpus consisting of 1.2 million characters found that it contained a total of 112,431 sentences, with an average sentence length of 10.91 characters. The study also showed that sentences of 30 characters or fewer accounted for over 95% of the total, while sentences of 40 characters or fewer accounted for over 99%. Based on these data, 40 characters are set as the maximum sentence length l, and the average sentence length l₁ of each text is compared to this maximum length l to calculate the score for s₂. Here, if the average sentence length of the text exceeds 40 characters, the score of l₁ will take its maximum value l, and s₂ will take its maximum value (s₂ = 1). Through this method, it is possible to quantitatively assess the sentence structure complexity of the text and incorporate it into the overall evaluation metric.

s_{2} = \frac{l_{1}}{l}

(4)

The s₃ parameter is aimed at measuring the extent of logical connective words used in sentences. Logical connectives play a key role in the recognition and understanding of text by large language models. The primary function of these terms is to help the model more accurately understand the contextual relationships and core content of the text, thereby avoiding misunderstandings of synonymous sentences. Therefore, logical connectives are an important aspect of structuring sentence construction.

In the Chinese context, logical connectives encompass a variety of types including causal relationships, parallel relationships, transitional relationships, comparative-relationships, progressive relationships, and conclusions. By integrating open-source Chinese word libraries from the internet, these logical connectives have been extracted and compiled into a connective word dictionary.

Based on this, a formula for calculating the s₃ parameter was designed, as shown in Equation (5). Through this approach, it is possible to quantitatively assess the usage of logical connectives in text and incorporate it as an important component of the D₂ evaluation metric.

s_{3} = \frac{z_{1}}{z}

(5)

In the evaluation model, z and z₁ are two key parameters, representing the theoretical maximum number of logical connectors in the input text and the actual number of logical connectors in the input text, respectively. First, the text is segmented into individual sentences using punctuation marks, thereby determining the total number of sentences. Subsequently, the maximum number of logical connectors in each sentence is determined. Here, the theoretical maximum number of logical connectors in the input text is equal to the total number of sentences. Next, during the analysis process, all logical connectors that appear in the text will be checked. These connectors are sourced from a predefined dictionary of connectors. Through this process, the number of logical connectors within each sentence, z₁, can be calculated. If z₁ exceeds z, then z₁ is set to z, and s₃ is assigned a value of 1.

This analysis method can quantitatively assess the frequency and distribution of logical connectives in the text. This assessment is important for understanding the degree of sentence structuring and the logical relationships within the context, while also providing crucial quantitative data for the D₂ evaluation score.

3.3. Content Relevance Score D₃

In the application of large language models in vertical fields, the importance of content relevance cannot be overlooked. For instance, in the application of large models in the medical field, when processing patient symptom information and providing suggestions, the specificity of the patient’s input text has a significant impact on the output results [21]. Similarly, in specialized models for the legal field, such as Chatlaw, given the complexity of real cases, slight differences in words can lead to drastically different outcomes. Therefore, in professional fields, the precise handling of content-relevant input becomes particularly crucial. Despite the Transformer architecture’s excellent performance in natural language processing, the specificity and relevance of the input content still directly affect its output results [22].

Therefore, the stronger the relevance of content to vertical industries, the richer and more professional the output content of large language models becomes. Based on this premise, this study designed a vertical industry-specific vocabulary comparison algorithm based on the Word2vec model. By integrating genetic algorithms, the issue of irrelevant word vectors was effectively resolved, and ultimately, a score for evaluating the relevance of text content was obtained through vector comparison. The specific process for scoring content relevance is illustrated in Figure 2.

This algorithm not only enhances the model’s understanding of professional terminology but also improves its accuracy and effectiveness in specific field applications.

As shown in Figure 2, after obtaining the input text, it undergoes preprocessing to generate tokenization results, which are then matched against a vectorized lexicon. In the specialized domain vectorized lexicon, if a word is present, its word vector is used. If a word is not present, it is assigned an unrelated vector, x_e. The vector x_e is obtained through a genetic algorithm screening of the vectorized lexicon. Finally, the vectors of all words are summed to obtain the input text vector X_inout. Then, it is compared with all words in the vectorized lexicon using cosine similarity to obtain the final content relevance score, D₃.

The calculation of D₃ primarily involves three main modules: the vectorized lexicon module, the unrelated word vector screening module, and the D₃ calculation module. These will be detailed in Section 3.3.1, Section 3.3.2 and Section 3.3.3, respectively.

3.3.1. Vectorized Lexicon Module

Word2vec is a word embedding model used for obtaining word vectors. It can vectorize words in text, with the word vectors containing semantic information between words and their context. This allows for a good measure of the relationships between words and provides a basis for understanding their meanings [23].

Word2vec includes two architectures: Continuous Bag of Words (CBOW) and Skip-gram. In the CBOW model, the central word is predicted based on the words surrounding it, up to c words before and after. Based on this, each word serves as a central word to adjust the word vectors, as illustrated in Figure 3. Conversely, the Skip-gram model predicts the surrounding words, up to c words before and after, based on the central word, as shown in Figure 4.

In this module, due to the high number of specialized vocabulary terms contained in the vectorized lexicon designed for the vertical domain, the Skip-gram model is used as the training model to train the lexicon.

Based on a general Chinese cybersecurity textbook and 75 related papers in the field of cybersecurity, a cybersecurity knowledge base was compiled. A Word2vec word vector model with a vector dimension of 200 was trained, serving as the vectorized lexicon for the cybersecurity domain.

3.3.2. Unrelated Vector Screening Module

Genetic algorithms [24] are optimization methods derived from biological evolution theory, simulating natural selection and genetic mechanisms to search and optimize the solution space of problems. The core idea of this algorithm is to establish a population composed of a series of individuals, each representing a potential solution to the problem. Through multiple generations of iteration, experiencing genetic operations such as selection, crossover, and mutation, the population continuously improves, thereby gradually approaching the optimal solution to the problem.

In this study, genetic algorithms are applied to optimize the vector representation of text analysis. Specifically, the operation begins by randomly generating 200 vectors of 200 dimensions each, forming the initial population. These vectors represent potential solutions used to filter out the vectors most suitable for representing irrelevant word vectors (x_e). We define an adaptive function F(v) to assess the fitness of each vector relative to the problem. Through multiple iterations, this function guides the evolution of the population, thereby filtering out the optimal vector representation for the subsequent evaluations of text content relevance.

The application of this method has improved the accuracy and efficiency of our algorithm, ensuring the relevance and professionalism of model outputs within specific fields.

F (v) = |\frac{1}{| V |} \sum_{v_{i} \in V} S i m i l a r i t y (v, v_{i})|

(6)

S i m i l a r i t y (a, b) = \cos (θ) = \frac{a \cdot b}{‖ a ‖ ‖ b ‖}

(7)

| V | = n

(8)

In this paper, each individual in the initially generated random population is first compared with all word vectors in the vectorized vocabulary V, and the fitness score of each individual in the population is calculated using the adaptive function F(v). Subsequently, these scores are sorted, and the top 80 individuals with the highest adaptive scores are selected as the parents for the next generation of the population.

By combining the 100-dimensional features from both parents, new individuals in the population are generated. This process produces a total of 120 new individuals, who, together with the original parent individuals, form a new population. To increase the diversity of vectors and enhance the likelihood of high-quality target vectors emerging, a 5-dimensional genetic mutation is introduced into the offspring population.

In the implementation of the genetic algorithm, the convergence speed of the algorithm needs to be considered. The number of generations required varies for vectorized vocabularies of different scales and dimensions. In this study, based on practical requirements, 50 generations were chosen as the iteration number for the genetic algorithm, which has been proven to yield relatively good target vector individuals, meeting the needs of the research outcomes.

Through this method, the performance of the algorithm can be effectively improved, ensuring that the generated vectors accurately reflect the relevance of the text content, thereby enhancing the effectiveness of large language models in vertical field applications.

3.3.3. D₃ Calculation Module

Subsequently, the results of the segmentation are matched with the vocabulary in the professional lexicon, counting the number of successfully matched words (denoted as M₁) and extracting their corresponding vector representations (marked as Equation (9)). At the same time, the number of words not matched is recorded (denoted as M₂).

Next, as previously mentioned, the genetic algorithm is applied to generate the initial population for the lexicon α. The goal is to filter out a specific vector, referred to as the irrelevant vector x_e, whose absolute cosine similarity with every vector in the lexicon is as small as possible.

The vectors of unmatched words will be replaced by x_e. Subsequently, all obtained word vectors are normalized and summed up to obtain the final input text vector X_input. The final step involves calculating the cosine similarity between X_Input and each vector in the lexicon α, and then taking the arithmetic mean of these similarity values to obtain the content relevance score D₃. The specific calculation formula is shown in Equation (11).

Through this process, we can quantitatively assess the content relevance of the input text, thereby improving the accuracy and effectiveness of large language models in specific field applications.

x_{i}, i = 1, 2, 3 \dots M_{1}

(9)

X_{i n p u t} = \sum_{i}^{M_{1}} x_{i} + M_{2} x_{e}

(10)

D_{3} = \frac{1}{n} \sum_{i = 1}^{n} \frac{X_{i n p u t} \cdot x_{i}}{∥ X_{i n p u t} ∥ ∥ x_{i} ∥}

(11)

4. Experimental Results

In this study, the most advanced Transformer large language model to date, ChatGPT-4, was selected as the primary experimental testing tool. The aim is to evaluate the correlation and effectiveness between the proposed evaluation criteria and the quality of generation.

To conduct this experiment, we generated 326 input question texts in the field of cybersecurity based on common cybersecurity issues. These texts were used to verify the effectiveness of the D₁, D₂, and D₃ metrics, as well as the final score S. The specific problem classifications are shown in Table 1. The problems are mainly divided into five major categories: information protection, network defense, application security, user education, and physical security. Corresponding input question texts were generated based on common issues in each domain.

4.1. Evaluation Index V-L

This experiment designed a generation evaluation metric, V-L (Vertical-Length), to more intuitively demonstrate the quality of generation by large language models (LLMs) in vertical fields. For the quality evaluation of generated results for questions in vertical fields, two elements were selected to constitute the details of the V-L metric: the content relevance score of the generated result, D_output, and the score for the length of the generated content, D_L. The specific formula for D_L is as follows (Equation (12)), where L_Max is the maximum content length among all answers and L is the content length of each answer. After obtaining D_output and D_L, they are combined to form the evaluation score for the V-L metric, S_V-L, with the specific formula as follows (Equation (13)).

D_{L} = \frac{L}{L_{M a x}}

(12)

S_{V - L} = \frac{1}{2} (D_{o u t p u t} + D_{L})

(13)

4.2. Result Analysis

When designing algorithms, prompts are a crucial metric, especially in the application of large language models in vertical domains. Prompts directly affect the generated results. For the same input text, there may be multiple identical prompts as well as multiple different prompts. However, since this algorithm is used in a vertical domain where prompts are relatively fixed, the scenario of multiple different prompts is not considered. Therefore, experiments were conducted to address the situation where multiple identical prompts exist for the same problem.

The experiment randomly selected five different problems and, while keeping the problems unchanged, added n identical prompts to each problem (n = 1, n = 2, n = 3). Since the quality of answers generated by the large language model can fluctuate even for the same input text, each input text was tested three times. The values of L and D_output were averaged before calculating S_V-L to eliminate the variability in generation quality. The experimental results are shown in Figure 5 below.

The experiment demonstrates that when the number of prompts is 1, the overall generation quality is better than when the number of prompts is greater than 1. Moreover, as the number of prompts n increases, the generation quality deteriorates. Therefore, when designing the prompt scoring system, it is preferable to consider only the case where the number of prompts is 1.

In this study, for each problem, univariate adjustments were made to the variables D₁, D₂, and D₃ to pose questions to ChatGPT.

D₁ Effectiveness Validation Experiment: For the same problem, questions were posed to ChatGPT both with and without added prompts. Each state was tested three times. The values of L and D_output obtained were averaged, and then S_V-L was calculated to validate the effectiveness of D₁. The scores from the 326 comparison groups were averaged separately to obtain the mean scores for D₁ = 1 and D₁ = 0. These results are recorded in Figure 6 below.

D₂ Effectiveness Validation Experiment: Under the condition of D₁ = 0, each problem was rewritten by adding logical connectors, increasing sentence length, and incorporating more cybersecurity domain vocabulary. Ensuring that the D₃ scores remain essentially the same, the D₂ scores were increased, and questions were posed to ChatGPT. Both the original and modified input texts were tested three times each, and the average values of L and D_output were calculated to determine S_V-L. The scores from the 326 comparison groups were averaged, and the average D₂ scores were also calculated. These results are recorded in Figure 6 below.

D₃ Effectiveness Validation Experiment: Under the condition of D₁ = 0, each problem was rewritten by changing colloquial descriptions to professional expressions and replacing replaceable words with cybersecurity-related vocabulary. Ensuring that the D₂ score error remains minimal, the D₃ scores were increased, and questions were posed to ChatGPT. Both the original and modified input texts were tested three times each, and the average values of L and D_output were calculated to determine S_V-L. The scores from the 326 comparison groups were averaged, and the average D₃ scores were also calculated. These results are recorded in Figure 6 below.

Detailed data are shown in Figure 6. The results show that as the scores of D₁, D₂, and D₃ increase, the S_V-L score exhibits a growing trend. This finding indicates that as the scores of these variables improve, the quality of the generated answers correspondingly enhances.

Based on the aforementioned experimental framework, tests were conducted to examine the impact of the evaluation score S from low to high on the effectiveness of generative AI answers by Transformer large models. The experimental results indicate that as the S score increases, the answers generated by the large model become progressively more detailed and professional. This finding confirms the effectiveness of the evaluation score S in guiding and predicting the output quality of large language models. Questions with higher S scores, due to optimizations in cue words, structuring, and professional content, enable the model to extract and process information more effectively, thereby generating more precise and in-depth answers.

Therefore, these experimental results not only validate the VFS (Vertical Field Score) input evaluation method but also provide valuable insights into how to utilize generative AI large models effectively.

Within this research framework, a series of continuous experiments were conducted on a total of 326 questions. During the experiment, the VFS of each input question was systematically improved, and the corresponding V-L (Vertical Length) scores were recorded. Each score is the average of three independent experiments.

To more intuitively demonstrate the correlation between the quality of input questions and the quality of generated content, this study randomly selected 10 samples from these questions and subsequently plotted a graph illustrating the relationship between input quality and generation quality.

As shown in Figure 7 for the same question, as the VFS increases, there is a significant improvement in the quality of the generated results. This indicates that, although there are certain limitations to improving the input quality for specific questions, the experimental data show a clear trend: a positive correlation exists between higher VFS scores and better quality of generated results. Notably, when the VFS score begins to rise from a lower level, the improvement in generated quality (i.e., V-L score) is relatively rapid. However, as the score further increases, the rate of improvement in generated quality gradually diminishes, indicating a decreasing marginal effect of score improvement on the quality of generation.

As shown in Figure 8, Figure 9 and Figure 10, with the gradual increase in the Vertical Field Score (VFS), it is observed that the quality and length of ChatGPT’s generated responses for the same question exhibit a corresponding upward trend.

In Figure 11, the generated response is displayed when the input question is modified to S = 0.51. Compared to Figure 8, Figure 9 and Figure 10, the input problem in Figure 11 is more complete, featuring clearer descriptions of the details. Each sentence includes a logical connective to articulate the relationships between different sentences. Additionally, there is an enhanced use of specialized vocabulary.

It is evident that, compared to Figure 8, Figure 9 and Figure 10, ChatGPT’s response in Figure 11 shows a significant improvement in both length and professionalism.

When adjusting the input question, the first step is to eliminate colloquial expressions and provide a clearer description of the question. Next, the logical relationships between sentences are enhanced by adding logical connectors to each sentence, indicating the connections between different sentences. Finally, the actual problem is described in a more specialized manner by replacing general vocabulary with more professional terms whenever possible. It can be observed that increasing the number of prompts enhances both the length and professionalism of the output text, allowing the large language model’s responses to be more focused on relevant vertical domains. Enhancing the logical coherence of the input text also makes the generated results more specific and significantly increases the length of the output text. The enhancement of specialized vocabulary makes the responses more professional and targeted, providing more detailed solutions to the problems and potentially offering additional solutions.

Additionally, in practical applications, enhancing the professionalism of the input question greatly tests the user’s level of expertise. Therefore, for non-professional users in vertical domains, replacing general vocabulary with specialized terms can be challenging. In such cases, the use of logical connectors and prompts becomes crucial. However, both the use of prompts and the improvement of the input text’s logical coherence significantly enhance the quality of the final generated results.

5. Conclusions

This study verified through experiments the significant impact of the three set scoring indicators D₁, D₂, and D₃ on the generation effect of the large language model. Experiments have shown that as the scores of these three indicators increase, the quality of the answers generated by the large language model improves accordingly. A particularly exciting finding is that as the comprehensive evaluation score S improves, there is a significant optimization in the generation results. This outcome suggests that the input text quality evaluation criteria designed in this paper are effective for assessing the generation effects of large language models in the field of cybersecurity, and this criterion has the potential to be applied to other vertical fields, such as medicine and law. The key difference in applying it across various fields lies in training the corresponding D₃ indicator model for the specialized vocabulary of each field.

The study of input text quality is of significant importance for improving the generative effects of large language models. While the Transformer architecture can comprehensively extract input information, variations in the way questions are posed in professional, vertical fields can lead to significant differences in the effectiveness and depth of generated results. This research, based on the three indicators proposed that affect generation quality, has constructed an input text quality standard S, which effectively evaluates input quality and thereby supports the application of large language models in fields such as cybersecurity, filling a gap in research on the quality of inputs for large language models. Additionally, this paper primarily focuses on the vertical field of cybersecurity, conducting validation experiments based on a cybersecurity knowledge base. In future research, knowledge from other fields will be gradually established to extend the applicability of this algorithm.

Future research directions and focuses will include:

Building more structured input models based on the existing foundation.
Continuously optimizing the details of the indicators so that the S can more accurately reflect input quality.
This will be applied in specialized large language models for specific vertical fields.
Building knowledge in other fields to promote the general applicability of this algorithm.
Conduct a more detailed analysis of input text research in the field of cybersecurity, including the examination of necessary hardware information and related parameters for specific issues. Additionally, integrate advanced algorithms to enhance the analysis results.

Author Contributions

Conceptualization, J.-K.Y.; Methodology, J.-K.Y.; Software, Y.-F.Y.; Validation, Y.-F.Y.; Formal analysis, Y.-F.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was supported by the National Key R&D Program of China under Grant No. 2024QY1703 and National Natural Science Foundation of China under Grant No. U1636208.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ni, J.; Young, T.; Pandelea, V.; Xue, F.; Cambria, E. Recent advances in deep learning based dialogue systems: A systematic survey. Artif. Intell. Rev. 2023, 56, 3055–3155. [Google Scholar] [CrossRef]
Lin, T.; Wang, Y.; Liu, X.; Qiu, X. A survey of transformers. AI Open 2022, 3, 111–132. [Google Scholar] [CrossRef]
Johnson, R.; Zhang, T. Supervised and semi-supervised text categorization using LSTM for region embeddings. In Proceedings of the International Conference on Machine Learning, PMLR, New York, NY, USA, 19–24 June 2016; pp. 526–534. [Google Scholar]
Liu, W.; Wang, Z.; Liu, X.; Zeng, N.; Liu, Y.; Alsaadi, F.E. A survey of deep neural network architectures and their applications. Neurocomputing 2017, 234, 11–26. [Google Scholar] [CrossRef]
Alshemali, B.; Kalita, J. Improving the Reliability of Deep Neural Networks in NLP: A Review. Knowl.-Based Syst. 2020, 191, 105210. [Google Scholar] [CrossRef]
Liu, G.; Guo, J. Bidirectional LSTM with attention mechanism and convolutional layer for text classification. Neurocomputing 2019, 337, 325–338. [Google Scholar] [CrossRef]
Lipton, Z.C.; Berkowitz, J.; Elkan, C. A critical review of recurrent neural networks for sequence learning. arXiv 2015, arXiv:1506.00019. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Gillioz, A.; Casas, J.; Mugellini, E.; Abou Khaled, O. Overview of the Transformer-based Models for NLP Tasks. In Proceedings of the 2020 15th Conference on Computer Science and Information Systems (FedCSIS), Sofia, Bulgaria, 6–9 September 2020; IEEE: New York, NY, USA, 2020; pp. 179–183. [Google Scholar]
Rahman, W.; Hasan, M.K.; Lee, S.; Zadeh, A.; Mao, C.; Morency, L.P.; Hoque, E. Integrating multimodal information in large pretrained transformers. In Proceedings of the Conference. Association for Computational Linguistics. Meeting, Online, 5–10 July 2020; Volume 2020, p. 2359. [Google Scholar]
Ganesan, A.V.; Matero, M.; Ravula, A.R.; Vu, H.; Schwartz, H.A. Empirical evaluation of pre-trained transformers for human-level NLP: The role of sample size and dimensionality. Proc. Conf. Assoc. Comput. Linguist. N. Am. Chapter Meet. 2021, 2021, 4515. [Google Scholar]
Srivastava, A.; Rastogi, A.; Rao, A.; Shoeb, A.A.M.; Abid, A.; Fisch, A.; Wang, G. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv 2022, arXiv:2206.04615. [Google Scholar]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Lowe, R. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
Dong, X.; Zhang, C.; Ge, Y.; Mao, Y.; Gao, Y.; Lin, J.; Lou, D. C3: Zero-shot Text-to-SQL with ChatGPT. arXiv 2023, arXiv:2307.07306. [Google Scholar]
Xiao, L.; Chen, X. Enhancing llm with evolutionary fine tuning for news summary generation. arXiv 2023, arXiv:2307.02839. [Google Scholar]
Wu, C.; Yin, S.; Qi, W.; Wang, X.; Tang, Z.; Duan, N. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv 2023, arXiv:2303.04671. [Google Scholar]
Dong, Y.; Jiang, X.; Jin, Z.; Li, G. Self-collaboration Code Generation via ChatGPT. arXiv 2023, arXiv:2304.07590. [Google Scholar] [CrossRef]
Wei, X.; Cui, X.; Cheng, N.; Wang, X.; Zhang, X.; Huang, S.; Han, W. Zero-shot information extraction via chatting with chatgpt. arXiv 2023, arXiv:2302.10205. [Google Scholar]
Azaria, A. ChatGPT Usage and Limitations; Ministère de L’enseignement Supérieur et de la Recherche: Paris, France, 2022.
Yu, Y.J.; Liu, Q. N-gram Chinese Characters Counting for Huge Text Corpora. Comput. Sci. 2014, 41, 263–268. [Google Scholar]
Xiong, H.; Wang, S.; Zhu, Y.; Zhao, Z.; Liu, Y.; Huang, L.; Shen, D. Doctorglm: Fine-tuning your Chinese doctor is not a herculean task. arXiv 2023, arXiv:2304.01097. [Google Scholar]
Cui, J.; Li, Z.; Yan, Y.; Chen, B.; Yuan, L. Chatlaw: Open-source legal large language model with integrated external knowledge bases. arXiv 2023, arXiv:2306.16092. [Google Scholar]
Li, X.; Xie, H.; Li, L.J. Research on Sentence Semantic Similarity Calculation Based on Word2vec. Comput. Sci. 2017, 44, 256–260. [Google Scholar]
Katoch, S.; Chauhan, S.S.; Kumar, V. A review on genetic algorithm: Past, present, and future. Multimed. Tools Appl. 2021, 80, 8091–8126. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Text structure scoring flowchart.

Figure 2. Content relevance scoring flowchart.

Figure 3. CBOW model.

Figure 4. Skip-gram model.

Figure 5. The effect of the number of prompt words.

Figure 6. Validity proof of D₁, D₂, and D₃.

Figure 7. Input and output quality relationship.

Figure 8. Question (S = 0.19).

Figure 9. Question (S = 0.25).

Figure 10. Question (S = 0.35).

Figure 11. Question (S = 0.51).

Table 1. Types of cybersecurity-related questions.

Question Type	Quantity	Content
Information protection	72	Data tampering, theft, destruction, information leakage, unauthorized access, etc.
Network defense	103	DDoS attacks, firewall vulnerabilities, protocol vulnerabilities, IPS configuration issues, etc.
Application security	66	SQL injection, cross-site scripting (XSS), MFA anomalies, etc.
User education	32	Phishing emails, malicious links, scam messages, etc.
Physical security	53	Server failures, data center protection, computer hardware failures, etc.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yi, J.-K.; Yao, Y.-F. Advancing Quality Assessment in Vertical Field: Scoring Calculation for Text Inputs to Large Language Models. Appl. Sci. 2024, 14, 6955. https://doi.org/10.3390/app14166955

AMA Style

Yi J-K, Yao Y-F. Advancing Quality Assessment in Vertical Field: Scoring Calculation for Text Inputs to Large Language Models. Applied Sciences. 2024; 14(16):6955. https://doi.org/10.3390/app14166955

Chicago/Turabian Style

Yi, Jun-Kai, and Yi-Fan Yao. 2024. "Advancing Quality Assessment in Vertical Field: Scoring Calculation for Text Inputs to Large Language Models" Applied Sciences 14, no. 16: 6955. https://doi.org/10.3390/app14166955

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Advancing Quality Assessment in Vertical Field: Scoring Calculation for Text Inputs to Large Language Models

Abstract

1. Introduction

2. VFS Algorithm Model