1. Introduction
The COVID-19 epidemic, which emerged in late 2019, severely shocked global society, economy, and public health [
1]. The World Health Organization (WHO) declared it a public health emergency of international concern in January 2020 [
2] and in May 2023 declared the official end of the global emergency phase [
3].
However, the end of the COVID-19 emergency does not eliminate the global threat posed by infectious diseases [
4,
5,
6]. Ongoing COVID-19 cases, influenza outbreaks, and the emergence of new pathogens suggest that future pandemics remain possible. Variants or mutations in existing viruses could again trigger global health crises. The public should be able to conveniently access high-quality health information at that time. Therefore, timely access to accurate health information remains vital for early warning and public preparedness. Countries or regions in
In response to potential future pandemics, enhancing public awareness and access to health information is essential.
While traditional search engines have long served as a primary source of health knowledge, generative artificial intelligence (AI), such as ChatGPT 40, which released in November 2022, has emerged as a new tool for information retrieval [
7]. Unlike search engines that retrieve documents from existing databases, generative AI produces novel responses based on patterns learned from large-scale datasets, which may enhance fluency but introduces risks of factual inaccuracy. Compared to conventional methods, generative AI is designed to deliver automated, personalized, and seemingly accurate responses, offering users a more interactive and natural communication experience [
8].
Generative artificial intelligence technology, with its distinctive conversational models, has significantly changed how people engage with knowledge and has become a novel channel for disseminating information about infectious diseases [
9]. These technologies have been increasingly applied in medicine [
10]. Nevertheless, despite their growing adoption, concerns remain regarding the reliability of AI-generated content—particularly in high-stakes areas like healthcare—where issues related to ethics, factual accuracy, and readability are particularly salient. The accuracy and readability of medical responses from generative AI have become one of the core concerns at present [
11].
However, the application of generative AI in the dissemination of health knowledge also brings ethical challenges. These challenges include the risk of generating misinformation, the inequality in the access to generative AI, and the extension of generative AI. At the same time, there may be differences in the public’s ability to understand AI-generated content. Therefore, in order to more effectively and correctly guide the general public to use generative AI technology to prevent and control major infectious diseases, it is necessary to improve the public’s health literacy through education, especially for vulnerable groups in society.
At present, both in China and internationally, research into generative artificial intelligence technologies in the medical domain has mainly focused on evaluating the accuracy of ChatGPT in medical exams and clinical question-answering tasks [
12,
13,
14]. Its application in both basic and clinical medicine has been tested [
15], such as performance evaluation in Polish medical exams [
16] and assessments of knowledge on bacterial infection [
17]. Notably, the reliability and readability of ChatGPT-4 in addressing hypothyroidism during pregnancy have also been studied [
18], demonstrating that its responses were moderately reliable and readable at a university level. Other studies have explored ChatGPT’s handling of myopia [
19], its comparison with Google Bard in text generation [
20], and the differences between ChatGPT and Internet search engines in responding to patient queries [
21].
In summary, existing research rarely evaluates the accuracy and readability of generative AI responses related to infectious diseases specifically, particularly in Chinese contexts. However, with more than 1 billion Internet users in China and around 230 million daily users querying health topics via generative AI platforms, there is an urgent need for reliable information on disease transmission. This study addresses that gap by comparing the performance of four major generative AI platforms (including Chinese models) in providing infectious disease-related information. The findings aim to guide medical professionals, AI developers, and the general public in selecting appropriate tools for health communication and decision making.
2. Materials and Methods
This paper compares the generative artificial intelligence model of four mainstream platforms at home and abroad. In addition, this article compares the four models’ accuracy, accessibility, comprehension, and readability influence factors and the production of text research topics. As research material, we selected the COVID-19 control guide Q&A text of the CDC as the standard answer, and the responses provided by the four generative artificial intelligence models of the domestic and foreign generation of artificial intelligence models were compared with the standard answer. The domestic generative AI models tested were Kimi and Ernie Bot, and the foreign models were ChatGPT and Gemini.
2.1. Research Materials
This article is related to infectious diseases. It is based on the U.S. CDC and control center, and the corresponding research text is the COVID-19 prevention and control guide of the CDC from August 2020 [
22]. This can be downloaded at the following URL:
https://stacks.CDC.gov/view/CDC/89817 (accessed on 1 February 2025). This article contains 53 questions and answers. The 53 questions and answers cover the following areas [
22]: (1) COVID-19 infection risk and control (six questions); (2) COVID-19 spread and prevention (seven questions); (3) detection and diagnosis (six questions); (4) the clinical management of COVID-19 (eight questions); (5) the management of the special population (seven questions); (6) the treatment of and prevention measures for COVID-19 (six questions); (7) prevention and vaccination (six questions); (8) other relevant questions (seven questions). The question and answer documents cover the categories of infection risk, communication prevention, detection diagnosis, clinical management, special population, prevention and treatment, vaccination, and other problems. The guide is aimed at both general medical staff and the scientific community at large in order to provide authoritative resources for the prevention and control of outbreaks.
2.2. Research Methods
In order to compare the generative text properties of four generative artificial intelligence models at home and abroad, this paper will study several relevant indicators. By comparing the generative text of the four models to the standard text provided by the CDC, we compared the accuracy, accessibility, comprehension, and readability impact indicators and the generative text topics. Through comprehensive evaluation of the domestic and international models of the analysis and response of infectious diseases, this study provides a reference guide for medical workers and people to obtain knowledge regarding infectious diseases.
To evaluate the accuracy and readability of the four AI models’ answers to the CDC COVID-19 prevention and control guidelines, we input 53 English questions to each model and extracted the English answer text generated by each model. English was chosen because the COVID-19 prevention and control guidelines are expressed in English. All questions were manually entered one by one by the researchers, and the full text answers generated by each model were recorded. No follow-up questions or supplementary interactions were conducted throughout the test to ensure that the model’s initial natural output capabilities were obtained.
The variable metrics used for model performance comparison are as follows: SimHash, Flesch–Kincaid grade level (FKGL), Flesch Reading Ease Score (FRES), reading level (RL), average words per sentence (AWPS), average syllables per word (ASPW), and sentences and words, where SimHash stands for text similarity, i.e., accuracy. The Flesch–Kincaid grade level (FKGL) indicates the U.S. school grade corresponding to the reading difficulty of the text. The Flesch Reading Ease Score (FRES) indicates how intelligible the text is. Reading level (RL) indicates the minimum education level, in terms of reading, required for a text. Average words per sentence (AWPS) indicates the average number of words per sentence. Average syllables per word (ASPW) indicates the average number of syllables per word. The “sentences” variable indicates the total number of sentences the text contains and is used to calculate the average sentence length. The “words” variable indicates the total number of words. The definitions and instructions for the use of each indicator are as follows.
- (1)
Comparison of text accuracy
There are many algorithms for comparing the accuracy of two texts in current research, and the SimHash index describes the text accuracy. The SimHash algorithm is a method of detecting file accuracy based on hash functions [
23]. This method can calculate the accuracy of two texts to measure the similarity of the two texts. The SimHash algorithm is suitable for text comparison, data classification, etc. It functions on the principle of ensuring the validity and accuracy of the algorithm by filtering and optimizing the calculation strategy [
24]. The similarity of two texts is represented by the numerical value of the SimHash algorithm. The greater this number is, the more accurate and similar the texts are.
- (2)
Textual legibility comparison
Text accessibility is evaluated by the FRES test, which is measured in terms of the extent to which the text is easy to read [
9]. The text readability index uses Flesch Reading Ease Score (FRES). The Flesch Reading Index is a measure of the readability of English text which was proposed by Rudolf Flesch in 1948 [
25]. It scores texts on a range of 0 to 100. The higher the score, the easier it is to read the text, and the lower the score, the more difficult it is to read.
- (3)
Textual comprehensibility comparison
The FKRGL index is used for easily understanding text in our study. It is measured in terms of the degree of understanding of a text [
9]. FKGL stands for Flesch–Kincaid grade level. This indicator is one of the important indicators of text comprehension, especially in areas such as medical care and education. The index was first proposed by Rudolf Flesch in 1948 and then revised by J. Peter Kincaid. This score represents the level of reading in the education system [
26].
- (4)
The study of the effect of text readability
The text’s readability is also evaluated with the reading level (RL) indicator [
27]. RL indicates the minimum grade of education required to read the text in question. In this study, we use the neural network model as one of the intelligent algorithms to analyze the key factors affecting text readability. According to the requirements of neural network model construction, the input layer indicators of the model are the AWPS indicator, ASW indicator, RL indicator, word indicator, and SimHash indicator, and the output layer indicator is the RL indicator. This paper discusses the key factors affecting text readability by multilayer perceptron mining in neural networks (multilayer perceptron is one of many neural network model algorithms).
- (5)
The semantic comparison of text content
This paper studies the word frequency and word frequency semantics of the model by text analysis [
28]. This method is the basic and most common analysis process used in natural language comprehension, and it mainly consists of two aspects. The first is word frequency statistics, and the second is topic mining. Word frequency statistics are the most basic parts of text analysis used to identify the most common words in the text in order to understand the main content or keywords of the text. Topic mining is a combination of high-frequency words and context induction topics. This method is suitable for the analysis of words in a text in our study.
- (6)
Manual scoring and comparison
In order to verify the accuracy of machine scoring, this article introduces a manual scoring mechanism. The specific method is to first select some questions and answers from the CDC prevention and control guidelines and then collect manual subjective scoring results through an online questionnaire method. Then, the manual scoring is compared with the machine scoring to verify the accuracy of the machine scoring.
2.3. Statistical Method
This paper is a quantitative analysis study, and different calculation methods are used in each index. This study calculates the text accuracy index of SimHash using the following network computing platform for the solution:
https://kiwirafe.pythonanywhere.com/app/xiangsi/ (accessed on 1 March 2025). Text readability and language size metrics (FKGL, FRES, RL, AWPS, ASPW, sentences, word) use the following network computing platform:
https://goodcalculators.com/flesch-kincaid-calculator/?utm_source=chatgpt.com (accessed on 1 March 2025). This study used the neural model analysis module in SPSS27 to calculate the multilayer perceptron and used the software to carry out the statistical analysis of the full text data. We used Python 3.9 to record the word frequency statistics of the text.
3. Results
3.1. Comparison of Text Accuracy
In this study, SimHash similarity data were used as an accuracy index. In order to compare the accuracy of the four models in answering the text of the 53 questions of the CDC, this study calculated the SimHash accuracy, displayed the calculated data as a box plot, and measured the consistency of the four models in answering the text of the same question by analyzing the box plot. The calculation results are shown in
Figure 1.
- (1)
The consistency of the text
The consistency index is the standard for the SimHash similarity score. From the calculation results, Kimi scored the highest (0.646), indicating that Kimi was the most consistent in answering questions. ChatGPT’s score (0.63) was close to that of Ernie Bot (0.627), but its answers were poor. Gemini (0.59) has the lowest average score, indicating that its answers were the worst among the four models.
- (2)
Stability comparison
The accuracy stability of the question is measured by standard deviation. The minimum standard deviation of ChatGPT is the lowest (0.074), indicating that it is the most stable and that its accuracy in answering different questions does not change. The standard deviation of Ernie Bot (0.082) indicates that the accuracy of the answer changes frequently, and the stability of the quality of the text is not high. The volatility of the Kimi and Gemini models is close (0.079 and 0.077, respectively), indicating that their responses demonstrated similar levels of accuracy stability.
- (3)
Extreme values
The extreme values are compared to the distribution of the group points. The maximum and minimum values of the four models: ChatGPT [0.45,0.75], Gemini [0.44,0.77], Kimi [0.41,0.84], and Ernie Bot [0.44,0.80]. The corresponding ranges (i.e., the difference between the maximum and minimum values) are as follows: ChatGPT (0.30), Gemini (0.33), Kimi (0.43), and Ernie Bot (0.36). Kimi has the largest range, indicating that this model is the most likely to produce significantly abnormal outputs in response to certain questions. Among the other three models, the ranges from largest to smallest are as follows: Ernie Bot, Gemini, and ChatGPT.
On the whole, Kimi has the highest accuracy in outputting text, followed by ChatGPT and Ernie Bot, and Gemini is slightly lower. ChatGPT has the best output stability, followed by Kimi, Gemini, and Ernie Bot. After calculation, it can be seen that Kimi has the largest range, indicating greater fluctuation in its responses to certain questions. The ranges of the other three models are relatively close, indicating that their output fluctuations are generally similar and smaller than Kimi.
3.2. Text Readability Comparison
The text readability index uses the Flesch Reading Ease Score. The FRES index of the four models is shown in
Figure 2.
The following is a comparison of Chinese and foreign models in terms of text consistency, stability, and extreme values.
- (1)
Comparison of text consistency
The ChatGPT score (26.59) distribution is more concentrated, indicating that the text demonstrates high consistency. The Gemini score (18.59) distribution is also more concentrated, indicating that the text demonstrates less consistency than that of ChatGPT. The distribution of the domestic model Kimi (28.93) is more diffuse, indicating that the text is not consistent with the previous two. The score distribution of Ernie Bot (26.85) is also more fragmented, with a median of about 35, indicating that its text is the least consistent.
- (2)
Stability comparison
The calculation results show that the standard deviation of ChatGPT (7.31) is the lowest among the four models, indicating that the readability of its generated text is the most stable. The standard deviation of Ernie Bot (10.13), ranking second, indicates high stability. The standard deviation of Gemini is 10.78, and the standard deviation of Kimi is 11.37. Both have high degrees of variation, indicating poor stability.
- (3)
Extreme comparison
The maximum and minimum values of the four models are as follows: ChatGPT [8.20,42.10], Gemini [0.00,54.20], Kimi [0.00,55.70], and Ernie Bot [3.40,58.00]. ChatGPT does not have obvious extreme values, which indicates that the distribution of the scores is more uniform. Gemini has some extreme values but not many of them, indicating some anomalies in its score distribution. The domestic model Kimi has several extreme values, indicating that there are some anomalies in the distribution of the scores. Ernie Bot has multiple extreme values, indicating that there are many exceptions in the distribution of the scores.
To sum up, the performance of ChatGPT is the best in the three aspects of text consistency, stability, and extreme value, followed by Gemini, Kimi, and Ernie Bot.
3.3. Textual Comprehensibility Comparison
The text is easy to understand and uses the FKGL index. The Flesch–Kincaid grade level indicator is the FKGL index, which is one of the indicators of text readability. Text readability represents the readability of the text and the corresponding reading level in the United States.
Figure 3 shows the FKGL index box diagram for the four large models.
This is based on the comparison of the four models in the above box, which is based on the output level, stability, and extreme value control.
- (1)
Comparison of text complex
After calculation, the FKGL scores of the four models are as follows: ChatGPT (13.08), Gemini (13.41), Kimi (13.70), and Ernie Bot (13.55). ChatGPT has the lowest average grade score, indicating that its generated text is easier to read, while Kimi has the highest-grade score, indicating that its text structure is more complex.
- (2)
Analysis of output stability
The calculated standard deviations of each model (values in brackets) are as follows: ChatGPT (1.33), Gemini (1.66), Kimi (2.35), and Ernie Bot (1.77). ChatGPT has the highest stability, while Kimi has the largest standard deviation, indicating that its complexity varies greatly.
- (3)
Extreme comparison
From the calculation results, we can see that the maximum and minimum ranges of each model are ChatGPT [10.40, 15.20], Gemini [7.90, 16.40], Kimi [8.70, 20.80], and Ernie Bot [8.50, 18.70]. Kimi has the widest score range from 8.70 to 20.80, indicating that its output consistency is the worst. In contrast, the maximum and minimum values of ChatGPT are relatively centered, with small fluctuations and stable performance.
In general, ChatGPT has the best output stability of the four models in English, and the Gemini model is medium. The domestic model Kimi has good readability, but the stability is poor, and the Ernie Bot model has the most extreme value, indicating that the text that it outputs is more volatile than that of Kimi.
3.4. Text Readability Influence Factor Analysis
This section uses the neural network model as one of the intelligent algorithms to analyze the influence factors affecting text readability for the four models and to calculate the relative importance of each input variable. The above study is compared with the four models of text readability (that is, the text readability and textual comprehension), and a quantitative study is carried out on the text readability and the text. There are few reports of readability studies in terms of the words of the text. The basic unit of text and the structure of the basic unit has a direct effect on the readability of the text, and the study of words and of word structure is of practical significance for text to be easily readable. This study can be used to understand the intrinsic nature of the readability of a text from its internal organization. Based on this, this section intends to use a multilayer perceptron (a multilayer perceptron is one of many neural network model algorithms) to generate the influence factors that affect text readability, and the four models are compared. In this paper, we use text computing software to generate text computing FRES and FKGL indicators. The same calculation can be obtained by determining the AWPS index of the other points [
9]: AWPS indicator, ASPW indicator, RL indicator, word indicator, and sentence indicator.
3.4.1. Neural Network Algorithm Construction
This paper uses a multilayer perceptron (a multilayer perceptron is one of many neural network model algorithms) to make research tools for text readability influence factors. A multilayer perceptron (MLP) is a typical feedforward neural network [
29] and consists mainly of the input layer, the hidden layers, and the output layer. It learns the feature representation of the data through a fully connected (FC) structure and nonlinear activation function, and MLP has the characteristics of full connection, nonlinear activation, error backpropagation, and multilayer feature extraction [
30]. MLP can be applied to text classification. The specific structure of the neural network model is shown in the Appendix
A. There are five input layer variables (X) used in our study: AWPS, ASPW, sentence, word, and SimHash. The output layer variable (Y) is RL, which indicates that the reading level is variable. ASPW indicates the average number of syllables. ASPW indicates sentence length.
3.4.2. Influence Factor Analysis
The study mentioned above, by building a model neural network training model that affects the output variable, obtains its influence path, analyzes the relative importance of the input variables of the output variable, and then provides the objective reference data for the quantitative comparison of the models [
31]. In this study, we used SPSS 27 software to conduct the neural network calculations. The neural network model used in this article is a multilayer perceptron model (MLP model). After the model runs, the software can output the importance of the input variable directly, which indicates the relative importance of each input variable. The variable importance bar diagram corresponding to the model is shown in
Figure 4.
As can be seen from
Figure 4, the readability of the text generated by the ChatGPT model is most affected by ASPS, followed by AWPS, while the variable SimHash has lesser importance, indicating that the readability of the generated text is less affected by the accuracy of the text. The readability of the text generated by the Gemini model is most affected by ASPW, while the SimHash effect is slightly higher than that of ChatGPT, indicating that the text generated by Gemini is greatly affected by the accuracy of the text. For the domestic model, the readability of Kimi-generated text is most affected by ASPW, which is still the main influencing factor. AWPS is still the most important factor for the readability of Ernie Bot’s generated text, followed by sentence factor, and it can be seen from the graph that the variable AWPS is much more important than the variable sentence. Other variables have a lesser impact.
In conclusion, the influencing factor of sentence length (AWPS) is the most important factor affecting the readability of the generated text in all models. This conclusion shows that the average sentence length is the most critical factor in generating text readability. The influencing factor of SimHash has different degrees of influence in different models. The readability of the text generated by the foreign models ChatGPT and Gemini is less affected by SimHash, indicating that the foreign models have a high degree of innovation in generating text, and the generated text has little correlation with the CDC standard text. However, the domestic models Kimi and Ernie Bot are more sensitive to SimHash, indicating that the text generated by the domestic models is more similar to the CDC standard text, which also proves that text accuracy affects the readability of the text.
3.5. Text Content Semantic Comparison
Text analysis is the process of extracting useful information from the text by means of mathematical statistics or related algorithms [
32]. According to the purpose of this paper and the characteristics of the text of the answers, lexical analysis is proposed to extract the key words. Then, according to the high-frequency lexical induction theme of each text, the core thought of each text is understood. In the present study, there are several kinds of text theme-mining algorithms which require a small amount of text; thus, we use the inductive summary method to adapt to the research text feature [
33].
3.5.1. Lexical Frequency Statistics
Based on the four large models and the lexical frequency data of COVID-19 generative text, the lexical frequency data statistics are obtained. The word frequency bar chart (10) and the word frequency ranking chart are shown in
Figure 5. This paper analyzes the following aspects.
By observing
Figure 5, we can see that “COVID-19” was the most frequently used word and “test” have a higher word in the four big models, indicating that the models are pay more attention to COVID-19 and test. ChatGPT had the highest frequency of using “COVID-19” (556 times) and Gemini had the lowest (202 times), suggesting that ChatGPT used the term most intensively, while Gemini may have more frequently used alternative expressions. The high frequency of “Risk” and “test” in the model, showing that risk assessment and testing are key topics in the model’s response. The frequency of a word can indicates its importance in the text.
3.5.2. Topic Mining
In this paper, the text of the various models is generated by statistical induction of the various models. A frequency table of the text generated by the four models is shown in
Table 1.
Statistical induction yields four model research themes. Through the analysis of
Table 1, the following conclusions can be drawn: among the top 10 word frequency rankings, the words “Patient”, “Infect”, and “Vaccine” appear more in the text generated by ChatGPT, indicating that the model pays more attention to epidemiological knowledge. The words “Medical”, “Healthcare”, and “Advice” appear more frequently in the text generated by the model Gemini, indicating that the field of medical care is more emphasized in the text generated by that model. The words “Risk”, “Symptom”, and “Recommend” appear more often in the domestic model Kimi, indicating that the text generated by that model pays more attention to multidisciplinary background information or early risk prevention. The words “Patient”, “Infect”, and “Healthcare” appear more often in the text generated by the model Ernie Bot, indicating that that model pays more attention to medical clinical topics. In conclusion, the domestic models (Kimi, Ernie Bot) are more suitable for clinical testing, medical system research, health consultation, etc. The foreign models (ChatGPT, Gemini) were more focused on epidemiological analysis, vaccine research, and disease severity assessment. Generally speaking, the domestic model is more applicable, and the foreign model is more professional.
In order to clearly and intuitively show and compare the differences in word frequency features of different models (ChatGPT, Gemini, Kimi, and Ernie Bot) when generating text, the mean value, standard deviation, and
p-value of the word frequency of each model when compared with ChatGPT are calculated according to
Table 1. The calculation results are shown in
Table 2 below.
From
Table 2, we can find the word frequency analysis and significance test of different models. Word frequency comparison: the word frequency is arranged from high to low as ChatGPT, Kimi, Ernie Bot, and Gemini. Standard deviation comparison: The standard deviation reflects the degree of dispersion of word frequency data, that is, the fluctuation range of word frequency. ChatGPT (117.46) shows that there is a large difference between high-frequency words and low-frequency words. Gemini (43.52) shows that the word frequency data are more concentrated, and the fluctuation is small. Kimi (51.19) shows that the word frequency data distribution is more concentrated. Ernie Bot (62.49) shows that its word frequency data distribution has some fluctuations, but it is more concentrated overall.
p-value (vs. ChatGPT) comparison: The
p-value is used for statistical testing to determine whether there is a significant difference in the word frequency distribution of the two models. The ChatGPT
p-value is “-”, indicating that it is a benchmark model. Gemini (0.0006), Kimi (0.006), and Ernie Bot (0.003) are all less than 0.05 (the usual significance level), indicating that the word frequency distribution of Gemini, Kimi, and Ernie Bot is significantly different from that of ChatGPT.
3.6. Human Validation of Model Responses
In the preceding sections, all evaluation indicators were calculated using computer software based on corresponding formulas, representing objective machine-generated scores. However, the purpose of promoting knowledge on the prevention and control of COVID-19 is to enable the public to understand and master specific measures. Therefore, in addition to machine evaluation, human subjective assessment is required to verify whether the AI-generated text has truly passed public scrutiny. Accordingly, it is necessary to conduct human evaluations on top of machine-based assessments. The human evaluation includes five criteria: readability (Is the response easy to read?), comprehensibility (Is the information easy to understand?), usefulness (Is it helpful for promoting healthy behaviors?), credibility (Do you trust the response?), and consistency (Is it consistent with CDC’s official guidance?).
3.6.1. Survey Design and Methods
To conduct human validation, five questions were extracted from the 53-question CDC COVID-19 prevention guideline manual. As the purpose of this phase was to compare automated scoring with human judgment, selection was guided by textual properties, including readability, consistency, and ease of understanding. The chosen questions are shown in
Table 3, with the reasons for selection detailed in the right-hand column.
A questionnaire survey was employed to perform the human validation component of this study. The design involved three steps:
- (1)
Five questions from
Table 2 were selected, and corresponding CDC responses and AI-generated answers from four language models were compiled.
- (2)
Evaluation criteria included five dimensions—readability, comprehensibility, usefulness, credibility, and consistency. Each response was rated on a five-point Likert scale, with 1 indicating “strongly disagree” and 5 indicating “strongly agree”.
- (3)
The questionnaire was distributed and collected through an online survey tool.
Participants were primarily undergraduate students, with a smaller number of vocational college and graduate students. Given that the CDC’s COVID-19 prevention guide is designed for the general public, university students were deemed appropriate representatives for evaluating the content.
3.6.2. Analysis of Subjective Assessment
The questionnaire survey was conducted with a relatively small sample size, resulting in 54 valid responses. The respondents included 19 males and 35 females. Regarding educational attainment, 7 were from junior colleges, 43 were from undergraduate programs, and 4 were graduate students.
Using a weighted averaging approach, the scores for each of the five evaluation criteria were computed for all AI models. The final outcomes are summarized in
Table 4.
- (1)
Overall performance
Based on the results in
Table 3, all four models received scores between 3.67 and 4.35 across the five indicators, indicating generally high content quality. Nevertheless, noticeable differences emerged, particularly in the dimensions of comprehensibility and credibility.
- (2)
Comparison and summary between foreign models and Chinese models
Comparing the performance of text generation by international models (ChatGPT and Gemini) and domestic models (Kimi and Ernie Bot). The specific comparison is as follows:
In terms of readability, the scores of ChatGPT (4.19), Gemini (4.20), and Kimi (4.20) are similar, while Ernie Bot scored slightly lower (4.09), which shows that the readability of text generated by different models is good.
Regarding comprehensibility, ChatGPT achieved the highest score (4.35), outperforming the others. Chinese models showed relatively stable performance—Kimi at 4.00 and Ernie Bot at 4.15—while Gemini scored the lowest (3.96).
In terms of consistency, ChatGPT (4.15) and Kimi (4.00) scored high, indicating that they are closer to CDC information, whereas Ernie Bot (3.96) and Gemini (3.67) showed lower consistency, indicating deviation from authoritative sources.
In terms of usefulness, three models—ChatGPT, Gemini, and Kimi—scored are the same, all 4.07,with Ernie Bot slightly behind (3.96), reflecting minor variation in their potential to support healthy behavior.
In credibility, ChatGPT again led (4.09), while Gemini and Ernie Bot tied at 3.96. Kimi received the lowest score (3.89), which shows that the credibility of the model generation needs to be improved.
In general, international models excelled in accuracy and credibility, while domestic models demonstrated strength in readability and alignment. These differences may stem from disparities in training corpora, optimization objectives, and cultural–linguistic orientation.
3.6.3. Comparison of Human- and Machine-Based Evaluations
To assess the alignment between manual and algorithmic evaluation methods, machine metrics (SimHash, FRES, FKGL) were paired with their corresponding human evaluation indicators, namely consistency, readability, and comprehensibility.
- (1)
Consistency: Comparing Human Scores with SimHash Similarity
Human ratings for consistency were highest for ChatGPT (4.15) and Kimi (4.00), moderate for Ernie Bot (3.96), and lowest for Gemini (3.67). The SimHash similarity scores reflected a comparable pattern: Kimi scored the highest (0.65), ChatGPT and Ernie Bot followed closely (0.63), and Gemini scored the lowest (0.59).
This strong correlation between human and machine scores indicates that SimHash is a reliable tool to replace human judgments of textual consistency. It is particularly effective in measuring how closely AI-generated responses align semantically with reference standards, such as those issued by the CDC.
- (2)
Readability: Comparison Between Human Ratings and FRES Scores
In the dimension of readability, human scores were highest for both Kimi and Gemini (4.20), followed by ChatGPT (4.19) and Ernie Bot (4.09). Comparing with the Flesch Reading Ease Score (FRES), Kimi also achieved the highest score (28.93), indicating superior textual readability. Because ChatGPT (25.59) and Ernie Bot (26.85) had comparable FRES values, while Gemini scored the lowest (18.59), it suggesting its sentence structure may be more complex or less accessible.
However, no clear parallel trend was observed between human ratings and FRES scores. Thus, while FRES may serve as a supplementary indicator of readability, its use should be combined with broader user-centered evaluations in practical applications.
- (3)
Comprehensibility: Human Ratings and FKGL Scores
In the dimension of comprehensibility, ChatGPT received the highest human rating (4.35), indicating that its responses were more easily understood by users. This was followed by Ernie Bot (4.15) and Kimi (4.00), while Gemini scored the lowest (3.96). These results suggest that ChatGPT’s textual organization better aligns with users’ comprehension expectations.
In terms of the Flesch–Kincaid grade level (FKGL), ChatGPT also demonstrated the lowest score (13.08), indicating the lowest required reading level. By contrast, Kimi (13.70) and Gemini (13.41) had higher FKGL values, with Ernie Bot in between (13.55). As higher FKGL scores denote increased reading difficulty, ChatGPT’s consistent performance across both human and machine evaluations confirms its relatively superior comprehensibility.
- (4)
Usefulness: Evaluated Solely by Humans
The usefulness metric assesses whether AI-generated content is effective in promoting pandemic prevention and control. This indicator cannot currently be measured through automated scoring. Human evaluation showed that ChatGPT, Gemini, and Kimi all received identical scores of 4.07, with Ernie Bot slightly lower at 3.96. These results suggest that, in general, the public believes that generative AI can output practical text.
Credibility: Perceptions of Trustworthiness
Credibility was used to evaluate how much trust users placed in the AI responses. ChatGPT attained the highest credibility score (4.09), followed by Gemini and Ernie Bot (3.96), and Kimi with the lowest score (3.89). This suggests that ChatGPT’s responses were more convincing. Similar to usefulness, credibility remains a purely human-assessed construct due to the absence of validated machine evaluation metrics.
Integrated Summary of Human and Machine Evaluation Comparisons: This study compared human and machine-based evaluations across five keys indicators and found varying degrees of alignment. SimHash and FKGL showed strong consistency with human assessment trends. In contrast, FRES scores diverged significantly from human judgments, indicating that machine-based assessments can only partially reflect the actual performance of AI-generated content.
Notably, two critical dimensions—usefulness and credibility—are inherently based on human perception and cannot currently be quantified through automated means. In summary, while machine algorithms can contribute to the quantification of certain evaluation dimensions, human involvement remains essential to ensure a comprehensive and balanced assessment of generative AI output. These artificial subjective evaluations partially reflect the users’ psychological feelings and self-perception, providing indirect insights into the dissemination of health knowledge.
4. Discussion
This paper studies the application performance of generative artificial intelligence models in terms of providing knowledge and answers regarding infectious diseases. The specific research method is to input 53 questions from the U.S. CDC COVID-19 prevention and control knowledge guide in English into four models, extract the English answer text of each model, and study the various performance of the four models in generating English text by calculating relevant indicators and modeling outputs. This study systematically compares the differences between the foreign models ChatGPT and Gemini, on the one hand, and the domestic models Kimi and Ernie Bot in the formation of text, the ease with which the text can be read and understood, and the semantic content. Furthermore, this study discusses the influence of generative artificial intelligence models in the dissemination of public health information. A concrete in-depth discussion is carried out below, and our findings are presented.
In this paper, the accuracy of artificial intelligence models is compared based on their SimHash similarity scores. The results show that the text of the foreign models ChatGPT and Gemini is similar to the standard answer provided by the CDC, which shows that these models have better accuracy and that the foreign model has a significant advantage for data training in the public health sector. In contrast, the text generated by the domestic models (Kimi and Ernie Bot) is more volatile and less stable than the CDC standard text.
Although SimHash provides a scalable computational method to evaluate semantic similarity, it cannot fully capture the correctness of facts from the perspective of human readers. This paper uses a machine judgment method to obtain an objective score. In order to make up for the shortcomings of machine judgment, this paper introduces a questionnaire method. By combining manual scoring and comparing humans and machines, the shortcomings of machine scoring can be judged. The manual judgment results of this paper show that machine scoring is highly consistent with the manual scoring ranking. This shows that for the dataset used in this paper, SimHash can be used as a more effective automated indicator in the dimension of “consistency” of the evaluation indicator. It can accurately judge the semantic proximity between the text generated by the AI model and the authoritative answer. The research in this paper shows that in key areas such as healthcare, human judgment is needed to supplement accuracy.
Although the dataset used in this study is relatively small and standardized due to time and conditions, the research framework has been further improved and verified by incorporating a questionnaire method to add a manual verification link. By comparing the manual scoring results with machine calculation indicators such as SimHash, the empirical results observed consistent ranking trends, especially in the dimensions of “consistency” and “accuracy”. This consistency supports the effectiveness and robustness of the evaluation framework even in the case of limited data.
The models used in this study generate an indicator of the readability of text, which is the Flesch Reading Ease Score (FRES index). The FRES index reveals the difference between the Chinese and foreign models in terms of the readability of the text that they produce. The empirical results show that the FRES index of the domestic models (Kimi and Ernie Bot) is higher, indicating that the domestic models produce text that can be read easily. The domestic model generates text that is easier to understand and language choreography that is more popular. In comparison, the FRES values of the foreign models (ChatGPT and Gemini) are lower, indicating that the generated text may be more professional and include words that are less frequently used by most people. However, the foreign models (ChatGPT and Gemini) generate text that is logical, rigorous, and suitable for professional reading. In the examination of the ease with which text can be read, we should focus on the organic combination of professionalism and universality in the dissemination of public health information.
The model generates text intelligibility using another FKGL metric in the readability metric. The empirical results show that the text of the foreign models (ChatGPT and Gemini) is above 12, and the level of understanding may correspond to the higher education level of the United States, which shows that the foreign model is suitable for people with higher levels of education. The FKGL value of the domestic model is between 810 and the average level of an individual with a secondary education background. The results of this study show that the training of generative artificial intelligence in the future needs to be understood in order to provide timely, accurate, and easy-to-understand professional text for the prevention and control of major infectious diseases.
In addition to the accuracy, readability, and comprehensibility of the four models, this work also used the human work neural model (multilayer perceptron) to study the effect of text readability. The empirical results show that the foreign models’ generative text is more focused on the complexity of the language structure, especially the number of lexical syllables (the ASPW) and the length of the sentence (the length of the sentence AWPS), which affects the readability of the language. The domestic model also emphasizes the number of text sentences, reflecting stronger localization language. The linguistic differences in the text in the domestic and international models embody the different characteristics of Chinese and English language.
The results of topic mining show that the topic content of the foreign models (ChatGPT and Gemini) focuses on “vaccination”, “virus mutation”, and “protection suggestions”, while domestic models (Kimi and Ernie Bot) mainly focus on practical knowledge such as “symptom recognition” and “test suggestions”. In general, the textual subject content of the Chinese and English models, at home and abroad, reflects the cultural and traditional differences in the training data behind them.
To verify the accuracy of AI-generated content and assess the feasibility of the research methodology proposed in this study, it is essential to incorporate manual evaluations that subjectively assess the accuracy of the generated content. This study adopts a questionnaire-based approach for manual assessment. The empirical results indicate that machine-based metrics, such as SimHash and the Flesch–Kincaid grade level (FKGL), closely align with the trends observed in manual evaluations, suggesting a degree of substitutability and reference value. However, the Flesch Reading Ease Score (FRES) shows notable discrepancies compared to manual judgments, highlighting that automated scoring cannot fully replace human assessment.
Human Evaluation as an Indicator of Potential Risk of Misinformation: This paper implements human subjective scoring by adopting a questionnaire method. In this process, in addition to providing subjective assessments of readability and consistency, human judges also provide important insights into the accuracy and credibility of AI-generated content. In some cases, participants rated certain responses as semantically similar but potentially unreliable. These issues show that AI models may generate content that seems reasonable but is actually misleading. This highlights the importance of incorporating human judgment into the evaluation process, as machine-based indicators alone (such as SimHash) may not fully capture the correctness of the facts. Avoiding the risk of misinformation is also the significance of adding human judgment in this paper.
5. Conclusions
This study employs the COVID-19 prevention and control Q&A content issued by the U.S. Centers for Disease Control and Prevention as a benchmark to systematically evaluate the performance differences among four mainstream generative artificial intelligence (AI) models—ChatGPT, Gemini, Kimi, and Ernie Bot—both domestic and international. The evaluation focuses on four dimensions: text generation capability, understandability, readability, and semantic content. To assess the validity of machine-generated scoring metrics, a questionnaire-based method was introduced to obtain subjective manual evaluations of AI-generated content, allowing for a comparison between human and machine assessments. Empirical findings reveal that the international models, ChatGPT and Gemini, demonstrate superior accuracy and professionalism; however, their generated texts tend to be more complex. In contrast, the domestic models, Kimi and Ernie Bot, produce more accessible language that is better suited for public health communication, though the level of specialization in their content requires improvement. Furthermore, significant differences were observed between domestic and international models in terms of language generation strategies, audience adaptability, and semantic coverage. Variations in language style and communication preferences were also noted across the models. The study confirms that while machine scoring offers partial insight into the performance of AI-generated texts, it cannot fully substitute for human evaluation. According to these findings, this study recommends that future AI-generated content for infectious disease communication should place greater emphasis on the knowledge base and comprehension level of the general public. Striking a balance between professionalism and accessibility in AI training data is essential to enhance the effectiveness and accuracy of public health knowledge dissemination.
While this study focuses on COVID-19-related disease prevention and control question-answering systems, its research framework—including machine-based metrics (e.g., SimHash, FKGL) and human-judged criteria (e.g., credibility, practicality)—is not disease-specific. This approach can be easily applied to evaluate the performance of generative AI models in other infectious diseases (e.g., influenza, monkeypox, tuberculosis, etc.). Given the growing role of AI in public health communication, evaluating its cross-disease adaptability is an important future direction. The results of this study lay the foundation for such extended research and provide a new evaluation paradigm for related research.
Through the empirical research in this article, it can be found that generative artificial intelligence has a good application prospect in the medical field, especially in the dissemination of knowledge on infectious disease prevention and control. However, when comparing the scoring results through machine judgment and manual judgment, it is found that relying solely on generative artificial intelligence is incomplete, and the judgment of knowledge still needs to rely on the experience and knowledge of professional medical staff, especially experts. In addition, the use of generative artificial intelligence technology in the field of public health should also follow ethical principles—transparency, accountability, and manual supervision—to prevent the abuse of artificial intelligence. Future research work will integrate multidisciplinary knowledge and explore the more efficient integration of generative artificial intelligence technology and public health knowledge dissemination, so as to contribute to the prevention and control of major infectious diseases.
In this paper, we studied the application of generative artificial intelligence in terms of its general knowledge of and ability to provide answers about the prevention and control of infectious diseases. In addition, we implemented three aspects of innovation: (1) Theoretical innovation. Through the introduction of a generative artificial intelligence model analysis tool, the research boundary of public health information dissemination is expanded. Traditional public health communication is concentrated in the media and government propaganda, and generative artificial intelligence models are rarely applied to the study of the knowledge of infectious diseases. This study marks the first time that generative artificial intelligence has been incorporated into the knowledge and answer framework of infectious diseases, and it has enriched the theoretical map of the spread of infectious diseases. The theoretical innovation of this study is also reflected in the selection of multiple indexes that generate text performance. This study selects the accuracy, readability, comprehension, readability factor, and five indexes of the text and overcomes the limitations of the traditional evaluation of individual indexes. (2) Innovative methods. In this paper, innovation is embodied in the multidimensional analysis framework of text accuracy and the readability index, and, by introducing the neural network model, the ability to identify the capacity to generate text is realized, and a mechanism for comparing the quality of texts is produced. This study enhances the interpretative nature of text through text topics. (3) Application innovation. The application innovation of this study is mainly embodied in the introduction of major model performance comparison perspectives. In this study, we compare the output performance of foreign models and domestic models under the same task and reveal the differences between Chinese and foreign models in terms of semantic accuracy, popularity, and professionalism, which provides a valuable economic reference for the development of artificial intelligence.
In general, the results of this study can serve directly in the healthcare system and in the prevention and control department. In the use or supervised generation of AI to conduct health knowledge dissemination, the evaluation of the quantity of the information is based on technical advice, thereby promoting the generation of artificial intelligence models to better serve the people and contribute to improving the level of control of major infectious diseases in the world.
6. Study Limitations
Although this study has carried out an effective exploration of the framework design of model evaluation, empirical data comparison, and multidimensional index extraction, it still contains a limitation which needs to be further improved in future studies.
The uniqueness of the standard answer limits the overall performance of the model. In this study, the answers to 53 COVID-19-related questions provided by the CDC were used as the standard answers, and the responses of the artificial intelligence models were similar. However, the answers of generative artificial intelligence models are diverse, and the uniqueness of the answer and the diversity of the generated text cannot be measured by the actual professional level of the model.
One limitation of this study is that it relies depend on the English version of the COVID-19 guidelines published by the Centers for Disease Control and Prevention (CDC) in 2020. While this material provides a standardized and authoritative benchmark, it does not take into account the multilingual public health communication. In addition to the CDC, the current commonly used multilingual COVID-19 prevention and control texts also come from the World Health Organization (WHO), the European Center for Disease Prevention and Control (ECDC), the official websites of various national ministries of health, and other institutions, covering more than 80 languages, including vaccination, protection guidelines, etc. Future research will use updated, multilingual, and culturally diverse corpora to evaluate the performance of AI models.
Another limitation of this study is the iterative nature of large language models (LLMs). Most large models, especially mainstream large models such as ChatGPT, Gemini, Kimi, and Ernie Bot studied in this paper, are constantly updating versions and modifying and improving algorithms. However, none of them provide update logs to users in a timely manner. Therefore, software iteration will cause the same revelation to produce different responses at different times, which poses a challenge to the reproducibility of the research. Although we have recorded access time and conditions during the test, the time difference in this problem needs to be considered when describing the conclusion. Future research directions will integrate multiple versions, multiple time points, and multiple acquisitions for comparison to further enhance the reproducibility of the research.
The impact of language training background on AI model performance: This study highlights the issue of differences in the training environments of generative AI models. ChatGPT and Geminiare trained by English-language datasets, whereas Kimi and Ernie Bot are trained by a mix of Chinese and English corpora. Since this study uses English-language CDC prevention and control guidelines as the evaluation benchmark, and all prompts and model responses are conducted in English, the international models (ChatGPT and Gemini) possess an inherent language advantage in output performance. Previous research has demonstrated that the performance of generative AI models can vary significantly depending on the language in which they are trained. For instance, Luo et al. reported that DeepSeek outperformed ChatGPT in tasks conducted in Chinese, provingthat model performance is influenced not only by the task itself but also by the language context [
34]. In light of these findings, future research should explore bilingual or multilingual evaluation frameworks to better understand how language training backgrounds affect model performance across different linguistic settings.
The topic analysis does not conduct emotional classification of the text. Although this study carried out text theme analysis, it did not carry out emotional analysis of the text, and in the process of actual dissemination of medical information, the emotionality of text and language is often associated with the efficiency of the transmission.
The quantity and limitation of the knowledge domain is a further shortcoming of this study. This study was based on the 53 COVID-19-related problems of the CDC, which have certain limitations. The knowledge sector is small and does not cover the wider range of infectious diseases. In future research, we will attempt to apply information about a wider range of diseases in order to improve the prevention and control of major infectious diseases.
This study is a practical exploration, providing reference materials for generative artificial intelligence technology in the prevention and control of major infectious diseases. Although the results of this study did not directly evaluate clinical scenarios, this paper provides two indicators of practicality and credibility in the design of manual evaluation questionnaires, which can provide preliminary insights into the applicability of answers generated by artificial intelligence in public health education. One of the future research directions will be to cooperate with medical staff and patients to conduct clinical practice research together to provide more substantial clinical verification for the corresponding research.
Limitations of generalizability and potential for misinformation: Although the dataset used in this study is the official text of an authoritative medical management agency, it is relatively small in scale. However, whether the results of this study can be generalized to more complex multilingual datasets or larger datasets remains to be verified. In addition, through the empirical research of this article, especially through the combination of manual and machine judgments, it was found that the responses generated by generative AI contained unreliable information, which reflects the lack of robustness of the content generated by generative AI. This also means that when a larger dataset is used for related research, AI is likely to generate more misinformation. Based on this, one of the future research directions is to use a large-scale, complex, and multilingual corpus database to systematically and comprehensively evaluate its sensitivity.