A Comparative Study of the Accuracy and Readability of Responses from Four Generative AI Models to COVID-19-Related Questions

Liang, Zongjing; Kuang, Yun; Liang, Xiaobo; Liang, Gongcheng; Li, Zhijie

doi:10.3390/covid5070099

Open AccessArticle

A Comparative Study of the Accuracy and Readability of Responses from Four Generative AI Models to COVID-19-Related Questions

by

Zongjing Liang

¹

,

Yun Kuang

^2,*,

Xiaobo Liang

³,

Gongcheng Liang

³ and

Zhijie Li

¹

School of Economics Management, Guangxi Normal University, Guilin 541006, China

²

Library of Guilin Normal University, Guilin Normal University, Guilin 541199, China

³

Network and Educational Technology Center, Guilin Normal University, Guilin 541199, China

^*

Author to whom correspondence should be addressed.

COVID 2025, 5(7), 99; https://doi.org/10.3390/covid5070099 (registering DOI)

Submission received: 14 May 2025 / Revised: 23 June 2025 / Accepted: 24 June 2025 / Published: 30 June 2025

(This article belongs to the Section COVID Public Health and Epidemiology)

Download

Browse Figures

Versions Notes

Abstract

The purpose of this study is to compare the accuracy and readability of Coronavirus Disease 2019 (COVID-19)-prevention and control knowledge texts generated by four current generative artificial intelligence (AI) models—two international models (ChatGPT and Gemini) and two domestic models (Kimi and Ernie Bot)—and to evaluate the other performance characteristics of texts generated by domestic and international models. This paper uses the questions and answers in the COVID-19 prevention guidelines issued by the U.S. Centers for Disease Control and Prevention (CDC) as the evaluation criteria. The accuracy, readability, and comprehensibility of the texts generated by each model are scored against the CDC standards. Then the neural network model in the intelligent algorithms is used to identify the factors that affect readability. Then the medical topics of the generated text are analyzed using text analysis technology. Finally, a questionnaire-based manual scoring approach was used to evaluate the AI-generated texts, which was then compared to automated machine scoring. Accuracy: domestic models have higher textual accuracy, while international models have higher reliability. Readability: domestic models produced more fluent and publicly accessible language; international models generated more standardized and formally structured texts with greater consistency. Comprehensibility: domestic models offered superior readability, while international models were more stable in output. Readability factors: the average words per sentence (AWPS) emerged as the most significant factor influencing readability across all models. Topic analysis: ChatGPT emphasized epidemiological knowledge; Gemini focused on general medical and health topics; Kimi provided more multidisciplinary content; and Ernie Bot concentrated on clinical medicine. From the empirical results, it can be found that the manual and machine scoring are highly consistent in the indicators SimHash and FKGL, which proves the effectiveness of the evaluation method proposed in this paper. Conclusion: Texts generated by domestic models are more accessible and better suited for public education, clinical communication, and health consultations. In contrast, the international model has a higher accuracy in generating expertise, especially in epidemiological studies and assessing knowledge literature on disease severity. The inclusion of manual evaluations confirms the reliability of the proposed assessment framework. It is therefore recommended that future AI-generated knowledge systems for infectious disease control balance professional rigor with public comprehensibility, in order to provide reliable and accessible reference materials during major infectious disease outbreaks.

Keywords:

COVID-19; generative artificial intelligence; infectious disease prevention and control; performance comparison; public health knowledge dissemination; manual vs. machine scoring

1. Introduction

The COVID-19 epidemic, which emerged in late 2019, severely shocked global society, economy, and public health [1]. The World Health Organization (WHO) declared it a public health emergency of international concern in January 2020 [2] and in May 2023 declared the official end of the global emergency phase [3].

However, the end of the COVID-19 emergency does not eliminate the global threat posed by infectious diseases [4,5,6]. Ongoing COVID-19 cases, influenza outbreaks, and the emergence of new pathogens suggest that future pandemics remain possible. Variants or mutations in existing viruses could again trigger global health crises. The public should be able to conveniently access high-quality health information at that time. Therefore, timely access to accurate health information remains vital for early warning and public preparedness. Countries or regions in In response to potential future pandemics, enhancing public awareness and access to health information is essential.

While traditional search engines have long served as a primary source of health knowledge, generative artificial intelligence (AI), such as ChatGPT 40, which released in November 2022, has emerged as a new tool for information retrieval [7]. Unlike search engines that retrieve documents from existing databases, generative AI produces novel responses based on patterns learned from large-scale datasets, which may enhance fluency but introduces risks of factual inaccuracy. Compared to conventional methods, generative AI is designed to deliver automated, personalized, and seemingly accurate responses, offering users a more interactive and natural communication experience [8].

Generative artificial intelligence technology, with its distinctive conversational models, has significantly changed how people engage with knowledge and has become a novel channel for disseminating information about infectious diseases [9]. These technologies have been increasingly applied in medicine [10]. Nevertheless, despite their growing adoption, concerns remain regarding the reliability of AI-generated content—particularly in high-stakes areas like healthcare—where issues related to ethics, factual accuracy, and readability are particularly salient. The accuracy and readability of medical responses from generative AI have become one of the core concerns at present [11].

However, the application of generative AI in the dissemination of health knowledge also brings ethical challenges. These challenges include the risk of generating misinformation, the inequality in the access to generative AI, and the extension of generative AI. At the same time, there may be differences in the public’s ability to understand AI-generated content. Therefore, in order to more effectively and correctly guide the general public to use generative AI technology to prevent and control major infectious diseases, it is necessary to improve the public’s health literacy through education, especially for vulnerable groups in society.

At present, both in China and internationally, research into generative artificial intelligence technologies in the medical domain has mainly focused on evaluating the accuracy of ChatGPT in medical exams and clinical question-answering tasks [12,13,14]. Its application in both basic and clinical medicine has been tested [15], such as performance evaluation in Polish medical exams [16] and assessments of knowledge on bacterial infection [17]. Notably, the reliability and readability of ChatGPT-4 in addressing hypothyroidism during pregnancy have also been studied [18], demonstrating that its responses were moderately reliable and readable at a university level. Other studies have explored ChatGPT’s handling of myopia [19], its comparison with Google Bard in text generation [20], and the differences between ChatGPT and Internet search engines in responding to patient queries [21].

In summary, existing research rarely evaluates the accuracy and readability of generative AI responses related to infectious diseases specifically, particularly in Chinese contexts. However, with more than 1 billion Internet users in China and around 230 million daily users querying health topics via generative AI platforms, there is an urgent need for reliable information on disease transmission. This study addresses that gap by comparing the performance of four major generative AI platforms (including Chinese models) in providing infectious disease-related information. The findings aim to guide medical professionals, AI developers, and the general public in selecting appropriate tools for health communication and decision making.

2. Materials and Methods

This paper compares the generative artificial intelligence model of four mainstream platforms at home and abroad. In addition, this article compares the four models’ accuracy, accessibility, comprehension, and readability influence factors and the production of text research topics. As research material, we selected the COVID-19 control guide Q&A text of the CDC as the standard answer, and the responses provided by the four generative artificial intelligence models of the domestic and foreign generation of artificial intelligence models were compared with the standard answer. The domestic generative AI models tested were Kimi and Ernie Bot, and the foreign models were ChatGPT and Gemini.

2.1. Research Materials

This article is related to infectious diseases. It is based on the U.S. CDC and control center, and the corresponding research text is the COVID-19 prevention and control guide of the CDC from August 2020 [22]. This can be downloaded at the following URL: https://stacks.CDC.gov/view/CDC/89817 (accessed on 1 February 2025). This article contains 53 questions and answers. The 53 questions and answers cover the following areas [22]: (1) COVID-19 infection risk and control (six questions); (2) COVID-19 spread and prevention (seven questions); (3) detection and diagnosis (six questions); (4) the clinical management of COVID-19 (eight questions); (5) the management of the special population (seven questions); (6) the treatment of and prevention measures for COVID-19 (six questions); (7) prevention and vaccination (six questions); (8) other relevant questions (seven questions). The question and answer documents cover the categories of infection risk, communication prevention, detection diagnosis, clinical management, special population, prevention and treatment, vaccination, and other problems. The guide is aimed at both general medical staff and the scientific community at large in order to provide authoritative resources for the prevention and control of outbreaks.

2.2. Research Methods

In order to compare the generative text properties of four generative artificial intelligence models at home and abroad, this paper will study several relevant indicators. By comparing the generative text of the four models to the standard text provided by the CDC, we compared the accuracy, accessibility, comprehension, and readability impact indicators and the generative text topics. Through comprehensive evaluation of the domestic and international models of the analysis and response of infectious diseases, this study provides a reference guide for medical workers and people to obtain knowledge regarding infectious diseases.

To evaluate the accuracy and readability of the four AI models’ answers to the CDC COVID-19 prevention and control guidelines, we input 53 English questions to each model and extracted the English answer text generated by each model. English was chosen because the COVID-19 prevention and control guidelines are expressed in English. All questions were manually entered one by one by the researchers, and the full text answers generated by each model were recorded. No follow-up questions or supplementary interactions were conducted throughout the test to ensure that the model’s initial natural output capabilities were obtained.

The variable metrics used for model performance comparison are as follows: SimHash, Flesch–Kincaid grade level (FKGL), Flesch Reading Ease Score (FRES), reading level (RL), average words per sentence (AWPS), average syllables per word (ASPW), and sentences and words, where SimHash stands for text similarity, i.e., accuracy. The Flesch–Kincaid grade level (FKGL) indicates the U.S. school grade corresponding to the reading difficulty of the text. The Flesch Reading Ease Score (FRES) indicates how intelligible the text is. Reading level (RL) indicates the minimum education level, in terms of reading, required for a text. Average words per sentence (AWPS) indicates the average number of words per sentence. Average syllables per word (ASPW) indicates the average number of syllables per word. The “sentences” variable indicates the total number of sentences the text contains and is used to calculate the average sentence length. The “words” variable indicates the total number of words. The definitions and instructions for the use of each indicator are as follows.

(1): Comparison of text accuracy

There are many algorithms for comparing the accuracy of two texts in current research, and the SimHash index describes the text accuracy. The SimHash algorithm is a method of detecting file accuracy based on hash functions [23]. This method can calculate the accuracy of two texts to measure the similarity of the two texts. The SimHash algorithm is suitable for text comparison, data classification, etc. It functions on the principle of ensuring the validity and accuracy of the algorithm by filtering and optimizing the calculation strategy [24]. The similarity of two texts is represented by the numerical value of the SimHash algorithm. The greater this number is, the more accurate and similar the texts are.

(2): Textual legibility comparison

Text accessibility is evaluated by the FRES test, which is measured in terms of the extent to which the text is easy to read [9]. The text readability index uses Flesch Reading Ease Score (FRES). The Flesch Reading Index is a measure of the readability of English text which was proposed by Rudolf Flesch in 1948 [25]. It scores texts on a range of 0 to 100. The higher the score, the easier it is to read the text, and the lower the score, the more difficult it is to read.

(3): Textual comprehensibility comparison

The FKRGL index is used for easily understanding text in our study. It is measured in terms of the degree of understanding of a text [9]. FKGL stands for Flesch–Kincaid grade level. This indicator is one of the important indicators of text comprehension, especially in areas such as medical care and education. The index was first proposed by Rudolf Flesch in 1948 and then revised by J. Peter Kincaid. This score represents the level of reading in the education system [26].

(4): The study of the effect of text readability

The text’s readability is also evaluated with the reading level (RL) indicator [27]. RL indicates the minimum grade of education required to read the text in question. In this study, we use the neural network model as one of the intelligent algorithms to analyze the key factors affecting text readability. According to the requirements of neural network model construction, the input layer indicators of the model are the AWPS indicator, ASW indicator, RL indicator, word indicator, and SimHash indicator, and the output layer indicator is the RL indicator. This paper discusses the key factors affecting text readability by multilayer perceptron mining in neural networks (multilayer perceptron is one of many neural network model algorithms).

(5): The semantic comparison of text content

This paper studies the word frequency and word frequency semantics of the model by text analysis [28]. This method is the basic and most common analysis process used in natural language comprehension, and it mainly consists of two aspects. The first is word frequency statistics, and the second is topic mining. Word frequency statistics are the most basic parts of text analysis used to identify the most common words in the text in order to understand the main content or keywords of the text. Topic mining is a combination of high-frequency words and context induction topics. This method is suitable for the analysis of words in a text in our study.

(6): Manual scoring and comparison

In order to verify the accuracy of machine scoring, this article introduces a manual scoring mechanism. The specific method is to first select some questions and answers from the CDC prevention and control guidelines and then collect manual subjective scoring results through an online questionnaire method. Then, the manual scoring is compared with the machine scoring to verify the accuracy of the machine scoring.

2.3. Statistical Method

This paper is a quantitative analysis study, and different calculation methods are used in each index. This study calculates the text accuracy index of SimHash using the following network computing platform for the solution: https://kiwirafe.pythonanywhere.com/app/xiangsi/ (accessed on 1 March 2025). Text readability and language size metrics (FKGL, FRES, RL, AWPS, ASPW, sentences, word) use the following network computing platform: https://goodcalculators.com/flesch-kincaid-calculator/?utm_source=chatgpt.com (accessed on 1 March 2025). This study used the neural model analysis module in SPSS27 to calculate the multilayer perceptron and used the software to carry out the statistical analysis of the full text data. We used Python 3.9 to record the word frequency statistics of the text.

3. Results

3.1. Comparison of Text Accuracy

In this study, SimHash similarity data were used as an accuracy index. In order to compare the accuracy of the four models in answering the text of the 53 questions of the CDC, this study calculated the SimHash accuracy, displayed the calculated data as a box plot, and measured the consistency of the four models in answering the text of the same question by analyzing the box plot. The calculation results are shown in Figure 1.

(1): The consistency of the text

The consistency index is the standard for the SimHash similarity score. From the calculation results, Kimi scored the highest (0.646), indicating that Kimi was the most consistent in answering questions. ChatGPT’s score (0.63) was close to that of Ernie Bot (0.627), but its answers were poor. Gemini (0.59) has the lowest average score, indicating that its answers were the worst among the four models.

(2): Stability comparison

The accuracy stability of the question is measured by standard deviation. The minimum standard deviation of ChatGPT is the lowest (0.074), indicating that it is the most stable and that its accuracy in answering different questions does not change. The standard deviation of Ernie Bot (0.082) indicates that the accuracy of the answer changes frequently, and the stability of the quality of the text is not high. The volatility of the Kimi and Gemini models is close (0.079 and 0.077, respectively), indicating that their responses demonstrated similar levels of accuracy stability.

(3): Extreme values

The extreme values are compared to the distribution of the group points. The maximum and minimum values of the four models: ChatGPT [0.45,0.75], Gemini [0.44,0.77], Kimi [0.41,0.84], and Ernie Bot [0.44,0.80]. The corresponding ranges (i.e., the difference between the maximum and minimum values) are as follows: ChatGPT (0.30), Gemini (0.33), Kimi (0.43), and Ernie Bot (0.36). Kimi has the largest range, indicating that this model is the most likely to produce significantly abnormal outputs in response to certain questions. Among the other three models, the ranges from largest to smallest are as follows: Ernie Bot, Gemini, and ChatGPT.

On the whole, Kimi has the highest accuracy in outputting text, followed by ChatGPT and Ernie Bot, and Gemini is slightly lower. ChatGPT has the best output stability, followed by Kimi, Gemini, and Ernie Bot. After calculation, it can be seen that Kimi has the largest range, indicating greater fluctuation in its responses to certain questions. The ranges of the other three models are relatively close, indicating that their output fluctuations are generally similar and smaller than Kimi.

3.2. Text Readability Comparison

The text readability index uses the Flesch Reading Ease Score. The FRES index of the four models is shown in Figure 2.

The following is a comparison of Chinese and foreign models in terms of text consistency, stability, and extreme values.

(1): Comparison of text consistency

The ChatGPT score (26.59) distribution is more concentrated, indicating that the text demonstrates high consistency. The Gemini score (18.59) distribution is also more concentrated, indicating that the text demonstrates less consistency than that of ChatGPT. The distribution of the domestic model Kimi (28.93) is more diffuse, indicating that the text is not consistent with the previous two. The score distribution of Ernie Bot (26.85) is also more fragmented, with a median of about 35, indicating that its text is the least consistent.

(2): Stability comparison

The calculation results show that the standard deviation of ChatGPT (7.31) is the lowest among the four models, indicating that the readability of its generated text is the most stable. The standard deviation of Ernie Bot (10.13), ranking second, indicates high stability. The standard deviation of Gemini is 10.78, and the standard deviation of Kimi is 11.37. Both have high degrees of variation, indicating poor stability.

(3): Extreme comparison

The maximum and minimum values of the four models are as follows: ChatGPT [8.20,42.10], Gemini [0.00,54.20], Kimi [0.00,55.70], and Ernie Bot [3.40,58.00]. ChatGPT does not have obvious extreme values, which indicates that the distribution of the scores is more uniform. Gemini has some extreme values but not many of them, indicating some anomalies in its score distribution. The domestic model Kimi has several extreme values, indicating that there are some anomalies in the distribution of the scores. Ernie Bot has multiple extreme values, indicating that there are many exceptions in the distribution of the scores.

To sum up, the performance of ChatGPT is the best in the three aspects of text consistency, stability, and extreme value, followed by Gemini, Kimi, and Ernie Bot.

3.3. Textual Comprehensibility Comparison

The text is easy to understand and uses the FKGL index. The Flesch–Kincaid grade level indicator is the FKGL index, which is one of the indicators of text readability. Text readability represents the readability of the text and the corresponding reading level in the United States. Figure 3 shows the FKGL index box diagram for the four large models.

This is based on the comparison of the four models in the above box, which is based on the output level, stability, and extreme value control.

(1): Comparison of text complex

After calculation, the FKGL scores of the four models are as follows: ChatGPT (13.08), Gemini (13.41), Kimi (13.70), and Ernie Bot (13.55). ChatGPT has the lowest average grade score, indicating that its generated text is easier to read, while Kimi has the highest-grade score, indicating that its text structure is more complex.

(2): Analysis of output stability

The calculated standard deviations of each model (values in brackets) are as follows: ChatGPT (1.33), Gemini (1.66), Kimi (2.35), and Ernie Bot (1.77). ChatGPT has the highest stability, while Kimi has the largest standard deviation, indicating that its complexity varies greatly.

(3): Extreme comparison

From the calculation results, we can see that the maximum and minimum ranges of each model are ChatGPT [10.40, 15.20], Gemini [7.90, 16.40], Kimi [8.70, 20.80], and Ernie Bot [8.50, 18.70]. Kimi has the widest score range from 8.70 to 20.80, indicating that its output consistency is the worst. In contrast, the maximum and minimum values of ChatGPT are relatively centered, with small fluctuations and stable performance.

In general, ChatGPT has the best output stability of the four models in English, and the Gemini model is medium. The domestic model Kimi has good readability, but the stability is poor, and the Ernie Bot model has the most extreme value, indicating that the text that it outputs is more volatile than that of Kimi.

3.4. Text Readability Influence Factor Analysis

This section uses the neural network model as one of the intelligent algorithms to analyze the influence factors affecting text readability for the four models and to calculate the relative importance of each input variable. The above study is compared with the four models of text readability (that is, the text readability and textual comprehension), and a quantitative study is carried out on the text readability and the text. There are few reports of readability studies in terms of the words of the text. The basic unit of text and the structure of the basic unit has a direct effect on the readability of the text, and the study of words and of word structure is of practical significance for text to be easily readable. This study can be used to understand the intrinsic nature of the readability of a text from its internal organization. Based on this, this section intends to use a multilayer perceptron (a multilayer perceptron is one of many neural network model algorithms) to generate the influence factors that affect text readability, and the four models are compared. In this paper, we use text computing software to generate text computing FRES and FKGL indicators. The same calculation can be obtained by determining the AWPS index of the other points [9]: AWPS indicator, ASPW indicator, RL indicator, word indicator, and sentence indicator.

3.4.1. Neural Network Algorithm Construction

This paper uses a multilayer perceptron (a multilayer perceptron is one of many neural network model algorithms) to make research tools for text readability influence factors. A multilayer perceptron (MLP) is a typical feedforward neural network [29] and consists mainly of the input layer, the hidden layers, and the output layer. It learns the feature representation of the data through a fully connected (FC) structure and nonlinear activation function, and MLP has the characteristics of full connection, nonlinear activation, error backpropagation, and multilayer feature extraction [30]. MLP can be applied to text classification. The specific structure of the neural network model is shown in the Appendix A. There are five input layer variables (X) used in our study: AWPS, ASPW, sentence, word, and SimHash. The output layer variable (Y) is RL, which indicates that the reading level is variable. ASPW indicates the average number of syllables. ASPW indicates sentence length.

3.4.2. Influence Factor Analysis

The study mentioned above, by building a model neural network training model that affects the output variable, obtains its influence path, analyzes the relative importance of the input variables of the output variable, and then provides the objective reference data for the quantitative comparison of the models [31]. In this study, we used SPSS 27 software to conduct the neural network calculations. The neural network model used in this article is a multilayer perceptron model (MLP model). After the model runs, the software can output the importance of the input variable directly, which indicates the relative importance of each input variable. The variable importance bar diagram corresponding to the model is shown in Figure 4.

As can be seen from Figure 4, the readability of the text generated by the ChatGPT model is most affected by ASPS, followed by AWPS, while the variable SimHash has lesser importance, indicating that the readability of the generated text is less affected by the accuracy of the text. The readability of the text generated by the Gemini model is most affected by ASPW, while the SimHash effect is slightly higher than that of ChatGPT, indicating that the text generated by Gemini is greatly affected by the accuracy of the text. For the domestic model, the readability of Kimi-generated text is most affected by ASPW, which is still the main influencing factor. AWPS is still the most important factor for the readability of Ernie Bot’s generated text, followed by sentence factor, and it can be seen from the graph that the variable AWPS is much more important than the variable sentence. Other variables have a lesser impact.

In conclusion, the influencing factor of sentence length (AWPS) is the most important factor affecting the readability of the generated text in all models. This conclusion shows that the average sentence length is the most critical factor in generating text readability. The influencing factor of SimHash has different degrees of influence in different models. The readability of the text generated by the foreign models ChatGPT and Gemini is less affected by SimHash, indicating that the foreign models have a high degree of innovation in generating text, and the generated text has little correlation with the CDC standard text. However, the domestic models Kimi and Ernie Bot are more sensitive to SimHash, indicating that the text generated by the domestic models is more similar to the CDC standard text, which also proves that text accuracy affects the readability of the text.

3.5. Text Content Semantic Comparison

Text analysis is the process of extracting useful information from the text by means of mathematical statistics or related algorithms [32]. According to the purpose of this paper and the characteristics of the text of the answers, lexical analysis is proposed to extract the key words. Then, according to the high-frequency lexical induction theme of each text, the core thought of each text is understood. In the present study, there are several kinds of text theme-mining algorithms which require a small amount of text; thus, we use the inductive summary method to adapt to the research text feature [33].

3.5.1. Lexical Frequency Statistics

Based on the four large models and the lexical frequency data of COVID-19 generative text, the lexical frequency data statistics are obtained. The word frequency bar chart (10) and the word frequency ranking chart are shown in Figure 5. This paper analyzes the following aspects.

By observing Figure 5, we can see that “COVID-19” was the most frequently used word and “test” have a higher word in the four big models, indicating that the models are pay more attention to COVID-19 and test. ChatGPT had the highest frequency of using “COVID-19” (556 times) and Gemini had the lowest (202 times), suggesting that ChatGPT used the term most intensively, while Gemini may have more frequently used alternative expressions. The high frequency of “Risk” and “test” in the model, showing that risk assessment and testing are key topics in the model’s response. The frequency of a word can indicates its importance in the text.

3.5.2. Topic Mining

In this paper, the text of the various models is generated by statistical induction of the various models. A frequency table of the text generated by the four models is shown in Table 1.

Statistical induction yields four model research themes. Through the analysis of Table 1, the following conclusions can be drawn: among the top 10 word frequency rankings, the words “Patient”, “Infect”, and “Vaccine” appear more in the text generated by ChatGPT, indicating that the model pays more attention to epidemiological knowledge. The words “Medical”, “Healthcare”, and “Advice” appear more frequently in the text generated by the model Gemini, indicating that the field of medical care is more emphasized in the text generated by that model. The words “Risk”, “Symptom”, and “Recommend” appear more often in the domestic model Kimi, indicating that the text generated by that model pays more attention to multidisciplinary background information or early risk prevention. The words “Patient”, “Infect”, and “Healthcare” appear more often in the text generated by the model Ernie Bot, indicating that that model pays more attention to medical clinical topics. In conclusion, the domestic models (Kimi, Ernie Bot) are more suitable for clinical testing, medical system research, health consultation, etc. The foreign models (ChatGPT, Gemini) were more focused on epidemiological analysis, vaccine research, and disease severity assessment. Generally speaking, the domestic model is more applicable, and the foreign model is more professional.

In order to clearly and intuitively show and compare the differences in word frequency features of different models (ChatGPT, Gemini, Kimi, and Ernie Bot) when generating text, the mean value, standard deviation, and p-value of the word frequency of each model when compared with ChatGPT are calculated according to Table 1. The calculation results are shown in Table 2 below.

From Table 2, we can find the word frequency analysis and significance test of different models. Word frequency comparison: the word frequency is arranged from high to low as ChatGPT, Kimi, Ernie Bot, and Gemini. Standard deviation comparison: The standard deviation reflects the degree of dispersion of word frequency data, that is, the fluctuation range of word frequency. ChatGPT (117.46) shows that there is a large difference between high-frequency words and low-frequency words. Gemini (43.52) shows that the word frequency data are more concentrated, and the fluctuation is small. Kimi (51.19) shows that the word frequency data distribution is more concentrated. Ernie Bot (62.49) shows that its word frequency data distribution has some fluctuations, but it is more concentrated overall. p-value (vs. ChatGPT) comparison: The p-value is used for statistical testing to determine whether there is a significant difference in the word frequency distribution of the two models. The ChatGPT p-value is “-”, indicating that it is a benchmark model. Gemini (0.0006), Kimi (0.006), and Ernie Bot (0.003) are all less than 0.05 (the usual significance level), indicating that the word frequency distribution of Gemini, Kimi, and Ernie Bot is significantly different from that of ChatGPT.

3.6. Human Validation of Model Responses

In the preceding sections, all evaluation indicators were calculated using computer software based on corresponding formulas, representing objective machine-generated scores. However, the purpose of promoting knowledge on the prevention and control of COVID-19 is to enable the public to understand and master specific measures. Therefore, in addition to machine evaluation, human subjective assessment is required to verify whether the AI-generated text has truly passed public scrutiny. Accordingly, it is necessary to conduct human evaluations on top of machine-based assessments. The human evaluation includes five criteria: readability (Is the response easy to read?), comprehensibility (Is the information easy to understand?), usefulness (Is it helpful for promoting healthy behaviors?), credibility (Do you trust the response?), and consistency (Is it consistent with CDC’s official guidance?).

3.6.1. Survey Design and Methods

To conduct human validation, five questions were extracted from the 53-question CDC COVID-19 prevention guideline manual. As the purpose of this phase was to compare automated scoring with human judgment, selection was guided by textual properties, including readability, consistency, and ease of understanding. The chosen questions are shown in Table 3, with the reasons for selection detailed in the right-hand column.

A questionnaire survey was employed to perform the human validation component of this study. The design involved three steps:

(1): Five questions from Table 2 were selected, and corresponding CDC responses and AI-generated answers from four language models were compiled.
(2): Evaluation criteria included five dimensions—readability, comprehensibility, usefulness, credibility, and consistency. Each response was rated on a five-point Likert scale, with 1 indicating “strongly disagree” and 5 indicating “strongly agree”.
(3): The questionnaire was distributed and collected through an online survey tool.

Participants were primarily undergraduate students, with a smaller number of vocational college and graduate students. Given that the CDC’s COVID-19 prevention guide is designed for the general public, university students were deemed appropriate representatives for evaluating the content.

3.6.2. Analysis of Subjective Assessment

The questionnaire survey was conducted with a relatively small sample size, resulting in 54 valid responses. The respondents included 19 males and 35 females. Regarding educational attainment, 7 were from junior colleges, 43 were from undergraduate programs, and 4 were graduate students.

Using a weighted averaging approach, the scores for each of the five evaluation criteria were computed for all AI models. The final outcomes are summarized in Table 4.

(1): Overall performance

Based on the results in Table 3, all four models received scores between 3.67 and 4.35 across the five indicators, indicating generally high content quality. Nevertheless, noticeable differences emerged, particularly in the dimensions of comprehensibility and credibility.

(2): Comparison and summary between foreign models and Chinese models

Comparing the performance of text generation by international models (ChatGPT and Gemini) and domestic models (Kimi and Ernie Bot). The specific comparison is as follows:

In terms of readability, the scores of ChatGPT (4.19), Gemini (4.20), and Kimi (4.20) are similar, while Ernie Bot scored slightly lower (4.09), which shows that the readability of text generated by different models is good.

Regarding comprehensibility, ChatGPT achieved the highest score (4.35), outperforming the others. Chinese models showed relatively stable performance—Kimi at 4.00 and Ernie Bot at 4.15—while Gemini scored the lowest (3.96).

In terms of consistency, ChatGPT (4.15) and Kimi (4.00) scored high, indicating that they are closer to CDC information, whereas Ernie Bot (3.96) and Gemini (3.67) showed lower consistency, indicating deviation from authoritative sources.

In terms of usefulness, three models—ChatGPT, Gemini, and Kimi—scored are the same, all 4.07,with Ernie Bot slightly behind (3.96), reflecting minor variation in their potential to support healthy behavior.

In credibility, ChatGPT again led (4.09), while Gemini and Ernie Bot tied at 3.96. Kimi received the lowest score (3.89), which shows that the credibility of the model generation needs to be improved.

In general, international models excelled in accuracy and credibility, while domestic models demonstrated strength in readability and alignment. These differences may stem from disparities in training corpora, optimization objectives, and cultural–linguistic orientation.

3.6.3. Comparison of Human- and Machine-Based Evaluations

To assess the alignment between manual and algorithmic evaluation methods, machine metrics (SimHash, FRES, FKGL) were paired with their corresponding human evaluation indicators, namely consistency, readability, and comprehensibility.

(1): Consistency: Comparing Human Scores with SimHash Similarity

Human ratings for consistency were highest for ChatGPT (4.15) and Kimi (4.00), moderate for Ernie Bot (3.96), and lowest for Gemini (3.67). The SimHash similarity scores reflected a comparable pattern: Kimi scored the highest (0.65), ChatGPT and Ernie Bot followed closely (0.63), and Gemini scored the lowest (0.59).

This strong correlation between human and machine scores indicates that SimHash is a reliable tool to replace human judgments of textual consistency. It is particularly effective in measuring how closely AI-generated responses align semantically with reference standards, such as those issued by the CDC.

(2): Readability: Comparison Between Human Ratings and FRES Scores

In the dimension of readability, human scores were highest for both Kimi and Gemini (4.20), followed by ChatGPT (4.19) and Ernie Bot (4.09). Comparing with the Flesch Reading Ease Score (FRES), Kimi also achieved the highest score (28.93), indicating superior textual readability. Because ChatGPT (25.59) and Ernie Bot (26.85) had comparable FRES values, while Gemini scored the lowest (18.59), it suggesting its sentence structure may be more complex or less accessible.

However, no clear parallel trend was observed between human ratings and FRES scores. Thus, while FRES may serve as a supplementary indicator of readability, its use should be combined with broader user-centered evaluations in practical applications.

(3): Comprehensibility: Human Ratings and FKGL Scores

In the dimension of comprehensibility, ChatGPT received the highest human rating (4.35), indicating that its responses were more easily understood by users. This was followed by Ernie Bot (4.15) and Kimi (4.00), while Gemini scored the lowest (3.96). These results suggest that ChatGPT’s textual organization better aligns with users’ comprehension expectations.

In terms of the Flesch–Kincaid grade level (FKGL), ChatGPT also demonstrated the lowest score (13.08), indicating the lowest required reading level. By contrast, Kimi (13.70) and Gemini (13.41) had higher FKGL values, with Ernie Bot in between (13.55). As higher FKGL scores denote increased reading difficulty, ChatGPT’s consistent performance across both human and machine evaluations confirms its relatively superior comprehensibility.

(4): Usefulness: Evaluated Solely by Humans

The usefulness metric assesses whether AI-generated content is effective in promoting pandemic prevention and control. This indicator cannot currently be measured through automated scoring. Human evaluation showed that ChatGPT, Gemini, and Kimi all received identical scores of 4.07, with Ernie Bot slightly lower at 3.96. These results suggest that, in general, the public believes that generative AI can output practical text.

Credibility: Perceptions of Trustworthiness

Credibility was used to evaluate how much trust users placed in the AI responses. ChatGPT attained the highest credibility score (4.09), followed by Gemini and Ernie Bot (3.96), and Kimi with the lowest score (3.89). This suggests that ChatGPT’s responses were more convincing. Similar to usefulness, credibility remains a purely human-assessed construct due to the absence of validated machine evaluation metrics.

Integrated Summary of Human and Machine Evaluation Comparisons: This study compared human and machine-based evaluations across five keys indicators and found varying degrees of alignment. SimHash and FKGL showed strong consistency with human assessment trends. In contrast, FRES scores diverged significantly from human judgments, indicating that machine-based assessments can only partially reflect the actual performance of AI-generated content.

Notably, two critical dimensions—usefulness and credibility—are inherently based on human perception and cannot currently be quantified through automated means. In summary, while machine algorithms can contribute to the quantification of certain evaluation dimensions, human involvement remains essential to ensure a comprehensive and balanced assessment of generative AI output. These artificial subjective evaluations partially reflect the users’ psychological feelings and self-perception, providing indirect insights into the dissemination of health knowledge.

4. Discussion

This paper studies the application performance of generative artificial intelligence models in terms of providing knowledge and answers regarding infectious diseases. The specific research method is to input 53 questions from the U.S. CDC COVID-19 prevention and control knowledge guide in English into four models, extract the English answer text of each model, and study the various performance of the four models in generating English text by calculating relevant indicators and modeling outputs. This study systematically compares the differences between the foreign models ChatGPT and Gemini, on the one hand, and the domestic models Kimi and Ernie Bot in the formation of text, the ease with which the text can be read and understood, and the semantic content. Furthermore, this study discusses the influence of generative artificial intelligence models in the dissemination of public health information. A concrete in-depth discussion is carried out below, and our findings are presented.

In this paper, the accuracy of artificial intelligence models is compared based on their SimHash similarity scores. The results show that the text of the foreign models ChatGPT and Gemini is similar to the standard answer provided by the CDC, which shows that these models have better accuracy and that the foreign model has a significant advantage for data training in the public health sector. In contrast, the text generated by the domestic models (Kimi and Ernie Bot) is more volatile and less stable than the CDC standard text.

Although SimHash provides a scalable computational method to evaluate semantic similarity, it cannot fully capture the correctness of facts from the perspective of human readers. This paper uses a machine judgment method to obtain an objective score. In order to make up for the shortcomings of machine judgment, this paper introduces a questionnaire method. By combining manual scoring and comparing humans and machines, the shortcomings of machine scoring can be judged. The manual judgment results of this paper show that machine scoring is highly consistent with the manual scoring ranking. This shows that for the dataset used in this paper, SimHash can be used as a more effective automated indicator in the dimension of “consistency” of the evaluation indicator. It can accurately judge the semantic proximity between the text generated by the AI model and the authoritative answer. The research in this paper shows that in key areas such as healthcare, human judgment is needed to supplement accuracy.

Although the dataset used in this study is relatively small and standardized due to time and conditions, the research framework has been further improved and verified by incorporating a questionnaire method to add a manual verification link. By comparing the manual scoring results with machine calculation indicators such as SimHash, the empirical results observed consistent ranking trends, especially in the dimensions of “consistency” and “accuracy”. This consistency supports the effectiveness and robustness of the evaluation framework even in the case of limited data.

The models used in this study generate an indicator of the readability of text, which is the Flesch Reading Ease Score (FRES index). The FRES index reveals the difference between the Chinese and foreign models in terms of the readability of the text that they produce. The empirical results show that the FRES index of the domestic models (Kimi and Ernie Bot) is higher, indicating that the domestic models produce text that can be read easily. The domestic model generates text that is easier to understand and language choreography that is more popular. In comparison, the FRES values of the foreign models (ChatGPT and Gemini) are lower, indicating that the generated text may be more professional and include words that are less frequently used by most people. However, the foreign models (ChatGPT and Gemini) generate text that is logical, rigorous, and suitable for professional reading. In the examination of the ease with which text can be read, we should focus on the organic combination of professionalism and universality in the dissemination of public health information.

The model generates text intelligibility using another FKGL metric in the readability metric. The empirical results show that the text of the foreign models (ChatGPT and Gemini) is above 12, and the level of understanding may correspond to the higher education level of the United States, which shows that the foreign model is suitable for people with higher levels of education. The FKGL value of the domestic model is between 810 and the average level of an individual with a secondary education background. The results of this study show that the training of generative artificial intelligence in the future needs to be understood in order to provide timely, accurate, and easy-to-understand professional text for the prevention and control of major infectious diseases.

In addition to the accuracy, readability, and comprehensibility of the four models, this work also used the human work neural model (multilayer perceptron) to study the effect of text readability. The empirical results show that the foreign models’ generative text is more focused on the complexity of the language structure, especially the number of lexical syllables (the ASPW) and the length of the sentence (the length of the sentence AWPS), which affects the readability of the language. The domestic model also emphasizes the number of text sentences, reflecting stronger localization language. The linguistic differences in the text in the domestic and international models embody the different characteristics of Chinese and English language.

The results of topic mining show that the topic content of the foreign models (ChatGPT and Gemini) focuses on “vaccination”, “virus mutation”, and “protection suggestions”, while domestic models (Kimi and Ernie Bot) mainly focus on practical knowledge such as “symptom recognition” and “test suggestions”. In general, the textual subject content of the Chinese and English models, at home and abroad, reflects the cultural and traditional differences in the training data behind them.

To verify the accuracy of AI-generated content and assess the feasibility of the research methodology proposed in this study, it is essential to incorporate manual evaluations that subjectively assess the accuracy of the generated content. This study adopts a questionnaire-based approach for manual assessment. The empirical results indicate that machine-based metrics, such as SimHash and the Flesch–Kincaid grade level (FKGL), closely align with the trends observed in manual evaluations, suggesting a degree of substitutability and reference value. However, the Flesch Reading Ease Score (FRES) shows notable discrepancies compared to manual judgments, highlighting that automated scoring cannot fully replace human assessment.

Human Evaluation as an Indicator of Potential Risk of Misinformation: This paper implements human subjective scoring by adopting a questionnaire method. In this process, in addition to providing subjective assessments of readability and consistency, human judges also provide important insights into the accuracy and credibility of AI-generated content. In some cases, participants rated certain responses as semantically similar but potentially unreliable. These issues show that AI models may generate content that seems reasonable but is actually misleading. This highlights the importance of incorporating human judgment into the evaluation process, as machine-based indicators alone (such as SimHash) may not fully capture the correctness of the facts. Avoiding the risk of misinformation is also the significance of adding human judgment in this paper.

5. Conclusions

This study employs the COVID-19 prevention and control Q&A content issued by the U.S. Centers for Disease Control and Prevention as a benchmark to systematically evaluate the performance differences among four mainstream generative artificial intelligence (AI) models—ChatGPT, Gemini, Kimi, and Ernie Bot—both domestic and international. The evaluation focuses on four dimensions: text generation capability, understandability, readability, and semantic content. To assess the validity of machine-generated scoring metrics, a questionnaire-based method was introduced to obtain subjective manual evaluations of AI-generated content, allowing for a comparison between human and machine assessments. Empirical findings reveal that the international models, ChatGPT and Gemini, demonstrate superior accuracy and professionalism; however, their generated texts tend to be more complex. In contrast, the domestic models, Kimi and Ernie Bot, produce more accessible language that is better suited for public health communication, though the level of specialization in their content requires improvement. Furthermore, significant differences were observed between domestic and international models in terms of language generation strategies, audience adaptability, and semantic coverage. Variations in language style and communication preferences were also noted across the models. The study confirms that while machine scoring offers partial insight into the performance of AI-generated texts, it cannot fully substitute for human evaluation. According to these findings, this study recommends that future AI-generated content for infectious disease communication should place greater emphasis on the knowledge base and comprehension level of the general public. Striking a balance between professionalism and accessibility in AI training data is essential to enhance the effectiveness and accuracy of public health knowledge dissemination.

While this study focuses on COVID-19-related disease prevention and control question-answering systems, its research framework—including machine-based metrics (e.g., SimHash, FKGL) and human-judged criteria (e.g., credibility, practicality)—is not disease-specific. This approach can be easily applied to evaluate the performance of generative AI models in other infectious diseases (e.g., influenza, monkeypox, tuberculosis, etc.). Given the growing role of AI in public health communication, evaluating its cross-disease adaptability is an important future direction. The results of this study lay the foundation for such extended research and provide a new evaluation paradigm for related research.

Through the empirical research in this article, it can be found that generative artificial intelligence has a good application prospect in the medical field, especially in the dissemination of knowledge on infectious disease prevention and control. However, when comparing the scoring results through machine judgment and manual judgment, it is found that relying solely on generative artificial intelligence is incomplete, and the judgment of knowledge still needs to rely on the experience and knowledge of professional medical staff, especially experts. In addition, the use of generative artificial intelligence technology in the field of public health should also follow ethical principles—transparency, accountability, and manual supervision—to prevent the abuse of artificial intelligence. Future research work will integrate multidisciplinary knowledge and explore the more efficient integration of generative artificial intelligence technology and public health knowledge dissemination, so as to contribute to the prevention and control of major infectious diseases.

In this paper, we studied the application of generative artificial intelligence in terms of its general knowledge of and ability to provide answers about the prevention and control of infectious diseases. In addition, we implemented three aspects of innovation: (1) Theoretical innovation. Through the introduction of a generative artificial intelligence model analysis tool, the research boundary of public health information dissemination is expanded. Traditional public health communication is concentrated in the media and government propaganda, and generative artificial intelligence models are rarely applied to the study of the knowledge of infectious diseases. This study marks the first time that generative artificial intelligence has been incorporated into the knowledge and answer framework of infectious diseases, and it has enriched the theoretical map of the spread of infectious diseases. The theoretical innovation of this study is also reflected in the selection of multiple indexes that generate text performance. This study selects the accuracy, readability, comprehension, readability factor, and five indexes of the text and overcomes the limitations of the traditional evaluation of individual indexes. (2) Innovative methods. In this paper, innovation is embodied in the multidimensional analysis framework of text accuracy and the readability index, and, by introducing the neural network model, the ability to identify the capacity to generate text is realized, and a mechanism for comparing the quality of texts is produced. This study enhances the interpretative nature of text through text topics. (3) Application innovation. The application innovation of this study is mainly embodied in the introduction of major model performance comparison perspectives. In this study, we compare the output performance of foreign models and domestic models under the same task and reveal the differences between Chinese and foreign models in terms of semantic accuracy, popularity, and professionalism, which provides a valuable economic reference for the development of artificial intelligence.

In general, the results of this study can serve directly in the healthcare system and in the prevention and control department. In the use or supervised generation of AI to conduct health knowledge dissemination, the evaluation of the quantity of the information is based on technical advice, thereby promoting the generation of artificial intelligence models to better serve the people and contribute to improving the level of control of major infectious diseases in the world.

6. Study Limitations

Although this study has carried out an effective exploration of the framework design of model evaluation, empirical data comparison, and multidimensional index extraction, it still contains a limitation which needs to be further improved in future studies.

The uniqueness of the standard answer limits the overall performance of the model. In this study, the answers to 53 COVID-19-related questions provided by the CDC were used as the standard answers, and the responses of the artificial intelligence models were similar. However, the answers of generative artificial intelligence models are diverse, and the uniqueness of the answer and the diversity of the generated text cannot be measured by the actual professional level of the model.

One limitation of this study is that it relies depend on the English version of the COVID-19 guidelines published by the Centers for Disease Control and Prevention (CDC) in 2020. While this material provides a standardized and authoritative benchmark, it does not take into account the multilingual public health communication. In addition to the CDC, the current commonly used multilingual COVID-19 prevention and control texts also come from the World Health Organization (WHO), the European Center for Disease Prevention and Control (ECDC), the official websites of various national ministries of health, and other institutions, covering more than 80 languages, including vaccination, protection guidelines, etc. Future research will use updated, multilingual, and culturally diverse corpora to evaluate the performance of AI models.

Another limitation of this study is the iterative nature of large language models (LLMs). Most large models, especially mainstream large models such as ChatGPT, Gemini, Kimi, and Ernie Bot studied in this paper, are constantly updating versions and modifying and improving algorithms. However, none of them provide update logs to users in a timely manner. Therefore, software iteration will cause the same revelation to produce different responses at different times, which poses a challenge to the reproducibility of the research. Although we have recorded access time and conditions during the test, the time difference in this problem needs to be considered when describing the conclusion. Future research directions will integrate multiple versions, multiple time points, and multiple acquisitions for comparison to further enhance the reproducibility of the research.

The impact of language training background on AI model performance: This study highlights the issue of differences in the training environments of generative AI models. ChatGPT and Geminiare trained by English-language datasets, whereas Kimi and Ernie Bot are trained by a mix of Chinese and English corpora. Since this study uses English-language CDC prevention and control guidelines as the evaluation benchmark, and all prompts and model responses are conducted in English, the international models (ChatGPT and Gemini) possess an inherent language advantage in output performance. Previous research has demonstrated that the performance of generative AI models can vary significantly depending on the language in which they are trained. For instance, Luo et al. reported that DeepSeek outperformed ChatGPT in tasks conducted in Chinese, provingthat model performance is influenced not only by the task itself but also by the language context [34]. In light of these findings, future research should explore bilingual or multilingual evaluation frameworks to better understand how language training backgrounds affect model performance across different linguistic settings.

The topic analysis does not conduct emotional classification of the text. Although this study carried out text theme analysis, it did not carry out emotional analysis of the text, and in the process of actual dissemination of medical information, the emotionality of text and language is often associated with the efficiency of the transmission.

The quantity and limitation of the knowledge domain is a further shortcoming of this study. This study was based on the 53 COVID-19-related problems of the CDC, which have certain limitations. The knowledge sector is small and does not cover the wider range of infectious diseases. In future research, we will attempt to apply information about a wider range of diseases in order to improve the prevention and control of major infectious diseases.

This study is a practical exploration, providing reference materials for generative artificial intelligence technology in the prevention and control of major infectious diseases. Although the results of this study did not directly evaluate clinical scenarios, this paper provides two indicators of practicality and credibility in the design of manual evaluation questionnaires, which can provide preliminary insights into the applicability of answers generated by artificial intelligence in public health education. One of the future research directions will be to cooperate with medical staff and patients to conduct clinical practice research together to provide more substantial clinical verification for the corresponding research.

Limitations of generalizability and potential for misinformation: Although the dataset used in this study is the official text of an authoritative medical management agency, it is relatively small in scale. However, whether the results of this study can be generalized to more complex multilingual datasets or larger datasets remains to be verified. In addition, through the empirical research of this article, especially through the combination of manual and machine judgments, it was found that the responses generated by generative AI contained unreliable information, which reflects the lack of robustness of the content generated by generative AI. This also means that when a larger dataset is used for related research, AI is likely to generate more misinformation. Based on this, one of the future research directions is to use a large-scale, complex, and multilingual corpus database to systematically and comprehensively evaluate its sensitivity.

Author Contributions

Conceptualization, Z.L. (Zongjing Liang) and Y.K.; methodology, Z.L. (Zongjing Liang), Y.K., G.L. and Z.L. (Zhijie Li); formal analysis, G.L.; preparation of original drafts, Z.L. (Zongjing Liang), Z.L. (Zhijie Li) and Y.K.; review and final approval, Z.L. (Zongjing Liang), Y.K., X.L., Z.L. (Zhijie Li) and G.L. All authors have read and agreed to the published version of the manuscript.

Funding

National Social Science Foundation of China (No. 20XTJ004): “Spatio-temporal Modeling, Monitoring, Prevention and Control of the Spread of Major Infectious Diseases Based on Big Data”.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used and analyzed in this study are available from the corresponding authors upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Specific Structure of Neural Network Model

According to the construction requirements of a neural network model, input layer and output layer variables are required. There are five input layer variables (X) used in our study: AWPS, ASPW, sentence, word, and SimHash. The output layer variable (Y) is RL, which indicates that the reading level variable. Among these reading levels, RL 1 means the reading level is “College”, and a higher RL means the reading level is “College graduate”. The results of the four models of the neural network model are shown in Figure A1.

Figure A1. Diagram of the neural network structure of the four models. (a) ChatGPT neural network structure diagram; (b) Gemini neural network structure diagram; (c) Kimi neural network structure diagram; (d) Ernie Bot neural network structure diagram. Note: Multilayer perceptron neural network structure diagram for readability prediction of four AI models built based on SPSS. The figure shows the multilayer perceptron (MLP) neural network structure automatically generated by SPSS, corresponding to (a) ChatGPT, (b) Gemini, (c) Kimi, and (d) Wenxinyiyan models. Each model uses five input variables, SimHash, AWPS, ASPW, sentences, and words, to predict the readability score (FKGL) of its output text. The hidden layer uses the hyperbolic tangent activation function, and the output layer uses the Softmax function. The thickness and color of the connecting lines in the figure reflect the weight size and sign direction (positive or negative) between neurons. The above structure is automatically generated by SPSS during model training.

The neural network structure, which is built by four models, is composed of the input layer, hidden layer, and output. In this case, the hidden layer is the hyperbolic variable activation function, which is used to improve the function in the nonlinear relationship between the variables and helps to optimize the gradient propagation of the model. The output layer of the model is text-readable, and the four images of the four models are consistent. The influence of the input variables of each model is comparable to the influence of the model output variable.

References

Chakraborty, I.; Maity, P. COVID-19 outbreak: Migration, effects on society, global environment and prevention. Sci. Total Environ. 2020, 728, 138882. [Google Scholar] [CrossRef]
Jee, Y. WHO international health regulations emergency committee for the COVID-19 outbreak. Epidemiol. Health Commun. 2020, 42, e2020013. [Google Scholar] [CrossRef] [PubMed]
Sarker, R.; Roknuzzaman, A.; Hossain, J.; Bhuiyan, M.A.; Islam, R. The WHO declares COVID-19 is no longer a public health emergency of international concern: Benefits, challenges, and necessary precautions to come back to normal life. Int. J. Surg. 2023, 109, 2851–2852. [Google Scholar] [CrossRef] [PubMed]
Evans, A.; AlShurman, B.A.; Sehar, H.; Butt, Z.A. Monkeypox: A mini-review on the globally emerging Orthopoxvirus. Int. J. Environ. Res. Public Health 2022, 19, 15684. [Google Scholar] [CrossRef]
Li, G.; Hilgenfeld, R.; Whitley, R.; De Clercq, E. Therapeutic strategies for COVID-19: Progress and lessons learned. Nat. Rev. Drug Discov. 2023, 22, 449–475. [Google Scholar] [CrossRef] [PubMed]
Baker, R.E.; Mahmud, A.S.; Miller, I.F.; Rajeev, M.; Rasambainarivo, F.; Rice, B.L.; Takahashi, S.; Tatem, A.J.; Wagner, C.E.; Wang, L.-F.; et al. Infectious disease in an era of global change. Nat. Rev. Microbiol. 2022, 20, 193–205. [Google Scholar] [CrossRef]
Zhou, T.; Li, S. Understanding user switch of information seeking: From search engines to generative AI. J. Librariansh. Inf. Sci. 2024. [Google Scholar] [CrossRef]
B Bharti, I.; Aggarwal, P.; Chauhan, K. Generative AI: Next Frontier for Competitive Advantage. In Enhancing Communication and Decision-Making With AI; Natarajan, A.K., Galety, M.G., Iwendi, C., Das, D., Shankar, A., Eds.; IGI Global: Hershey, PA, USA, 2024; pp. 1–36. [Google Scholar]
Öztürk, Z.; Bal, C.; Çelikkaya, B.N. Evaluation of Information Provided by ChatGPT Versions on Traumatic Dental Injuries for Dental Students and Professionals. Dent. Traumatol. 2025. [Google Scholar] [CrossRef]
Siebielec, J.; Ordak, M.; Oskroba, A.; Dworakowska, A.; Bujalska-Zadrozny, M. Assessment Study of ChatGPT-3.5’s performance on the final Polish Medical examination: Accuracy in answering 980 questions. Healthcare 2024, 12, 1637. [Google Scholar] [CrossRef]
Wang, G.; Gao, K.; Liu, Q.; Wu, Y.; Zhang, K.; Zhou, W.; Guo, C. Potential and limitations of ChatGPT 3.5 and 4.0 as a source of COVID-19 information: Comprehensive comparative analysis of generative and authoritative information. J. Med. Internet Res. 2023, 25, e49771. [Google Scholar] [CrossRef]
Zong, H.; Li, J.; Wu, E.; Wu, R.; Lu, J.; Shen, B. Performance of ChatGPT on Chinese national medical licensing examinations: A five-year examination evaluation study for physicians, pharmacists and nurses. BMC Med. Educ. 2024, 24, 143. [Google Scholar] [CrossRef] [PubMed]
Yanagita, Y.; Yokokawa, D.; Uchida, S.; Tawara, J.; Ikusaka, M. Accuracy of ChatGPT on medical questions in the national medical licensing examination in Japan: Evaluation study. JMIR Form. Res. 2023, 7, e48023. [Google Scholar] [CrossRef]
Sadeq, M.A.; Ghorab, R.M.F.; Ashry, M.H.; Abozaid, A.M.; Banihani, H.A.; Salem, M.; Abu Aisheh, M.T.; Abuzahra, S.; Mourid, M.R.; Assker, M.M.; et al. AI chatbots show promise but limitations on UK medical exam questions: A comparative performance study. Sci. Rep. 2024, 14, 18859. [Google Scholar] [CrossRef] [PubMed]
Meo, S.A.; Al-Masri, A.A.; Alotaibi, M.; Meo, M.Z.S.; Meo, M.O.S. ChatGPT knowledge evaluation in basic and clinical medical sciences: Multiple choice question examination-based performance. Healthcare 2023, 11, 2046. [Google Scholar] [CrossRef] [PubMed]
Sumbal, A.; Sumbal, R.; Amir, A. Can ChatGPT-3.5 pass a medical exam? A systematic review of ChatGPT’s performance in academic testing. J. Med. Educ. Curric. Dev. 2024, 11, 23821205241238641. [Google Scholar] [CrossRef] [PubMed]
De Vito, A.; Geremia, N.; Marino, A.; Bavaro, D.F.; Caruana, G.; Meschiari, M.; Colpani, A.; Mazzitelli, M.; Scaglione, V.; Rullo, E.V.; et al. Assessing ChatGPT’s theoretical knowledge and prescriptive accuracy in bacterial infections: A comparative study with infectious diseases residents and specialists. Infection 2025, 53, 873–881. [Google Scholar] [CrossRef]
Onder, C.E.; Koc, G.; Gokbulut, P.; Taskaldiran, I.; Kuskonmaz, S.M. Evaluation of the reliability and readability of ChatGPT-4 responses regarding hypothyroidism during pregnancy. Sci. Rep. 2024, 14, 243. [Google Scholar] [CrossRef]
Biswas, S.; Logan, N.S.; Davies, L.N.; Sheppard, A.L.; Wolffsohn, J.S. Assessing the utility of ChatGPT as an artificial intelligence-based large language model for information to answer questions on myopia. Ophthalmic Physiol. Opt. 2023, 43, 1562–1570. [Google Scholar] [CrossRef]
Ahmed, W.M.; Azhari, A.A.; Alfaraj, A.; Alhamadani, A.; Zhang, M.; Lu, C.-T. The Quality of AI-Generated Dental Caries Multiple Choice Questions: A Comparative Analysis of ChatGPT and Google Bard Language Models. Heliyon 2024, 10, e28198. [Google Scholar] [CrossRef]
Shen, S.A.; Perez-Heydrich, C.A.; Xie, D.X.; Nellis, J.C. ChatGPT vs. web search for patient questions: What does ChatGPT do better? Eur. Arch. Otorhinolaryngol 2024, 281, 3219–3225. [Google Scholar] [CrossRef]
Centers for Disease Control and Prevention (CDC). Clinical Questions about COVID-19: Questions and Answers. 2020. Available online: https://stacks.cdc.gov/view/cdc/89817 (accessed on 1 February 2025).
Toprak, A.; Turan, M. Automated thematic dictionary creation using the web based on WordNet, Spacy, and Simhash. Data Inf. Manag. 2024, 100088. [Google Scholar] [CrossRef]
Sadowski, C.; Levin, G. Simhash: Hash-Based Similarity Detection. 2007. Available online: http://www.webrankinfo.com/dossiers/wp-content/uploads/simhash.pdf (accessed on 1 February 2025).
Eleyan, D.; Othman, A.; Eleyan, A. Enhancing software comments readability using flesch reading ease score. Inf. Dev. 2020, 11, 430. [Google Scholar] [CrossRef]
Grabeel, K.L.; Russomanno, J.; Oelschlegel, S.; Tester, E.; Heidel, R.E. Computerized versus hand-scored health literacy tools: A comparison of Simple Measure of Gobbledygook (SMOG) and Flesch-Kincaid in printed patient education materials. J. Med. Libr. Assoc. 2018, 106, 38–44. [Google Scholar] [PubMed]
Karnan, N.; Francis, J.; Vijayvargiya, I.; Tan, C.R.; Rubino, C. Analyzing the Effectiveness of AI-Generated Patient Education Materials: A Comparative Study of ChatGPT and Google Gemini. Cureus 2024, 16, e74398. [Google Scholar] [CrossRef]
Onan, A.; Korukoglu, S.; Bulut, H. LDA-based topic modelling in text sentiment classification: An empirical analysis. Int J Comput. Linguist. Appl. 2016, 7, 101–119. [Google Scholar]
Desai, M.; Shah, M. An anatomization on breast cancer detection and diagnosis employing multi-layer perceptron neural network (MLP) and Convolutional neural network (CNN). Clin. Ehealth 2021, 4, 1–11. [Google Scholar] [CrossRef]
He, X.; Chen, Y. Modifications of the multi-layer perceptron for hyperspectral image classification. Remote Sens. 2021, 13, 3547. [Google Scholar] [CrossRef]
Yilmaz, I.; Kaynar, O. Multiple regression, ANN (RBF, MLP) and ANFIS models for prediction of swell potential of clayey soils. Expert Syst. Appl. 2011, 38, 5958–5966. [Google Scholar] [CrossRef]
Pang, X.; Wan, B.; Li, H.; Lin, W. MR-LDA: An efficient topic model for classification of short text in big social data. Int. J. Grid High Perform. Comput. 2016, 8, 100–113. [Google Scholar] [CrossRef]
Debortoli, S.; Müller, O.; Junglas, I.; Brocke, J.V. Text mining for information systems researchers: An annotated topic modeling tutorial. Commun. Assoc. Inf. Syst. 2016, 39, 7. [Google Scholar] [CrossRef]
Luo, P.W.; Liu, J.W.; Xie, X.; Jiang, J.W.; Huo, X.Y.; Chen, Z.L.; Huang, Z.C.; Jiang, S.Q.; Li, M.Q. DeepSeek vs ChatGPT: A comparison study of their performance in answering prostate cancer radiotherapy questions in multiple languages. Am. J. Clin. Exp. Urol. 2025, 13, 176. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Box plots of ChatGPT, Gemini, Kimi, and Ernie Bot SimHash values. The expla nations for each color and diamond in the figure are as follows: Blue represents the ChatGPT model, brown represents the Gemini model, light blue represents the Kimi model, and dark blue represents the Ernie Bot model. The diamond shapes indicate the extreme values. Note: statistical variables and significance. The average score of ChatGPT is 0.624 ± 0.074, while the average scores of Gemini, Kimi, and Ernie Bot are 0.592 ± 0.077, 0.646 ± 0.080, and 0.631 ± 0.082, respectively. The independent sample t-test shows that the statistical differences between ChatGPT and other models are as follows: Gemini (p = 0.035) Kimi (p = 0.138) Ernie Bot (p = 0.645).

Figure 2. Flesch Reading Ease Score (i.e., FRES) index box diagram. The explanations for each color and diamond in the figure are as follows: dark orange represents the ChatGPT model, Brick Red represents the Gemini model, red represents the Kimi model, and Bright Magenta represents the Ernie Bot model. The diamond shapes indicate the extreme values. Note: statistical vari ables and significance. The average score of ChatGPT was 27.47 ± 8.07, while the average scores of Gemini, Kimi, and Ernie Bot were 18.60 ± 11.32, 29.35 ± 11.47, and 26.37 ± 10.82, respectively. Independent sample t-test showed that the statistical differences between ChatGPT and other models were as follows: Gemini (p < 0.001), Kimi (p = 0.282), Ernie Bot (p = 0.538).

Figure 3. Flesch–Kincaid grade level (i.e., FKGL index). The explanations for each color and diamond in the figure are as follows: Orange-Yellow represents the ChatGPT model, Terracotta represents the Gemini model, Rose Red represents the Kimi model, and Bright Magenta represents the Ernie Bot model. The diamond shapes indicate the extreme values. Note: statistical variables and significance. The average FKGL score of ChatGPT was 12.68 ± 1.29, while the average scores of Gemini, Kimi, and Ernie Bot were 13.26 ± 1.70, 13.45 ± 2.25, and 13.45 ± 1.88, respectively. Independent sample t-test showed that the statistical differences between ChatGPT and other models were as follows: Gemini (p = 0.032), Kimi (p = 0.018), and Ernie Bot (p = 0.008).

Figure 4. Histogram of the importance of influencing variables. (a) Histogram of ChatGPT’s effects on the importance of variables; (b) histogram of Gemini’s effects on the importance of variables; (c) histogram of Kimi’s effects on the importance of variables; (d) histogram of Ernie Bot’s effects on the importance of variables. Note: The normalized importance plot of each predictor variable in the neural network model automatically generated by the neural network module (multilayer perceptron, MLP) of SPSS statistical analysis software, showing the relative influence of each input variable on the model output, with the values normalized to the maximum value of 100%. This plot reflects the importance weight of each variable in the trained network, but the original values cannot be directly exported.

Figure 5. Generative histogram of text word frequency. (a) ChatGPT generative text word frequency diagram; (b) Gemini generative text word frequency diagram; (c) Kimi generative text word frequency diagram; (d) Ernie Bot generative text word frequency diagram. Note: Frequency distribution of the top 10 high-frequency words in the texts generated by the four generative AI models. The figure shows the 10 most frequently occurring words in the texts generated by (a) ChatGPT, (b) Gemini, (c) Kimi, and (d) Ernie Bot when answering 53 standardized COVID-19-related questions.

Table 1. The models’ generative word frequency table (the frequency number is the top 10).

	ChatGPT		Gemini		Kimi		Ernie Bot
Sort	Word	Frequency	Word	Frequency	Word	Frequency	Word	Frequency
1	COVID-19	556	COVID-19	202	COVID-19	253	COVID-19	283
2	Patient	328	Medical	148	Test	191	Patient	194
3	Infect	293	Health	100	Risk	180	Infect	152
4	Risk	284	Healthcare	95	Patient	171	Risk	140
5	Test	277	Risk	95	Vaccine	154	Test	131
6	Severe	233	Advice	89	Infect	137	Symptom	112
7	Symptom	231	Infected	86	Severe	131	Severe	108
8	Individual	201	Test	73	Symptom	130	Healthcare	105
9	Vaccine	201	Individual	69	Recommend	103	Medical	99
10	Viral	200	Symptom	67	Sars-cov	103	Health	93

Table 2. Word frequency analysis and significance test of different models.

Model	Mean Frequency	Standard Deviation	p-Value (vs. ChatGPT)
ChatGPT	280.4	117.46	-
Gemini	102.4	43.52	0.0006
Kimi	155.3	51.19	0.006
Ernie Bot	141.7	62.49	0.003

Table 3. Selected questions and their selection criteria.

No.	Select CDC Question	Select Reason
1	Who is at risk for infection with the virus that causes COVID-19?	The issue is relevant and understandable to the general public
2	Can people who recover from COVID-19 be re-infected with SARS-CoV-2?	Questions are important and helpful for comparing text accuracy with AI
3	When is someone infectious?	Helps judge the “practicality” and “understandability” of AI-generated content
4	Are patients with hypertension at higher risk for severe illness from COVID-19?	Highly related to the chronic disease population
5	If an infected person has clinically recovered, should the person continue to wear a cloth face covering in public?	Facilitate the evaluation of whether the recommendations are scientific and credible

Table 4. Results of five indicators of four AI models scored by humans.

Indicator	ChatGPT	Gemini	Kimi	Ernie Bot	Mean	Std. Dev.
Consistency	4.15	3.67	4.00	3.96	3.94	0.20
Readability	4.19	4.20	4.20	4.09	4.17	0.05
Comprehensibility	4.35	3.96	4.00	4.15	4.12	0.18
Usefulness	4.07	4.07	4.07	3.96	4.04	0.06
Credibility	4.09	3.96	3.89	3.96	3.98	0.08

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liang, Z.; Kuang, Y.; Liang, X.; Liang, G.; Li, Z. A Comparative Study of the Accuracy and Readability of Responses from Four Generative AI Models to COVID-19-Related Questions. COVID 2025, 5, 99. https://doi.org/10.3390/covid5070099

AMA Style

Liang Z, Kuang Y, Liang X, Liang G, Li Z. A Comparative Study of the Accuracy and Readability of Responses from Four Generative AI Models to COVID-19-Related Questions. COVID. 2025; 5(7):99. https://doi.org/10.3390/covid5070099

Chicago/Turabian Style

Liang, Zongjing, Yun Kuang, Xiaobo Liang, Gongcheng Liang, and Zhijie Li. 2025. "A Comparative Study of the Accuracy and Readability of Responses from Four Generative AI Models to COVID-19-Related Questions" COVID 5, no. 7: 99. https://doi.org/10.3390/covid5070099

APA Style

Liang, Z., Kuang, Y., Liang, X., Liang, G., & Li, Z. (2025). A Comparative Study of the Accuracy and Readability of Responses from Four Generative AI Models to COVID-19-Related Questions. COVID, 5(7), 99. https://doi.org/10.3390/covid5070099

Article Menu

A Comparative Study of the Accuracy and Readability of Responses from Four Generative AI Models to COVID-19-Related Questions

Abstract

1. Introduction

2. Materials and Methods

2.1. Research Materials

2.2. Research Methods

2.3. Statistical Method

3. Results

3.1. Comparison of Text Accuracy

3.2. Text Readability Comparison

3.3. Textual Comprehensibility Comparison

3.4. Text Readability Influence Factor Analysis

3.4.1. Neural Network Algorithm Construction

3.4.2. Influence Factor Analysis

3.5. Text Content Semantic Comparison

3.5.1. Lexical Frequency Statistics

3.5.2. Topic Mining

3.6. Human Validation of Model Responses

3.6.1. Survey Design and Methods

3.6.2. Analysis of Subjective Assessment

3.6.3. Comparison of Human- and Machine-Based Evaluations

4. Discussion

5. Conclusions

6. Study Limitations

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Specific Structure of Neural Network Model

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI