2. Materials and Methods
This study is a validation study aimed at evaluating the 4A program, an AI-driven cognitive assessment tool for nursing education. The validation process employs cross-sectional data and statistical methods to establish the tool’s reliability, accuracy, and generalizability in educational contexts. The research was conducted in two main phases: initial validation and external validation. Each phase was designed to rigorously assess different capabilities of the 4A program through agreement testing, accuracy testing, and external validation.
2.1. Study Sample and Setting
The sample of fourth-year nursing students was recruited from a nursing education institution in Northern Thailand. The institution predominantly provides an undergraduate educational program for around 960 nursing students. The sample size consisted of 308 participants, divided into two rounds, the first round included 170 nursing students, whereas the second round had 138 participants.
2.1.1. Inclusion Criteria
Participants were eligible for this study if they met the following criteria: (1). Academic Standing: Individuals must be fourth-year nursing students who have completed all required courses in maternal and child health nursing. (2). Examination Preparation: Students must be actively preparing for the national nursing examination certificate as administered by the Thailand Nursing and Midwifery Council. (3). Geographical Location: Participants must be enrolled in a nursing educational institute located in northern Thailand. (4). Technological Requirements: Students should have access to a functional internet-connected device equipped with a microphone that can record sound, ensuring they can engage effectively with the 4A program for online cognitive skills assessments.
2.1.2. Exclusion Criteria
Participants were excluded from the study under the following circumstances: (1). Electrical Shutdown: Individuals experiencing an electrical power outage during the testing session, preventing the completion of the assessment. (2). Internet Connectivity Issues: Participants who lose internet connectivity during the testing period, inhibiting their ability to effectively participate in the online assessment.
2.1.3. Sample Size: Initial Validation Phase
The sample size for the initial validation phase, encompassing both agreement and accuracy testing, was determined using power analysis. The parameters for the power analysis included the choice of a moderate effect size of 0.5. This effect size aligns with findings from prior work using virtual nurse labs, where similar outcomes were observed [
18]. Power (1 − β) 0.80, and significance level (α) 0.05 were used. Using G*Power 3.1.9.4 software, these parameters indicated a required sample size of 52 participants for robust testing of agreement and accuracy. This sample size ensures the reliability of comparisons between the assessments made by the 4A program and human experts, as well as the evaluation of the 4A program’s accuracy and precision.
2.1.4. Sample Size: External Validation Phase
For the external validation phase, the 4A system’s outputs were compared with the national nursing examination results. Using the rule of thumb for logistic regression, at least 20 events per variable are required for a robust model. Using the average pass rate of 82.29% (failure rate 17.71%) from the Nursing and Midwifery Council’s national examination data over the past three years (personal communication in December 2022). The required sample size was calculated as 20/0.177 = 113 participants. Accounting for a 20% attrition rate, the total number of participants needed was approximately 138.
2.2. Instrumentation
The research instruments utilized in this study consist of two primary components designed to align with the learning outcomes of the nursing curriculum, specifically targeting maternal and child health nursing practice: (1) Scenario-Based Assessments: A series of three detailed case studies requiring audio responses from participants in terms of nursing diagnosis, signs or symptoms, and nursing care. These scenarios were crafted to evaluate the students’ critical thinking, problem-solving, and decision-making skills in maternal and child health contexts. Each scenario prompted the student to articulate responses that reflect their clinical reasoning and ability to handle complex healthcare situations effectively. The maximum possible summed score for the three scenarios was 38 points. (2) Data Collection Questionnaires: These instruments were composed of two parts: (2.1) Personal Information: gender, hometown, primary language used, main device used for testing, and grade point average, (2.2) Perception of Cognitive Skill Assessment via AI. This questionnaire was developed to assess participants’ responses to an AI-assisted answering assessment program, focusing on their perception of cognitive abilities in decision-making, particularly when these abilities are enhanced or evaluated through an AI system. It consists of 10 items, measured on a 5-point Likert scale, with each item rated from 1 (strongly disagree) to 5 (strongly agree), resulting in a total possible score of 50.
Interpretation of Scores:
41–50: Highly positive perception—Participants strongly agree that the AI system effectively enhances and evaluates their cognitive skills.
31–40: Positive perception—Participants agree that the AI system positively contributes to their cognitive skills, though there may be minor reservations.
21–30: Neutral perception—Participants have mixed feelings about the AI system’s impact on their cognitive skills.
11–20: Negative perception—Participants disagree that the AI system significantly enhances or evaluates their cognitive skills.
1–10: Highly negative perception—Participants strongly disagree and perceive the AI system as ineffective in supporting their cognitive skills.
2.3. Validation and Reliability
The 4A program underwent both alpha and beta testing phases. During the alpha testing phase, research staff conducted tests in controlled environments to detect and address technical issues before exposing the program to users. The beta testing phase was extended to a small group of 15 recently graduated nursing students to evaluate the usability and performance in real-world scenarios. Feedback gathered during this phase was used to optimize the program further, ensuring that the final version of the 4A program met user expectations and operated effectively. The research instruments, such as scenario-based assessments and perception of cognitive skill assessment via AI, were rigorously reviewed and validated by a panel of six experts in the field, including two in nursing education, two specializing in obstetric nursing, one midwifery expert, and one expert in AI-generated content. The content validity index (CVI) for the scenario-based assessments and perception of cognitive skill assessment via 4A were calculated to be 0.98, and 0.98, respectively. The perception of cognitive skill assessment via 4A was evaluated for internal consistency using Cronbach’s alpha, which yielded a value of 0.96.
2.4. Data Collection
Data collection was conducted in two distinct phases to validate the 4A program and evaluate its efficacy in a real-world educational setting. Phase One (2022) was the initial validation. A total of 170 nursing students participated in this phase. Each participant completed an online cognitive assessment to test cognitive abilities in scenarios relevant to maternal and child health. Participants were allocated 15 minutes per scenario, with three scenarios totaling 45 minutes. The audio responses collected were utilized to train the 4A program, enhancing its capability to convert spoken words into written text accurately. The transcribed texts were analyzed using natural language processing (NLP) techniques against a set of standard answers to evaluate agreement and accuracy in the assessment responses. The 4A program in this study utilizes a combination of computer algorithms and artificial intelligence (AI) to evaluate students’ audio responses in online cognitive skill assessments. The text analysis component of this program is powered by ChatGPT-4o API, a natural language processing (NLP) model designed to process and evaluate textual data. The process involves two main steps: (1) speech-to-text conversion (STT), the program converts spoken responses into written text using automatic speech recognition (ASR). A dataset of 170 transcribed audio responses was used to fine-tune the accuracy of the model, ensuring its effectiveness in recognizing technical nursing vocabulary, patient scenarios, and procedural descriptions. And for (2) text analysis via NLP, once transcribed, student responses undergo semantic similarity analysis using ChatGPT-4o API, which compares each response to predefined model answers. Responses with an overlap score of ≥60% are considered acceptable. This approach enhances objectivity in assessment by leveraging ChatGPT-4o’s contextual understanding and pattern recognition capabilities, aligning with recent advancements in AI-driven educational technologies [
19].
2.5. Agreement Between 4A Program and Human Experts
The agreement testing involves comparing the 4A program’s output against the assessment of two human experts. This helps to evaluate the reliability of the 4A program assessment compared to human judgment. A subset of 52 participants was randomly selected from the 170 participants using the ‘Select Cases’ random sampling feature in SPSS (version 18). Audio clips from all three scenarios for these 52 participants were gathered during the initial testing phase. Two independent experts (referred to as CN and KN) were assigned to listen to and evaluate the selected clips. Each expert independently verified the responses, assessing their accuracy and relevance to the scenarios posed. To measure agreement, the evaluations provided by the 4A program for these clips were compared with the independent evaluations of the two human experts. This comparison aimed to determine the level of concordance among the three assessment sources, focusing on whether the program’s responses aligned with expert judgments.
2.6. Accuracy and Precision of the 4A Program
Accuracy refers to how well the 4A program correctly identifies both correct and incorrect responses, providing an overall measure of correctness. Precision, on the other hand, focuses on how well the 4A program identifies correct answers without mistakenly labeling incorrect answers as correct. In simpler terms, accuracy is about overall correctness, while precision is about avoiding false positives [
2,
20]. Using the same set of 52 clips, the accuracy and precision of the 4A program were assessed based on its classifications compared to those made by the human experts. Accuracy evaluates the overall correctness of the 4A program’s classifications. It is defined as the proportion of correct classifications (both true positives and true negatives) relative to all classifications made. In this context, accuracy measures how effectively the 4A program distinguishes correctly from incorrect responses provided by students. Precision focuses on the program’s reliability in identifying correct responses. It is defined as the proportion of true positives (correct answers identified by the program) out of all instances classified as correct by the program. Precision assesses the program’s ability to avoid misclassifying incorrect responses as correct.
2.7. External Validation
Phase Two (2023) involved an external validation process. The primary aim was to externally validate the 4A program by comparing its online cognitive test results with the outcomes of the national nursing licensure examination. The external validation was to confirm the findings from the beta testing and ensure the program’s effectiveness and generalizability across different groups of users to measure cognitive ability. In this phase, participants were openly invited to take the test as part of their preparation for the national licensure examination. A total of 138 participants from three nursing institutions in the northern part of Thailand participated in this phase. Each participant completed the online cognitive test and later reported their pass or fail results from the national nursing licensure examination. The outcomes from the 4A program were then compared with the results of their national licensure examination to evaluate the program’s generalizability and predictive validity within a broader educational context.
2.8. Data Analysis
Initial analysis involved computing descriptive statistics to summarize the demographic characteristics of participants. Means, standard deviations, and frequencies were calculated to provide a comprehensive overview of the study population using SPSS version 18. The intraclass correlation coefficient (ICC) was utilized to assess the consistency of verification methods across the different sources of answers: two human experts and the human expert with the 4A program. Accuracy and precision evaluation were used via the calculation of true positives, true negatives, false positives, and the total number of cases. The accuracy is calculated from (true positives plus true negatives) divided by total number of cases. Whereas precision is calculated from true positives divided by (true positives plus false positives). McNemar’s test was used to compare the accuracy and precision between 4A and human experts with a significance level set at p < 0.05. Logistic regression was employed to analyze the capability of the 4A program’s assessments in predicting students’ success on the national nursing examination.
2.9. Ethical Considerations
This study received prior approval from the Institutional Review Board, ensuring compliance with ethical standards for research involving human participants. All participants were provided with detailed informed consent forms before their participation. These documents clearly explained the study’s purposes, procedures, potential risks, and benefits. Participants were assured that they had the right to withdraw from the study at any point without any consequences, ensuring voluntary participation. Strict measures were implemented to maintain the confidentiality and anonymity of participant data throughout the study. Personal identifiers were removed from all study documents and datasets to prevent any breach of privacy. Protocols for data handling and security were rigorously followed. These included secure storage of digital data and restricted access to ensure that participant information was protected against unauthorized access, use, disclosure, disruption, modification, or destruction. Participants were kept informed about the progress of the study and their rights as research subjects. The study adhered to ethical considerations to ensure participant rights and safety, receiving approval from the Institutional Review Board of the Faculty of Nursing at Chiang Mai University (Study Protocol Code: 2023-EXP003). The research protocol underwent an expedited review process and received official ethical approval on 13 February 2023 (approval valid until 12 February 2024).
4. Discussion
The first objective of this study was to evaluate the level of agreement between the 4A program and human experts in assessing nursing students’ cognitive skills when faced with the three learning scenarios. The findings revealed a high level of consistency between the 4A program and human evaluators, with an intraclass correlation coefficient (ICC) of 0.971 between the two human experts and 0.886 between the AI program and human experts. These results indicate that the 4A program is highly reliable and closely aligns with expert judgment, suggesting its potential as a dependable tool in educational assessments. The program’s ability to replicate human performance is supported by findings that structured testing, whether conducted by AI or human evaluators, enhances cognitive skills through retrieval practices and reinforcement [
21,
22]. The high ICC value between the two human experts supports the reliability of expert assessment methods. This is consistent with findings from previous studies that emphasize the importance of reliability in educational assessments, particularly in high-stakes testing environments where consistency is critical to ensuring fairness and accuracy [
23].
However, the slightly lower ICC between the AI program and the human experts, though still strong, reflects the inherent challenges of algorithmic assessment. Unlike human assessors, 4A systems rely on predefined criteria and algorithms that may not capture the full complexity of cognitive assessments, such as nuanced understanding and contextual interpretation. This limitation is echoed in the literature, where studies have highlighted the difficulty of developing an AI program that can fully replicate human judgment in complex tasks [
2]. Another potential reason for the lower agreement between the AI system and human experts could be attributed to the primary language used by participants, particularly those from the northern region of the country. A significant proportion of participants spoke a regional dialect as their main language, which may have introduced challenges in accurately interpreting responses during the AI assessment. Unlike human experts, who can adapt to linguistic nuances and contextual variations, the AI system may struggle with regional accents, word choices, or phrasing that deviate from its training data. This linguistic limitation has been noted in other studies as a common barrier to achieving high accuracy in AI-based assessments, especially in multilingual or dialect-rich contexts [
24]. Addressing these linguistic challenges through improved training datasets or natural language processing models that account for regional diversity may enhance the performance of AI systems in future assessments.
Despite these challenges, the strong agreement between the 4A program and human experts demonstrates that AI has the potential to be integrated effectively into educational assessments. As the use of AI in education continues to grow, ensuring that these systems can reliably match human judgment is crucial. The findings from this study align with previous research suggesting that with proper training and refinement, AI systems can achieve high levels of agreement with human assessors, thereby offering scalable and efficient alternatives to traditional assessment methods [
2].
The second objective of this study was to evaluate the accuracy and precision of the 4A program in comparison to human experts. The results revealed a non-significant difference in performance between the AI system and the human expert, with a McNemar test (χ
2 = 0.4,
p = 0.527). This indicates that the 4A program’s ability to accurately and precisely assess nursing students’ cognitive skills is comparable to that of human experts. This finding is consistent with previous research that highlights the role of structured testing in reinforcing cognitive processes and improving learning outcomes, a phenomenon widely recognized as the testing effect [
21,
22]. Comparable accuracy and precision suggest that the 4A program can serve as a reliable alternative to human evaluation, particularly in educational settings where consistency and scalability are essential. The program’s design, which incorporates predefined algorithms for assessing responses, ensures a high level of standardization, reducing the potential for human bias. These findings align with previous research demonstrating the feasibility of AI systems in replicating human performance in structured tasks, such as educational assessments and cognitive evaluations [
15].
Although the 4A program showed similar performance to human experts, it is essential to consider factors that could influence its evaluation outcomes. For example, the AI’s reliance on pre-programmed algorithms may limit its ability to interpret contextual nuances or adapt to linguistic diversity, particularly in cases where participants’ primary language or phrasing deviates from the system’s training data. These limitations highlight the importance of continued refinement in AI training models to ensure that systems like the 4A program remain robust across diverse populations and contexts. Despite these challenges, the results support the potential of the 4A program as a scalable and consistent tool for cognitive skill assessment. Further research should explore its application in larger and more diverse samples to validate these findings and identify areas for improvement, particularly in handling variability in language and context.
The third objective of this study was to validate the effectiveness of the 4A program by comparing its assessment outcomes with the results of the national nursing examinations. The external validation phase revealed that higher scores on the 4A program significantly predicted success in the national exams, with an odds ratio of 1.124 (
p = 0.031). This finding supports the predictive validity of the 4A program and underscores its potential as a valuable tool in educational settings. The positive correlation between the AI program scores and national exam outcomes indicates that the 4A program is capable of providing meaningful insights into students’ readiness for high-stakes certification exams. This is particularly important in the context of nursing education, where accurate assessment of cognitive skills is crucial for ensuring that students are prepared to deliver safe and effective patient care. The ability of the AI program to predict exam success suggests that it can be effectively integrated into the broader assessment framework, complementing traditional methods and providing additional data points for educators to consider when evaluating student performance [
15].
Furthermore, the feedback from participants demonstrated the strong approval of the 4A program, particularly in its ability to engage users, reflect cognitive skills, and provide relevant and clear assessments. The high ratings for scenario-based learning and the development of analytical skills suggest that the program effectively supports critical thinking. Areas for improvement include further refinement of voice-recorded response features and addressing any challenges participants face in adapting to non-traditional assessment formats. The findings of this study align with the growing body of research supporting the integration of educational technology in healthcare training. The AI-assisted answer assessment (4A) program exemplifies how AI-powered tools can enhance traditional cognitive assessments, ensuring reliability, accuracy, and external validity. Future research should explore the broader applications of AI-driven assessments within educational technology to optimize learning outcomes in nursing education. These insights provide valuable guidance for enhancing the program’s usability and expanding its application in broader educational contexts.
Limitations
Several methodological limitations should be noted when interpreting this study. Firstly, the study was conducted in a specific nursing education institution and with only fourth-year nursing students, which may limit the generalizability of the findings to other settings and other student groups. Secondly, the duration of the intervention and the follow-up period for assessing long-term effects were relatively short. Future research could explore the sustained impact of the 4A program over an extended period. Thirdly, the study focused on cognitive skills assessment and did not examine other aspects, such as student satisfaction or engagement with the learning program. Lastly, the limitation of this study is the reliance on the ChatGPT-4 API for NLP-based text analysis. While this model demonstrated robust accuracy in processing nursing students’ audio transcripts, alternative NLP models such as DeepSeek, GPT-3.5, BERT, or healthcare-specific AI models might yield different results due to variations in training data, contextual understanding, and domain adaptation. Future studies could explore comparative analyses of multiple NLP models to assess their relative performance in cognitive skill evaluation within nursing education.