Expanding the Team: Integrating Generative Artificial Intelligence into the Assessment Development Process
Abstract
1. Introduction
1.1. Technology-Assisted Item Development
1.2. Evaluating Items Developed Through Technology-Assistance
1.3. Study Context and Research Question
2. Materials and Methods
2.1. Assessment Development and Validation Frameworks
2.2. Assessment Development and Validation Process for This Study
2.2.1. Phase 1—Planning
2.2.2. Phase 2—Developing
Initial Prompt: Your task is to help me write multiple-choice items for undergraduates who are in a school-based social work internship. Each item should have a stem (item question) and four options with only one correct answer. Make sure you identify the correct answer. Write a multiple-choice item based on [insert learning objective] that is focused on [insert specific content linked to lesson materials or framework].
- Mutually Exclusive Options—answer options do not overlap (e.g., k-type questions)
- Avoid Use of Negatives—stem and options are positively phrased (e.g., do not include not or except)
- Avoid All or Nothing—response options do not include all of the above or none of the above
- Parallel Response Options—content in options is similar (e.g., if correct answer is a potential diagnosis, all options reflect comparable clinical possibilities)
- Similar Option Length—response options are all of similar length literally and conceptually (same number of concepts mentioned)
- Present a Problem in the Stem—a direct question is presented (e.g., avoid items that ask: Which is the following is true?)
- Do Not Clue the Answer—the correct answer should not include language directly associated with the stem as to suggest it is correct simply because of similar wording
- Do Not Teach in the Stem—only information needed for test-takers to answer the question is presented because instruction is done during a lesson, not a test
Follow-Up Prompt Example 1: Make sure the item is positively framed. Do not use negatives in the stem or options.
Follow-Up Prompt Example 2: The correct answer cannot be the longest. Rephrase options so they are all of similar length. Shorter response options are better.
Follow-Up Prompt Example 3: The correct answer has similar language to the stem and others options do not. This clues the correct answer. Adjust language so test-takers cannot correctly guess the answer based on language used.
Follow-Up Prompt Example 4: The stem is too wordy and teaching. Streamline so it is easier to read and does not teach.
2.2.3. Phase 3—Initial Qualitative Field-Testing
SWI Test Review Questionnaire Procedures
Focused Consensus Discussion Procedures
3. Results
3.1. SWI Test Review Questionnaire
- Revise Response Option(s) (n = 8, 67% items)
- Confusing language (n = 3)
- More than one correct answer (n = 5)
- Change stem (n = 2, 17% items)
- LO alignment concern (n = 2)
- No suggestions offered (n = 3, 25% items)
- SME feedback did not warrant changes (n = 1)
- SME conflicting feedback (n = 2)
Test Item | LO Alignment | Appropriate Stem | One Correct Answer | Appropriate Options | Total Concerns | Research Team Suggested Modifications |
---|---|---|---|---|---|---|
1 | 0 | 0 | 0 | 1 | 1 (5%) | Revise response option(s): Confusing language. |
2 | 0 | 1 | 1 | 1 | 3 (15%) | Revise response option(s): Confusing language. |
3 | 0 | 1 | 1 | 1 | 3 (15%) | Revise response option(s): More than one correct answer. |
4 | 3 | 3 | 4 | 1 | 11 (55%) | Change stem: LO alignment concern. |
5 | 0 | 1 | 0 | 0 | 1 (5%) | No suggestions offered: SME feedback did not warrant changes. |
6 | 0 | 0 | 3 | 3 | 6 (30%) | Revise response option(s): More than one correct answer. |
7 | 0 | 1 | 1 | 1 | 3 (15%) | Change stem: LO alignment concern. Revise response option(s): Confusing language. |
8 | 1 | 1 | 2 | 3 | 7 (35%) | No suggestions offered: SME conflicting feedback. |
9 | 0 | 0 | 2 | 1 | 3 (15%) | Revise response option(s): More than one correct answer. |
10 | 0 | 1 | 1 | 4 | 6 (30%) | Revise response option(s): More than one correct answer. |
11 | 0 | 1 | 0 | 0 | 1 (5%) | No suggestions offered: SME conflicting feedback. |
12 | 0 | 0 | 1 | 2 | 3 (15%) | Revise response option(s): More than one correct answer. |
3.2. Focused Consensus Discussion
4. Discussion
Limitations, Implications, and Future Research
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
AERA | American Educational Research Association |
AIG | Automatic Item Generation |
DBR | Design-Based Research |
LO | Learning Objectives |
MHSP | Mental Health Service Professional |
NCME | National Council on Measurement in Education |
NGT | Nominal Group Technique |
NRC | National Research Council |
OET | Office of Educational Technology |
PD | Professional Development |
SME | Subject Matter Expert |
References
- American Educational Research Association; American Psychological Association; National Council on Measurement in Education. Standards for Educational and Psychological Testing; American Educational Research Association: Washington, DC, USA, 2014. [Google Scholar]
- Kisker, E.E.; Boller, K. Forming a Team to Ensure High-Quality Measurement in Education Studies (REL 2014-052); U.S. Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance, Analytic Technical Assistance and Development: Washington, DC, USA, 2014. Available online: http://ies.ed.gov/ncee/edlabs (accessed on 25 April 2025).
- National Research Council. Knowing What Students Know: The Science and Design of Educational Assessment; The National Academies Press: Washington, DC, USA, 2001. [Google Scholar] [CrossRef]
- Sondergeld, T.A. Shifting sights on STEM education instrumentation development: The importance of moving validity evidence to the forefront rather than a footnote. Sch. Sci. Math. 2020, 120, 259–261. [Google Scholar] [CrossRef]
- May, T.A.; Bright, D.; Fan, Y.; Fornaro, C.; Koskey, K.L.; Heverin, T. Development of a college student validation survey: A design-based research approach. J. Coll. Stud. Dev. 2023, 64, 370–377. [Google Scholar] [CrossRef]
- May, T.A.; Koskey, K.L.K.; Provinzano, K.P. Developing and Validating the Preschool Nutrition Education Practices Survey. J. Nutr. Educ. Behav. 2024, 56, 545–555. [Google Scholar] [CrossRef]
- Severino, L.; DeCarlo, M.J.; Sondergeld, T.A.; Ammar, A.; Izzetoglu, M. A validation study of an eighth grade reading comprehension assessment. Res. Middle Level Educ. 2018, 41, 1–16. [Google Scholar]
- Bostic, J.D.; Matney, G.T.; Sondergeld, T.A. A validation process for observation protocols: Using the Revised SMPs Look-for Protocol as a lens on teachers’ promotion of the standards. Investig. Math. Learn. 2019, 11, 69–82. [Google Scholar] [CrossRef]
- Sondergeld, T.A.; Johnson, C.C. Development and validation of a 21st Century Skills assessment: Using an iterative multi-method approach. Sch. Sci. Math. 2019, 119, 312–326. [Google Scholar] [CrossRef]
- Falcão, F.; Costa, P.; Pêgo, J.M. Feasibility assurance: A review of automatic item generation in medical assessment. Adv. Health Sci. Educ. 2022, 27, 405–425. Available online: https://link.springer.com/article/10.1007/s10459-022-10092-z (accessed on 4 March 2025). [CrossRef]
- Falcão, F.; Marques Pereira, D.; Gonçalves, N.; De Champlain, A.; Costa, P.; Pêgo, J.M. A suggestive approach for assessing item quality, usability and validity of automated item generation. Adv. Health Sci. Educ. 2023, 28, 1441–1465. [Google Scholar] [CrossRef] [PubMed]
- Kurdi, G.; Leo, J.; Parsia, B.; Sattler, U.; Al-Emari, S. A systematic review of automatic question generation for education purposes. Int. J. Artif. Intell. Educ. 2020, 30, 121–204. [Google Scholar] [CrossRef]
- May, T.A.; Fan, Y.K.; Stone, G.E.; Koskey, K.L.K.; Sondergeld, C.J.; Folger, T.D.; Archer, J.N.; Provinzano, K.; Johnson, C.C. An effectiveness study of generative artificial intelligence tools used to develop multiple-choice test items. Educ. Sci. 2025, 15, 144. [Google Scholar] [CrossRef]
- Maity, S.; Deroy, A. The future of learning in the age of generative AI: Automated question generation and assessment with large language models. arXiv 2024, arXiv:2410.09576. [Google Scholar] [CrossRef]
- Scaria, N.; Dharani Chenna, S.; Subramani, D. Automated educational question generation at different Bloom’s skill levels using large language models: Strategies and evaluation. arXiv 2024, arXiv:2408.04394. [Google Scholar] [CrossRef]
- Elkins, S.; Kochmar, E.; Serban, I.; Cheung, J.C.K. How useful are educational questions generated by large language models? In Communications in Computer and Information Science; Wang, N., Rebolledo-Mendez, G., Dimitrova, V., Matsuda, N., Santos, O.C., Eds.; Springer: Cham, Switzerland, 2023; Volume 1831, pp. 536–542. [Google Scholar] [CrossRef]
- Mollick, E.; Mollick, L. Using AI to Implement Effective Teaching Strategies in Classrooms: Five Strategies, Including Prompts; Wharton School of the University of Pennsylvania & Wharton Interactive: Philadelphia, PA, USA, 2023. [Google Scholar] [CrossRef]
- Bulut, O.; Beiting-Parrish, M.; Casabianca, J.M.; Slater, S.C.; Jiao, H.; Song, D.; Ormerod, C.; Fabiyi, D.G.; Ivan, R.; Walsh, C.; et al. The rise of artificial intelligence in educational measurement: Opportunities and ethical challenges. Chin.-Engl. J. Educ. Meas. Eval. 2024, 5, 3. [Google Scholar] [CrossRef]
- Office of Educational Technology. Artificial Intelligence and the Future of Teaching and Learning: Insights and Recommendations; U.S. Department of Education: Washington, DC, USA, 2023. Available online: https://www.ce-jeme.org/cgi/viewcontent.cgi?article=1090&context=journal (accessed on 4 March 2025).
- Nguyen-Trung, K. ChatGPT in thematic analysis: Can AI become a research assistant in qualitative research? Qual. Quant. 2025, 1–34. [Google Scholar] [CrossRef]
- Kaldaras, L.; Akaeze, H.O.; Reckase, M.D. Developing valid assessments in the era of generative artificial intelligence. Front. Educ. 2024, 9, 1399377. [Google Scholar] [CrossRef]
- Circi, R.; Hicks, J.; Sikali, E. Automatic item generation: Foundations and machine learning-based approaches for assessments. Front. Educ. 2023, 8, 858273. [Google Scholar] [CrossRef]
- Gierl, M.J.; Lai, H. Instructional topics in educational measurement (ITEMS) module: Using automated processes to generate test items. Educ. Meas. Issues Pract. 2013, 32, 36–50. [Google Scholar] [CrossRef]
- Kane, M. Validating the interpretations and uses of test scores. J. Educ. Meas. 2013, 50, 1–73. [Google Scholar] [CrossRef]
- Kosh, A.E.; Simpson, M.A.; Bickel, L.; Kellogg, M.; Sanford-Moore, E. A cost–benefit analysis of automatic item generation. Educ. Meas. Issues Pract. 2019, 38, 48–53. [Google Scholar] [CrossRef]
- Tan, B.; Armoush, N.; Mazzullo, E.; Bulut, O.; Gierl, M. A review of automatic item generation techniques leveraging large language models. Int. J. Assess. Tools Educ. 2025, 12, 317–340. [Google Scholar] [CrossRef]
- Magzoub, M.E.; Zafar, I.; Munshi, F.; Shersad, F. Ten tips to harnessing generative AI for high-quality MCQs in medical education assessment. Med. Educ. Online 2025, 30, 2532682. [Google Scholar] [CrossRef] [PubMed]
- Gierl, M.J.; Lai, H. Evaluating the quality of medical multiple-choice items created with automated processes. Med. Educ. 2013, 47, 726–733. [Google Scholar] [CrossRef] [PubMed]
- Pugh, D.; De Champlain, A.; Gierl, M.; Lai, H.; Touchie, C. Can automated item generation be used to develop high-quality MCQs that assess applications of knowledge? Res. Pract. Technol. Enhanc. Learn. 2020, 15, 12. [Google Scholar] [CrossRef]
- Caldwell, D.J.; Pate, A.N. Effects of question formats on student and item performance. Am. J. Pharm. Educ. 2013, 77, 71. [Google Scholar] [CrossRef]
- Coombs, A.; DeLuca, C.; LaPointe-McEwan, D.; Chalas, A. Changing approaches to classroom assessment: An empirical study across teacher career stages. Teach. Teach. Educ. 2018, 71, 134–144. [Google Scholar] [CrossRef]
- Scott, E.E.; Wenderoth, M.P.; Doherty, J.H. Design based research: A methodology to extend and enrich biology education research. CBE—Life Sci. Educ. 2020, 19, es11. [Google Scholar] [CrossRef]
- Cobb, P.; Confrey, J.; diSessa, A.; Lehrer, R.; Schauble, L. Design experiments in educational research. Educ. Res. 2003, 32, 9–13. [Google Scholar] [CrossRef]
- Middleton, J.A.; Gorard, S.; Taylor, C.; Bannon-Ritland, B. The compleat design experiment: From soup to nuts. In Design Research: Investigating and Assessing Complex Systems in Mathematics, Science and Technology Education; Kelly, E., Lesh, R., Baek, J., Eds.; Lawrence Erlbaum Associates: Mahwah, NJ, USA, 2008. [Google Scholar]
- Artino, A.R., Jr.; La Rochelle, J.; Dezee, K.; Gehlbach, H. Developing questionnaires for educational research: AMEE guide no. 87. Med. Teach. 2014, 36, 463–474. [Google Scholar] [CrossRef]
- May, T.A.; Johnson, C.C.; Walton, J.B.; Harold, S. The development and validation of a K-12 STEM engagement participant outcome instrument. Educ. Sci. 2025, 15, 377. [Google Scholar] [CrossRef]
- Bostic, J.D.; Sondergeld, T.A. Measuring sixth-grade students’ problem-solving: Validating an instrument addressing the mathematics common core. Sch. Sci. Math. 2015, 115, 281–291. [Google Scholar] [CrossRef]
- Bostic, J.D.; Sondergeld, T.A.; Folger, T.; Kruse, L. PSM7 and PSM8: Validating two problem-solving measures. J. Appl. Meas. 2017, 18, 151–162. [Google Scholar] [PubMed]
- Orr, R.B.; Csikari, M.M.; Freeman, S.; Rodriguez, M.C. Writing and using learning objectives. CBE—Life Sci. Educ. 2022, 21, fe3. [Google Scholar] [CrossRef]
- Eweda, G.; Bukhary, Z.A.; Hamed, O. Quality assurance of test blueprinting. J. Prof. Nurs. 2020, 36, 166–170. [Google Scholar] [CrossRef]
- Bloom, B.S.; Engelhart, M.D.; Furst, E.J.; Hill, W.H.; Krathwohl, D.R. Taxonomy of Educational Objectives: The Classification of Educational Goals. Handbook I: Cognitive Domain; David McKay Co.: New York, NY, USA, 1956. [Google Scholar]
- Rudner, L. Implementing the Graduate Management Admission Test computerized adaptive test. In Elements of Adaptive Testing; van der Linden, W., Glas, C., Eds.; Springer: New York, NY, USA, 2010; pp. 151–165. [Google Scholar]
- Bianchini, G.; Ferrato, A.; Limongelli, C. Multiple choice question generation using large language models: Methodology and educator insights. arXiv 2025, arXiv:2506.04851. [Google Scholar] [CrossRef]
- Ramlochan, S. Goal-Oriented vs Process-Oriented Prompting in Large Language Models; Prompt Engineering & AI Institute: 2024. Available online: https://promptengineering.org/unlocking-the-power-of-goal-oriented-prompting-for-ai-assistants/ (accessed on 20 April 2025).
- White, J.; Fu, Q.; Hays, S.; Sandborn, M.; Olea, C.; Gilbert, H.; Elnashar, A.; Spencer-Smith, J.; Schmidt, D.C. A prompt pattern catalog to enhance prompt engineering with ChatGPT. arXiv 2023, arXiv:2302.11382. [Google Scholar] [CrossRef]
- Brookhart, S.M.; Nitko, A.J. Assessment and Grading in Classrooms, 5th ed.; Pearson: Upper Saddle River, NJ, USA, 2008. [Google Scholar]
- McMillan, J.H. Classroom Assessment: Principles and Practices for Effective Standards-Based Instruction, 5th ed.; Pearson: Upper Saddle River, NJ, USA, 2011. [Google Scholar]
- Miller, D.M.; Linn, R.L.; Gronlund, N.E. Measurement and Assessment in Teaching, 11th ed.; Pearson: Boston, MA, USA, 2021. [Google Scholar]
- Popham, W.J. Classroom Assessment: What Teachers Need to Know, 7th ed.; Pearson: Boston, MA, USA, 2014. [Google Scholar]
- Yusoff, M.S.B. ABC of content validation and content validity index calculation. Educ. Med. J. 2019, 11, 49–54. [Google Scholar] [CrossRef]
- Lincoln, Y.S.; Guba, E.G. Naturalistic Inquiry; Sage: Beverly Hills, CA, USA, 1985. [Google Scholar]
General Introduction to Community Schools and the MHSP Project | ||
---|---|---|
Learning Objectives | Taxonomy (Skill Level) | Perceived Emphasis |
Describe University Assisted Community Schools as a strategy. | Comprehension | ●● |
Describe how have activities and work have impacted the ability of the school to adopt a whole health mental health approach | Knowledge | ●●● |
Well-Being of the Self and Therapeutic Relationship (Ethics, DEI, and Self-Care) | ||
Learning Objectives | Taxonomy (Skill Level) | Perceived Emphasis |
Express comprehension of reality, equality, equity, and justice within the application of principles of diversity, equity, and inclusion in a school setting. | Application | ●●● |
Apply professional judgement and decision-making skills regarding situations that may arise in your professional roles. | Application | ●●● |
Describe methods to manage distress and discomfort associated with professional duty. | Comprehension | ●●● |
Assessment and Intervention | ||
Learning Objectives | Taxonomy (Skill Level) | Perceived Emphasis |
Identify strategies for fostering family engagement. | Knowledge | ●●● |
Use strategies to monitor personal implicit biases and stereotypes. | Comprehension | ●● |
Implementing trauma-informed practice in the schools. | Comprehension | ●●●● |
Apply appropriate documentation process within schools re: mental health. | Application | ●●● |
Describe immediate action steps based on an identified high-risk behavior and suicide risk. | Comprehension | ●●● |
Describe evidence-based strategies for working with students, families, communities, and schools. | Comprehension | ●●● |
Describe age- and developmentally-appropriate, culturally informed activities to engage students, families, communities, and schools. | Comprehension | ●●● |
Item Modification | Example | Annotation |
---|---|---|
Response(s) Modification (n = 5, 56%) | LO: Express comprehension of reality, equality, equity, and justice within the application of principles of diversity, equity, and inclusion in a school setting. | SMEs arrived at consensus regarding confusing language used in the correct answer (Option A). Although the response was not technically incorrect, SMEs felt it was possible their SWIs could interpret the answer differently than intended given their learning through MHSP workshops and lessons. Therefore, the option was modified (as shown in green text) to better align with the SWIs’ educational context and instruction. It was then deemed ready for field testing with SWIs. |
Original Item: In a school setting, equity refers to
| ||
Revised Item: In a school setting, equity refers to
| ||
Stem and Response(s) Modification (n = 3, 33%) | LO: Apply appropriate documentation process within schools regarding mental health. | SMEs agreed that while SWIs are introduced to the SOAP note procedure, it is not a primary focus of their summer intensive learning sessions. Instead, this information is typically provided by Site Supervisors during fieldwork. If the intent of this test is to assess SWI learning as a result of participating in formal training, the item (as written) may not accurately reflect that objective. As such, the item was revised to reference more general documentation processes by removing orange text from the original item, resulting in a version ready for continued field testing. |
Original Item: In the context of applying appropriate documentation processes within schools regarding student mental health, which of the following practices align best with the Subjective Objective Assessment and Plan (SOAP) note procedure?
| ||
Revised Item: In the context of applying appropriate documentation processes within schools regarding student mental health, which of the following aligns with best practice?
| ||
Delete Item (n=1, 11%) | LO: Describe methods to manage distress and discomfort associated with professional duty. | The self-care wheel is emphasized during SWI training as a helpful model for managing professional stress. However, after reviewing the test as a whole, SMEs determined this specific content was not essential to include. They were more interested in understanding how SWIs engage with self-care practices—an area better suited for interviews or surveys rather than cognitive tests. Since this content could not be effectively assessed through an MCQ, the item was removed and will be evaluated using more suitable methods. |
Original Item: A social worker uses reflective journaling and seeks supervision to process their difficult emotions after handling a sensitive student case. Which domain of the self-care wheel are they primarily focusing on?
| ||
Revised Item: None. Item deleted from test. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
May, T.A.; Provinzano, K.; Koskey, K.L.K.; Sondergeld, C.J.; Stone, G.E.; Archer, J.N.; Rimkunas, N. Expanding the Team: Integrating Generative Artificial Intelligence into the Assessment Development Process. Appl. Sci. 2025, 15, 9976. https://doi.org/10.3390/app15189976
May TA, Provinzano K, Koskey KLK, Sondergeld CJ, Stone GE, Archer JN, Rimkunas N. Expanding the Team: Integrating Generative Artificial Intelligence into the Assessment Development Process. Applied Sciences. 2025; 15(18):9976. https://doi.org/10.3390/app15189976
Chicago/Turabian StyleMay, Toni A., Kathleen Provinzano, Kristin L. K. Koskey, Connor J. Sondergeld, Gregory E. Stone, James N. Archer, and Naorah Rimkunas. 2025. "Expanding the Team: Integrating Generative Artificial Intelligence into the Assessment Development Process" Applied Sciences 15, no. 18: 9976. https://doi.org/10.3390/app15189976
APA StyleMay, T. A., Provinzano, K., Koskey, K. L. K., Sondergeld, C. J., Stone, G. E., Archer, J. N., & Rimkunas, N. (2025). Expanding the Team: Integrating Generative Artificial Intelligence into the Assessment Development Process. Applied Sciences, 15(18), 9976. https://doi.org/10.3390/app15189976