1. Introduction
Mastery grading has been discussed in the education research literature since the 1960s, notably in Bloom’s proposal to develop broader mastery learning curricula [
1]. Though several meta-analyses corroborate the general efficacy of mastery learning approaches, broader adoption of this assessment approach has not been observed in higher education, and the American higher educational system generally relies on high-stakes exams [
2]. Research studies that specifically examine the impact of the mastery outcomes approach in higher education STEM remain limited but are beginning to emerge in the literature [
3]. A mastery outcomes structure that utilized second-chance testing in an undergraduate engineering course resulted in significantly improved final exam performance relative to a course that used traditional high-stakes exams, and students in the mastery outcomes course earned twice as many As and half the number of failing grades. Most importantly, traditionally underrepresented students performed on par with non-URM students [
3].
Alternative grading and assessment models have been explored across numerous STEM disciplines, including chemistry education [
4]. For example, a learner-centered grading method using a standards-based assessment structure for general chemistry has been shown to improve grading transparency. This implementation did not quantify the impact on student learning outcomes, focusing instead on observational data related to the generally positive student learning experience [
5]. In another study involving high school chemistry students, mastery learning improved performance and attitude toward learning [
6]. Although the research is limited, these studies highlight the generally positive shift in student perspective when moving from a traditional high-stakes grading system to a mastery approach [
7,
8].
Another outcomes-based assessment approach that has gained recent attention in the chemistry education community is specifications grading [
9]. The specifications grading system differs from the mastery outcomes (second chance testing) approach in that specifications grading allows students to demonstrate mastery by completing bundles of assignments or tasks (e.g., a letter grade of A can be earned by completing 9/10 components in the bundle, a letter grade of B can be earned by completing 8/10 components in the bundle, etc.). Within the chemistry education literature, specifications grading has been implemented predominantly in laboratory courses [
7,
8]. We speculate that this is due to this type of grading structure being well suited to lab courses that focus on skills and completion of tasks [
7], and specifications systems can lead to mixed results with respect to student satisfaction [
8]. Nevertheless, implementing the specification grading system in a large enrollment organic chemistry laboratory setting notably improved final letter grades [
8].
Implementing a specifications grading system in a lecture setting represents a dramatic overhaul in course design from traditional points-based systems, where grades are largely determined by a single attempt, to high-stakes summative assessments [
2]. As a result, traditional points-based grading remains the norm for lecture courses despite mounting evidence that points-based models increase student stress levels, decrease equity, and de-emphasize the acquisition of content knowledge [
10]. Several innovative strategies have been examined for implementing specifications grading in general chemistry [
11], organic chemistry [
12,
13,
14], analytical chemistry [
15], and upper-division chemical biology [
16] lecture courses. Many of these studies primarily focus on qualitative aspects of the specifications grading implementations, highlighting reduced self-reported anxiety and generally positive feedback from professors. Hollinsed et al. did report an increase in the conversion of B students to A students. However, the specifications grading model did not significantly impact the number of lower-performing students [
11].
The promise of improving the learning environment for students and professors while creating a more equitable grading system continues to motivate the development of alternative grading models. Recently, Noell et al. implemented a hybrid-specs grading system introducing an element of second-chance testing to shift the emphasis toward content mastery without a full course redesign [
17]. The hybrid-specs system increased the conversion of B to A grades, but there was a small increase in the DFW rates using this hybrid-specs model. The success of the hybrid-specs model is likely tied to the testing effect. The testing effect is a framework based on research linked to retrieval practice, which is a critical component of the learning process [
18]. However, for students to benefit from the second-chance testing model, they must have the skills, resources, and metacognitive strategies required to fill knowledge gaps, thereby improving scores on subsequent tests [
19]. For this reason, recent literature suggests that the testing effect can decrease or even disappear as the complexity of the learning materials increases [
20]. Yet, when regular testing is coupled with additional tools and resources, student performance gains have been realized for advanced topics directly related to chemistry [
21].
Successful engagement with the mastery grading model is closely linked to a student’s background, particularly through metacognitive development and familiarity with effective learning strategies. Therefore, comparing student performance across different demographics, such as familial education history, ethnicity, and income, can provide valuable insight into the design and implementation of alternative grading models. Previous reports did not study the specific impacts of these alternative grading strategies on traditionally underrepresented students [
3,
7,
8,
12,
15,
16]. A previous study implementing mastery learning in high school general chemistry did find a particularly pronounced positive effect on learning outcomes and attitudes from students who are struggling with the chemistry content [
6]. However, the results were not disaggregated by ethnicity or familial education level.
To the authors’ knowledge, the present work represents the first study that assesses the impact of a mastery grading system in large enrollment general chemistry courses relative to an active control course that used traditional high-stakes exams. This work highlights the importance of coupling second-chance testing with interactive courseware designed to promote asynchronous active learning and metacognitive development [
22]. Particular emphasis is placed on exploring correlations between a student’s familial education history, ethnicity, and socioeconomic status by looking at disaggregated student performance data. This study shows that supporting a mastery grading model with robust interactive courseware improved the average student performance for the entire class population on a common final assessment by 6.9 percentage points relative to a control using infrequent high-stakes exams. The improvement was more pronounced (11.6 percentage points) for first-generation college students with an URM background receiving financial aid.
1.1. Theoretical Frameworks
In this project, we implement a mastery outcomes assessment approach rooted in the theoretical frameworks of the testing effect [
18] and mindset theory [
23]. The testing effect is linked to the phenomenon of retrieval practice, in which it has been found that the act of retrieving information is, in some cases, a more impactful learning event than an information coding event, where information encoding refers to learning new knowledge [
24]. The mastery outcomes approach implemented in this study naturally leverages the positive impacts of the testing effect by providing more frequent self-assessment scenarios. Furthermore, the mastery grading approach naturally facilitates the incorporation of metacognitive strategies.
Student feedback on mastery assessments is directly linked to interactive courseware content with built-in tools for helping students monitor the learning process by identifying gaps in their understanding and addressing them through targeted practice. The assessment, reflection, and practice cycle is designed to provide a pattern of engagement that promotes a growth mindset. Mindset theory is based on research indicating students who believe that intelligence is malleable often experience more positive learning outcomes. Students with a growth mindset tend to view initial failure as an opportunity for improvement rather than a predictor of future negative outcomes [
23].
1.2. Research Questions and the Current Study
- 1.
How do student performance outcomes differ between courses incorporating a mastery grading/test-retake system and a course using infrequent high-stakes exams?
- 2.
How do courses incorporating a mastery grading/test-retake system affect equity gaps compared to a course using infrequent high-stakes exams?
- 3.
What is the general qualitative student affective response to courses using mastery grading/test-retake system?
The present work employs a second-chance testing strategy through weekly unit mastery assessments (denoted as Mastery). However, we have chosen a mastery-focused model that dispenses with the token economy and allows every student a fixed number of scheduled retakes for a given mastery assessment [
7,
9]. This mastery grading approach directly fosters a growth mindset in students and demonstrates a commitment from instructors that students possess the capacity to improve through persistent effort. The cues hypothesis states that instructors who espouse a fixed mindset create threatening situational cues that can demotivate students, especially traditionally under-represented students [
25]. By adopting a mastery outcomes assessment structure, instructors will show a commitment to a student growth mindset that should lead to improved cognitive and affective outcomes.
Coupling the mastery grading model with metacognitive coaching and interactive courseware designed to promote asynchronous active learning (Mastery+OLI) is a crucial complement to the mastery learning model. Suppose that students are not given guidance on evaluating and reflecting upon their learning. In that case, it is unlikely that having multiple attempts on the various content assessments will lead to meaningful learning gains [
20]. The General Chemistry curriculum available through the Open Learning Initiative (OLI) at Carnegie Mellon University was selected for the present work. OLI General Chemistry provides a rich, interactive learning environment built upon the OpenStax Chemistry textbook with embedded problems that provide extensive hints and feedback [
26,
27]. The structure of OLI General Chemistry is based on the literature findings, which suggest students learn more by doing interactive problems rather than reading text or watching videos [
28].
3. Results
We begin by comparing student performance on the common final assessment questions.
Figure 1 plots the distribution of student scores for the control (gray), Mastery (red), and Mastery+OLI (green) courses. The Mastery and Mastery+OLI courses led to higher scores on the common final exam questions than the control. Specifically, the mean performance expressed as the percent of correct responses on the common final exam questions for the Mastery+OLI group was 7.1 percentage points greater than the control (
). Similarly, the mean performance for the Mastery group was 2.6 percentage points higher relative to the control; however, this improvement was not statistically significant at the 0.05 significance level.
Figure 1 shows a narrower distribution in scores for the Mastery+OLI section, and the standard deviation for the Mastery+OLI group (16.4) is less than both the Mastery (18.9) and Control (20.4) groups.
Despite the fact the Mastery+OLI course had the highest percentage of URM, first-generation, and students receiving financial aid (see
Table 1), the mean common exam scores for the Mastery+OLI course were significantly higher when the entire class population was analyzed (
Figure 1). Disaggregating student performance data by ethnicity, first-generation status, and financial aid status reveals striking trends. In particular,
Figure 2a shows that mastery grading alone did not significantly impact the average student performance for URM students (blue) relative to the control. However, when the mastery grading model was coupled with the OLI interactive courseware tools, the average performance for URM students improved by 7.1 percentage points relative to the control (
), with an effect size of 0.181 (see
Table 3). Even more pronounced improvements are observed for students with first-generation status, as seen in
Figure 2b). Relative to the control, first-generation students enrolled in the Mastery+OLI course demonstrated a mean difference of 12.5 percentage points, with a moderate effect size of 0.272 (
) compared to only a 5.1 percentage point improvement for students who did not identify as first-generation (see
Table 3). These findings suggest mastery grading provides improved student learning for students with sufficient scaffolding for addressing gaps in content knowledge. The rich interactive tools provided through OLI in the Mastery+OLI course are likely crucial for assisting URM and first-generation college students in addressing gaps in their content knowledge. These findings are consistent with our recent work examining the link between engagement with the OLI courseware and performance on mastery assessments [
22].
Financial aid status was used as a proxy for identifying students more likely to have experienced financial hardship. Similar to the results found for ethnicity and first-generation status, students who received financial aid disproportionately benefited from the Mastery+OLI implementation relative to the control group. In particular, the Mastery+OLI group showed a 7.4 percentage point improvement in the mean relative to the control. Interestingly, the mean performance of the students who had not received financial aid in the Mastery and Mastery+OLI courses was the same. These results further support the hypothesis that students from more affluent backgrounds are more likely to have family members who have attended college and are, therefore, more likely to have developed habits and practices conducive to mastering chemistry content.
Finally, we considered intersectionality in the data by comparing the average performance of all students with those who simultaneously identify as first-generation college students, members of a URM group, and receive financial aid.
Figure 2d and
Table 3 show that mastery grading alone yielded a mean difference of 4.5 relative to the control (
p = 0.956), and Mastery+OLI yielded a mean difference of 11.6 percentage points relative to the control (
p < 0.018). The Mastery+OLI design showed a moderate effect, with a Cohen’s
f value of 0.248 (see
Table 3) [
31]. It should be noted that financial aid status was highly correlated with URM and first-generation status. Specifically, there were only two students who were both URM and first-generation status and not receiving financial aid.
Student Feedback Results
Recent work suggests a poor correlation between student evaluations and student learning, and the efficacy of student evaluations has been questioned [
32]. However, this lack of correlation has been observed for numerical ranking systems, and evidence suggests student evaluations remain a crucial tool for providing instructors with feedback [
33]. In particular, the comments section can provide valuable insights from the student’s perspective, and such insights are particularly useful when evaluating novel instructional tools and course design. Nevertheless, analyzing hundreds of student responses for sentiment and relevance to a specific intervention while minimizing the introduction of bias is a challenging task. Recently, Hoar et al. suggested using natural language processing tools to facilitate this process [
34]. Here, we apply natural language processing tools to student course evaluation data to analyze student feedback on the mastery grading system.
Student feedback was collected as anonymous course evaluation data for both mastery grading sections. The comments section in the course evaluation data was separated by sentence, providing 397 responses for the mastery with OLI section and 95 responses for the mastery grading section that did not use OLI. Each sentence was processed using Google’s Natural Language API to identify keywords with a corresponding salience ranking. We selected keywords related to mastery grading and collected the associated sentiment scores (see the
Supporting Information for details). This analysis resulted in 103 student comments related to the mastery grading model across both sections. Finally, each relevant student comment was subject to sentiment analysis using Google’s Natural Language Processing tools. Google Cloud Sentiment Analysis (GCSA) provides a sentiment score between −1 and +1, with larger scores corresponding to more positive sentiment. Sentiment analysis was carried out using two additional algorithms and the results were similar to those reported in
Figure 3 (see
Supporting Information for details). Readers interested in the details of implementation and the comparative performance of alternative sentiment analysis algorithms are directed to the following review [
35].
The histogram in
Figure 3 illustrates the sentiment analysis results. For ease of interpretation, scores below −0.25 are classified as negative (red), between −0.25 and +0.25 as neutral (yellow), and above +0.25 as positive (green). This analysis was replicated with similar results using the VADER and TextBlob algorithms (see the
Supporting Information for details) [
36,
37]. The mastery grading model resulted in generally positive or neutral student feedback. In particular, numerous students noted decreased stress and anxiety from the second-chance testing. On the other hand, numerous students expressed concern about sacrificing small-group discussion time in favor of repeated mastery exam attempts. Additionally, several students remarked on the lack of flexibility regarding testing times.
4. Discussion
The preliminary implementation of mastery grading in large-enrollment chemistry courses appears to have improved overall student learning outcomes and reduced equity gaps. This study contributes to the growing body of literature on alternative grading systems by demonstrating the efficacy of a mastery-focused approach, particularly when supplemented with interactive courseware like the General Chemistry curriculum provided through the Open Learning Initiative (OLI). The overall performance improvements we observe with the Mastery+OLI course are comparable to those seen in a pharmacokinetics (PK) and pharmacodynamics (PD) course that used a weekly quizzing model. Specifically, Henning et al. reported a 7.93 percentage point improvement in the average scores on the PK/PD component of the final exam [
21].
Although several studies have explored alternative grading systems in chemistry, these studies do not directly compare performance across different demographics [
6,
11,
12,
14]. However, one study involving high school chemistry students found that a mastery grading model improves learning outcomes for students having difficulty with the content [
6]. These previous results reported for high school chemistry are broadly corroborated by our finding that URM, first-generation, and students receiving financial aid showed the greatest improvement when using the Mastery+OLI model (
Figure 2). Another study implementing a mastery grading model in an undergraduate engineering course found that women and URM students benefited from the alternative grading model to the same extent as the general population [
3]. The present work did not consider gender; however,
Figure 2a does show similar improvement for both URM and non-URM students when using the Mastery+OLI model.
Examining the disaggregated statistics for learning outcomes highlights the importance of incorporating metacognitive tools within the mastery grading model. Mastery grading alone did not improve student learning relative to the control for URM students (
Figure 2a). These findings are not necessarily surprising when considered in the context of the recent literature. The role of retrieval practice in consolidating learning is well established [
18,
24]. However, consolidating learning through repeated testing may be of little value for students who are struggling to grasp the challenging and complex concepts in general chemistry [
20]. Casselman et al. have shown that providing responsive online content designed to promote the development of metacognitive skills improved ACS exam performance by 4% relative to the control [
19]. The OLI interactive courseware provides an accessible platform where students can regularly assess their abilities, receive detailed feedback regarding progress toward learning goals, and create a future study plan. Developing metacognitive skills is expected to be particularly impactful for students with minimal previous training of this nature (e.g., URM and first-generation students in
Figure 2).
Chemsitry-specific growth mindset interventions in first-year general chemistry have been shown to improve the student learning experience and even eliminate the ethnicity achievement gap [
38]. The mastery grading system reinforces the belief that chemistry content knowledge and problem-solving skills can be developed and improved over time. A closer examination of the sentiment analysis data presented in
Figure 3 shows that the majority of the positive student feedback references a reduction in stress surrounding testing and a shift toward viewing mistakes as opportunities for growth, resulting in improved performance on subsequent exams. Though this was not directly explored in the study, we speculate that the absence of a curve in the mastery grading model promoted peer-to-peer engagement because the course grade was no longer tied to a student’s performance relative to their peers.
Limitations and Future Work
Although we do not have incoming knowledge data, all students were placed into the course on the same track based on math placement. The distribution of students is likely equal across all three sections with respect to those math placement scores. However, as with any observational study, we ultimately could not account for the various confounding variables that might have negatively impacted student performance (prior chemistry knowledge, differences in co-curricular demands among the class populations, etc.) Additionally, though instructor bias could not be accounted for, it is noted the Mastery course was taught by a Distinguished professor of teaching and the control instructor provided the common questions. These factors suggest there was no bias favoring higher student exam performance in the Mastery+OLI group. Further studies are currently being designed to include initial knowledge assessment and the deployment of the Mastery+OLI model at scale across multiple institutions.
The common assessment items were generally emphasizing more traditional skills and knowledge. There is an emerging emphasis in the chemistry education community to move beyond a procedural and skill-based focus and promote learning objectives associated with conceptual understanding and more meaningful learning [
39,
40]. Therefore, future work will investigate how a mastery grading approach can improve this type of higher-order learning. However, multiple-choice questions administered through an online testing system better accommodate the volume of testing inherent to the Mastery+OLI model, and building a test retake system with more open-ended conceptual assessment items will be a challenge that needs to be overcome. Finally, the end of the term presents a logistical limitation, wherein students must complete multiple mastery attempts in rapid succession. This did lead to some student anxiety, and future implementations will focus on strategies for providing a more flexible testing schedule.
We have demonstrated that this test retake system can be implemented in large enrollment intro courses at an R1 institution with TA support and built-in recitations/discussions. The absence of either TA support or recitations/discussion sections would make it difficult to replicate this approach. Furthermore, sacrificing small-group discussion time in favor of testing presents a limitation, evidenced by student comments that expressed concern over the loss of small-group interaction time in the discussion/recitation sections. Future work will include exploring alternative second-chance testing models using a testing center, collecting more data on the student experience, and carrying out a more detailed study on the affective outcomes and/or outcomes related to student mindset.