1. Introduction
Mixed-format assessments, which incorporate both multiple-choice (MC) and constructed-response (CR) items, are prevalent in current educational evaluations, comprising approximately 63% of statewide assessments in the United States (
Lane, 2005). These assessments have gained prominence due to their ability to evaluate diverse cognitive skills through different item formats, providing a more comprehensive picture of student abilities (
Xiong et al., 2024). Given the use of different item formats, these assessments typically generate two distinct score scales, such as the Graduate Record Examination (GRE;
Educational Testing Service, 2012). In the GRE Verbal test, for instance, MC items are scored automatically and results are provided immediately to students, whereas CR items require detailed human evaluation and the scores will be reported with polytomous values in a later date. However, this dual scoring system creates a fundamental challenge in educational measurement: how to accurately synthesize these distinct measurements into a meaningful composite ability score that reflects a student’s true capabilities.
The complexity of integrating different scores from mixed-format assessments has spawned numerous psychometric approaches. Specifically, traditional methods have relied on item response theory (IRT;
Baker & Kim, 2014) models, applying dichotomous models for MC items and polytomous models for CR items to calibrate different score patterns (
Ercikan et al., 1998;
Rosa et al., 2001). Although this approach may produce a rough estimation of students’ abilities, it raises consistent questions regarding the proportional weight contribution of each item format to the final composite ability score. That is, this method involves an underlying psychometric assumption to allocate weights based on item reliability (i.e., discrimination), thereby optimizing the scores primarily for reliability, which may diverge from the information provided from the two separate ability scores (
Sykes et al., 2001). Alternative approaches using weighted linear combinations have emerged (
Thissen et al., 2001), but these methods suffer from two key limitations: the absence of a consensus on the optimal weight selection and the labor-intensive nature of expert-driven weight determination. Furthermore, the reliance on approximation methods within these frameworks often compromises the precision of final ability estimates.
Recent advances in Bayesian methods have opened new avenues for addressing these limitations by leveraging prior knowledge about unknown abilities (
Smid et al., 2020). Among these methods,
Xiong et al. (
2023) introduced a two-step Sequential Bayesian (SB) approach based on empirical Bayes theory, which effectively uses empirical prior information to estimate ability values. In the first step, this method determines the prior distribution from a portion of the response score data for each student, and in the next step, this empirical prior is then used to compute the posterior estimations of ability through the rest of the data. By such a two-step framework, this approach eliminates the need for explicit weight assignments, as the empirical Bayes technique automated adjusts the weights during calibration using priors. Although this allows for an integration of results from different item formats into a composite and comprehensive ability score, the practical application of the empirical Bayes method, often described as a “pseudo-Bayesian” approach, may present challenges in certain situations. This method involves estimating hyperparameters directly from the data, which are then treated as fixed values within the Bayesian model (
Williams & Savitsky, 2021). Such an approach may lead to a significant underestimation of uncertainty as it fails to account for the variance in the hyperparameters themselves. Additionally, this method derives priors from dichotomous scores and subsequently applies these priors to calibrate polytomous scores. This separate calibration can compromise the reliability of the estimates, particularly in assessments with limited items, as it effectively splits the data and diminishes the influence of MC items by not incorporating them fully into the calibration of CR responses. Moreover, the SB method’s reliance on maximum likelihood estimation for item parameters before estimating ability parameters effectively fixes the scale of ability parameters, potentially constraining the flexibility of the model and introducing bias in the final ability estimates.
To address these methodological shortcomings, this study introduces the Empirical Fully Bayesian (EFB) method to the original two-step analysis. This method proposes a change and extension to the traditional SB method by the integration of a fully Bayesian method, No-U-Turn Sampler (NUTS;
Hoffman & Gelman, 2014). First, unlike the SB method’s fixed hyperparameter approach, EFB achieves posterior estimates through full posterior distribution integration, providing a more comprehensive representation of uncertainty in the parameter estimates. Second, where SB relies on maximum likelihood estimation that fixes the scale of item parameters, EFB employs simultaneous estimation of both item and ability parameters through fully Bayesian sampling, allowing for more flexible parameter estimation and better accounting for the interdependence between item and ability parameters. Third, in contrast to SB’s sequential calibration process that potentially compromises reliability, EFB utilizes a unified calibration approach that simultaneously processes both MC and CR items, maintaining the integrity of information from both formats throughout the overall estimation process.
The structure of this study is as follows: First, we describe the alternative EFB method, and we illustrate the estimation procedure using specific IRT models. Then, we analyze empirical data using our method to demonstrate its capacity in providing a more comprehensive insight compared to original method. After that, we further validate the recovery of ability parameters through two simulation studies utilizing additional empirical data sets. Finally, we discuss the broader implications and limitations of our findings and suggest potential directions for future research.
3. Results
In this section, we present both an empirical real-data demonstration and two empirical data-based simulations to comprehensively evaluate the proposed EFB method. The real-data analysis illustrates the practical applicability of our approach, while the simulations, informed by additional empirical data, allow us to systematically examine the robustness of EFB under controlled conditions such as assessment lengths, sample sizes, and ability groups, as compared to the conventional SB method.
3.1. Empirical Study Demonstration
3.1.1. Data Description and Analysis
This section utilizes empirical score data from Grade 3 to Grade 10 English Language Arts (ELA) assessments administered within some school districts in Georgia, the United States, to showcase the application of the EFB method. These ELA assessments are designed to evaluate student proficiency in extended reasoning and critical thinking, with a particular focus on skills and knowledge essential for reading and argumentative writing. To provide a clear understanding of the data, summary statistics are presented in
Table 1. This data set encompasses a wide range of student participation across grades: Grade 6 records the smallest cohort with 646 students, while Grade 9 includes the largest at 7017 students. The structure of the ELA assessments varies by grade level. Grades 3 through 6 feature shorter tests, each comprising three MC items and two CR items. Grades 7 and 8 follow the same format, whereas Grades 9 and 10 are composed of 13 MC items and 5 CR items. All MC items are scored dichotomously. Regarding the CR items, all but the final item in each assessment are short answer items with a maximum score of 2; the last CR item is a long essay item, scored up to 4. The last column “Points Possible” indicates the total points of MC plus CR item scores in each assessment.
All analyses were conducted using RStudio 4.3.0 (
R Core Team, 2023). For each of the simulation studies, three Markov chains were utilized, with each chain undergoing 3000 iterations. To ensure the stability and reliability of the results, the first 1000 iterations of each chain were discarded as burn-in. This approach helps in mitigating the impact of initial values on the analysis, thereby enhancing the accuracy and robustness of the final estimations.
3.1.2. Empirical Real-Data Results
Prior to deploying the EFB method, results of a factor analyses are first presented in
Table 2. These analyses serve to ascertain the dimensional structure of each assessment data set. For instance, in the Grade 3 data set, the eigenvalue for the first factor was significantly higher at 6.52, accounting for 18.94% of the total variance, compared to the subsequent factor. A similar pattern can be observed for other grades. Since more MC and CR items are included in Grades 7–10 compared to Grades 3–6, the second factor accounted for relatively more variance in Grades 7–10. However, the first factors of these grades still explain the majority of the variance, and there remains a notable decrease in eigenvalues from the first to subsequent factor. This pattern indicates a strong presence of a dominant single factor, confirming the unidimensionality of all the assessment data.
The ELA data sets were analyzed using the EFB method.
Figure 2 illustrates this analysis through scatterplots that compare all students’ posterior ability estimations with their prior ability estimations for each grade, alongside their respective distributions. The distribution curves for the posterior ability estimations are depicted on the right side of each scatterplot, while those for the prior ability estimations are shown at the top. It is expected that there would be a positive correlation between the prior and posterior estimates, and this expectation holds true regardless of the test designs with varying numbers of CR item points. From
Figure 2, the scatterplots consistently reveal a positive linear relationship between the posterior and prior ability estimates, although there are some visual differences observed between Grades 7–10 and Grades 3–6. Grades 7–10 have larger student populations, so it has more diverse score combinations. With a higher number of students and items, these grades exhibit a more diverse pattern of estimates, reflecting the increased variability in student responses. In contrast, Grades 3–6 with smaller student populations show more potentially overlapping scatter plots. This is primarily due to the smaller number of students and items, which increases the likelihood of students having identical response patterns across multiple items. For example, in Grades 3–6, where each ELA assessment includes only three MC items, some students’ response score vectors are identical, leading to identical prior ability estimates for these students. This similarity is visually represented by several vertical lines within the scatterplots. As a result of using these identical prior values, the EFB method produces a noticeably narrower posterior distribution in these grades. For Grades 7–10, such significant narrowing of the posterior distribution is not observed. However, it is notable that while the prior ability estimates generally follow a roughly normal distribution, the posterior distributions in certain grades exhibit non-normal characteristics: Grades 7–10 show right-skewed distributions.
Furthermore, an analysis not depicted in
Figure 2 but crucial to our findings shows that the EFB method achieves high reliability in the posterior estimates. For example, among all grades, Grade 6 demonstrated the lowest reliability coefficient of 0.831, with a mean posterior ability of −0.204 and a standard deviation of 0.770. In contrast, Grade 7 showed a middle reliability value at 0.880, with a mean posterior ability of −0.113 and a standard deviation of 0.878. Grade 9 exhibited the highest reliability coefficient of 0.909, with a mean posterior ability of 0.048 and a standard deviation of 0.945. Overall, these findings illuminate the EFB method’s consistent performance, with reliability coefficients uniformly exceeding 0.830 across all examined grades.
The analysis of posterior ability measurement precision can reveal variability in the standard error of measurement (SEM) and conditional standard error of measurement (CSEM) across ability levels and grades.
Table 3 lists the SEM/CSEM at the 25th, 50th (median), and 75th percentile ability estimates across Grades 3–10, along with the overall SEM for each grade. The measurement precision was generally good across all grades, with SEM values ranging from 0.29 to 0.41. Grade 8 demonstrated the highest overall precision (SEM = 0.30), while Grade 3 showed the lowest (SEM = 0.39).
In practice, each of the assessments has been calibrated to produce three ability estimations, one derived from MC items, another from CR items, and a composite ability estimation that comprehensively integrates both, using weights determined by subject matter experts. To evaluate the efficacy of the EFB method, we performed a comparative analysis between the posterior ability estimates from the EFB method and the traditional composite ability estimates that incorporate expert-determined weights.
Table 4 presents the correlation, Mean Absolute Errors (MAEs) and Root Mean Square Errors (RMSEs) for each grade’s assessment, comparing the two sets of ability estimations. The correlation between EFB and traditional methods remains high for Grade 6 despite this grade having the smallest student cohort and showing the largest MAE and RMSE values compared to other grades. This indicates some challenges in estimating abilities with smaller samples. However, across all other grades, we observe high correlations and relatively low MAE and RMSE values between the two methods. These findings suggest that the EFB method can achieve estimation results comparable to the traditional method without requiring the labor-intensive and costly process of manually specifying item weights through expert judgment.
3.2. Empirical Data Based Simulation Study Demonstration
To effectively evaluate the accuracy of ability estimation recovery by the EFB method, it is essential to establish a benchmark for comparison. This section presents two simulation studies that utilize empirical data to evaluate the precision of the EFB method relative to the traditional SB method. The first simulation focuses on understanding how various assessment conditions such as assessment length and sample size influence the accuracy of ability recovery. In contrast, the second simulation examines the EFB method’s performance and sensitivity in recovering ability parameters for non-normally distributed groups of students, particularly in scenarios characterized by small sample sizes. These simulations are designed to rigorously test the robustness and real-world applicability of the EFB method in educational measurement contexts.
To ensure a fair comparison between the ability estimates derived from the EFB and SB methods, which may yield results on different IRT scales due to their distinct estimation techniques, it is crucial to align these estimates to a common scale (
Natesan et al., 2016). For this purpose, the item and person (i.e., ability) parameters are linked to the generating scale of the simulation using the Stocking–Lord method (
Stocking & Lord, 1983). This alignment allows for a meaningful evaluation of the ability estimates across the divergent scales.
The recovery of ability parameters is quantitatively assessed using the RMSEs and the reliability of the estimates, comparing them against the simulated true values. Additionally, the Feldt-Raju reliability coefficient is employed to evaluate the internal consistency of ability estimations (
Shu & Schwarz, 2014). This coefficient is calculated as follows:
where
is the score weight for the
ith item,
is the covariance between the ith item score and the total score,
is the variance for
th item, and
is the total variance.
To determine substantive differences in the recovery comparison between two methods, we conducted ANOVA tests to identify significant differences between methods, with indicating statistical significance.
3.2.1. Simulation 1: Manipulating Assessment Conditions to Evaluate Ability Recovery
Simulation 1 Design
This simulation utilized item parameters sourced from previously calibrated empirical assessment data, detailed in the
Supplementary Materials. The simulation was designed to manipulate three key variables: Ability Group, Sample Size, and Assessment Length, to explore their effects on ability recovery across different assessment scenarios. Three ability distributions were simulated: a normal distribution, a uniform distribution, and a negatively skewed normal distribution. Within each ability group, we manipulated the sample size and assessment length to generate various operational assessment conditions. The levels of these conditions are listed in
Table 5 with a total of
conditions. For example, “20 (18 MC + 2 CR)” indicates a 20-item assessment consisting of 18 MC items and 2 CR items. MC items are dichotomous, and CR items have a max score of 2. Each unique combination of the manipulated conditions was randomized to ensure a broad representation of potential testing scenarios. To enhance the robustness of the findings, each scenario configuration was replicated 100 times, allowing for a comprehensive analysis of the impacts of these variables on the ability estimates produced under different assessment conditions.
Simulation 1 Results
Each simulated data set was analyzed using both the EFB method and the traditional SB method. All Markov Chains successfully converged, affirming robustness even in small-sample conditions. The design matrix detailing each simulation condition, along with the average RMSEs and reliability scores for ability estimation from both methods, is presented in
Table 6. The standard deviations for these estimates are also reported, enclosed in parentheses alongside each value. The observed RMSE values in our study are on the logit metric and are approximately 1.94 times the average standard error of measurement (SEM) across conditions. This relationship aligns with theoretical expectations for well-calibrated measurement models in IRT. The results underscored that larger sample sizes and longer assessments typically yielded higher accuracy in ability estimations. While increasing the assessment length demonstrated a small improvement in the recovery of ability, increasing sample size produced more accurate ability estimations. It can be observed that the EFB method consistently shows a lower RMSE across all conditions compared to SB, indicating that it tends to produce more accurate results. In terms of reliability, the EFB method generally outperforms SB, although the margin is narrower and varies more across conditions.
Notably, under the same ability group with a reduced sample size of 150, the EFB method surpassed the SB method in both accuracy and reliability of the estimates. This highlights the EFB method’s efficacy in handling smaller data sets effectively. The EFB method demonstrated enhanced performance across uniform and skewed ability distributions, with significant improvements in both RMSEs and reliability compared to the SB method. When using the EFB method, performance with normally distributed data was not significantly different from performance with uniform and skewed distributions. In addition, no significant differences were observed in the recovery of estimation between uniform and skewed distributions, suggesting that the EFB method’s performance is robust across varied distribution types. However, with the SB method, we observed slight differences in performance across distribution types, with larger standard deviations for non-normal groups.
A factorial ANOVA analysis was conducted to analyze the RMSEs for ability recovery. The results, as shown in
Table 7, reveal significant main effects for the estimation method and for all three manipulation conditions (
). Notably, these effects were substantial, evidenced by the large effect sizes, as indicated by partial
. Among the three conditions, Sample Size emerged as the most influential condition with the greatest effect size (0.453), suggesting that a larger cohort of students enhances ability recovery. The Ability Group was the next most influential condition with an effect size of 0.280, confirming the impact of the true ability distribution in the population on the precision of estimation. These results corroborate our findings in
Table 6, indicating that the number of students directly correlates with the ability recovery. Moreover, these findings underscore the need for cautious interpretation of ability estimation when dealing with small sample sizes and unknown ability distributions.
Furthermore, the estimation method itself also yielded a substantial effect on ability recovery, as evidenced by a partial
of 0.328 based on the ANOVA analysis. This highlights the pivotal importance of the estimation method choice in accurately recovering ability. This also indicates that EFB may provide more precise ability estimation, particularly in operational scenarios where the true ability group and sample size are unknown. Considering the results from
Table 6, it is evident that selecting EFB methods in conditions with small sample sizes and ambiguous or non-normally distributed ability may result in higher levels of accuracy.
In addition, there are also significant interactions between estimation method and ability group, as well as between estimation method and sample size. These significant interactions indicate that the performance differences between EFB and SB methods vary depending on the ability distribution and sample size conditions. Specifically, the EFB method maintains more consistent performance across different ability distributions and sample sizes, while the SB method’s performance is more variable across these conditions.
3.2.2. Simulation 2: Recovery of Ability with Small-Sample Non-Normally Distributed Achievement
Simulation 2 Design
Simulation 2 employed empirical data sourced from summative Science assessments conducted in 2018, covering Grades 3 through 12 within a small local district in the State of Georgia. Note that not all grades have their own separate science assessment. For example, Grade 3 and Grade 4 share one assessment, and Grades 10, 11, and 12 share one assessment. The remaining grades each have their own individual assessment. Therefore, the total number of science assessment data in this study is 7. The assessments for each grade level comprised 22 MC items and 3 CR items. The responses to MC items were scored dichotomously, while the CR items have a max score of 2.
Table 8 summarizes the real-data assessment information for this simulation study.
All item and ability parameters utilized in this simulation had been previously calibrated and equated against an item bank established from earlier administrations. For the purposes of this simulation, we then use the previously obtained item parameters and ability parameters to simulate all students’ responses on each item without replacement. All the previously obtained item parameters were attached in
Supplementary Materials. Student ability estimates are not listed due to state law but the distributions are discussed in this section.
Simulation 2 Results
Building on the findings from Simulation 1, where the EFB method demonstrated superior accuracy and reliability in ability estimations, especially for small samples, Simulation 2 further explored this effectiveness in small-sample contexts ranging from 469 to 870.
Table 9 presents the results of ability recovery from both methods. A consistent pattern emerges in the RMSE analysis, where the EFB method invariably records lower RMSE across all data sets, indicating more precise ability estimations compared to the traditional SB method. Notably, the smallest RMSE for EFB, recorded at 0.583, occurred in Assessment 3, which had the largest sample size of 870. This underscores the observation that larger sample sizes generally contribute to more accurate estimations. Conversely, Assessment 7, which exhibited the highest RMSE for EFB, still showed RMSE values lower than those achieved using the traditional SB method, affirming the EFB method’s superior performance irrespective of sample size variations.
Reliability metrics further reinforce the robustness of the EFB method, with the highest reliability observed in Assessment 3 at a coefficient of 0.930. This aligns with the RMSE data, suggesting that larger sample sizes not only enhance estimation accuracy but also improve reliability. Even in Assessment 7, where EFB had its highest RMSE, it still maintained a reliability rating superior to that of the traditional SB method. Additionally, the relatively minor variations in sample sizes across the seven assessments allow for a direct and meaningful comparison between the traditional SB and EFB methods, highlighting their respective efficacies in ability estimation. The consistent outperformance of EFB across these metrics suggests that it is a more robust method for ability estimation in operational assessment scenarios.
Figure 3 displays the pre-calibrated (regarded as “true”) ability distributions alongside the recovered ability distributions from the two estimation methods, for Assessments 1–7, from bottom to top. Solid lines in each density plot represent the true ability distributions. In
Figure 3, for the red lines reflecting students’ ability, Assessments 2–7 are positively skewed, while Assessment 1 is negatively skewed. The skewness of Assessments 6 and 7 is the lowest, so it is not visually obvious. This visualization distinctly demonstrates that the EFB method more accurately recovers the posterior ability distributions, closely mirroring the true ability distributions compared to the traditional SB method. This observation is consistent across both normally and non-normally distributed abilities. In assessments with a large number of students, the true ability distribution tends to resemble a normal distribution. When the student population size is small, however, under certain operational scenarios the true ability may exhibit different distributions as shown in
Figure 3. From this figure, we can see EFB could provide a useful alternative to calibrate mixed-format assessments with more precise results.
4. Discussion
The use of mixed-format assessments is very common and offers the advantage of measuring a wider range of skills compared to single-format assessments. The calibration of such assessments requires a precise estimation of students’ overall abilities in the contemporary educational measurement field. This study introduced a new estimation method, EFB, with at least three achievements. First, it develops a fully Bayesian framework that effectively adapts and refines the empirical Bayes priors used by the SB method. Secondly, unlike traditional approaches that may require splitting data for analysis, the EFB method maintains a holistic view of the data set, thereby enhancing the reliability of the ability estimates it produces. Finally, it demonstrates accuracy in calibrating small sample sizes while ensuring convergence.
For assessments in which both dichotomous and polytomous items are scored separately, the EFB method offers a solution to combine the separate scores into a single scale score. This is exceedingly beneficial in scenarios where dichotomous and polytomous items are scored on divergent scales or when polytomous item scores are tabulated and released subsequent to MC scores. By circumventing the conventional requirement for explicit weight allocation, the EFB method streamlines the integration of item-specific and composite ability estimations, thereby enhancing the holistic and nuanced interpretation of student performance. In addition, the applicability of this method extends to small-sample-size assessments, such as the Bureau of Indian Education’s summative assessments. In such situations, traditional IRT calibration methods may struggle to converge and to provide stable and reliable ability estimations. By leveraging the empirical Bayesian prior and fully Bayesian sampling, EFB method was found to overcome some of the limitations of traditional calibration, yielding more accurate and reliable parameter estimates. These practical implications highlight the versatility and utility of EFB method in addressing common challenges encountered in the calibration and scoring of mixed-format assessments.
The present research also acknowledges two areas that could be addressed in next-stage studies. First, while EFB methodology has demonstrated a capacity for generating more precise estimations, it is also associated with an elevated computational demand, especially when contrasted with methodologies such as maximum likelihood estimation. This increment in computational expense underscores the necessity for the development of more computationally efficient sampling strategies within the NUTS framework. Second, this study has predominantly focused on the applicability of the EFB methodology to unidimensional assessments. It is critical to note, however, that the foundational principle of employing empirical prior information under EFB method does not strictly limit its application to unidimensional contexts. On the contrary, it only presupposes that all items are assessing either a singular ability or a coherent set of abilities. This opens the possibility for EFB’s application to multidimensional assessments, broadening the scope of its utility significantly. Finally, we also wanted to highlight another critical consideration. From the discussion, we know that for students with the same MC responses, they will receive the same prior information under the current FSB method, as suggested in
Figure 1 of empirical study. However, with the investigation of educational process data, such as students’ behavioral actions and eye-tracking data, students with even the same responses could also demonstrate different ability values in terms of the content being measured. Therefore, using data about how students respond to assessment items could help us better understand their assessment-taking strategies and improve how we adjust their scores from the start. This means we could tailor initial empirical prior estimation for students who obtain the same scores but engage with the assessment differently, such as through random guessing behaviors. Looking into this could make our methods for calibrating mixed-format assessments more precise and give us a deeper insight into how students take assessments.
5. Conclusions
Overall, the proposed EFB method successfully addresses the research question on accurate composite ability score estimation in mixed-format assessments. Empirical data analyses consistently demonstrated that this approach provides higher empirical reliability in ability estimation and better reflects non-normally distributed ability groups. Through real-data-based simulations, we confirmed the EFB method’s superior recovery of ability estimations across various assessment conditions compared to traditional approaches.
From a practical standpoint, the EFB method achieves comparable results to traditional methods without requiring manual item weight specification, thus reducing the need for subject matter expert involvement and simplifying the calibration process. This approach proves particularly valuable when the true ability distribution is unclear or sample sizes are limited. These findings highlight the EFB method’s potential to streamline the assessment process while maintaining high standards of accuracy and reliability. Future research should focus on extensions to broader application scenarios such as multidimensional measurement and multimodal educational data mining.