Next Article in Journal
Technology-Enhanced Language Learning: Subtitling as a Technique to Foster Proficiency and Intercultural Awareness
Previous Article in Journal
Breaking the Cycle: How Fatigue, Cyberloafing, and Self-Regulation Influence Learning Satisfaction in Online Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Improving the Measurement of Students’ Composite Ability Score in Mixed-Format Assessments

1
Department of Educational Psychology, The University of Georgia, Athens, GA 30602, USA
2
School of Electrical and Computer Engineering, The University of Georgia, Athens, GA 30602, USA
3
Research Evaluation and Methodology, The University of Florida, Gainesville, FL 32601, USA
*
Author to whom correspondence should be addressed.
Educ. Sci. 2025, 15(3), 374; https://doi.org/10.3390/educsci15030374
Submission received: 19 February 2025 / Revised: 17 March 2025 / Accepted: 17 March 2025 / Published: 18 March 2025
(This article belongs to the Section Education and Psychology)

Abstract

:
Mixed-format assessments, which include both multiple-choice (MC) and constructed-response (CR) items, often produce separate scoring scales, with MC items scored dichotomously and CR items scored polytomously. Conventional methods for estimating composite ability scores, such as weighting or summing, rely on subject matter expertise but overlook the information embedded in MC item scores. While recent progress takes advantage of empirical Bayes analysis for estimating composite ability scores, it may also introduce biases because it relies solely on point estimates without accounting for the variability in unknown ability distributions and other parameters. To address these gaps, this study introduces a practical and easily implementable method, empirical fully Bayesian, that leverages MC item scores to derive empirical priors, leading to more accurate composite score estimates. It has been found that MC scores have effectively captured students’ achievement in the assessment domain, and they could provide valuable information for final scoring. Through empirical analyses of students in Grades 3 to 10 and two additional simulation studies based on real-world data, we demonstrate that this approach enhances composite ability score reliability, reduces reporting biases, and provides a valuable empirical evaluation tool for mixed-format assessments.

1. Introduction

Mixed-format assessments, which incorporate both multiple-choice (MC) and constructed-response (CR) items, are prevalent in current educational evaluations, comprising approximately 63% of statewide assessments in the United States (Lane, 2005). These assessments have gained prominence due to their ability to evaluate diverse cognitive skills through different item formats, providing a more comprehensive picture of student abilities (Xiong et al., 2024). Given the use of different item formats, these assessments typically generate two distinct score scales, such as the Graduate Record Examination (GRE; Educational Testing Service, 2012). In the GRE Verbal test, for instance, MC items are scored automatically and results are provided immediately to students, whereas CR items require detailed human evaluation and the scores will be reported with polytomous values in a later date. However, this dual scoring system creates a fundamental challenge in educational measurement: how to accurately synthesize these distinct measurements into a meaningful composite ability score that reflects a student’s true capabilities.
The complexity of integrating different scores from mixed-format assessments has spawned numerous psychometric approaches. Specifically, traditional methods have relied on item response theory (IRT; Baker & Kim, 2014) models, applying dichotomous models for MC items and polytomous models for CR items to calibrate different score patterns (Ercikan et al., 1998; Rosa et al., 2001). Although this approach may produce a rough estimation of students’ abilities, it raises consistent questions regarding the proportional weight contribution of each item format to the final composite ability score. That is, this method involves an underlying psychometric assumption to allocate weights based on item reliability (i.e., discrimination), thereby optimizing the scores primarily for reliability, which may diverge from the information provided from the two separate ability scores (Sykes et al., 2001). Alternative approaches using weighted linear combinations have emerged (Thissen et al., 2001), but these methods suffer from two key limitations: the absence of a consensus on the optimal weight selection and the labor-intensive nature of expert-driven weight determination. Furthermore, the reliance on approximation methods within these frameworks often compromises the precision of final ability estimates.
Recent advances in Bayesian methods have opened new avenues for addressing these limitations by leveraging prior knowledge about unknown abilities (Smid et al., 2020). Among these methods, Xiong et al. (2023) introduced a two-step Sequential Bayesian (SB) approach based on empirical Bayes theory, which effectively uses empirical prior information to estimate ability values. In the first step, this method determines the prior distribution from a portion of the response score data for each student, and in the next step, this empirical prior is then used to compute the posterior estimations of ability through the rest of the data. By such a two-step framework, this approach eliminates the need for explicit weight assignments, as the empirical Bayes technique automated adjusts the weights during calibration using priors. Although this allows for an integration of results from different item formats into a composite and comprehensive ability score, the practical application of the empirical Bayes method, often described as a “pseudo-Bayesian” approach, may present challenges in certain situations. This method involves estimating hyperparameters directly from the data, which are then treated as fixed values within the Bayesian model (Williams & Savitsky, 2021). Such an approach may lead to a significant underestimation of uncertainty as it fails to account for the variance in the hyperparameters themselves. Additionally, this method derives priors from dichotomous scores and subsequently applies these priors to calibrate polytomous scores. This separate calibration can compromise the reliability of the estimates, particularly in assessments with limited items, as it effectively splits the data and diminishes the influence of MC items by not incorporating them fully into the calibration of CR responses. Moreover, the SB method’s reliance on maximum likelihood estimation for item parameters before estimating ability parameters effectively fixes the scale of ability parameters, potentially constraining the flexibility of the model and introducing bias in the final ability estimates.
To address these methodological shortcomings, this study introduces the Empirical Fully Bayesian (EFB) method to the original two-step analysis. This method proposes a change and extension to the traditional SB method by the integration of a fully Bayesian method, No-U-Turn Sampler (NUTS; Hoffman & Gelman, 2014). First, unlike the SB method’s fixed hyperparameter approach, EFB achieves posterior estimates through full posterior distribution integration, providing a more comprehensive representation of uncertainty in the parameter estimates. Second, where SB relies on maximum likelihood estimation that fixes the scale of item parameters, EFB employs simultaneous estimation of both item and ability parameters through fully Bayesian sampling, allowing for more flexible parameter estimation and better accounting for the interdependence between item and ability parameters. Third, in contrast to SB’s sequential calibration process that potentially compromises reliability, EFB utilizes a unified calibration approach that simultaneously processes both MC and CR items, maintaining the integrity of information from both formats throughout the overall estimation process.
The structure of this study is as follows: First, we describe the alternative EFB method, and we illustrate the estimation procedure using specific IRT models. Then, we analyze empirical data using our method to demonstrate its capacity in providing a more comprehensive insight compared to original method. After that, we further validate the recovery of ability parameters through two simulation studies utilizing additional empirical data sets. Finally, we discuss the broader implications and limitations of our findings and suggest potential directions for future research.

2. Materials and Methods

2.1. Overview of the Empirical Fully Bayesian Method

Figure 1 presents a flowchart outlining the EFB method. In this approach, students’ responses to dichotomous MC items are first calibrated using a selected dichotomous IRT model to generate an empirical prior distribution for each student through a Bayesian sampling process. This empirical prior is then utilized in another sampling process to derive the posterior distributions from the complete mixed-format score data, which include both MC and CR scores.

2.2. Mathematical Statistics of the EFB Method

Suppose a mixed-format unidimensional assessment contains n MC items and m CR items ( i = 1 , n , n + 1 , , n + m ). Let U j = ( U M C , U C R ) = ( u 1 , , u n , , u n + m ) represent the mixed-format response score data from the j th student. Assume the MC data U M C can yield true ability estimates of θ = ( θ 1 , , θ j ) for all students. θ ’s can be sampled from a population with hyper-parameters η , given a probability distribution f ( θ | η ) . Then, the Bayes’ theorem will yield
f ( θ ) = f ( θ | U M C ) = f U M C   θ f U M C   f ( θ | η ) f ( η ) d η
This integral is computed by NUTS, and the estimated ability distributions f ( θ ) will be regarded as empirical prior information for subsequent analyses. As the selected fully Bayesian method in this study, NUTS is an extension of Hamiltonian Monte Carlo (Duane et al., 1987) method. NUTS enhances HMC by automating the tuning of hyperparameters, thus alleviating the need for manual adjustments during model convergence. Designed to improve the user-friendliness and robustness of HMC, NUTS incorporates principles from Hamiltonian dynamics to efficiently draw samples. This integration enables NUTS to reach posterior distributions more rapidly and at a lower computational cost compared to random walk Monte Carlo, significantly optimizing the sampling process.
For each ability parameter θ j with a probability density function f ( θ ) , the fully Bayesian sampling introduces a set of auxiliary momentum variables r j to help converge. The momentum variables are drawn independently from a standard normal distribution p ( r ) ~ N ( 0 , Σ ) , where the Σ is the covariance matrix which can be interpreted as a Euclidean metric to rotate and scale the target distribution (Hoffman & Gelman, 2014). The target distribution refers to the probability distribution from which we want to generate samples. Typically, this distribution is intractable to sample directly. Therefore, a Hamiltonian function, a joint density of θ and r can be defined as in Equation (2) to help find the target distribution.
p θ , r = p r θ p θ = exp log p θ + log p r | θ = exp H θ , r
where the H is the Hamiltonian function in Equation (3)
H θ , r = L θ + K R
where L θ = log p θ and K ( r ) = log p ( r | θ ) are two intermediate components. The sampling process in NUTS uses Hamiltonian dynamics to keep the Hamiltonian function in Equation (3) invariant. NUTS eliminate the need to specify the number of steps parameter in the fully Bayesian framework and it allows automatic tuning of these parameters by introducing a new termination criterion that indicates whether the replication is long enough to yield sufficient results. The specific sampling process used by NUTS is presented in the Supplementary Materials.
After obtaining the empirical prior distribution for each student, the posterior ability estimation also follows the above NUTS procedures. Equation (4) shows the posterior density of the ability given the mixed-format data:
f θ U M C , U C R f U M C , U C R | θ f θ
where f ( U M C , U C R | θ ) represents the likelihood of the complete item response score data given the model parameters, and f ( θ ) is the empirical prior distribution.

2.3. Item Response Theory Model Selections

Section 2.2 provides the mathematical and statistical foundations of the EFB method. This section introduces two IRT models to specify the probability distribution of students’ response patterns across both MC and CR items. Students’ dichotomous MC response score data are modeled by a two-parameter logistic model (Birnbaum, 1968) as shown in Equation (5):
P i θ j = exp D a i θ j b i 1 + exp D a i θ j b i
where P i ( θ j ) is the probability of obtaining a correct response on the i th item given j th student’s ability θ j , b i is MC item difficulty, a i is MC item discrimination, and D is the scaling constant, which is equal to 1.7. Students’ CR responses (are assumed scored with ordered polytomous item scores, i.e., v i = 0,1 , 2 , , z i ) are modeled by a graded response model (Samejima, 1969) in Equation (6).
P i k θ j = P v i k | θ j 1 , k = 0 exp D a i θ j β i k 1 + exp D a i θ j β i k , 1 k z i 0 , k > z i
where P i k ( θ j ) is the cumulative category response function representing j th student’s response to the i th item which is assumed to be at or above the ordered k th score category with ability θ j , β i k is the CR item step difficulty parameter for category ( k + 1 ) (baseline score 0 does not have step difficulty), and D is the same as in Equation (5). In the second step of the model, the probability of score category P i k θ j is obtained by subtracting the higher category P i k + 1 θ j from P i k θ j .

2.4. Sequential Bayesian Method

The conventional SB method uses the Expected A Posteriori method to obtain the ability score estimate. Through a marginal maximum likelihood estimation with expectation–maximization algorithm, it first obtains all item parameters, then using the item parameters to estimate ability parameters. Note all item parameters, including a i , b i , and β i k , as ξ . Specifically, in the first MC calibration step, the MC item parameters ξ M C is used to obtain the estimation for the prior ability parameter distribution as in Equation (7)
g θ j 1 U M C , ξ M C ) = P U M C , j |   θ j 1 , ξ M C g θ 0   P U M C , j
where the g ( θ 0 ) is the initial ability prior distribution to draw the estimation of θ j 1 and which is held equivalent for all students in the calibration. This initial MC calibration step yield estimation for each student j ’s ability θ j 1 associated with the standard error σ j 1 . These were used to construct J normal distributions, denoted as g ( θ j 1 ) ~ N ( θ j 1 , σ j 1 2 ) , as individual prior latent ability distributions in the second step estimation.
In the second step, after obtaining the CR item parameters ξ C R and using the individual prior distributions g ( θ j 1 ) , the estimate of the ability given response pattern U C R can be obtained as
θ ^ j = θ j P ( U C R | θ j ) d θ P U C R
The standard error is
  σ ^ 2 ( θ ^ j ) = θ j θ ^ j 2 P U C R θ j g θ j 1 d θ P U C R  

3. Results

In this section, we present both an empirical real-data demonstration and two empirical data-based simulations to comprehensively evaluate the proposed EFB method. The real-data analysis illustrates the practical applicability of our approach, while the simulations, informed by additional empirical data, allow us to systematically examine the robustness of EFB under controlled conditions such as assessment lengths, sample sizes, and ability groups, as compared to the conventional SB method.

3.1. Empirical Study Demonstration

3.1.1. Data Description and Analysis

This section utilizes empirical score data from Grade 3 to Grade 10 English Language Arts (ELA) assessments administered within some school districts in Georgia, the United States, to showcase the application of the EFB method. These ELA assessments are designed to evaluate student proficiency in extended reasoning and critical thinking, with a particular focus on skills and knowledge essential for reading and argumentative writing. To provide a clear understanding of the data, summary statistics are presented in Table 1. This data set encompasses a wide range of student participation across grades: Grade 6 records the smallest cohort with 646 students, while Grade 9 includes the largest at 7017 students. The structure of the ELA assessments varies by grade level. Grades 3 through 6 feature shorter tests, each comprising three MC items and two CR items. Grades 7 and 8 follow the same format, whereas Grades 9 and 10 are composed of 13 MC items and 5 CR items. All MC items are scored dichotomously. Regarding the CR items, all but the final item in each assessment are short answer items with a maximum score of 2; the last CR item is a long essay item, scored up to 4. The last column “Points Possible” indicates the total points of MC plus CR item scores in each assessment.
All analyses were conducted using RStudio 4.3.0 (R Core Team, 2023). For each of the simulation studies, three Markov chains were utilized, with each chain undergoing 3000 iterations. To ensure the stability and reliability of the results, the first 1000 iterations of each chain were discarded as burn-in. This approach helps in mitigating the impact of initial values on the analysis, thereby enhancing the accuracy and robustness of the final estimations.

3.1.2. Empirical Real-Data Results

Prior to deploying the EFB method, results of a factor analyses are first presented in Table 2. These analyses serve to ascertain the dimensional structure of each assessment data set. For instance, in the Grade 3 data set, the eigenvalue for the first factor was significantly higher at 6.52, accounting for 18.94% of the total variance, compared to the subsequent factor. A similar pattern can be observed for other grades. Since more MC and CR items are included in Grades 7–10 compared to Grades 3–6, the second factor accounted for relatively more variance in Grades 7–10. However, the first factors of these grades still explain the majority of the variance, and there remains a notable decrease in eigenvalues from the first to subsequent factor. This pattern indicates a strong presence of a dominant single factor, confirming the unidimensionality of all the assessment data.
The ELA data sets were analyzed using the EFB method. Figure 2 illustrates this analysis through scatterplots that compare all students’ posterior ability estimations with their prior ability estimations for each grade, alongside their respective distributions. The distribution curves for the posterior ability estimations are depicted on the right side of each scatterplot, while those for the prior ability estimations are shown at the top. It is expected that there would be a positive correlation between the prior and posterior estimates, and this expectation holds true regardless of the test designs with varying numbers of CR item points. From Figure 2, the scatterplots consistently reveal a positive linear relationship between the posterior and prior ability estimates, although there are some visual differences observed between Grades 7–10 and Grades 3–6. Grades 7–10 have larger student populations, so it has more diverse score combinations. With a higher number of students and items, these grades exhibit a more diverse pattern of estimates, reflecting the increased variability in student responses. In contrast, Grades 3–6 with smaller student populations show more potentially overlapping scatter plots. This is primarily due to the smaller number of students and items, which increases the likelihood of students having identical response patterns across multiple items. For example, in Grades 3–6, where each ELA assessment includes only three MC items, some students’ response score vectors are identical, leading to identical prior ability estimates for these students. This similarity is visually represented by several vertical lines within the scatterplots. As a result of using these identical prior values, the EFB method produces a noticeably narrower posterior distribution in these grades. For Grades 7–10, such significant narrowing of the posterior distribution is not observed. However, it is notable that while the prior ability estimates generally follow a roughly normal distribution, the posterior distributions in certain grades exhibit non-normal characteristics: Grades 7–10 show right-skewed distributions.
Furthermore, an analysis not depicted in Figure 2 but crucial to our findings shows that the EFB method achieves high reliability in the posterior estimates. For example, among all grades, Grade 6 demonstrated the lowest reliability coefficient of 0.831, with a mean posterior ability of −0.204 and a standard deviation of 0.770. In contrast, Grade 7 showed a middle reliability value at 0.880, with a mean posterior ability of −0.113 and a standard deviation of 0.878. Grade 9 exhibited the highest reliability coefficient of 0.909, with a mean posterior ability of 0.048 and a standard deviation of 0.945. Overall, these findings illuminate the EFB method’s consistent performance, with reliability coefficients uniformly exceeding 0.830 across all examined grades.
The analysis of posterior ability measurement precision can reveal variability in the standard error of measurement (SEM) and conditional standard error of measurement (CSEM) across ability levels and grades. Table 3 lists the SEM/CSEM at the 25th, 50th (median), and 75th percentile ability estimates across Grades 3–10, along with the overall SEM for each grade. The measurement precision was generally good across all grades, with SEM values ranging from 0.29 to 0.41. Grade 8 demonstrated the highest overall precision (SEM = 0.30), while Grade 3 showed the lowest (SEM = 0.39).
In practice, each of the assessments has been calibrated to produce three ability estimations, one derived from MC items, another from CR items, and a composite ability estimation that comprehensively integrates both, using weights determined by subject matter experts. To evaluate the efficacy of the EFB method, we performed a comparative analysis between the posterior ability estimates from the EFB method and the traditional composite ability estimates that incorporate expert-determined weights. Table 4 presents the correlation, Mean Absolute Errors (MAEs) and Root Mean Square Errors (RMSEs) for each grade’s assessment, comparing the two sets of ability estimations. The correlation between EFB and traditional methods remains high for Grade 6 despite this grade having the smallest student cohort and showing the largest MAE and RMSE values compared to other grades. This indicates some challenges in estimating abilities with smaller samples. However, across all other grades, we observe high correlations and relatively low MAE and RMSE values between the two methods. These findings suggest that the EFB method can achieve estimation results comparable to the traditional method without requiring the labor-intensive and costly process of manually specifying item weights through expert judgment.

3.2. Empirical Data Based Simulation Study Demonstration

To effectively evaluate the accuracy of ability estimation recovery by the EFB method, it is essential to establish a benchmark for comparison. This section presents two simulation studies that utilize empirical data to evaluate the precision of the EFB method relative to the traditional SB method. The first simulation focuses on understanding how various assessment conditions such as assessment length and sample size influence the accuracy of ability recovery. In contrast, the second simulation examines the EFB method’s performance and sensitivity in recovering ability parameters for non-normally distributed groups of students, particularly in scenarios characterized by small sample sizes. These simulations are designed to rigorously test the robustness and real-world applicability of the EFB method in educational measurement contexts.
To ensure a fair comparison between the ability estimates derived from the EFB and SB methods, which may yield results on different IRT scales due to their distinct estimation techniques, it is crucial to align these estimates to a common scale (Natesan et al., 2016). For this purpose, the item and person (i.e., ability) parameters are linked to the generating scale of the simulation using the Stocking–Lord method (Stocking & Lord, 1983). This alignment allows for a meaningful evaluation of the ability estimates across the divergent scales.
The recovery of ability parameters is quantitatively assessed using the RMSEs and the reliability of the estimates, comparing them against the simulated true values. Additionally, the Feldt-Raju reliability coefficient is employed to evaluate the internal consistency of ability estimations (Shu & Schwarz, 2014). This coefficient is calculated as follows:
α F R = 1 1 i = 1 n + m λ i 2 1 i = 1 n + m σ i 2 σ x 2
where λ i = σ i x σ x 2 is the score weight for the ith item, σ i x is the covariance between the ith item score and the total score, σ i 2 is the variance for i th item, and σ x 2 is the total variance.
To determine substantive differences in the recovery comparison between two methods, we conducted ANOVA tests to identify significant differences between methods, with p < 0.05 indicating statistical significance.

3.2.1. Simulation 1: Manipulating Assessment Conditions to Evaluate Ability Recovery

Simulation 1 Design

This simulation utilized item parameters sourced from previously calibrated empirical assessment data, detailed in the Supplementary Materials. The simulation was designed to manipulate three key variables: Ability Group, Sample Size, and Assessment Length, to explore their effects on ability recovery across different assessment scenarios. Three ability distributions were simulated: a normal distribution, a uniform distribution, and a negatively skewed normal distribution. Within each ability group, we manipulated the sample size and assessment length to generate various operational assessment conditions. The levels of these conditions are listed in Table 5 with a total of 3 × 3 × 3 = 27 conditions. For example, “20 (18 MC + 2 CR)” indicates a 20-item assessment consisting of 18 MC items and 2 CR items. MC items are dichotomous, and CR items have a max score of 2. Each unique combination of the manipulated conditions was randomized to ensure a broad representation of potential testing scenarios. To enhance the robustness of the findings, each scenario configuration was replicated 100 times, allowing for a comprehensive analysis of the impacts of these variables on the ability estimates produced under different assessment conditions.

Simulation 1 Results

Each simulated data set was analyzed using both the EFB method and the traditional SB method. All Markov Chains successfully converged, affirming robustness even in small-sample conditions. The design matrix detailing each simulation condition, along with the average RMSEs and reliability scores for ability estimation from both methods, is presented in Table 6. The standard deviations for these estimates are also reported, enclosed in parentheses alongside each value. The observed RMSE values in our study are on the logit metric and are approximately 1.94 times the average standard error of measurement (SEM) across conditions. This relationship aligns with theoretical expectations for well-calibrated measurement models in IRT. The results underscored that larger sample sizes and longer assessments typically yielded higher accuracy in ability estimations. While increasing the assessment length demonstrated a small improvement in the recovery of ability, increasing sample size produced more accurate ability estimations. It can be observed that the EFB method consistently shows a lower RMSE across all conditions compared to SB, indicating that it tends to produce more accurate results. In terms of reliability, the EFB method generally outperforms SB, although the margin is narrower and varies more across conditions.
Notably, under the same ability group with a reduced sample size of 150, the EFB method surpassed the SB method in both accuracy and reliability of the estimates. This highlights the EFB method’s efficacy in handling smaller data sets effectively. The EFB method demonstrated enhanced performance across uniform and skewed ability distributions, with significant improvements in both RMSEs and reliability compared to the SB method. When using the EFB method, performance with normally distributed data was not significantly different from performance with uniform and skewed distributions. In addition, no significant differences were observed in the recovery of estimation between uniform and skewed distributions, suggesting that the EFB method’s performance is robust across varied distribution types. However, with the SB method, we observed slight differences in performance across distribution types, with larger standard deviations for non-normal groups.
A factorial ANOVA analysis was conducted to analyze the RMSEs for ability recovery. The results, as shown in Table 7, reveal significant main effects for the estimation method and for all three manipulation conditions ( p < 0.001 ). Notably, these effects were substantial, evidenced by the large effect sizes, as indicated by partial η 2 . Among the three conditions, Sample Size emerged as the most influential condition with the greatest effect size (0.453), suggesting that a larger cohort of students enhances ability recovery. The Ability Group was the next most influential condition with an effect size of 0.280, confirming the impact of the true ability distribution in the population on the precision of estimation. These results corroborate our findings in Table 6, indicating that the number of students directly correlates with the ability recovery. Moreover, these findings underscore the need for cautious interpretation of ability estimation when dealing with small sample sizes and unknown ability distributions.
Furthermore, the estimation method itself also yielded a substantial effect on ability recovery, as evidenced by a partial η 2 of 0.328 based on the ANOVA analysis. This highlights the pivotal importance of the estimation method choice in accurately recovering ability. This also indicates that EFB may provide more precise ability estimation, particularly in operational scenarios where the true ability group and sample size are unknown. Considering the results from Table 6, it is evident that selecting EFB methods in conditions with small sample sizes and ambiguous or non-normally distributed ability may result in higher levels of accuracy.
In addition, there are also significant interactions between estimation method and ability group, as well as between estimation method and sample size. These significant interactions indicate that the performance differences between EFB and SB methods vary depending on the ability distribution and sample size conditions. Specifically, the EFB method maintains more consistent performance across different ability distributions and sample sizes, while the SB method’s performance is more variable across these conditions.

3.2.2. Simulation 2: Recovery of Ability with Small-Sample Non-Normally Distributed Achievement

Simulation 2 Design

Simulation 2 employed empirical data sourced from summative Science assessments conducted in 2018, covering Grades 3 through 12 within a small local district in the State of Georgia. Note that not all grades have their own separate science assessment. For example, Grade 3 and Grade 4 share one assessment, and Grades 10, 11, and 12 share one assessment. The remaining grades each have their own individual assessment. Therefore, the total number of science assessment data in this study is 7. The assessments for each grade level comprised 22 MC items and 3 CR items. The responses to MC items were scored dichotomously, while the CR items have a max score of 2. Table 8 summarizes the real-data assessment information for this simulation study.
All item and ability parameters utilized in this simulation had been previously calibrated and equated against an item bank established from earlier administrations. For the purposes of this simulation, we then use the previously obtained item parameters and ability parameters to simulate all students’ responses on each item without replacement. All the previously obtained item parameters were attached in Supplementary Materials. Student ability estimates are not listed due to state law but the distributions are discussed in this section.

Simulation 2 Results

Building on the findings from Simulation 1, where the EFB method demonstrated superior accuracy and reliability in ability estimations, especially for small samples, Simulation 2 further explored this effectiveness in small-sample contexts ranging from 469 to 870.
Table 9 presents the results of ability recovery from both methods. A consistent pattern emerges in the RMSE analysis, where the EFB method invariably records lower RMSE across all data sets, indicating more precise ability estimations compared to the traditional SB method. Notably, the smallest RMSE for EFB, recorded at 0.583, occurred in Assessment 3, which had the largest sample size of 870. This underscores the observation that larger sample sizes generally contribute to more accurate estimations. Conversely, Assessment 7, which exhibited the highest RMSE for EFB, still showed RMSE values lower than those achieved using the traditional SB method, affirming the EFB method’s superior performance irrespective of sample size variations.
Reliability metrics further reinforce the robustness of the EFB method, with the highest reliability observed in Assessment 3 at a coefficient of 0.930. This aligns with the RMSE data, suggesting that larger sample sizes not only enhance estimation accuracy but also improve reliability. Even in Assessment 7, where EFB had its highest RMSE, it still maintained a reliability rating superior to that of the traditional SB method. Additionally, the relatively minor variations in sample sizes across the seven assessments allow for a direct and meaningful comparison between the traditional SB and EFB methods, highlighting their respective efficacies in ability estimation. The consistent outperformance of EFB across these metrics suggests that it is a more robust method for ability estimation in operational assessment scenarios.
Figure 3 displays the pre-calibrated (regarded as “true”) ability distributions alongside the recovered ability distributions from the two estimation methods, for Assessments 1–7, from bottom to top. Solid lines in each density plot represent the true ability distributions. In Figure 3, for the red lines reflecting students’ ability, Assessments 2–7 are positively skewed, while Assessment 1 is negatively skewed. The skewness of Assessments 6 and 7 is the lowest, so it is not visually obvious. This visualization distinctly demonstrates that the EFB method more accurately recovers the posterior ability distributions, closely mirroring the true ability distributions compared to the traditional SB method. This observation is consistent across both normally and non-normally distributed abilities. In assessments with a large number of students, the true ability distribution tends to resemble a normal distribution. When the student population size is small, however, under certain operational scenarios the true ability may exhibit different distributions as shown in Figure 3. From this figure, we can see EFB could provide a useful alternative to calibrate mixed-format assessments with more precise results.

4. Discussion

The use of mixed-format assessments is very common and offers the advantage of measuring a wider range of skills compared to single-format assessments. The calibration of such assessments requires a precise estimation of students’ overall abilities in the contemporary educational measurement field. This study introduced a new estimation method, EFB, with at least three achievements. First, it develops a fully Bayesian framework that effectively adapts and refines the empirical Bayes priors used by the SB method. Secondly, unlike traditional approaches that may require splitting data for analysis, the EFB method maintains a holistic view of the data set, thereby enhancing the reliability of the ability estimates it produces. Finally, it demonstrates accuracy in calibrating small sample sizes while ensuring convergence.
For assessments in which both dichotomous and polytomous items are scored separately, the EFB method offers a solution to combine the separate scores into a single scale score. This is exceedingly beneficial in scenarios where dichotomous and polytomous items are scored on divergent scales or when polytomous item scores are tabulated and released subsequent to MC scores. By circumventing the conventional requirement for explicit weight allocation, the EFB method streamlines the integration of item-specific and composite ability estimations, thereby enhancing the holistic and nuanced interpretation of student performance. In addition, the applicability of this method extends to small-sample-size assessments, such as the Bureau of Indian Education’s summative assessments. In such situations, traditional IRT calibration methods may struggle to converge and to provide stable and reliable ability estimations. By leveraging the empirical Bayesian prior and fully Bayesian sampling, EFB method was found to overcome some of the limitations of traditional calibration, yielding more accurate and reliable parameter estimates. These practical implications highlight the versatility and utility of EFB method in addressing common challenges encountered in the calibration and scoring of mixed-format assessments.
The present research also acknowledges two areas that could be addressed in next-stage studies. First, while EFB methodology has demonstrated a capacity for generating more precise estimations, it is also associated with an elevated computational demand, especially when contrasted with methodologies such as maximum likelihood estimation. This increment in computational expense underscores the necessity for the development of more computationally efficient sampling strategies within the NUTS framework. Second, this study has predominantly focused on the applicability of the EFB methodology to unidimensional assessments. It is critical to note, however, that the foundational principle of employing empirical prior information under EFB method does not strictly limit its application to unidimensional contexts. On the contrary, it only presupposes that all items are assessing either a singular ability or a coherent set of abilities. This opens the possibility for EFB’s application to multidimensional assessments, broadening the scope of its utility significantly. Finally, we also wanted to highlight another critical consideration. From the discussion, we know that for students with the same MC responses, they will receive the same prior information under the current FSB method, as suggested in Figure 1 of empirical study. However, with the investigation of educational process data, such as students’ behavioral actions and eye-tracking data, students with even the same responses could also demonstrate different ability values in terms of the content being measured. Therefore, using data about how students respond to assessment items could help us better understand their assessment-taking strategies and improve how we adjust their scores from the start. This means we could tailor initial empirical prior estimation for students who obtain the same scores but engage with the assessment differently, such as through random guessing behaviors. Looking into this could make our methods for calibrating mixed-format assessments more precise and give us a deeper insight into how students take assessments.

5. Conclusions

Overall, the proposed EFB method successfully addresses the research question on accurate composite ability score estimation in mixed-format assessments. Empirical data analyses consistently demonstrated that this approach provides higher empirical reliability in ability estimation and better reflects non-normally distributed ability groups. Through real-data-based simulations, we confirmed the EFB method’s superior recovery of ability estimations across various assessment conditions compared to traditional approaches.
From a practical standpoint, the EFB method achieves comparable results to traditional methods without requiring manual item weight specification, thus reducing the need for subject matter expert involvement and simplifying the calibration process. This approach proves particularly valuable when the true ability distribution is unclear or sample sizes are limited. These findings highlight the EFB method’s potential to streamline the assessment process while maintaining high standards of accuracy and reliability. Future research should focus on extensions to broader application scenarios such as multidimensional measurement and multimodal educational data mining.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/educsci15030374/s1.

Author Contributions

Conceptualization, J.X.; methodology, J.X. and Q.L.; software, J.X. and C.T.; validation, Q.L. and C.T.; formal analysis, J.X.; investigation, Q.L.; resources, C.T.; data curation, B.W.; writing—original draft preparation, J.X.; writing—review and editing, J.X., Q.L., C.T. and A.S.C.; visualization, J.X.; supervision, J.X.; project administration, J.X.; funding acquisition, A.S.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was previously funded by the National Science Foundation, grant number 1813760.

Institutional Review Board Statement

Ethical review and approval were waived for this study due to the proposed activity is not research involving human subjects as defined by DHHS and FDA regulations. Project activities are limited to use of de-identified data.

Informed Consent Statement

Not applicable.

Data Availability Statement

The empirical data is protected under the state law, and a description about the assessment can be found here at https://coe.uga.edu/directory/k-12-assessment-solutions/ accessed on 18 February 2025 (Former named as “Georgia Center for Assessment”). The simulation data and simulation code can be downloaded from the Supplementary Materials.

Acknowledgments

The authors would like to acknowledge the reviewers and editors during the review process.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
EFBEmpirical Fully Bayesian
SBSequential Bayesian
IRTItem response theory
MCMultiple choice
CRConstructed response
HMCHamiltonian Monte Carlo
NUTSNo-U-Turn Sampler
RMSERoot Mean Square Errors
MAEMean Absolute Errors
ELAEnglish Language Arts
ANOVAAnalysis of variance
GREGraduate Record Examination

References

  1. Baker, F. B., & Kim, S.-H. (Eds.). (2014). Item response theory: Parameter estimation techniques. CRC Press. [Google Scholar]
  2. Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In Statistical theories of mental test scores (pp. 397–479). Addison-Wesley. [Google Scholar]
  3. Duane, S., Kennedy, A. D., Pendleton, B. J., & Roweth, D. (1987). Hybrid monte carlo. Physics Letters B, 195(2), 216–222. [Google Scholar] [CrossRef]
  4. Educational Testing Service. (2012). GRE the official guide to the revised general test. McGraw Hill Professional. [Google Scholar]
  5. Ercikan, K., Julian, M. W., Burket, G. R., Weber, M. M., & Link, V. (1998). Calibration and scoring of tests with multiple-choice and constructed-response item types. Journal of Educational Measurement, 35(2), 137–154. [Google Scholar] [CrossRef]
  6. Hoffman, M. D., & Gelman, A. (2014). The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1), 1593–1623. [Google Scholar]
  7. Lane, S. (2005, April 11–15). Status and future directions for performance assessments in education. Annual Meeting of the American Educational Research Association, Montreal, QC, Canada. [Google Scholar]
  8. Natesan, P., Nandakumar, R., Minka, T., & Rubright, J. D. (2016). Bayesian prior choice in IRT estimation using MCMC and variational Bayes. Frontiers in Psychology, 7, 1–11. [Google Scholar] [CrossRef] [PubMed]
  9. R Core Team. (2023). R: A language and environment for statistical computing [Computer software], R Foundation for Statistical Computing.
  10. Rosa, K., Swygert, K. A., Nelson, L., & Thissen, D. (2001). Item Response theory applied to combinations of multiple-choice and constructed-response items—Scale scores for patterns of summed scores. In Test scoring (pp. 253–292). Routledge. [Google Scholar]
  11. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika, S1(34), 1–97. [Google Scholar] [CrossRef]
  12. Shu, L., & Schwarz, R. D. (2014). IRT-estimated reliability for tests containing mixed item formats. Journal of Educational Measurement, 51(2), 163–177. [Google Scholar] [CrossRef]
  13. Smid, S. C., McNeish, D., Miočević, M., & Van De Schoot, R. (2020). Bayesian versus frequentist estimation for structural equation models in small sample contexts: A systematic review. Structural Equation Modeling: A Multidisciplinary Journal, 27(1), 131–161. [Google Scholar] [CrossRef]
  14. Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7(2), 201–210. [Google Scholar] [CrossRef]
  15. Sykes, R. C., Truskosky, D., & White, H. (2001, April 11–13). Determining the representation of constructed response items in mixed-item format exams. Annual Meeting of the National Council on Measurement in Education (p. 42), Seattle, WA, USA. [Google Scholar]
  16. Thissen, D., Nelson, L., & Swygert, K. A. (2001). Item response theory applied to combinations of multiple-choice and constructed-response items—Approximation methods for scale scores. In Test scoring (pp. 293–341). Routledge. [Google Scholar]
  17. Williams, M. R., & Savitsky, T. D. (2021). Uncertainty estimation for pseudo-bayesian inference under complex sampling. International Statistical Review, 89(1), 72–107. [Google Scholar] [CrossRef]
  18. Xiong, J., Cohen, A. S., & Xiong, X. (Maggie). (2023). Sequential bayesian ability estimation applied to mixed-format item tests. Applied Psychological Measurement, 47(5–6), 402–419. [Google Scholar] [CrossRef] [PubMed]
  19. Xiong, J., Engelhard, G., & Cohen, A. S. (2024). Analysis of mixed-format assessments using measurement models and topic modeling. Measurement: Interdisciplinary Research and Perspectives, 1–15. [Google Scholar] [CrossRef]
Figure 1. Flowchart for the empirical fully Bayesian method on mixed-format assessment data.
Figure 1. Flowchart for the empirical fully Bayesian method on mixed-format assessment data.
Education 15 00374 g001
Figure 2. Scatter plots and distributions of prior ability and posterior ability estimations.
Figure 2. Scatter plots and distributions of prior ability and posterior ability estimations.
Education 15 00374 g002
Figure 3. True ability distribution and recovered ability distributions from two estimation methods.
Figure 3. True ability distribution and recovered ability distributions from two estimation methods.
Education 15 00374 g003
Table 1. Summary statistics for Grade 3 to Grade 10 ELA assessments scores.
Table 1. Summary statistics for Grade 3 to Grade 10 ELA assessments scores.
GradeNumber of StudentsNumber of MC ItemsNumber of CR ItemsPoints Possible
3872329
42320329
51249329
6646329
742838316
862548316
9701713525
10454213525
Table 2. Results of factor analyses for ELA assessments.
Table 2. Results of factor analyses for ELA assessments.
GradeFactorEigenvaluePercent of Variance (%)
316.5218.94
20.832.73
419.4423.57
21.262.76
519.5728.35
21.563.08
618.6317.61
21.743.55
7112.4324.63
24.689.65
8113.5727.06
25.429.11
9120.4530.42
28.9310.68
10126.4734.66
29.809.88
Table 3. Standard error of measurement and conditional standard error of measurement for posterior ability estimates on each grade assessment.
Table 3. Standard error of measurement and conditional standard error of measurement for posterior ability estimates on each grade assessment.
PercentilesGrade 3Grade 4Grade 5Grade 6Grade 7Grade 8Grade 9Grade 10
25th 0.410.370.360.330.350.310.330.35
50th (Median)0.350.360.380.350.370.320.340.33
75th 0.330.320.300.360.330.290.310.37
Overall0.390.350.360.350.340.300.320.34
Table 4. Comparison between EFB ability estimation and composite ability estimation with expert determined weights.
Table 4. Comparison between EFB ability estimation and composite ability estimation with expert determined weights.
Grade 3Grade 4Grade 5Grade 6Grade 7Grade 8Grade 9Grade 10
Correlation0.970.970.960.950.980.980.970.98
MAE0.220.190.240.310.230.200.230.21
RMSE0.340.230.360.420.280.280.290.27
Table 5. Simulation condition and levels.
Table 5. Simulation condition and levels.
ConditionLevel 1Level 2Level 3
Ability Group Normal distributionUniform distributionSkewed distribution
Sample Size 15020004000
Assessment Length20 (18 MC + 2 CR)45 (38 MC + 7 CR)70 (58 MC + 12 CR)
Table 6. Design matrix and recovery of ability from two estimation methods.
Table 6. Design matrix and recovery of ability from two estimation methods.
ConditionsRMSEReliability
OrderGroupSampleLengthEFBSBEFBSB
1Normal150200.604 (0.086)0.725 (0.082)0.848 (0.087)0.807 (0.087)
2Normal150450.583 (0.087)0.688 (0.085)0.874 (0.076)0.809 (0.076)
3Normal150700.571 (0.075)0.664 (0.078)0.879 (0.086)0.833 (0.086)
4Normal2000200.433 (0.078)0.569 (0.087)0.908 (0.083)0.882 (0.087)
5Normal2000450.422 (0.079)0.544 (0.075)0.927 (0.086)0.883 (0.083)
6Normal2000700.398 (0.076)0.510 (0.072)0.924 (0.082)0.909 (0.075)
7Normal4000200.385 (0.079)0.473 (0.076)0.944 (0.084)0.903 (0.084)
8Normal4000450.333 (0.082)0.436 (0.087)0.953 (0.085)0.922 (0.087)
9Normal4000700.310 (0.075)0.399 (0.076)0.959 (0.088)0.937 (0.088)
10Uniform150200.643 (0.095)0.791 (0.091)0.811 (0.094)0.781 (0.086)
11Uniform150450.617 (0.077)0.739 (0.074)0.823 (0.076)0.798 (0.075)
12Uniform150700.600 (0.086)0.703 (0.085)0.830 (0.087)0.801 (0.082)
13Uniform2000200.619 (0.082)0.728 (0.089)0.866 (0.088)0.785 (0.075)
14Uniform2000450.596 (0.087)0.686 (0.086)0.871 (0.080)0.804 (0.084)
15Uniform2000700.570 (0.089)0.607 (0.087)0.879 (0.085)0.832 (0.083)
16Uniform4000200.504 (0.079)0.661 (0.080)0.920 (0.086)0.843 (0.078)
17Uniform4000450.478 (0.086)0.619 (0.084)0.925 (0.085)0.877 (0.089)
18Uniform4000700.454 (0.084)0.587 (0.086)0.929 (0.086)0.883 (0.085)
19Skewed150200.637 (0.085)0.785 (0.089)0.807 (0.085)0.780 (0.087)
20Skewed150450.603 (0.076)0.730 (0.084)0.811 (0.082)0.796 (0.087)
21Skewed150700.582 (0.077)0.692 (0.088)0.829 (0.084)0.803 (0.080)
22Skewed2000200.543 (0.089)0.690 (0.089)0.833 (0.087)0.774 (0.083)
23Skewed2000450.502 (0.076)0.667 (0.085)0.846 (0.088)0.785 (0.087)
24Skewed2000700.476 (0.079)0.631 (0.086)0.852 (0.086)0.797 (0.084)
25Skewed4000200.487 (0.078)0.574 (0.083)0.897 (0.075)0.811 (0.079)
26Skewed4000450.463 (0.085)0.543 (0.082)0.909 (0.085)0.843 (0.085)
27Skewed4000700.445 (0.077)0.502 (0.084)0.903 (0.076)0.859 (0.077)
Table 7. ANOVA for ability parameter recovery RMSE.
Table 7. ANOVA for ability parameter recovery RMSE.
Conditionsd.f. 1Sum SquareMean SquareF Statistics p Partial   η 2
Estimation Method117.80817.8082607.232<0.0010.328
Ability Group214.2247.1121041.287<0.0010.280
Sample Size230.26915.1352215.850<0.0010.453
Assessment Length23.5501.775259.882<0.0010.089
1 d.f. indicates Degrees of Freedom.
Table 8. The real-data assessment information for Simulation 2.
Table 8. The real-data assessment information for Simulation 2.
AssessmentSample SizeMC NumberCR NumberPossible PointsSkewness DirectionSkewness Magnitude
Assessment 148622328NegativeMild
Assessment 249222328PositiveModerate
Assessment 387022328PositiveModerate to High
Assessment 463022328PositiveMild to Moderate
Assessment 563922328PositiveModerate
Assessment 646922328PositiveLow
Assessment 758322328PositiveLow
Table 9. Recovery of ability from two estimation methods.
Table 9. Recovery of ability from two estimation methods.
AssessmentStudent NumberRMSEReliability
EFBSBEFBSB
Assessment 14860.668 (0.082)0.763 (0.080)0.877 (0.092)0.808 (0.089)
Assessment 24920.690 (0.089)0.758 (0.079)0.865 (0.089)0.795 (0.076)
Assessment 38700.583 (0.077)0.686 (0.081)0.930 (0.081)0.872 (0.080)
Assessment 46300.618 (0.075)0.695 (0.073)0.893 (0.080)0.826 (0.077)
Assessment 56390.626 (0.078)0.674 (0.083)0.899 (0.076)0.835 (0.079)
Assessment 64690.644 (0.083)0.703 (0.080)0.859 (0.083)0.793 (0.081)
Assessment 75830.649 (0.085)0.759 (0.088)0.873 (0.099)0.801 (0.087)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xiong, J.; Liu, Q.; Tang, C.; Wang, B.; Cohen, A.S. Improving the Measurement of Students’ Composite Ability Score in Mixed-Format Assessments. Educ. Sci. 2025, 15, 374. https://doi.org/10.3390/educsci15030374

AMA Style

Xiong J, Liu Q, Tang C, Wang B, Cohen AS. Improving the Measurement of Students’ Composite Ability Score in Mixed-Format Assessments. Education Sciences. 2025; 15(3):374. https://doi.org/10.3390/educsci15030374

Chicago/Turabian Style

Xiong, Jiawei, Qidi Liu, Cheng Tang, Bowen Wang, and Allan S. Cohen. 2025. "Improving the Measurement of Students’ Composite Ability Score in Mixed-Format Assessments" Education Sciences 15, no. 3: 374. https://doi.org/10.3390/educsci15030374

APA Style

Xiong, J., Liu, Q., Tang, C., Wang, B., & Cohen, A. S. (2025). Improving the Measurement of Students’ Composite Ability Score in Mixed-Format Assessments. Education Sciences, 15(3), 374. https://doi.org/10.3390/educsci15030374

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop