Next Article in Journal
Symplectic Radon Transform and the Metaplectic Representation
Next Article in Special Issue
Modeling Overdispersed Dengue Data via Poisson Inverse Gaussian Regression Model: A Case Study in the City of Campo Grande, MS, Brazil
Previous Article in Journal
GTAD: Graph and Temporal Neural Network for Multivariate Time Series Anomaly Detection
Previous Article in Special Issue
Neural Networks for Financial Time Series Forecasting
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

On the Choice of the Item Response Model for Scaling PISA Data: Model Selection Based on Information Criteria and Quantifying Model Uncertainty

by
Alexander Robitzsch
1,2
1
IPN—Leibniz Institute for Science and Mathematics Education, Olshausenstraße 62, 24118 Kiel, Germany
2
Centre for International Student Assessment (ZIB), Olshausenstraße 62, 24118 Kiel, Germany
Entropy 2022, 24(6), 760; https://doi.org/10.3390/e24060760
Submission received: 19 April 2022 / Revised: 12 May 2022 / Accepted: 25 May 2022 / Published: 27 May 2022
(This article belongs to the Special Issue Data Science: Measuring Uncertainties II)

Abstract

:
In educational large-scale assessment studies such as PISA, item response theory (IRT) models are used to summarize students’ performance on cognitive test items across countries. In this article, the impact of the choice of the IRT model on the distribution parameters of countries (i.e., mean, standard deviation, percentiles) is investigated. Eleven different IRT models are compared using information criteria. Moreover, model uncertainty is quantified by estimating model error, which can be compared with the sampling error associated with the sampling of students. The PISA 2009 dataset for the cognitive domains mathematics, reading, and science is used as an example of the choice of the IRT model. It turned out that the three-parameter logistic IRT model with residual heterogeneity and a three-parameter IRT model with a quadratic effect of the ability θ provided the best model fit. Furthermore, model uncertainty was relatively small compared to sampling error regarding country means in most cases but was substantial for country standard deviations and percentiles. Consequently, it can be argued that model error should be included in the statistical inference of educational large-scale assessment studies.

1. Introduction

Item response theory (IRT) models [1] are central to analyzing dichotomous random variables. IRT models can be regarded as a factor-analytic multivariate technique to summarize a high-dimensional contingency table by a few latent factor variables of interest. Of particular interest is the application of an IRT model in educational large-scale assessment (LSA; [2]), such as the programme for international student assessment (PISA; [3]), which summarizes the ability of students on test items in different cognitive domains.
In the official reporting of outcomes of LSA studies such as PISA, the set of test items is represented by a unidimensional summary measure extracted by applying a unidimensional IRT model. Across different LSA studies, there is no consensus on which particular IRT model should be utilized [4,5,6]. In previous research, there are a few attempts that quantity the impact of IRT model choice on distribution parameters of interest such as country means, standard deviations, or percentiles. However, previous research did not systematically study a large number of competing IRT models [7,8,9]. Our research fills a gap because it conducts an empirical comparison involving 11 different IRT models for scaling for PISA 2009 data in three ability domains. Moreover, we compare the model fit of these different IRT models and quantify the variability in model uncertainty using the model error. We compare the model error with the standard error associated with the uncertainty due to the sampling of students.
The rest of the article is structured as follows. In Section 2, we discuss different IRT models used for scaling. Section 3 introduces the concepts of model selection and model uncertainty. Section 4 describes the method used to analyze PISA 2009 data. In Section 5, we discuss the empirical results for the PISA 2009 dataset. Finally, the paper closes with a discussion in Section 6.

2. Item Response Models for Scaling Cognitive Test Items

In this section, we present an overview of different IRT models that are used for scaling cognitive test data to obtain a unidimensional summary score [10,11,12]. In the rest of the article, we restrict ourselves to the treatment of dichotomous items. However, the principle can similarly be applied to polytomous items.
Let X = ( X 1 , , X I ) be the vector of I dichotomous items X i { 0 , 1 } . A unidimensional IRT model [11,12] is a statistical model for the probability distribution P ( X = x ) for  x { 0 , 1 } I , where
P ( X = x ; γ ) = i = 1 I P i ( θ ; γ i ) x i 1 P i ( θ ; γ i ) 1 x i f ( θ ) d θ , θ F .
In Equation (1), a latent variable θ is involved that can be interpreted as a unidimensional summary of the test items X . The distribution of θ is modeled using a (semi)parametric distribution F with density function f. In the rest of the article, we fix this distribution to be standard normal, but this can be weakened [13,14,15]. The item response functions (IRF) P i ( θ ; γ i ) model the relationship of the dichotomous item with the latent variable, and we collect all item parameters in the vector γ . In most cases, a parametric model is utilized in the estimation of the IRF (but see [16] for a nonparametric identification), which is indicated by the item parameter γ i in Equation (1). Note that in (1), item responses X i are conditionally independent on  θ ; that is, after controlling the latent ability θ , pairs of items X i and X j are conditionally uncorrelated. This property is also known as the local dependence assumption, which can be statistically tested [12,17]. The item parameters γ i of the estimated IRFs in Equation (1) can be estimated by (marginal) maximum likelihood (ML) using an EM algorithm [18,19,20]. The estimation can involve sampling weights for students [21] and a multi-matrix design in which only a subset of items is administered to each student [22]. In the likelihood formulation of (1), non-administered items are skipped in the multiplication term.
In practice, the IRT model (1) is likely to be misspecified because the unidimensionality assumption is implausible. Moreover, the parametric assumption P i ( θ ; γ i ) of the IRF might be incorrect. In addition, in educational LSA studies involving a large number of countries, there will typically be country differential item functioning [23,24,25]; that is, item parameters will vary across countries. In this case, applying ML using country-invariant item parameters defines the best approximation with respect to the Kullback–Leibler distance of the true distribution and a model-implied distribution. In this sense, an IRT model is selected by purpose and not by reasons of model fit because it will not even approximately fit the data (see also [26]). If country means are computed based on a particular IRT model, the parameter of interest should be, rather, interpreted as a descriptive statistic of interest [27]. Using a particular model does not mean that we believe that the model (approximately) fits the data. In contrast, we think that a vector of country means μ and item parameters γ summarize a high-dimensional contingency table P ( X = x ) .
Locally optimal weights [28] can be used to discuss the consequences for scoring when using a particular IRT model. A local scoring rule for the ability θ can be defined by a weighted sum i = 1 I ν i ( θ ) X i for abilities near θ = θ 0 . The ability θ is determined by ML estimation using previously estimated item parameters. The locally optimal weights can be derived as (see [27,28,29]):
ν i ( θ ) = P i ( θ ) P i ( θ ) ( 1 P i ( θ ) )
If the local weight ν i ( θ ) (also referred to as the local item score) varies across different θ values, the impact of single items in the ability differs. This property can be critically recognized, particularly for country comparisons in LSA studies [29]. Subsequently, we will discuss the properties of different IRT models regarding the optimal weights ν i ( θ ) .
In this article, several competitive functional forms of the IRF are compared, and their consequences for distribution parameters (e.g., means, standard deviations, and percentiles) for the prominent LSA study PISA are discussed. Performing such a fit index contest [30,31] does not necessarily mean that we favor model selection based on model fit. In the next Section 2.1, we discuss several IRFs later utilized for model comparisons. In Section 2.2, we investigate the behavior of the estimated ability distribution under misspecified IRFs. Finally, we conclude this section with some thoughts on the choice of the IRT model (see Section 2.3).

2.1. Different Functional Forms for IRT Models

In this section, we discuss several parametric specifications of the IRF P i ( θ ) that appear in the unidimensional IRT model defined in Equation (1).
The one-parameter logistic model (1PL; also known as the Rasch model; [32,33]) employs a logistic link function and parametrizes an item with a single parameter b i that is called item difficulty. The model is defined by
Model 1 PL : P i ( θ ) = 1 1 + exp ( a θ b i ) ,
where a is the common item discrimination parameter. Alternatively, one can fix the parameter a to 1 and estimate the standard deviation of the latent variable θ . Notably, the sum score i = 1 I x i is a sufficient statistic for  θ in the 1PL model. The 1PL model has wide applicability in educational assessment [34,35].
The 1PL model uses a symmetric link function. However, asymmetric link functions could also be used for choosing an IRF. The cloglog link function is used in the one-parameter cloglog (1PCL) model [36,37]:
Model 1 PCL : P i ( θ ) = 1 exp ( exp ( a θ + b i ) ) .
Consequently, items are differentially weighted in the estimation of  θ at each θ location, and the sum score is not a sufficient statistic. The cloglog link function has similar behavior to the logistic link function in the 1PL model in the lower tail (i.e., for negative values of  θ ), but differs from it in the upper tail.
The one-parameter loglog (1PLL) IRT model is defined by
Model 1 PLL : P i ( θ ) = exp ( exp ( a θ b i ) ) .
In contrast to the cloglog link function, the loglog function is similar to the logistic link function in the upper tail (i.e., for positive θ values), but different from it in the lower tail.
Figure 1 compares the 1PL, 1PCL, and 1PLL models regarding the IRF P i and the locally optimal weight ν i . The loglog IRT model (1PLL) stretches more in the lower tails than in the lower θ tail than the logistic link function. The converse is true for the cloglog IRT model (1PCL), which is significantly stretched in the upper θ tail. In the right panel of Figure 1, locally optimal weights are displayed. The 1PL model has a constant weight of 1, while the local contribution of item score for  θ differs across the θ range for the 1PCL and the 1PLL model. The 1PCL model provides a higher local item score for higher θ values than for lower θ values. Hence, more difficult items receive lower local item scores than easier items. In contrast, the 1PLL model results in higher local item scores for difficult items compared to easier items. This idea is reflected in the D-scoring method [38,39].
Notably, the 1PCL and 1PLL models use asymmetric IRFs. One can try to estimate the extent of asymmetry in IRFs by using a generalized logistic link function (also called the Stukel link function; [40]):
Model 1 PGL : P i ( θ ) = 1 1 + exp ( S ( a θ + b i ; α 1 , α 2 ) ) ,
where the generalized logit link function is defined as
S ( x ; α 1 , α 2 ) = α 1 1 exp ( α 1 x ) 1 if x 0 and α 1 > 0 x if x 0 and α 1 = 0 α 1 1 log ( 1 α 1 x ) if x 0 and α 1 < 0 α 2 1 exp ( α 2 x ) 1 if x < 0 and α 2 > 0 x if x < 0 and α 2 = 0 α 2 1 log ( 1 + α 2 x ) if x < 0 and α 2 < 0
In this 1PGL model, common shape parameters α 1 and α 2 for the IRFs are additionally estimated. The 1PL, 1PCL and 1PLL models can be obtained as special cases of (6).
The four models 1PL, 1PCL, 1PLL, and 1PGL have in common that they only estimate one parameter per item. The assumption of a common item discrimination is weakened in the two-parameter logistic (2PL) IRT model [28], as a generalization of the 1PL model in which the discriminations a i are now made item-specific:
Model 2 PL : P i ( θ ) = 1 1 + exp ( a i θ b i ) .
Note that i = 1 I a i x i is a sufficient statistic for  θ . Hence, items X i are differentially weighted by the weight a i , which is determined within the statistical model.
Further, the assumption of a symmetric logistic link function might be weakened, and a four-parameter generalized logistic (4PGL) model can be estimated:
Model 4 PGL : P i ( θ ) = P i ( θ ) = 1 1 + exp ( S ( a θ + b i ; α 1 i , α 2 i ) ) .
In the IRT model (9), the shape parameters α 1 i and α 2 i are made item-specific. Hence, the extent of asymmetry of the IRF is estimated for each item.
The 2PL model (8) can be generalized to the three-parameter logistic (3PL; [41]) IRT model that assumes an item-specific lower asymptote c i larger than 0 for the IRF:
Model 3 PL : P i ( θ ) = c i + ( 1 c i ) 1 1 + exp ( a i θ b i ) .
Parameter c i is often referred to as a (pseudo-)guessing parameter [42,43]. The 3PL model might be reasonable if multiple-choice items are used in the test.
The 3PL model can be generalized in the four-parameter logistic (4PL; [44,45,46]) model such that it also contains upper asymptotes d i smaller than 1 for the IRF:
Model 4 PL : P i ( θ ) = c i + ( 1 d i c i ) 1 1 + exp ( a i θ b i ) .
The d i parameter is often referred to as a slipping parameter, which characterizes careless (incorrect) item responses [47]. In contrast to the 1PL, 2PL, or the 3PL model, the 4PL model has not yet been applied in the operational practice of LSA studies. However, there are a few research papers that apply the 4PL model to LSA data [48,49].
It should be mentioned that the 3PL or the 4PL model might suffer from empirical nonidentifiability [45,50,51,52]. This is why prior distributions for guessing (3PL and 4PL) and slipping (4PL) parameters are required for stabilizing model estimation. As pointed out by an anonymous reviewer, the use of prior distributions changes the meaning of the IRT model. However, we think that identifiability issues are of less concern in the large-sample-size situations that are present in educational LSA studies. If item parameters are obtained in a pooled sample of students comprising all countries, sample sizes are typically above 10,000. In this case, the empirical data will typically dominate prior distributions, and prior distributions are therefore not needed.
In Figure 2, IRFs and locally optimal weights for the 4PL, 3PL, and 2PL models are displayed. The item parameters for the 4PL model were a i = 1 , b i = 0 , c i = 0.25 , and d i = 0.1 . The parameters of the displayed 2PL and 3PL models were obtained by minimizing the weighted squared distance between the IRF of the 4PL model and the simpler model under the constraint that the model-implied item-means coincide under the normal distribution assumption of  θ . Importantly, it can be seen in the right panel that the 2PL model has a constant local item score, while it is increasing for the 3PL model and it is inversely U-shaped for the 4PL model. Hence, when using the 4PL model, it must not be too easy or too difficult to obtain a high local item score for a student that got the item correct.
A different strand of model extensions also starts from the 2PL model but introduces more item parameters to model asymmetry or nonlinearity while retaining the logistic link function. The three-parameter logistic model with quadratic effects (3PLQ) additionally includes additional quadratic effects of θ in the 2PL model [42,50]:
Model 3 PLQ : P i ( θ ) = 1 1 + exp ( a 2 i θ 2 a 1 i θ b i ) .
Due to the presence of the  a 2 i parameter, asymmetric IRFs can be modeled. As a disadvantage, the IRF in (12) must not be monotone, although this constraint can be incorporated in the estimation [53,54].
The three-parameter model with residual heterogeneity (3PLRH) extends to the 2PL model by including an asymmetry parameter δ i [55,56]:
Model 3 PLRH : P i ( θ ) = 1 1 + exp 1 + exp ( δ i θ ) 1 / 2 ( a i θ + b i ) .
The 3PLRH model has been successfully applied to LSA data and often resulted in superior model fit compared to the 3PL model [57,58].
In Figure 3, IRFs and locally optimal weights are displayed for three parameter specifications in the 3PLRH model (i.e., a i = 1 , b i = 0 , and δ i = 0.5 , 0 , 0.5 ). One can see that the introduced asymmetry parameter δ i governs the behavior of the IRF in the lower or upper tails. The displayed IRFs mimic the 1PL, 1PCL, and 1PLL models. Moreover, with δ i parameters different from zero, different locally optimal weights across the  θ range are introduced. Notably, a positive δ i parameter is associated with a larger local item score in the lower θ tail. The opposite is true for a negative δ i parameter.
Finally, the 3PL model is extended in the four-parameter logistic model with quadratic effects (4PLQ), in which additional item-specific quadratic effects for  θ are included [50]
Model 4 PLQ : P i ( θ ) = c i + ( 1 c i ) 1 1 + exp ( a i 2 θ 2 a i 1 θ b i ) .

2.2. Ability Estimation under Model Misspecification

In this section, we study the estimation of  θ when working with a misspecified IRT model. In the treatment, we assume that there is a true IRT model with unknown IRFs. We study the bias in estimated abilities for a fixed value of θ if misspecified IRFs are utilized. This situation refers to the empirical application in an LSA study, in which a misspecified IRF is estimated based on data comprising all countries, and the distribution of  θ is evaluated at the level of countries. The misspecification emerges due to incorrectly assumed functional forms of the IRF or the presence of differential item functioning at the level of countries [24,59].
We assume that the there are true but unknown IRFs P i ( θ ) = Ψ ( α i ( θ ) ) with a continuously differentiable function α i and Ψ ( x ) = [ 1 + exp ( x ) ] 1 denotes the logistic link function. We assume that the local independence assumption holds in the IRT model. For estimation, we use a misspecified IRT model with IRFs P i ( θ ) = Ψ ( a i ( θ ) ) with a continuously differentiable function a i . Notably, there exists a misspecification if α i a i . In Appendix A, we derive an estimate θ 1 under the misspecified IRT model if  θ 0 is the data-generating ability value under the true IRT model. Hence, we derive a transformation function m ( θ 1 ) = θ 0 + B ( θ 0 ) , where B ( θ ) is the bias function that indicates the bias in the estimated ability due to the application of the misspecified IRT model. We assume that the item parameters under the misspecified IRT model are known (i.e., the IRFs a i ( θ ) are known). Then, the ML estimate is determined based on the misspecified IRT model taking into account that θ 0 solves the maximum likelihood equation under the true IRT model. It is assumed that the number of items I is large. Moreover, we apply two Taylor approximations that rely on the assumption that | α i ( θ ) a i ( θ ) | is sufficiently small.
The derivation in Appendix A (see Equation (A10)) provides
θ 1 θ 0 + A 1 i = 1 I Ψ ( a i ( θ 0 ) ) Ψ ( α i ( θ 0 ) ) α i ( θ 0 ) θ 0 + B ( θ 0 ) ,
where the bias term B is defined by B ( θ ) = A 1 i = 1 I Ψ ( a i ( θ ) ) Ψ ( α i ( θ ) ) α i ( θ ) and A is determined by item information functions (see Appendix A). Equation (15) clarifies how the misspecified IRFs enter the computation of  θ . Interestingly, the extent of misspecification Ψ ( a i ( θ 0 ) ) Ψ ( α i ( θ 0 ) is weighted by α i ( θ 0 ) .
Equation (15) provides practical consequences when applying misspecified IRT models. For instance, θ 0 might be the true country percentile, referring to a true IRT model. If the transformation θ 1 = m ( θ 0 ) is monotone, the percentile with the misspecified model is θ 1 and Equation (15) quantifies a bias for the estimated percentile. Moreover, let f c be the density of the ability under the true IRT model for country c; then, one can determine the bias in the country means by using (15). The true country mean of country c is given by μ c = θ f c ( θ ) d θ . The estimated country mean μ c * under the misspecified model is given by
μ c * = μ c + B ( θ ) f c ( θ ) d θ .
Note that the bias term B ( θ ) will typically be country-specific because the true IRF P i ( θ ) = Ψ ( α i ( θ ) ) are country-specific due to differential item functioning at the level of countries. Hence, item-specific relative country effects regarding the IRF that are uniformly weighted in Equation (15) can be considered a desirable property.
In the case of a fitted 2PL model, it holds that a i ( θ ) = a i θ , and deviations Ψ ( a i ( θ ) ) Ψ ( α i ( θ ) ) are weighted by a i ( θ ) = a i in the derived bias in (15). For the 1PL model, the deviations are equally weighted due to  a i ( θ ) = 1 . This property might legitimate the use of the often ill-fitting 1PL model because model deviations are equally weighted across items (see [27]). We elaborate on this discussion in the following Section 2.3.

2.3. A Few Remarks on the Choice of the IRT Model

In Section 2.1, we introduced several IRT models and it might be asked which criteria should be used for selecting one among these models. We think that model-choice principles depend on the purpose of the scaling models. Pure research purposes (e.g., understanding cognitive processes underlying item response behavior; modeling item complexity) must be distinguished from policy-relevant reporting practice (e.g., country rankings in educational LSA studies). Several researchers have argued that model choice should be primarily a matter of validity and not based on purely statistical criteria [27,60,61,62,63,64].
Myung et al. [63] discussed several criteria for model selection with a focus on cognition science. We would like to emphasize that these criteria might be differently weighted if applied to educational LSA studies that are not primarily conducted for research purposes. The concept of the interpretability of a selected IRT model means that the model parameters must be linked to psychological processes and constructs. We think that simple unidimensional IRT models in LSA studies are not used because one believes a unidimensional underlying (causal) variable exists. The chosen IRT model is used for summarizing item response patterns and for providing simple and interpretable descriptive statistics. In this sense, we have argued elsewhere [27] that model fit should not have any relevance for model selection in LSA studies. However, it seems in the official LSA publications such as those from PISA that information criteria are also used for justifying the use of scaling models [5]. We would like to note that these model comparisons are often biased in the sense that the personally preferred model is often the winner of this fit contest, and other plausible IRT models are excluded from these contests because they potentially could provide a better model fit. Information-criteria-based model selection falls into the criterion of generalizability according to Myung et al. [63]. These criteria are briefly discussed in Section 3.1.
Notably, different IRT models imply a differential weighting of items in the summary variable θ [29,65]. This characteristic is quantified with locally optimal weights (see Section 2.1). The differential item weighting might impair the comparison of subgroups. More critically, the weighing of items is, in most applications, determined by statistical models and might, hence, have undesirable consequences because practitioners have an implicitly defined different weighing of items in mind when composing a test based on a single test of items. Nevertheless, our study investigates the consequences of using different IRT models for LSA data. To sum up, which of the models should be chosen in operational practice is a difficult question that should not be (entirely) determined by statistical criteria.

3. Model Selection and Model Uncertainty

3.1. Model Selection

It is of particular interest to conduct model comparisons of the different scaling models that involve different IRFs (see Section 2.1). The Akaike information criterion (AIC) and the Bayesian information criterion (BIC) are used for conducting model comparisons in this article (see [66,67,68,69]). Moreover, the Gilula–Haberman penalty (GHP; [70,71,72]) is used as an effect size that is relatively independent of the sample size and the number of items. The GPH is defined as GHP = AIC / ( 2 p = 1 N I p ) , where I p is the number of estimated model parameters for person p. The GHP can be seen as a normalized variant of the AIC. A difference in GHP larger than 0.001 is a notable difference regarding global model fit [72,73].

3.2. Model Uncertainty

Country comparisons in LSA studies such as PISA can depend on the chosen IRT model. In this case, choosing a single best-fitting model might be questionable [74,75]. To investigate the impact of model dependency, we discuss the framework of model uncertainty [76,77,78,79,80,81,82,83,84,85,86] in this section and quantify it by a statistic that characterizes model error.
To quantify model uncertainty, each model m is associated with a weight w m 0 and we assume m = 1 M w m = 1 [87]. To adequately represent the diversity of findings from different models, an equal weighting of models has been criticized [88]. In contrast, particular models in the set of all models are downweighted if they are highly dependent and produce similar results [89,90,91]. We believe that model fit should not influence model weights [92]. The goal is to represent differences between models in the model error. If the model weights were determined by model fit, plausible but non-fitting models such as the 1PL model would receive a model weight of zero, which is not preferred because the 1PL model should not be excluded from the set of specified models. Moreover, if model weights are computed based on information criteria [80], only one or a few models receive weights that differ from zero, but all other models do not impact the statistical inference. This property is why we do not prefer Bayesian model averaging in our application [82,93,94].
Let γ = ( γ 1 , , γ M ) be the vector of a statistical parameter of all models. We can define a composite parameter γ comp as
γ comp = m = 1 M w m γ m
We can also define a population-level model error (ME) as
M γ comp = m = 1 M w m ( γ m γ comp ) 2
Now, assume that data is available and γ ^ = ( γ ^ 1 , , γ M ) is estimated. The estimate γ ^ is multivariate normally distributed with mean γ and a covariance matrix V . Typically, estimates of different models using the same dataset will be (strongly) positively correlated. An estimate of the composite parameter γ comp is given as
γ ^ comp = m = 1 M w m γ ^ m
Due to  E ( γ ^ m ) = γ m , we obtain that γ ^ comp is an unbiased estimate of  γ comp . The empirical model error ME is defined as
ME = m = 1 M w m ( γ ^ m γ ^ comp ) 2
Now, it can be shown that ME 2 is a positively biased estimate of  M γ comp 2 because the former also contains sampling variability. Define γ comp = w γ , where w = ( w 1 , , w M ) . Similarly, we can write γ ^ comp = w γ ^ . Let e m be the m-th unit vector of length M that has an entry of 1 at the m-th entry and 0 otherwise. This notation enables the representation γ m = e m γ . Define u m = e m w . From (18), we obtain
M γ comp 2 = m = 1 M w m ( u m γ ) 2 = m = 1 M w m u m γ γ u m
Furthermore, we can then rewrite the expected value of  E ( ME 2 ) as (see Equation (20))
E ( ME 2 ) = M γ comp 2 + m = 1 M w m ( u m ( γ ^ γ ) ) 2 = M γ comp 2 + m = 1 M w m u m V u m = M γ comp 2 + B ,
  where the second term B is a biasing term that is the estimated variation across models due to sampling error. This term can be estimated if an estimate of the covariance matrix V of the vector of model estimates γ ^ is available. As an alternative, the bias in  E ( ME 2 ) can be removed by estimating B in (22) with resampling techniques such as bootstrap, jackknife or (balanced) half sampling [21,95]. Let B ^ be an estimate of the bias; a bias-corrected model error can be estimated by
ME bc = max ( ME 2 B ^ , 0 )
One can define a total error TE that includes the sampling error SE due to person sampling and a model error estimate ME bc :
TE = SE 2 + ME bc 2
This total error also takes the variability in the model choice into account and allows for broader inference. Constructed confidence intervals relying on  TE will be wider than ordinary confidence intervals that are only based on the  SE .

4. Method

In our empirical application, we used data from PISA 2009 to assess the influence of the choice of different scaling models. Similar research with substantially fewer IRT modeling alternatives was conducted in [8,96,97].

4.1. Data

PISA 2009 data was used in this empirical application [3]. The impact of the choice of the scaling model was investigated for the three cognitive domains mathematics, reading, and science. In total, 35, 101, and 53 items were included in our analysis for the domains mathematics, reading, and science, respectively. All polytomous items were dichotomously recoded, with only the highest category being recoded as correct.
A total number of 26 countries were included in the analysis. The median sample sizes at the country level were Med = 5398 (M = 8578.0, Min = 3628, Max = 30,905) for reading, Med = 3761 (M = 5948.2, Min = 2510, Max = 21,379) for mathematics, and Med = 3746.5 (M = 5944.2, Min = 2501, Max = 21,344) for science.
For all analyses at the country level, student weights were taken into account. Within a country, student weights were normalized to a sum of 5000, so that all countries contributed equally to the analyses.

4.2. Analysis

We compared the fit of 11 different scaling models (see Section 2.1) in an international calibration sample [98]. To this end, 500 students were randomly sampled from each of the 26 countries and each of the three cognitive domains. Model comparisons were conducted based on the resulting samples involving 13,000 students.
In the next step, the item parameters obtained from the international calibration sample were fixed in the country-specific scaling models. In this step, plausible values for the  θ distribution in each of the countries were drawn [99,100]. We did not include student covariates when drawing plausible values. Note that sampling weights were taken into account in this scaling step. The resulting plausible values were subsequently linearly transformed such that a weighted mean of 500 and a weighted standard deviation of 100 holds in the total sample of studies comprising all countries. Weighted descriptive statistics and their standard errors of the  θ distribution were computed according to the Rubin rules of multiple imputation [3]. The only difference to the original PISA approach is that we apply balanced half sampling instead of balanced repeated replication for computing standard errors (see [21,101]). Balanced half sampling has the advantage of easy computation of the bias for model error (see Equation (23)).
For quantifying model uncertainty, model weights were assigned prior to analysis based on the principles discussed in  Section 3.2. First, because the 1PL, 2PL, and the 3PL are the most frequently used models in LSA studies, we decided that the sum of their model weight should at least exceed 0.50. Second, the weights of models with similar behavior (i.e., models that result in similar country means) should be decreased. These considerations resulted in the following weights: 1PL: 0.273, 2PL: 0.136, 3PL: 0.136; 1PCL: 0.061; 1PLL: 0.061; 1PGL: 0.061; 3PLQ: 0.068; 3PLRH: 0.068; 4PGL: 0.045; 4PL: 0.045; 4PLQ: 0.045. It is evident that a different choice of model weight will change the composite parameter of interest and the associated model error. We did not opt for a sensitivity analysis employing an alternative set of model weights in order to ease the presentation of results in this paper. In order to study the importance of sampling error (SE) and the bias-corrected model error ( ME bc ), we computed an error ratio (ER) that is defined by ER = ME bc / SE . Moreover, we computed the total error as TE = SE 2 + ME bc 2 .
All analyses were carried out with the statistical software R [102]. The different IRT models were fitted using the xxirt() function in the R package sirt [103]. Plausible value imputation was conducted using the R package TAM [104].

5. Results

5.1. Model Comparisons Based on Information Criteria

The 11 different scaling models were compared for the three cognitive domains mathematics, reading, and science for the PISA 2009 dataset. Table 1 displays model comparisons based on AIC, BIC, and Δ GHP , which is defined as the difference between the GHP values of a particular model and the best-fitting model.
Based on the AIC or Δ GHP , one of the models, 4PGL, 3PLQ, 3PLRH, 3PL, 4PL, or 4PLQ, was preferred in one of the domains. If the BIC were used as a selection criterion, the 3PLQ or the 3PLRH will always be chosen across the models. Notably, the operationally used 2PL model had only satisfactory for the reading domain. By inspecting Δ GHP , it is evident that the largest gain in model fit is obtained by switching from one- to two-, three- or four-parameter models. However, the gain in model fit from the 2PL to the 3PL model is not noteworthy.
In contrast, the gains in fitting the 3PLQ or 3PLRH can be significant. Among the one-parameter models, it is interesting that the loglog link function resulted in a better model fit for mathematics compared to the logistic or the cloglog link functions. This was not the case for reading or science. Overall, the model comparison for PISA 2009 demonstrated that the 3PLQ or 3PLRH should be preferred over the 2PL model for reasons of model fit.

5.2. Model Uncertainty for Distribution Parameters

To obtain a visual insight into the similarity of the different scaling models, we computed pairwise absolute differences in the country means. We used the average of them as a distance matrix used as the input of a hierarchical cluster analysis based on the Ward method. Figure 4 shows the dendrogram of this cluster analysis. It can be seen that the 2PL and 3PL provided similar results. Another cluster of models was formed by the more complex models 3PLQ, 3PLRH, 4PGL, 4PL, and 4PLQ. Finally, the different one-parameter models 1PLL, 1PGL, 1PL (and 1PGL) provided relatively distinct findings.
In Table 2, detailed results for 11 different scaling models for country means in PISA 2009 reading are shown The largest number of substantial deviations of country means from the weighted mean (i.e., the composite parameter) with at least 1 were obtained for the 1PCL model (10), 1PLL (9), and 4PLQ (9). At the level of countries, there were 11 countries in which none of the scaling models substantially differed from the weighted mean. In contrast, there was a large number of deviations for Denmark (DNK; 9) and South Korea (KOR; 10). The ranges in country means across different scaling models at the level of countries varied between 0.3 (SWE; Sweden) and 7.7 (JPN; Japan), with a mean of 2.4.
In Table A1 in Appendix C, detailed results for 11 different scaling models for country means in PISA 2009 mathematics are shown. The largest number of substantial deviations from the weighted mean was obtained for the 1PCL (12), the 1PLL (11), and the 1PGL (9) model. The ranges of the country means across models ranged between 0.5 and 7.9, with a mean of 2.8.
In Table A2 in Appendix C, detailed results for 11 different scaling models for country means in PISA 2009 science are shown. For science, many models showed a large number of deviations. This demonstrates large model uncertainty. The ranges of the country means across models varied between 0.6 and 7.8, with a mean of 2.8.
In Table 3, results and model uncertainty of 11 different scaling models for country means and standard deviations in PISA 2009 reading are shown. The unadjusted model error had an average of M = 0.66. The bias-corrected model error ME bc was slightly smaller, with M = 0.62. On average, the error ratio was 0.24, indicating that the larger portion of uncertainty is due to sampling error compared to model error.
The estimated country standard deviations for reading were much more model-dependent. The bias-corrected model error has an average of 0.96 (ranging between 0.00 and 2.68). This was also pronounced in the error ratio, which had an average of 0.60. The maximum error ratio was 2.05 for Finland (FIN; with a model error of 9.8), indicating that the model error was twice as large as the sampling error. Overall, model error turned out to be much more important for the standard deviation than the mean.
In Table 4, results and model uncertainty of 11 different scaling models for country 10th and 90th percentiles in PISA 2009 reading are shown. For the 10th percentile Q10, the error ratio was on average 0.60, with a range between 0.13 and 2.61. The average error ratio was even larger for the 90th percentile Q90 (M = 0.84, Min = 0.23, Max = 2.16). Hence, quantile comparisons across countries can be sensitive to the choice of the IRT scaling model.
In Table A3 in Appendix C, results and model uncertainty of 11 different scaling models for country means and standard deviations in PISA 2009 mathematics are shown. As for reading, the error ratio was on average smaller for country means (M = 0.24, Max = 0.66) than for country standard deviations (M = 0.77, Max = 1.58). Nevertheless, the additional uncertainty associated with model uncertainty is too large to be ignored in statistical inference. For example, South Korea (KOR) had a range of 15.7 for the standard deviation across models, which corresponds to an error of 3.75 and an error ratio of 1.58.
In Table A4 in Appendix C, results and model uncertainty of 11 different scaling models for country 10th and 90th percentiles in PISA 2009 mathematics are shown. The error ratios for the 10th and the 90th percentiles were similar (Q10: M = 0.66; Q90: M = 0.65). In general, the relative increase in uncertainty due to model error for percentiles was similar to the standard deviation.
In Table A5 in Appendix C, results and model uncertainty of 11 different scaling models for country means and standard deviations in PISA 2009 science are shown. As for reading and mathematics, the importance of model error was relatively small for country means (M = 0.27 for the error ratio). However, it reached 0.72 for Denmark with a bias-corrected model error of 1.89. For country standard deviations, the error ratio was larger (M = 0.53, Min = 0.00, Max = 1.50).
In Table A6 in Appendix C, results and model uncertainty of 11 different scaling models for country 10th and 90th percentiles in PISA 2009 science are shown. The influence of model error on percentiles was slightly smaller in science than in reading or mathematics. The average error ratios were M = 0.44 (Q10) and M = 0.57 (Q90), but the maximum error ratios of 1.53 (Q10) and 2.04 (Q90) indicated that model error was more important than sampling error for some countries.
To investigate the impact of the choice of model weights in our analysis (see Section 4.2), we additionally conducted a sensitivity analysis for the reading domain by using uniform model weights (weighting scheme W2). That is, we weighted each of the 11 scaling models by w m = 1 / 11 = 0.091 ( m = 1 , , 11 ). We studied changes in country means and country standard deviations regarding the composite mean, standard errors (SE), and model errors ( ME bc ). The results are displayed in Table 5.
For the composite estimate of the country mean, we only observed tiny differences between the proposed model weighting W1 and the uniform weighting W2. The absolute difference in country means was 0.14 on average (SD = 0.11) and ranged between 0.01 and 0.36 (South Korea, KOR). The average absolute difference for the change in country standard deviations was also small (M = 0.26; SD = 0.20). Notably, there were almost no changes in the standard error for country means and country standard deviations for the weighting methods. However, the model error slightly increased with uniform weighting from M = 0.62 to M = 0.68 for country means and from 0.96 to 1.12 for country standard deviation. In conclusion, one can state that employing a different weighting scheme might not strongly change the composite estimate or the standard error but can have importance regarding the quantified model uncertainty in the model error ME bc .

6. Discussion

Overall, our findings demonstrate that uncertainty regarding IRT scaling model influences country means. This kind of uncertainty is too large to be neglected in reporting. For some of the countries, the model error exceeded the sampling error. In this case, confidence intervals based on standard errors for the sampling of students might be overly narrow.
A different picture emerged for standard deviations and percentiles. In this case, the choice of the IRT model turned out to be much more important. Estimated error ratios were, on average, between 0.40 and 0.80, indicating that the model error introduced a non-negligible amount of uncertainty in parameters of interest. However, the importance of model error compared to sampling error was even larger for some of the countries. In particular, distribution parameters for high- and low-performing countries were substantially affected by the choice of the IRT model.
In our analysis, we only focused on 11 scaling models studied in the literature. However, semi- or nonparametric IRT models could alternatively be utilized [16,53,105,106,107], and their impact on distribution parameters could be an exciting topic for future research. If more parameters in an IRT model were included, we expect an even larger impact of model choice on distribution parameters.
In our analysis, we did not use student covariates for drawing plausible values [100,108]. It could be that the impact of the choice of the IRT model would be smaller if relevant student covariates were included [109]. Future research can provide answers to this important question. As a summary of our research (see also Section 2.3), we would like to argue that model uncertainty should also be reported in educational LSA studies. This could be particularly interesting because the 1PL, 2PL, or the 3PL models are applied in the studies. In model comparisons, we have shown that the 3PL with residual heterogeneity (3PLRH) and the 3PL with quadratic effects of θ (3PLQ) were superior to alternatives. If the 2PL model is preferred over the 1PL model for reasons of model fit, three-parameter models must be preferred for the same reason. However, a central question might be whether the 3PLRH should be implemented in the operational practice of LSA. Technically, it would be certainly feasible, and there is no practical added complexity compared to the 2PL or the 3PL model.
Interestingly, some specified IRT models have the same number of item parameters but a different ability to fit the item response data. For example, the 3PL and the 3PLRH models have the same number of parameters, but the 3PLRH is often preferred in terms of model fit. This underlines that the choice of the functional form is also relevant, not only the number of item parameters [30].
Frequently, the assumed IRT models will be grossly misspecified for educational LSA data. The misspecification could lie in the functional form of the IRFs or the assumption of invariant item parameters across countries. The reliance of ML estimation on misspecified IRT models might be questioned. As an alternative, (robust) limited-information (LI) estimation methods [110] can be used. Notably, ML and LI methods result in a different weighing of model errors [111]. If differential item functioning (DIF) across countries is critical, IRT models can also be separately estimated in each country, and the results brought onto a common international metric through linking methods [112,113]. In the case of a small sample size at the country level, regularization approaches for more complex IRT models can be employed to stabilize estimation [114,115]. Linking methods have the advantage of a clear definition of model loss regarding country DIF [116,117,118] compared to joint estimation with ML or LI estimation [119].
As pointed out by an anonymous reviewer, applied psychometric researchers seem to have a tendency to choose the best fitting model with little care for whether that choice is appropriate in the particular research context. We have argued elsewhere that the 1PL model compared to other IRT models with more parameters is more valid because of its equal weighting of items [27]. If Pandora’s box is opened via the argument of choosing a more complex IRT model due to improved model fit, we argue for a specification of different IRT models and an integrated assessment of model uncertainty, as has been proposed in this article. In this approach, however, the a priori choice of model weights has to be carefully conducted.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The PISA 2009 dataset is available from https://www.oecd.org/pisa/data/pisa2009database-downloadabledata.htm (accessed on 13 March 2022).

Acknowledgments

I sincerely thank three anonymous reviewers for their valuable comments that improved this article.

Conflicts of Interest

The author declares no conflict of interest.

Abbreviations

    The following abbreviations are used in this manuscript:
AICAkaike information criterion
BICBayesian information criterion
DIFdifferential item functioning
GHPGilula-Haberman penalty
IRFitem response function
IRTitem response theory
LIlimited information
LSAlarge-scale assessment
MEmodel error
MLmaximum likelihood
PISAprogramme for international student assessment
SEstandard error
TEtotal error

Appendix A. Ability Estimation in Misspecified IRT Models

Let P i ( θ ) = Ψ ( a i ( θ ) ) be the true but unknown IRF, where Ψ is the logistic link function and a i is a differentiable function. If the IRFs are known, the latent ability θ can be obtained be maximizing the following log-likelihood function
l ( θ ) = i = 1 I x i log Ψ ( a i ( θ ) ) + ( 1 x i ) log 1 Ψ ( a i ( θ ) ) .
The maximization of Equation (A1) provides the estimating equation
l 1 ( θ 0 ) = l θ | θ = θ 0 = i = 1 I x i Ψ ( a i ( θ 0 ) ) a i ( θ 0 ) = 0 ,
where a i denotes the first derivative of a i . Note that
E l 1 ( θ 0 ) = i = 1 I Ψ ( a i ( θ 0 ) ) Ψ ( a i ( θ 0 ) ) a i ( θ 0 ) = 0 ,
Now assume that misspecified IRFs P i * ( θ ) = Ψ ( α i ( θ ) ) instead of  P i ( θ ) are used. The following estimating equation provides an ability estimate θ 1 :
l 1 * ( θ 1 ) = i = 1 I x i Ψ ( α i ( θ 1 ) ) α i ( θ 1 ) = 0 .
We make use of the following Taylor approximations
α i ( θ 1 ) α i ( θ 0 ) + α i ( θ 0 ) ( θ 1 θ 0 ) and
Ψ ( α i ( θ 1 ) ) Ψ ( α i ( θ 0 ) ) + I ( α i ( θ 0 ) ) α i ( θ 0 ) ( θ 1 θ 0 ) ,
where I ( x ) = Ψ ( x ) ( 1 Ψ ( x ) ) . Set Δ θ = θ 1 θ 0 . We obtain by inserting (A5) and (A6) in (A4)
l 1 * ( θ 1 ) i = 1 I x i Ψ ( α i ( θ 0 ) ) I ( α i ( θ 0 ) ) α i ( θ 0 ) Δ θ α i ( θ 0 ) + α i ( θ 0 ) Δ θ = 0 .
We can now determine the bias Δ θ by solving E ( l 1 * ( θ 1 ) ) = 0 for  θ 1 and taking E ( l 1 ( θ 0 ) ) = 0 into account. Moreover, we take the expectation and ignore the squared term in  Δ θ (i.e., Δ θ 0 ). Then, we compute from (A7)
E l 1 * ( θ 1 ) = E i = 1 I x i Ψ ( α i ( θ 0 ) ) I ( α i ( θ 0 ) ) α i ( θ 0 ) Δ θ α i ( θ 0 ) + α i ( θ 0 ) Δ θ = i = 1 I Ψ ( a i ( θ 0 ) ) Ψ ( α i ( θ 0 ) ) I ( α i ( θ 0 ) ) α i ( θ 0 ) Δ θ α i ( θ 0 ) + α i ( θ 0 ) Δ θ = i = 1 I Ψ ( a i ( θ 0 ) ) Ψ ( α i ( θ 0 ) ) α i ( θ 0 ) Δ θ i = 1 I I ( α i ( θ 0 ) ) [ α i ( θ 0 ) ] 2 Ψ ( a i ( θ 0 ) ) Ψ ( α i ( θ 0 ) ) α i ( θ 0 ) = 0
Finally, we obtain from (A8)
θ 1 = θ 0 + i = 1 I Ψ ( a i ( θ 0 ) ) Ψ ( α i ( θ 0 ) ) α i ( θ 0 ) i = 1 I I ( α i ( θ 0 ) ) α i ( θ 0 ) 2 Ψ ( a i ( θ 0 ) ) Ψ ( α i ( θ 0 ) ) α i ( θ 0 ) .
We can further approximate the term in (A9) to
θ 1 θ 0 + A 1 i = 1 I Ψ ( a i ( θ 0 ) ) Ψ ( α i ( θ 0 ) ) α i ( θ 0 )
where A = i = 1 I I ( α i ( θ 0 ) ) α i ( θ 0 ) 2 .

Appendix B. Country Labels for PISA 2009 Study

The following country labels were used in the Results Section 5 for the PISA 2009 analysis:
AUS = Australia; AUT = Austria; BEL = Belgium; CAN = Canada; CHE = Switzerland; CZE = Czech Republic; DEU = Germany; DNK = Denmark; ESP = Spain; EST = Estonia; FIN = Finland; FRA = France; GBR = United Kingdom; GRC = Greece; HUN = Hungary; IRL = Ireland; ISL = Iceland; ITA = Italy; JPN = Japan; KOR = Republic of Korea; LUX = Luxembourg; NLD = Netherlands; NOR = Norway; POL = Poland; PRT = Portugal; SWE = Sweden.

Appendix C. Additional Results for PISA 2009 Mathematics and Science

In Table A1, detailed results for 11 different scaling models for country means in PISA 2009 mathematics are shown. In Table A2, detailed results for 11 different scaling models for country means in PISA 2009 science are shown.
In Table A3, results and model uncertainty of 11 different scaling models for country means and standard deviations in PISA 2009 mathematics are shown. In Table A4, results and model uncertainty of 11 different scaling models for country 10th and 90th percentiles in PISA 2009 mathematics are shown. In Table A5, results and model uncertainty of 11 different scaling models for country means and standard deviations in PISA 2009 science are shown. In Table A6, results and model uncertainty of 11 different scaling models for country 10th and 90th percentiles in PISA 2009 science are shown.
Table A1. Detailed results for all 11 different scaling models for country means in PISA 2009 mathematics.
Table A1. Detailed results for all 11 different scaling models for country means in PISA 2009 mathematics.
CNTMrg ME bc 1PL1PCL1PLL1PGL2PL4PGL3PLQ3PLRH3PL4PL4PLQ
AUS511.20.720.02511.3510.9510.8511.1511.4511.4511.2511.2511.5511.2511.3
AUT492.52.900.71492.7491.2494.1493.9492.7492.4492.3492.9491.2492.1492.1
BEL512.42.990.86513.0511.3514.2514.2511.6512.2512.4512.1511.5512.3512.2
CAN523.02.170.62522.5521.9522.7522.9523.8523.1523.1523.2524.0523.0523.0
CHE533.56.221.44532.5529.0535.0535.2533.9534.5534.4534.4533.4534.9534.6
CZE488.11.210.20488.2488.9487.8487.7488.5487.8487.8488.0488.0487.7487.7
DEU508.92.460.89509.7508.9510.4510.3508.1508.3507.9508.2508.0508.1508.0
DNK497.43.520.93498.0499.7496.3496.6497.6496.2496.4496.2497.9496.4496.4
ESP478.90.530.06479.1479.0478.8478.8478.6479.1478.9479.0478.6478.9478.9
EST508.15.351.35507.6510.8505.5505.4508.9507.8507.9507.7509.9507.9507.9
FIN538.15.131.27539.3541.1538.2537.9537.9536.5536.4536.8538.2536.0536.2
FRA490.81.790.50491.3490.0491.6491.8490.0490.4490.7490.6490.4490.5490.5
GBR486.92.300.53486.6486.9485.3485.9487.1487.1487.3487.1487.6487.3487.3
GRC458.03.950.97458.6457.6459.9459.2457.3458.3458.0458.2456.0457.9457.8
HUN483.41.110.00483.5484.1483.1483.0483.5483.5483.2483.4483.1483.3483.4
IRL482.61.970.55482.1482.1481.6482.0483.1483.0483.0482.7483.6483.2483.2
ISL501.03.020.74501.5503.0500.1500.2500.7500.0500.4500.1501.3500.3500.4
ITA478.00.880.18478.1478.6478.1477.8477.7478.2478.2478.2477.8478.2478.2
JPN529.93.061.11528.4529.1529.1528.9530.5531.3531.1531.0530.5531.4531.3
KOR544.77.872.45541.6540.0546.4545.8545.6546.7547.5547.1545.6547.7547.8
LUX483.41.550.46483.8482.8484.1484.0482.7483.7483.3483.7482.5483.4483.5
NLD521.51.980.51522.0522.6521.4521.5521.2520.8520.8520.8521.5520.6520.7
NOR493.34.110.87493.4495.6491.5491.6493.5492.9493.0492.8493.9493.0493.0
POL487.01.220.15487.1488.0486.8486.7487.1486.9486.9486.8486.8487.0486.9
PRT480.12.260.49479.8478.7479.7480.0480.3480.7480.8481.0480.2480.8480.7
SWE487.41.440.47488.1488.3487.4487.6486.8487.2487.0487.0486.9487.0487.1
Note. CNT = country label (see Appendix B); M = weighted mean across different scaling models; rg = range of estimates across models; MEbc = bias-corrected estimate of model error based on balanced half sampling (see Equation (23)); For model descriptions see Section 2.1 and Equations (3) to (14). Country means that differ from the weighted mean of country means of the 11 different models more than 1 are printed in bold.
Table A2. Detailed results for all 11 different scaling models for country means in PISA 2009 science.
Table A2. Detailed results for all 11 different scaling models for country means in PISA 2009 science.
CNTMrg ME bc 1PL1PCL1PLL1PGL2PL4PGL3PLQ3PLRH3PL4PL4PLQ
AUS517.62.730.83518.4518.1519.2518.3516.7517.3517.1517.2516.5517.2517.1
AUT488.11.110.18487.9488.6487.6488.0488.4488.3488.4488.7487.9488.3488.2
BEL498.12.370.55497.8496.6498.9497.7498.5498.6498.5498.5498.2498.7498.6
CAN519.60.650.09519.6519.5520.0519.6519.4519.6519.6519.5519.6519.4519.6
CHE509.20.960.35508.7508.8508.9508.7509.5509.6509.7509.7509.4509.6509.6
CZE494.12.890.98495.1495.7494.5495.2493.5493.0492.8492.9493.6492.9492.9
DEU513.92.130.53514.2514.9514.7514.2514.0513.3513.1513.5513.7513.1512.8
DNK488.34.701.89490.3490.9489.6490.4486.2486.6486.5486.4486.8486.7486.6
ESP478.22.070.42478.2479.0477.0478.4478.1477.8477.9477.7478.7478.0478.0
EST517.51.000.23517.4517.2517.9517.3517.4517.6517.2517.4518.2517.4517.2
FIN546.53.540.79547.1546.3549.0546.9546.0546.4546.0546.1545.5546.3546.1
FRA488.23.741.02487.2485.9488.3487.1488.9489.3489.6489.5488.8489.3489.5
GBR505.01.120.28504.7504.8505.2504.7504.9505.8505.4505.4504.7505.5505.6
GRC461.44.511.26460.3458.3461.6460.0462.4462.8462.5462.5462.1462.5462.5
HUN494.65.051.36495.8498.0493.5496.1493.9492.9493.0493.0494.5493.0493.1
IRL497.00.950.27497.3497.4497.4497.3496.7496.8496.5496.7496.7496.5496.6
ISL487.63.341.09486.5487.4485.5486.6488.8488.4488.2488.4488.8488.1488.2
ITA479.70.570.17479.9479.5479.5479.9479.8479.5479.4479.3479.7479.3479.3
JPN534.67.852.29532.4530.2534.6532.1536.1536.3537.6536.9535.0538.1537.6
KOR530.63.571.42529.1529.0529.1529.2531.0532.0532.5532.4531.5532.3532.4
LUX474.83.490.87474.2472.6475.1474.0475.3476.1476.0476.1474.6475.7475.7
NLD514.22.630.93515.2515.6514.8515.2513.6513.1513.0513.2513.4513.0513.1
NOR491.03.241.10492.2492.6491.2492.3490.5489.4489.6489.4490.6489.6489.7
POL499.63.080.70500.0501.7498.6500.2499.3499.0498.9499.0499.7498.7498.9
PRT483.44.410.88483.2485.3480.9483.5483.8483.0483.1482.9484.4483.1483.1
SWE487.31.540.34487.1486.3487.2487.0487.5487.6487.9487.7487.5487.9487.9
Note. CNT = country label (see Appendix B); M = weighted mean across different scaling models; rg = range of estimates across models; MEbc = bias-corrected estimate of model error based on balanced half sampling (see Equation (23)); For model descriptions see Section 2.1 and Equations (3) to (14). Country means that differ from the weighted mean of country means of the 11 different models more than 1 are printed in bold.
Table A3. Results and model uncertainty of 11 different scaling models for country means and country standard deviations in PISA 2009 mathematics.
Table A3. Results and model uncertainty of 11 different scaling models for country means and country standard deviations in PISA 2009 mathematics.
Country MeanCountry Standard Deviation
CNTNMrgSEME ME bc ERTEMrgSEME ME bc ERTE
AUS9889511.20.72.750.190.020.012.75101.52.71.820.890.830.452.00
AUT4575492.52.93.170.800.710.223.25105.16.02.051.761.680.822.65
BEL5978512.43.02.390.880.860.362.54111.54.22.201.361.320.602.56
CAN16,040523.02.21.700.620.620.371.8193.55.51.281.731.731.352.16
CHE8157533.56.23.591.451.440.403.87105.27.21.852.332.291.232.94
CZE4223488.11.23.160.320.200.063.1698.92.82.100.930.860.412.27
DEU3503508.92.53.450.910.890.263.56104.62.32.270.860.730.322.38
DNK4088497.43.52.860.950.930.333.0191.91.81.780.360.080.051.79
ESP17,920478.90.52.210.200.060.032.2195.46.11.641.631.600.982.29
EST3279508.15.32.821.371.350.483.1383.55.91.961.601.560.802.50
FIN4019538.15.12.221.321.270.572.5687.88.41.822.612.591.423.17
FRA2965490.81.83.670.590.500.143.71104.74.62.771.341.260.453.05
GBR8431486.92.32.770.590.530.192.8294.23.11.750.900.820.471.93
GRC3445458.03.94.131.030.970.234.2497.69.62.382.882.821.183.69
HUN3177483.41.14.040.260.000.004.0497.85.43.421.691.690.493.82
IRL2745482.62.02.890.610.550.192.9488.35.02.021.411.360.672.44
ISL2510501.03.02.140.760.740.352.2695.02.52.090.690.610.292.18
ITA21,379478.00.92.090.240.180.092.1098.05.51.401.321.320.941.92
JPN4207529.93.13.771.151.110.293.93101.77.92.612.612.540.973.64
KOR3447544.77.93.712.522.450.664.4594.015.72.383.903.751.584.45
LUX3197483.41.61.880.530.460.241.94103.65.11.781.361.300.732.21
NLD3318521.52.05.190.560.510.105.2296.44.52.061.571.490.732.54
NOR3230493.34.12.760.880.870.322.8992.62.81.470.850.740.501.65
POL3401487.01.22.990.280.150.052.9995.45.91.902.462.441.283.10
PRT4391480.12.32.990.540.490.163.0397.74.71.931.531.490.772.44
SWE3139487.41.43.020.530.470.153.0699.33.51.911.141.080.572.19
Note. CNT = country label (see Appendix B); N = sample size; M = weighted mean across different scaling models; rg = range of estimates across models; SE = standard error (computed with balanced half sampling); ME = estimated model error (see Equation (20)); MEbc = bias-corrected estimate of model error based on balanced half sampling (see Equation (23)); ER = error ratio defined as MEbc/SE; TE = total error computed by TE = SE 2 + ME bc 2 (see Equation (24)).
Table A4. Results and model uncertainty of 11 different scaling models for country 10th and 90th percentiles in PISA 2009 mathematics.
Table A4. Results and model uncertainty of 11 different scaling models for country 10th and 90th percentiles in PISA 2009 mathematics.
Country 10th PercentileCountry 90th Percentile
CNTNMrgSEME ME bc ERTEMrgSEME ME bc ERTE
AUS9889380.22.23.120.760.610.203.18641.98.44.062.652.560.634.80
AUT4575355.516.24.224.744.601.096.25627.17.33.862.242.090.544.39
BEL5978367.05.84.461.691.530.344.71654.914.92.944.674.661.595.51
CAN16,040402.34.62.751.501.500.543.13643.710.22.013.153.151.573.73
CHE8157393.93.64.291.140.970.234.40666.620.34.165.765.711.377.07
CZE4223361.99.74.802.912.850.595.58617.32.23.910.630.300.083.92
DEU3503371.86.75.121.891.850.365.45642.611.63.753.833.751.005.30
DNK4088379.24.53.491.661.510.433.80616.13.53.691.191.140.313.86
ESP17,920354.413.33.443.753.701.085.05600.84.92.731.121.060.392.92
EST3279401.78.84.282.312.240.524.83616.76.63.661.801.680.464.03
FIN4019425.08.73.392.662.610.774.28650.913.83.124.604.571.475.54
FRA2965354.310.95.452.822.720.506.09623.97.84.853.403.280.685.86
GBR8431366.86.33.322.162.090.633.92609.52.03.930.580.260.073.94
GRC3445332.422.85.636.556.441.148.55584.06.74.641.771.690.364.94
HUN3177356.712.36.073.573.570.597.04608.56.46.031.631.510.256.21
IRL2745368.08.04.452.362.220.504.97594.65.13.391.541.470.433.70
ISL2510378.33.43.701.251.050.283.84622.24.73.461.701.600.463.82
ITA21,379351.511.32.473.413.411.384.21604.34.42.890.910.830.293.01
JPN4207397.87.26.312.152.020.326.63658.716.74.174.974.851.166.40
KOR3447424.910.64.523.222.920.655.38666.732.25.098.047.881.559.38
LUX3197348.514.13.544.124.031.145.36615.83.32.511.271.140.462.75
NLD3318396.94.65.921.180.920.166.00645.810.15.023.313.220.645.96
NOR3230373.75.13.441.861.720.503.85612.94.13.260.900.750.233.35
POL3401364.013.23.604.714.711.315.93610.37.24.092.612.500.614.79
PRT4391354.512.03.473.683.641.055.02607.02.64.220.740.490.124.25
SWE3139359.610.03.743.313.250.874.95616.03.53.991.131.000.254.11
Note. CNT = country label (see Appendix B); N = sample size; M = weighted mean across different scaling models; rg = range of estimates across models; SE = standard error (computed with balanced half sampling); ME = estimated model error (see Equation (20)); MEbc = bias-corrected estimate of model error based on balanced half sampling (see Equation (23)); ER = error ratio defined as MEbc/SE; TE = total error computed by TE = SE 2 + ME bc 2 (see Equation (24)).
Table A5. Results and model uncertainty of 11 different scaling models for country means and country standard deviations in PISA 2009 science.
Table A5. Results and model uncertainty of 11 different scaling models for country means and country standard deviations in PISA 2009 science.
Country MeanCountry Standard Deviation
CNTNMrgSEME ME bc ERTEMrgSEME ME bc ERTE
AUS9864517.62.72.720.840.830.302.84104.93.41.750.650.580.331.84
AUT4577488.11.13.640.290.180.053.64105.72.22.910.630.530.182.96
BEL5938498.12.42.510.550.550.222.57106.72.41.980.610.570.292.06
CAN16,075519.60.71.810.150.090.051.8193.83.61.240.940.910.741.54
CHE8215509.21.03.010.400.350.123.0398.92.11.820.480.350.191.86
CZE4252494.12.93.431.000.980.293.5799.11.12.660.300.000.002.66
DEU3477513.92.13.080.550.530.173.12103.35.32.251.091.050.472.48
DNK4101488.34.72.621.921.890.723.2395.23.61.981.111.090.552.26
ESP17,876478.22.12.180.460.420.192.2287.94.01.641.000.970.591.90
EST3272517.51.02.750.310.230.082.7687.34.11.911.091.060.562.18
FIN4016546.53.52.480.840.790.322.6192.810.91.552.352.331.502.80
FRA2960488.23.73.911.101.020.264.04105.34.13.091.271.150.373.29
GBR8413505.01.12.780.360.280.102.79102.61.91.850.640.580.311.94
GRC3452461.44.54.101.261.260.314.2996.88.82.222.052.000.902.99
HUN3193494.65.03.461.431.360.393.7289.82.52.920.590.500.172.97
IRL2738497.01.03.310.360.270.083.3299.41.72.810.500.330.122.83
ISL2501487.63.32.011.091.090.542.2899.55.11.891.171.130.602.20
ITA21,344479.70.61.820.210.170.091.8399.15.91.491.201.200.811.91
JPN4222534.67.83.762.292.290.614.40106.710.33.152.722.690.854.14
KOR3451530.63.63.301.421.420.433.5986.97.91.932.412.341.213.04
LUX3195474.83.51.940.910.870.452.12107.96.51.531.631.581.032.20
NLD3323514.22.65.770.980.930.165.8599.74.72.321.201.110.482.57
NOR3204491.03.22.671.151.100.412.8893.23.21.650.810.740.451.81
POL3397499.63.12.720.730.700.262.8192.72.31.930.580.520.272.00
PRT4336483.44.43.060.890.880.293.1986.04.21.540.890.850.551.76
SWE3157487.31.52.850.390.340.122.87102.42.21.580.500.380.241.63
Note. CNT = country label (see Appendix B); N = sample size; M = weighted mean across different scaling models; rg = range of estimates across models; SE = standard error (computed with balanced half sampling); ME = estimated model error (see Equation (20)); MEbc = bias-corrected estimate of model error based on balanced half sampling (see Equation (23)); ER = error ratio defined as MEbc/SE; TE = total error computed by TE = SE 2 + ME bc 2 (see Equation (24)).
Table A6. Results and model uncertainty of 11 different scaling models for country 10th and 90th percentiles in PISA 2009 science.
Table A6. Results and model uncertainty of 11 different scaling models for country 10th and 90th percentiles in PISA 2009 science.
Country 10th PercentileCountry 90th Percentile
CNTNMrgSEME ME bc ERTEMrgSEME ME bc ERTE
AUS9864383.33.83.191.091.010.323.34650.312.74.072.852.760.684.92
AUT4577350.97.95.682.392.290.406.13621.73.84.291.050.900.214.38
BEL5938358.77.14.181.961.960.474.62632.911.52.892.282.250.783.66
CAN16,075398.51.42.590.510.360.142.62638.710.22.252.432.391.063.29
CHE8215379.43.43.951.110.940.244.06634.28.23.832.012.000.524.32
CZE4252366.86.15.491.491.360.255.66621.44.04.261.030.930.224.36
DEU3477379.82.34.870.830.360.074.88645.412.53.502.432.380.684.23
DNK4101366.86.53.641.981.920.534.12610.96.43.592.482.450.684.34
ESP17,876365.27.13.451.681.650.483.82590.33.82.490.980.900.362.65
EST3272404.33.14.051.060.990.244.16629.19.53.322.062.020.613.89
FIN4016426.88.93.412.112.070.613.98665.121.23.054.614.561.495.48
FRA2960349.714.16.263.653.430.557.14619.26.04.701.391.260.274.87
GBR8413372.56.53.561.741.690.473.94635.89.43.862.702.620.684.67
GRC3452336.720.46.024.524.430.747.47584.85.34.011.431.310.334.22
HUN3193378.82.56.410.820.370.066.42609.84.93.771.191.110.303.93
IRL2738370.37.35.602.081.930.345.93623.24.34.011.091.010.254.13
ISL2501357.810.43.772.562.480.664.51613.33.72.781.121.040.372.97
ITA21,344350.714.02.872.852.851.004.04605.72.22.130.610.570.272.21
JPN4222390.55.67.551.481.260.177.66663.427.83.406.976.942.047.72
KOR3451417.46.33.832.021.860.494.26639.916.74.454.914.911.106.63
LUX3195334.618.62.984.624.551.535.44612.21.62.670.490.180.072.68
NLD3323385.43.76.361.341.080.176.45642.411.25.482.522.440.456.00
NOR3204371.15.83.311.421.320.403.56611.33.53.581.231.070.303.74
POL3397380.63.73.810.930.860.223.90619.73.63.511.060.930.263.63
PRT4336373.75.43.671.311.170.323.85595.46.93.461.271.240.363.68
SWE3157355.510.43.422.272.200.644.07617.57.33.621.811.750.484.02
Note. CNT = country label (see Appendix B); N = sample size; M = weighted mean across different scaling models; rg = range of estimates across models; SE = standard error (computed with balanced half sampling); ME = estimated model error (see Equation (20)); MEbc = bias-corrected estimate of model error based on balanced half sampling (see Equation (23)); ER = error ratio defined as MEbc/SE; TE = total error computed by TE = SE 2 + ME bc 2 (see Equation (24)).

References

  1. Van der Linden, W.J.; Hambleton, R.K. (Eds.) Handbook of Modern Item Response Theory; Springer: New York, NY, USA, 1997. [Google Scholar] [CrossRef]
  2. Rutkowski, L.; von Davier, M.; Rutkowski, D. (Eds.) A handbook of International Large-Scale Assessment: Background, Technical Issues, and Methods of Data Analysis; Chapman Hall/CRC Press: London, UK, 2013. [Google Scholar] [CrossRef]
  3. OECD. PISA 2009; Technical Report; OECD: Paris, France, 2012; Available online: https://bit.ly/3xfxdwD (accessed on 15 April 2022).
  4. OECD. PISA 2012; Technical Report; OECD: Paris, France, 2014; Available online: https://bit.ly/2YLG24g (accessed on 15 April 2022).
  5. OECD. PISA 2015; Technical Report; OECD: Paris, France, 2017; Available online: https://bit.ly/32buWnZ (accessed on 15 April 2022).
  6. Foy, P.; Yin, L. Scaling the TIMSS 2015 achievement data. In Methods and Procedures in TIMSS 2015; Martin, M.O., Mullis, I.V., Hooper, M., Eds.; IEA: Boston, MA, USA, 2016. [Google Scholar]
  7. Brown, G.; Micklewright, J.; Schnepf, S.V.; Waldmann, R. International surveys of educational achievement: How robust are the findings? J.R. Stat. Soc. Ser. A Stat. Soc. 2007, 170, 623–646. [Google Scholar] [CrossRef] [Green Version]
  8. Jerrim, J.; Parker, P.; Choi, A.; Chmielewski, A.K.; Salzer, C.; Shure, N. How robust are cross-country comparisons of PISA scores to the scaling model used? Educ. Meas. 2018, 37, 28–39. [Google Scholar] [CrossRef] [Green Version]
  9. Schnepf, S. Insights into survey errors of large scale educational achievement surveys. In JRC Working Papers in Economics and Finance, No. 2018/5; Publications Office of the European Union: Luxembourg, 2018. [Google Scholar] [CrossRef]
  10. Berezner, A.; Adams, R.J. Why large-scale assessments use scaling and item response theory. In Implementation of Large-Scale Education Assessments; Lietz, P., Cresswell, J.C., Rust, K.F., Adams, R.J., Eds.; Wiley: New York, NY, USA, 2017; pp. 323–356. [Google Scholar] [CrossRef]
  11. Bock, R.D.; Moustaki, I. Item response theory in a general framework. In Handbook of Statistics; Rao, C.R., Sinharay, S., Eds.; Elsevier: Amsterdam, The Netherlands, 2007; Volume 26, pp. 469–513. [Google Scholar] [CrossRef]
  12. Yen, W.M.; Fitzpatrick, A.R. Item response theory. In Educational Measurement; Brennan, R.L., Ed.; Praeger Publishers: Westport, CT, USA, 2006; pp. 111–154. [Google Scholar]
  13. Casabianca, J.M.; Lewis, C. IRT item parameter recovery with marginal maximum likelihood estimation using loglinear smoothing models. J. Educ. Behav. Stat. 2015, 40, 547–578. [Google Scholar] [CrossRef] [Green Version]
  14. Woods, C.M. Empirical histograms in item response theory with ordinal data. Educ. Psychol. Meas. 2007, 67, 73–87. [Google Scholar] [CrossRef]
  15. Xu, X.; Von Davier, M. Fitting the Structured General Diagnostic Model to NAEP Data; Research Report No. RR-08-28; Educational Testing Service: Princeton, NJ, USA, 2008. [Google Scholar] [CrossRef]
  16. Douglas, J.A. Asymptotic identifiability of nonparametric item response models. Psychometrika 2001, 66, 531–540. [Google Scholar] [CrossRef]
  17. Yen, W.M. Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Appl. Psychol. Meas. 1984, 8, 125–145. [Google Scholar] [CrossRef]
  18. Bock, R.D.; Aitkin, M. Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika 1981, 46, 443–459. [Google Scholar] [CrossRef]
  19. Aitkin, M. Expectation maximization algorithm and extensions. In Handbook of Item Response Theory, Volume 2: Statistical Tools; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; pp. 217–236. [Google Scholar] [CrossRef]
  20. Robitzsch, A. A note on a computationally efficient implementation of the EM algorithm in item response models. Quant. Comput. Methods Behav. Sci. 2021, 1, e3783. [Google Scholar] [CrossRef]
  21. Kolenikov, S. Resampling variance estimation for complex survey data. Stata J. 2010, 10, 165–199. [Google Scholar] [CrossRef] [Green Version]
  22. Von Davier, M. Imputing proficiency data under planned missingness in population models. In A Handbook of International Large-Scale Assessment: Background, Technical Issues, and Methods of Data Analysis; Rutkowski, L., von Davier, M., Rutkowski, D., Eds.; Chapman Hall/CRC Press: London, UK, 2013; pp. 175–201. [Google Scholar] [CrossRef]
  23. Monseur, C.; Sibberns, H.; Hastedt, D. Linking errors in trend estimation for international surveys in education. IERI Monogr. Ser. 2008, 1, 113–122. Available online: https://bit.ly/38aTVeZ (accessed on 15 April 2022).
  24. Robitzsch, A.; Lüdtke, O. Linking errors in international large-scale assessments: Calculation of standard errors for trend estimation. Assess. Educ. 2019, 26, 444–465. [Google Scholar] [CrossRef]
  25. Sachse, K.A.; Roppelt, A.; Haag, N. A comparison of linking methods for estimating national trends in international comparative large-scale assessments in the presence of cross-national DIF. J. Educ. Meas. 2016, 53, 152–171. [Google Scholar] [CrossRef]
  26. Reise, S.P.; Du, H.; Wong, E.F.; Hubbard, A.S.; Haviland, M.G. Matching IRT models to patient-reported outcomes constructs: The graded response and log-logistic models for scaling depression. Psychometrika 2021, 86, 800–824. [Google Scholar] [CrossRef] [PubMed]
  27. Robitzsch, A.; Lüdtke, O. Reflections on analytical choices in the scaling model for test scores in international large-scale assessment studies. PsyArXiv 2021. [Google Scholar] [CrossRef]
  28. Birnbaum, A. Some latent trait models and their use in inferring an examinee’s ability. In Statistical Theories of Mental Test Scores; Lord, F.M., Novick, M.R., Eds.; MIT Press: Reading, MA, USA, 1968; pp. 397–479. [Google Scholar]
  29. Camilli, G. IRT scoring and test blueprint fidelity. Appl. Psychol. Meas. 2018, 42, 393–400. [Google Scholar] [CrossRef] [PubMed]
  30. Bonifay, W.; Cai, L. On the complexity of item response theory models. Multivar. Behav. Res. 2017, 52, 465–484. [Google Scholar] [CrossRef]
  31. Reise, S.P.; Horan, W.P.; Blanchard, J.J. The challenges of fitting an item response theory model to the Social Anhedonia Scale. J. Pers. Assess 2011, 93, 213–224. [Google Scholar] [CrossRef] [Green Version]
  32. Rasch, G. Probabilistic Models for Some Intelligence and Attainment Tests; Danish Institute for Educational Research: Copenhagen, Denmark, 1960. [Google Scholar]
  33. Fischer, G.H. Rasch models. In Handbook of Statistics, Volume 26: Psychometrics; Rao, C.R., Sinharay, S., Eds.; Elsevier: Amsterdam, The Netherlands, 2006; pp. 515–585. [Google Scholar] [CrossRef]
  34. Goldstein, H. Consequences of using the Rasch model for educational assessment. Br. Educ. Res. J. 1979, 5, 211–220. [Google Scholar] [CrossRef]
  35. Wendt, H.; Bos, W.; Goy, M. On applications of Rasch models in international comparative large-scale assessments: A historical review. Educ. Res. Eval. 2011, 17, 419–446. [Google Scholar] [CrossRef]
  36. Goldstein, H.; Wood, R. Five decades of item response modelling. Brit. J. Math. Stat. Psychol. 1989, 42, 139–167. [Google Scholar] [CrossRef]
  37. Shim, H.; Bonifay, W.; Wiedermann, W. Parsimonious asymmetric item response theory modeling with the complementary log-log link. Behav. Res. Methods 2022. in print. [Google Scholar] [CrossRef] [PubMed]
  38. Dimitrov, D.M. An approach to scoring and equating tests with binary items: Piloting with large-scale assessments. Educ. Psychol. Meas. 2016, 76, 954–975. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  39. Robitzsch, A. About the equivalence of the latent D-scoring model and the two-parameter logistic item response model. Mathematics 2021, 9, 1465. [Google Scholar] [CrossRef]
  40. Stukel, T.A. Generalized logistic models. J. Am. Stat. Assoc. 1988, 83, 426–431. [Google Scholar] [CrossRef]
  41. Lord, F.M.; Novick, R. Statistical Theories of Mental Test Scores; Addison-Wesley: Reading, MA, USA, 1968; Available online: https://bit.ly/3ztstUp (accessed on 15 April 2022).
  42. Aitkin, M.; Aitkin, I. Investigation of the Identifiability of the 3PL Model in the NAEP 1986 Math Survey; Technical Report; US Department of Education, Office of Educational Research and Improvement National Center for Education Statistics: Washington, DC, USA, 2006. Available online: https://bit.ly/35b79X0 (accessed on 15 April 2022).
  43. von Davier, M. Is there need for the 3PL model? Guess what? Meas. Interdiscip. Res. Persp. 2009, 7, 110–114. [Google Scholar] [CrossRef]
  44. Barton, M.A.; Lord, F.M. An Upper Asymptote for the Three-Parameter Logistic Item-Response Model; ETS Research Report Series; Educational Testing Service: Princeton, NJ, USA, 1981. [Google Scholar] [CrossRef]
  45. Loken, E.; Rulison, K.L. Estimation of a four-parameter item response theory model. Brit. J. Math. Stat. Psychol. 2010, 63, 509–525. [Google Scholar] [CrossRef]
  46. Waller, N.G.; Feuerstahler, L. Bayesian modal estimation of the four-parameter item response model in real, realistic, and idealized data sets. Multivar. Behav. Res. 2017, 52, 350–370. [Google Scholar] [CrossRef]
  47. Yen, Y.C.; Ho, R.G.; Laio, W.W.; Chen, L.J.; Kuo, C.C. An empirical evaluation of the slip correction in the four parameter logistic models with computerized adaptive testing. Appl. Psychol. Meas. 2012, 36, 75–87. [Google Scholar] [CrossRef]
  48. Barnard-Brak, L.; Lan, W.Y.; Yang, Z. Differences in mathematics achievement according to opportunity to learn: A 4PL item response theory examination. Stud. Educ. Eval. 2018, 56, 1–7. [Google Scholar] [CrossRef]
  49. Culpepper, S.A. The prevalence and implications of slipping on low-stakes, large-scale assessments. J. Educ. Behav. Stat. 2017, 42, 706–725. [Google Scholar] [CrossRef]
  50. Aitkin, M.; Aitkin, I. Statistical Modeling of the National Assessment of Educational Progress; Springer: New York, NY, USA, 2011. [Google Scholar] [CrossRef]
  51. Bürkner, P.C. Analysing standard progressive matrices (SPM-LS) with Bayesian item response models. J. Intell. 2020, 8, 5. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  52. Meng, X.; Xu, G.; Zhang, J.; Tao, J. Marginalized maximum a posteriori estimation for the four-parameter logistic model under a mixture modelling framework. Brit. J. Math. Stat. Psychol. 2020, 73, 51–82. [Google Scholar] [CrossRef] [PubMed]
  53. Feuerstahler, L.M. Metric transformations and the filtered monotonic polynomial item response model. Psychometrika 2019, 84, 105–123. [Google Scholar] [CrossRef] [PubMed]
  54. Feuerstahler, L. Flexible item response modeling in R with the flexmet package. Psych 2021, 3, 31. [Google Scholar] [CrossRef]
  55. Molenaar, D.; Dolan, C.V.; De Boeck, P. The heteroscedastic graded response model with a skewed latent trait: Testing statistical and substantive hypotheses related to skewed item category functions. Psychometrika 2012, 77, 455–478. [Google Scholar] [CrossRef] [PubMed]
  56. Molenaar, D. Heteroscedastic latent trait models for dichotomous data. Psychometrika 2015, 80, 625–644. [Google Scholar] [CrossRef]
  57. Lee, S.; Bolt, D.M. An alternative to the 3PL: Using asymmetric item characteristic curves to address guessing effects. J. Educ. Meas. 2018, 55, 90–111. [Google Scholar] [CrossRef]
  58. Liao, X.; Bolt, D.M. Item characteristic curve asymmetry: A better way to accommodate slips and guesses than a four-parameter model? J. Educ. Behav. Stat. 2021, 46, 753–775. [Google Scholar] [CrossRef]
  59. Holland, P.W.; Wainer, H. (Eds.) Differential Item Functioning: Theory and Practice; Lawrence Erlbaum: Hillsdale, NJ, USA, 1993. [Google Scholar] [CrossRef]
  60. Brennan, R.L. Misconceptions at the intersection of measurement theory and practice. Educ. Meas. 1998, 17, 5–9. [Google Scholar] [CrossRef]
  61. Edelsbrunner, P.A.; Dablander, F. The psychometric modeling of scientific reasoning: A review and recommendations for future avenues. Educ. Psychol. Rev. 2019, 31, 1–34. [Google Scholar] [CrossRef] [Green Version]
  62. Kingston, N. Future challenges to psychometrics: Validity, validity, validity. In Handbook of Statistics, Volume 26: Psychometrics; Rao, C.R., Sinharay, S., Eds.; Elsevier: Amsterdam, The Netherlands, 2006; pp. 1111–1112. [Google Scholar] [CrossRef]
  63. Myung, I.J.; Pitt, M.A.; Kim, W. Model evaluation, testing and selection. In Handbook of Cognition; Lamberts, K., Goldstone, R.L., Eds.; Sage Thousand Oaks: Mahwah, NJ, USA, 2005; pp. 422–436. [Google Scholar] [CrossRef]
  64. Zumbo, B.D. Validity: Foundational issues and statistical methodology. In Handbook of Statistics, Volume 26: Psychometrics; Rao, C.R., Sinharay, S., Eds.; Elsevier: Amsterdam, The Netherlands, 2006; pp. 45–79. [Google Scholar] [CrossRef]
  65. Chiu, T.W.; Camilli, G. Comment on 3PL IRT adjustment for guessing. Appl. Psychol. Meas. 2013, 37, 76–86. [Google Scholar] [CrossRef]
  66. Joo, S.H.; Khorramdel, L.; Yamamoto, K.; Shin, H.J.; Robin, F. Evaluating item fit statistic thresholds in PISA: Analysis of cross-country comparability of cognitive items. Educ. Meas. 2021, 40, 37–48. [Google Scholar] [CrossRef]
  67. Oliveri, M.E.; von Davier, M. Investigation of model fit and score scale comparability in international assessments. Psych. Test Assess. Model. 2011, 53, 315–333. Available online: https://bit.ly/3k4K9kt (accessed on 15 April 2022).
  68. von Davier, M.; Yamamoto, K.; Shin, H.J.; Chen, H.; Khorramdel, L.; Weeks, J.; Davis, S.; Kong, N.; Kandathil, M. Evaluating item response theory linking and model fit for data from PISA 2000–2012. Assess. Educ. 2019, 26, 466–488. [Google Scholar] [CrossRef]
  69. OECD. PISA 2018; Technical Report; OECD: Paris, France, 2020; Available online: https://bit.ly/3zWbidA (accessed on 15 April 2022).
  70. Gilula, Z.; Haberman, S.J. Prediction functions for categorical panel data. Ann. Stat. 1995, 23, 1130–1142. [Google Scholar] [CrossRef]
  71. Haberman, S.J. The Information a Test Provides on an Ability Parameter; Research Report No. RR-07-18; Educational Testing Service: Princeton, NJ, USA, 2007. [Google Scholar] [CrossRef]
  72. van Rijn, P.W.; Sinharay, S.; Haberman, S.J.; Johnson, M.S. Assessment of fit of item response theory models used in large-scale educational survey assessments. Large Scale Assess. Educ. 2016, 4, 10. [Google Scholar] [CrossRef] [Green Version]
  73. Robitzsch, A. On the treatment of missing item responses in educational large-scale assessment data: An illustrative simulation study and a case study using PISA 2018 mathematics data. Eur. J. Investig. Health Psychol. Educ. 2021, 11, 117. [Google Scholar] [CrossRef]
  74. Longford, N.T. An alternative to model selection in ordinary regression. Stat. Comput. 2003, 13, 67–80. [Google Scholar] [CrossRef]
  75. Longford, N.T. ‘Which model?’ is the wrong question. Stat. Neerl. 2012, 66, 237–252. [Google Scholar] [CrossRef]
  76. Athey, S.; Imbens, G. A measure of robustness to misspecification. Am. Econ. Rev. 2015, 105, 476–480. [Google Scholar] [CrossRef] [Green Version]
  77. Brock, W.A.; Durlauf, S.N.; West, K.D. Model uncertainty and policy evaluation: Some theory and empirics. J. Econom. 2007, 136, 629–664. [Google Scholar] [CrossRef] [Green Version]
  78. Brock, W.A.; Durlauf, S.N. On sturdy policy evaluation. J. Leg. Stud. 2015, 44, S447–S473. [Google Scholar] [CrossRef] [Green Version]
  79. Buckland, S.T.; Burnham, K.P.; Augustin, N.H. Model selection: An integral part of inference. Biometrics 1997, 53, 603–618. [Google Scholar] [CrossRef]
  80. Burnham, D.R.; Anderson, K.P. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach; Springer: New York, NY, USA, 2002. [Google Scholar] [CrossRef] [Green Version]
  81. Chatfield, C. Model uncertainty, data mining and statistical inference. J. R. Stat. Soc. Ser. A Stat. Soc. 1995, 158, 419–444. [Google Scholar] [CrossRef]
  82. Clyde, M.; George, E.I. Model uncertainty. Stat. Sci. 2004, 19, 81–94. [Google Scholar] [CrossRef]
  83. Muñoz, J.; Young, C. We ran 9 billion regressions: Eliminating false positives through computational model robustness. Sociol. Methodol. 2018, 48, 1–33. [Google Scholar] [CrossRef] [Green Version]
  84. Young, C. Model uncertainty in sociological research: An application to religion and economic growth. Am. Sociol. Rev. 2009, 74, 380–397. [Google Scholar] [CrossRef] [Green Version]
  85. Young, C.; Holsteen, K. Model uncertainty and robustness: A computational framework for multimodel analysis. Sociol. Methods Res. 2017, 46, 3–40. [Google Scholar] [CrossRef] [Green Version]
  86. Young, C. Model uncertainty and the crisis in science. Socius 2018, 4, 1–7. [Google Scholar] [CrossRef] [Green Version]
  87. Garthwaite, P.H.; Mubwandarikwa, E. Selection of weights for weighted model averaging. Aust. N. Z. J. Stat. 2010, 52, 363–382. [Google Scholar] [CrossRef]
  88. Knutti, R. The end of model democracy? Clim. Chang. 2010, 102, 395–404. [Google Scholar] [CrossRef]
  89. Lorenz, R.; Herger, N.; Sedlacek, J.; Eyring, V.; Fischer, E.M.; Knutti, R. Prospects and caveats of weighting climate models for summer maximum temperature projections over North America. J. Geophys. Res. Atmos. 2018, 123, 4509–4526. [Google Scholar] [CrossRef]
  90. Sanderson, B.M.; Knutti, R.; Caldwell, P. A representative democracy to reduce interdependency in a multimodel ensemble. J. Clim. 2015, 28, 5171–5194. [Google Scholar] [CrossRef] [Green Version]
  91. Sanderson, B.M.; Wehner, M.; Knutti, R. Skill and independence weighting for multi-model assessments. Geosci. Model Dev. 2017, 10, 2379–2395. [Google Scholar] [CrossRef] [Green Version]
  92. Schulze, D.; Reuter, B.; Pohl, S. Measurement invariance: Dealing with the uncertainty in anchor item choice by model averaging. Struct. Equ. Model. 2022. in print. [Google Scholar] [CrossRef]
  93. Fletcher, D. Model Averaging; Springer: Berlin, Germany, 2018. [Google Scholar] [CrossRef]
  94. Kaplan, D.; Lee, C. Optimizing prediction using Bayesian model averaging: Examples using large-scale educational assessments. Eval. Rev. 2018, 42, 423–457. [Google Scholar] [CrossRef] [PubMed]
  95. Rao, J.N.K.; Wu, C.F.J. Resampling inference with complex survey data. J. Am. Stat. Assoc. 1988, 83, 231–241. [Google Scholar] [CrossRef]
  96. Macaskill, G. Alternative Scaling Models and Dependencies in PISA, TAG(0809)6a, TAG Meeting Sydney, Australia. 2008. Available online: https://bit.ly/35WwBPg (accessed on 15 April 2022).
  97. Robitzsch, A.; Lüdtke, O.; Goldhammer, F.; Kroehne, U.; Köller, O. Reanalysis of the German PISA data: A comparison of different approaches for trend estimation with a particular emphasis on mode effects. Front. Psychol. 2020, 11, 884. [Google Scholar] [CrossRef]
  98. Rutkowski, L.; Rutkowski, D.; Zhou, Y. Item calibration samples and the stability of achievement estimates and system rankings: Another look at the PISA model. Int. J. Test. 2016, 16, 1–20. [Google Scholar] [CrossRef]
  99. Mislevy, R.J. Randomization-based inference about latent variables from complex samples. Psychometrika 1991, 56, 177–196. [Google Scholar] [CrossRef]
  100. von Davier, M.; Sinharay, S. Analytics in international large-scale assessments: Item response theory and population models. In A Handbook of International Large-Scale Assessment: Background, Technical Issues, and Methods of Data Analysis; Rutkowski, L., von Davier, M., Rutkowski, D., Eds.; Chapman Hall/CRC Press: London, UK, 2013; pp. 155–174. [Google Scholar] [CrossRef]
  101. Robitzsch, A. Robust and nonrobust linking of two groups for the Rasch model with balanced and unbalanced random DIF: A comparative simulation study and the simultaneous assessment of standard errors and linking errors with resampling techniques. Symmetry 2021, 13, 2198. [Google Scholar] [CrossRef]
  102. R Core Team. R: A Language and Environment for Statistical Computing; R Core Team: Vienna, Austria, 2022; Available online: https://www.R-project.org/ (accessed on 11 January 2022).
  103. Robitzsch, A. Sirt: Supplementary Item Response Theory Models; R Package Version 3.10-118; R Core Team: Vienna, Austria, 2021; Available online: https://CRAN.R-project.org/package=sirt (accessed on 23 September 2021).
  104. Robitzsch, A.; Kiefer, T.; Wu, M. TAM: Test Analysis Modules; R Package Version 3.7-6; R Core Team: Vienna, Austria, 2021; Available online: https://CRAN.R-project.org/package=TAM (accessed on 25 June 2021).
  105. Falk, C.F.; Cai, L. Semiparametric item response functions in the context of guessing. J. Educ. Meas. 2016, 53, 229–247. [Google Scholar] [CrossRef]
  106. Ramsay, J.O.; Winsberg, S. Maximum marginal likelihood estimation for semiparametric item analysis. Psychometrika 1991, 56, 365–379. [Google Scholar] [CrossRef]
  107. Rossi, N.; Wang, X.; Ramsay, J.O. Nonparametric item response function estimates with the EM algorithm. J. Educ. Behav. Stat. 2002, 27, 291–317. [Google Scholar] [CrossRef] [Green Version]
  108. Braun, H.; von Davier, M. The use of test scores from large-scale assessment surveys: Psychometric and statistical considerations. Large Scale Assess. Educ. 2017, 5, 1. [Google Scholar] [CrossRef] [Green Version]
  109. Zieger, L.; Jerrim, J.; Anders, J.; Shure, N. Conditioning: How Background Variables Can Influence PISA Scores; Working Paper 20-09; Centre for Education Policy and Equalising Opportunities (CEPEO): London, UK, 2020; Available online: https://bit.ly/2JOUfWJ (accessed on 15 April 2022).
  110. Cai, L.; Moustaki, I. Estimation methods in latent variable models for categorical outcome variables. In The Wiley Handbook of Psychometric Testing: A Multidisciplinary Reference on Survey, Scale and Test; Irwing, P., Booth, T., Hughes, D.J., Eds.; Wiley: New York, NY, USA, 2018; pp. 253–277. [Google Scholar] [CrossRef]
  111. Robitzsch, A.; Lüdtke, O. Mean comparisons of many groups in the presence of DIF: An evaluation of linking and concurrent scaling approaches. J. Educ. Behav. Stat. 2022, 47, 36–68. [Google Scholar] [CrossRef]
  112. Kolen, M.J.; Brennan, R.L. Test Equating, Scaling, and Linking; Springer: New York, NY, USA, 2014. [Google Scholar] [CrossRef]
  113. Lee, W.C.; Lee, G. IRT linking and equating. In The Wiley Handbook of Psychometric Testing: A Multidisciplinary Reference on Survey, Scale and Test; Irwing, P., Booth, T., Hughes, D.J., Eds.; Wiley: New York, NY, USA, 2018; pp. 639–673. [Google Scholar] [CrossRef]
  114. Battauz, M. Regularized estimation of the four-parameter logistic model. Psych 2020, 2, 2O. [Google Scholar] [CrossRef]
  115. Battauz, M.; Bellio, R. Shrinkage estimation of the three-parameter logistic model. Brit. J. Math. Stat. Psychol. 2021, 74, 591–609. [Google Scholar] [CrossRef]
  116. Robitzsch, A. Robust Haebara linking for many groups: Performance in the case of uniform DIF. Psych 2020, 2, 14. [Google Scholar] [CrossRef]
  117. Robitzsch, A. Lp loss functions in invariance alignment and Haberman linking with few or many groups. Stats 2020, 3, 19. [Google Scholar] [CrossRef]
  118. Robitzsch, A. A comparison of linking methods for two groups for the two-parameter logistic item response model in the presence and absence of random differential item functioning. Foundations 2021, 1, 9. [Google Scholar] [CrossRef]
  119. Robitzsch, A. Estimation methods of the multiple-group one-dimensional factor model: Implied identification constraints in the violation of measurement invariance. Axioms 2022, 11, 119. [Google Scholar] [CrossRef]
Figure 1. Item response functions P i (left panel) and locally optimal weights ν i (right panel) for the 1PL, 1PCL and 1PLL models.
Figure 1. Item response functions P i (left panel) and locally optimal weights ν i (right panel) for the 1PL, 1PCL and 1PLL models.
Entropy 24 00760 g001
Figure 2. Item response functions P i (left panel) and locally optimal weights ν i (right panel) for the 4PL, 3PL and 2PL models.
Figure 2. Item response functions P i (left panel) and locally optimal weights ν i (right panel) for the 4PL, 3PL and 2PL models.
Entropy 24 00760 g002
Figure 3. Item response functions P i (left panel) and locally optimal weights ν i (right panel) for different IRFs of the 3PLRH model.
Figure 3. Item response functions P i (left panel) and locally optimal weights ν i (right panel) for different IRFs of the 3PLRH model.
Entropy 24 00760 g003
Figure 4. Dendrogram of cluster analysis using the Ward method for 11 different scaling models based on the distance matrix defined as average absolute differences between country means of models for PISA 2009 reading data.
Figure 4. Dendrogram of cluster analysis using the Ward method for 11 different scaling models based on the distance matrix defined as average absolute differences between country means of models for PISA 2009 reading data.
Entropy 24 00760 g004
Table 1. Model comparisons based on information criteria for the three ability domains—mathematics, reading and science—in PISA 2009.
Table 1. Model comparisons based on information criteria for the three ability domains—mathematics, reading and science—in PISA 2009.
MathematicsReadingScience
ModelAICBICΔGHPAICBICΔGHPAICBICΔGHP
1PL2175102177790.00594135554143170.00553478193482220.0062
1PCL2200222202910.01224147574155190.00703487563491600.0077
1PLL2168822171510.00434169884177510.00983489843493880.0081
1PGL2167842170680.00414133694141460.00533478043482230.0062
2PL2156212161440.00124100324115410.00113445973453890.0009
4PGL2151422161880.00004091634121820.00003440643456480.0000
3PLQ2151532159380.00004093274115910.00023440973452850.0001
3PLRH2151742159590.00014092754115390.00013440833452710.0000
3PL2154862160990.00094097674116050.00083444203453620.0006
4PL2151792160600.00014092964118520.00023441053453680.0001
4PLQ2151682161020.00014092454119130.00013440893454640.0000
Note. AIC = Akaike information criterion; BIC = Bayesian information criteria; DGHP = difference in Gilula–Haberman penalty (GHP) between a particular model and the best-fitting model in terms of GHP; For model descriptions see Section 2.1 and Equations (3) to (14). For AIC and BIC, the best-fitting model and models whose information criteria did not deviate from the minimum value by more than 100 are printed in bold. For DGHP, the model with the smallest value and models with DGHP values smaller than 0.0005 are printed in bold.
Table 2. Detailed results for all 11 different scaling models for country means in PISA 2009 reading.
Table 2. Detailed results for all 11 different scaling models for country means in PISA 2009 reading.
CNTMrg ME bc 1PL1PCL1PLL1PGL2PL4PGL3PLQ3PLRH3PL4PL4PLQ
AUS515.21.250.29515.1515.8514.8515.2515.7515.2515.2515.5515.0515.0514.5
AUT470.82.360.65470.2469.6470.6470.1470.9472.0471.6471.7470.6471.6471.9
BEL509.52.910.78508.9507.8509.4508.8509.7510.7510.4510.5509.4510.7510.6
CAN525.01.790.43525.1525.6525.2525.1525.4524.3524.5524.8524.9524.0523.8
CHE501.71.270.39501.3501.3501.0501.4501.5502.3502.3502.2501.8502.3502.3
CZE479.90.890.27479.5480.2479.5479.6480.1480.0480.0479.8480.4480.1480.0
DEU498.51.830.39498.2499.3497.5498.5498.4499.0498.9498.9498.7498.8499.1
DNK493.75.461.58495.0497.3492.9495.6492.6491.9492.0491.8493.5492.1492.1
ESP480.11.430.43480.0480.7479.5480.1480.3479.6479.8479.6480.9479.7479.7
EST501.52.430.75501.2502.8500.4501.4502.0500.9501.0501.0502.8500.7500.8
FIN539.01.660.41539.0538.7539.2538.9538.7539.8539.2539.6538.4539.7540.1
FRA498.04.541.13497.4495.1499.0497.0497.7499.4499.4499.5497.7499.6499.3
GBR494.01.290.20494.0494.7493.4494.1494.0494.0494.1494.0494.2493.8493.8
GRC480.63.420.96481.7479.6482.8481.1480.3479.4480.0479.7480.0479.6479.6
HUN494.21.740.40494.4495.0493.8494.4494.5493.5493.6493.7494.3493.3493.4
IRL496.82.040.51496.5497.7495.7496.8497.4496.4496.6496.6497.5496.5496.4
ISL501.20.780.15501.3501.6501.5501.2501.3501.1500.8501.0500.8501.3501.2
ITA486.51.370.32486.3485.6486.6486.2486.8486.7487.0486.9486.6486.8486.9
JPN521.37.701.60522.3517.7525.4521.4520.4521.6521.0520.7519.8522.2522.2
KOR539.74.031.45541.3541.4541.5541.2538.7538.2538.5538.7538.5537.4537.6
LUX472.74.381.22471.7470.0473.0471.3473.2474.4474.2474.4472.5474.0474.2
NLD509.01.570.28509.1509.8508.2509.4508.6508.9509.1508.7508.8509.2509.1
NOR503.30.890.14503.3503.6503.7503.1503.2503.3503.2503.0503.3503.7503.9
POL501.72.240.72501.0501.2500.4501.3502.2502.0502.5502.2502.7502.2502.1
PRT489.22.790.70489.4490.8488.0489.8489.3488.3488.5488.4489.9488.3488.3
SWE497.00.340.00496.9497.0497.0496.9496.9497.2497.0497.1496.9497.1497.2
Note. CNT = country label (see Appendix B); M = weighted mean across different scaling models; rg = range of estimates across models; MEbc = bias-corrected estimate of model error based on balanced half sampling (see Equation (23)); For model descriptions see Section 2.1 and Equations (3) to (14). Country means that differ from the weighted mean of country means of the 11 different models more than 1 are printed in bold.
Table 3. Results and model uncertainty of 11 different scaling models for country means and country standard deviations in PISA 2009 reading.
Table 3. Results and model uncertainty of 11 different scaling models for country means and country standard deviations in PISA 2009 reading.
Country MeanCountry Standard Deviation
CNTNMrgSEME ME bc ERTEMrgSEME ME bc ERTE
AUS14,247515.21.22.510.320.290.122.52104.72.61.450.680.640.441.59
AUT6585470.82.43.340.690.650.193.40104.66.82.161.661.640.762.71
BEL8500509.52.92.490.800.780.322.61107.53.11.920.690.650.342.02
CAN23,200525.01.81.490.450.430.291.5595.64.61.121.181.181.051.62
CHE11,801501.71.32.720.420.390.142.7599.70.81.670.230.000.001.67
CZE6059479.90.93.170.320.270.093.1895.21.31.860.390.200.111.87
DEU4975498.51.83.050.420.390.133.08100.11.32.010.300.000.002.01
DNK5920493.75.52.101.581.580.752.6388.03.51.310.700.680.521.48
ESP25,828480.11.42.120.440.430.202.1791.94.61.181.161.130.961.64
EST4726501.52.42.700.770.750.282.8085.53.81.710.850.820.481.89
FIN5807539.01.72.270.430.410.182.3091.59.81.312.682.682.052.98
FRA4280498.04.53.921.161.130.294.08112.21.82.920.550.410.142.95
GBR12,172494.01.32.470.250.200.082.4799.62.81.340.770.730.551.53
GRC4966480.63.44.261.010.960.234.3799.85.42.091.461.380.662.50
HUN4604494.21.73.620.460.400.113.6494.82.72.780.670.580.212.84
IRL3931496.82.03.240.550.510.163.2898.84.22.631.241.190.452.89
ISL3628501.20.81.670.230.150.091.68102.03.51.401.030.960.681.69
ITA30,905486.51.41.610.330.320.201.64101.43.71.350.810.770.571.55
JPN6082521.37.73.711.621.600.434.04107.38.03.161.591.520.483.50
KOR4989539.74.03.101.511.450.473.4284.28.41.762.232.021.152.68
LUX4622472.74.41.191.231.221.021.70109.38.01.212.011.991.652.33
NLD4760509.01.65.580.350.280.055.5995.14.11.891.121.010.542.14
NOR4660503.30.92.610.220.140.062.6196.83.71.550.980.930.601.81
POL4917501.72.22.720.720.720.262.8192.83.61.320.900.840.631.56
PRT6298489.22.83.170.710.700.223.2591.83.21.750.740.710.401.89
SWE4565497.00.33.000.090.000.003.00103.61.71.630.420.270.171.66
Note. CNT = country label (see Appendix B); N = sample size; M = weighted mean across different scaling models; rg = range of estimates across models; SE = standard error (computed with balanced half sampling); ME = estimated model error (see Equation (20)); MEbc = bias-corrected estimate of model error based on balanced half sampling (see Equation (23)); ER = error ratio defined as MEbc/SE; TE = total error computed by TE = SE 2 + ME bc 2 (see Equation (24)).
Table 4. Results and model uncertainty of 11 different scaling models for country 10th and 90th percentiles in PISA 2009 reading.
Table 4. Results and model uncertainty of 11 different scaling models for country 10th and 90th percentiles in PISA 2009 reading.
Country 10th PercentileCountry 90th Percentile
CNTNMrgSEME ME bc ERTEMrgSEME ME bc ERTE
AUS14,247379.55.52.981.521.490.503.33646.811.23.333.103.040.914.51
AUT6585332.920.54.825.375.321.107.18602.84.83.641.261.070.303.79
BEL8500369.07.74.092.152.080.514.59644.716.82.784.244.241.525.07
CAN23,200400.84.92.401.421.410.592.78646.711.91.923.003.001.563.56
CHE11,801370.57.53.681.831.770.484.09627.710.93.363.113.090.924.56
CZE6059357.58.44.672.192.130.465.14603.36.23.181.581.530.483.53
DEU4975366.07.54.791.951.810.385.12624.49.22.732.642.580.953.76
DNK5920378.24.12.820.960.910.322.96604.04.72.571.451.430.562.94
ESP25,828359.08.73.242.182.120.663.87595.13.01.860.780.740.402.00
EST4726390.97.33.831.811.760.464.21610.76.23.171.501.460.463.49
FIN5807419.210.02.902.452.450.853.80653.321.62.665.755.752.166.34
FRA4280350.513.85.933.683.590.606.93638.616.34.923.883.820.786.23
GBR12,172365.99.93.002.572.570.863.95621.75.03.011.451.390.463.31
GRC4966350.516.26.243.513.290.537.05607.53.63.061.030.970.323.21
HUN4604368.67.06.081.561.400.236.24613.44.54.081.211.120.284.23
IRL3931370.09.65.612.452.380.436.09619.75.72.841.311.240.443.10
ISL3628366.36.02.671.401.280.482.96628.211.22.332.842.761.183.62
ITA30,905352.412.22.652.672.651.003.75613.77.71.862.012.001.072.73
JPN6082381.04.87.461.171.010.147.52652.925.93.395.735.671.686.60
KOR4989430.513.84.183.533.310.795.33644.514.73.513.683.601.025.03
LUX4622328.324.52.426.366.312.616.76609.85.91.831.631.550.852.40
NLD4760386.83.55.840.910.730.135.89632.712.95.353.473.360.636.31
NOR4660377.13.53.470.850.770.223.55625.713.73.283.453.451.054.76
POL4917381.95.03.251.251.240.383.48620.512.83.183.463.431.084.68
PRT6298369.96.64.511.431.340.304.70606.83.53.200.830.740.233.29
SWE4565363.18.83.972.192.130.544.51627.67.93.602.132.060.574.15
Note. CNT = country label (see Appendix B); N = sample size; M = weighted mean across different scaling models; rg = range of estimates across models; SE = standard error (computed with balanced half sampling); ME = estimated model error (see Equation (20)); MEbc = bias-corrected estimate of model error based on balanced half sampling (see Equation (23)); ER = error ratio defined as MEbc/SE; TE = total error computed by TE = SE 2 + ME bc 2 (see Equation (24)).
Table 5. Sensitivity analysis for country means and country standard deviations for original and uniform model weighting for PISA 2009 reading.
Table 5. Sensitivity analysis for country means and country standard deviations for original and uniform model weighting for PISA 2009 reading.
Country MeanCountry Standard Deviation
MSE ME bc MSE ME bc
CNTW1W2W1W2W1W2W1W2W1W2W1W2
AUS515.2515.22.512.510.290.33104.7104.71.451.460.640.74
AUT470.8471.03.343.330.650.74104.6104.32.162.181.641.90
BEL509.5509.72.492.490.780.90107.5107.61.921.910.650.74
CAN525.0524.81.491.490.430.5395.695.81.121.131.181.34
CHE501.7501.82.722.730.390.4399.799.71.671.680.000.00
CZE479.9479.93.173.160.270.2095.295.21.861.860.200.15
DEU498.5498.73.053.040.390.44100.1100.12.012.000.000.03
DNK493.7493.42.102.101.581.7588.087.81.311.330.680.84
ESP480.1480.02.122.110.430.4491.991.51.181.161.131.34
EST501.5501.42.702.700.750.7785.585.31.711.720.820.99
FIN539.0539.22.272.310.410.4691.592.41.311.312.683.14
FRA498.0498.33.923.931.131.35112.2112.12.922.920.410.49
GBR494.0494.02.472.470.200.2599.699.41.341.350.730.82
GRC480.6480.34.264.230.961.0099.899.42.092.061.381.55
HUN494.2494.03.623.610.400.4794.894.62.782.780.580.66
IRL496.8496.73.243.210.510.5298.898.32.632.601.191.38
ISL501.2501.21.671.680.150.14102.0102.31.401.410.961.07
ITA486.5486.61.611.610.320.36101.4101.51.351.340.770.87
JPN521.3521.33.713.711.601.79107.3107.73.163.161.521.96
KOR539.7539.43.103.131.451.4884.284.71.761.782.022.33
LUX472.7473.01.191.191.221.38109.3108.91.211.231.992.29
NLD509.0509.05.585.620.280.3295.195.51.891.901.011.17
NOR503.3503.42.612.630.140.2096.897.21.551.560.931.14
POL501.7501.82.722.730.720.6792.893.01.321.340.840.96
PRT489.2489.03.173.160.700.8391.891.51.751.740.710.88
SWE497.0497.03.003.000.000.00103.6103.41.631.640.270.32
Note. CNT = country label (see Appendix B); M = weighted mean across different scaling models; rg = range of estimates across models; SE = standard error (computed with balanced half sampling); MEbc = bias-corrected estimate of model error based on balanced half sampling (see Equation (23)); W1 = model weighting used in the main analysis (see Section 4.2 and results in other tables); W2 = uniform weighting of models.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Robitzsch, A. On the Choice of the Item Response Model for Scaling PISA Data: Model Selection Based on Information Criteria and Quantifying Model Uncertainty. Entropy 2022, 24, 760. https://doi.org/10.3390/e24060760

AMA Style

Robitzsch A. On the Choice of the Item Response Model for Scaling PISA Data: Model Selection Based on Information Criteria and Quantifying Model Uncertainty. Entropy. 2022; 24(6):760. https://doi.org/10.3390/e24060760

Chicago/Turabian Style

Robitzsch, Alexander. 2022. "On the Choice of the Item Response Model for Scaling PISA Data: Model Selection Based on Information Criteria and Quantifying Model Uncertainty" Entropy 24, no. 6: 760. https://doi.org/10.3390/e24060760

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop