1. Introduction
A number of psychological measurements produce item scores that consist of a count of events in a fixed amount of time. For example, verbal fluency tasks may ask participants to generate as many words as possible that start with a specific letter. Some measurement procedures may involve examinees making multiple attempts (e.g., to solve math problems, to read words) in a given amount of time; we may then count the number of successful (or unsuccessful) attempts. In the field of creativity research, divergent thinking tasks, which generally ask examinees to produce as many creative ideas as possible in response to a given prompt, are often scored (at least partly) based on fluency, which refers to the count of ideas generated in response to the prompt. For example, divergent thinking tasks may ask participants to generate as many uses as possible for an object (e.g., a brick) (
Torrance 2008) or to find as many different explanations for a social situation (
Preckel et al. 2011). Beyond the measurement of divergent thinking ability in general, the same kind of test has been adapted to more specific domains. For example, one may attempt to measure managerial divergent thinking by asking examinees to find as many ideas as possible to strengthen cohesion within a team (
Myszkowski et al. 2015), or to measure engineering design divergent thinking by asking examinees to find as many design ideas for a given problem (e.g., design an object that produces sounds) (
Charyton and Merrill 2009).
The tasks given as examples here therefore produce scores which are discrete and have a lower bound at 0, which generally mean that they are right-skewed when low counts are expected. Also, when higher scores are expected, they tend to be more variable, which implies a mean–variance relationship. These characteristics make them unsuitable for traditional factor analysis, which generally assumes continuous, normally distributed variables with fixed variance. Consequently, item response models that assume scores to follow different probability distributions, such as the Poisson distribution, have been suggested as more appropriate for divergent thinking tasks (
Forthmann et al. 2016;
Myszkowski 2024;
Myszkowski and Storme 2021), as well as for other tasks that produce counts (
Doebler and Holling 2016;
Jansen 1986;
Meredith 1971;
Rasch 1960;
Spray 1990).
1.1. The Rasch Poisson Counts Model
The original item response theory model developed for count item responses is the Rasch Poisson counts model (RPCM;
Rasch 1960). It defines the fluency score
for person
i (where
) on item
j (where
) as following a Poisson distribution of rate
:
In the Poisson distribution, the rate parameter
is equal to both the mean and the variance of the distribution. This relation of equality between the mean and the variance is a characteristic of the Poisson distribution, which is known as equidispersion. In the RPCM, the rate parameter is modeled as a function of an item easiness parameter
, a person ability parameter
and a general discrimination parameter
a, such as the following:
The use of the exponential function as the inverse link defines this model as a log-linear model (
Doebler et al. 2014;
Mellenbergh 1994;
Myszkowski 2024), although this has also been referred to as a multiplicative model (e.g.,
Jansen 1986). The absence of other item scores in the response function reflects that, as is usual in measurement models, it is assumed that scores are conditionally independent, meaning that they are not related over and beyond
. This assumption, commonly referred to in the IRT literature as local independence (
Chen and Thissen 1997), can be violated in various tests, including divergent thinking tests. For example,
Myszkowski and Storme (
2021) found that this was the case when there are similarities between certain prompts in a test (e.g., all tasks that are alternate use tasks in a broader set of tasks) that are not accounted for (using, for example, a bifactor/testlet model).
1.2. The Need for Item-Specific Discrimination Parameters
In tasks like verbal fluency tasks, one may expect (or voluntarily assume) that the number of responses only depends on a person’s latent ability and the difficulty of the prompt. For example, we would expect that generating words starting with the letters “A” or “Z” are prompts that vary in difficulty, but not in how strongly they are related to the underlying ability of verbal fluency. Certainly, in such contexts, the RPCM may be a reasonable model.
But, in other tasks, such as divergent thinking tasks, as illustrated by
Myszkowski and Storme (
2021), it may be suspected that different items could tap into (for example) different domains of expertise or cognitive abilities. For example, generating alternate uses of a knife could be related to expertise in cooking, while generating uses of a brick could be related to expertise in construction. In addition, different items could reflect more or less the underlying ability of interest due to nuisance factors (e.g., typing/writing speed, social inhibition). For example, high social inhibition may prevent certain ideas to be produced for alternate uses of certain objects (e.g., a knife) as opposed to others (e.g., a tire). As a consequence, because they may be confounded by different item-specific factors, it has been suggested that measurement models applied to divergent thinking fluency scores should allow for the possibility that, even accounting for item difficulty, different items may be more or less sensitive to variation in the underlying ability (latent fluency).
1.3. The 2-Parameter Poisson Counts Model (2PPCM)
The 2-parameter Poisson counts model (2PPCM
Myszkowski and Storme 2021) was therefore proposed as a generalization of the RPCM, which allows for item-specific discrimination parameters. In the original paper, a re-analysis of a dataset containing various divergent thinking item responses (
Silvia 2008a,
2008b,
2008c;
Silvia et al. 2008) indicated that fluency tests may be more accurately modeled using the 2PPCM than the RPCM, even though, at this stage, further research is needed to confirm whether the 2PPCM is generally more accurate than the RPCM in divergent thinking tasks.
Like the Rasch Poisson counts model, the 2-parameter Poisson counts model (2PPCM) defines the fluency score for a participant
i on item
j as follows:
but, unlike the RCPM, the rate distributional parameter is modeled using a second item parameter, which is a discrimination parameter
:
It can be seen that 2PPCM is a generalization of the Rasch Poisson counts model (RPCM), which is obtained when all are constrained to be equal.
In the psychometric literature, the former parametrization is typically referred to as a slope–intercept parametrization. Alternately, we may reparametrize the model in a more traditional IRT parametrization, which focuses on the distance between difficulty and ability parameters
:
A similar reparametrization can be used for the RPCM. Because it is more convenient to use a slope–intercept parametrization when using a regression framework, we will use it in the following sections. In other words, the rest of this paper uses Equation (
4) as the model equation for the 2PPCM, not Equation (
5).
1.4. Model Identification
Model identification for the 2PPCM obeys similar constraints and solutions as any other item response models with variable slope parameters. More specifically, in order to identify the 2PPCM, we generally choose to fix the latent variance (to 1, typically), which is often referred to as the variance standardization method. Because it has practical advantages when studying psychometric instruments, such as easily producing person location estimates on a standard normal (i.e.,
z) scale, facilitating the interpretation of discrimination parameters for all items, or easily interpreting person covariate estimates in latent regression models (by standardizing the covariate, we directly obtain standardized regression coefficients), this is often the preferred method in IRT. Although it may not be preferred for all modeling scenarios (e.g., using anchor items in differential item functioning), it is the default in popular IRT packages like
ltm (
Rizopoulos 2006) and
mirt (
Chalmers 2012).
An alternative which is more frequently the default in structural equation modeling software—such as
lavaan (
Rosseel 2012)—is to fix one of the slope parameters
(to 1, typically), while the variance of
is freely estimated. This is known as the marker method. We can note that, in the RPCM, fixing any
to 1 implies that we are fixing all slopes to 1 (since the slopes are constrained to be equal).
1.5. Software
The 2PPCM can be estimated using generalized structural equation modeling (GSEM) software that accept Poisson distributions and logarithm link functions, such as
Mplus or
Stata, and was originally presented using this type of environment (
Myszkowski and Storme 2021). Unfortunately, to the best of our knowledge, there is no open-source GSEM package that supports Poisson distributions and logarithm link functions. An alternative is to estimate models using generalized linear mixed models (GLMMs) software, such as the
R package
lme4 (4.4.2) (
Bates et al. 2015), which can be used to estimate the RPCM (see
Baghaei and Doebler 2019, for a tutorial). However, GLMM software do not allow for the estimation of the 2PPCM, because they do not allow for item-specific discrimination parameters. A notable exception is the
R package
PLMixed (
Jeon and Rockwood 2018), which extends
lme4 to allow for discrimination parameters like in the 2PPCM.
We shall note that different software environments impose constraints on the identification method. While GSEM software (e.g.,
Mplus) typically allow both identification methods, frequentist multilevel estimation software often do not. For example,
lme4 (
Bates et al. 2015)—which can be used for the RPCM (
Baghaei and Doebler 2019)—and its extension for factor structures
PLMixed (
Jeon and Rockwood 2018) do not allow for fixing the latent variance and therefore limit the identification to the marker method.
Because recent research has highlighted that Bayesian estimation through
Stan (
Carpenter et al. 2017) and the
brms package (
Bürkner 2017) allows a number of interesting flexibilities for item response theory analysis (
Bürkner 2020,
2021), we propose to explore its capacities in the analysis of count data using Poisson models. In other words, in this paper, we discuss the use of
brms and
Stan to estimate the 2PPCM (and, by extension, the RPCM) on a divergent thinking dataset.
1.6. Bayesian Item Response Theory Using brms and Stan
Although more extensive overviews of Bayesian item response theory (IRT) models using
Stan and
brms have been published (
Bürkner 2020,
2021), we will provide a brief overview of the approach here.
brms is an
R package that provides an interface for
R users for
Stan (
Carpenter et al. 2017), which is a probabilistic programming language that allows for the estimation of various Bayesian models using Hamiltonian Monte Carlo (HMC) sampling (
Neal 2011), a type of Markov Chain Monte Carlo (MCMC) sampling.
brms allows for the use of a regression-type syntax similar to the one used in the popular frequentist package
lme4 (
Bates et al. 2015) (and more generally, to the
formula argument in
R functions), which makes it substantially easier to use for researchers familiar with this package compared to using
Stan directly. Many different response distributions can be used with various link functions, including, respectively, the Poisson distribution and the logarithm link function, which are, of course, particularly relevant here.
In a Bayesian framework, we specify a prior distribution for the parameters of the model, which is then updated using the data to obtain a posterior distribution. This posterior distribution is then used to make inferences about the parameters of the model. The principal downside of Bayesian estimation is that it is computationally intensive, and can be slow for complex models with many parameters. However, recent advances in Hamiltonian Monte Carlo sampling have made Bayesian estimation more efficient and accessible.
In the context of IRT, Bayesian estimation offers several advantages. One key benefit is the ability to set prior distributions on parameters, which can serve multiple purposes. First, priors can incorporate prior knowledge or beliefs about item (or person) parameters, which may be valuable when data are limited or when drawing on previous research findings. Second, priors can aid in convergence by stabilizing estimates, especially in complex models with many parameters. Finally, priors can act as a form of regularization, shrinking parameter estimates toward more “reasonable” values, thereby mitigating issues like over-fitting. Moreover, Bayesian IRT enables a straightforward interpretation of uncertainty through posterior distributions, providing richer information than point estimates alone. Rather than focusing solely on parameter estimates, researchers can examine the full distribution of possible values for each parameter, which helps in understanding the precision of estimates and the credibility of model-based inferences. A more thorough discussion of the benefits of Bayesian IRT in general can be found in
Fox (
2010), while a more specific discussion of Bayesian estimation in
brms in IRT can be found in
Bürkner (
2021). In addition, an example tutorial for estimating binary logistic response models can be found in
Bürkner (
2020).
1.7. Aim of the Present Paper
In this paper, we show how the 2PPCM can be estimated in a Bayesian multilevel regression framework and interpreted using brms. We will illustrate this using the example dataset provided for the special issue, which contains fluency scores for 3 divergent thinking tasks (i.e., 3 items) and 202 respondents. We will discuss model specification, estimation, convergence, fit and comparisons. Furthermore, we will provide instructions on plotting item response functions and item information functions, comparing models, diagnosing model fit, checking equidispersion, calculating reliability, and extracting factor scores. Although we limit ourselves to the core components of IRT analysis and do not address all possible topics (e.g., differential item functioning, explanatory IRT, other counts distributions), we hope that this paper will provide a useful starting point for researchers interested in using Bayesian estimation for IRT models in the context of divergent thinking tasks.
In the paper itself, we present the most critical aspects of the code (e.g., the code will not present how to customize plots). This is both to keep the paper (relatively) concise and to minimize the risk of it becoming obsolete as packages evolve. The full code used, including the data, is available on the Open Science Framework (OSF) at
https://osf.io/z8r7v/ (14 February 2025). Because it may evolve and depends on operating system characteristics, we defer to online tutorials for the installation of
brms and
Stan. Currently, links to install the
brms package and its necessary components can be found at
https://github.com/paul-buerkner/brms (accessed on 14 February 2025).
2. Model Estimation
2.1. Data Preparation
We used the dataset provided for the special issue, which has been presented in previous research (
Forthmann et al. 2019;
Forthmann and Doebler 2022;
Forthmann et al. 2020). The participants were prompted to generate as many uses as possible for a rope (item 1), a paperclip (item 2) and a garbage bag (item 3); for the purpose of this paper, we analyzed only the part that contains the fluency scores. For convenience, the OSF repository contains the subset of the dataset that was used in the analysis. The item responses were pivoted to a long format (i.e., one row per participant per item). The data used throughout are called
data_long, and contain a variable with the subject identifier (
Person), the item identifier (
Item), and the fluency score (
Score). All cases have fluency scores for all items, except for one person who only has a score to the paperclip item; this case was kept in analysis. To note, a wide format of the data (
data_wide) is also provided in the OSF repository, as it was used to produce
Mplus analysis used as the benchmark.
2.2. Loading Libraries
We first load the brms library with
Throughout, we will also use the
dplyr library (
Wickham et al. 2023), notably to filter data frames conveniently:
2.3. Model Specification
The distributional assumption and the item response function of the 2PPCM, respectively, presented in Equations (
3) and (
4), can be specified in
brms using the
bf() function:
![Jintelligence 13 00026 i003]()
The formula is defined in a similar way to the formula used in the
lme4 package, with the response variable on the left side of the tilde and the predictors on the right side. In the first part of the formula, we define the outcome variable (
Score) as a function of the item and person parameters with
Score ∼ 0 + easiness + slope * theta, which is a direct equivalent of the item response function of the 2PPCM as defined in Equation (
4), except that it is not exponentiated, because the logarithm is later defined as a link function. The intercept is omitted with the
0 +, as is typical in IRT (this allows for item parameter estimates to correspond to item locations). To note, the two item parameters and person parameter in this first part of the formula are not observed in the dataset, but we define them as being predicted by variables in the dataset in the next lines of code.
The latent variable
is defined using
theta ∼ 0 + (1 | Person), which defines it as a random intercept (i.e., location) by grouping the variable (the variable
Person in the dataset. Again,
0 + is used, which implies that the population mean of
is set to 0. Both the easiness and the slope parameters are defined using fixed effects of the
Item variable, using, respectively,
easiness ∼ 0 + Item and
slope ∼ 0 + Item. Alternatively, they may be defined as random effects, although it is not common practice in IRT, but see
Bürkner (
2021) for an example of how to do this. Finally, we declare the model as nonlinear with
nl = TRUE and specify that the outcome variable follows a Poisson distribution with a log link function using
family = poisson(link = "log"). We summarize the arguments of the
bf() function in
Table 1.
To note, although there are less verbose ways to specify it, it is easy to reuse the formula for the RPCM by fixing the slope parameter to be constant across items with slope ∼ 1:
2.4. Setting Prior Distributions
Before setting priors, it is advisable to inspect what default priors are used by brms for the model. This can be carried out using
Currently, non-informative flat priors are used by default for all item parameters (easiness and slopes), and truncated Student’s t priors with a lower bound at 0, 3 degrees of freedom and a scale of 2.5 are used for the standard deviation of the Person random effect (i.e., the standard deviation of ). Because we need to identify the model and have chosen to do so using the variance standardization method, it is necessary to fix the standard deviation of to a constant (1) instead. We can do this using the prior() function:
![Jintelligence 13 00026 i006]()
Priors may be set on easiness and slope parameters to stabilize the model, incorporate prior knowledge or regularize parameters. When doing this, one must keep in mind the logarithmic link function, which implies that the scale of the priors is not the same as for linear models. More specifically, the exponentiated easiness parameter corresponds to the expected count for someone of average ability (), and the exponentiated slope parameter corresponds to the multiplicative effect of a one-standard-deviation increase in on the expected count. In our experimentations, we found that using flat priors on the easiness parameters was not problematic for model estimation, but that informative priors on the slope parameters were necessary to stabilize the model in some cases, including in the dataset at hand. Although it was only necessary for slope parameters, we discuss priors for both item parameters.
2.4.1. Informative Priors for Easiness Parameters
Since the exponentiated easiness parameter corresponds to the expected count for someone of average ability, to help the model converge, a useful prior for easiness would be a distribution that spans (on a log scale) the (hypothesized or observed) range of observed counts. As a weakly informative prior, for example here, since scores range between 1 and 23 (thus a range of 22), we could use a normal distribution, truncated at 0, with a mean of 2.48) (which corresponds to the log of the midpoint) and a standard deviation corresponding to (the log of) half the range (. We can add this new prior with the following:
The first line of the prior() function generates the string that defines the distribution (lb = 0 is used to truncate the distribution), while the rest describes which prior is being modified.
2.4.2. Informative Priors for Slope Parameters
We propose as a weakly informative prior for the slope parameters a normal distribution (truncated to positive values) with a mean of 0 and a standard deviation that corresponds to some expectation for the (log) maximum slope. In our example, we can hypothesize that a standard deviation of would at most quadruple the number of ideas in this context (i.e., a 4-fold increase), so we set the standard deviation of the prior to (≈1.39). We can add this new prior with the following:
To note, although it is not generally a good idea to set priors based on the data because it risks over-fitting, a possible alternative, although it was not carried out here, could be to first estimate an RPCM on the data to obtain a common estimate for the discrimination parameter a, and to use this estimate as the mean of the prior for the slope parameters in the 2PPCM.
2.4.3. Notes on Using the Marker Method of Identification
In this paper, we chose to use the variance standardization method of identification, which is the most common method in IRT. To use the marker method of identification, one would need to fix one of the slope parameters (usually the first) to a constant (usually 1).
Because the first item slope is fixed to 1, we expect slopes to be close to 1 for all the other items, as long as the test is homogeneous (i.e., similar slopes for all items). In this case, because the items are all alternate uses tasks taken in similar conditions, we expect it to be the case. We can thus use as a weakly informative prior for the other slope parameters a normal distribution with a mean of 1 and a standard deviation that corresponds to some expectation for the range of slopes. Larger ranges shall be expected when items are more heterogeneous. We would suggest 1 as a default choice for the standard deviation of the slopes, which can be implemented with the following:
Finally, like with the variance standardization method, the easiness parameter corresponds to the expected count for someone of average ability (). Thus, the same distributions may be used to specify weakly informative priors for easiness parameters:
Although this is only one example, these priors led to successful model estimation (per the convergence inspection methods later described) with the marker method in the present dataset.
2.5. Model Estimation
We can now estimate the model (with the variance standardization method) using the brm() function:
In this function, we specify the formula, the data, the priors, the number of iterations (including warmup), the number of chains, the number of cores, and a seed for reproducibility. We also specify that we want to save all parameters using the save_pars argument (this is useful for certain methods used on brmsfit objects). The model will then be estimated using Hamiltonian Monte Carlo sampling, and the results will be stored in the fit_2PPCM object. The same code can be used to estimate the RPCM, by changing the formula to the formula_RPCM defined earlier. We also estimated it at this stage to later illustrate model comparison possibilities.
3. Post-Estimation Analyses
3.1. Model Summary
After estimation, a first step that most researchers would take is to inspect the model summary. This can be carried out using the summary() method:
The parameter section of the output is presented below. For users not familiar with Bayesian estimation, it is important to note that, contrary to maximum likelihood estimation, the actual outcome of the estimation for each parameter is a distribution (which is the estimated posterior distribution), not (directly) a point estimate. By default, the summary method provides as a point estimate the mean of the posterior distribution. The estimate error is the standard deviation of the posterior distribution, and the 95% credible intervals are the 2.5% and 97.5% quantiles of the posterior distribution. Finally, some convergence diagnostics are presented for each parameter, which we will discuss in the next section.
![Jintelligence 13 00026 i014]()
The first part is not informative here, because it presents the value of the standard deviation of the latent variable , which was fixed for identification (hence the convergence diagnostics being NA and the estimate error being 0). Note that, if the marker method of identification had been used, the standard deviation of would be estimated and presented here.
The second part presents the fixed effects, which are the item parameters. The estimates (as well as their error and credible intervals) are presented on the log scale (not on the scale of the count responses). Easiness parameters are interpreted as the expected log count for someone of average ability (). Slope parameters are interpreted as the expected change effect of a one-standard-deviation increase in on the expected log count. In log-linear models, it is common to exponentiate the estimates when one wants to interpret them. This can be carried out with the following:
![Jintelligence 13 00026 i015]()
The exponentiated easiness parameters are interpreted as the expected count for someone of average ability (), while the exponentiated slope parameters are interpreted as the multiplicative effect of a one-standard-deviation increase in on the expected count. For example, for item 1 (rope), the expected count of ideas produced by someone of average ability is , and a one-standard-deviation increase in multiplies the expected count by .
3.2. Inspecting Model Convergence
In general,
brms will output warnings if the model does not converge well. In addition, the
summary() method already provides some convergence diagnostics. First, because in MCMC (including HRC) sampling, we generally use multiple estimation chains (in this example, 4), it is important to check that the chains have converged to the same distribution for each parameter. When it is the case, we say that the chains have mixed well. A commonly reported convergence diagnostic is the Gelman–Rubin statistic (
,
Gelman and Rubin 1992), which is a measure of how well the different chains mixed (i.e., how similar the posterior distributions are across chains). It is expected to be close to 1.00 for good convergence across chains. In addition, we also want to make sure that the posterior distribution, which consists, for a parameter, of the values of the parameter for each (post-warmup) iteration of the chain, consists of a large number of independent samples. In other words, we want to make sure that the posterior distribution is not too auto-correlated. This can be assessed by looking at the effective sample size (ESS), which is a measure of the number of independent samples in the posterior distribution. The ESS is expected to be large (e.g., above 400) for good convergence. Bulk ESS values evaluate how well the chains are exploring the bulk of the posterior distribution, while tail ESS values evaluate how well the chains are exploring the tails of the distribution. In general, both ESS values should be above 400 (
Vehtari et al. 2021) for all parameters.
Both the
and ESS values are presented in the summary output, presented in
Section 3.1. In addition, during estimation, the number of divergent transitions (ideally 0) is reported. A divergent transition occurs when the HMC algorithm encounters difficulties navigating the parameter space. This usually happens when the model is poorly specified, the priors are too restrictive, or the parameters are strongly correlated, leading to regions of the posterior distribution that are hard to explore accurately, which can indicate that the model is not well specified or that the priors are too restrictive. In the present examples, we found that the 2PPCM converged well, with
values close to 1.00, ESS values well above 400, and no divergent transitions.
Among graphical methods to investigate convergence, it is common to use a trace plot, which shows the value of the parameter at each iteration of the chain, as well as a histogram or density plot of the posterior distribution in each chain. The
mcmc_plot() method uses the
bayesplot package (
Gabry et al. 2019) to easily produce these plots (here, for the item parameters):
We present in
Figure 1 and
Figure 2, respectively, the trace plots and posterior distributions for the six item parameters. We can see that the chains mixed well, and that the posterior distributions are well behaved. We can also see that the chains explored the tails of the distribution well, as the histograms are well shaped.
In addition to the
brms and
bayesplot functions and methods, the package
shinystan (
Gabry and Veen 2022) provides an interactive and convenient way to quickly explore and diagnose potential convergence issues.
If the model does not converge well, it is generally advisable (besides verifying that the model was correctly specified) to try increasing the number of iterations, increasing the number of chains, or using less restrictive priors. For divergent transitions, it is advisable to increase the adapt_delta, which is a tuning parameter for the HMC algorithm that controls how the step size is adapted during sampling. A higher value (e.g., ) makes the HMC sampler more conservative, which increases computation time but can help reduce the number of divergent transitions. This can be carried out by using control = list(adapt_delta = 0.95) in the brm() function.
3.3. Extracting Item Parameter Estimates
Item parameter point estimates are already reported when using the summary() method, but they can also be more directly accessed (along with estimate errors) using the fixef() method:
![Jintelligence 13 00026 i017]()
As a benchmark, we also estimated the 2PPCM using
Mplus (which used maximum likelihood estimation), following the code of the original paper presenting the model (
Myszkowski and Storme 2021). The item estimates obtained with
brms and
Mplus are presented for comparison in
Figure 3. We can see that the estimates were very similar, and so were the credible/confidence intervals. This suggests that the 2PPCM can be estimated using
brms with results that are very similar to those obtained using maximum likelihood estimation. However, it should be noted that this similarity depends on the priors used in
brms. The priors in this example application were purposely weakly informative, and used in order to help model estimation (i.e., they were not used to convey any prior belief about the instrument). More informative priors would lead to different estimates, and thus to potentially less similarity with maximum likelihood methods.
3.4. Comparing Models
Model comparisons can be made using a number of methods. Recommended methods include leave-one-out (LOO) cross-validation (
Vehtari et al. 2017) and the widely applicable information criterion (WAIC) (
Watanabe 2010), both implemented in
brms using the
loo package (
Vehtari et al. 2017). They can be obtained using the following:
Since both of these methods involve random sampling, it is advisable to set a random seed to obtain reproducible results. Both of these methods provide an estimate of the out-of-sample predictive accuracy of the model through their expected log predictive density (ELPD). A higher ELPD indicates better out-of-sample predictive accuracy. The difference in ELPD between two models is provided when the compare = TRUE argument is used.
In this example dataset, we found that the 2PPCM had a lower ELPD than the RPCM (per both the WAIC and LOOIC), indicating better out-of-sample predictive accuracy for the RPCM. The ELPD difference was
(
) for the WAIC method and
(
) for the LOO method, which suggests that the RPCM is more parsimonious and should be preferred in this dataset. This is not surprising, considering how similar the item slope parameters appeared previously and considering that, contrary to the dataset analyzed in the original paper (
Myszkowski and Storme 2021), the dataset at hand contains responses to three tasks of the very same type (all alternate uses), reducing the risk of nuisance factors.
If model evidence (i.e., support for a model in the data) is of more interest than predictive performance (i.e., prediction of out-of-sample data)—this could be the case, if, for example, we were interested in knowing which factor scoring method to use for the examinees in the sample rather than in concluding on which model to use in general with the test—an estimate of the Bayes factor can also be obtained using the bayesfactor() function:
In this dataset, we found that the Bayes factor was in favor of the RPCM, which provides substantial evidence for the RPCM being a better-fitting model in this dataset.
Finally, apart from a model comparison approach, and although it is more of a component-wise approach, the hypothesis testing feature of brms can also be used to test whether the slopes differ using a series of pairwise comparisons:
Printing the hypothesis object h will provide the Bayes factor for each comparison, while the plot() method will show the distribution of posterior draws for the slope differences. In this dataset, we found that the 95% credible intervals of slope differences included 0, suggesting no slope differences, which was in line with the RPCM outperforming the 2PPCM in this dataset.
3.5. Inspecting Model Fit
3.5.1. Model Fit
Model fit is typically assessed using the posterior predictive checks (PPCs) method, which involves comparing the observed data to data simulated from the posterior predictive distribution. Good model fit is indicated when the observed data are plausible under the model. The
pp_check() method uses the
bayesplot package to produce a number of plots that can be used to assess this. In the context of IRT, we may want to compare observed score distributions (
y) and random draws of the posterior predictive distributions (
). This plot, presented in
Figure 4, can be obtained using the following:
Figure 4.
Posterior predictive check for all scores.
Figure 4.
Posterior predictive check for all scores.
In this plot, we can see that the distribution of observed scores and the posterior predictive distributions are similar, which suggests that the model fits the data well. For a diagnosis of misfit that is less based on graphical examination, it is also possible to generate posterior predictive draws using the posterior_predict() method and to compare these draws to observations numerically. For example, one can verify if the observations are plausible under the model by verifying that they fall within the 95% credible interval of the posterior predictive distribution. We found that only one observation (Item 1, Person 192) fell outside of the 95% credible interval, which suggests good fit (code on the OSF repository).
3.5.2. Item Fit
In the IRT tradition, however, we often focus more on item fit (how well the model fits each item separately). Fortunately, the same kind of approach used for model fit can be used to assess item fit, by specifying the
group argument to the
Item variable (plot shown in
Figure 5):
Figure 5.
Posterior predictive check (density plot by item).
Figure 5.
Posterior predictive check (density plot by item).
This plot suggests that we found that the observed item scores were plausible under the model, as the observed score distributions were similar to the posterior predictive distributions. This suggests that the 2PPCM is a good fit for the data.
3.5.3. Person Fit
It is also possible to use the same kind of procedure to inspect person fit. This can be carried out by specifying the group argument to the Person variable. Because of the number of cases, however, it can be useful to use the newdata argument to specify a subset of the data to plot. For example, to plot the posterior predictive distributions for the first five participants, we can use the following:
We show in
Figure 6 the posterior predictive distributions for the first five participants as an example. This plot suggests that their observed scores were plausible under the model. To note, in this dataset, there are only three scores per person, so the observed scores are subject to greater sampling variability than when using posterior predictive checks by item in the previous section. Thus, more discrepancies between the observed and posterior predictive distributions are to be expected than before.
3.5.4. Covariate-Adjusted Frequency Plots
Another option to inspect or illustrate model fit that has been used in the context of IRT for count responses (
Forthmann et al. 2019,
2018;
Jendryczko et al. 2020) is the covariate-adjusted frequency plot (
Holling et al. 2016). This plot compares the observed frequencies to the expected frequencies for each score. While the observed frequencies are directly observed in the data, the expected frequencies are obtained by first predicting the expected means, which correspond to (predicted) rate parameters of the Poisson distribution. This can be carried out using the
fitted() method). Afterwards, the expected frequencies are calculated by summing, for a given possible score, the probability mass of the Poisson distribution across all the expected means, which can be performed using the
dpois() function. Below, we show how to obtain observed and expected frequencies for Item 1. In
Figure 7, we show the covariate-adjusted frequency plot for all items (full code on OSF).
![Jintelligence 13 00026 i024]()
Figure 7.
Covariate-adjusted frequency plots by item.
Figure 7.
Covariate-adjusted frequency plots by item.
If the model fits the data well, the expected and observed frequencies should be close. In this case, we can see that the 2PPCM seems to fit the data well.
3.6. Inspecting Item Response Functions
Once item parameter estimates are extracted, we can use the estimates for a given item to calculate (and plot) the item response function in a range of plausible
theta values (e.g.,
to 3) by applying the response model formula in Equation (
4). For example, for item 1, we can use the following:
We show in
Figure 8 the item response functions for all items and for both the RPCM and the 2PPCM. In line with what was seen in the parameter estimates of the 2PPCM, the item response functions have slopes that are very close, but item 2 (paperclip) seems more difficult than the other items. We may speculate that this higher difficulty is attributable to the object itself being less versatile (i.e., people tend to use paperclips for a smaller number of uses than the other objects).
To note, although it can hardly be seen in this case due to the similarity among the discrimination parameters, the item response functions of the 2PPCM are intersecting (i.e., the items do not present invariant item ordering), contrary to those of the RPCM (
Myszkowski 2024).
3.7. Calculating Test and Item Information
In IRT, information
is used to quantify the amount of information that a test (or an item) provides about a person’s ability
. Since information functions have not been presented for the 2PPCM in the literature, we provide the formula for the 2PPCM here. Fisher information about a parameter (here
) is defined as the negative of the expectation of the second derivative of the log-likehood with respect to it:
By calculation of the first and second derivative (further discussed in
Appendix A), we obtain that the test information function
is as follows:
and the item information function
is as follows:
To note, for the RPCM, these formulae are the same, but the discrimination parameters do not vary by item. We can note that information increases with the (squared) discrimination parameter , with the easiness parameter , and with the ability parameter . In other words, all else being equal, items that are easy and have high discrimination parameters provide more information, and more information is provided about persons with high ability parameters (this last point is in contrast with logistic models, where information is maximal when a person’s ability is close to the item’s difficulty).
Because there is no optimum of information, one might argue that examining the information function is less useful than in logistic models. However, its calculation could be important in contexts such as adaptive testing or optimal test design, where items may be selected based on the expected information that they provide about a person’s location. Furthermore, whereas, in the RPCM, items with higher easiness parameters provide more information at any given (which makes item selection based on information straightforward, as it implies that items with higher easiness should be preferred), this is not necessarily the case in the 2PPCM, due to the introduction of variable discrimination parameters (e.g., an item may be very easy but provide low information if its discrimination parameter is low).
To compute test information functions, we can use point estimates of the parameters (which we previously extracted). It is typical to examine item information through item information curves (IICs), which show how much information an item provides for a range of plausible
values. We can calculate the IIC for item 1 (reusing the previously extracted parameters) using Equation (
8), and plot it using the following code:
We show in
Figure 9 the item information curve for all items.
The total test information can simply be obtained by summing item information:
The calculation of test information can also be useful in the calculation of expected (i.e., marginal) reliability.
3.8. Examining Equidispersion
The Poisson distribution assumes that the variance is equal to the mean, an assumption known as equidispersion. Violations of this assumption lead to biased estimates of reliability and standard errors, with estimates being more conservative in the case of underdispersion, and more liberal in the case of overdispersion (
Forthmann et al. 2019). Hence, underdispersion tends to be considered a less important problem than overdispersion (
Jendryczko et al. 2020).
As suggested for the RPCM (
Baghaei and Doebler 2019) and used for the 2PPCM (
Myszkowski and Storme 2021), this assumption can be checked using the dispersion parameter
. The overall dispersion parameter can be calculated using the following formula:
Equidispersion implies that , underdispersion implies that , and overdispersion implies that . To compute it from our brmsfit object, we retrieve predictions in the sample using the fitted() method (which provides the posterior mean for each observation):
We found that the dispersion parameter was
, which indicates underdispersion. In a similar manner, we may want to calculate dispersion by item:
All items showed underdispersion (
,
,
). Other packages like
DHARMa (
Hartig 2024) use different formulae to test for non-equidispersion and can also be used on
brms models’ objects:
The DHARMa dispersion test indicates significant underdispersion (, ), which is in line with the previous calculation. When looking at dispersion by item (code provided in OSF repository), all items show underdispersion (, , , ).
As was proposed for the 2PPCM and RPCM (
Myszkowski and Storme 2021), for Poisson models, a graphical representation of the dispersion by item can be obtained using Pearson residuals (
) as a function of the predicted values. Ideally, the residuals should be evenly distributed around 0, with no clear trend or structure. When this is not the case, this plot can allow for detection of systematic patterns in dispersion (e.g., predicted score values where the model consistently overpredicts or underpredicts). Pearson residuals are obtained using the following:
We show in
Figure 10 the Pearson residuals as a function of the predicted values by item.
Although the residuals appear relatively stable and clustered around 0, we can note, overall, an increasing trend, which indicates that the model tends to systematically overpredict for low predicted scores and underpredict for high predicted scores.
3.9. Extracting Factor Scores
Point estimates, standard errors and credible intervals for can be obtained (notably) using the ranef() method:
To fully leverage the Bayesian framework when describing a person’s level, one may be interested in extracting the posterior distribution of using posterior draws. This can be performed using the as_draws set of functions. Here, we show how to extract the posterior distribution of across all participants, chains and iterations (without the warmup):
We show a comparison of the
estimates obtained through ML estimation in
Mplus and
brms in
Figure 11. We can see that the estimates are practically identical (
), and that the errors (standard errors for
Mplus, and standard deviation of the posterior distribution for
brms) are also very similar
).
3.10. Calculating Reliability
There are several approaches to the calculation of reliability in the context of IRT models. One common approach, which is the one used in the
empirical_rxx() function of the
mirt package (
Chalmers 2012), is based on the typical formula for reliability in classical test theory, which is the ratio of the true score variance to the sum of the true score variance and the error variance:
In the context of Bayesian IRT, for group-level empirical reliability, we can use as a proxy for the true score variance the variance of the estimates (which implies that this method is dependent upon the method used to obtain a point estimate from the posterior distribution), and as a proxy for the error variance the mean of the square of the error estimates:
In this dataset, we found that the group-level empirical reliability was . For person-level reliability, we can use the same method, but using the squared error for the person instead of the mean squared error:
We show a scatter plot of reliability as a function of
in
Figure 12. In line with item information curves, we can see that empirical reliability tends to increase with
in the 2PPCM. Expectedly, the examinee that had only responded to one item has a lower reliability that they would have with scores for all items, and is therefore an outlier in this plot.
3.11. Suggested Workflow
To wrap up this section, we present a suggested workflow for the estimation of Poisson IRT models using brms. After the package has been installed and tested (e.g., using an example in the package documentation), we would recommend specifying the 2PPCM using the code we provided using the default priors. After estimation, one should carefully inspect convergence (e.g., divergent transitions, low ESS, poor mixing of the chains), and in case convergence issues are found, focus on using informative priors (especially for the discrimination parameters). If the model still does not converge, then try increasing the number of iterations, the number of chains, and/or the adapt_delta parameter.
Once a model has been estimated and the output and summary show no (or nearly no) convergence issues, we would suggest estimating other candidate models (e.g., the RPCM). Once all candidate models have been estimated, compare their fit and select the model with the best fit. Use the covariate-adjusted frequency plots and posterior predictive checks to inspect the fit of the selected model. Calculate dispersion and use the Pearson residuals to inspect the equidispersion assumption.
Afterwards, if the model is to be used for scoring persons, extract the factor scores and their errors (or entire posterior distributions). If the model is used to investigate the properties of the test, to refine the test, discard items or assemble new tests, then primarily inspect the item response functions and item information curves, and calculate the overall test reliability.