2.1. Participants
This study was conducted as part of a larger project whose goal was to develop a spatial reasoning battery for students in grades 2–8. To ensure that test items had sufficient variability in difficulty, a new form of 15 items was developed as part of this work. We collected two focal samples for this study to increase respondent variability on the latent trait. First, we recruited 73 students in grades 3 to 7 through summer camp programs that were focused on a range of academic, artistic, and STEM-related topics. Students in this sample were aged 9–12 (median = 11), with 44% female, 65% white, 21% Asian, 7% Black, and 7% other. These camps occurred at either a southeastern university or a midwestern university. Parents or guardians of students were approached to have their students participate in the research study. Students received a giftcard for their participation. Second, we collected data from 101 undergraduate students from a southeastern university, primarily Education majors, to augment our sample. These respondents were 68% female, 81% white, 10% Black, 7% Asian, and 3% other groups. We did not collect their ages. Results from preliminary analyses indicated that these two samples could be combined for analyses (discussed further in the Results section). These students were recruited through their college’s research portal and offered extra credit in their coursework for their participation. All data were collected under the approval of the Institutional Review Board of the university.
2.2. Instrument
An initial set of object assembly items was developed relying on older versions of the test format and prior work on their cognitive processes. Based on pilot testing, fifteen items with good classical discrimination statistics and a range of difficulty were selected for this study. Our list of item characteristics was based on characteristics defined by
Embretson and Gorin (
2001) that showed potential in their original analyses to explain item difficulty (see also
Ivie and Embretson 2010). Some of the item characteristics were self-explanatory while others were less so.
Number of pieces (Npieces),
total edges (Tedges) across pieces,
maximum edges (Medges) on any one piece, and
curved pieces (Cpieces) were judged by the test developer and checked by a collaborator for accuracy. These characteristics described the stem.
The item developer and a collaborator rated the more subjective item characteristics concurrently. For example, the decision of whether a distractor was “easily excluded” (EED) could vary by rater. Using the definitions provided in
Table 1, two raters independently categorized each item. The few discrepancies were discussed and resolved.
Pieces with labels (Lpieces) was a measure of how many pieces in the stem had clear labels (square, triangle, [pie] slice). Irregular shapes without obvious labels were not counted.
Regular-shape solution (RSS) was judged based on the key having a standard shape (circle, equilateral triangle, right triangle, or square).
Displaced pieces (Dpieces) was based on the number of pieces in the stem that were moved to a different location in the key.
Rotated pieces (Rpieces) was based on the pieces in the key that had to be rotated from the key to the stem to reach the correct answer.
Easily excluded distractors (EED; called falsifiable distractors by
Embretson and Gorin 2001) was the most subjective characteristic and included any distractors with a different number of pieces or obviously different shapes from the stem. A description of the item characteristics is presented in
Table 1.
Figure 1 shows an example item. In this item,
Npieces is 3;
Tedges is 8;
Medges is 4;
Cpieces is 3;
Lpieces is 0;
RSS is 1 (yes; the key has a circle shape);
Dpieces is 2;
Rpieces is 2; and
EED is 0 (all options have the same number of pieces as the stem).
2.3. Data Analysis
We analyzed the OA items using three steps. First, we evaluated the instrument’s overall psychometric characteristics. Specifically, we evaluated the internal consistency using Cronbach’s alpha, which assesses the degree of item covariance on a 0 to 1 scale. An
value closer to 1 indicates a stronger correlation among the items, implying that there are consistent response patterns between items. Additionally, we calculated descriptive statistics for scored item responses for each item, including the mean (or the proportion of correct responses), standard deviation, and corrected item–total correlation, which is the correlation for the item with the total scores without this item. This analysis gave us preliminary insights into the degree to which the OA items could be interpreted as a measure of spatial reasoning. We used the
psych (Procedures for Psychological, Psychometric, and Personality Research;
Revelle 2023) package to conduct these analyses in
R R Core Team (
2022).
Then, we analyzed the responses using the dichotomous Rasch model (
Rasch 1960) via the
eRm (
Mair et al. 2021)
R package. We selected this model for several reasons. First, Rasch models are well suited to relatively small sample sizes compared to the requirements for other, more complex, parametric IRT models. For example, researchers have noted that it is possible to obtain stable estimates with the dichotomous Rasch model with samples as small as n = 30 participants (
Bond et al. 2020;
Linacre 1994). Second, this model allowed us to evaluate the characteristics of the OA items before we explored the contributions of item characteristics to item difficulty. As
Green and Smith (
1987) pointed out, evidence of adequate psychometric characteristics, including acceptable item fit, is essential before the results of extended IRT models can be meaningfully interpreted. Accordingly, we evaluated item properties based on the dichotomous Rasch model as a preliminary step in our LLTM analysis. This model allowed us to explore the degree to which the OA items reflected a unidimensional construct in which items exhibited useful psychometric characteristics. Specifically, unidimensionality was examined with a principal components analysis of standardized residuals. A maximum eigenvalue of 2.00 is recommended for sufficient unidimensionality (
Chou and Wang 2010). We evaluated the overall model fit using a likelihood ratio test.
Additionally, the Rasch model assumes that the items exhibit local independence—such that, after controlling for the primary latent variable, item responses are statistically independent. We evaluated this assumption by calculating the residual correlations between each pair of items after controlling for the model. The absolute value of the correlation coefficients is recommended to be less than 0.2 to indicate adherence to local independence (
Linacre 1994). Furthermore, the infit and outfit mean square error (MSE) statistics for items and respondents assist in uncovering the adherence to invariant item difficulty across participants (i.e., item difficulty is the same for all participants) and invariant person locations across items (i.e., person estimates of spatial ability do not depend on the specific items). Specifically, a value of 1.0 for both infit and outfit
MSE and a value of 0.0 for both infit and outfit
z indicate good fit.
To understand the ability of our items in discriminating the two focal samples (subgroups), a differential item functioning (DIF;
Wright and Masters 1982) analysis was conducted. Specifically, using a concurrent calibration approach, we estimated the item difficulty and standard errors specific to each subgroup with the dichotomous Rasch model and calculated the standardized differences in item difficulty between the two subgroups given by
where
z is the standardized difference,
and
are the item difficulty specific to subgroups 1 and 2, respectively, and
and
are the standard errors of the item difficulty specific to subgroups 1 and 2, respectively. Higher values of
z indicate greater item locations (more difficult) for subgroup 1 compared to subgroup 2.
Finally, we applied the LLTM to the scored OA responses using the eRm (
Mair et al. 2021) package with the Q-matrix illustrated in
Table 2. Item classifications shown in the Q-matrix were specified based on expert classification of the OA items related to nine of the characteristics included by (
Embretson and Gorin 2001; see also
Ivie and Embretson 2010). To facilitate interpretability and model robustness, we dichotomized the polytomous characteristics. For each of these characteristics, we calculated the mean of all the unique values and coded the values lower than the mean as “0” and the values higher than or equal to the mean as “1” (i.e., a mean split). For instance, the
Npieces for the 15 items were [2, 2, 2, 3, 5, 4, 4, 3, 5, 3, 4, 4, 4, 5, 4], and the unique values were 2, 3, 4, and 5, with a mean of 3.5. The values lower than 3.5 (2 and 3) were denoted as 0, and the values equal to or above 3.5 (4 and 5) were denoted as 1. As a result, the dichotomized
Npieces became [0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1]. This resulted in acceptable variability in each dichotomized variable.
We evaluated the fit of the LLTM in three ways, following
Baghaei and Kubinger (
2015). First, we used the log-likelihood chi-square test for both the dichotomous Rasch model and the LLTM, and we compared the difference between the −2log-likelihoods of the two models against a critical value of chi-square (i.e., the value at the 0.95 quantile,
) with the degrees of freedom equal to the difference between the number of parameters in the two models
Fischer (
1973). A difference between the −2log-likelihoods less than the corresponding critical value indicated a good fit for the LLTM, implying that the identified item characteristics appreciably accounted for the item difficulty parameters. Second, we calculated Pearson’s correlation coefficients between the item difficulty parameters (
) of the dichotomous Rasch model and the item difficulty parameters (
) based on the LLTM. The coefficient takes a value between 0 and 1, and higher values indicated that the item characteristics in the LLTM accounted for more variance in the item difficulty estimated by the dichotomous Rasch model. Third, we examined the alignment between the item difficulty parameters (i.e.,
and
) of the LLTM and the dichotomous Rasch model. To do this, we normalized and plotted the LLTM estimates against the item difficulty parameters of the dichotomous Rasch model.
After we examined these fit indices, we evaluated the LLTM results to better understand the influence of specific item characteristics on item difficulty. We examined the parameter for each item and the parameter for each item characteristic with their standard errors and 95% confidence intervals. Both the and parameters were estimated on a log-odds (i.e., “logit”) scale. The parameter in the LLTM was interpreted in the same way as the b or difficulty parameter in the dichotomous Rasch model. Specifically, a larger value of the parameter indicated that the corresponding item was more difficult. A larger value of the parameter indicated that the corresponding item characteristic made the items more difficult.