3.1. Model Fit and Comparison
We chose a high-order LLM because it simultaneously estimates a general latent ability and specific attributes. We also conducted a pilot study to compare higher-order LLM to other CDMs and explore its efficiency in spatial ability assessment. To simplify the model-selection process, we only compared all eight CDMs using the more informative six-attribute Q-matrix (based on Lakin’s spatial ability classification). Given that only DINA, DINO, and GDINA are nested models,
Table 6 only shows relative model fit indices to ensure a comprehensive comparison across all models, including AIC, BIC, CAIC, and SABIC. Among these indices, the Akaike Information Criterion (AIC) is the most prevalent and basic index, helping select model by balancing complexity and fit. Meanwhile, the Bayesian Information Criterion (BIC) imposes a heavier penalty on model complexity. The Consistent Akaike Information Criterion (CAIC) is stricter than BIC, preferring simpler models as the sample size increases. The Sample-Size Adjusted BIC (SABIC) enhances dealing with small sample sizes. For all four indices, lower values indicate better model fit. After comparing model fit indices of all eight CDMs, the high-order LLM model (HO-LLM) provides the best relative fit based on lowest AIC (4744.01), BIC (5316.03), CAIC (5520.03) and SABIC (4671.03), leading us to select it in our research.
In
Table 7, the model with Lakin’s Q-matrix (Q
LH) provides competitive relative fit based on AIC (4744.01) and SABIC (4671.03), while Linn’s Q-matrix (Q
LP) has BIC (5179.74) and CAIC (5307.74). It means that both Lakin’s and Linn’s models balance model fit and complexity well. For absolute model fit, Hegarty’s Q-matrices (Q
H) have the lowest RMSEA2 (0.046), indicating the best absolute fit among the models. RMSEA values closer to 0 indicate better fit; values below 0.05 are considered good. Meanwhile, Lakin’s has the lowest SRMSR (0.085) among the models but is still slightly above the recommended threshold of 0.08, indicating some minor model misfit.
Table 7 also includes classification accuracy, referring to how accurately the CDM can identify an individual’s mastery or non-mastery of specific skills or attributes. Linn’s Q-matrix achieves a best classification accuracy (0.931), which means it has identified a mastery pattern with 95.7% accuracy, an impressive value. Meanwhile, Lakin’s Q-matrix has the lowest CA (0.708), but values above 0.70 are still often considered acceptable in educational or psychological assessments. The reason could be that Linn’s Q-matrix only has three attributes, which can make it easy to detect attribute mastery.
Based on
Table 7, we cannot simply decide which Q matrix is the exclusive optimization. They had a similar performance on model fit and classification accuracy, suggesting that model selection should consider the specific context and purpose of the analysis.
3.2. Item Fit
In CDM framework, Proportion of Variance Accounted For (PVAF) reflects the proportion of the total variability in the data that is explained by the model. Item-level PVAF can offer more details about how well the model accounts for the variance in responses to each specific item, supporting us make decisions on filtering items. In
Figure 5, Linn’s model consistently outperforms others across all task types, standing around 0.7, indicating that it better captures the variance in the data. Hegarty’s model shows competitive performance, particularly in the MR item type. According to item-level PVAF, some items (e.g., 6, 14) show significantly lower PVAF (lower than 0.4) across all models, suggesting item misfit or modeling issues. The findings might indicate that these items are not well represented by the Q-matrix or are particularly challenging or ambiguous. Additionally, all models capture mental rotation tasks well, with the MR items exhibiting relatively high and stable PVAF. The OA and IP items show more fluctuation and variability, suggesting that item-specific dips need further investigation.
We applied Bonferroni correction approach for calculating the maximum adjusted correlation (Adj.p.max) between each item and all other items (
Bonferroni 1936) to identify potential item redundancy or model misfit. In
Figure 6, the heatmap shows the adjusted
p-values for each item across four models, listed by rows (Linn, Hegarty, Uttal, and Lakin). Items with Adj.p.max less than 0.05 indicate significant correlation with other items, which is a signal of potential misfit (shown in light grey). Items with Adj.p.max greater than 0.05 suggest an acceptable fit (shown in blue).
Across models, most items exhibit acceptable fit (blue), especially for the MR and OA tasks. However, there are notably problematic items concentrated in the IP tasks (Items 26–45). Specifically, items 32 and 33 consistently misfit across all models, which means that these items may be redundant or underrepresent this subset. Among four models, the Lakin model shows the best performance in MR and OA tasks, with fewer misfitting IP items. Overall, the heatmap of adjusted p-value identifies which items require further review and highlights the Lakin model relatively overperform other models in fitting diverse spatial item types.
In the HO-LLM framework, guessing and slipping parameters play a crucial role in interpreting unpredictable student performance. The guessing parameter indicates the likelihood that a participant without the necessary skill can still answer an item correctly through guessing. A high guessing parameter suggests that some items are vulnerable to random guessing or that these items may depend on additional mental resources or unmeasured attributes. The slipping parameter represents the probability that a student who has the required skill answers the item incorrectly. A high slipping rate indicates that certain factors (such as carelessness, misunderstanding, or fatigue) are leading to unexpected errors. We interpret items with high guessing or slipping as potentially problematic for diagnostic accuracy (
De La Torre et al. 2010). However, there is not a fixed threshold to filter guessing or slipping parameter. In most cases, less than 0.3 are an acceptable value (
Cuhadar and Salih 2022).
In
Figure 7, the Uttal and Lakin models show lower rates of guessing and slipping and higher rates of useful signals for most tasks, especially MR and OA tasks. Linn model has better rates of useful signals for some IP tasks. Overall, the Uttal and Lakin model may exhibit more information on person ability estimation without the impact of slipping and guessing.
Regarding task type, the OA task showed the best performance with higher signal section (shown as blue) and smaller guessing and slipping parameters (gray and light grey, respectively), except item 14. Some items of MR task have higher guessing parameters, such as item 6, 8, and 19, indicating some participants with higher ability could answer incorrectly. The IP task type exhibits lower slipping parameters, but higher guessing parameters compared to the other two task types. This suggests that while children who understand the task tend to answer correctly without errors, others may guess, indicating that this type may be particularly challenging for younger examinees.
3.3. Person Parameters
Higher-order LLM simultaneously estimates higher-order latent ability and attribute mastery probabilities. We created an Ability-Mastery Alignment Plot (AMAP) to visualize the relationship between higher-order latent ability (also known as theta, shown in top section) and mastery pattern profiles (shown in bottom section) derived from CDMs. For example, in
Figure 4, the top section displays higher-order ability estimates, reflecting the general spatial ability level of each person, in which each dot represents an individual. The bottom section shows mastery pattern profiles across multiple attributes (e.g., spatial visualization (SV), mental rotation (MR), and spatial perception (SP), in which “×” markers indicate mastery (1) or blank markers for non-mastery (0) for each attribute. Combining the top and bottom sections, this plot indicates how individuals’ theta aligns with their mastery profile.
Figure 8 shows the relationship between higher-order spatial ability (theta) and attribute mastery patterns in the Linn model. Specifically, the rightmost points in the top plot have greater theta values (around 0.95), while these individuals have full mastery across all three attributes (SP, MR, and SV) in the bottom plot. Participants without mastering mental rotation attribute (mastery pattern (1,0,1)) have a much lower theta than those who master all attributes (mastery pattern (1,1,1)), dropping from 0.95 to −0.20. This highlights that mental rotation is the most critical attribute for higher-order spatial ability. One possible reason is that mental rotation may involve greater executive demands, such as transformation and spatial updating, compared SV, which only requires recognition or static mental imaging. When spatial perception is absent (mastery pattern (0,0,1)), the higher-order ability only slightly drops from −0.20 to −0.25, suggesting that spatial perception has a minor impact. Moreover, only mastering spatial perception (mastery pattern (1,0,0)) results in a low spatial ability of around −1.0, which is just slightly higher than non-mastery of any attributes (mastery pattern (0,0,0)). This finding indicates that SP alone does not substantially support higher spatial ability in Linn’s model. In conclusion, the results suggest that in Linn’s spatial ability framework, mental rotation is the most important attribute for supporting high spatial ability, while spatial perception plays a less significant role.
Figure 9 shows the relationship between higher-order spatial abilities and attribute mastery patterns in the Hegarty model. Like the Linn model, the rightmost points in the top plot also have the highest theta values (around 0.95) with full mastery across all four attributes (MT, SV, SO, and SWM) in the bottom plot. Theta decreases progressively with the absence of attributes, following the order of MT, SV, SO, and SWM. Notably, the absence of spatial working memory (SWM) (mastery pattern (1,1,1,0)) has only a slight impact on general ability (around 0.8), whereas the absence of mental transformation (MT) causes a substantial decrease in ability (around −0.2). This pattern indicates that while mastering mental transformation attributes is comparatively challenging for individuals, it is also the key element for achieving the highest general spatial ability.
Figure 10 shows the relationship between higher-order spatial abilities and attribute mastery patterns in the Uttal model. Like the previous Linn model and Hegarty model, the rightmost points in the top plot also have the highest theta values (around 0.95) with full mastery across all four attributes (intrinsic, extrinsic, static, and dynamic) in the bottom plot. The Uttal model’s theta has a more linear trend than the previous model’s clear step-like structure. This indicates that the combination of attributes has a nuanced effect on theta. Dynamic spatial ability is the most important attribute, yet only a small percentage of individuals master it, whereas intrinsic spatial ability plays a minor role, with the majority mastering it. Generally, to improve students’ general spatial ability, enhancing training on dynamic ability is an effective method.
In
Figure 11, the rightmost points in the top section, representing the highest theta values, align with full mastery patterns across all six attributes (VI, RO, OR, PE, MR, and DSA) in the bottom section. Most participants consistently master OR (Orientation) and MR (Mechanical Reasoning), indicating that mastery of these two spatial ability attributes is comparatively easier than others. From the highest theta, we found that the first missing attribute is Ro (Rotation), indicating the difficulty of mastering Ro and its major impact on general spatial ability. This finding provides the same result in the Linn model, which also includes the mental rotation attribute.
In conclusion, the analyses across
Figure 8,
Figure 9,
Figure 10 and
Figure 11 reveal distinct patterns in how various spatial attributes contribute to general spatial reasoning ability (theta) within different spatial ability classification frameworks. Across all models and framework, we found that certain attributes emerge as critical to higher general spatial reasoning ability, such as mental rotation in the Linn model, mental transformation in the Hegarty model, dynamic spatial processing in the Uttal model, and mental rotation in the Lakin model. These attributes likely serve as core components in fostering a robust spatial skillset, regardless of the specific classification framework.