3.1. Background
As mentioned in the introduction, an approach to identifying within-person differences in shifts to construct-irrelevant responses is grounded upon the notion that both rapid and protracted response times are important indicators of alternative and noisy response processes. In contrast to the traditional notion of distinguishing strategy-driven response behavior from guessing behavior, the modeling in Study 2 uses a more general distinction between strategy-driven response behavior and noisy/erratic response behavior. While unexpectedly short response times may be indicative of disengaged, guessing behavior, unexpectedly long response times may be indicative of early processes in identifying and settling into an appropriate response strategy. For example, if an insufficient number of test items are presented for practice, some examinees may still be becoming familiar with the nature of the task and searching for the appropriate strategy when administering test items. Such a search procedure would be expected to produce exceedingly long response times for those items, as well as responses that are less informative to making claims about the construct of interest. Alternatively, as the duration of testing increases, fatigue and attentional lapses may have a greater influence over the response process, thereby producing extended response times and ability estimates which are less informative toward the construct of interest. The modeling approach in Study 2 can account for any number of transition processes, without making any strong a priori assumptions. Given these general principles, a formal presentation of our modeling approach is presented in the following sections.
Accuracy Model. As in Study 1, modeling accuracy data and deriving the latent ability estimates for the hypothetical construct, a Rasch modeling framework is used. The Rasch model adheres to the foundational measurement property of specific objectivity: person ability estimates may be estimated independent of the distribution of item difficulties, and therefore remain invariant across different item sets within the construct. For any given person and item, the probability of a correct response is a function of a simple additive relation between the person’s ability and the item’s difficulty:
where
represents the probability of a correct response on item i for person
j;
represents the latent ability of person
j, and
is the item difficulty of item
i. For dichotomously scored items, responses are Bernoulli distributed and the logistic function is a common choice for
f as shown in Equation (1).
However, if transitions occur between construct-driven and erratic response styles, then a measurement model that accounts for a mixture of response styles must be adopted:
Note that an additional subscript, s, has been affixed to the parameters of the model. This s subscript indicates that the different response styles will be represented by distinct parameter values. In this approach, s may take on two values (1, 2), corresponding to construct-driven and erratic response styles.
RT Model. As mentioned above, these different response styles are assumed to be strongly tied to different patterns in response times. In particular,
unexpectedly long response times or
unexpectedly short response times are both associated with a response style contaminated by the construction of irrelevant processes. However, even for principled, construct-driven responses, there are substantial individual differences in response time behavior. Furthermore, certain items induce longer response time behavior, depending on the item complexity and the associated processing demands. Thus, any designation of an ‘unexpected’ observed response time must be conditional on both person and item effects.
Van der Linden (
2006) proposed a response time model that accounts for both person and item effects, and we adopt his general approach as a foundation for deriving residual response time scores after conditioning on person and item effects. As in Study 1, log response time (lnRT) was used; thus, in the modeling of item
i and person
j
and the expected values of the transformed response times are an additive function of both person speed (
) and item time intensity (
):
Thus,
represents the residual variance in response times, after the person and item effects have been accounted for. In the case that both construct-driven and erratic response styles may be observed during a single testing occasion, one would anticipate some degree of heteroscedasticity to be observed. After an examinee makes the transition to an erratic response style,
would be expected to be quite large because the person has shifted to a response style inconsistent with their behavior during the other portions of the test. Thus, van der Linden’s model is extended to the following mixture case:
where the states (response styles) are defined by the size of the residual variance term (note the
s subscript on
now). It is worth explicating here that although the accuracy and response time models are presented separately above, all parameters are estimated simultaneously in a joint model.
Modeling transitions. Given this framework, one can test different hypotheses regarding the pattern of transitions from one response style to another. For example, constraints can be applied such that transitions only occur at certain time points, or that only single transitions occur (e.g., single change-point model). Alternatively, if it is expected that multiple transitions occur during the course of a test (e.g., examinees transition into a construct-driven response style once task familiarity is sufficient, but then transition back into an erratic response style once time pressures emerge), then multiple change-point structure or Markov process might be integrated into the framework. A Markov process makes no assumptions regarding the number of change points, so it provides a useful exploratory technique for examining transition patterns. One simplifying assumption is made, however, in order to ensure a tractable solution: an examinee’s state (response style) for any given item,
i, is dependent only on the state expressed for the previous item,
i − 1, rather than the entire history of states. Put another way, a lag of only 1 is considered for the dependency structure of states expressed over time. Formally, if
represents the state expressed by subject
j for item
i, then
For the empirical analysis, transitions are initially modeled according to such a Markov process. This allows an understanding of what response style examinees exhibit at the start of testing, whether transitions occur in a bidirectional or unidirectional manner, and the expected distribution of states if the test were to continue indefinitely. As will be discussed in more detail in the
Section 3.4, a change-point process is also considered, based on the results of the Markov process analysis.
3.2. Method
Test and examinees. The same test and examinees as described in Study 1 above were used to examine the change-point model.
Estimation. All item and transition parameters were estimated via a fully Bayesian approach with MCMC methods and Gibbs sampling from the posterior with the JAGS: 4.3 software package. Normal priors were used for all item parameters, and a gamma prior was used for the response time residual variance terms. For initial state classifications at time 1 for subject
j (
values were drawn from a categorical distribution according to a pair of initial event probabilities:
And the event probabilities (initial state probabilities) were set to a Dirichlet prior with concentration parameters fixed to 1:
And for state classifications at all remaining time points,
such that transitions follow a Markov process. These event probabilities (transition probabilities) are drawn again from a Dirichlet prior with concentration parameters fixed to 1:
For the change-point analysis, a fully Bayesian approach with Gibbs sampling from the posterior was also implemented. Each subject’s vector of state classifications across items (
is constrained to the following parameterization:
where
is an estimated parameter indicating the time point where a given examinee’s response style transition occurs. The
values were then assumed to be drawn from a uniform prior distribution. In all cases, 10,000 samples were taken from the posterior after a period of 10,000 ‘burn-in’ iterations. Inspection of convergence plots, autocorrelation plots, and the Gelman-Rubin statistic ensured an appropriate burn-in and sampling interval.
Once all item parameters and state classifications were obtained, EAP estimates were derived for abilities, treating the item parameters and state classifications as fixed. Two ability estimates were obtained for each examinee; one for construct-driven states and one for erratic response states. Since only the ability estimates for construct-driven states were retained for further analysis, this effectively trimmed erratic responses from the data.
3.3. Alternative Trims
In order to evaluate the relative utility of emphasizing both positive and negative response time residuals in identifying threats to validity, we consider a few additional trims based on alternative criteria. First, we consider a test trim based on only negative residuals. If simple rapid guessing processes can fully account for any threats to validity, then eliminating only highly negative residuals should produce ability estimates with improved relationships using external measures. To explore this possibility, we evaluate the correlation between ability estimates and the AFQT scores across several different trims based on the magnitude of negative residuals (<−6 SD, <−5 SD, <−3.5 SD, <−3 SD, <−2.5 SD, <−2 SD, <−1.5 SD, <−1 SD, <−.5 SD, and < 0 SD).
Second, in order to rule out the possibility that disengaged responding may be captured by individual mean shifts in response time over the course of a test (rather than a shift towards erratic responding in either direction), we consider a change-point model based on individual shifts in the person speed parameter,
:
which is similar to our proposed model in Equation (7), except with the states defined with respect to the person speediness parameter rather than the residual parameter. This change-point model is similar to what Zhu et al. proposed in (
Zhu et al. 2023).
3.4. Results
Can the transition patterns be directly modeled and hence used to inform future test administration practice? An initial analysis using a Markov process transition model indicated that the vast majority of examinees began the test with a construct-driven response style, and then transitioned towards a more erratic response style (see
Table 4 for the estimated initial and stationary distribution of response style states). At the onset of testing, nearly 95% of examinees were responding to ART items in a seemingly consistent and principled fashion (log-transformed residual variance,
= .206). However, if continued to test indefinitely, then this proportion is expected to drop to 70%.
Figure 3 displays the increase in the proportion of examinees engaged in erratic response styles over the course of the testing window.
Figure 2 shows somewhat of an elbow; the proportion of erratic response styles remains low and unchanged for the first half of the test, and then begins to precipitously increase after item 16, after which there is a sharp rise in erratic responding. Further, 20% of individuals have transitioned to the erratic response state by the final item in the test.
Together, these results support a potential change-point process, such that virtually all examinees begin responding to items in a seemingly consistent and principled way, and then transition into more erratic response styles as the test proceeds (perhaps as fatigue, attentional lapses, and guessing behavior influence the response process). Therefore, we re-examine the pattern of response style states and transitions using a change-point model.
Figure 4 shows the distribution of the estimated change points based on changes in log-transformed residual response time variances. Note that these indicate change points in transitioning from a construct-driven response style to an erratic response style. The vast majority of change points occur late in the testing window, indicating testing fatigue may play a prominent role in affecting the response process.
Table 5 shows that after the transition, the magnitude of the log-transformed residual variance increases by a factor of almost 9, suggesting that towards the end of the test, examinees demonstrate dramatically inconsistent patterns of responding. Furthermore, the probability of a correct response post-change-point decreases to .26 from the pre-change-point value of .65, further underscoring the notion that the pattern of transition is from optimal to non-optimal. Note that
Figure 5 shows the distribution of residual values after the change point. While the distribution is negatively skewed (indicating the presence of some extremely unexpected short response times; perhaps characteristic of rapid guessing behavior), there also exists a substantial number of positive values, indicating that a singular mean shift in response speed is unable to account for the unexpected response times.
Can information from these transition patterns be used to adjust the ability estimates in an effort to improve the underlying validity of the measure? Having established the pattern and direction of these transitions, the next logical question is whether or not this information can be used to isolate the construct-driven responses and improve the validity coefficient of the test. The second column of
Table 6 displays the correlation coefficient between the corrected ability estimates (i.e., after erratic responses have been trimmed according to the change-point model) and the AFQT general score. Note that the table displays the correlation coefficient after it has been adjusted to account for the degree of unreliability in the ART test post-trim (see
Appendix A for more information). The corrected ability estimates account for almost 40% of the variance in AFQT general scores, a full 8% gain in variance accounted for over and above ability estimates derived from the non-trimmed ART test. Additionally, this gain was statistically significant at
p < .05 (see
Appendix A for a description of the bootstrapping method used to construct the associated null distribution for the significance test). Also, note that the corrected ability estimates here demonstrate an improved relationship with the AFQT scores, relative to the cluster ability estimates from Study 1 (see
Table 2).
The gain in the predictive quality of the trimmed ART test seems to be a function of underestimated abilities in the original whole test analysis.
Figure 6 shows a scatterplot of corrected ability estimates from the trimmed data against the traditional, whole test ability estimates. Note that although most points fall along the diagonal (indicating invariance in the ability estimates), a sizable number of points fall
above the diagonal, particularly at the lower end of the original (whole test) ability distribution. By excluding potentially erratic responses, a subset of ability estimates were boosted upward, producing stronger correlations with the AFQT score.
A critical question is whether the recommended use of response time residual variance as an index of construct-irrelevant responding provides any utility over an approach that simply focuses on guessing and rapid responding behavior. To evaluate whether protracted response times are additionally informative towards identifying disengaged responding, ART data were trimmed according to only
highly negative residual response times, and the correlation between the resulting ability estimates and AFQT general scores was evaluated. These results are displayed in
Table 6. Regardless of the threshold used to define an uncharacteristically short response time, none of the resulting trims produced significant improvements in the validity coefficient. Furthermore, a test trim informed by a change-point model based simply on individual-level mean shifts in response speed (similar to what (
Zhu et al. 2023) proposed) produced a smaller improvement in the validity coefficient than our trim based on the residual analysis. Only when data trims were informed by
both uncharacteristically small and uncharacteristically long response times was there a significant improvement in the validity coefficient.
3.5. Discussion
The approach outlined above adopted contemporary concepts from the psychometric modeling of response times and formalized these concepts into a time series measurement model to derive corrected ability estimates with clearer interpretive value. In particular, large residual response times, after conditioning on person and item effects, were used as a potential indicator of erratic response styles.
Van der Linden and Guo (
2008) had previously suggested the use of similar residual terms as a potential indicator of aberrant responding, and we extended the notion into a formal approach that (1) identifies how erratic responding tendencies may unfold over time and (2) produces ability estimates that are less contaminated by erratic response processes.
With respect to ability estimates with clearer interpretive value, an empirical analysis with the analogical reasoning items demonstrated an improved validity coefficient once erratic responses were removed from the analysis. Note that these corrected ability estimates also demonstrated an improved validity coefficient over the cluster-specific ability estimates derived from Study 1. Furthermore, nearly all examinees began the testing process with a construct-driven response style, with a net gain in erratic responses over time, implying a change-point process. This case suggests that administering additional items may not be a useful approach to improving the psychometric qualities of the test, as response processes were already compromised by the current length of the test.
From a broader perspective, the results speak to the importance of implementing such a procedure in practical testing settings to guide future administrations and obtain the construct-pure estimates of ability. It also underscores the importance of considering both uncharacteristically short
and uncharacteristically long response times as evidence of erratic responding. Simply emphasizing guessing and rapid responding styles (as is often the case in explorations of response styles; see (
Molenaar and DeBoeck 2018;
Qian et al. 2016;
Wise and DeMars 2006) for a few examples) has the potential to neglect other types of disengaged responding, producing biased estimates of ability. Our method, in contrast, additionally considers uncharacteristically long response times as evidence of construct-irrelevant influences on the response process, and the utility of this approach was established empirically.
One additional note is worth mentioning regarding the interpretation of the engaged versus erratic states. If two states are identified with sizable differences in the response time residual variance, one might immediately assume that the state with the larger residual term is reflective of an erratic or non-optimal response process. However, this may not necessarily be the case, and the inspection of additional model parameters is required in order to develop a more valid characterization of the states. An observation with a large residual variance term simply indicates that the recorded response time was quite unexpected, given a subject’s general pattern of responding on all other items on the test. For example, if a subject engages items with a principled solution strategy for most items, but becomes fatigued towards the end of the test and responds with rapid guessing behavior for the last few items, then these last few observed response times would produce large residuals. In this case, it is clear that the state with a large residual variance term is reflective of a disengaged response style. However, it is alternatively possible that an examinee may be engaged for only the first few items, and then begin rapidly guessing for the vast majority of items that follow. In this latter case, the longer response times for the first few items represent the exception, and would actually produce the larger residual values. Thus, there is the potential for a state with increased residuals to actually reflect a more engaged, principled response process. In order to disentangle the two possibilities, it is helpful to examine the parameters from the accuracy model. A disengaged response style should produce much higher rates and increased item difficulties; so, these parameters can be evaluated to confirm the suspected nature of the two response states. In our application, the error rates and item difficulties increased precipitously after the change point when inflated residual values were observed, implying a general transition towards disengaged responding for the last remaining items.