*5.1. Design*

The items of the generated item pool were distributed and piloted within a multimatrix design [56]. Items were divided by grade level into eight di fferent word lists each. Due to the multimatrix design, each list had a proportion of identical words (so-called anchor items) within one grade level and between grade levels. The tests were carried out by the teacher in the middle of the school year without a time limit, as is usually the case with CBM, in order to be able to calculate characteristic values for each item.

The multimatrix design of the test templates made it possible to generate a cross-linked data set. Analyses based on the item response theory (IRT) allowed the determination of psychometric parameters for all items based on the total sample. Since the present data matrix shows a binary coding in "correctly solved" or "incorrectly solved", a dichotomous Rasch Model was estimated. The Rasch analyses were performed with the statistics program R [57] using the pairwise package [58]. The model fit of the items was judged by their estimated Infit values. Since the outfit statistics are clearly influenced by outlier values, whereas the Infit values are more sensitive in the range of medium ability values [59], the Infit statistics were primarily examined in the present study for deviations from the expected value 1 (0.70 ≤ Infit ≤ 1.30) [60]. For further analysis of the quality of the items, common item statistics (di fficulty and selectivity) were calculated.

In order to check for di fferences in item di fficulties between boys and girls (test fairness regarding gender), a graphical model test was carried out to assess the measurement invariance of the items.

To analyze the reliability, Cronbach's α was reported. Based on a correlation of the items of the item pool with an external criterion (ELFE-II test), the construct validity could be tested.
