**6. Results**

Due to the scaling according to IRT, empirical characteristic values (di fficulty, selectivity, and model fit statistics) considered during item selection are available for each item. The item di fficulties σ result from the Thurston threshold values estimated in the Rasch Model (due to dichotomous response possibilities). The σ-values can be interpreted as z-values, so that a value of zero corresponds to an average di fficulty. Values below zero indicate that the words were easier for the children to read, values above zero indicate more di fficult items. The selectivity values were calculated as point biserial correlations of the raw scores with the respective total value of the test template. These values help to discriminate which items are separated between students with low and high levels of performance. A value close to one means that the item assesses the same aspect as the overall test. A value close to zero indicates that an item has little in common with the overall test. In this study a value of rpbis = 0.2 and above served as a minimum criterion.

Of the items analyzed, 206 showed too little selectivity (rpbis < 0.2) or an under or overfit in the Infit statistics (fit < 0.70 or fit > 1.30). These items were eliminated from the item pool for further analysis. The reduced item pool was then scaled again using a one-dimensional Rasch Model. The selectivity values were at least rpbis = 0.20, the maximum was rpbis = 0.66, and the average value was rpbis = 0.41. The item fit statistics (InfitMnSq) varied between min = 0.70 and max = 1.30. The mean InfitMnSq was 0.92. This indicates that there were no model violations and that all items meet the requirements of a one-dimensional Rasch model. All items thus form a one-dimensional scale, i.e., they measure the same construct.

The mean item di fficulty was σ = 0.00, and the values scattered in a range between min = −2.99 and max = 3.68. The item pool thus covered a range of very easy to very di fficult items. An analysis of the item di fficulties separated by grade levels showed an increase in the mean values with constant standard deviation (see Table 3), i.e., the items became easier with the increasing grade level. Results show that there is a wide range of item di fficulties in every grade level.

**Table 3.** Mean, minimum, and maximum item di fficulties, separated according to grade levels.


Due to the use of the Rasch model, it was also possible to map the item and person parameters on the same scale. The personal parameters WLE (Weighted-Maximum-Likelihood-Method) [62] were determined by a pairwise item comparison [63,64]. This method is particularly suitable for data sets

with missing values [64,65]. The WLE can be used to assess the appropriateness of the degree of difficulty of the items. The person item map (see Figure 1) shows the person parameters as histograms, as well as the item difficulty. It becomes clear that the measurement range of the items essentially corresponds to the distribution of the person parameters. However, one can see that there is a lack of items for students with particularly high skills.

**Figure 1.** A person–item map (distribution of a person's abilities as a histogram on the left side; measuring range of the items ordered by item difficulty).

In order to analyze the differences in item difficulties between boys and girls, the results obtained were plotted separately by gender on the *x*- and *y*-axes (see Figure 2). If the parameters are constant across the sexes, they run along the bisectors of the angle. Here, there are differences in the item difficulties of individual items. A few items (N = 21) showed very large deviations from the bisecting line. A variance analysis showed a significant influence of sex (F(1, 4115) = 9.753, *p* < 05), but the effect was only small (η<sup>2</sup> = 002 or d = 0.10).

For the eight word lists from each grade level, we determined Cronbach's α. The overall values varied between α = 74 and α = 97. On average, the values were high (grade 1: α = 95; grade 2: α = 89; grade 3: α = 89; and grade 4: α = 85).

According to Cohen [66], the correlation with the ELFE-II test is high: r = 64 (N = 178).

**Figure 2.** Graphical model test for the assessment of the measurement invariance via gender. Note: a few items cannot be displayed here, as they show very large deviations from the bisecting line.
