2.7.4. Reliability Analyses

Inter- and intra-rater reliability were calculated for each item using a raw agreement proportion (the proportion of pairs of ratings that agree exactly), Cohen's kappa statistic as well as Gwet's AC1 [30], which provides more stable estimates [31] of agreement in situations where there is high trait (pass/fail) prevalence.

#### 2.7.5. Creation of a Development-for-Age *z*-Score (DAZ)

Once the final items had been selected, a scoring system for the final tool was set up. A generalized partial credit model (GPCM) [32] using an empirical histogram prior [33,34] to account for the non-normality in the ability (development) distribution was fitted to the data using the R [35] package MIRT [27]. As the data contained a mixture of binary and three ordinal category responses, a polytomous IRT model was required. Taking the latent scores for each child from the IRT model, the LMS (lambda, mu, sigma) [36,37] method of centile estimation was used to remove the effect of age to create age contingent *z*-scores, which we termed "development-for-age *z*-scores", or DAZ for short.

#### 2.7.6. Validation

The DAZ scores were plotted against the demographic and contextual variables to explore known-groups construct validity of the tool by comparing countries, maternal education categories and gender. Differences in mean scores were tested using a *t*-test or analysis of variance (ANOVA). Concurrent validity with respect to FCI scores, SES scores, WAZ, HAZ and WHZ was examined using Pearson or Spearman correlation coefficients. A more detailed analysis is described elsewhere [28].
