*4.2. Limitations of the Available Literature*

The evidence base supporting the ability of child development assessment tools to predict long-term outcomes remains limited to remarkably few studies, with a need for more high-quality studies that are adequately powered and have follow-up sufficient to reveal associations with adult-life outcomes. Figures 2 and 3 illustrate that there are high quality studies distributed across the three outcomes of interest and all three assessment tool domains. However, the included studies were heterogeneous with respect to study design, assessment tools, outcome measures, and statistical models. This heterogeneity precludes direct comparison, even between studies that used the same tool (e.g., WISC-R) to determine whether these associations are repeatable, and the effect sizes are consistent across populations. Our quality assessment suggests that issues related to attrition remain a challenge in longitudinal studies; continuing to engage and track study participants over decades is a common challenge in longitudinal studies, so this finding is not all together surprising. However, it is notable that two studies did not clearly describe attrition, which threatens both evaluation of sample size and effect measures [20,24].

All included studies in this review were observational cohort studies, which are susceptible to several limitations. Cohort studies are prone to differential loss to follow-up of participants with medical or financial challenges, which can bias findings. While many studies accounted for confounding with adjusted effect estimates, additional sources of residual confounding likely remained, including family and community contextual factors, the impact of developmental interventions, and children's physical health. Longitudinal studies that document and control for these contextual factors are needed.

Additionally, the use of multiple or composite assessment tools was framed as a "best fit" approach by some authors. However, the utilization of multiple predictors can diminish the statistical validity of significant results due to the increased probability of a significant result due solely to chance, given the large number of hypothesis tests. A priori assertions grounded in theoretical rationale for the utility of composite or multiple domain assessment tools can help to mitigate this issue and provide better evidence as to whether composite assessments improve prediction of outcomes; alternatively, the assessment of predictors separately would help to isolate the effect of individual tools.

Finally, the generalizability of findings from this review is limited by the fact that all of the studies took place in high-income countries among relatively homogenous racial and ethnic groups. Few of the tools assessed in this review have been validated for use in African, Asian, and South American populations. The absence of studies from lowand middle-income countries may be a reflection of the small number of tools validated

for use in these populations, and limits generalizability of findings to populations from low-income countries, and populations with high rates of malnutrition or limited access to education.

### *4.3. Limitations of Present Study*

There are several limitations to this review. First, the study was designed with a specific purpose to identify developmental assessment tools that predict long term outcomes related to academic and economic potential of individuals and communities and did not include research assessing other long-term outcomes with high relevance for health and quality of life. Despite efforts to be comprehensive in its inclusion of tools by completing a broad search of the PsycTESTS database and reviewing almost 1400 tools, some studies were excluded at full-text review because they did not include an assessment tool from the original search list (e.g., a study that examined educational attainment among three large cohorts from Finland, the UK, and the Philippines and found significant positive associations between cognitive development scores at early ages and attainment in adulthood [30]). Despite a thorough search of three robust databases, there is likely additional relevant research that was not captured. In particular, grey literature, such as non-peer reviewed organizational reports, and economics literature (e.g., EconLit database) were not considered and may be a source of additional information regarding the socioeconomic outcome of interest. Additionally, only English and French literature was reviewed due to the linguistic capacity of the research team, and thus there may be additional literature in other languages that may be particularly relevant to address the issue mentioned above related to generalizability of findings to the low-and middle-income country context.

Next, this review was completed in 2018; to remediate the concern of additional published literature not being reflected in this review, in January 2021 we conducted posthoc abstract screening of articles published in 2018–2021 in all three databases (PubMed, Educational Resources Information Center (ERIC), and PsycINFO), using the same search terms. Of 158 results across the three databases, five articles passed abstract screening and were full-text reviewed, and only two additional studies met inclusion criteria [31,32]. First, Samuels et al., 2019 found that the Behavior Rating Inventory of Executive Function (BRIEF) and BRIEF Self-Report (BRIEF-SR) were significantly associated with the upcoming cumulative grade point average in a diverse population of 259 New York middle and high school students, independent of gender, free/reduced lunch, and special education status [31]. However, it is unclear whether this instrument predicts longer-term academic performance because the time interval between tool assessment and outcome assessment was notably short. Second, Kosik et al., 2018 found in a U.S based birth cohort that the WISC at age seven was significantly associated with educational attainment, employment, and wealth in adulthood [32]. Despite the identification of these two additional studies, of which likely only Kosik et al., 2018 would be considered high-quality, we are confident that the findings reported in our main review remain relevant and continue to fill a needed gap in the literature. These studies' findings do not conflict with findings of the five high-quality studies in the main review, and in fact only further support our review's overall conclusions.

Finally, all of the high-quality studies reviewed reported positive associations, suggesting publication bias and potential underreporting of null findings. Coupled with the small sample sizes and shorter follow-up of the low and neutral quality studies reviewed, additional research is needed to support the associations identified between tools and outcomes studied herein.
