*2.4. Criteria for Risk of Bias Assessment*

Due to the heterogeneity of statistical methods employed by the original studies selected, the high number of tests included, and the limited number of studies per test, a meta-analysis was not conducted. An assessment of risk of bias in selected original studies and systematic reviews was made for each eligible study by two studies (N.M.J. and F.M.A.) independently. Discrepancies were solved in a consensus meeting. Inter-rater agreement for the risk of bias between researchers was calculated by the percentage agreement (96% (Kappa = 0.962) before consensus, and 100% agreement after consensus meeting).

The assessing risk of bias criteria in original studies were determined according to quality assessment list employed by Castro-Piñero et al. [27], which include the three following criteria: (1) the adequate number of participants; (2) an adequate description of the study population; and (3) adequate statistical analysis (see Supplementary Table S1). Each criterion was rated from 0 to 2, being 2 the best score. For all studies, a total score was calculated by counting up the number of positive items (a total score between 0 and 6). Studies were categorized as very low quality (0–2), low quality (3–4) and high quality (5–6).

The methodological quality of each systematic review was appraised using the 'Assessment of Multiple Systematic Reviews' (AMSTAR) rating scale [34]. AMSTAR contains 11-items to assess the methodological aspects of reviews with items scored as 1 if the answer was "Yes", and 0 if the answer was "No", "Cannot Answer" or "Not Applicable" (see Supplementary Table S2). The total score ranged from 0 to 11. The item on conflict of interest requires that the systematic review and all primary studies be assessed. We modified this item to only assess the review itself as Biddle et al. [35] proposed, given that PRISMA does not require a conflict-of-interest assessment for each primary study. The final quality rates were computed by tertiles, where the first tertile ranged from 0 to 3 points (low quality); the second tertile from 4 to 7 points (medium quality); and the third tertile from 8 to 11 points (high quality).
