Next Article in Journal
The Impact of Task Context on Pleasantness and Softness Estimations: A Study Based on Three Touch Strategies
Previous Article in Journal
Similarities and Differences Between Bullying and Sexual Harassment in Schools: A Social-Ecological Review of Risk and Protective Factors
Previous Article in Special Issue
The Mediating Role of Teacher Efficacy Between Academic Self-Concept and Teacher Identity Among Pre-Service Physical Education Teachers: Is There a Gender Difference?
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Systematic Review

Assessing the Measurement Properties of the Test of Gross Motor Development-3 Using the COSMIN Methodology—A Systematic Review

1
Zhejiang Sports Science Institute, Hangzhou 310004, China
2
Institute of Human Movement and Sports Engineering, College of Physical Education and Health Sciences, Zhejiang Normal University, Jinhua 321004, China
3
Institute of Culture Creativity, Weifang Vocational College, Weifang 261000, China
4
China Volleyball Sport College, Beijing Sport University, Beijing 100084, China
*
Authors to whom correspondence should be addressed.
Behav. Sci. 2025, 15(1), 62; https://doi.org/10.3390/bs15010062
Submission received: 27 November 2024 / Revised: 6 January 2025 / Accepted: 7 January 2025 / Published: 13 January 2025

Abstract

:
This study aimed to systematically review the measurement properties of the Test of Gross Motor Development-3 (TGMD-3) using the COSMIN methodology. A search of four databases (PubMed, EMBASE, Web of Science, CINAHL) identified 23 relevant studies. The methodological quality of the studies was assessed using the COSMIN risk of bias checklist; the measurement properties of the TGMD-3 were evaluated by the COSMIN quality criteria; and the quality of the evidence was rated using a modified GRADE approach. The findings indicated that the test–retest, inter-rater, and intra-rater reliability, as well as measurement invariance and part content validity (relevance and comprehensibility), were sufficient, supported by high-quality evidence. The bifactor structure was found to be a more appropriate model for the TGMD-3, with structural validity and internal consistency rated as sufficient, though based on moderate-quality evidence. However, hypothesis testing for construct validity produced inconsistent results, also supported by moderate-quality evidence. Responsiveness was rated as inconsistent, based on low-quality evidence. Overall, the TGMD-3 is graded as “B”, meaning it has the potential to be recommended, but further research is needed to fully establish its measurement properties. Future studies should focus on verifying the comprehensiveness of items of the TGMD-3 to optimise its application.

1. Introduction

Shortly after birth, infants display instinctive behaviours such as sucking and grasping (Widström et al., 2019). As children’s body coordination improves, they gradually learn to lift their heads and roll over, among other motor behaviours. More generally, motor behaviour development brings about new opportunities for acquiring knowledge about the world, and burgeoning motor skills can instigate cascades of developmental changes in perceptual, cognitive, and social domains (Adolph & Franchak, 2017). Motor skills development is commonly categorised into two main areas: gross motor skills and fine motor skills. Gross motor skills encompass movements that require the use of large muscle groups, like sitting unsupported, crawling, walking, and running. On the other hand, fine motor skills pertain to the use of smaller muscles for activities such as grasping objects, manipulating them, or engaging in tasks like drawing (Gonzalez et al., 2019). Fundamental movement skills (FMSs) are defined as the “basic learning movement patterns in preschool children” (Hu et al., 2023). These skills encompass essential abilities that enable children to perform structured movements (Wick et al., 2017; Zheng et al., 2022), primarily involving gross motor skills such as running, jumping, sliding, striking, catching, and kicking. Milestones provide a framework for observing and monitoring a child over time (Gerber et al., 2010). According to a survey by the WHO, generally speaking, infants achieve the milestone of sitting independently between 3.8 and 9.2 months, standing between 6.9 and 16.9 months, and walking between 8.2 and 17.6 months (WHO Multicentre Growth Reference Study Group & de Onis, 2006). By the age of 2, a child possesses the ability to kick a ball, jump with both feet leaving the ground, and throw a large ball overhand (Gerber et al., 2010). The milestones for subsequent ages (after the age of 3) indicate advancements in the duration, frequency, or the successful execution distance of each task (Gerber et al., 2010). Proficiency in FMSs is crucial for children’s overall development, with numerous studies demonstrating positive correlations between FMSs and various health-related outcomes, including body composition (Okely et al., 2004), academic achievement (de Waal, 2019), cognitive function (O’Hagan et al., 2022), and social skills (Ecevit & Sahin, 2021). Inadequate FMS development can hinder children’s ability to participate in physical activities, preventing them from reaching key developmental milestones (Stodden et al., 2008). Over the past few decades, the importance of assessing motor competence has gained significant attention, prompting the development of various reliable and valid instruments to accurately screen for FMS (Cools et al., 2009; Eddy et al., 2020; Klingberg et al., 2019). These tools play a central role in promoting physical literacy, informing targeted interventions, and addressing public health concerns related to childhood obesity and sedentary lifestyles (Lubans et al., 2010).
All versions of the Test of Gross Motor Development (TGMD)—including TGMD-1, TGMD-2, and TGMD-3—are process-oriented assessments designed to evaluate fundamental movement skills (Valentini et al., 2021). The original TGMD was developed by Ulrich in 1985 in his doctoral dissertation to assess FMSs in physical education settings (Ulrich, 2017). The normative sample consisted of 909 children from throughout the US (Wiart & Darrah, 2001). Several studies have examined its psychometric properties. Ulrich (1983) demonstrated that the TGMD possesses excellent content validity (Ulrich, 1983), while Ulrich and Ulrich (1984) proved that this test is highly sensitive in measuring changes in preschool children’s fundamental motor skill performance (Ulrich & Ulrich, 1984). Evaggelinou et al. (2002) support the bifactor structure of the TGMD by assessing the construct validity of the TGMD (Evaggelinou et al., 2002). According to national standards for evaluating educational and psychological tests, norm-referenced tests should be re-standardised approximately every 15 years to account for population changes over time (Ulrich, 2017). This led to the creation of TGMD-2 in 2000, which updated the normative sample, and subsequently, the TGMD-3, which incorporated feedback from users to refine the instrument (Ulrich, 2017).
The measurement properties of an instrument, as outlined by the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN), are essential for determining its reliability, validity, and responsiveness. According to COSMIN, reliability includes test–retest, inter-rater, and intra-rater reliability, while validity encompasses content, construct (including structural, cross-cultural, and hypothesis testing), and criterion validity (Mokkink et al., 2024). TGMD-3 is now widely utilised in both research and physical education contexts to assess children’s motor competence (Chen et al., 2024; Estevan et al., 2017; Magistro et al., 2020; Maïano et al., 2022; Mamani-Ramos et al., 2023; Marinšek et al., 2023; Mohammadi et al., 2019; Pitchford & Webster, 2020). Since the introduction of the TGMD-3, numerous studies have examined its measurement properties (Estevan et al., 2017; Magistro et al., 2020; Maïano et al., 2022; Mamani-Ramos et al., 2023; Marinšek et al., 2023; Mohammadi et al., 2019; Pitchford & Webster, 2020; Valentini et al., 2017; Webster et al., 2015). While many of these studies confirm the reliability and validity of the TGMD-3, inconsistent findings remain. For instance, some research supports a bifactor structure (the scale having two dimensions) for the TGMD-3. Based on this, the scale was divided into two subscales named the locomotor subscale and the ball skills subscale (A. Brian et al., 2018; A. S. Brian et al., 2021; Duncan et al., 2022; Magistro et al., 2020; Salami et al., 2021; Wagner et al., 2017). Whereas other studies argue that a one-factor structure is also appropriate (Estevan et al., 2017; Garn & Webster, 2021; Marinšek et al., 2023; Mohammadi et al., 2019; Webster & Ulrich, 2017), the TGMD-3 cannot be divided into subscales. Furthermore, while some studies have reported sufficient internal consistency for the TGMD-3’s locomotor subscale (Mohammadi et al., 2019), others have found it inadequate (Valentini et al., 2017). These inconsistent reports highlight the need for a comprehensive, systematic review to provide clarity on the measurement properties of the TGMD-3.
The COSMIN methodology is widely recognised as a robust framework for evaluating the measurement properties of health-related instruments, offering a systematic approach to assessing their reliability, validity, and responsiveness (Mokkink et al., 2024). Its application has proven effective in a variety of domains, including motor competence assessments, as it ensures that instruments are both scientifically rigorous and applicable across different populations. For instance, Hulteen et al. utilised the COSMIN framework in their systematic review of motor assessment tools in children and adolescents, revealing key insights into the psychometric properties of these instruments (Hulteen et al., 2020). The COSMIN methodology has also been applied to single instruments, as demonstrated by reviews of the Body Image Scale (Melissant et al., 2018) and the Peabody Developmental Motor Scales-2 (Zhu et al., 2024), which confirmed its versatility in evaluating a wide range of scales. Given the inconsistencies in previous studies regarding the TGMD-3’s measurement properties, applying the COSMIN methodology to systematically review the TGMD-3 offers an opportunity to standardise the evidence, ensuring a more reliable and comprehensive understanding of its psychometric robustness. This review aims to consolidate existing research on the TGMD-3 and synthesise the quality of evidence through the COSMIN methodology, thereby offering valuable guidance for both researchers and practitioners in the fields of physical education and child development. By critically examining the reliability, validity, and overall measurement properties of the TGMD-3, this review seeks to determine the extent to which the instrument can be trusted to accurately assess fundamental movement skills of children.

2. Materials and Methods

2.1. Literature Search Strategy

A systematic search was conducted using English as the search language across four major electronic databases—PubMed, EMBASE, Web of Science, and CINAHL—targeting studies that evaluated the measurement properties of the Test of Gross Motor Development, Third Edition (TGMD-3) up to September 2024. The search strategy was designed to comprehensively capture all relevant studies using an array of search terms associated with the TGMD-3. These included variations in the instrument’s name: the test of gross motor development OR TGMD OR “the test of gross motor development-3” OR “TGMD-3” OR “the test of gross motor development-third edition” OR “the test of gross motor development-3rd” OR “the Test of Gross Motor Development, Third Edition” OR “Test of Gross Motor Development-Third Edition” OR “Test of Gross Motor Development, 3rd Edition”.
To identify studies examining the psychometric properties of TGMD-3, search terms were combined with keywords representing specific measurement properties, such as reliability OR “internal consistency” OR “measurement error” OR validity OR “content validity” OR “face validity” OR “construct validity” OR “structural validity” OR “hypotheses testing” OR “cross cultural validity” OR “criterion validity” OR responsiveness OR “measurement properties” OR “psychometric properties” OR “measurement property” OR “psychometric property” OR “ divergent validity” OR “concurrent validity” OR “predictive validity”.
The search process adhered to the latest Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines (Page et al., 2021), ensuring a rigorous, transparent, and reproducible approach. Full-text articles of the studies meeting the inclusion criteria were accessed either via the publisher’s platform or, when necessary, through institutional resources or external collaborations. The registration of the study protocol was completed in PROSPERO accessed on 4 November 2024 (https://www.crd.york.ac.uk/prospero/; CRD42024600 851).

2.2. Inclusion and Exclusion Criteria

The studies included in this review had to meet the following criteria: (1) studies focusing on typically developing children aged 3 to 11 years or children with disabilities; (2) studies evaluating the measurement properties of the Test of Gross Motor Development-3 (TGMD-3); and (3) studies assessing at least one measurement properties of the TGMD-3.
Studies were excluded if they (1) utilised the TGMD-3 solely to investigate children’s FMSs without evaluating the scale’s measurement properties; (2) utilised the TGMD-3 solely for evaluating the efficacy of the intervention; (3) were reviews, systematic reviews, or meta-analyses; or (4) provided only an abstract, lacked full-text access, or were not peer-reviewed.

2.3. Literature Selection and Data Extraction

Two reviewers, YZ and JW, independently handled the process of selecting the literature and extracting data. Discrepancies between the reviewers were resolved through discussion and, if necessary, consultation with a third reviewer (YQ). For any unresolved disagreements, further input was sought from additional review authors (WY and QC).
All identified references were managed using EndNote software (Version 20.2.1), where duplicate records were automatically excluded. The selection process involved two stages. First, the titles and abstracts were screened to remove irrelevant studies. In the second stage, the full texts of the remaining articles were thoroughly reviewed based on the predefined inclusion and exclusion criteria.
From the eligible studies, the following information was systematically gathered: (1) study attributes such as the primary author’s name, publication year, study cohort, geographical area, sample size, and demographics in terms of age and gender; (2) TGMD-3 measurement characteristics assessed, encompassing internal consistency, content and structural validity, cross-cultural validity, measurement invariance, reliability, measurement error, criterion validity, and responsiveness; (3) detailed data pertaining to each evaluated measurement characteristic.

2.4. Assessment of Risk of Bias and Evidence Quality in Included Studies

To assess the methodological quality of the included studies, the COSMIN risk of bias checklist (Mokkink et al., 2024) was employed. This checklist comprises ten areas, namely, “PROM development, content validity, internal consistency, structural validity, measurement invariance/cross-cultural validity, reliability, criterion validity, hypothesis testing for construct validity, measurement error, and responsiveness”. Based on the specific measurement properties discussed in each study, the relevant sections were chosen. The methodological quality of each item was rated as “very good”, “adequate”, “doubtful”, or “inadequate”, adhering to standardised scoring criteria. The overall methodological quality of each study was decided using the “worst score principle”, where the lowest score within a section determined the study’s overall rating. Two reviewers, YZ and JW, independently conducted the bias risk assessment for all articles. Disagreements between the reviewers were resolved through discussion and, if required, by consulting a third reviewer, YQ.
To synthesise the quality of evidence, a modified version of the Grading of Recommendations Assessment, Development and Evaluation (GRADE) method (Mokkink et al., 2024) was employed. This adaptation is tailored to the COSMIN framework, refining the original GRADE system. Evidence levels were categorised as “high”, “moderate”, “low”, or “very low”. All included studies initially received a “high” level of evidence, with potential downgrades applied based on specific study characteristics. Unlike the conventional GRADE approach, this modified version omits the “publication bias” consideration. The quality of evidence may be reduced due to factors such as risk of bias, inconsistency, indirectness, and imprecision.

2.5. Overall Rating of the Measurement Properties

The overall rating for each measurement property of the TGMD-3 was determined following the COSMIN methodology for assessing the content validity of the PROM user manual (Terwee et al., 2018) and the COSMIN methodology for systematic reviews of the PROM user manual (COSMIN manual) (Mokkink et al., 2024).
The evaluation encompassed various measurement properties such as content validity, internal consistency, structural validity, criterion validity, cross-cultural validity/measurement invariance, reliability, measurement error, hypothesis testing for construct validity, and responsiveness, as detailed in Supplementary Materials Table S1. Each of the reported items was assessed and categorised as “sufficient (+)”, “insufficient (−)”, or “indeterminate (?)” based on the criteria outlined in Supplementary Materials Table S2. The overall rating for each measurement property was then determined as “sufficient (+)”, “insufficient (−)”, “inconsistent (±)”, or “indeterminate (?)”, reflecting the comprehensiveness and rigour of the evaluation process.
In cases where results were inconsistent, further analysis was undertaken to identify potential reasons for these discrepancies. For hypothesis testing of construct validity, the research team pre-defined hypotheses. Specifically, construct convergent or concurrent validity was deemed sufficient when the correlation coefficient between TGMD-3 and a comparator instrument measuring a similar construct was ≥0.50. Construct validity was rated as “sufficient (+)” if 75% or more of the results aligned with the hypotheses, “insufficient (−)” if 75% or more did not, or “indeterminate (?)” if no hypotheses were established.

3. Results

3.1. Literature Search Results

A total of 1054 articles were retrieved from the database search, including 198 from CINAHL, 57 from PubMed, 737 from Web of Science, and 62 from EMBASE. The search was conducted up to 26 September 2024, with no restriction on the earliest date of publication.
Following the import of all identified articles into EndNote, 313 duplicates were removed. The titles and abstracts of the remaining 741 articles were then screened, resulting in the exclusion of 707 articles due to irrelevance. This initial screening left 34 articles for further consideration. Twelve articles were excluded due to the unavailability of their full texts (being conference abstracts only), and subsequently, twenty-two articles underwent eligibility assessment. Upon further evaluation, five articles were excluded for the following reasons: two studies used the TGMD-3 to assess other scales (Aadland et al., 2022; Copetti et al., 2022), one study did not evaluate the measurement properties of the TGMD-3 (A. Brian et al., 2021), and two studies focused on assessing other versions of the TGMD-3 (Duncan et al., 2022; Valentini et al., 2021). Subsequently, we added six articles by screening the references in the included studies. Ultimately, 23 articles met the inclusion criteria and were included in this systematic review. The figure below (Figure 1) illustrates the comprehensive selection process, outlining the number of articles at each stage of the process.

3.2. Characteristics of the Included Studies

Breaking the included studies down by country, seven studies were conducted in the United States (A. Brian et al., 2018; A. S. Brian et al., 2021; Garn & Webster, 2021; Maeng et al., 2017; Pitchford & Webster, 2020; Staples et al., 2020; Webster & Ulrich, 2017); three in Spain (Carballo-Fazanes et al., 2021; Estevan et al., 2017); two in Brazil (Valentini et al., 2022, 2017); two in Italy (Magistro et al., 2020; Magistro et al., 2018); and two in Iran (Magistro et al., 2020; Magistro et al., 2018). Additional studies originated from Australia (Allen et al., 2017), Ireland (Duncan et al., 2022), Canada (Maïano et al., 2022), Peru (Mamani-Ramos et al., 2023), Slovenia (Marinšek et al., 2023), Bosnia and Herzegovina (Mehmedinović et al., 2021), Indonesia (Rizkyanto et al., 2024), and Germany (Wagner et al., 2017).

3.3. Synthesis of Evidence for the Measurement Properties of TGMD-3

The synthesis of evidence for the measurement properties of the TGMD-3, along with the corresponding quality of evidence for each property, is summarised in Table 2. These findings offer a comprehensive evaluation of the psychometric properties of the instrument, allowing for a more informed understanding of its reliability and validity. Detailed data on the quality of evidence for each measurement property are presented in the Supplementary Materials (Table S3), providing further insight into the robustness of the results.

3.3.1. Content Validity

Out of the 23 included studies, four specifically assessed the content validity of the TGMD-3 (Mamani-Ramos et al., 2023; Marinšek et al., 2023; Mohammadi et al., 2019; Valentini et al., 2017). These studies systematically evaluated the content validity by consulting experts in the field to determine the relevance and comprehensibility of the TGMD-3. The overall rating for the relevance and comprehensibility of the scale was deemed sufficient, with a moderate quality of evidence. However, since none of the studies addressed the comprehensiveness of items of the TGMD-3, an overall rating for this aspect could not be determined, nor was it possible to provide a qualitative synthesis of content validity (Table 2).

3.3.2. Structure Validity

Thirteen of the twenty-three studies evaluated the structural validity of the TGMD-3 using classical test theory (CTT) (A. Brian et al., 2018; A. S. Brian et al., 2021; Estevan et al., 2017; Garn & Webster, 2021; Magistro et al., 2020; Maïano et al., 2022; Mamani-Ramos et al., 2023; Marinšek et al., 2023; Mehmedinović et al., 2021; Mohammadi et al., 2019; Salami et al., 2021; Valentini et al., 2017; Webster & Ulrich, 2017). Thirteen studies explored the bifactor structure of the TGMD-3, with five studies also assessing a one-factor structure. The bifactor model was rated as sufficient, with 79% of studies supporting this structure, and the quality of evidence was considered moderate. In contrast, the one-factor model exhibited inconsistent results, leading to a moderate quality of evidence (Table 2).

3.3.3. Internal Consistency

As summarised in Table 2, thirteen studies assessed the internal consistency of the TGMD-3 (Allen et al., 2017; A. Brian et al., 2018; Estevan et al., 2017; Garn & Webster, 2021; Magistro et al., 2018; Mamani-Ramos et al., 2023; Marinšek et al., 2023; Mehmedinović et al., 2021; Mohammadi et al., 2019; Rizkyanto et al., 2024; Valentini et al., 2017; Wagner et al., 2017; Webster & Ulrich, 2017). Cronbach’s alpha values for the Locomotor subscale ranged from 0.63 to 0.92, with 92% of results exceeding the threshold of 0.7, indicating sufficient internal consistency. Similarly, the Ball Skills subscale demonstrated alpha values ranging from 0.60 to 0.95, with 92% of results surpassing 0.7, also rated as sufficient. The Total TGMD-3 scale exhibited alpha values between 0.74 and 0.96, confirming sufficient internal consistency. The overall rating for internal consistency was deemed sufficient, although the quality of evidence was moderate due to some inconsistencies across the studies.

3.3.4. Reliability

Thirteen studies examined the reliability of the TGMD-3 (Allen et al., 2017; A. Brian et al., 2018; Carballo-Fazanes et al., 2021; Estevan et al., 2017; Maeng et al., 2017; Magistro et al., 2020; Mamani-Ramos et al., 2023; Marinšek et al., 2023; Mohammadi et al., 2019; Valentini et al., 2017; Wagner et al., 2017; Webster & Ulrich, 2017). Following the COSMIN guidelines (Mokkink et al., 2024), these studies assessed test–retest reliability, inter-rater reliability, and intra-rater reliability.
Seven studies explored the test–retest reliability of the TGMD-3 (Magistro et al., 2020; Mamani-Ramos et al., 2023; Marinšek et al., 2023; Mohammadi et al., 2019; Valentini et al., 2017; Wagner et al., 2017; Webster & Ulrich, 2017). Intraclass correlation coefficients (ICCs) (Allen et al., 2017; Magistro et al., 2020; Mamani-Ramos et al., 2023; Marinšek et al., 2023; Wagner et al., 2017; Webster & Ulrich, 2017) and Pearson correlation coefficients (r) (Mohammadi et al., 2019; Valentini et al., 2017) were the primary metrics used to assess this property. The ICCs ranged from 0.81 to 0.996 for the Locomotor subscale, 0.84 to 0.997 for the Ball Skills subscale, and 0.92 to 0.996 for the Total TGMD-3 score. Pearson correlation coefficients were from 0.92 to 0.93 for the Locomotor subscale, 0.81 to 0.94 for the Ball Skills subscale, and 0.90 to 0.95 for the Total TGMD-3. Overall, the test–retest reliability was rated as sufficient, with high-quality evidence (Table 2).
Nine studies evaluated the inter-rater reliability of the TGMD-3 (A. Brian et al., 2018; Estevan et al., 2017; Maeng et al., 2017; Mamani-Ramos et al., 2023; Marinšek et al., 2023; Mohammadi et al., 2019; Valentini et al., 2017; Wagner et al., 2017). The ICCs ranged from 0.82 to 0.97 for the Locomotor subscale, 0.778 to 0.98 for the Ball Skills subscale, and 0.842 to 0.98 for the total TGMD-3. These results were deemed sufficient, and the quality of evidence was judged to be high, as all studies were identified as having very good methodological quality (Table 2).
Intra-rater reliability was assessed in nine studies (Allen et al., 2017; Carballo-Fazanes et al., 2021; Estevan et al., 2017; Maeng et al., 2017; Mamani-Ramos et al., 2023; Marinšek et al., 2023; Mohammadi et al., 2019; Valentini et al., 2017; Wagner et al., 2017). The ICCs ranged from 0.865 to 0.988 for the Locomotor subscale, 0.85 to 0.99 for the Ball Skills subscale, and 0.90 to 0.99 for the Total TGMD-3. As all ICC values exceeded 0.7, the overall rating was deemed sufficient, with high-quality evidence (Table 2). Collectively, the findings indicate that the TGMD-3 exhibits sufficient reliability across test–retest, inter-rater, and intra-rater measures.

3.3.5. Measurement Invariance

Six studies assessed the measurement invariance of the TGMD-3 across different groups (gender, age, and disability) using multi-group confirmatory factor analysis (MCFA) and differential item functioning (DIF) (Magistro et al., 2020; Magistro et al., 2018; Marinšek et al., 2023; Salami et al., 2021; Valentini et al., 2022; Wagner et al., 2017) (Table 2).
Four studies found no significant differences across gender groups using MCFA, and one study using DIF also found no significant differences, suggesting sufficient measurement invariance across genders. The quality of evidence for these studies was judged to be high.
For age group comparisons, one study found no significant differences using MCFA, and another using DIF also reported no significant differences, indicating sufficient measurement invariance across age groups. Both studies were rated as high-quality evidence. Additionally, one study examined measurement invariance between children with and without disabilities, finding no significant differences via MCFA. The quality of evidence for this study was also rated as high. Taken together, the findings suggest that the TGMD-3 has sufficient measurement invariance across gender, age, and disability groups, supported by high-quality evidence.

3.3.6. Hypothesis Testing for Construction Validity

Two studies evaluated the construct validity of the TGMD-3 (A. Brian et al., 2018; Wagner et al., 2017). These studies assessed the construct validity by correlating the TGMD-3 with similar domain instruments such as Test of Gross Motor Development-2 (TGMD-2), Movement Assessment Battery for Children-2 (M-ABC2), and German Youth Games ball-throwing distance performance (GYGBT).
One study assessed the convergent validity of the TGMD-3 with the TGMD-2 (A. Brian et al., 2018), reporting Pearson correlation coefficients of 0.98 for the Locomotor subtest, 0.98 for the Ball Skills subtest, and 0.99 for the Total scale, indicating sufficient convergent validity with high-quality evidence.
Another study assessed the divergent validity of the TGMD-3 with the M-ABC2 and the concurrent validity of the TGMD-3 with the GYGBT (Wagner et al., 2017). The overall rating for these tests was insufficient, with correlation coefficients below 0.5. Despite this, the quality of evidence was still rated as high. Overall, the construct validity of the TGMD-3 was deemed inconsistent, with moderate-quality evidence due to divergent findings (Table 2).

3.3.7. Responsiveness

Two studies examined the responsiveness of the TGMD-3 (Pitchford & Webster, 2020; Staples et al., 2020). One study compared the FMS performance of typically developing children with that of children with Attention Deficit Hyperactivity Disorder (ADHD), Autism Spectrum Disorder (ASD), Intellectual Disability (ID), and Language or Articulation Disorders (LAD) (Pitchford & Webster, 2020). The TGMD-3 demonstrated sufficient responsiveness in comparisons between typically developing children and those with ADHD, ASD, and ID (p < 0.05). However, responsiveness was insufficient in comparisons between typically developing children and those with LAD. The quality of evidence for these results was rated as low due to significant bias. The second study assessed responsiveness by comparing FMSs before and after an intervention (Staples et al., 2020), showing sufficient results. However, the quality of evidence remained low due to bias concerns (Table 2).

4. Discussion

To the best of our knowledge, this is the first systematic review to employ the COSMIN methodology to assess the measurement properties of the Test of Gross Motor Development-3 (TGMD-3). This review synthesised data from 23 studies, evaluating the TGMD-3 across various psychometric properties.
In accordance with the COSMIN guidelines, content validity holds the utmost importance among all measurement properties for any instrument or scale (Mokkink et al., 2024). Burns (1993) highlight three primary sources for establishing content validity: the existing literature, judgement from representatives of the target population, and expert evaluation (Burns, 1993). Although expert judgement is commonly used (Almanasreh et al., 2019), the COSMIN methodology emphasises the integration of both patient (or population) and expert viewpoints to assess three crucial aspects of content validity: relevance (the pertinence of all items to the construct within a specific context), comprehensiveness (the inclusion of all essential aspects of the construct), and comprehensibility (the understanding of items as intended by the population) (Terwee et al., 2018).
In this review, four studies focused on evaluating the content validity of the TGMD-3. (Mamani-Ramos et al., 2023; Marinšek et al., 2023; Mohammadi et al., 2019; Valentini et al., 2017). These studies primarily evaluated the relevance and comprehensibility of the TGMD-3 through expert consultations. However, no study incorporated feedback from the target population (children) due to the nature of the TGMD-3, where children are required only to follow evaluator instructions to complete the movement tasks, rendering direct comprehension of the items less critical (Mohammadi et al., 2019; Valentini et al., 2017). Therefore, although the content validity assessment did not involve participants, the relevance and comprehensibility of the TGMD-3 were still deemed sufficient.
The aspect of comprehensiveness, which ensures that no key facets of the construct are missing, was notably absent from the included studies (Terwee et al., 2018). Since no studies have specifically evaluated this component, a thorough assessment of the overall content validity of the TGMD-3 cannot be provided. It is recommended that future research endeavour to fill this gap by investigating whether the TGMD-3 comprehensively encompasses all crucial aspects of gross motor development across varied populations and settings.
Structural validity pertains to how well the scores of a scale capture the dimensional characteristics of the underlying construct it is intended to measure (Mokkink et al., 2024). In the case of TGMD-3, there is debate regarding its dimensionality. While some studies support a bifactor structure, others advocate for a one-factor model. Our systematic review, utilising COSMIN’s approach to synthesising evidence, found sufficient moderate-quality evidence favouring a bifactor structure, with inconsistent moderate-quality evidence supporting a one-factor model. Thus, the bifactor structure appears to be a more appropriate representation of the TGMD-3’s dimensionality. This finding aligns with previous research, such as that by Garn and Webster (2021), who also concluded that the bifactor model provided a better fit compared to both one-factor and two-factor models (Garn & Webster, 2021).
Cronbach’s alpha values exceeding 0.7 are typically regarded as indicative of adequate internal consistency (Terwee et al., 2007). In the studies included in this review, 92% of them demonstrated that the internal consistency of TGMD-3 is sufficient. According to the COSMIN manual, if at least 75% of the results meet the threshold for sufficiency, the measurement property can be rated as sufficient. However, the quality of evidence must be downgraded due to inconsistent findings, reducing it from high-quality evidence to moderate-quality evidence (Mokkink et al., 2024). Therefore, based on our findings, moderate-quality evidence supports the conclusion that the internal consistency of TGMD-3 is sufficient. This outcome, as per COSMIN guidelines, indicates that the inter-relatedness among the items on the TGMD-3 scale is adequately high (Mokkink et al., 2024).
According to the COSMIN guidelines, reliability refers to the degree to which scores remain consistent across various conditions, such as different time intervals (test–retest reliability), different raters at the same time (inter-rater reliability), and the same rater at different times (intra-rater reliability) (Mokkink et al., 2024). Our results reinforce that there is high-quality evidence supporting the sufficiency of reliability (test–retest, intra-rater, and inter-rater reliability) of the TGMD-3. These findings are consistent with another systematic review by Rey et al. (2020), which also found the TGMD-3 to be sufficiently reliable (Rey et al., 2020).
As outlined in the COSMIN manual, cross-cultural validity refers to “the extent to which the performance of items on translated or culturally adapted measurement instruments accurately reflects the performance of the items in the original version of those instruments” (Mokkink et al., 2024). Our analysis revealed that no studies have evaluated the cross-cultural validity of the TGMD-3 using the COSMIN-recommended methodology. Consequently, there is uncertainty regarding the ability of translated versions of the TGMD-3 to accurately represent the original instrument. Therefore, we recommend validating the cross-cultural validity of these translated versions before utilising them to assess children’s fundamental motor skills (FMSs).
Measurement invariance refers to the consistency of a test’s scores across different groups (e.g., gender, age) (Van De Schoot et al., 2015). Our assessment demonstrated that high-quality evidence supports the sufficiency of TGMD-3’s measurement invariance across three groups: different genders, ages, and children with and without disabilities. These results suggest that TGMD-3 is appropriate for use across diverse populations, ensuring it is a robust tool in various demographic settings.
Responsiveness refers to a scale’s ability to detect meaningful changes in the construct being measured over time (Mokkink et al., 2024). Two studies included in our review (Pitchford & Webster, 2020; Staples et al., 2020) evaluated the responsiveness of TGMD-3 by comparing the fundamental movement skill (FMS) scores of different child groups (e.g., children with and without ADHD, ASD, ID, and those before and after a physical education intervention). While the TGMD-3 showed sufficient responsiveness in most comparisons, it was found to be insufficient in the comparison between children with and without LAD. The overall quality of evidence for these studies was judged to be low due to severe biases. Notably, these bias due to that the studies employed a paired t-test, which is not recommended by COSMIN as it measures statistical significance rather than valid change. This method is considered inappropriate for assessing responsiveness (Mokkink et al., 2024).
Based on COSMIN guidelines, a measurement instrument or scale can be categorised as “A” if it has sufficient content validity (regardless of evidence quality level) or sufficient internal consistency (at least low-quality evidence). Conversely, instruments with high-quality evidence demonstrating inadequate measurement properties are classified as “C”. Instruments that do not meet the criteria for either “A” or “C” are classified as “B” (Prinsen et al., 2018). Our review found that while the internal consistency of TGMD-3 was sufficient with moderate-quality evidence, the lack of evaluation regarding the comprehensiveness of the TGMD-3’s content validity prevents a full assessment. Therefore, the TGMD-3 can be categorised as “B”. Instruments in this category hold promise for recommendation but require further research to validate their quality comprehensively. We recommend that future studies specifically address the comprehensiveness of items of the TGMD-3 to strengthen its overall utility and validity.
This study has several limitations. Firstly, our screening was limited to studies published in English and Chinese, potentially leading to the omission of important research on the cross-cultural validity of TGMD-3. Cross-cultural validity, defined as “the degree to which the performance of the items on a translated scale reflects the performance of the items in the original version”, is especially relevant when translating the TGMD-3 into different languages. The outcomes of cross-cultural validation studies are often published in the local language of the study; thus, excluding non-English- and non-Chinese-language studies may have restricted the comprehensiveness of our review. To thoroughly assess cross-cultural validity, future reviews should consider including studies published in a broader range of languages.
Secondly, our study focused solely on evaluating the measurement properties of the original version of the TGMD-3, excluding other versions such as the TGMD-3 short form (Duncan et al., 2022) and the Motor Skills Sequential Pictures (MSSP) version of the TGMD-3 (Copetti et al., 2022). As different versions of the TGMD-3 emerge, each with potentially unique measurement properties, they warrant separate examination to determine their reliability and validity in different contexts and populations.
Drawing from our findings, we propose the following recommendations for future research endeavours. Initially, in accordance with the COSMIN manual, a measurement instrument or scale can be categorised as “A” if it has sufficient content validity (regardless of evidence quality level) or at sufficient internal consistency (at least low-quality evidence) (Prinsen et al., 2018). Our review suggests that, with additional research verifying the comprehensiveness of the TGMD-3’s items, the tool could potentially be upgraded to an “A” classification. Instruments in this category provide robust, trustworthy results. Therefore, we suggest that future studies should specifically examine the comprehensiveness of the TGMD-3’s content validity to provide more conclusive evidence on this aspect. Secondly, further investigations into the TGMD-3’s responsiveness and measurement error should be conducted using the COSMIN methodology. These properties are essential for assessing the ability of the TGMD-3 to detect meaningful changes over time and ensuring the accuracy of measurements. By expanding the evidence base on these measurement properties, future research can solidify the TGMD-3’s utility across diverse settings and populations, ultimately enhancing its applicability and reliability in both clinical and educational environments.

5. Conclusions

The assessment of the TGMD-3 using the COSMIN methodology demonstrated that its test–retest reliability, inter-rater reliability, intra-rater reliability, and measurement invariance were sufficient, supported by high-quality evidence. The content validity, in terms of relevance and comprehensibility, was also deemed sufficient with high-quality evidence. Furthermore, a bifactor structure emerged as a more suitable model for TGMD-3 compared to a one-factor structure, with structural validity and internal consistency supported by moderate-quality evidence. However, the results of hypothesis testing for construct validity were inconsistent, and responsiveness was found to be inconsistent, supported by low-quality evidence.
Overall, TGMD-3 was classified as a “B” grade instrument. Instruments in this category show potential for recommendation but require further research to comprehensively evaluate their measurement properties. In particular, future studies should aim to assess the comprehensiveness of items of the TGMD-3 to strengthen its overall validity and reliability across diverse populations and contexts.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/bs15010062/s1: COSMIN and TGMD-3 measurement property evaluation framework. Table S1. COSMIN Definitions of Measurement Properties; Table S2. COSMIN Criteria for Assessing Measurement Properties; Table S3. Levels of Evidence for the Measurement Properties of the TGMD-3.

Author Contributions

All authors played a crucial role in devising and planning this study. Y.Z. and J.W. were responsible for conducting the literature search, gathering data, and performing evaluations. Y.Z., Y.D. and Y.Q. worked together to produce the first version of the manuscript. All authors were involved in verifying the data and contributed to refining the draft. W.Y., Q.C. and M.K. took charge of revising and polishing the manuscript until its final form. Each author has reviewed and endorsed the final manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by a grant from the Zhejiang Normal University Horizontal Research Fund (Grant number: KYH06Y21383 and KYH34324171). The authors also wish to acknowledge the support of Jinhua Maimiao Education Technology Co., Ltd.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All the data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare that this study received funding from Jinhua Maimiao Education Technology Co., Ltd. The funder was not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication.

References

  1. Aadland, K. N., Nilsen, A. K. O., Lervåg, A. O., & Aadland, E. (2022). Structural validity of a test battery for assessment of fundamental movement skills in Norwegian 3–6-year-old children. Journal of Sports Sciences, 40(15), 1688–1699. [Google Scholar] [CrossRef]
  2. Adolph, K. E., & Franchak, J. M. (2017). The development of motor behavior. Wiley Interdisciplinary Reviews: Cognitive Science, 8(1–2), e1430. [Google Scholar] [CrossRef]
  3. Allen, K. A., Bredero, B., Van Damme, T., Ulrich, D. A., & Simons, J. (2017). Test of gross motor development-3 (TGMD-3) with the use of visual supports for children with autism spectrum disorder: Validity and reliability. Journal of Autism and Developmental Disorders, 47(3), 813–833. [Google Scholar] [CrossRef]
  4. Almanasreh, E., Moles, R., & Chen, T. F. (2019). Evaluation of methods used for estimating content validity. Research in Social and Administrative Pharmacy, 15(2), 214–221. [Google Scholar] [CrossRef] [PubMed]
  5. Brian, A. S., Starrett, A., Pennell, A., Beach, P. H., Miedema, S. T., Stribing, A., & Lieberman, L. J. (2021). The brief form of the test of gross motor development-3 for individuals with visual impairments. International Journal of Environmental Research and Public Health, 18(15), 7962. [Google Scholar] [CrossRef] [PubMed]
  6. Brian, A., Miedema, S. T., Johnson, J. L., & Chica, I. (2021). A comparison of the fundamental motor skills of preschool-aged children with and without visual impairments. Adapted Physical Activity Quarterly, 38(3), 349–358. [Google Scholar] [CrossRef]
  7. Brian, A., Taunton, S., Lieberman, L. J., Haibach-Beach, P., Foley, J., & Santarossa, S. (2018). Psychometric properties of the test of gross motor development-3 for children with visual impairments. Adapted Physical Activity Quarterly, 35(2), 145–158. [Google Scholar] [CrossRef] [PubMed]
  8. Burns, N. (1993). The practice of nursing research: Conduct, critique & utilization. WB Saunders Co. [Google Scholar]
  9. Carballo-Fazanes, A., Rey, E., Valentini, N. C., Rodríguez-Fernández, J. E., Varela-Casal, C., Rico-Díaz, J., Barcala-Furelos, R., & Abelairas-Gómez, C. (2021). Intra-rater (Live vs. video assessment) and inter-rater (expert vs. novice) reliability of the test of gross motor—Third edition. International Journal of Environmental Research and Public Health, 18(4), 1652. [Google Scholar] [CrossRef]
  10. Chen, B., Liu, Y., Tang, J., Wang, J., Hong, F., & Ye, W. (2024). Cross-sectional survey of gender differences in gross motor skills among preschool children in Jinhua city, China. Heliyon, 10(21), e39872. [Google Scholar] [CrossRef]
  11. Cools, W., De Martelaer, K., Samaey, C., & Andries, C. (2009). Movement skill assessment of typically developing preschool children: A review of seven movement skill assessment tools. Journal of Sports Science & Medicine, 8(2), 154. [Google Scholar]
  12. Copetti, F., Valentini, N. C., Deslandes, A. C., & Webster, E. K. (2022). Pedagogical support for the test of gross motor development-3 for children with neurotypical development and with autism spectrum disorder: Validity for an animated mobile application. Physical Education and Sport Pedagogy, 27(5), 483–501. [Google Scholar] [CrossRef]
  13. de Waal, E. (2019). Fundamental movement skills and academic performance of 5-to 6-year-old preschoolers. Early Childhood Education Journal, 47(4), 455–464. [Google Scholar] [CrossRef]
  14. Duncan, M. J., Martins, C., Ribeiro Bandeira, P. F., Issartel, J., Peers, C., Belton, S., O’Connor, N. E., & Behan, S. (2022). TGMD-3 short version: Evidence of validity and associations with sex in Irish children. Journal of Sports Sciences, 40(2), 138–145. [Google Scholar] [CrossRef] [PubMed]
  15. Ecevit, R. G., & Sahin, M. (2021). Relationship between motor skills and social skills in preschool children. European Journal of Education Studies, 8(10), 46–60. [Google Scholar] [CrossRef]
  16. Eddy, L. H., Bingham, D. D., Crossley, K. L., Shahid, N. F., Ellingham-Khan, M., Otteslev, A., Figueredo, N. S., Mon-Williams, M., & Hill, L. J. B. (2020). The validity and reliability of observational assessment tools available to measure fundamental movement skills in school-age children: A systematic review. PLoS ONE, 15(8), e0237919. [Google Scholar] [CrossRef] [PubMed]
  17. Estevan, I., Molina-García, J., Queralt, A., Álvarez, O., Castillo, I., & Barnett, L. (2017). Validity and reliability of the Spanish version of the test of gross motor development–3. Journal of Motor Learning and Development, 5(1), 69–81. [Google Scholar] [CrossRef]
  18. Evaggelinou, C., Tsigilis, N., & Papa, A. (2002). Construct validity of the test of gross motor development: A cross-validation approach. Adapted Physical Activity Quarterly, 19(4), 483–495. [Google Scholar] [CrossRef]
  19. Garn, A. C., & Webster, E. K. (2021). Bifactor structure and model reliability of the test of gross motor development—3rd edition. Journal of Science and Medicine in Sport, 24(1), 67–73. [Google Scholar] [CrossRef] [PubMed]
  20. Gerber, R. J., Wilks, T., & Erdie-Lalena, C. (2010). Developmental milestones: Motor development. Pediatrics in Review, 31(7), 267–277. [Google Scholar] [CrossRef] [PubMed]
  21. Gonzalez, S. L., Alvarez, V., & Nelson, E. L. (2019). Do gross and fine motor skills differentially contribute to language outcomes? A systematic review. Frontiers in Psychology, 10, 2670. [Google Scholar] [CrossRef]
  22. Hu, J., Zhang, S., Ye, W., Zhu, Y., Zhou, H., Lu, L., Chen, Q., & Korivi, M. (2023). Influence of different caregiving styles on fundamental movement skills among children. Frontiers in Public Health, 11, 1232551. [Google Scholar] [CrossRef]
  23. Hulteen, R. M., Barnett, L. M., True, L., Lander, N. J., del Pozo Cruz, B., & Lonsdale, C. (2020). Validity and reliability evidence for motor competence assessments in children and adolescents: A systematic review. Journal of Sports Sciences, 38(15), 1717–1798. [Google Scholar] [CrossRef] [PubMed]
  24. Klingberg, B., Schranz, N., Barnett, L. M., Booth, V., & Ferrar, K. (2019). The feasibility of fundamental movement skill assessments for pre-school aged children. Journal of Sports Sciences, 37(4), 378–386. [Google Scholar] [CrossRef]
  25. Lubans, D. R., Morgan, P. J., Cliff, D. P., Barnett, L. M., & Okely, A. D. (2010). Fundamental movement skills in children and adolescents: Review of associated health benefits. Sports Medicine, 40, 1019–1035. [Google Scholar] [CrossRef]
  26. Maeng, H., Webster, E. K., Pitchford, E. A., & Ulrich, D. A. (2017). Inter-and intrarater reliabilities of the test of gross motor development-third edition among experienced TGMD-2 Raters. Adapted Physical Activity Quarterly, 34(4), 442–455. [Google Scholar] [CrossRef]
  27. Magistro, D., Piumatti, G., Carlevaro, F., Sherar, L. B., & Esliger, D. W. (2020). Psychometric proprieties of the test of gross motor development-third edition in a large sample of Italian children. Journal of Science and Medicine in Sport, 23(9), 860–865. [Google Scholar] [CrossRef] [PubMed]
  28. Magistro, D., Piumatti, G., Carlevaro, F., Sherar, L. B., Esliger, D. W., Bardaglio, G., Magno, F., Zecca, M., & Musella, G. (2018). Measurement invariance of TGMD-3 in children with and without mental and behavioral disorders. Psychological Assessment, 30(11), 1421–1429. [Google Scholar] [CrossRef]
  29. Maïano, C., Morin, A. J. S., April, J., Webster, E. K., Hue, O., Dugas, C., & Ulrich, D. (2022). Psychometric properties of a french-canadian version of the test of gross motor development-third edition (TGMD-3): A Bifactor Structural Equation Modeling Approach. Measurement in Physical Education and Exercise Science, 26(1), 51–62. [Google Scholar] [CrossRef]
  30. Mamani-Ramos, A. A., Damian-Nunez, E. F., Torres-Cruz, F., Dextre-Mendoza, C. W., Alcarraz-Curi, M., Quisocala-Ramos, J. A., Mamani-Cari, Y. A., Roncal-Serpa, F. R., Quispe-Cruz, H., Paucar-Pancca, A., & Montoya-Castillo, P. M. (2023). Psychometric properties of the peruvian version of the gross motor development test-third edition. Retos-Nuevas Tendencias En Educacion Fisica Deporte Y Recreacion, 50, 1180–1187. Available online: https://recyt.fecyt.es/index.php/retos/issue/view/4423 (accessed on 5 January 2025).
  31. Marinšek, M., Bedenik, K., & Kovač, M. (2023). Psychometric proprieties of the slovenian version of the test of gross motor development–3. Journal of Motor Learning and Development, 1(aop), 1–15. [Google Scholar] [CrossRef]
  32. Mehmedinović, S., Bratovčić, V., Kuduzović, E., Avdić, B., & Kožljak, L. (2021). Metric characteristics of the test of gross motor development (tgmd 3). Research in Education and Rehabilitation, 4(2), 146–155. [Google Scholar] [CrossRef]
  33. Melissant, H. C., Neijenhuijs, K. I., Jansen, F., Aaronson, N. K., Groenvold, M., Holzner, B., Terwee, C. B., van Uden-Kraan, C. F., Cuijpers, P., & Verdonck-de Leeuw, I. M. (2018). A systematic review of the measurement properties of the Body Image Scale (BIS) in cancer patients. Supportive Care in Cancer, 26, 1715–1726. [Google Scholar] [CrossRef]
  34. Mohammadi, F., Bahram, A., Khalaji, H., Ulrich, D. A., & Ghadiri, F. (2019). Evaluation of the psychometric properties of the persian version of the test of gross motor development–3rd edition. Journal of Motor Learning and Development, 7(1), 106–121. [Google Scholar] [CrossRef]
  35. Mokkink, L. B., Elsman, E. B., & Terwee, C. B. (2024). COSMIN guideline for systematic reviews of patient-reported outcome measures version 2.0. Quality of Life Research, 33(11), 2929–2939. [Google Scholar] [CrossRef]
  36. O’Hagan, A. D., Behan, S., Peers, C., Belton, S., O’Connor, N., & Issartel, J. (2022). Do our movement skills impact our cognitive skills? Exploring the relationship between cognitive function and fundamental movement skills in primary school children. Journal of Science and Medicine in Sport, 25(11), 871–877. [Google Scholar] [CrossRef] [PubMed]
  37. Okely, A. D., Booth, M. L., & Chey, T. (2004). Relationships between body composition and fundamental movement skills among children and adolescents. Research Quarterly for Exercise and Sport, 75(3), 238–247. [Google Scholar] [CrossRef]
  38. Page, M. J., McKenzie, J. E., Bossuyt, P. M., Boutron, I., Hoffmann, T. C., Mulrow, C. D., Shamseer, L., Tetzlaff, J. M., Akl, E. A., Brennan, S. E., Chou, R., Glanville, J., Grimshaw, J. M., Hróbjartsson, A., Lalu, M. M., Li, T., Loder, E. W., Mayo-Wilson, E., McDonald, S., . . . Moher, D. (2021). The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ, 372, n71. [Google Scholar] [CrossRef] [PubMed]
  39. Pitchford, E. A., & Webster, E. K. (2020). Clinical validity of the test of gross motor development-3 in children with disabilities from the U.S. national normative sample. Adapted Physical Activity Quarterly: APAQ, 38(1), 62–78. [Google Scholar] [CrossRef]
  40. Prinsen, C. A., Mokkink, L. B., Bouter, L. M., Alonso, J., Patrick, D. L., De Vet, H. C., & Terwee, C. B. (2018). COSMIN guideline for systematic reviews of patient-reported outcome measures. Quality of Life Research, 27, 1147–1157. [Google Scholar] [CrossRef]
  41. Rey, E., Carballo-Fazanes, A., Varela-Casal, C., Abelairas-Gómez, C., & ALFA-MOV Project Collaborators. (2020). Reliability of the test of gross motor development: A systematic review. PLoS ONE, 15(7), e0236070. [Google Scholar] [CrossRef] [PubMed]
  42. Rizkyanto, W. I., Gani, I., Iswant, A., Yudhistira, D., & Shahril, M. I. (2024). Validity and reliability of the gross motor development III test for indonesian children. Fizjoterapia Polska, 24(2), 171–177. [Google Scholar] [CrossRef]
  43. Salami, S., Bandeira, P. F. R., Gomes, C. M. A., & Dehkordi, P. S. (2021). The test of gross motor development—Third edition: A bifactor model, dimensionality, and measurement invariance. Journal of Motor Learning and Development, 10(1), 116–131. [Google Scholar] [CrossRef]
  44. Staples, K. L., Pitchford, E. A., & Ulrich, D. A. (2020). The instructional sensitivity of the test of gross motor development-3 to detect changes in performance for young children with and without down syndrome. Adapted Physical Activity Quarterly, 38(1), 95–108. [Google Scholar] [CrossRef] [PubMed]
  45. Stodden, D. F., Goodway, J. D., Langendorfer, S. J., Roberton, M. A., Rudisill, M. E., Garcia, C., & Garcia, L. E. (2008). A developmental perspective on the role of motor skill competence in physical activity: An emergent relationship. Quest, 60(2), 290–306. [Google Scholar] [CrossRef]
  46. Terwee, C. B., Bot, S. D., de Boer, M. R., Van der Windt, D. A., Knol, D. L., Dekker, J., Boutera, L. M., & de Vet, H. C. (2007). Quality criteria were proposed for measurement properties of health status questionnaires. Journal of Clinical Epidemiology, 60(1), 34–42. [Google Scholar] [CrossRef] [PubMed]
  47. Terwee, C. B., Prinsen, C. A., Chiarotto, A., Westerman, M. J., Patrick, D. L., Alonso, J., Bouter, L. M., de Vet, H. C. W., & Mokkink, L. B. (2018). COSMIN methodology for evaluating the content validity of patient-reported outcome measures: A delphi study. Quality of Life Research, 27, 1159–1170. [Google Scholar] [CrossRef]
  48. Ulrich, D. A. (1983). The standardization of a criterion-referenced test in fundamental motor and physical fitness skills. [Dissertation Abstracts International Section A: Humanities and Social Sciences]. Available online: https://elibrary.ru/item.asp?id=7352604 (accessed on 5 January 2025).
  49. Ulrich, D. A. (2017). Introduction to the special section: Evaluation of the psychometric properties of the TGMD-3. Journal of Motor Learning and Development, 5(1), 1–4. [Google Scholar] [CrossRef]
  50. Ulrich, D. A., & Ulrich, B. D. (1984). The objectives-based motor-skill assessment instrument: Validation of instructional sensitivity. Perceptual and Motor Skills, 59(1), 175–179. [Google Scholar] [CrossRef]
  51. Valentini, N. C., Duarte, M. G., Zanella, L. W., & Nobre, G. C. (2022). Test of gross motor development-3: Item difficulty and item differential functioning by gender and age with rasch analysis. International Journal of Environmental Research and Public Health, 19(14), 8667. [Google Scholar] [CrossRef] [PubMed]
  52. Valentini, N. C., Nobre, G. C., Zanella, L. W., Pereira, K. G., Albuquerque, M. R., & Rudisill, M. E. (2021). Test of gross motor development-3 validity and reliability: A screening form. Journal of Motor Learning and Development, 9(3), 438–455. [Google Scholar] [CrossRef]
  53. Valentini, N. C., Zanella, L. W., & Webster, E. K. (2017). Test of gross motor development—Third edition: Establishing content and construct validity for brazilian children. Journal of Motor Learning and Development, 5(1), 15–28. [Google Scholar] [CrossRef]
  54. Van De Schoot, R., Schmidt, P., De Beuckelaer, A., Lek, K., & Zondervan-Zwijnenburg, M. (2015). Measurement invariance. Frontiers in Psychology, 6, 1064. [Google Scholar] [CrossRef] [PubMed]
  55. Wagner, M. O., Webster, E. K., & Ulrich, D. A. (2017). Psychometric properties of the test of gross motor development, (German translation): Results of a pilot study. Journal of Motor Learning and Development, 5(1), 29–44. [Google Scholar] [CrossRef]
  56. Webster, E. K., Pitchford, E. A., & Ulrich, D. A. (2015). Psychometric properties for a united states cohort for the test of gross motor development-3rd edition. Journal of Sport & Exercise Psychology, 37, S17. [Google Scholar]
  57. Webster, K., & Ulrich, D. (2017). Evaluation of the psychometric properties of the test of gross motor development—Third edition. Journal of Motor Learning and Development, 5(1), 45–58. [Google Scholar] [CrossRef]
  58. WHO Multicentre Growth Reference Study Group & de Onis, M. (2006). WHO Motor Development Study: Windows of achievement for six gross motor development milestones. Acta Paediatrica, 95, 86–95. [Google Scholar] [CrossRef]
  59. Wiart, L., & Darrah, J. (2001). Review of four tests of gross motor development. Developmental Medicine and Child Neurology, 43(4), 279–285. [Google Scholar] [CrossRef] [PubMed]
  60. Wick, K., Leeger-Aschmann, C. S., Monn, N. D., Radtke, T., Ott, L. V., Rebholz, C. E., Cruz, S., Gerber, N., Schmutz, E. A., Puder, J. J., Munsch, S., Kakebeeke, T. H., Jenni, O. G., Granacher, U., & Kriemler, S. (2017). Interventions to promote fundamental movement skills in childcare and kindergarten: A systematic review and meta-analysis. Sports Medicine, 47, 2045–2068. [Google Scholar] [CrossRef] [PubMed]
  61. Widström, A. M., Brimdyr, K., Svensson, K., Cadwell, K., & Nissen, E. (2019). Skin-to-skin contact the first hour after birth, underlying implications and clinical practice. Acta Paediatrica, 108(7), 1192–1204. [Google Scholar] [CrossRef]
  62. Zheng, Y., Ye, W., Korivi, M., Liu, Y., & Hong, F. (2022). Gender differences in fundamental motor skills proficiency in children aged 3–6 years: A systematic review and meta-analysis. International Journal of Environmental Research and Public Health, 19(14), 8318. [Google Scholar] [CrossRef]
  63. Zhu, Y., Hu, J., Ye, W., Korivi, M., & Qian, Y. (2024). Assessment of the measurement properties of the peabody developmental motor scales-2 by applying the COSMIN methodology. Italian Journal of Pediatrics, 50(1), 87. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Flow diagram of article selection according to PRISMA.
Figure 1. Flow diagram of article selection according to PRISMA.
Behavsci 15 00062 g001
Table 1. Basic characteristics of the included articles.
Table 1. Basic characteristics of the included articles.
Author (Year)Population CharacteristicsResearch Characteristics
of TGMD-3
NAge (Years)Sex (M/F)Studied PopulationCountry/
Region
Measurement
Properties
Allen et al. (2017)354–1022/13Typically developing children and children with ASDAustraliaIC, TR, IER, IAR, SV
A. Brian et al. (2018)6612.93 ± 2.441/25Children with visual impairmentUSAIC, TR, CoV, SV
A. S. Brian et al. (2021)30213 ± 2.5175/127Children with visual impairmentUSASV
Carballo-Fazanes et al. (2021)259.16 ± 1.3160% girlsTypically developing childrenSpainIAR, IER
(Duncan et al., 2022)16089.2 ± 2.0447% girlsTypically developing childrenIrelandSV
Estevan et al. (2017)1783–1147.5% girlsTypically developing childrenSpainIC, SV, IER, IAR
Garn and Webster (2021)8626.5 ± 2.2349% girlsTypically developing childrenUSASV, IC
Maeng et al. (2017)106.57 ± 2.516/4Typically developing childrenUSAIAR, IER
Magistro et al. (2018)10753–11565/510Children with mental disorders and typically developing children ItalyMI, IC
Magistro et al. (2020)52108.38 ± 1.9748% girlsTypically developing childrenItalyMI, SV, TR
Maïano et al. (2022)1275–11 70/57Typically developing children CanadaSV
Mamani-Ramos et al. (2023)348 6–1048.6% girlsTypically developing childrenPeruSV, IC, SV, IC, TR
Marinšek et al. (2023)4527.32 ± 2.1650.4% girlsTypically developing childrenSloveniaCV, IAR, IER, TR, SV, IC, MI
Mehmedinović et al. (2021)146 6.8 ± 2.23 53.4% girlsTypically developing childrenBosnia and HerzegovinaIC, SV
Mohammadi et al. (2019)16006.56 ± 2.2950% girlsTypically developing children IranCV, IC, IR, IER, TR, SV
Pitchford and Webster (2020)1703–11122/48Disabled children and typically developing children USARe
Rizkyanto et al. (2024)290 11–13180/110Typically developing childrenIndonesiaCV, IC
Salami et al. (2021) 4967.23 ± 2.0353.8% girlsTypically developing childrenIranSV, MI
Staples et al. (2020)48 5.10 ± 0.7428/20Children with Down syndromeUSARe
Valentini et al. (2022)9893–10.9498/491Typically developing childrenBrazilMI
Valentini et al. (2017)5973–10295/320Typically developing childrenBrazilSV, CV, IER, IAR, TR, IC
Wagner et al. (2017)1897.15 ± 2.02 99/90Typically developing childrenGermanyIER, IAR, TR, IC, MI, CoV, SV
Webster and Ulrich (2017)8076.33 ± 2.0947.5% girlsTypically developing childrenUSAIC, TR, SV
Note: M/F = male/female; CV: content validity; SV: structural validity; IC: internal consistency; MI: measurement invariance; TR: test–retest reliability; IeR: inter-rater reliability; IaR: intra-rater reliability; Re: responsiveness; CoV: concurrent validity.
Table 2. Summary of the findings.
Table 2. Summary of the findings.
Measurement PropertySummary or Pooled ResultsOverall RatingQuality of Evidence
Content validityContent validity (Mamani-Ramos et al., 2023; Marinšek et al., 2023; Mohammadi et al., 2019; Valentini et al., 2017)
Relevant (Mamani-Ramos et al., 2023; Marinšek et al., 2023; Mohammadi et al., 2019; Valentini et al., 2017)Sufficient (+)High: multiple very good studies
Comprehensible (Mamani-Ramos et al., 2023; Marinšek et al., 2023; Mohammadi et al., 2019; Valentini et al., 2017)Sufficient (+)High: multiple very good studies
Structural validityStructural validity (A. Brian et al., 2018; A. S. Brian et al., 2021; Estevan et al., 2017; Garn & Webster, 2021; Magistro et al., 2020; Maïano et al., 2022; Mamani-Ramos et al., 2023; Marinšek et al., 2023; Mehmedinović et al., 2021; Mohammadi et al., 2019; Salami et al., 2021; Valentini et al., 2017; Webster & Ulrich, 2017)
Bifactor structure (A. Brian et al., 2018; A. S. Brian et al., 2021; Estevan et al., 2017; Garn & Webster, 2021; Magistro et al., 2020; Maïano et al., 2022; Mamani-Ramos et al., 2023; Marinšek et al., 2023; Mehmedinović et al., 2021; Mohammadi et al., 2019; Salami et al., 2021; Valentini et al., 2017; Webster & Ulrich, 2017)Qualitative summary: Sufficient (+)
79% supported
Moderate: multiple very good studies, inconsistent results
One-factor structure (Estevan et al., 2017; Garn & Webster, 2021; Marinšek et al., 2023; Mohammadi et al., 2019; Webster & Ulrich, 2017)Qualitative summary: inconsistent (±)Moderate: multiple very good studies, inconsistent results
Internal consistencyInternal consistency (Allen et al., 2017; A. Brian et al., 2018; Estevan et al., 2017; Garn & Webster, 2021; Magistro et al., 2018; Mamani-Ramos et al., 2023; Marinšek et al., 2023; Mehmedinović et al., 2021; Mohammadi et al., 2019; Rizkyanto et al., 2024; Valentini et al., 2017; Wagner et al., 2017; Webster & Ulrich, 2017)Qualitative summary: Sufficient (+)Moderate: multiple very good studies, inconsistent results
Locomotor Subtest
α: 0.63–0.92
Sufficient (+)
92% supported
Ball skills Subtest
α: 0.60–0.95
Sufficient (+)
92% supported
Total TGMD-3
α: 0.74–0.96
Sufficient (+)
ReliabilityReliability (Allen et al., 2017; A. Brian et al., 2018; Carballo-Fazanes et al., 2021; Estevan et al., 2017; Maeng et al., 2017; Magistro et al., 2020; Mamani-Ramos et al., 2023; Marinšek et al., 2023; Mohammadi et al., 2019; Valentini et al., 2017; Wagner et al., 2017; Webster & Ulrich, 2017)
Test–retest reliability (Magistro et al., 2020; Mamani-Ramos et al., 2023; Marinšek et al., 2023; Mohammadi et al., 2019; Valentini et al., 2017; Wagner et al., 2017; Webster & Ulrich, 2017)Qualitative summary: Sufficient (+)High: all studies are very good
Locomotor Subtest
ICC: 0.81–0.996 (Allen et al., 2017; Magistro et al., 2020; Mamani-Ramos et al., 2023; Marinšek et al., 2023; Wagner et al., 2017; Webster & Ulrich, 2017)Sufficient (+)
r: 0.92–0.93 (Mohammadi et al., 2019; Valentini et al., 2017)Sufficient (+)
Ball Skills Subtest
ICC: 0.84–0.997 (Allen et al., 2017; Magistro et al., 2020; Mamani-Ramos et al., 2023; Marinšek et al., 2023; Wagner et al., 2017; Webster & Ulrich, 2017)Sufficient (+)
r: 0.81–0.94 (Mohammadi et al., 2019; Valentini et al., 2017)Sufficient (+)
Total TGMD-3
ICC: 0.92–0.996 (Allen et al., 2017; Magistro et al., 2020; Mamani-Ramos et al., 2023; Marinšek et al., 2023; Wagner et al., 2017; Webster & Ulrich, 2017)Sufficient (+)
r: 0.90–0.95 (Mohammadi et al., 2019; Valentini et al., 2017)Sufficient (+)
Inter-rater reliability (A. Brian et al., 2018; Estevan et al., 2017; Maeng et al., 2017; Mamani-Ramos et al., 2023; Marinšek et al., 2023; Mohammadi et al., 2019; Valentini et al., 2017; Wagner et al., 2017)Qualitative summary: Sufficient (+)High: all studies are very good
Locomotor Subtest
ICC: 0.82–0.97Sufficient (+)
Ball Skills Subtest
ICC: 0.778–0.98Sufficient (+)
Total TGMD-3
ICC: 0.842–0.98Sufficient (+)
Intra-rater reliability: (Allen et al., 2017; Carballo-Fazanes et al., 2021; Estevan et al., 2017; Maeng et al., 2017; Mamani-Ramos et al., 2023; Marinšek et al., 2023; Mohammadi et al., 2019; Valentini et al., 2017; Wagner et al., 2017)Qualitative summary: Sufficient (+)High: all studies are very good
Locomotor Subtest
ICC: 0.865–0.988Sufficient (+)
Ball skills Subtest
ICC: 0.85–0.99Sufficient (+)
Total TGMD-3
ICC: 0.90–0.99Sufficient (+)
Measurement invarianceMeasurement invariance (Magistro et al., 2020; Magistro et al., 2018; Marinšek et al., 2023; Salami et al., 2021; Valentini et al., 2022; Wagner et al., 2017)Qualitative summary: Sufficient (+)High: all studies are very good
Across gender groups:
No important differences (Magistro et al., 2020; Marinšek et al., 2023; Salami et al., 2021; Wagner et al., 2017) OR no important DIF (Valentini et al., 2022)
Sufficient (+)
Across age groups:
No important differences (Magistro et al., 2020) OR no important DIF (Valentini et al., 2022)
Sufficient (+)
Across with and without disability groups:
No important differences (Magistro et al., 2018)
Sufficient (+)
Hypothesis testing for construct validityHypothesis testing for construct validity (A. Brian et al., 2018; Wagner et al., 2017)Qualitative summary: inconsistent (±)Moderate: multiple very good studies, inconsistent results
TGMD-3 and TGMD-2 (A. Brian et al., 2018)
  • Convergent validity:
  • Locomotor Subtest r: 0.98
  • Ball Skills Subtest r: 0.98
  • Total Scale r: 0.99
Sufficient (+)
TGMD-3 and M-ABC2 (Wagner et al., 2017)
  • Divergent Validity:
  • Locomotor Subtest r: 0.22–0.33
  • Ball Skills Subtest r: 0.23–0.30
insufficient (−)
TGMD-3 and GYGBT (Wagner et al., 2017)
  • Concurrent validity r: 0.36
insufficient (−)
ResponsivenessResponsiveness (Pitchford & Webster, 2020; Staples et al., 2020)Qualitative summary: inconsistent (±)Low: severe bias
Hypothesis testing: comparison between subgroups (Pitchford & Webster, 2020)
Children with and without ADHD: p < 0.05 Sufficient (+)Low: severe bias
Children with and without LAD: p > 0.534Insufficient (−)Low: severe bias
Children with and without ASD: p < 0.001Sufficient (+)Low: severe bias
Children with and without ID: p < 0.001Sufficient (+)Low: severe bias
hypotheses testing: before and after intervention (Staples et al., 2020)
Locomotor Subtest: (p < 0.01)
Ball Skills Subtest: (p < 0.01)
Sufficient (+)Low: severe bias
Note: TGMD-2 = Test of Gross Motor Development-2; M-ABC2 = Movement Assessment Battery for Children-2; GYGBT = German Youth Games ball-throwing distance performance; ADHD = Attention Deficit Hyperactivity Disorder; LAD = Language or Articulation Disorders; ASD = Autism Spectrum Disorder; ID = Intellectual Disability.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhu, Y.; Wang, J.; Ding, Y.; Qian, Y.; Korivi, M.; Chen, Q.; Ye, W. Assessing the Measurement Properties of the Test of Gross Motor Development-3 Using the COSMIN Methodology—A Systematic Review. Behav. Sci. 2025, 15, 62. https://doi.org/10.3390/bs15010062

AMA Style

Zhu Y, Wang J, Ding Y, Qian Y, Korivi M, Chen Q, Ye W. Assessing the Measurement Properties of the Test of Gross Motor Development-3 Using the COSMIN Methodology—A Systematic Review. Behavioral Sciences. 2025; 15(1):62. https://doi.org/10.3390/bs15010062

Chicago/Turabian Style

Zhu, Yuanye, Jing Wang, Yaru Ding, Yongdong Qian, Mallikarjuna Korivi, Qian Chen, and Weibing Ye. 2025. "Assessing the Measurement Properties of the Test of Gross Motor Development-3 Using the COSMIN Methodology—A Systematic Review" Behavioral Sciences 15, no. 1: 62. https://doi.org/10.3390/bs15010062

APA Style

Zhu, Y., Wang, J., Ding, Y., Qian, Y., Korivi, M., Chen, Q., & Ye, W. (2025). Assessing the Measurement Properties of the Test of Gross Motor Development-3 Using the COSMIN Methodology—A Systematic Review. Behavioral Sciences, 15(1), 62. https://doi.org/10.3390/bs15010062

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop