3.2.3. Flexibility

Only one study [82] that examined the criterion-related validity of flexibility tests was classified as high quality (see Supplementary Table S4). They found that the sit-and-reach was not a valid test (r = 0.44–0.48, *p* < 0.05).

A meta-analysis [31] which included 28 studies on adults (see Supplementary Table S5) found that the sit-and-reach test and its different versions, had moderate validity for estimating hamstring extensibility (rp ranged from 0.49; 95% CI: 0.29–0.68 to 0.68; 95% CI: 0.55–0.80), but a low validity for estimating lumbar extensibility (rp ranged from 0.16; 95% CI: −0.10–0.41 to 0.35; 95% CI: 0.15–0.54). Moreover, another meta-analysis [29] carried out on adults (of six studies) reported that the toe-touch test had moderate validity for assessing hamstring extensibility (rp = 0.66; 95% CI: 0.56–1.00).

### 3.2.4. Motor Fitness

No study investigating the criterion-related validity of motor fitness tests was classified as high quality (see Supplementary Table S3).

#### **4. Discussion**

The present systematic review comprehensively studied the criterion-related validity of the existing field-based fitness tests used in adults. The findings of this review provide an evidence-based proposal for most valid field-based fitness tests for adult population.

#### *4.1. Cardiorespiratory Fitness*

The gold standard to assess VO2max is the Douglas bag method, although there is agreement that the respiratory gas analyser is a valid method of assessing oxygen uptake [83]. All high-quality studies measured VO2max or peak oxygen consumption when performing a submaximal/maximal treadmill or cycle test, except Manttari et al. [52], who directly measured VO2max when performing the 6 min walk test.

#### 4.1.1. Distance and Time-Based Run/Walk Tests

The run/walk field tests are probably the most widely used tests [27,84], however, until recently, there was no consensus regarding the most appropriate distance or time to use for these tests [85]. Mayorga et al. [30] performed a meta-analysis which examined the criterion-related validity of the 5000 m, 3 mile, 2 mile, 3000 m, 1.5-mile, 1-mile, 1000 m, <sup>1</sup> <sup>2</sup> -mile, 600 m, 600 yd, <sup>1</sup> <sup>4</sup> -mile, 15 min, 12 min, 9 min, and 6 min run/walk tests. They found that the criterion-related validity of the run/walk tests, only considering the performance score, ranged from low to high, with the 1.5-mile and the 12 min run/walk tests being the most appropriate tests for estimating cardiorespiratory fitness in adults aged 19–64 years. Sex, age or VO2max level did not affect criterion-related validity, whereas when multiple predictors (i.e., performance score, sex, age or body mass) were considered, the criterion-related validity values were higher. In this sense, two high-quality original studies reinforced these results, and showed that the 12 min [41] and the 1.5-mile [40,41] run/walk tests were fairly accurate for estimating cardiorespiratory fitness in adults aged 18–26 years (r = 0.87–0.93, *p* < 0.05).

Overall, the run/walk tests are not user-friendly tests, due to the difficulty of developing an appropriate pace, which may affect the test outcome (some participants start too fast, so they are unable to maintain their speed throughout the test; others start too slow, so when they wish to increase their speed the test is already finished). These problems are more likely to occur in longer distance tests. Other factors affecting the test outcome include the individual's willingness to endure the discomfort of strenuous exercise, a short attention span, poor motivation, and limited interest in a monotonous task [86–88].

The 2 km and 6 min walk tests are probably the most widely used walk tests in adults [39,51]. Both tests require submaximal effort, thus avoiding the problem of enduring the discomfort of strenuous exercise. In addition, it allows to evaluate those people with a low level of physical fitness or is unable to run. Three high-quality studies [36,37,39] observed that Oja's equation derived from the 2 km walk test has high validity (r = 0.80–0.87, all *p* < 0.05) in untrained and/or overweight/obese adults aged 20–64 years. One highquality study reported that the 2 km walk test [38] is a reasonably valid field test for estimating the cardiorespiratory fitness of moderately active adults aged 35–45 years, but not in adults with very high maximal aerobic power.

Many studies developed prediction equations for the 6 min test based on spirometry [89]. However, only three high-quality studies [50–52] analysed the criterion-related validity of the 6 min test based on VO2max in adults. They showed a moderate-to-high validity (r = 0.70–0.93, all *p* < 0.001) in obese and healthy adults aged 18–64 years. Burr et al. [90] suggested that, on its own, the 6 min walk test can be useful to discriminate between broad categories of high, moderate and low fitness, but that this approach may be associated with a degree of error, especially in the high fitness group.

According to these findings, the 2 km and 6 min walk tests are valid for use in adults aged 19–64 years with low or moderate fitness levels, but not in adults with a high fitness level.

Regarding the 1 mile walk test, conflicting results were found, especially when examining the accuracy of the Kline's [42] and Dolgener's [46] equations in adults aged 19–64 years.

#### 4.1.2. Twenty-Metre Shuttle Run Test

The 20 m shuttle run test was developed by Leger at al. [91] to solve the pace issue of the run/walk tests. The test consists of 1 min stages of continuous running at an increasing speed. Recently, a meta-analysis [28] showed that the performance score of the 20 m shuttle run test had a moderate-to-high criterion-related validity for estimating VO2max (*r*<sup>p</sup> = 0.66–0.84) in youth and adults aged 18–64 years, higher than when other variables (i.e., sex, age or body mass) were accounted for (*r*p = 0.78–0.95). This study also reported that Leger's protocol had a greater average criterion-related validity coefficient (*r*<sup>p</sup> = 0.84; 95% CI: 0.80–0.89) than Eurofit, QUB and Dong-HO protocols; and Leger's protocol was statistically higher for adults (*r*p= 0.94, 0.87–1.00) than for children (*r*p = 0.78; 95% CI: 0.72–0.85). These values are higher than those reported for the 1500 m and 12 min run/walk tests [30]. Moreover, the meta-analysis showed that sex did not seem to affect the criterion-related validity values.

On the other hand, Cooper et al. [54] showed that Brewer's protocol and equation were not valid for assessing active young people aged 18–26 years (mean difference = 1.8 ± 6.3 mL/kg/min; *p* = 0.004). In line with these findings, Kim et al. [58] observed that Leger's protocol and equation were more accurate than Brewer's protocol and equation (mean difference −0.54 mL/kg/min; %CV: 1.39 vs. mean difference −2.944 mL/kg/min; %CV: 8.87) in Korean adults, especially in women. Nonetheless, the authors suggested the need to develop new equations for Korean adults.

It is important to note that the 20 m square shuttle run test [55,57] was proposed as an alternative to the 20 m shuttle run test to reduce the test's turning angle from 180 to 90. This test was the best predictor of VO2max than the 20 m shuttle run test in young male adults aged 18–25 years.

#### 4.1.3. Step Tests

Step tests are a safe, simple, inexpensive and practical method of assessing cardiorespiratory fitness under submaximal conditions, which require minimum space [32]; they are also a great alternative to laboratory tests in clinical settings. There are a wide variety of step test protocols which differ in terms of stepping frequency, test duration and number of test stages. Bennett et al. [32] analysed the criterion-related validity of different step tests (the Chester step test, a personalised step test, the STEP tool step test, the Queen's College step test, the Skubic and Hodgkins step test, a height-adjusted, rate-specific, single-state step test, the Astrand–Ryhming step test, and a modified YMCA 3 min step test) in adults aged 18–64 years. The validity of these tests ranged from moderate to high, and they suggested that the Chester step test was the most valid step test to evaluate cardiorespiratory fitness in adults. However, this systematic review only included two studies with contradictory results, similarly to the Queen's College step test.

Analysing the 12 high-quality studies that examined the criterion-related validity of the step tests in adults aged 19–64 years, we can conclude that the YMCA step test [67,71] seemed to be the most appropriate step test to estimate VO2max in adults aged 19–64 years. However, it is important to note that there is no single equation, since the result of the equation depends on the sample used. Santo and Golding [92] even altered the protocol by adjusting the step height to the individual participant's height in order to increase the accuracy of this test.

#### 4.1.4. Levels of Evidence

Strong evidence indicated that (a) the 20 m shuttle run test using Leger's equation, the 2 km walk using Oja's equation, the 6 min and the YMCA step tests are valid for estimating cardiorespiratory fitness; and (b) the criterion-related validity of the distance and time-based run/walk tests range from low to high, with the 1.5-mile and 12 min run/walk tests being the best predictors. Moderate evidence indicated that the 20 m square shuttle run test is valid for estimating cardiorespiratory fitness. Due to the inconsistent results found in high-quality studies, limited evidence was found for the validity of the 1-mile walk, treadmill jogging, incremental shuttle walking, Chester, and Queen's College step tests. Due to the low number of high-quality studies, limited evidence indicated that (a) the 3 min walk, the <sup>1</sup> <sup>4</sup> -mile walk, Mankato submaximal, modified Astrand–Ryhming, University Montreal, modified Canadian aerobic fitness step, 6 min single 15 cm step, Tecumseh step, modified Harvard step and Astrand–Ryhming Step tests are valid for estimating cardiorespiratory fitness; and (b) the YMCA cycle, Ruffier, Danish step, and 2 min step tests are not valid for estimating cardiorespiratory fitness. Due to the consistent results found in multiple low-quality studies, limited evidence supported using the 6 min step test for estimating cardiorespiratory fitness.

#### *4.2. Muscular Strength*

The specificity of the type of muscular work performed and the use of different energy systems are both major challenges for establishing a gold standard method for maximal, endurance and explosive muscular strength tests [93]. One repetition maximum (1RM) and repetitions to a certain percentage of 1RM (i.e., 50% of 1RM or 70% of 1RM) [27], isokinetic dynamometer strength [94–96], and electromyography [78,80] were used as gold standards.

#### 4.2.1. Maximal Isometric Strength

The TKK dynamometer [73–75] seemed the most appropriate test to assess maximal isometric strength in adults. All the studies used the "known weights" as the criterion reference.

Several studies examined whether the elbow position (extended or flexed at 90 degrees) affected the hand maximal isometric strength score in children [75], adolescents [97] and young adults [98]. They observed that performing the handgrip strength test with the elbow extended seems the most appropriate protocol to evaluate hand maximal isometric strength in these populations—which is in accordance with the protocol recommended by the American Center for Disease Control and Prevention [99].

Ruiz et al. [100] also investigated whether the position (grip span) on the standard grip dynamometer determined the hand maximal isometric strength in adults. They found that when measuring hand maximal isometric strength in women, hand size must be taken into consideration, providing the mathematical equation (*y* = *x*/5 + 1.5 cm) to adapt optimal grip span (*y*) to hand size (*x*). In adult men, optimal grip span could be set at a fixed value (5.5 cm) and is not influenced by hand size.

Importantly, just like the step test, the handgrip strength test can be very useful in clinical settings because it requires minimal equipment and space, is time-efficient and easy to administer.

#### 4.2.2. Endurance Strength

The Biering–Sørensen test, a trunk holding test in an antigravity prone position, is commonly used to measure the back and hip muscle endurance strength, which is associated with lower back pain [101]. Mannion et al. [77] and Coorevits et al. [78] showed that the test endurance time was highly associated with isometric/endurance hip and back musculature strength (r = 0.84–98, *p* < 0.01). On the other hand, Kankaanpää et al. [79], found that this association was moderate (r = 0.60–0.71, *p* < 0.05). However, when BMI (r= −0.49–0.51, *p* < 0.001) in women and age (r = 0.25–0.29, *p* < 0.05) in men were accounted for in the prediction model, the explained variance increased considerably. Thus, the Biering–Sørensen test might be considered as valid for measuring back muscle endurance strength.

Assessing abdominal muscle functionality is clinically relevant since it is considered to be related to lower back pain [102,103]. The curl-up test, or its different versions, was the field test originally used to assess this capacity. In the present review, no original studies evaluating the criterion-related validity of this test were classified as high quality. An alternative of the curl-up test could be the prone bridging test, an isometric holding test in prone position which is currently being used to supposedly measure abdominal endurance strength. The prone bridging test time is inversely associated with lower back pain [104,105]. In relation to the validity of this test, De Blaiser et al. [80] found a higher activation of the abdominal core musculature during the test than for the back and hip musculature, showing a high association between test time and abdominal endurance strength. Future high-quality studies are necessary to clarify the validity of this test.

It should be noted that no study that analysed the criterion-related validity of lower and upper body endurance strength tests were classified as high quality.

#### 4.2.3. Explosive Strength

The standing long jump is proposed in health-related fitness test batteries in preschool children [106], as well as children and adolescents [107] to assess lower body explosive strength, given its criterion-related and predictive validity. However, to our knowledge, the criterion-related validity of this test has not been studied. Bui et al. observed that the Sargent jump test [81] is not appropriate to evaluate lower body explosive strength, because its overestimates the height of a vertical jump and its accuracy is reduced as the jump height increases (mean difference: 4.4 ± 5.1, *p* < 0.001). Due to the close relationship that lower body maximal/explosive strength has on adult health [22,23], more high-quality studies are required to analyse the criterion-related validity of these tests in future research.

#### 4.2.4. Levels of Evidence

Strong evidence indicated that (a) the handgrip strength test with the elbow extended and with the grip span adapted to the hand size and sex (using the TKK dynamometer) is a valid test for assessing hand maximal isometric strength; and (b) the Biering–Sørensen test offers a valid test for assessing endurance strength of hip and back muscles. Moderate evidence indicated that handgrip strength (Jamar) has acceptable validity for assessing hand maximal isometric strength. Due to (a) the low number of high-quality studies, limited evidence (only one study) was found supporting the use of prone bridging for assessing abdominal endurance strength and the Sargent jump test for assessing lower body explosive strength; (b) the inconsistent results found in multiple high-quality studies, limited evidence was found for the validity of using handgrip strength (DynEx) for assessing hand maximal isometric strength; and (c) the consistent results found in multiple low or very low-quality studies, the curl-up test, or its different versions, are not valid for assessing abdominal endurance strength.

#### *4.3. Flexibility*

Radiography seems to be the best criterion measurement of flexibility, but goniometry is also used as a criterion measure [108,109].

Goniometers are relatively easy to obtain; nevertheless, their use requires a certain technical qualification since it is a sensitive method, and thus it is not feasible for use in all settings [110]. Traditionally, the sit-and-reach test, originally designed by Wells and Dillon [111], and its different versions, are included in the fitness test batteries for measuring hamstring and lower back flexibility, which are probably the most widely used measures of flexibility [27].

Mayorga et al. [31] performed a meta-analysis to analyse the criterion-related validity of the sit-and-reach and its different versions (modified sit-and-reach, back-saver sit-andreach, modified back-saver sit-and-reach, V sit-and-reach, modification V sit-and-reach, unilateral sit-and-reach and chair sit-and-reach). These tests showed moderate validity for estimating hamstring extensibility, but low validity for estimating lumbar extensibility. They also found that the classic sit-and-reach test had the highest criterion-related validity coefficient in both hamstring and lumbar extensibility, compared to the other test, which does not seem to justify the use of the classic protocol modifications in order to solve the problems attributed to itself (i.e., the length proportion between the upper and lower limbs or the position of the head and ankles).

The toe-touch test is another field-based test for measuring hamstring flexibility, in which the individuals were assessed standing instead of sitting on the floor [112]. Although this test is easy to administer and can be an alternative to the sit-and-reach test, when the participant has problems being measured sitting, it is not proposed for any filed-based fitness test battery. A meta-analysis [29] analysed the criterion-related validity of the toetouch test for measuring hamstring flexibility, reporting similar validity coefficients to those of the classic sit-and-reach.

It is interesting to highlight that Nuzzo [113] has recently suggested that flexibility should be invalidated as a major component of fitness, due to its lack of predictive and concurrent validity in terms of meaningful health and performance outcomes.

#### Levels of Evidence

Strong evidence indicated that (a) the sit-and-reach test and its modified versions have moderate validity for estimating hamstring extensibility, but low validity for estimating lumbar extensibility; and (b) the toe-to-touch test has moderate validity for estimating hamstring extensibility.

### *4.4. Motor Fitness*

The validity of motor fitness tests is the least studied in adults. None of the three studies that analysed the criterion-related validity in motor fitness tests were classified as high quality. Given that the motor fitness tests (i.e., gait/walking speed, balance, timed up and go) are associated with all-cause mortality [114–116], falls and fractures [117], disability in activities of daily living [118] and depression [119], it would be useful to know their criterion-related validity.

#### Levels of Evidence

Due to the consistent results found in multiple low-quality studies, we found limited evidence that the ten-step test had moderate validity in assessing agility.

#### **5. Conclusions**

The systematic review emphasized important major points regarding the criterionrelated validity of adult field-based fitness tests (Figure 2):

**Figure 2.** Major points regarding criterion-related validity of adult field-based fitness tests.

Cardiorespiratory fitness: the 20 m shuttle run tests best assessed cardiorespiratory fitness using Leger's equation. Alternatively, the 1.5-mile, 12 min run/walk and YMCA step tests were other cardiorespiratory testing options. When low-level cardiorespiratory fitness existed, or if running was possible, the 2 km, then Oja's equation or 6 min walk tests were appropriate alternatives.

Muscular strength: strong evidence indicated that (a) the handgrip strength test, with the elbow extended and with the grip span adapted to the individual's hand size (using the TKK dynamometer), offers a valid means to assess hand maximal isometric strength; and (b) the Biering–Sørensen test estimated the endurance strength of hip and back muscles. Limited evidence (only one study) supported the prone bridging and Sargent jump tests as abdominal endurance strength and lower body explosive strength surrogate markers, respectively.

Flexibility: strong evidence supported the sit-and-reach test and its different versions, and that the toe-to-touch tests is not valid for assessing hamstring and lower back flexibility.

Motor fitness: limited evidence about the criterion-related validity of motor fitness existed.

When there are problems of space and time, as in clinical settings, the YMCA step and the handgrip strength tests are good alternatives for assessing cardiorespiratory fitness and isometric muscular strength, respectively.

**Supplementary Materials:** The following are available online at https://www.mdpi.com/article/ 10.3390/jcm10163743/s1: Supplementary Tables and Supplementary Material.

**Author Contributions:** J.C.-P. and M.C.-G. conceived the study idea. J.C.-P. led the writing of the review and carried out methodological aspects with N.M.-J., F.M.-A. and J.R.F.-S., F.M.-A., V.S.-J., R.I.-G. and J.R.R. contributed writing—review and editing the final manuscript. All authors discussed the results and contributed to the final manuscript, and agreed with the order of presentation of the authors. All authors have read and agreed to the published version of the manuscript.

**Funding:** This project was supported by Ministry of Economy, Industry and Competitiveness in the 2017 call for R&D Projects of the State Program for Research, Development and Innovation Oriented to the Challenges of the Company; National Plan for Scientific and Technical Research and of Innovation 2017-2020 (DEP2017-88043-R); and the Regional Government of Andalusia and University of Cadiz: Research and Knowledge Transfer Fund (PPIT-FPI19).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

