*3.1. Soil Test P*

Shallow depth (0–5 cm) STP averaged 61 mg P kg−<sup>1</sup> across the fields and agronomic depth (0–20 cm) STP averaged 40 mg P kg−<sup>1</sup> for the tile drainage dataset (Table 1). Soil test P of individual fields ranged from 19–202 and 12–150 mg P kg−<sup>1</sup> for the shallow and agronomic depths, respectively. Variability of STP within fields was high, with an average CV of 32–39% across all the sampling events and both depths, while 14% of the sampling events surpassed a CV of 50% (data not shown). The field average Pstrat was 1.88, with large variation among the studied fields (range 1.18–3.35).

**Table 1.** Soil test P (STP) concentrations and P stratification ratios (Pstrat) across soil sampling events at 39 fields, and edge-of-field discharge and dissolved reactive P (DRP) and total P (TP) flow weighted mean (FWM) concentrations during the relevant sampling windows.


\* Surface runoff soil sampling windows *n* = 52; tile drainage *n* = 86 † P stratification ratio.

Soil test P and Pstrat within individual fields showed a high degree of variability between subsequent soil sampling events (Table 2). The agronomic STP in a given field increased or decreased between subsequent soil sampling events by >15 mg P kg−<sup>1</sup> in 12% of cases (average STP change: 9.4 <sup>±</sup> 1.1 mg P kg−1). Shallow STP changed by >15 mg P kg−<sup>1</sup> in 23% of cases (average STP change: 13.6 <sup>±</sup> 2.1 mg P kg−1). Averaged across all fields and sampling events, the absolute change in STP was an increase of 1.8 <sup>±</sup> 1.7 and 4.3 <sup>±</sup> 3.1 mg P kg−<sup>1</sup> for agronomic and shallow depths, respectively. Furthermore, in many fields the within field variability in STP demonstrated large changes between subsequent soil sampling events; for example, the CV for agronomic STP differed by >20% in 15 of the 48 soil sampling event comparisons (average CV change: 15.1 ± 2.1% for agronomic depth, 13.6 ± 1.8% for shallow depth). Similarly, the Pstrat often demonstrated significant within field changes, with 21% of cases changing by >0.50 between soil sampling events (average change: 0.43 ± 0.07).

**Table 2.** Summary of changes in soil test P (STP), STP coefficient of variation, and P stratification ratio (Pstrat) between subsequent soil sampling events in individual fields.


\* STP = mg P kg<sup>−</sup>1, coefficient of variation = % † P stratification ratio.

Management practices occurring between soil sampling events generally did not consistently explain the observed changes in STP or Pstrat (Table S2). Manure application between soil sampling

events was the only management factor that significantly influenced STP changes, where shallow and agronomic STP were 13.3 and 10.6 mg kg−<sup>1</sup> greater, respectively, in fields with manure compared to those without (*t* test; *t*-statistics −2.3 and −3.4, P = 0.029 and P = 0.0014 for shallow and agronomic depths, respectively). However, neither chemical P fertilizer application (*t* test; *t*-statistics 0.41 and 1.1, P = 0.68 and P = 0.26 for shallow and agronomic depths, respectively) nor the amount of P applied between soil sampling events (regression; t-statistics 1.04 and 1.04, P = 0.3 and P = 0.3 for shallow and agronomic depths, respectively) influenced changes in STP. Similarly, changes in STP were not different in fields that were tilled between sampling events compared to fields that did not undergo a tillage operation (*t* test, *t*-statistic −0.35 and 0.01, P = 0.7 and P = 0.9 for shallow and agronomic depths, respectively). Additionally, changes in Pstrat were not related to form or amount of P applied or tillage (t-tests and regression, *p*-values > 0.52 for all).

#### *3.2. Surface Runo*ff *and Tile Drainage Phosphorus Concentrations*

The average FWM DRP concentration in surface runoff across all soil sampling windows was 0.19 <sup>±</sup> 0.02 mg DRP L−1, and the FWM TP concentration averaged 0.65 <sup>±</sup> 0.05 mg TP L−<sup>1</sup> (Table 1). Phosphorus concentrations were lower in tile drainage and averaged 0.066 <sup>±</sup> 0.008 mg DRP L−<sup>1</sup> and 0.28 <sup>±</sup> 0.02 mg TP L<sup>−</sup>1. There was a high degree of variability in FWM P concentrations between the sampling windows, for example DRP concentrations ranged from 0.02–0.66 mg DRP L−<sup>1</sup> for surface runoff and 0.01–0.27 mg DRP L−<sup>1</sup> for tile drainage.
