**1. Introduction**

Phosphorus (P) losses from croplands are a major driver of hypoxia and harmful and nuisance algal blooms in waterbodies worldwide [1,2]. Dissolved reactive P (DRP) is the P fraction that primarily drives these algal blooms as it is readily bioavailable, though other fractions of total P (TP) also contribute bioavailable P and so play a secondary role [3]. Phosphorus lost from croplands can be directly derived from recent fertilizers (i.e., incidental P) [4] or from P stored in soil pools (i.e., legacy soil P) [5], but recent research suggests legacy soil P is the dominant P source in the Western Lake Erie Basin (WLEB) [6]. Previous studies have shown that DRP and TP are readily transported via both the artificial subsurface tile drainage network (i.e., tile drainage) as well as overland surface runoff in the

WLEB [7–9]. Thus, to improve predictions of environmental P losses it is necessary to further our understanding of how soil P pools relate to surface and subsurface edge-of-field P losses.

Soil P is routinely measured in agricultural systems as a component of agronomic soil testing to estimate crop fertility requirements [10]. Agronomic soil testing was designed to estimate potential crop nutrient demand and yield response to fertilizer applications, so sampling depths typically target the crop's primary rooting zone. For example, in Ohio, Michigan, and Indiana, USA, the 0–20 cm depth is the typical agronomic soil testing depth and is the basis of university extension fertilizer recommendations [11]. Results of these tests are also useful for predicting the potential risk of P loss; P risk assessment tools typically use STP as a risk prediction factor [12–14], and previous studies have reported positive relationships between agronomic soil test P (STP) and P concentrations in surface runoff and tile drainage [15–20]. However, agronomic testing programs were not designed to monitor potential water quality impacts of soil P, so modifications to these agronomic sampling approaches hold the potential to improve soil testing for environmental purposes.

Depth of sampling is one component of agronomic soil testing that could be adapted to better understand environmental P losses. Soil P can be highly stratified with greater P concentrations in the surface layer, particularly in no-till or reduced tillage systems [21–23]. In the WLEB, stratification of soil P in croplands was shown to be prevalent across the Sandusky river watershed, and was hypothesized to be a contributing factor to increasing DRP concentration trends in surface waters [18]. The shallow surface soil horizon is the dominant zone of interaction between soil P and surface runoff water and thus has a disproportionate influence on surface runoff P concentrations [24–27]. In contrast, water discharged via subsurface tile drains passes through the soil matrix thereby expanding the zone of interaction and increasing opportunities for P sorption or assimilation. However, the interaction between water and the soil matrix decreases where preferential macropore flow dominates, particularly in medium- and fine-textured soils, resulting in elevated P concentrations in tile discharge [28–31]. In these situations, tile drainage water chemistry may largely reflect the surface soil characteristics. Stratified soil sampling that quantifies soil P in the shallow zone of interaction (0–5 cm) in addition to the agronomic depth (0–20 cm) has the potential to better predict environmental P loss in surface runoff and tile drainage as compared to traditional soil sampling approaches that only quantify soil P in the agronomic depth.

The objective of this study was to determine whether a stratified soil sampling regime could explain more variability in environmental P losses than traditional agronomic depth samples. Stratified soil sampling was conducted on 39 fields distributed throughout NW Ohio, USA instrumented with edge-of-field (EoF) water quality monitoring, which enabled examination of relationships between STP and EoF P losses across a broad range of soil characteristics and management regimes. The hypothesis tested was that STP from shallow (0–5 cm) soil samples will better predict DRP and TP concentrations in both surface runoff and tile drainage compared to STP from agronomic (0–20 cm) soil samples. In addition, repeated soil sampling in individual fields enabled assessments of changes in STP and P stratification ratio (0–5 cm STP/0–20 cm STP; Pstrat) over the course of a 3-year study period.

#### **2. Materials and Methods**

#### *2.1. Experimental Sites*

Edge-of-field water quality monitoring was used to assess relationships between STP and surface runoff and tile drainage P concentrations across a network of 39 crop fields in the NW quadrant of Ohio underlain with artificial subsurface tile drainage systems. Field locations and management were previously described in detail [32,33]. Soils were medium to fine textured, with drainage classification of somewhat poorly drained to very poorly drained. Fields were nearly level to gently sloping (average slope range 0.4–5.1%; mean 1.6%), and were generally representative of regional topography, soils, and management practices (i.e., nutrient management, tillage, and subsurface drainage). General soil characteristics (textural class, slope, pH, soil organic matter) of the fields in this study are presented

in Table S1. Field management was performed by farmers and followed common practices in the region. Soybean (*Glycine max*, (L.) Merr)-corn (*Zea mays*, L) rotations were the primary cropping system, but winter wheat (*Triticum aestivum*, L.), alfalfa (*Medicago sativa*, L.), and winter cover crops (various species) were also included in a subset of fields. Tillage practices ranged from multiple tillage passes each year to long-term no tillage. Soil fertility management typically consisted of springtime N fertilizer applications to corn and wheat and fall broadcast applications of P and K fertilizers once per crop rotation. Starter fertilizers containing N, P, and K were commonly applied at crop planting. Additionally, several fields received manure applications within 2 years of this study.

#### *2.2. Runo*ff *Phosphorus Concentrations*

Surface and subsurface water quality was monitored from outlets at the edge of fields; sampling approach and instrumentation were described in depth by Williams et al. 2016 [32]. Contributing drainage areas for surface and subsurface flow were between 1.1–18.5 ha (mean = 8.1 ± 4.6 ha). Water quality data were collected via automated measurement of flow rate at 10-min intervals and automated collection of water samples using both time- and flow-weighted basis. Tile drain flow was monitored with compound weirs (Thel-Mar, Brevard NC, USA) and bubbler modules (Teledyne ISCO, Lincoln NE, USA), while surface runoff was measured with 0.6 m H-flumes (Tracom, Alpharetta GA, USA) and bubbler modules. Water samples were collected with Teledyne Isco (Lincoln, NE, USA) automated samplers. Event-based water samples were automatically collected on a flow-weighted basis from surface flumes for the entire period of record. For tile drainage and prior to 2015, samples were automatically collected on a time interval approach (aliquots collected every 6 h and composited for a 24 h period). Starting in 2015, the time interval samples were supplemented with additional event-based samples collected over the rise and fall of the hydrograph to better represent the discharge events. Event sampling was triggered by increased discharge, with a 200 mL aliquot taken for each 1 mm of volumetric depth. Ten aliquots were combined into a single 2 L sample. Event sampling ceased when flow ceased (surface runoff) or when flow declined to baseline levels (tile drainage). Water samples were retrieved from the field at least weekly and were stored at 4 ◦C until laboratory analysis.

Water samples were analyzed for both DRP and TP concentrations. Briefly, samples were split into two aliquots, with one filtered at 0.45 um for DRP analysis, and an unfiltered sample used for TP analysis. The unfiltered sample underwent alkaline-persulfate digestion prior to TP analysis [34]. The filtered (DRP) and digested (TP) samples were analyzed for orthophosphate concentration using a flow injection analyzer (Lachat Instruments, Loveland, CO, USA) via the ascorbic acid reduction method. The resulting discrete P concentration data were converted into 10 min P concentrations by linear interpolation. The 10 min constructed P concentration values were then multiplied by the corresponding 10 min flow values to calculate 10 min P loads [35]. Resulting load data were summed into daily cumulative P loads. Daily estimates of flow and P load were summed into total flow and P load for the relevant period of each soil sampling event (described below). The total P load over the period of a given soil sampling event was then divided by total flow from that period to calculate the flow-weighted mean DRP (FWM DRP) and TP concentrations (FWM TP) associated with the soil sampling event.

#### *2.3. Soil Test Phosphorus*

Stratified soil samples were collected from contributing field areas of each monitored outlet between December 2014 and December 2017. The frequency of soil sampling events within a given field depended on crop rotation, establishment of EoF water quality monitoring instrumentation, and resource availability; 14 EoF sites were sampled on three occasions, 24 sites were sampled twice, and two were sampled once. The surface runoff dataset included a total of 52 individual soil sampling events, while 86 soil sampling events were included in the tile drainage dataset. Each soil sampling event consisted of taking three to nine samples at discrete locations within a given field. Sampling locations were selected based on USDA-NRCS soil maps and local topography to ensure

the sampling captured the variability in soils across the areas contributing to discharge at the field outlets. On average, one sample was collected for each 1.5 ha of contributing field area. Individual soil sampling locations were somewhat consistent from year to year, but limited precision of GPS coordinates meant subsequent samples were likely >10 m apart. Samples collected from 2014–2016 were taken with hand-held push probes (2 cm diameter) and the 2017 samples were collected using a hydraulic soil probe (5 cm diameter). At each location, five individual cores, distributed within ~2 m of a central point, were collected, split into 0–5 cm and 5–20 cm depths, and combined into one sample. Soil samples were air dried, ground, and analyzed for STP with Mehlich-3 extractant by the Ohio State University Service Testing and Research laboratory (Wooster, OH, USA). A simple average of the discrete STP data was used to estimate the field average STP concentration values, and within field variation in STP was characterized with the coefficient of variation (CV). The field average STP data were used to calculate the *Pstrat* for each field according to Equation (1):

$$P\_{\text{strat}} = \frac{[\text{STP}]0 - 5 \text{ cm}}{[\text{STP}]5 - 20 \text{ cm}} \tag{1}$$

Management information was used to assign a range of dates for each soil sampling event for which the STP measurements were considered most relevant for predicting EoF runoff P concentrations. The period of relevance for a given soil sampling event could extend up to one year before and after the sampling date, which would correspond to soil sampling occurring once in a 2-year crop rotation. This period was truncated to less than 2 years if the field was subjected to either a tillage operation or a P fertilizer or manure application as these operations were expected to alter the STP concentrations and *Pstrat*. Additionally, if a P fertilizer or manure application occurred prior to the soil sampling event, the first 2 weeks following the application were also excluded from date range to restrict the influence of short-lived direct P fertilizer losses [4]. Thus, no P applications or tillage operations occurred during the period of relevance for the soil sampling events. Resulting lengths of the date ranges were from 159 to 673 days, with an average length of 384 days.

#### *2.4. Statistical Analysis*

Relationships between STP and FWM P concentrations were examined with ordinary least squares regressions. Prior to regression analysis, the FWM DRP and TP concentrations were natural log transformed to comply with normality assumptions. The goodness of fit of the resulting regressions was assessed using the coefficient of determination (R2) and root mean square error (RMSE; the standard deviation of the model residuals). The field average STP and the maximum STP value in each field were both tested as predictors of FWM DRP and TP, and regression model fits were compared. Residuals from the field average agronomic STP-FWM P concentration regressions were extracted and subsequently correlated with Pstrat, with Pearson's *r* used to assess correlation strength. Multiple linear regressions were constructed in a stepwise manner from two predictor variables: agronomic STP and Pstrat. These predictor variables were checked for multicollinearity. The improvement in model fit provided by the addition of Pstrat to the agronomic STP model was further examined by extracting residuals from the simple regressions of agronomic STP vs. FWM-P concentration, as well as the multiple linear regression. For a given observation, the magnitude of the two model residuals were compared and the difference between residuals was analyzed for simple linear relationships to soil textural class, field average slope, agronomic STP, and average daily discharge.

In fields with more than one soil sampling event, the influence of management practices on changes in STP and Pstrat within the fields was examined. The effects of management factors on changes in STP and Pstrat (e.g., STP in soil sampling period 1—STP in soil sampling period 2) were tested with t-tests for three class variables (+/− P fertilizer, +/− manure, +/− tillage) and with ordinary least squares regression for the rate of P applied (only for fields that received manure or P fertilizer). Analyses were conducted in SAS v. 9.4 (SAS Institute, Cary NC, USA) and Sigmaplot (Systat Software, San Jose, CA, USA).
