*2.1. Study Area*

In this study, we considered ten counties (Allegan, Barry, Eaton, van Buren, Kalamazoo, Calhoun, Berrien, Cass, St. Joseph and Branch) covering 28,281 km<sup>2</sup> in southwestern Michigan (Figure 1a). The study area is part of the US Corn Belt with a subhumid climate [20], with annual precipitation ranging from 590 mm (in the 2012 drought) to 858 mm during the study period 2001–2016 (Figure 1b, [21]). Short term droughts, common in this region, induce plant water stress and reduce grain yields of corn and soybeans, which account for 45% and 32% of total agricultural area, respectively [22]. During extended drought periods, the sandy soils prevalent in this region cannot store sufficient soil water to allow crops to reach full yield potential.

**Figure 1.** (**a**). The location (inset) and remotely sensed crop types (CDL, [22]) of the study area, including ten southwestern Michigan counties. (**b**) Precipitation from 15 June to 31 July (blue), and the remainder of year (gray) in the study area.

#### *2.2. Basic Remotely Sensed, Land Surface Model, and Climate Input Data*

We used a variety of time-varying and static input data for the random forest (RF) classification model (summarized in Tables 1 and 2). The static input variables describe terrain, soil, and geographic location. Dynamic inputs are derived from remote sensing and climate data, as well as land surface model output. For most time varying data, we focus on the June 15th to July 31st period, which is the time before canopy closure occurs in corn and soybeans to avoid reflectance saturation. Data were either obtained from, or uploaded to, the Google Earth Engine (GEE) cloud computing platform [23] for classification.



**Table 2.** Weather-sensitive remote sensing, spatial anomaly and composite indices.



**Table 2.** *Cont.*

We included five temporally static input variables to account for possible relations between yield and terrain attributes, soil properties, crop characteristics, and geographic information. These variables are: (1) slope calculated from 30-m National Elevation Dataset (NED) Digital Elevation Model (DEM) [24], (2) soil available water capacity (AWC, field capacity minus wilting point), (3) vertical saturated hydraulic conductivity (kSat), (4) latitude (lat), and (5) longitude (long). The AWC and *kSat* are based on the top 25 cm soil properties provided by the USDA Soil Survey Geographic Dataset (SSURGO) Web Soil Survey [17]. For each SSURGO map unit polygon, we calculated depthand component-fraction weighted averages of all soil horizon textures (%sand, %clay) within the top 25 cm. We then used the ROSETTA database [25] to relate these textures to estimates of field capacity, wilting point, and soil hydraulic conductivity. Vertical hydraulic conductivity was calculated for each component as the harmonic mean of individual horizontal saturated conductivities.

Climate inputs were derived from daily 4-km resolution PRISM [21] and Gridded Surface Meteorological dataset (GRIDMET) [26]: (1) precipitation, (2) aridity calculated as the ratio of growing season rainfall to potential evapotranspiration (PET), (3) average Palmer Drought Severity Index (PDSI), (4) dryspell (maximum consecutive days with less than 5 mm precipitation), (5) average daily maximum VPD (vapor pressure deficit), (6) daily mean temperature, (7) heatwave (maximum consecutive days with daily mean temperature above 25 ◦C, (8) GDD (growing degree days = cumulative degree obtained from the difference between air temperature and base temperature for corn and soybeans, 25 ◦C in this study), and (9) as a measure of pre- and in season-wetness, we calculated the total precipitation before June 15th (p\_early) and from June 15th to Jul. 31st (p\_sum), respectively.

Irrigation decisions are often based on soil water content [8]. Here, we use the root zone soil water content at noon from NLDAS-Noah with 1/8◦ spatial resolution and hourly time step [27], which is currently the best readily available product at regional scale that has sufficiently fine temporal resolution for our application. The NLDAS-Noah product does not implement irrigation, thus its soil water content data serves as a reference that represents wetness under rainfed conditions.

We used remote sensing data from Landsat Surface Reflectance Products at an 8-day interval in all years except 2012, when there was a 16-day interval since only Landsat 7 ETM+ was operational. We included all scenes within 5 days of the June 15th to July 31st key growing season period. The actual number of available scenes during this period varies spatially and inter-annually as the 8-day (16-day in 2012) return interval is simultaneously reduced by cloud coverage and augmented by overlapping scene edges. From 2001 to 2016, the average number of valid observations among pixels in the study domain varied from 1.67 (2012) to 6.35 (2001) (Table S1, Supplementary Material). All Landsat 7 images collected after May 31, 2003 have data gaps due to the Scan Line Corrector (SLC) failure, which leads to significant data shortage in 2012. We used a moving window average method to fill in the gaps caused by the SLC failure. For every pixel within a gap, we set its value as the average within a five pixel by one pixel rectangle, oriented perpendicular to the scanline.

After filling the data gaps, we extracted the thermal, near-infrared, short-wave infrared, red, green, and blue bands from Landsat images between June 10th and August 5th and calculated NDVI, EVI, GI and NDWI. We then created statistical composites from the available imagery following a best pixel approach [28] to generate mean, maximum, minimum and range (i.e., maximum subtracted by minimum) composites for all four indices and the thermal band.

#### *2.3. Weather-Sensitive Scene Selection, Spatial Anomaly Calculation, and Novel Composite Indices*

Initially we applied the methods from previous studies (i.e., [12–14]) to construct a RF classifier for this region, but found the accuracies were inadequate. We thus developed and tested a variety of approaches, including several that we deemed unsuccessful because they did not increase the accuracy of the RF classifier. Ultimately, we implemented three methods to create layers beyond the basic remotely sensed, land surface model (LSM), and climate indices described in Section 2.2.

First, we calculated two composite indices recently proposed in [13] that adjust GI in an attempt to account for regional variations in available water. The first composite variable is the product of maximum GI and growing-season mean NDWI (i.e., water-adjusted green index, WGI); we anticipate that for irrigated fields GI and NDWI will both be high. A pixel with high GI but low NDWI may be rainfed (low NDWI) but treated with abundant nutrients (high GI). The other composite variable is maximum GI divided by seasonal aridity (i.e., aridity-normalized green index, AGI). The rationale behind AGI is that irrigated fields should have high GI even during relatively dry growing conditions.

Second, we developed weather-sensitive remote sensing indices calculated from Landsat scenes at the time most favorable for distinguishing irrigated from rainfed fields. We assumed this time immediately follows a dry period identified based on three separate criteria: (1) maximum consecutive days with monotonically descending root zone (up to 1-m depth) soil water content, (2) lowest PDSI of the season, and (3) greatest number of consecutive days with daily precipitation less than 5 mm. The criteria were calculated using climate and LSM model outputs listed in Table 1. The resulting input variables are denoted with suffix \_SM, \_pdsi, \_ppt, respectively. Irrigated crops generally exhibit higher vegetation indices and NDWI than rainfed crops [13,14,27,29,30]; we expect that this difference is amplified under water stress conditions during dry periods. Further, the three-day average vapor pressure deficit (VPD) before the day of maximum Landsat GI (VPDMaxGI) and number of consecutive days with rainfall not exceeding 5 mm before maximum GI day (dryspellMaxGI), were calculated as rainfed crops are unlikely to exhibit maximum GI when VPD is high or after a dry period.

Third, we calculated spatial anomaly remote sensing indices to better distinguish irrigated from rainfed fields. We first calculated the neighborhood percentiles (40% and 90%) of the vegetation indices using a circular kernel with a 10-km radius for every year. This radius was selected based on the range of a spherical fit to the empirical variogram of the climate and LSM model outputs. Two percentiles were selected to provide anomalies that would be useful in areas where irrigation is relatively sparse (where an anomaly relative to the 90%would be more appropriate), and where irrigation is predominant (similarly, where the 40% might indicate irrigated fields). We then subtract the neighborhood percentiles from the vegetation indices to produce annual anomaly maps; resulting input variables are denoted with suffix \_p40 and \_p90, respectively. A positive value points to a higher-than-neighbor vegetation index under similar climate conditions, which we expected to be related to irrigation activity.

All together, the basic remote sensing, climate, and LSM simulated indices (Section 2.2) and the weather-sensitive remote sensing, spatial anomaly, and composite indices comprise 98 input variables of the RF classifier. A complete list of these variables is provided in Table S2, Supplementary Material.

Several "failed" attempts to define improved indices were made, and then abandoned based on lack of improvement in classification accuracy. We provide these here as information for others seeking to further the work of humid region irrigation remote sensing. A full list of these variables is included in Table S3, Supplementary Material. Many of these variables were extracted from MODIS products. Due to short overpass time, MODIS products are less subject to cloud coverage than Landsat products. We expected that MODIS thermal bands, terrestrial evapotranspiration (ET), and potential evapotranspiration (PET) estimates [31] would provide valuable information to identify irrigation activity [32]. We thus used monthly statistics of these products as well as calculated composite indices, including the difference between precipitation and MODIS ET, and ET divided by VPD. Another climate index that we tested is temporal anomaly of precipitation (annual precipitation subtracted by the multi-year average precipitation). We also derived the monthly ratio of vegetation indices

such as maximum GI in July divided by maximum GI in June, ratio among vegetation indices such as GI divided by EVI, and GI divided by NLDAS-Noah soil water content. Our preliminary results suggested that these indices did not improve classification accuracy on our validation points and were thus not used to generate the final results. Likely reasons include the coarse spatial resolution of MODIS and climate products as well as possible errors embedded in these products. These may be useful in areas with larger fields than the pilot study area.

#### *2.4. Random Forest Classifier*

We use a random forest (RF) machine learning algorithm to inductively build a classifier of irrigated versus rainfed areas. We selected RF because the algorithm was successful in various hydrologic and remote sensing applications (e.g., [13,14,33–35] is robust with relatively large number of inputs, provides input variable importance measures and probabilistic outputs [36], and is supported in the GEE cloud computing platform.

A random forest is comprised of an ensemble of decision trees. Given a set of training data {xi, yi}, i = 1, ... , n, where xi denotes input variables, and yi is the corresponding output. In this study, yi is a categorical variable with two classes: irrigated and rainfed. The algorithm randomly draws *n* samples with replacements from the training dataset to train a single tree. The process is repeated N times, resulting in a forest of N trees. Once trained, each tree predicts the class of a new data point, and the N trees may predict M classes. The RF algorithm outputs the percentage of trees that provide a prediction of the M classes. The class that receives the highest probability is the final prediction.

In the mapping process, a composite of images is created for each year as input data layers (Tables 1 and 2, Table S2, Supplementary Material). For each year since 2007, the composite is masked using the Cropland Data Layers (CDL, [22]) to keep only corn and soybean fields for this region. For years before 2007, the composite is masked using the National Land Cover Dataset (NLCD, [37]), the primary available product for the study region. The NLCD-based crop mask includes all row crop fields because NLCD does not distinguish among row crops. The trained RF classifier is then applied to input composites and labels each pixel as either irrigated or rainfed. In this way, we develop irrigation probability maps for every year from 2001 to 2016. The probability value ranges from 0 to 1, with higher values suggesting larger likelihood of irrigation activity in the pixel. A pixel is classified as irrigated if it receives a probability greater than 0.5, and as rainfed otherwise. The resulting binary maps are postprocessed in two steps. Due to cloud coverage, 2014 and 2015 have 8.2% and 5.7% pixels, respectively, with no Landsat scene from June 10th to August 5th. The gap pixels in 2014 are labeled irrigated if it was classified as irrigated in both 2012 and 2013. Similarly, the gap pixels in 2015 are labeled irrigated if it is classified as irrigated in the 2013 map and gap-filled 2014 map. In the second step, all pixels that are classified as irrigated only once during 2001–2015 are labeled as rainfed as it is unlikely that farmers will irrigate only one time due to high infrastructure costs. We then examine the final irrigation maps to track the spatial extent and the changing irrigation dynamics.

During training, the RF algorithm also calculates a variable importance score based on the total decrease in node impurities by splitting on the variable, averaged over all trees [38]. The variable importance scores provide a measure of the relative importance of each input variable for capturing the spatiotemporal irrigation pattern. In this study, we used all 98 input variables described in Sections 2.2 and 2.3, and used the RF calculated variable importance measure to draw insights into the data worth of various indices in similar irrigation mapping applications. We note that RF algorithms are robust with the presence of a large number of inputs. Depending on specific applications, and especially when using other machine learning algorithms that are less robust to high input number, more sophisticated feature selection techniques (e.g., [34,39]) can be used to constrain the input space dimension.

#### *2.5. Manually Labeled Dataset*

The RF classifier is built based on a training dataset compiled from high-resolution aerial photography that was acquired during growing seasons (National Agriculture Imagery Program, NAIP [40]) from the study area. Approximately half of the training points are sampled in two representative years, 2012 and 2014. 2012 is known as a dry year in which limited rainfall induced water stress during the crop growth period. We expect that the irrigated crops are more productive than rainfed crops in this year, and the difference should be reflected in our selected indices. Therefore, 2012 represents an "easy" irrigation mapping case for the classifier. On the other hand, the 2014 growing season receives plenty of precipitation and thus represents a challenging case for the algorithm. In addition, we sampled 100 locations across multiple years (2005, 2006, 2009, 2010) to track shifts between rainfed and irrigated fields. The locations of the data points were randomly sampled after applying an agriculture land cover mask. From 2001 to 2006 the mask is derived from the NLCD and included pixels categorized as cultivated crops. Since 2007 when CDL was first available in the study area, the masks include pixels labeled as corn and soybean fields in CDL.

Through the GEE cloud computing platform, we manually labeled each point as either irrigated or rainfed based on multiple lines of evidence, including the presence of visible irrigation infrastructure, high vegetation indices, and limited water supply from remote sensing and climate data. The presence of irrigation infrastructure, primarily central pivot irrigation systems, is identified from NAIP images. When such infrastructure is identified, we examine the time series of vegetation indices, NDWI, precipitation, and NLDAS-Noah root zone soil water content to estimate whether a particular location is irrigated. As described previously, the vegetation indices and NDWI are derived from available Landsat scenes, with precipitation data from PRISM. A data point is discarded when a decision cannot be made. In total, the manually labeled dataset include 1536 data points (locations in Figure 2).

**Figure 2.** Training points are randomly generated, scattered in crop areas. One-time training points (green triangles) are generated for 2012 (dry) and 2014 (wet) years. Additional points (red dots) are generated for 2005, 2006, 2009, 2010, 2012 and 2014.

#### *2.6. Classification Accuracy Assessment*

We evaluated the accuracy of irrigation mapping using two validation data sources. First, we randomly divided the manually labeled dataset into training (80%) and validation (20%). We trained a random forest on the training dataset, and then tested its performance on the validation data points. To reduce the effects of random sampling, we repeated this sample-and-train process 20 times. We note that some of the remote sensing and climate information are used both in the manual labeling of the validation dataset as well as inputs to the random forest classifier. Such overlap may favorably influence classifier performance evaluation on the manually labeled validation dataset. However, the manually labeled reference dataset primarily relies upon visual cues in the NAIP high-resolution imagery, which was not included in the RF classification. Overlapping datasets provided only supporting evidence for manual labeling. Therefore, the validation points still provide valuable insights into irrigation mapping accuracy, especially given the lack of ground truth data. As a second, independent assessment, we calculated the total irrigated area of corn and soybean for each county in the study domain. The results are compared with county-wise statistics of the two crops available through NASS Agricultural Census [2] in the available years (2002, 2007, 2012).

For comparison, we also evaluate the accuracy of MIrAD-US, a national, 250-m resolution irrigation dataset available in 2002, 2007 and 2012 [12], using the manually labeled dataset. The MIrAD-US product identifies irrigated pixels when MODIS annual peak NDVI exceeds a threshold, which is adjusted for each county such that the resulting total irrigated acreage agrees with the USDA NASS statistics. We compare MIrAD-US label (irrigated versus rainfed) with the manually labeled dataset (Section 2.5), with error rates reported in Section 3.1. The MIrAD-US error rate is then compared with RF classification validation error for data points in 2012 (no training data is generated in 2002 and 2007 due to lack of NAIP images), averaged over 20 repeated experiments.

#### **3. Results and Discussions**

We developed annual irrigation probability maps for 2001 to 2016 (Figure 3) by integrating Landsat remote sensing imagery and hydroclimatic variables in a RF analysis as discussed above. Figure 3 shows the irrigation probability maps for a 2.7 by 3.2 km area in 2012. Comparing the irrigation maps with NAIP imagery (Figure 3b,c), the RF classifier can identify irrigation with detailed sub-field spatial pattern, which national products such as MIrAD-US cannot capture due to its coarse resolution (Figure 3e). This is important, as small fragmented fields are common in the study area. It is important to note that NAIP imagery was only used to label training points and not included in the input data to produce the irrigation maps. Figure 3c shows the green indices calculated from the Landsat scene that immediately follows a dry period (GI\_ppt), which is identified as the longest consecutive days with daily precipitation less than 5 mm (Section 2.4). This variable is the most important input variable according to RF important score (see Section 3.2 for more details).

**Figure 3.** (**a**) Map of fraction of years classified as irrigated since earliest year irrigated according to the fandom forest (RF)-based annual irrigation maps spanning 2001–2016. For example, a pixel that is irrigated every year since the start of irrigation in 2012 will receive a fraction of 1.0. 2012 insets of (**b**) NAIP aerial image showing irrigated farms with varying sizes, (**c**) GI calculated from the Landsat scene that immediately follows the largest dryspell (GI\_ppt), (**d**) random forest-based irrigation probability map with 30-m resolution and (**e**) MIrAD-US irrigation map with 250-m resolution. Images (**b**–**<sup>e</sup>**) are for 2012. Areas not classified as corn or soybeans (USDA-NASS, 2016) are shown in dark.
