Next Article in Journal
Hidden Danger Association Mining for Water Conservancy Projects Based on Task Scenario-Driven
Next Article in Special Issue
The Utilization of Satellite Data and Machine Learning for Predicting the Inundation Height in the Majalaya Watershed
Previous Article in Journal
Diagnosis of Groundwater Quality in North Assiut Province, Egypt, for Drinking and Irrigation Uses by Applying Multivariate Statistics and Hydrochemical Methods
Previous Article in Special Issue
Understanding the Challenges of Hydrological Analysis at Bridge Collapse Sites
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Low-Flow (7-Day, 10-Year) Classical Statistical and Improved Machine Learning Estimation Methodologies

by
Andrew DelSanto
1,*,
Md Abul Ehsan Bhuiyan
1,2,
Konstantinos M. Andreadis
1 and
Richard N. Palmer
1
1
Department of Civil and Environmental Engineering, University of Massachusetts, Amherst, MA 01003, USA
2
Climate Prediction Center, National Oceanic & Atmospheric Administration (NOAA), College Park, MD 20742, USA
*
Author to whom correspondence should be addressed.
Water 2023, 15(15), 2813; https://doi.org/10.3390/w15152813
Submission received: 17 June 2023 / Revised: 23 July 2023 / Accepted: 1 August 2023 / Published: 3 August 2023
(This article belongs to the Special Issue A Safer Future—Prediction of Water-Related Disasters)

Abstract

:
Water resource managers require accurate estimates of the 7-day, 10-year low flow (7Q10) of streams for many reasons, including protecting aquatic species, designing wastewater treatment plants, and calculating municipal water availability. StreamStats, a publicly available web application developed by the United States Geologic Survey that is commonly used by resource managers for estimating the 7Q10 in states where it is available, utilizes state-by-state, locally calibrated regression equations for estimation. This paper expands StreamStats’ methodology and improves 7Q10 estimation by developing a more regionally applicable and generalized methodology for 7Q10 estimation. In addition to classical methodologies, namely multiple linear regression (MLR) and multiple linear regression in log space (LTLR), three promising machine learning algorithms, random forest (RF) decision trees, neural networks (NN), and generalized additive models (GAM), are tested to determine if more advanced statistical methods offer improved estimation. For illustrative purposes, this methodology is applied to and verified for the full range of unimpaired, gaged basins in both the northeast and mid-Atlantic hydrologic regions of the United States (with basin sizes ranging from 2–1419 mi2) using leave-one-out cross-validation (LOOCV). Pearson’s correlation coefficient (R2), root mean square error (RMSE), Kling–Gupta Efficiency (KGE), and Nash–Sutcliffe Efficiency (NSE) are used to evaluate the performance of each method. Results suggest that each method provides varying results based on basin size, with RF displaying the smallest average RMSE (5.85) across all ranges of basin sizes.

1. Introduction

Estimates of the magnitude and reoccurrence intervals of low-flow events on rivers and streams are a necessary input for many natural resources planning activities, including municipal, industrial, and agricultural planning [1]. Resource managers in the northeast and mid-Atlantic United States are specifically interested in 7-day, 10-year low-flow (7Q10) estimation for protecting aquatic species that may be impacted by water withdrawal, hydropower production, or discharge of wastewater. In addition, 7Q10 estimation is used in a variety of other design facets, including water quality management, water supply planning, cooling plant design, hydropower regulation, irrigation, recreation, and more [2]. Classical 7Q10 estimation in ungaged basins commonly relies on simple statistical models that are calculated at similar, gaged sites [3]. For example, the USGS’s widely used statistical estimation program StreamStats uses multiple linear regression equations derived in log space, calibrated on gaged sites, to estimate flow statistics at ungaged sites [4]. These regression equations make use of the concept of “stationarity”, i.e., the assumption that the statistical properties of streams do not change over time. Relatively recent studies have suggested that the climate and associated hydrologic processes no longer satisfy that assumption, exposing a weakness in the stationary modelling approach [5,6,7,8]. For instance, it is estimated that the southwestern United States is currently experiencing its driest 22 yr period since 800 CE and approximately 20% of it can be attributed to recent anthropogenic changes [9]. In contrast, studies in the northeast have found both average baseflows and 7-day summer baseflows are increasing with statistical significance [10,11]. In the mid-Atlantic, Blum et al. (2019) found increasing 7Q10s in the northern part of the mid-Atlantic (New York, Pennsylvania) and decreasing 7Q10s in lower mid-Atlantic (Virginia, Maryland), concluding that because of these trends, “using the most recent 30 years of record when a trend is detected reduces error and bias in 7Q10 estimators compared to use of the full record” [2]. Outside of trend detection, few statistical alternatives exist in practice to account for changing climatic conditions in statistical 7Q10 estimation in ungaged basins.
In addition to assuming stationarity, StreamStats’ 7Q10 estimation suffers from a variety of other drawbacks, including (1) lack of development in some states, (2) being applicable to only relatively small basins, and (3) minimal statistically significant input variables that vary greatly by state. Because the 7Q10 is an extremely common planning metric, many states rely on 7Q10 estimation for permits related to stream withdrawals and wastewater treatment. In 2019, the Connecticut Department of Energy and Environmental Protection was forced to change their permitting laws from using the 7Q10 to using a similar drought metric, the Q99 flow, because StreamStats 7Q10 estimation has not been developed for Connecticut: https://portal.ct.gov/DEEP/Water/Water-Quality/Triennial-Review-of-the-Connecticut-Water-Quality-Standards (accessed on 6 April 2021). The use of state boundaries to dictate homogenous hydrologic areas also limits the amount of unimpaired data that is available to develop the regression equations. Developing these regression equations requires ample unimpaired training data in a homogenous area of interest, which can sometimes be impossible to achieve in practice [12], as exemplified by the case of Rhode Island. Due to a lack of sufficient unimpaired, gaged basins in Rhode Island, some gages from Massachusetts and Connecticut (which itself does not have 7Q10 estimation developed) were used to develop the 7Q10 regression equations for Rhode Island [13]. This suggests that we may be able to expand these regression equations to cover larger geographic footprints that are not dictated by state lines, as we are already using data from other states to develop these equations. Classical 7Q10 estimation techniques also rely on regression equations that were only developed for relatively small basin sizes (i.e., <100 mi2), either state by state or localized to larger watersheds [4]. This helps maintain the homogeneity of the applicable area, which maintains the accuracy of estimates but does not allow for larger basin estimation and limits the ability to compare estimates between states, regions, and watersheds since different equations and input variables are used to make estimates in nearby states. In the extreme case, ungaged locations a few feet apart on the same stream but across a state border can utilize differing regression equations, which can result in different estimates. These equations rely heavily on the watershed area as the most significant variable, but the other variables vary significantly depending on the state. In some cases, watershed size is the only significant variable used to make 7Q10 estimates [14]. Many other studies in this area have attempted to apply landcover, climate, and topographical variables with varying levels of success [15,16,17,18]. One set of statistically significant input variables for the entire northeast and mid-Atlantic would allow (1) data augmentation where 7Q10 estimation has not been developed, (2) comparisons of 7Q10 estimates between states, and (3) better understanding of the input variables themselves, including potential sensitivity analyses that involve changing climate and/or landcover inputs.
Regression equations typically rely on multiple linear regression in log space (LTLR) rather than standard multiple linear regression (MLR) because Tasker and Stedinger (1989) [19] demonstrated that (log-transformed) GLS analysis is theoretically most appropriate and generally provides the best results when used for hydrologic regressions, which was then used in standard regression analysis of peak- and low-flow frequency statistics, such as the 100-year peak flow and the 7-day, 10-year low flow [20]. Applying a more advanced statistical method may allow for improved estimation, requiring less input data, detecting the subtle importance of additional variables, and maintaining accuracy over larger spatial footprints. Though machine learning has been used in hydrology for several decades, the application of this technique has accelerated with increased access to data and computational power [21]. Many recent studies have benefitted from machine learning to improve streamflow estimation, including using artificial neural networks (ANN), support vector machines (SVM), and random forests (RF) [22,23,24,25,26]. Studies have even demonstrated that machine-learning-based models, which are calibrated based on historical streamflow records in gaged basins, can produce more accurate streamflow predictions in ungaged basins than traditional process-based models [27]. Specifically for low-flow prediction, machine learning algorithms have been used to estimate low-flow indices [28], for direct low-flow prediction using random forest [29], and for evaluation of statistical methods in low-flow prediction [30]. Due to these successes, Nearing et al. (2021) state, “it is entirely possible that successful water resources and water hazard predictions might not require anything that looks even like a simple hydrology model in the future” [27].
To address the issues noted previously, this paper suggests three strategies to improve estimations of 7Q10 flow for the northeast and mid-Atlantic United States:
  • Develop a single, generalized methodology for 7Q10 estimation that is applicable to larger geographical regions (such as the northeast and mid-Atlantic regions of the United States). This methodology will make use of publicly available data as inputs, allowing resource managers to create accurate 7Q10 estimates in states where StreamStats 7Q10 estimation has not been developed or as an alternative to StreamStats where 7Q10 estimation has been developed;
  • Expand the range of applicable basin sizes to account for every gaged basin in the northeast and mid-Atlantic that has been determined to be unimpaired. StreamStats’ state-by-state 7Q10 estimation relies on regression equations that were only developed for small basins (i.e., <100 mi2) in most states, but unimpaired gaged basins in the study area range from 2 to 1400 mi2. Our methodology is trained on every location, ensuring sufficient locations for model training and allowing the application of the method to a larger range of basin sizes than classical methods;
  • Include multiple landcover, climate, and topographical variables as inputs for estimation. The additional input variables will increase the accuracy of 7Q10 estimates over the large area of study, and the inclusion of landcover and climate variables will facilitate future sensitivity analyses related to changing landcover and climate variables in conjunction with physical hydrology models.
Additionally, we test three machine learning algorithms, random forest (RF) decision trees, neural networks (NN), and generalized additive models (GAM), against classical statistical methods for 7Q10 estimation (MLR and LTLR) which have been found to perform similarly in practice [31]. These machine learning methods are applied for three reasons: (1) They do not make assumptions about the underlying distribution of the data. (2) Even though complex and essentially non-parametric, they are accurate and widely applied. And (3) they are relatively easy to implement, making them appealing for resource managers to use as alternatives to classical methods.

2. Data and Study Area

2.1. Study Area and Gages

The study area for this research is the northeast and mid-Atlantic United States, defined here as the states of Maine, New Hampshire, Vermont, Massachusetts, Rhode Island, Connecticut, New York, Pennsylvania, New Jersey, Delaware, Maryland, Virginia, and West Virginia. The USGS’s Hydro-Climatic Data Network, HCDN-2009 [32] was used to identify unimpaired streams in gaged watersheds of varying sizes and physical attributes in the study area (ranging from 2.1 mi2 to 1419 mi2, Figure 1). Data for these 106 stations from the HCDN were downloaded from the USGS Current Water Data for the Nation: https://waterdata.usgs.gov/nwis/rt (accessed on 18 December 2020). After reviewing the watershed size distribution of the HCDN sites, we determined it lacked a sufficient number of gages in extremely small watersheds (<30 mi2) for training data. In addition to the HCDN sites, 59 small sites in Massachusetts, determined to be sufficient for 7Q10 training data [20], were added to the training data for a total of 165 sites throughout the area of study. Appendix A includes a table of all sites used, and Figure 1 displays the corresponding watersheds.

2.2. Input Variables/Data

Daily precipitation and temperature data were extracted at each gage from the Livneh et al. (2015) [33] hydrometeorological dataset [34]. This dataset contains air temperature and precipitation data from approximately 20,000 weather stations monitored by GHCN-daily (U.S.), Environment Canada, and Servicio Meteorológico Nacional (Mexico) [33]. A minimum of 20 years of data was required for CONUS and Canadian stations, but due to the relative paucity of station data in Mexico, the authors followed the procedure recommended by Zhu and Lettenmaier (2007) [35], which requires a minimum of 50 valid days of data in any given year for a station to be included. From this station data, the data was interpolated using the SYMAP algorithm [36], which employs statistical methods such as clustering analysis, regression analysis, and correlation analysis to identify patterns and relationships within spatial data. After interpolation, the authors followed procedures for quality control by computing a monthly coefficient of variation based on the standard deviation of the daily values compared to their monthly mean and removed months with a ratio of less than 0.18, determined empirically using 25 stations from 7 states with at least 15 years of data [33]. This dataset is publicly available, with gridded climate variables at 1/16° horizontal resolution (~6 km) from 1950–2013 [34]. Static land data, including mean basin elevation, mean basin slope, forest and wetland percentages of the basin, and watershed area, were collected from USGS StreamStats Data-Collection Station reports: https://streamstatsags.cr.usgs.gov/gagePages/html/ (accessed on 19 July 2021).

2.3. 7Q10 Comparison Data

To compare 7Q10 estimates from this experiment to current statistical methodologies, we use the USGS’s statistical estimation program StreamStats because of its wide usage in the states for which it has been developed in the northeast and mid-Atlantic. This program uses logarithmic-transformed linear regression (LTLR) equations to estimate flow statistics [4]. Where applied, different regression equations and variables are calculated for each state. Furthermore, states in the mid-Atlantic region use different equations (and in some cases, different variables) based on hydrologic regions within each state. Table 1 lists the candidate states and the corresponding variables used for 7Q10 estimation. Connecticut, Delaware, Maryland, New Jersey, New York, and Vermont are included last, as StreamStats 7Q10 estimates have not been developed for these states. Out of the 165 gaged sites used for training, raw 7Q10 estimates from StreamStats were available for 128 sites.

3. Materials and Methods

In this section, the materials and methods used in this research are described. This includes the calculation of the historical 7Q10 at each site (Section 3.1) based on the historical data, each statistical method being compared (Section 3.2), the input variables included (Section 3.3), a cross-validation procedure for testing (Section 3.4), and the various efficiency/error metrics used to evaluate the performance of each method (Section 3.5).

3.1. 7Q10 Values at Each Site

The historical 7Q10 values based on historical data for each site were extracted from the USGS’s StreamStats Data-Collection Station Reports described in Section 2.2. In addition, if at least 30 years of continuous, daily streamflow data is available for a site, the “fasstr” software package: https://cran.r-project.org/web/packages/fasstr/index.html (accessed on 19 July 2021) is used to calculate the 7Q10 directly from the daily streamflow data. This package fits a quantile distribution to daily streamflow data that allows for the efficient calculation of low-flow frequency analysis metrics, including the 7Q10. As expected, these 7Q10 values were virtually identical to the 7Q10 values calculated by the USGS at each site. These values, noted as the “true 7Q10” values for each site, can also be found in Appendix A.

3.2. Statistical Methods

In this analysis, five statistical methods are applied. Two classical statistical methods, namely multiple linear regression (Section 3.2.1) and logarithmic-transformed linear regression (Section 3.2.2), are tested alongside three machine learning algorithms, namely random forest decision trees (Section 3.2.3), neural networks (Section 3.2.4), and generalized additive models (Section 3.2.5). For the machine learning algorithms, feature scaling is applied to the input variables before method application using min-max normalization:
x = x x m i n x m a x x m i n

3.2.1. Multiple Linear Regression

Multiple linear regression (MLR) is a simple, common methodology that takes the general form of
Y i = b 0 + b 1 X 1 + b 2 X 2 + + b n X n + ε i
where  Y i is the estimate of the dependent variable for site i X 1  to  X n  are the n independent variables,  b 0  to  b n  are the n + 1 regression model coefficients, and  ε i  is the residual error for site i. Assumptions for use of MLR are (1) the relationship displays linearity, (2) the mean of  ε i is zero, (3) the variance of the  ε i  is constant and independent of  X n , (4) the  ε i  are normally distributed, and (5) the  ε i  are independent [37]. For this study, we force the intercept  b 0  to be 0 since a basin with 0 area should have 0 flow.

3.2.2. Logarithmic-Transformed Linear Regression

Logarithmic-transformed linear regression (LTLR) is the most used method for 7Q10 estimation because it can correct for spatial correlation and differences in streamflow record lengths [19]. In addition, streamflow and basin characteristics used in hydrologic regression have been found to be log-normally distributed, with residuals (calculated by subtracting the estimated values from the observed values) that were not randomly distributed when multiple linear regression was applied, suggesting that the variables should be transformed to log space [20]. This results in a model of the form
log Y i = b 0 + b 1 log X 1 + b 2 log X 2 + + b n log X n + ε i
Using base 10, the equation takes the general form of
Y i = 10 b 0 X 1 b 1 X 2 b 2 X n b n 10 ε i
Though theory suggests that LTLR is the preferred method for 7Q10 estimation, in practice, both MLR and LTLR have been found to perform similarly [31].

3.2.3. Random Forest

The random forest (RF) algorithm applied here is a non-parametric, tree-based regression model [38]. RFs use bootstrap aggregation, where bootstrap samples are randomly chosen with substitution seeking a lower test error by variance reduction. RFs consist of numerous decision trees (Figure 2).
The RF model is optimized by tuning or calibrating its three major hyperparameters: (1) “mtry”, the number of predictors that will be randomly sampled at each split when creating the tree models; (2) “ntrees”, the number of decision trees contained in the ensemble; and (3) the minimum size of terminal nodes, “nt”. All parameters were manually tuned to create a stable model using the package “randomForest”: https://cran.r-project.org/web/packages/randomForest/index.html (accessed on 20 July 2021) in R.

3.2.4. Neural Networks

Neural networks (NN) are a class of machine learning algorithms inspired by the structure and function of the human brain [39]. The three primary types of layers are the input layer, one or more hidden layers, and the output layer. Each neuron takes inputs, performs a weighted sum of these inputs, applies an activation function to produce an output, and then passes this output to neurons in the next layer. The connections between neurons have associated weights, which the network learns from data during the training process. The network adjusts its weights iteratively using optimization algorithms to minimize the difference between its predictions and the actual targets. A simple neural network structure is highlighted in Figure 3.
Neural networks are optimized by tuning major parameters, including the number of hidden layers, the limiting threshold for the partial derivatives of the error function as stopping criteria, and the maximum allowable steps for training. Additionally, the number of neurons per hidden layer, initial weights, activation functions, and learning rate can be customized for different scenarios. All parameters were manually tuned to create a stable model that converges using the “neuralnet” package: https://www.rdocumentation.org/packages/neuralnet/versions/1.44.2/topics/neuralnet (accessed on 21 July 2021) in R.

3.2.5. Generalized Additive Models

Generalized additive models (GAM) [40] represent a flexible extension of traditional linear models. GAMs capture non-linear dependencies of data through the application of smoothing functions which allows for finding complex relationships between variables. By employing smoothing functions such as cubic regression splines, GAMs accommodate non-linear patterns and mitigate issues of model misspecification. This property of GAMs makes it well-suited for scenarios where linear models may fall short in capturing intricate patterns. Furthermore, GAM does not impose assumptions on the underlying distribution of the response variable, enabling it to incorporate various response distributions appropriately. An example of a standard linear model and a GAM applied to the same data is provided in Figure 4.
GAMs are optimized by tuning several parameters, including “gamma” to increase smoothing, “family” to specify the distribution to be used, and “weights” to designate prior weights on the contribution of the data to the log-likelihood. All parameters were manually tuned to create a stable model using the GAM function from the “mgcv” package: https://www.rdocumentation.org/packages/mgcv/versions/1.9-0/topics/gam (accessed on 17 August 2021) in R.

3.3. Input Variables

Statical land data, including mean basin elevation, percent mean basin slope, percent landcover considered wetland and forest, and basin area, were collected from the USGS’s StreamStats Data-Collection Station Reports. These data are direct inputs into the statistical models. In addition, timeseries of daily precipitation and maximum temperature were extracted at each of the gages from Livneh et al., 2015. A running cumulative 30-day precipitation value was calculated, as well as the corresponding average 30-day maximum daily temperature. Attempting to isolate when a 7Q10 flow would occur, we extracted the lowest 30-day cumulative precipitation limited to only 30-day periods of high temperatures (>90th percentile). The 30-day cumulative precipitation and corresponding high average temperature were recorded. A list of all input variables is included in Table 2.

3.4. Leave-One-Out Cross-Validation (LOOCV)

The leave-one-out cross-validation (LOOCV) method is an extreme version of K-fold cross-validation where K = N [41]. It is an iterative process that is executed the same number of times as the number of data points. Only one value is used as the test set, while all other values are used as the training set. This iterative process is run for every value so that there is a test set value for every value in the dataset, which allows for a new test set to be created with all of the individual test set values (Figure 5). The new test set can then be evaluated using traditional error metrics and analysis. LOOCV can be computationally intensive, but for relatively small datasets, it can provide better performance than K-fold cross-validation due to the largest possible training set being used to estimate each test set value [41].

3.5. Error Metrics

R2 and RMSE are used to directly evaluate the random and systematic error of each method. In addition, the Nash–Sutcliffe Efficiency (NSE) and the Kling–Gupta Efficiency (KGE) are also included because of their frequent usage for evaluating streamflow models.
The widely known coefficient of determination, R2 [42], will be one of the metrics for evaluating model performance. Values range from 0 (no correlation) to 1 (perfect correlation). This value is calculated using the following equation:
R 2 = 1 i = 1 n y i y ^ i 2 i = 1 n y i y ¯ 2
Here,  y i  is the observed 7Q10,  y ^  is the model predicted 7Q10, and n is the number of samples used in the calculation.
RMSE is used to evaluate error because it is among the most used indicators for evaluation of model performance [43]. Similar studies have also chosen RMSE over MAE for its sensitivity to outliers [44]. The general equation is given by
R M S E = 1 n 1 n y i y ^ i 2
Hydrologists commonly use the Nash–Sutcliffe Efficiency [45] and Kling–Gupta Efficiency [46] for streamflow modelling evaluation. The Nash–Sutcliffe Efficiency (NSE) signals a model’s ability to predict variables different from the mean. NSE is calculated given
N S E = 1 i = 1 n O i P i 2 i = 1 n O i O ¯ 2
NSE values range from negative infinity (indicating a poor model) and 1 (indicating a perfect fit between observed and predicted values). Negative values indicate that the mean is a better predictor of the observed values than the model.
Furthermore, the Kling–Gupta Efficiency (KGE) [46] is widely used for hydrologic applications [47,48]. KGE provides three components, the general correlation (r term), the bias (beta term), and the relative variability (alpha term), between the modelled and observed values. KGE is calculated using the following formula:
K G E = 1 r 1 2 + α 1 2 + β 1 2
α = σ m σ o
β = μ m μ o
where  σ m  is the standard deviation of model,  σ o  is the standard deviation of reference,  μ m  is the mean of model, and  μ o is the mean of reference. Like NSE, KGE values range from negative infinity (poor performance) to 1 (best performance). Here, r is Pearson’s correlation coefficient, α represents the variability error, and β indicates the bias error.
The methodology of this paper is summarized in Figure 6. This figure includes all datasets, variables, and methods utilized.

4. Results and Discussion

In this section, the development (Section 4.1, Section 4.2, Section 4.3, Section 4.4 and Section 4.5) and general performance (Section 4.6 and Section 4.7) of each statistical model are evaluated. For all methods excluding neural networks where it is not applicable, we present the significance of variables using the standard p-value, with four thresholds given by minorly significant (0.1 > X > 0.05), moderately significant (0.05 > X > 0.01), largely significant (0.01 > X > 0.001), and extremely significant (X < 0.001).

4.1. Multiple Linear Regression

Applying multiple linear regression to the input variables, with the constraint that the intercept  b 0  is set to 0, gives the following results (Table 3).
As expected, the area was found to be extremely significant at the 0.001 level. In addition, slope and precipitation were found to be largely significant at the 0.01 level. Elevation was found to be minimally significant at the 0.1 level but did show significance and improved results. Neither landcover variable, the percentage of basin considered to be forest or wetland, showed significance. The resulting equation, only including statistically significant variables, results in the following:
7 Q 10 = 0.0579833 A r e a + 0.0014461 E l e v a t i o n 0.2804801 S l o p e + 0.3421899 P r e c i p
(R Square = 0.7481, Residual Standard Error = 7.693, p-Value = 2 × 10−16).
The sign of each variable aligns intuitively with the expected relationship between that variable and the 7Q10. Increasing area and precipitation allows for additional water during low flows, leading to positive coefficients. Elevation and slope relate directly to baseflows, and as baseflows play a large role in low flows, the relationship between baseflow and 7Q10s should be similar. High elevations are typically associated with higher baseflows [49], leading to an increasing relationship between 7Q10s and elevation. The relationship between slope and 7Q10 was expected to be positive, but as noted by Rumsey et al. (2015), “positive correlations between slope and baseflow are expected to be related to effects of elevation, but slope steepness is known to affect rates of groundwater transmission and determines whether groundwater will reach a channel network or be retained in the soil” [49], making it reasonable that increasing slope may not increase 7Q10s.
Figure 7 presents the MLR 7Q10 estimates and the actual and historical 7Q10s. The line represents a perfect fit, and the points closest to the line indicate lower bias.
A known weakness of applying MLR for extremely low flow estimation is that the residuals are not randomly distributed, as noted in most StreamStats low flow reports cited earlier. Plotting the residuals, from the smallest to largest observed 7Q10, confirms this (Figure 8).
Though the MLR fit is statistically significant, the plot of residuals suggests that MLR may not be the appropriate method for this case. Because of this, the next method, logarithmic-transformed linear regression (LTLR), is the traditional method for estimating 7Q10s in small basins. This method is traditionally used by StreamStats, though StreamStats reports refer to it as generalized least squares regression with a logarithmic transformation. This methodology accounts for the drawbacks of simple MLR, including the non-random residuals, spatially distributed correlation, and differences in record lengths.

4.2. Logarithmic-Transformed Linear Regression

Similarly, applying LTLR to the same input data leads to the following results, given in Table 4.
In log space, only area was found to be extremely significant at the 0.001 level, while slope, precipitation, and temperature are moderately significant at the 0.05 level. Additionally, the percentage of the basin considered to be forest was found to be moderately significant at the 0.05 level in log space when it was not found to be significant using basic MLR. Once again, the percentage of the basin considered to be wetland was not found to be significant, but surprisingly, elevation was not found to be significant in log space, while it was just above the threshold for moderate importance (p-value~0.05) using MLR. Only including the significant variables leads to the following equation:
log 7 Q 10 = 1.31308 l o g A r e a 0.19413 l o g S l o p e + 0.22437 log F o r e s t + 0.31049 l o g P r e c i p 4.37462 l o g ( T e m p )
R Square = 0.67, Residual Standard Error = 0.6139, p-Value = 2 × 10−16.
Once again, plotting the estimated values vs. the actual historical 7Q10s (but this time, in log space) is displayed in Figure 9.
This fit is noticeably more linear but moves away from linearity for extremely small 7Q10 values (Actual 7Q10s < 10−1 cfs). These extremely small points can be ignored due to significant figures, as these very small numbers suggest that the stream’s 7Q10 is essentially 0 flow (ephemeral streams). More importantly, a goal of this experiment is to include much larger basin areas (and their corresponding 7Q10s) than are traditionally accounted for in regression equations. The largest 7Q10s, which correspond with the largest basins in the analysis, are found in the top right of Figure 9 and seem to continue to fit the general trend of linearity in log space. However, it should be noted that even though those points are similar distances from the line in log space, the difference is much larger than Figure 6 suggests. The three points correspond to actual 7Q10s of 79.67, 59.16, and 90.51 cfs, with their corresponding estimates to be 151.21, 170.09, and 167.49 cfs, respectively. To further examine this, Figure 10 displays the residuals in log space, once again arranged from smallest to largest 7Q10.
Besides the three points in the bottom right corner of Figure 10, which were discussed previously, the residuals in Figure 10 appear to be more consistently distributed than the residuals in Figure 8, suggesting that using LTLR is the preferred method over general MLR for 7Q10 estimation.
The final equation, translated back into standard space, is given below.
7 Q 10 = A r e a 1.31308 S l o p e 0.19413 F o r e s t 0.22437 P r e c i p 0.31049 T e m p 4.37462

4.3. Random Forest

Applying the random forest (RF) machine learning algorithm to the input data yields the following results in Table 5.
Area was found to be significant at the 0.01 level, with elevation and precipitation significant at the 0.05 level, and both slope and percent forest significant at the 0.1 level. Temperature was not found to be significant using the random forest model or the multiple linear regression model, making it only significant in log space. In all three cases so far (MLR, LTLR, and RF), the percentage of the basin considered wetland was not found to be significant. Figure 11 displays the estimated out-of-bag error as a function of the number of decision trees.
Figure 11 displays that the model stabilized around 100 trees. Though the RF method does not make assumptions about normality, a plot of the residuals given in Figure 12, once again arranged from the smallest to largest 7Q10, shows that they are not randomly distributed.
Plotting the 7Q10 values estimated using the random forest method vs. the actual historical 7Q10s is given in Figure 13.
For relatively smaller 7Q10s, RF estimates are similar to the other methods. However, RF underestimates actual historical 7Q10s that are over 20 cfs. For this range, there are only 3 points above the line (overestimation) and 11 points under the line (underestimation). This range is specifically difficult to estimate because there are very few unimpaired watersheds in the study area that are large enough to have 7Q10s in this range, limiting the available training data.

4.4. Neural Network

Neural networks were applied to the input data using a variety of tuning parameters. The addition of multiple hidden layers increased computation time, caused failure to converge in some cases, and did not improve model performance, so the final neural network described only included one hidden layer, an associated convergence threshold of 0.01, and a maximum step of 1 × 105. In this section, no table of variable importance and significance is included, as calculating p-values for neural networks is not common practice. Neural networks are highly complex models with multiple weights and parameters. When calculating p-values for each weight or parameter, it is effectively conducting multiple hypothesis tests for each. This introduces the risk of the multiple comparisons problem [50], where the probability of obtaining false positives (significant p-values) increases, which can lead to misleading results. Instead, we display the general results in Figure 14.
Figure 14 displays that the NN model overestimates smaller 7Q10s (especially in the 0–20 cfs range) and overestimates 7Q10s larger than 20 cfs (7 points below the line, as opposed to 2 above). This should be corroborated by the residuals, which are displayed in Figure 15, once again organized from smallest actual 7Q10 to largest.
Figure 15 suggests that the neural network consistently overestimates the smaller 7Q10 values and underestimates the larger 7Q10 values. Even if this model proves to have the smallest error metrics in Section 4.6 and Section 4.7, this is a significant drawback.

4.5. Generalized Additive Model

GAM was applied to the input data with a variety of tuning parameters. No initial weights or scale parameters were given, and the optimal model was found using the Gaussian distribution with generalized cross-validation (GCV). The optimal GAM model yields the following results in Table 6.
Area and elevation were both found to be extremely significant, while precipitation and temperature were found to be moderately and largely significant. The only other methodology where temperature was found to be significant was LTLR (linear regression in log space), suggesting that there may be a subtle importance of this variable that was not detected in standard space by MLR or RF. In addition to slope, neither of the landcover variables were found to be significant using GAM. The estimated 7Q10 values vs. the actual historical values are presented in Figure 16.
Figure 16 does not suggest an obvious pattern in residuals. To further analyze the model fit, we plot the residuals in Figure 17, once again organized from smallest actual 7Q10 to largest.
The residuals for the larger basins seem to deviate from the perfect fit line, but the residuals in Figure 17 seem to be much more randomly distributed than the neural network model that displayed a clear pattern in Figure 15.
In the following sections, we will evaluate how each method performs in comparison to StreamStats (Section 4.6) and in relation to each other (Section 4.7). Additionally, all methods seem to perform worst for large 7Q10s, so Section 4.7 will highlight under what range of basin sizes each method performs best, using RMSE as the main metric compared to StreamStats estimates.

4.6. Comparisons to StreamStats Estimates

Current 7Q10 estimates were derived using the USGS’s StreamStats program, discussed in Section 2.3. Estimates are only available for some states in the domain, leaving 128 data points for comparison out of the 165 total used for training. In addition, no test set was used in the StreamStats derivations, so comparing exact estimates from each method without using a validation procedure is given in Table 7.
Most methods perform similarly, but GAM displays the best R2, KGE, NSE, and RMSE by far out of all the methods. Because of the high flexibility of GAMs, they are prone to overfitting, and this success will be further tested in Section 4.7 with LOOCV. The NN model also displays a high R2 and KGE, but that is with the drawback that it overestimates small 7Q10s and underestimates large 7Q10s, as highlighted in Figure 15. MLR and RF outperform current estimates by an average of 12% in terms of KGE, as well as 25% in terms of RMSE, which is arguably the most important metric as it directly measures error. RF’s success could also be due to overfitting, which is tested in the next section. MLR, however, is a classical method for 7Q10 estimation and is not prone to overfitting. Given its success compared to StreamStats for the exact same locations that were derived using state-by-state equations, results suggest that a single, generalized methodology is appropriate.

4.7. General Performance

Comparing each method using leave-one-out cross-validation results in the following metrics, given in Table 8.
Table 8 confirms that the success displayed by both NN and GAM in the previous section was due to overfitting. They display the two worst R2s and have RMSEs larger than both RF and MLR. With the addition of a test set, the RF method performance only declined 3.17% for R2 (0.61, down from 0.63), 9.21% for KGE (0.69, down from 0.76), 22.64% for NSE (0.41, down from 0.53), and increased 5.27% for RMSE (8.39, up from 7.97). The average decline for MLR and LTLR was similar, at 3.41% for R2, 8.08% for KGE, 12.70% for NSE, and an increase of 16.18% for RMSE. This suggests that RF’s previous success was not due to overfitting, as it displayed similar declines to LTLR and MLR, which utilize straight lines for fitting.
Each method performs differently based on the evaluation metric used. One explanation for this is the wide range of basin sizes included in this analysis. This is highlighted in the plots of residuals earlier, which demonstrated that many models perform poorer for larger 7Q10s. Splitting the data into three distinct subsets based on basin size will allow us to further examine why some methods display high R2 but poor RMSE. For this analysis, small basins are defined as basins under 15 mi2, while medium basins are basins between 15 and 70 mi2, and large basins are basins larger than 70 mi2. These thresholds do not have any physical meaning and are simply selected to divide the full dataset into three equally sized subsets for further analysis. In Table 9, we provide the RMSE for each method applied to each basin size range. Especially for these subsets, RMSE is the best metric to base success on, as it measures the error of estimates vs. the actual values, which is the most important metric for resource managers who need accurate 7Q10 estimates.
Table 9 demonstrates that LTLR and RF perform similarly well for 7Q10 estimation in both small- and medium-sized basins, greatly outperforming MLR, NN, and GAM. However, for large basins, LTLR performs poorly because of the extreme overestimation discussed earlier, while MLR, RF, and NN perform similarly well (RMSE = 14.02, 14.01, and 15.60, respectively). Based on this experiment, RF performs the best across all ranges of basin sizes (Avg. RMSE = 5.85), while LTLR performs similarly well in small- and medium-sized basins where it is traditionally used, and MLR performs similarly well in large basins. Based on RF’s success for all ranges of basin sizes, we include Figure 18 to compare raw 7Q10 estimates for the RF method to both classical methods, MLR and LTLR. Figure 18a–c displays the actual 7Q10, as well as the LTLR, MLR, and RF estimates, arranged from the smallest historical 7Q10 to the largest.
The results in Figure 18a–c corroborate the results from Table 9. Logarithmic-transformed linear regression performs best overall in terms of R2 and NSE but performs poorly in KGE and RMSE due to its poor estimates in large basins, reflected in Figure 18c. LTLR estimates the largest historical 7Q10, which is 90 cfs, to be 181 cfs, more than double the actual value. Similarly, LTLR estimates the next two largest 7Q10s, which are 59.2 cfs and 79.7 cfs, to be 186 cfs and 160 cfs, respectively. Though the LTLR vs. actual 7Q10 graph (Figure 9) previously suggested that LTLR may be able to be expanded for large basins because the largest 7Q10s seemed to maintain a constant distance from the perfect fit line, Figure 9 is in log space, and differences in log space are amplified for larger numbers when returning to standard space. Conversely, multiple linear regression performs well in large basins but poorly in small basins, as it attempts to minimize the overall error and gives more weight to larger 7Q10s. Lastly, the random forest method performs well overall, especially in larger basins, suggesting that this flexible machine learning algorithm may be able to account for the drawbacks of both LTLR and MLR.

5. Conclusions

This research improves upon current methodologies for statistically estimating the 7Q10 by analyzing multiple statistical methods, testing various topographical, landcover, and climate variables for significance, and widening the geographical and watershed size ranges of current methodologies. Results support that a single, generalized methodology can be used for 7Q10 estimation throughout the entire northeast and mid-Atlantic, with similar R2, RMSE, NSE, and KGE compared to current state-by-state StreamStats’ estimates while only requiring one equation/model. Estimates from StreamStats display an R2, KGE, NSE, and RMSE of 0.66, 0.66, 0.65, and 9.88, respectively, for the unimpaired gages where it is available in the study area, while the random forest method displays an R2, KGE, NSE, and RMSE of 0.61, 0.69, 0.41, and 8.39, respectively, for the full range of gages even after cross-validation was applied. Two other machine learning algorithms (neural networks and generalized additive models) were tested as well but displayed significantly worse R2s, KGEs, NSEs, and RMSEs (0.53, 0.69, 0.36, and 9.41 and 0.53, 0.65, 0.52, and 12.15, respectively) with the addition of the cross-validation test set.
Future work could involve testing other advanced statistical methods and/or machine learning algorithms for better low-flow estimation. Because we were able to successfully apply this methodology to such a large geographical footprint, other future work may include determining the boundaries at which assuming hydrologic homogeneity is no longer satisfied. Additional future work directly related to this study may involve landcover and climate-altered futures. The inclusion of climate and landcover input variables, which were both found to be statistically significant, can be used prescriptively in conjunction with physical hydrology models to test how changing landcover and climate conditions affect 7Q10 estimates. Because of additional stakeholder input, future work may also involve only using 7Q10 data derived from the last 30 years of streamflow data at each site for use as the actual historical 7Q10. Blum et al. (2019) found that using the most recent 30 years of streamflow record to derive the “true” 7Q10 when a trend is detected reduces error and bias in 7Q10 estimators compared to using the full record of streamflow [2]. This may account for recent climatic and hydrologic conditions that will be more representative of future 7Q10 conditions at a particular site, but additional studies must be completed to confirm this relationship.

Author Contributions

A.D. obtained the input data, applied each method, and wrote the draft manuscript; M.A.E.B. provided machine learning expertise, including suggesting random forests and helping to manually tune the method; K.M.A. and R.N.P. helped provide expertise on hydrology and improved the language of the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by a U.S. Geological Survey Northeast Climate Adaptation Science Center award G21AC10556, A Decision Support System for Estimating Changes in Extreme Floods and Droughts in the Northeast U.S., to Andrew DelSanto.

Data Availability Statement

All data and codes used in this project can be obtained by contacting the first author via email ([email protected]).

Conflicts of Interest

The authors declare no competing interest.

Appendix A

StationSourceStateStation NameWatershed Area (mi2)7Q10 (cfs)
01013500HCDNMEFish River near Fort Kent, Maine869.879.67
01022500HCDNMENarraguagus River at Cherryfield, Maine221.529.24
01030500HCDNMEMattawamkeag River near Mattawamkeag, Maine1419.459.16
01031500HCDNMEPiscataquis River near Dover-Foxcroft, Maine296.915.54
01047000HCDNMECarrabassett River near North Anson, Maine35144.96
01052500HCDNNHDiamond River near Wentworth Location, NH148.216.95
01054200HCDNMEWild River at Gilead, Maine69.99.59
01055000HCDNMESwift River near Roxbury, Maine96.86.82
01057000HCDNMELittle Androscoggin River near South Paris, Maine73.72.39
01073000HCDNNHOyster River near Durham, NH12.10.35
01073860HCDNMASmall Pox Brook at Salisbury, MA1.830.15
01078000HCDNNHSmith River near Bristol, NH85.96.04
01094340MA Low Flow ReportMAWhitman River near Westminster, Mass.21.70.89
01094396MA Low Flow ReportMAPhilips Brook at Fitchburg, Mass.15.80.34
01094760MA Low Flow ReportMAWaushacum Brook near West Boylston, Mass.7.410.06
01095220MA Low Flow ReportMAStillwater River near Sterling, Mass.30.41.06
01095380MA Low Flow ReportMATrout Brook near Holden, Mass.6.790.05
01095915MA Low Flow ReportMAMulpus Brook near Shirley, Mass.15.70.39
01095928MA Low Flow ReportMATrapfall Brook near Ashby, Mass.5.890.02
01096000MA Low Flow ReportMASquannacook River near West Groton, MA64.46.52
01096504MA Low Flow ReportMAReedy Meadow Brook at East Pepperell, Mass.1.920.24
01096505MA Low Flow ReportMAUnkety Brook near Pepperell, Mass.6.840.46
01096515MA Low Flow ReportMASalmon Brook at Main Street at Dunstable, Mass.18.22.34
01096805MA Low Flow ReportMANorth Brook near Berlin, Mass.15.40.54
01096855MA Low Flow ReportMADanforth Brook at Hudson, Mass.6.620.14
01096935MA Low Flow ReportMAElizabeth Brook at Wheeler Street at Stow, Mass.17.20.76
01097280MA Low Flow ReportMAFort Pond Brook at West Concord, Mass.24.90.89
01097300MA Low Flow ReportMANashoba Brook near Acton, MA12.90.12
01099400MA Low Flow ReportMARiver Meadow Brook at Lowell, Mass.25.60.98
01100608MA Low Flow ReportMAMeadow Brook near Tewksbury, Mass.4.090.15
01100700MA Low Flow ReportMAEast Meadow River near Haverhill, MA5.540.15
01101100MA Low Flow ReportMAMill River near Rowley, Mass.7.70.39
01102490MA Low Flow ReportMAShaker Glen Brook near Woburn, Mass.3.050.17
01103015MA Low Flow ReportMAMill Brook at Arlington, Mass.5.350.38
01103253MA Low Flow ReportMAChicken Brook near West Medway, Mass.7.230.18
01103435MA Low Flow ReportMAWaban Brook at Wellesley, Mass.10.20.13
01103440MA Low Flow ReportMAFuller Brook at Wellesley, Mass.3.910.11
01104960MA Low Flow ReportMAGermany Brook near Norwood, Mass.2.370.08
01104980MA Low Flow ReportMAHawes Brook at Norwood, Mass.8.640.29
01105568MA Low Flow ReportMACochato River at Holbrook, Mass.4.310.09
01105575MA Low Flow ReportMACranberry Brook at Braintree Highlands, Mass.1.720.01
01105600MA Low Flow ReportMAOld Swamp River near South Weymouth, MA4.470.16
01105630MA Low Flow ReportMACrooked Meadow River near Hingham Center, Mass.4.910.27
01105820MA Low Flow ReportMASecond Herring Brook at Norwell, Mass.3.170.03
01105830MA Low Flow ReportMAFirst Herring Brook near Scituate Center, Mass.1.720.01
01105861MA Low Flow ReportMAJones River Brook near Kingston, Mass.4.740.49
01105930MA Low Flow ReportMAPaskamanset River at Turner Pond near New Bedford,8.090.32
01105937MA Low Flow ReportMAShingle Island River near North Dartmouth, Mass.8.590.06
01105947MA Low Flow ReportMABread and Cheese Brook at Head of Westport, Mass.9.250.14
01106000MA Low Flow ReportRIAdamsville Brook at Adamsville, RI7.990.05
01106460MA Low Flow ReportMABeaver Brook near East Bridgewater, Mass.8.940.34
01107400MA Low Flow ReportMAFall Brook near Middleboro, Mass.9.31.32
01108180MA Low Flow ReportMACotley River at East Taunton, Mass.7.480.47
01108600MA Low Flow ReportMAHodges Brook at West Mansfield, Mass.3.830.03
01109087MA Low Flow ReportMAAssonet River at Assonet, Mass.20.70.62
01109090MA Low Flow ReportMARattlesnake Brook near Assonet, Mass.4.220.11
01109225MA Low Flow ReportMARocky Run near Rehoboth, Mass.7.210.07
01109460MA Low Flow ReportMADark Brook at Auburn, Mass.11.10.94
01111200MA Low Flow ReportMAWest River below West Hill Dam, near Uxbridge, MA27.81.80
01111225MA Low Flow ReportMAEmerson Brook near Uxbridge, Mass.7.260.63
01111300MA Low Flow ReportRINipmuc River near Harrisville, RI160.25
01112190MA Low Flow ReportMAMuddy Brook at South Milford, Mass.6.170.14
01117468HCDNRIBeaver River near Usquepaug, RI8.871.78
01118300HCDNCTPendleton Hill Brook near Clarks Falls, CT40.02
01121000HCDNCTMount Hope River near Warrenville, CT27.10.65
01123000HCDNCTLittle River near Hanover, CT30.14.36
01123140MA Low Flow ReportMAMill Brook at Brimfield, Mass.13.81.29
01123200MA Low Flow ReportMAStevens Brook at Holland, Mass.4.390.09
01124390MA Low Flow ReportMALittle River at Richardson Corners, Mass.8.580.20
01134500HCDNVTMoose River at Victory, VT75.35.97
01137500HCDNNHAmmonoosuc River at Bethlehem Junction, NH88.227.36
01139000HCDNVTWells River at Wells River, VT95.114.12
01139800HCDNVTEast Orange Branch at East Orange, VT8.80.71
01142500HCDNVTAyers Brook at Randolph, VT31.72.11
01144000HCDNVTWhite River at West Hartford, VT691.290.51
01150900HCDNVTOttauquechee River near West Bridgewater, VT23.33.31
01162500HCDNMAPriest Brook near Winchendon, MA19.20.47
01162900MA Low Flow ReportMAOtter River at Gardner, Mass.19.22.57
01164300MA Low Flow ReportMALawrence Brook at Royalston, Mass.15.60.32
01167200MA Low Flow ReportMAFall River at Bernardston, Mass.22.31.46
01168300MA Low Flow ReportMACold River near Zoar, Mass.29.61.69
01168400MA Low Flow ReportMAChickley River near Charlemont, Mass.27.13.24
01169000HCDNMANorth River at Shattuckville, MA89.18.82
01170100HCDNMAGreen River near Colrain, MA41.34.73
01181000HCDNMAWest Branch Westfield River at Huntington, MA946.03
01187300HCDNMAHubbard River near West Hartland, CT20.80.45
01195100HCDNCTIndian River near Clinton, CT5.680.02
01208990HCDNCTSaugatuck River near Redding, CT20.80.31
01333000HCDNMAGreen River at Williamstown, MA43.34.80
01350000HCDNNYSchoharie Creek at Prattsville NY236.59.47
01350080HCDNNYManor Kill at West Conesville near Gilboa, NY32.41.67
01350140HCDNNYMine Kill near North Blenheim, NY16.90.22
01362200HCDNNYEsopus Creek at Allaben, NY63.75.48
01365000HCDNNYRondout Creek near Lowes Corners, NY38.46.26
01411300HCDNNJTuckahoe River at Head of River, NJ30.66.70
01413500HCDNNYEast Brook Delaware River at Margaretville, NY163.711.17
01414500HCDNNYMill Brook near Dunraven, NY24.92.26
01415000HCDNNYTremper Kill near Andes, NY33.11.59
01423000HCDNNYWest Branch Delaware River at Walton, NY331.923.49
01434025HCDNNYBiscuit Brook above Pigeon Brook at Frost Valley, NY3.720.45
01435000HCDNNYNeversink River near Claryville, NY66.613.99
01439500HCDNPABush Kill at Shoemakers, PA118.17.75
01440000HCDNNJFlat Brook near Flatbrookville, NJ64.87.53
01440400HCDNPABrodhead Creek near Analomink, PA67.67.60
01451800HCDNPAJordan Creek near Schnecksville, PA52.42.78
01466500HCDNNJMcDonalds Branch in Lebanon State Forest, NJ2.10.82
01484100HCDNDEBeaverdam Branch at Houston, DE3.50.13
01485500HCDNMDNassawango Creek near Snow Hill, MD54.61.17
01486000HCDNMDManokin Branch near Princess Anne, MD4.30.04
01487000HCDNDENanticoke River near Bridgeville, DE72.415.99
01491000HCDNMDChoptank River near Greensboro, MD112.84.09
01510000HCDNNYOtselic River at Cincinnatus, NY147.99.42
01516500HCDNPACorey Creek near Mainesburg, PA12.20.01
01518862HCDNPACowanesque River at Westfield, PA90.61.45
01532000HCDNPATowanda Creek near Monroeton, PA213.92.96
01539000HCDNPAFishing Creek near Bloomsburg, PA27117.83
01542810HCDNPAWaldy Run near Emporium, PA5.30.09
01543000HCDNPADriftwood Br Sinnemahoning Cr at Sterling Run, PA272.44.51
01543500HCDNPASinnemahoning Creek at Sinnemahoning, PA686.615.50
01544500HCDNPAKettle Creek at Cross Fork, PA137.15.07
01545600HCDNPAYoung Womans Creek near Renovo, PA46.31.60
01547700HCDNPAMarsh Creek at Blanchard, PA43.80.66
01548500HCDNPAPine Creek at Cedar Run, PA601.224.84
01549500HCDNPABlockhouse Creek near English Center, PA37.60.79
01550000HCDNPALycoming Creek near Trout Run, PA174.87.84
01552000HCDNPALoyalsock Creek at Loyalsockville, PA436.123.63
01552500HCDNPAMuncy Creek near Sonestown, PA23.41.19
01557500HCDNPABald Eagle Creek at Tyrone, PA44.53.19
01564500HCDNPAAughwick Creek near Three Springs, PA2054.41
01567500HCDNPABixler Run near Loysville, PA152.32
01568000HCDNPASherman Creek at Shermans Dale, PA206.315.95
01580000HCDNMDDeer Creek at Rocks, MD94.423.91
01583500HCDNMDWestern Run at Western Run, MD60.211.57
01586610HCDNMDMorgan Run near Louisville, MD263.63
01591400HCDNMDCattail Creek near Glenwood, MD22.81.71
01594950HCDNMDMcMillan Fort near Fort Pendleton, MD2.30.00
01596500HCDNMDSavage River near Barton, MD48.11.03
01605500HCDNWVSouth Branch Potomac River at Franklin, WV179.120.64
01606500HCDNWVSouth Branch Potomac River near Petersburg, WV650.454.55
01613050HCDNPATonoloway Creek near Needmore, PA10.80.00
01620500HCDNVANorth River near Stokesville, VA17.30.23
01632000HCDNVAN F Shenandoah River at Cootes Store, VA209.80.84
01632900HCDNVASmith Creek near New Market, VA94.67.64
01634500HCDNVACedar Creek near Winchester, VA101.94.67
01638480HCDNVACatoctin Creek at Taylorstown, VA89.60.66
01639500HCDNMDBig Pipe Creek at Bruceville, MD103.28.10
01644000HCDNVAGoose Creek near Leesburg, VA331.72.01
01658500HCDNVAS F Quantico Creek near Independent Hill, VA7.50.00
01664000HCDNVARappahannock River at Remington, VA619.710.76
01666500HCDNVARobinson River near Locust Dale, VA178.89.18
01667500HCDNVARapidan River near Culpeper, VA467.117.08
01669000HCDNVAPiscataway Creek near Tappahannock, VA27.70.38
01669520HCDNVADragon Swamp at Mascot, VA1090.01
02011400HCDNVAJackson River near Bacova, VA157.417.06
02011460HCDNVABack Creek near Sunrise, VA60.42.01
02013000HCDNVADunlap Creek near Covington, VA16410.68
02014000HCDNVAPotts Creek near Covington, VA153.217.64
02015700HCDNVABullpasture River at Williamsville, VA110.226.07
02016000HCDNVACowpasture River near Clifton Forge, VA461.257.01
02017500HCDNVAJohns Creek at New Castle, VA106.67.72
02018000HCDNVACraig Creek at Parr, VA329.130.75
02027000HCDNVATye River near Lovingston, VA934.03
02027500HCDNVAPiney River at Piney River, VA47.62.60
02028500HCDNVARockfish River near Greenfield, VA94.82.50
02038850HCDNVAHoliday Creek near Andersonville, VA8.50.33

References

  1. Smakhtin, V.U. Low flow hydrology: A review. J. Hydrol. 2001, 240, 147–186. [Google Scholar] [CrossRef]
  2. Blum, A.G.; Archfield, S.A.; Hirsch, R.M.; Vogel, R.M.; Kiang, J.E.; Dudley, R.W. Updating estimates of low-streamflow statistics to account for possible trends. Hydrol. Sci. J. 2019, 64, 1404–1414. [Google Scholar] [CrossRef]
  3. Salinas, J.L.; Laaha, G.; Rogger, M.; Parajka, J.; Viglione, A.; Sivapalan, M.; Blöschl, G. Comparative assessment of predictions in ungauged basins—Part 2: Flood and low flow studies. Hydrol. Earth Syst. Sci. 2013, 17, 2637–2652. [Google Scholar] [CrossRef] [Green Version]
  4. Ries, K.G., III; Guthrie, J.D.; Rea, A.H.; Steeves, P.A.; Stewart, D.W. StreamStats: A Water Resources Web Application: U.S. Geological Survey Fact Sheet 2008-3067; U.S. Geological Survey: Reston, VA, USA, 2008; 6p.
  5. Milly, P.C.D.; Betancourt, J.; Falkenmark, M.; Hirsch, R.M.; Kundzewicz, Z.W.; Lettenmaier, D.P.; Stouffer, R.J. Stationarity Is Dead: Whither Water Management. Science 2008, 319, 573–574. [Google Scholar] [CrossRef] [PubMed]
  6. Bayazit, M. Nonstationarity of Hydrological Records and Recent Trends in Trend Analysis: A State-of-the-art Review. Environ. Process. 2015, 2, 527–542. [Google Scholar] [CrossRef]
  7. Salas, J.D.; Obeysekera, J.; Vogel, R.M. Techniques for assessing water infrastructure for nonstationary extreme events: A review. Hydrol. Sci. J. 2018, 63, 325–352. [Google Scholar] [CrossRef] [Green Version]
  8. Hesarkazzazi, S.; Arabzadeh, R.; Hajibabaei, M.; Rauch, W.; Kjeldsen, T.R.; Prosdocimi, I.; Castellarin, A.; Sitzenfrei, R. Stationary vs. non-stationary modelling of flood frequency distribution across northwest England. Hydrol. Sci. J. 2021, 66, 729–744. [Google Scholar] [CrossRef]
  9. Williams, A.P.; Cook, B.I.; Smerdon, J.E. Rapid intensification of the emerging southwestern North American megadrought in 2020–2021. Nat. Clim. Chang. 2022, 12, 232–234. [Google Scholar] [CrossRef]
  10. Ayers, J.; Villarini, G.; Jones, C.; Schilling, K.; Farmer, W. The Role of Climate in Monthly Baseflow Changes across the Continental United States. J. Hydrol. Eng. 2022, 27, 04022006. [Google Scholar] [CrossRef]
  11. Hodgkins, G.A.; Dudley, R.W. Historical summer base flow and stormflow trends for New England rivers. Water Resour. Res. 2011, 47, W07528. [Google Scholar] [CrossRef]
  12. Chaves, H.M.L.; Rosa, J.W.C.; Vadas, R.G.; Oliveira, R.V.T. Regionalization of Minimum Flows in Basins Through Interpolation in Geographic Information Systems. RBRH Braz. J. Water 2002, 7. [Google Scholar] [CrossRef]
  13. Bent, G.C.; Steeves, P.A.; Waite, A.M. Equations for Estimating Selected Streamflow Statistics in Rhode Island: U.S. Geological Survey Scientific Investigations Report 2014-5010; U.S. Geological Survey: Reston, VA, USA, 2014; 65p.
  14. Austin, S.H.; Krstolic, J.L.; Wiegand, U. Low-Flow Characteristics of Virginia Streams: U.S. Geological Survey Scientific Investigations Report 2011-5143; U.S. Geological Survey: Reston, VA, USA, 2011; 122p.
  15. Dudley, R.W. Estimating Monthly, Annual, and Low 7-Day, 10-Year Streamflows for Ungaged Rivers in Maine: U.S. Geological Survey Scientific Investigations Report 2004-5026; U.S. Geological Survey: Reston, VA, USA, 2004; 22p.
  16. Flynn, R.H.; Tasker, G.D. Development of Regression Equations to Estimate Flow Durations and Low-Flow-Frequency Statistics in New Hampshire Streams: U.S. Geological Survey Scientific Investigations Report 02-4298; U.S. Geological Survey: Reston, VA, USA, 2002; 66p.
  17. Stuckey, M.H. Low-Flow, Base-Flow, and Mean-Flow Regression Equations for Pennsylvania Streams: U.S. Geological Survey Scientific Investigations Report 2006-5130; U.S. Geological Survey: Reston, VA, USA, 2006; 84p.
  18. Wiley, J.B. Estimating Selected Streamflow Statistics Representative of 1930–2002 in West Virginia: U.S. Geological Survey Scientific Investigations Report 2008-5105; Version 2; U.S. Geological Survey: Reston, VA, USA, 2008; 24p.
  19. Tasker, G.D.; Stedinger, J.R. An operational GLS model for hydrologic regression. J. Hydrol. 1989, 111, 361–375. [Google Scholar] [CrossRef]
  20. Ries, K.G., III. Methods for Estimating Low-Flow Statistics for Massachusetts Streams: U.S. Geological Survey Water Resources Investigations Report 00-4135; U.S. Geological Survey: Reston, VA, USA, 2000; 81p.
  21. Kratzert, F.; Klotz, D.; Herrnegger, M.; Sampson, A.K.; Hochreiter, S.; Nearing, G.S. Toward improved predictions in ungauged basins: Exploiting the power of machine learning. Water Resour. Res. 2019, 55, 11344–11354. [Google Scholar] [CrossRef] [Green Version]
  22. Zhang, S.; Lu, L.; Yu, J.; Zhou, H. Short-term water level prediction using different artificial intelligent models. In Proceedings of the 2016 Fifth International Conference on Agro-Geoinformatics (Agro-Geoinformatics), Tianjin, China, 18–20 July 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1–6. [Google Scholar] [CrossRef]
  23. Soleymani, S.A.; Goudarzi, S.; Anisi, M.H.; Hassan, W.H.; Idris, M.Y.I.; Shamshirband, S.; Ahmedy, I. A novel method to water level prediction using RBF and FFA. Water Resour. Manag. 2016, 30, 3265–3283. [Google Scholar] [CrossRef]
  24. Mosavi, A.; Ozturk, P.; Chau, K.-W. Flood prediction using machine learning models: Literature review. Water 2018, 10, 1536. [Google Scholar] [CrossRef] [Green Version]
  25. Kratzert, F.; Klotz, D.; Shalev, G.; Klambauer, G.; Hochreiter, S.; Nearing, G. Towards learning universal, regional, and local hydrological behaviors via machine learning applied to large-sample datasets. Hydrol. Earth Syst. Sci. 2019, 23, 5089–5110. [Google Scholar] [CrossRef] [Green Version]
  26. Tongal, H.; Martijn, J.B. Simulation and forecasting of streamflows using machine learning models coupled with base flow separation. J. Hydrol. 2018, 564, 266–282. [Google Scholar] [CrossRef]
  27. Nearing, G.S.; Kratzert, F.; Sampson, A.K.; Pelissier, C.S.; Klotz, D.; Frame, J.M.; Prieto, C.; Gupta, H.V. What role does hydrological science play in the age of machine learning? Water Resour. Res. 2021, 57, e2020WR028091. [Google Scholar] [CrossRef]
  28. Worland, S.C.; Farmer, W.H.; Kiang, J.E. Improving predictions of hydrological low-flow indices in ungaged basins using machine learning. Environ. Model. Softw. 2018, 101, 169–182. [Google Scholar] [CrossRef]
  29. Ferreira, R.G.; da Silva, D.D.; Elesbon, A.A.A.; Fernandes-Filho, E.I.; Veloso, G.V.; de Souza Fraga, M.; Ferreira, L.B. Machine learning models for streamflow regionalization in a tropical watershed. J. Environ. Manag. 2021, 280, 111713. [Google Scholar] [CrossRef]
  30. Laimighofer, J.; Melcher, M.; Laaha, G. Parsimonious statistical learning models for low-flow estimation. Hydrol. Earth Syst. Sci. 2022, 26, 129–148. [Google Scholar] [CrossRef]
  31. Vogel, R.M.; Kroll, C.N. Generalized low-flow frequency relationships for ungaged sites in massachusetts. J. Am. Water Resour. Assoc. 1990, 26, 241–253. [Google Scholar] [CrossRef]
  32. Lins, H.F. USGS Hydro-Climatic Data Network 2009 (HCDN-2009); Fact Sheet 2012-3047; U.S. Geological Survey: Reston, VA, USA, 2012. Available online: https://pubs.er.usgs.gov/publication/fs20123047 (accessed on 18 December 2020).
  33. Livneh, B.; Bohn, T.J.; Pierce, D.W.; Muñoz-Arriola, F.; Nijssen, B.; Vose, R.; Cayan, D.R.; Brekke, L. A Spatially Comprehensive, Meteorological Data Set for Mexico, the U.S., and Southern Canada (NCEI Accession 0129374). NOAA National Centers for Environmental Information. Dataset. 2015. Available online: https://doi.org/10.7289/v5x34vf6 (accessed on 20 May 2021).
  34. Livneh, B.; National Center for Atmospheric Research Staff (Eds.) Last Modified 12 Dec 2019. The Climate Data Guide: Livneh Gridded Precipitation and Other Meteorological Variables for Continental US, Mexico and Southern Canada. 2019. Available online: https://climatedataguide.ucar.edu/climate-data/livneh-gridded-precipitation-and-other-meteorological-variables-continental-us-mexico (accessed on 18 December 2020).
  35. Zhu, C.; Lettenmaier, D.P. Long-term climate and derived surface hydrology and energy flux data for Mexico: 1925–2004. J. Clim. 2007, 20, 1936–1946. [Google Scholar] [CrossRef]
  36. Shepard, D.S. Computer mapping: The SYMAP interpolation algorithm. In Spatial Statistics and Models; Gaile, G.L., Willmott, C.J., Reidel, D., Eds.; Springer: Dordrecht, The Netherlands, 1984; pp. 133–145. [Google Scholar]
  37. Iman, R.L.; Conover, W.J. A Modern Approach to Statistics; John Wiley: New York, NY, USA, 1983; 497p. [Google Scholar]
  38. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
  39. McCulloch, W.S.; Pitts, W. A Logical Calculus of Ideas Immanent in Nervous Activity. Bull. Math. Biophys. 1943, 5, 115–133. [Google Scholar] [CrossRef]
  40. Hastie, T.; Tibshirani, R.J. Generalized Additive Models; Chapman and Hall: Boca Raton, FL, USA, 1986. [Google Scholar]
  41. Molinaro, A.M.; Simon, R.; Pfeiffer, R.M. Prediction error estimation: A comparison of resampling methods. Bioinformatics 2005, 21, 3301–3307. [Google Scholar] [CrossRef] [Green Version]
  42. Wright, S. Correlation and causation. J. Agric. Res. 1921, 20, 557–585. [Google Scholar]
  43. Shortridge, J.E.; Guikema, S.D.; Zaitchik, B.F. Machine learning methods for empirical streamflow simulation: A comparison of model accuracy, interpretability, and uncertainty in seasonal watersheds. Hydrol. Earth Syst. Sci. 2016, 20, 2611–2628. [Google Scholar] [CrossRef] [Green Version]
  44. Mekanik, F.; Imteaz, M.A.; Talei, A. Seasonal rainfall forecasting by adaptive network-based fuzzy inference system (ANFIS) using large scale climate signals. Clim. Dynam. 2016, 46, 3097–3111. [Google Scholar] [CrossRef]
  45. Nash, J.E.; Sutcliffe, J.V. River flow forecasting through conceptual models part I—A discussion of principles. J. Hydrol. 1970, 10, 282–290. [Google Scholar] [CrossRef]
  46. Gupta, H.V.; Kling, H.; Yilmaz, K.K.; Martinez, G.F. Decomposition of the mean squared error and NSE performance criteria: Implications for improving hydrological modeling. J. Hydrol. 2009, 377, 80–91. [Google Scholar] [CrossRef] [Green Version]
  47. Formetta, G.; Mantilla, R.; Franceschi, S.; Antonello, A.; Rigon, R. The JGrass-NewAge system for forecasting and managing the hydrological budgets at the basin scale: Models of flow generation and propagation/routing. Geosci. Model Dev. 2011, 4, 943–955. [Google Scholar] [CrossRef] [Green Version]
  48. Beck, H.E.; van Dijk AI, J.M.; de Roo, A.; Miralles, D.G.; McVicar, T.R.; Schellekens, J.; Bruijnzeel, L.A. Global-scale regionalization of hydrologic model parameters. Water Resour. Res. 2016, 52, 3599–3622. [Google Scholar] [CrossRef] [Green Version]
  49. Rumsey, C.A.; Miller, M.P.; Susong, D.D.; Tillman, F.D.; Anning, D.W. Regional scale estimates of baseflow and factors influencing baseflow in the Upper Colorado River Basin. J. Hydrol. Reg. Stud. 2015, 4, 91–107. [Google Scholar] [CrossRef] [Green Version]
  50. Holm, S. A Simple Sequentially Rejective Multiple Test Procedure. Scandinavian. J. Stat. 1979, 6, 65–70. [Google Scholar]
Figure 1. The 165 gages and their corresponding watersheds used for this analysis. Gages are displayed as blue triangles, the standard convention by the USGS for designating gaging stations. Watersheds are displayed as distinct colors to facilitate differentiation.
Figure 1. The 165 gages and their corresponding watersheds used for this analysis. Gages are displayed as blue triangles, the standard convention by the USGS for designating gaging stations. Watersheds are displayed as distinct colors to facilitate differentiation.
Water 15 02813 g001
Figure 2. Schematic representation of the random forest algorithm.
Figure 2. Schematic representation of the random forest algorithm.
Water 15 02813 g002
Figure 3. Schematic representation of a simple neural network with two hidden layers.
Figure 3. Schematic representation of a simple neural network with two hidden layers.
Water 15 02813 g003
Figure 4. A standard linear model vs. a generalized additive model for the same data.
Figure 4. A standard linear model vs. a generalized additive model for the same data.
Water 15 02813 g004
Figure 5. Example structure of leave-one-out cross-validation.
Figure 5. Example structure of leave-one-out cross-validation.
Water 15 02813 g005
Figure 6. Flowchart summarizing the methodology. Includes data from the USGS’ Hydro-Climatic Data Network [32], USGS’ StreamStats Reports [13,14,15,16,17,18,19,20], USGS’ MA Low Flow Report [20], and the Livneh Climate Dataset [33].
Figure 6. Flowchart summarizing the methodology. Includes data from the USGS’ Hydro-Climatic Data Network [32], USGS’ StreamStats Reports [13,14,15,16,17,18,19,20], USGS’ MA Low Flow Report [20], and the Livneh Climate Dataset [33].
Water 15 02813 g006
Figure 7. Multiple linear regression estimates vs. actual historical 7Q10s (cfs).
Figure 7. Multiple linear regression estimates vs. actual historical 7Q10s (cfs).
Water 15 02813 g007
Figure 8. Residuals for multiple linear regression (cfs).
Figure 8. Residuals for multiple linear regression (cfs).
Water 15 02813 g008
Figure 9. Logarithmic-transformed linear regression estimates vs. actual historical 7Q10s (cfs).
Figure 9. Logarithmic-transformed linear regression estimates vs. actual historical 7Q10s (cfs).
Water 15 02813 g009
Figure 10. Residuals for logarithmic-transformed linear regression (cfs).
Figure 10. Residuals for logarithmic-transformed linear regression (cfs).
Water 15 02813 g010
Figure 11. Estimated out-of-bag error vs. number of trees applied.
Figure 11. Estimated out-of-bag error vs. number of trees applied.
Water 15 02813 g011
Figure 12. Residuals for the random forest estimates (cfs).
Figure 12. Residuals for the random forest estimates (cfs).
Water 15 02813 g012
Figure 13. Random forest estimated 7Q10 values vs. actual historical 7Q10s (cfs).
Figure 13. Random forest estimated 7Q10 values vs. actual historical 7Q10s (cfs).
Water 15 02813 g013
Figure 14. Neural network estimated 7Q10 values vs. actual historical 7Q10s (cfs).
Figure 14. Neural network estimated 7Q10 values vs. actual historical 7Q10s (cfs).
Water 15 02813 g014
Figure 15. Residuals for the neural network model (cfs).
Figure 15. Residuals for the neural network model (cfs).
Water 15 02813 g015
Figure 16. Generalized additive model estimated 7Q10 values vs. actual historical 7Q10s (cfs).
Figure 16. Generalized additive model estimated 7Q10 values vs. actual historical 7Q10s (cfs).
Water 15 02813 g016
Figure 17. Residuals for the generalized additive model.
Figure 17. Residuals for the generalized additive model.
Water 15 02813 g017
Figure 18. (ac) Raw 7Q10 estimates using each methodology.
Figure 18. (ac) Raw 7Q10 estimates using each methodology.
Water 15 02813 g018
Table 1. StreamStats 7Q10 estimation by state.
Table 1. StreamStats 7Q10 estimation by state.
StateVariables Used for 7Q10 Estimation
Massachusetts [20]Drainage area
Area of stratified-drift deposits per unit of stream length plus 0.1
Mean basin slope
Indicator variable, 0 in the eastern region, 1 in the western region
Rhode Island [13]Drainage area
Stream density
New Hampshire [16]Drainage area
Mean annual temperature
Jun to Oct average gage precipitation
Maine [15]Drainage area
Fraction of sand and gravel aquifers
Pennsylvania [17]
Region 1 (Southeast) 1
Region 2 (Central-east) 2
Region 3 (Northwest) 3
Region 4 (Southwest) 4
Region 5 (Northeast) 5
Drainage area 1,2,3,4,5
Basin slope1
Mean elevation 3,4
Mean annual precipitation 2,3
Stream density 2
Soil thickness 1,2
Percent glaciation 5
Percent carbonate bedrock 2
Percent forested area 5
Percent urban area 1
Virginia [14]
Coastal Plain 1
Piedmont 2
Blue Ridge 3
Valley and Ridge 4
Appalachian Plateaus 5
Mesozoic Basins 6
Drainage area 1,2,3,4,5,6
West Virginia [18]
North 1
South Central 2
Eastern Panhandle 3
Drainage area 1,2,3
Longitude of basin centroid 1
Connecticut, Delaware, Maryland, New Jersey, New York, VermontUnavailable
Table 2. Input variables for estimating 7Q10.
Table 2. Input variables for estimating 7Q10.
VariableDescription
Area (mi2)Watershed area
Mean Elevation (ft)Average elevation of the watershed
Slope (%)Average basin slope
Percent Wetland (%)Wetland percentage of the watershed
Percent Forest (%)Forest percentage of the watershed
Min 30-day Cumulative Precipitation (mm)Lowest 30-day cumulative precipitation, limited to abnormally hot periods (X > 90th percentile temperatures)
Average 30-day High Temperatures (C)Average 30-day temperature during the corresponding period of low cumulative precipitation
Table 3. Multiple linear regression variables and significance.
Table 3. Multiple linear regression variables and significance.
VariableEstimatep-ValueSignificance
Area (mi2)0.05798332 × 10−16<0.001
Mean Elevation (ft)0.00144610.05343<0.1
Slope (%)−0.28048010.00197<0.01
Percent Wetland (%)−0.06189910.27690No significance
Percent Forest (%)0.00488110.86061No significance
Min 30-day Cumulative Precipitation (mm)0.34218990.00246<0.01
Average 30-day High Temperatures (C)−0.04667180.58980No significance
Table 4. Logarithmic-transformed linear regression variables and significance.
Table 4. Logarithmic-transformed linear regression variables and significance.
VariableEstimatep-ValueSignificance
Intercept4.271570.1590No significance
Area (mi2)1.313082 × 10−16<0.001
Mean Elevation (ft)−0.115730.2908No significance
Slope (%)−0.194130.0303<0.05
Percent Wetland (%)−0.020360.7542No significance
Percent Forest (%)0.224370.0489<0.05
Min 30-day Cumulative Precipitation (mm)0.310490.0160<0.05
Average 30-day High Temperatures (C)−4.374620.0309<0.05
Table 5. Random forest variables and significance.
Table 5. Random forest variables and significance.
Variable% Included MSEp-ValueSignificance
Area (mi2)57.5809820.0099<0.01
Mean Elevation (ft)6.9113980.03465<0.05
Slope (%)2.2573480.07228<0.1
Percent Wetland (%)−2.0999590.9901No significance
Percent Forest (%)3.3354430.08911<0.1
Min 30-day Cumulative Precipitation (mm)7.7266350.04275<0.05
Average 30-day High Temperatures (C)1.2656360.6634No significance
Table 6. Generalized additive model variables and significance.
Table 6. Generalized additive model variables and significance.
VariableEstimated Degrees of Freedomp-ValueSignificance
Area (mi2)8.4070.000000<0.001
Mean Elevation (ft)8.1430.000608<0.001
Slope (%)1.0000.230267No significance
Percent Wetland (%)1.3150.160834No significance
Percent Forest (%)3.5480.484701No significance
Min 30-day Cumulative Precipitation (mm)1.6610.017755<0.05
Average 30-day High Temperatures (C)1.4660.007338<0.01
Table 7. Multiple method comparisons to current estimates.
Table 7. Multiple method comparisons to current estimates.
MethodR2KGENSERMSE
StreamStats Estimates0.660.660.659.88
Log-Transformed Linear Regression0.670.540.6213.50
Multiple Linear Regression0.700.800.637.14
Random Forest0.630.760.537.97
Neural Network0.730.820.676.79
Generalized Additive Model0.840.910.835.19
Table 8. Comparisons between statistical methods using leave-one-out cross-validation.
Table 8. Comparisons between statistical methods using leave-one-out cross-validation.
MethodR2KGENSERMSE
Log-Transformed Linear Regression0.720.500.6215.24
Multiple Linear Regression0.600.730.478.53
Random Forest0.610.690.418.39
Neural Network0.530.690.369.41
Generalized Additive Model0.530.650.5212.15
Table 9. RMSE comparisons between methods for specified size ranges.
Table 9. RMSE comparisons between methods for specified size ranges.
RMSE for Each Methodology
SubsetMLRLTLRRFNNGAM
Small Basins (<15 mi2)2.110.340.442.772.97
Medium Basins (15–70 mi2)3.962.833.093.874.66
Large Basins (>70 mi2)14.0226.2314.0115.6020.32
Average6.709.805.857.419.31
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

DelSanto, A.; Bhuiyan, M.A.E.; Andreadis, K.M.; Palmer, R.N. Low-Flow (7-Day, 10-Year) Classical Statistical and Improved Machine Learning Estimation Methodologies. Water 2023, 15, 2813. https://doi.org/10.3390/w15152813

AMA Style

DelSanto A, Bhuiyan MAE, Andreadis KM, Palmer RN. Low-Flow (7-Day, 10-Year) Classical Statistical and Improved Machine Learning Estimation Methodologies. Water. 2023; 15(15):2813. https://doi.org/10.3390/w15152813

Chicago/Turabian Style

DelSanto, Andrew, Md Abul Ehsan Bhuiyan, Konstantinos M. Andreadis, and Richard N. Palmer. 2023. "Low-Flow (7-Day, 10-Year) Classical Statistical and Improved Machine Learning Estimation Methodologies" Water 15, no. 15: 2813. https://doi.org/10.3390/w15152813

APA Style

DelSanto, A., Bhuiyan, M. A. E., Andreadis, K. M., & Palmer, R. N. (2023). Low-Flow (7-Day, 10-Year) Classical Statistical and Improved Machine Learning Estimation Methodologies. Water, 15(15), 2813. https://doi.org/10.3390/w15152813

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop