1. Introduction
Urban stormwater models are primary components of the monitoring system for real-time water flow and water quality simulation and prediction. In the literature, many urban hydrology models are well-established. However, there are few studies that attempt to model both flow and water quality taking into account the whole complexity of the physical, chemical, and biological processes involved [
1,
2]. Moreover, urban water quality studies need to combine hydrological modelling of natural surfaces with the performance of urban man-made structures and impervious areas in a comprehensive hydrological modelling approach. The importance of access to and preservation of clean water is emphasised by the United Nations Sustainable Development Goals to “ensure availability and sustainable management of water and sanitation for all” (Goal 6) and to “conserve and sustainably use the oceans, seas and marine resources for sustainable development” (Goal 14) [
3].
Zoppou [
4] presents a review of eight urban stormwater models specifically designed for simulating water quantity and quality: among others, Quantity–Quality Simulation (QQS) [
5]; Storm Water Management Model (SWMM) [
6]; and MIKE-SWMM, a combination of MIKE 11 [
7] and SWMM. Although QQS can simulate chemical oxygen demand (COD) and total nitrogen, it does not provide the capability to simulate ammonium (NH
). Similarly, the reviewed SWMM version does not provide a routine for simulating COD or NH
. Additionally, although MIKE-SWMM simulates several water quality variables, it does not provide for specific simulation of COD.
Mitchell et al. [
8] present a state-of-art review of integrated urban drainage models, in which a detailed review of seven models was conducted: Aquacyle [
9], Hydro Planner [
10], Krakatoa [
11], UrbanCycle [
12], Mike Urban [
13], UVQ [
14], and WaterCress [
15]. Mitchell et al. [
8] concluded that these models are weak in terms of handling temporal and spatial scales, input data uncertainty, and representation of urban infrastructure dynamics over time within a 10 to 100 year horizon.
Bach et al. [
16] present a critical review of integrated urban drainage modelling (UDM) and compared 20 different software tools used for integrated modelling: among others, integrated urban drainage models (IUDMs) such as InfoWorks CS [
17], Simulation of Biological Wastewater Systems (SIMBA) [
18], SWMM [
19], and WEST [
20]; integrated urban water cycle models (IUWCMs) such as City Drain 3 [
21], Model for Urban Stormwater Improvement Conceptualisation (MUSIC) [
22], MIKE URBAN [
23], UrbanCycle [
12], and UrbanDeveloper [
24]; and integrated urban water system models (IUWSMs) such as Dynamic Adaptation for eNabling City Evolution for Water (DAnCE4Water) [
25]. In their comparison, they evaluated nine different urban drainage processes, five urban drainage components, and eight types of model applications. As a future outlook of integrated urban water models, they highlight that improvements are required for representing spatial and temporal processes in these models [
8], with special attention required to address long-time-series simulation [
1,
26]. Additionally, they recognise that integrated urban water modelling must explore parallel computing with efforts to improve the performance of existing software [
21,
27] and encouraged researchers to be adaptive to the emerging computational technology. The review above suggests that there is still room to improve urban water models, specifically in the case of urban drainage models. One of the problems is that most models are very complex and require a large amount of data for calibration and to simulate processes accurately. Following [
28], these complex mechanistic models describe the flow routing in pipes by the de Saint-Venant equations, which are based on the conservation of mass and momentum. These partial differential equations are solved by numerical algorithms that are often computationally demanding. Therefore, this approach is impractical for long-term simulation or optimisation tasks. As an alternative, surrogate models are frequently mentioned in the literature [
28,
29,
30,
31,
32]. These models are faster and represent an approximate substitute of the “real process”, that is, the complex mechanistic model that better represents reality. Meirlaen et al. [
28] distinguish between two types of possible simplifications, the empirical (black box) and the mechanistic (white box) approaches, and present a framework for developing a mechanistic surrogate model from a complex mechanistic model (CMM), reducing the computational time by a factor of 3.
Jin [
33] presents a comprehensive survey of fitness approximation in evolutionary computation, whereby polynomials, the kriging model, neural networks, and support vector machines are described as the most often used methods of surrogate modelling to improve computational efficiency. However, these methods are of the black box type, which implies that the physical description and meaning of the processes that are simulated are lost.
From a different perspective, efforts have been made to simplify CMMs [
34,
35,
36,
37,
38,
39], but these approaches remain complex. Complex urban drainage models can be even more troublesome when Monte Carlo (MC) based uncertainty propagation analysis is required, because this analysis requires formidable computation times. Therefore, it becomes increasingly important to address scalability issues [
27]. By scalability, we refer to the capability to deploy adaptive algorithms and run models efficiently in different hardware configurations (i.e., the number of threads) in distributed computing environments. Parallel computation is a key component in hydrological modelling for expediting computations.
To the best knowledge of the authors, in the realm of urban drainage modelling, there are only a few examples of scalable implementations for solving intensive computational tasks in watershed-distributed or semi-distributed modelling. Some examples of parallel computing are found in other fields, such as in the application of watershed-distributed eco-hydrological models [
40] and in large-scale integrated hydrological modelling [
41,
42], but examples in the urban drainage domain are very scarce [
27,
43,
44].
The above indicates that urban drainage modelling still requires simplified or surrogate models and implementations of parallel computing, specifically scalable frameworks, that are, in addition, easily accessible. In this paper, we address this need by developing and presenting EmiStatR, “Emissions and Statistics in R for Wastewater and Pollutants in Combined Sewer Systems”, a mechanistic simplified urban water model for the simulation of Combined sewer overflow (CSO) emissions. Specifically, we contribute with a tool for performing short- and long-term simulations, developed in a parallel computing framework and allowing fast calculations while preserving the physical description and meaning of the processes simulated.
We also demonstrate that it is possible to obtain similar accuracy for water quantity and quality with this simplified and scalable model, compared to results of a complex mechanistic full hydrodynamic model. We focus on COD and NH
as water quality measures. COD is a standard for dimensioning CSO structures. NH
represents a diluted substance that can have a significant impact on surface water quality because of possible transformation to ammonia (NH
). Additionally, COD and NH
are key variables for evaluation of the performance of wastewater treatment plants (WwTPs) and the quality status of receiving water bodies. A detailed outlook regarding the relevance of transformation and nutrient removal from the water column is presented by Bell and co-workers [
45].
This paper has three main objectives: (1) The development of a simplified mechanistic urban water model, EmiStatR, which represents the overall dynamic behaviour of the CSO spill volume, load, and concentration of COD and NH. (2) The presentation of an implementation of the model in R with parallel computation capabilities, allowing fast and scalable calculations, particularly for scenarios with long simulation periods and in MC uncertainty propagation mode. (3) The calibration and application of EmiStatR to a Luxembourg case study and validation by comparing the performance against a CMM that uses the de Saint-Venant partial differential equations to describe the flow routing in the pipes of the sewer network.
3. Case Study
3.1. Study Area
A test case was created to evaluate the use and performance of EmiStatR. A sub-catchment of the Haute-Sûre catchment in the northwest of Luxembourg was chosen. The combined sewer system of the sub-catchment drains the three villages Goesdorf, Kaundorf, and Nocher-Route. In the local sewer system downstream from the villages, three CSOCs are located to store pollutant peaks in the first flush of CSFs.
Figure 3 depicts their locations and the delineation of the catchment. The topography of the area is characterised by a hilly landscape. The elevations around Goesdorf are between 390 and 490 m, around Kaundorf are between 370 and 464 m, and in the area of Nocher-Route vary between 400 and 485 m. The main land use types in the villages are residential, smaller industries, and farms. Outside of the villages, forest as well as agricultural areas and grassland are the dominating land uses. The receiving water bodies at CSO structures in Goesdorf, Kaundorf, and Nocher-Route are tributaries of the river Sûre (Sauer, in German) (
Figure 3).
3.2. Model Calibration
Measured precipitation time series at the Goesdorf CSOC served as input for the model calibration for water quantity output variables. This time series was recorded from May 15, 2011 to June 3, 2011 at 1 min resolution. Seven water quantity parameters were selected for calibration: (1) Water consumption, ; (2) infiltration flow, ; (3) time flow, ; (4) run-off coefficient for impervious area, ; (5) run-off coefficient for pervious area, ; (6) orifice coefficient of discharge, ; and (7) initial level of water in the CSOC, .
For calibration, we used the DREAM algorithm [
48]. DREAM has the capability of running and evaluating multiple different chains simultaneously for global exploration. The algorithm tunes the proposal distribution in randomised subspaces during the search. DREAM enhances the applicability of Markov chain Monte Carlo (MCMC) sampling approaches in complex problems [
48]. The main building block of the DREAM algorithm is the Differential Evolution Markov Chain (DE-MC) method presented by ter Braak [
60]. In DE-MC, different Markov chains are run simultaneously in parallel. At the current time, they form a population. Jumps in each chain are generated by taking a fixed multiple of the difference of two random chains without replacement. To accept or reject candidate points, the Metropolis ratio is used [
60].
The DREAM algorithm is implemented in R in the R package dream [
49]. Observations of water level in the Goesdorf storage CSOC served as reference for optimising the model parameters. The water level was recorded from April 19, 2011 to July 15, 2011 at 30 s time steps. The precipitation and water level observations were aggregated to 10 min intervals to assure that the model simulations and observations had the same temporal support before comparison. The observations were divided into two sets, one for calibration and one for validation. The calibration set comprised the initial section of the measurements from May 15 to June 3, 2011, a total of 2698 records at 10 min time steps. The validation set comprised the measurements from June 3 to July 7, 2011, a total of 4901 records at 10 min time steps.
DREAM optimises by minimising the root-mean-squared error (RMSE). As accuracy measures, the calibration results were evaluated by the mean error (ME), RMSE, and the Nash–Sutcliffle model efficiency coefficient (NSE) [
61]:
where
is the ith observation,
is the ith simulation,
is the mean of the observations, and
N is the number of observations (and simulations).
For Kaundorf and Nocher-Route, sufficient calibration data were not available. We therefore used the reference values (
Table 2).
Regarding the water quality module of EmiStatR, six parameters are required to define pollution in terms of the following: (1) COD load per PE per day in the wastewater, ; (2) NH load per PE per day in the wastewater, ; (3) COD load per PE per day in the infiltration water, ; (4) NH load per PE per day in the infiltration water, ; (5) COD concentration in the run-off, ; and (6) NH concentration in the run-off, . If these parameters are not measured directly, then they can be calibrated when observations of COD or NH (concentrations or loads) in the output of the CSO spill volume are available. In this case study, we did not need to calibrate and for Goesdorf, Kaundorf, or Nocher-Route, because 91 observations in total under DWF conditions were available. The measured had a mean value of 104 with a standard deviation of 87.5 . The measured had a mean value of 4.7 with a standard deviation of 1.92 . The temporal support of these observations was 120 minutes. The other input parameters of the water quality module (, , , and ) were set to zero, because the concentrations in rainfall and infiltration water were judged negligible compared to that of household sewage. We chose periods from 2010 and 2011 for both calibration and validation.
Calibration Results of the Water Quantity Model
Table 3 and
Figure 4a present the final calibration results of the hydraulic model implementing the DREAM algorithm. The calibration required 980 function evaluations. The optimised set of parameters produced a ME of −1.35 m
, RMSE of 6.85 m
, and NSE of 0.95. In this case,
was set to 5 L·s
and
V was set to 190 m
(actual conditions for 2011).
Figure 4a shows the precipitation input time series for the calibration dataset (upper inset) and the comparison of observed and simulated time series of the CSOC volume (bottom inset). For the events presented in
Figure 4, the values of ME and RMSE are in cubic metres, whereas the NSE is dimensionless. From
Figure 4a, it is possible to infer that after model calibration, the model could adequately simulate (NSE = 0.95) the volume in the CSOC. The model simulation was slightly under model observations specifically for low-rainfall conditions. Additionally, an over-prediction of the peak volume was presented in the simulation of the CSOC volume.
3.3. Validation of Model Predictions
Besides the calibration set, another set of measurements was used as independent observations to assess the accuracy of the model predictions for validation of the water quantity model. Input precipitation was recorded from June 3 to July 7, 2011 at a temporal resolution of 1 min, aggregated to 10 min. The observations of water level in the storage CSOC correspond to this period.
Figure 4b shows the results of the hydraulic model validation. It shows the precipitation input time series (upper inset), the comparison of observed and simulated time series of the CSOC volume (middle inset), and the comparison with the output of a CMM (bottom inset).
The CMM was implemented in the software InfoWorks ICM 7.5 (Innovyze Ltd, Wallingford, Oxfordshire, United Kingdom), and it served as a benchmark to calibrate and validate EmiStatR for water quantity and quality variables. The CMM was a full hydrodynamic flow and pollution load model, which implementd the de Saint Venant partial differential equations and was built initially in the software InfoWorks CS (Innovyze)
® [
62]. This model was used to simulate surface run-off and discharge characteristics in local sewer systems and the behaviour of CSO structures in the Goesdorf sub-catchment and future sewer systems linked to weather periods. Besides the catchment data and structural data of sewer sections planned and in operation, the simulations were based on local rain data for local calibration and on regional long-term rain data to simulate the long-term performance of the system. In the framework of a coarse calibration and validation process, it was proved that the model reproduced discharge characteristics in local sewer systems of selected villages sufficiently. The resulting parameterisation to model surface run-off characteristics from impervious areas in the villages, such as initial losses, was applied to further catchments showing similar characteristics [
62]. The calibrated model of the catchment and drainage network of the case study, implemented in InfoWorks CS and upgraded to InfoWorks ICM 7.5, was used to validate the performance of EmiStatR. We followed a similar procedure as presented by Meirlaen et al. [
28] for developing a mechanistic surrogate model from a CMM.
In general, validation of a good agreement between simulation and observations was observed (NSE of 0.78). The model simulation results were slightly under the observations of the CSOC volume, and as a consequence, the peaks simulated were lower than those of observations, which agreed also with the behaviour shown in
Figure 4a.
Regarding the water quality module of EmiStatR, we performed a validation on the basis of a 1 year simulation with the CMM. We ran the validation at 10 min time steps and aggregated the results to 120 min to eliminate short-time variability. Our interest was in the average load of pollutants over several hours, which corresponded well with the usual time for taking water samples for further laboratory analysis. The input values of the two main parameters were 104
for
and 4.7
for
. These values corresponded to wastewater quality (WwQ) measurements. The total COD and NH
were monitored in the CSOC under DWF conditions.
Figure 4b (bottom inset) shows how the model simulation agreed with observations (NSE of 0.79). The model simulation was also systematically below the observations of CSOC volume.
Additionally, to perform a more extensive validation of the water quality model, we compared its output with simulations obtained with the CMM for a 1 year time series at 10 min time steps.
Table 4 and
Figure 5 summarise the results of this validation. The results suggest that EmiStatR performed with good accuracy (NSE ≈ 0.80) when compared with the CMM.
3.4. Scalability and Performance
A hardware set-up was defined to execute the scalability test. We used an Intel(R) Xeon(R) CPU E7-L8867 server (Santa Clara, CA, USA) at 2.13 GHz with 40 physical cores (and 40 virtual cores) at 1.064 GHz, 516 GByte in random access memory (RAM), and the operating system (OS) Linux Ubuntu 12.04.5 LTS 64-bit. We used a maximum of cores. Additionally, we multiplied the number of simulations by 10 and 100 to evaluate the model runtime under repeated model calls, such as would typically be required in MC uncertainty analyses. As a result, the selected numbers of simulations were 32, 320 and 3200.
Regarding the results of the scalability test, the code implemented in EmiStatR allowed for specifying the number of cores to be used in the simulation according to the number of cores available. In the scalability test, a single simulation referred to a full year at 10 min time steps. We used the calibrated values for Goesdorf. For Kaundorf and Nocher-Route, we used the reference values given in
Section 2.3.1.
Table 5 and
Table 6 summarise the general input data and the CSO structures in the simulation mode, respectively.
Table 7 presents the runtime results in minutes depending on the number of cores used. The row “speed-up” factor (SF) was calculated as the ratio between the maximum computation time and the current computation time. The maximum computation time was set for the computation with just one core, that is, non-parallel computing. The minimum time is presented in bold font for each test. The results indicated speed-up factors of 12.2 (32 MC simulations), 22.0 (320 MC simulations), and 23.6 (3200 MC simulations). The highest speed-up factor (23.6) was obtained in scenario 3 (3200 MC simulations) using 32 cores. Although the lowest computation time was obtained running scenario 1 (32 MC simulations), the lowest speed-up factor was also reached (12.2).
This test was done by setting up the model to simulate three sub-catchments at the same time in parallel mode. Therefore, the scalable code implemented also inferred that parallelisation of sub-catchments also speeds up the overall computation with similar factors.
5. Conclusions
We show using a case study that adequate simulation of CSO spill volume as well as COD and NH loads and concentrations is possible using a scalable, surrogate model. Compared with a CMM, EmiStatR requires less input data, provides automatic calibration procedures, and can present outputs in an accessible way (to practitioners). Another advantage is the large body of R functionalities available to tools such as EmiStatR, for example, compatibility with input and output data formats for temporal and geospatial data and advanced calibration techniques such as DREAM.
We show that EmiStatR provides a satisfactory representation of CSO spill volume and COD and NH loads, which confirms that white box simplification can lead to well-performing surrogate models. Moreover, its inherent parallel computation and scalable capabilities allow fast calculations for scenarios of high complexity and for long-term simulations to test hypotheses in urban drainage modelling.
We compare the results of EmiStatR with those obtained using a well-known CMM. The behaviour for volume in the CSOC and the estimation of loads of COD and NH were very similar. Our case study showed that this small catchment (i.e., area of ≤30 ha) could be modelled with EmiStatR with satisfactory accuracy compared to models of much higher complexity. Future usage will show how EmiStatR performs in other case studies. Because the basis of EmiStatR is formed by generic equations, it is expected that the performance will be similar.
For future work, it would be of interest to the scientific and practitioner communities to take the spatial distribution of some of the input variables, such as precipitation, impervious areas, and land use, into account. The literature shows that spatial variation in precipitation is not considered in many commonly used models [
4,
16]. Usually, precipitation is assumed to be uniformly distributed in a sub-catchment. This is not a very realistic assumption, particularly in applications for which the response time is short. The integration of geostatistical probability models that interpolate and simulate precipitation data in space and time would be an important advancement in urban drainage modelling.
It should be emphasised that integrated urban drainage modelling often lacks uncertainty propagation tools that assist in quantifying the spatial and temporal (correlated) distributions [
68]. It also lack tools for sensitivity analysis to apportion contributions of the different sources of uncertainty to the overall model output uncertainty. Therefore, future work should address these topics and include an economic analysis, also taking the potential failure of CSO infrastructures into account. Such analyses benefit from fast and scalable implementations such as EmiStatR.