1. Introduction
Groundwater is an essential source of freshwater for much of the world’s population. In many places, however, available groundwater resources are under stress due to increasing anthropogenic influences and demand, as well as a changing climate. The appropriate monitoring and modelling of these groundwater systems are critical to enabling management decisions that will lead to a sustainable future.
Classical groundwater modelling approaches use mathematical models consisting of complex systems of differential equations to represent the physical processes known to contribute to groundwater levels. However, these models require substantial assumptions and are typically subject to considerable uncertainty. In particular, accurately characterizing the hydrogeological properties of an area with a physically based model requires extensive expert hydrogeological knowledge and many assumptions about the nature of underground structures and the mechanisms involved in groundwater recharge. Groundwater systems are often complex, with water levels depending on many static and time-varying influences, including long- and short-term climate conditions, vegetation, land use, soil permeability, hydraulic conductivities, subsurface geological structures, aquifer size and connectivity, extraction patterns, recharge from local rivers and lakes, overland flooding, and irrigation activities. Gathering relevant information on each of these variables is usually difficult, time-consuming, and expensive. Building models for areas where there is not access to this information require the incorporation of many assumptions. In areas where subsurface information is available, the climate, soil, and vegetation characteristics (e.g., transpiration rate, cover, root systems, etc.) continually evolve over time, contributing temporal changes to groundwater recharge mechanisms and rendering these systems even more difficult to understand and represent through explicitly defined mathematical relationships. In general, the more realistic a physically based model is, the more data and assumptions will be required. See [
1] for a recent textbook on the topic.
To preclude this need for extensive knowledge about the subsurface systems, groundwater modellers are increasingly turning to data-driven approaches that use statistical modelling and machine learning to make predictions. In recent years, these data-driven approaches have gravitated towards the use of neural networks and deep learning algorithms (see [
2,
3,
4]). The growing popularity of neural networks in hydrogeological prediction is due to their ability to extract features, represent relationships, and make predictions on complex systems without requiring detailed knowledge about the physical bases of the underlying system. This in turn allows patterns to emerge without requiring strong assumptions about the unknown factors and linkages [
5,
6,
7,
8]. Depending on data availability, these models also have the potential to incorporate the impact of anthropogenic influences on the hydrogeologic system without explicitly quantifying the relationships in advance. Keeping up with the increasing capacity to collect hydrologic data, deep learning systems provide a means of efficiently processing these large data sets [
6]. Big data techniques (e.g., incorporating global climate models, remote sensing, citizen science, etc.) have been shown to benefit sustainable groundwater management by overcoming a lack of relevant data at the local scale [
9]. There is increasing awareness of the amount of information that can be extracted with data-driven models and acknowledgments that our ability to make predictions from data is improving at a greater rate than our ability to make predictions from hydrologic theory [
8]. The rapid expansion of neural network methodology from the machine learning field is continually adding new ideas to existing groundwater modelling possibilities.
In a recent project sponsored by the NSW Department of Planning and Environment (DPIE) in New South Wales (NSW), Australia, we compared machine learning approaches based on neural network models with classic time series methods to model the level of groundwater in an aquifer in the Richmond River catchment, in the northeast part of the state [
10]. We found that, while each approach has some unique advantages and disadvantages, both did a remarkably good job at capturing changing patterns over time. The results of this project are depicted in
Figure 1, showing groundwater levels (dark blue), along with rainfall levels (light blue), evaporation (green) and predictions (orange), based on classic autoregressive integrated moving average (ARIMA) modelling (top panel) and a neural network model (bottom panel). Neither approach gives a perfect prediction, but both do very well in terms of capturing the key features of how the aquifer levels changed over time. See [
10] for more detail.
In a follow-up project for DPIE, we set out to apply the same analyses to data from a different catchment area, the Namoi River in the north-west part of the state. However, the results were not nearly as good. It became quickly apparent that, in contrast to the Richmond catchment, the Namoi analysis was significantly more complex due to being an area of relatively low rainfall combined with high groundwater extractions to meet the demands of intensive agriculture and mining industries. It was clear that effective modelling needed to draw on a more extensive range of potential predictors, including data related to extractions, as well as river flow rates (which were also indirect indicators of dam releases during dry periods). Additionally, the data sets from each monitoring bore were relatively small in terms of the number of observations, and exhibited large proportions of missing data, impeding the application of individual neural network time series models to each well. We decided that it would make sense to work with a larger, richer dataset that combined data from multiple aquifers in the catchment, to share information across the region and not simply analyse on a bore-by-bore basis, as had been successful for the Richmond River catchment. The analysis was complicated by high levels of spatiotemporal variability between the individual groundwater time series measured in the region. This present paper is a detailed case study describing our efforts to undertake this analysis based on multiple, inter-related time series corresponding to 165 groundwater monitoring bores in the Namoi River catchment.
There is currently growing interest from the machine learning community around the use of global, rather than local, models for time series analysis [
11,
12,
13]. Local (or individual) time series prediction models are trained on a single time series, usually under the assumption that each time series results from a unique data generating process. On the other hand, global models are trained on data from multiple time series simultaneously; the data is aggregated into a single data set and a single model is produced. Another variation, known as partitioned models, falls between local and global, working on a subset of related time series. Global or partitioned models by nature have access to larger sample sizes than local models for the same data, and it is well known that machine learning models work best when there is a large volume of input data. While individual models representing a complex system with relatively few data points may tend to overfit, global models have the advantage of borrowing strength from a larger pool of data and are therefore less likely to overfit, potentially leading to better generalisation to new data.
It has been shown [
11] that global models do indeed generalise better than individual models to data that have not been seen during the training process, and that ‘long memory patterns and related effects’ that would require manual introduction into local models are better able to be learned automatically by global models. Moreover, Ref. [
12] found that an LSTM model (long short-term memory [
14]) trained over multiple time series performed better than univariate methods if the time series are similar. The authors discuss the capabilities of neural networks to operate as universal function approximators that ‘make them ideal for exploiting information across many time series’. They used a global LSTM to make predictions on time series that had been clustered based on the similarity of their features, and determined that grouping the time series by similarity increased prediction capabilities. Ref. [
13] concluded that the combination of recurrent neural networks such as LSTM with the leveraging of cross-series information through global models led to benefits in forecasting accuracy. Ref. [
15] proposed a global time series model, implemented with publically available software called DeepAR, that incorporated the LSTM structure along with the inclusion of lagged outcomes as additional predictors to produce probabilistic forecasts on multiple time series at once.
In a hydrological context, Ref. [
8] describes how traditional models perform best when calibrated on individual basins, but deep learning works best when trained on multiple catchments, as shown by [
16]. In this study, the authors train a single LSTM on runoff time series from over 500 basins and produce better predictions than with various benchmark hydrological models calibrated individually by catchment. Ref. [
17] demonstrated the power of a global model to forecast groundwater levels across the state of Victoria in Australia. Ref. [
5] reviews machine learning applications in hydrology to date, finding that most are studies of small datasets at individual sites that are not transferable to other locations. The authors suggest that the future lies in multitask learning, where machine learning tasks have several target variables.
The main purpose of this paper is to present a case study, comparing and evaluating the use of local, global and partitioned time series models on the time series from the 165 wells from the Namoi River catchment. Located in the same region, the time series have been created by similar, though not exactly the same, data-generating processes, indicating that it may be useful to share some information across the system. Climatic conditions of rainfall and evapotranspiration are closely related for all wells, but subsurface conditions affecting recharge rates, such as soil permeability, aquifer depth, and hydraulic connectivity, will affect each time series differently. We explore whether these system differences indicate that a separate model should be made for each time series, or if combining the time series is beneficial for prediction performance. If so, should they be subsetted in a meaningful way (i.e., partitioned based on the similarities of their temporal patterns) or simply all combined together? The benefits of modelling this set of related time series with the possible approaches (individually, partitioned, or agglomerated into a single global model) are investigated and quantified here.
To the best of our knowledge, this study represents the first application of the DeepAR technology in the context of hydrogeology. We also provide a general discussion of the benefits and limitations of the application of various contemporary machine learning multiple-time series methods to real-world environmental monitoring data. A key insight is that, while machine learning strategies such as DeepAR can do well in terms of short-term predictions, long term predictions require models where key drivers have been identified, measured well, and incorporated into the modelling process.
Section 2 provides more detail about the Namoi River catchment and the data available for our analysis. An overview of the various methods that we have considered is given in
Section 3.
Section 4 describes the results of applying these methods to the Namoi data. In
Section 5 and
Section 6, the results are discussed, and some final conclusions are drawn, including some discussion about potentially useful extensions of the modelling framework.
2. Data and Study Area
In this paper, we focus on the analysis of 165 groundwater level time series from 70 different monitoring locations across the Namoi River catchment in northern NSW, located just to the west of the Great Dividing Range. Multiple monitoring bores, or wells, are often established at the same location, in order to allow access to aquifers at different depths, hence the greater number of time series than sites. The locations of the monitoring sites and environmental monitoring locations are shown in
Figure 2. The study period is 1 January 1974 to 31 December 2018.
Groundwater level monitoring data have been provided by the water division of the NSW Department of Planning, Industry and Environment (DPIE). These data are also publicly available for download from a website maintained by WaterNSW [
19], a state-owned corporation established under 2014 legislation to manage and oversee NSW water resources. WaterNSW owns and operates the largest surface and groundwater monitoring system in the southern hemisphere. The recorded groundwater measurements show high variability from well to well, and many of the time series contain strong temporal patterns, as can be seen in the sample of four sets of measurements shown in
Figure 3. The time series are of differing lengths due to different dates of station commissioning and/or decommissioning. They are characterised by sporadic measurement frequencies and high levels of missing data. At the beginning of the records, the measurements were made manually on a 2–3-month rotation. In recent years, a few stations have had automatic telemetry equipment installed and are recording regular daily measurements. There was a total of 11 bores in this study with automatic telemeters installed.
Figure 4 gives an indication of the measurement frequency at each bore and gaps in the data set.
Rainfall and evapotranspiration measurements were obtained at a daily resolution from the SILO database [
20], constructed and maintained by the Australian Bureau of Meteorology (BOM) and comprising observed values, along with infilled missing values. The specific SILO evapotranspiration variable used is ‘Penman-Monteith reference evapotranspiration (FAO56)’. Rainfall and evapotranspiration data from five climate stations across the Namoi catchment are used in this study: Walgett Council Depot (station 52026), Narrabri West Post Office (station 53030), Gunnedah Resource Centre (station 55024), Tamworth (station 55054), and Quirindi Post Office (station 55049). The rainfall data vary greatly across the region, while evapotranspiration follows a similar pattern for these stations.
River discharge data were downloaded directly from the WaterNSW website at daily resolution for the following stations: Goangra (station 419026, upstream of Walgett), Mollee (station 419039, downstream of Mollee weir between Narrabri and Weewaa), downstream of Keepit Dam (station 419007), the Peel River at Carroll Gap (station 419006), and the Mooki River at Breeza (station 419027). Patterns of measured streamflow are complicated in this region, influenced by a combination of natural phenomena and human interventions. Dam outflow rates may be increased during periods of low rainfall due to intentional dam releases, leading to elevated downstream flow. Of course, streamflow also fluctuates naturally during periods of high and low rainfall.
Extraction (groundwater pumping) data were provided by DPIE in the form of annual extracted volumes (ML/year) at locations specified by latitude and longitude. These records begin in 1967 for some of the wells, and in 1985 for others. Due to a lack of recorded data, in this study, we assume that wells with no records before 1985 did not have pumping occurring. In actuality, the lack of reliability in the recorded extractions data may be a source of uncertainty that impacts the reliability of the fitted models, and we discuss this issue further in our discussion section. To integrate this annual lump-sum data with the daily environmental measurements, extractions have been set at a constant value throughout the year. As discussed presently, the inclusion of a day- or month-within-year variable in our models allows the flexibility for the neural networks to create interactions that help explain annual fluctuations.
Figure 5 shows water level data from one groundwater monitoring well, GW030344_2, with some of the predictor time series that will be used to model the water levels. Only data from a single gauge of each of the predictors are shown, though in the models, there are multiple inputs of each type of predictor (i.e., many extraction bores). Note that this is one of the wells where an automatic telemeter had been installed around 2010.
The study region forms part of the Murray–Darling basin and is a highly productive agricultural area sustained by large volumes of groundwater extractions. As the subsurface characteristics of this area are complicated and not yet fully defined, classical, process-based hydrogeological modelling is difficult to apply. It is known that the groundwater system consists of multiple layers of aquifers, with unmeasured lateral through-flow and vertical leakage occurring between the shallow and deep aquifers. The surface water and groundwater systems are closely connected, meaning that groundwater depletions due to extractions may be masked by incoming surface water. Large amounts of water that are extracted from deep aquifers end up finding their way into shallow aquifers via irrigation. Surface waterways are substantially regulated by dams and weirs, providing a disconnection between rainfall events and groundwater recharge. This relationship is complicated further in that periods of low rainfall can lead to high extractions that in turn deplete groundwater, and yet low rainfall can also lead to dam releases that lead to recharge through the streambeds. The amount of rainfall varies greatly between the east and west of the region, and the response of groundwater levels to precipitation can vary between areas even within the same aquifer, due to differences in soil permeability. Geological fault lines disconnect subsurface hydraulic characteristics. The complex hydrogeology of the region, along with the extensive monitoring network in place, makes it very natural to consider empirically based approaches.
5. Discussion
An exploration has been made into the benefits of using global models to incorporate time series information from multiple time series, rather than analysing data from single wells one at a time. A number of advantages exist within this global modelling strategy: the input datasets are larger, which generally leads to the improved performance of machine learning approaches by providing more training examples; and by exploiting common patterns across the different wells or subsets of wells, relevant patterns can be captured in a more efficient and parsimonious manner. In terms of prediction accuracy, global time series neural network approaches have shown impressively good properties in practice. Results from the recent time series prediction M4 and M5 [
33] competitions on 10,000 s of time series showed overall domination by RNN-based deep learning global modelling strategies.
In this study of groundwater monitoring data, both the local models and the global models did well in terms of the overall prediction RMSE. When it came to predictions for individual wells, we found that, in some cases, the global models outperformed the local models, and in other cases, the reverse was true. While both the local and the global models were moderately successful in explaining the observed strong annual patterns seen for some of the wells, neither were completely ideal for all of the wells. This may be a result of the differing responses to the predictor variables or the differing number of data observations at the various wells. The limitations of the available data on extractions may also be a major contributing factor.
When it came to prediction of the future, the methods struggled when we tried to predict too far ahead. In particular, while results could be considered satisfactory in terms of RMSE when predicting the relatively short-term (<5 years), the results deteriorated substantially when trying to predict for a longer period of 10 years. This phenomenon is quite typical for time series prediction, which works by exploiting the autocorrelation structure inherent in the data to predict the future. For a reliable long-term prediction, the best strategy is unquestionably to make sure that the analysis has access to the right predictors or features that can explain the observed patterns.
Although there are limitations in terms of how reliably the data can be projected into the far future, the results demonstrate the usefulness of exploring “what if” scenarios such as setting extractions to zero. These explorations could be expanded to see how the predictions might look under scenarios, such as multiple successive years of very low or very high rainfall.
In terms of identifying a best modelling strategy for this set of environmental monitoring time series, no single approach was found to be uniformly best across all time series. The choice of best strategy is complicated by the fact that differences between the various methods can be subtle, and there are many trade-offs in terms of prediction outcome and ease of working with the data at hand. The choice of the best strategy depends strongly on the context; the best strategy for short-term prediction is likely to be quite different from the best strategy for long-term prediction. There are also differences in how successfully the various approaches can be adapted to handle limitations in the available data. Whilst it is straightforward to adapt the MLP-type models to handle time series with missing data points or time series measured at sporadic timepoints, it is difficult to do this for the LSTM model. There are also some technical differences between the various methods in terms of the software available for implementation.
This particular analysis was challenged by the variability in the frequency of groundwater level data collection. Although data collection goes back to the early 1970s for many wells, the earlier measurements were based on manual collection every 6 weeks or so. Starting in the early 2000s, some of the wells were fitted with automatic telemeters that provided continuous monitoring. As a result of this variability in measurement frequency and timing, it was virtually impossible to fit LSTM-type models to the daily data. This can be circumvented by aggregating to monthly data, however this aggregation potentially results in some loss of predictive power. Based on our local MLP analyses on individual wells, we found that RMSE was indeed better when daily data were used. Another strategy may be to employ approaches to infill the missing data, but we have not done this, as this study was aimed at determining what strategies could be used with the raw, sporadic data, as measured.
Strong temporal patterns were found for many of the wells, as seen with the MLP modelling on the individual telemetered wells and with the SOM clustering analysis. As discussed earlier, we were able to develop models that fit the observed data by including appropriate time terms in our models, along with the observed climate and extraction predictors. However, while such models may explain observed data very well, they cannot be reliably used to predict far outside the range of observable data. Similarly, we found that models incorporating appropriate autoregressive terms could do a good job in terms of short-term predictions. For long-term prediction, however, it is critical to develop models that have a rich enough set of reliable predictors to explain the time trends. The importance of exogenous variables for improving time series forecasting accuracy was also listed as one of the main takeaways from the M5 competition [
33].
To further explore the importance of the long-term prediction of ensuring the right predictors are captured, a small computer simulation was conducted. Specifically, we generated a time series containing 2000 datapoints. The first 200 points were used for model fitting, and then the models were evaluated for how well they could predict (a) the next 200 points and (b) the final 200 points. Scenario a) could be considered an example of short-term prediction, whilst scenario b) could represent a long-term prediction. The data-generating model included a predictor that mimicked a rain variable, as well as some seasonal patterns and a long-term trend. The models that were fit included a linear regression model (ordinary least squares or OLS) that included the correct predictors, a linear model that did not include all the correct predictors, but included a lagged outcome variable, and finally, a classic time series model (ARIMA) which could exploit the autocorrelation structure in the data.
Figure 18 provides the results, with the top panel showing the data used for model fitting (blue dots), along with the fitted values from the two linear models. The black line corresponding to fitted values from the linear model is virtually identical to that corresponding to the fit from the linear model with lags (green line), as well as the fit from the ARIMA model (red line). The same can be seen in the middle panel which shows the predictions for the short-term scenario (predicting the next 200 timepoints). However, the bottom panel showing the long-term prediction results shows that, while the linear model still does an excellent job, the two “time series” models (linear model with lags as well as ARIMA) both do very poorly. This is a fairly simplistic exercise based on a simple simulation, but it underscores the reality that classic time series methodologies cannot be relied on for long-term predictions. A much better strategy is to make sure that all of the right features or predictors have been identified and, where possible, used to predict the future.
While it is beyond the scope of this paper to discuss in more detail, it is interesting to note that some time series analysis programs provide error bands around their predictions, which shows the uncertainty of prediction of the future (see, for example,
Figure 1, which includes such bands in the ARIMA results fitted to the Richmond River catchment). Typically, these prediction bands quickly become very wide, illustrating the phenomenon well. DeepAR is in fact one of the programs that provides prediction intervals as an option. Whilst not shown, we did generate some prediction bands for the DeepAR models and, as expected, they became very wide quite quickly. The other programs used here (MLPs and LSTMs via Keras in R) did not provide the option to generate prediction intervals. It is a well-known phenomenon that prediction intervals based on time-series methodologies tend to become very wide as they predict further and further into the future. This is because time series predictions of the immediate, short-term future are able to exploit the autocorrelation structure inherent in the data. Longer term, however, it is only via reliable endogenous predictors that one is able to obtain an accurate prediction.
There are a number of analytical strategies that could potentially be explored to improve the predictive power of our models. Transfer learning, for example, involves pre-training global models on all of the data, and then further fine-tuning the models for a small number of epochs to get a better fit for each specific well. Moreover, ref. [
34] applied such a strategy in the context of predicting runoff, and the fine-tuning step was found to improve runoff prediction over just using the global model as-is. The authors also found that the process works even if the data used in the global model is quite inhomogeneous. Ref. [
35] used transfer learning with models developed for a set of monitored lakes to make predictions at unmonitored lakes, and found it to be a powerful technique for transferring knowledge learnt in areas with sufficient data to areas with scarce or inadequate data. Our analysis had a catchment-specific focus. The literature contains examples where similar kinds of statistical and machine learning methods had been used to model data based on multiple catchments. Ref. [
17], for example, developed models that applied to the entire state of Victoria, Australia. The authors did not have access to extractions data, but used remote sensing data collected through the GRACE satellite program [
36]. The use of such remote sensing data in place of the annual extractions data may be a useful option for further work on our study, as well as considering other data sources related to land use.
Exploratory analyses using some simple GAM models suggested that there were strong interaction effects between some of the variables and the time indicators. Such a phenomenon could well be expected, in the sense that some of the observed predictors may be effectively explaining the rate of decline or change in groundwater levels. The LSTM models are flexible enough to capture such effects. However, it is very difficult to tell whether or not the models are adequately explaining how levels change over time in a way that can be reliably projected into the future for forecasting. There are some analytical strategies that could potentially be explored to address this issue. One option is to apply some of the recently developed strategies related to explainable AI (see [
37]). Another worthwhile angle would be to explore state space modelling ideas (see classic text by [
38]). Such models have been developed and explored extensively in the time series literature and involve making the assumption of the existence of a latent time trend in the data, however they have limitations in terms of how predictors can be incorporated. An ideal situation may be to combine ideas of neural network modelling with state–space modelling, including the possibility that covariates can impact the rate of change of the process being modelled.
6. Conclusions
Local, partitioned, and global algorithms were investigated here for making predictions based on monitoring time series data for groundwater levels, and each was found to have benefits and drawbacks depending on the context and characteristics of the monitoring data.
The local (individual) MLP models were able to incorporate the sporadic datasets and provide quick and easy predictions, though it was necessary to add temporal features manually. These individual models were restricted to the few groundwater wells that had automatic telemeters installed, as single wells with manual measurements do not have enough data points for the MLPs. The results of the telemetered wells showed relatively good predictive accuracy with the local MLPs.
The partitioned models, using the SOM algorithm to partition the data sets into clusters that could be modelled with LSTMs, were also able to provide acceptable predictions. The benefit of these partitioned models is that they are able to provide predictions for the entire set of time series, even those with very few measurements. A drawback of this method is that the same LSTM prediction is given for a number of individual wells, and therefore may result in a loss of accuracy in the prediction of groundwater levels at telemetered wells for which the individual models worked well.
The global models were a good solution for increasing the size of the data set, for sharing information catchment-wide, and again for providing predictions at wells with few data points. The DeepAR algorithm offers a powerful modelling framework to fit a global LSTM model to the time series from all the wells. The framework has a number of advantages, including the ability to create needed interactions and nonlinearities, however a primary emphasis of DeepAR is the incorporation of autoregressive terms; this is the ‘AR’ part of DeepAR. We have discussed how these are very helpful for predictions in the short term, but not for long-term predictions.
Overall, it was determined that, whilst these methods are able to do a satisfactory job of modelling groundwater levels with the use of appropriate covariates, it is nonetheless not straightforward to use them for the prediction of the future. Our analyses of the Namoi catchment data suggest that, even though data related to climate, extractions, and streamflow/dam releases can explain a relatively high proportion of the observed trends in the regional groundwater levels, it appears that some residual temporal effects remain, even after these variables have been accounted for. Model fit can be improved through the inclusion of time effects into the modelling process, and the inclusion of autoregressive terms can also help to “soak up” temporal effects. However, for the purpose of projecting further into the future, it is important to have models that can explain temporal effects through accurately measured features. Additionally, it follows that the predictions of these features into the future (e.g., future expected rainfall or future extractions) must also be accurate if they are to be relied on as predictors in the groundwater level models.
An interesting outcome of this study is the potential for the use of these methods for a relatively easy analysis of ‘what-if’ scenarios, as shown in
Figure 8, where the groundwater levels were estimated for a fictitious scenario in which no pumping had occurred in the catchment. The ability to easily add and remove potential predictors from an analysis and compare the predictions is a strong drawcard for these methods.
There are a number of possible directions for future work. As discussed above, improved strategies for reliable measurement of extractions would likely improve model fits. It is intriguing to consider the possibility of using other data sources such as satellite data to capture land use and changes in water storage. We also believe that there are further analytical directions that could be explored. In particular, we recommend exploring the use of state–space modelling combined with a deep learning framework. This would require the development of new statistical methods, as well as software engineering. It would also be useful to develop a modelling framework that incorporated a global LSTM structure like DeepAR, but which allowed the option to turn off the autoregressive component, or allowed for the number of AR terms to be delinked from the “lookback” component of DeepAR. Such a modified version of DeepAR would allow for analyses that focus on explaining the observed trends in the data, without simply “soaking up” the effects with AR terms.