Probabilistic Prediction of Satellite-Derived Water Quality for a Drinking Water Reservoir

Bertone, Edoardo; Peters Hughes, Sara

doi:10.3390/su151411302

Open AccessArticle

Probabilistic Prediction of Satellite-Derived Water Quality for a Drinking Water Reservoir

by

Edoardo Bertone

^1,2,3,*

and

Sara Peters Hughes

⁴

¹

School of Engineering and Built Environment, Griffith University, Southport, QLD 4222, Australia

²

Australian Rivers Institute, Griffith University, 170 Kessels Road, Nathan, QLD 4111, Australia

³

Cities Research Institute, Griffith University, Edmund Rice Drive, Southport, QLD 4222, Australia

⁴

Seqwater, 117 Brisbane Street, Ipswich, QLD 4305, Australia

^*

Author to whom correspondence should be addressed.

Sustainability 2023, 15(14), 11302; https://doi.org/10.3390/su151411302

Submission received: 25 May 2023 / Revised: 18 July 2023 / Accepted: 18 July 2023 / Published: 20 July 2023

(This article belongs to the Special Issue Monitoring, Modeling, and Automation of Water and Wastewater Processes)

Download

Browse Figures

Versions Notes

Abstract

:

A Bayesian network-based modelling framework was proposed to predict the probability of exceeding critical thresholds for chlorophyll-a and turbidity in an Australian subtropical drinking water reservoir, based on Sentinel-2 data and prior knowledge. The model was trained with quasi-synchronous historical in situ and satellite data for 2018–2023 and achieved satisfactory accuracy (Brier score < 0.27 for all models) despite limited poor water quality events in the final dataset. The graphical output of the model (posterior probability maps of high turbidity or chlorophyll-a) provides an effective means for the user to evaluate both the prediction, and the uncertainty behind the predictions in a single map. This avoids loss of trust in the model and can trigger spatially targeted data collection in order to reduce uncertainty. Future work will focus on refining the modelling methodology and its automation, as well as including other data such as in situ high-frequency sensors.

Keywords:

Bayesian networks; remote sensing; water quality; water resources management

1. Introduction

Remote sensing, in particular through satellites observations, offers opportunities to monitor crucial water quality parameters in drinking water reservoirs at much higher spatial and temporal resolutions than those typically feasible by the relevant water authority. A paradigm shift in recent decades has led to extensive satellite datasets being made freely available to the research community, and with the advent of newer satellites with increased spatial resolution (e.g., 10 m or 20 m for most bands of Sentinel 2) and more appropriate spectral bands, an increased amount of research has focused on smaller inland waters such as large rivers and medium to large-scale lakes and reservoirs [1]. Several studies, taking advantage of such datasets and in some cases, of machine learning approaches, have focused their attention on monitoring/predicting optically active constituents such as chlorophyll-a (chl-a) [2,3,4,5,6], total suspended/dissolved solids [3,7,8] and coloured dissolved organic matter (cDOM) [9], and through indirect approaches, other parameters such as nutrients [10,11], or dissolved carbon dioxide [12] and other carbon fractions [7,13]. Reviews are available [1] with comprehensive breakdowns of target application and modelling approach. As a result, these water quality parameters can be estimated regularly (up to every 5 days with, e.g., Sentinel 2 data, even more frequently with commercial sensors), everywhere in the water body (one estimation every 100–400 m²), and, if relying on the open-access datasets, virtually for free.

Despite such indisputable benefits and an immense amount of research in this space, the integration of satellite-derived water quality monitoring by water authorities around the world is not yet as established as it could be expected. Aside for the resources and expertise required both initially to build the model, and over time to continuously extract/process newly acquired satellite images, there are several errors and uncertainties which can arise and propagate, from satellite data collection to model development [14]. While a number of studies exists, which quantified and incorporated uncertainty in their estimations [15,16], the majority of research studies seem to assess model accuracy with traditional (and yet in a non-standardised way, see [1]) metrics (e.g., R²) in the model development/validation phase, but fail to incorporate the error/uncertainty in the model application stage (e.g., visual representation—see, e.g., [2,3,5]). As a consequence, the resulting maps provide only the specific predicted water quality value for each pixel, which could potentially carry a relatively large error, especially for areas of the water body where less in situ data were available for training/validation. If predictions prove to be wrong a number of times during model deployment, the water authority can quickly lose trust in such model, and revert back to relying only on their previously established routine water quality monitoring. As such, it is important to clearly quantify and explain the uncertainty behind the predictions; this however is not necessarily straightforward, as it requires adding more information to display in already data-intensive water quality prediction maps.

Previous studies in this specific field, have resorted to creating two separate maps: one displaying the predictions, and another displaying the related uncertainty in such prediction, for each spatial data point [16]. Despite being a comprehensive approach, it might not be a trivial task for the average operator to jointly interpret such large amount of information on two separate–but somewhat connected–maps. Particularly for high-resolution applications, for instance, it would be very difficult to match the same pixel on both images, to understand the uncertainty behind a pixel-specific prediction. Another two-map approach might involve showing predicted (at satellite resolution) and observed (at lower resolution, interpolated) values [11]; this however, in addition to the issues mentioned above, cannot be performed in the model deployment stage when, ideally, the field monitoring work would be greatly reduced as a result of using the satellite-based estimations. As such, new approaches are required, to avoid both overwhelming the average end-user with too much information and making them lose trust in the model when wrong, if not enough information on model uncertainty is provided.

A simpler visualisation method, yet effective in both representing uncertainty and providing useful information for the water utility, is to provide a map of the predicted probabilities of exceeding critical thresholds (e.g., [17]). We herein propose to complete this under a Bayesian Network (BN) framework. BNs are user-friendly probabilistic graphical modelling tools, relying on the principles of Bayesian inference and the Bayes’ theorem [18]. Typically, variables in BNs are discretised, which is appealing for such threshold-based application. Probability distributions of inputs and outputs are inferred from the datasets and any other prior information (e.g., expert knowledge, outputs from other models) deemed important. The model then updates from prior to posterior distribution (through the Bayes’ theorem) when new evidence is provided. The probability of exceeding the threshold of interest can be easily retrieved by discretising the variables accordingly. Applications of Bayesian probability theory in the field of satellite sensing of water quality exist [16], though this was linked to artificial neural networks and relying on large datasets. A specific application of BN to satellite imagery also exists [19], though in relation to earthquakes and relying on relatively large in situ datasets from damage proxy maps. In addition to the aforementioned advantages, BNs can effectively handle smaller datasets too [20,21,22], which can be the case if a small number of match-ups is available due to limited/unsynchronized in situ monitoring. As such, an attempt to use a BN approach for satellite- derived estimation of freshwater quality constituents with a limited in situ water quality dataset can be considered novel, especially in relation to the purpose of more clearly visualising the uncertainty behind such predictions.

In the herein described research work, such an approach was applied to predict chl-a and turbidity in a relatively challenging case study of a drinking water reservoir in South-East Queensland, Australia. The challenges arose from low in situ chl-a and turbidity readings for the majority of the dataset, which have proven to yield lower modelling accuracy due to issues such as a lower signal to noise ratio, despite the Sentinel-2 Multispectral instrument being able to better handle this issue [1]. This, however, provides an opportunity to more effectively highlight the need for a clear visualisation of uncertainty, for models/areas yielding lower prediction accuracy.

2. Materials and Methods

2.1. Research Domain and Data Collection

Advancetown Lake, bounded by Hinze dam, is the largest drinking water reservoir in the Gold Coast region (Queensland, Australia); following the latest “Stage 3” upgrade in 2011, its full capacity is 310,730 mL and it supplies, under normal operations, raw water for most of the Gold Coast population, counting at the time of writing well over half a million inhabitants.

The reservoir is a relatively stable, good-quality water one, surrounded by a pristine catchment. The main water treatment issues occur during winter circulation events, bringing anoxic, manganese-rich waters from the hypolimniom to the epilimnion, where the water is usually drawn from [23,24]. Despite this, in recent years a few algal blooms have been recorded; turbidity can also increase following wet weather events. Climate change and increased water demand might lead this reservoir to experience less regular winter turnover events and warmer summer waters [25], potentially leading to more frequent algal blooms.

The raw water is collected via two intake towers, i.e., at the lower intake (HLI), near the dam wall, and at the upper intake (HUI), where it is pumped to a smaller drinking water treatment plant. At these locations, two vertical profiling systems (VPSs) autonomously collect water quality data for the entire water column every hour. Manual samples collection is undertaken by Seqwater (the bulk water supplier in the region) once a month too, at (1) these two sites, (2) upstream locations at both inlets, and (3) a few recreational sites. Figure 1 shows (yellow dots) the location of the two monitoring sites near the intake towers, as well as (orange dots) three of the sites where data used for this work were collected.

Historical, manual water quality data were collected for all such sites for the period 2016–2023. We focused on chl-a and turbidity data. VPS data, despite providing a high-frequency dataset, were kept out of the scope of this work, due to the deployed optical sensors, such as chl-a, being affected by significant interferences, such as non-photochemical quenching [26]: future work could focus on appropriate compensation of such sensors to expand the dataset.

The Sentinel Multispectral instrument equipped in the twin Sentinel-2 2A and 2B was accessed at no charge from the European Space Agency’s Copernicus Open Access Hub (https://scihub.copernicus.eu/dhus/#/home (accessed on 17 March 2023)). Level 2A (processed) images were downloaded. For Level-2A, the granules, also called tiles, are 110 × 110 km² ortho-images in UTM/WGS84 projection. Data were available from December 2018. Images with cloud cover >9% were filtered out, with further manual omission of images having cloud presence above the in situ monitoring locations. Images were retained for analysis if the retrieval date was <2 days from the in situ sample collection date. Following data cleaning for both satellite and manual sampling results, the final combined dataset counted 68 entries for chl-a and 77 for turbidity.

2.2. Data Analysis and Model Development

Satellite files were processed in the R environment (Rstudio 2022.12.0) through libraries including “shapefiles”, “raster”, “rgdal”, and “rasterVis”. A code was developed to first extract and plot images of the retrieved reflectance values for the individual bands (1 to 8), as well as the true colour image, for visual analysis. Retrieval included transformation based on date-specific “quantification” and “offset” values. A separate code was developed to make all bands’ resolution equal to 10 × 10 m, by applying local bilinear interpolation. Subsequently, the band reflectance values at the coordinates of the in situ monitoring points were extracted and compiled in Microsoft Excel for analysis with the in situ data.

In order to convert the reflectance signal coming from different satellite spectral bands to a water quality variable such as chl-a or turbidity, different types of modelling approaches can be applied; they can be broadly classified as physical or empirical [27], with the latter being by far the most common over the last 20 years in the freshwater quality monitoring field [1]. Physical models try to provide a general relationship based on known scientific/theoretical laws; their attempts to simplify complex systems unavoidably lead to assumptions, and the need of observational data for validation. Empirical models, on the other hand, are fully site-specific, and derive statistical relationships purely from the data, without any prior assumptions on how the target variable might be expected to affect reflectance at different bands. Empirical models range from simpler linear regressions to more complex machine learning models accounting, among others, for nonlinear features in the data [1]. There exists also an intermediate option, called “semi-empirical” [27], which involves, for instance, the calculation of previously developed (and tested) indices, with known physical relationships with the parameter of interest, as a potential predictor for the new site-specific application.

There have been several indices developed in this context, especially for chl-a estimation. Usually, this involves a combination/ratio among bands in the red spectrum and bands in the NIR spectrum [3,5], especially for moderate to high concentrations [28]; however, several different or more complex transformations of bands’ reflectance have been proposed in the past (see, e.g., Table 3 in [3] or more recently [29,30]). For turbidity, there is not a clear band or band ratio either [1] as the type of suspended particles and their reflectance will be site-specific and related to other water quality parameters (such as chl-a too); however, information about the red part of the spectrum is usually necessary (as summarised in [1] from other studies). From previous studies summarised in [10], the equivalent Sentinel-2 bands used for turbidity predictions would include B4 and B5.

For this application, a combined empirical (i.e., individual bands’ reflectance) and semi-empirical (i.e., new and previously developed indices) approach was implemented. The best predictors were selected based on either higher values for relevant accuracy metrics (e.g., R²) or visual interpretation (i.e., showing particular features appropriate for BN, such as clear threshold-specific behaviours from scatter plots with the target variables). Initially, such assessment was conducted for the individual band reflectance data inputs; different band ratios/combinations (selected based on previous studies and including variations of those) were then also assessed the same way. For each model, the two best-performing (from accuracy metrics and/or from visual assessment) inputs were then selected for the BN development. Due to the limited dataset, we opted for simpler models with a limited number of predictors. Figure 2 illustrates the entire methodological approach of this study.

The data for the identified best predictors as well as the target variable (i.e., chl-a and turbidity) were used to train a BN with the Netica software (version 5.22 × 64 Bit; Norsys Software Corp., Vancouver, BC, Canada). A BN is a graphical probabilistic model relying on the Bayes Theorem to generate posterior probabilities based on prior knowledge (e.g., historical data) and new provided evidence (e.g., a new satellite image); the structure, which is an acyclic graph, connects variables (called “nodes”) from “parent” to “child”. The numerical relationship between parent and child nodes is defined by conditional probability tables, which can be derived from, e.g., empirical data, expert opinion, outputs of other models, or a combination of these. To enable the creation and computation of conditional probability tables in Netica (and in general in BNs), nodes have to be discretised into a number of “states” (i.e., intervals). More information on properties, benefits, and limitations of BN are available elsewhere [18,22].

Selected best input variables were discretised according to critical thresholds identified through data analysis, while target variables’ thresholds in this case were selected based on their practical application. For instance, a threshold of 8 mg/m³ for chl-a was selected as being a well-established [31] threshold for classification between oligotrophic lakes (<8 mg/m³) and meso- to eutrophic lakes (>8 mg/m³). For turbidity, a high threshold of 40 NTU was used, based on data analysis and the identification of the same threshold as critical for optimal water treatment operations in another Australian drinking water reservoir [32]. Overall, both thresholds were discussed and chosen with the water utility stakeholders according to site-specific knowledge of water treatment technologies and related capacity/limits. Based on the selected thresholds, historical data for inputs and targets were used to train the BN in Netica and fill the conditional probability tables. For those nodes without parent nodes (i.e., in this case the target water quality variables), the conditional probability tables represent the prior probabilities of those nodes, i.e., probability of those nodes assuming different states, in absence of any evidence. These were also initially estimated from the available historical data but, as shown later in the results, they were subsequently adjusted to account for the experts’ knowledge of their fluctuations when data were not available for this application.

The last step involved generating the probabilities of exceeding such critical thresholds for the entire lake, based on provided satellite images and related band-specific reflectance. The developed BN has discretised variables, and as such the estimated probabilities are limited to a handful of values, depending on the combination of input state values. In order to improve the visualisation by creating a continuum of probability values, a simplified Monte Carlo approach was applied, to generate a dataset of predicted posterior probabilities based on different specific input values; such dataset was then modelled with multiple nonlinear regression. Future work will focus on a full integration of the BN into the R code, to enable the automated incorporation of specific input values (and their uncertainty) for each pixel. BNs model performance (both for the discretised and the continuous model versions) was assessed over a subset of the data using the Brier’s Score [18].

B = \frac{\sum_{i = 1}^{n} {(X_{i} - q_{i})}^{2}}{n}

(1)

where

X_{i}

is the actual occurrence (or not) of the event (hence usually 1 or 0),

q_{i}

is the predicted probability of an event (e.g., high turbidity or chl-a), and n is the number of data points. The closer B is to zero, the better the prediction.

3. Results

3.1. Water Quality Data Features and Inputs Selection

Figure 3 shows the variation of chl-a and turbidity across the study period, for in situ results (at all used locations) matching the considered satellites retrieval time. It can be seen that the reservoir overall had very low chl-a and turbidity, aside from a limited number of occasions in which the predefined thresholds were exceeded, as far as this reduced dataset can tell. Based on this reduced dataset, there are no long-term trends. Chl-a shows an expected seasonality with usually slightly higher values during the warmer summer months, though the highest in the analysed time period were in 2022 during autumn/winter. Turbidity does not show a clear seasonality, rather reflecting a likely response to shorter term events such as rainfall. In turn, it is important to notice a lack of any significant cross-correlations, with spikes in turbidity not linked to spikes in chl-a; this implies a different source of turbidity and thus the need for a separate band combination for the turbidity model.

No individual band, or ratio between individual bands, provided significant correlations with turbidity data. Out of all ratios/indices developed, the two that provided useful information for turbidity prediction, are illustrated in Equations (2) and (3). For specific information about each Sentinel-2 bands, the reader can refer to, e.g., Table 1 in [4].

{I n p u t}_{T b, 1} = \frac{B 8}{V I S} = \frac{B 8}{\frac{(B 2 + B 3 + B 4)}{3}}

(2)

{I n p u t}_{T b, 2} = \frac{V I S}{S W I R} = \frac{(B 2 + B 3 + B 4)}{(B 5 + B 6 + B 7)}

(3)

where B2, B3, B4, and B8 are the retrieved reflectance values for, respectively, Band 2, 3, 4, and 8 for Sentinel 2 data. VIS, as the equations show, is the average reflectance for the three used bands in the visible spectrum; meanwhile, SWIR is the average reflectance for the three used bands in the short-wave infrared spectrum. With turbid water reflecting more than clear water in the entire visible spectrum, such ratios, quantifying the relative variation in the visible spectrum reflectance compared to other spectral regions (e.g., near infrared (NIR) for

{I n p u t}_{T b, 1}

and the short-wave infrared in

{I n p u t}_{T b, 2}

), seemed to provide valuable information to distinguish between high and low turbidity. Specifically, although traditional metrics such as R² were low, it was possible to identify specific thresholds of interest. For instance, a threshold of 1 for

{I n p u t}_{T b, 1}

enabled the distinguishing between low turbidity (

{I n p u t}_{T b, 1}

< 1 in 95% of the low turbidity dataset) and high turbidity.

Similarly, with chl-a, the two following indices were used as final predictors, with the second one corresponding to the Normalized Difference Water Index (NDWI) [33].

{I n p u t}_{c h l a, 1} = \frac{(B 4 - B 8)}{(B 4 + B 8)}

(4)

{I n p u t}_{c h l a, 2} = N D W I = \frac{(B 3 - B 8)}{(B 3 + B 8)}

(5)

While there is limited useful information provided for low chl-a, the majority of high chl-a results corresponded to NDWI < 0.2 and

{I n p u t}_{c h l a, 1}

< −0.035. These thresholds were then used for discretisation in the BN software.

3.2. Bayesian Network and Outputs Visualisation

Figure 4 shows the simple BN structure as implemented in the Netica software. The blue-delimited part (“Sentinel-2 data”), to be used for estimation of the input indices, is a valuable component which nevertheless was out of the scope of this work, as it requires a full integration of BN capabilities into the R environment to fully automate the pixel-by-pixel prediction and visualisation. Future work will focus on this component; this will be valuable since it will allow the incorporation of any uncertainty (by assigning a probability distribution to the variables B2 to B8) related to the satellite data retrieval and preprocessing. The core part of the BN shows the target variables being the “parent” nodes (i.e., affecting the “child” nodes, which are in fact the model inputs): this is because the level of turbidity and chl-a will affect the band reflectance, and in turn the indices selected as inputs. However, for prediction purposes, the unknown parameters are the water quality variables, whose probabilities of different states will be inferred via back-propagation (i.e., bottom-up), which is a feature of BNs. Additionally, importantly with BN and as previously mentioned, prior probabilities can be provided for, in this case, the target water quality variables. For turbidity, we provided as prior, the frequency distribution of our historical dataset (92%-6%-2% for the three intervals, see Figure 4). For chl-a, we preferred not to provide any prior information (i.e., we kept them 50–50%, see Figure 4), as the dataset was limited and, based on discussion with water utility stakeholders, there was a higher number of high chl-a events than what captured in our final dataset; as such, we did not want to bias the model toward underpredictions.

In terms of validation, the calculated Brier scores were as illustrated in Table 1. Overall, with a score of no more than 0.28 for any model, the performance can be considered satisfactory given the limited dataset (and small number of threshold exceedances). The average posterior probability value estimated when an exceedance event was recorded was also very high. For turbidity, we considered as event, an exceedance of the 10 NTU threshold for this particular assessment, in order to have a larger dataset of assessable events.

In terms of visualisation effectiveness, Figure 5 and Figure 6 represent two probabilistic maps, specifically for high chl-a (Figure 5) and high turbidity (Figure 6), on the 25 November 2022 and 4 January 2023, i.e., on days when a Sentinel-2 image was retrieved. In Figure 5, a higher probability of high chl-a seems to be expected in the Eastern arm. However, looking at the numbers, such probability is in the 60–70% range: rather than predicting a specific value based on model accuracy which may prove to be wrong, we provide decision makers with a probabilistic estimation based on historical occurrences and model performance (in turn affected by historical data availability). The operators can decide on the most appropriate action (e.g., increased monitoring for confirmation/validation, or, directly, water treatment adjustments) based on their risk tolerance. On the other hand, the Western arm seems to display a lower risk of high chl-a. However, looking at the numbers, the probabilities associated with those grey and dark blue colours are around 50%. This means that uncertainty is at its maximum, and that the model, based on available historical data and their accuracy, and its own structure, does not have enough information to provide any meaningful prediction. This offers an opportunity for the stakeholders to make decisions which can improve the model over time, such as increased data collection in that area of the reservoir. This will ensure that the BN can be continuously updated and refined with relevant data, thus reducing that uncertainty.

Similarly, Figure 6 shows a relatively low risk of high turbidity everywhere. It should be noted that, unlike for chl-a, prior knowledge (based on historical data) was provided for the turbidity model, thus the BN updates such prior probabilities which, for high turbidity, were low based on historical frequency. By looking at the numbers, however, it can be seen that, overall, there is a lower probability (around 25%, light blue) in the Eastern arm, and a higher probability (around 40%, darker blue) on the Western arm. This can be seen, again, as higher uncertainty in the Western arm. As the Eastern arm has a large source of data (HUI) which the Western arm does not, this is a confirmation that future monitoring work could increase data collection on a few strategic points in the Western arm, to ensure less uncertain predictions overall, and in particular on this large section of the reservoir.

4. Discussion

The proposed BN methodology leads to visual outputs which, while inevitably more complex to interpret than a map with the predicted “crisp” values only, provide a compromise between user-friendliness and rigorous representation of predictions’ uncertainty. Water decision makers can lose trust in models for a number of reasons [34], and these include over-confident predictions which prove to be wrong. The majority of satellite-based freshwater quality predictions, while quantifying model accuracy and uncertainties, fail to visually represent it in the generated prediction maps. Even in the most recent works of the Intergovernmental Panel on Climate Change, while the two-map approach (prediction and mean error) is presented in the detailed climate model evaluation chapters (e.g., Figure 9.4 in [35]), this information is lost when summarised for policy makers, who are instead only provided with the predicted values (e.g., Figure SPM.5 in [36]), and overlook the model error, which greatly differs based on, e.g., latitude.

The provided visual output is not meant to be “passively” interpreted, and used only when prediction uncertainty is low. Rather, this offers an opportunity to proactively identify reservoir areas with consistently higher uncertainty. While uncertainty can arise from a number of sources, obtaining more in situ data, which for this application was limited, can lead to a better model performance. Optimising the frequency and locations of surface water quality monitoring networks is not trivial, and several methods have been proposed [37]. While for practical reasons (e.g., proximity to intake towers) the current in situ manual sampling work is disproportionately more focused on the Eastern arm of Advancetown Lake, the presence of poor water quality in the Western arm can still affect the raw water quality at other locations in the reservoir (at the very least, at HLI). Moreover, while the Eastern arm receives water mainly from overflow of the upstream Little Nerang Dam, whose raw water is regularly monitored and used in conjunction with HUI for treatment, the Western arm receives water from the upstream section of the Nerang River and its catchment, which is much less frequently monitored. It thus seems appropriate to increase data collection in this arm, to reduce uncertainty in this part of the reservoir, and to potentially use the outputs to calibrate/validate existing process-based models. This would in turn enhance the understanding of some key hydrodynamic/water quality processes occurring in this arm and in the reservoir as a whole, potentially making the extra monitoring a cost-effective investment.

While the benefits of the proposed Bayesian Network approach have been outlined, the presented work has a number of limitations that future work could address.

From a coding perspective, the BN runs on its own software, making its application to newly acquired and processed images time-consuming and complex. The simplified approach applied herein (i.e., embedding BN predictions to a set of simplified probabilistic, nonlinear equations in R) can add to the error and uncertainty. Future work can focus on fully implementing the BN principles in the same R code, potentially through dedicated BN/Bayesian packages in R.

Importantly, this could ultimately incorporate an uncertainty factor (as a probability distribution—see left side of Figure 4) for the band-specific reflectance values. For our dataset, in order to increase the matchups, often their retrieval times/day was up to a few days different than the in situ data. Depending on the target water quality parameter and reservoir, this can be considered acceptable; however, it is logical to consider synchronous data more reliable (i.e., justifying a narrower probability distribution around the retrieved reflectance value). Similarly, while data are processed and corrected for atmospheric/radiometric (among others) issues, such correction algorithms are not perfect, and a correct retrieval can be challenging, especially under certain conditions. Such varying uncertainty could be also incorporated in the BN.

The BN also allows the incorporation of prior knowledge for the target variable. This was performed for turbidity but, as explained, not for chl-a due to the limited dataset, likely not representative of overall historical conditions. Future work, backed by further data collection, can lead to a better understanding of chl-a variations through weather events, seasons, and years, allowing the incorporation of a realistic prior distribution. This would help refine the predictions, according to the knowledge of the system. In addition, the chl-a prediction could also be refined based on other inputs, such as season, or temperature, which can affect its concentration. This could be incorporated into the BN structure and will also help reduce uncertainty.

A larger dataset, in addition to further exploring different locations (i.e., in the Western arm), can also be achieved by increasing the number of historical matchups. Sentinel-2 data were available from only the end of 2018, and due to the different data collection frequencies (5 days for Sentinel-2, monthly for in situ), only a limited number of matchups was obtained. Other matchups could be obtained by exploring other satellite platforms such as Landsat-8, or by increasing the in situ dataset through the inclusion of high-frequency VPS data for HLI and HUI. These provide, every hour, an entire profile for chl-a and turbidity, among other useful parameters such as temperature, pH, organic matter, conductivity; as such, this would ensure a matchup for every “acceptable” (e.g., low cloud cover) satellite image, thus increasing the dataset to several matchups a month. This was out of the scope of this work since it would require a number of further sensor data analysis and compensation work. Optical sensors such as the chl-a and the fluorescent dissolved organic matter (fDOM) ones, in addition to sensor drift, are greatly affected by factors such as sunlight, turbidity, temperature, and pH, to name a few [26,38,39,40]. As such, future work should first ensure full reliability of these sensors (including by running laboratory-based compensation experiments and checking for consistency with laboratory results) before their readings can be used as inputs. If not, this would just add further uncertainty to the data and in turn to the model predictions. Similarly, the inclusion of other satellites’ data, despite many having similar spectral features, would require work to be fully integrated with the existing Sentinel-2 dataset, and checked for consistency [41].

Finally, as mentioned, this was a challenging location, due to its relatively pristine catchment and most of the dataset showing clear, low chl-a waters. Future data collection might expand the number of high chl-a and high turbidity events, which have a higher signal to noise ratio and, in turn, are typically easier to predict. This would also improve model performance and reduce uncertainty. Importantly, the modelled uncertainty is concentration-specific with, similarly to several past studies, higher inaccuracies and variability for lower chl-a concentrations. Compared to other approaches, the uncertainty represented by the BN will be different depending on chl-a concentrations, thus overcoming a common issue.

5. Conclusions

A probabilistic modelling approach, nestled around a Bayesian Network and implemented in R, was deployed to predict the risk of exceeding critical thresholds of chl-a and turbidity in a medium size, subtropical drinking water reservoir. The generated maps are relatively user-friendly in quantifying and representing the pixel-specific prediction uncertainty. Decision makers, based on their risk tolerance, could take immediate actions at the water treatment plant, or adjust/increase the in situ regular data collection at locations suggested by the model. This would lead to a reduction in uncertainty over time. Future work will focus on incorporating a number of modelling improvements as well as integrating compensated high-frequency sensor data to significantly expand the available dataset. It is suggested that such a modelling/visualisation approach can be re-applied to several fields of remote sensing research, particularly where high prediction uncertainty can be expected and/or where the spatial/temporal features of future in situ monitoring campaigns need to be defined (e.g., a number of environmental applications).

Author Contributions

Conceptualization, E.B.; methodology, E.B.; software, E.B.; validation, E.B.; formal analysis, E.B.; investigation, E.B.; resources, E.B. and S.P.H.; data curation, E.B. and S.P.H.; writing—original draft preparation, E.B.; writing—review and editing, S.P.H.; visualization, E.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author and upon approval by Seqwater.

Acknowledgments

We are thankful to Seqwater for sharing the in situ data used in this study.

Conflicts of Interest

The authors declare no conflict of interest.

References

Topp, S.N.; Pavelsky, T.M.; Jensen, D.; Simard, M.; Ross, M.R. Research trends in the use of remote sensing for inland water quality science: Moving towards multidisciplinary applications. Water 2020, 12, 169. [Google Scholar] [CrossRef] [Green Version]
Hu, H.; Fu, X.; Li, H.; Wang, F.; Duan, W.; Zhang, L.; Liu, M. Prediction of lake chlorophyll concentration using the BP neural network and Sentinel-2 images based on time features. Water Sci. Technol. 2023, 87, 539–554. [Google Scholar] [CrossRef] [PubMed]
Saberioon, M.; Brom, J.; Nedbal, V.; Souček, P.; Císař, P. Chlorophyll-a and total suspended solids retrieval and mapping using Sentinel-2A and machine learning for inland waters. Ecol. Indic. 2020, 113, 106236. [Google Scholar] [CrossRef]
Bramich, J.; Bolch, C.J.S.; Fischer, A. Improved red-edge chlorophyll-a detection for Sentinel 2. Ecol. Indic. 2021, 120, 106876. [Google Scholar] [CrossRef]
Li, S.; Song, K.; Wang, S.; Liu, G.; Wen, Z.; Shang, Y.; Lyu, L.; Chen, F.; Xu, S.; Tao, H.; et al. Quantification of chlorophyll-a in typical lakes across China using Sentinel-2 MSI imagery with machine learning algorithm. Sci. Total Environ. 2021, 778, 146271. [Google Scholar] [CrossRef]
Seegers, B.N.; Werdell, P.J.; Vandermeulen, R.A.; Salls, W.; Stumpf, R.P.; Schaeffer, B.A.; Owens, T.J.; Bailey, S.W.; Scott, J.P.; Loftin, K.A. Satellites for long-term monitoring of inland U.S. lakes: The MERIS time series and application for chlorophyll-a. Remote Sens. Environ. 2021, 266, 112685. [Google Scholar] [CrossRef]
Cherukuru, N.; Martin, P.; Sanwlani, N.; Mujahid, A.; Müller, M. A semi-analytical optical remote sensing model to estimate suspended sediment and dissolved organic carbon in tropical coastal waters influenced by peatland-draining river discharges off Sarawak, Borneo. Remote Sens. 2020, 13, 99. [Google Scholar] [CrossRef]
Rahul, T.S.; Brema, J. Assessment of water quality parameters in Muthupet estuary using hyperspectral PRISMA satellite and multispectral images. Environ. Monit. Assess. 2023, 195, 880. [Google Scholar] [CrossRef]
Valerio, A.d.M.; Kampel, M.; Vantrepotte, V.; Ward, N.D.; Sawakuchi, H.O.; Less, D.F.D.S.; Neu, V.; Cunha, A.; Richey, J. Using CDOM optical properties for estimating DOC concentrations and pCO₂ in the Lower Amazon River. Opt. Express 2018, 26, A657–A677. [Google Scholar] [CrossRef] [Green Version]
Sagan, V.; Peterson, K.T.; Maimaitijiang, M.; Sidike, P.; Sloan, J.; Greeling, B.A.; Maalouf, S.; Adams, C. Monitoring inland water quality using remote sensing: Potential and limitations of spectral indices, bio-optical simulations, machine learning, and cloud computing. Earth Sci. Rev. 2020, 205, 103187. [Google Scholar] [CrossRef]
Markogianni, V.; Kalivas, D.; Petropoulos, G.P.; Dimitriou, E. An appraisal of the potential of Landsat 8 in estimating chlorophyll-a, ammonium concentrations and other water quality indicators. Remote Sens. 2018, 10, 1018. [Google Scholar] [CrossRef] [Green Version]
Qi, T.; Xiao, Q.; Cao, Z.; Shen, M.; Ma, J.; Liu, D.; Duan, H. Satellite Estimation of Dissolved Carbon Dioxide Concentrations in China’s Lake Taihu. Environ. Sci. Technol. 2020, 54, 13709–13718. [Google Scholar] [CrossRef]
Kutser, T.; Verpoorter, C.; Paavel, B.; Tranvik, L.J. Estimating lake carbon fractions from remote sensing data. Remote Sens. Environ. 2015, 157, 138–146. [Google Scholar] [CrossRef]
Zheng, G.; DiGiacomo, P.M. Uncertainties and applications of satellite-derived coastal water quality products. Prog. Oceanogr. 2017, 159, 45–72. [Google Scholar] [CrossRef]
Liu, X.; Steele, C.; Simis, S.; Warren, M.; Tyler, A.; Spyrakos, E.; Selmes, N.; Hunter, P. Retrieval of Chlorophyll-a concentration and associated product uncertainty in optically diverse lakes and reservoirs. Remote Sens. Environ. 2021, 267, 112710. [Google Scholar] [CrossRef]
Werther, M.; Odermatt, D.; Simis, S.G.H.; Gurlin, D.; Lehmann, M.K.; Kutser, T.; Gupana, R.; Varley, A.; Hunter, P.D.; Tyler, A.N.; et al. A Bayesian approach for remote sensing of chlorophyll-a and associated retrieval uncertainty in oligotrophic and mesotrophic lakes. Remote Sens. Environ. 2022, 283, 113295. [Google Scholar] [CrossRef]
Roncoroni, M.; Mancini, D.; Kohler, T.J.; Miesen, F.; Gianini, M.; Battin, T.J.; Lane, S.N. Centimeter-scale mapping of phototrophic biofilms in glacial forefields using visible band ratios and UAV imagery. Int. J. Remote Sens. 2022, 43, 4723–4757. [Google Scholar] [CrossRef]
Fenton, N.; Neil, M. Risk Assessment and Decision Analysis with Bayesian Networks; CRC Press: New York, NY, USA, 2018. [Google Scholar]
Xu, S.; Dimasaka, J.; Wald, D.J.; Noh, H.Y. Seismic multi-hazard and impact estimation via causal inference from satellite imagery. Nat. Commun. 2022, 13, 7793. [Google Scholar] [CrossRef]
Chen, S.H.; Pollino, C.A. Good practice in Bayesian network modelling. Environ. Model. Softw. 2012, 37, 134–145. [Google Scholar] [CrossRef]
Barton, D.N.; Kuikka, S.; Varis, O.; Uusitalo, L.; Henriksen, H.J.; Borsuk, M.; de la Hera, A.; Farmani, R.; Johnson, S.; Linnell, J.D.C. Bayesian networks in environmental and resource management. Integr. Environ. Assess. Manag. 2012, 8, 418–429. [Google Scholar] [CrossRef]
Uusitalo, L. Advantages and challenges of Bayesian networks in environmental modelling. Ecol. Model. 2007, 203, 312–318. [Google Scholar] [CrossRef]
Bertone, E.; Stewart, R.A.; Zhang, H.; O’Halloran, K. Analysis of the mixing processes in the subtropical Advancetown Lake, Australia. J. Hydrol. 2015, 522, 67–79. [Google Scholar] [CrossRef] [Green Version]
Bertone, E.; Stewart, R.A.; Zhang, H.; Bartkow, M.; Hacker, C. An autonomous decision support system for manganese forecasting in subtropical water reservoirs. Environ. Model. Softw. 2015, 73, 133–147. [Google Scholar] [CrossRef]
Bertone, E.; Stewart, R.; Zhang, H.; O’Halloran, K. Numerical Study On Climate Variation And Population Growth Impacts On An Australian Subtropical Water Supply Reservoir. In Proceedings of the 11th International Conference on Hydroinformatics, New York, NY, USA, 17–21 August 2014. [Google Scholar]
Rousso, B.Z.; Bertone, E.; Stewart, R.A.; Rinke, K.; Hamilton, D.P. Light-induced fluorescence quenching leads to errors in sensor measurements of phytoplankton chlorophyll and phycocyanin. Water Res. 2021, 198, 117133. [Google Scholar] [CrossRef]
Chuvieco, E. Fundamentals of Satellite Remote Sensing: An Environmental Approach, 3rd ed.; CRC Press: Boca Raton, FL, USA, 2020. [Google Scholar]
Alba, G.; Anabella, F.; Marcelo, S.; Andrea, G.A.; Ivana, T.; Guillermo, I.; Sandra, T.; Michal, S. Spectral monitoring of algal blooms in an eutrophic lake using sentinel-2. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 306–309. [Google Scholar]
Barraza-Moraga, F.; Alcayaga, H.; Pizarro, A.; Félez-Bernal, J.; Urrutia, R. Estimation of Chlorophyll-a Concentrations in Lanalhue Lake Using Sentinel-2 MSI Satellite Images. Remote Sens. 2022, 14, 5647. [Google Scholar] [CrossRef]
Shi, X.; Gu, L.; Jiang, T.; Zheng, X.; Dong, W.; Tao, Z. Retrieval of Chlorophyll-a Concentrations Using Sentinel-2 MSI Imagery in Lake Chagan Based on Assessments with Machine Learning Models. Remote Sens. 2022, 14, 4924. [Google Scholar] [CrossRef]
Vollenweider, R.; Kerekes, J. Eutrophication of Waters. Monitoring, Assessment and Control; Organisation for Economic Co-Operation and Development: Paris, France, 1982; p. 156. [Google Scholar]
Bertone, E.; Sahin, O.; Richards, R.; Roiko, A. Extreme events, water quality and health: A participatory Bayesian risk assessment tool for managers of reservoirs. J. Clean. Prod. 2016, 135, 657–667. [Google Scholar] [CrossRef] [Green Version]
McFeeters, S.K. The use of the Normalized Difference Water Index (NDWI) in the delineation of open water features. Int. J. Remote Sens. 1996, 17, 1425–1432. [Google Scholar] [CrossRef]
Silva-Hidalgo, H.; Martín-Domínguez, I.R.; Alarcón-Herrera, M.T.; Granados-Olivas, A. Mathematical Modelling for the Integrated Management of Water Resources in Hydrological Basins. Water Resour. Manag. 2009, 23, 721–730. [Google Scholar] [CrossRef]
Flato, G.; Marotzke, J.; Abiodun, B.; Braconnot, P.; Chou, S.C.; Collins, W.; Cox, P.; Driouech, F.; Emori, S.; Eyring, V. Evaluation of climate models. In Climate Change 2013: The Physical Science Basis. Contribution of Working Group I to the Fifth Assessment Report of the Intergovernmental Panel on Climate Change; Cambridge University Press: Cambridge, UK, 2014; pp. 741–866. [Google Scholar]
Masson-Delmotte, V.; Zhai, P.; Pirani, S.; Connors, C.; Péan, S.; Berger, N.; Caud, Y.; Chen, L.; Goldfarb, M.; Scheel Monteiro, P.M. IPCC, 2021: Summary for Policymakers. In Climate Change 2021: The Physical Science Basis. Contribution of Working Group I to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change; Cambridge University Press: Cambridge, UK, 2021. [Google Scholar]
Jiang, J.; Tang, S.; Han, D.; Fu, G.; Solomatine, D.; Zheng, Y. A comprehensive review on the design and optimization of surface water quality monitoring networks. Environ. Model. Softw. 2020, 132, 104792. [Google Scholar] [CrossRef]
Bertone, E.; Chuang, A.; Burford, M.A.; Hamilton, D.P. In-situ fluorescence monitoring of cyanobacteria: Laboratory-based quantification of species-specific measurement accuracy. Harmful Algae 2019, 87, 101625. [Google Scholar] [CrossRef] [PubMed]
Choo, F.; Zamyadi, A.; Stuetz, R.M.; Newcombe, G.; Newton, K.; Henderson, R.K. Enhanced real-time cyanobacterial fluorescence monitoring through chlorophyll-a interference compensation corrections. Water Res. 2019, 148, 86–96. [Google Scholar] [CrossRef] [PubMed]
Choo, F.; Zamyadi, A.; Newton, K.; Newcombe, G.; Bowling, L.; Stuetz, R.; Henderson, R.K. Performance evaluation of in situ fluorometers for real-time cyanobacterial monitoring. H2Open J. 2018, 1, 26–46. [Google Scholar] [CrossRef]
Lessio, A.; Fissore, V.; Borgogno-Mondino, E. Preliminary Tests and Results Concerning Integration of Sentinel-2 and Landsat-8 OLI for Crop Monitoring. J. Imaging 2017, 3, 49. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Advancetown Lake and monitoring locations used for this study. Yellow dots (main monitoring sites): HUI = Hinze dam upper intake; HLI = Hinze dam lower intake (28°02′42.7″ S 153°16′56.3″ E). Orange dots (additional monitoring sites): S1 = Hinze dam Little Nerang Creek inlet; S2 = Hinze dam Nerang River inlet; S3 = Hinze dam at Belliss Creek inlet. Yellow star = Advancetown lake’s approximate location. Modified from Google Earth (2023).

Figure 2. Methodology flow chart. Blue indicates tasks performed in R; green indicates tasks performed in MS Excel; orange indicates task performed in Netica. P = probability; H = high.

Figure 3. Chl-a and turbidity for Advancetown Lake (all deployed sites), 2018–2023 (only dates matching satellite retrievals).

Figure 4. Bayesian Network structure. B1 to B8 = Sentinel-2 retrieved reflectance for Bands 1 to 8. For definitions of the four BN predictors, please see Equations (2)–(5). Variables discretisation based on data analysis and expert input.

Figure 5. BN-estimated posterior probability of chl-a exceeding 8 mg/m³, Advancetown Lake, 25 November 2022.

Figure 6. BN-estimated posterior probability of turbidity exceeding 40 NTU, Advancetown Lake, 4/1/2023.

Table 1. BNs accuracy assessment.

BN Model	Brier Score	Average Predicted Probability for Events ¹
Discrete chl-a	0.211	58.1%
Continuous chl-a	0.275	62.1%
Discrete Tb	0.083	74.3%
Continuous Tb	0.234	90.8%

¹ Event = chl-a > 8 mg/m³ or Tb >10 NTU.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bertone, E.; Peters Hughes, S. Probabilistic Prediction of Satellite-Derived Water Quality for a Drinking Water Reservoir. Sustainability 2023, 15, 11302. https://doi.org/10.3390/su151411302

AMA Style

Bertone E, Peters Hughes S. Probabilistic Prediction of Satellite-Derived Water Quality for a Drinking Water Reservoir. Sustainability. 2023; 15(14):11302. https://doi.org/10.3390/su151411302

Chicago/Turabian Style

Bertone, Edoardo, and Sara Peters Hughes. 2023. "Probabilistic Prediction of Satellite-Derived Water Quality for a Drinking Water Reservoir" Sustainability 15, no. 14: 11302. https://doi.org/10.3390/su151411302

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Probabilistic Prediction of Satellite-Derived Water Quality for a Drinking Water Reservoir

Abstract

1. Introduction

2. Materials and Methods

2.1. Research Domain and Data Collection

2.2. Data Analysis and Model Development

3. Results

3.1. Water Quality Data Features and Inputs Selection

3.2. Bayesian Network and Outputs Visualisation

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI