Deep Learning for Water Quality Prediction—A Case Study of the Huangyang Reservoir

Chen, Jixuan; Wei, Xiaojuan; Liu, Yinxiao; Zhao, Chunxia; Liu, Zhenan; Bao, Zhikang

doi:10.3390/app14198755

Open AccessArticle

Deep Learning for Water Quality Prediction—A Case Study of the Huangyang Reservoir

by

Jixuan Chen

^1,2,

Xiaojuan Wei

^1,2,*,

Yinxiao Liu

^1,2,

Chunxia Zhao

^1,2,

Zhenan Liu

^1,2 and

Zhikang Bao

^1,2

¹

College of Electrical Engineering, Northwest Minzu University, Lanzhou 730030, China

²

Gansu Engineering Research Center for Eco-Environmental Intelligent Networking, Lanzhou 730030, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(19), 8755; https://doi.org/10.3390/app14198755 (registering DOI)

Submission received: 20 August 2024 / Revised: 24 September 2024 / Accepted: 26 September 2024 / Published: 27 September 2024

(This article belongs to the Special Issue Applications of Deep Learning and Artificial Intelligence Methods: 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

Water quality prediction is a fundamental prerequisite for effective water resource management and pollution prevention. Accurate predictions of water quality information can provide essential technical support and strategic planning for the protection of water resources. This study aims to enhance the accuracy of water quality prediction, considering the temporal characteristics, variability, and complex nature of water quality data. We utilized the LTSF-Linear model to predict water quality at the Huangyang Reservoir. Comparative analysis with three other models (ARIMA, LSTM, and Informer) revealed that the Linear model outperforms them, achieving reductions of 8.55% and 10.51% in mean square error (MSE) and mean absolute error (MAE), respectively. This research introduces a novel method and framework for predicting hydrological parameters relevant to water quality in the Huangyang Reservoir. These findings offer a valuable new approach and reference for enhancing the intelligent and sustainable management of the reservoir.

Keywords:

water quality prediction; deep learning; Linear model; water quality data

1. Introduction

Water is a crucial resource for the sustenance of humanity, playing a vital role in all human production activities. Additionally, as a fundamental component of the ecological environment, it possesses an irreplaceable nature and cannot be substituted by any other resource. In recent years, China’s economy has experienced remarkable growth, leading to a significant increase in industrialization. However, this rapid industrialization has also led to a concerning increase in surface water pollution caused by industrial activities. Water quality prediction is a fundamental prerequisite for effective water resource management and pollution prevention. Accurate water quality predictions can effectively reflect the current pollution status of water bodies and forecast future trends. This information provides essential technical support and strategic planning for the protection of water resources, helping to prevent water pollution events before they occur. Accurate water quality prediction has significant implications for enhancing the conservation and utilization of water resources, addressing the current challenges in water pollution prevention, and facilitating the restoration of the ecological environment [1]. In order to achieve more accurate water quality predictions, numerous scholars have conducted extensive research.

Water quality prediction methods are categorized into gray system theory prediction methods, chaos theory prediction methods, statistical methods, and machine learning methods [2,3]. Gray system theory prediction was proposed by Deng Julong in 1982. This method is utilized to address the relationships within systems where information is unclear due to insufficient data. Li integrated Markov chain theory with gray system theory to enhance the predictive accuracy of the model when analyzing river water quality data characterized by high random volatility [4]. Chaos theory was introduced by American meteorologist Edward Norton Lorenz in 1963. It primarily explains that the occurrence of seemingly random phenomena is not the result of a single influencing factor; rather, it is produced by the combined influence of multiple factors within an entire system. The development of chaos theory, along with its ongoing expansion and application in environmental science [5,6], has provided new methodologies for predicting river water quality [7,8]. Xu Min et al. analyzed dissolved oxygen (DO) levels in water bodies using chaos theory and the concept of phase space reconstruction [9]. They investigated the underlying evolutionary patterns of complex systems from a macroscopic perspective, decoupled and simplified the intricate relationships in water quality into a univariate system, and successfully made short-term predictions of water quality using a chaotic phase space model. Statistical methods have evolved since the beginning of the last century. The DO-BOD coupled water quality model, proposed by Streeter and Phelps in 1925, is considered the earliest statistical approach for water quality prediction modeling [10]. Another notable model is the ARIMA model, which stands for Autoregressive Integrated Moving Average Model. This model was proposed by Box and Jenkins for time-series forecasting in the early 1970s. However, the method described in the text has certain limitations. It requires a large amount of data and is unable to handle nonlinear trends and multivariate elements, and the requirement for the timing data is that the data should be stationary or exhibit stability through differencing. Other statistical models, such as the Soil and Water Assessment Tool (SWAT) model and the Hydrological Simulation Pro-gram-Fortran (HSPF) model [11,12,13], also typically rely on a substantial amount of measured data for support and involve complex computational processes with relatively low simulation accuracy [14,15]. In recent years, the utilization of machine learning algorithms has become prevalent in the domain of water quality prediction [16]. On the one hand, neural network models exhibit strong nonlinear expression capabilities. On the other hand, this is primarily attributed to their inherent benefits in adaptivity. Nevertheless, due to the escalating volume of data, the conventional neural network architecture proves to be inadequate for handling highly intricate datasets [17,18]. Consequently, its performance capability diminishes significantly when confronted with extensive water quality data. In a study conducted by Ta et al. [19], a simplified Convolutional Neural Network (CNN) was employed to forecast the concentration of DO. The performance of the CNN model was compared with that of the BPNN network model, and it was demonstrated that the CNN exhibited superior effectiveness and stability in predicting water quality. Meanwhile, other researchers have proposed the utilization of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) algorithms [20,21]. These models have the ability to extract time-series features related to water quality and make predictions [22]. The LSTM algorithm has been shown to effectively predict water quality data with long time-series characteristics [23,24]. However, this method requires more data [25], and the obtained results are unsatisfactory, lacking parallelism and exhibiting poor landing execution. Therefore, Zou Qinghong [26] conducted a search for the dependency relationship between water quality data using multiple time scales. Additionally, SiamiNamini S et al. utilized BiLSTM’s robust capacity to capture and store information for forecasting water quality [27]. The findings of his research also confirmed the effectiveness of the BiLSTM model in extracting water quality information. The experimental findings demonstrated that the BiLSTM model outperformed both the LSTM model and other conventional models in terms of prediction accuracy, especially in the field of water quality prediction. In addition, a research group at the Institute of Automation, Chinese Academy of Sciences, proposed the Informer model for timing prediction based on Transformer. Despite the model’s poor parallel processing capabilities, difficulty in stabilizing training, and high computational requirements, it validates the potential value of the Transformer-like model in capturing individual long-term dependencies between the outputs and inputs of long-series time series.

Based on the aforementioned information, the objective of this study was to construct a water quality prediction model using a Linear model approach. The modeling process of a Linear model is efficient, as it does not require complex computations, and remains fast even when dealing with large datasets. The proposed model aims to address the aforementioned issues, and it was tested using the Huangyang Reservoir as a case study. Furthermore, it is anticipated that the model can be extrapolated to encompass the entire country. An extensive analysis was conducted on the water quality of the Huangyang Reservoir, which included the calculation of three key water quality monitoring indicators: DO, Turbidity, and pH. Next, to enable real-time predictions of the three water quality indicators, we employed three deep learning techniques for comparison—ARIMA, LSTM, and Informer—to construct an optimal model capable of accurately predicting water quality. A Linear-based model was chosen as the baseline model for comparison with the deep learning model. Calculations were subsequently conducted using data collected from 1 January 2022, at 0:00, to 12 July 2023, at 10:00, encompassing a duration of approximately one and a half years. The comparison of model prediction accuracy was based on the evaluation metrics of mean absolute error (MAE) and mean square error (MSE).

2. Materials and Methods

The aim of this study was to elucidate the fundamental variables at various time points in order to evaluate the future changes in water quality in the Huangyang Reservoir.

The Huangyang Reservoir is situated in Gansu Province. The dam is specifically located in Liangzhou District of Wuwei City, which is in the eastern part of the West Corridor. The selected dam site is located on the Huangyang River, about 50 km southeast of Wuwei City. The reservoir control basin covers an area of 828,000 square meters and serves as a medium-sized annual regulation reservoir primarily utilized for irrigation purposes. It also provides benefits such as flood control, power generation, and tourism, making it a comprehensive and multifunctional facility. In recent years, the problem of water quality pollution in the Huangyang Reservoir has worsened significantly because of the increase in urban pollution. Water pollution originates from three primary sources: firstly, domestic sewage discharged by urban areas along rivers; secondly, industrial effluents released by nearby factories; and thirdly, rainwater that carries air and surface pollution into the river.

2.1. System Framework

The objective of this study was to employ various deep learning models in order to simulate and analyze the spatial and temporal changes in pH, Turbidity, and DO. This study aims to enhance the precision of water quality prediction by comprehensively considering the temporal characteristics, variability, and chaotic nature of water quality data. To achieve this objective, Figure 1 presents a summary of five essential steps. Figure 1 illustrates the various stages involved in the proposed model, including data acquisition, preprocessing, model construction, and evaluation processes.

2.2. Data and Data Processing

2.2.1. Data Collection and Processing

Data Acquisition: The model test data for this study consisted of hourly measurements of water quality indicators in the Huangyang Reservoir, Wuwei City, from 1 January 2017, to 1 October 2023. These indicators include pH value, Turbidity, and DO in the water body, among others.
Selection of Research Subjects: Upon analyzing the water quality data and comparing the data with the “Environmental Quality Standards for Surface Water”, it was observed that the DO, Turbidity, and pH content in the Huangyang Reservoir, located in Wuwei City, had the most significant influence on its water quality classification. Therefore, this study focused on evaluating the water quality of the Huangyang Reservoir in Wuwei City by examining the changes in its DO, Turbidity, and pH levels during the study period.
Standardization: Different evaluation indicators often have different quantitative scales and units, which can impact the results of data analysis. To mitigate the influence of these differences in quantitative scales, data standardization is necessary to ensure the comparability of data indicators. By standardizing the raw data, the indicators are brought to the same scale, facilitating comprehensive comparative evaluation. Therefore, the DO, Turbidity, and pH values are standardized. This process involves dividing the raw data by the standard deviation after subtracting the mean value, resulting in data that are transformed into a distribution with a mean of 0 and a standard deviation of 1. The standardized processing formula is as follows:

X_{norm} = \frac{X - μ}{δ}

(1)

where

X

represents the original data,

μ

represents the mean,

δ

represents the standard deviation, and

X_{n o r m}

represents the normalized value.

4: Sample data should be divided into a training set, testing set, and validation set. The division of the training, testing, and validation sets follows a ratio of 7:2:1. Specifically, the initial 70% of the sample data are allocated for model training, the subsequent 20% are designated for testing the model, and the remaining 10% are utilized for validating the model’s performance.
5: Data reduction is a crucial step in the research process. During the evaluation of the model at the conclusion of the training process, the normalized data are subjected to reduction (inverse normalization) using an equation to assess the error in the predicted values of the model. The inverse normalization formula is as follows:

X = X_{n o r m} \cdot δ + μ

(2)

2.2.2. Data Padding

Our dataset potentially includes missing values and outliers due to factors such as equipment failures, station maintenance, and technical complexity. The original dataset contains some null values, and the presence of incomplete data can significantly impact the accuracy of the model’s predictions. If eliminated directly, this action results in a reduction in the available data and can also lead to inadequate model training. Considering the significant correlation observed among the parameter data, the k-means algorithm has been selected as the preferred method for analysis in this study.

The k-means algorithm serves as the fundamental clustering algorithm for predefined clustering categories [28,29]. The algorithm utilized in this study is a traditional distance-based clustering approach that uses distance as a metric to assess similarity. Specifically, the algorithm considers objects with shorter distances to be more similar to each other. The algorithm takes into account the proximity of objects to define clusters, with the ultimate objective of acquiring compact and autonomous clusters. The measurement is conducted using the Euclidean distance metric. The software has the capability to process and manage extensive datasets with high efficiency [30]. The inputs of this algorithm include a dataset and the number of categories. The Euclidean distance formula is expressed in Equation (3):

d i s t (X, Y) = \sqrt{\sum_{i}^{n} {(X_{i} - Y_{i})}^{2}}

(3)

In Equation (3), X and Y denote two samples,

i

represents the number of eigenvalues, and n represents the number of data points.

2.3. Assessment Metrics

A comparative analysis was conducted to evaluate the accuracy of model predictions using two metrics: the mean average error (MAE) and the mean squared error (MSE).

The MSE and MAE are commonly employed metrics for evaluating the error of a model. Lower values of MSE and MAE indicate a higher level of prediction accuracy. The MSE and MAE formulas are as follows:

M A E = \frac{1}{N} \sum_{i = 1}^{N} |y_{i} - {y^{'}}_{i}|

(4)

M S E = \frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - {y^{'}}_{i})}^{2}

(5)

where

N

represents the total number of samples in the test set, while

y'

and

y

denote the true and predicted values, respectively.

2.4. Model Building

The ARIMA model is a popular and widely used statistical method for time-series forecasting. It attempts to extract the time-series patterns hidden within the data through autocorrelation and differencing, using these patterns to forecast future data. However, the ARIMA model does not perform well with non-smooth and nonlinear data. To address this issue, deep learning methods have been introduced for time-series prediction. Traditional RNNs have limitations in handling long-term dependencies, while LSTM can effectively address this issue. Nevertheless, LSTM may result in the loss of features from long time series, limiting their ability to extract features effectively. To enhance the efficiency of predicting long time sequences, the Informer model has been developed. Informer is based on the attention mechanism of Transformer, ensuring effective feature learning while reducing computational complexity through probabilistic sparsification. This technique enhances the sequence prediction speed and accuracy of the model. However, Informer only utilizes positional encoding and labeled embedded subsequences, which, although helpful in retaining some ordering information, may lead to the loss of temporal information due to the replacement invariance of the self-attention mechanism. In order to simplify these problems, the LTSF-Linear model can be used for multi-step prediction to achieve the same predictive effect without the need for complex computations, remaining fast even when handling large datasets.

2.4.1. Construction of a Water Quality Prediction Model Based on the ARIMA Model

The normalized time series exhibits consistent patterns regardless of the specific point in time at which it is observed. Three parameters, namely p, d, and q, are used to describe the three main components of an ARIMA model. There exist multiple approaches for transforming a non-smooth time series into a smooth time series, and one of these methods involves generating D through differencing. The ARIMA model can be extended to the ARIMA (p, d, q) model. The process of time-series analysis involves three main steps: model identification, model estimation, and model testing. It is crucial to select a model that is suitable for the specific time-series data at hand.

In the autoregressive (AR) model, the variable of interest is predicted using past values of the variables. This model is suitable for data that exhibit smoothness. On the other hand, the moving average (MA) model represents a weighted average of past prediction errors, similar to a regression model. Equation (7) represents the AR model, while Equation (8) represents the MA model:

y_{t} = c + ϕ_{1} y_{t - 1} + ϕ_{2} y_{t - 2} + \dots + ϕ_{p} y_{t - p} + ε_{t}

(6)

y_{t} = c + ε_{t} + ϕ_{1} ε_{t - 1} + ϕ_{2} ε_{t - 2} + \dots + ϕ_{q} y_{t - q}

(7)

The identification of the ARIMA model is determined by analyzing the autocorrelation function (ACF) and partial autocorrelation function (PACF) for the difference order (d), autoregressive order (p), and moving average order (q). After identifying the appropriate ARIMA (p, d, q) model, it is imperative to estimate the chosen model and assess its suitability through testing. In this study, the quality of the statistical model is assessed using the Akaike Information Criterion (AIC) as a testing method (Equation (9)). The AIC is a valuable tool for comparing different estimated models, with lower values indicating better model quality. The time series model that has been developed has the capability to accurately forecast and identify anomalies by incorporating temporal features in the provided data. In this paper, p, d, and q are 0, 1, and 0, respectively.

A I C = - 2 \log (L) + 2 (p + q + k + 1)

(8)

2.4.2. Construction of a Water Quality Prediction Model Based on the LSTM Model

The fundamental RNN has a hidden-layer structure that is self-connected, distinguishing it from conventional neural networks. The RNN updates the hidden layer state at the current time step by incorporating the hidden layer state from the previous time step. This property makes the RNN well suited for handling time-series data. As the length of the time series increases, training the RNN becomes challenging due to the problem of ‘forgetting’ early time-series information. This phenomenon can result in either the disappearance of gradients or their explosion, which makes the training process challenging. The introduction of LSTM addresses the limitation of the RNN in retaining early time-series information to some extent [31]. The hidden layer of LSTM maintains a self-connection structure. However, LSTM is capable of capturing two types of information, namely the cell state and hidden layer state, from the previous time step. It achieves this through the utilization of a threshold structure consisting of the ‘forgetting gate’, ‘input gate’, and ‘output gate’. These gates control the transmission and updating of both the cell state information and the hidden layer state information. The LSTM hidden-layer structure is shown in Figure 2, where

C_{t - 1}

and

C_{t}

are the cell state information at time t − 1 and time t, respectively;

{\tilde{C}}_{t}

is the candidate update information at time t;

h_{t - 1}

and

h_{t}

are the hidden layer state information at time t − 1 and time t, respectively;

X_{t}

is the input value at time t; σ is the Sigmoid function; and

f_{t}

,

i_{t}

, and

o_{t}

are the control coefficients of the ‘forgetting gate’, ‘input gate’, and ‘output gate’, respectively.

The ‘forgetting gate’ is based on

f_{t}

determining how much of the cell state information

C_{t - 1}

can be retained at time t − 1.

f_{t}

is a value between 0 and 1, calculated by

X_{t}

and

h_{t - 1}

as an input. The closer

f_{t}

is to 0, the more information is removed in

C_{t - 1}

, and the closer

f_{t}

is to 1, the more information is retained in

C_{t - 1}

; the effect of control coefficients

i_{t}

and

o_{t}

is similar to that of

f_{t}

. The role of the ‘input gate’ is to determine which information will be added to the

C_{t}

and to work together with the ‘forgetting gate’ to determine the update of the

C_{t}

. The ‘output gate’ is used to control the output

h_{t}

of the hidden layer state information at time t. The formula of the LSTM network is as follows:

f_{t} = σ (W_{f}, [h_{t - 1}, X_{t}], b_{f})

(9)

i_{t} = σ (W_{i}, [h_{t - 1}, X_{t}], b_{i})

(10)

{\tilde{C}}_{t} = \tanh σ (W_{C}, [h_{t - 1}, X_{t}], b_{C})

(11)

C_{t} = f_{t} \cdot C_{t - 1} + i_{t} \cdot {\tilde{C}}_{t}

(12)

o_{t} = σ (W_{o}, [h_{t - 1}, X_{t}], b_{o})

(13)

h_{t} = o_{t} \cdot \tanh (C_{t})

(14)

In the given formula, the variables

W_{f}

,

W_{i}

,

W_{C}

,

W_{o}

represent weights, while

b_{f}

,

b_{i}

,

b_{C}

,

b_{o}

represent biases.

In this paper, the optimal parameters were determined after numerous experiments. The parameter settings in this model are as follows: the number of selected generations is 500, batch size is 64, the initial learning rate is set to 0.001, the RMSProp attenuation coefficient is 0.8, and the error function used is the mean square error.

2.4.3. Construction of a Water Quality Prediction Model Based on the Informer Model

The Informer model is a sequence prediction model that utilizes an autoregressive transformation mechanism [32]. It is primarily used for predicting time-series data. The main principles of this concept are outlined below. Its structure is shown in Figure 3.

Autoregressive Transformation Mechanism: The Informer model initially takes the time-series data as input for autoregressive transformation. Autoregressive transformation involves partitioning the original time-series data into subsequences of a fixed length, denoted as L. This process effectively converts the time-series data into text data, resembling the historical data used in natural language processing (NLP) tasks. By employing this approach, the initial task of time-series prediction can be transformed into a prediction problem focused on forecasting the subsequent time step.

Encoder–Decoder Structure: The Informer model employs an encoder–decoder structure to facilitate sequence prediction. Both the encoder and decoder components comprise multiple layers of self-attention mechanisms, fully connected layers, and regularization layers. The encoder and decoder are equipped with a cross-time-step interaction layer, which helps to capture the interdependencies among various time steps.

The Informer model incorporates a self-attention mechanism to capture the interdependencies among various positions within the input sequence. The mechanism employed in this study involves the mapping of each position in the input sequence to a low-dimensional attention vector. Subsequently, the similarity between these attention vectors is calculated to determine the degree of correlation between different positions.

Interaction Layer across Time Steps: The Informer model effectively captures the interdependencies between different time steps by incorporating an interaction layer across time steps. This layer computes the similarity between the attention vectors at the current time step and the attention vectors at previous time steps in order to generate a temporal association matrix. This facilitates the transfer and interaction of information across different time steps.

The Informer model incorporates a multi-scale training approach to address prediction tasks across various temporal scales. By segregating the prediction problems based on distinct time scales, it is possible to achieve more precise capture of both long-term and short-term patterns in time-series data. In summary, the Informer model demonstrates its efficacy in effectively addressing the task of time-series data prediction.

This is achieved through the utilization of various techniques, including autoregressive transformation; the use of an encoder–decoder structure, a self-attention mechanism, and a cross time-step interaction layer; and multi-scale training.

In this paper, the optimal parameters were determined after numerous experiments. The parameter settings in this model are as follows: the implicit layer feature length is 624; the number of attention heads is 14; there are 6 and 4 encoder and decoder stacking layers, respectively; the pairwise Q sampling factor is set to 5; the dimension of the fully connected layer is 2048; the number of training rounds is 400; the batch size is 32; and the learning rate is 0.0001.

2.4.4. Construction of a Water Quality Prediction Model Based on the Linear Model

In the context of Transformer-based LTSF solutions, it is worth noting that all the non-Transformer baselines used for comparison are IMS prediction techniques. These techniques are known to exhibit substantial error accumulation effects. We propose a hypothesis that the observed enhancements in performance in these studies can be primarily attributed to the utilization of the DMS strategy employed in them.

To examine the validity of this hypothesis, we put forth the most straightforward DMS model, which involves a temporal linear layer known as LTSF-Linear. This model serves as a baseline for comparison. The fundamental structure of LTSF-Linear involves a direct regression of the historical time series in order to forecast the future. This is achieved through a weighting and summing operation, as illustrated in Figure 4. The mathematical expression is represented by the equation

{\hat{X}}_{i} = W X_{i}

,

W \in R^{T \times L}

, which

W

represents a linear layer along the time axis.

{\hat{X}}_{i}

and

X_{i}

are the predictions and inputs for each variable [33]. It should be noted that LTSF-Linear, as a modeling approach, shares weights across variables and does not account for any spatial correlation.

In this paper, the optimal parameters were determined after numerous experiments. The parameter settings in this model are as follows: the linear layer consists of 2 layers, the number of selected generations is 400, the batch size is 64, the initial learning rate is set to 0.0001, the ReLU activation function is utilized, and the loss function is the mean square error, optimized using the Adam optimizer.

2.4.5. Modeling Comparison

A comparison of the various prediction methods is shown in Table 1, considering criteria such as methodology and proposed modeling capabilities. We conducted a comparison of the scalability and computational burden of the analyzed methods, focusing on the dimension of the case study. This comparison demonstrates that the utilization of the complex network approach is advantageous for modeling a greater number of parameters as it imposes a reduced computational burden compared to alternative methods.

3. Results

The four models were trained separately using the preprocessed dataset from the Huangyang Reservoir. Two models with 96 (sixteen days; four hours in a time step) and 336 (fifty-six days; four hours in a time step) output time steps, respectively, were trained for each model. The results are shown in Figure 5, Figure 6 and Figure 7 and the detailed results are shown in Table 2. The purpose of training the models with different output time steps is to verify the effectiveness of the models for long time-series prediction and to analyze the effect of different output time steps on the models. After the model training was completed, the divided test set of water quality data of the Huangyang Reservoir was used as an input for the trained model for prediction and compared with real data values obtained from the National Automatic Integrated Water Quality Supervision Platform.

From the table above, it can be seen that the Linear model has the smallest values of MSE and MAE, and the Linear model has reduced the MSE by 16.8% and the MAE by 17.3% for the short-term prediction of DO concentration (time step 96; 16 days) compared to the best results of the other models. The Linear model has reduced the MSE by 9.7% and the MAE by 3.7% for the long-term prediction of DO concentration (time step 336; 56 days). The MSE for the short-term prediction of pH concentration by the Linear model decreased by 2.9%, and the MAE decreased by 23.9%. The MSE for the long-term prediction of pH concentration decreased by 1.6%, and the MAE decreased by 4.9%. The MSE for the short-term prediction of Turbidity concentration by the Linear model decreased by 15.4%, and the MAE decreased by 10.2%. The MSE for long-term prediction of Turbidity concentration was reduced by 6.5%, and the MAE was reduced by 3.1%. This leads to the conclusion that the Linear model is the most effective and that it has good applicability.

4. Discussion

The predictions of the four models for the prediction results of DO, pH, and turbidity in the Huangyang Reservoir reflected the trend of change, and the accuracy of the prediction was higher at the beginning, while the accuracy of the models’ predictions became worse after the prediction time step became larger. In the later prediction period, the differences in the predictions of water quality index concentrations among the four models become larger, and the ARIMA model predicts a poor trend, and the information obtained does not have much value. The Informer model obviously has a higher degree of overlap than the LSTM model in both short- and long-term prediction, so the Informer model is significantly better than the LSTM model. Compared to the other three model, in terms of high overlap between predicted values and ability to use actual data, the Linear models work better and are of great reference value.

4.1. Discussion of LSTM and Informer Results

The understanding of time-series prediction is not challenging due to the dependency of the prediction at time t on the output of the previous moment, t−1. However, as the length of the time series increases, the speed of prediction decreases and convergence becomes more difficult. The reason for this difficulty is that LSTM uses a backpropagation algorithm to optimize the sequence by calculating the loss function. With longer sequences, finding the gradient becomes more challenging, resulting in slower convergence and poorer performance.

The Informer model is built upon the Transformer architecture, which incorporates an attention mechanism to determine the temporal proximity between the current and previous time points. The parallel nature of the system has resulted in a significant improvement in speed.

4.2. Discussion of Linear and Informer Results

In the context of time-series forecasting, the order of the sequence often holds significant importance. We contend that despite the incorporation of positional and temporal embeddings, the approach employed by Informer still exhibits a deficiency in preserving temporal information. Although Linear is characterized by its simplicity, it possesses several noteworthy features. Notably, the effectiveness of dependency capture improves as the path length decreases. This capability allows Linear to effectively capture both short-term and long-term temporal relationships. The Linear model is a linear model that consists of up to two linear layers. As a result, it exhibits advantages such as reduced memory usage, its ability to work with fewer parameters, and faster inference speed compared to existing Informers. After the completion of the training process, it is possible to visually examine the weights associated with the seasonal and trend branches. This analysis can provide valuable insights into the predicted values. Furthermore, it is worth noting that the Linear approach can be acquired effortlessly without the need for tuning the hyperparameters of the model. In terms of performance, it is observed that the average decrease achieved by the Linear method surpasses that of the Informer-based method across all scenarios.

4.3. Discussion of ARIMA, LSTM, Informer, and Linear Model Results

The MSE and MAE values were typically low for all four models. It can be asserted that all the developed ARIAM, LSTM, Informer, and Linear models have demonstrated effective water quality prediction capabilities. The Linear model demonstrated notable enhancements over the baseline model across nearly all water quality measurements, exhibiting significant superiority compared to the other two models. The Linear model consistently demonstrated enhanced predictive capabilities for all water quality measurements in comparison to the baseline model. On average, the Linear model demonstrated a reduction in MAE of approximately 0.8% compared to the LSTM model and 0.4% compared to the ARIAM model. Additionally, the Informer model exhibited slightly superior performance when compared to the LSTM model. The Informer model demonstrates superior performance compared to the LSTM model of all time series. This finding suggests that the Informer model is more effective in capturing nonlinear information. In the context of pH prediction, it has been observed via analysis concerning MSE that the Informer model exhibits a slightly superior performance compared to the LSTM model. However, it is worth noting that both models demonstrate a very similar level of performance. In the context of predicting turbidity, the ARIMA model demonstrated the second highest level of performance. Overall, the Linear model exhibited superior prediction accuracy compared to the other models. This suggests that the model is effective in enhancing the accuracy of short-term water quality prediction in the Huangyang Reservoir. Furthermore, the Linear model is also deemed more appropriate for long-term water quality prediction in the Huangyang Reservoir, as well as in the entire country.

5. Conclusions

Surface water resources are crucial for human survival, as they provide a direct source of purified water for daily production and consumption. Therefore, it is of utmost importance to accurately predict the levels of these three indicators in order to effectively manage pollution in the watershed. The data presented in this paper were derived from water quality measurements obtained from the Huangyang Reservoir in Wuwei City. This study focused on three key water quality parameters, namely DO, pH, and Turbidity. The prediction was conducted using a combined model including a Linear neural network, which yielded superior results compared to other models, as evidenced by the fact it achieved lower values of MSE and MAE. The average reductions were about 8.55% and 10.51%, respectively, compared to the other models with the best results. The experiments demonstrated that the Linear model is appropriate for predicting water quality parameters in the Huangyang Reservoir. This model can improve the performance of water quality prediction and could play a role in shifting the focus of water pollution management from post-treatment to prevention. This finding introduces a novel approach to predicting water quality parameters in the Huangyang Reservoir. However, Linear has a limited modeling capacity, and it only provides a simple but competitive baseline for future research. In addition, future water quality predictions will not only rely on traditional water quality monitoring data but also integrate geographic information system data, meteorological data, hydrological data, and other multi-source data. These data exhibit clear nonlinearity, stochasticity, and chaotic behavior, making them more suitable for prediction using nonlinear methods. Such methods utilize these data to conduct a more comprehensive analysis of the factors influencing water quality, thereby enhancing the accuracy and reliability of predictions.

Author Contributions

Article structure and experimental methods, X.W. and J.C.; data collection and organization, J.C., Y.L., Z.L. and Z.B.; data processing and experimental operation, J.C. and Y.L.; writing—original draft preparation, J.C.; writing—review and editing, J.C., Z.L., C.Z. and Z.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No.12205241), the Natural Science Foundation of Gansu Province (No.20JR10RA115), the Fundamental Research Funds for the Central Universities (No.31920220049&31920230138), and the Higher Education Innovation Fund Project of Gansu Province (No.2022B-074).

Data Availability Statement

The data presented in this study are available on request from the corresponding authors. The data are not publicly available due to the fact that this article is a continuation of a follow-up study by the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, J.B. A new idea to carry out integrated watershed management in the reservoir area of Huangyang Reservoir. Water Safety 2023, 4, 77–79. [Google Scholar]
Pang, S.J.; Wang, X.Y.; Ma, W.J. Research of Parameter Uncertainty for the HSPF Model Under Different Temporal Scales. Environ. Sci. 2018, 39, 2030–2038. [Google Scholar] [CrossRef]
Xu, Y.; Sun, X.Y.; Liu, F.; Fan, Y.N.; Jiang, Z. Simulation of herbicide transportation in Sihe watershed by SWAT model. China Environ. Sci. 2018, 38, 3959–3966. [Google Scholar] [CrossRef]
Li, Z.Q.; Wang, S.J. Application of gray Markov model in the prediction of DO concentration in surface water bodies. Chongqing Environ. Sci. 2002, 3, 35–37. [Google Scholar]
Koçak, K.; Şaylan, L.; Şen, O. Nonlinear time series prediction of O₃ concentration in Istanbul. Atmos. Environ. 2000, 34, 1267–1271. [Google Scholar] [CrossRef]
Sivakumar, B.; Liong, S.Y.; Liaw, C.Y. Evidence of chaotic behavior in singapore rainfall. J. Am. Water Resour. Assoc. 1998, 34, 301–310. [Google Scholar] [CrossRef]
Geldof, G.D. Adaptive water management: Integrated water management on the edge of chaos. Water Sci. Technol. 1995, 32, 7–13. [Google Scholar] [CrossRef]
Zaldivar, J.M.; Gutiérrez, E.; Galván, I.M.; Strozzi, F.; Tomasin, A. Forecasting high waters at Venice Lagoon using chaotic time series analysis and nonlinear neural networks. J. Hydroinform. 2000, 2, 61–84. [Google Scholar] [CrossRef]
Xu, M.; Zheng, G.M.; Xie, G.X.; Huang, G.H.; Su, X.K. preliminary study on the application of chaos theory to the prediction of dissolved oxygen in rivers. J. Environ. Sci. 2003, 6, 776–780. [Google Scholar] [CrossRef]
Liu, J.Q.; Tu, W.X.; Zhu, E. Survey on graph convolutional neural network. Comput. Eng. Sci. 2023, 45, 1472–1481. [Google Scholar]
Liu, Y.; Kang, W.; Li, L. Prediction of watershed pollutant flux based on Long Short-Term Memory neural network. J. Hydroelectr. Eng. 2020, 39, 72–81. [Google Scholar]
Bengio, Y.; Simard, P.; Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 1994, 5, 157–166. [Google Scholar] [CrossRef]
Gers, F.A.; Schmidhuber, J.; Cummins, F. Learning to forget: Continual prediction with LSTM. Neural Comput. 2000, 12, 2451–2471. [Google Scholar] [CrossRef]
Li, W.; Wu, H.; Zhu, N.; Jiang, Y.; Tan, J.; Guo, Y. Prediction of dissolved oxygen in a fishery pond based on gated recurrent unit (GRU). Inf. Process. Agric. 2021, 8, 185–193. [Google Scholar] [CrossRef]
Chen, Y.Y.; Cheng, J.J.; Fang, X.M.; Yu, H.H.; Li, D.L. Principal component analysis and long short-term memory neuranetwork for predicting dissolved oxygen in water for aquaculture. Trans. Chin. Soc. Agric. Eng. 2018, 34, 183–191. [Google Scholar]
Wang, Z.; Lian, Y.Q.; Li, X.N.; Wang, X.; Fang, Y.; Xu, X.H. Research on remote sensing inversion of water quality parameters in Chanhe River and Bahe River based on machine learning based on machine learning. Yangtze River 2022, 53, 13–18. [Google Scholar] [CrossRef]
Lu, Y.; Tu, J.; Gao, Z.G.; Li, X.G.; Yang, S.; Yang, X.Y. Impact of rainfall temporal resolution on SWAT hydrologic modeling. China Environ. Sci. 2020, 40, 5383–5390. [Google Scholar] [CrossRef]
Jiang, Y.N.; Wang, L.; Wei, X.M.; Ding, X.C.; Xingchen, D. Impacts of Climate Change on Runoff of Jinghe River Based on SWAT Model. Trans. Chin. Soc. Agric. 2017, 48, 262–270. [Google Scholar]
Ta, X.; Wei, Y. Research on a dissolved oxygen prediction method for recirculating aquaculture systems based on a convolution neural network. Comput. Electron. Agric. 2018, 145, 302–310. [Google Scholar] [CrossRef]
Ouyang, T.; Shan, K.; Botian, T.; Yu, H.; Wu, Z.; Shang, M. Research on the online forecasting of algal kinetics based on time-series data and LSTM neuralnetwork:Taking Three Gorges Reservoir as an example. J. Lake Sci. 2021, 33, 1031–1042. [Google Scholar]
Guo, Y.; Lai, X.J. Water level prediction of Lake Poyang based on long short-term memory neural network. J. Lake Sci. 2020, 32, 865–876. [Google Scholar]
Deng, L.; Wu, Q.Y.; Yang, S.R. Use of stack sparse auto-encoder (SSAE) deep feature learning and long short-term memory (SSAE-LSTM) neural network for the prediction of hourly PM2.5 concentratior. Acta Sci. Circumstantiae 2020, 40, 3422–3434. [Google Scholar] [CrossRef]
Xu, J.C.; Chen, C.B.; Wang, J.C.; Wang, Z.; Ye, Q.Q. Research on Water Quality Prediction Model Based on Deep Learning. Autom. Instrum. 2019, 34, 96–100. [Google Scholar] [CrossRef]
Wu, S.; Li, H. Prediction of PM2. 5 concentration in urban agglomeration of China by hybrid network model. J. Clean. Prod. 2022, 374, 133968. [Google Scholar] [CrossRef]
Yang, D. Phishing Website Detection Based on Feature Selection Classification and Bidirectional LSTM Neural Networks. Bachelor’s Thesis, Southwest Jiaotong University, Chengdu, China, 2018. [Google Scholar]
Zou, Q.H. Water Quality Prediction Method Based on Multi-Timescale Bidirectional LSTM Networks. Bachelor’s Thesis, Chongqing University, Chongqing, China, 2022. [Google Scholar]
Im, Y.; Song, G.; Lee, J.; Cho, M. Deep Learning Methods for Predicting Tap-Water Quality Time Series in South Korea. Water 2022, 14, 3766. [Google Scholar] [CrossRef]
Xu, H.; Lv, B.; Chen, J.; Kou, L.; Liu, H.; Liu, M. Research on a Prediction Model of Water Quality Parameters in a Marine Ranch Based on LSTM-BP. Water 2023, 15, 2760. [Google Scholar] [CrossRef]
Wang, H.; Zhang, L.; Wu, R.; Zhao, H. Enhancing Dissolved Oxygen Concentrations Prediction in Water Bodies: A Temporal Transformer Approach with Multi-Site Meteorological Data Graph Embedding. Water 2023, 15, 3029. [Google Scholar] [CrossRef]
Shi, K.; Wang, P.; Yin, H.; Lang, Q.; Wang, H.; Chen, G. Dissolved Oxygen Inversion Based on Himawari-8 Imagery and Machine Learning: A Case Study of Lake Chaohu. Water 2023, 15, 3081. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; pp. 11106–11115. [Google Scholar]
Zeng, A.; Chen, M.; Zhang, L.; Xu, Q. Are transformers effective for time series forecasting? In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 13–14 February 2023; pp. 11121–11128. [Google Scholar]

Figure 1. System framework.

Figure 2. Structure of hidden layer in LSTM model.

Figure 3. An overview of the Informer model.

Figure 4. Illustration of the basic linear model.

Figure 5. Comparison of DO for each model at different time steps.

Figure 6. Comparison of pH for each model at different time steps.

Figure 7. Comparison of Turbidity for each model at different time steps.

Table 1. Advantages and disadvantages of forecasting methods.

Methodologies	Advantages	Drawbacks
ARIMA	The model is characterized by its simplicity, as it solely relies on endogenous variables and does not incorporate any exogenous variables.	The requirement for the timing data is that the data should be stationary or exhibit stability through differencing.
LSTM	The utilization of the gate mechanism in its long-term memory function partially addresses the issues of gradient explosion and gradient disappearance.	The obtained results are unsatisfactory, lacking parallelism and exhibiting poor landing execution.
Informer	This study demonstrated a significant potential for long-distance-dependent expression of the high memory usage.	Limitations inherent in the encoder–decoder architecture.
Linear	The modeling process is efficient, as it does not require complex computations, and remains fast even when dealing with large datasets.	It should be noted that the model does not fit well with nonlinear data. In order to ascertain the linear relationship between the variables, it is necessary to conduct an initial analysis.

Table 2. Univariate long-sequence time-series forecasting results. The best results are highlighted in bold.

Methods		Linear		Informer		LSTM		ARIMA
Metric		MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE
DO	96	0.05479	0.15481	0.06586	0.18727	0.07077	0.17953	0.25039	0.49658
DO	336	0.10412	0.25121	0.11539	0.26084	0.20794	0.32633	0.54285	0.69872
pH	96	4.49666	0.87898	4.63165	1.15597	5.84864	1.47656	6.85472	3.54125
pH	336	5.67115	1.30048	5.76526	1.36833	5.77852	1.43615	7.85423	3.58954
Turbidity	96	0.63564	0.37395	1.66478	0.58936	0.75222	0.41665	0.94851	0.56314
Turbidity	336	1.38009	0.56380	1.47716	0.58168	1.58312	0.58772	1.65789	0.78954

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, J.; Wei, X.; Liu, Y.; Zhao, C.; Liu, Z.; Bao, Z. Deep Learning for Water Quality Prediction—A Case Study of the Huangyang Reservoir. Appl. Sci. 2024, 14, 8755. https://doi.org/10.3390/app14198755

AMA Style

Chen J, Wei X, Liu Y, Zhao C, Liu Z, Bao Z. Deep Learning for Water Quality Prediction—A Case Study of the Huangyang Reservoir. Applied Sciences. 2024; 14(19):8755. https://doi.org/10.3390/app14198755

Chicago/Turabian Style

Chen, Jixuan, Xiaojuan Wei, Yinxiao Liu, Chunxia Zhao, Zhenan Liu, and Zhikang Bao. 2024. "Deep Learning for Water Quality Prediction—A Case Study of the Huangyang Reservoir" Applied Sciences 14, no. 19: 8755. https://doi.org/10.3390/app14198755

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Learning for Water Quality Prediction—A Case Study of the Huangyang Reservoir

Abstract

1. Introduction

2. Materials and Methods

2.1. System Framework

2.2. Data and Data Processing

2.2.1. Data Collection and Processing

2.2.2. Data Padding

2.3. Assessment Metrics

2.4. Model Building

2.4.1. Construction of a Water Quality Prediction Model Based on the ARIMA Model

2.4.2. Construction of a Water Quality Prediction Model Based on the LSTM Model

2.4.3. Construction of a Water Quality Prediction Model Based on the Informer Model

2.4.4. Construction of a Water Quality Prediction Model Based on the Linear Model

2.4.5. Modeling Comparison

3. Results

4. Discussion

4.1. Discussion of LSTM and Informer Results

4.2. Discussion of Linear and Informer Results

4.3. Discussion of ARIMA, LSTM, Informer, and Linear Model Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI