Research on Water Quality Prediction Model Based on Spatiotemporal Weighted Fusion and Hierarchical Cross-Attention Mechanisms

Zhou, Jiaming; Wei, Ke; Huang, Jiahuan; Yang, Lin; Shi, Junzhe

doi:10.3390/w17091244

Open AccessArticle

Research on Water Quality Prediction Model Based on Spatiotemporal Weighted Fusion and Hierarchical Cross-Attention Mechanisms

by

Jiaming Zhou

,

Ke Wei

,

Jiahuan Huang

,

Lin Yang

and

Junzhe Shi

^*

Wuxi Ecological Environmental Monitoring and Control Center, Wuxi 214121, China

^*

Author to whom correspondence should be addressed.

Water 2025, 17(9), 1244; https://doi.org/10.3390/w17091244 (registering DOI)

Submission received: 13 March 2025 / Revised: 15 April 2025 / Accepted: 16 April 2025 / Published: 22 April 2025

(This article belongs to the Section Water Quality and Contamination)

Download

Browse Figures

Versions Notes

Abstract

:

In the context of drinking water safety assurance, water quality prediction faces challenges due to temporal fluctuations, seasonal cycles, and the impacts of sudden events. To address the issue of cumulative prediction bias caused by the simplistic feature fusion of traditional methods, this study proposes a neural network architecture that integrates spatiotemporal features with a hierarchical cross-attention mechanism. Innovatively, the model constructs a parallel feature extraction framework, integrating BiGRUs (Bidirectional Gated Recurrent Units) and BiTCNs (Bidirectional Temporal Convolutional Networks). By incorporating a bidirectional spatiotemporal interaction mechanism, the model effectively captures long-term dependencies in time series and local associations in spatial topology. During the feature fusion phase, layer-by-layer weighting through learnable parameters enables adaptive spatiotemporal feature processing. A hierarchical cross-attention module is designed to achieve deep feature integration, enhancing the discriminative expression of spatial features while preserving the dynamics of time series. The experimental results demonstrate that when predicting water quality monitoring data from the Xidong Water Plant, this model excels in forecasting key indicators such as total phosphorus and total nitrogen. Compared to traditional hybrid models, it reduces the MSE (Mean Squared Error) by 33.35%, the MAE (Mean Absolute Error) by 38.05%, and the RMSE (Root Mean Square Error, RMSE) by 19.35%, and increases the R² (coefficient of determination, R²) by 2.15 percentage points. These achievements break the limitations of traditional methods’ rigid and simplistic feature fusion, fully demonstrating the model’s superiority in prediction accuracy and generalization capabilities.

Keywords:

BiGRU; BiTCN; spatiotemporal feature fusion; hierarchical attention mechanism; adaptive feature weighting

1. Introduction

Water is a crucial resource upon which all terrestrial life depends. China’s per capita water resources are less than a quarter of the global average [1]. Human activities inevitably impact the aquatic environment in various ways [2]. As industrialization and urbanization accelerate, various pollutants are continually discharged into water bodies, making water quality assessment increasingly complex and variable. Indicators such as TN (total nitrogen), TP (total phosphorus), and the COD-Mn (permanganate index) are key drivers of eutrophication in water bodies [3]. By predicting ley indicators, early warning signals can be provided regarding the water quality of drinking water sources. This helps to optimize water treatment processes, adjust the dosage of chemicals, and enhance the efficiency of water treatment. Therefore, conducting precise and accurate water quality monitoring and prediction is crucial for ensuring drinking water safety, resource conservation, and environmental management [4].

Although ARIMA-based models have demonstrated statistical validity [5], they lack nonlinear modeling capabilities and prediction error benchmarks, highlighting the necessity of the more expressive and performance-driven learning methods proposed in our research. Despite the relative success of the improved Bayesian network [6], a 21% error rate indicates that there is still significant room for improvement in performance, especially in terms of generalization across different water bodies and capturing real-time fluctuations. Some existing studies have combined traditional water quality prediction algorithms with neural networks to address the nonlinearity issue in water quality time series prediction. Jadhav, A. R. et al. proposed the use of ANNs (Artificial Neural Networks) [7] for predictions in water and wastewater treatment systems, but the model showed a poor generalization ability. Gao Y et al. [8] introduced a Particle Swarm Optimization algorithm to enhance the performance of the LS-SVM (Least Squares Support Vector Machine) model in predicting leaf water potential, but the model suffers from high computational complexity when dealing with large-scale data.

The variation in water quality parameters is typically nonlinear, dynamic, and complex. Recurrent Neural Networks (RNNs) are capable of effectively capturing long-term dependencies in time series data and have achieved significant results in time series forecasting tasks. Sabri et al. [9] proposed a photovoltaic power prediction model based on Long Short-Term Memory (LSTM) autoencoders. In this architecture, the LSTM encoder is employed to extract temporal features from the input sequence, while the decoder reconstructs the feature representations and performs photovoltaic power forecasting. Hu, Z. et al. [10] introduced a novel water quality prediction method based on deep LSTM learning networks to forecast parameters such as pH and water temperature. Chen, L. et al. [11] addressed the challenges associated with the complex and low-accuracy prediction of dissolved oxygen in indoor aquaculture environments by proposing a VMD-CNN-BiGRU (Variational Mode Decomposition–Convolutional Neural Network–Bidirectional Gated Recurrent Unit) model augmented with an attention mechanism. Their approach provides a robust framework for monitoring and early warning control in aquaculture water quality systems. Yan, J. et al. [12] used a 1-DRCNN (1D Residual Convolutional Neural Network) and Bidirectional Gated Recurrent Units to extract the latent features of water quality parameters, improving the accuracy of water quality prediction. Bi, J. et al. [13] proposed a hybrid model based on LSTM autoencoder neural networks and Savitzky–Golay filters for water quality time series prediction. Niu D and Yu M et al. [14] developed an optimized CNN-BiGRU model, utilizing an attention mechanism for short-term multi-energy load forecasting. This model employs CNN to extract features from time series data, while BiGRUs are used to capture data dependencies. Additionally, Sun and Jin et al. [15] introduced a CNN-TCN (Convolutional Neural Network-Temporal Convolutional Network, CNN-TCN) architecture in the same year. This architecture enhances long-term temporal correlations using a multi-head attention mechanism and retrieves more significant information from various time points. Furthermore, Yuan J and Li Y et al. [16] colleagues proposed a novel CA-TCN-BiGRU model for multi-parameter water quality prediction, integrating a channel attention mechanism, a Temporal Convolutional Network, and a Bidirectional Gated Recurrent Unit. These innovations represent significant advancements in the field of predictive modeling, addressing complex dependencies and improving prediction accuracy across various applications.

Among the many derivatives of RNNs (Recurrent Neural Networks), Gated Recurrent Units stand out due to their simple yet efficient gating mechanisms, which have shown broad applicability across various applications [17]. In response to the long-term dependencies and complexities of water quality time series data, this paper proposes the use of BiTCNs to capture the global spatial features in water quality data. Through bidirectional convolution operations, this network comprehensively explores the interrelationships of water quality parameters across different spatial positions. Concurrently, the data are fed into BiGRUs to extract dynamic features from the time series. The spatial and temporal features extracted are then subject to a weighted fusion process, employing a hierarchical cross-attention mechanism. This mechanism calculates attention weights, automatically focusing on the most critical features. The fused features are ultimately passed through a fully connected layer, where nonlinear transformations are applied to produce the final decision output, thereby enabling high-precision water quality prediction (See Table 1).

2. Data Preprocessing

This study utilizes water quality data from the automatic monitoring station at Xidong Water Plant, which are sourced from the China National Environmental Monitoring Centre. The monitoring station is situated in Gonghu Bay, located in the northern region of Lake Taihu, and functions as one of the centralized drinking water sources for Wuxi City. Data are collected every 4 h and include parameters such as water temperature, pH, dissolved oxygen, electrical conductivity, turbidity, CODMn, ammonia nitrogen, total phosphorus, and total nitrogen. This study selected three years of water quality monitoring data from January 2021 to December 2023, totaling 6370 records.

To conduct a thorough analysis of the water quality data at this location over the past three years, this study employs Pearson correlation coefficient analysis [18]. This method is used to avoid using all water quality indicators as input variables, thereby reducing the complexity of the model and preventing the degradation of its predictive capability. The Pearson correlation coefficients calculated for various key water quality indicators are visually presented in the form of a heatmap, where the depth of color indicates the strength of the correlation.

As illustrated in Figure 1, the Pearson correlation coefficient heatmap serves as an intuitive and powerful visualization tool that effectively reveals the linear relationships between various water quality parameters. Each square on the heatmap represents the correlation coefficient between a pair of water quality parameters, with the color depth directly reflecting the strength of their correlation. By analyzing the heatmap in depth, indicators that are significantly correlated with total phosphorus and total nitrogen are identified. Specifically, total phosphorus is positively correlated with turbidity, CODMn, ammonia nitrogen, and total nitrogen. These characteristics are crucial for water quality assessment and pollutant monitoring, as they reflect the overall condition of the water body. Consequently, these indicators are considered as input variables for building a model to predict total phosphorus concentration. Similarly, highly correlated indicators identified from the heatmap are also used as input variables for constructing a model to predict total nitrogen concentration. The utilization of Pearson correlation coefficient analysis and the heatmap visualization aids in selecting indicators that are highly correlated with total phosphorus and total nitrogen, providing strong support for the development of subsequent water quality prediction models. This strategic selection ensures that the models are equipped with relevant variables, enhancing the predictive accuracy and reliability of the water quality forecasts.

Due to significant differences in the dimensions and value ranges of different features, using raw data directly for modeling could lead to certain features having a disproportionate impact on model training. To address this issue, we standardized the data using the Z-score normalization method [19], transforming each feature’s data into a format with a mean of 0 and a standard deviation of 1. This ensures that all features are on the same scale, preventing the model from being unduly influenced by features with larger value ranges during training.

3. Model Structure Design

Considering the non-stationary and dynamic characteristics of water quality data, a single model is insufficient to fully capture their complexity and diversity. The spatiotemporal variability and uncertainty of water quality data may also lead to biases in the prediction results of a single model. To address these challenges, this study explores the complementary integration of TCNs (Temporal Convolutional Networks) and GRUs (Gated Recurrent Units) for spatiotemporal feature extraction and fusion. A TCN utilizes convolutional kernels to extract local features from water quality data, effectively capturing fine-grained information. With its distinctive gating mechanism, a GRU can effectively capture temporal dependencies within water quality data, thereby enhancing the model’s capability to learn sequential patterns.

3.1. TCN Module

In time series problems such as water quality monitoring, the data typically exhibit both short-term and long-term dependencies. Moreover, different time points may carry varying levels of informative content, which necessitates the use of models capable of adaptively capturing temporal relevance across multiple scales. TCNs represent an improvement over traditional one-dimensional convolution. Introduced by Bai et al. in 2018 [20], TCNs have been widely applied in time series modeling tasks [21]. The study in reference [22] shows that compared to RNN and its variants, the TCN model has a simpler structure and significant effects, and it also significantly reduces hardware costs. These advantages make TCN an attractive option for effectively handling the complexities inherent in time series data.

In time series applications such as water quality monitoring, the data typically exhibit both short-term and long-term dependencies. Moreover, different time steps may carry varying degrees of importance, with certain points containing more critical information than others. As an improvement over traditional one-dimensional convolution, a TCN is a convolutional model, introduced by Bai et al. in 2018 [20], which has been widely applied in time series modeling tasks. Research indicates that compared to RNNs and their variants, the TCN model has a simpler structure, demonstrates clear effectiveness, and significantly reduces hardware costs.

The introduction of causal convolution in the TCN is key to its ability to handle time series data effectively. It efficiently segregates future data, ensuring the causality of the time series data. For one-dimensional data

[x_{0}, x_{1}, \dots x_{t}, x_{t + 1}]

, the output

y_{t}

at time

t

in the hidden layer after convolution is calculated as shown in Equation (1). Here,

y_{t}

is the output at time

t

after causal convolution, and

f

represents the one-dimensional convolutional kernel.

y_{t} = f (x_{0}, x_{1}, \dots x_{t})

(1)

Dilated convolution is key for interval sampling, as varying the dilation factor allows for an expanded receptive field. This operation can reduce the number of convolutions required for long sequences, thereby effectively decreasing the computational load. As shown in Figure 2, the introduction of dilated convolutions in TCNs enables the flexible adjustment of the model’s receptive field, effectively extracting features from local to global levels and providing a comprehensive analysis of the data’s deep features. For one-dimensional data

[x_{0}, x_{1}, \dots x_{t}, x_{t + 1}]

processed through dilated convolution, the output at time

t

is shown in Equation (2). Here,

h_{t}

represents the output at time

t

after the dilated convolution,

f (ε)

is the element at position

ε

in the convolution kernel,

\ker n e l

is the size of the convolution kernel,

x_{t - m \cdot i}

is the element of the sequence after interval sampling, and

m

is the dilation factor. This setup allows the convolution operation to span a wider range of the input data, effectively capturing the broader contextual relationships within the sequence.

h_{t} = \sum_{i = 0}^{\ker n e l - 1} f (ε) \cdot x_{t - m \cdot i}

(2)

As illustrated in Figure 3, the original features of the model are added to the features calculated as described above through a residual connection. This approach prevents the issue of gradient vanishing, which could cause the model to lose important original features, and effectively increases stability during the training of long sequences.

The configuration of a convolutional kernel with a size of

\ker n e l = 1 \times 3

and a dilation factor

d = 1

typically constitutes a residual block in a TCN. This residual block usually comprises causally dilated convolution, a spatial dropout layer [23], a residual connection, a ReLU activation function, and a weight normalization layer [24]. Assuming that the TCN consists of k stacked residual blocks, after passing through these residual blocks, the output is expressed by Equation (3).

y^{(j, k)} = [y_{0}^{(j, k)} \dots y_{T}^{(j, k)}]

(3)

y_{t}^{(j, k)} = \sum_{i = 0}^{\ker n e l} (f (i) \cdot y_{t - m - i}^{(j - 1, k)}) + y_{t}^{(1, k)}

(4)

In the formula,

y^{(j, k)}

represents the output at layer

j

within the

k

residual block, where

j \in \{1, 2\}

and

k

denotes the index of the residual block;

T

represents the length of the sequence.

3.2. GRU Module

The GRU, as an improvement over traditional RNNs, is designed to extract more crucial information from the time–space domain of HRRP (High-Resolution Range Profile) [25]. It not only maintains the relevance of time series data but also reduces the number of parameters compared to LSTM networks. The network structure of the GRU is depicted in Figure 4. The model expressions are outlined in Equations (5)–(8). In the GRU model,

X_{t}

represents the data at time

t

.

σ

and

\tanh

are activation functions.

r_{t}

serves as the reset gate, which can control the influence of the previous state’s information.

W_{r}

and

W_{z}

represent the weight matrices for the reset and update gates, respectively.

b_{r}

and

b_{z}

represent the bias matrices for the reset and update gates, respectively.

z_{t}

also serves as the update gate, which can control how the information from the previous moment affects the current state transition.

{\tilde{h}}_{t}

represents the new memory content, and

h_{t - 1}

is the hidden layer.

W_{\tilde{h}}

is the weight matrix for the candidate state;

b_{\tilde{h}}

is the bias matrix for the candidate state.

r_{t} = σ (W_{r} \cdot [h_{t - 1} X_{t}] + b_{r})

(5)

Z_{t} = σ (W_{z} \cdot [h_{t - 1} X_{t}] + b_{z})

(6)

{\tilde{h}}_{t} = \tanh (W_{\tilde{h}} \cdot [r_{t} \cdot h_{t - 1}, x_{t} + b_{\tilde{h}}])

(7)

h_{t} = (1 - z_{t}) \cdot h_{t - 1} + z_{t} \cdot {\tilde{h}}_{t}

(8)

3.3. Weighted Hierarchical Cross Attention Based on Spatiotemporal Features

Water quality data can be influenced by different time points, making it essential to capture both historical and future information for a comprehensive understanding of their dynamic evolution. Traditional models such as TCNs and GRUs only perform forward computations on the input sequence, neglecting the implicit information contained in the reverse direction. This limitation hinders their ability to fully learn the global dependencies of historical data. Therefore, the model proposed in this study leverages forward and backward convolutional networks to simultaneously learn bidirectional information, allowing it to consider both past and future data dependencies. This approach enables the more accurate capture of the temporal patterns of water quality variations, thereby enhancing the model’s capability to handle complex temporal relationships and improving its overall predictive performance.

Traditional self-attention mechanisms can effectively capture dependencies within an input sequence. However, they lack the ability to directly interact across temporal and spatial domains. This limitation often leads to redundant information during processing, causing an over-concentration of attention weights and ultimately affecting the overall model performance. To address this issue, this study introduces the cross-attention mechanism [26], which enables the model to simultaneously process information from two different input sequences. This mechanism establishes selective information integration pathways, allowing for more effective feature fusion. However, this mechanism is limited to computing the weighted sum of a single attention head, making it more suitable for simpler feature fusion tasks. Li H et al. [27] proposed using self-attention to enhance the internal features of each modality, thereby improving the interaction of features (complementary information) between different modalities. Building on this idea, this study integrates temporal and spatial domain features to complementarily extract the complex features of water quality data.

As shown in Figure 5, the structure diagram of the (Time Space Cross Attention, TSCA) model designed in this study is presented. It is primarily divided into three parts: bidirectional sliding window processing, bidirectional spatiotemporal feature complementarity, and bidirectional feature cross-fusion. The input water quality data are processed through BiTCN and BiGRU computations to capture the spatial correlations and temporal dependencies of the data. The BiTCN module is composed of multiple Temporal Blocks, each containing two convolutional layers, a dropout layer, a ReLU activation function, and a flattening layer. This multi-layer structure effectively extracts deep temporal-spatial features. To prevent potential gradient vanishing issues as the network deepens, residual connections are designed. Additionally, to accelerate convergence and improve the model’s generalization ability, the convolutional kernels undergo weight normalization. This helps avoid gradient vanishing or explosion issues during training.

BiGRUs and BiTCNs are used to extract temporal and spatial domain features, respectively. The key issue is how to efficiently associate and weight these features. This study investigates the cross-attention mechanism suitable for the spatiotemporal feature fusion of water quality data. By employing parallel processing with multi-head attention, the model achieves weighted cross-layer feature fusion. Through the QKV mechanism, cross-layer feature fusion is applied to each layer of BiGRUs and BiTCNs. The temporal features extracted by BiGRUs from each layer are used as Q, and the spatial features extracted by BiTCNs from each layer are used as K and V, which are input into the cross-attention mechanism. The similarity between Q and K is computed, and weighted summation is applied to obtain the fused features. Finally, the cross-layer attention features are concatenated and pooled to adapt to the fully connected layer for the final output. The calculation of the weighted fusion features is shown in Formula (9).

B i G R U_o u t p u t s [i]

represents the temporal features extracted by the i-th layer of BiGRUs.

B i T C N_o u t p u t s [i]

represents the spatial features extracted by the i-th layer of BiTCNs.

σ

is a learnable weight parameter. It is initialized to 1, and updated according to the gradient from the Adam optimizer. The model adapts the fusion ratio of BiGRU and BiTCN features at each layer, enabling it to dynamically adjust based on the specific characteristics of the input data. This allows the model to learn the correlations between different positions in the spatiotemporal features, ultimately enhancing its performance and generalization ability. By adapting the fusion process at each layer, the model can effectively capture both temporal and spatial dependencies, leading to improved predictive accuracy and robustness (see Table 2).

f = σ \cdot B i G R U_o u t p u t s [i] + (1 - σ) B i T C N_o u t p u t s [i]

(9)

The weighted fused features undergo hierarchical cross-attention computation, as shown in Formula (10). Finally, the cross-layer attention features are concatenated and pooled to adapt to the fully connected layer for the final output.

A t t e n t i o n O u t p u t = s o f t \max (Q \cdot K^{T}) \cdot V

(10)

In the above equation,

Q

comes from

B i G R U_o u t p u t s [i]

and

K

and

V

come from

B i T C N_o u t p u t s [i]

.

4. Experimental Results and Analysis

The experimental environment of the model in this study consists of a 64-bit Windows 10 operating system with a 64 GB RAM computing platform. The CPU is an Intel i7-13700F (Intel Corporation, City: Santa Clara, CA, USA), and the GPU is an NVIDIA GeForce RTX 4080 with 16 GB of VRAM. The research algorithm is implemented using Python 3.8, PyTorch 2.1.1, and CUDA 11.8 frameworks.

Since water quality monitoring data are time series data, with parameters changing over time, we employed a sliding window technique to model the time series. The sliding window effectively extracts the correlation information from the sequential data, transforming it into a format suitable for deep learning models. We selected a sliding window size of 5, meaning that the data from the previous 5 time points are used to predict the water quality parameter for the next time point. For each sample, the water quality data from the previous 5 time points serve as feature inputs, while the label at the current time is the value of total phosphorus or total nitrogen. Each sample consists of a time series of length 5, which is then used to train the model. The dataset is divided into a training set and a test set in a 9:1 ratio. This data processing ensures the stability of model training. It also facilitates the prediction of total phosphorus and total nitrogen in the Xidong Water Plant’s water.

Based on the correlation analysis of three years of data from the Xidong Water Plant, data with a Pearson correlation coefficient greater than 0.3 are selected as inputs to the model for the prediction of total phosphorus and total nitrogen, respectively. As shown in Figure 6 and Figure 7, these represent the three years of data for total phosphorus and total nitrogen, which are used as the Y labels for the model. As shown in Table 3, the model input data consist of a total of 6370 samples. Each sample contains 5 time steps and 5 features, and so the input data size is [6370, 5, 5].

The prediction of water quality indicators in Xidong Water Plant essentially constitutes a regression model. To evaluate the effectiveness of the improved network structure in this study, four metrics are selected: MSE, MAE, RMSE, and R² [28]. These metrics are used to assess the model’s prediction performance. MSE and RMSE are primarily used to assess the difference between predicted and actual values. MAE reflects the average level of the model’s prediction error, while R² measures the goodness of fit of the model. The calculations for MSE, MAE, and RMSE are shown in Formulas (11)–(13), where smaller values indicate better model prediction performance. As shown in Formula (14), an R² closer to 1 indicates a stronger model fitting ability.

M S E = \frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - y_{i}^{'})}^{2}

(11)

M A E = \frac{1}{N} \sum_{i = 1}^{N} |y_{i}^{'} - y_{i}|

(12)

R M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(y_{i}^{'} - y_{i})}^{2}}

(13)

R^{2} = 1 - \frac{\sum_{i = 1}^{N} {(y_{i} - y_{i}^{'})}^{2}}{\sum_{i = 1}^{N} {({\bar{y}}_{i} - y_{i}^{'})}^{2}}

(14)

In the formulas,

y_{i}

is the true value,

y_{i}^{'}

is the predicted value,

\bar{y}

is the mean of the true values, and

N

is the total number of samples.

The loss function is set as MSE loss, and the optimizer used is the Adam Optimizer. By testing different batch sizes, the model’s performance in predicting total phosphorus and total nitrogen water quality indicators was evaluated. As shown in Table 4, the experimental results indicate that as the batch size decreases, the model becomes better at capturing spatiotemporal features. This leads to more accurate predictions of water quality indicators. When the batch size is 4, the model achieves high accuracy in predicting both total phosphorus and total nitrogen, demonstrating the best performance.

To further investigate and analyze the impact of the number of output channels on the model’s performance, the learning rate is set to 0.0001, the batch size is 4, and the number of training epochs is 50. A series of comparative experiments were conducted, as shown in Table 5. When the number of output channels is small, the model may fail to capture the complex patterns in the data. This results in higher training and testing losses. As the number of output channels increases, the model’s representational capacity improves, allowing it to better fit the training data. Consequently, the training loss decreases. However, when the number of channels increases to 128, the testing loss for total nitrogen increases slightly. This suggests a potential overfitting issue. Although the testing loss at 512 channels is slightly lower than that at 128 channels, the difference is not significant. More importantly, the model with 128 output channels shows a clear advantage in computational complexity. This is crucial for enhancing the model’s operational efficiency and practical application value. Therefore, when the number of output channels is set to 128, the model typically achieves the best balance between training and testing performance. At this point, the model has sufficient representational capacity while maintaining good generalization ability.

Therefore, the initial learning rate for the model training in this study is set to 0.0001, with a batch size of 4 and 50 training epochs. The number of channels is [128, 128]. The MSE loss values for the training and testing sets, as functions of the training epochs, are shown in Figure 8. In the early training stages, the training loss decreases rapidly, indicating that the model can quickly fit the training data. The decrease in testing loss is relatively gradual, suggesting that the model’s performance on the test data becomes more stable over time. This trend indicates that the model gradually reached an optimal performance on the training set. No significant overfitting was observed during the prediction of total phosphorus and total nitrogen, achieving good training and testing results.

Figure 9 and Figure 10 present visualizations of the spatiotemporal feature fusion cross-attention prediction model proposed in this study. The pink curve represents the original signal, while the blue curve represents the signal predicted by the model. It is evident that the predicted signal closely follows the trend of the original signal. Especially in regions with signal fluctuations and abrupt changes, the model is able to effectively capture the dynamic variations in the original signal. The model successfully identifies fluctuation patterns, particularly in areas with sharp signal changes. The predicted signal follows the changes in the original signal in a timely manner, indicating the model’s strong dynamic learning capability. Although the overall trend is consistent, some minor errors can be observed. Specifically, slight deviations are noticeable between the predicted and original signals at certain peaks and valleys. Overall, the magnitude of these errors is small, indicating that the model achieves high prediction accuracy.

To evaluate the importance of different features in the model, the contribution of each feature to the model’s predictions was calculated [29]. As shown in Figure 11a,b, total phosphorus and total nitrogen dominate among all the features, indicating that these two features have a crucial impact on the model’s prediction accuracy and results.

In contrast, the contributions of ammonia nitrogen and CODMn are relatively lower, but they play a supplementary role in the model’s learning process. Specifically, ammonia nitrogen is directly related to the content of harmful substances in the water. Turbidity and water temperature make relatively smaller contributions. Especially in cases with minimal changes or when environmental conditions remain stable, these features may serve more as auxiliary variables.

The spatiotemporal feature fusion cross-attention prediction model proposed in this study demonstrates strong capabilities in modeling temporal data. To validate its effectiveness, it is compared with several commonly used benchmark algorithms. All comparison algorithms are trained and evaluated on the same training and testing datasets.

Table 6 shows that TSCA achieves the best performance, with an MSE of 0.035 for total phosphorus prediction and an R² of 0.982 for total nitrogen prediction. In the total phosphorus task, its MSE is 23.9% lower than that of the second-best model, CAST (0.046). For total nitrogen prediction, its R² is 3.5% higher than that of IOOA-SRF-CAALSTM (0.949) [30].

The proposed TSCA model achieved a substantially lower Mean Absolute Error (MAE) than traditional models. Specifically, it yielded an MAE of 0.131 for total phosphorus and one of 0.080 for total nitrogen, representing a 35% to 72% reduction relative to the CNN-BiGRU baseline model (MAE of 0.203 for total phosphorus and 0.287 for total nitrogen). This indicates that TSCA has lower prediction bias and greater stability. Compared to traditional models such as CA-TCN-BiGRU, TSCA significantly enhances modeling capability by integrating spatiotemporal features through a cross-attention mechanism. For instance, the MSE for total nitrogen prediction is 0.024, which is 70.4% lower than that of VMD-CNN-BiGRU (0.081), highlighting the advantage of spatiotemporal interaction. Furthermore, TSCA exhibits minimal performance fluctuations across both tasks. The difference in R² between total phosphorus and total nitrogen is only 0.029, which is far lower than the 0.009 value observed in CNN-BiGRU. The experimental results fully demonstrate that TSCA achieves significant advantages in prediction accuracy, error control, and model generalization through spatiotemporal feature interaction optimization.

To visually present the comparison metrics of different algorithms in Table 6 bar-line charts for total phosphorus and total nitrogen predictions are plotted, as shown in Figure 12 and Figure 13. The proposed TSCA model exhibits the lowest values in MAE, MSE, and RMSE bar charts compared to other algorithms, while its R² curve shows the highest performance. This highlights the significant advantage of TSCA in predicting total phosphorus and total nitrogen.

As shown in Figure 14 and Figure 15, the TSCA model demonstrates significant advantages in both total phosphorus and total nitrogen prediction. Its prediction trajectories are closely aligned with the original data. This is particularly evident in regions exhibiting intense concentration fluctuations and abrupt changes, where the model produces highly synchronized and accurate responses. In contrast, traditional models exhibit widespread issues such as lagging or overfitting. The stability of the TSCA model is particularly noteworthy. In low-concentration nitrogen predictions, it effectively avoids the abnormal fluctuations observed in the VMD-CNN-BiGRU model. Moreover, its long-term trend predictions consistently align with the ground truth, addressing the persistent overestimation issue associated with the CAST model. By dynamically optimizing spatiotemporal feature interactions through a cross-attention mechanism, TSCA significantly enhances its capability to capture complex temporal patterns. For instance, its nitrogen prediction error is reduced by 70.4% compared to VMD-CNN-BiGRU, while performance variations across tasks remain minimal. These results comprehensively validate TSCA’s high accuracy, strong robustness, and generalization capabilities.

As shown in Figure 13 and Figure 14, the TSCA model demonstrates significant advantages in predicting both total phosphorus and total nitrogen. Its prediction trajectories closely align with the original data. This alignment is particularly evident in regions with intense concentration fluctuations, such as 100–300 time steps for phosphorus and 200–400 steps for nitrogen. The model effectively captures abrupt changes. For example, the sharp decline in phosphorus from 1.5 to 0.5 between 600 and 700 steps, achieving precisely synchronized responses. In contrast, traditional models like CNN-BiGRU and IOOA-SRF-CAALSTM exhibit widespread issues, including lagging and overfitting. The stability of TSCA is especially notable. In low-concentration nitrogen predictions, it successfully avoids the abnormal fluctuations seen in VMD-CNN-BiGRU. Additionally, its long-term trends, particularly from 500 to 700 steps for nitrogen, consistently match ground truth values. By dynamically optimizing spatiotemporal feature interactions through a cross-attention mechanism, TSCA significantly enhances its ability to capture complex temporal patterns. For instance, its nitrogen prediction error is reduced by 70.4% compared to VMD-CNN-BiGRU. Moreover, performance variations across tasks remain minimal, with an R² difference of only 0.029 between phosphorus and nitrogen predictions. These results comprehensively validate TSCA’s high accuracy, strong robustness, and generalization capabilities.

5. Conclusions

Total phosphorus (TP) and total nitrogen (TN) concentrations are important indicators for water quality monitoring and these elements represent significant pollutants in aquatic environments. The accurate prediction of TP and TN contributes to effective water quality management and treatment, while also providing valuable monitoring data for government agencies and relevant departments to ensure the safety and sustainability of public water sources. The hierarchical spatiotemporal cross-attention (TSCA) prediction model proposed in this study integrates spatial and temporal features, effectively capturing both the temporal dynamics and complex interdependencies inherent in water quality data. Compared with traditional convolutional and recurrent neural network models, the TSCA model demonstrates better adaptability to the unique characteristics of water quality datasets. By employing bidirectional convolution to extract spatiotemporal features and integrating them through a weighted mechanism, the model enhances the representation of both local and sequential patterns. The incorporation of a hierarchical cross-attention mechanism further improves the model’s capability to handle water quality data with complex and dynamic variations. The combination of these approaches enables the model to achieve high prediction accuracy and robust generalization performance. Despite its effectiveness, the performance of the TSCA model may be affected by the quality and quantity of available historical data. Additionally, its architectural complexity may introduce relatively high computational costs, which could limit its applicability in real-time scenarios on low-resource devices. Future work may focus on developing lightweight variants of the TSCA model for edge deployment, incorporating multi-source heterogeneous data (e.g., meteorological information, pollution sources), and adapting self-supervised learning techniques to reduce dependence on labeled data.

Author Contributions

J.Z.: Conceptualization, methodology. K.W.: writing—original draft. J.H.: investigation, supervision. L.Y.: writing—review and editing. J.S.: supervision. All authors have read and agreed to the published version of the manuscript.

Funding

Ministry of Ecology and Environment of the People’s Republic of China Yangtze River Ecological Environment Protection and Restoration “One City, One Policy” On-site Tracking Research (Phase II), Wuxi City Yangtze River Water Ecological Environment Protection Study, Subproject Number: 2022-LHYJ-02-0502-01.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, Y.; Song, L.; Liu, Y.; Yang, L.; Li, D. A review of the artificial neural network models for water quality prediction. Appl. Sci. 2020, 10, 5776. [Google Scholar] [CrossRef]
Wang, Q.; Hao, D.; Li, F.; Guan, X.; Chen, P. Development of a new framework to identify pathways from socioeconomic development to environmental pollution. J. Clean. Prod. 2020, 253, 119962. [Google Scholar] [CrossRef]
Li, J.; Zhang, J.; Liu, L.; Fan, Y.; Li, L.; Yang, Y.; Lu, Z.; Zhang, X. Annual periodicity in planktonic bacterial and archaeal community composition of eutrophic Lake Taihu. Sci. Rep. 2015, 5, 15488. [Google Scholar] [CrossRef] [PubMed]
Babaeinesami, A.; Tohidi, H.; Ghasemi, P.; Goodarzian, F.; Tirkolaee, E.B. A closed-loop supply chain configuration considering environmental impacts: A self-adaptive NSGA-II algorithm. Appl. Intell. 2022, 52, 13478–13496. [Google Scholar] [CrossRef]
Katimon, A.; Shahid, S.; Mohsenipour, M. Modeling water quality and hydrological variables using ARIMA: A case study of Johor River, Malaysia. Sustain. Water Resour. Manag. 2018, 4, 991–998. [Google Scholar] [CrossRef]
Avila, R.; Horn, B.; Moriarty, E.; Hodson, R.; Moltchanova, E. Evaluating statistical model performance in water quality prediction. J. Environ. Manag. 2018, 206, 910–919. [Google Scholar] [CrossRef]
Jadhav, A.R.; Pathak, P.D.; Raut, R.Y. Water and wastewater quality prediction: Current trends and challenges in the implementation of artificial neural network. Environ. Monit. Assess. 2023, 195, 321. [Google Scholar] [CrossRef]
Gao, Y.; Zhao, T.; Zheng, Z.; Liu, D. A Cotton Leaf Water Potential Prediction Model Based on Particle Swarm Optimisation of the LS-SVM Model. Agronomy 2023, 13, 2929. [Google Scholar] [CrossRef]
Sabri, M.; El Hassouni, M. Photovoltaic power forecasting with a long short-term memory autoencoder networks. Soft Comput. 2023, 27, 10533–10553. [Google Scholar] [CrossRef]
Hu, Z.; Zhang, Y.; Zhao, Y.; Xie, M.; Zhong, J.; Tu, Z.; Liu, J. A water quality prediction method based on the deep LSTM network considering correlation in smart mariculture. Sensors 2019, 19, 1420. [Google Scholar] [CrossRef]
Chen, L.; Zhang, Y.; Xu, B.; Shao, K.; Yan, J.; Bhatti, U.A. A lot-based VMD-CNN-BIGRU indoor mariculture water quality prediction method including attention mechanism. Int. J. High Speed Electron. Syst. 2024, 2540010. [Google Scholar] [CrossRef]
Yan, J.; Liu, J.; Yu, Y.; Xu, H. Water quality prediction in the Luan river based on 1-DRCNN and BiGRU hybrid neural network model. Water 2021, 13, 1273. [Google Scholar] [CrossRef]
Bi, J.; Lin, Y.; Dong, Q.; Yuan, H.; Zhou, M. Large-scale water quality prediction with integrated deep neural network. Inf. Sci. 2021, 571, 191–205. [Google Scholar] [CrossRef]
Niu, D.; Yu, M.; Sun, L.; Gao, T.; Wang, K. Short-term multi-energy load forecasting for integrated energy systems based on CNN-BiGRU optimized by attention mechanism. Appl. Energy 2022, 313, 118801. [Google Scholar] [CrossRef]
Sun, F.; Jin, W. CAST: A convolutional attention spatiotemporal network for predictive learning. Appl. Intell. 2023, 53, 23553–23563. [Google Scholar] [CrossRef]
Yuan, J.; Li, Y. Wastewater quality prediction based on channel attention and TCN-BiGRU model. Environ. Monit. Assess. 2025, 197, 219. [Google Scholar] [CrossRef]
Waqas, M.; Humphries, U.W. A critical review of RNN and LSTM variants in hydrological time series predictions. MethodsX 2024, 13, 102946. [Google Scholar] [CrossRef]
Li, G.; Zhang, A.; Zhang, Q.; Wu, D.; Zhan, C. Pearson correlation coefficient-based performance enhancement of broad learning system for stock price prediction. IEEE Trans. Circuits Syst. II Express Briefs 2022, 69, 2413–2417. [Google Scholar] [CrossRef]
Rahmad Ramadhan, L.; Anne Mudya, Y. A Comparative Study of Z-Score and Min-Max Normalization for Rainfall Classification in Pekanbaru. J. Data Sci. 2024, 2024, 1–8. [Google Scholar]
Bai, S.; Kolter, J.Z.; Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar]
Tan, H.; Shang, Y.; Luo, H.; Lin, T.R. A Combined Temporal Convolutional Network and Gated Recurrent Unit for the Remaining Useful Life Prediction of Rolling Element Bearings. In Proceedings of the International Conference on the Efficiency and Performance Engineering Network, Huddersfield, UK, 29 August–1 September 2023; Springer Nature: Cham, Switzerland, 2023; pp. 853–862. [Google Scholar]
Zhao, W.; Gao, Y.; Ji, T.; Wan, X.; Ye, F.; Bai, G. Deep temporal convolutional networks for short-term traffic flow forecasting. IEEE Access 2019, 7, 114496–114507. [Google Scholar] [CrossRef]
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
Salimans, T.; Kingma, D.P. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. arXiv 2016, arXiv:1602.07868. [Google Scholar] [CrossRef]
Chu, Y.; Guo, Z. Attention enhanced spatial temporal neural network for HRRP recognition. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE: New York, NY, USA, 2021; pp. 3805–3809. [Google Scholar]
Lin, H.; Cheng, X.; Wu, X.; Shen, D. Cat: Cross attention in vision transformer. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; IEEE: New York, NY, USA, 2022; pp. 1–6. [Google Scholar]
Li, H.; Wu, X.J. CrossFuse: A novel cross attention mechanism based infrared and visible image fusion approach. Inf. Fusion 2024, 103, 102147. [Google Scholar] [CrossRef]
Barzegar, R.; Aalami, M.T.; Adamowski, J. Short-term water quality variable prediction using a hybrid CNN–LSTM deep learning model. Stoch. Environ. Res. Risk Assess. 2020, 34, 415–433. [Google Scholar] [CrossRef]
Christoph, M. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable; Leanpub: Victoria, BC, Canada, 2020. [Google Scholar]
Antony, T.; Maruthuperumal, S. An effective illustrative approach for water quality prediction using spatio temporal features with cross attention-based adaptive long short-term memory. Intell. Decis. Technol. 2024, 18724981241298876. [Google Scholar]

Figure 1. A heat map of the Pearson correlation coefficient of water quality monitoring indicators in the past three years.

Figure 2. TCN Model Diagram.

Figure 3. Schematic diagram of residual block.

Figure 4. Schematic diagram of GRU unit.

Figure 5. Structure diagram of TSCA model.

Figure 6. Total phosphorus.

Figure 7. Total nitrogen.

Figure 8. Visualization of training process for total phosphorus prediction in proposed model.

Figure 9. Predicted total phosphorus data curve.

Figure 10. Predicted total nitrogen data curve.

Figure 11. Feature importance analysis of all samples in the prediction model.

Figure 12. Comparison of evaluation metrics for total phosphorus prediction across different algorithms.

Figure 13. Comparison of evaluation metrics for total nitrogen prediction across different algorithms.

Figure 14. Comparison of TSCA and other algorithms for predicting total phosphorus curve.

Figure 15. Comparison of TSCA and other algorithms for predicting total nitrogen curve.

Table 1. A comparison of the advantages and disadvantages of different algorithms in the current research status.

Method	Advantages	Limitations
ARIMA	Statistically robust; suitable for linear time series prediction.	Lacks nonlinear modeling capability; unclear basis for forecasting accuracy.
Improved Bayesian network	Relatively successful; demonstrates certain predictive capabilities.	21% error rate; limited generalization, especially across different water bodies.
ANN	Effective in handling nonlinear problems; suitable for water and wastewater treatment system prediction.	Prone to overfitting; limited generalization ability.
LS-SVM	Enhanced prediction performance through swarm optimization; suitable for large-scale datasets.	Computationally complex; low efficiency when handling massive datasets.
LSTM	Captures long-term dependencies in time series; reliable in forecasting.	Sensitive to training duration and hyperparameter tuning
VMD-CNN-BiGRU	Incorporates attention mechanism; improves dissolved oxygen prediction accuracy	Complex architecture; requires large data support.
CNN-BiGRU	Captures short-term dependencies in time series; suitable for short-term load forecasting.	May struggle with long-term dependencies due to limited temporal perception
CNN-TCN	En-hance long-term temporal correlations using a multi-head attention mechanism	Performance can be highly sensitive to kernel size, dilation rate, and attention head settings.
CA-TCN-BiGRU	Combines channel attention and temporal convolution; suitable for multi-parameter water quality prediction.	Complex structure; higher demand for computational optimization.

Table 2. Summary of key methodological components and advantages.

Component	Advantages
Bidirectional Sliding Window Preprocessing	Enhances short- and long-term pattern learning; preserves temporal continuity; augments feature diversity. increases data utilization.
BiGRU	Capture temporal dependencies from both past and future contexts; strong performance on time series; reduces gradient vanishing; faster convergence.
BiTCN	Enables parallel computation and handles long-range dependencies via dilation; extracts deep spatial and local contextual features; reduces gradient vanishing; faster convergence.
Weighted cross-layer cross-attention mechanism	Allows dynamic feature weighting and hierarchical interaction across BiGRU and BiTCN outputs; enhances spatiotemporal correlation modeling; improves interpretability and generalization.

Table 3. Input data.

Input Data Size: [6370, 5, 5]
Variable	Turbidity	Ammonia nitrogen	CODMn	Total nitrogen	Total phosphorus (Y Label)
Variable	Water temperature	CODMn	Ammonia nitrogen	Total phosphorus	Total nitrogen (Y Label)

Table 4. Prediction results of different batches.

Batch Size	Total Phosphorus				Total Nitrogen
Batch Size	MSE	MAE	RMSE	R²	MSE	MAE	RMSE	R²
128	0.198	0.348	0.445	0.579	0.728	0.810	0.853	0.292
64	0.076	0.218	0.275	0.857	0.154	0.357	0.393	0.874
32	0.048	0.160	0.219	0.909	0.057	0.197	0.240	0.952
16	0.040	0.142	0.200	0.925	0.031	0.117	0.176	0.974
8	0.036	0.132	0.192	0.950	0.024	0.087	0.156	0.980
4	0.035	0.131	0.189	0.953	0.022	0.080	0.150	0.982

Table 5. Comparison of effects of different output channel numbers on the model.

Number of Channels	Total Phosphorus				Total Nitrogen
Number of Channels	MSE	MAE	RMSE	R²	MSE	MAE	RMSE	R²
[32,32]	0.046	0.165	0.216	0.941	0.013	0.114	0.177	0.974
[32,64]	0.044	0.151	0.211	0.944	0.029	0.099	0.173	0.976
[64,64]	0.041	0.139	0.204	0.946	0.028	0.088	0.167	0.977
[64,128]	0.035	0.135	0.189	0.951	0.022	0.085	0.151	0.981
[128,128]	0.035	0.132	0.188	0.952	0.022	0.080	0.150	0.982
[128,256]	0.035	0.132	0.188	0.952	0.022	0.082	0.151	0.981
[256,256]	0.035	0.131	0.188	0.952	0.023	0.083	0.152	0.981
[256,512]	0.035	0.131	0.188	0.952	0.023	0.085	0.154	0.981
[512,512]	0.035	0.131	0.189	0.953	0.024	0.085	0.155	0.981
[512,1024]	0.036	0.132	0.190	0.951	0.024	0.080	0.156	0.980

Table 6. Comparison of different algorithms for predicting total phosphorus and total nitrogen.

Comparison Algorithm	Total Phosphorus				Total Nitrogen
Comparison Algorithm	MSE	MAE	RMSE	R²	MSE	MAE	RMSE	R²
CNN-BiGRU	0.134	0.203	0.367	0.917	0.169	0.287	0.412	0.908
1-DRCNN–BiGRU [12]	0.170	0.325	0.413	0.925	0.139	0.298	0.374	0.928
VMD-CNN-BiGRU [11]	0.094	0.259	0.307	0.928	0.081	0.237	0.285	0.937
CA-TCN-BiGRU [16]	0.337	0.255	0.581	0.901	0.210	0.242	0.459	0.921
IOOA-SRF-CAALSTM [29]	0.187	0.227	0.433	0.924	0.068	0.205	0.261	0.949
CAST [15]	0.046	0.184	0.214	0.936	0.042	0.152	0.206	0.958
TSCA	0.035	0.132	0.188	0.952	0.024	0.080	0.150	0.982

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, J.; Wei, K.; Huang, J.; Yang, L.; Shi, J. Research on Water Quality Prediction Model Based on Spatiotemporal Weighted Fusion and Hierarchical Cross-Attention Mechanisms. Water 2025, 17, 1244. https://doi.org/10.3390/w17091244

AMA Style

Zhou J, Wei K, Huang J, Yang L, Shi J. Research on Water Quality Prediction Model Based on Spatiotemporal Weighted Fusion and Hierarchical Cross-Attention Mechanisms. Water. 2025; 17(9):1244. https://doi.org/10.3390/w17091244

Chicago/Turabian Style

Zhou, Jiaming, Ke Wei, Jiahuan Huang, Lin Yang, and Junzhe Shi. 2025. "Research on Water Quality Prediction Model Based on Spatiotemporal Weighted Fusion and Hierarchical Cross-Attention Mechanisms" Water 17, no. 9: 1244. https://doi.org/10.3390/w17091244

APA Style

Zhou, J., Wei, K., Huang, J., Yang, L., & Shi, J. (2025). Research on Water Quality Prediction Model Based on Spatiotemporal Weighted Fusion and Hierarchical Cross-Attention Mechanisms. Water, 17(9), 1244. https://doi.org/10.3390/w17091244

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Water Quality Prediction Model Based on Spatiotemporal Weighted Fusion and Hierarchical Cross-Attention Mechanisms

Abstract

1. Introduction

2. Data Preprocessing

3. Model Structure Design

3.1. TCN Module

3.2. GRU Module

3.3. Weighted Hierarchical Cross Attention Based on Spatiotemporal Features

4. Experimental Results and Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI