1. Introduction
As the scale of industrial development expands, environmental problems caused by production activities have become increasingly prominent. Industrial wastewater from sectors such as metallurgy, dyeing, and pulp and papermaking contains excessive amounts of heavy metals and refractory organics, which severely pollute soil and water resources, destroy ecosystems, and lead to negative impacts on human health [
1,
2]. Additionally, the harmful substances in industrial wastewater can volatilize with the rising temperature, resulting in air pollution. Thus, industrial wastewater has become one of the significant global environmental challenges. Effective treatment and enhanced regulatory mechanisms are crucial for environmental protection and the sustainable development of industries [
3].
It is essential to regulate specific wastewater discharge standards based on industrial categories and pollutant types to control water pollution and maintain ecological balance. Because of the large amount of wastewater discharge, the complex composition of pollutants, and poor biodegradability [
4], pre-treatment is required before carrying out the biochemical treatment again for the converted small molecular substances [
5]. By combining hardware sensors with data-driven soft measurement methods, wastewater treatment plants (WWTPs) can monitor complex processes in real-time and realize the indirect estimation of key indicators that are difficult to measure in practice [
6,
7]. Furthermore, this enables early warning of abnormal conditions, leading to more efficient and reliable process control.
Chemical oxygen demand (COD) is the amount of consumed oxygen when the dissolved and suspended matter are exposed to a specific oxidizing agent, which is considered a critical index for evaluating the organic pollution level in wastewater [
8]. However, monitoring the COD concentrations online using hardware sensors involves high costs, and some substances in the water body can react irreversibly with the sensor elements, which has negative impacts on their accuracy and lifespan [
9]. Thus, data-driven soft sensor methods for modeling COD concentrations in wastewater are widely applied in practical industrial operations.
Since machine learning methods can systematically learn relationships between variables from historical data, there have been a number of studies and applications of them in recent years [
10]. Alavi and co-workers [
11] provided a new insight in real-time COD prediction using several novel models combining hybridizing kernel-based extreme learning machines (ELM) with intelligent optimization algorithms. To make reliable predictions with time series data, Lotfi et al. [
12] utilized autoregressive integrated moving average (ARIMA) as a fundamental structure and incorporated the outlier robust ELM technique to model the nonlinear and linear variables separately, thereby improving the accuracy of predicted effluent COD. Although ARIMA can explain the relationship between current and historical values using autoregression, differencing, and moving average calculations, its reliability significantly decreases when the input is an irregularly sampled time series with multiple data patterns. The appearance of deep learning methods provides an effective direction to these problems.
Deep learning models based on the recurrent neural network (RNN), which has a chain structure, have been applied in smart WWTPs in recent years because of their ability to learn historical trends in the temporal dimension [
13,
14]. Examples include the long short-term memory network (LSTM), which introduces gating mechanisms into the RNN units [
15,
16,
17], and gated recurrent units (GRU), which improves gating structure in LSTM to reduce the number of parameters in training [
18,
19,
20]. These models generate future trends by historical dependencies in time series, and all of them achieved accurate estimations of effluent COD, heavy metal, and sludge volume index with the inputs observed in both single and entire wastewater treatment processes.
Due to the advantages of the convolution calculation, models based on convolutional neural network (CNN) have performed well in specific time series tasks involving multi-scale patterns [
21]. In [
22], the CNN was combined with LSTM and GRU to compensate for their ability to abstract spatial features and thereby realize more reliable water quality predictions. Xie et al. [
23] considered the temporal convolutional network (TCN) an effective framework for processing sequence data since it can perform dilated causal convolution in parallel. The experimental results proved that TCN can significantly improve the simulation of total nitrogen concentrations for the next eight hours in WWTP effluent. A study on forecasting the crowd flows utilized a graph convolutional network (GCN) to capture the complicated interactive relationships among external factors and integrated it with a prediction method for exceptional accomplishment in extracting spatio-temporal dependencies [
24]. Thus, it is crucial for achieving accurate predictions to select and build the most suitable model according to the various data patterns and characteristics.
In this paper, a novel deep learning framework based on time scale decomposition and spatio-temporal feature extraction, which is denoted as MSTCN, is proposed to achieve more accurate predictions of the wastewater effluent indicator COD by identifying more complex time patterns in the data and realizing hierarchical feature extractions. The Fourier transform method is applied first to determine the top three main frequencies in the original data, according to which the inputs can be reconstructed into series at multiple time scales. After that, the historical dependencies and interactions between variables at each time scale can be captured by the TCN and GCN layers, respectively. The feature fusion across different time scales allows the proposed model to forecast utilizing information from various temporal dimensions and spatial interactions. Based on the actual wastewater quality data, the effectiveness and accuracy of the MSTCN model are demonstrated compared to widely applied prediction methods for time series.
The aim of this study is to achieve stable and accurate predictions of the COD concentrations in water bodies after treatment. This research makes efforts to develop an effective soft-sensor model to analyze monitoring data and provide reliable predictions for managing wastewater treatment processes.
This paper is organized as follows:
Section 2 introduces the data collection, detailed methods, and the proposed model. In
Section 3, the results of data preprocessing, parameter optimization, and experiments are shown and discussed. Finally,
Section 4 presents the conclusion of this paper.
2. Materials and Methods
2.1. Data Collection
The dataset used for training and validating the predictive performance of models in this work was sampled from a wastewater treatment plant in South Korea [
25] between 9 March 2007 and 29 February 2008. This plant employs advanced biological treatment processes to remove suspended solids, organic matter, and nutrients from the wastewater gradually, which mainly consists of a secondary clarifier and four basins of the denitrification, anaerobic, anoxic, and oxic processes. The detailed schematic is plotted in
Figure 1.
Parameters monitored by hardware sensors during the whole process are influent flow, suspended solids (SS), biochemical oxygen demand (BOD), chemical oxygen demand (COD), total nitrogen (TN), total phosphorus (TP), and COD of the effluent (COD-E). Measurements in this dataset are the daily average values of seven variables sampled in a year. In the following training, the first six are chosen to be predictive variables, while COD-E is the target variable, which is to assess whether the water quality meets discharge standards. The curve of each variable is shown in
Figure 2.
2.2. Fourier Transform
The Fourier transform is a mathematical method that can convert the time-domain signal
f(
t) into the frequency-domain signal
F(
ω) and is widely applied in fields such as physics, engineering, and digital signal processing [
26]. Since any periodic signal can be represented as a sum of simple oscillating functions of different frequencies, complex signals that meet the Dirichlet convergence condition can be decomposed into the superposition of infinitely many discrete sine or cosine waves using Fourier series.
The period of a continuous signal is T
1, then its Fourier expansion is defined in Equation (1) as follows [
27]:
where
ω1 is the fundamental frequency of
f(
t), and each decomposed quantity is an integer multiple of
ω1. While
a0 represents the direct current component,
an and
bn are the amplitudes of the cosine and sine signals, respectively. In [
t0,
t0 +
T1], they can be calculated as follows:
However, many variables sampled from the wastewater treatment process in practice have no significant periodicity, in which case it can be equivalent to applying the Fourier transform to a function with an infinitely large period. Thus, the value of
ω1 approaches zero, and the Fourier transform can be described as follows:
where
e−iωt is a rotation vector for time-domain transformation and
i in it is a unit of the complex number [
28]. Each frequency component
F(
ω) contains the information of amplitude and phase, which determine the height and position of the wave, respectively. According to the top
k F(
ω) with the largest amplitudes, the most predominant
k periods in the time series can be identified.
Thereby, the original data can be reconstructed across different time scales utilizing these periods obtained by the Fourier transform approach, which is essential for extracting specific features at each time scale and then enhancing the prediction accuracy.
2.3. Temporal Convolutional Network
Recurrent neural networks and their variants, such as LSTM and GRU, are structures that have demonstrated superior capabilities of capturing dynamic characteristics compared to convolutional models when dealing with time-series data. However, serial computing during forward propagation requires excessive memory to store intermediate results for each time step, which leads to more inefficient training.
TCN is a time-series modeling method based on convolutional operations that can work on entire sequences in parallel, enhancing the utilization of computing resources. Compared to LSTM, the architecture of TCN is simpler and there are fewer parameters, which further reduces the risk of overfitting [
29].
TCN combines the causal convolution and dilated convolution to expand the receptive field of the filters without losing any input information. Additionally, layers are stacked as the residual structure to eliminate the instability that arises from the increase in model depth.
Figure 3 displays the TCN architecture.
Causal convolution is a unidirectional structure, which implies that the outcome of
d layer at
t moment is only related to the factors before moment
t − 1 in layer
d − 1. Given an input sequence
X = (
x1,
x2,
x3, …,
xT) and a filter
F = (
f1,
f2,
f3, …,
fd), the convolution operation at
xs can be calculated via Equation (6) as follows [
30]:
Instead of utilizing pooling to increase the receptive field like CNN, TCN allows interval sampling of inputs. A standard convolutional layer can adjust the number of kernel intervals by a hyperparameter denoted dilation rate
r to control the size of the convolution window, as shown in Equation (7).
where
k is the filter size. Considering the dilated convolution injects
m − 1 empty units into the one-dimensional convolutional layer, it is necessary to make padding with the same number of zeros at the end of each layer.
As presented in
Figure 3b, each residual block in TCN consists of two dilated causal convolutional layers and ReLU nonlinear mapping layers. Moreover, the weight normalization and dropout layer are inserted after them to regularize the network, avoiding the gradient vanishing or explosion problems. To prevent the performance of TCN from degrading with an increasing number of layers, the input is directly added to the extracted features from the second convolutional layer. Since the two tensor shapes can be different, a common convolutional layer with a kernel size of 1 × 1 is applied for shape regularization before the residual summation.
2.4. Improvements
Due to the limitation of TCN in learning the interactions between variables, this work incorporates an adjacency matrix from graph algorithms to describe the information between nodes. This matrix stores the spearman correlation coefficients, serving as an additional input to help the model better understand and extract the dynamic relationships among variables.
GCN is specifically designed to deal with graph-structured data [
31]. During the training process, GCN can achieve abundant feature representations by graph convolutions utilizing the constructed adjacency matrix. Thus, GCN is built to adapt the weights of the input variables according to the correlations in the adjacency matrix and aggregate highly correlated features. Then, the output tensor is merged with time characteristics captured by the TCN module, which is composed of several residual blocks. By combining TCN and GCN, the hybrid model can obtain an output that contains both cross-sequence common patterns of features and the historical trends within the sequence.
As shown in
Figure 4, extra calculations associated with GCN enable TCN to explicitly learn the data patterns and diverse relationships in multivariate time series, thereby improving the prediction accuracy and generalization capability.
2.5. Modelling Methodology
As illustrated in
Figure 5, modelling of the effluent COD by the multi-scale temporal convolutional network (MSTCN) can be carried out in the following three steps:
Fourier transform. Decompose the time series into the frequency domain and identify the k frequency components with the largest amplitudes. Then, reconstruct the original data into two-dimensional series at different time scales according to the top k periods.
Model training. For each time scale, construct an adjacency matrix to represent the correlations between variables and input it along with the time series data into the GCN layer to capture finer-grained relationships of features in the spatial dimension. Similarly, build a corresponding TCN layer to extract broad historical dependences within the sequences. Then, the representations learned at that time scale can be achieved by feature fusion of the results from GCN and TCN layers, and the final output of the hybrid model is aggregated adaptively from k scales using a fully connected layer.
Evaluation. Root mean squared error (
RMSE), mean absolute error (
MAE) [
32], and the coefficient of determination (
R2) [
33] are chosen as the metrics to evaluate the prediction accuracy and validate the proposed model performance statistically. The mathematical definitions of them are as follows:
When N is the sample size, μ denotes the mean value, and yi and ypred,i are the measurements and the estimated values of the model, respectively. The R2 value ranges from 0 to 1, with values closer to 1 indicating a higher proportion of the variance in the target variable that can be explained by the predictive variables, thus signifying a better fit of the model. Conversely, lower values of RMSE and MAE reflect a smaller error between the predictions and observations.
3. Experimental Design
The experiments in this work all worked on a machine running on a Windows 11 operating system and equipped with an AMD Ryzen 5800H processor, which has a memory of 16 GB. All models were constructed under the Keras 2.3.1 framework by Python 3.7.0 and were trained with NVIDIA GeForce RTX 3060 GPU. Both equipment were manufactured in Santa Clara, CA, USA.
3.1. Data Pre-Processing
Missing values are removed directly as most of them exist in the first five records and the missing rate is less than 1%. The boxplots and violin plots for the remaining 358 samples are displayed in
Figure 6.
After removing the outlier in Flow, the wastewater treatment dataset contains 357 samples, which are split into the training and test sets at a ratio of 7:3 in this work. For the integrity and continuity of the time series data, the first 250 samples are considered as the training set, and the remaining 107 measurements are used as the test set. Among all violin plots, SS, BOD, COD, and TN exhibit symmetrical shapes around their median lines, with high-density areas aligning closely with the median, indicating that these variables approximately follow a normal distribution. However, the data points of TP and COD-E are significantly clustered in the smaller value areas, and it can be inferred that these two variables are supposed to show a negative skew distribution. Thus, it is necessary to perform exponential transformation on them before training to eliminate the skewness.
Together with the results of the time lag analysis in
Figure 7 and the spearman correlation analysis in
Figure 8, there are strong interactions between variables of the experimental data, while each series has a complex temporal pattern within itself. As observed in
Figure 7, there are still rather strong dependencies between the data point at time
t and that before 30 timesteps in terms of BOD, COD, and COD-E. These factors mentioned above are both crucial for accuracy when forecasting with the multivariate time series. Therefore, a multi-scale model can better deal with the dynamic characteristics, and the introduction of the adjacency matrix allows the model to flexibly capture diverse relationships in data, enhancing its predictive performance and interpretability.
3.2. Fourier Transform
To construct the input for MSTCN, the major periods of each variable in the original data are first identified using the Fourier transform approach. The amplitude spectrum is plotted in
Figure 9, where the top three largest magnitudes are marked with red dots. According to the mode of main frequencies, the top three periods for the entire dataset can be determined to be 20, 25, and 41, respectively.
Then, the training set of an initial size of (250, 6) is reshaped according to these determined periods into matrices with sizes of (13, 20, 6), (10, 25, 6), and (7, 41, 6), respectively. The spearman correlation coefficients for the input variables at each time scale are calculated and stored in the corresponding adjacency matrices, as shown in
Figure 10.
3.3. Optimization
Considering the complex structures and training consumptions of deep learning models, the Bayesian optimization algorithm with the Gaussian process is used for hyperparameter tuning. It is a classic method for the global optimization of unknown functions, which adjusts the parameter selection for the next iteration within the specified range based on the results of previous iterations and updates the posterior distribution of the objective function until it closely aligns with the true distribution [
34].
In this study, the metric RMSE is taken as the optimization target, and the global optimal solutions for the following hyperparameters are searched within 100 iterations: kernel size ks, dilation rate r, batch size, optimizer, and learning rate. To prevent overfitting and reduce the computational resources required, an early stopping mechanism is also performed in this stage. As a result, the MSTCN model with a kernel size of 7 and a dilation rate of 3 is proven to outperform other structures, which is trained at a learning rate of 0.01 and a batch size of 64 using the Adam optimizer.
Since the kernel size of the convolutional layer in GCN is generally the same size as the adjacency matrix, the critical factors that can significantly influence the final performance of MSTCN are the kernel size of the TCN module and its dilation rate. Thus, the grid search method is applied to further determine their values, and the
R2 of MSTCN on the test set for different parameter combinations is plotted in
Figure 11. It is obvious that the model provides the most superior ability to fit future trends of the sequence when
ks is 7 and
r is 3, which is consistent with the results of Bayesian optimization. The detailed structure of the optimized MSTCN model is listed in
Table 1.
3.4. Comparison Methods
To validate the superiority of the proposed MSTCN framework, several classical deep learning models are trained in this study to predict the effluent COD concentrations for the following day, including CNN, LSTM, LSTM-based on the attention mechanism (ALSTM), and the common TCN. The hyperparameters involved in building and training these mentioned models are illustrated in
Table 2.
4. Results and Discussion
Table 3 presents the predictive performance of the proposed MSTCN and comparison models. The training duration required for each model to converge is also recorded in it for more comprehensive evaluations from different dimensions.
It is obvious that MSTCN not only demonstrates remarkable improvements in predictive accuracy over other models but also exhibits superior training efficiency. MSTCN has a training
R2 of 0.9786, RMSE of 0.1834, and MAE of 0.1480, while it achieves the highest
R2 of 0.9044, lowest RMSE of 0.3927, and lowest MAE of 0.2765 on the test set.
Figure 12 plots its result curves against the measured one. Although there are some larger deviations between predictions and observations at the last few samples, MSTCN continues to successfully capture the overall trends in the test data and maintain a high degree of accuracy. When dealing with the untrained sequence that contains both long-term dependencies and short-term fluctuations, MSTCN can also generate predictions with slight biases, confirming its robustness and reliability in forecasting effluent COD concentrations. Additionally, it takes 8 s to reach convergence, which is quite efficient given the model complexity.
Figure 13 presents the predictions of CNN and LSTM on the test set, which suggests that LSTM can better fit the observation curve with smaller errors when there exist similar trends within a relatively long time period. However, both of them have poor generalization abilities on this wastewater treatment process dataset. Although these two models can achieve an explanation degree of over 90% for the variability in the COD-E data during training, there is a significant decrease in predictive performance on the test set. The test
R2 value of MSTCN is 17.63% and 12.67% higher than that of CNN and LSTM, respectively. According to the test
RMSE, MSTCN has a 35.68% and 30.42% reduction compared to them, respectively.
Different from the fixed receptive field utilized by CNN for capturing local characteristics, the causal convolution in TCN enables it to refer to the state of time
t − 1 in the previous hidden layer when learning the changes at time
t. Thus, TCN can provide more accurate predictions for overall trends in the time series but is sometimes less effective than CNN on short-term abrupt changes, which can also be confirmed by comparing
Figure 13a and
Figure 14a. Considering that the dilation rate allows TCN to adjust the size of its receptive field window and makes its learning range more flexible, TCN outperforms LSTM on the dataset, which contains many fluctuations only existing in a short period. RMSE of the TCN model on the test set is 0.5071, which is 10.14% lower than LSTM, while the
R² value is 4.72% higher. However, MSTCN still shows significant improvements over TCN in both training and test performance metrics, as the multi-scale approach of MSTCN enables the TCN layers in it to be trained separately to learn the unique temporal patterns of the target variable for each scale. Thus, there is a 7.59% increase in test
R2 and a 22.56% decrease in
RMSE compared with TCN.
The attention mechanism is a method that can highlight important information by assigning higher weights to critical features in the input sequence [
35]. By inserting an attention module between the LSTM and the input layer, the negative influence of redundant information on LSTM performance can be effectively reduced during the training process.
Figure 14b plots the forecasting results of ALSTM on the test set, and it is obvious that there are smaller differences between the ALSTM curve and the observed one during the stage with a gradually increasing trend. Comparing
Figure 14a and
Figure 14b, ALSTM has a superior explanation of dynamic changes at the end of the sequence than TCN. The test
R2,
RMSE, and
MAE of it are 0.8747, 0.4496, and 0.3552, indicating a second-ranked performance among all comparative models. Due to the GCN layers, MSTCN can extract more diverse feature representations with the extracted relationships at each scale from the adjacency matrix. As a result, these metrics of MSTCN are 3.39%, 12.64%, and 22.14% improved than those of ALSTM.
Considering the convolution operations used in CNN, TCN, and MSTCN, their training time is much shorter than LSTM and ALSTM, which are based on serial computation. MSTCN takes 8 s to reach its optimal performance, which is 2.2 s shorter than ALSTM and 6.3 s shorter than LSTM. The ability of MSTCN to converge faster while maintaining high predictive performance suggests its scalability and applicability in complex and large-scale industrial settings [
36].
Overall, the proposed MSTCN model shows high predictive accuracy in both training and test sets, making it a reliable tool for forecasting effluent COD concentrations after wastewater treatment. The model effectively identifies and simulates both long-term trends and short-term fluctuations, which is crucial for achieving accurate predictions from time series with complex data patterns [
37]. These findings highlight the potential of MSTCN for improving accuracy and operational efficiency in wastewater quality prediction.
5. Conclusions
In this work, a hybrid model based on time scale decomposition and spatio-temporal feature extraction, which is denoted as MSTCN, is proposed to identify more complex patterns and variable interactions in the data and achieve more accurate predictions of the wastewater effluent indicator COD-E. The Fourier transform method is applied first to determine the top three main frequencies in the original data, according to which the inputs can be reconstructed into series at three time scales. For each time scale, the spearman correlation coefficients between variables are stored in a separate adjacency matrix. Then, build TCN and GCN layers to learn temporal dependencies within the sequences and aggregate highly weighted feature representations, respectively. Finally, the output of MSTCN is the result of feature fusion from TCN and GCN layers across different time scales.
The dataset used to validate the model performance is sampled from the nutrient removal process of a wastewater treatment plant in South Korea. The evaluation indicator COD-E for effluent quality was used as the target variable, which included both abrupt short-term fluctuations and long-term overall trends in its series. The proposed MSTCN model achieved an R2 of 0.9786, an RMSE of 0.1834, and an MAE of 0.1480 on the training set, while the values of R2, RMSE, and MAE were 0.9044, 0.3927, and 0.2765 for the test set, respectively. The experimental results demonstrate the effectiveness of MSTCN in predicting future COD-E concentrations with multivariate and multi-pattern time series data.
In future work, the effluent biochemical oxygen demand (BOD) is supposed to be considered as another critical indicator for evaluating the wastewater quality after treatment, and the relationship between BOD and COD can be focused on to achieve a more comprehensive monitoring and more accurate predictions of the effluent quality.