1. Introduction
Dissolved oxygen (DO) is an important index to characterize environmental water quality and is affected by complex, dynamic, and non-linear factors. As one of the critical elements of the aquatic environment, DO can control the survival of aquatic organisms, the decomposition type of organic pollutants, and the strength of the self-purification of water bodies. An anoxic environment will further lead to the release of nitrogen, phosphorus, and other pollutants from the sediment into the water, which will adversely deteriorate the water quality [
1,
2,
3]. Changes in the dissolved oxygen content in intensive aquaculture ponds are influenced by meteorological data and other water quality parameters [
4,
5]. The temperature, body mass, mineralization of organic matter, and several environmental factors, such as solar irradiance, water temperature, pH, turbidity, rainfall, and wind speed, also affect the dynamics of DO [
6]. The physical and chemical factors that influence the level of DO in water also include water surface agitation and chemical oxygen consumption, such as sulfide oxidation and nitrification. The biological factors are photosynthesis and aerobic respiration. The significant discharge of nutrients in an aquatic system can decrease the DO concentration because of the relevant biological and chemical impacts [
7,
8].
Due to the non-linear and complex changes in DO data, it is very important to predict and simulate the dissolved oxygen content and process changes for water environment health assessment and management. Various numerical models have been established by scholars all over the world to study the time series fluctuation of DO. A dynamic model of a dissolved oxygen complex with the help of a single dimension differential equation was proposed, in which the dynamics of DO in relation to the nature of the solar irradiance, temperature, salinity, and mineralization of particulate organic matter were studied as well as the influence of reaeration and wind velocity on estuarine DO dynamics [
9]. A dynamic DO model was developed using STELLA to predict the variations of the net ecosystem metabolism (NEM) in the Yellow River Estuary, in which the simulation error was kept under 23%. Via the sensitive analysis of the model, the temperature was found to be the most remarkable factor affecting the metabolic rates at individual sites [
10]. A numerical simulation of DO concentrations was discussed using the TELEMAC-WAQTEL-O2 model, which was applied to investigate the effect of different weather conditions, including the tide, mean and maximum wind, and different water temperatures, on DO in Egypt [
11]. A process-based model chain (the lake model PROTECH drived by the INCA-N and INCA-P catchment models) was used to quantify the effectiveness of terrestrial nutrient control measures on DO concentrations. Nutrient load reductions were a significant driver of increased DO concentrations, associated with changes in the water temperature and chemistry [
12]. As a multi-factor hydrological process simulation model, including a quantitative numerical simulation technique, the water assessment tool (SWAT) model was developed and is often used to simulate and predict hydrological processes. The numerical simulation of non-point source pollution can also be realized using the SWAT model [
13]. However, an obvious shortcoming of this model is that it needs to collect a large amount of information, such as the hydrology, meteorology, geology, land use, farming methods, crop types, and regional economy as the input parameters, which makes the model calibration and construction complicated [
14].
In order to solve the inconvenience caused by the lack of data in numerical models, machine learning models were applied to attempt to simulate and predict water quality [
15]. Machine learning techniques constituted by a set of algorithms and statistical models are widely used in water quality monitoring and prediction [
16,
17]. Machine learning techniques can enhance predictive capabilities by considering non-linear interactions and capturing underlying complexities, particularly ANNs (artificial neural networks). The ANN model, equipped with the Levenberg–Marquardt algorithm, which is effective in training and optimizing network parameters, can effectively capture complex relationships and provide reliable DO predictions [
18]. Two ANN models, the feed-forward neural network (FFNN) and radial basis function neural network (RBFNN), were developed to predict the dissolved oxygen from the biochemical oxygen demand (BOD) and chemical oxygen demand (COD) in the Surma River, which indicated that the ANN model could be employed successfully in estimating the dissolved oxygen in the Surma River [
19]. Another machine learning method, the support vector machine (SVM), was also applied in water environment assessments and was thought to be better than the ANN because the SVM can deal with non-linear problems when identifying some predictive solutions [
20,
21]. A predictive modeling framework based on support vector regression (SVR) was proposed to analyze and predict the spatiotemporal variations of the DO concentration in this world-renowned mega project, which could also identify and reveal the key parameters that should be concerned and monitored under different environmental factor changes [
22]. As a classical neural network model, the backpropagation neural network (BPNN) model has the ability to extract non-linear relationships from the input factors with high interpretability while its local convergence can be avoided. It is a multi-layer feed-forward network trained according to an error backpropagation algorithm, which usually contains an input layer, an output layer, and one or more hidden layers [
23]. A unique AEABC-BPNN model was carried out through the non-linear input and output fitting function of the BPNN model, which has the dual advantage of self-renewal and global iterative updates under the effect of the adaptive evolution strategy [
24]. A novel clustering-based soft plus extreme learning machine method (CSELM) was provided to predict dissolved oxygen changes accurately and efficiently from time series data, which achieved better prediction results than other models in terms of accuracy and efficiency in a real-world dissolved oxygen content prediction [
4].
Deep learning models have the characteristics of multi-layer feedback simulations, which can overcome the critical time limitation of machine learning models. In recent years, research into deep learning models in water quality analyses and predictions has been paid more and more attention. The recurrent neural network (RNN) is a recursive neural network algorithm that can learn the non-linear features of sequence data to realize time series predictions [
25]. The long and short-term memory network (LSTM) model, as a typical deep learning model, is particularly adaptable to time series predictions due to its high accuracy and good scalability, which can solve the problem of long delays and long interval time series [
26]. A LSTM model combining the gradient boosting decision tree (GBDT) was proposed to select the characteristic factors with a strong influence on DO in the standard pond of the Jintan fishery base in China [
27]. A gated recurrent unit (GRU) based on LSTM optimizes the LSTM network structure while maintaining LSTM performance [
28]. With a fixed number of parameters, a GRU has advantages in the convergence of CPU time, parameter updates, and generalization over LSTM [
29]. The long-term learning dependence of a GRU can capture the long-term correlation of water quality data and realize the periodic rule of water quality data for a long period, which can predict the water quality accurately by reflecting the continuous water pollution process. The GRU model was verified to be better than some linear models, such as the autoregressive integrated moving average (ARIMA) model [
30]. A PCA-PFA-GRU model was established to predict the dissolved oxygen in perch culture water quality, in which a principal component analysis (PCA) was used to eliminate the redundant variables and reduce the data dimension and complexity first. The dissolved oxygen, water temperature, and conductivity were chosen to substitute for the original variables [
31].
As a time-frequency domain transformation method, variational mode decomposition (VMD) can reduce the non-stationarity of time series with high complexity and strong non-linearity and obtain relatively stable subsequences containing multiple different frequency scales. It is an adaptive and completely non-recursive variational and signal mode processing method proposed in 2014 [
32]. It has the advantage of determining the number of mode decompositions. Its adaptability is manifested in determining the number of mode decompositions of a specific sequence according to the actual situation, and then the center frequency and finite bandwidth of each mode can be adaptively matched in the subsequent search and solution process. In this way, the effective separation of the intrinsic mode function (IMF) and the frequency domain division of the signal can be realized to obtain the effective decomposition components of the signal provided and finally obtain the optimal solution to the variational problem. The VMD method overcomes the problems of the end effect and aliasing of the modal components in the empirical mode decomposition (EMD) method, which avoids the aliasing phenomenon by controlling the bandwidth so as to be suitable for non-stationarity sequences.
According to the characteristics and advantages of frequency division, it can be speculated that the interpretability and accuracy of the model will be improved by using the VMD method to decompose and screen the factors affecting water quality first and then conducting a model simulation. There are few reports on the construction of a predictive framework by combining frequency division with a machine learning model or deep learning model in non-point source pollution research. Taking DO as the main research index and six rivers in the Chengdu area as the research object, the influencing factors of water quality in the Chengdu area were attempted to be decomposed and screened by using the VMD method in this paper. After that, the GRU, random forest, and extreme gradient boosting (XGBoost) were used to predict DO before and after variational mode decomposition in order to compare the effects of variational mode decomposition on the prediction of DO in practice. The new prediction framework combining VMD with an intelligent model to simulate DO is hoped to improve the simulation accuracy and contribute to water area management.
3. Results and Discussion
3.1. Results Analysis of Variational Mode Decomposition
Water quality is affected by different factors. Water quality data is composed of the entangled data of different frequency bands. Different frequency bands have different characteristics. Variational mode decomposition can be used to separate the signals of different frequency bands from the original data in order to find the research laws better.
Figure 7 is a comparison diagram of the original data and various signals, in which IMF1 is a low-frequency signal representing the long-term trend of the data, and IMF2-5 is a high-frequency signal representing the short-term mutation of the data.
Figure 8 is the Fourier spectrum diagram after DO variational mode decomposition, which is the spectrum diagram of each signal obtained using Fourier transformation.
Figure 8 shows that there is basically no frequency band entanglement among the signals after the decomposition of the DO water quality data, while the signals with the frequency band entanglement in the original data are effectively separated so as to be used for further research.
3.2. Correlation Analysis
The correlation between the variables was analyzed using the Pearson correlation analysis method. The correlation values between DO and the meteorological factors, pollution factors, and other variables in the original data and decomposed signals IMF1-IMF5 were obtained. The data list is shown in
Table 4. Significance statistics were calculated automatically, and
p-values were generated accordingly. The validity of the correlation coefficients was determined by checking whether the
p-value was less than 0.05. All the correlation coefficients greater than 0.3 had
p-values less than 0.05, which confirms the validity of the correlation analysis. The
p-value of the statistical significance of the correlation coefficient is shown in
Table 5.
It is generally believed that variables with an absolute value of correlation less than 0.3 are not correlated, 0.3–0.8 are relatively strongly correlated, and greater than 0.8 are very strongly correlated [
33]. Based on the correlation analysis and VMD decomposition, the main factors influencing the long-term and short-term dissolved oxygen content could be determined.
The results show that the long-term trend characteristic of DO is influenced by the superposition of meteorological factors, hydrological factors, and water pollution factors, but it is not strongly correlated with any single factor. The absolute value of correlations between WT, T, TP, MT, MBP, BP, LBP, PH, CODMn, NH3-N, and DO are 0.3–0.8, which indicates that these factors have a relatively strong influence on the long-term characteristics of DO. Among them, the influence of temperature, phosphorus, and air pressure are relatively higher, while the pH value, CODMn, and nitrogen are relatively lower. The other factors except the above have weak correlations or no correlations on the long-term trend of DO, according to the results. The short-term mutation characteristics of DO are mainly affected by the water temperature, pH value, and eutrophication factors, such as NH3-N, TP, and TN in water bodies, which indicates that the short-term mutation characteristics of DO are mainly determined by the characteristics of the water body itself, and so the influence of the meteorological factors can basically be ignored.
3.3. Comparative Analysis of the Model Simulation Results before and after Decomposition
3.3.1. Model Simulation Results with the Original Data
Three models, the GRU, random forest, and XGBoost, were used to predict the DO content in the Chengdu area with the original data. The results are shown in
Figure 9.
The average absolute error (MAE), root mean square error (RMSE), and symmetric mean absolute percentage error (SMAPE) were used to evaluate the variation degree and accuracy of the three models so as to judge the prediction effect of all three models. The RMSE, MAE, and SMAPE were calculated as shown in
Table 6.
As seen in the table, the RMSE value is greater than 0.905, the SMAPE value is less than 0.104, and the MAE value, 0.519–0.742, is moderate. The evaluation results show that, affected by the interference factors, the fitting degree of the three models to the original data is relatively low, and the simulation effect is not so good.
3.3.2. Model Simulation Results with the Decomposed Signals
The signals whose correlation with the predicted target DO was less than 0.3 were deleted first. Then, using the signal data obtained from the mode decomposition of the other variables, the DO was predicted using the GRU, random forest, and XGBoost techniques. The simulation results are shown in
Figure 10. The RMSE, MAE, and SMAPE were calculated in the same way. Their results are shown in
Table 7.
Obviously, after eliminating the influence of the interference factors, the simulation results of the GRU, random forest, and XGBoost models significantly improved. The RSME value ranges from 0.482 to 0.511, the SMAPE value ranges from 0.052 to 0.059, and the MAE ranges from 0.364 to 0.435. All the values greatly improved compared with those before decomposition. The simulation results show that variational mode decomposition is very helpful to improve the simulation results. Additionally, the XGBoost and random forest models have a high simulation fitting degree and can be used to predict DO in the water environment of Chengdu.
3.4. Discussion
In this paper, a deep analysis of the original water quality data was conducted by combining variational mode decomposition with Fourier transformation. The analysis results indicate that the data consist of multiple modal signals at different frequencies, which may be related to various influencing factors on the water quality data, including the long-term trends caused by seasonal changes and short-term fluctuations due to occasional events. Based on this, an innovative water quality prediction framework for DO prediction was developed by combining VMD with machine learning and deep learning techniques. In this framework, the original data were initially decomposed into several intrinsic mode components, and these intrinsic mode components were filtered out subsequently by using a filter. Then, by combining intelligent models, the relationship between the intrinsic mode components and DO was explored to predict future changes. The prediction results of the GRU, random forest, and XGBoost models with and without the framework constructed in this paper were compared in order to verify the effectiveness of the framework. The prediction evaluation metrics were the RMSE, MAE, and SMAPE. The experimental results demonstrated that the prediction accuracy of the GRU deep learning model was improved by 27.56%, 16.4%, and 16.5% for the RMSE, MAE, and SMAPE, respectively. The improvement effect of the machine learning model was more obvious. The random forest was improved by 44.5%, 47.8%, and 46.8%, and XGBoost by 46.8%, 49.3%, and 48.4%. The obvious improvement of the machine learning model may be attributed to VMD separating the entangled modal signals from the original data, obtaining the modal components representing different frequency bands, preventing the essential change patterns of the data from being obscured by the entangled modes, using the filter to eliminate the noise unrelated to DO, and thereby enhancing the prediction accuracy. Through the constructed framework, the machine model can capture the inherent law of complex non-linear water quality sequence data more accurately, which causes the prediction accuracy of the machine model to exceed the deep learning model. This verifies the effectiveness of the framework. In addition, through the constructed framework, the inherent law of complex non-linear water quality sequence data can be more accurately captured using the machine learning model so that the prediction accuracy even exceeds the GRU deep learning model, which further verifies the effectiveness of the framework.
4. Conclusions
The correlation analysis between the influencing factors and DO in the water environment in the Chengdu area was carried out. It was concluded that the long-term trend characteristic of DO is influenced by the superposition of meteorological factors, hydrological factors, and water pollution factors. The air temperature, water temperature, phosphorus, air pressure, pH value, chemical oxygen demand, and nitrogen are relatively strongly correlated with the long-term trend characteristics of DO. Further, a signal decomposition analysis was carried out using the VMD method. It was concluded that only the water characteristics, such as the water temperature, pH value, and eutrophication gradients, have the greatest influence on the short-term mutation characteristics of DO. By combining frequency division with GRU, random forest, and XGBoost models, the DO content was predicted before and after variational mode decomposition. It was found that variational mode decomposition could effectively improve the simulation accuracy of the model.
DO is only one of the factors that can be used to measure environmental water quality. In future studies, we can try to further explore the actual impact of variational mode decomposition on the prediction models of various pollution factors that are not DO. We could attempt to develop a new water quality forecasting model system using variational mode decomposition combined with a water quality prediction model. The application scope of the new forecasting model system could also be extended from the Chengdu area to a larger regional scope.