Novel Ensemble Learning Approach for Predicting COD and TN: Model Development and Implementation

Cheng, Qiangqiang; Kim, Ji-Yeon; Wang, Yu; Ren, Xianghao; Guo, Yingjie; Park, Jeong-Hyun; Park, Sung-Gwan; Lee, Sang-Youp; Zheng, Guili; Wang, Yawei; Lee, Young-Jae; Hwang, Moon-Hyun

doi:10.3390/w16111561

Open AccessArticle

Novel Ensemble Learning Approach for Predicting COD and TN: Model Development and Implementation

by

Qiangqiang Cheng

^1,†,

Ji-Yeon Kim

^2,†,

Yu Wang

^1,*,

Xianghao Ren

¹,

Yingjie Guo

¹,

Jeong-Hyun Park

³,

Sung-Gwan Park

²,

Sang-Youp Lee

²

,

Guili Zheng

⁴,

Yawei Wang

⁴,

Young-Jae Lee

⁵ and

Moon-Hyun Hwang

^2,*

¹

Key Laboratory of Urban Stormwater System and Water Environment, Ministry of Education, Beijing University of Civil Engineering and Architecture, Beijing 100044, China

²

Institute of Conversions Science, Korea University, 145, Anam-ro, Sungbuk-gu, Seoul 02841, Republic of Korea

³

Graduate School of Engineering Practice, Seoul National University, 1, Gwanak-ro, Gwanak-gu, Seoul 08826, Republic of Korea

⁴

Research Center, Xinhua Pharmaceutical (Shouguang) Co., Ltd., 10 Chayan Road, Shouguang 262700, China

⁵

Department of Water Resources, Graduate School of Water Resources, Sungkyunkwan University, Suwon 16419, Republic of Korea

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Water 2024, 16(11), 1561; https://doi.org/10.3390/w16111561 (registering DOI)

Submission received: 15 May 2024 / Revised: 28 May 2024 / Accepted: 28 May 2024 / Published: 29 May 2024

(This article belongs to the Special Issue Membrane Separation and Water Treatment: Modeling and Application)

Download

Browse Figures

Versions Notes

Abstract

:

Wastewater treatment plants (WWTPs) generate useful data, but effectively utilizing these data remains a challenge. This study developed novel ensemble tree-based models to enhance real-time predictions of chemical oxygen demand (COD) and total nitrogen (TN) concentrations, which are difficult to monitor directly. The effectiveness of these models, particularly the Voting Regressor, was demonstrated by achieving excellent predictive performance even with the small, volatile, and interconnected datasets typical of WWTP scenarios. By utilizing real-time sensor data from the anaerobic–anoxic–oxic (A2O) process, the model successfully predicted COD concentrations with an R² of 0.7722 and TN concentrations with an R² of 0.9282. In addition, a novel approach was proposed to assess A2O process performance by analyzing the correlation between the predicted C/N ratio and the removal efficiencies of COD and TN. During a one and a half year monitoring period, the predicted C/N ratio accurately reflected changes in COD and TN removal efficiencies across the different A2O bioreactors. The results provide real-time COD and TN predictions and a method for assessing A2O process performance based on the C/N ratio, which can significantly aid in the operation and maintenance of biological wastewater treatment processes.

Keywords:

ensemble model; water quality prediction; COD & TN; A2O process; WWTPs

1. Introduction

In recent years, there has been a remarkable surge in the volume of data generated and stored within wastewater treatment plants (WWTPs) since 2010, leading to an exponential increase. Notably, it has been observed that 90% of global data have been generated within the past two years alone, signifying the onset of the big data era [1].

This abundance of data holds significant value in the realm of artificial intelligence (AI), primarily for predictive analytics and analysis [2]. However, the escalation in data quantity has also presented several challenges in their efficient utilization.

Despite the extensive data collection facilitated by water quality sensors in WWTPs, certain parameters such as chemical oxygen demand (COD), total nitrogen (TN), and total phosphorus (TP) still require analysis through time-consuming experimentation [3]. Moreover, the complex interactions between biological and physicochemical treatment processes, coupled with fluctuations in influent quality and quantity, lead to significant variations in operational conditions and treatment efficiencies in WWTPs, presenting challenges for monitoring, operation, and maintenance. Consequently, conventional modeling methodologies may prove insufficient for practical operational environments.

Recent research has observed a significant rise in the use of artificial intelligence (AI) techniques for predicting parameters in wastewater treatment plants (WWTPs), specifically aiming to improve treatment efficiency for COD, TN, and TP [4]. Artificial neural networks (ANN) have proven to be effective in these studies, offering advantages over traditional WWTP modeling by requiring shorter processing times and demonstrating the ability to quickly adapt to new conditions through retraining only on relevant data [5]. Several studies have employed ANN methods to predict water quality parameters and simulate the treatment efficiency of WWTPs [6,7,8,9]. Comparisons between predicted values and actual data, conducted using both ANN and adaptive neuro-fuzzy inference systems (ANFIS), unveiled a significant correlation of 96.0% for COD [10]. This strong correlation indicates the potential applicability of AI-based techniques for predicting and controlling the performance of aerobic biological processes in WWTPs. Utilizing a variety of water quality data to forecast how WWTPs will function under diverse scenarios, encompassing changes in influent quantity and quality, operating conditions, and seasonal variations, enables the development of decision support systems that optimize WWTP operations, leading to enhanced efficiency, cost savings, and other beneficial outcomes. In practice, the acquisition of comprehensive water quality datasets encompassing parameters such as biochemical oxygen demand (BOD), COD, total suspended solid (TSS), total dissolved solid (TDS), ammonia nitrogen (NH₃-N), TN, TP, dissolved oxygen (DO), pH, temperature, etc., for implementation in AI-based models within WWTPs poses a significant challenge [11]. This limitation substantially impedes the effective deployment of AI-based models such as ANN and ANFIS, which heavily rely on extensive training datasets to establish intricate relationships between input and output variables. The operational reality of WWTPs often involves incomplete or sparse data due to factors like sensor malfunctions, sampling limitations, and operational constraints [12]. Consequently, the absence of robust data undermines the performance and reliability of AI-based models, resulting in compromised predictive accuracy and increased uncertainty. Addressing these challenges necessitates innovative approaches to data collection, preprocessing, and modeling to enhance the robustness and applicability of AI-based predictive models within practical WWTP settings.

The development of an AI-based approach for real-time prediction of COD and TN, which are challenging to measure directly in wastewater treatment processes, offers significant advantages for monitoring biofilm formation in biological treatment systems and membrane fouling in membrane bioreactor (MBR) processes. Biofilm formation and activity are closely linked to the availability of organic carbon sources (represented by COD) and nutrients like nitrogen (represented by TN). Accurate real-time prediction of these parameters enables the optimization of operational conditions to promote stable and efficient biofilm growth, leading to enhanced removal of organic matter, nitrogen, and other contaminants. Furthermore, in MBR processes, membrane fouling is influenced by the characteristics of the influent, including COD, TN, and the C/N ratio. The ability to predict these parameters in real time can facilitate monitoring and control strategies to mitigate membrane fouling, thereby improving the overall performance and longevity of the MBR system. Additionally, the AI-based approach can provide valuable insights into the dynamics and interactions between biofilm formation and membrane fouling in integrated biological and membrane-based treatment processes. By leveraging real-time predictions, it becomes possible to optimize the synergy between these two critical components, ultimately enhancing the overall efficiency and sustainability of wastewater treatment operations.

This study introduces a novel ensemble learning approach that combines predictions from various regression models to generate the final prediction. The utilization of ensemble learning methods is driven by the objective of predicting optimal outcomes using limited data obtainable from actual WWTPs (specifically focusing on the anaerobic–anoxic–oxic (A2O) process). One of the primary goals of this study is to develop and apply AI-based models that enable reasonably satisfactory predictions despite the constraints of limited data availability. This method operates by averaging the predictions of individual models or taking a weighted average, integrating multiple models using different algorithms to learn diverse data patterns and characteristics. Combining multiple models generally leads to reduced error compared to a single model, with averaging or following the majority opinion helping to mitigate extreme prediction values and provide more stable prediction outcomes. Therefore, in scenarios where data are scarce or missing, this new ensemble learning method could offer a practical solution. By leveraging relatively easily obtainable data (i.e., NH₃-N, pH, TDS, temperature, etc.) through real-time measurements at actual WWTP sites, it is possible to implement a method for real-time prediction of water quality parameters that would otherwise require time-consuming experiments for measurement such as COD and TN. The results from this study are expected to be applicable for monitoring and predicting COD and TN concentrations in membrane bioreactor (MBR) processes integrated with the A2O process. Particularly, in MBR processes, the COD and TN concentrations in the influent (i.e., effluent from the A2O process) play a crucial role in membrane fouling. The proposed ensemble learning approach can provide reliable real-time predictions of critical water quality parameters related to MBR fouling, such as COD and TN, even with limited data availability. This capability is particularly valuable for MBR processes, where membrane fouling as well as biofilm formation is influenced by the effluent from the A2O process, including COD and TN. By accurately predicting these parameters, the ensemble learning method can be used to monitor the performance of this type of integrated A2O-MBR process. Furthermore, based on the ability to predict COD and TN concentrations in real time, a method was proposed to evaluate the performance of each reactor within the A2O process through correlation analysis between the C/N ratio (i.e., the ratio of COD to TN) and the removal efficiencies of COD and TN in each reactor.

2. Model Development and Data Preparation

2.1. Ensemble Model

Machine learning problems are primarily divided into supervised learning, where labels are known, and unsupervised learning, where patterns within the data are identified without labels. Supervised learning typically addresses classification problems, in which data are classified as “True” or “False” (i.e., the process of identifying the category relationship of existing data and autonomously determining the category of newly observed data), and regression problems, which involve predicting continuous numerical values. In this study, from the perspective of solving regression problems, real-time measurable parameters were utilized to predict the values of COD and TN, which are difficult to measure in real time.

There are various models for solving regression problems, including simple linear regression, which utilizes only simple linear relationships, multiple regression, which employs multiple input values, and polynomial regression, which identifies nonlinear relationships. Generalized regression models include traditional techniques such as Ridge, Lasso, and Elastic Regression, as well as more recently developed Machine Learning methods like Random Forest and XGBoost. Recently, deep learning models have been developed to solve regression problems by stacking multiple layers, utilizing simple linear equations, activation functions, and backpropagation. However, these models require a substantial amount of consistent datasets, and still, on tabular data, the performance of tree-based models such as XGBoost and Random Forest is superior to that of deep learning approaches [13]. Therefore, considering the typical data collection scenario in WWTPs, the present study aims to apply tree-based machine learning techniques to address regression problems.

In the present study, ensemble learning methods, which combine various machine learning regression models to generate a final prediction, were employed. Representative ensemble models include Bagging [14], Boosting [15], and Voting models [16]. Bagging, exemplified by Random Forest [17], involves independently training models on sub-datasets obtained through a bootstrap resampling process from the training dataset and combining their outputs. The Boosting approach, such as Adaptive Boosting, Gradient Boosting, and XGBoost, sequentially trains baseline models while assigning higher weights to data points that were incorrectly predicted in previous iterations, thereby reducing errors. Voting combines the prediction results of multiple models through a voting process.

Such ensemble models can incorporate multiple models utilizing different algorithms, enabling the learning of diverse data patterns and characteristics by averaging or weighted averaging of individual model predictions. Additionally, combining multiple models generally reduces errors compared to a single model, and the process of averaging or following the majority opinion mitigates extreme predictions, providing more stable results [16]. The governing equation for the weighted ensemble models is represented by Equation (1), where n represents the number of regressors in the ensemble, w_i is the weight assigned to the i-th regressor, and y_i(x) is the prediction made by the i-th regressor for the input x.

\hat{y} (x) = \sum_{i = 1}^{n} w_{i} \cdot y_{i} (x)

(1)

2.2. Exploratory Data Analysis (EDA)

The data utilized in this study were collected from 1 January 2022, to 29 September 2023 (a total of 352 days) at a WWTP in China, which operates using the A2O process. The collected data include pH, temperature, COD, ammonia nitrogen (NH₃-N), TN, and total dissolved solids (TDS). Detailed information regarding these water quality parameters and a detailed description of the A2O process are provided in Section 3.1.

Table 1 presents the Pearson correlation coefficients for four water quality parameters: COD, NH₃-N, TN, and TDS in each reactor of the A2O process. Regardless of the reactor, COD and NH₃-N exhibited a strong correlation with the Pearson correlation coefficients greater than 0.5. TDS showed a moderate correlation with the nitrogen parameters (NH₃-N and TN); however, excluding the anaerobic reactor, the coefficients were below 0.5. This implies that it is essential to consider that the correlations among the collected data may differ for each water quality parameter and reactor when utilizing machine learning to derive correlations between the data. Figure 1 presents the pairwise correlations among the four water quality parameters (i.e., COD, NH₃-N, TN, and TDS) for each reactor in the A2O process, derived using pairplots. As depicted in Figure 1, all four parameters exhibit an approximately normal distribution. The anaerobic reactor exhibited stronger linearity compared to other reactors as shown in Figure 1. Additionally, as presented in Table 1, the Pearson correlation value between NH₃-N and TN in the anaerobic reactor was 0.8433, indicating the strongest correlation among the reactors. This observation indicates that each reactor in the A2O process demonstrates different performance in terms of organic matter degradation and nitrification or denitrification processes.

In this study, upon examining the raw data obtained from the WWTP, it was evident that a substantial number of missing values were present. Table 2 presents the analysis results of missing value occurrences for water quality parameters (i.e., COD, NH₃-N, TDS, temperature (T), and pH) across different reactors. The analysis results of the overall missing values occurrence distribution obtained from the WWTP are presented in Figure 2. As evident from Figure 2, the distribution of missing values across different reactors varies significantly, indicating that the available data for machine learning training are limited. This situation is commonly encountered in most WWTP data collection scenarios. Therefore, to secure a larger usable dataset, consistent measurement and analysis practices must be implemented. One of the key reasons for developing and employing ensemble models in this study is to maximize the prediction performance even in situations where the available data are limited.

Table 3 presents the descriptive statistics for water quality parameters, including the data count (Count), mean value (Mean), standard deviation (Std), minimum value (Min), and maximum value (Max). Additionally, the average values for the 25%, 50%, and 75% percentile intervals are provided. Figure 3 illustrates the changes before (Figure 3a) and after (Figure 3b) scaling using the standard normal distribution to address the outliers identified during the data preprocessing stage [17,18]. As shown in Figure 3, numerous outliers were still present even after scaling using the standard normal distribution. This suggests that when water quality parameter data, either automatically recorded in real time by sensors or recorded through analysis, are collected over an extended period in a typical WWTP, numerous outliers can occur within the dataset. Therefore, it is crucial to appropriately handle these outliers in the data before utilizing them as a training dataset for machine learning models. As will be discussed in more detail later in Section 4, the correlation matrix illustrated in Figure 4 clearly shows darker shades of red towards the bottom-right corner, indicating stronger correlations among the data within this region. This suggests that the reactor in the A2O process in which highly correlated data were collected is closely related to changes in the corresponding water quality parameters. The outliers identified during the EDA process were handled in the data preprocessing and feature engineering stages (see Section 2.3).

2.3. Data Preprocessing

In this study, an exploratory data analysis (EDA) was conducted to understand the characteristics of the data and determine appropriate preprocessing steps. Data with more than 50% missing values or data not utilized for model development were dropped. However, for the TDS parameter, which was used for model development, the ‘0’ values were imputed using the multivariate imputation by chained equations (MICE) method. Traditionally, statistical values such as mean, mode, or median are used for imputation, or predicted values obtained through linear regression are employed. However, using a single average value such as the mean, mode, or median for imputation can lead to reduced model performance. Predicted values obtained through linear regression may result in more accurate estimates than the actual measurements. To address these issues, this study utilized the MICE method for imputation, which not only considers the relationships with other variables but also accounts for potential errors or uncertainties in the measurement procedure [19,20].

After imputation, the local outlier factor (LOF) method was employed to remove outliers from the dataset without missing values. The LOF method detects outliers by measuring the local density deviation of a sample concerning its neighboring data points. This approach compares the density of a sample to its K-nearest neighbors (KNN), considering the local characteristics, hence the term “local outliers”. The degree of deviation from the neighboring data points determines the outlier score. By adjusting the K value, the LOF algorithm can identify outliers based on its sensitivity to different neighborhood sizes. After outlier removal, scaling was performed under the assumption that the data follow a normal distribution. It is noteworthy that while tree-based models and Naive Bayes models are not affected by the scale of input data, linear regression and distance-based KNN models are influenced by the scale of input features. Therefore, in this study, scaling was performed to ensure the applicability of various models, including linear regression, support vector regression, K-nearest regression, and tree-based regression models. The data were scaled to a standard normal distribution with a mean of “0” and a standard deviation of “1”.

2.4. Feature Engineering

One of the primary objectives of this study is to predict the COD and TN values, which require time-consuming measurements, based on real-time measurable data. Consequently, in addition to addressing a regression problem that involves predicting continuous numerical values, feature engineering was performed to transform the raw data into a format suitable for machine learning. To achieve this, four water quality parameters—COD, TN, NH₃-N, and TDS—were utilized as the dataset, as they are applicable to both the influent and the three reactors of the A2O process. To obtain a suitable dataset for model training, outliers were addressed through normalization followed by the application of the LOF algorithm, and missing values were handled through imputation techniques.

To address the limited number of features and based on the findings from the EDA process, which revealed a high correlation of 0.84 between NH₃-N and TN in the anaerobic reactor, additional features were introduced to mitigate multicollinearity. Specifically, the NH₃-N removal amount and removal rate for each reactor were added as features. Furthermore, interaction terms were incorporated to enable more sophisticated modeling compared to using individual variables alone. The interaction terms allow for modeling the combinatorial effects between variables, enabling the capture of more nuanced relationships compared to using individual variables alone [21]. Additionally, interaction terms are particularly useful when there exist nonlinear and complex interactions among the variables. For linear or distance-based models, appropriate scaling is necessary as they are influenced by the range of the data. Additionally, converting the data to float type (floating-point) can enhance computational efficiency. In this study, standardization was employed for scaling, which transforms the data to have a mean of 0 and a standard deviation of 1. This technique preserves the normal distribution of the data and reduces the influence of outliers. Additionally, the Yeo–Johnson power transformation was applied to ensure normality and stabilize the variance of the data [20]. For a variable, y, the Yeo–Johnson transformation, y(λ), is defined as Equations (2) and (3), where λ is the power parameter. This parameter is typically presented in a piecewise form to ensure continuity at the point of singularity (λ = 0).

For y ≥ 0:

y_{i}^{(λ)} = \{\begin{matrix} \frac{{(y_{i} + 1)}^{λ} - 1}{λ} if λ \neq 0 \\ l n (y_{i} + 1) if λ = 0 \end{matrix}

(2)

For y < 0:

y_{i}^{(λ)} = \{\begin{matrix} \frac{{- ((y_{i} + 1)}^{(2 - λ)} - 1)}{(2 - λ)} if λ \neq 2 \\ - \ln (- y_{i} + 1) if λ = 2 \end{matrix}

(3)

Finally, the feature-engineered data were split into an 8:2 ratio. The 80% portion was further divided into an 8:2 ratio to create training and validation datasets, respectively. The remaining 20% of the data were utilized as the test dataset for model evaluation.

2.5. Hyperparameter Tuning

The A2O process of the WWTP in this study, like many other WWTPs, faces the common challenge of having a limited number of variables and data points. In such cases, generalization through hyperparameters becomes crucial. Hyperparameters are components of machine learning models that directly influence the learning process. Identifying optimal hyperparameters is essential not only for improving model performance but also for enhancing generalization capabilities [20]. A common approach to tuning the optimal hyperparameters is to combine grid search and cross-validation.

Grid search is a method that constructs a grid of various hyperparameter combinations, trains the model with each combination, and identifies the combination that yields the best performance. Cross-validation involves partitioning the data into k subsets, using k-1 subsets for training and the remaining subset for validation, and repeating this process k times to evaluate the model’s performance. The typical hyperparameter tuning process involves performing cross-validation for each combination of hyperparameters in the grid search.

2.6. Evaluation Metrics

In this study, various evaluation metrics are utilized to assess the performance of supervised machine learning models employed for regression problems. These metrics, which express the error between observed and predicted values, include R-squared (R²), mean absolute error (MAE), mean absolute percentage error (MAPE), mean squared error (MSE), and root mean squared error (RMSE) [14]. R², ranging from 0 to 1, indicates how well the model explains the variability in the data. The coefficient of determination (R²) is a statistical measure that evaluates the performance of a regression model. An R² value of 0 indicates that the model’s predictive power is equivalent to predicting the mean, while an R² value of 1 signifies that the model accurately predicts all data points without any error. In the case of negative R² values, it implies that the model’s predictions are worse than simply predicting the mean value. MAE represents the average of the absolute errors, providing an intuitive understanding of the error magnitude. MAPE expresses the relative size of the errors, allowing for the examination of error as a percentage of actual values. MSE assigns a penalty to larger errors by squaring them, which makes it sensitive to outliers. Similarly, RMSE, being the square root of MSE, provides an error metric in the same units as the output variable and emphasizes larger errors more than smaller ones due to its quadratic nature. In this study, a comparative analysis of the performance of various machine learning models used for the continuous prediction of COD and TN in each reactor of the A2O process has been conducted utilizing these evaluation metrics. The equations for calculating these evaluation metrics are as shown in Equations (4)–(8), where y_i represents the actual value for the i-th input,

{\hat{y}}_{i}

is the predicted value for the i-th input, and

\bar{y}

is the mean of the actual values.

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}

(4)

MAE = \frac{1}{n} \sum_{i = 1}^{n} |y_{i} - {\hat{y}}_{i}|

(5)

MAPE = \frac{1}{n} \sum_{i = 1}^{n} |\frac{y_{i} - {\hat{y}}_{i}}{y_{i}}| \times 100

(6)

MSE = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}

(7)

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}

(8)

3. Materials and Methods

3.1. A2O Process

The WWTP examined in this study incorporates a sequence of unit processes as depicted in Figure 5. This WWTP was designed on a small scale, with a design treatment capacity of 360 m³/day, while in reality, it processes approximately 500 m³/day of wastewater. The key treatment methodology utilized is the A2O process, which consists of anaerobic, anoxic, and oxic (aerobic) reactors arranged in series. The anaerobic reactor is primarily designed to promote the release of phosphorus while also facilitating a partial breakdown of organic matter. Subsequent to this stage, in the anoxic reactor, denitrification takes place, utilizing nitrate that is recirculated from the oxic (aerobic) reactor through the process of sludge return. The oxic (aerobic) reactor is tasked with further phosphorus removal via excess phosphorus uptake, in addition to the continued degradation of organic compounds. The effective volumes of the anaerobic, anoxic, and oxic reactors are 2250 m³, 1125 m³, and 5400 m³, respectively. The hydraulic retention times (HRT) for the anaerobic, anoxic, and oxic reactors are set at 4.5 d, 2.25 d, and 10.8 d, respectively.

The wastewater characteristics pertinent to each treatment process are concisely compiled in Table 4. The influent is characterized by a COD concentration of 6077.3 ± 1173.0 mg/L and elevated salinity levels, averaging 3528.0 ± 503.3 mg/L. Following the application of the A2O treatment system, the COD concentration in the effluent was reduced to 434.3 ± 163.4 mg/L, corresponding to a removal efficiency of 92.9%. Additionally, the TN and NH₃-N exhibited removal efficiencies of 89.7% and 93.0%, respectively. It was also noted that the mixed liquor suspended solids (MLSS) concentrations within both the aerobic and anoxic tanks were relatively low, at approximately 4000 mg/L.

3.2. Analytical Methods

Samples from the influent, each reactor tank in the A2O process, and the effluent were filtered through a 0.45 µm mixed cellulose filter (Advantec, Tokyo, Japan). The following parameters were analyzed for each sample according to standard methods: COD, TN, NH₃-N, and TP. The total salinity and MLSS concentration were determined using the gravimetric method. The pH and temperature were quantified using a portable multifunctional electrode with a meter matching network (Multi 3630, Munich, Germany). The concentrations of sulfate and chloride ions were determined using an ion chromatograph (ICS-1100, Massachusetts, USA). Volatile phenols were determined using a spectrophotometer (PE lambda 750, Waltham, MA, USA).

4. Results and Discussion

4.1. Applicability of Various Regression Models

In this study, various artificial intelligence (AI) models were employed to predict the effluent concentrations of COD and TN in the A2O process based on real-time measurements of pH, TDS, temperature, and NH₃-N. Initially, the suitability of traditional regression models including multiple linear regression and multilayer perceptron models for such predictions was compared and evaluated.

4.1.1. Multiple Linear Regression (MLR) Model

Multiple linear regression (MLR) is a method that predicts the dependent variable using independent variables. However, MLR requires a linear relationship between the independent variables and the dependent variable, and the observed data points must be independent. Additionally, the error term of the dependent variable with respect to the independent variables should follow a normal distribution with constant variance, and there should be no strong correlation among the independent variables. For the anaerobic reactor, as shown in Figure 6 and the exploratory data analysis (EDA) process (see Section 2.2), a linear relationship between COD and TN with NH₃-N and TDS was observed. Consequently, COD and TN were designated as the dependent variables, while the real-time measurable parameters NH₃-N and TDS were selected as the independent variables. The MLR was performed using the ordinary least squares (OLS) method, which minimizes the difference between the actual and predicted values.

The results of predicting the dependent variables COD_ana and TN_ana using the independent variables NH₃-N_ana and TDS_ana through MLR are presented in Table 5. For the regression analysis predicting COD_ana, the t-tests of the independent variables showed p-values less than 0.05, rejecting the null hypothesis that the coefficients are zero, indicating statistical significance. Consequently, for COD_ana, it can be interpreted that on average, a one-unit change in NH₃-N results in a 1.6232-unit change. The R² value of 0.305 indicates a somewhat low explanatory power of the model. The F-statistic tests the statistical significance of the regression model by calculating the ratio of the variance of the regression model to the variance of the error term. It is an indicator of how much the independent variables contribute to explaining the variability of the dependent variable. A high F-statistic value implies that the explainable variation in the model is larger than the unexplained variation. For the COD_ana prediction model, the F-statistic of 52 and a very low probability value lead to the rejection of the null hypothesis that all independent variables do not influence the dependent variable, indicating the statistical significance of the model. In this study, although the F-statistic value for the COD_ana prediction is 52.0, the probability value is much lower than 0.05, suggesting that all independent variables are likely to influence the dependent variable, and the model can be considered statistically significant.

The normality of the residuals from the multiple regression prediction is presented in Figure 7. The predicted values of COD_ana and TN_ana and the corresponding residuals in multiple regression are shown in Figure 8. As shown in Figure 7, the residuals (the differences between the regression equation and the observed values) of the predicted COD_ana followed a normal distribution in the Quantile-Quantile plot (QQ plot). Additionally, the Breusch-Pagan test was conducted to check for heteroscedasticity in the residuals, and the p-value of 0.0099, lower than 0.05, led to the rejection of the null hypothesis that the independent variables do not influence the variance of the residuals, indicating that some independent variables affect the residuals. Finally, the variation inflation factor (VIF) was used to measure multicollinearity among the independent variables in the regression analysis, and the value of 10.5067, exceeding 10, indicated a very high level of multicollinearity. In the case of the TN_ana prediction results, the R² value of 0.7510 indicated a relatively good explanatory power. However, similar to the COD_ana prediction model, some assumptions of the regression analysis were not met. In other words, for predicting both independent variables, the use of a conventional regression analysis is not statistically perfect. As explained in Section 2, appropriate treatment of the independent variables, such as feature engineering or the utilization of separate regression models like ridge, lasso, or elastic net, is necessary.

4.1.2. Multilayer Perceptron Model (MLP)

The Multilayer Perceptron (MLP) is a type of ANN consisting of an input layer, hidden layer(s), and an output layer. Input data are processed through the hidden layer(s) to predict the output, a process known as feedforward propagation. Activation functions are applied at each layer to determine the output values, allowing the MLP to solve nonlinear problems. The difference between the predicted and actual values is considered the error, which is used to set an objective function. Through partial derivatives and the gradient descent method, the weights and biases of each neuron are updated in the backward direction to minimize the objective function, a process called backpropagation. This iterative process of updating weights and biases is generally referred to as training. Typically, the MLP performs better than traditional machine learning algorithms when dealing with large datasets with many features.

For performance testing of the MLP, a baseline model was constructed with two input features, two hidden layers with 32 nodes each, and the Rectified Linear Unit (ReLU) as the activation function. The MAE was used as the objective function, and the Adam optimizer with a learning rate of 0.001 was employed. The experiment was conducted for 50 epochs. The dataset for the baseline model was randomly shuffled and split into an 8:2 ratio, with the eight portions further divided into an 8:2 ratio for training and validation, respectively, while the remaining two portions were used for testing.

The results of predicting the target variables COD_ana and TN_ana using NH₃-N_ana and TN_ana as features with the MLP are presented in Table 6 and Figure 9. The MAPE for both COD_ana and TN_ana was around 0.18, but the R² value for the COD_ana prediction was −0.02, which is equivalent to predicting the mean. This is significantly lower than the value of 0.3050 obtained using MLR, indicating that the baseline MLP model has poorer interpretability compared to MLR. This can be attributed to the MAE loss function used in the MLP, which focuses on minimizing the prediction error for each data point rather than capturing the overall pattern of the dataset. Additionally, the limited data availability might have led to overfitting during the training process, preventing the model from learning the overall pattern of the dataset and failing to generalize during the testing phase.

4.2. Prediction of COD and TN Using Ensemble Model

In the WWTP data utilized in this study, linear relationships between variables were observed only in certain reaction processes. Additionally, the residuals did not satisfy the homoscedasticity assumption, and multicollinearity among independent variables was high. Consequently, the use of linear regression to predict COD and TN in WWTPs is not suitable for this research. Furthermore, the limited number of real-time measured variable data points (n = 352) is insufficient for training deep learning models. Previous studies have shown that deep learning models are prone to overfitting when the number of data points is small and they are sensitive to minor changes in input values, leading to poor performance on small datasets [13]. In contrast, ensemble models such as Random Forest and XGBoost utilize techniques like bagging and boosting to generalize models, enabling them to perform well on relatively small datasets and provide interpretability that deep learning models lack [13]. However, while deep learning models only require simple data formatting to match the input shape, machine learning models necessitate data preprocessing and feature engineering tailored to the specific model, as their performance is influenced by characteristics like input data scaling. Consequently, in this study, ensemble models including Random Forest Regressor, Gradient Boosting Regressor, and XGB Regressor were employed from various machine learning models [14,15,16,17,18,22]. Additionally, a Voting Regressor, which applies weights to these models, was utilized to predict COD and TN.

The prediction of COD and TN effluent concentrations for each reaction process was performed using real-time measured parameters such as pH, TDS, temperature, and NH₃-N. The prediction performance metrics are presented in Table 7, where the best-performing metric for each process is highlighted in bold. Figure 10 depicts the comparison between the predicted and observed values for the effluent COD and TN concentrations from each reactor, illustrating the case with the highest R² value. The evaluation of prediction performance was primarily based on the R² value. Although MAE and RMSE provide an intuitive understanding of the actual error magnitude, the ranges of COD and TN vary across processes, making it difficult to compare them directly [23,24]. Therefore, the R² value, which allows for an overall comparison of COD and TN predictions across processes, was primarily utilized.

The predictions have been divided into two main categories. Firstly, COD and TN were predicted using feature-engineered data with the influent pH, temperature, and the respective reactor’s NH₃-N and TDS as the primary features (Figure 10a–h). In this case, for COD prediction, the Voting Regressor exhibited the highest explanatory power with an R² value of 0.7722 in the oxic reactor (Figure 10e). Similarly, for TN prediction, the Gradient Boosting Regressor demonstrated excellent performance with an R² value of 0.9282 in the oxic reactor (Figure 10f). Notably, for COD prediction, the R² values improved as the process progressed, with the highest values being 0.3704 in the anaerobic reactor (Figure 10a), 0.52712 in the anoxic reactor (Figure 10c), and 0.7722 in the oxic reactor (Figure 10e). A similar pattern was observed for TN prediction, with the R² values being 0.8334 in the anaerobic reactor (Figure 10b), 0.8062 in the anoxic reactor (Figure 10d), and 0.9282 in the oxic reactor (Figure 10f) though R² values for the anerobic reactor is slightly higher than that of the anoxic reactor. In the A2O process, the organic matter and nitrogen compounds present in the influent are sequentially degraded and removed as they pass through the anaerobic, anoxic, and oxic reactors. This sequential removal process results in a reduction in the variability of the measured values in the final stage, the oxic (aerobic) reactor, as the majority of the pollutants have been eliminated. This phenomenon can explain the observed trend of improved model performance, as indicated by higher R² values, in the later stages of the treatment process.

It may be noted that the observed pattern for TN prediction, wherein the R² value for the anaerobic reactor exceeds that of the anoxic reactor, may be attributed to the differential impacts of nitrification and denitrification processes, as well as the efficiency of sludge return within the system. In the anaerobic reactor, the absence of oxygen precludes nitrification; hence, the variability in nitrogen compounds is primarily influenced by the release of phosphorus and the hydrolysis of organic matter [25]. This could lead to a more predictable pattern in COD and TN concentrations, resulting in a relatively high R² value. Conversely, in the anoxic reactor, denitrification introduces additional variability due to its dependence on factors such as the availability of nitrate (recirculated from the oxic reactor) and readily biodegradable carbon. Moreover, the efficiency of sludge return can significantly affect the concentration of biodegradable substrates and microorganisms available for denitrification, potentially leading to less predictability and a lower R² value compared to the anaerobic stage. Therefore, examining how factors related to the biological variability of the reactors are linked to the accuracy of data-based predictions presents itself as a highly intriguing subject for further research.

The following prediction focused on estimating the COD and TN concentrations in the final treated water of the A2O process, specifically the effluent from the oxic reactor, utilizing both the influent water quality parameters of the A2O process and the water quality parameters from each reactor. To achieve this, real-time measurable parameters such as pH, temperature, NH₃-N, and TDS of the A2O process influent, as well as NH₃-N and TDS from each reactor, were adopted as feature parameters (i.e., independent parameters). In other words, all water quality parameters that can be measured in real time across the entire processes were used to predict the COD and TN of the final effluent. The ability to predict COD and TN, which are challenging to measure in real time, can significantly improve the efficiency of process operation. Moreover, the predicted concentrations of COD and TN can serve as important data for process monitoring and maintenance. According to the prediction results, the Random Forest algorithm achieved the highest performance in COD prediction with an R² value of 0.7564 (Figure 10g). Similarly, for TN prediction, the Random Forest algorithm again demonstrated superior performance with an R² value of 0.8527 (Figure 10f).

Among the eight prediction results presented in Table 7, the Voting Regressor exhibited the best performance based on R² values for predicting the COD and TN concentrations in the effluent of the anaerobic reactor, COD prediction in the anoxic reactor, and COD prediction in the oxic reactor. This suggests that while tree-based ensemble models such as Random Forest, Gradient Boosting, and XGBoost perform well with small and highly dispersed datasets, the Voting Regressor, which further ensembles these models for decision-making, can provide more stable and superior predictive performance. Despite the need for large amounts of data to apply AI techniques, it has been reported that even with the use of substantial data, the correlation between actual and predicted data remains low in WWTPs due to their highly nonlinear characteristics and high biological variability in usual biological wastewater treatment facilities [26]. The results obtained in this study, however, indicate that for WWTP datasets with limited data points and features in tabular form, satisfactory prediction results can be expected by adopting the data preprocessing methods and feature engineering techniques presented in this study, followed by the application of ensemble techniques. The present study demonstrated that using only four easily measurable feature parameters (i.e., pH, temperature, NH₃-N, and TDS), the COD and TN concentrations, which are difficult to measure in real time and require extensive analysis time, can be predicted with maximum R² values of 0.7722 (Figure 10e) and 0.9282 (Figure 10f), respectively.

In this study, despite the limited number of data points and feature parameters, the application of the ensemble model through the Voting Regressor achieved satisfactory performance in predicting both COD and TN to a certain extent. Particularly for TN prediction, the inclusion of NH₃-N, which can be easily measured in real time using sensors, as one of the feature parameters that contributed to the higher predictive performance for TN. Therefore, to maximize the predictive performance for COD, it is suggested to identify feature parameters that have a high correlation with COD and can be easily measured in real time using sensors or other means. Parameters such as UV absorbance (UVA), turbidity, oxidation-reduction potential (ORP), and dissolved oxygen (DO) are known to exhibit a certain degree of correlation with COD and can be easily measured in real time using sensors at actual WWTPs. Utilizing these data as COD prediction feature parameters, either individually or in combination, is expected to significantly enhance the COD prediction performance. Therefore, it is recommended that WWTPs consistently collect data on the aforementioned potential feature parameters over an extended period. By doing so, WWTPs can leverage these data, along with the techniques presented in this study, to improve the accuracy of COD predictions.

4.3. Predicting COD and TN Removal Efficiencies Based on the C/N Ratio

The C/N ratio plays a crucial role in determining the removal efficiencies of COD and TN in biological wastewater treatment processes. A balanced C/N ratio is essential for optimal growth and activity of microorganisms responsible for organic matter degradation and nitrogen removal [27,28]. If the C/N ratio is too high, indicating an excess of organic carbon relative to nitrogen, the microorganisms may preferentially utilize the available nitrogen for cell synthesis, leaving residual organic matter untreated and resulting in poor COD removal. Conversely, if the C/N ratio is too low, with limited organic carbon sources, the microorganisms may not have sufficient energy and carbon sources for effective nitrogen removal processes, leading to poor TN removal. Therefore, real-time monitoring of the C/N ratio at WWTPs would offer significant advantages in terms of optimal operation and maintenance of biological processes. However, as mentioned earlier, real-time measurement of COD and TN is not feasible at WWTPs. In this study, as demonstrated in Section 4.2, the effluent COD and TN concentrations in each reactor of the A2O process can be successfully predicted in real time. Utilizing these predicted COD and TN values, the removal efficiency of COD and TN as well as the C/N ratios in each reactor can be calculated. Table 8 presents a comparison of the COD and TN removal efficiencies calculated using the predicted COD and TN values obtained from the ensemble model and the removal efficiencies calculated using the measured values for each reactor in the A2O process. The C/N ratios calculated using the predicted COD and TN values for each reactor in the A2O process are also summarized in Table 8. The R² values presented in Table 8 represent the correlation between the COD and TN removal efficiencies (based on measured values) and the corresponding C/N ratios (based on predicted values) for each reactor in the A2O process.

The C/N ratios calculated using the predicted COD and TN concentrations from the ensemble model ranged from 2.2 to 14.8 across all reactors in the A2O process. Considering the typical C/N ratio range of 1 to 20 for efficient biological wastewater treatment processes, the results indicate that the A2O process in this study was operating within an acceptable C/N ratio range. Examining the mean C/N ratios for each reactor, the anoxic reactor’s C/N ratio (5.8) showed a significant increase compared to the anaerobic reactor’s C/N ratio (3.7). However, the oxic reactor’s C/N ratio (6.2) did not differ substantially from the anoxic reactor’s C/N ratio (5.8). This observation can be attributed to the denitrification process occurring in the anoxic reactor, which reduces nitrogen concentrations, leading to an increase in the C/N ratio [29,30]. The minimal change in the C/N ratio between the anoxic and oxic reactors is a result of the concurrent removal of COD and TN, maintaining a relatively stable C/N ratio. For the predicted COD removal efficiencies, the first two stages of the A2O process, the anaerobic and anoxic reactors, exhibited removal efficiencies of approximately 70%, which closely matched the measured COD removal efficiencies. Regarding TN removal, both the predicted and measured removal efficiencies showed the highest removal efficiency of 80% in the anoxic reactor, where denitrification occurs. The predicted and measured removal efficiencies for TN were in close agreement. This result implies that by using predicted COD and TN values to monitor the C/N ratio in real time, WWTPs can enhance the management and efficiency of their biological treatment processes despite the challenges of direct COD and TN measurement.

To predict the COD and TN removal efficiencies in the A2O process using the C/N ratio, it is necessary to understand the correlation between the C/N ratio of each reactor and the corresponding COD and TN removal efficiencies in that reactor. The relationships between the COD and TN removal efficiencies and the C/N ratio for each reactor in the A2O process are presented in Figure 11. Figure 11a depicts the relationship between the COD removal efficiency and the C/N ratio, while Figure 11b illustrates the relationship between the TN removal efficiency and the C/N ratio. The C/N ratios used in these figures were calculated based on the predicted COD and TN concentrations obtained from the ensemble model. The COD and TN removal efficiencies were calculated using the measured concentrations. Based on the R² values, the correlation between the COD removal efficiency and the C/N ratio was very high in the anaerobic and anoxic reactors, with values of 0.9387 and 0.9396, respectively (Figure 11a). However, the correlation was somewhat lower in the oxic (aerobic) reactor, with an R² value of 0.6971. For TN, the anoxic reactor exhibited the highest correlation between the TN removal efficiency and the C/N ratio, with an R² value of 0.9228 (Figure 11b). It is understandable that the correlation between COD and TN removal efficiencies and the C/N ratio in each reactor of the A2O process may differ. This is because each reactor in the A2O process is designed with different roles. Generally, in the A2O process, the anaerobic reactor is designed for organic matter removal and phosphorus release, the anoxic reactor for denitrification, and the aerobic reactor for residual organic matter removal and excessive phosphorus uptake. Therefore, the correlation between the C/N ratio of each reactor and the COD and TN removal efficiency of the corresponding reactor may vary. In this study, as shown in Figure 11b, for TN, the highest correlation between the C/N ratio and TN removal efficiency was observed in the anoxic reactor, where denitrification is the main objective.

Figure 12 and Figure 13 present the comparisons between the measured and predicted COD and TN removal efficiencies, respectively, for each reactor in the A2O process, using data collected from January 2022 to September 2023. Additionally, these figures depict the change patterns of the C/N ratio calculated based on the COD and TN concentrations predicted by the ensemble model during the same period. As shown in the time-series plot of Figure 12, the measured and predicted COD removal efficiencies exhibited good agreement for all reactors in the A2O process throughout the data collection period from January 2022 to September 2023. Notably, significant fluctuation in COD removal efficiencies were observed in the anoxic reactor in August 2022 and May 2023 (Figure 12b). Concurrently, alterations in the C/N ratio during these intervals were also observed. This correlation aligns with established scientific understanding that the C/N ratio exerts a considerable influence on the biodegradation capabilities of bioreactors deployed in biological wastewater treatment processes. There were some temporal discrepancies between the change in COD removal efficiency and the change in C/N ratio. These discrepancies could be attributed to the differences in hydraulic retention time (HRT) of each reactor. It is worth knowing that the variations in the C/N ratio within each reactor could precede or lag in reflecting changes in the COD removal efficiency, due to the different retention times of each reactor [31,32]. In this study, the hydrodynamic conditions of each reactor in A2O process were not considered as feature parameters (due to the lack of recorded data from the plant). However, incorporating these conditions could potentially enable the early detection of changes in COD removal efficiency through the prediction of the C/N ratio.

From a time-series perspective, the discrepancies between the measured and predicted COD removal efficiency were somewhat pronounced in the oxic reactor (Figure 12c) compared to the anaerobic (Figure 12a) and anoxic reactors (Figure 12b). This observation can be attributed to the fact that most of the COD was removed in the anaerobic and anoxic reactors (see Table 8), as previously observed, resulting in lower COD concentrations in the oxic reactor. Consequently, the lower COD levels in the oxic reactor led to higher variability in the predicted values. However, it is important to note that such abrupt fluctuations may not represent actual phenomena but could be attributed to measurement outliers. Therefore, while monitoring the predicted C/N ratio, it is recommended to assess whether there are any issues with the real-time measurement of the water quality parameters used for the C/N ratio prediction.

Figure 13 presents a time-series analysis for each reactor within the A2O process, illustrating the comparison between predicted and measured TN removal efficiency, as well as the correlation between TN removal efficiency and the C/N ratio. The time-series plot in Figure 13 demonstrates that the measured and predicted TN removal efficiencies were closely aligned across all reactors within the A2O process, with particularly strong consistency observed in the anoxic reactor. Similar to the case of COD as depicted in Figure 12, the TN removal efficiencies also showed pronounced fluctuations during specific periods within the monitoring period, which tended to correspond closely with the variations in the C/N ratio during the same periods. In particular, the anoxic reactor, where denitrification occurs and the majority of TN is removed, clearly demonstrates a good correlation between the fluctuations in TN removal efficiency and changes in the C/N ratio.

In Figure 13b, it can be observed that in May 2023, there was a significant decline in the TN removal efficiency within the anoxic reactor, as indicated by both the predicted and measured values, and the C/N ratio also decreased substantially during this period. For efficient denitrification in the anoxic reactor of the A2O process, an adequate carbon source is essential [33]. The low C/N ratio implies a scarcity of carbon, leading to reduced denitrification efficiency and, consequently, less effective removal of TN. As previously mentioned, the significant fluctuations in TN removal efficiency could also be indicative of actual phenomena or anomalies in measurement. Regardless of the cause, such pronounced variability has the potential to unsettle WWTP operators. Consequently, the ability to anticipate these changes in TN removal efficiency based on trends in C/N ratio fluctuations, as proposed in this study, could greatly contribute to the operational efficiency and maintenance of WWTPs. Similar to the correlation results between COD removal efficiency and the C/N ratio, the temporal discrepancies in correlations due to the varying HRT of each reactor can also be applied to the relationship between TN removal efficiency and the C/N ratio. These results indicate that understanding the relationship between TN (also COD) removal efficiency and C/N ratio as well as their simultaneous fluctuations characteristic, particularly considering the unique role of each reactors in A2O process, is crucial for enhancing the prediction performance.

5. Conclusions

This study has successfully demonstrated the potential of ensemble tree-based machine learning models to predict COD and TN concentrations in the A2O process. Among the models employed, the Voting Regressor—integrating Random Forest, Gradient Boosting, and XGBoost—emerged as the superior approach, yielding robust R² values of 0.7722 for COD and 0.9282 for TN predictions. Such performance indicates a significant advancement over traditional multilinear regression, especially when dealing with the small, volatile, and interconnected datasets characteristic of WWTPs.

A notable finding is the sequential improvement of model performance across the treatment stages, with the oxic reactor displaying the highest predictive accuracy (i.e., R² = 0.3704 (anaerobic) < 0.5271 (anoxic) < 0.7722 (oxic) for COD and R² = 0.8062 (anoxic) < 0.8334 (anaerobic) > 0.9282 (oxic) for TN). This can be attributed to the reduced variability in water quality parameters following pollutant removal. The incorporation of real-time measurable parameters—pH, temperature, NH₃-N, and TDS—as features in the model was pivotal, allowing for real-time predictions of COD and TN that traditionally require extensive laboratory analysis.

Further, this research highlighted the importance of the C/N ratio in monitoring and optimizing the A2O process. The ensemble models facilitated real-time computation of C/N ratios, reflecting the process’s efficacy in COD and TN removal across different reactors in the A2O process. In this study, the accuracy of predicting COD removal efficiency based on the C/N ratio was high, with R² values exceeding 0.92 for both the anaerobic and anoxic reactors. Additionally, the accuracy of predicting TN removal efficiency through the C/N ratio was high at 0.9228 for the anoxic reactor. The ability to predict the removal efficiencies of COD and TN through the C/N ratio offers WWTP operators a valuable tool for process control and decision-making.

It is important to acknowledge the limitations encountered in this study, including the reliance on the size and variability of the dataset collected over a one-and-a-half-year period. The models’ performance is contingent on the quality and granularity of input data, which might limit their generalizability to other WWTPs with differing operational parameters or environmental conditions. Looking forward, the methodology employed here presents promising avenues for further research. Future work could explore the application of these models in a broader range of treatment plants to validate their effectiveness across diverse operational contexts. Additionally, continuous refinement of model features and integration of more advanced data preprocessing techniques could further enhance predictive accuracy.

In conclusion, the findings from this study underscore the value of ensemble tree-based models as a tool capable of delivering the most reliable predictions within the given data environment. By enabling real-time monitoring and predictive insights into COD and TN concentrations, this approach represents a significant step forward in the management and optimization of WWTPs.

Author Contributions

Conceptualization, J.-Y.K. and S.-Y.L.; methodology, Q.C., J.-H.P., S.-G.P. and Y.-J.L.; software, J.-H.P.; validation, Q.C., J.-Y.K. and M.-H.H.; formal analysis, Q.C., J.-Y.K. and Y.-J.L.; investigation, Q.C., Y.G. and Y.-J.L.; resources, X.R., G.Z. and Y.W. (Yu Wang); experiment, G.Z. and Y.W. (Yawei Wang); data curation, Q.C. and Y.G.; writing—original draft preparation, Q.C., J.-Y.K., J.-H.P. and S.-Y.L.; writing—review and editing, Y.W. (Yu Wang), S.-Y.L. and M.-H.H.; visualization, Q.C. and J.-Y.K.; supervision, Y.W. (Yu Wang), S.-Y.L., X.R. and M.-H.H.; project administration, Y.W. (Yu Wang) and M.-H.H.; funding acquisition, Y.W. (Yu Wang) and M.-H.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by “the Korea Agency for Infrastructure Technology Advancement (KAIA) grant funded by the Ministry of Land, Infrastructure and Transport (Grant RS-2022-00144137)”.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding authors.

Conflicts of Interest

Authors Guili Zheng, Yawei Wang were employed by the company Xinhua Pharmaceutical (Shouguang) Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Wongburi, P.; Park, J.K. Big Data Analytics from a Wastewater Treatment Plant. Sustainability 2021, 13, 12383. [Google Scholar] [CrossRef]
Maiza, M.; Beltrán, S.; Westling, K.; Carlsson, B.; Mulas, M.; Bergström, P.; Hyyryläinen, S.; Gorka, U. DIAMOND: AdvanceD data management and InformAtics for the optimuM operatiON anD control of WWTPs. In Proceedings of the ICA 2013, Narbonne, France, 18–20 September 2013. [Google Scholar]
Siegrist, R.L. Introduction to Decentralized Infrastructure for Wastewater Treatment and Water Reclamation. In Decentralized Water Reclamation Engineering: A Curriculum Workbook; Springer International Publishing: Cham, Switzerland, 2017; pp. 1–37. [Google Scholar]
Aghdam, E.; Mohandes, S.R.; Manu, P.; Cheung, C.; Yunusa-Kaltungo, A.; Zayed, T. Predicting quality parameters of wastewater treatment plants using artificial intelligence techniques. J. Clean. Prod. 2023, 405, 137019. [Google Scholar] [CrossRef]
Häck, M.; Köhne, M. Estimation of was tewater process parameters using neural networks. Water Sci. Technol. 1996, 33, 101–115. [Google Scholar] [CrossRef]
Haimi, H.; Mulas, M.; Corona, F.; Vahala, R. Data-derived soft-sensors for biological wastewater treatment plants: An overview. Environ. Model. Softw. 2013, 47, 88–107. [Google Scholar] [CrossRef]
Pai, T.-Y. Gray and Neural Network Prediction of Effluent from the Wastewater Treatment Plant of Industrial Park Using Influent Quality. Environ. Eng. Sci. 2008, 25, 757–766. [Google Scholar] [CrossRef]
Fan, R.; Wang, S.; Chen, H. A COD measurement method with turbidity compensation based on a variable radial basis function neural network. Anal. Methods 2023, 15, 5360–5368. [Google Scholar] [CrossRef] [PubMed]
Akbar, M.A.; Sharif, O.; Selvaganapathy, P.R.; Kruse, P. Identification and Quantification of Aqueous Disinfectants Using an Array of Carbon Nanotube-Based Chemiresistors. ACS Appl. Eng. Mater. 2023, 1, 3040–3052. [Google Scholar] [CrossRef] [PubMed]
Civelekoglu, G.; Yigit, N.O.; Diamadopoulos, E.; Kitis, M. Modelling of COD removal in a biological wastewater treatment plant using adaptive neuro-fuzzy inference system and artificial neural network. Water Sci. Technol. 2009, 60, 1475–1487. [Google Scholar] [CrossRef] [PubMed]
Zhang, S.; Zhou, P.; Xie, Y.; Chai, T. Improved model-free adaptive predictive control method for direct data-driven control of a wastewater treatment process with high performance. J. Process Control 2022, 110, 11–23. [Google Scholar] [CrossRef]
Jafar, R.; Awad, A.; Jafar, K.; Shahrour, I. Predicting Effluent Quality in Full-Scale Wastewater Treatment Plants Using Shallow and Deep Artificial Neural Networks. Sustainability 2022, 14, 15598. [Google Scholar] [CrossRef]
Grinsztajn, L.; Oyallon, E.; Varoquaux, G. Why do tree-based models still outperform deep learning on typical tabular data? Adv. Neural Inf. Process. Syst. 2022, 35, 507–520. [Google Scholar]
Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
Schapire, R.E. The Strength of Weak Learnability. Mach. Learn. 1990, 5, 197–227. [Google Scholar] [CrossRef]
Erdebilli, B.; Devrim-İçtenbaş, B. Ensemble Voting Regression Based on Machine Learning for Predicting Medical Waste: A Case from Turkey. Mathematics 2022, 10, 2466. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Van Buuren, S.; Groothuis-Oudshoorn, K. mice: Multivariate imputation by chained equations in R. J. Stat. Softw. 2011, 45, 1–67. [Google Scholar] [CrossRef]
Yeo, I.K.; Johnson, R.A. A new family of power transformations to improve normality or symmetry. Biometrika 2000, 87, 954–959. [Google Scholar] [CrossRef]
Grotenhuis, M.t.; Thijs, P. Dummy variables and their interactions in regression analysis: Examples from research on body mass index. arXiv 2015, arXiv:1511.05728. [Google Scholar]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Nadiri, A.A.; Shokri, S.; Tsai, F.T.-C.; Moghaddam, A.A. Prediction of effluent quality parameters of a wastewater treatment plant using a supervised committee fuzzy logic model. J. Clean. Prod. 2018, 180, 539–549. [Google Scholar] [CrossRef]
Khair, U.; Fahmi, H.; Al Hakim, S.; Rahim, R. Forecasting error calculation with mean absolute deviation and mean absolute percentage error. J. Phys. Conf. Ser. 2017, 930, 012002. [Google Scholar] [CrossRef]
Güçlü, D.; Dursun, S. Amelioration of carbon removal prediction for an activated sludge process using an artificial neural network (ANN). CLEAN–Soil Air Water 2008, 36, 781–787. [Google Scholar] [CrossRef]
Nasr, M.S.; Moustafa, M.A.; Seif, H.A.; El Kobrosy, G. Application of Artificial Neural Network (ANN) for the prediction of EL-AGAMY wastewater treatment plant performance-EGYPT. Alex. Eng. J. 2012, 51, 37–43. [Google Scholar] [CrossRef]
Wang, X.; Wang, S.; Xue, T.; Li, B.; Dai, X.; Peng, Y. Treating low carbon/nitrogen (C/N) wastewater in simultaneous nitrification-endogenous denitrification and phosphorous removal (SNDPR) systems by strengthening anaerobic intracellular carbon storage. Water Res. 2015, 77, 191–200. [Google Scholar] [CrossRef] [PubMed]
Zhu, G.-C.; Lu, Y.-Z.; Xu, L.-R. Effects of the carbon/nitrogen (C/N) ratio on a system coupling simultaneous nitrification and denitrification (SND) and denitrifying phosphorus removal (DPR). Environ. Technol. 2021, 42, 3048–3054. [Google Scholar] [CrossRef] [PubMed]
Lai, T.M.; Dang, H.V.; Nguyen, D.D.; Yim, S.; Hur, J. Wastewater treatment using a modified A2O process based on fiber polypropylene media. J. Environ. Sci. Health Part A 2011, 46, 1068–1074. [Google Scholar] [CrossRef] [PubMed]
Lim, E.-T.; Jeong, G.-T.; Bhang, S.-H.; Park, S.-H.; Park, D.-H. Evaluation of pilot-scale modified A2O processes for the removal of nitrogen compounds from sewage. Bioresour. Technol. 2009, 100, 6149–6154. [Google Scholar] [CrossRef] [PubMed]
Guo, Y.; Guo, L.; Sun, M.; Zhao, Y.; Gao, M.; She, Z. Effects of hydraulic retention time (HRT) on denitrification using waste activated sludge thermal hydrolysis liquid and acidogenic liquid as carbon sources. Bioresour. Technol. 2017, 224, 147–156. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Zhang, Y.; Yang, M.; Kamagata, Y. Effects of hydraulic retention time on nitrification activities and population dynamics of a conventional activated sludge system. Front. Environ. Sci. Eng. 2013, 7, 43–48. [Google Scholar] [CrossRef]
Mohan, T.K.; Nancharaiah, Y.; Venugopalan, V.; Sai, P.S. Effect of C/N ratio on denitrification of high-strength nitrate wastewater in anoxic granular sludge sequencing batch reactors. Ecol. Eng. 2016, 91, 441–448. [Google Scholar] [CrossRef]

Figure 1. Distribution and correlation analysis of COD, NH₃-N, TN, and TDS for each A2O process using pairplots during the data preprocessing step. The colors are differentiated based on the processes from which the data were obtained, and the process names are shown on the right side of the figure (note that “Influent” refers to the influent of the A2O process, and “Effluent” refers to the final treated effluent). Note that data for pH and temperature (Temp) are only applicable to the influent of the A2O process.

Figure 2. Identification of missing values for water quality parameters across different stages of the A2O process (the total number of data points = 352). Missing values are depicted by lighter colors, while the black portions indicate that data were acquired normally.

Figure 3. Outlier detection for COD, NH₃-N, TN, and TDS across different stages of the A2O process using boxplots: (a) before standardization and (b) after standardization.

Figure 4. Correlation of water quality parameters in each bioreactor of the A2O process based on explanatory data analysis (EDA).

Figure 5. Schematic diagram of the A2O process.

Figure 6. Linear relationships between water quality parameters in the anaerobic reactor: (a) COD and NH₃-N, (b) COD and TDS, (c) TN and NH₃-N, and (d) TN and TDS.

Figure 7. Normality of the residuals from the multiple regression prediction: (a) histogram of residuals for COD, (b) QQ plot for COD, (c) histogram of residuals for TN, and (d) QQ plot for TN.

Figure 8. Predicted values and residuals in multiple regression: (a) COD and (b) TN.

Figure 9. MLP learning curves and prediction results: (a) loss curve for COD (left) and TN (right), (b) true value vs. predicted value for COD (left) and TN (right), and (c) distribution of prediction error for COD (left) and TN (right).

Figure 10. Comparison of predicted and observed values for the model that exhibited the highest performance (based on the R² values displayed in the top-left corner of each graph.) in predicting COD and TN concentrations: (a) COD_ana; (b) TN_ana; (c) COD_ano; (d) TN_ano; (e) COD_oxi; (f) TN_oxi; (g) COD_oxi (predicted with all data), and (h) TN_oxi (predicted with all data). The subscripts “ana”, “ano”, and “oxi” denote anaerobic, anoxic, and oxic reactors, respectively. The name of the model exhibiting the best predictive performance is displayed at the top center of the respective graph.

Figure 11. Correlation between the removal efficiency and the C/N ratio for each reactor in A2O process reactors: (a) COD removal efficiency and C/N ratio, and (b) TN removal efficiency and C/N ratio (each removal efficiency is based on measured values, and the C/N ratio is based on predicted values).

Figure 12. Time-series comparison of predicted and measured COD removal efficiencies along with the calculated C/N ratio based on the ensemble model predictions: (a) anaerobic reactor, (b) anoxic reactor, and (c) oxic reactor of the A2O process monitored in this study. Data used were collected from January 2022 to September 2023.

Figure 13. Time-series comparison of predicted and measured TN removal efficiencies along with the calculated C/N ratio based on the ensemble model predictions: (a) anaerobic reactor, (b) anoxic reactor, and (c) oxic reactor of the A2O process monitored in this study. Data used were collected from January 2022 to September 2023.

Table 1. Pearson correlation coefficients for the four water quality parameters (COD, NH₃-N, TN, and TDS) across the reactors in the A2O process (Each parameter corresponds to the respective process (i.e., anaerobic, anoxic, and oxic processes)).

Parameter		COD	NH₃-N	TN	TDS
Anaerobic	COD_ana	1.0000
	NH₃-N_ana	0.5336 *	1.0000
	TN_ana	0.4913	0.8433	1.0000
	TDS_ana	0.3761	0.4504	0.5764	1.0000
Anoxic	COD_ano	1.0000
	NH₃-N_ano	0.6555	1.0000
	TN_ano	0.5493	0.4512	1.0000
	TDS_ano	0.0471	0.3045	0.0212	1.0000
Oxic	COD_oxi	1.0000
	NH₃-N_oxi	0.5749	1.0000
	TN_oxi	0.5357	0.5860	1.0000
	TDS_oxi	0.0150	0.0263	−0.0666	1.0000

Note(s): * The highest correlation value is highlighted in bold font for each water quality parameter and reactor.

Table 2. Analysis of the number of missing values for each water quality parameter across different stages of the A2O process.

Parameter	# of 0	Parameter	# of 0
COD_inf *	0	NH₃-N_ano	0
NH₃-N_inf	0	TN_ano	0
TDS_inf	0	TDS_ano	112
T_inf	0	COD_oxi	0
TN_inf	0	NH₃-N_oxi	0
pH_inf	0	TN_oxi	0
COD_ana	0	TDS_oxi	112
NH₃-N_ana	0	COD_eff	0
TN_ana	0	NH₃-N_eff	0
TDS_ana	112	TN_eff	0
COD_ano	0	TDS_eff	112

Note(s): * The subscripts “inf”, “ana”, “ano”, “oxi”, and “eff” are used as abbreviations for “influent”, “anaerobic”, “anoxic”, “oxic”, and “effluent”, respectively. These abbreviations have been used in the figures and tables later in this paper.

Table 3. Descriptive statistics for water quality parameters in the A2O process.

Parameter		Count	Mean	Std	Min	25%	50%	75%	Max
Influent	COD	352	6087.0	1170.3	2390.0	5290.0	5980.0	6792.5	9780.0
	NH₃-N	352	305.0	88.4	104.0	235.0	305.0	370.0	554.0
	TDS	352	6638.5	1353.7	3540.0	5820.0	6565.0	7420.0	12,800.0
	Temp.	352	35.8	4.7	18.0	33.0	36.1	39.0	46.6
	TN	352	707.6	124.4	360.0	610.0	700.0	800.0	1120.0
	pH	352	7.3	0.5	5.8	7.2	7.3	7.4	10.3
Anaerobic	COD	352	1857.6	472.2	863.0	1497.3	1780.0	2152.8	3294.0
	NH₃-N	352	267.0	115.2	76.0	166.0	249.5	369.3	570.0
	TN	352	517.1	147.2	180.0	400.0	490.0	660.0	840.0
	TDS	352	5058.4	3571.7	0.0	0.0	6700.0	7662.5	10,900.0
Anoxic	COD	352	558.3	189.7	66.0	421.8	524.5	642.3	1338.0
	NH₃-N	352	59.9	63.5	12.0	27.0	37.0	64.0	360.0
	TN	352	99.2	44.7	58.0	81.0	93.0	102.0	430.0
	TDS	352	4773.7	3376.2	0.0	0.0	6210.0	7342.5	9600.0
Oxic	COD	352	457.4	184.5	149.0	329.8	398.5	552.0	1261.0
	NH₃-N	352	24.8	54.1	3.0	5.0	7.0	12.0	336.0
	TN	352	78.7	44.5	44.0	64.0	70.0	76.0	400.0
	TDS	352	4582.2	3285.7	0.0	0.0	5690.0	7280.0	9980.0
Effluent	COD	352	434.3	163.4	138.0	320.0	392.5	524.3	1058.0
	NH₃-N	352	21.4	49.3	2.0	4.0	6.0	9.3	309.0
	TN	352	73.0	43.2	40.0	60.0	65.0	70.0	390.0
	TDS	352	4422.0	3211.8	0.0	0.0	5400.0	7172.5	9770.0

Table 4. Characteristics of wastewater in each process.

Parameters (mg/L)	Influent	Anaerobic Reactor	Anoxic Reactor	Oxic Reactor	Effluent
COD	6077.3 ± 1173.0	1857.6 ± 472.2	558.3 ± 190.0	457.4 ± 184.5	434.3 ± 163.4
NH₃-N	304.6 ± 88.9	267.0 ± 115.2	59.9 ± 63.5	24.8 ± 54.1	21.4 ± 49.3
TDS	6612.6 ± 1332.3	7419.0 ± 1071.8	7001.4 ± 1038.4	6720.6 ± 1193.0	6485.5 ± 1308.0
TP	19.9 ± 8.5	-	-	-	-
TN	706.7 ± 124.8	517.1 ± 147.2	99.2 ± 44.7	78.7 ± 44.5	73.0 ± 43.2
Sulfate	978.7 ± 601.1	-	-	-	-
Chloride ion	869.2 ± 226.0	-	-	-	-
Total salinity	3528.0 ± 503.3	-	-	-	-
Volatile phenol	85.1 ± 30.3	-	-	-	-
Temp. (°C)	35.9 ± 4.7	-	-	-	-
pH	7.3 ± 0.5	-	-	-	-
MLSS	-	-	4212.7 ± 1038.4	4097.4 ± 409.5	-

Table 5. Evaluation metrics for the MLR model.

Dep. Variable	Intercept & Independent Variables	Coefficient	Standard Error	t-Value	P > \|t\|	R² (Adj.)	F-Statistics (Prob.)	Breusch–Pagan (Prob.)	VIF
COD_ana	Intercept_ana	648.6080	184.5770	3.5140	0.001	0.3050 (0.2990)	52.00 (1.88 × 10⁻¹⁹)	9.2257 (0.0099)	10.5067
	NH₃-N_ana	1.6232	0.2630	6.1710	0.000
	TDS_ana	0.1074	0.0280	3.7840	0.000
TN_ana	Intercept_ana	226.3749	10.3800	21.8090	0.000	0.7510 (0.7490)	468.40 (1.49 × 10⁻⁹⁹)	8.3245 (0.0155)
	NH₃-N_ana	0.9651	0.0430	22.5470	0.000
	TDS_ana	0.0065	0.0010	4.7390	0.000

Table 6. Evaluation metrics for the MLP model.

Process	Features	Target	Model	R²	MAE	MAPE	MSE	RMSE
Anaerobic	TDS_ana NH₃-N_ana	COD	Multilayer Perceptron	−0.0281	306.8756	0.1864	1.478 × 10⁵	384.4962
Anaerobic	TDS_ana NH₃-N_ana	TN	Multilayer Perceptron	0.6096	82.2764	0.1858	9.667 × 10³	98.3218

Table 7. Performance evaluation metrics of various machine learning models employed to predict the effluent COD and TN concentrations for each reactor using real-time measured data (pH, temperature, NH₃-N, and TDS).

Process	Features	Target	Model	R²	MAE	MAPE	MSE	RMSE
Anaerobic	pH_inf Temp_inf NH₃-N_ana TDS_ana	COD	Random Forest Regressor	0.3611	314.0597	0.1730	139,249.0139	373.1608
			Gradient Boosting Regressor	0.3667	307.3038	0.1692	138,028.2257	371.5215
			XGB Regressor	0.2808	320.7301	0.1768	156,761.5667	395.9313
			Voting Regressor	0.3704	314.8617	0.1714	137,239.0012	370.4578
		TN	Random Forest Regressor	0.7972	48.2324	0.0971	3996.0776	63.2145
			Gradient Boosting Regressor	0.8033	47.4283	0.0952	3876.5724	62.2621
			XGB Regressor	0.7982	49.5103	0.1017	3976.5846	63.0602
			Voting Regressor	0.8334	43.4740	0.0870	3282.3695	57.2920
Anoxic	pH_inf Temp_inf NH₃-N_ano TDS_ano	COD	Random Forest Regressor	0.5068	86.1109	0.1833	12,174.7390	110.3392
			Gradient Boosting Regressor	0.4807	82.7199	0.1772	12,819.2609	113.2222
			XGB Regressor	0.4401	92.1794	0.1995	13,822.4896	117.5691
			Voting Regressor	0.5272	82.6990	0.1815	11,671.4214	108.0344
		TN	Random Forest Regressor	0.7611	13.1522	0.1211	496.6391	22.2854
			Gradient Boosting Regressor	0.8062	11.8176	0.1091	402.9547	20.0737
			XGB Regressor	0.8003	11.7854	0.1084	415.2215	20.3770
			Voting Regressor	0.8010	11.7671	0.1015	413.6942	20.3395
Oxic	pH_inf Temp_inf NH₃-N_oxi TDS_oxi	COD	Random Forest Regressor	0.7375	60.6232	0.1552	7892.0438	88.8372
			Gradient Boosting Regressor	0.7611	63.0409	0.1607	7184.0118	84.7586
			XGB Regressor	0.7264	67.8336	0.1701	8225.6465	90.6953
			Voting Regressor	0.7722	60.1776	0.1523	6849.8258	82.7637
		TN	Random Forest Regressor	0.8633	10.7976	0.1291	325.6551	18.0459
			Gradient Boosting Regressor	0.9282	9.7993	0.1229	170.9395	13.0744
			XGB Regressor	0.8887	10.8189	0.1370	265.1009	16.2819
			Voting Regressor	0.9005	10.5414	0.1265	237.0049	15.3950
Oxic with all data *	pH_inf Temp_inf NH₃-N_inf TDS_inf NH₃-N_ana TDS_ana NH₃-N_ano TDS_ano NH₃-N_oxi TDS_oxi	COD	Random Forest Regressor	0.7564	54.4847	0.1502	5328.1426	72.9941
			Gradient Boosting Regressor	0.7399	58.7900	0.1525	5690.3028	75.4341
			XGB Regressor	0.7280	58.2190	0.1582	5950.2342	77.1378
			Voting Regressor	0.7350	57.6162	0.1545	5796.6046	76.1354
		TN	Random Forest Regressor	0.8527	9.8803	0.1250	317.1213	17.8079
			Gradient Boosting Regressor	0.7609	11.1788	0.1391	514.7406	22.6879
			XGB Regressor	0.7805	11.3402	0.1261	472.6688	21.7409
			Voting Regressor	0.8477	10.2401	0.1237	328.0104	18.1111

Notes: * The oxic (aerobic) reactor is the final stage of the A2O process; therefore, it was considered the final treatment stage, and predictions were made using data from the influent and all three reactors in the A2O process. The bold represents the highest accuracy based on each evaluation matric.

Table 8. Comparison between the measured and predicted removal efficiencies (R² represents the correlation performance) along with the C/N ratio calculated based on the predicted COD and TN concentrations.

Parameter	Process	R²	Calculated C/N	Measured Removal Efficiency (%)	Calculated Removal Efficiency (%)
COD	Anaerobic	0.9387	2.2~8.3 (3.7) *	0.24~0.86 (0.68)	0.23~0.86 (0.68)
	Anoxic	0.9396	2.7~10.9 (5.8)	0.18~0.94 (0.68)	0.17~0.85 (0.69)
	Oxic	0.6971	2.4~14.8 (6.2)	0.00~0.55 (0.17)	0.00~0.45 (0.19)
TN	Anaerobic	0.5970	2.2~8.3 (3.7)	0.00~0.70 (0.25)	0.00~0.64 (0.25)
	Anoxic	0.9228	2.7~10.9 (5.8)	0.09~0.90 (0.79)	0.12~0.88 (0.79)
	Oxic	0.7385	2.4~14.8 (6.2)	0.00~0.49 (0.21)	0.00~0.57 (0.22)

Note: * The numbers in parentheses represent the mean values.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cheng, Q.; Kim, J.-Y.; Wang, Y.; Ren, X.; Guo, Y.; Park, J.-H.; Park, S.-G.; Lee, S.-Y.; Zheng, G.; Wang, Y.; et al. Novel Ensemble Learning Approach for Predicting COD and TN: Model Development and Implementation. Water 2024, 16, 1561. https://doi.org/10.3390/w16111561

AMA Style

Cheng Q, Kim J-Y, Wang Y, Ren X, Guo Y, Park J-H, Park S-G, Lee S-Y, Zheng G, Wang Y, et al. Novel Ensemble Learning Approach for Predicting COD and TN: Model Development and Implementation. Water. 2024; 16(11):1561. https://doi.org/10.3390/w16111561

Chicago/Turabian Style

Cheng, Qiangqiang, Ji-Yeon Kim, Yu Wang, Xianghao Ren, Yingjie Guo, Jeong-Hyun Park, Sung-Gwan Park, Sang-Youp Lee, Guili Zheng, Yawei Wang, and et al. 2024. "Novel Ensemble Learning Approach for Predicting COD and TN: Model Development and Implementation" Water 16, no. 11: 1561. https://doi.org/10.3390/w16111561

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Novel Ensemble Learning Approach for Predicting COD and TN: Model Development and Implementation

Abstract

1. Introduction

2. Model Development and Data Preparation

2.1. Ensemble Model

2.2. Exploratory Data Analysis (EDA)

2.3. Data Preprocessing

2.4. Feature Engineering

2.5. Hyperparameter Tuning

2.6. Evaluation Metrics

3. Materials and Methods

3.1. A2O Process

3.2. Analytical Methods

4. Results and Discussion

4.1. Applicability of Various Regression Models

4.1.1. Multiple Linear Regression (MLR) Model

4.1.2. Multilayer Perceptron Model (MLP)

4.2. Prediction of COD and TN Using Ensemble Model

4.3. Predicting COD and TN Removal Efficiencies Based on the C/N Ratio

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI