Prediction of Cavity Length Using an Interpretable Ensemble Learning Approach

Guo, Ganggui; Li, Shanshan; Liu, Yakun; Cao, Ze; Deng, Yangyu

doi:10.3390/ijerph20010702

Open AccessArticle

Prediction of Cavity Length Using an Interpretable Ensemble Learning Approach

by

Ganggui Guo

¹

,

Shanshan Li

²,

Yakun Liu

^1,*,

Ze Cao

¹ and

Yangyu Deng

¹

School of Hydraulic Engineering, Faculty of Infrastructure Engineering, Dalian University of Technology, Dalian 116024, China

²

Conservancy and Hydropower Engineering, Xi’an University of Technology, Xian 710048, China

^*

Author to whom correspondence should be addressed.

Int. J. Environ. Res. Public Health 2023, 20(1), 702; https://doi.org/10.3390/ijerph20010702

Submission received: 4 December 2022 / Revised: 27 December 2022 / Accepted: 27 December 2022 / Published: 30 December 2022

(This article belongs to the Section Environmental Science and Engineering)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The cavity length, which is a vital index in aeration and corrosion reduction engineering, is affected by many factors and is challenging to calculate. In this study, 10-fold cross-validation was performed to select the optimal input configuration. Additionally, the hyperparameters of three ensemble learning models—random forest (RF), gradient boosting decision tree (GBDT), and extreme gradient boosting tree (XGBOOST)—were fine-tuned by the Bayesian optimization (BO) algorithm to improve the prediction accuracy and compare the five empirical methods. The XGBOOST method was observed to present the highest prediction accuracy. Further interpretability analysis carried out using the Sobol method demonstrated its ability to reasonably capture the varying relative significance of different input features under different flow conditions. The Sobol sensitivity analysis also observed two patterns of extracting information from the input features in ML models: (1) the main effect of individual features in ensemble learning and (2) the interactive effect between each feature in SVR. From the results, the models obtaining individual information both predict the cavity length more accurately than that using interactive information. Subsequently, the XGBOOST captures more correct information from features, which leads to the varied Sobol index in accordance with outside phenomena; meanwhile, the predicted results fit the experimental points best.

Keywords:

cavity length; optimization algorithm; interpretable model; ensemble learning model

1. Introduction

Flow aeration using aerators is an inexpensive and effective technique for preventing cavitation erosion in spillways [1]. It sets a ramp in the chute to stir up the high-speed flow, followed by an aeration cavity that entrains air into the water through the air supply ducts. As a critical parameter in evaluating the air entrainment efficiency of a chute aerator [2,3], cavity length has been the focus of many researchers, and different empirical correlations have been proposed in the literature. Rutschmann [4] and Chanson [5] used the classical jet trajectory computation to calculate cavity length. Wu [3] further took the effect of flow depth and transverse fluctuating velocity into account to calculate the emergence angle, and the predicted cavity length presents a significantly better agreement with the experimental data than other correlations. Pfister [6] conducted dimensional analysis in the cavity length prediction and found that it is mainly affected by the aerator geometry of the chute bottom and deflector angle.

Though regression equations provide a reasonable estimation of the cavity length, their limited range of applicability and accuracy have also been reported in previous studies [7,8,9] due to their inability to capture the complex interactions between different input parameters. In recent years, machine learning (ML) models have gained popularity in modeling such problems. Initially stand-alone ML methods include artificial neural networks [10], SVR [11,12], adaptive neuro-fuzzy inference systems (ANFIS) [13,14], and multiple linear regression [15]. In the field of hydraulic engineering, most of the abovementioned algorithms are sole learning algorithms, whereas the ensemble learning techniques, which are more accurate, robust, and powerful [16,17], are applied to solve our problems. Qiu et al. [18] combined extreme gradient boosting (XGBOOST) with different optimization algorithms, such as the grey wolf optimization, whale optimization algorithm, and BO algorithm, to predict blast-induced ground vibration with 150 datasets and obtained a high prediction performance. Afan [19] conducted a study to predict the groundwater level using a combination of ensemble and deep learning models (ensemble DL). This study finds that ensemble DL is a more reliable tool compared to the individual DL model. Chen [20] estimated the wall shear stress for an ultra-high-pressure water-jet nozzle based on a hybrid BPNN model. Wu et al. [21] used XGBOOST to identify the water leakage zone and predict the leakage level, and their results confirmed that the prediction accuracy of the XGBOOST algorithm was better than that of the BPNN algorithm. However, the above-mentioned studies only focused on improving the accuracy of prediction, whereas no attempt has been made to further interpret the ML models.

In recent years, researchers started to focus on the interpretability of the ML models, which can bring significant improvements in the prediction accuracy and make models more credible [22,23,24,25]. For example, by calculating the contribution of each feature, the SHAP (Shapley Additive Explanations) method [26] uses the Shapley value to explain the prediction results. However, it only has the capability of extracting the effect of each individual feature in the prediction. For the purpose of revealing both the individual and interactive effects of the input parameters, the Sobol method [27] is proposed, which calculates the multiple sensitivity indices of the input features.

This study aims at developing ML models in the cavity length prediction of the aerator in the spillway. Experimental data from the State Key Laboratory of Hydraulic and Mountain River Engineering of Sichuan University and the Hydraulic Laboratory of Kunming University of Science and Technology are adopted, and the input parameters are selected based on an SVR test with 10-fold cross-validation. Then, the training is performed using three ensemble-learning models, i.e., RF, GBDT, and XGBOOST, respectively. The effect of the BO algorithm in optimization is also tested. Finally, the Sobol method was used to explore the relative significance of different input parameters in the cavity length prediction.

2. Machine Learning Models

2.1. Support Vector Regression

Vapnik [28] established the SVR model with the main goal of reducing the structural risk of complex problems by building a high-dimensional mapping relationship between input and output variables. Assuming that (x, y) is the observed data, where x is the input vector and y is the output of the observations, a linear relationship between the input and the output can be established as:

y^{'} = f (x) = ω ϕ (x) + b

(1)

where y′ is the predicted value of the model, ϕ(x) is the function that maps x from low to high dimensions, and ω and b represent the weight and bias of the model, respectively.

The reduction of the difference between the model outputs y′ and actual outputs y is achieved by minimizing the structural risk; the final nonlinear regression function is given as:

f (x) = \sum_{i = 1}^{n} (α_{i}^{*} - α_{i}) K (x, x_{i}) + b

(2)

where α_i and α^*_i are the Lagrange multipliers, b is the bias of SVR, and K(x, x_i) is the kernel function. The RBF radial basis function is adopted in this study.

2.2. Radom Forest (RF)

RF belongs to the bagging algorithm framework and has been adopted in many engineering problems [29]. RF utilizes two powerful algorithms, bootstrap aggregation and random subspace, to reduce the variance of the model and improve the generalizability.

In the construction of each RF regression tree, there are arbitrarily drawn subsets and features from the training dataset forming n bootstrap sets that are used to train n_tree regression trees. The final output is the result of aggregating all regression trees:

y^{'} = \frac{1}{n_{t r e e}} \sum_{i = 1}^{n_{t r e e}} {y_{i}}^{'} (x)

(3)

where y′ is the predicted value, and n_tree is the number of decision trees.

Additionally, some of the data, named as out-of-bag data (OOB), are not included in the training. Instead, they are used to evaluate the performance of the trees, which can reduce the extra computational cost that comes with cross-validation [30].

2.3. Gradient Boosting Decision Tree (GBDT)

GBDT is an ensemble ML algorithm using the boosting framework. The establishment of each decision tree is not independent, and the establishment of the latter tree is based on the residual of the previous tree [31]. The residuals are reduced along the negative gradient direction in each iteration, and the final output is the weighted average of all the decision trees. The procedure of building the decision tree is as follows:

(1): Initialize the iteration starting point h₀(x).

h_{0} (x) = \underset{f}{\arg \min} \sum_{i = 1}^{n} L (y, y^{'})

(4)

where h₀(x) is the initialized regression tree-based learner and L(y, y′) represents the error between the true value y and the predicted value y′ of the regression tree.

(2): The m_th residual along the gradient direction is:

r_{m i} = - \frac{\partial L (y_{i}, h_{m - 1} (x))}{\partial h_{m - 1} (x_{i})}, (i = 1, 2, \dots, n)

(5)

where r_mi represents the pseudo-residual, h_{m −} ₁(x) represents the prediction of the m − 1th tree, and n represents the number of samples.

(3): The establishment of the m_th tree depends on the dataset x and r_mi. Each sample has a prediction result y′ that is used to update the m_th strong learner to obtain h_m (x_i).
(4): After completing m iterations, the final strong learner H(x) is obtained.

2.4. Extreme Gradient Boosting

The XGBOOST algorithm, which can play a powerful role in gradient enhancement, was proposed based on the GBDT structure [32]. The main difference between XGBOOST and GBDT is that XGBOOST adds a regularization term to the loss function L(y, y′) to form the objective function O(y, y′):

O (y, y^{'}) = L (y, y^{'}) + R (f) + C

(6)

where L(y, y′) measures the difference between the prediction y′ and the target y, R(f) represents the traditional regularization term that is used to penalize the complexity of the model and helps to smoothen the final learned weights to avoid overfitting, and C is a constant.

Moreover, XGBOOST adopts a second-order Taylor series of the objective functions to optimize the objective quickly in a general setting:

O (y, y^{'}) = \sum_{i = 1}^{n} [g_{i} f (x_{i}) + \frac{1}{2} h_{i} f^{2} (x_{i})] + α T + \frac{1}{2} η {‖w‖}^{2}

(7)

where g_i and h_i denote the first and second derivative of the loss function, α is the weight of the min split loss, T indicates the number of leaves, η represents the weight of the regularization term, and w is the output of each leaf node.

2.5. Bayesian Optimization

BO can effectively handle a correlation whose mathematical expressions are unknown, and the calculation complexity is high [33]. The BO algorithm combines the probabilistic model with the acquisition function to find the minimum value of the goal function f(x). The probabilistic surrogate model adopts a Gaussian process that can quickly obtain the prior distribution of the function. The distribution of the function is typically determined by the mean μ(x) and variance σ(x).

Then, in each iteration (t = 1, 2,...T), the acquisition function uses μ(x) and σ(x) to select the point x_t with the highest confidence in the t^th iteration (t = 1, 2,...,T) and adds the point (x_t, f(x_t)) in the next iteration to approximate the function. The acquisition functions mainly include a_PI [34], a_EI [35], and a_LCB [36]. The equations are:

a_{P I} = Φ (γ (x)) . γ (x) = \frac{f (x_{b e s t}) - μ (x)}{σ (x)}

(8)

a_{E I} = σ (x) (γ (x) Φ (γ (x))) + φ (γ (x))

(9)

a_{L C B} = μ (x) - κ σ (x)

(10)

where x_best is the best value point in the current iteration; μ(x) and σ(x) represent the mean and variance, respectively; Φ and φ are the cumulative distribution function and probability distribution function of the standard normal distribution, respectively; and k is the weight coefficient to balance μ(x) and σ(x).

The BO in this study adopts the a_EI criterion and uses the cross-validation method to determine the combination of the hyperparameters. Their relative performance is evaluated using R².

2.6. Model Evaluation Indices

To evaluate the accuracy of the prediction of the machine learning models, the coefficient of determination (R²), root mean squared error (RMSE), and mean absolute error (MAE) [37,38] are calculated, with their definitions being:

RMSE = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(y_{i}^{*} - y_{i})}^{2}}

(11)

R^{2} = 1 - \frac{\sum_{i = 1}^{N} {(y_{i}^{*} - y_{i})}^{2}}{\sum_{i = 1}^{N} {(y_{i}^{*} - \bar{y_{i}})}^{2}}

(12)

MAE = \frac{1}{N} \sum_{i = 1}^{N} (y_{i}^{*} - y_{i})

(13)

where N represents the number of samples. y_i, is the original experimental data, and y_i represents its mathematical average. y_i^* is the result derived from the ML models.

2.7. Sobol Sensitivity Analysis

To acquire further understanding of the relative significance of different input parameters in the development of ML models, Sobol sensitivity analysis [27] is also carried out. The established model is assumed to be expressed as a function Y = f(X), where X = (x₁, x₂, …, x_n) is the model input and Y is the output. The Sobol method can decompose the function f(X) into a sum of 2ⁿ increasing terms:

Y = f (X) = f_{0} + \sum_{i = 1}^{n} f_{i} (x_{i}) + \sum_{1 \leq i < j \leq n}^{n} f_{i, j} (x_{i}, x_{j}) + \dots + f_{1, 2, \dots, n} (x_{1}, x_{2}, \dots, x_{n})

(14)

Based on the variance decomposition method, the first-order index S_i, second-order index S_i,_j, and total effect index ST_i are:

S_{i} = \frac{V_{i}}{V (Y)}

(15)

S_{i j} = \frac{V_{i j}}{V (Y)}

(16)

S T_{i} = 1 - \frac{V_{~ i}}{V (Y)}

(17)

where V(Y) is the total variance of Y, V_i is the variance of the i^th X, V_ij is the variance of the interaction between the two variables X_i and X_j, and V_~i is the average variance of all variables except for i. S_i represents the contribution of variable i, also known as the main effect of X_i; S_ij represents the effect of the interaction of variables X_i and X_j; and ST_i represents the combined effect of variable x_i and its interaction with other variables.

To accurately calculate the sensitivity index, it is necessary to make reasonable sampling of the input data [39]. The optimal Latin hypercube design of Saletlli [40,41] was used. Because of the necessity of calculating the second-order sensitivity index, sampling must be performed n(2m + 2) times [40], where n and m are the number of samples and the number of input features, respectively. Choosing more data than the number of test samples can help in calculating the sensitivity index accurately. Therefore, in this study, m = 4 and n = 500 (>270 sets of data) are selected with 5000 uniformly distributed points sampled for calculating the Sobol indices.

2.8. Dataset and Dimensional Analysis

To develop the soft prediction models discussed in this study, we used 270 experimental data points from two universities: the State Key Laboratory of Hydraulic and Mountain River Engineering of Sichuan University and the Hydraulic Laboratory of Kunming University of Science and Technology. The experiments were performed in inclined rectangular open channels with channel widths of 30 cm and 10 cm, respectively. The experimental setup is illustrated in Figure 1. Air entrained by the ramp was supplied from the lateral duct to the cavity below the water nappe. A head tank is placed at the upstream side of the ramp that controls the flowrate. A Cartesian coordinate system is generated, with its origin located at point O. The x-axis has a direction parallel to the flume bottom centerline, whereas the z-axis is specified to be perpendicular to the bottom surface.

As a dominant factor that influences the air-entraining efficiency, the cavity length is influenced by the ramp height s, the ramp angle α, the chute bottom angle φ, the approaching flow velocity v, and the water depth h. The cavity length was measured from x = 0 to the impact point. Details of the two datasets are listed in Table 1. The sub-pressure is approximately the atmospheric pressure with the cavity sufficiently open to the atmosphere. Therefore, dimensional analysis was performed to obtain Equation (18) of the cavity length:

\frac{L}{h} = f (α, φ, F r = \sqrt{\frac{v^{2}}{g h}}, \frac{s}{h})

(18)

Figure 2 shows the technical route of this study, which mainly includes three steps: (1) selection of the model input, (2) establishment of different ensemble learning models compared with traditional empirical regression, and (3) model interpretability analysis.

3. Results

As the selection of the input features and the training set ratio is critical in the development of the ML models, a pre-processing step that adopts SVR is carried out to analyze the effects of the feature selection and the size of the training dataset. Details of the cases for the test are listed in Table 2.

The statistical analysis of the error produced from different inputs and training dataset sizes is illustrated in Figure 3. According to the Table 2, input 1, with a training set ratio of 90%, presents the best performance among other conditions, with CV-R² = 0.915, CV-MAE = 0.0742, and CV-RMSE = 0.0589. CV-R² decreases and CV-RMSE and CV-MAE increase by only changing the ratio of the training set. CV-R² decreased by 3.4%, and CV-RMSE and CV-MAE increased by 13.9% and 19.2%, respectively, when the proportion declined from 90% to 50%.

Compared with the training set ratio, the input features were a dominant factor influencing the predicted accuracy. The performance is only 0.782, 0.118, and 0.093 for input3 of the 90% training set ratio. Therefore, the ML model learns more information from features to accurately predict the cavity length.

To improve the prediction ability of ML models, this study uses the input combination of input1 and the 90% training set ratio.

3.1. Cavity Length Prediction Using Different Ensemble Learning Models

3.1.1. Effect of BO Optimization

The hyperparameters are first determined by the BO method using 72 prior observation points in the Gaussian progress with the 10-fold CV technique, followed by the implementation of different ensemble learning models including RF, GBDT, and XGBOOST, as well as SVR. The derived hyperparameters and the time required for training are listed in Table 3.

As shown in Table 4, GBDT requires the least time for training among the different ensemble learning models, which is about 10% of the time spent by RF.

Deviations between the predicted results and the experimental data are showed in Figure 4. The ML models that implement BO optimization always present smaller errors compared with the cases without BO optimization. All ensemble learning models provide results with a better agreement with the experiment compared with the SVR. Among the different ensemble learning models adopted, XGBOOST-BO presents the best performance.

3.1.2. Comparison with Empirical Correlations

An evaluation of the relative performance of ML models with respect to the traditional empirical correlations is also carried out. Five widely implemented empirical correlations proposed in the literature are selected; their expressions are listed in Table 5. The comparison of the results derived from these correlations with the results from XGBOOST-BO, which has the best performance among different ML models, is illustrated in Figure 5. The marker points with different colors represent the results derived from different models. The lines are linear fits that correspond to the results with the same color. Apparent deviation from the line of y = x exists for the correlations of Rutschmann, Chanson, Yang and Pfister, whereas the results of Wu present a similar agreement as XGBOOST-BO with y = x, which may result from the consideration of the transverse turbulent flow velocity u′ and the water depth in their correlation. To further evaluate the relative performance of the correlations, the R² of their results is also listed in Table 5. The results of Wu present the highest value of R², compared with the R² = 0.996 of XGBOOST-BO; the superior accuracy of ML models is further demonstrated.

3.2. Model Interpretation Using the Sobol Technique

3.2.1. Global Interpretations

In this study, Sobol analysis is carried out to evaluate the relative importance of different input parameters in the training process. The total sensitivity indices, which include the effect of the first and the second indices, are listed in Table 6. Focus on the SVR model first; Fr and s/h are the predominant input features in cavity length prediction. Then, the first- and second-order indices, which present the main and interactive effects of the input features in the ML models, are shown in Table 6 and Table 7. The effect of an individual feature is the least important, as the magnitude of S_i is much lower than that of ST_i. As a result, the SVR output can be concluded to be predominantly affected by the interaction among the different input features.

As for the ensemble learning models, in contrast to SVR, it can be observed that individual features play important roles in the learning process. Among all the input features, Fr can be found to be the most important factor, as the ST_Fr presents a higher magnitude than other input variables among all the models implemented. All second-order indices are close to 0, except for S_Fr,s/h (0.14, 0.12, and 0.10), indicating that the interaction between two variables is less important in predicting the cavity length. Thus, information from an individual input feature dominates the prediction results in the ensemble learning model.

Overall, the above analysis implies that the SVR model relies more on the interactive effects between two or more features in the prediction, whereas the ensemble learning models prefer to extract information from an individual input parameter.

3.2.2. Physical Interpretation of the Sobol Analysis Results

In physical conditions, the variation of s/h has a significant effect on the cavity length at a relatively low Fr due to the dominant role played by the gravity of the water jet. As Fr increases, the impact of subpressure and backwater becomes prominent, which significantly limits the cavity length [4,40,41]. Even a small increase in subpressure can drastically reduce the cavity length [1]. As the variation in the subpressure and backwater become more sensitive to the change in φ [42] at relatively high flow velocities, it can be expected that the main factor that influences the s/h becomes φ at a high Fr.

To further evaluate the interpretability of the different ML models against the physics under different flow conditions, the variation of ST_i for α, φ, and s/h at different Fr values is plotted in Figure 6. At 2.46 < Fr < 5.5, all ML models show an ST_i of s/h higher than 0.5, implying its dominant effect on the cavity length at a relatively low Fr. When Fr > 5.5, an abrupt fluctuation of ST with respect to Fr exists for the ensemble learning models, whereas its variation in SVR presents a relatively more smoothed style. The magnitude of ST_α drops to 0 among all ensemble learning models, which corresponds to the fact that α becomes invariant in the experiments at Fr > 5.5. However, the ST_α of SVR still presents a magnitude near 0.6 at Fr > 5.5, implying the inability of SVR to model the actual effect of different input features. As the ST of s/h for XGBOOST presents a relatively high magnitude at a low Fr and decreases dramatically as Fr increases, which has good correspondence with the actual effect of s/h in the formation of the cavity, and the variation of the ST of φ also complies with the change in the relative significance of φ in the experiments, the compliance with the physics demonstrates its stronger robustness compared with other ML models when facing such a problem.

3.2.3. Evaluation of Model Accuracy at Different Fr values

As a main factor that influences the cavity length, the accuracy in capturing the dependence of L/h on the Fr of different ML models is also evaluated. A sampling space composed of 5000 data points is generated by Sobol analysis for different Froud numbers. The variation of the average of the ML-derived L/h among the 5000 samples with respect to Fr is plotted in Figure 7.

When Fr < 5.5, a slight positive correlation between L/h and Fr exists, and all the ML models present similar results. When Fr > 5.5, an apparent increase in L/h with the increase in Fr can be found. Under this condition, a significant underestimation of L/h exists for SVR, whereas the RF and GBDT tend to overpredict the L/h. The results of XGBOOST present the best agreement with the experimental data, which may result from its ability to capture the actual relative significance of different input features in the prediction of cavity length.

4. Conclusions

In this study, a high-precision ensemble learning model for cavity length prediction in aerators of spillways is established. Sobol sensitivity analysis is conducted to further explain the ML models based on the relationship between features and cavity length. The following conclusions can be drawn:

The optimal input combination and ratio is selected using the SVR model. The hyperparameters of the ensemble learning models are selected by combining 10-fold cross-validation with BO technology. The results show that the BO algorithm significantly improves the prediction performance. Among the different models implemented, the XGBOOST-BO model presents the highest test accuracy (R² = 0.964, RMSE = 0.051, and MAE = 0.036). Additionally, the results imply that the ensemble learning model always outperforms the SVR model, even without the implementation of the BO optimization.

As for the comparison with the empirical correlations, the correlation proposed by Wu et al. presents the best agreement with the experimental data among all correlations taken into account, whereas its performance is still a bit worse than that of XGBOOST-BO, indicating the superiority of ML models in cavity length prediction when compared with the empirical correlations.

Finally, attempts to interpret the ML model are made by carrying out the Sobol analysis. The results imply that the SVR prefers to extract interactive information from two features, including Fr and s/h, while the ensemble learning models rely more on the individual effect among all input features. The Sobol analysis at different Fr values implies that all ML models correctly predicted the dominant effect of s/h on the cavity length at relatively low Fr values (Fr < 5.5), whereas only XGBOOST captured the diminished effect of s/h as Fr became higher than 5.5. The difference in the compliance with the physics is reflected in the accuracy of the cavity length prediction at different Fr values, where the results from XGBOOST present the best agreement with the experiments.

It has to be noted that the input data adopted in this study come from a relatively simple aerator composed of a ramp. More complicated aerators can also have offsets together with the ramp. Though it has wider applicability in hydraulic engineering, it brings more input features and is therefore more challenging in the development of the ML models, which will be the focus of the further investigations.

Author Contributions

Conceptualization, Y.L. and G.G.; methodology, Z.C. and S.L.; software, G.G.; validation, Z.C., Y.D.; All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by the National Nature Science Foundation of China (Grant No. 52179060, 51909024).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is unavailable due to privacy or ethical restrictions.

Conflicts of Interest

The authors declare no conflict of interest.

References

Pfister, M.; Hager, W.H. Chute aerators. II: Hydraulic design. J. Hydraul. Eng. 2010, 136, 360–367. [Google Scholar] [CrossRef]
Wu, J.; Ruan, S. Emergence Angle of Flow Over an Aerator. J. Hydrodyn. 2007, 19, 601–606. [Google Scholar] [CrossRef]
Wu, J.; Ruan, S. Cavity length below chute aerators. Sci. China Ser. E Technol. Sci. 2008, 51, 170–178. [Google Scholar] [CrossRef]
Rutschmann, P.; Hager, W.H. Air entrainment by spillway aerators. J. Hydraul. Eng. 1990, 116, 765–782. [Google Scholar] [CrossRef]
Chanson, M.H. Predicting the filling of ventilated cavities behind spillway aerators. J. Hydraul. Res. 2010, 33, 361–372. [Google Scholar] [CrossRef] [Green Version]
Pfister, M.; Hager, W.H. Chute aerators. I: Air transport characteristics. J. Hydraul. Eng. 2010, 136, 352–359. [Google Scholar] [CrossRef]
Ahmed, A.N.; Yafouz, A.; Birima, A.H.; Kisi, O.; Huang, Y.F.; Sherif, M.; Sefelnasr, A.; El-Shafie, A. Water level prediction using various machine learning algorithms: A case study of Durian Tunggal river, Malaysia. Eng. Appl. Comput. Fluid Mech. 2022, 16, 422–440. [Google Scholar] [CrossRef]
Pal, M.; Singh, N.K.; Tiwari, N.K. Support vector regression based modeling of pier scour using field data. Eng. Appl. Artif. Intell. 2011, 24, 911–916. [Google Scholar] [CrossRef]
Zaji, A.H.; Bonakdari, H. Optimum Support Vector Regression for Discharge Coefficient of Modified Side Weirs Prediction. INAE Lett. 2017, 2, 25–33. [Google Scholar] [CrossRef] [Green Version]
Bhattarai, A.; Dhakal, S.; Gautam, Y.; Bhattarai, R. Prediction of Nitrate and Phosphorus Concentrations Using Machine Learning Algorithms in Watersheds with Different Landuse. Water 2021, 13, 3096. [Google Scholar] [CrossRef]
AlDahoul, N.; Ahmed, A.N.; Allawi, M.F.; Sherif, M.; Sefelnasr, A.; Chau, K.W.; El-Shafie, A. A comparison of machine learning models for suspended sediment load classification. Eng. Appl. Comput. Fluid Mech. 2022, 16, 1211–1232. [Google Scholar] [CrossRef]
Çimen, M. Estimation of daily suspended sediments using support vector machines. Hydrol. Sci. J. 2008, 53, 656–666. [Google Scholar] [CrossRef]
Hu, Z.; Karami, H.; Rezaei, A.; DadrasAjirlou, Y.; Piran, M.J.; Band, S.S.; Chau, K.W.; Mosavi, A. Using soft computing and machine learning algorithms to predict the discharge coefficient of curved labyrinth overflows. Eng. Appl. Comput. Fluid Mech. 2021, 15, 1002–1015. [Google Scholar] [CrossRef]
Dursun, O.F.; Kaya, N.; Firat, M. Estimating discharge coefficient of semi-elliptical side weir using ANFIS. J. Hydrol. 2012, 426, 55–62. [Google Scholar] [CrossRef]
Roushangar, K.; Akhgar, S.; Salmasi, F. Estimating discharge coefficient of stepped spillways under nappe and skimming flow regime using data driven approaches. Flow Meas. Instrum. 2018, 59, 79–87. [Google Scholar] [CrossRef]
Liang, W.; Luo, S.; Zhao, G.; Wu, H. Predicting Hard Rock Pillar Stability Using GBDT, XGBoost, and LightGBM Algorithms. Mathematics 2020, 8, 765. [Google Scholar] [CrossRef]
Zhang, W.; Yu, J.; Zhao, A.; Zhou, X. Predictive model of cooling load for ice storage air-conditioning system by using GBDT. Energy Rep. 2021, 7, 1588–1597. [Google Scholar] [CrossRef]
Qiu, Y.; Zhou, J.; Khandelwal, M.; Yang, H.; Yang, P.; Li, C. Performance evaluation of hybrid WOA-XGBoost, GWO-XGBoost and BO-XGBoost models to predict blast-induced ground vibration. Eng. Comput. 2021, 38, 4145–4162. [Google Scholar] [CrossRef]
Afan, H.A.; Ibrahem Ahmed Osman, A.; Essam, Y.; Ahmed, A.N.; Huang, Y.F.; Kisi, O.; Sherif, M.; Sefelnasr, A.; Chau, K.W.; El-Shafie, A. Modeling the fluctuations of groundwater level by employing ensemble deep learning techniques. Eng. Appl. Comput. Fluid Mech. 2021, 15, 1420–1439. [Google Scholar] [CrossRef]
Chen, Y.-J.; Chen, Z.-S. A prediction model of wall shear stress for ultra-high-pressure water-jet nozzle based on hybrid BP neural network. Eng. Appl. Comput. Fluid Mech. 2022, 16, 1902–1920. [Google Scholar] [CrossRef]
Wu, J.; Ma, D.; Wang, W. Leakage Identification in Water Distribution Networks Based on XGBoost Algorithm. J. Water Resour. Plan. Manag. 2022, 148, 04021107. [Google Scholar] [CrossRef]
Altmann, A.; Toloşi, L.; Sander, O.; Lengauer, T. Permutation importance: A corrected feature importance measure. Bioinformatics 2010, 26, 1340–1347. [Google Scholar] [CrossRef] [Green Version]
Mi, J.-X.; Li, A.-D.; Zhou, L.-F. Review Study of Interpretation Methods for Future Interpretable Machine Learning. IEEE Access 2020, 8, 191969–191985. [Google Scholar] [CrossRef]
Wang, S.; Peng, H.; Liang, S. Prediction of estuarine water quality using interpretable machine learning approach. J. Hydrol. 2022, 605, 127320. [Google Scholar] [CrossRef]
Hall, J.W.; Boyce, S.A.; Wang, Y.; Dawson, R.J.; Tarantola, S.; Saltelli, A. Sensitivity Analysis for Hydraulic Models. J. Hydraul. Eng. 2009, 135, 959–969. [Google Scholar] [CrossRef]
Feng, D.-C.; Wang, W.J.; Mangalathu, S.; Taciroglu, E. Interpretable XGBoost-SHAP machine-learning model for shear strength prediction of squat RC walls. J. Struct. Eng. 2021, 147, 04021173. [Google Scholar] [CrossRef]
Sobol, I.M. Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Math. Comput. Simul. 2001, 55, 271–280. [Google Scholar] [CrossRef]
Vapnik, V. The Nature of Statistical Learning Theory; Springer Science & Business Media: Berlin/Heidelberg, Germany, 1999. [Google Scholar]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
Svetnik, V.; Liaw, A.; Tong, C.; Culberson, J.C.; Sheridan, R.P.; Feuston, B.P. Random forest: A classification and regression tool for compound classification and QSAR modeling. J. Chem. Inf. Comput. Sci. 2003, 43, 1947–1958. [Google Scholar] [CrossRef]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Jones, D.R. A taxonomy of global optimization methods based on response surfaces. J. Glob. Optim. 2001, 21, 345–383. [Google Scholar] [CrossRef]
Kushner, H.J. A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. J. Basic Eng. Mar. 1964, 86, 97–106. [Google Scholar] [CrossRef]
Mockus, J.; Tiesis, V.; Zilinskas, A. The application of Bayesian methods for seeking the extremum. Towards Glob. Optim. 1978, 2, 2. [Google Scholar]
Srinivas, N.; Krause, A.; Kakade, S.M.; Seeger, M. Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv 2009, arXiv:0912.3995. [Google Scholar]
Wang, G.C.; Zhang, Q.; Band, S.S.; Dehghani, M.; Chau, K.W.; Tho, Q.T.; Zhu, S.; Samadianfard, S.; Mosavi, A. Monthly and seasonal hydrological drought forecasting using multiple extreme learning machine models. Eng. Appl. Comput. Fluid Mech. 2022, 16, 1364–1381. [Google Scholar] [CrossRef]
Singh, V.K.; Panda, K.C.; Sagar, A.; Al-Ansari, N.; Duan, H.F.; Paramaguru, P.K.; Vishwakarma, D.K.; Kumar, A.; Kumar, D.; Kashyap, P.S.; et al. Novel Genetic Algorithm (GA) based hybrid machine learning-pedotransfer Function (ML-PTF) for prediction of spatial pattern of saturated hydraulic conductivity. Eng. Appl. Comput. Fluid Mech. 2022, 16, 1082–1099. [Google Scholar] [CrossRef]
Campolongo, F.; Saltelli, A.; Cariboni, J. From screening to quantitative sensitivity analysis. A Unified Approach. Comput. Phys. Commun. 2011, 182, 978–988. [Google Scholar] [CrossRef]
Saltelli, A. Making best use of model evaluations to compute sensitivity indices. Comput. Phys. Commun. 2002, 145, 280–297. [Google Scholar] [CrossRef]
Saltelli, A.; Annoni, P.; Azzini, I.; Campolongo, F.; Ratto, M.; Tarantola, S. Variance based sensitivity analysis of model output. Design and estimator for the total sensitivity index. Comput. Phys. Commun. 2010, 181, 259–270. [Google Scholar] [CrossRef]
Yang, Y.; Yang, Y.; Shuai, Q. The hydraulic and aeration characteristics of low Froude number flow over a step aerator. J. Hydraul. Eng. 2000, 31, 27–31. [Google Scholar]

Figure 1. Schematic diagram of the spillway aerator and the formed cavity.

Figure 2. Construction of the ML models together with the interpreting analysis.

Figure 3. R², RMSE, and MAE in cavity length prediction using the SVR model based on different input combinations.

Figure 4. The prediction results of the optimal XGB-BO and XGB regressor on the test set.

Figure 5. Cavity depth prediction derived from empirical correlations and XGBOOST-BO.

Figure 6. Evolution of the total Sobol′s indices. (a) SVR. (b) RF. (c) GBR. (d) XGB.

Figure 7. Prediction of L/h with respect to Fr using different ML models.

Table 1. Parameter setting in the experiments.

	α	h/(cm)	v/(m/s)	Fr	s/(cm)	φ	L/(cm)	Samplers
Data1 (Zhang)	0.087 0.105 0.122	2.5~8.5	1.5~6.0	2.95~7.77	1.5 2.5 3.0 4.0	0.1	13.5~75.0	108
Data2 (Shen)	0.07 0.087 0.096	1.25~3.40	1.0~1.8	2.46~3.90	1.0 2.0 3.0	0.2 0.143 0.1	4.9~26.5	162

Table 2. Input combinations of SVR.

Model	Input Configuration		Training Set Ratio (%)
Model	Name	Features	Training Set Ratio (%)
SVR	input1	$α, F r, φ, \frac{s}{h}$	90 80 70 60 50
	input2	$F r, φ, \frac{s}{h}$
	input3	$F r, \frac{s}{h}$

Table 3. Hyperparameters optimization results.

Model	Hyperparameters			Time (s)
SVR-BO	C = 47.84	γ = 8.08	ε = 0.068	21.8
RF-BO	n_estimators = 28	max_depth = 50	——	305.1
GBDT-BO	n_estimators = 55	max_depth = 9	Learning_rate = 0.43	29.7
XGboost-BO	n_estimators = 47	max_depth = 3	Learning_rate = 0.29	145.7

Table 4. Performance evaluation between ML models with and without BO optimization.

Model	Train			Test
Model	R²	RMSE	MAE	R²	RMSE	MAE
SVR	0.931	0.067	0.06	0.921	0.081	0.060
SVR-BO	0.929	0.068	0.057	0.936	0.069	0.058
RF	0.989	0.0268	0.020	0.924	0.076	0.057
RF-BO	0.989	0.0264	0.020	0.947	0.063	0.050
GBDT	0.976	0.04	0.029	0.949	0.062	0.047
GBDT-BO	1.000	0	0	0.957	0.056	0.046
XGBOOST	0.999	0.00077	0.0016	0.921	0.077	0.055
XGBOOST-BO	0.964	0.015	0.038	0.964	0.051	0.036

Table 5. Empirical correlations for cavity length prediction.

Reference	Equation	Parameters	R²
Rutschman, 1990 [4]	$\begin{array}{l} L = \frac{F r^{2} Θ d_{0}}{\cos α} [1 + \sqrt{1 + \frac{2 (t_{r} + t_{s}) \cos α}{Θ^{2} F r^{2} d}}] (1 \\ - 0.4 P_{N}^{0.5}) \end{array}$	$Θ = φ \sqrt{\tanh (\frac{t_{r}}{d_{0} φ})}$ , emergence angle	0.617
Chanson, 2010 [5]	$L = V_{0} T \cos φ + 0.5 g T^{2} \sin α$	$\begin{array}{l} T = \frac{V_{0} \sin φ}{g (\cos α + P_{N})} [1 + \\ \sqrt{1 + 2 (t_{r} + t_{s}) \frac{g (\cos α + P_{N})}{{(V_{0} \sin φ)}^{2}}}] \end{array}$	0.758
Yang, 2000 [42]	$L = V_{1} T \cos φ + 0.5 g T^{2} (\sin α - 0.00625 F r^{2})$	$\begin{array}{l} V_{1} = 0.908 V_{0} \\ Θ = φ \sqrt{\tanh (\frac{t_{r}}{d_{0} φ})} \\ T = \frac{V_{0} \sin Θ}{g (\cos α + P_{N})} [1 + \\ \sqrt{1 + 2 (t_{r} + t_{s}) \frac{g (\cos α + P_{N})}{{(V_{0} \sin Θ)}^{2}}}] \end{array}$	0.945
Wu, 2008 [3]	$L = V_{0} T \cos Θ' + 0.5 g T^{2} \sin α_{2}$	$\begin{array}{l} Θ^{'} = 0.48 [Θ + α_{2} - α_{1}] + \\ 0.52 [φ + α_{2} - α_{1} - \arctan (\frac{u^{'}}{V_{0}})] \end{array}$	0.948
Pfister, 2010 [6]	$(\frac{L}{h}) = 0.77 F r {(1 + \sin φ)}^{1.5} (\sqrt{\frac{s + t}{h}} + F r \tan α)$		0.868

Table 6. First-order and total Sobol’s indices of Ml models with BO.

	α		Fr		φ		s/h
	Total Index	First Index	Total Index	First Index	Total Index	First Index	Total Index	First Index
SVR-BO	0.235	0.0057	0.711	0.091	0.126	0.038	0.646	0.20
RF-BO	0.044	0.020	0.813	0.658	0.020	0.008	0.263	0.152
GBDT-BO	0.049	0.021	0.789	0.607	0.057	0.016	0.292	0.145
XGBOOST-BO	0.064	0.023	0.703	0.484	0.123	0.098	0.332	0.171

Table 7. Second-order Sobol’s indices of different ML models.

	α, Fr	α, φ	α, s/h	Fr, φ	Fr, s/h	φ, s/h
SVR-BO	0.143	0.057	0.068	0.041	0.39	0.014
RF-BO	0.017	0.0	0.036	0.013	0.100	0.006
GBDT-BO	0.010	0.0	0.0	0.020	0.120	0.010
XGBOOST-BO	0.020	0.0	0.014	0.030	0.140	0.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, G.; Li, S.; Liu, Y.; Cao, Z.; Deng, Y. Prediction of Cavity Length Using an Interpretable Ensemble Learning Approach. Int. J. Environ. Res. Public Health 2023, 20, 702. https://doi.org/10.3390/ijerph20010702

AMA Style

Guo G, Li S, Liu Y, Cao Z, Deng Y. Prediction of Cavity Length Using an Interpretable Ensemble Learning Approach. International Journal of Environmental Research and Public Health. 2023; 20(1):702. https://doi.org/10.3390/ijerph20010702

Chicago/Turabian Style

Guo, Ganggui, Shanshan Li, Yakun Liu, Ze Cao, and Yangyu Deng. 2023. "Prediction of Cavity Length Using an Interpretable Ensemble Learning Approach" International Journal of Environmental Research and Public Health 20, no. 1: 702. https://doi.org/10.3390/ijerph20010702

APA Style

Guo, G., Li, S., Liu, Y., Cao, Z., & Deng, Y. (2023). Prediction of Cavity Length Using an Interpretable Ensemble Learning Approach. International Journal of Environmental Research and Public Health, 20(1), 702. https://doi.org/10.3390/ijerph20010702

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Prediction of Cavity Length Using an Interpretable Ensemble Learning Approach

Abstract

1. Introduction

2. Machine Learning Models

2.1. Support Vector Regression

2.2. Radom Forest (RF)

2.3. Gradient Boosting Decision Tree (GBDT)

2.4. Extreme Gradient Boosting

2.5. Bayesian Optimization

2.6. Model Evaluation Indices

2.7. Sobol Sensitivity Analysis

2.8. Dataset and Dimensional Analysis

3. Results

3.1. Cavity Length Prediction Using Different Ensemble Learning Models

3.1.1. Effect of BO Optimization

3.1.2. Comparison with Empirical Correlations

3.2. Model Interpretation Using the Sobol Technique

3.2.1. Global Interpretations

3.2.2. Physical Interpretation of the Sobol Analysis Results

3.2.3. Evaluation of Model Accuracy at Different Fr values

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI