Assessing the Suitability of Boosting Machine-Learning Algorithms for Classifying Arsenic-Contaminated Waters: A Novel Model-Explainable Approach Using SHapley Additive exPlanations

Ibrahim, Bemah; Ewusi, Anthony; Ahenkorah, Isaac

doi:10.3390/w14213509

Open AccessArticle

Assessing the Suitability of Boosting Machine-Learning Algorithms for Classifying Arsenic-Contaminated Waters: A Novel Model-Explainable Approach Using SHapley Additive exPlanations

by

Bemah Ibrahim

¹

,

Anthony Ewusi

¹ and

Isaac Ahenkorah

^2,*

¹

Department of Geological Engineering, University of Mines and Technology, Tarkwa P.O. Box 237, Ghana

²

UniSA STEM, University of South Australia, Mawson Lakes, SA 5095, Australia

^*

Author to whom correspondence should be addressed.

Water 2022, 14(21), 3509; https://doi.org/10.3390/w14213509

Submission received: 20 October 2022 / Revised: 31 October 2022 / Accepted: 31 October 2022 / Published: 2 November 2022

(This article belongs to the Section New Sensors, New Technologies and Machine Learning in Water Sciences)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

There is growing tension between high-performance machine-learning (ML) models and explainability within the scientific community. In arsenic modelling, understanding why ML models make certain predictions, for instance, “high arsenic” instead of “low arsenic”, is as important as the prediction accuracy. In response, this study aims to explain model predictions by assessing the relationship between influencing input variables, i.e., pH, turbidity (Turb), total dissolved solids (TDS), and electrical conductivity (Cond), on arsenic mobility. The two main objectives of this study are to: (i) classify arsenic concentrations in multiple water sources using novel boosting algorithms such as natural gradient boosting (NGB), categorical boosting (CATB), and adaptive boosting (ADAB) and compare them with other existing representative boosting algorithms, and (ii) introduce a novel SHapley Additive exPlanation (SHAP) approach for interpreting the performance of ML models. The outcome of this study indicates that the newly introduced boosting algorithms produced efficient performances, which are comparable to the state-of-the-art boosting algorithms and a benchmark random forest model. Interestingly, the extreme gradient boosting (XGB) proved superior over the remaining models in terms of overall and single-class performance metrics measures. Global and local interpretation (using SHAP with XGB) revealed that high pH water is highly correlated with high arsenic water and vice versa. In general, high pH, high Cond, and high TDS were found to be the potential indicators of high arsenic water sources. Conversely, low pH, low Cond, and low TDS were the main indicators of low arsenic water sources. This study provides new insights into the use of ML and explainable methods for arsenic modelling.

Keywords:

arsenic; modelling; machine learning; boosting algorithm; explainability; SHAP

1. Introduction

In recent years, elevated concentrations of arsenic in water bodies have increasingly become a global health challenge due to their toxic nature and adverse effects on human health [1]. It is estimated that more than 200 million people worldwide are chronically exposed to elevated arsenic concentrations in drinking water [2]. The real danger lies in situations where an exposed community is unaware of arsenic contamination because the long-term health effects of arsenic poisoning are caused by chronic exposure [3]. Such in-stances resulted in the largest arsenic poisoning in Bangladesh and West Bengal, India, where 35–77 million people were estimated to be at risk of ingesting arsenic-contaminated water above the WHO recommended limit of 10 µg/L [4]. The fundamental intervention is to test all drinking water sources in order to identify and provide arsenic-free drinking water [4]. Unfortunately, this positive call is mostly a burden, especially for rural communities in developing countries, because the quantification of arsenic usually involves very expensive methods such as atomic absorption spectrophotometry, together with highly trained technicians and high maintenance costs. Hence, an indirect way of estimating arsenic concentration from physicochemical parameters such as pH, electrical conductivity (Cond), total dissolved solids (TDS) and turbidity (TurB), which are much simpler to perform, could be vital in detecting arsenic contamination.

Over the past few years, many studies have demonstrated the applicability of machine learning (ML) methods in modelling arsenic concentrations in water sources and have found that ML can achieve consistent or even better results than traditional methods [5,6,7,8,9]. Among other ML methods, boosting algorithms have emerged as robust and competitive techniques that have been consistently placed among the top contenders in most Kaggle competitions [10,11]. Their performances are strongly justified and backed up theoretically by their ability to improve the performance of weak classification models by combining the outputs of many “weak” classifiers [12]. Furthermore, these models can absorb more input variables and adequately describe non-linear and complicated relations between variables. For example, boosted regression trees (BRT) have been widely adopted in modelling arsenic concentration in water sources [5,6,7,8]. Ayotte, et al. [13] applied BRT and logistic regression (LR) to predict the probability of arsenic exceeding 10 μg/L or 5 μg/L in drinking water wells in the Central Valley, California. They reported higher predictive accuracy for BRT compared to LR. A recent study by Ibrahim, et al. [9] also introduced two novel boosting variants, extreme gradient boosting (XGB) and light gradient boosting (LGB) to the arsenic modelling domain. Their study indicated that boosting algorithms are very efficient and produce satisfactory predictions comparable to state-of-the-art ML methods. It is evident that boosting algorithms’ applicability and superiority in modelling complex underlying relationships are well justified. However, limited studies have investigated the performance of other new variants of boosting algorithms in arsenic modelling. Interestingly, new variants of boosting algorithms such as natural gradient boosting (NGB), categorical boosting (CATB), and adaptive boosting (ADAB) exist in the literature and have been used successfully in various hydrological modelling such as estimating daily reference evapotranspiration [14,15], runoff probability prediction [16] and pan evaporation estimation [17]. Given that no single boosting algorithm is consistently better than another [18], it is critical to investigate the suitability of these new boosting models for modelling arsenic concentration in water sources.

Aside from their predictive ability, ML or boosting models use complex algorithms with difficult-to-understand decisions and processes. In critical areas such as arsenic modelling, the inability of end users to understand these models appears problematic, because entrusting critical decisions to a system that cannot explain itself poses obvious dangers. ML frequently struggles to generalise beyond the circumstances seen during training, yet when given out-of-distribution samples, they still make (mostly incorrect) predictions with high confidence [19]. To make these models safe, fair, and reliable, explainable ML models are needed to assist end-users, decision-makers and regulators in accurately determining why models make certain predictions [20]. Existing research [5,6,7,8,9] has used methods such as variable importance scores and partial dependency plots to determine the influence of predictor variables on arsenic mobility. Such methods, however, are incapable of determining the importance of predictor variables for individual classes in a classification problem and/or their impact on individual predictions. Again, such methods assume feature independence even when they are correlated [20]. The recent introduction of the novel SHapley Additive exPlanation (SHAP) approach has proved to be useful in understanding the response of the ML output to input [21]. SHAP has shown satisfactory results in several scientific research projects on feature analysis [22,23,24,25,26,27]. To the best of our knowledge, only a few studies have used explainable ML algorithms to analyse variable attribution in hydrology. Hence, more research on model explainability and interpretability is required to decipher the complex hydrogeochemical controls of arsenic mobilisation.

In the quest to introduce more efficient boosting algorithms while also explaining why certain decisions are made by these models, the current study therefore seeks to: (i) determine the efficacy of six boosting algorithms i.e., gradient boosting machine (GBM), XGB, LGB, NGB, CATB and ADAB in classifying low (<5 µg/L), medium (>5 to ≤10 μg/L) and high (>10 µg/L) arsenic water; and (ii) identify the importance and contribution of each predictor variable in each individual class (high, low and medium) and visually interpret the complex non-linear behaviour underlying arsenic mobility. The comparison is extended to a non-boosting model, i.e., random forest (RF), which can be considered a benchmark model due to its success in arsenic modelling [8,9,28,29].

2. Study Area

In this study, samples were collected in a few selected mining communities near Ayanfuri in Ghana’s Upper Denkyira West district (Figure 1). The district is located between latitudes 6°0′ N and 5°55′ N and longitudes 1°57′ W and 1°52′ W, and it borders the districts of Bibiani-Anhwiaso-Bekwai to the north-west, Amansie West and Amansie Central to the north-east, Wassa Amenfi East and Wassa Amenfi West to the south-west, and Upper Denkyira East Municipal to the south [30]. The study area’s average temperature, specifically Ayanfuri, was 26.4 °C, with the warmest and coldest temperatures recorded in March (average of 27.6 °C) and August (average of 24.6 °C), respectively [30].

As shown in Figure 1, the majority of the research area is underlain by Paleoproterozoic Birimian flysch-type metasediments composed of dacitic volcaniclastics, greywackes, and argillaceous (phyllitic) sediments that have been strongly folded, faulted, and metamorphosed to upper greenschist facies [31,32]. The sediments have been intruded by a variety of tiny granite masses as well as several regional formations. Gold mineralisation occurs within the Ashanti Gold Belt’s granitic plugs and sills or dykes in southern Ghana, West Africa, as well as two or three regional shear structures [31]. Mineralisation occurs as extremely fine grains, frequently at sulphide grain boundaries and in sulphide fractures, primarily at or near vein edges, with coarse visible gold seen in the quartz on occasion [31]. High-grade gold intercepts are frequently associated with very coarse arsenopyrite +/− sphalerite, chalcopyrite, and galena [31]. Groundwater (boreholes, pipe-borne, and hand-dug wells) and surface water (rivers and streams) supply more than 70% and 13% of the area’s drinking water, respectively [33].

3. Materials and Methods

3.1. Data Description

A total of 597 water samples, with 354 from groundwater sources and 243 from surface water sources, were obtained within the study area from the period of 2011 to 2018. The sampling locations are shown in Figure 1. Groundwater samples were mostly taken from residential and public boreholes as well as hand-dug wells, while surface water samples were obtained from rivers and streams. Prior to sampling, 1 L polythene bottles were first cleaned with weak HNO₃ and rinsed with distilled water. Physicochemical parameters like pH, TurB, TDS and Cond were measured in the field using the HQ40d18 Series Portable Meter, while arsenic was determined using Inductively Coupled Plasma Optical Emission Spectrometry (ICP–MS). Here, H₂ was used in the collision cell at a flow rate of 80 mL/min to maximise sensitivity while minimising potential polyatomic interferences with the target element (arsenic). The flow was optimised by monitoring the counts of the target element in a 2.0 M nitric acid eluate. Finally, determinations were made using aqueous standards in 2.0 M nitric acid with metal concentrations ranging from 0 to 0.25 mg/L. Triplicate analysis was conducted to ensure the accuracy of the arsenic measurements and the mean of each measurement was recorded as the representative concentration. The dataset obtained consists of 324 (54%) low, 100 (17%) medium, and 173 (29%) high arsenic concentrations.

3.2. Model Development

A total of 417 datasets (70%) were used to train the ML models, with the remaining 180 (30%) used to assess (test) the models’ performance. The dataset was split using stratified sampling to ensure a uniform proportion of the target class in the training and testing datasets. The input variables pH, Turb, TDS, and Cond were used to classify the low, medium, and high arsenic concentrations. Figure 2 depicts the model development workflow.

It is important to note that the hyperparameter search space was limited to only two hyperparameters to avoid model complexity while also ensuring fast computational time. Table 1 shows the libraries and optimal parameters used to build the various models.

By using standard performance metrics, the overall predictive efficiency of the developed models was evaluated and compared. Finally, the relationship between the predictor variables and arsenic mobility was explained using the best-performing boosting model and SHAP. The theoretical concept of SHAP and the boosting algorithms used in the study are presented briefly in the following subsections. Table 2 summarises the main advantages and limitations of the boosting algorithms investigated. The RF model is not discussed because it has been extensively treated in the literature [9,38,39,40].

3.2.1. Categorical Boosting

CATB, introduced by Dorogush et al. [42], is a tree-based gradient boosting algorithm that is capable of handling categorical data in both regression and classification problems. The algorithm introduces two new advances to the boosting implementation in order to fight the problem of prediction shift caused by target leakage, which is found in all currently existing gradient boosting algorithms. The first is ordered boosting, which is a permutation-driven alternative to the conventional boosting method, while the second is a novel technique for dealing with categorical data [35]. Unlike other gradient boosting techniques, which require converting categorical variables before implementation, CATB uses oblivious decision trees (OBT) as base predictors when building a tree [44]. The OBT is very balanced and less prone to overfitting and increased speeds during the implementation of CATB. In dealing with categorical variables, CATB employs a more efficient technique that eliminates overfitting and allows for the use of all datasets in training. It randomly permutes the dataset and computes an average label value for each sample, with the same category value placed before the supplied one in the permutation. More theoretical intuition is presented by Dorogush et al. [42].

3.2.2. Natural Gradient Boosting

NGB, introduced by Duan et al. [37], is a key innovation in the gradient boosting family, which enables predictive uncertainty estimation with gradient boosting by employing a probabilistic forecast. The NGB algorithm has the advantage of being simpler as it requires relatively less expertise to implement. The major contribution to the boosting family is that it uses multiparameter boosting and natural gradients to integrate any choice of base learner (e.g., regression tree), parametric distribution (normal), and scoring rule (MLE), which are all chosen during configuration [37]. In this study, the base learner used was a decision tree with a maximum depth of 3. The categorical distribution was used with the default Friedman MSE error as a criterion.

3.2.3. Adaptive Boosting

ADAB [45] creates an ensemble by focusing on previously misclassified cases. Like all ensembles, it generates a set of classifiers and then votes on them to classify test examples. However, here, the various classifiers were built in a sequence by focusing the underlying learning algorithm on those training examples that were misclassified. The algorithm’s efficiency depends on building a diverse, yet accurate, collection of classifiers [46]. The key idea behind ADAB is to use weighted versions of the same training samples, rather than to use random subsamples [12].

3.2.4. Light Gradient Boosting

LGB [34] is a gradient boosting algorithm based on decision trees and the idea of combining weak learners into powerful learners. It uses histogram-based algorithms [15,47] that discretise continuous feature values into P-bins and constructs a histogram of width, p, thereby resulting in enhanced speed and less memory usage. The leaf-wise technique is used by LGB while growing trees [15,34,48]. This technique improves LGB efficiency by selecting the leaf with the highest branching gain. This renders the algorithm prone to overfitting. The maximum depth parameter, on the other hand, is used to restrict the tree depth in order to prevent overfitting while guaranteeing efficiency. Furthermore, LGB employs gradient-based one-side sampling and unique feature bundling for quicker learning and feature bundling [11].

3.2.5. Extreme Gradient Boosting

XGB is an efficient technique created by Chen et al. [49] that is built using the gradient boosting framework and can handle both regression and classification issues [50,51]. In an iterative procedure, the algorithm learns the functional connection between the input and target characteristics by training individual trees successively on the residuals from the preceding tree [6]. In this case, it iteratively merges weak base learning models with a stronger learner to maximise the objective function.

3.2.6. Gradient Boosting Machine

The GBM technique by Friedman [50] is a form of boosting approach that iteratively generates new base learners by reweighting misclassified data. Unlike ADAB, GBM derives the weights by working on the loss function’s negative partial derivatives at each training observation. These partial derivatives are also known as pseudo-residuals, and they are used to iteratively expand an ensemble. As a result, the feature space is partitioned, with related pseudo-residuals grouped together [52].

3.2.7. SHapley Additive exPlanation

SHAP, introduced by Lundberg and Lee [53], is a game theory-based model explainability method for explaining predictions from a ML model. The main goal of SHAP is to interpret individual predictions of an instance (e.g., high arsenic) by computing the contribution of each predictor variable (e.g., pH, Turb, Cond, and TDS) using the coalition game theory. Here, each predictor variable acts as a player in a coalition. The contribution of each input variable, also known as the payoff, is the increase in the probability of a particular class occurring when conditioning on the feature. The outcome of the model is explained using the concept of additive feature attribution. SHAP specifies the explanation in Equation (1) as:

g (x^{t}) = \nabla_{o} + \sum_{i = 1}^{M} \nabla_{i} x_{i}

(1)

where

g

represents the explanation model,

x_{i}

is a coalition vector that indicates whether the ith predictor is present (

x_{i} = 1)

or absent

(x_{i} = 0)

,

\nabla_{o}

is the base value when all inputs are unavailable, and M is the maximum coalition size.

\nabla_{i} \in R

is the Shapley value, which represents feature attribution for a feature i.

SHapley values,

\nabla_{i},

have properties that make it suitable for evaluating feature importance [20,54]:

Dummy: If a feature

i

does not contribute any marginal value,

\nabla_{i}

= 0.

Additivity: If a model S is an ensemble of m submodels, the contributions of a feature

i

in the submodels should add up;

\nabla_{i}^{s}

=

\sum_{i = 1}^{K} \nabla_{i}

.

Efficiency: All SHapley values must add up as the difference between predictions and expected values.

Substitutability: If two given features

i

and

j

contribute equally to all their possible subsets, then their SHapely values are equal

\nabla_{i} = \nabla_{j}

.

Since SHAP computes SHapley values, it is known to be the only method that satisfies all the properties of efficiency, symmetry, dummy, and additivity [54]. Interestingly, SHAP is justified to provide a unique solution with three vital properties [20,53,54]:

Local accuracy: Equivalent to SHapley’s efficiency property.

Consistency: Follows the additivity and substitutability dummy properties of the SHapley values.

Missingness: This means that missing features get a SHapley value of zero. It is important to note that, in theory, a missing feature can have an arbitrary SHapley value without having accuracy. In practice, it is only needed when features are constant.

The precise determination of the SHAP value is difficult since it necessitates sophisticated exponential computation for each feasible subset of variables. Hence, Lundberg et al. [55] introduced the TreeSHAP to efficiently approximate SHAP values for tree-based ML models such as XGB, LGB, and RF. Therefore, this study employed the TreeSHAP variant with XGB for the model explainability.

3.3. Statistical Evaluation of Model Performance

The selection of suitable metrics for discriminating the optimal solution is an important step towards obtaining an optimised classifier [56]. In this study, accuracy (Acc), kappa, precision, F1, sensitivity, area under the receiver operating characteristic curve (AUC) and Matthews correlation coefficient (MCC) were used as evaluators to measure the effectiveness of the classifiers on unseen data (testing data). These metrics allow a clear and intuitive interpretation of the performance of the classifiers in all classes [57]. The metrics and corresponding mathematical representations (Equations 2 to 8) are presented in Table 3.

4. Results and Discussion

4.1. Hydrogeochemistry of Input Parameters and Arsenic Pollution

Table 4 shows the summary data for the major hydrochemical parameters. The pH of the samples obtained from surface water sources varied in the range of 3.90 to 8.51 with a mean of 6.35, whereas samples obtained from groundwater sources also varied in a pH range of 4.23 to 7.30 with a mean of 5.73. This means that the pH of the samples ranges from very acidic to mildly alkaline. The acidic pH in some of the samples, notably in the surface water samples, may be due to the presence of sulfur-bearing minerals in the aquifer system, which encouraged the accumulation of acidity from rainwater and other sources, lowering the pH [30,61,62]. Furthermore, the sample locations contain carbonate rocks (argillitic and volcaniclastic deposits), which can cause carbonate minerals to dissolve and mix with surface and groundwater, increasing the pH.

The Cond values of the surface water samples are in the range of 206 to 2040 μS/cm with a mean value of 183.50 μS/cm, whereas the groundwater samples are in the range of 83 to1070 μS/cm with a mean value of 245.91 μS/cm. When compared with the guideline value (2500 μS/cm) of Cond in drinking water [3], the Cond values of all the samples in both surface water and groundwater are below the guideline value (Table 4). The higher Cond values are attributable to inputs from anthropogenic activities in the region, such as aquaculture and indiscriminate garbage dumping [30].

The TDS values of the surface water samples vary from 8440 to 2,390,000 μg/L, with a mean value of 104,169 μg/L, whereas the groundwater samples vary from 48300 to 934,000 μg/L, with a mean value of 150,003 μg/L (Table 4). It is interesting to note that the majority of the surface water samples have TDS concentrations above the WHO guideline value of 1,000,000 μg/L. The increased TDS concentrations in the surface water samples indicate that pesticide and herbicide runoff from agricultural operations is a serious issue in the area. Additionally, leachate from adverse mining and mineral processing activities in the area could contribute to the elevated TDS concentrations.

The surface water samples’ turbidity readings vary from 0.60 to 292,600 NTU, with a mean value of 1312.72 NTU, whereas the groundwater samples are in the range of 0.20 to 142 NTU, with a mean value of 18.17 NTU. The turbidity values of a majority of the samples in both surface water and groundwater exceed the WHO guideline value of 5 NTU (Table 4). Surface water samples had higher turbidity values than groundwater samples, presumably due to severe rainfall or disturbances to land near raw water sources caused by undesirable farming and mining operations.

In recent years, the adverse mining activities in the area have resulted in elevated arsenic concentrations in water sources [63] leaving a majority of the population highly exposed to arsenic contamination [64]. The arsenic concentrations of the surface water samples are in the range of 2.0–620 μg/L with a mean value of 28.51 μg/L, whereas the groundwater samples are in the range of 2.0–88.29 μg/L with a mean value of 4.23 μg/L. When compared with the guideline value (10 μg/L) of arsenic in drinking water [3], the arsenic concentrations of a majority of the samples in both surface water and groundwater exceed the guideline value (Table 4). Surface water tests contained significant levels of arsenic, which might be attributed to the extensive surface mining of gold-bearing rocks containing sulphide minerals such as pyrite and arsenopyrite. Elevated arsenic content in groundwater samples can also be related to sulphidic aquifer oxidation. Figure 3 depicts low, medium and high arsenic concentrations in surface water and groundwater.

4.2. Overall Model Performance

The models are analysed and compared, in this section, based on how well they maximised the performance metrics. The overall performance of the testing dataset was assessed and compared using the performance measures kappa, Acc, MCC, and AUC. Table 5 displays the overall performance measures of the developed models.

Figure 4 depicts a plot of the overall performance of the developed models. As shown in Figure 4, XGB is more efficient (with an Acc of 0.86) in classifying the various water classes, followed closely by LGB (0.83), NGB (0.82), GBM (0.82), CATB (0.81), and ADAB (0.76). In terms of the area under the curve (AUC) score, all the models, except ADAB (AUC of 0.83), obtained more than 0.9 (Figure 4).

In interpreting the strength of the agreement, kappa values of 0.01–0.20 are considered minor, 0.21–0.40 are considered fair, 0.41–0.60 are considered moderate, 0.61–0.80 are considered significant, and 0.81–1.00 are considered virtually perfect [65]. The kappa results presented indicate that the LGB, XGB, CATB, NGB, and GBM reached a substantial agreement with kappa values of 0.71, 0.75, 0.67, 0.68, and 0.69, respectively, whereas ADAB achieved a moderate agreement with a score of 0.58 (Figure 4).

MCC has a range of [−1,1], with values close to 1 indicating a very good correlation between the predicted and observed class and values close to 0 indicating poor correlation. XGB produced the highest correlation of 0.75, followed, sequentially, by LGB, GBM, NGB, CATB, and ADAB with correlation values of 0.72, 0.69, 0.69, 0.68, and 0.58, respectively (Figure 4).

In terms of overall evaluation (Acc, kappa, MCC, and AUC), XGB outperformed the other models by receiving the highest MCC, Acc, and kappa scores, followed by LGB. ADAB had the worst overall performance across all metrics. In comparison to the standard RF model, all boosting models developed in this study performed well in terms of arsenic classification in surface water and groundwater. In most cases, XGB performed admirably; however, these findings should be further investigated in future studies using a large number of environmental datasets representing a wide range of environmental settings and compartments in order to draw broad conclusions.

4.3. Single-Class Model Performance

Single-class metrics are less sensitive to class imbalance, making them ideal for evaluating classifiers in skew data domains [57]. Precision, sensitivity, and F1 measures were used to assess comparative single-class performance. Table 6 displays the single class performance results.

The identification of water samples with low arsenic concentrations is critical in quantifying safe drinking water sources for human consumption [8]. Figure 5a shows that all of the models performed admirably in identifying waters with low arsenic concentrations. In terms of precision, more than 0.77 of the predicted low arsenic concentrations were found to be low. The precision score for XGB was 0.87, followed by GBM (0.86), LGB (0.84), CATB (0.84), NGB (0.80), and ADAB (0.78). This is critical for arsenic modelling because it lowers the number of false positive predictions (predicting either a high or medium arsenic concentration as a low). This means that the models are safe to use because they do not incorrectly classify high or medium arsenic water sources as low. Again, all of the developed models had a sensitivity greater than 0.85. The highest sensitivity score was 0.93, achieved by XGB, followed by NGB (0.91), LGB (0.89), CATB (0.89), ADAB (0.87), and GBM (0.86). All of the models have a high F1 score (>0.81), which represents a balance between sensitivity and precision. XGB had the highest predictive efficiency (F1 score of 0.90), while ADAB had the lowest (F1 score of 0.82). Overall, XGB demonstrated the greatest predictive efficiency for low arsenic concentrations, with the highest precision, sensitivity, and F1 score.

The boosting models performed poorly in estimating the medium arsenic concentration, as shown in Figure 5b. XGB had the highest precision (0.85), while NGB had the highest sensitivity (0.72) and F1 score (0.72). The medium arsenic’s poor performance can be attributed to the dataset’s small sample size (17% of the total data used).

Typically, the primary goal of arsenic modelling is to identify and accurately classify high arsenic areas in order to reduce arsenic contamination and pollution [13]. Thus, sensitivity is critical because it serves as a protective buffer for the population. High sensitivity indicates low false negatives (predicting a low arsenic concentration in a water sample when it is high) and vice versa. From Figure 5c, it can be seen that all the models except ADAB and NGB could correctly predict the high arsenic class in the testing dataset with a sensitivity score greater than 0.802. Again, all the models could predict the high arsenic concentration with very good precision (>0.83). In terms of the F1 score, which represents the balance between precision and sensitivity, LGB is more efficient in predicting the high arsenic waters.

In terms of single class assessment (precision, sensitivity, and F1), all the models achieved very good performance in classifying the high and low arsenic waters but relatively poor performance in classifying the medium arsenic waters. XGB achieved the highest sensitivity for the high and low arsenic waters, the highest precision for low and medium arsenic waters, and the highest F1 score for the low arsenic class. NGB obtained the highest sensitivity and F1 score for the medium class. Previous comparative studies [41] have similarly justified XGB’s superiority over other boosting variants. For the medium class, NGB had the highest sensitivity and F1 score. Generally, ADAB performed the worst in terms of precision, sensitivity, and F1 score for all the various classes. ADAB’s poor performance is consistent with previous comparisons [14,41].

4.4. Relative Importance of Predictor Variables

The relative importance of the predictor variables can be used to identify the primary input factors influencing the predictions [13]. The variable importance plots in Figure 6 show a similar trend, albeit with a slight variation. Overall, pH and Cond had the greatest impact on arsenic distribution in water sources, while TDS and TurB had a moderate influence. These findings are consistent with the domain knowledge that arsenic mobility in water sources is often controlled by pH and Cond [66].

4.5. SHAP Global Interpretation

The global importance of the individual predictor variables can be explored using the SHAP feature importance plot in Figure 7. Unlike the variable importance plots in Figure 6, which are based on the decrease in model performance, SHAP feature importance is based on the magnitude of feature attributions. Here, the contribution of the predictor variables for the individual classes could be verified (Figure 7). Such important insight has not yet been explored in previous studies. Previous studies, such as Lombard et al. [8], mostly adopt variable importance plots, which only account for the influence on overall classification and not on individual classes.

From Figure 7, it can be seen that the higher the mean SHAP value, the more important the predictor variable. It is effortless to decipher that Cond is the most influencing variable, followed by pH, TDS, and Turb. In predicting high and low arsenic classes, pH is the most important variable, followed by Cond. With regards to the medium class, Cond is the most important variable, followed by TDS. The plot is very insightful as it establishes how variable importance varies according to the concentration of arsenic (high, medium, or low) in water sources.

The directional marginal contribution of each predictor variable to the various classes is presented in Figure 8. Here, each point on the plot corresponds to a row in the dataset. The gradient colour of each point represents the magnitude of the input variable, i.e., the red or blue plots represent the higher or lower values of inputs, respectively. The y-axis represents the variable names, ranked from top to bottom in order of importance, and the x-axis depicts the SHAP value. The presence of coloured points on both sides for all features indicates how much a feature impacts the model negatively (left) or positively (right). The overlapping points are jittered in the y-axis direction to show the distribution of the SHAP values per feature.

From Figure 8a, it is indicative that water sources have a high probability of containing low arsenic concentration when pH values are low (i.e., high values correlate negatively and vice versa). It is seen that low Cond values generally increase the chance of having water with a low arsenic concentration. Likewise, low Turb mostly increases the probability of having low arsenic water. The least important variable in determining low arsenic water is the Turb. Although low values seem to have an undefined relationship with arsenic mobility, high values of Turb generally indicate low arsenic water. On the other hand, high values of pH and TDS reduce the chances of water being low in arsenic. Overall, it is identified that low pH and low Cond mostly prevent arsenic mobility in water, whereas the converse, together with high TDS, encourages arsenic mobility.

According to Figure 8b, the most important determinant of medium arsenic waters is Cond, followed by TDS, Turb, and pH. It is observed that a source of water with low Cond is likely to contain a medium arsenic concentration. Similarly, low TDS values mostly indicate medium arsenic waters. The spread of the low values on opposite sides of the plot is indicative of how difficult it was for the model to learn the classification rules between the input and target variables. On the other hand, high Cond, high TDS, high Turb and high pH of a water source make the occurrence of medium arsenic water unlikely.

Figure 8c shows that high pH water is highly correlated with high arsenic water. Similarly, high Cond is a potential indicator of a high arsenic water source. Additionally, high TDS is mostly associated with high pH. On the other hand, low pH, Cond, Turb and TDS generally indicate the absence of a high arsenic concentration.

Figure 9, Figure 10 and Figure 11 show the relationship between SHAP values and changes in individual predictor variables for low, medium, and high arsenic waters, respectively. The plots depict the variation of SHAP values in relation to the input variables. The colour coding on the right corresponds to the interaction term values. When vertical dispersion of points is observed, interaction among predictor variables can be identified.

Figure 9a–c, show that there is no clear relationship between SHAP values and TDS, Turb, or Cond, respectively. However, beyond TDS of 2500, the probability of low arsenic water occurring increases with increasing TDS, as shown in Figure 9b. Waters with a high Cond (> 250) and a low pH are also likely to be low in arsenic, as shown in Figure 9c. The inverse relationship between pH and SHAP values is depicted in Figure 9d. It is important to note that high SHAP values indicate a high likelihood of low arsenic water occurrence. As can be seen, increasing pH reduces the likelihood of low arsenic water occurrence and vice versa. Interaction with TDS is also observed. According to the interaction effect, water with a high pH and medium TDS is less likely to be low in arsenic.

In Figure 10a–d, it is very difficult to decipher a defined relationship between SHAP values and the various input parameters. This explains why the models performed relatively poorly in classifying medium-arsenic water. Moreover, variable interaction seems to be relatively poor. However, beyond a Cond of 300, SHAP values can be observed to increase with increasing Cond in Figure 10c.

In Figure 11a,b, TDS and Turb have no clear relationship with the presence of high arsenic water. Waters with zero Turb and high pH, on the other hand, are highly likely to contain high arsenic, according to the interaction effect in (Figure 11b). Figure 11c shows that high Cond values rule out the possibility of arsenic-rich water. Figure 11d demonstrates why pH is the most important variable in determining arsenic levels in water sources. There is a direct relationship between pH and SHAP values (Figure 11d). This indicates that the probability of the occurrence of high arsenic water increases with an increase in pH. This observation is consistent with some past studies where higher concentrations of arsenic were detected in groundwater of high pH [67,68] and vice versa [69].

4.6. SHAP Local Interpretation

Since the global interpretations are based on the training dataset, it is critical to understand how the model makes decisions when an unknown dataset (testing dataset) is introduced. In most ML predictive tasks, the ultimate goal is generalizable predictive performance, or performance on unknown datasets. In this regard, SHAP local interpretation was used to justify decision-making on the testing dataset. The SHAP waterfall plots are used to explain how individual predictions were made. It helps to understand how input variables contribute to the model’s prediction. f(x) in waterfall plots indicates the predicted water occurrence probability, and E[f(x)] indicates the expectation of water occurrence. The x-axis represents the range of responses. The y-axis represents the variable name and the corresponding observed value. The bottom of the plot starts with the expected values of the model output, and then, each row shows how the positive (red) or negative (blue) contributions of each feature move the value from the expected model output to the final prediction. For instance, in Figure 12, it can be seen that a water sample with pH = 7.3, Cond = 415, Turb = 335, and TDS = 207 is considered. The sample has a high arsenic concentration, which was correctly predicted by the XGB model. Figure 12a–c illustrate the probabilities of the samples occurring as low, medium, and high, respectively. The individual probabilities (f(x)) are indicated on the top of each plot.

The global interpretations made thus far will be critical in deciphering why the sample was predicted as high rather than the other options (low or medium). In terms of the probability of the sample being low in arsenic (Figure 12a), Cond, Turb and TDS all work in favour (positively). However, the pH concentration makes it highly unlikely that it is low in arsenic, driving the prediction in the opposite direction. Furthermore, the sample was not classified as medium because all of the input variables drove the prediction negatively, reducing the likelihood of medium arsenic water occurrence (Figure 12b). According to Figure 12c, it can be seen that the variables driving the prediction positively (to the right) are high pH and high Cond, which were previously established to have the greatest global influence on high arsenic. Their influence is stronger compared to Turb and TDS, which drive the prediction in the opposite direction. This explains why the sample was overall classified as high in arsenic.

In Figure 13, it can be seen that a water sample with pH = 6, Cond = 55, Turb = 28, and TDS = 33 was predicted as medium. All of the input features in Figure 13a, except TDS, drove the prediction negative. It is clear that the pH and Cond values significantly reduced the likelihood of the sample being low in arsenic. The temperature and pH pushed the prediction to the right, increasing the likelihood of encountering medium arsenic water, as shown in Figure 13b. Low Cond values, as shown in Figure 8b, increase the likelihood of medium arsenic waters. Low Cond is found to be the major variable driving the prediction in favour of medium arsenic water, followed by pH. Concerning the possibility of the sample containing arsenic (Figure 13c), it is easy to see that pH, TDS and Turb greatly reduced such a possibility.

4.7. Contribution and Limitations

Arsenic in drinking water is becoming more widely recognized as a potential health risk for rural people in developing countries like Ghana, West Africa. Recognising the potential health risks, it has become critical to regularly monitor arsenic levels in drinking water supply systems such as surface water and groundwater. However, traditional testing and monitoring approaches are somewhat costly and time-consuming. This study evaluated the predictive efficiency of various boosting algorithms as ML techniques for classifying low, medium, and high arsenic concentrations in surface water and groundwater for the first time. In this study, a standard ML technique known as RF was used as a benchmark model. The findings from this study suggest that the concentrations of arsenic in some samples obtained from the study area exceeded the WHO limit of 10 μg/L, with some samples showing a maximum concentration of about 620 μg/L. This is a major environmental concern because several potential sources of arsenic pollution in the area are increasing, including mining, fuel combustion, wood preservation, and the use of As-based pesticides in agriculture.

This study provides a significant contribution to the existing knowledge in the literature by developing ML algorithms that can be used as a cost-effective and quicker approach for monitoring and classifying low, medium, and high arsenic concentrations in various water supply systems. In terms of overall and single-class performance, all the developed boosting algorithms showed excellent performance in classifying arsenic concentrations. More importantly, the XGB model exhibited exceptional performance compared to the other boosting models and can be adopted in future studies for classifying and predicting arsenic concentrations in water supply systems.

The study, also for the first time, employed SHAP to identify important influence variables on arsenic mobilization. Models’ predictions were explained using SHAP to promote transparency in ML modelling. Interaction effects among input variables were also assessed.

Despite the high predictive tendencies of the ML algorithms developed in this study, the models were developed using predictor variables available at a regional scale or smaller geographic area (a specific region in Ghana, West Africa), and hence, caution should be applied for direct translation of knowledge. Furthermore, the dataset used in this study was obtained through the analysis of water samples collected between the years 2011 and 2018, so care should be taken when interpreting the results in current times because of potential uncertainty due to possible temporal variation. Also, including other relevant predictors such as redox conditions could lead to better performance and could be considered in future studies.

5. Conclusions

An empirical comparison of six representative categories of the most popular boosting algorithms including extreme gradient boosting (XGB), gradient boosting machine (GBM), Light gradient boosting (LGB), natural gradient boosting (NGB), categorical boosting (CATB) and adaptive boosting (ADAB) for arsenic modelling. SHapley Additive ex-Planation (SHAP) was also used to explain model decisions in order to decipher the complex underlying non-linear relationship between influencing input variables (pH, Turb, TDS, and Cond) and arsenic mobility. The major findings are as follows:

In terms of overall assessment metrics (Acc, MCC, Kappa and AUC), all the boosting models (XGB, NGB, LGB, ADAB, CATB, and GBM) developed proved efficient in the arsenic modelling task with minimum AUC, MCC, Kappa, and Acc scores of 0.83, 0.58, and 0.76, respectively.
The single class assessment metrics (precision, sensitivity and F1 score) indicate that the boosting models are more efficient at recognising high and low arsenic contaminated waters.
Essentially, the XGB algorithm outperformed the remaining models in terms of overall and single-class assessment metrics, whereas ADAB obtained the least performance.
High pH water was found to be highly correlated with high arsenic water, and vice versa. Water with high pH, Cond and TDS increases the likelihood of encountering high arsenic water sources. Low pH, Cond, and TDS levels are all indicators of low arsenic water. Medium arsenic waters are mostly associated with low Cond and low TDS.

Overall, this study provides a comprehensive evaluation of boosting algorithms and explainable ML that may be useful for future prediction, categorisation, and control of arsenic concentrations in various water supply systems. Although the models used in this study are somewhat predictive, the data used for validation and testing was fairly limited to the study region and timeframe. As a result, future studies should validate these models using large and current datasets.

Author Contributions

Conceptualization, B.I. and I.A.; methodology, B.I.; software, B.I.; validation, B.I. and I.A.; formal analysis, B.I.; investigation, B.I.; resources, B.I.; data curation, B.I.; writing—original draft preparation, B.I.; writing—review and editing, B.I., I.A. and A.E.; visualization, B.I.; supervision, I.A. and A.E. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Cho, K.H.; Sthiannopkao, S.; Pachepsky, Y.A.; Kim, K.-W.; Kim, J.H. Prediction of Contamination Potential of Groundwater Arsenic in Cambodia, Laos, and Thailand Using Artificial Neural Network. Water Res. 2011, 45, 5535–5544. [Google Scholar] [CrossRef] [PubMed]
Naujokas, M.F.; Anderson, B.; Ahsan, H.; Aposhian, H.V.; Graziano, J.H.; Thompson, C.; Suk, W.A. The Broad Scope of Health Effects from Chronic Arsenic Exposure: Update on a Worldwide Public Health Problem. Environ. Health Perspect. 2013, 121, 295–302. [Google Scholar] [CrossRef] [PubMed]
World Health Organization. Guidelines for Drinking-Water Quality; World Health Organization: Geneva, Switzerland, 2017. [Google Scholar]
Smith, A.H.; Lingas, E.O.; Rahman, M. Contamination of Drinking-Water by Arsenic in Bangladesh: A Public Health Emergency. Bull. World Health Organ. 2000, 78, 1093–1103. [Google Scholar]
Tan, Z.; Yang, Q.; Zheng, Y. Machine Learning Models of Groundwater Arsenic Spatial Distribution in Bangladesh: Influence of Holocene Sediment Depositional History. Environ. Sci. Technol. 2020, 54, 9454–9463. [Google Scholar] [CrossRef] [PubMed]
Chakraborty, M.; Sarkar, S.; Mukherjee, A.; Shamsudduha, M.; Ahmed, K.M.; Bhattacharya, A.; Mitra, A. Modeling Regional-Scale Groundwater Arsenic Hazard in the Transboundary Ganges River Delta, India and Bangladesh: Infusing Physically-Based Model with Machine Learning. Sci. Total Environ. 2020, 748, 141107. [Google Scholar] [CrossRef]
Erickson, M.L.; Elliott, S.M.; Brown, C.J.; Stackelberg, P.E.; Ransom, K.M.; Reddy, J.E.; Cravotta III, C.A. Machine-Learning Predictions of High Arsenic and High Manganese at Drinking Water Depths of the Glacial Aquifer System, Northern Continental United States. Environ. Sci. Technol. 2021, 55, 5791–5805. [Google Scholar] [CrossRef] [PubMed]
Lombard, M.A.; Bryan, M.S.; Jones, D.K.; Bulka, C.; Bradley, P.M.; Backer, L.C.; Focazio, M.J.; Silverman, D.T.; Toccalino, P.; Argos, M.; et al. Machine Learning Models of Arsenic in Private Wells Throughout the Conterminous United States As a Tool for Exposure Assessment in Human Health Studies. Environ. Sci. Technol. 2021, 55, 5012–5023. [Google Scholar] [CrossRef]
Ibrahim, B.; Ewusi, A.; Ahenkorah, I.; Ziggah, Y.Y. Modelling of Arsenic Concentration in Multiple Water Sources: A Comparison of Different Machine Learning Methods. Groundw. Sustain. Dev. 2022, 17, 100745. [Google Scholar] [CrossRef]
Taieb, S.B.; Hyndman, R.J. A Gradient Boosting Approach to the Kaggle Load Forecasting Competition. Int. J. Forecast. 2014, 30, 382–394. [Google Scholar] [CrossRef] [Green Version]
Chen, T.; Guestrin, C. Xgboost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Ferreira, A.J.; Figueiredo, M.A. Boosting Algorithms: A Review of Methods, Theory, and Applications. Ensemble Mach. Learn. 2012, 35–85. [Google Scholar] [CrossRef]
Ayotte, J.D.; Nolan, B.T.; Gronberg, J.A. Predicting Arsenic in Drinking Water Wells of the Central Valley, California. Environ. Sci. Technol. 2016, 50, 7555–7563. [Google Scholar] [CrossRef] [PubMed]
Wu, T.; Zhang, W.; Jiao, X.; Guo, W.; Hamoud, Y.A. Comparison of Five Boosting-Based Models for Estimating Daily Reference Evapotranspiration with Limited Meteorological Variables. PLoS ONE 2020, 15, e0235324. [Google Scholar] [CrossRef]
Fan, J.; Ma, X.; Wu, L.; Zhang, F.; Yu, X.; Zeng, W. Light Gradient Boosting Machine: An Efficient Soft Computing Model for Estimating Daily Reference Evapotranspiration with Local and External Meteorological Data. Agric. Water Manag. 2019, 225, 105758. [Google Scholar] [CrossRef]
Shen, K.; Qin, H.; Zhou, J.; Liu, G. Runoff Probability Prediction Model Based on Natural Gradient Boosting with Tree-Structured Parzen Estimator Optimization. Water 2022, 14, 545. [Google Scholar] [CrossRef]
Dong, L.; Zeng, W.; Wu, L.; Lei, G.; Chen, H.; Srivastava, A.K.; Gaiser, T. Estimating the Pan Evaporation in Northwest China by Coupling CatBoost with Bat Algorithm. Water 2021, 13, 256. [Google Scholar] [CrossRef]
Wolpert, D.H.; Macready, W.G. No Free Lunch Theorems for Optimization. IEEE Trans. Evol. Computat. 1997, 1, 67–82. [Google Scholar] [CrossRef] [Green Version]
Escalante, H.J.; Escalera, S.; Guyon, I.; Baró, X.; Güçlütürk, Y.; Güçlü, U.; van Gerven, M.; van Lier, R. Explainable and Interpretable Models in Computer Vision and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2018. [Google Scholar]
Masís, S. Interpretable Machine Learning with Python: Learn to Build Interpretable High-Performance Models with Hands-on Real-World Examples; Packt Publishing Ltd.: Birmingham, UK, 2021. [Google Scholar]
Štrumbelj, E.; Kononenko, I. Explaining Prediction Models and Individual Predictions with Feature Contributions. Knowl. Inf. Syst. 2014, 41, 647–665. [Google Scholar] [CrossRef]
Lama, L.; Wilhelmsson, O.; Norlander, E.; Gustafsson, L.; Lager, A.; Tynelius, P.; Wärvik, L.; Östenson, C.-G. Machine Learning for Prediction of Diabetes Risk in Middle-Aged Swedish People. Heliyon 2021, 7, e07419. [Google Scholar] [CrossRef]
Mangalathu, S.; Hwang, S.-H.; Jeon, J.-S. Failure Mode and Effects Analysis of RC Members Based on Machine-Learning-Based SHapley Additive exPlanations (SHAP) Approach. Eng. Struct. 2020, 219, 110927. [Google Scholar] [CrossRef]
Ibrahim, B.; Ahenkorah, I.; Ewusi, A. Explainable Risk Assessment of Rockbolts’ Failure in Underground Coal Mines Based on Categorical Gradient Boosting and SHapley Additive exPlanations (SHAP). Sustainability 2022, 14, 11843. [Google Scholar] [CrossRef]
Wen, X.; Xie, Y.; Wu, L.; Jiang, L. Quantifying and Comparing the Effects of Key Risk Factors on Various Types of Roadway Segment Crashes with LightGBM and SHAP. Accid. Anal. Prev. 2021, 159, 106261. [Google Scholar] [CrossRef] [PubMed]
Wang, R.; Kim, J.-H.; Li, M.-H. Predicting Stream Water Quality under Different Urban Development Pattern Scenarios with an Interpretable Machine Learning Approach. Sci. Total Environ. 2021, 761, 144057. [Google Scholar] [CrossRef] [PubMed]
Wang, S.; Peng, H.; Hu, Q.; Jiang, M. Analysis of Runoff Generation Driving Factors Based on Hydrological Model and Interpretable Machine Learning Method. J. Hydrol. Reg. Stud. 2022, 42, 101139. [Google Scholar] [CrossRef]
Podgorski, J.; Berg, M. Global Threat of Arsenic in Groundwater. Science 2020, 368, 845–850. [Google Scholar] [CrossRef]
Podgorski, J.; Wu, R.; Chakravorty, B.; Polya, D.A. Groundwater Arsenic Distribution in India by Machine Learning Geospatial Modeling. Int. J. Environ. Res. Public Health 2020, 17, 7119. [Google Scholar] [CrossRef]
Amponsah, N.; Bakobie, N.; Cobbina, S.; Duwiejuah, A. Assessment of Rainwater Quality in Ayanfuri, Ghana. Am. Chem. Sci. J. 2015, 6, 172–182. [Google Scholar] [CrossRef]
Agbenyezi, T.K.; Foli, G.; Gawu, S.K. Geochemical Characteristics of Gold-Bearing Granitoids At Ayanfuri In The Kumasi Basin, Southwestern Ghana: Implications For The Orogenic Related Gold Systems. Earth Sci. Malays. (ESMY) 2020, 4, 127–134. [Google Scholar] [CrossRef]
Majeed, F.; Ziggah, Y.Y.; Kusi-Manu, C.; Ibrahim, B.; Ahenkorah, I. A Novel Artificial Intelligence Approach for Regolith Geochemical Grade Prediction Using Multivariate Adaptive Regression Splines. Geosyst. Geoenviron. 2022, 1, 100038. [Google Scholar] [CrossRef]
Ghana Statistical Service. 2010 Population and Housing Census: District Analytical Report, Tarkwa Nsuaem Municipal. Available online: https://www.statsghana.gov.gh/ (accessed on 25 October 2014).
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased Boosting with Categorical Features. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018; Curran Associates, Inc.: Red Hook, NY, USA, 2018; Volume 31. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-Learn: Machine Learning in Python. Mach. Learn. Python 2011, 12, 2825–2830. [Google Scholar]
Duan, T.; Anand, A.; Ding, D.Y.; Thai, K.K.; Basu, S.; Ng, A.; Schuler, A. Ngboost: Natural Gradient Boosting for Probabilistic Prediction. In Proceedings of the International Conference on Machine Learning, PMLR; pp. 2690–2700. Available online: http://proceedings.mlr.press/v119/duan20a.html?ref=https://githubhelp.com (accessed on 20 October 2022).
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
Peters, J.; Baets, B.D.; Verhoest, N.E.C.; Samson, R.; Degroeve, S.; Becker, P.D.; Huybrechts, W. Random Forests as a Tool for Ecohydrological Distribution Modelling. Ecol. Model. 2007, 207, 304–318. [Google Scholar] [CrossRef]
Ibrahim, B.; Majeed, F.; Ewusi, A.; Ahenkorah, I. Residual Geochemical Gold Grade Prediction Using Extreme Gradient Boosting. Environ. Chall. 2022, 6, 100421. [Google Scholar] [CrossRef]
Kadiyala, A.; Kumar, A. Applications of Python to Evaluate the Performance of Decision Tree-Based Boosting Algorithms. Environ. Prog. Sustain. Energy 2018, 37, 618–623. [Google Scholar] [CrossRef]
Dorogush, A.V.; Ershov, V.; Gulin, A. CatBoost: Gradient Boosting with Categorical Features Support. arXiv 2018, arXiv:1810.11363. [Google Scholar]
Peng, T.; Zhi, X.; Ji, Y.; Ji, L.; Tian, Y. Prediction Skill of Extended Range 2-m Maximum Air Temperature Probabilistic Forecasts Using Machine Learning Post-Processing Methods. Atmosphere 2020, 11, 823. [Google Scholar] [CrossRef]
Ferov, M.; Modrỳ, M. Enhancing Lambdamart Using Oblivious Trees. arXiv 2016, arXiv:1609.05610. [Google Scholar]
Freund, Y.; Schapire, R.E. A Decision-Theoretic Generalization of on-Line Learning and an Application to Boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef] [Green Version]
Margineantu, D.D.; Dietterich, T.G. Prunning Adaptive Boosting. ICML 1997, 97, 211–218. [Google Scholar]
Alsabti, K.; Ranka, S.; Singh, V. CLOUDS: A Decision Tree Classifier for Large Datasets. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 27–31 August 1998; Volume 2. No. 8. [Google Scholar]
Shi, H. Best-First Decision Tree Learning. Available online: https://researchcommons.waikato.ac.nz/handle/10289/2317 (accessed on 19 October 2022).
Chen, T.; He, T.; Benesty, M.; Khotilovich, V.; Tang, Y.; Cho, H.; Chen, K. Xgboost: Extreme Gradient Boosting. R Package, Version 0.4-2 2015, 1, 1–4. [Google Scholar]
Friedman, J.H. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Friedman, J.; Hastie, T.; Tibshirani, R. Additive Logistic Regression: A Statistical View of Boosting (with Discussion and a Rejoinder by the Authors). Ann. Stat. 2000, 28, 337–407. [Google Scholar] [CrossRef]
Dev, V.A.; Eden, M.R. Formation Lithology Classification Using Scalable Gradient Boosted Decision Trees. Comput. Chem. Eng. 2019, 128, 392–404. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4768–4777. [Google Scholar]
Molnar, C. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable. [Online]. Available online: https://christophm.github.io/interpretable-ml-book/ (accessed on 29 September 2022).
Lundberg, S.M.; Erion, G.G.; Lee, S.-I. Consistent Individualized Feature Attribution for Tree Ensembles. arXiv 2018, arXiv:1802.03888. [Google Scholar] [CrossRef]
Hossin, M.; Sulaiman, M.N. A Review on Evaluation Metrics for Data Classification Evaluations. Int. J. Data Min. Knowl. Manag. Process 2015, 5, 1. [Google Scholar]
Tanha, J.; Abdi, Y.; Samadi, N.; Razzaghi, N.; Asadpour, M. Boosting Methods for Multi-Class Imbalanced Data Classification: An Experimental Review. J. Big Data 2020, 7, 70. [Google Scholar] [CrossRef]
Cohen, J. A Coefficient of Agreement for Nominal Scales. Educ. Psychol. Meas. 1960, 20, 37–46. [Google Scholar] [CrossRef]
Chicco, D.; Jurman, G. The Advantages of the Matthews Correlation Coefficient (MCC) over F1 Score and Accuracy in Binary Classification Evaluation. BMC Genom. 2020, 21, 6. [Google Scholar] [CrossRef]
Grandini, M.; Bagli, E.; Visani, G. Metrics for Multi-Class Classification: An Overview. arXiv 2020, arXiv:2008.05756. [Google Scholar]
Ewusi, A.; Ahenkorah, I.; Kuma, J.S.Y. Groundwater Vulnerability Assessment of the Tarkwa Mining Area Using SINTACS Approach and GIS. Ghana Min. J. 2017, 17, 18–30. [Google Scholar] [CrossRef]
Ewusi, A.; Apeani, B.Y.; Ahenkorah, I.; Nartey, R.S. Mining and Metal Pollution: Assessment of Water Quality in the Tarkwa Mining Area. Ghana Min. J. 2017, 17, 17–31. [Google Scholar] [CrossRef]
Kusimi, J.M.; Kusimi, B.A. The Hydrochemistry of Water Resources in Selected Mining Communities in Tarkwa. J. Geochem. Explor. 2012, 112, 252–261. [Google Scholar] [CrossRef]
Asante, K.A.; Agusa, T.; Kubota, R.; Subramanian, A.; Ansa-Asare, O.D.; Biney, C.A.; Tanabe, S. Evaluation of Urinary Arsenic as an Indicator of Exposure to Residents of Tarkwa, Ghana. West Afr. J. Appl. Ecol. 2008, 12, 45751. [Google Scholar] [CrossRef]
Landis, J.R.; Koch, G.G. An Application of Hierarchical Kappa-Type Statistics in the Assessment of Majority Agreement among Multiple Observers. Biometrics 1977, 33, 363–374. [Google Scholar] [CrossRef]
Welch, A.H.; Stollenwerk, K.G. Arsenic in Ground Water: Geochemistry and Occurrence; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2003; ISBN 978-1-4020-7317-5. [Google Scholar]
Asante, K.A.; Agusa, T.; Subramanian, A.; Ansa-Asare, O.D.; Biney, C.A.; Tanabe, S. Contamination Status of Arsenic and Other Trace Elements in Drinking Water and Residents from Tarkwa, a Historic Mining Township in Ghana. Chemosphere 2007, 66, 1513–1522. [Google Scholar] [CrossRef] [PubMed]
Smedley, P.L. Arsenic in Rural Groundwater in Ghana: Part Special Issue: Hydrogeochemical Studies in Sub-Saharan Africa. J. Afr. Earth Sci. 1996, 22, 459–470. [Google Scholar] [CrossRef]
Bortey-Sam, N.; Nakayama, S.M.; Ikenaka, Y.; Akoto, O.; Baidoo, E.; Mizukawa, H.; Ishizuka, M. Health Risk Assessment of Heavy Metals and Metalloid in Drinking Water from Communities near Gold Mines in Tarkwa, Ghana. Environ. Monit. Assess. 2015, 187, 397. [Google Scholar] [CrossRef]

Figure 1. Location and geological map of the study area.

Figure 2. Model development and workflow of the study.

Figure 3. Arsenic concentrations in surface water and groundwater illustrated using box and whisker plot.

Figure 4. Overall performance evaluation of developed models using AUC, MCC, Kappa, and Acc.

Figure 5. Single-class performance of developed models in classifying: (a) low arsenic concentrations; (b) medium arsenic concentrations; and (c) high arsenic concentrations.

Figure 6. Assessment of input variable importance score using: (a) LGB; (b) XGB; (c) CATB; (d) ADAB; (e) GBM; (f) NGB; and (g) RF.

Figure 7. SHAP feature importance plot for different arsenic concentrations.

Figure 8. SHAP summary plot for different As concentrations: (a) low; (b) medium; and (c) high.

Figure 9. SHAP dependency and interaction effect plots for low As concentrations: (a) TDS; (b)Turb; (c) Cond; and (d) pH.

Figure 10. SHAP dependency and interaction effect plots for medium As concentrations: (a) TDS; (b) Turb; (c) Cond; and (d) pH.

Figure 11. SHAP dependency and interaction effect plots for high As concentrations: (a) TDS; (b) Turb; (c) Cond; and (d) pH.

Figure 12. Local interpretations of a correct prediction considering the various possibilities of occurrence for different As concentrations: (a) low; (b) medium; and (c) high.

Figure 13. Local interpretations of a correct prediction considering the various possibilities of occurrence for different As concentrations: (a) low; (b) medium; and (c) high.

Table 1. Optimal hyperparameters for building various ML models.

Model	Optimal Hyperparameters	Library
LGB	n_estimators = (150) max_depth = (3)	Lightgbm [34]
XGB	n_estimators = (200) max_depth = (3)	Xgboost [11]
CATB	n_estimators = (140) max_depth = (3)	Catboost [35]
ADAB	n_estimators = (200) learaning_rate = (0.1)	Scikitlearn [36]
GBM	n_estimators = (100) max_depth = (5) k-neighbors = (5)	Scikitlearn [36]
NGB	n_estimators = (80) learaning_rate = (0.01)	Ngboost [37]
RF	n_estimators = (170) max_depth = (5)	Scikitlearn [36]

Table 2. Summary of advantages and limitations of the boosting algorithms used in the study.

Model	Advantages	Limitations
LGB	(i) Fast learning [41] (ii) Lower memory consumption [41]	(i) Can lose predictive performance due to gradient-based one-side sampling split (GOSS) approximations [20] (ii) Sensitive to noisy data [34]
XGB	(i) Higher execution speed [41] (ii) Less prone to overfitting (iii) Supports parallelisation (iv) Scalable	(i) Performs sub-optimally on sparse and unstructured data
CATB	(i) Less prone to overfitting [17,42]	(i) Setting of different random numbers has a certain impact on the model prediction results [34]
ADAB	(i) Easier implementation [41] (ii) Less prone to overfitting [34] (iii) Simpler feature selection [41]	(i) Sensitive to outliers and noisy data [41]
GBM	(i) Insensitive to missing data [14] (ii) Reduced bias [14] (iii) Reduced overfitting [14]	(i) Computationally expensive
NGB	(i) Flexible and scalable [37] (ii) Performs probabilistic prediction [37] (iii) Efficient for joint prediction [37] (iv) Modular with respect to base learners	(i) Limited in some skewed probability distribution [43]

Table 3. Evaluation metrics with formula and description.

Metric	Formula	Equation	Description
Acc	$\frac{T P + T N}{T P + T N + F P + F N}$	(2)	It measures the ratio of correct predictions over the total number of instances evaluated.
Kappa	$\frac{c \times s - \sum_{k}^{K} p_{k} \times t_{k}}{s^{2} - \sum_{k}^{K} p_{k} \times t_{k}}$	(3)	The kappa coefficient demonstrates the agreement between the observed classes and the measured classes [58].
AUC	$\int_{0}^{1} R_{t p} (R_{f p}) δ R_{f p}$	(4)	The AUC value indicates how well the probabilities of the positive class are separated from the negative class.
MCC	$\frac{c \times s - \sum_{k}^{K} p_{k} \times t_{k}}{\sqrt{(s^{2} - \sum_{k}^{K} p_{k}^{2}) (s^{2} - \sum_{k}^{K} t_{k}^{2})}}$	(5)	MCC is generally known to be a balanced metric for evaluating classification performance on data with varying class sizes [59]. It is a good indicator of the total unbalanced prediction model [60].
Sensitivity	$\frac{T P}{T P + F N}$	(6)	The percentage of the relevant materials data sets that were correctly identified.
Precision	$\frac{T P}{T P + F P}$	(7)	Precision represents the proportion of predicted positive cases that are correctly real values [56].
F1	$\frac{2 * p * s}{p + s}$	(8)	This measures the harmonic mean between recall and precision values [56].

Note(s): TP, TN, FP, and FN are true positive, true negative, false positive, and false negative, respectively. True positive rate, R_tp is the function of the false positive rate, R_fp along the receiver operating characteristic curve. For kappa and MCC, c represents the total number of elements correctly predicted, s is the total number of elements, p_k is the number of times class k was predicted, and t_k represents the number of times class k truly occurs.

Table 4. Statistical summary of all the measured parameters and WHO guideline values.

Parameters	Unit	Limit	Surface Water				Groundwater
Parameters	Unit	Limit	Mean	Std. Dev.	Min	Max	Mean	Std. Dev.	Min	Max
pH units	-	6.5–8.5	6.35	0.77	3.90	8.51	5.73	0.58	4.23	7.30
Cond	μS/cm	2500	183.50	206.67	6.80	2040.00	245.91	140.84	83.30	1070.00
TDS	μg/L	1,000,000	104,169	129,714	8440	2,390,000	150,003	100,199	48,300	934,000
Turbidity	NTU	5	1312.72	11,465.35	0.60	292,600.00	18.17	27.36	0.20	142.00
As	μg/L	10	28.51	65.88	2.00	620.00	4.23	7.59	2.00	88.29

Table 5. Overall performance of developed models on testing data.

Model	AUC	MCC	Kappa	Acc
LGB	0.93	0.72	0.71	0.83
XGB	0.93	0.75	0.75	0.86
CATB	0.93	0.68	0.67	0.81
ADAB	0.83	0.58	0.58	0.76
GBM	0.93	0.69	0.69	0.82
NGB	0.92	0.69	0.68	0.82
RF	0.93	0.68	0.68	0.81

Table 6. Single-class performance of developed models on testing data.

Model	Low (≤5 μg/L)			Medium (>5 to ≤10 μg/L)			High (>10 μg/L)
Model	Precision	Sensitivity	F1	Precision	Sensitivity	F1	Precision	Sensitivity	F1
XGB	0.87	0.93	0.9	0.85	0.57	0.68	0.84	0.88	0.86
LGB	0.84	0.89	0.87	0.66	0.63	0.64	0.92	0.85	0.88
ADAB	0.78	0.87	0.82	0.52	0.53	0.52	0.88	0.67	0.76
CATB	0.84	0.89	0.86	0.62	0.53	0.57	0.86	0.83	0.84
GBM	0.86	0.86	0.86	0.63	0.63	0.63	0.85	0.85	0.85
NGB	0.8	0.91	0.85	0.75	0.7	0.72	0.9	0.71	0.8
RF	0.84	0.88	0.86	0.62	0.53	0.57	0.85	0.85	0.85

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ibrahim, B.; Ewusi, A.; Ahenkorah, I. Assessing the Suitability of Boosting Machine-Learning Algorithms for Classifying Arsenic-Contaminated Waters: A Novel Model-Explainable Approach Using SHapley Additive exPlanations. Water 2022, 14, 3509. https://doi.org/10.3390/w14213509

AMA Style

Ibrahim B, Ewusi A, Ahenkorah I. Assessing the Suitability of Boosting Machine-Learning Algorithms for Classifying Arsenic-Contaminated Waters: A Novel Model-Explainable Approach Using SHapley Additive exPlanations. Water. 2022; 14(21):3509. https://doi.org/10.3390/w14213509

Chicago/Turabian Style

Ibrahim, Bemah, Anthony Ewusi, and Isaac Ahenkorah. 2022. "Assessing the Suitability of Boosting Machine-Learning Algorithms for Classifying Arsenic-Contaminated Waters: A Novel Model-Explainable Approach Using SHapley Additive exPlanations" Water 14, no. 21: 3509. https://doi.org/10.3390/w14213509

APA Style

Ibrahim, B., Ewusi, A., & Ahenkorah, I. (2022). Assessing the Suitability of Boosting Machine-Learning Algorithms for Classifying Arsenic-Contaminated Waters: A Novel Model-Explainable Approach Using SHapley Additive exPlanations. Water, 14(21), 3509. https://doi.org/10.3390/w14213509

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Assessing the Suitability of Boosting Machine-Learning Algorithms for Classifying Arsenic-Contaminated Waters: A Novel Model-Explainable Approach Using SHapley Additive exPlanations

Abstract

1. Introduction

2. Study Area

3. Materials and Methods

3.1. Data Description

3.2. Model Development

3.2.1. Categorical Boosting

3.2.2. Natural Gradient Boosting

3.2.3. Adaptive Boosting

3.2.4. Light Gradient Boosting

3.2.5. Extreme Gradient Boosting

3.2.6. Gradient Boosting Machine

3.2.7. SHapley Additive exPlanation

3.3. Statistical Evaluation of Model Performance

4. Results and Discussion

4.1. Hydrogeochemistry of Input Parameters and Arsenic Pollution

4.2. Overall Model Performance

4.3. Single-Class Model Performance

4.4. Relative Importance of Predictor Variables

4.5. SHAP Global Interpretation

4.6. SHAP Local Interpretation

4.7. Contribution and Limitations

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI