Biodegradation is an interfacial phenomenon influenced by a chemical’s tendency to partition to various phases in the environment. Equilibrium partitioning between solid and liquid interfaces [
1] strongly influences the biodegradability of chemicals in the presence of surfaces (e.g., soils and sediments). The resulting inaccessibility of solutes to microorganisms that are responsible for degradation can limit biodegradation [
2,
3]. Due to the need to predict the ultimate fate of chemicals in the environment, many methods have been developed for estimating or predicting a chemical’s biodegradation potential. These methods have each been constructed and are utilized in different ways in an effort to manage the tradeoffs between model complexity, availability of input data, and model reliability. Model inputs include expert opinion assessment, physical property correlations, group contribution, and other qualitative and quantitative indicators of biodegradability.
Modeling techniques used include linear and nonlinear regression, chemometric analysis, neural networks and artificial intelligence. Each of these techniques has individual advantages and disadvantages and tradeoffs are managed such that all models have various limitations in their utility and predictive ability [
4,
5,
6]. For example, individual models tend to have some level of chemical class specificity, either by design, or as an artifact of the breadth of the model training data set. Basic attributes such as model complexity, range of chemical structures and size of data set can be used to subjectively assess the general utility of specific models [
5].
This paper presents a discussion of the various methods to estimate biodegradability and, more importantly an evaluation of an artificial intelligence technique based on inductive machine learning that allows consideration for physical properties and group contribution effects [
7,
8]. The evaluation has been conducted using an independent, critically reviewed database of biochemical oxygen demand (BOD) values that has seen limited use in model development. The inductive machine learning approach allows for the development of models with simple logical rules that indicate important structural features for biodegradability and may provide for the elucidation of relevant factors in determining a chemical’s availability in the environment in the presence of solid surfaces, and therefore its propensity to biodegrade. Factors such as acclimation and chemical concentration may also be incorporated in future inductive machine learning models to account for environmental variability and more reliably predict biodegradation.
In this study, the inductive machine learning approach is demonstrated as sound when evaluated against an independent, highly reviewed data set that is not related to its training set. While the development of reliable and realistic biodegradability QSARs will require data from different types of tests to better simulate actual environmental conditions [
9], the inductive machine learning approach shows promise for incorporating important surface interface and other environmental impacts into future modeling efforts.
Data availability for environmental fate assessment of chemicals
There are literally hundreds of thousands of anthropogenic chemicals manufactured and ultimately released to the environment, either through their intended use or through accidental discharge. The ultimate disposition of these chemicals on the environment is important in assessing their short and long term impact on living systems, and ultimately, on human health. While new standards and requirements for testing and providing data for High Production Volume (HPV) chemicals have promise for improving data availability for new chemicals, the sheer number of chemicals currently in use makes individual testing and assessment impractical. For example, it has been reported that there are more than 100,000 compounds existing in the European Union as indicated by the contents of the ENECS database [
10]. Furthermore, a recent study reviewed more than 10,000 pre-manufacture notices submitted to the United States Environmental Protection Agency between 1995 and 2001 and was able to find only 305 chemicals with biodegradability data [
11].
Relatively new requirements for screening tests in the European Union, Canada and Japan will undoubtedly improve the availability of data for biodegradability and other environmental fate parameters. Even with these requirements, however, information provided from these tests may not be sufficient to conduct risk assessments [
4]. In addition, consistently measuring whether or not a chemical is likely to biodegrade and at what rate can be difficult. For example, analytically determined biodegradation half-lives have covered a wide range even when tested under similar conditions [
12]. Even if the consistency of the results can be resolved, test conditions such as acclimation and test chemical concentration can produce results that are of potentially questionable relevance to a chemical’s actual fate in the environment [
9].
The development of models for predicting biodegradability has provided a number of useful tools for generally assessing the fate of various chemicals in the environment and even in helping to understand the mechanisms of degradation; however, work remains to be done for these tools to reach a level of general utility. While years of research in physical property modeling and structure activity relationships has resulted in the ability to predict many chemical properties with acceptable reliability from knowledge only of chemical structure, prediction of biodegradability among other properties still needs improvement [
13]. Russom
et al. [
14] reported, for example, that for the BIOWIN package [
15], the EU recommends only using a slow biodegradation output as confirmation that a substance is not readily biodegradable and recommends against relying on fast biodegradation outputs.
There are two frequently referenced broadly available data sources for biodegradation data, commonly referred to as the BIODEG and the MITI-I databases. BIODEG is a file of biodegradation data within the Environmental Fate Database [
16] which is available commercially from Syracuse Research Corporation (Syracuse, N.Y., U.S.A.,
http://www.syrres.com/esc/). The MITI-I database is available directly from the Chemicals Evaluation Research Institute (Tokyo, Japan) and can be downloaded from
http://www.cerij.or.jp/ceri_en/otoiawase/otoiawase_menu.html. These databases, in addition to the expert opinion survey conducted by Boethling and Sabljić [
17], have been used extensively for model development and validation. These data sets are generally available and are regarded as of a high quality. It is notable that the two datasets do include some data that are contradictory for a small subset of overlapping chemicals in the BIODEG and MITI-I datasets [
8]. Chemicals within these databases are generally classified as biodegradable or non-biodegradable or as fast or slowly biodegradable.
The BIODEG and MITI-I datasets are sufficiently unique that it is common for independent models to be generated based on each. Gamberger
et al. [
8], for example, created two different rules, each designed to best predict data from one or the other dataset. The commonly used BIOWIN model package recommended by the EU Risk Ranking Method [
14] includes separate linear and non-linear models built from the MITI-I and the BIODEG data [
11,
18]. It has been reported that due to cross correlations, it is possible to develop a model that fits the training set data well but is not reliable as a predictor for chemicals outside the training set [
19]. Based on this fact and the extensive use of the BIOWIN and MITI-I data in model development, it would be useful to evaluate models on an independent data-set to see how they perform.
Another set of critically reviewed data for BOD that exists has been prepared by the American Institute of Chemical Engineers Design Institute for Physical Properties (DIPPR)
® and is available commercially from EPCON International (
http://www.epcon.com/Product22.htm). The DIPPR database includes 56 chemical properties for approximately 600 chemicals selected from U.S. Environmental Protection Agency regulatory lists [
20]. Each BOD data point in the DIPPR database has been critically evaluated using a 10-point criteria system which utilizes five rating parameters as shown in
Table 1. Data sources received a score between 0 and 2 for each parameter which were then totaled for all of the parameters. For chemicals that had multiple data points from multiple sources, only the highest rated data point was chosen for this study. For a complete discussion of the criteria and a summary of the BOD/ThOD data see [
21]. As a critically evaluated data-set that has seen limited use for biodegradation model development, this data-set is ideal for evaluation of models and modeling approaches developed to-date.
Table 1.
Evaluation Criteria used for BOD Data in the DIPPR database
Table 1.
Evaluation Criteria used for BOD Data in the DIPPR database
Rating Parameter | Required for Highest Rating |
Experimental Technique | Follow Standard Methods |
Temperature | Maintained at 20 °C |
Seed Acclimation | Used acclimated seed |
Concentration of Chemical Dilution | 2-6 mg/L O2 depletion |
Internal Consistency | ThOD≥ BOD |
Review of modeling efforts
There are a large number of correlations and models for biodegradability currently in the literature. For example, Raymond
et al. [
5] presented 41 correlations for various individual homologous series of chemicals and Loonen
et al. [
22] referred to an EU study that evaluated 84 individual models. Most models generate results that generally indicate propensity for biodegradability such as readily biodegradable, slowly biodegradable, or not readily biodegradable and typically do not produce quantitative results such as half lives or degradation rates. These semi-qualitative model outputs have been noted as useful for screening tools but lacking in utility for full scale fate modeling as environmental compartment models, or “box models”, typically require at least compartmental half lives [
4]. The fact that even consistent analytical results are difficult to obtain additionally suggests however, that screening level tools likely represent the finest level of detail that can be reasonably obtained given the complexity of the systems involved and the current level of understanding of biodegradation mechanisms. While the models constructed to-date certainly have utility, the continued development of models with predictable accuracy and that can reasonably account for multiple factors and provide insight into fundamental modes of action related to biodegradability, including interface phenomena, will require continued research.
A number of detailed reviews of modeling efforts are available [
4,
5,
6,
23]. This work does not intend to repeat that work, but rather present a brief discussion of general modeling efforts to-date with a more detailed discussion and evaluation of an inductive machine learning method utilized by Gamberger
et al. [
7]. The evaluation was conducted using a critically reviewed database that has seen limited use in model development and therefore should provide for reasonable independent assessment of the models’ ability to predict the biodegradability of chemicals not included in the model training sets. This discussion also includes considerations of potential future directions related to interface considerations.
The types of approaches to modeling are generally categorized for the purposes of this study as; regression models, human expert system models, and machine learning models. Rorije [
10] noted that the rule based artificial intelligence approach used by Gamberger
et al. [
7] cannot be compared in a straightforward fashion to other types of modeling approaches and as such, this method has seen limited review in the literature.
Regression models
Regression models consist of linear, multiple linear, and non-linear correlations of biodegradation rates with parameters including physical or chemical properties and/or molecular connectivity indices. Commonly used properties include molecular weight, solubility, and structural fragment or group contributions. Molecular connectivity indices have also been used that relate to branching, volume, and molecular weight as well as other factors. A number of previously published regression models are presented in
Table 2.
Table 2.
Examples of Published Biodegradation Models Representative of Common Modeling Approaches
Table 2.
Examples of Published Biodegradation Models Representative of Common Modeling Approaches
Model Reference | Training Data Set | Descriptors used | Modeling Technique Used |
---|
Boethling and Sabljić [17] | Results of expert opinion survey | Molecular connectivity indices 2Xv and 4Xpc, molecular weight, and number of chlorine atoms | Linear and multiple linear regression |
Boethling et al. [29] | BIODEG and results of expert opinion survey | Molecular weight and calculated structural fragment/group contributions | Multiple linear and nonlinear regression |
Howard et al. [15] | BIODEG | Structural fragment/group contributions | Linear and nonlinear regression |
Huuskonen [19] | Results of expert opinion survey | Various atom-type electrotopological state indices | Multiple linear regression and artificial neural network |
Loonen et al. [30] | Data measured using MITI-I protocol | Structural fragment/group contributions | Partial least squares discriminant analysis |
Loonen et al. [22] | Data measured using MITI-I protocol | Structural fragment/group contributions | Partial least squares discriminant analysis |
Cambon and Devilers [26] | Results of expert opinion survey | Structural features and molecular weight | Neural network |
Gamberger et al. [7,8] | BIODEG, expert opinion survey, and MITI-I | Structural features and molecular weight | Inductive machine learning |
Klopman [31,32] | BIODEG | Method uses machine learning techniques to determine relevant descriptors mathematically from data on activity and basic chemical structure. | Knowledge-based learning system |
Rorije et al. [33] (model specific to anaerobic degradation) | Anaerobic degradation data from Environmental Fate Database EFDB [34] | Used Klopman method [32] to generate fragments important for anaerobic biodegradation. | Used Klopman [32] method |
These models are attractive in their relative ease of development given reasonable availability of data and model inputs, but are generally limited to specific chemical classes. Additionally, while statistical measures can be undertaken to reduce the risk of chance correlations, their possibility remains. It has been reported, for example, that the significance of some variables may be difficult to rationalize given known factors that influence biodegradation [
15]. The inability to rationalize the significance of some variables may suggest that they are the result of chance correlations.
Machine learning based models
Machine learning techniques include neural networks and inductive learning and utilize computers to process available data against chemical structural features and properties to elucidate important features and properties relevant for biodegradation. Neural network techniques were noted some time ago as a promising tool for summarizing biodegradability data [
24] and have been described as attractive for developing robust models due to their ability to account for a variety of interacting factors that influence a chemical’s biodegradability [
25]. These models follow a similar logic to the expert system/survey models in that they seek to identify subtleties that are not initially obvious, but utilize computer and mathematical analysis to more rigorously identify the important structural features and properties. These techniques are attractive for modeling complex processes like biodegradation due to their dynamic nature and ability to modify their behavior in response to their environment, store experimental knowledge, and make that knowledge available for modeling [
26]. Another advantage of using machine learning techniques is their ability to point out the importance of specific descriptors and relations among descriptors that are likely to stimulate further investigations into the specific mechanisms of biodegradation [
8]. Similarly, understanding structural features discovered through machine learning analysis may be additionally helpful in designing chemicals with a higher propensity for degradability by including substituents that promote degradability and removing substituents that inhibit degradability [
27]. Examples of published machine learning modeling efforts are presented in
Table 2.
Application of model batteries
In addition to the discrete use of individual models, it has been suggested in the past that a number of models can be used successively to evaluate confidence in the results. It is logical that if multiple models are run for the same chemical and produce conflicting results, then those results are potentially questionable. At a minimum, the user is faced with a decision about which one, among the conflicting models, is more accurate given comparably appropriate models (e.g. no class specificity or other issues with either model training set).
The general concept of utilizing multiple models and concluding that reliable results cannot be obtained given conflicting results has been suggested in previous studies [
6,
15]. However this concept has more recently been rigorously evaluated. A recent study presented the use of model batteries selected through Bayesian analysis to improve the reliability of predictions or better qualify questionable predictions [
28]. The model battery approach consists of selecting a series of models and qualifying confidence in the model results based on whether each of the models agrees or not. While not a fundamentally new approach to modeling, the battery test approach is a new method of formally assessing the reliability of the results obtained from various models or sequential combinations of models.
Gamberger et al. inductive machine learning artificial intelligence model
As described above, Gamberger
et al. have developed inductive machine learning models for predicting biodegradation potential of organic chemicals. Two of these models have been selected for further analysis and are termed for this study “Rule A” [
7] and “Rule B” [
8]. Rule A was developed from the expert opinion data-set reported by Boethling and Sabljić [
17] and Rule B was developed from MITI-I test data. These Rules use the structural descriptors noted in
Table 3, but have different outcomes regarding the significance of those descriptors in biodegradability based on the nature of the data on which they were built. The MITI-I data has been reported to have a tendency to under-predict biodegradability and therefore classifies some compounds as non-degradable that are classified as degradable under other test conditions, such as those conditions that the chemicals in the BIODEG database were tested under [
6]. This under-prediction has been reported in part to be potentially caused by the relatively high chemical concentration used in the MITI-I test which is higher than what is likely to be experienced in the environment, and may produce toxic effects on the test inoculum [
4]. Based on these data differences, it is reasonable that two distinct models be developed, one as a general utility biodegradation model based on the Boethling and Sabljić [
17] survey data (i.e. Rule A) and one which was developed to more closely predict the results of the MITI-I test (i.e. Rule B).
The inductive machine learning method involves describing each chemical with a number of structural descriptors as input variables. The structural descriptors used by Gamberger
et al. are presented in
Table 3. Binary output variables are assigned to each chemical with a 1 for fast biodegradability and 0 for slow biodegradability based on the training set data. Each chemical represents a learning example and analysis is conducted to find individual rules that satisfy all of the learning examples. The simplest rule is assumed to have the greatest chance of being most correct against test data. Once the simplest rules are identified, they are further analyzed to determine if the exclusion of any single chemical can reduces the number of basic logical elements. If this occurs, that chemical is removed as a potential outlier or incorrect data point. Chemicals are removed in this manner until a simple non-reducible solution is obtained which is the rule that models the data best. Rules A and B are presented in common language format in
Table 4 and in mathematical format in
Table 5.
Table 3.
Structural descriptors used in construction of Artificial Intelligence biodegradation models (from [
7,
8])
Table 3.
Structural descriptors used in construction of Artificial Intelligence biodegradation models (from [7,8])
Descriptor Designation | Rule A Descriptors | Rule B Descriptors |
---|
a | Presence of heterocyclic or anhydride groups | Presence of heterocyclic nitrogen atom |
b | Presence of ester, amide, or anhydride groups | Presence of ester, amide, or anhydride groups |
c | Number of chlorine atoms | Number of chlorine atoms |
d | Bicyclic alkanes | Bicyclic alkanes |
e | Chemical composed only of carbon, hydrogen, nitrogen, and oxygen atoms | Chemical composed of only carbon, hydrogen, nitrogen, and oxygen atoms |
f | Presence of nitro group | Presence of nitro group |
g | Number of rings | Number of rings |
h | Presence of epoxy group | Presence of epoxy group |
i | Primary alcohols and phenols | Primary alcohols and phenols |
j | Molecular weight | Molecular weight |
k | Number of all C-O bonds | Number of all C-O bonds |
l | | Number of tertiary amino groups |
m | | Number of quaternary carbon atoms |
n | | Number of C=C bonds |
o | | Number of aromatic amino groups |
p | | Number of acid groups |
r | | Number of ester groups |
Table 4.
Rules developed for inductive machine learning model by Gamberger et al.
Table 4.
Rules developed for inductive machine learning model by Gamberger et al.
Rule A [7] | Rule B [8] |
---|
A chemical will biodegrade fast if any of the following conditions is met:
(a) chemicals with one or more C-O bonds and molecular weight below 180 (b) chemicals built of C,H,N, and O atoms but without a nitro group and having a number of rings equal to or smaller than the number of C-O bonds (c) chemicals built of C,H, N, and O atoms but without a nitro group and their molecular weight must be in the range from 95 to 135
| A chemical will biodegrade fast if any of the following conditions is satisfied:
(a) acyclic chemicals with one C-O bond, but without quaternary carbons (b) esters, amides, or anhydrides built of C, H, N, and O atoms, but without or with 2 C=C bonds (c) acyclic esters, amides, or anhydrides without quaternary carbons (d) esters, amides, or anhydrides built of C, H, N, and O atoms, having one ring or less but without quaternary carbons (e) acyclic chemicals built of C, H, N, and O atoms, but without either quaternary carbons or tertiary amino groups and without or with 2 C=C bonds (f) chemicals built of C, H, N, and O atoms, acyclic or with 1 ring, with at least one C-O bond, but without either quaternary carbons or tertiary amino groups and without or with 2 C=C bonds.
|
Table 5.
Mathematical representation of two Rules developed by Gamberger
et al. (See
Table 3 for structural descriptors with letter designations)
Table 5.
Mathematical representation of two Rules developed by Gamberger et al. (See Table 3 for structural descriptors with letter designations)
Rule 1 [7] | Rule 2 [8] |
---|
Chemical will biodegrade fast if any of the following terms is satisfied:
| Chemical will biodegrade fast if any of the following terms is satisfied:
(m = 0) (k = 1) (g = 0) (b = 1) (n ≠ 1) (e = 1) (b = 1) (m = 0) (g = 0) (b = 1) (m = 0) (e = 1) (g ≤ 1) (m = 0) (e = 1) (l = 0) (n ≠ 1) (g = 0) (m = 0) (e = 1) (l = 0) (n ≠ 1) (k ≠ 0) (g ≤ 1)
|
Rules A and B have been subject to review against the expert survey results of Boethling and Sabljić [
17] and the BIODEG and MITI-I Data. Summaries of these evaluations have been reported in the literature [
8] and are presented below in
Table 6 and
Table 7.
Table 6.
Results of Rule A when applied to Boethling and Sabljić [
17] expert survey data and data from the BIODEG database [
35].
Table 6.
Results of Rule A when applied to Boethling and Sabljić [17] expert survey data and data from the BIODEG database [35].
Test Set | Biodegradability indication | Number of correct predictions | Percent of correct predictions |
---|
23 Chemicals from Boethling and Sabljić [17] expert survey | Readily Biodegradable | 8/8 | 100% |
Slowly Biodegradable | 14/15 | 93% |
17 Chemicals selected from BIODEG database | Readily Biodegradable | 9/9 | 100% |
Slowly Biodegradable | 8/8 | 100% |
Table 7.
Results of Rule B when applied to MITI-I data test set [
8]
Table 7.
Results of Rule B when applied to MITI-I data test set [8]
Test Set | Biodegradability indication | Number of correct predictions | Percent of correct predictions |
---|
762 MITI-I data points | Fast Biodegradation | 279/364 | 77% |
Slow Biodegradation | 355/398 | 89% |
With these positive results as an indication of the power of the method, an additional analysis was conducted with the critically reviewed DIPPR data set as an additional external check of the soundness of the method for predicting biodegradation. The results of this check are presented in the following section.