Next Article in Journal
CuFe2O4 Magnetic Nanoparticles as Heterogeneous Catalysts for Synthesis of Dihydropyrimidinones as Inhibitors of SARS-CoV-2 Surface Proteins—Insights from Molecular Docking Studies
Previous Article in Journal
Physical and Chemical Phenomena during the Production of Hydrogen in the Microwave Discharge Generated in Liquid Hydrocarbons with the Barbotage of Various Gases
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Interpretable Predictive Model for Health Aspects of Solvents via Rough Set Theory

by
Wey Ying Hoo
1,
Jecksin Ooi
1,
Nishanth Gopalakrishnan Chemmangattuvalappil
2,
Jia Wen Chong
2,
Chun Hsion Lim
1 and
Mario Richard Eden
3,*
1
School of Engineering and Physical Sciences, Heriot-Watt University Malaysia, No. 1, Jalan Venna P5/2, Precinct 5, Putrajaya 62200, Malaysia
2
Department of Chemical & Environmental Engineering, University of Nottingham Malaysia, Jalan Broga, Semenyih 43500, Malaysia
3
Department of Chemical Engineering, Auburn University, Auburn, AL 36849, USA
*
Author to whom correspondence should be addressed.
Processes 2023, 11(8), 2293; https://doi.org/10.3390/pr11082293
Submission received: 13 June 2023 / Revised: 28 July 2023 / Accepted: 28 July 2023 / Published: 31 July 2023
(This article belongs to the Section Chemical Processes and Systems)

Abstract

:
This paper presents a machine learning (ML) approach to predict the potential health issues of solvents by uncovering the hidden relationship between substances and toxicity. Solvent selection is a crucial step in industrial processes. However, prolonged exposure to solvents has been found to pose significant risks to human health. To mitigate these hazards, it is crucial to develop a predictive model for health performance by identifying the contributing factors to solvent toxicity. This research aims to develop a predictive model for health issues related to solvent toxicity. Among various algorithms in ML, Rough Set Machine Learning (RSML) was chosen for this work due to its interpretable nature of the generated models. The models have been developed through data collection on the toxicity of various organic solvents, the construction of predictive models with decision rules, and model verification. The results reveal correlations between solvent toxicity and the Balaban index, valence connectivity index, Wiener index, and boiling points. The generated predictive model using RSML has successfully provided insightful observations about the correlation between human toxicity and molecular attributes.

1. Introduction

Solvents have been widely used in the chemical industry for dissolving, suspending, diluting, and separating substances. The paints and coatings sector holds the largest share in the global solvent market, followed by the printing inks segment, with the industrial cleaning industry in the third position [1]. However, prolonged exposure to solvents, particularly those containing volatile organic compounds (VOCs) can adversely affect human health, especially on respiratory, nervous, and reproductive systems. According to the National Institute of Occupational Safety and Health (NIOSH), approximately 9.8 million workers [2] are regularly exposed to high dosages of solvents each year through various exposure pathways. To address these health issues, organizations, such as the Occupational Safety and Health Administration (OSHA) and the U.S. Environmental Protection Agency (EPA), have established guidelines, such as permissible exposure limits (PEL), to detect and classify the associated health consequences.
While current research focuses on determining the potential issues associated with the use of organic solvents, recent research has highlighted the health issues related to solvent exposure [3]. The research findings revealed that the N-methyl-2-pyrrolidone (NMP) solvent, which is conventionally used to manufacture membranes, has neurotoxic, hepatoxic, genotoxic, carcinogenic, and mutagenic effects on humans [3]. It is crucial to replace the toxic solvent with less toxic alternatives. Moreover, other researchers also reviewed the toxicokinetic, general toxicity, and reproductive toxicity associated with various frequently used solvents that primarily expose humans through inhalation routes [4]. The reports suggested that more detailed information on solvents is needed to understand the potential toxicity of solvents. However, the current research lacks a generalized predictive model for determining the health performance of solvents in a systematic approach. Although harmful substance concentration can be determined, the effect of toxicity can only be known after we apply a certain solvent into a process. Therefore, it is crucial to identify which molecular attributes will affect toxicity, whereby the results could then be applied in solvent design. This research aims to bridge the research gap between health issues and predictive models by determining the underlying relationship between solvents and health hazards based on their molecular structure.
This paper is divided into five main sections. The first section discusses the literature review on solvent toxicity, topological indices, and rough set machine learning (RSML), which led to the identification of research gap. Section 2 provides the detailed proposed methodology to close the identified research gap. Section 3 and Section 4 mainly focus on the results and discussion by also highlighting the main insights gained from the obtained results. The summary and key contribution of this work, as well as potential future work, are illustrated in the last section.

1.1. Toxicity

Aromatic organic solvents represent 35% [5] of industrial utilization with high solvency [6] in forming solutions by dissolving a significant amount of solute. Based on the characteristic of high vapor pressure and low boiling point, solvents with a boiling point in the range of 50 °C to 260 °C [7] are known as volatile organic compounds (VOCs) and are easily emitted into the atmosphere at room temperature. The vaporized solvents can be readily absorbed by the human body through inhalation, ingestion, and skin, thereby affecting human health performance.
The impairment of health performance caused by exposure to VOCs varies depending on the exposure routes, levels, and type of solvents. The factors of exposure level, considering both concentration and duration, have been classified into short-term and long-term effects on human health, mainly through inhalation routes. The short-term health effects are dizziness, headache, nausea, and irritation in the eyes, nose, and throat [8]. The dispersed chemicals attach to the mucous layer of the membrane, leading to irritation and inflammation. In contrast, prolonged exposure to solvents has long-term health effects in terms of mutagenicity, toxicity, and carcinogenicity [8]. Thus, there is a necessity to investigate the long-term health effects of the solvents on the toxicity and carcinogenic parameter.
In the context of occupational health, the likelihood of solvent exposure through inhalation is significantly greater in terms of both quantity and frequency when compared to exposure through the oral and dermal routes [9]. Depending on the dosage, the ranking on the degree of toxicity can be classified according to Hodge and Sterner scale in lethal concentration 50% (LC50). Table 1 summarizes the toxicity rating by Hodge and Sterner Scale. Death caused by low dosage with less than or equal to 10 ppm indicates the substance would be classified as highly toxic as Class 1. Based on the rating of toxicity, organic solvents with a rating of 1 to 4 are considered toxic and must be avoided for extreme exposure.
Since the toxicity of organic solvents is related to their chemical structures, structural descriptors, such as Topological Indices (TI), have the potential to form a predictive model for toxicity. TIs are valuable tools in providing unique information about the structure of a molecule and are commonly used in predicting physicochemical properties of molecules.

1.2. Topological Indices

Topological indices (TIs) are numerical values that characterize the structural features or properties of chemical compounds based on their molecular graphs [11]. The molecular topologies are influenced by the structure of the molecules in dimensions, configuration, bonding, symmetry, and degree of complexity. The most commonly used topological indices are the connectivity index, Wiener index, Randic index, Balaban index, and Zagreb index [12]. The TIs can be computed for any molecule based on their molecular structure. These indices can then be used to develop correlations in the studies of quantitative structure-activity (QSAR)/property (QSPR)/toxicity relationship (QSTR) [11].
TIs have been utilized for predicting toxicity and properties of chemicals [11]. For instance, the valence connectivity index, the Balaban index, and the electropy index were used to predict the underlying relationship between the molecular structure of ethers and toxicity in mice [13]. In toxicology, molecular topological indices have been applied to study the toxicity of alcohols, pesticides, and ionic liquids. In assessing the accuracy of the TIs to the toxicity, a figure of log LC50 versus TIs was plotted. A high accuracy and correlation will represent a quadratic model. In addition, the toxicity of organophosphorus pesticides is determined by the Randic–Kier–Hall connectivity indices and Topological Charge Indices (TCI) [14]. The indices are linked to the corresponding regression equation to calculate the LD50 value, and then, the calculated LD50 cross-validates with the experimental LD50 to identify the most correlated TIs. The studies have proven that the topological indices have an excellent correlation to the LD50. Besides, topological studies significantly reduce costs and save time compared to conducting experiments, making it an efficient approach [11]. However, there exists a research gap in connecting topological indices to a predictive model for assessing human health issues. With the incorporation of RSML, a predictive model can be developed based on topological indices of organic solvents.

1.3. Rough Set Machine Learning (RSML)

Machine learning (ML) is a specialized area of artificial intelligence (AI) that focuses on the automatic learning, analysis, and discovery of data [15]. Machine learning is capable of analyzing massive and fuzzy databases to discover patterns, reveal the underlying structure and relationships, and subsequently develop a robust prediction model. Based on learning methods, supervised learning in machine learning techniques determines the underlying data by imposing learning on the relationship between the past input-output training data through supervision. The supported algorithms are decision trees, support vector machines (SVM), K-Nearest Neighbors (KNN), RSML, Artificial Neural Network (ANN), and Bayesian Networks.
In a recent study, individual sets of mathematical tools, including fuzzy sets, rough sets, and soft sets, have been combined into a framework named Z-fuzzy soft β-covering-based rough matrices to solve a multiple attributes group decision-making (MAGDM) problem [16]. This combination has shown satisfactory results in recruiting the best applicant for the assistant professor job and can be further applied for decision-making problems. It further shows the capability and flexibility of rough sets to be combined with other tools in developing decision rules that would be useful for real-world problems.
RSML has its advantages in data processing without prior or complete information, whereas SVM approaches require maximum information in solving the problems of binary classification [17]. Moreover, RSML is preferred over ANN as RSML does not require human intervention and achieves similar accuracy within a shorter period [18]. Even though ANN have been utilized for decision-making tasks, such as Hepatitis B prediction and control in medical applications, there are challenges using ANN when it comes to uncertain data and uncommon diseases [19]. Although Frequent Pattern (FP) Growth algorithm is good in identifying patterns and generating association rules between the items based on support and confidence, it is unable to provide explanations understandable by humans and is less preferred when the data is uncertain and incomplete. Another commonly used machine learning method includes random forest (RF) due to its high predictive precision and flexibility. Nevertheless, there is a lack of interpretability that does not allow us to understand how a decision is made when compared to RSML. This is mainly because RF combines multiple decision trees, which makes it harder to interpret and understand the individual contributions of features. It becomes rather challenging to unveil meaningful physical insights through those aforementioned “black-box” ML approaches as relevance is often disclosed instead of focusing on the cause and effect [20]. The importance of having interpretable ML to drive knowledge generation has been emphasized in various research fields. For example, interpretable ML models are important in the electrocatalysis field to offer new insights into identifying novel catalytic materials and their mechanisms [20]. The recent advances of interpretable ML for estimating reactivity properties of solid surfaces and their existing challenges were also critically discussed in a recent contribution [21].
Due to the aforementioned benefits, RSML is capable of identifying and predicting the health performance of solvents by handling incomplete and uncertain data during important feature selection with excellent data efficiency and high versatility. Furthermore, RSML generates if-then decision rules in discovering the relationship between the conditional attributes of the objects to the decision attribute in a straightforward manner, then to be further applied to establish predictive models. In short, RSML benefits in data analysis and decision-making for datasets without requiring probability statistics and assumptions [22].
Rough set theory (RST) is a mathematical approach to analyzing imprecise and uncertain data or knowledge [23]. RST performs data classification, feature selection, and knowledge discovery tasks on large datasets. In theory, RSML functions are based on the approximation method. In the forms of approximation, the indiscernibility data exists within an elementary set bounded by lower and upper approximation. This is illustrated in Figure 1 [22]. The approximation method can be applied to toxic solvents, incorporating various pieces of information to perform classification based on the boundaries. With the use of lower and upper approximation to handle uncertainty caused by missing data, RSML is capable of handling uncertain and incomplete data. Likewise, RSML selects features by determining reducts, which are the minimal subsets of attributes affecting the decision attribute. This is important in generating meaningful decision rules that help in understanding and interpreting the data.
A rough set theory presents the information in a decision table consisting of an object, conditional attributes, and decision attributes. An example of an information table is presented in Table 2.
In the decision table, the object is the desired element for analysis, along with the conditional attributes [24] for the properties and characteristics of the corresponding object. The decision (class) attribute is determined based on the conditional input attributes with generated decision rules [24]. By referring to Table 2, type of chemical is the object, conditional attributes comprise both boiling point and Wiener index, whereas toxicity class is the decision attribute. Both attributes can be quantitative (integer or decimal values) or nominal characteristics for numerical, categorical, and binary attributes. A structured information table in RSML benefits data classification, analysis, and discovery of the object-attribute relationship.
RSML have been extensively applied for data mining and pattern recognition. For example, RSML integrates CO2 capture and storage (CCS) for carbon management in selecting potential CO2 storage sites [25]. RSML approaches are capable of performing a high certainty prediction on the storage sites, improving the CCS and negative emissions technologies (NETs) process. Besides storage sites, the RSML approach can also be applied to predict storage depth and geographical location. Moreover, the pyrolysis bio-oil properties can be predicted by RSML algorithms with the data on pyrolysis temperature and feedstock characteristics [26]. The focused pyrolysis bio-oil properties are the higher heating value and pH. The identified characteristics of feedstock samples are related to the amount of carbon, nitrogen, and oxygen content. In addition, another contribution was made by applying RSML in constructing a predictive model of the odor of fragrances [27]. RSML relates the odor properties in the form of topological indices and dilution as conditional attributes to the odor characteristics. Based on the rules induced, the fragrant topological indices in Kappa 3 and Kappa 2 dominate the characteristics. The fragrance prediction model through RSML greatly impacts the development of chemical products. Because of its interpretability, RSML was also used to construct predictive models for the estimation of physical and transport properties of polymers, such as glass transition temperature and cohesive energy [28]. The promising rules generated from RSML were then incorporated as property constraints in the computer-aided molecular design (CAMD) model to determine potential polymeric membrane molecular structure for air separation [28].
With the RSML applications above, rough sets theory with feature selection, clustering, and rule induction has promising outcomes in dealing with vague, ambiguous, and fuzzy datasets. As a result, the RSML algorithm is appropriate and applicable in developing an interpretable predictive model for determining the health performance of organic solvents. RSML efficiently aids in identifying the hidden structure and relationship with minimal resources and time consumption. Hence, this research aims to develop a predictive model of the health performance of organic solvents. The potential conditional attributes for the toxicity of organic solvents in human health focused on the topological indices and physical properties of solvents.

2. Methodology

This section explains the steps to identify the underlying structure and develop predictive models for health performance. The main steps include data collection (Step 1), developing a rough set model (Step 2), and verifying the prediction model (Step 3). The proposed methodology is illustrated in Figure 2.
Step 1 :
Data collection on organic solvents
The first step involves collecting the toxicity data for a variety of chemicals which include aromatics, alcohols, ketones, ethers, esters, alkanes, and alkenes. The objects involved in this research are organic solvents for which 100 data points were collected from various sources and chemical compositions. Training data accounts for 70% of datasets for modeling, whereas 30% of datasets [26] preserve for validating the generated decision rules to ensure the accuracy of the predicted model. The database of organic solvents used for training and validation can be found in Appendix A Table A1 and Table A2 respectively.
Step 2a :
Identify the health indices as the decision attribute
Organic solvents cause various human health issues, such as toxicity and carcinogenicity. Therefore, this step identifies the targeted health issue of toxicity as a decision attribute in quantifying the toxic effects. The toxicity within this study encompasses all the toxic effects in the human body and is not restricted to particular human organs or systems.
Step 2b :
Classify the decision attribute
Once the decision attribute has been identified, the attribute has to be classified and well-defined to distribute the object into the corresponding class. Based on the type of expression, the classification can be expressed in either category (categories 1,2,3), labels (high or low), or binary (yes/no) form. The toxicity as a decision attribute has been classified according to the Hodge and Sterner scale of standard LC50 through inhalation routes expressed in the form of categories (categories 1, 2, and 3).
In the predictive model, the toxic substances are classified into a simplified version based on Hodge and Sterner scale for rules induction, as shown in Table 3. For instance, the simplified toxicity rating 1 merges classes 1 and 2 from Hodge and Sterner scale. The toxicity classification for the predictive model of the health performance of solvents is then defined and classified into three main classes.
Step 2c :
Identify the possible conditional attributes
A conditional attribute describes how the objects contribute to the decision attribute, expressed in a range or an exact value. The possible conditional attributes can be linked to the solvent’s topological indices and physical properties.
In identifying conditional attributes, several topological indices related to human toxicity are chosen, including the Balaban Index, Wiener Index, and molecular connectivity index, whereas the chosen physical property is the boiling point of the solvent. The chosen topological indices have been supported by literature on the correlations of the Balaban index and valence connectivity index to toxicity [13].
Step 2d :
Establish a decision table
The identified conditional attributes are then translated into quantifiable properties through calculations. For example, numerical values of relevant topological indices such as the Balaban index [29], the valence connectivity index [30], and the Wiener index [30] are calculated for each molecule. A decision table serves as a fundamental for the RSML model. Table 4 represents the simplified decision table for the organic solvent in the toxicity study.
From Table 4, the object represents the organic solvent, and the condition attributes comprise topological indices (Balaban Index, Valence Connectivity Index, Wiener Index) and the physical property (boiling point). The decision attribute in the toxicity classification is the three main classes based on the toxicity rating in Table 1. In rough set theory (RST), a decision table, also known as an information table made up of the universe, U (nonempty sets), and a set of attributes, A expressed in S = (U,A). The attribute comprised a set of values, V a in the form of a A .
Step 2e :
Perform data processing
The collected data must be pre-processed to eliminate redundant and unessential data before simulating the prediction model. In data processing, RST has the indiscernibility relation that describes the equivalence relation in a set of condition attributes, which denotes B⊆A [31]. In training for rule induction, the conditional attributes and respective classes undergo classification with approximation theory to indiscernibility. There are lower and upper approximations expressed in Equations (1) and (2) [31], respectively. The lower approximation is the set of attributes certain to belong to the class. In contrast, the upper approximation is the set of attributes that possibly fall in the subset.
B * ( X ) = x U   { B ( x ) : B ( x ) X }      
B * ( X ) = x U   { B ( x ) : B ( x ) X }      
The difference between the upper and lower approximation is named boundary region, expressed in Equation (3).
B N B ( X ) = B * ( X ) B * ( X )
where B(X) represents concept of decision X, U depicts the dataset, and BNB represents boundary of the concept.
Furthermore, the data pre-processing process involves utilizing the concepts of reduct and core. In attribute reduction, reduct has the definition in determining the exact conditional attributes that impact the decision attribute, as shown in Equation (4) [32].
γ ( A , X ) = γ ( A , X )   for   A A
where A′ represents a subset of the attribute set A and X is a decision class. A′ is referred to as a reduct when the dependency of X on A′ is identical to its dependency on A. It is worth noting that multiple reducts can exist for the same dataset. When this occurs, one can select reducts by considering criteria such as mechanistic plausibility or consistency with first principles. Whereas the core is defined as the identification of the intersection of all minimal subsets of attributes, as represented by Equation (5).
C o r e = i = 1 n R i
where R i represents the ith reduct set.
Step 2f :
Simulate the prediction model using Rough Set Data Explorer2 (ROSE2)
The generation of cores and reducts have been generated for model development via a software system, Rough Set Data Explorer, version 2 (ROSE2). ROSE2, based on rough set theory, has the core module coded with C++ programming [33] for the interface modules. ROSE2 performs data pre-processing with core and reduct functions and rules induction based on the Learning from Examples Module (LEM2) algorithm [33]. ROSE2 has a significantly lower 5% error rate than feature selection [34]. The generated rules from LEM2 are in the form of “IF-THEN” decision rules. The example of “IF-THEN” rule is as follows: “(Balaban Index in [34, 63)) & (Boiling Point ≥ 104), Decision: 1”. The explanation for the example is if the molecule structure of an organic solvent has the Balaban index in the range of 34 to 63, along with a boiling point equal, and greater than 104 °C, then it will be classified as class 1 toxicity.
Step 3a :
Perform validation on the developed predictive model
The developed predictive model with generated rules is validated with datasets from 30% of the collected data that are excluded from the training data. Validation is a vital step in rough set theory in evaluating the model and determining the appropriateness of the model for evaluating human toxicity from an organic solvent. The validation techniques numerically determine the strength, certainty, and coverage of models with Equations (4)–(6). High certainty and coverage indicate high satisfaction in the model performance with a determined underlying relationship. However, the low coverage and accuracy model can be re-modeled based on the previous steps.
Strength σ x ( C , D )   refers to the degree of supportiveness of the objects in the decision rule. Equation (6) shows the number of supported objects s u p p x ( C , D ) over the total available objects ( U ) in the decision table.
σ x ( C , D ) = s u p p x ( C , D ) | U |
where C and D are the denotes for condition and decision attributes.
The certainty ( c e r x ( C , D ) ) shown in Equation (7) measures the probability of the conditional attribute being classified into the decision attribute. A high certainty percentage of 100% ( c e r t a i n t y   f a c t o r ,   c e r x ( C , D ) = 1 ) indicates the generated rule is certain to the correct conditional attribute to be classified in the respective decision class, whereas the uncertain rule has the certainty factor to be less than 100% or a certainty factor of less than 1 ( 0 < c e r t a i n t y   f a c t o r ,   c e r x ( C , D ) < 1 ) .
c e r x ( C , D ) = | C ( x ) D ( x ) | | C ( x ) | = s u p p x ( C , D ) | C ( x ) | = σ x ( C , D ) σ x ( C )
In simplified words, certainty is calculated with the object fulfilling the decision rule in the particular decision class ( σ x ( C , D ) ) divided by all the objects that meet the decision rule ( σ x ( C ) ) .
Besides, coverage ( c o v x ( C , D ) )   measures the percentage of objects in the corresponding class under the rule.
c o v x ( C , D ) = | C ( x ) D ( x ) | | D ( x ) | = s u p p x ( C , D ) | C ( x ) | = σ x ( C , D ) σ x ( D )
Based on Equation (8), the coverage metrics determined by calculation with the object fulfilled the decision rule in the particular decision class ( σ x ( C , D ) ) divide by the total number of objects in the particular decision class ( σ x ( D ) ) .

3. Results

3.1. Cores and Reducts

Five reduct sets were identified among four various conditional attributes of the Balaban index, valence connectivity index, Wiener index, and boiling point. The number of rules generated for each reduct is shown in Table 5. However, no core of the model was identified across reducts. The absence of a core indicates no constant conditional attribute intersects within the reduct sets. Thus, the results are analyzed based on the corresponding conditional attribute in each reduct.
The study generated a sum of 166 decision rules within five sets of reduct. In reduct 1, 36 decision rules were generated, and they were all determined by the Balaban index and valence connectivity index. Reduct 2 was determined by the valence connectivity index and Wiener index, whereas reduct 3 has the Balaban index and boiling point. Furthermore, reduct 4 comprised valence connectivity index and boiling point. Lastly, reduct 5 consists of the Wiener index and boiling point. A summary of the reduct sets and the number of rules generated is shown in Table 5.

3.2. Rule-Based Prediction Models

In the predictive model, five different sets of decision rules, along with certainty and coverage, were generated. The complete set of decision rules from each reduct is shown in Appendix A Table A3, Table A4, Table A5, Table A6 and Table A7. In order to perform rules interpretation, rules with good coverage and certainty are normally considered. Table 6 shows some examples of decision rules in reduct 5, and the rules will be used to illustrate the explanation.
Rule 4 expressed in the “IF-THEN” statement, “If the molecular structure of organic solvent has the Balaban index in the range of 34 to 63, boiling point greater than or equal to 104 °C, then the inhalation LC50 falls within the range of 10 to 100 ppm, the organic solvent is classified as extremely to highly toxic in human health.” The coverage of rule 4 was 32.26% by fulfilling 10 out of the 31 organic solvents of training data under class 1 toxicity. The certainty in the value of 100% indicated 10 organic solvents were classified in the correct decision attribute.
For class 2, the interpretation of the rules has the following “IF-THEN” statement “If the organic solvents have the Balaban index range of 17 to 33, boiling point between 36 °C to 78 °C, then the organic solvents were classified as class 2 toxicity as slightly toxic with inhalation LC50 in the range between 100 to 100,000 ppm.” The coverage of rule 22 is comparatively lower than in class 1 in the value of 19.23% and a 100% certainty.
For class 3, the “IF-THEN” statement for the decision rule was “If the Balaban index of the organic solvents were greater or equals to 153, boiling point greater or equals to 199 °C, the organic solvents were classified under class 3 with relatively harmless to human health.” The decision rule has 23.08% coverage and 100% certainty.
Similar to the remaining reduct sets, the interpretations for each decision rule are dominated by coverage and certainty. A high coverage indicates a high number of organic solvents in the particular class that met the rule’s requirement. Meanwhile, certainty shows the accuracy of the objects classified under the correct class. Thus, the decision rule with high coverage and certainty obtains a rigid predictive model in predicting human toxicity in inhalation exposure to organic solvent.

3.3. Validation of Decision Rules

Thirty percent of datasets that was not utilized in developing the model were used to validate the decision rules. The validation data consist of exactly 30 data points of organic solvents applied to all five reduct sets in validating the coverage and certainty of the generated decision rules. The validation results exhibited high certainty for a class and were promising in showing the underlying relationship between the condition attributes and the decision attribute. Figure 3, Figure 4, Figure 5, Figure 6 and Figure 7 demonstrate the validation results for each reduct set.
Reduct set has the minimal feature subset that correlates to the decision attributes in retaining the main backbone of the data set. For Class 1 toxicity, reduct 3 fulfilled the most significant number of decision rules in 9 out of the 13 rules, followed by reduct 1 meeting 7 out of 15 rules in validation. In terms of certainty, Figure 5 illustrates that reduct 3 has 5 validated rules with 100% certainty, as compared to reduct 1, shown in Figure 3, which has only 4 rules with certainty higher than 70%. However, rule no.10 (R10) in reduct 1 meets a remarkable 5 data points in 83.33% certainty. Thus, reduct 1, with high to moderate certainty and the maximum coverage, was chosen for class 1 toxicity. In general, the conditional attributes of the Balaban index and valence connectivity index in reduct 1 affect the class 1 toxicity.
For class 2 toxicity, the validation datasets met four decision rules with R17 and R24 with 100% certainty. However, only a data point was present in each decision rule. Hence, in the review of reduct 5, with the second highest number of decision rules, the certainty for all three rules was comparatively high, with 66.67% certainty with 2 data points in R12 and 100% certainty with 1 data point for R20 and R22. With a high number of data points and certainty, reduct 5 correlates better to class 2 toxicity.
Molecules in class 3 were non-toxic to human health and were determined by reduct 3 with 3 decision rules. The Balaban index and boiling point mainly influence class 3 toxicity.
In summary, each toxicity class as a decision attribute is affected by various conditional attributes as shown in Table 7. The Balaban and valence connectivity indices from reduct 1 affect class 1 toxicity. Moreover, class 2 toxicity was influenced by the Wiener index and boiling point from reduct 5. Lastly, the Balaban index and boiling point from reduct 3 lead to the non-toxicity of class 3.

4. Discussion

Each toxicity class can be explained with conditional attributes based on the selected reduct set. The results in each class of decision attributes are interpretable with the chosen reduct set.

4.1. Class 1 Toxicity

Class 1 toxicity is defined as high to moderate toxicity of organic solvents via inhalation on the scale of LC50. The conditional attributes affecting class 1 toxicity include the Balaban index, denoted by A1, and the valence connectivity index, denoted by A2. In the validated decision rules, both topological indices (A1 and A2) have higher values in class 1 than in other classes. Table 8 shows the decision rules for classes 1 and 2 in reduct 1.
The Balaban index, also known as the averaged distance sum connectivity [35], refers to the J index. J index increases with increasing branching and number of rings (aromatic). In other words, the J index was determined by molecules’ branching (shape) in the proportional relationship. Besides, the valence connectivity index is the sum of overall bonds in counting the interacted bonding among two molecules in their valence states. This study focuses on the first-order valence connectivity index. The index measures the degree of connectivity between atoms based on the valence electron counts. The index correlates with the polarity of a molecule in charge distribution. Then, the polarity is linked to the organic solvent’s intermolecular forces and boiling point.
Solvent lipophilicity increases with molecular weight [36]. Hence, the increase of the Balaban index and aromaticity has increased the hydrophobicity of non-polar lipid-soluble molecules [37]. The dispersed organic solvent in the atmosphere with a high Balaban index and lipophilicity intake from inhalation tends to bind and accumulate in hydrophobic regions of the human body, like lipid-rich tissues. Moreover, the solvent with high valence connectivity index has a high degree of valence electron in uniform distribution resulting in low electronegativity and hence low polarity. The low polarity indicates weak intermolecular forces with low energy required for bond breaking. As a result, molecules with low polarity have a low boiling point. The low boiling point molecules tend to vaporize with high volatility. Therefore, the volatile solvent tends to disperse into the air and accumulate in the human body [35], risking extreme toxicity to human health. In summary, the topological indices of high Balaban index and valence connectivity index of the organic solvents are scientifically proven in high to moderate toxicity to human health via inhalation routes. Figure 8 summarizes the relationship of the topological indices in contributing towards high toxicity.
In short, it can be concluded that organic solvents with high values of the Balaban index and valence connectivity index exhibit high to moderate toxicity to human health when inhaled.

4.2. Class 2 Toxicity

Class 2 toxicity is classified as slightly toxic to humans, with the inhalation routes of LC50 in rat exposure for 4 h ranging from 1000 to 100,000 ppm. Based on the validated decision rules, the conditional attributes of the Wiener index, denoted by A3, and the physical property of boiling points, denoted by A4, present in an organic solvent, are expected to contribute to class 2 toxicity. From the decision rules, class 2 is interpreted with a low Wiener index, with the lowest index lesser than 13 and a moderately high boiling point up to 288 °C.
In topological studies, the Wiener index quantifies the summation distances in the shortest path of each bonding [35] between two vertices. Wiener index correlates with molecular properties in QSAR and QSPR. The distance-based index has a good measurement of the compactness of a molecule [38] in an inversely proportional relationship. Hence, a molecule’s Wiener index relates to its compactness and size. Moreover, boiling point is related to volatility. There is an inverse relationship between boiling point and volatility.
A slightly toxic organic solvent exhibits low Wiener index, indicating that the molecules are closely packed with large compactness. A compacted molecule was smaller, with vertices squeezed in a confined space. The small-sized molecules in the particle forms are easily inhaled into the lungs, then absorbed and distributed throughout the bloodstream. Simultaneously, organic solvents of volatile organic compounds (VOCs) in moderately high boiling points have a stronger intermolecular force, resulting in moderate volatility for molecules escaping into the atmosphere. Thus, small-sized particles with moderate volatility led to slight toxicity. A summary of the relationship between conditional attributes leading to toxicity is shown in Figure 9.
Table 9 shows a few extracted decision rules from reduct 5 that result in class 2 toxicity. Rule 13 to 15 shows the binary relation of the Wiener index and boiling point on a decision rule. The rules of binary conditional attributes have the statement: “If the Wiener index is lower, then the solvent has a moderate boiling point”, while in another case with the statement “If the Wiener index is high, the boiling point of solvent has to be higher, to be maintained in class 2 toxicity”. On the other hand, the single conditional attribute of the Wiener index or boiling point can also contribute to toxicity. In rule 20, the Wiener index has a higher value in contributing towards slight toxicity, ranging from 72 to 83. The decision rule makes sense for having a higher value if a lower Wiener index of 10 in a single attribute would lead to highly toxic. Moreover, the conditional attribute of boiling point at 154 °C has been identified as class 2 toxicity.
Therefore, class 2 toxicity has characteristics of low Wiener index and moderately high boiling points of organic solvents. The compactness and intermolecular force of molecules allows for easy inhalation into the human body.

4.3. Class 3 Toxicity

The organic solvent with low toxicity is classified as class 3 with LC50 as 100,000 ppm in the inhalation routes. In the validation results, there is a high level of certainty for reduct 3 in class 3 toxicity. The conditional attributes, including the Balaban index denoted in A1 and the boiling point in A4, have been interpreted to dominate the organic solvent in meeting the criteria of class 3. A non-toxic effect of organic solvent is predicted to be in low Balaban index and high boiling point.
For an organic solvent with low toxicity, the solvents are low in the Balaban index, with lesser branching and aromaticity effects in decreasing lipophilicity. Low lipophilicity decreases the tendency of a molecule to be absorbed into body cells and tissues, resulting in low accumulation and hence lesser toxicity. In conjunction, a high boiling point solvent has a stronger intermolecular force and requires high energy in bond breaking, leading to less volatility. Therefore, solvents with the criteria of less complicated structure in low aromaticity and high volatility are relatively difficult in the phase change from liquid to a gas phase and interact with the human body, resulting in a non-toxic effect on the human body. An overall relationship between the Balaban index and boiling point is presented in Figure 10.
The validated decision rules in Table 10 prove the attribute relationship between a low Balaban index and a high boiling point. In rule 27 and rule 28, the Balaban indices are lower than those in class 1, reduct 1. Similarly, a high boiling point is obtained in reduct 3 compared to class 3, reduct 5, tabulated in Table 10. The statement for rule 27 is explained as “If organic solvents with Balaban index ranging from 2.22 to 3.16, boiling point in the range of 199 °C to 284 °C, then the solvent is classified in class 3 toxicity. Moreover, a conditional attribute of a low Balaban index between 2.67 to 2.755 in rule 31 has resulted in extremely low toxicity. Hence, the validated decision rules for Class 3 toxicity highlight the significance of a low Balaban index and a high boiling point in determining the non-toxic effects of organic solvents.
In summary, five reduct sets were determined from the data inputs in the predictive model. The generated decision rules have demonstrated validated data with reasonably good certainty and coverage and can be explained scientifically. When interpreting the results, the decision rules induced by the RSML approach showed the relationship between the molecular structure and the respective toxicity classification. When an organic solvent is classified as extremely toxic (class 1), it is mainly attributed to the high Balaban and valence connectivity indices. While a solvent is classified as slightly toxic (class 2), it was found to have low Wiener index and moderately high boiling point. Lastly, the low Balaban index and high boiling point are significant factors that contribute to low toxicity (class 3). As compared to other machine learning approaches, which are normally “black-box models” with limited explainability, this work has successfully revealed the key molecular attributes that lead to distinct classes of toxicity. This understanding holds significance in the process of designing new molecules or products.

5. Conclusions

This research paper presents the topological indices and physical properties of solvents as conditional attributes in a predictive model of human toxicity of solvents based on RSML. The impacts of solvents on human health can be estimated based on the factors, such as the boiling point of the solvent and the topological indices, including Balaban Index, the Valence Connectivity Index, and the Wiener Index. The predicted model based on uncertain and ambiguous data has generated rules to uncover the underlying structure of molecules contributing to human toxicity. The Balaban Index, valence connectivity index, and Wiener Index provide the quantitative values for the structural connection to the toxicity of organic solvents.
The proposed predictive model of the health performance of solvents with RSML has provided significant advantages to evaluate the health performance of solvents by discovering the conditional attributes that affect different classes of toxicity. This is particularly useful in solvent design and screening. However, the research has limitations on the assessment solely on human toxicity and may not account for other health issues caused by solvents. Further research should focus on developing prediction models for other health issues caused by solvents, such as carcinogenicity and mutagenicity. Additionally, future research directions should enhance the machine learning techniques and larger datasets of the predictive model with the incorporation of more comprehensive data in order to further improve the model’s performance. In conclusion, this research successfully demonstrated the potential of using topological indices and physical properties of solvents in a predictive model using machine learning tools for assessing human toxicity.

Author Contributions

Conceptualization, J.O. and N.G.C.; Methodology, W.Y.H., J.O. and N.G.C.; Software, W.Y.H. and J.W.C.; Validation, J.O. and N.G.C.; formal analysis, W.Y.H.; Investigation, W.Y.H. and J.O.; resources, N.G.C. and J.W.C.; data curation, W.Y.H.; Writing—original draft preparation, W.Y.H.; Writing—review & editing, J.O., N.G.C., J.W.C., C.H.L. and M.R.E.; visualization, W.Y.H.; Supervision, J.O. and N.G.C.; project administration, J.O.; All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Database of organic solvents for training.
Table A1. Database of organic solvents for training.
No.Organic Chemical SolventsBalaban IndexValence Connectivity IndexWiener IndexBoiling Point (°C)Toxicity
12-Ethoxyethanol2.342.10351351
2Dimethylformamide (DMF)2.831.39181531
3Carbon tetrachloride3.022.2716771
41-hexanol2.453.02561561
5Acetyl Acetone3.322.12481381
6Formamide2.190.5742101
71,2-dichloroethane1.972.1010841
81,2-dimethoxyethane (glyme, DME)2.341.8935851
91,2-Dichloroethene2.551.6410601
10Nitromethane2.800.8191011
11Benzene3.002.0027801
12Trichloroethylene3.142.0818871
13Dibutyl ether2.603.991201421
14n-Decane2.654.911651741
15Methylisobutylketone3.132.62481181
16Pyridine3.001.85271151
17m-xylene3.082.82611061
181,4-Dioxane2.002.1527121
19Methanol1.000.451651
20Ethylene glycol1.971.13101951
21Cyclohexanone2.252.91421551
22Ethylbenzene2.832.97641361
23Sulfolane2.764.23392851
24Chlorobenzene3.022.48421321
25Iso-Butanol2.541.88181081
262-Methoxyethanol2.191.51201251
27Methylene chloride1.631.604401
28p-xylene3.032.82621061
29o-xylene3.132.83601061
30Chloroform2.321.969611
311,1,2-Trichloroethene2.542.52181141
321-propanol1.971.5210972
332-Butanol2.541.95181002
34Pentene2.402.0220302
35Dimethyl Sulfoxide (DMSO)2.802.9591892
36Butyl acetate2.822.90791252
37Ethyl formate2.401.4720552
382-Methyl-1-propanol3.171.96281652
392-propanol2.321.419822
40Pentane2.192.4120362
41Anisole2.832.52641542
422-pentanone2.832.76321022
43Benzonitrile3.052.38641902
44Heptane2.453.4156982
451-octanol2.604.021201952
46Methyl t-butyl ether (MTBE)3.172.1128552
472-Amyl Alcohol2.542.38321302
48Hexamethylphosphoramide (HMPA)4.695.031422312
49Ethyl acetate2.831.9032772
50Formic acid2.190.4941002
511-Butanol2.192.02201182
52Carbon disulfide3.271.224472
532-aminoethanol1.971.22101712
54Ethanol1.631.024792
55Acetic Acid2.800.9391182
56Ethyl Ether1.971.4010112
57Diethylamine2.192.1220562
58Di-n-butyl phthalate2.717.149123373
59Glycerin2.751.71312903
60Ethyl benzoate2.693.561642123
61t-butyl alcohol3.021.7216823
62Ethyl acetoacetate3.392.821021813
63Dimethyl phthalate3.153.962952833
64Diisopropyl ether2.952.7848693
653-pentanol2.752.49311153
66Ether2.191.9920353
67Benzyl Alcohol2.832.58642053
68Diglyme (Diethylene glycol dimethyl ether)2.602.971201623
69Acetophenone2.982.86882023
701,2-PropanedioI 2.541.56181883
Table A2. Database of organic solvents for validation.
Table A2. Database of organic solvents for validation.
No.Organic Chemical SolventsBalaban IndexValence Connectivity IndexWiener IndexBoiling Point (°C)Toxicity
1Toluene3.022.41421111
2Cyclohexanol2.123.07421611
3N,N-dimethylaniline2.853.03881941
4Tetrahydrofuran (THF)2.082.0815651
5Hexane2.342.9135691
61,1-Dichloroethene2.801.499571
71-pentanol2.342.52351381
8Methylcyclohexane2.123.39421011
9N,N-Dimethylacetamide3.261.82291651
10Cyclohexane2.003.0027811
111,1,1-Trichloroethane3.022.2016751
12n-octane2.533.91841251
13Acetonitrile2.480.724821
141-heptanol2.533.52841761
15N-methyl-e-pyrrolidone (NMP)2.482.54402021
162-pentanol2.632.45321191
17Acetone2.801.209562
18Aniline3.022.20421842
19Isopropyl acetate3.132.3048892
20Methyl acetate2.851.3218572
21Isobutyl acetate3.052.76741172
22Triethylamine2.993.0748892
232-Butanone2.851.7618802
241-Amyl Alcohol2.342.52351382
25Benzaldehyde2.992.44641782
26Diethyl ether2.191.9920353
272,2,4-Trimethyl pentane3.393.4266993
283-pentanone2.992.33311023
29Diethylene Glycol2.452.21562463
30Acetaldehyde2.190.814203
Table A3. Decision rules generated for reduct 1.
Table A3. Decision rules generated for reduct 1.
NoRulesToxicity ClassTraining DataValidation Data
Balaban IndexValence Connectivity IndexCoverage (%)Certainty (%)Coverage (%)Certainty (%)
12.495–2.6253.215–4.00513.23100.0012.50100.00
22.22–2.37≥1.58112.90100.0012.5066.67
3<2.095≥1.5819.68100.0012.50100.00
41.985–2.220.87–1.78513.23100.000.000.00
5< 2.5751.58–1.895112.90100.000.000.00
62.99–3.04≥1.785116.13100.0012.5040.00
73.065–3.145-112.90100.000.000.00
82.755–2.89≥2.9616.45100.006.25100.00
9-0.53–0.8716.45100.006.2550.00
10<2.5452.445–3.21519.68100.0031.2583.33
11-1.305–1.39513.23100.000.000.00
12<2.67≥4.12513.23100.000.000.00
131.8–1.895<1.17513.23100.000.000.00
143.295–3.355-13.23100.000.000.00
15<1.315-13.23100.000.000.00
162.095–2.22≥2.01211.54100.000.000.00
172.425–2.892.69–2.96211.54100.000.000.00
183.16–3.295-211.54100.000.000.00
192.78–2.891.54–2.5527.69100.0011.11100.00
20<1.9851.175–1.54211.54100.000.000.00
212.37–2.6251.895–2.445211.54100.000.000.00
22-0.87–1.07527.69100.000.000.00
23<2.625≥4.00523.85100.000.000.00
24<2.495≥3.21523.85100.000.000.00
25-1.395–1.49211.54100.000.000.00
263.04–3.065-23.85100.0011.11100.00
27≥4.04-23.85100.000.000.00
28≥1.315<0.5323.85100.000.000.00
292.89–2.99-315.38100.0020.00100.00
302.67–2.755-330.77100.000.000.00
312.495–2.6252.6–2.99537.69100.000.000.00
323.01–3.145<1.99537.69100.000.000.00
33-1.975–1.99537.69100.0020.00100.00
343.145–4.04≥2.485315.38100.000.000.00
35-2.55–2.637.69100.000.000.00
36-1.54–1.5837.69100.000.000.00
Table A4. Decision rules generated for reduct 2.
Table A4. Decision rules generated for reduct 2.
NoRulesToxicity ClassTraining DataValidation Data
Valence Connectivity IndexWiener IndexCoverage (%)Certainty (%)Coverage (%)Certainty (%)
11.955–2.55<19116.13100.0012.50100.00
21.58–1.675-16.45100.000.000.00
32.8–3.21534–63116.13100.0012.5066.67
4<2.6934–52116.13100.0018.7550.00
5-24–2819.68100.006.25100.00
61.785–1.895-19.68100.006.25100.00
71.49–1.515-13.23100.000.000.00
80.53–0.87-16.45100.006.2550.00
91.305–1.395-13.23100.000.000.00
104.125–4.97-16.45100.000.000.00
113.975–4.005-13.23100.000.000.00
121.075–1.175-13.23100.000.000.00
132.96–2.995<7113.23100.000.000.00
14<0.47-13.23100.000.000.00
151.175–1.54<13219.23100.0011.1150.00
162.01–2.13519–29215.38100.000.000.00
172.69–2.77-23.85100.0011.11100.00
18≥3.21552–7223.85100.000.000.00
191.895–1.975≥17211.54100.000.000.00
20≥4.00563–15327.69100.000.000.00
212.325–2.445-211.54100.0011.1133.33
220.87–1.075-27.69100.000.000.00
232.88–2.905-23.85100.000.000.00
24<2.55≥6327.69100.0011.11100.00
251.395–1.49-211.54100.000.000.00
26≥1.975<1023.85100.000.000.00
270.47–0.53-23.85100.000.000.00
282.77–2.8-37.69100.000.000.00
29<3.975≥84338.46100.000.000.00
302.55–2.6-37.69100.000.000.00
311.54–1.58-37.69100.000.000.00
32-30–32315.38100.0020.0050.00
331.975–1.995-37.69100.0020.00100.00
34≥6.085-37.69100.000.000.00
351.675–1.785-315.38100.000.000.00
Table A5. Decision rules generated for reduct 3.
Table A5. Decision rules generated for reduct 3.
NoRulesToxicity ClassTraining DataValidation Data
Balaban IndexBoiling Point (°C)Coverage (%)Certainty (%)Coverage (%)Certainty (%)
1<2.81101–11519.68100.006.25100.00
2≥2.9983–159125.81100.000.000.00
32.22–2.37<6713.23100.006.2520.00
41.985–2.37≥121112.90100.0012.5066.67
5-131–154119.35100.006.2550.00
62.425–2.495≥10113.23100.006.2550.00
7≥2.9958–8116.45100.006.25100.00
8<2.09512–6719.68100.006.25100.00
92.755–2.78-13.23100.000.000.00
102.33–2.6758–9216.45100.0012.50100.00
11<1.985≥17313.23100.000.000.00
122.625–2.67-13.23100.006.25100.00
13<1.98580–9213.23100.000.000.00
14-92–101215.38100.000.000.00
152.81–2.89<128211.54100.0022.22100.00
16≥2.625154–17227.69100.000.000.00
172.545–2.625≥17223.85100.000.000.00
18-189–19227.69100.000.000.00
19-164–17227.69100.000.000.00
20≥2.22<58215.38100.0022.2266.67
211.985–2.3373–121211.54100.000.000.00
222.22–2.81117–13127.69100.000.000.00
231.315–2.2244–8027.69100.000.000.00
24-36–3823.85100.000.000.00
25≥4.04-23.85100.000.000.00
26-<1223.85100.000.000.00
272.22–3.16199–284330.77100.0020.0050.00
282.495–2.625159–189315.38100.000.000.00
293.01–3.02581–11637.69100.000.000.00
30-67–7337.69100.000.000.00
312.67–2.755-330.77100.000.000.00
323.355–4.04-37.69100.0020.00100.00
33-33–3637.69100.0020.00100.00
Table A6. Decision rules generated for reduct 4.
Table A6. Decision rules generated for reduct 4.
NoRulesToxicity ClassTraining DataValidation Data
Valence Connectivity IndexBoiling Point (°C)Coverage (%)Certainty (%)Coverage (%)Certainty (%)
11.955–2.6958–115119.35100.0018.7560.00
2<2.135121–159112.90100.000.000.00
32.55–2.69<15913.23100.000.000.00
4≥2.905101–159112.90100.0012.50100.00
5-104–115116.13100.006.25100.00
61.785–1.895-16.25100.006.25100.00
71.58–1.675-16.45100.000.000.00
8-131–154119.35100.006.2550.00
9<1.175≥12116.45100.000.000.00
104.125–4.97-16.45100.000.000.00
110.53–0.87-16.45100.006.2550.00
12≥1.49<2113.23100.000.000.00
13<0.47-13.23100.000.000.00
141.895–2.55≥154111.54100.0022.2250.00
15≥2.0121–58215.38100.000.000.00
16≥2.05189–198211.54100.000.000.00
170.47–1.54<101226.92100.0022.2240.00
181.515–2.445117–13127.69100.000.000.00
194.97–6.085-23.85100.000.000.00
201.895–1.955-27.69100.000.000.00
21<1.305104–17227.69100.000.000.00
22≥2.88<12827.69100.0011.1116.67
23-102–10423.85100.000.000.00
241.54–3.975≥199338.46100.0020.0050.00
252.505–2.995159–189315.38100.000.000.00
26≥1.5881–8337.69100.000.000.00
27≥1.99511537.69100.000.000.00
28≥2.485<7337.69100.000.000.00
291.975–1.995-37.69100.0020.00100.00
30≥6.085-37.69100.000.000.00
31-178–189315.38100.000.000.00
Table A7. Decision rules generated for reduct 5.
Table A7. Decision rules generated for reduct 5.
NoRulesToxicity ClassTraining DataValidation Data
Wiener IndexBoiling Point (°C)Coverage (%)Certainty (%)Coverage (%)Certainty (%)
1<24101–11519.68100.000.000.00
2-83–9219.68100.000.000.00
3<2458–78112.90100.0012.50100.00
434–63≥104132.26100.0025.0057.14
5<24121–15416.45100.000.000.00
624–28-19.68100.0025.00100.00
7≥131<19813.23100.000.000.00
8-131–154119.35100.000.000.00
9<13≥19316.45100.000.000.00
10<6<4313.23100.000.000.00
1123–33117–18927.69100.000.000.00
12-44–58215.38100.0022.2266.67
13≥63189–19827.69100.000.000.00
14<1073–8327.69100.000.000.00
15<13117–192211.54100.000.000.00
16≥1092–104215.38100.000.000.00
17131–153-23.85100.000.000.00
18<32117–12127.69100.000.000.00
19-92–101215.38100.000.000.00
2072–83-23.85100.0011.11100.00
21<23<3227.69100.000.000.00
2217–3336–78219.23100.0011.11100.00
23-15423.85100.000.000.00
24≥3011537.69100.000.000.00
25≥153≥199323.08100.000.000.00
2645–131≥199315.38100.0020.00100.00
27≥1381–8337.69100.000.000.00
28-159–16337.69100.000.000.00
29≥30<7337.69100.000.000.00
30-178–189315.38100.000.000.00
31-≥288315.38100.000.000.00
32-33–3637.69100.0020.00100.00

References

  1. Future Business Insights. Industrial Solvents Market. In Market Research Report; Future Business Insights: Pune, India, 2019. [Google Scholar]
  2. National Institute of Occupational Safety and Health. Organic Solvent Neurotoxicity; NIOSH Current Intelligence Bulletin 48. U.S. Dept. of Health and Human Services, Public Health Service, Centers for Disease Control, National Institute for Occupational Safety and Health: Cincinnati, OH, USA, 1987. [Google Scholar]
  3. Tarrass, F.; Benjelloun, M. Health and environmental effects of the use of N-methyl-2-pyrrolidone as a solvent in the manufacture of hemodialysis membranes: A sustainable reflexion. Nefrología (Engl. Ed.) 2022, 42, 122–124. [Google Scholar] [CrossRef] [PubMed]
  4. Vulimiri, S.V.; Pratt, M.M.; Kulkarni, S.; Beedanagari, S.; Mahadevan, B. Chapter 18—Reproductive and Developmental Toxicity of Solvents and Gases. In Reproductive and Developmental Toxicology, 3rd ed.; Gupta, R.C., Ed.; Academic Press: Cambridge, MA, USA, 2022; pp. 339–355. [Google Scholar] [CrossRef]
  5. Ǎrija Baķe, M.; Eglite, M.; Martinsone, Ž.; Buiķe, I.; Piķe, A.; Sudmalis, P. Organic Solvents as Chemical Risk Factors of the Work Environment in Different Branches of Industry and Possible Impact of Solvents on Workers’ Health. Proc. Latv. Acad. Sci. Sect. B Nat. Exact Appl. Sci. 2010, 64, 25–32. [Google Scholar] [CrossRef]
  6. Stauffer, E.; Dolan, J.A.; Newman, R. Chapter 7—Flammable and Combustible Liquids. In Fire Debris Analysis; Stauffer, E., Dolan, J.A., Newman, R., Eds.; Academic Press: Burlington, NJ, USA, 2008; pp. 199–233. [Google Scholar] [CrossRef]
  7. Soni, V.; Singh, P.; Shree, V.; Goel, V. Effects of VOCs on Human Health. In Energy, Environment, and Sustainability; Springer Nature: Singapore, 2018; pp. 119–142. [Google Scholar] [CrossRef]
  8. Pruthu, K. Organic Solvents-Health Hazards. J. Chem. Pharm. Sci. 2014, 3, 83–86. [Google Scholar]
  9. Institute of Medicine; Board on Health Promotion and Disease Prevention; Committee on Gulf War and Health: Literature Review of Pesticides and Solvents. Gulf War and Health: Volume 2: Insecticides and Solvents; National Academies Press: Washington, DC, USA, 2003; Volume 2. [Google Scholar]
  10. Canadian Centre for Occupational Health and Safety, What is a LD50 and LC50? Canadian Centre for Occupational Health and Safety: Hamilton, ON, Canada, 2023.
  11. Basak, S.C.; Mills, D.; Gute, B.D.; Grunwald, G.D.; Balaban, A.T. Applications of Topological Indices in the Property/Bioactivity/Toxicity Prediction of Chemicals. In Topology in Chemistry; Elsevier: Amsterdam, The Netherlands, 2002; pp. 113–184. [Google Scholar] [CrossRef]
  12. Chemmangattuvalappil, N.G.; Eden, M.R. A Novel Methodology for Property-Based Molecular Design Using Multiple Topological Indices. Ind. Eng. Chem. Res. 2013, 52, 7090–7103. [Google Scholar] [CrossRef]
  13. Bonchev, D. Applications of Topological Indices to QSAR. The Use of the Balaban Index and the Electropy Index for Correlations with Toxicity of Ethers on Mice. Acta Pharm. Jugosl. 1987, 37, 75–86. [Google Scholar]
  14. García-Domenech, R.; Alarcon-Elbal, P.; Bolas, G.; Bueno-Marí, R.; Chordá-Olmos, F.A.; Delacour, S.A.; Mouriño, M.C.; Vidal, A.; Gálvez, J. Prediction of acute toxicity of organophosphorus pesticides using topological indices. SAR QSAR Environ. Res. 2007, 18, 745–755. [Google Scholar] [CrossRef]
  15. Kononenko, I.; Kukar, M. Machine Learning and Data Mining. In Machine Learning and Data Mining; Elsevier: Amsterdam, The Netherlands, 2007; pp. 1–36. [Google Scholar] [CrossRef]
  16. Sivaprakasam, P.; Angamuthu, M. Generalized Z-Fuzzy Soft Β-Covering Based Rough Matrices and Its Application To Magdm Problem Based On Ahp Method. Decis. Mak. Appl. Manag. Eng. 2023, 6, 134–152. [Google Scholar] [CrossRef]
  17. Ibrahim, H.; Anwar, S.A.; Ahmad, M.I. Classification of imbalanced data using support vector machine and rough set theory: A review. J. Phys. Conf. Ser. 2021, 1878, 12054. [Google Scholar] [CrossRef]
  18. Juneja, M.; Walia, E.; Sandhu, P.S.; Mohana, R. Implementation and comparative analysis of rough set, Artificial Neural Network (ANN) and Fuzzy-Rough classifiers for satellite image classification. In Proceedings of the 2009 International Conference on Intelligent Agent & Multi-Agent Systems, Chennai, India, 22–24 July 2009; pp. 1–6. [Google Scholar] [CrossRef]
  19. Albu, A.; Precup, R.E.; Teban, T.A. Results and challenges of artificial neural networks used for decision-making and control in medical applications. Facta Univ. Ser. Mech. Eng. 2019, 17, 285–308. [Google Scholar] [CrossRef] [Green Version]
  20. Zhang, X.; Tian, Y.; Chen, L.; Hu, X.; Zhou, Z. Machine Learning: A New Paradigm in Computational Electrocatalysis. J. Phys. Chem. Lett. 2022, 13, 7920–7930. [Google Scholar] [CrossRef]
  21. Omidvar, N.; Pillai, H.S.; Wang, S.H.; Mou, T.; Wang, S.; Athawale, A.; Achenie, L.E.; Xin, H. Interpretable Machine Learning of Chemical Bonding at Solid Surfaces. J. Phys. Chem. Lett. 2021, 12, 11476–11487. [Google Scholar] [CrossRef]
  22. Pawlak, Z.; Skowron, A. Rudiments of rough sets. Inf. Sci. 2007, 177, 3–27. [Google Scholar] [CrossRef]
  23. Pawlak, Z. Rough sets. Int. J. Comput. Inf. Sci. 1982, 11, 341–356. [Google Scholar] [CrossRef]
  24. Mahajan, P.; Kandwal, R.; Vijay, R. Rough Set Approach in Machine Learning: A Review. Int. J. Comput. Appl. 2012, 56, 1–13. [Google Scholar] [CrossRef]
  25. Aviso, K.B.; Janairo, J.I.B.; Promentilla, M.A.B.; Tan, R.R. Prediction of CO2 storage site integrity with rough set-based machine learning. Clean Technol. Environ. Policy 2019, 21, 1655–1664. [Google Scholar] [CrossRef]
  26. Chong, J.W.; Thangalazhy-Gopakumar, S.; Tan, R.R.; Aviso, K.B.; Chemmangattuvalappil, N.G. Estimation of fast pyrolysis bio-oil properties from feedstock characteristics using rough-set-based machine learning. Int. J. Energy Res. 2022, 46, 19159–19176. [Google Scholar] [CrossRef]
  27. Heng, Y.P.; Lee, H.Y.; Chong, J.W.; Tan, R.R.; Aviso, K.B.; Chemmangattuvalappil, N.G. Incorporating Machine Learning in Computer-Aided Molecular Design for Fragrance Molecules. Processes 2022, 10, 1767. [Google Scholar] [CrossRef]
  28. Cheun, J.-Y.; Liew, J.-Y.-L.; Tan, Q.-Y.; Chong, J.-W.; Ooi, J.; Chemmangattuvalappil, N.G. Design of Polymeric Membranes for Air Separation by Combining Machine Learning Tools with Computer Aided Molecular Design. Processes 2023, 11, 2004. [Google Scholar] [CrossRef]
  29. Balaban, A.T. Highly discriminating distance-based topological index. Chem. Phys. Lett. 1982, 89, 399–404. [Google Scholar] [CrossRef]
  30. Balaban, A.T.; Khadikar, P.V.; Supuran, C.T.; Thakur, A.; Thakur, M. Study on supramolecular complexing ability vis-à-vis estimation of pKa of substituted sulfonamides: Dominating role of Balaban index (J). Bioorg. Med. Chem. Lett. 2005, 15, 3966–3973. [Google Scholar] [CrossRef]
  31. Pawlak, Z. Rough set theory and its applications. J. Telecommun. Inf. Technol. 2002, 3, 7–10. [Google Scholar] [CrossRef]
  32. Vashist, R.; Vaishno, S.M.; Garg, M.L. Rule Generation based on Reduct and Core: A Rough Set Approach. Int. J. Comput. Appl. 2011, 29, 975–8887. [Google Scholar] [CrossRef]
  33. Predki, B.; Słowiński, R.; Stefanowski, J.; Susmaga, R.; Wilk, S. ROSE—Software Implementation of the Rough Set Theory. In Rough Sets and Current Trends in Computing; Polkowski, L., Skowron, A., Eds.; Springer: Berlin/Heidelberg, Germany, 1998; pp. 605–608. [Google Scholar]
  34. Grzymala-Busse, J.W. An Empirical Comparison of Rule Induction Using Feature Selection with the LEM2 Algorithm. In Advances on Computational Intelligence; Greco, S., Bouchon-Meunier, B., Coletti, G., Fedrizzi, M., Matarazzo, B., Yager, R.R., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 270–279. [Google Scholar]
  35. Balaban, A.T. Applications of Graph Theory in Chemistry. J. Chem. Inf. Comput. Sci. 1985, 25, 334–343. [Google Scholar] [CrossRef]
  36. Bruckner, J.V.; Anand, S.S.; Warren, D.A. Toxic Effects of Solvents and Vapors. In Casarett & Doull’s Essentials of Toxicology, 3rd ed.; Klaassen, C.D., Watkins, J.B., III, Eds.; McGraw-Hill Education: New York, NY, USA, 2015. [Google Scholar]
  37. Kanu, I.; Anyanwu, E. Impact of hydrophobic pollutants’ behavior on occupational and environmental health. Sci. J. 2005, 5, 211–220. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  38. Nikolić, S.; Trinajstić, N.; Mihalić, Z. The Wiener Index: Development and Applications. Croat. Chem. Acta Ccacaa 1995, 68, 105–129. [Google Scholar]
Figure 1. Illustration of Rough Set Theory.
Figure 1. Illustration of Rough Set Theory.
Processes 11 02293 g001
Figure 2. Proposed methodology to develop health performance predictive model.
Figure 2. Proposed methodology to develop health performance predictive model.
Processes 11 02293 g002
Figure 3. Validation results for reduct 1.
Figure 3. Validation results for reduct 1.
Processes 11 02293 g003
Figure 4. Validation results for reduct 2.
Figure 4. Validation results for reduct 2.
Processes 11 02293 g004
Figure 5. Validation results for reduct 3.
Figure 5. Validation results for reduct 3.
Processes 11 02293 g005
Figure 6. Validation results for reduct 4.
Figure 6. Validation results for reduct 4.
Processes 11 02293 g006
Figure 7. Validation results for reduct 5.
Figure 7. Validation results for reduct 5.
Processes 11 02293 g007
Figure 8. Summary of the effects of conditional attributes on class 1 toxicity.
Figure 8. Summary of the effects of conditional attributes on class 1 toxicity.
Processes 11 02293 g008
Figure 9. Summary of the effects of conditional attributes on class 2 toxicity.
Figure 9. Summary of the effects of conditional attributes on class 2 toxicity.
Processes 11 02293 g009
Figure 10. Summary of the effects of conditional attributes on class 3 toxicity.
Figure 10. Summary of the effects of conditional attributes on class 3 toxicity.
Processes 11 02293 g010
Table 1. Toxicity rating based on Hodge and Sterner scale [10].
Table 1. Toxicity rating based on Hodge and Sterner scale [10].
Toxicity RatingCommonly Used TermInhalation LC50
(Exposure of Routes for 4 H) ppm
1Extremely Toxic10 or less
2Highly Toxic10–100
3Moderately Toxic100–1000
4Slightly Toxic1000–10,000
5Practically Non-Toxic10,000–100,000
6Relatively Harmless100,000
Table 2. Example of a decision table.
Table 2. Example of a decision table.
ObjectConditional AttributesDecision Attribute
Type of ChemicalBoiling Point (°C)Wiener IndexToxicity Class
Toluene1002.5Class 1
Table 3. Simplified toxicity rating for the prediction model.
Table 3. Simplified toxicity rating for the prediction model.
Toxicity RatingCommonly Used TermInhalation LC50
(Exposure of Routes for 4 H) ppm
1Extremely to highly Toxic<10–100
2Slightly Toxic100–100,000
3Non-toxic100,000
Table 4. Simplified information table of organic solvent toxicity.
Table 4. Simplified information table of organic solvent toxicity.
ObjectConditional AttributesDecision
Attribute
Organic SolventBalaban IndexValence
Connectivity Index
Wiener IndexBoiling Point (°C)Toxicity Class
Acetyl Acetone3.322.1248138Class 1
1-Octanol2.604.02120195Class 2
Acetophenone2.982.8688202Class 3
Table 5. Reduct sets and the number of rules generated.
Table 5. Reduct sets and the number of rules generated.
ReductsConditional AttributesNumber of Rules Generated
1Balaban Index, Valence Connectivity Index36
2Valence Connectivity Index, Wiener Index34
3Balaban Index, Boiling Point33
4Valence Connectivity Index, Boiling Point31
5Wiener Index, Boiling Point32
Table 6. Decision rules in reduct 5.
Table 6. Decision rules in reduct 5.
No.RuleDecisionCoverageCertainty
4(Balaban Index in [34, 63]) &
(Boiling Point ≥ 104)
132.26%100%
22(Balaban Index in [17, 33]) &
(Boiling Point in [36, 78])
219.23%100%
25(Balaban Index ≥ 153) &
(Boiling Point ≥ 199)
323.08%100%
Table 7. Conditional attributes for each class.
Table 7. Conditional attributes for each class.
Toxicity ClassificationConditional AttributesReduct
1Balaban Index, Valence Connectivity Index1
2Wiener Index, Boiling Point5
3Balaban Index, Boiling Point3
Table 8. Comparison of indices value among classes-reduct 1.
Table 8. Comparison of indices value among classes-reduct 1.
No.RuleDecisionCoverageCertainty
8(A1 in [2.755, 2.89]) & (A2 ≥ 2.96)16.45%100%
19(A1 in [2.78, 2.89]) & (A2 in [1.54, 2.55])211.11%100%
Table 9. Extracted rules on class 2 toxicity from reduct 5.
Table 9. Extracted rules on class 2 toxicity from reduct 5.
No.Rule
13(A3 ≥ 63) & (A4 in [189, 198])
14(A3 < 10) & (A4 in [73, 83])
15(A3 < 13) & (A4 in [117, 192])
17 (A3 in [131, 153])
20(A3 in [72, 83])
23(A4 = 154)
Table 10. Extracted rules on class 3 toxicity from reduct 3.
Table 10. Extracted rules on class 3 toxicity from reduct 3.
No.Rule
27(A1 in [2.22, 3.16]) & (A4 in [199, 284])
28(A1 in [2.495, 2.625]) & (A4 in [159, 189])
31(A1 in [2.67, 2.755])
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hoo, W.Y.; Ooi, J.; Chemmangattuvalappil, N.G.; Chong, J.W.; Lim, C.H.; Eden, M.R. An Interpretable Predictive Model for Health Aspects of Solvents via Rough Set Theory. Processes 2023, 11, 2293. https://doi.org/10.3390/pr11082293

AMA Style

Hoo WY, Ooi J, Chemmangattuvalappil NG, Chong JW, Lim CH, Eden MR. An Interpretable Predictive Model for Health Aspects of Solvents via Rough Set Theory. Processes. 2023; 11(8):2293. https://doi.org/10.3390/pr11082293

Chicago/Turabian Style

Hoo, Wey Ying, Jecksin Ooi, Nishanth Gopalakrishnan Chemmangattuvalappil, Jia Wen Chong, Chun Hsion Lim, and Mario Richard Eden. 2023. "An Interpretable Predictive Model for Health Aspects of Solvents via Rough Set Theory" Processes 11, no. 8: 2293. https://doi.org/10.3390/pr11082293

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop