Next Article in Journal
Care-Seeking for Diarrhoea in Southern Malawi: Attitudes, Practices and Implications for Diarrhoea Control
Previous Article in Journal
Behavioral Response in the Immediate Aftermath of Shaking: Earthquakes in Christchurch and Wellington, New Zealand, and Hitachi, Japan
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

QSAR Study for Carcinogenic Potency of Aromatic Amines Based on GEP and MLPs

1
Department of Public Health, Qingdao University Medical College, Qingdao 266071, China
2
Modern Educational Technology Center, Qingdao University, Qingdao 266071, China
3
Institute for Computational Science and Engineering, Laboratory of New Fibrous Materials and Modern Textile, The Growing Base for State Key Laboratory, Qingdao University, Ningxia Road 308, Qingdao 266071, China
4
Department of Chemistry, Lanzhou University, Lanzhou 730000, China
*
Author to whom correspondence should be addressed.
Int. J. Environ. Res. Public Health 2016, 13(11), 1141; https://doi.org/10.3390/ijerph13111141
Submission received: 7 September 2016 / Revised: 21 October 2016 / Accepted: 24 October 2016 / Published: 15 November 2016

Abstract

:
A new analysis strategy was used to classify the carcinogenicity of aromatic amines. The physical-chemical parameters are closely related to the carcinogenicity of compounds. Quantitative structure activity relationship (QSAR) is a method of predicting the carcinogenicity of aromatic amine, which can reveal the relationship between carcinogenicity and physical-chemical parameters. This study accessed gene expression programming by APS software, the multilayer perceptrons by Weka software to predict the carcinogenicity of aromatic amines, respectively. All these methods relied on molecular descriptors calculated by CODESSA software and eight molecular descriptors were selected to build function equations. As a remarkable result, the accuracy of gene expression programming in training and test sets are 0.92 and 0.82, the accuracy of multilayer perceptrons in training and test sets are 0.84 and 0.74 respectively. The precision of the gene expression programming is obviously superior to multilayer perceptrons both in training set and test set. The QSAR application in the identification of carcinogenic compounds is a high efficiency method.

1. Introduction

Aromatic amines (AAs) are indispensable material in the process of synthesis azo colorants, which have strong tinting strength, bright color, and durability. So, the azo colorants have been widely applied to textile industry, food additives, cosmetics, and plastics [1,2]. In our life environment, we can come into contact with AAs at any time, such as gorgeous clothes, colorful food, and polluted air and water. The main ways of AAs to enter the body are the skins contact and digestive tract [3]. It is recognized that some AAs be verified or be suspected as human carcinogens. The enzyme P450 can help AAs convert into arylnitreniumlons in the body, which combine with C8 position of guanine in DAN.
Through extended exposure to the compounds, the structure of the DNA will be changed and a malignant tumor will appear. As a result, it leads to bladder, ureteral, renal, and pelvic carcinoma and other malignant diseases [4,5,6]. The European Commission Regulation 552/2009/EC has banned carcinogenic AAs to be used in textile and leather articles [7]. With the rapid development of chemical industry, a large number of compounds are produced and used. Compounds eventually are distributed in the environment by various uses, which strongly influence environmental and human health [8,9].
Due to the high carcinogenicity of AAs, recognition of the toxicity and carcinogenicity of the new AAs has special significance in toxicology. Thus, it is very important to assess the security risk of compounds. However, it is a huge project to assay the large number of compounds by experimental means. Toxicity identification of new compounds is very harmful to experimental animals. Even some experiments violate ethics requirements [10]. So, it is necessary to develop a simple, fast, and available approach to measure the property of security risk of compounds. The quantitative structure activity relationship (QSAR) method not only can quickly establish a reliable predicting model, but also can reveal the damaging effect mechanism of the poison interacting with the body and provide the reference information of designing and synthesizing safer and eco-friendly real green compounds [11]. Study [12] has been carried out with the aid of a combined quantum mechanics/molecular mechanics (QM/MM) computations to explore the detoxifying mechanism of agGSTe2 toward DDT. In this thesis, all AAs were randomly divided into a training set and test set, and then we set up prediction models based on molecular descriptors of AAs.
In the last two decades, many scholars had solved prediction problems by the establishment of bionic mathematics calculation model and the achievements were surprising [13]. Establishing a stability and rapid classification model is what we want. Gene expression programming (GEP) introduced by Ferreira [14] is an automatic programming approach, which overcomes certain limitations of genetic algorithms and genetic programming by working with two elements, the chromosome and the expression tree [15]. The advantage of GEP in designing decision trees makes it a successful method for solving classification problems [16,17]. Each physical and chemical parameter of AAs is as a gene unit in the gene expression programming. Complex algorithms weave them to a multivariate nonlinear equation. GEP in the field of carcinogenic classification shows incomparable superiority. The multilayer perceptrons (MLPs) is a biologically inspired computational tool for solving pattern recognition problems and is efficient in recognizing previously trained patterns. The capability of neural networks with multiple inputs and multiple outputs realizes data parallel processing and self-learning [13,18]. The parameters, as well as neurons, perform math functions intended to interweave them to a net, divided into carcinogens and non-carcinogens. In the current research, GEP and MLPs are new analysis strategies of the classification for carcinogenicity of AAs. Compared with MLPs, the proposed GEP is better in carcinogenic potency prediction of a suite of AA samples.

2. Methodology

2.1. Source of AA Data

25 compounds have ionic pentavalent nitrogen atoms, and hexavalent sulfur atoms were eliminated, because the physical and chemical parameters cannot be computed, 128 fused ring aromatic amine (including heterocyclic compound) were taken from the literature [19], molecular structures and data of carcinogenicity are available. 1 stands for carcinogen, 0 stands for non-carcinogen. Carcinogenic activity is indicated by rat liver tumor. In this study, random allocation was taken to assure that every compound has the same opportunity to be divided into training set and test set. Each compound was given to a encoding, from 1 to 128. Then, 35 random numbers are generated in IBM SPSS 19.0 software (IBM Corporation, Chicago, IL, USA). If the encoding is same with the random number, this compound will be selected to test set. Finally, 128 compounds were divided into 93 training sets (Table 1) and 35 test sets (Table 2). The test set is used to evaluate stability of the QSAR model.

2.2. Calculation of Molecular Descriptors

In the QSAR model, molecular structure of compound was replaced by the corresponding physical and chemical parameters to establish numerical equations. All the structures of AAs were drawn into Chemdraw. Firstly, the geometry optimization operated in the Hyperchem 7.5 software (HyperCube Inc., Gainesville, FL, USA), the calculation used MM+ molecular mechanics force field. The semi-empirical AM1 method can get more precise optimization in the MOPAC. The molecular structures were optimized using the Polak-Ribiere algorithm until the root mean square gradient was 0.01 [20]. Then, the HIN files were generated by geometrical optimization, the MNO files were generated by MOPAC calculation. The CODESSA program (Semichem, Shawnee, KS, USA) can give the five classes of descriptors: constitutional, topological, geometrical, electrostatic, and quantum-chemical. Semi-empirical quantum chemistry methods are on the basis of the Hartree-Fock formalism, but include some approximations and obtain some parameters from empirical data. They are very appropriate for computational chemistry for treating physicochemical properties of large molecules. The semi-empirical AM1 calculation has shown to be successful in studying of QSAR. The constitutional descriptors show the molecular composition of the compounds without using the geometry or electronic structure; including number of atoms, molecular weight, and average atomic weight, etc. The topological descriptors are used to describe the atomic connectivity in the molecule; including wiener index, information content index and its derivatives, etc. The geometrical descriptors provide the information about the size of the molecule and require 3D-coordinates of the atoms in the given molecule; including shadow indices, molecular volume, etc. The electrostatic descriptors can reflect characteristics of the charge distribution of the molecule; including charged partial surface area descriptors, partial positive surface area, etc. The quantum-chemical descriptors will add important information to the conventional descriptors; including HOMO-LUMO energy gap, reactivity indices, etc. With the method of preprocessing, according to the following three rules [21], the model necessary descriptors are selected: (1) The parameters are the common of vast majority of the compounds; (2) For all the compounds, the descriptor numerical decreases; (3) The correlation coefficient of any two variables should <0.8. If any two descriptors have a correlation of 0.8, one should be removed. Otherwise, it will reduce the prediction efficiency of the QSAR model. This method could be used using well-established statistical projection technique such as PLS [22] or ACP [23] to construct uncorrelated variables.

2.3. Theory of Gene Expression Programming

Gene expression programming (GEP) is a new technique of novel algorithm for data mining that is based on the structure and function of biological gene [24]. It carries on all the advantages of both genetic algorithm (GA) and genetic programming (GP), by eliminating some of their own limitations. GEP adopts fixed length, nonlinear, or linear strings of chromosomes to solve complex problems by forming the expression trees of different shapes and sizes when evaluating their fitness [25]. The search space of GEP is separated from the solution space, which can be expanded to the benefits such as unconstrained search of the genome space, thus achieving the purpose of using simple coding to solve classification problems.
GEP genotype individuals consist of the head and tail, the head elements from the function character and terminator sets, tail elements from terminator sets. The head is not strictly limited. The length of the head h is selected according to the number of parameters (such as a, b, c, 1, 2…) and the set of functions (such as sin, tan…). The common set of functions, F = { + , , , ÷ , Q } , Q represents the root function. The tail only contains the variable. The length of the tail t should be computed as: t = h ( n 1 ) + 1 . n is the number of parameters for the maximum number variable function. The chromosomes function as a genome, after being modified by various means of mutation, transposition, root transposition, gene transposition, gene recombination, and one-point and two-point recombination, that will be transformed into expression trees. Figure 1 is one of the simplest expression trees can be processed into QSAR formula: F = b ( a + ( c d ) ) . Parameters in the operation relationship were used to set up various models until get the best results. The application of complex functions can improve the prediction ability of QSAR model.
It is important that individuals to be selected and copied into the next generation according to the fitness function. The advantage of this kind of fitness function is the system that can find the optimal solution for itself. The calculation [26] for optimum fitness function (Equations (1)–(3)):
f i t n e s s ( R ) = { 0 , if   c o n s i g   ( R ) < 0 c o n s i g   ( R ) ln  ( c o m p l   ( R ) 1 ) , otherwise
c o n s i g ( R ) = ( p p + n P P + N ) P P + N
c o m p l  ( R ) = p P
p, n, P, and N are number of all the positive compounds, number of all the negative compounds, number of the positive compounds in a training set, and number of the negative compounds in a training set, respectively.
For two classification prediction problems, only one GEP rule classification (R) can be achieved. Validating instances with GEP rules, if the result is positive, will be considered as a kind of the instance. Otherwise, it should be to the other kind. Exact representation is as follows:
  • If GEP_Rule (X) > 0 Then X ∈ class A
  • ELSE X ∈ class B
  • X stands for properties of instance.
The process of classification prediction problem is that decoding and calculating the fitness function of each chromosome, performing all kinds of genetic operation and updating chromosomes. This process will be repeated for a pre-established number of generations until the best model has been found [20]. Flow chart of GEP classification algorithm is shown in Figure 2.

2.4. Multilayer Perceptrons (MLPs)

Artificial neural network (ANN) is based on the structure and function of neural network. It puts the complex neural network theory to simplify, abstract, and simulate. ANN has been widely used in classification, prediction, associative memory, pattern recognition, and other fields, which has gotten consistently high praise. What makes a MLPs different is that some neurons use a nonlinear activation function which is developed to model the frequency of action potentials, or firing, of biological neurons in the brain. Weka software provides a multilayer perceptrons artificial neural network. The use of back-propagation network algorithms makes MLPs application more expansive than other artificial neural networks. Figure 3 shows the structure of MLPs.
The input layer is decided by the dimensions of objects and the received signal is directly transmitted to the hidden layers. The number of hidden layers cannot be calculated by an accurate analytical formula and usually determined according to experience. In Weka, universal symbol “a” represents for hidden layers, a = a t t r i b s + c l a s s e s 2 . The realization of signal transmission and output of nodes between hidden layer and output layer is by excitation function [27]. Basic idea of carcinogenic classification forecasting of AAS by MLPs is that the known results of the sample model used for training network, and the carcinogenicity of compounds, can be identified by the trained network.

2.5. Platform of Weka

Waikato Environment for Knowledge Analysis (Weka) was developed by IanH. Wjttjn and EibeFrank of the University of Waikato and was based on JAVA software. As professional data mining software, Weka contains almost all of the classification methods in machine learning [28,29]. Under normal circumstances, the scholars are unable to preprocess the complex data without a good data analysis background of data mining. Weka provides a unified interface for users and saves manual programming for data analysis. Weka can not only provide a single classification algorithm of projections for the same data, but also can integrate several algorithms to predicting. To our knowledge, the rationale and complexity of classification algorithms can affect the accuracy of the prediction. Therefore, we chose different algorithm and combined the test of GEP and MLPs, so that we can obtain better prediction results and provide a good model.

3. Results and Discussion

3.1. Significance of the Descriptors

Number of carbon atoms (NCOS): The number of benzene rings is associated with the carcinogenicity. Growing number of C atoms will increase the morbidity of cancer [30]. On the other hand, the binding of methyl with DNA can change the conformation of double helix and affect the transcription of protein, which then changes tumor suppressor genes and gene mutation increases the risk of cancer [31]. The number of C atoms in nitrobenzene as descriptors to build the QSAR model has important significance.
Number of nitrogen-atoms (NNOS): Aromatic amines metabolic activation sites on the amino N atoms. Preliminary metabolic activation occurs in the liver, including N-catalytic oxidation by cytochrome P450lA2 and N-acetylation by acetyl enzyme catalysis. This process produce N-hydroxy. The aryl amines generated from oxidation can form additions with DNA to the urinary tract epithelial cells. Likewise, the structure of DNA is changed. N-O-sulfate ester is formed after sulfur transfer with N-hydroxyl. Another way, the reaction of N-hydroxyl with acetyltransferase produces N-O-acetate ester. The unstable N-O-sulfate ester and N-O-acetate ester generate N ions in hydrolysis, which can combine with normal ion-making nucleophilic reaction with DNA bases [30,32]. Highly activated free radical nitrogen ions cause normal cell mutation.
Kier flexibility index (KFBI), Balaban index (BBI), structural information content index (order 0) (SICI), and topographic electronic index (all bonds) (TEIA) are topology descriptors. The molecular connectivity index as the structure characterization can provide a intuitive concept to make quantitative description on the molecular structure according to the molecular size, shape, and structure of chemical bond connection sequence and branched molecules—such as the structure of the information. The topology descriptors make structural differences quantitatively between the molecular quantitative and expression of molecular connectivity function. Different numerical topology values represent different molecular structures [32,33,34]. The four kinds of molecular descriptors are closely connected with carcinogenicity of AAs and can be used for the QSAR model.
Polarity parameter (PLPT) is closely related to the solubility of molecules. The larger the lipo-hydro partition coefficient of low polar compounds, the higher the lipid solubility. It easily gets the lipid bilayer by simple diffusion and accumulates in adipose tissue. High polar compounds have better water-solubility. Water-solubility directly affects the toxicity and the target organ [35]. The polarity of aromatic amine determines the metabolism time in the body. The polarity parameters as discriminant factors are very crucial.
The lower the LUMO energy (LUMO) is more conducive to electrophilic reaction. Electrophilic reagents are related to the carcinogenicity of AA compounds. AAs could be converted to electrophilic reagents that are with some or all of the positive charge under the effect of cytochrome P-450 or other oxidase [36]. The atom with electrons in nucleophilic reagent easily reacts with the electrophilic reagents by sharing electrons. AAs as a promoter can enhance the carcinogenic effect of other poisons.
The correlation of eight descriptors is calculated by SPSS 20.0, in which any two variables related factor <0.8 (Table 3). It means that all variables are uncorrelated and not repetitive in the GEP models, so all the eight parameters could be adapted to QSAR study.

3.2. Results of GEP

128 compounds include 35 carcinogenic and 93 non carcinogenic. The number of carcinogenic and non-carcinogenic compounds is 24, with 64 in the training set, respectively. The setting of the function is { + , , × , ÷ , Mod , Exp , Log , Sin , Tan } , eight groups descriptors were used to build GEP model in the Automatic Problem Solver 3.0 (Gepsoft Limited Company, Bristol, UK). It takes about 25 min to select a most optimal model. Prediction result of each compound, accuracy, positive predictive value and negative predictive value (Table 4) are given by APS. We converted the C++ function into Equation (4).
F ( x ) = x 1 + tan [ log ( x 8 + x 5 ) x 1 · x 6 x 3 + x 2 ] + tan ( x 5 + x 3 ) mod [ log ( x 6 · x 2 ) , log x 8 ] + x 5 + tan ( x 2 + x 3 + x 7 x 1 ) + tan ( x 4 + x 8 x 1 ) + tan { exp [ log ( x 5 mod ( x 3 , x 1 ) ) + x 5 ] }
The variables x 1 , x 2 , x 3 , x 4 , x 5 , x 6 , x 7 , and x 8 represents the Number of C atoms, Number of N atoms, Kier flexibility index, Balaban index, structural information content index (order 0), and topographic electronic index (all bonds), polarity parameter, and the lower LUMO energy.
This is a complex nonlinear function, but classification prediction result is pretty better. Accurate rates of training set and test set are 0.92 and 0.82.

3.3. The Results of MLPs

Hidden layers were set “a”, training time is 500, validation threshold is 20. The test set is same with that of GEP. The training set is used to adjust the parameters of the model, and the test set is used to evaluate the predictive power. MLPs use the back-propagation algorithm and keep regulating weights in training to get the global error minimized.
The entire range of carcinogenic aromatic amine prediction accuracy is 0.84 of training set and 0.74 of test set by MLPs. Grid square represents error prediction. Curve margin could intuitively reflect the quality of classification prediction results. Curve margin is the difference values of forecasting, probability of actual categories, and the maximum prediction probability of wrong categories. The vertical axes represent the sequence numbers of AAs. The horizontal axes represent the difference values. The greater difference values of samples closer to 1, the better classification effect. Figure 4 and Figure 5 show the vast majority of marginal values are close to 1. These two pictures indicate MLPs can accurately predict the carcinogenicity of AAs. The results of MLPs are given by Weka (Table 4). From the point of view of running time, 0.08 s for training set and 0.20 s for the test set.

3.4. Comparison between GEP and MLPs

The purpose of this study is to establish a precise prediction model, to accurately identify the potential carcinogen of AAs. Carcinogenic compound prediction is very rare in previous studies. The GEP model based on human gene expression could accurately identify the carcinogenic of AAs. Performance assessment of classification algorithm shown in Table 4 uses recognized indicators precision, sensitivity, specificity, and Youden’s index obtained by optimizing Equations (5)–(7). These indexes are cited from “screening test” of epidemiology. Screening test has been widely employed in seeking potential patients to provide medical help in time. The indexes (accuracy, sensitivity, specificity, and Youden’s index) can show the reliability of screening tests. Our study combined QSAR and screening test methods from epidemiology.
sensitivity = A A + C
specificity = D B + D
Youden s   index = ( sensitivity + specificity ) 1
where A and B are the number of carcinogenic compounds predicted correctly and wrongly by QSAR model, C and D are the number of non-carcinogenic compounds predicted wrongly and correctly by QSAR model, respectively. All these indexes are cited from screening of epidemiology.
GEP is significantly better than MLPs. This is mainly because GEP algorithm could construct adaptive function by the evolution of its own and establish nonlinear relationship between the details and the carcinogenic compounds. Due to unique way of coding and genetic operation, GEP possesses remarkable ability to predict the carcinogenicity of AAs. So that the GEP algorithm will be more details to reflect differences in the resulting expression. GEP can give detailed predicted expressions while MLPs only provide prediction results. However, the GEP model is a complex nonlinear function and in the process of the establishment of the model is full of complications.
MLPs in the study results are not as good as GEP. To get a satisfactory result, GEP often taking a long time, but the MLP run time is within one second. For MLPs, there are no universal common rules specify how to set up training methods, build network structure and select the parameters. It adopted “trial and error” that large amounts of neural networks were tested until an optimal result was obtained. The network structure and parameter settings are usually through personal experience [37]. In addition, MLPs cannot accurately reflect the nonlinear relationship between multiple parameters. MLP are only used for existing AAs carcinogenicity data, but it cannot establish equation expressions to predict the properties of new compounds. The probability is obtained among the independence of each various property, but in practice this is not the case, it may lead to a decline precise rate. MLPs cannot give mathematical expression of the model. Although the computational time of GEP is much more than MLPs, the forecasting accuracy is more important within a certain range (computational time not too long).

4. Conclusions

The study on the carcinogenic compounds of compounds is essential in toxicology. The structure of the chemical compounds is the basis for the toxicity and effect the metabolism of toxic chemicals in the body. QSAR is an innovative idea to predict the carcinogenicity of AAs. QSAR can evaluate the superiority of the experimental group and give the valuable information for the risk assessment. In this study, the computational time of MLPs is lower than GEP, but the forecasting ability of GEP is better than MLPs. The unique advantage of GEP is that it can establish a mathematical model to predict the toxicity of new compound. In the design of AA compounds, it can increase or reduce the certain structure to achieve the purpose of reducing carcinogenic potential. Thus, GEP is a promising research direction in toxicology.

Author Contributions

Fucheng Song, Anling Zhang, Hui Liang and Lianhua Cui conceived and designed the experiments; Fucheng Song performed the experiments; Wenlian Li and Hongzong Si analyzed the data; Yunbo Duan and Honglin Zhai contributed reagents/materials/analysis tools; Fucheng Song and Hongzong Si wrote the paper. All authors read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Harding, A.P.; Popelier, P.L.A.; Harvey, J.; Giddings, A.; Foster, G.; Kranz, M. Evaluation of aromatic amines with different purities and different solvent vehicles in the Ames test. Regul. Toxicol. Pharm. 2015, 71, 244–250. [Google Scholar] [CrossRef] [PubMed]
  2. Garrigós, M.C.; Reche, F.; Marín, M.L.; Pernías, K.; Jiménez, A. Optimization of the extraction of azo colorants used in toy products. J. Chromatogr. A 2002, 963, 427–433. [Google Scholar] [CrossRef]
  3. Sanchis, Y.; Coscollà, C.; Roca, M.; Yusà, V. Target analysis of primary aromatic amines combined with a comprehensive screening of migrating substances in kitchen utensils by liquid chromatography-high resolution mass spectrometry. Talanta 2015, 138, 290–297. [Google Scholar] [CrossRef] [PubMed]
  4. Ewald, R.; Thomas, M.; Joerg, D.; Lynda, C.; Regina, S. Heterocyclic aromatic amines and their contribution to the bacterial mutagenicity of the particulate phase of cigarette smoke. Toxicol. Lett. 2016, 243, 40–47. [Google Scholar]
  5. Wellner, T.; Lüersen, L.; Schaller, K.H.; Angerer, J.; Xler, H.; Korinth, G. Percutaneous absorption of aromatic amines—A contribution for human health risk assessment. Food Chem. Toxicol. 2008, 46, 1960–1968. [Google Scholar] [CrossRef] [PubMed]
  6. Inami, K.; Okazawa, M.; Mochizuki, M. Mutagenicity of aromatic amines and amides with chemical models for cytochrome p450 in Ames assay. Toxicol. In Vitro 2009, 23, 986–991. [Google Scholar] [CrossRef] [PubMed]
  7. Akceylan, E.; Bahadir, M.; Yılmaz, M. Removal efficiency of a calix[4]arene-based polymer for water-soluble carcinogenic direct azo dyes and aromatic amines. J. Hazard. Mater. 2009, 162, 960–966. [Google Scholar] [CrossRef] [PubMed]
  8. Bratberg, M.; Olsvik, P.A.; Edvardsen, R.B.; Brekken, H.K.; Vadla, R.; Meier, S. Effects of oil pollution and persistent organic pollutants (POPs) on glycerophospholipids in liver and brain of male Atlantic cod (Gadus morhua). Chemosphere 2013, 90, 2157–2171. [Google Scholar] [CrossRef] [PubMed]
  9. Merwe, J.P.V.D.; Hodge, M.; Olszowy, H.A.; Whittier, J.M.; Lee, S.Y. Using blood samples to estimate persistent organic pollutants and metals in green sea turtles (Chelonia mydas). Mar. Pollut. Bull. 2010, 60, 579–588. [Google Scholar] [CrossRef] [PubMed]
  10. Sondra, S.T.; Corrella, S.D. The biomechanisms of metal and metal-oxide nanoparticles’ interactions with cells. Int. J. Environ. Res. Public Health 2015, 12, 1112–1134. [Google Scholar]
  11. Sama, A.; Ayoub, K.J. Verifying the performance of artificial neural network and multiple linear regression in predicting the mean seasonal municipal solid waste generation rate: A case study of Fars province, Iran. Waste Manag. 2016, 48, 14–23. [Google Scholar]
  12. Li, Y.W.; Shi, X.L.; Zhang, Q.Z.; Hu, J.T.; Chen, J.M.; Wang, W.X. Computational evidence for the detoxifying mechanism of epsilon class glutathione transferase toward the insecticide DDT. Environ. Sci. Technol. 2014, 48, 5008–5016. [Google Scholar] [CrossRef] [PubMed]
  13. Uysala, M.; Tanyildizi, H. Estimation of compressive strength of self compacting concrete containing polypropylene fiber and mineral additives exposed to high temperature using artificial neural network. Constr. Build. Mater. 2012, 27, 404–414. [Google Scholar] [CrossRef]
  14. Ferreira, C. Gene expression programming: A new adaptive algorithm for solving problems. Complex Syst. 2001, 1, 87–129. [Google Scholar]
  15. Shaw, A.K.; Majumder, S.; Sarkar, S.; Sarkar, S.K. A novel EMD based watermarking of fingerprint biometric using GEP. Procedia Technol. 2013, 10, 172–183. [Google Scholar] [CrossRef]
  16. Prasenjit, D.; Ajoy, K.D. A utilization of GEP (gene expression programming) meta model and PSO (particle swarm optimization) tool to predict and optimize the forced convection around a cylinder. Energy 2016, 95, 447–458. [Google Scholar]
  17. Jędrzejowicz, J.; Jędrzejowicz, P. Experimental evaluation of two new GEP-based ensemble classifiers. Expert Syst. Appl. 2011, 38, 10932–10939. [Google Scholar] [CrossRef]
  18. Silva, A.A.; Lima Neto, I.A.; Misságia, R.M.; Ceia, M.A.; Carrasquilla, A.G.; Archilha, N.L. Artificial neural networks to support petrographic classification of carbonate-siliciclastic rocks using well logs and textural information. J. Appl. Geophys. 2015, 117, 118–125. [Google Scholar] [CrossRef]
  19. Zhu, Y.P.; Yu, Y.N.; Chen, X.R. Fisher discriminant analysis for carcinogenic potency of aromatic amines. Chin. J. Prev. Med. 1999, 1, 1–11. [Google Scholar]
  20. Si, H.Z.; Wang, T.; Zhang, K.J.; Hu, Z.D.; Fan, B.T. QSAR study of 1,4-dihydropyridine calcium channel antagonists based on gene expression programming. Bioorganic Med. Chem. 2006, 14, 4834–4841. [Google Scholar] [CrossRef] [PubMed]
  21. Li, X.Y.; Luan, F.; Si, H.Z.; Hu, Z.D.; Liu, M.C. Prediction of retention times for a large set of pesticides or toxicants based on support vector machine and the heuristic method. Toxicol. Lett. 2007, 175, 136–144. [Google Scholar] [CrossRef] [PubMed]
  22. Servien, R.; Mamy, L.; Li, Z.; Rossard, V.; Latrille, E.; Bessac, F.; Patureau, D.; Benoit, P. Typol—A new methodology for organic compounds clustering based on their molecular characteristics and environmental behavior. Chemosphere 2014, 111, 613–622. [Google Scholar] [CrossRef] [PubMed]
  23. Zhou, C.; Xiao, W.; Tirpak, T.M.; Nelson, P.C. Evolving Accurate and Compact Classification Rules with Gene Expression Programming. IEEE Trans. Evol. Comput. 2003, 7, 519–531. [Google Scholar] [CrossRef]
  24. Eriksson, L.; Andersson, P.L.; Johansson, E.; Tysklind, M. Megavariate analysis of environmental QSAR data. Part I—A basic framework founded on principal component analysis (PCA), partial least squares (PLS), and statistical molecular design (SMD). Mol. Divers. 2006, 10, 169–186. [Google Scholar] [CrossRef] [PubMed]
  25. Duan, L.; Tang, C.; Zhang, T.; Wei, D.; Zhang, H. Distance guided classification with gene expression programming. Adv. Data Min. Appl. 2006, 4093, 239–246. [Google Scholar]
  26. Teodorescu, L.; Sherwood, D. High energy physics event selection with gene expression programming. Comput. Phys. Commun. 2008, 178, 409–419. [Google Scholar] [CrossRef]
  27. Yadav, A.K.; Malik, H.; Chandel, S.S. Selection of most relevant input parameters using WEKA for artificial neural network based solar radiation prediction models. Renew. Sust. Energy Rev. 2014, 31, 509–519. [Google Scholar] [CrossRef]
  28. Mohammad, M.N.; Sulaiman, N.; Muhsin, O.A. A novel intrusion detection system by using intelligent data mining in WEKA environment. Procedia Comput. Sci. 2011, 3, 1237–1242. [Google Scholar] [CrossRef]
  29. Lievens, S.; Baets, B.D. Supervised ranking in the WEKA environment. Inf. Sci. 2010, 180, 4763–4771. [Google Scholar] [CrossRef]
  30. Singh, K.P.; Gupta, S.; Rai, P. Identifying pollution sources and predicting urban air quality using ensemble learning methods. Atmos. Environ. 2013, 80, 426–437. [Google Scholar] [CrossRef]
  31. Kar, S.; Roy, K. First report on development of quantitative interspecies structure-carcinogenicity relationship models and exploring discriminatory features for rodent carcinogenicity of diverse organic chemicals using OECD guidelines. Chemosphere 2012, 87, 339–355. [Google Scholar] [CrossRef] [PubMed]
  32. Wu, X.C.; Zhang, Q.Z.; Wang, H.; Hu, J.T. Predicting carcinogenicity of organic compounds based on CPDB. Chemosphere 2015, 139, 81–90. [Google Scholar] [CrossRef] [PubMed]
  33. Helguera, A.M.; Pérez, M.A.C.; González, M.P.; Ruiz, R.M.; Díaz, H.G. A topological substructural approach applied to the computational prediction of rodent carcinogenicity. Bioorganic Med. Chem. 2005, 13, 2477–2488. [Google Scholar] [CrossRef] [PubMed]
  34. Basavaraja, J.; Inamdar, S.R.; Kumar, H.M.S. Solvents effect on the absorption and fluorescence spectra of 7-diethylamino-3-thenoylcoumarin: Evaluation and correlation between solvatochromism and solvent polarity parameters. Spectrochim. Acta A 2015, 137, 527–534. [Google Scholar] [CrossRef] [PubMed]
  35. Sambathkumar, K.; Jeyavijayan, S.; Arivazhagan, M. Electronic structure investigations of 4-aminophthal hydrazide by UV-visible, NMR spectral studies and HOMO-LUMO analysis byab initioand DFT calculations. Spectrochim. Acta A 2015, 147, 124–138. [Google Scholar] [CrossRef] [PubMed]
  36. Lin, I.S.; Fan, P.L.; Chen, H.I.; Loh, C.H.; Shih, T.S.; Liou, S.H. Rapid and intermediate N-acetylators are less susceptible to oxidative damage among 4,4-methylenebis(2-chloroaniline) (MBOCA)-exposed workers. Int. J. Hydrogen Energy 2013, 216, 515–520. [Google Scholar] [CrossRef] [PubMed]
  37. Szczuka, M.; Ślęzak, D. Feedforward neural networks for compound signals. Theor. Comput. Sci. 2011, 412, 5960–5973. [Google Scholar] [CrossRef]
Figure 1. Expression trees.
Figure 1. Expression trees.
Ijerph 13 01141 g001
Figure 2. The flow chart of GEP.
Figure 2. The flow chart of GEP.
Ijerph 13 01141 g002
Figure 3. Multilayer perceptrons artificial neural network structure.
Figure 3. Multilayer perceptrons artificial neural network structure.
Ijerph 13 01141 g003
Figure 4. Curve margin of training set. The vertical axes represent the numbers of AAs; the horizontal axes represent the difference values of forecasting probability of actual categories, and the maximum prediction probability of wrong categories.
Figure 4. Curve margin of training set. The vertical axes represent the numbers of AAs; the horizontal axes represent the difference values of forecasting probability of actual categories, and the maximum prediction probability of wrong categories.
Ijerph 13 01141 g004
Figure 5. Curve margin of test set. The vertical axes represent the numbers of AAs; the horizontal axes represent the difference values of forecasting probability of actual categories and the maximum prediction probability of wrong categories.
Figure 5. Curve margin of test set. The vertical axes represent the numbers of AAs; the horizontal axes represent the difference values of forecasting probability of actual categories and the maximum prediction probability of wrong categories.
Ijerph 13 01141 g005
Table 1. Carcinogenic activity of aromatic amines for training set.
Table 1. Carcinogenic activity of aromatic amines for training set.
No.Aromatic AminesCarcinogenicity (exp)Carcinogenicity (GEP)Carcinogenicity (MLPs)
1N-Acetoxy-4-biphenylacetamide000
2N-Acetoxy-2-fluorenylacetamide000
3N-Acetoxy-4-phenanthrylacetamide000
4N-Acetoxy-N-(4-stilbenyl)acetamide000
53-Amino-s-triazole111
61-Anthramine000
79-Anthramine000
82-Anthranilacetamide000
9Benzidine101
10N-(Benzoyloxy)-fluorenylacetamide010
114-Biphenyldimethylamine000
123,6-Bis(dimethylamino)acridine101
132-Chloro-4-phenylaniline010
144′-Chloro-4-stilbenyl-N,N-dimethylamine000
152-Cyano-4-stilbenamine100
164,6-Diamino-2-(5-nitro-2-furyl)-s-triazine011
170,0′-Dianisidine000
183-Dibenzofuranylacetamide000
193-Dibenzothiophenylacetamide000
202,2′-Dichloro-4,4′-diaminostilbene101
213,3′-Dichloro-4,4′-diaminostilbene010
229,10-Dihydro-2-phenanthramine000
233,3′-Dihydroxybenzidine000
242-(4-(N,N-Dimethylamino)styryl) quinoline000
253,2′-Dimethyl-4-biphenylamine000
263,3′-Dimethyl-4-biphenylamine000
272-Fluorenylacetamide100
283-Fluorenylacetamide000
291-Fluorenylaceto hydroxamic acid000
302-Fluorenylaceto hydroxanic acid100
31N-Fluorenyl-2-benzamide010
32N-Fluorenyl-2-benzohydroxamic acid000
332-Fluorenyldiacetamide101
342-Fluorenyldimethylamine111
352,5-Fluorenylenediacetamide000
362-Fluorenylhydroxylamine000
37N-(2-Fluorenyl)-2,2,2-trifluoroacetamide101
384′-Fluoro-4-biphenylamine101
391-Fluoro-2-fluorenylacetamide001
403-Fluoro-2-fluorenylacetamide100
414-Fluoro-2-fluorenylacetamide000
425-Fluoro-2-fluorenylacetamide010
436-Fluoro-2-fluorenylacetamide100
447-Fluoro-2-fluorenylacetamide100
457-Fluoro-2-N-fluorenylacetohydroxamic acid110
464′-Fluoro-p-phenylaniline010
474′-Fluoro-4-stilbenamine110
484′-Fluoro-4-stilbenyl-N,N-dimethylamine100
492-Hydrazino-4-phenylthiazole010
50N-Hydroxy-N-(4-stilbenyl) acetamide010
513-Iodo-2-fluorenylacetamide000
527-Iodo-2-fluroenylacetamide000
532-Methoxy-3-benzofuranylamine000
547-Methoxy-2-fluorenylacetamide101
551-Methoxy-2-fluorenylamine101
563-Methoxy-2-fluorenylamine010
574-((p-Methoxyphenyl)azo)-o-anisidine101
582-Methyldiacetylbenzidine001
594,4′-Methylenebis(2-chloroaniline)111
604′-Methyl-4-phenylacetanilide000
613-Methyl-4-phenylaniline010
623-Methyl-4-stilbenamine000
631-Naphthylacetohydroxamic acid000
642-Naphthylhydroxylamine000
659-Oxo-2-fluorenylacetamide100
661-Phenanthrylacetamide000
672-Phenanthrylacetamide010
681-Phenanthrylamine000
693-Phenanthrylamine000
709-Phenanthrylamine000
714-(Phenylazo) acetanilide000
724-(Phenylazo) aniline000
734-(Phenylazo) diacetanilide000
744-(Phenylazo)-N-phenylacetohydroxamic acid000
754-Stilbenamine000
76N-(4-Stilbenyl) acetamide000
774-Stilbenyl-N,N-diethylamine000
784-Stilbenyl-N,N-dimethylamine000
79N-(4-Styrylphenyl) hydroxylamine000
803,2′,4′,6′-Tetramethyl-4-biphenylamine100
81o,o′-Tolidine010
824-(m-Tolylazo) acetanilide000
834-(m-Tolylazo) aniline000
842-(o-Tolylazo)-p-toluidine101
852-(p-Tolylazo)-p-toluidine000
864-(o-Tolylazo)-o-toluidine110
874-(o-Tolylazo)-m-toluidine000
884-(m-Tolylazo)-m-toluidine000
894-(p-Tolylazo)-o-toluidine000
904-(p-Tolylazo)-m-toluidine000
91N,N,2′-Trimethyl-4-stilbenamine000
92N,N,3′-Trimethyl-4-stilbenamine000
93N,N,4′-Trimethyl-4-stilbenamine000
Table 2. Carcinogenic activity of aromatic amines for test set.
Table 2. Carcinogenic activity of aromatic amines for test set.
No.Aromatic AminesCarcinogenicity (exp)Carcinogenicity (GEP)Carcinogenicity (MLPs)
12-Anthramine000
24-Biphenylacetamide000
34-Biphenylacetohydroxamic acid010
43-Carbazolylacetamide001
52,7-Diaminofluorene001
64,4′-Diaminostilbene110
72-Dibenzothiophenylacetamide000
83,3′-Dichlorobenzidine000
92-Fluorenamine110
101-Fluorenylacetamide000
113-Fluorenylaceto hydroxanic acid000
122,7-Fluorenyldiacetamide110
132-Fluorenyldiethylamine000
14N,2-Fluorenylformamide010
152-Fluorenylmethylamine100
16N,2-Fluorenylsuccinamic acid100
178-Fluoro-2-fluorenylacetamide101
182-Fluoro-4-phenylaniline000
193′-Fluoro-4-phenylaniline000
203-Methoxy-4-biphenylamine011
213-Methoxy-2-fluorenylacetamide010
224,4′-Methylenebis(2-methylaniline)101
233-Methyl-2-naphthylamine000
242-Methyl-4-phenylaniline000
252′-Methyl-4-phenylaniline000
262-Methyl-4-stilbenamine010
272-Naphthylamine000
281-Naphthylhydroxylamine000
299-Phenanthrylacetamide000
302-Phenanthrylacetohydroxamic acid000
312-Phenanthrylamine010
324-(Phynylazo)-o-anisidine110
331-(Phenylazo)-2-naphthylamine000
344-(Phenylazo)-N-phenylhydroxylamine000
353,2′,5′-Trimethyl-4-diphenylamine101
Table 3. The correlation of eight descriptors.
Table 3. The correlation of eight descriptors.
CorrelationNCOSNNOSKFBIBBISICITEIAPLPTLUMO
NCOS1.000−0.2270.649−0.7080.6670.234−0.034−0.374
NNOS 1.0000.175−0.0140.1590.312−0.201−0.111
KFBI 1.000−0.5690.7300.4330.007−0.18
BBI 1.000−0.681−0.259−0.1730.438
SICI 1.0000.6200.250−0.456
TEIA 1.0000.339−0.107
PLPT 1.000−0.277
LUMO 1.000
Table 4. Results of GEP and MLPs.
Table 4. Results of GEP and MLPs.
AccuracySensitivitySpecificityYouden’s Index
Training set of GEP0.9140.9470.9050.852
Test set of GEP0.8290.6670.8850.552
Training set of MLPS0.8380.8440.8130.657
Test set of MLPS0.7430.7930.5000.293

Share and Cite

MDPI and ACS Style

Song, F.; Zhang, A.; Liang, H.; Cui, L.; Li, W.; Si, H.; Duan, Y.; Zhai, H. QSAR Study for Carcinogenic Potency of Aromatic Amines Based on GEP and MLPs. Int. J. Environ. Res. Public Health 2016, 13, 1141. https://doi.org/10.3390/ijerph13111141

AMA Style

Song F, Zhang A, Liang H, Cui L, Li W, Si H, Duan Y, Zhai H. QSAR Study for Carcinogenic Potency of Aromatic Amines Based on GEP and MLPs. International Journal of Environmental Research and Public Health. 2016; 13(11):1141. https://doi.org/10.3390/ijerph13111141

Chicago/Turabian Style

Song, Fucheng, Anling Zhang, Hui Liang, Lianhua Cui, Wenlian Li, Hongzong Si, Yunbo Duan, and Honglin Zhai. 2016. "QSAR Study for Carcinogenic Potency of Aromatic Amines Based on GEP and MLPs" International Journal of Environmental Research and Public Health 13, no. 11: 1141. https://doi.org/10.3390/ijerph13111141

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop