Next Article in Journal
Fifteen Shades of Grey: Combined Analysis of Genome-Wide SNP Data in Steppe and Mediterranean Grey Cattle Sheds New Light on the Molecular Basis of Coat Color
Next Article in Special Issue
Bioinformatics Analysis Revealed Novel 3′UTR Variants Associated with Intellectual Disability
Previous Article in Journal
Insights into Mobile Genetic Elements of the Biocide-Degrading Bacterium Pseudomonas nitroreducens HBP-1
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Linear Regression and Deep Learning Approach for Detecting Reliable Genetic Alterations in Cancer Using DNA Methylation and Gene Expression Data

1
Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
2
Department of Computer Science & Engineering, Aliah University, Newtown WB-700160, India
3
Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
4
MD Anderson Cancer Center UTHealth Graduate School of Biomedical Sciences, Houston, TX 77030, USA
*
Author to whom correspondence should be addressed.
Genes 2020, 11(8), 931; https://doi.org/10.3390/genes11080931
Submission received: 21 July 2020 / Revised: 3 August 2020 / Accepted: 6 August 2020 / Published: 12 August 2020

Abstract

:
DNA methylation change has been useful for cancer biomarker discovery, classification, and potential treatment development. So far, existing methods use either differentially methylated CpG sites or combined CpG sites, namely differentially methylated regions, that can be mapped to genes. However, such methylation signal mapping has limitations. To address these limitations, in this study, we introduced a combinatorial framework using linear regression, differential expression, deep learning method for accurate biological interpretation of DNA methylation through integrating DNA methylation data and corresponding TCGA gene expression data. We demonstrated it for uterine cervical cancer. First, we pre-filtered outliers from the data set and then determined the predicted gene expression value from the pre-filtered methylation data through linear regression. We identified differentially expressed genes (DEGs) by Empirical Bayes test using Limma . Then we applied a deep learning method, “nnet” to classify the cervical cancer label of those DEGs to determine all classification metrics including accuracy and area under curve (AUC) through 10-fold cross validation. We applied our approach to uterine cervical cancer DNA methylation dataset (NCBI accession ID: GSE30760, 27,578 features covering 63 tumor and 152 matched normal samples). After linear regression and differential expression analysis, we obtained 6287 DEGs with false discovery rate (FDR) < 0.001 . After performing deep learning analysis, we obtained average classification accuracy 90.69 % ( ± 1.97 % ) of the uterine cervical cancerous labels. This performance is better than that of other peer methods. We performed in-degree and out-degree hub gene network analysis using Cytoscape . We reported five top in-degree genes ( PAIP 2 , GRWD 1 , VPS 4 B , CRADD and LLPH ) and five top out-degree genes ( MRPL 35 , FAM 177 A 1 , STAT 4 , ASPSCR 1 and FABP 7 ). After that, we performed KEGG pathway and Gene Ontology enrichment analysis of DEGs using tool WebGestalt(WEB-based Gene SeT AnaLysis Toolkit). In summary, our proposed framework that integrated linear regression, differential expression, deep learning provides a robust approach to better interpret DNA methylation analysis and gene expression data in disease study.

1. Introduction

DNA methylation has been found a promising biomarker in cancer detection and cancer classification. DNA methylation can be defined as a heritable epigenetic mark where a methyl group can transfer covalently to the C-5 position of the cytosine ring of DNA through DNA methyltransferases (DNMTs). DNA methylation is vital for normal development. It plays very important role in a number of key operations including genomic imprinting, inactivation of X-chromosome, repression of repetitive element transcription and transposition, and different diseases including cancer [1]. To biologically interpret the DNA methylation data, two kinds of analysis are available: (i) single differentially methylated genes (CpG sites) finding [2,3] and (ii) differentially methylated region (DMR) finding [4,5,6]. These two kinds of analysis are only specific to performing a single task. Therefore, it is important to incorporate different factors to correctly interpret DNA methylation data by which it can work as multi-functionalities from different directions such as prediction of gene expression using DNA methylation, differential expression analysis, cancer classification [7], hub gene finding, and others.
In practical scenarios, it is observed that DNA methylation normally reduces gene expression levels [8,9]. However, this opinion varies on different factors. There are different kinds of method to integrate DNA methylation and gene expression data. There are several shortcomings of those existing methods. Firstly, it is not easy to determine the directionality of the evaluated gene expression estimated from the DNA methylation. Normally, the suppression of gene expression is caused by hypermethylation in the promoter region, while the activation correlates the hypermethylation in the gene body. Therefore, the prediction of changing in gene expression based on simple DNA methylation results is difficult [10]. Secondly, an accurate measure of gene promoter methylation is difficult due to the variance in the size of canonical promoters as well as the presence of the distal augments, which initiates biases into the association of methylated regions with gene models [10]. Thirdly, the high probability of selecting a long gene due to the nearby differentially methylated CpGs or overlapping (or nested) with other genes [10]. Fourthly, some specific tools are required for reformatting the methylation data into the genomic region formats (e.g., BED) for some web-based methods such as GREAT [11], Galaxy [12]. It creates more complications in their usage [10].
Cervical cancer is a cancer which starts in the cervix, a hollow cylinder that connects the lower part of uterus to a woman’s vagina. Most of the cervical cancers grow in the cells on the outer surface of the cervix. Normally women are unable to realize this disease in the initial stage since the symptoms are more or less similar with the common conditions such as menstrual periods and urinary tract infections. The normal symptoms of the cervical cancer include abnormal bleeding during mensuration time or after having sex, pain in the pelvis, as well as pain during the urination [13]. Here, we used a DNA methylation dataset for uterine cervical cancer from NCBI (Accession ID: GSE30760) [14] which have two types of samples, one is normal sample and another one is uterine cervical cancer sample.
So far, there has been no method to integrate regression, differential expression and deep learning strategies for accurate interpretation of DNA methylation in a complex disease like cancer. To resolve the previously mentioned drawbacks, in this article, we provided an integrated framework using regression, differential expression and deep learning methods to correctly interpret biologically of the DNA methylation data through integrating that DNA methylation data and corresponding TCGA (The Cancer Genome Atlas) gene expression data for uterine cervical data (NCBI accession ID GSE30760) [14,15,16]. We pre-filtered the redundant CpG sites, eliminated outliers, and replaced missing values. Next, we predicted corresponding gene expression value from the pre-filtered DNA methylation data through linear regression algorithm where the impact between DNA methylation and TCGA gene expression has been determined. As a result, we obtained the predicted gene expression matrix for the preprocessed DNA methylation data. Through the entire analysis, we used ByMethyl R package [10]. Next, we identified differentially expressed genes (DEGs) using downstream analysis, Empirical Bayes test using Limma [17,18,19]. After we applied a recently released deep learning method, “nnet” (feed-forward neural network based model) [20] to interpret those DEGs for determining the classification capacity of uterine cancer and normal groups, we then estimated all classification metrics (average accuracy, average sensitivity, average specificity, average precision, average overall error rate and area under curve (AUC)) using 10-fold cross validation. We trained our predicted DEG expression data using “nnet” with the default parameter settings (i) size (=number of units in hidden layer), (ii) rang (=initial random weights) while [−rang, rang], (iii) decay (=parameter for weight decay), (iv) maxit (=the maximum number of iterations or number of epochs), (v) MaxNWts (=the maximum allowable number of weights) and other default parameters. Remarkably, we obtained 90.69 % ( ± 1.97 % ) as average classification accuracy of the uterine cervical cancer samples and normal samples by using DEG expression data. According to comparative study, the classification accuracy of our proposed method is higher than that of other state-of-the-art methods. We further performed in-degree and out-degree hub gene network analysis using Cytoscape [21]. We reported the five top in-degree genes ( PAIP 2 , GRWD 1 , VPS 4 B , CRADD and LLPH ) and the five top out-degree genes ( MRPL 35 , FAM 177 A 1 , STAT 4 , ASPSCR 1 and FABP 7 ). After that, we performed Gene Set Enrichment Analysis (GESA) to determine enriched KEGG pathways and Gene Ontology (GO) terms including Biological Process (BP), Cellular Component (CC), and Molecular Function (MF) on the set of all DEGs having F D R < 0.001 using WebGestalt (WEB-based Gene SeT AnaLysis Toolkit) [22]. Finally, our proposed integrated framework using linear regression, differential expression and deep learning method can interpret the DNA methylation data better than using single differential methylation analysis or differentially methylated region finding strategies for any kind of cancer.

2. Materials and Methods

The steps of our proposed framework are demonstrated as follows, as well as in Figure 1.

2.1. Data Collection

In this study, we used a cervical cancer methylation dataset(NCBI accession ID: GSE30760) [14,15,16]. This dataset included 63 uterine cervical tumor samples and 152 matched normal samples. Of note, the initial analysis had 27,578 genes.

2.2. Preprocessing of Methylation Data

In this article, we provided an extensive analysis to integrate DNA methylation and corresponding TCGA gene expression data by utilization of regression, differential expression and deep learning. In this method, we have utilized different steps as below.

2.2.1. Data Preprocessing

First we eliminated the CpG sites that had missing values in more than half of the samples and then the remaining missing values would be imputed through integrating a new traditional quality control R package ENmix [23], which is widely useful in Illumina Human Methylation data analysis. The functions under the ENmix package are used to remove unwanted experimental noise and to improve accuracy and reproducibility of methylation measures. ENmix functions are very flexible and transparent. In our proposed method this quality control ‘ENmix’ was used in our methylation data to discard the outliers and to replace missing values using the popular k nearest neighbors (KNN) algorithm.

2.2.2. Computing Predicted Expression Scores of Gene through Linear Regression Analysis

In this step, we computed the predicted gene expression scores from the preprocessed of DNA methylation profiles and corresponding TCGA CESC cancer type through linear regression analysis along with corresponding pre-trained weight factor.
To do so, we utilized the linear regression algorithm to measure the impact between DNA methylation and gene expression for uterine cervical cancer on preprocessed DNA methylation and corresponding TCGA CESC cancer type [10]. In a statistical point of view, linear regression is a linear approach for molding the relationship between a scalar variable (or, dependent variable) and one or more explanatory variables (or independent variables). In regression analysis, gene expression ( E x ) is the dependent variable and DNA methylation ( M t ) is the independent variable. For an i-indexed gene denoted by g e n e i , E x = { e x 1 , e x 2 , . . . , e x n } is the gene expression across n samples, and M t is the corresponding methylation matrix ( 27 , 578 × 215 matrix here). Here, we chose those CpGs ( c p g n , j ) whose beta values were correlated, i.e., Pearson’s correlation coefficient was greater than | 0.8 | ) with gene expression label ( g e n e i ) for building the model, g e n e i where c p g n , j is the beta value of j -th CpG in sample n. The equation for the linear regression model was described as follows:
E x = α + b e t a M t
where α denotes the linear regression intercepting factor, and beta refers to the coefficient vector. In our case, through this linear regression model, we generated the predicted gene expression matrix for the provided genes (CpG sites) using DNA methylation data. Then we applied 10-fold cross-validation to validate our model. That means, we need to check the quality of the gene expression inferred by the linear regression model. Basically, for each validation, to train the model we used 9/10 samples as training dataset. Then, we computed a gene expression profile for the rest 1/10 samples by integrating the DNA methylation data and trained model. After completion of 10-fold cross-validation, our further step was to merge test sample profiles to a gene expression profile containing all samples. For conducting downstream validation we compared the gene expression with the RNA-seq data.

2.2.3. Voom Normalization and Identifying Differentially Expressed Genes Using Limma

In this step Voom normalization [24] was used and after that we applied Limma [18,25]. After applying Voom normalization tool, we detected DEGs from the predicted gene expression data for downstream analysis through Limma [19]. According to benchmark methods the performance of Limma is very good for any kind of data distributions for any sample size. The definition of the moderated t-statistic of Limma is as follows [19]:
t ¯ k = 1 1 m 1 + 1 m 2 β k ^ p ¯ k
where m 1 denotes the sample size for diseased group and m 2 signifies the sample size for control group, and total sample size m = m 1 + m 2 . β k ^ , p k notify corresponding contrast estimator and posterior sample variance for the genes, respectively.
To find the false discovery rate (FDR) adjusted p-value using Empirical Bayes t-statistics, we used t-table or cumulative distribution function (cdf). FDR adjusted p-value less than 0.001 indicates the differentially expressed genes (DEGs) here. This p-value denotes the probability of observing a t-value which is either equal to or greater than the actually observed t-value in which the given null hypothesis is true.
Here, we applied the Empirical Bayes test using Limma to compute t-score, p-value and FDR, where normal uterine samples group had 152 samples and uterine cervical cancer samples group included 63 samples. Finally, we selected those genes as differentially expressed genes whose F D R < 0.001 . However, all the differentially expressed genes were considered as a single potential gene signature which could be verified at classification analysis through deep learning.

2.3. Disease Classification of DEGs through Deep Learning

Here, we used a latest deep learning method “nnet” (feed-forward neural network based model), [20]. We used this deep learning technique with 10-fold cross validation to examine the class-label (normal and Uterine Cervical cancer groups) of the differentially expressed genes with a repeat of thirty times. In the cross-validation, we divided the predicted gene expression data of the DEGs into 10 folds of samples of which nine folds of samples were used as training set, while remaining one fold of samples was utilized as the test set. From this sub step, we ran “nnet” tool using a certain number of epochs (termed as “maxit”) that means the deep learning method was internally repeated for that “maxit” times, and then computed the classification metrics at one time iteration of each fold. From this sub step, we obtained a confusion matrix consisting of True Positive (TP), False Negative (FN), False Positive (FP) and True Negative (TN). This sub procedure was repeated for each fold of samples (i.e., nine other fold samples). Then we added all these metrics for these 10 times internal repetition and then produced a final confusion matrix. Then we added all these metrics for these 10 internal repetitions and then produced a final confusion matrix. Thereafter, we repeated this entire procedure multiple times (30 times) here to obtain the average classification metric values (average accuracy, average sensitivity, average specificity, average precision, average overall error rate and area under curve (AUC)). Here, we used the test sample as a validation sample also. In this deep learning method, we used “nnet” with the default parameter settings (i) size (=number of units in hidden layer), (ii) rang (=initial random weights) while [−rang, rang], (iii) decay (=parameter for weight decay), (iv) maxit (=the maximum number of iterations or number of epochs), (v) MaxNWts (=the maximum allowable number of weights) and other default parameters also.

2.4. Hub Gene Finding

In this regard, we applied Pearson’s correlation analysis on the DEGs identified by our method for finding out the active edges among genes having correlation value ≥0.8 or ≤ 0.8 . After obtaining the set of active edges, we performed degree centrality analysis through Cytoscape online tool [21] and determined in-degree and out-degree scores of each DEG. We marked top 10 in-degree hub DEGs and top 10 out-degree hub DEGs.

2.5. Gene Set Enrichment Analysis

The potential function, biological significance, and disease relevance of a list of signature genes can be assessed by Gene Set Enrichment Analysis (GSEA). After identifying differentially expressed genes we used KEGG pathways and Gene Ontology (GO) annotations (three domains: Biological Process (BP), Cellular Component (CC), and Molecular Function (MF)) on a set of top differentially expressed genes by WebGestalt (WEB-based Gene SeT AnaLysis Toolkit) [22]. We obtained all KEGG pathways and Gene Ontology (GO) terms accompanied by number of genes in that pathway or GO-term, enriched p-value and FDR. We filtered out those KEGG pathways or GO terms whose FDR was greater than or equal to 0.05.

3. Results and Discussion

In this case study, we had 27,578 features and 215 samples including 152 normal samples and 63 uterine cervical cancer samples. After data preprocessing, linear regression and differential expression analysis, we obtained 6287 DEGs having F D R < 0.001 by Limma , in a list accompanied by computed t-score, p-value and FDR. Top 25 DEGs are shown in Table 1. For example, ADCY 2 was the topmost DEG with minimum FDR (FDR = 5.64 × 10 119 ). We provided the list of all DEGs obtained by differential expression analysis by Empirical Bayes test using Limma with FDR corrected p-value in a Supplementary File, Additional file 1: Table S1. Furthermore, the predicted gene expression matrix of all DEGs computed from original pre-filtered uterine cervical cancer DNA methylation data through linear regression analysis was provided in another Supplementary File, Additional file 2: Table S2.
After that, we applied the latest deep learning method “nnet” (feed-forward neural network based model), [20] on our computed DEG expression dataset which have 6287 features with 215 samples. We used this deep learning technique with 10-fold cross validation to examine the class-label (normal and uterine cervical cancer groups) of the differentially expressed genes with a repeat of 30 times. In the cross-validation, we divided all the samples of the predicted gene expression data of the DEGs into 10 folds of samples of which nine-fold of samples was used as training set, while the remaining one-fold of the samples was utilized as the test set. From this sub step, we ran “nnet” tool using maxit (number of epochs) equal to 100, that means the deep learning method was internally repeated for 100 times, and then computed the classification metrics at one time iteration of each fold. From this sub step, we obtained a confusion matrix consisting of True Positive (TP), False Negative (FN), False Positive (FP) and True Negative (TN). This sub procedure was repeated for each fold of samples (i.e., nine other folds). Then, we added all these metrics for these 10 times internal repetitions and produced a final confusion matrix. Thereafter, we repeated this entire procedure for multiple times (30 times) and obtained thirty confusion metrics. Using this, we obtained the average classification metric values (average accuracy, average sensitivity, average specificity, average precision, average overall error rate and area under curve (AUC)). Note that our deep learning method has already repeated 30,000 times ( 30 × 10 × 100 ) from which we computed the average accuracy, where every sample was used as a test set at least once (i.e., no sample was missing as a test sample). Here we used test sample as validation 163 sample. In this deep learning method, we used “nnet” with the default parameter settings (i) size (=number of units in hidden layer) (=2), (ii) rang (=initial random weights)(=0.1) while [−rang, rang], (iii) decay (=parameter for weight decay)(= 5 × e 4 ), (iv) maxit (=the maximum number of iterations or number of epochs)(=100), (v) MaxNWts (=the maximum allowable number of weights)(=84,581) and other default parameters. As we used 10-fold cross validation, 9/10 of 215 samples (i.e., 194 or 193 samples) were considered as training set and 1/10 of 215 samples (i.e., 21 or 22 samples) were taken as test set. of which nine-fold of samples was used as a training set, while remaining one-fold of samples was utilized as a test set. Thus, each sample participated in each role, either in training sample or test sample, at least once. Here, we also used the test sample as the validation sample. We obtained 90.69 % ( ± 1.97 % ) average classification accuracy and value of AUC was 0.858. For more details, see Table 2. We have plotted all metrics in Figure 2.
We carried out a comparative study between proposed method and an existing method “RSNNS” (Stuttgart Neural Network Simulator (SNNS) based deep learning tool) with 10-fold cross validation with repeating 30 times. In case of “RSNNS” we also used same default parameter settings like (i) size (=number of units in hidden layer)(=2), (ii) maxit (=maximum number of iterations or number of epochs) (=100), among others. In both cases we have repeated entire procedure 30 times to to obtain a reliable classification. Our proposed method produced an average classification accuracy of 90.69 % ( ± 1.97 % ) whereas existing method “RSNNS” had 87.27 % ( ± 5.92 % ) as average classification accuracy (see Figure 3). We considered our framework had better performance than all other methods using deep learning tool.
Here, we applied Pearson’s correlation analysis on our DEGs for finding out edges among genes having correlation value greater than or equal to 0.8 or, less than or equal to (−0.8). Then, we also performed in-degree and out-degree hub gene network analysis using Cytoscape [21]. As an example the five top genes with highest in-degree values were namely PAIP 2 , GRWD 1 , VPS 4 B , CRADD and LLPH , see Table 3. Similarly, the five top most out-degree genes were namely MRPL 35 , FAM 177 A 1 , STAT 4 , ASPSCR 1 and FABP 7 , see Table 4. We provided detail hub gene network structure in a Supplementary File, Additional file 7: Table S7.
In the corresponding literature survey, we found that most of the topmost hub genes detected by our method played an important role in the respective cancer. PAIP 2 gene and cervical cancer were found to be associated by Berlanga et al. [26]. GRWD 1 was utilized as the negatively regulator of p53 in tumorigenesis [27]. It had been also used as a potential bio-marker in DNA methylation at the time of treatment and risk assessment of cancer. Methylation of GRWD 1 might be a protective factor in the development of tumor [28]. VPS 4 B gene and cervical cancer were reported in the literature Broniarczyk et al. [29]. Similarly, CRADD gene is involved in cervical cancer, as reported in Sundaram et al. [30], while LLPH gene was associated with cervical cancer in Feron et al. [31]. Similarly, the topmost out-degree hub genes were mostly associated with cervical cancer through literature search. For example, the association between FAM 177 A 1 and cervical cancer were documented in Wen et al. [32], whereas STAT 4 was connected with the respective cervical cancer in Luo et al. [33]. In addition, ASPSCR 1 and cervical cancer are reported in Liang et al. [34], while FABP 7 was found to be linked to cervical cancer in Zhang et al. [35].
These 6287 DEGs, which have F D R < 0.001 , were taken for Gene Set Enrichment Analysis using WebGestalt (WEB-based Gene SeT AnaLysis Toolkit) [22]. We had applied WebGestalt (WEB-based Gene SeT AnaLysis Toolkit) database on our DEG set to obtain all KEGG pathways and Gene Ontology (GO) terms [Biological Process (BP), Cellular Component (CC) and Molecular Function (MF)], accompanied by number of genes in that pathway or GO-term, enriched p-value and FDR. Here, we took our input data set in the prescribed format of WebGestalt which was in a two-columns pattern, first one was gene name and second one was score. Here we used t-score as score. Significant pathways and GO-terms were described in below and also for more details see Table 5, Table 6, Table 7 and Table 8. For example, hsa05205:Proteoglycans in cancer was a top significant KEGG pathway which has minimum FDR value ( 2.16 × 10 5 ). A total of 198 genes were associated in this pathway with enriched p-value 6.65 × 10 8 . For the remaining top 10 significant KEGG pathways, see Table 5. We provided the list of all KEGG pathways in a Supplementary File, Additional file 3: Table S3. In addition, the volcano plot of the of normalized enrichment score of those FDR significant KEGG pathways is shown in Figure 4. Similarly, GO:0008283 cell proliferation was one of the top significant GO-BP terms with FDR value 0. A total of 1986 genes were associated with this GO-BP term, enriched p-value 0. For the remaining terms, see Table 6. We provided the list of all GO-BP terms in a Supplementary File, Additional file 4: Table S4. In such analysis, we found GO:0005783 endoplasmic reticulum as one of the top significant GO-CC terms with FDR value 0. A total of 1861 genes were associated with this GO-CC term, enriched p-value 0. For the rest, see Table 7. We provided the list of all GO-CC terms in a Supplementary File, Additional file 5: Table S5. Furthermore, GO:0042802 identical protein binding was one of the top significant GO-MF terms with minimum FDR value 0. A total of 1696 genes were associated with this GO-MF term having the enriched p-value 0. For details, see Table 8. We provided the list of all GO-MF terms in a Supplementary File, Additional file 6: Table S6.

4. Conclusions and Future Work

In this article, we provided a framework using linear regression, differential expression, and deep learning to provide correct biological interface for integrating DNA methylation and corresponding TCGA gene expression data to uterine cervical cancer. To develop the framework, first we eliminated outliers, then applied linear regression to determine predicted gene expression data from the preprocessed DNA methylation data by the use of TCGA gene expression data. Then we identified 6287 differentially expressed gene with FDR cut off less than 0.001 using downstream analysis through Empirical Bayes test using Limma . After that, we applied “nnet” deep learning method to interpret differentially expressed genes with 10-fold cross validation and with the default parameter settings (i) size (=number of units in hidden layer), (ii) rang (=initial random weights) while [−rang, rang], (iii) decay (=parameter for weight decay), (iv) maxit (=the maximum number of iterations or number of epochs), (v) MaxNWts (=the maximum allowable number of weights) and other default parameters also. We obtained 90.69 % ( ± 1.97 % ) as average classification accuracy of the uterine cervical cancer samples and normal samples for DEG expression data, which is more significant than other existing methods. So through the deep learning and comparative study, we can say that our obtained DEGs are strong and efficient in disease classification.
Here, we also performed in-degree and out-degree hub gene network analysis using Cytoscape [21]. We reported the five highest in-degree genes ( PAIP 2 , GRWD 1 , VPS 4 B , CRADD and LLPH ) and the five highest out-degree genes ( MRPL 35 , FAM 177 A 1 , STAT 4 , ASPSCR 1 and FABP 7 ). Furthermore, we used pathway analysis on DEGs with F D R < 0.001 using WebGestalt . Finally, our framework is useful for better biological interpretation of the DNA methylation data rather than single differential methylation analysis or differentially methylated region finding.
In our future study, we will extend our current work through integrating random forest ensemble method into deep learning strategy to obtain a better classification model in all prospective, and then apply that on big data (e.g., single cell RNA sequencing data or, other TCGA cancer tissue specific data) for cancer classification.

Supplementary Materials

The code is available online at https://drive.google.com/open?id=1LsYe3ypiweox2OSmnD5LDaYMJePeozDd, Table S1: List of all DEGs obtained by differential expression analysis by Empirical Bayes test using limma with FDR corrected p-value; Table S2: The predicted gene expression matrix of all DEGs computed from original prefiltered Uterine Cervical cancer DNA methylation data through linear regression analysis; Table S3: List of all KEGG pathways obtained by WebGestalt ; Table S4: List of all GO-BP terms obtained by WebGestalt ; Table S5: List of all GO-CC terms obtained by WebGestalt ; Table S6: List of all GO-MF obtained by WebGestalt ; Table S7: Image of hub-finding network by Cytoscape .

Author Contributions

Conceived and designed the experiments: S.M., S.S. and T.B. Execution of the experiments: S.M. and S.S. Data analysis: S.M. and S.S. Manuscript writing: S.M., S.S. and Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

Z.Z. was partially supported by Cancer Prevention and Research Institute of Texas (CPRIT RP170668 and RP180734). The funder had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Conflicts of Interest

The authors have declared that no competing interests exist.

References

  1. Jin, B.; Li, Y.; Robertson, K.D. DNA Methylation: Superior or Subordinate in the Epigenetic Hierarchy? Genes Cancer 2011, 2, 607–617. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. Kim, J.H.; Karnovsky, A.; Mahavisno, V.; Weymouth, T.; Pande, M.; Dolinoy, D.C.; Rozek, L.S.; Sartor, M.A. LRpath analysis reveals common pathways dysregulated via DNA methylation across cancer types. BMC Genom. 2012, 13, 526. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  3. Mallik, S.; Mukhopadhyay, A.; Maulik, U. Integrated Statistical and Rule-Mining Techniques for DNA Methylation and Gene Expression Data Analysis. JAISCR 2013, 3, 101–115. [Google Scholar] [CrossRef] [Green Version]
  4. Marsit, C.J.; Koestler, D.C.; Christensen, B.C.; Karagas, M.R.; Houseman, E.A.; Kelsey, K.T. DNA methylation array analysis identifies profiles of blood-derived DNA methylation associated with bladder cancer. J. Clin. Oncol. 2011, 29, 1133–1139. [Google Scholar] [CrossRef] [Green Version]
  5. Rijlaarsdam, M.A.; van der Zwan, Y.G.; Dorssers, L.C.J.; Looijenga, L.H.J. DMRforPairs: Identifying differentially methylated regions between unique samples using array based methylation profiles. BMC Bioinform. 2014, 15, 141. [Google Scholar] [CrossRef] [Green Version]
  6. Mallik, S.; Odom, G.J.; Gao, Z.; Gomez, L.; Chen, X.; Wang, L. An evaluation of supervised methods for identifying differentially methylated regions in Illumina methylation arrays. Briefings Bioinform. 2019, 20, 2224–2235. [Google Scholar] [CrossRef] [Green Version]
  7. Mallik, S.; Zhao, Z. Graph- and rule-based learning algorithms: A comprehensive review of their applications for cancer type classification and prognosis using genomic data. Briefings Bioinform. 2020, 21, 368–394. [Google Scholar] [CrossRef]
  8. Qin, G.; Mallik, S.; Mitra, R.; Li, A.; Jia, P.; Eischen, C.; Zhao, Z. MicroRNA and transcription factor co-regulatory networks and subtype classification of seminoma and non-seminoma in testicular germ cell tumors. Sci. Rep. 2020, 10, 852. [Google Scholar] [CrossRef] [Green Version]
  9. Mallik, S.; Qin, G.; Jia, P.; Zhao, Z. Molecular signatures identified by integrating gene expression and methylation in non-seminoma and seminoma of testicular germ cell tumors. Epigenetics 2020. [Google Scholar] [CrossRef]
  10. Wang, Y.; Franks, J.M.; Whitfield, M.L.; Cheng, C. BioMethyl: An R package for biological interpretation of DNA methylation data. Bioinformatics 2019, 35, 3635–3641. [Google Scholar] [CrossRef]
  11. McLean, C.Y.; Bristor, D.; Hiller, M.; Clarke, S.L.; Schaar, B.T.; Lowe, C.B.; Wenger, A.M.; Bejerano, G. GREAT improves functional interpretation of cis-regulatory regions. Nat. Biotechnol. 2010, 28, 495–501. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  12. Goecks, J.; Nekrutenko, A.; Taylor, J.; Team, G. Galaxy: A comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010, 11, R86. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  13. Everything You Need to Know About Cervical Cancer. Available online: https://www.healthline.com/health/cervical-cancer (accessed on 17 March 2020).
  14. Zhuang, J.; Jones, A.; Lee, S.; Ng, E.; Fiegl, H.; Zikan, M.; Cibula, D.; Sargent, A.; Salvesen, H.B.; Jacobs, I.J.; et al. The dynamics and prognostic potential of DNA methylation changes at stem cell gene loci in women’s cancer. PLoS Genet. 2012, 8, e1002517. [Google Scholar] [CrossRef]
  15. Teschendorff, A.E.; Jones, A.; Fiegl, H.; Sargent, A.; Zhuang, J.J.; Kitchener, H.C.; Widschwendter, M. Epigenetic variability in cells of normal cytology is associated with the risk of future morphological transformation. Genome Med. 2012, 4, 24. [Google Scholar] [CrossRef] [Green Version]
  16. Teschendorff, A.E.; Jones, A.; Widschwendter, M. Stochastic epigenetic outliers can define field defects in cancer. BMC Bioinform. 2016, 17, 178. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  17. Bandyopadhyay, S.; Mallik, S.; Mukhopadhyay, A. A survey and comparative study of statistical tests for identifying differential expression from microarray data. IEEE/ACM Trans. Comput. Biol. Bioinf. 2014, 11, 95–115. [Google Scholar] [CrossRef] [PubMed]
  18. Mallik, S.; Seth, S.; Bhadra, T.; Tomar, N.; Zhao, Z. A Multi-classifier Model to Identify Mitochondrial Respiratory Gene Signatures in Human Cancer. In Proceedings of the 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), San Diego, CA, USA, 18–21 November 2019; pp. 1928–1935. [Google Scholar]
  19. Mallik, S.; Mukhopadhyay, A.; Maulik, U. RANWAR: Rank-Based Weighted Association Rule Mining From Gene Expression and Methylation Data. IEEE Trans. NanoBiosci. 2015, 14, 59–66. [Google Scholar] [CrossRef]
  20. Venables, W.N.; Ripley, B.D. Feed-Forward Neural Networks and Multinomial Log-Linear Models (Package “nnet”). CRAN R Project Repository. 2020. Available online: http://www.stats.ox.ac.uk/pub/MASS4/ (accessed on 2 May 2020).
  21. Shannon, P.; Markiel, A.; Ozier, O.; Baliga, N.S.; Wang, J.T.; Ramage, D.; Amin, N.; Schwikowski, B.; Ideker, T. Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks. Genome Res. 2003, 13, 2498–2504. [Google Scholar] [CrossRef]
  22. Liao, Y.; Wang, J.; Jaehnig, E.J.; Shi, Z.; Zhang, B. WebGestalt 2019: Gene set analysis toolkit with revamped UIs and APIs. Nucleic Acids Res. 2019, 47, W199–W205. [Google Scholar] [CrossRef] [Green Version]
  23. Xu, Z.; Niu, L.; Li, L.; Taylor, J.A. ENmix: A novel background correction method for Illumina HumanMethylation450 BeadChip. Nucleic Acids Res. 2016, 44, e20. [Google Scholar] [CrossRef]
  24. Law, C.W.; Chen, Y.; Shi, W.; Smyth, G.K. voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 2014, 15, R29. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  25. Ritchie, M.E.; Phipson, B.; Wu, D.; Hu, Y.; Law, C.W.; Shi, W.; Smyth, G.K. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015, 43, e47. [Google Scholar] [CrossRef] [PubMed]
  26. Berlanga1, J.J.; Baass, A.; Sonenberg, N. Regulation of poly(A) binding protein function in translation: Characterization of the Paip2 homolog, Paip2B. RNA 2006, 12, 1556–1568. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  27. Kayama, K.; Watanabe, S.; Takafuji, T.; Tsuji, T.; Hironaka, K.; Matsumoto, M.; Nakayama, K.I.; Enari, M.; Kohno, T.; Shiraishi, K.; et al. GRWD1 negatively regulates p53 via the RPL11-MDM2 pathway and promotes tumorigenesis. EMBO Rep. 2017, 18, 123–137. [Google Scholar] [CrossRef] [Green Version]
  28. Gao, C.; Zhuang, J.; Li, H.; Liu, C.; Zhou, C.; Liu, L.; Sun, C. Exploration of methylation-driven genes for monitoring and prognosis of patients with lung adenocarcinoma. Cancer Cell Int. 2018, 18, 194. [Google Scholar] [CrossRef]
  29. Broniarczyk, J.; Pim, D.; Massimi, P.; Bergant, M.; Jozefiak, A.; Crump, C.; Banks, L. The VPS4 component of the ESCRT machinery plays an essential role in HPV infectious entry and capsid disassembly. Sci. Rep. 2017, 7, 45159. [Google Scholar] [CrossRef] [Green Version]
  30. Sundaram, M.K.; Raina, R.; Afroze, N.; Bajbouj, K.; Hamad, M.; Haque, S.; Hussain, A. Quercetin modulates signaling pathways and induces apoptosis in cervical cancer cells. Biosci. Rep. 2019, 39, BSR20190720. [Google Scholar] [CrossRef] [Green Version]
  31. Feron, O.; Boidot, R.; Branders, S.; Dupont, P.; Helleputte, T. Signature of Cycling Hypoxia and Use Thereof for the Prognosis of Cancer, International Application Published under the Patent Cooperation Treaty (PCT). WO 2015/015000 Al, 5 February 2015. Available online: https://patentimages.storage.googleapis.com/80/1a/3c/eac6d250b2943a/WO2015015000A1.pdf (accessed on 10 April 2020).
  32. Wen, H.; Chen, L.; He, J.; Lin, J. MicroRNA Expression Profiles and Networks in Placentas Complicated with Selective Intrauterine Growth Restriction. Mol. Med. Rep. 2017, 16, 6650–6673. [Google Scholar] [CrossRef] [Green Version]
  33. Luo, J.; Huang, Q.; Lin, X. STAT4 Expression Is Correlated with Clinicopathological Characteristics of Cervical Lesions. Available online: https://www.researchgate.net/publication/303787920_STAT4_expression_is_correlated_with_clinicopathological_characteristics_of_cervical_lesions/citations (accessed on 10 April 2020).
  34. Liang, W.S.; Aldrich, J.; Nasser, S. Simultaneous Characterization of Somatic Events and HPV-18 Integration in a Metastatic Cervical Carcinoma Patient Using DNA and RNA Sequencing. Int. J. Gynecol. Cancer 2014, 24, 329–338. [Google Scholar] [CrossRef]
  35. Zhang, C.; Liao, Y.; Liu, P.; Du, Q.; Liang, Y.; Ooi, S.; Qin, S.; He, S.; Yao, S.; Wang, W. FABP5 promotes lymph node metastasis in cervical cancer by reprogramming fatty acid metabolism. Theranostics 2020, 10, 6561–6580. [Google Scholar] [CrossRef]
Figure 1. Flowchart of the proposed framework.
Figure 1. Flowchart of the proposed framework.
Genes 11 00931 g001
Figure 2. ROC plots of all classification metrics for the proposed method.
Figure 2. ROC plots of all classification metrics for the proposed method.
Genes 11 00931 g002
Figure 3. Comparative bar plot: proposed method vs state-of-the-art method (RSNNS)).
Figure 3. Comparative bar plot: proposed method vs state-of-the-art method (RSNNS)).
Genes 11 00931 g003
Figure 4. The volcano plot of normalized enrichment score of the FDR significant KEGG pathways from GSEA analysis of DEGs.
Figure 4. The volcano plot of normalized enrichment score of the FDR significant KEGG pathways from GSEA analysis of DEGs.
Genes 11 00931 g004
Table 1. List of differentially expressed genes (false discovery rate (FDR) sorted).
Table 1. List of differentially expressed genes (false discovery rate (FDR) sorted).
Gene Symbolt-Scorep-ValueFDR
ADCY 2 45.22 5.64 × 10 119 5.95 × 10 115
PTPN 6 32.50 1.43 × 10 89 7.55 × 10 86
LHFPL 2 32.09 1.63 × 10 88 5.71 × 10 85
VAV 1 30.24 1.41 × 10 83 3.72 × 10 80
EYA 4 −29.40 2.97 × 10 81 6.27 × 10 78
PNPLA 2 29.02 3.38 × 10 80 5.94 × 10 77
ARID 3 A −28.71 2.37 × 10 79 3.56 × 10 76
HOXD 10 28.19 6.97 × 10 78 9.20 × 10 75
TWIST 1 27.69 1.85 × 10 76 2.17 × 10 73
BHMT 26.49 5.42 × 10 73 5.72 × 10 70
TSLP 26.25 2.76 × 10 72 2.65 × 10 69
ACCN 4 26.16 5.23 × 10 72 4.60 × 10 69
HOXA 6 25.94 2.22 × 10 71 1.80 × 10 68
PRR 5 25.67 1.40 × 10 70 1.06 × 10 67
NODAL 25.45 6.44 × 10 70 4.53 × 10 67
EFCAB 1 25.41 8.60 × 10 70 5.45 × 10 67
WNT 2 25.40 8.86 × 10 70 5.45 × 10 67
PC 25.40 9.31 × 10 70 5.45 × 10 67
S 100 A 8 25.23 2.84 × 10 69 1.58 × 10 66
VWCE 24.89 3.05 × 10 68 1.61 × 10 65
IGFBP 2 24.86 3.61 × 10 68 1.81 × 10 65
ZNF 385 A 24.74 8.36 × 10 68 4.01 × 10 65
C 1 orf 220 24.71 1.04 × 10 67 4.75 × 10 65
COG 2 24.63 1.85 × 10 67 8.15 × 10 65
QRFP −24.51 4.16 × 10 67 1.76 × 10 64
Table 2. Values of disease classification metrics by proposed method.
Table 2. Values of disease classification metrics by proposed method.
MetricsAverage Value (std *)
Average accuracy 90.69 % ( ± 1.97 % )
Average sensitivity 73.97 % ( ± 1.06 % )
Average specificity 97.63 % ( ± 1.71 % )
Average precision 93.38 % ( ± 4.17 % )
Average overall error rate 9.30 % ( ± 1.97 % )
Area under curve (AUC)0.858
* std: standard deviation.
Table 3. Top 10 hub genes according to the in-degree centrality from our proposed method.
Table 3. Top 10 hub genes according to the in-degree centrality from our proposed method.
Gene SymbolIn-DegreeOut-DegreeAverage Shortest Path LengthBetweenness CentralityCloseness CentralityClustering Coefficient
PAIP 2 439323.5870.8020.2790.188
GRWD 1 425663.43511.0010.2910.178
VPS 4 B 406683.4602.2760.2890.191
CRADD 4061783.08711.0030.3240.152
LLPH 403403.5452.3130.2820.182
NDUFA 4 390893.5561.9270.2810.168
NDUFB 6 3721113.2944.6610.3040.175
ZKSCAN 4 372883.3641.4340.2970.200
SMARCD 1 365433.5150.7340.2840.214
TMED 10 348393.5464.1240.2820.193
Table 4. Top 10 hub genes according to the out-degree centrality from our proposed method.
Table 4. Top 10 hub genes according to the out-degree centrality from our proposed method.
Gene SymbolIn-DegreeOut-DegreeAverage Shortest Path LengthBetweenness CentralityCloseness CentralityClustering Coefficient
MRPL 35 2393762.7659.3540.3620.141
FAM 177 A 1 213393.0020.2630.3330.225
STAT 4 943322.8722.7440.3480.211
ASPSCR 1 683292.8881.1320.3460.212
FABP 7 2043152.7793.0080.3600.171
HNRNPA 0 653113.0101.2300.3320.191
ANGPTL 4 182992.8870.3640.3460.249
DDX 19 A 862832.9931.3850.3340.218
TRNT 1 402823.1190.4770.3210.221
PFDN 1 522743.0050.5260.3330.243
Table 5. Top significant KEGG Pathways (FDR sorted).
Table 5. Top significant KEGG Pathways (FDR sorted).
KEGG Pathway Name *#GenesEnriched p-ValueFDR
hsa05205 Proteoglycans in cancer198 6.65 × 10 8 2.16 × 10 5
hsa04550 Signaling pathways regulating pluripotency of stem cells139 1.32 × 10 7 2.16 × 10 5
hsa05166 Human T-cell leukemia virus 1 infection255 6.95 × 10 7 7.29 × 10 5
hsa04510 Focal adhesion199 8.94 × 10 7 7.29 × 10 5
hsa05200 Pathways in cancer524 1.26 × 10 6 8.19 × 10 5
hsa04015 Rap1 signaling pathway206 1.93 × 10 6 1.05 × 10 4
hsa04514 Cell adhesion molecules (CAMs)144 2.31 × 10 7 1.07 × 10 4
hsa04611 Platelet activation123 7.22 × 10 6 2.94 × 10 4
hsa04072 Phospholipase D signaling pathway146 1.02 × 10 5 3.69 × 10 4
hsa04640 Hematopoietic cell lineage97 4.01 × 10 5 1.31 × 10 3
* See Supplementary Table S3 for details.
Table 6. Top significant GO-BP term enriched (FDR sorted).
Table 6. Top significant GO-BP term enriched (FDR sorted).
GO-BP Term Name *#GenesEnriched p-ValueFDR
GO:0008283 cell proliferation198600
GO:0006928 movement of cell or subcellular component196700
GO:0009891 positive regulation of biosynthetic process194900
GO:0016192 vesicle-mediated transport194200
GO:0006955 immune response191900
GO:0031328 positive regulation of cellular biosynthetic process191900
GO:0006915 apoptotic process191100
GO:0010628 positive regulation of gene expression191100
GO:2000026 regulation of multicellular organismal development190800
GO:0006468 protein phosphorylation186000
* See Supplementary Table S4 for details.
Table 7. Top significant GO-CC term enriched (FDR sorted).
Table 7. Top significant GO-CC term enriched (FDR sorted).
GO-CC Term Name *#GenesEnriched p-ValueFDR
GO:0005783 endoplasmic reticulum186100
GO:0097458 neuron part169000
GO:0031984 intrinsic component of plasma membrane167300
GO:0031984 organelle subcompartment166100
GO:0098805 whole membrane163000
GO:0005887 integral component of plasma membrane159600
GO:0005794 Golgi apparatus151600
GO:0044433 cytoplasmic vesicle part146200
GO:0044463 cell projection part142500
GO:0120038 plasma membrane bounded cell projection part142500
* See Supplementary Table S5 for details.
Table 8. Top significant GO-MF term enriched (FDR sorted).
Table 8. Top significant GO-MF term enriched (FDR sorted).
GO-MF Term Name *#GenesEnriched p-ValueFDR
GO:0042802 identical protein binding169600
GO:0005102 signaling receptor binding153800
GO:0019904 protein domain specific binding68400
GO:0044212 transcription regulatory region DNA binding896 1.33 × 10 15 6.25 × 10 13
GO:0001067 regulatory region nucleic acid binding898 2.00 × 10 15 7.50 × 10 13
GO:0003690 double-stranded DNA binding915 1.02 × 10 14 3.16 × 10 12
GO:0008134 transcription factor binding638 1.18 × 10 14 3.16 × 10 12
GO:0016301 kinase activity845 2.19 × 10 14 5.00 × 10 12
GO:1990837 sequence-specific double-stranded DNA binding823 2.49 × 10 14 5.00 × 10 12
GO:0000976 transcription regulatory region sequence-specific DNA binding781 2.66 × 10 14 5.00 × 10 12
* See Supplementary Table S6 for details.

Share and Cite

MDPI and ACS Style

Mallik, S.; Seth, S.; Bhadra, T.; Zhao, Z. A Linear Regression and Deep Learning Approach for Detecting Reliable Genetic Alterations in Cancer Using DNA Methylation and Gene Expression Data. Genes 2020, 11, 931. https://doi.org/10.3390/genes11080931

AMA Style

Mallik S, Seth S, Bhadra T, Zhao Z. A Linear Regression and Deep Learning Approach for Detecting Reliable Genetic Alterations in Cancer Using DNA Methylation and Gene Expression Data. Genes. 2020; 11(8):931. https://doi.org/10.3390/genes11080931

Chicago/Turabian Style

Mallik, Saurav, Soumita Seth, Tapas Bhadra, and Zhongming Zhao. 2020. "A Linear Regression and Deep Learning Approach for Detecting Reliable Genetic Alterations in Cancer Using DNA Methylation and Gene Expression Data" Genes 11, no. 8: 931. https://doi.org/10.3390/genes11080931

APA Style

Mallik, S., Seth, S., Bhadra, T., & Zhao, Z. (2020). A Linear Regression and Deep Learning Approach for Detecting Reliable Genetic Alterations in Cancer Using DNA Methylation and Gene Expression Data. Genes, 11(8), 931. https://doi.org/10.3390/genes11080931

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop