Data Analysis
Hypophosphatasia is a rare inheritable disorder caused by TNAP deficiency. Previous studies have demonstrated that TNAP deficiency results in sensorimotor dysfunction [
3]. Here, two RNA-seq data acquired from the brain neocortex and spinal cord of
(wild-type) and
−/− (Global TNAP knockout) mice have been collected. This transgenic mouse model of infantile HPP represents the more severe phenotype of HPP in the human population. We expect that RNA transcriptional profile data near candidate genes may enrich the effects of genetic variants that fall within a sequence data, and any new findings will help better understand the mechanisms of pediatric brain injuries during brain development. In this study, both RNA-seq data include 36,500 gene markers and 16 samples. According to one covariate, gender, male and female mice are equally allocated to normal and disease groups, as seen in
Appendix A Table A5. Noticeably, the number of genes is much larger than the number of samples. Consequently, both spinal cord and neocortex data are cleaned to further reduce their feature dimensionality. The initiate analysis, providing
fold-change, log fold-change standard error, and
t statistic, serves as reference for cleaning the data. The criteria suggest that few genes having a small difference in fold change between the wild-type and knock-out mice, and having an extremely smaller standard error, should be removed. These genes may easily provide the false positive signals and increase the detection noise when we scan all genes in an RNA-seq study. After cleaning the data, 28,426 genes and 16 samples are included in two RNA-seq datasets.
It is known that the statistic
Z estimates the genetic effect for each gene, and its normal property will assure the successful application of the empirical Bayes method.
Figure 4 summarizes four histograms of all genes’
Z scores calculated from the spinal cord, neocortex, weighted gene, and weighted
Z methods, respectively. Specifically, the red color line in each subplot represents the standard normal density curve. Compared to the integration methods, the first two subplots corresponding to the spinal cord and neocortex methods show that
Z statistics have a small tail, which makes the detection of novel genes more challenging. One possible reason is that these two analyses only consider single RNA-seq data information. If we further compare the subplots between the spinal cord and the neocortex, it may be revealed that the spinal cord analysis seems to have a heavier tail, and a good normal property, which may result in a better performance on genetic detections in the RNA-seq study. Conversely, two integration methods, weighted gene and weighted
Z methods, display the heavier tails on both sides (negative/positive), which suggests we may easily detect more causal genes which are associated to the HPP disease.
One key step in the empirical Bayes method [
22] is to rank the selected genes (features). All methods rank the most important genes by their respective empirical Bayes estimates. The top gene lists of four methods (spinal cord, neocortex, weighted gene, and weighted
Z) are summarized. It is known that the genetic effects from the spinal cord and brain neocortex are important to the central nervous system. When we integrate two RNA-seq data, we expect that the combined method will more efficiently search for the most important genetic markers, and its performance will be better than the analysis result based on a single RNA-seq data. Most importantly, HPP is characterized by the reduced serum alkaline phosphatase (ALP), and its molecular diagnosis is established by identifying the loss-of-function ALPL variants [
31], which is the most important gene highly related to HPP disease. In this study,
Table 4 displays top 10 most important genes from a spinal cord data analysis, and it successfully detects the Alpl gene (position: No. 4).
Table 4 also summarizes top 10 genes based on a neocortex data analysis, and it also detects Alpl, but it is ranked much lower at No. 26. It is interesting to see that another gene, Cirbp, is ranked much higher (position: No. 2). This gene (Cirbp) is related to the cold inducible RNA binding protein, and may be important in the diagnosis of the HPP disease. Since two RNA-seq data are collected, it is capable of proposing two integration methods described in
Section 2.4.1 and
Section 2.4.2 to strengthen as many genetic signals as possible. In particular,
Table 5 summarizes top 10 gene list from the weighted gene method, and its list includes Alpl (Position: No. 6) and Cirbp (Position: No. 9). It seems that the weighted gene method can identify these two genes among top 10 genes by maximizing the genetic effects from two RNA-seq datasets. Additionally, more genes, namely GM13230, zfp990, Tmprss11d, and Eno1b, detected by the single spinal cord data analysis, are also observed in the top four gene positions. The second integration method, weighted
Z method, displays its top 10 genes in
Table 5. It also detects two genes (Alpl and Cirbp), but two genes (Cirbp and Fkbp5) identified by the single neocortex data analysis are ranked at the front, followed by the genes Alpl, GM13230, zfp990 and Eno1b detected by the single spinal cord data analysis. This may suggest that weighted
Z method provides an alternative way to maximize the genetic effects from two RNA-seq data. When we extend the top 10 genes to the top 40 genes, it is better to see that more top-ranked genes detected by a spinal cord data analysis or a neocortex data analysis are selected by the integration methods. The relevant top 40 gene lists are summarized in
Appendix A (
Table A1,
Table A2,
Table A3 and
Table A4). Generally, the most important HPP disease causal gene, Alpl, is successfully detected by four methods, especially the integration methods which are capable of maximizing multiple gene signals from a spinal cord data and a neocortex RNA-seq data. HPP is a complex disease, and its diagnosis is based on the collective actions of multiple genes. Identifying more genes will help us better understand their biological effects in the diagnosis of HPP disease.
The identifications of most important genes complete the feature selection procedure. The next step is to evaluate the risk prediction based on the selected candidate genes. We split samples into a train and a test set. Since the sample size is rather small, 50% of the subjects are randomly assigned to a train set, and the remaining goes to a test group. This cross-validation is repeated five times to calculate the average test error.
Table 6 summarizes and compares the prediction errors of four methods. Specifically, the prediction error of a neocortex data analysis gives the largest error rate (0.15). In fact, this result is consistent with its poor performance in the
Z histogram. While the EB prediction rule applies to a spinal cord data, it performs slightly better and a smaller prediction error (0.125) is obtained. The proposed integration methods, weighted gene and weighted
Z methods, expect to better search the most important genes for risk prediction. In particular, the error rate of weighted gene method is 0.1, which is smaller than the error rates based on the single spinal cord and the neocortex data analysis. Weighted
Z method combines two
Z statistics detected from two RNA-seq datasets to select the causal genes, and those selected genes will be separately applied to spinal cord and neocortex data to compute the test error rates. The final error is based on the average value of two data results, and it is around 0.0625, which shows the minimum prediction error. In summary, the integration methods, weighted
Z and weighted gene methods, provide the smaller prediction error rates, and these two methods may better explore the genetic effects in the risk prediction of HPP disease.
We also calculate the area under the receiver operating characteristic (ROC) curve (AUC) to compare the prediction accuracy among four different methods.
Table 7 compares the prediction accuracy across four methods. Specifically, the neocortex data analysis gives the smallest prediction accuracy (AUC: 85%), followed by the spinal cord data analysis, where its AUC achieves 87.5%; then, weighted gene approach achieves an AUC of 90%, and weighted
Z method has the highest prediction accuracy (AUC: 93.75%). This analysis reveals the importance of empirical Bayes estimates in developing a large-scale risk prediction model, and two integration methods further improve the prediction accuracy while jointly analyzing two RNA-seq data.
Figure 5 illustrates the comparison of ROCs across four methods under the cross-validation procedure. This figure demonstrates that two integration methods outperform the individual RNA-seq data analysis in terms of prediction accuracy. The results imply that our proposed methods are efficient prediction tools for a large-scale RNA-seq data analysis.