Next Article in Journal
Genome-Wide Analysis of Exertional Rhabdomyolysis in Sickle Cell Trait Positive African Americans
Previous Article in Journal
Comparative Analysis of Chloroplast Genomes for the Genus Manglietia Blume (Magnoliaceae): Molecular Structure and Phylogenetic Evolution
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

New Empirical Bayes Models to Jointly Analyze Multiple RNA-Sequencing Data in a Hypophosphatasia Disease Study

1
Department of Mathematics and Statistics, University of Michigan-Dearborn, Dearborn, MI 48128, USA
2
Manufacturing Systems Engineering, University of Michigan-Dearborn, Dearborn, MI 48128, USA
3
Department of Natural Sciences, University of Michigan-Dearborn, Dearborn, MI 48128, USA
*
Author to whom correspondence should be addressed.
Genes 2024, 15(4), 407; https://doi.org/10.3390/genes15040407
Submission received: 23 February 2024 / Revised: 17 March 2024 / Accepted: 18 March 2024 / Published: 26 March 2024
(This article belongs to the Section RNA)

Abstract

:
Hypophosphatasia is a rare inherited metabolic disorder caused by the deficiency of tissue-nonspecific alkaline phosphatase. More severe and early onset cases present symptoms of muscle weakness, diminished motor coordination, and epileptic seizures. These neurological manifestations are poorly characterized. Thus, it is urgent to discover novel differentially expressed genes for investigating the genetic mechanisms underlying the neurological manifestations of hypophosphatasia. RNA-sequencing data offer a high-resolution and highly accurate transcript profile. In this study, we apply an empirical Bayes model to RNA-sequencing data acquired from the spinal cord and neocortex tissues of a mouse model, individually, to more accurately estimate the genetic effects without bias. More importantly, we further develop two integration methods, weighted gene approach and weighted Z method, to incorporate two RNA-sequencing data into a model for enhancing the effects of genetic markers in the diagnostics of hypophosphatasia disease. The simulation and real data analysis have demonstrated the effectiveness of our proposed integration methods, which can maximize genetic signals identified from the spinal cord and neocortex tissues, minimize the prediction error, and largely improve the prediction accuracy in risk prediction.

1. Introduction

Hypophosphatasia (HPP) is a rare inherited metabolic disorder, and its severe forms were estimated to affect 1 in 100,000 births in Canada and 1 in 300,000 in Europe, while moderate forms of HPP are 50 times more frequent [1]. US data suggest that HPP is more prevalent in white people than in black people [1]. The ethnic group with the highest reported incidence of HPP is the Mennonites in Manitoba, Canada, with 1 in 25 individuals carrying a tissue-nonspecific alkaline phosphatase (TNAP) mutation and around 1 in 25,000 newborns having lethal HPP [2]. In fact, HPP is caused by the deficiency in TNAP, and TNAP activity is essential for bone formation, mineralization, and differentiation of bone marrow stromal cells. Specifically, TNAP deficiency in bone and muscle progenitor cells results in mitochondrial hyperfunction and increased ATP levels, both of which greatly influence cell function and survival [3]. The clinical symptoms of HPP widely vary according to the patient’s age at the onset of the disorder. Less severe forms of HPP are often characterized by pain or fractures over time [4,5] due to poor bone mineralization. More severe and early onset cases of HPP present issues with muscle weakness, diminished motor coordination [3], and epileptic seizures [6]. These neurological manifestations are poorly characterized, especially in children. Therefore, the development of strategies to improve the neurological outcomes is urgently needed. One goal of this study is to propose a statistical model to accurately reveal the genetic mechanism underlying the neurological manifestations of HPP. In this study, the tissue samples come from Alpl+/+ (wild type) and Alpl−/− (Global TNAP knockout) mice to represent the phenotype of patients with infantile HPP in the human population [3]. We have collected two RNA-sequencing (RNA-seq) data from the spinal cord and neocortex tissues of a mouse model of infantile hypophosphatasia. The more important task is to develop a new method to integrate multiple RNA-seq data for discovering influential genes associated with the HPP disease and predicting the disease presence in subjects.
In 2008, RNA-seq was first introduced to the field of transcriptomics [7,8,9,10], and it has since greatly transformed the genetic study. Transcriptomics is the study of the complete set of RNA transcripts that are produced by the genome under specific conditions or in a specific cell. Studying transcript levels is essential to understand the translation process from information encoded in the genome to cellular functions. Consequently, transcript profiling is an important tool for predicting the presence of disease. RNA-seq offers several advantages over other types of transcriptome profiling, particularly single-cell RNA-seq which can identify novel transcripts not corresponding to an existing genomic sequence. Additionally, RNA-seq has a low background signal, and a large dynamic range of expression levels to detect transcripts, and it can precisely locate transcription boundaries down to a single-base resolution [11]. More practically, a large amount of RNA transcript is needed, but the cost of mapping whole genome is lower than other data, such as microarrays or cDNA sequencing [11].
While the applications of RNA-seq study have become widespread, this complex process faces a big challenge on how to properly analyze RNA-seq data [12]. There exist several pipelines and publications outlining best practices for the entire process from the extraction to the data analysis [12,13,14,15]. Many traditional statistical methods have been applied to RNA-seq data such as a likelihood ratio test [16] or negative binomial models [16,17]. With the fast growing of deep learning and machine learning in many different domains, deep learning models have recently been applied to RNA-seq data [18,19] with limited success since high-throughput sequencing data fights with the imbalance between a large number of genetic markers and a small number of samples.
As mentioned before, the number of genetic markers in a RNA-seq data is extremely greater than the number of samples, and this kind of data analysis faces a challenge on how to accurately explore the genetic effects from high-dimensional variants. Several frequentist methods have been developed to shrink genetic estimates and reduce bias, such as LASSO [20] and the shrunken centroids methods [21]. However, these methods tend to overestimate the genetic effects and include noisy signals. Fortunately, Bayesian approach produces estimates which are immune to estimation bias in a high-dimensional data analysis. In addition, an empirical Bayes (EB) method [22] can simplify the calculations of Bayesian approaches and help address the selection bias problem occurred in a large feature space. This method has been successfully applied to microarray data for a prostate cancer study [22,23]. Later, the EB method is extended to exome DNA sequencing data for predicting the risk of cardiovascular disease [24]. Recently, a weighted EB model is derived to incorporate multiple traits, a quantitative disease trait, and a binary disease status, in whole-genome DNA sequencing analysis for improving the genetic mechanism underlying a hypertension disease [25]. Here, with the collection of multiple RNA-seq data from the spinal cord and neocortex tissues of a mouse model, there remains a need to develop a new EB method which combines two RNA-seq data into a model for maximizing the effects of genetic markers, and effectively reducing the estimate bias in a high-dimensional data analysis.
In this study, we apply the EB method to the spinal cord and neocortex RNA-seq data, separately, for more accurately discovering the differentially expressed (DE) genes in the HPP disease study. More importantly, we further propose two integration methods, weighted gene approach and weighted Z method, to incorporate two RNA-seq data into a model for enhancing the effects of genetic variants in the diagnostics of HPP disease. We expect that the proposed integration methods can strengthen the genetic signals and improve the prediction performance, because some disease-causal genes may be differentially expressed at one specific tissue, not at both tissues. To test this conjecture, we compare the empirical Bayes method, which is applied to the single RNA-seq data, with two integration empirical Bayes methods jointly analyzing two RNA-seq data. The results will demonstrate the efficiency of our proposed integration methods.
This paper is outlined as follows. Section 2 discusses four EB methods. Section 3 proves the effectiveness of two integration methods. In the following Section 4, the specifics of a real RNA-seq data analysis are discussed, including a description of the datasets, distributions of the statistics, ranking of important features, and comparison of various methods. We conclude with a few remarks from the data analysis, and further discuss the limitations of current methods, and give some directions of future work in Section 5.

2. Methods

2.1. Step I: Gene Expression

The experimental procedures of this study were approved by the Institutional Animal Care and Use Committee (IACUC) of the University of Michigan (Protocol number: PRO00010860, date of approval: 22 August 2022). Here, two RNA-seq data are collected from spinal cord and neocortex tissues. Each data is composed of tens of thousands of genes collected from 16 mice. Those mice are classified into two categories, 8 mice are assigned to a wild type group and another 8 mice are in a knock-out group. A balanced design is applied here. The wild type is treated as a normal group, and the knock-out class is treated as a disease group. Specifically, two RNA-seq data share common samples, and the main discrepancy between two data may come from the assumption that combining DE genes detected from two tissues may strengthen the genetic signal in the diagnostics of HPP disease. Moreover, raw RNA-seq data is transformed via a binary logarithm function, so that its distributions based on the transformed data more closely resemble the normal distribution, which better aligns with the empirical Bayes method. The relevant workflow of study is summarized in Figure 1.

2.2. Step II: Empirical Bayes Prediction Rule for a Spinal Cord RNA-Sequencing Data

RNA-seq data collected from the spinal cord is critical to the central nervous system, and its analysis helps understand the genetic mechanism of the HPP disease. We assume that there are N genes X p = ( X 1 | p , , X N | p ) where X p is a n × N matrix measuring N log transformed genes of n subjects, and p is a label of spinal cord. The corresponding standardized gene matrix W p = ( W 1 | p , , W N | p ) is defined by an equation W i | p = X i | p ( μ i , 1 | p + μ i , 2 | p ) 2 1 n σ i | p , i = 1, ⋯, N where X i | p is one column of matrix X p measuring the i t h transformed gene for all subjects, 1 n is a vector of size n containing ones, μ i , 1 | p and μ i , 2 | p denote the mean of the i t h gene in the normal and disease groups, individually. σ i | p is the standard deviation of the i t h gene, and n represents the total number of subjects consisting of n 1 controls (normal subjects) and n 2 cases (disease subjects) with n 1 = n 2 = n 2 . Thus, the prediction rule is to classify subjects into the disease group if
i N δ i | p W i | p > 0
where the parameter δ i | p = d 0 μ i , 2 | p μ i , 1 | p σ i | p ( d 0 = n 1 n 2 n 1 + n 2 = n 2 ) measures the impact of the i t h gene on the HPP disease. In this study, a balanced design is applied ( n 1 = n 2 ), then, a threshold zero is selected to define the decision boundary. In the above equation, a vector 0 is used to allocate subjects having large prediction values to a disease group.
In reality, σ i | p is often unknown, but it can be estimated by a statistic s i | p ( s i | p = ( n 1 1 ) s i , 1 | p 2 + ( n 2 1 ) s i , 2 | p 2 n 2 ), where s i , 1 | p 2 and s i , 2 | p 2 are sample variances of the i t h gene in the normal class and disease group, respectively. Hence, the parameter δ i | p will be estimated by
t i | p = d 0 X ¯ i , 2 | p X ¯ i , 1 | p s i | p t n 2 ( δ i , 1 )
Note that X ¯ i , k | p = j = 1 n X i j | p I ( Y j | p = k 1 ) n k (k = 1 or 2) is the sample mean of the i t h gene in the normal or disease group where Y j | p (= 0 or 1) is the jth subject’s disease status. To take the advantage of normal property, t i | p is converted into a normal variable Z i | p = Φ 1 ( P ( T t i | p ) ) , where Φ 1 is the inverse of the cumulative distribution function of standard normal and P ( T t i | p ) is a cumulative distribution function of t distribution with n 2 degrees of freedom. Thus, Z i | p becomes an estimate of δ i when we analyze a spinal cord RNA-seq data, and it follows a normal distribution with mean δ i and the standard deviation close to 1 [22] Remark F. The standardized gene W i | p is estimated by W ^ i | p = X i | p X ¯ i , 1 | p + X ¯ i , 2 | p 2 1 n s i | p . The prediction rule based on Z i | p will be defined below,
i N Z i | p W ^ i | p > 0
However, the estimate Z i | p may result in a selection bias, wherein genetic effects are often overestimated in the above prediction rule. It is known that Bayesian methods may provide less biased estimates [26,27], and previous studies [28,29] have shown that the marginal density function of statistic Z i | p is capable of deriving the Bayesian estimate without knowing the prior of δ i . For any pair (Z, δ ), the Bayesian estimate δ ^ is obtained by the first derivative of the logarithm of the marginal density of Z [22],
Z | δ N ( δ , 1 ) l f ^ ( z ) = l o g ( f ^ ( z ) ) δ ^ = z + s 2 l f ^ ( z ) = E ^ ( δ | z )
where l f ^ ( . ) estimates the logarithm of the marginal density of Z, l f ^ ( z ) is its first derivative function, and s 2 (sample variance) is numerically calculated. Basically, the empirical Bayes method provides a numerical approximation of the Bayes estimates [22].
Finally, a subset of genes with large EB estimates denoted as δ ^ i | p define the following prediction rule:
i I 1 δ ^ i | p W ^ i | p > 0
where a set I 1 collects all potential genes with strong EB estimates ( δ ^ i | p , i I 1 ) in a spinal cord RNA-seq analysis.

2.3. Step III: Empirical Bayes Prediction Rule for a Neocortex RNA-Sequencing Data

The second RNA-seq data is collected from the brain neocortex of the mouse, and it is also used to explore the genetic mechanism of HPP disease. Here, the feature dimension of the neocortex RNA-seq data is the same as that of the above spinal cord data. For N genes, X c = ( X 1 | c , , X N | c ) denotes all genes measured from the neocortex tissue, and the corresponding standardized gene matrix W c = ( W 1 | c , , W N | c ) is calculated by W i | c = X i | c ( μ i , 1 | c + μ i , 2 | c ) 2 1 n / σ i | c , i = 1, …, N, where c is a label of the neocortex, μ i , 1 | c and μ i , 2 | c denote the mean of the i t h gene in the normal and disease groups, individually, σ i | c denotes the standard deviation of the i t h gene, and n represents the total number of subjects. Since two RNA-seq data share the common samples, n, n 1 , and n 2 can be defined in the same way. Thus, the prediction rule is to classify subjects to the disease group if
i N δ i W i | c > 0
where the parameter δ i = d 0 μ i , 2 | c μ i , 1 | c σ i | c assesses the impact of the i t h gene on the disease when we analyze neocortex RNA-seq data. Since an equal number of samples are assigned to the normal and disease groups, a balanced design may select a zero threshold to determine the decision boundary.
Similarly, an estimator t i | c of δ i is calculated below:
t i | c = d 0 X ¯ i , 2 | c X ¯ i , 1 | c s i | c t n 2 ( δ i , 1 )
Note that this statistic t i | c is also converted into a normal variable Z i | c = Φ 1 ( P ( T t i | c ) ) , which has a selection bias in practice. We shrink Z i | c to obtain the relevant EB estimator δ ^ i | c for the i t h gene ( i = 1 , . . . , N ). The final prediction rule of a neocortex RNA-seq data analysis is defined as
i I 2 δ ^ i | c W ^ i | c > 0
where W ^ i | c is the i t h sample standardized gene vector calculated from the neocortex data, and its formula is similar to that of Section 2.2, and a set I 2 collects all genes with important EB estimates ( δ ^ i | c , i I 2 ).

2.4. Step IV: Empirical Bayes Prediction Rule When Integrating the Spinal Cord and Neocortex RNA-Seq Data

2.4.1. Weighted Gene Approach

Since two RNA-seq data are collected from spinal cord and neocortex tissues, we expect that combining the genetic signals from multiple RNA-seq data may improve the genetic effects in the diagnostics of HPP disease. The idea of weighted gene method was first developed in the empirical Bayes model by integrating the synonymous gene and non-synonymous gene together for improving the cardiovascular disease prediction performance in an exome DNA sequencing study [24]. Here, we extend this idea to combine two RNA-seq data acquired from the spinal cord and neocortex tissues, and it may maximize the genetic signals.
For all subjects, X p and X c denote N genes of the spinal cord and neocortex, individually, where p represents spinal cord, and c denotes neocortex. While we analyze two RNA-seq data, a weight ( w i ) measuring the relative importance of the i t h gene effect detected from the spinal cord data compared to that from the neocortex data is calculated by
w i = log ( p i | p ) log ( p i | p ) log ( p i | c ) for i = 1 , , N
where p i | p and p i | c are p-values calculated from a simple logistic regression model which detects the i t h gene effect on the HPP disease corresponding to the spinal cord data and neocortex data, individually. A larger weight suggests that the genetic effect identified from the spinal cord may have a relatively stronger association with the disease than that from the neocortex. A new weighted gene ( X i * ) will make use of this weight to combine the genetic effects from two RNA-seq data, and it is defined below:
X i * = w i X i | p + ( 1 w i ) X i | c for i = 1 , , N
These N weighted genes ( X 1 * , , X N * ) are capable of maximizing the genetic effects detected from the spinal cord and neocortex tissues. The relative empirical Bayes prediction rule will classify subjects to the disease if
i N δ i W i * > 0
Note that δ i measures the genetic effect of the i t h standardized weighted gene, where W i * = X i * ( μ i , 1 * + μ i , 2 * ) 2 1 n σ i * ( i = 1 , , N ) is the i t h standardized weighted gene which is defined similarly as in the aforementioned section. Similarly, μ i , 1 * and μ i , 2 * are the mean values of the i t h weighted gene corresponding to the normal group and the disease group, and σ i * is the standard deviation of the i t h weighted gene. As before, the statistic ( t i * ) is calculated to estimate the parameter δ i which reflects the effect of the i t h weighted gene.
t i * = d 0 X ¯ i , 2 * X ¯ i , 1 * s i * t n 2 ( δ i , 1 )
Z i * = Φ 1 ( P ( T t i * ) )
where X ¯ i , 1 * and X ¯ i , 2 * are sample means of the i t h weighted gene in the normal class and disease class, respectively, s i * is the pooled sample standard deviation of the i t h weighted gene, and d 0 = n 1 n 2 n 1 + n 2 . A statistic t i * is converted into an estimator Z i * through an inverse standard normal function Φ 1 and a cumulative distribution function P ( T t i * ) .
The normal property of Z i * guarantees the effectiveness of parameter δ i estimation. To avoid the selection bias, the empirical Bayes method will shrink the estimator Z i * to an EB estimate δ ^ i . The final prediction rule will be based on all important EB estimates,
i I 3 δ ^ i W ^ i * > 0
where a set I 3 collects all standardized weighted genes having strong EB estimates, W ^ i * is the estimate of the i t h standardized weighted gene combining spinal cord and neocortex genetic signals, and its standardized formula is defined similarly as in the previous section.

2.4.2. Weighted Z Method

The principle of weighted gene method is to calculate a weight ( w i ) for each gene, and it enlarges the genetic signal from two RNA-seq data. An alternative way for strengthening the genetic signal is to calculate a weighted Z score, which combines a Z statistic detected from the spinal cord data with that from the neocortex data. We expect to increase the genetic effects when scanning two RNA-seq data.
Z i | p defined in Section 2.2 is a genetic estimate of the i t h gene, and this statistic reflects the importance of the i t h gene identified from spinal cord data. Similarly, Z i | c defined in Section 2.3 is also an estimate of the i t h gene, and it explores the genetic effect calculated from the neocortex data. A weight z w i is defined below:
z w i = | Z i | p | | Z i | p | + | Z i | c | for i = 1 , , N
where | Z i | p | and | Z i | c | are absolute values of statistics calculated from the spinal cord data and neocortex data, respectively. For a fixed sample data, the weight z w i reflects the relative importance of the i t h gene effect detected from a spinal cord data compared to that from a neocortex data. A larger weight indicates that the i t h gene detected from the spinal cord shows a stronger signal than that from the neocortex. To better capture the genetic effect from two RNA-seq data, a weighted Z statistic is calculated below:
Z i w = z w i Z i | p + ( 1 z w i ) Z i | c
According to Section 2.2 and Section 2.3, Z i | p and Z i | c follow the normal distribution with mean δ i and variance close to 1, respectively. The weighted Z i w is a function of two variables Z i | p and Z i | c . While data is fixed, this new statistic Z i w approximates a normal distribution with mean δ i and the standard deviation s Z i w ( s Z i w = z w i 2 + ( 1 z w i ) 2 + 2 z w i ( 1 z w i ) ρ ), where ρ is the correlation coefficient between two statistics Z p and Z c calculated from the spinal cord and neocortex data, and it will help calculate the standard deviation of the statistic Z i w . In real application, this new statistic Z i w reflects the genetic effect of the i t h gene after combining two statistics Z i | p and Z i | c . Then, we apply the empirical Bayes method to shrink this statistic ( Z i w ), and the relevant EB estimate δ i ^ is estimated to avoid selection bias. The final prediction rule will be determined by all important EB estimates.

3. Simulation

Simulation Analysis

To assess the performances of our proposed methods, a semi-parametric simulation method (SPsimSeq: [30]) is applied to generate two RNA-seq data corresponding to spinal cord and neocortex RNA-seq data. This simulation method requires to input raw count data, then, it is designed to maximally retain the characteristics of real RNA-seq data. In particular, two simulation data are capable of capturing the gene-wise distributions and the between-genes correlation structure of real source data. In the simulation scenario, 3000 genes and 16 samples are simulated. Specifically, around 2% of genes (65 genes) are selected to be DE genes which are important to HPP disease, and the remaining genes are null genes. A total of 16 samples are allocated to two classes, then, eight subjects are assigned to disease and eight subjects are assigned to normal. Figure 2 and Figure 3 compare the variability and distribution of mean expression levels between simulation data and real data. It clearly illustrates that both simulation data have retain major characteristics of real spinal cord and neocortex RNA-seq data.
Feature selection is a key step in the risk prediction, where genes are ranked by their importance to the disease. All discussed methods are compared by their discoveries of true DE genes. We expect more DE genes selected by our proposed methods to possess larger effects and occupy top positions in the list. Table 1 summarizes the number of DE genes among top 100 or 200 genes identified by each method. The more the discovery of DE genes is, the better the corresponding method is. The first two methods (Spinal and Cortex) apply the empirical Bayes model to single spinal cord simulation data and neocortex data, respectively (Table 1). This shows that single spinal cord data analysis can discover 21 true DE genes among top 100/200 genes, and the neocortex data analysis only identifies five DE genes among top 100/200 genes. These two simulation results are consistent with the findings of real data analysis. Compared to spinal cord data, neocortex data shows poor genetic singles. Table 1 also summarizes two integration methods’ results, such as weighted gene approach which detects 53 and 58 DE genes among top 100 genes and top 200 genes, respectively. This method tries to maximize the genetic signals from two RNA-seq data, thus, 53 selected DE genes consist of 44 spinal cord DE genes and nine neocortex DE genes, and 58 DE genes include 49 spinal cord DE genes and nine neocortex DE genes. The alternative integration method, weighted Z, can even detect a larger number of true genes, such as 56 DE genes among top 100 genes, and 61 DE genes among top 200 genes. This finding is also consistent with the real data analysis, where weighted Z performs better in the areas of prediction error and accuracy. In addition, we also display the details of top 30 genes ranked by various methods (Table 2). In particular, yellow color highlights the spinal cord DE genes, and cyan color denotes the neocortex DE genes. This illustrates that two integration methods are capable of strengthening the genetic signals from two simulation RNA-seq data.
We further compare the prediction errors of various methods. Both simulation data are split into two sets, a training dataset (50% samples) and a test dataset (50% samples). Each proposed model is fit into a training set, then its model estimate is applied to a test set to calculate the prediction error. We repeat this cross-validation (CV) procedure five times to calculate the average test error. Table 3 summarizes the error rate and its standard error of each method. The results demonstrate that two integration methods perform best, like weighted Z has the smallest error rate (0.16), and weighted gene approach (0.225), followed by single spinal cord data analysis (0.275), and single neocortex data analysis (0.3). In general, the simulation study helps understand RNA-seq data properties. (1) Two simulation data retain major characteristics of real spinal cord and neocortex RNA-seq data. (2) Two integration methods, weighted gene and weighted Z, detect a larger number of DE genes and receive a smaller prediction error, compared to the single spinal cord data analysis or single neocortex data analysis. (3) If we focus on two single data analyses (spinal cord and neocortex) compared to each other, the spinal cord data makes it easier to detect DE genes. It seems that the stronger effects of genes are detected from the spinal cord tissue. (4) The performances of simulation data are generally consistent with those of real data.

4. Real Application

Data Analysis

Hypophosphatasia is a rare inheritable disorder caused by TNAP deficiency. Previous studies have demonstrated that TNAP deficiency results in sensorimotor dysfunction [3]. Here, two RNA-seq data acquired from the brain neocortex and spinal cord of A l p l + / + (wild-type) and A l p l −/− (Global TNAP knockout) mice have been collected. This transgenic mouse model of infantile HPP represents the more severe phenotype of HPP in the human population. We expect that RNA transcriptional profile data near candidate genes may enrich the effects of genetic variants that fall within a sequence data, and any new findings will help better understand the mechanisms of pediatric brain injuries during brain development. In this study, both RNA-seq data include 36,500 gene markers and 16 samples. According to one covariate, gender, male and female mice are equally allocated to normal and disease groups, as seen in Appendix A Table A5. Noticeably, the number of genes is much larger than the number of samples. Consequently, both spinal cord and neocortex data are cleaned to further reduce their feature dimensionality. The initiate analysis, providing log 2 fold-change, log fold-change standard error, and t statistic, serves as reference for cleaning the data. The criteria suggest that few genes having a small difference in fold change between the wild-type and knock-out mice, and having an extremely smaller standard error, should be removed. These genes may easily provide the false positive signals and increase the detection noise when we scan all genes in an RNA-seq study. After cleaning the data, 28,426 genes and 16 samples are included in two RNA-seq datasets.
It is known that the statistic Z estimates the genetic effect for each gene, and its normal property will assure the successful application of the empirical Bayes method. Figure 4 summarizes four histograms of all genes’ Z scores calculated from the spinal cord, neocortex, weighted gene, and weighted Z methods, respectively. Specifically, the red color line in each subplot represents the standard normal density curve. Compared to the integration methods, the first two subplots corresponding to the spinal cord and neocortex methods show that Z statistics have a small tail, which makes the detection of novel genes more challenging. One possible reason is that these two analyses only consider single RNA-seq data information. If we further compare the subplots between the spinal cord and the neocortex, it may be revealed that the spinal cord analysis seems to have a heavier tail, and a good normal property, which may result in a better performance on genetic detections in the RNA-seq study. Conversely, two integration methods, weighted gene and weighted Z methods, display the heavier tails on both sides (negative/positive), which suggests we may easily detect more causal genes which are associated to the HPP disease.
One key step in the empirical Bayes method [22] is to rank the selected genes (features). All methods rank the most important genes by their respective empirical Bayes estimates. The top gene lists of four methods (spinal cord, neocortex, weighted gene, and weighted Z) are summarized. It is known that the genetic effects from the spinal cord and brain neocortex are important to the central nervous system. When we integrate two RNA-seq data, we expect that the combined method will more efficiently search for the most important genetic markers, and its performance will be better than the analysis result based on a single RNA-seq data. Most importantly, HPP is characterized by the reduced serum alkaline phosphatase (ALP), and its molecular diagnosis is established by identifying the loss-of-function ALPL variants [31], which is the most important gene highly related to HPP disease. In this study, Table 4 displays top 10 most important genes from a spinal cord data analysis, and it successfully detects the Alpl gene (position: No. 4). Table 4 also summarizes top 10 genes based on a neocortex data analysis, and it also detects Alpl, but it is ranked much lower at No. 26. It is interesting to see that another gene, Cirbp, is ranked much higher (position: No. 2). This gene (Cirbp) is related to the cold inducible RNA binding protein, and may be important in the diagnosis of the HPP disease. Since two RNA-seq data are collected, it is capable of proposing two integration methods described in Section 2.4.1 and Section 2.4.2 to strengthen as many genetic signals as possible. In particular, Table 5 summarizes top 10 gene list from the weighted gene method, and its list includes Alpl (Position: No. 6) and Cirbp (Position: No. 9). It seems that the weighted gene method can identify these two genes among top 10 genes by maximizing the genetic effects from two RNA-seq datasets. Additionally, more genes, namely GM13230, zfp990, Tmprss11d, and Eno1b, detected by the single spinal cord data analysis, are also observed in the top four gene positions. The second integration method, weighted Z method, displays its top 10 genes in Table 5. It also detects two genes (Alpl and Cirbp), but two genes (Cirbp and Fkbp5) identified by the single neocortex data analysis are ranked at the front, followed by the genes Alpl, GM13230, zfp990 and Eno1b detected by the single spinal cord data analysis. This may suggest that weighted Z method provides an alternative way to maximize the genetic effects from two RNA-seq data. When we extend the top 10 genes to the top 40 genes, it is better to see that more top-ranked genes detected by a spinal cord data analysis or a neocortex data analysis are selected by the integration methods. The relevant top 40 gene lists are summarized in Appendix A (Table A1, Table A2, Table A3 and Table A4). Generally, the most important HPP disease causal gene, Alpl, is successfully detected by four methods, especially the integration methods which are capable of maximizing multiple gene signals from a spinal cord data and a neocortex RNA-seq data. HPP is a complex disease, and its diagnosis is based on the collective actions of multiple genes. Identifying more genes will help us better understand their biological effects in the diagnosis of HPP disease.
The identifications of most important genes complete the feature selection procedure. The next step is to evaluate the risk prediction based on the selected candidate genes. We split samples into a train and a test set. Since the sample size is rather small, 50% of the subjects are randomly assigned to a train set, and the remaining goes to a test group. This cross-validation is repeated five times to calculate the average test error. Table 6 summarizes and compares the prediction errors of four methods. Specifically, the prediction error of a neocortex data analysis gives the largest error rate (0.15). In fact, this result is consistent with its poor performance in the Z histogram. While the EB prediction rule applies to a spinal cord data, it performs slightly better and a smaller prediction error (0.125) is obtained. The proposed integration methods, weighted gene and weighted Z methods, expect to better search the most important genes for risk prediction. In particular, the error rate of weighted gene method is 0.1, which is smaller than the error rates based on the single spinal cord and the neocortex data analysis. Weighted Z method combines two Z statistics detected from two RNA-seq datasets to select the causal genes, and those selected genes will be separately applied to spinal cord and neocortex data to compute the test error rates. The final error is based on the average value of two data results, and it is around 0.0625, which shows the minimum prediction error. In summary, the integration methods, weighted Z and weighted gene methods, provide the smaller prediction error rates, and these two methods may better explore the genetic effects in the risk prediction of HPP disease.
We also calculate the area under the receiver operating characteristic (ROC) curve (AUC) to compare the prediction accuracy among four different methods. Table 7 compares the prediction accuracy across four methods. Specifically, the neocortex data analysis gives the smallest prediction accuracy (AUC: 85%), followed by the spinal cord data analysis, where its AUC achieves 87.5%; then, weighted gene approach achieves an AUC of 90%, and weighted Z method has the highest prediction accuracy (AUC: 93.75%). This analysis reveals the importance of empirical Bayes estimates in developing a large-scale risk prediction model, and two integration methods further improve the prediction accuracy while jointly analyzing two RNA-seq data. Figure 5 illustrates the comparison of ROCs across four methods under the cross-validation procedure. This figure demonstrates that two integration methods outperform the individual RNA-seq data analysis in terms of prediction accuracy. The results imply that our proposed methods are efficient prediction tools for a large-scale RNA-seq data analysis.

5. Discussion

Hypophosphatasia is a rare inherited metabolic disorder caused by tissue-nonspecific alkaline phosphatase deficiency which influences cell function and survival [3]. It manifests as a variety of clinical symptoms where more severe and early onset cases are often characterized by neurological symptoms, such as sensorimotor dysfunction [3] or epileptic seizures [6]. However, these neurological manifestations are poorly characterized, particularly in children. RNA-seq data offer a high-resolution and highly accurate transcript profile, which is an important tool for predicting disease risk. In this study, two RNA-seq data are acquired from spinal cord and neocortex tissues of a mouse model to explore the genetic mechanism underlying the diagnostics of HPP disease. Typically, RNA-seq data result in a large number of predictors and a limited number of samples, which challenges many traditional statistical methods because these methods may easily overestimate the genetic effects due to a selection bias. A method such as empirical Bayes [22] attempts to mitigate this problem by shrinking the estimates in the gene expression study. Therefore, we propose two integration methods, weighted gene and weighted Z, to jointly analyze multiple RNA-seq data for maximizing the genetic signals in an empirical Bayes prediction rule. This study will help develop age-specific strategies and improve neurological outcomes in patients.
This project focuses on the discoveries of differentially expressed genes from multiple RNA-seq data, but HPP disease may be caused by the complex relationships between genetic factors and non-genetic factors. In real applications, genes often interact with each other or act with the environment factors, therefore, research adding gene–gene and gene–environment interactions seems to be beneficial. More importantly, the differential expressed genes in mammals play an important role in controlling brain growth and development, especially for children. The malfunction of genes at any developmental stage could lead to substantially abnormal characters such as genetic disorders. As a highly complex process, changes in gene expression can redirect its developmental trajectory to better adapt to environmental conditions. For this reason, incorporating such information into the empirical Bayes model should provide more information about the genetic architecture of a dynamic developmental trait.
The long-term goal is to discover the interaction between genetic transcriptome and metabolic markers underlying the diagnostics of HPP disease. In addition to RNA-seq data, we have collected metabolome data from the spinal cord and neocortex tissues of mice. These multiple omics data will help develop the age-specific and pathway-specific personalized/targeted treatments. Since metabolites data directly reflect the changes in disease phenotype and ensuing effects of post-translational modifications [32], it is increasingly collected for biomarker discovery. Current integration approaches fail to capture complex or indirect relationship between transcripts and metabolites, furthermore, the pathway methods are limited to metabolites, and only a small fraction of metabolites have been mapped to it. A promising method IntLIM [33] integrates trancriptomic and metabolomic data via a linear model, which provides a way to discover different performances of the gene–metabolite interaction between normal and disease groups. Thus, our future work is going to integrate and augment different omics data, such as RNA-seq data and metabolomics data, for recognizing the complex architecture of markers identified from multiple omics data during the development of HPP disease.
The most difficult obstacles in this study is the small number of samples ( n = 16 ). Due to the input uncertainty, our analysis may make the genetic estimates not very stable. Specifically, while calculating the average prediction error, it is impossible to strictly follow the cross-validation procedure. We have to split raw data into 50-50 subsets, and repeat it five times. This procedure leads to a potential issue (a kind of false positive) that five training sets/five test data share a large proportion of common samples. However, it is rather difficult to enlarge the sample size due to the budget limitation. For this reason, we plan to work on a new theoretical method, distribution robust optimization (DRO) approach, to obtain steady genetic estimates. DRO method [34] is attractive because it combines features of robust optimization method and stochastic optimization method to inherit the benefits of both methods. In general, DRO can overcome the conservative estimates of the robust optimization method without the exact distribution required in stochastic optimization. Thus, a new EB model incorporating the principle of DRO method may take the advantage of its theoretical properties and computational advantage, which will benefit from both the discovery of DE genes and the genetic interaction between metabolomic markers and DE genes when analyzing a small number of samples.

Author Contributions

Conceptualization, G.L. and Z.Z.; methodology, G.L. and J.H.; software, D.K. and G.L.; validation, D.K., J.H., Z.Z. and G.L.; formal analysis, D.K. and G.L.; investigation, D.K. and G.L.; resources, Z.Z.; data curation, Z.Z.; writing—original draft preparation, D.K. and G.L.; writing—review and editing, J.H. and Z.Z.; visualization, G.L.; supervision, G.L.; project administration, G.L.; funding acquisition, G.L., J.H. and Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Thematic Research Planning U080680.

Institutional Review Board Statement

Experimental procedures were approved by the Institutional Animal Care and Use Committee (IACUC) of the University of Michigan (Protocol number: PRO00010860, date of approval: 22 August 2022).

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets and software codes are available online https://drive.google.com/drive/folders/1M9iU_2lG24329jiGFZI81UFFj8LtDlUZ?usp=sharing (accessed on 23 February 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
RNA-seqRNA-sequencing data
HPPHypophosphatasia
TNAPTissue-nonspecific alkaline phosphatase
EBEmpirical Bayes
DEDifferentially expressed

Appendix A

Table A1. Top 40 most important genes for the EB classifier using spinal cord data.
Table A1. Top 40 most important genes for the EB classifier using spinal cord data.
RankingGeneRankingGene
1Gm1323021Masp2
2Zfp99022Plcd3
3Eno1b23Kirrel2
4Alpl24Mt2
5Tmprss11d25Map3k6
6Ubiad126Zfp992
7Gm2936727Elf5
8Rbm328Gm44421
9Sult1a129Smim3
10Miip30Pxmp2
11Tnfrsf831Deaf1
12Hsph132Arrdc1
13Klf1533Acer2
14Tekt434Fkbp5
15Gm1724435Gm38246
16Slc25a3436Paqr5
17Gstm137Aacs
18Cirbp38Prodh
19Gtdc139Cnmd
201810010H24Rik40Sbsn
Table A2. Top 40 most important genes for the EB classifier using a neocortex data analysis.
Table A2. Top 40 most important genes for the EB classifier using a neocortex data analysis.
RankingGeneRankingGene
1Gm4932721Uevld
2Cirbp22Hexim2
3Txnip23Itih2
4Igfbp324Tsga10
5Pla2g325Rnpep
6Cbln326Alpl
7Fkbp527Steap3
8Gm2052128Lyg2
9Col23a129Vstm5
10Gm3000330H3f3aos
11Gm591431Gm10570
12Slc25a3432Ces2b
13Cdkl333Gm45548
14Fcna34Gm45453
15F830212C03Rik35Rsph10b
16Scgb3a136Gm7628
17Hif3a37Wrn
18Asprv138Gm14252
19Slco2a139Lsp1
20Klhdc940Slc6a12
The ranking of the top 40 genes using the EB method on the neocortex data; EB: empirical Bayes estimate; Neocortex: neocortex RNA-seq data.
Table A3. Top 40 most important genes for the EB classifier using the weighted gene method.
Table A3. Top 40 most important genes for the EB classifier using the weighted gene method.
RankingGeneWeightRankingGeneWeight
1Gm132300.9484721Ighg30.98984
2Zfp9900.9600822Paqr50.51213
3Tmprss11d0.9850823Txnip0.06262
4Eno1b0.92767241810010H24Rik0.92349
5Hsph10.5044325Ephx20.50384
6Alpl0.5082626Mt20.94691
7Masp20.5065327Myoc0.92971
8Gm293670.9211128Rex20.95681
9Cirbp0.4995929Amy10.51611
10Slc25a340.5024730Map3k60.50406
11Slco2a10.5041431Krt280.52644
12Miip0.9372432Igsf10.98553
13Igfbp30.4863933Gm172440.51114
14Elf50.9367534Kirrel20.99110
15Tnfrsf80.9447535Ces2b0.50348
16Sult1a10.9302036Acer20.93197
17Tekt40.9251937Fkbp50.50177
18Zfp9920.9489138Gm99250.50643
19Cnmd0.9752039Pxmp20.92845
20Gm431500.4827540Gm59140.49911
Table A4. Top 40 most important genes for the EB classifier using the weighted Z method.
Table A4. Top 40 most important genes for the EB classifier using the weighted Z method.
RankingGeneWeightRankingGeneWeight
1Cirbp0.4907221Igfbp30.46329
2Slc25a340.5421122Acer20.56409
3Miip0.6134023Paqr50.55760
4Fkbp50.5128224Ak40.51710
5Gm293670.6102025Gm130710.54671
6Sult1a10.6070126Elf50.67904
7Gm132300.6935927Zfp9920.71311
8Zfp9900.7540228Sbsn0.56089
9Eno1b0.6613929Cd1800.52914
10Alpl0.5964230Ighg30.91566
11Gm172440.5883931Gkn30.51397
12Masp20.5576732Gm76280.52943
13Hsph10.5992833Txnip0.40891
14Tekt40.6008034Rsph10b0.52757
15Tnfrsf80.6404635Znf41-ps0.54300
16Map3k60.5679436Gm99250.53913
17Gm59140.5012137Cnmd0.77883
181810010H24Rik0.5794638Kirrel20.92445
19Tmprss11d0.8809039Apln0.49753
20Ces2b0.5017740Hif3a0.50933
The ranking of the top 40 genes using the EB method on the weighted Z scores: the weighted Z scores using the score of the spinal cord and neocortex data; EB: empirical Bayes estimate.
Table A5. The distribution of mice in terms of gender.
Table A5. The distribution of mice in terms of gender.
GenderNormal GroupDisease Group
Male (n)44
Female (n)44

References

  1. Choida, V.; Bubbear, J.S. Update on the management of hypophosphatasia. Ther. Adv. Musculoskelet. Dis. 2019, 11, 1759720X19863997. [Google Scholar] [CrossRef] [PubMed]
  2. Greenberg, C.R.; Taylor, C.L.; Haworth, J.C.; Seargeant, L.E.; Philipps, S.; Triggs-Raine, B.; Chodirker, B.N. A homoallelic Gly317-Asp mutation in ALPL causes the perinatal (lethal) form of hypophosphatasia in Canadian mennonites. Genomics 1993, 17, 215–217. [Google Scholar] [CrossRef]
  3. Zhang, Z.; Nam, H.K.; Crouch, S.; Hatch, N.E. Tissue Nonspecific Alkaline Phosphatase Function in Bone and Muscle Progenitor Cells: Control of Mitochondrial Respiration and ATP Production. Int. J. Mol. Sci. 2021, 22, 1140. [Google Scholar] [CrossRef] [PubMed]
  4. Berkseth, K.E.; Tebben, P.J.; Drake, M.T.; Hefferan, T.E.; Jewison, D.E.; Wermers, R.A. Clinical spectrum of hypophosphatasia diagnosed in adults. Bone 2013, 54, 21–27. [Google Scholar] [CrossRef] [PubMed]
  5. Schmidt, T.; Mussawy, H.; Rolvien, T.; Hawellek, T.; Hubert, J.; Rüther, W.; Amling, M.; Barvencik, F. Clinical, radiographic and biochemical characteristics of adult hypophosphatasia. Osteoporos. Int. 2017, 28, 2653–2662. [Google Scholar] [CrossRef] [PubMed]
  6. Whyte, M.P.; Wenkert, D.; Zhang, F. Hypophosphatasia: Natural history study of 101 affected children investigated at one research center. Bone 2016, 93, 125–138. [Google Scholar] [CrossRef]
  7. Holt, R.A.; Jones, S.J. The new paradigm of flow cell sequencing. Genome Res. 2008, 18, 839–846. [Google Scholar] [CrossRef]
  8. Lister, R.; O’Malley, R.C.; Tonti-Filippini, J.; Gregory, B.D.; Berry, C.C.; Millar, A.H.; Ecker, J.R. Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell 2008, 133, 523–536. [Google Scholar] [CrossRef]
  9. Mortazavi, A.; Williams, B.A.; McCue, K.; Schaeffer, L.; Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 2008, 5, 621–628. [Google Scholar] [CrossRef] [PubMed]
  10. Nagalakshmi, U.; Wang, Z.; Waern, K.; Shou, C.; Raha, D.; Gerstein, M.; Snyder, M. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 2008, 320, 1344–1349. [Google Scholar] [CrossRef]
  11. Wang, Z.; Gerstein, M.; Snyder, M. RNA-Seq: A revolutionary tool for transcriptomics. Nat. Rev. Genet. 2009, 10, 57–63. [Google Scholar] [CrossRef] [PubMed]
  12. Corchete, L.A.; Rojas, E.A.; Alonso-López, D.; De Las Rivas, J.; Gutiérrez, N.C.; Burguillo, F.J. Systematic comparison and assessment of RNA-seq procedures for gene expression quantitative analysis. Sci. Rep. 2020, 10, 19737. [Google Scholar] [CrossRef] [PubMed]
  13. Conesa, A.; Madrigal, P.; Tarazona, S.; Gomez-Cabrero, D.; Cervera, A.; McPherson, A.; Szcześniak, M.W.; Gaffney, D.J.; Elo, L.L.; Zhang, X.; et al. A survey of best practices for RNA-seq data analysis. Genome Biol. 2016, 17, 13. [Google Scholar]
  14. Fang, Z.; Martin, J.; Wang, Z. Statistical methods for identifying differentially expressed genes in RNA-Seq experiments. Cell Biosci. 2012, 2, 26. [Google Scholar] [CrossRef] [PubMed]
  15. Koch, C.M.; Chiu, S.F.; Akbarpour, M.; Bharat, A.; Ridge, K.M.; Bartom, E.T.; Winter, D.R. A Beginner’s Guide to Analysis of RNA Sequencing Data. Am. J. Respir. Cell Mol. Biol. 2018, 59, 145–157. [Google Scholar] [CrossRef] [PubMed]
  16. Marioni, J.C.; Mason, C.E.; Mane, S.M.; Stephens, M.; Gilad, Y. RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008, 18, 1509–1517. [Google Scholar] [CrossRef]
  17. Zhao, L.; Wu, W.; Feng, D.; Jiang, H.; Nguyen, X. Bayesian Analysis of RNA-Seq Data Using a Family of Negative Binomial Models. Bayesian Anal. 2017, 13, 411–436. [Google Scholar]
  18. Urda, D.; Montes-Torres, J.; Moreno, F.; Franco, L.; Jerez, J.M. Deep Learning to Analyze RNA-Seq Gene Expression Data. In Advances in Computational Intelligence. IWANN; Rojas, I., Joya, G., Catala, A., Eds.; Lecture Notes in Computer Science; Springer: Cham, Swizerland, 2017; Volume 10306. [Google Scholar]
  19. Zhang, Z.; Pan, Z.; Ying, Y.; Xie, Z.; Adhikari, S.; Phillips, J.; Carstens, R.P.; Black, D.L.; Wu, Y.; Xing, Y. Deep-learning augmented RNA-seq analysis of transcript splicing. Nat. Methods 2019, 16, 307–310. [Google Scholar] [CrossRef] [PubMed]
  20. Tibshirani, R. Regression shrinkage and selection via the Lasso. J. Roy. Stat. Soc. B 1996, 58, 267–288. [Google Scholar] [CrossRef]
  21. Hoerl, A.E.; Kennard, R. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 1970, 12, 55–67. [Google Scholar] [CrossRef]
  22. Efron, B. Empirical Bayes Estimates for Large-Scale Prediction Problems. J. Am. Stat. Assoc. 2009, 104, 1015–1028. [Google Scholar] [CrossRef] [PubMed]
  23. Singh, D.; Febbo, P.G.; Ross, K.; Jackson, D.G.; Manola, J.; Ladd, C.; Tamayo, P.; Renshaw, A.A.; D’Amico, A.V.; Richie, J.P.; et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 2002, 1, 203–209. [Google Scholar] [CrossRef] [PubMed]
  24. Li, G.; Ferguson, J.; Zheng, W.; Lee, J.S.; Zhang, X.; Li, L.; Kang, J.; Yan, X.; Zhao, H. Large-scale risk prediction applied to Genetic Analysis Workshop 17 mini-exome sequence data. BMC Proc. 2011, 5 (Suppl. S9), S46. [Google Scholar] [CrossRef]
  25. Li, G.; Cui, Y.; Zhao, H. An Empirical Bayes risk prediction model using multiple traits for sequencing data. Stat. Appl. Genet. Mol. Biol. 2015, 14, 551–573. [Google Scholar] [CrossRef] [PubMed]
  26. Dawid, A.P. Selection paradoxes of Bayesian inference. In Multivariate Analysis and Its Applications (Hong Kong 1992); IMS: Hayward, CA, USA, 1994; pp. 211–220. [Google Scholar]
  27. Senn, S. A note concerning a selection ‘Paradox’ of Dawid’s. Am. Stat. 2008, 62, 206–210. [Google Scholar] [CrossRef]
  28. Brown, L.D. Admissible estimators, recurrent diffusions, and insoluble boundary value problems. Ann. Math. Stat. 1971, 42, 855–903. [Google Scholar] [CrossRef]
  29. Stein, C.M. Estimation of the mean of a multivariate normal distribution. Ann. Stat. 1981, 9, 1135–1151. [Google Scholar] [CrossRef]
  30. Assefa, A.; Vandesompele, J.; Thas, O. SPsimSeq: Semi-parametric simulation tool for bulk and single-cell RNA sequencing data. Bioinformatics 2020, 36, 3276–3278. [Google Scholar] [CrossRef] [PubMed]
  31. Nunes, M.E. Hypophosphatasia. In GeneReviews; University of Washington: Seattle, WA, USA, 2007; [Updated 2023]. [Google Scholar]
  32. Zhang, A.; Sun, H.; Yan, G.; Wang, P.; Han, Y.; Wang, X. Metabolomics in diagnosis and biomarker discovery of colorectal cancer. Cancer Lett. 2014, 345, 17–20. [Google Scholar] [CrossRef]
  33. Siddiqui, J.K.; Baskin, E.; Liu, M.; Cantemir-Stone, C.Z.; Zhang, B.; Bonneville, R.; McElroy, J.P.; Coombes, K.R.; Mathé, E.A. IntLIM: Integration using linear models of metabolomics and gene expression data. BMC Bioinform. 2018, 19, 81. [Google Scholar] [CrossRef]
  34. Delage, E.; Ye, Y. Distributionally Robust Optimization Under Moment Uncertainty with Application to Data-Driven Problems. Oper. Res. 2010, 58, 595–612. [Google Scholar] [CrossRef]
Figure 1. The workflow of this study.
Figure 1. The workflow of this study.
Genes 15 00407 g001
Figure 2. Simulation of spinal cord RNA-seq data.
Figure 2. Simulation of spinal cord RNA-seq data.
Genes 15 00407 g002
Figure 3. Simulation of neocortex RNA-seq data.
Figure 3. Simulation of neocortex RNA-seq data.
Genes 15 00407 g003
Figure 4. Distribution of Z statistics generated by a spinal cord data, a neocortex data, and two integration methods.
Figure 4. Distribution of Z statistics generated by a spinal cord data, a neocortex data, and two integration methods.
Genes 15 00407 g004
Figure 5. ROC curves of various methods.
Figure 5. ROC curves of various methods.
Genes 15 00407 g005
Table 1. Number of true genes in the top 100/200 genes.
Table 1. Number of true genes in the top 100/200 genes.
SimulationTop GenesSpinalCortexWeighted GenesWeighted Z
Simulation data10021553 (S = 44, C = 9)56 (S = 47, C = 9)
20021558 (S = 49, C = 9)61 (S = 51, C = 10)
Note: Spinal: EB method is applied to a single spinal cord simulation data. Cortex: EB method is applied to a single neocortex simulation data. Weighted Genes (p): the weighted gene approach is jointly applied to two simulation data. Weighted Z: the weighted Z method is jointly applied to two simulation data. Row 1 is used to calculate how many true DE genes are among top 100 genes detected by various methods. Row 2 is used to calculate how many true DE genes are among top 200 genes identified by various methods.
Table 2. Top 30 genes detected by 4 methods.
Table 2. Top 30 genes detected by 4 methods.
RankingSpinalCortexWeighted GenesWeighted Z
1GlulGm44677Klhl6Gm43364
2Tmem109LplJpt2Gm36947
3Zic4Sema4bGng11Gm20939
4Gm43300Spint2Aqp4Gm13230
5Asprv1Inpp5fRmi2Rex2
6Rbm24Gucy2fInpp5fGm3608
7Per1Edem1Gm3608Eno1b
8Gm97682810029C07RikMzb1Idi1
9Pmaip1Phf1HemgnAplnr
10Tmem64Ncapg2CidebKlhl6
11Zfp687Ccser2Prr11Pif1
12Smim3Slco3a1Gm36947Rmi2
13DspDgcr8Uhrf1Ska1
14Gm37567GnptabKntc1Gng11
15MestCcdc166Idi1Igsf1
16Fam107aAI197445AplnrMki67
17C2cd4aStilBtnl10Kntc1
18Cdkn1aCamk2dNxnl2Alpl
19St6galnac2Mfsd10AlplHemgn
20Sp6Nxnl2Ska1Knl1
21Ldoc1Gm3203Igsf1Rdh5
22Cyp2d22Adam15Gm43364Aqp4
23RtbdnGm43364Mki67Mzb1
24MertkNup160Cdc6Jpt2
254930481A15RikEsyt3Has2Has2
26Znf41-psGm11427Kcnj5Cideb
27Gm37310CpzGm20939Nxnl2
28Cd209fBub1Bub1Prr11
29Gm37885Rex2Rex2Btnl10
30Gm4949Eno1bEno1bCdc6
Note: yellow color denotes the spinal cord DE genes. Cyan color represents the neocortex DE genes.
Table 3. Prediction errors of various methods.
Table 3. Prediction errors of various methods.
SimulationSpinalCortexWeighted GenesWeighted Z
Error0.2750.30.2250.1625
SD0.2850.0680.1850.0948
Note: Error denotes the average prediction error; SD is a standard deviation of prediction error; Spinal cord: a spinal cord RNA-seq data analysis; Cortex: a neocortex RNA-seq data analysis; Weighted Genes (p): weighted gene method; Weighted Z: weighted Z method.
Table 4. Top 10 gene list for a spinal cord data analysis and a neocortex data analysis.
Table 4. Top 10 gene list for a spinal cord data analysis and a neocortex data analysis.
Spinal CordNeocortex
RankingGeneRankingGene
1Gm132301Gm49327
2Zfp9902Cirbp
3Eno1b3Txnip
4Alpl4Igfbp3
5Tmprss11d5Pla2g3
6Ubiad16Cbln3
7Gm293677Fkbp5
8Rbm38Gm20521
9Sult1a19Col23a1
10Miip10Gm30003
Spinal Cord: empirical Bayes estimates for spinal cord RNA-seq data; Neocortex: empirical Bayes estimates for neocortex RNA-seq data.
Table 5. Top 10 gene list for two integration methods, weighted gene and weighted Z.
Table 5. Top 10 gene list for two integration methods, weighted gene and weighted Z.
Weighted GeneWeighted Z
RankingGeneRankingGene
1Gm132301Cirbp
2Zfp9902Slc25a34
3Tmprss11d3Miip
4Eno1b4Fkbp5
5Hsph15Gm29367
6Alpl6Sult1a1
7Masp27Gm13230
8Gm293678Zfp990
9Cirbp9Eno1b
10Slc25a3410Alpl
Note: Top 10 genes based on two integration methods; Weighted Gene: the weighted gene method; Weighted Z: the weighted Z method.
Table 6. Average prediction error of four methods.
Table 6. Average prediction error of four methods.
Spinal CordNeocortexWeighted GeneWeighted Z
Error0.1250.15000.10000.0625
SD0.15310.05590.16300.0765
Note: Error denotes the average prediction error; SD is a standard deviation of prediction error; Spinal cord: a spinal cord RNA-seq data analysis; Neocortex: a neocortex RNA-seq data analysis; Weighted Gene: the weighted gene method; Weighted Z: the weighted Z method.
Table 7. The prediction accuracy of four methods.
Table 7. The prediction accuracy of four methods.
MethodAUC
Spinal87.5%
Neocortex85%
Weighted Gene90%
Weighted Z93.75%
AUC: is the area under the receiver operating characteristic. Spinal cord: a spinal cord RNA-seq data analysis; Neocortex: a neocortex RNA-seq data analysis; Weighted Gene: the weighted gene method; Weighted Z: the weighted Z method.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kinsman, D.; Hu, J.; Zhang, Z.; Li, G. New Empirical Bayes Models to Jointly Analyze Multiple RNA-Sequencing Data in a Hypophosphatasia Disease Study. Genes 2024, 15, 407. https://doi.org/10.3390/genes15040407

AMA Style

Kinsman D, Hu J, Zhang Z, Li G. New Empirical Bayes Models to Jointly Analyze Multiple RNA-Sequencing Data in a Hypophosphatasia Disease Study. Genes. 2024; 15(4):407. https://doi.org/10.3390/genes15040407

Chicago/Turabian Style

Kinsman, Dawson, Jian Hu, Zhi Zhang, and Gengxin Li. 2024. "New Empirical Bayes Models to Jointly Analyze Multiple RNA-Sequencing Data in a Hypophosphatasia Disease Study" Genes 15, no. 4: 407. https://doi.org/10.3390/genes15040407

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop