2. Materials and Methods
2.1. A Framework for Regression-Based Modeling of GAN
Let us suppose that the research objective is to figure out how
q target genes
Y1,
Y2,
…,
Yq are associated with
p predictor genes
X1,
X2,
…,
Xp, and
n independent experiments (e.g., microarrays) are conducted for this purpose, in which for the
i-th experiment, the observed expression levels of genes
Yj and
Xk are
yij and
xik, respectively, but for convenience, we will drop the subscript
I here. A regression model for the GAN of
q simultaneous equations can then be expressed as [
23]:
where the error terms,
εy1,
εy2, …,
εyq, as well as the expression levels of genes
Xs and
Ys, are random variables (i.e., non-constants).
We used the architecture of (1) for GAN construction mainly for two reasons. First, even though distribution-free modeling under the regression framework has been reported, we would like to develop a new approach with fewer assumptions. Second, unlike other approaches, such as Bayesian models, our approach could use a standard p-value cutoff of 0.05 to infer an association under the architecture of (1). This is attractive, especially when prior knowledge concerning the gene’s regulation role in GAN is lacking.
2.2. Conventional Strategies for Estimating the Association Parameters
Generally speaking, Equation (1) can be a distribution-free regression model of GAN if we do not specify a probability density function for the expression and error terms, but in previous studies, including the work of [
23], a normal distribution function was used to model gene expressions and measuring noises. Note also that although [
25] had shown that large errors or outliers of expression data do not need to be modeled by a Gaussian distribution function in regression-based inferring of gene regulatory networks, their method nonetheless required all the errors and outliers to be modeled as symmetrically distributed residuals, which are unrealistic for real-world non-Gaussian noises. For the
i-th observation (experiment), a regression model of the
j-th equation of Equation (1) can be rewritten as follows:
If we do not know the specific probability density functions for the observed expressions and error terms, there are, in general, three conventional LS-based strategies to estimate the association parameters of
β in Equation (2). The first one is the ordinary LS estimation (LSE) strategy, for which the association can be detected by minimizing the following objective function Q
j0 based on the sum of squares of the error terms
εyij’s [
26]
Assuming that these error terms are independently and identically distributed with zero mean and finite variance, LSE has some well-behaved statistical properties, including unbiasedness and minimal variance, as summarized by the Gauss–Markov theorem [
26].
The second one is the
L1-norm penalized strategy, called least absolute shrinkage and selection operator, or LASSO [
27], for which the association of Equation (2) can be obtained by minimizing Q
j1, using an
L1 penalty on top of Equation (3)
where
λj1 is the tuning (weight) parameter in the penalty term,
L1, of Q
j1.
The third one is the
L2-norm penalized strategy, called ridge regression estimation, or RRE [
28], for which the association can be obtained by minimizing the following objective function Q
j2, using an
L2 penalty on top of Equation (3)
where
λj2 is the tuning (weight) parameter in the penalty term,
L2, of Q
j2.
Generally speaking, RRE is used to combat multi-collinearity owing to the shrinkage of inflation estimation variances arising from highly correlated gene expression data, while LASSO is used to exclude zero coefficients in large-scale regression-based GAN prediction through the adjustment of shrinkage parameters in the penalty term. More on penalized LS strategies have been discussed by [
24,
29].
2.3. An Alternative Parameter Estimation Method
In addition to these LS-based methods (LSE, RRE, and LASSO), an alternative, distribution-free estimation in regression models is the method of grouping estimators. Wald [
30] proposed a special kind of grouping estimator called the Wald-type estimator (WTE) to tackle measuring noises (variations arising from measuring processes) in simple linear regression. Wald’s method divides the data into two groups according to predictor
X: those above and below the observation median, respectively. The association parameters can then be estimated simply by computing the gradients of four means (those of the observed
X and
Y values, respectively, in the two divided groups). WTE has received little attention in the literature because it is inefficient as compared to LSE [
31] and inconsistent with respect to measuring noises [
32]. In addition, an assumption of independence between predictor variables is needed in multiple linear regression models, causing its poor performance in highly correlated data [
17]. For more about WTE and methods of grouping, readers are referred to [
31,
33].
Recently, a generalized version of WTE called an adjusted Wald-type estimator (AWTE) has been developed to tackle Berkson-type uncertainties (i.e., noises in measurement but not errors caused by measuring process) and collinearity problems [
17]. This non-parametric approach has several merits. First, for the multi-collinearity problem, AWTE is statistically consistent and asymptotically unbiased (overcoming the drawbacks of LSE, RRE and LASSO). Second, for the uncertainties in measurement error in conjunction with collinearities, whereas LSE may cause completely erroneous conclusions [
34], AWTE can solve both problems simultaneously. It should be noted that, as Wu and Fang [
17] pointed out, Berkson-type uncertainties are fundamentally different from the measuring noises discussed in [
23,
30] and are also different from the outliers treated in [
25]: Namely, Berkson-type uncertainties can arise from biological noises while the other types are products of measuring processes. The application of AWTE to GAN construction will be formally described later.
2.4. Modeling Biological Noises and Correlated Expressions
Contributions from extrinsic and intrinsic noises in biological processes and correlated expressions may lead to biased regression modeling and incorrect predictions for a GAN [
15,
16]. To avoid such biases and to recover true associations, we consider the effects of both intrinsic and extrinsic noises in the framework of a linear regression system.
Let us begin with the consideration of intrinsic noises for not only the target gene
Yj but also the predictor gene
Xk in Equation (2). As pointed out by Fujita and coworkers [
23], the error term,
εyij, in the regression model can be seen as intrinsic noise in the expression of the target gene,
yij. However, the intrinsic noises of predictor genes, defined as
εxik’s below, are irrespective of measuring devices, although they also appear in measurements [
35]. In other words, biological noises, which are Berkson-type uncertainties, can affect both the true and the observed expressions of target genes, while measuring noises such as those discussed in [
23] affect only the observed expressions [
36]. Therefore, if we would like to explicitly model intrinsic biological noise
εxik in predictor gene expression
xik, we can employ a Berkson-type uncertainty model [
17] and rewrite Equation (2) as follows:
Next, to model extrinsic noises, it is suggested by [
37] that a total noise should be identified, which can be the sum of intrinsic and extrinsic noises, and that these two types of noises should be presented separately to distinguish the contributions of their different origins. Thus, to account for noises of both intrinsic and extrinsic origins simultaneously in the regression system, we can rewrite Equation (4) as follows:
where the total noises in the expression of predictor gene
Xk and target gene
Yj are
εxik +
vxik and
εyij +
vyij, respectively, in which
vxik and
vyij are extrinsic noises and
εxik and
εyij are intrinsic noises. Notice that the total noise for predictor gene
Xk may not influence target gene
Yj if the association of these two genes is negligible, i.e., if the regression coefficient (
βjk) is very close to zero. In contrast, the total noise for target gene
Yj can cause its expression to fluctuate significantly whether or not the interactions between predictor and target genes are negligible. As a result, combining all noises into a single term in the modeling is problematic if the complexities of uncertainties are overly simplified.
Finally, to deal with the potential presence of collinearity, i.e., highly correlated gene expression data, we can assume that a predictor gene
Xl (
l <
p) is linearly dependent on another predictor gene
Xp (see [
38] for a similar assumption about linear dependence between two genes); that is,
Equation (6) intuitively divides gene expression
xil (
l <
p) into two additive sources: the former source,
uil, is a unique component for the predictor gene itself (i.e., independent of other genes) and the later source,
xip, is a common interaction component among
p predictor genes and
rl is their correlation parameter. Note that the framework of Equation (6) allows for ease of interpreting the structure of correlated observations and has been commonly used in the literature to address collinear configuration in regression analysis [
17,
39].
In summary, if highly correlated expression data and intracellular molecular noises are significant, WTE can be unstable, and conventional regression strategies (LSE, RRE, or LASSO) for deducing the values of association parameters β’s may be greatly biased. This is because specifying the exact means and variances of the total noise contributed from manifold origins is difficult, and as a result, the assumptions of the Gauss–Markov theorem do not hold. Furthermore, it is possible to over-adjust the penalty terms in Qj1 (for LASSO) or Qj2 (for RRE) for ill-conditioned problems due to the requirement of information on regression coefficients when estimating weight parameters.
2.5. A Robust Distribution-Free Regression Method for Modeling GAN Using AWTE
To consider the effects of biological noises on inferring a GAN, we can rewrite Equation (1) for the
i-th independent experiment according to Equation (5)
In addition, to deal with the influence of highly correlated data in regression models, the regressors can be constrained on Equation (6). To account for the complexity that may arise from stochastic noises of manifold origins, such as those described as non-Gaussian noises, we employed AWTE to obtain the association parameters β’s in Equation (7). The whole analytical procedure of the proposed distribution-free method, also referred to as AWTE, can be summarized by three primary steps.
Step 1. Determine the common interaction component (i.e., the second source of the additive combination, xip) in Equation (6) among p predictor genes according to their observed expression levels; it can be made by
where
ρ is the Pearson correlation coefficient and
xk is the expression of gene
Xk.
Step 2. Estimate all the correlation parameters r1, r2, …, rp−1 in Equation (6), which can be achieved by
where
l = 1, …,
p−1,
I denotes the indicator function, i.e.,
I[A] = 1 if A is true, and 0 otherwise,
xk is an
n × 1 vector with its
i-th value equal to
xik for all
k ≤
p, and
M(
xk) is the median of all values in vector
xk (i.e., the median of
x1k,
x2k, …,
xnk).
Equation (8) is the so-called AWTE, where
p is zero and
τjk is given by
Note that, if we take all the values of k to be zero, Equation (8) reduces to WTE.
A few remarks need to be made regarding the present approach. First, it has been pointed out by [
17] that AWTE (Equation (8)) is a two-stage estimation method: estimating the whole set of regression coefficients except for the case of
k =
p first, then all the other regression coefficients by calculating
τjp’s. In this way, AWTE can be calculated directly without using iterative or optimization algorithms. Second, the computational cost of AWTE is
O(
p2) if
q = 1 [
17], and hence that of Equation (8) under Equation (6) is
O(max[
p2,
pq]). In addition, as demonstrated in [
17], by using this approach, we have theoretical guarantees for the robustness of the predicted GAN (see
Appendix A for the theorem of robustness and its proof).
4. Discussion
The LS strategy, which is mathematically equivalent to the maximum likelihood estimation, is known to perform well for systems with Gaussian noises, i.e., noises that are characterized by normal distributions [
26]. However, when noises are non-Gaussian, LS-based methods can be unsatisfactory. For example, as can be seen in
Figure 1, when the sample size is as large as 3200, conventional methods can be unstable under conditions of non-Gaussian noises, with LSE and RRE having a wide range of INER even when the level of noise is not high (e.g.,
σ2 = 1). In addition, as can be seen from
Figure 3 at sample size = 3200, LASSO performed better (lower median in the box plot) than LSE and RRE in the case of
σ2 > 1 but worse in the case of
σ2 ≤ 1, which suggests that in this example LASSO failed the test of robustness. Taken together, we can conclude that, as did others [
15,
48], a predicted gene network might be non-functional (e.g., with high INER and high FDR values) or even incorrect if the effects of intrinsic or extrinsic noises are ignored or overly simplified to reduce analysis complexities. Indeed, the expressions of eight of the fifteen genes analyzed in
Table A1 for lymphoma did not pass the normality test (
Appendix B Figure A4), which could be a reason for the poor ROC results of the LS-based methods for predicting the B-cell GAN (
Figure 5).
To assess the potential use of the gene trio regulation motif for practical applications, we conducted a subtype analysis in DLBCL GCB and ABC using data from GSE60. As may be seen in
Figure A5A, the regulations of the motif, as suggested in GCB patients’ gene expression data, are consistent with the overall trend of our model (
Table A1). However, in
Figure A5B), the down-regulating function of BACH2 is nearly non-existent, while the up-regulating function of SPIB and OCT2 to the two oncogenes in the ABC subtype is stronger compared to the GCB subtype. Although we do not know specific mechanisms of how these TF genes can help differentiate the subtypes, these observations may suggest that over-expression of SPIB and OCT2, as well as malfunction of BACH2, could be probable causes leading to higher IRF4/AID expressions and resulting in different clinical outcomes for patients with different subtypes.
In the present work, we did not consider measuring noises because their modeling may require additional experimental data and/or analysis procedures, as well as a distribution-dependent approach (e.g., [
23]). It is a problem not within the scope of the present study but will be a subject of our future research.
There are a few other limitations of our method in its current form. Firstly, although a wider range of design specifications can be used to construct GANs because AWTE can model uncertain noises with fewer constraints, our method may not perform as well as conventional LS-based methods if the number of observations is not sufficiently large, as
Figure 1,
Figure 2 and
Figure 3 indicate. However, array-based experiments and other high-throughput technologies to produce very large expression datasets have become increasingly available in recent years, as in studies using TCPA (The Cancer Proteome Atlas, [
49]; sample size > 3000), TCGA (The Cancer Genome Atlas, [
50]; sample size > 800) or UK biobank ([
51]; sample size of 500,000 around), this limitation of sample size may soon become a non-issue in many applications.
Secondly, if in the model the number of predictor genes,
p, is larger than that of experiments,
n, overfitting may occur, which is a major statistical limitation of linear regression analysis [
24]. To circumvent this problem, automatic variable selection techniques (e.g., stepwise, forward, or backward selections) can be potentially helpful to screen for favorite predictor genes so as to consider only a smaller number of them (i.e.,
n >
p) in applying the proposed approach. Or, as demonstrated by the case study of lymphoma in the present work, knowledge and information from the literature, despite being far from complete and often not straightforward, can be harnessed for the new method to make insightful discoveries on gene regulations.
Thirdly, we did not consider time series data mainly because regression modeling for time series observations often requires distribution-dependent procedures (see, e.g., [
52]) or a distribution-free procedure as in the work of [
53], for which, however, theoretical justifications are still lacking to prove that a generalized LS-based approach can address well the manifold uncertainties associated with the predictors of interest. Further studies are required to fully address this statistical issue.
Fourthly, our model was derived from data from an older array platform, which may cause biases in the analysis and hence reduce the accuracy of the results. The predictive value of the gene trio motif has also not been fully investigated, although in a preliminary analysis we found that the trio can be a prognostic signature to distinguish survival risks of lymphoma patients (
Figure A6). Further validation with newer data of the model and the gene trio motif in cancer gene regulation is ongoing.
Finally, our method in the present work was applied to only a handful of variables (genes). In principle, one could consider all TFs as predictor genes to regulate all other genes and build a whole-genome TF-centered GAN. However, it remains to be investigated if the existing data are sufficient to overcome overfitting for such an undertaking. A strategy such as principal component analysis to shrink the dimension of these TFs while keeping all the data in the analysis may be necessary.