1. Introduction
The tremendous development of high-throughput sequencing techniques allows for the generation of massive genomic data, e.g., gene expressions and Single-Nucleotide Polymorphisms (SNPs). These data provide an unprecedented opportunity of uncovering biomarkers associated with outcomes such as the development and progression of complex diseases, e.g., cancers and type II diabetes. Numerous studies on this topic have been hitherto carried out. However, most existing studies assume that a covariate has an identical effect on the outcome variable for all subjects, which is often unrealistic in practice. For example, Ford et al. [
1] found that the risk of breast and ovarian cancers in BRCA2 mutation carriers increases with age. Another example is that the effects of some genes in the nicotinic 15q25 locus on lung cancer risk are mediated by nicotine dependence [
2]. These findings suggest that the effects of a specific covariate can be heterogenous and discrepancies in covariate effects or covariate-outcome associations may arise due to the differences in clinical characteristics and other traits that differ across subjects. As such, ignoring such effects, heterogeneity in genomic data analysis can result in biased estimations and misleading inferences.
The most commonly used strategy for handling heterogeneity is subgroup analysis, under which subjects form subgroups and each subgroup has unique covariate-outcome associations. A number of approaches have been proposed, such as the finite mixture model [
3,
4,
5], and penalization-based approaches, such as concave fusion penalization [
6,
7], and C-Lasso [
8]. However, these approaches assume that the effects of covariates are the same within each subgroup. As suggested by the literature, the covariate (e.g., genetic) effects are typically associated with clinical measures (e.g., age and number of cigarettes smoked per day), which are often continuous variables. As such, in some applications, covariate effects are more likely to vary smoothly rather than being locally constant within each subgroup.
In this study, we focus on a scenario where the subjects can be ordered by an auxiliary variable (see
Section 2 for details). We consider a linear regression model with heterogeneous covariate effects by allowing the regression coefficients to vary smoothly across subjects. We then propose a novel penalization approach to capture the smoothing changes of coefficients. Under this approach, a “spline-lasso” penalty is imposed on the second-order derivatives of the coefficients to encourage smoothness in coefficients’ changes. Additionally, we introduce a penalty of the group Lasso form to accommodate the high dimensionality of genomic data (i.e., the number of genes is larger than the sample size) and select the relevant covariates.
Our work is related to the varying coefficient models, a kind of classical semi-parametric model. It treats the coefficients as functions of certain characteristics, and uses various nonparametric smoothing techniques, such as spline-based methods [
9,
10], and local polynomial smoothing [
11], to approximate the unknown coefficient functions. For example, high-dimensional varying coefficient models proposed by Wei et al. [
12], Xue and Qu [
13], Song et al. [
14], Chen et al. [
15], finite mixture of varying coefficient model [
16], and additive varying-coefficient model for non linear gene-environment interactions [
17]. Compared to these varying-coefficient regression approaches, the proposed method has few requirements for the distribution of auxiliary variables and better estimates the regression coefficients when auxiliary variable is unevenly distributed (
Figure 1).
Moreover, the proposed approach is also related to but also significantly advances existing ones. First, it advances existing genomic marker identification studies by considering the heterogeneity of covariate effects. Second, it advances gene-environment interaction analysis methods [
18,
19] by allowing more flexibility in the relationship pattern (not limited to a given relationship) between covariate (genetic) effects and environmental factors (auxiliary variables). Finally, the proposed approach also advances the existing multiple changing-point regression studies [
20,
21] by tracking the gradually changes of coefficients rather than the abrupt ones (
Figure 1). Overall, this approach is practically useful for analyzing genomic data and may lead to important new findings.
To further illustrate differences of the proposed method from varying-coefficient models and multiple changing-point regression methods, consider a simple simulation example with
, and 3 significant variables. The coefficient for each variable varies among individuals and is a function of a certain environmental factor, e.g., age. Suppose the age is unevenly distributed among subjects, with subjects concentrated between the age of 25–35 and 45–55, which is indicated by denser rugs in the
Figure 1. We compare proposed method with the varying-coefficient model [
12] and the change point regression model [
22]. The simulation results show that the compared method performs relatively poorly (root mean squared errors (RMSE) = 4.853, rooted prediction error (RPE) = 1.325 for varying-coefficient model; RMSE = 3.158, RPE = 1.242 for change point regression model), while proposed method identifies the true coefficient pathway consistently (RMSE = 0.954, RPE = 0.893).
The rest of this paper is organized as follows. In
Section 2, we introduce the proposed approach, present the algorithm, and discuss some theoretical properties. Simulations are shown in
Section 3.
Section 4 presents the analysis of two The Cancer Genome Atlas (TCGA) datasets.
Section 5 concludes the paper. The technical details of proofs and additional numerical results are provided in the
Appendix A,
Appendix B,
Appendix C and
Appendix D.
2. Materials and Methods
Assume a dataset consists of
N independent subjects. For subject
n, let
and
denote the response variable and the
p-dimensional vector of genomic measurements, respectively. In our numerical study, we analyze gene expression data. It is noted that the proposed approach can also be applied to other types of omics measurements. Assume the data has been standardized and consider a heterogenous linear regression model given by:
where
’s are independent and identically distributed (i.i.d.) random errors and
are the regression coefficients. Different from the standard regression model, which imposes an identical
on all subjects, model (
1) allows
to be subject-specific. Here, we consider a linear regression, which is standard to model the relationship between covariates and outcomes. The proposed approach is applicable to other models, for example, the AFT model. More details are provided in
Appendix A. In this paper, we focus on a scenario where the heterogeneity analysis of covariate effects can be conducted with the aid of an auxiliary variable whose measurement is available for
N subjects. Specifically, we assume that the subjects have been sorted according to the auxiliary variable’s values. Further, the effect of a relevant covariate on the response variable is expected to vary smoothly across subjects. The studies reviewed in
Section 1 and other similar ones suggest that the covariate (e.g., genetic) effects are usually associated with clinical traits. As such, we choose an auxiliary variable with known interactions with clinical variables. Please see the examples in the data analysis section for details (
Section 4).
Remark 1. In subgroup-level heterogeneity analysis, an auxiliary variable may not be needed. However, a subject-level heterogeneity analysis is intractable without the auxiliary variable due to non-identifiability. To date, the existing methods that can handle this type of heterogeneity, for example, varying-coefficients and interaction analysis, all require an auxiliary variable. Note that, in our analysis, the auxiliary variable does not need to be “precise.” Consider, for example, a sample of size 5. Auxiliary variable A has the values 1, 3, 7, 2, and 9 for the five subjects and auxiliary variable B has the values , 0.4, 0.5, 0.0, and 3. Although auxiliary variables A and B do not match, the proposed method can lead to the same covariate effects when using both auxiliary variables as an ordering index.
As previously mentioned, we propose a novel penalized estimation and denote
and
. Then, we define estimator
as the solution of the following optimization problem:
where
represents the two-norm of any vector
u and
’s are weights.
and
are data-dependent tuning parameters. We also introduce an “expanded” measurement matrix
:
We denote
. Then, objective function
can be rewritten in a more compact form:
with being the column vector, whose nth element is 1, and the others are 0.
Rationale. In (
2), the first term is the lack-of-fit measure, expressed as the sum of
N individual subjects. The first penalty is the group Lasso on
. Here the “group” refers to the regression coefficients of
N subjects for a specific covariate. This penalty accommodates the high-dimensionality of the data and allows for the regularized estimation and selection of relevant covariates. The “all-in-all-out” property of the group Lasso leads to a homogeneous sparsity structure, that is, the
N subjects have the same set of important covariates. To obtain an oracle estimator, we add weight
to the sparsity penalty, which is determined by an initial estimator. Assuming that initial estimator
is available, let
.
The main advancement is the second penalty, which has a spline form. It penalizes the second-order derivatives (in discrete version) of coefficients
to promote the smoothness of coefficients between adjacent subjects. Note that the coefficients for any adjacent subjects are assigned a penalty of the same magnitude regardless of the distance between subjects measured by the auxiliary variable. Different from standard spline-lasso penalties [
23], it is imposed on the regression coefficients of different subjects. Furthermore, different from some alternatives which promote first-order smoothness, such as the fusion Lasso [
24] and smooth Lasso [
25], this penalty encourages second-order smoothness. Additionally, the quadratic form of this penalty makes it computationally easier than the absolute-value-form penalty, such as Lasso. It is noted that the gene-environment interaction analysis also can capture the smooth change of covariate effects over an auxiliary variable (environmental factor). However, the interaction analysis approach requires specifying a parametric form of the relationship between covariate effects and auxiliary variable, which is not very flexible in practice, in particular, for high-dimensional data.
2.1. Computation
Optimization (
2) can be realized using a block coordinate descent (CD) algorithm. For each covariate
j, its measurement on the
N subjects
forms a group and corresponding coefficients
are simultaneously updated. The algorithm optimizes the objective function with respect to one group of coefficients and iteratively cycles through all groups until convergence is reached. Let
represent the sub-matrix of
, corresponding to
, which is a diagonal matrix. We denote
as the estimate of
in the
kth iteration. The proposed algorithm proceeds as follows:
- 1.
Initialize , and set .
- 2.
Update
. For
, minimize
with respect to
, where:
This can be realized by executing the following steps:
- (a)
Set the step size .
Increase step size by
until
- (b)
Compute
and update the estimate of
by
- 3.
Repeat Step 2 until convergence is achieved. In our numerical study, the convergence criterion is .
To speed up the algorithm, we add a momentum term to the last iteration of
in (
3) and determine step size
t via the backtracking line search method. After the algorithm converges, some groups of coefficients are estimated as zeros. To further improve estimation accuracy, in practice, we can remove the covariates with zero coefficients and re-estimate the nonzero coefficients by minimizing objective function (
2) without the sparsity penalty. The proposed approach involves two tuning parameters selected using a grid search and the
K-fold cross validation with
.
Realization. To facilitate data analysis within and beyond this study, we have developed a Python code implementing the proposed approach and made it publicly available at
https://github.com/foliag/SSA (accessed on 21 March 2022). The proposed approach is computationally affordable. As shown in
Figure A1, the computational time of the proposed approach is linear, with an increasing number of features.
2.2. Statistical Properties
Here, we establish the consistency properties of the proposed approach. We define a new dataset
by
and
, where
. Then, objective function (
2) can be converted to an adaptive group Lasso form:
Let be the true parameter values. We denote q as the number of non-zero coefficient vectors. Without loss of generality, assume for . We define two sets, and , corresponding to the index of nonzero and zero coefficient vectors, respectively. Let and . We then use to represent the minimal eigenvalue of matrix . The following conditions are assumed:
- (C0)
Errors are i.i.d sub-Gaussian random variables with mean zero. That is, for certain constants and , the tail probabilities of satisfy for all and .
- (C1)
Let . Then, .
- (C2)
Let . Then, . Moreover, there exists a constant so that .
- (C3)
and .
- (C4)
.
Condition (C0) is the sub-Gaussian condition is commonly assumed in studies [
26]. Condition (C1) assumes the measurement matrix is bounded. Similar conditions have been considered by AuthMartinussen and Scheike [
27] and Binkiewicz and Vogelstein [
28]. Condition (C2) puts a lower bound on the size of the smallest signal and assumes the initial
is not too small for
. Similar conditions have been considered by Wei and Huang [
29]. Condition (C3) is similar to the assumption made in Case I of Guo et al. [
23], which requires
to be invertible and the minimal eigenvalue
to converge to 0 at a rate controlled by
. Condition (C4) makes a weak constraint on
, which can be satisfied when for any nonzero coefficient vector
(
) the largest gap between two adjacent components is bounded.
Theorem 1. Assume Conditions (C0)–(C4) hold, as does event when N does to infinity. We define Then, with a probability converging to one, we have The proof is provided in
Appendix B. If
q is not too large and
and
are not too small, we may have
∼
(more details below). Then, we can find a
that satisfies
∼
and
∼
simultaneously. It is not difficult to prove that event
holds for the marginal regression estimator as the initial estimator. As a result, under conditions (C3) and (C4), the gap between
and
converges to 0. This theorem thus establishes estimation consistency.
The following additional conditions are assumed:
- (C5)
Initial estimators
are
r-consistent for the estimation of certain
:
where
is an unknown constant vector satisfying
.
- (C6)
Constants
satisfy:
Condition (C5) is similar to condition (A2) in Huang et al. [
26], which ensured that weight
is not too small for
. Condition (C6) restricts the numbers of covariates with zero and nonzero coefficients, the penalty parameters, the minimal eigenvalue of
, and the smallest nonzero coefficient. Given all conditions in Theorems 1 and 2, we may assume
,
, and
for some
; then, the number of nonzero coefficients
q can be as large as
for some
. In this case, there can be
zero coefficients, where
is a small nonzero constant, assuming
and
.
Theorem 2. Under Conditions (C0)–(C6), The proof is provided in
Appendix C. This theorem establishes the selection consistency properties of the proposed approach under a high-dimensional setting.
3. Simulation
We set
. The data are generated from the following true model:
where the random errors are simulated independently from
. We investigate nine scenarios for the coefficients as follows:
- Scenario 1.
The coefficients are generated from trigonometric functions; for
,
where
∼
- Scenario 2.
The coefficients are generated from exponential functions:
where
∼
- Scenario 3.
The coefficients are generated from logarithmic functions:
where
∼
- Scenario 4.
The coefficients are generated from linear functions:
where
∼
- Scenario 5.
The coefficients are constants:
where
∼
.
- Scenario 6.
The coefficients are generated from the four above (trigonometric, exponential, logarithmic and linear) functions, respectively. Each function generates an equal number of coefficients.
- Scenario 7.
The coefficients are generated from the four above functions, where 40% and 35% of the coefficients are generated from the trigonometric and linear functions, respectively, and 10% and 15% of the coefficients are generated from the exponential and logarithmic functions, respectively.
- Scenario 8.
The coefficients are generated from the four functions. The trigonometric, exponential, logarithmic, and linear functions generate 35%, 15%, 20%, and 30% of the coefficients, respectively.
- Scenario 9.
The coefficients are generated as in Scenario 5. We select 40% of the coefficients and, for each function, add random perturbations on their values in one or two ranges, where each range includes 20 consecutive subjects.
In Scenarios 1–5, the
q coefficients are generated from the same function, whereas from different functions in Scenarios 6–9. The coefficients in Scenario 5 are constants, that is, there is no heterogeneity in covariate effects. Some of coefficients in Scenario 9 do not change smoothly across subjects, but have a few discontinuous areas.
Figure A2 presents
nonzero coefficients as a function of
subjects under nine scenarios. In the first eight scenarios, the
p covariates are generated from a multivariate normal distribution with marginal mean 0 and variance 1. We consider an auto-regressive correlation structure, where covariates
j and
k have the correlation coefficient
with
and
, corresponding to the weak and strong correlations, respectively. In Scenario 9, the
p covariates are generated independently from a uniform distribution on
. It is noted that the aforementioned nonlinear functions of regression coefficients are widely used in simulation studies of varying-coefficient models for genomic data [
30,
31].
We consider two versions of the proposed approach. One uses the “standard” Lasso to obtain the initial estimator of coefficients (New-Lasso) and the other uses marginal regression (New-Mar). Both estimators are homogeneous, that is, the coefficients are the same for all subjects. To better gauge the proposed approach, we compare it with three alternatives: (a) Lasso, which directly applies the Lasso method to the entire dataset but does not account for the heterogeneity of coefficients across different subjects; (b) AdLasso, which is the group adaptive Lasso in the varying-coefficient model [
12]; and (c) IVIS, which uses the independent screening technique for fitting the varying-coefficient model [
14]. The last two methods focus on variable selection and the estimation of the varying-coefficient model in high-dimensional settings, where each nonzero coefficient is assumed a smooth function of a known auxiliary variable.
For the proposed approach and its alternatives, we evaluate the variable selection performance by TP (number of true positives) and FP (number of false positives). Estimation and prediction are also evaluated. Specifically, estimation is measured by RMSE (root mean squared errors), defined as , and prediction is measured by RPE (root prediction errors), defined as .
Table 1 summarizes the simulation results over 100 replications for settings with
,
, and
. The rest of the results are presented in
Table A1,
Table A2 and
Table A3. Across the simulation spectrum, the proposed approach has superior performance in terms of variable selection, as it can identify more important variables while having a low number of false positives. For example, in Scenario 1,
and
(
Table 1), New-Lasso has (TP, FP) = (18.44, 0.16), while Lasso has (TP, FP) = (14.56, 0.30), AdLasso (TP, FP) = (16.64,0.70), and IVIS (TP, FP) = (13.76, 3.28). Consider another example, Scenario 9,
and
(
Table 1). For the identification of important variables, the four approaches have the TP values 18.30 (New-Lasso), 15.40 (Lasso), 15.74 (AdLasso), and 14.24 (IVIS), and FP values 0.00 (New-Lasso), 2.60 (Lasso), 0.40 (AdLasso), and 4.64 (IVIS), suggesting the proposed approach is robust to perturbations. In most scenarios, New-Lasso outperforms New-Mar when covariates are weakly correlated (
), but performs worse than New-Mar when covariates are strongly correlated (
). These results stem from the fact that Lasso is not good at dealing with highly correlated covariates. In practice, we can select one of them according to the correlations among covariates. Examples are provided in
Section 4. Lasso identifies a reasonable number of important variables but with higher false positive than the proposed approach. AdLasso shows a good performance in variable selection, but inferior to that of the proposed approach under most simulation settings. IVIS has the worst performance among the five approaches.
In the evaluation of estimation, the proposed approach again has a favorable performance. We plot the estimated nonzero coefficients as a function of subjects and 95% point-wise confidence intervals (
Figure A3). In Scenario 6 with
,
, and
, the estimated coefficients are close to the true ones, and the confidence intervals contain the true coefficients for most subjects. However, the estimation results become worse for the coefficients of the first and last few subjects. This is because the information available to estimate these coefficients is less than that on the intermediate coefficients. This problem can be alleviated by increasing the sample size (
Figure A4). Additionally, the proposed approach outperforms the alternatives in terms of prediction under most scenarios.
Overall, simulation suggests favorable performance of the proposed approach. It is interesting to note that it has satisfactory performance even under the no heterogeneity scenario (Scenario 5). Thus, it provides a safe choice for practical data analysis where the degree of heterogeneity in covariate effects is unknown. The other simulation settings have similar results. However, due to space constraints, we do not describe them here.
5. Discussion
The mature application of the high-throughput technology has produced a large amount of genomic data. With the rapid development of precision medicine, the heterogeneity effect of covariates has received increasing attention in disease genomic studies. However, most existing studies focus on the subgroup-specific effects, meaning the effects are the same within each subgroup, thus neglecting the possible varying effects within a subgroup. In this paper, we consider that the effects of covariates change smoothly across subjects. We thus propose a novel penalization-based estimation method, which combines a group-lasso penalty and a spline-lasso penalty based on subgroup-based studies by capturing the varying effects within each subgroup. It also advances the existing varying-coefficient studies by lowering the requirements for the distribution of the auxiliary variable. We show that, under the appropriate conditions, the proposed approach can correctly select important covariates with a probability converging to one and estimates the coefficients consistently. Simulations demonstrated a satisfactory practical performance and data analysis led to sensible findings, significantly different from those using alternative methods.
WIth the proposed regression model, it is impossible to estimate directly the subject-specific covariate effects due to the non-identifiability problem. This is resolved by introducing an auxiliary variable, which can have a biological interpretation. As such, it would be of interest to develop other frameworks that can differentiate between heterogeneous covariate effects in the (partial) absence of auxiliary variable. Additionally, the data analysis results also warrant further investigation.