*2.3. Penalized Identification*

When the dimensionality of lipid factors is high, the total number of main and interaction effects is even higher. However, only a small subset of important effects is associated with the phenotype, which naturally leads to a variable selection problem. Penalized GEE based methods, including Wang et al. [5] and Ma et al. [6], have been proposed for conducting selection under correlated longitudinal responses. However, those published studies focus on the main effects and ignore the interactions. As shown in (1), the lipid–environment interactions are modeled on the group level, that is the interaction between all the *q* treatment factors and the *h*th lipidomics measurement (1 - *h p*). Such a group structure cannot be accommodated by variable selection methods from existing longitudinal studies. This fact motivates us to develop a method for the interaction analysis of repeated measures data, termed as interep, with the following penalized generalized estimating equation:

$$Q(\beta) = lI(\beta) - \sum\_{\mathcal{S}=1}^{p} \rho'(|\beta\_{2\mathcal{S}}|; \lambda\_1, \gamma) \text{sign}(\beta\_{2\mathcal{S}}) - \sum\_{h=1}^{p} \rho'(||\beta\_{3h}||\_{\Sigma\_h \circ}; \sqrt{q} \lambda\_{2\mathcal{L}} \gamma), \tag{3}$$

where *U*(*β*) is the score equation in GEE and *ρ* (·) is the first derivative of the minimax concave penalty (MCP) [22]. Since the environmental factors are usually of low dimension and are predetermined as important ones, they are not subject to penalized selection. *U*(*β*) is defined as:

$$\mathcal{U}(\beta) = \sum\_{i=1}^{n} Z\_i^T V\_i^{-1} (Y\_i - \mu\_i(\beta))\_i$$

and the MCP can be expressed as:

$$\rho(t;\lambda,\gamma) = \lambda \int\_0^t (1 - \frac{\mathbf{x}}{\gamma \lambda})\_+ d\mathbf{x}\_\prime$$

where *λ* is the tuning parameter and *γ* is the regularization parameter. The first derivative function of the MCP penalty is:

$$
\rho'(t; \lambda, \gamma) = (\lambda - \frac{t}{\gamma})I(t \le \gamma \lambda).
$$

MCP can be adopted for the regularized selection on both individual and group level effects. It is fast, continuous, and nearly unbiased [22].

In (3), the vector *β*<sup>2</sup> = (*β*21, ..., *β*2*p*) denotes the regression parameters for all the *p* lipid factors. *β*<sup>3</sup> = (*β* <sup>31</sup>, .., *β* <sup>3</sup>*p*), which denotes the regression parameters for lipid–environment interactions, is a vector of length *pq*. *β*3*<sup>h</sup>* is a vector of length *q* (*h* = 1, 2, ..., *p*), corresponding to the interactions between the *h*th lipid feature and the environment factors. With the control as the baseline, the environment factors have been formulated as a group of dummy variables. With high-dimensional main and interaction effects, penalization is critical for the identification of important effects out of the large number of candidates. In the penalized generalized estimating equation (3), the first penalty term adopts MCP directly to conduct the selection of main lipid effects on the individual level. The second penalty, in the forms of group MCP, imposes shrinkage on the product between the lipid factors and dummy variable group, which corresponds to the lipid–environment interactions. The group level selection of interaction effects is consistent with the mechanism of creating the dummy variable group of environmental factors. Note that such a rationale of formulating the penalized generalized estimating Equation (3) is deeply rooted in group LASSO [16].

In particular, *λ*<sup>1</sup> and *λ*<sup>2</sup> in (3) are tuning parameters. *ρ* (||*β*3*h*||Σ*<sup>h</sup>* ; √*qλ*2, *<sup>γ</sup>*) is the group MCP penalty that corresponds to the interactions between the *h*th (*h* = 1, 2, ..., *p*) lipid factor and the *q* environment factors. The empirical norm ||*β*3*h*||Σ*<sup>h</sup>* is defined as: ||*β*3*h*||Σ*<sup>h</sup>* = (*β* <sup>3</sup>*h*Σ*hβ*3*h*)1/2 with Σ*<sup>h</sup>* = *n*−1*B <sup>h</sup> Bh*. *Bh* = *Z*[,(2 + *q* + *p* + (*h* − 1) × *q*) : (1 + *q* + *p* + *h* × *q*)], and it contains the *q* columns in *Z* that correspond to the interactions from the *h*th lipid factor with the *q* environment factors.

A variety of penalized variable selection methods for high-dimensional longitudinal data have been developed in the past two decades for analyzing high-dimensional omics data, such as gene expressions, single nucleotide polymorphisms (SNPs), and copy number variations (CNVs) [5,6]. However, lipidomics data have been rarely investigated by using high-dimensional variable selection methods. We developed a package, (interep https://cran.r-project.org/package=interep)

that incorporates our recently developed penalization procedures to conduct interaction analysis for high-dimensional lipidomics data with repeated measurements [21].

Remark: The uniqueness of the proposed study lies in accounting for the group structure of lipid–environment interactions through penalized identification. Therefore, the main lipid effects and lipid–environment interactions are penalized on individual and group levels, separately, which leads to a formulation of both MCP and group MCP penalties. Although our model has been motivated from a specific lipidomics profiling study in weight controlled mice [15], it can be readily extended to accommodate more general cases in interaction studies where the environmental factors are not dummy variables formulated from the ANOVA setting. In such a case, for each lipid factor, the main lipid effects and lipid–environment interactions form a group, with the leading component of the group being a vector of 1s. As not all the effects in the group are expected to be associated with the phenotype, a sparse group type of variable selection is demanded. Such a formulation has been investigated in survival analysis [23], but not in longitudinal studies yet. With a simple modification of our model to penalize the main and interaction effects on the individual and group level simultaneously, the proposed one becomes a penalized sparse group GEE model and can be adopted to handle general environmental factors in high-dimensional cancer genomics studies.
