*3.1. Simulation*

We evaluated the performance of all six methods (A1–A6) through extensive simulation studies. Among them, A1–A3 were developed for accommodating the interaction structures with different working correlations, while A4–A6 were only focused on the identification of main effects so the structure of the group level interaction effects were not respected. Note that there are existing studies that can also achieve the selection of main effects in longitudinal studies. For example, Wang et al. [5] adopted the smoothly clipped absolute deviation (SCAD) penalty for conducting the selection of main effects. Since the MCP is incorporated as the baseline penalty in A1–A3, A4–A6 have thus been developed based on MCP and used as benchmark methods for comparison.

The responses were generated from the model (2) with sample size *n* = 250 and 500. The number of time points *k* was set to five. The dimensions for lipid factors *Xij* were *p* = 75, 150 and 300. With *q* = 3 for *Eij*, we first simulated a vector of length *n* from the standard normal distribution. A group of three binary dummy variables for environmental factors could then be generated after dichotomizing the vector at the 30th and 70th percentiles. In addition, the lipids were simulated from a multivariate normal distribution with mean zero and the AR1 covariance matrix with marginal variance one and auto-correlation coefficient 0.5. We simulated the random error from a multivariate normal distribution by assuming a zero mean vector and an AR1 covariance structure with *ρ* = 0.5 and 0.8. Note that when considering the interactions, the actual dimensionality was much larger than *p*. For instance, given *n* = 250, *p* = 150, and *q* = 3, the total dimension for all the main and interaction effects was 604.

The coefficients were simulated from *U*[0.4, 0.8] for 17 nonzero effects, consisting of the intercept, 3 environmental dummy variables, 4 lipid main effects, and 3 groups of lipid–environment interactions

(9 interaction effects). We generated 100 replicates for the four settings: (1) *n* = 250 and *p* = 75, (2) *n* = 250 and *p* = 150, (3) *n* = 500 and *p* = 150, and (4) *n* = 500 and *p* = 300. All the rest of the coefficients were set to zero. For each setting, we considered two correlation coefficients (*ρ* = 0.5 and 0.8) for the random error. The number of true positives (TP) and false positives (FP) was recorded.

In addition to identification results, we also calculated the estimation accuracy in terms of the difference between estimated and true coefficients. In particular, the mean squared error corresponding to the true nonzero coefficients and true zero coefficients (for noisy effects) were termed as MSE and NMSE, respectively. The total mean squared error for the coefficient vector, or TMSE, is computed as:

$$\text{TMSE} = \frac{1}{100} \sum\_{r=1}^{100} ||\mathcal{J}^{(r)} - \mathcal{J}||^2 / p\_{\beta}$$

where *p<sup>β</sup>* is the dimension of *β* and *β*ˆ(*r*) is the estimated value of *β* in the *r*th simulated dataset. MSE and NMSE were calculated in a similar way as for TMSE.

Identification results of the six methods (A1–A6) are tabulated in Tables 1–4. In general, A1–A3, which account for both the lipid main effects and lipid–environment interactions, had better performance than A4–A6, which only accommodated the main effects. For example, in Table 1, given *n* = 250, *ρ* = 0.5, *p* = 75, the actual dimension is 304. A1 identified 14.5 (sd 1.9) nonzero effects out of all the 17 true positives, with a relatively small number of false positives of 4.8 (sd 3.1). On the other hand, A4 identified a smaller number of true positives, 1.3 (sd 1.5), with a larger number of false positives, 6.6 (sd 4.2). Among the identified effects, A1 identified 7.4 (sd 1.5) interactions, with 3.1 (sd 2.6) false positives. A4 identified a smaller TP of 6.1 (sd 1.1) and a higher FP of 5.1 (sd 3.3) of the lipid–environment interactions. We could observe that the difference in identification performance between A1 and A4 came mainly from the interaction effects, which was due to the fact that A4 could not accommodate the group level selection corresponding to the lipid–environment interactions. As the dimension increased, A1 outperformed A4 more significantly. For instance, in Table 4, the overall dimension for *n* = 500, *ρ* = 0.8, *p* = 300 is 1204. A1 had a TP of 15.9 (sd 1.2) and an FP of 3 (sd 2.6), while A4 had a smaller TP 14.5 (sd 1.2) and a higher FP 4.5 (sd 3.0). Figures 1 and 2 are plotted based on the identification results from Tables 1–4. We can observe that overall, A1–A3 outperformed A4–A6 with a higher TP and a lower FP under each setting.


**Table 1.** Identification results for *n* = 250, *p* = 75 with an actual dimension of 304.

Mean (sd) based on 100 replicates. A1–A3: methods accommodating the lipid–environment interactions with exchangeable, AR(1), and independence working correlations, respectively. A4–A6: methods not accommodating the lipid–environment interactions with exchangeable, AR(1), and independence working correlations, respectively.


**Table 2.** Identification results for *n* = 250, *p* = 150 with an actual dimension of 604.

Mean (sd) based on 100 replicates. A1–A3: methods accommodating the lipid–environment interactions with exchangeable, AR(1), and independence working correlations, respectively. A4–A6: methods not accommodating the lipid–environment interactions with exchangeable, AR(1), and independence working correlations, respectively.

**Table 3.** Identification results for *n* = 500, *p* = 150 with an actual dimension of 604.


Mean (sd) based on 100 replicates. A1–A3: methods accommodating the lipid–environment interactions with exchangeable, AR(1), and independence working correlations, respectively. A4–A6: methods not accommodating the lipid–environment interactions with exchangeable, AR(1), and independence working correlations, respectively.


**Table 4.** Identification results for *n* = 500, *p* = 300 with an actual dimension of 1204.

Mean (sd) based on 100 replicates. A1–A3: methods accommodating the lipid–environment interactions with exchangeable, AR(1), and independence working correlations, respectively. A4–A6: methods not accommodating the lipid–environment interactions with exchangeable, AR(1), and independence working correlations, respectively.

**Figure 1.** Plot of the identification results for *n* = 250, *p* = 75 with an actual dimension of 304. *p* = 150 with an actual dimension of 604. A1–A3: methods accommodating the lipid–environment interactions with exchangeable, AR(1), and independence working correlations, respectively. A4–A6: methods not accommodating the lipid–environment interactions with exchangeable, AR(1), and independence working correlations, respectively.

**Figure 2.** Plot of the identification results for *n* = 500, *p* = 150 with an actual dimension of 604. *p* = 300 with an actual dimension of 1204. A1–A3: methods accommodating the lipid–environment interactions with exchangeable, AR(1), and independence working correlations, respectively. A4–A6: methods not accommodating the lipid–environment interactions with exchangeable, AR(1), and independence working correlations, respectively.

In terms of estimation accuracy, A1–A3 also had a better performance compared with A4–A4, as shown in Tables 5 and 6. For the panel corresponding to *n* = 250, *ρ* = 0.5, and *p* = 75 in Table 5, the mean squared error for the nonzero coefficients of A1 was 0.1055, which was less than half of that of A4 (0.2321). Besides, A1 also had a smaller total mean squared error (TMSE). All the pieces of evidence suggested that A1 had higher estimation accuracy than A4. We can observe the pattern for the rest of the four methods. As the dimension increased to *n* = 500, *ρ* = 0.8, and *p* = 300 (so the total dimension was 1204) in Table 6, the MSE of A1 (0.0688) was also smaller than that of A4 (0.1949). There were no obvious differences in NMSE among these settings.

Another important conclusion we make from the simulation study is that, for the methods that differ only in working correlation, i.e., A1 (exchangeable), A2 (AR1), and A3 (independence), there was no significant difference in terms of either identification or estimation accuracy, as shown by Tables 1–6, as well as Figures 1 and 2. Such an observation suggests that the proposed methods under the GEE framework were robust to the misspecification of the working correlation, and this is consistent with the conclusions from main effects only models in longitudinal studies [7].


**Table 5.** Estimation accuracy results for *n* = 250, *p* = 75 with an actual dimension of 304. *p* = 150 with an actual dimension of 604.

Mean (sd) based on 100 replicates. A1–A3: methods accommodating the lipid–environment interactions with exchangeable, AR(1), and independence working correlations, respectively. A4–A6: methods not accommodating the lipid–environment interactions with exchangeable, AR(1), and independence working correlations, respectively.

**Table 6.** Estimation accuracy results for *n* = 500, *p* = 150 with an actual dimension of 604. *p* = 300 with an actual dimension of 1204.


Mean (sd) based on 100 replicates. A1–A3: methods accommodating the lipid–environment interactions with exchangeable, AR(1), and independence working correlations, respectively. A4–A6: methods not accommodating the lipid–environment interactions with exchangeable, AR(1), and independence working correlations, respectively.

To mimic the sample size and number of lipid factors in the case study, we also conducted a simulation in settings with *n* = 60, *p* = 30, and *q* = 3. Therefore, the overall dimension of main and interaction effects was 124. The coefficients were generated from *U*[1.4,1.8] for 17 nonzero effects. The identification and prediction results are summarized in Tables A1 and A2 in the Appendix A, respectively. Consistent patterns were observed. For example, in terms of identification, under *ρ* = 0.5,

A1 had a higher TP of 13.6 (sd 2.5) compared to the 11.1 (sd 2.6) of A4, and a lower FP of 4.7 (sd 2.7), compared to the FP of 5.4 (sd 2.8) identified by A4.

Evaluation of all the methods, especially A1–A3, was also conducted when the true underlying model was misspecified. We generated the response (phenotype) from a main effect only model with eight true main effects when *n* = 250, *p* = 75, *ρ* = 0.8 with a total dimension of 304. Results are provided in Table A3. When the interaction effects did not exist, A1 had only identified a very small number of false interaction effects, with 0.7 (sd 1.7) false positives. A2–A6 performed similarly in terms of identifying false interaction effects. All six methods identified a comparable number of true main effects. Overall, all methods had similar performance in identification, as well as prediction, when the data generating model had only main effects. Such a phenomenon is reasonable by further examining the results in Table 1. We found that the major difference between A1–A3 and A4–A6 was due to the identification of interaction effects. Therefore, when only main effects were present, all the methods had comparable performances.

Penalized regression and hypothesis testing are two related, but distinct aspects in statistical analysis. The proposed study was not aimed at developing test statistics, computing the power functions, and assessing the control of type 1 error, so these statistical test related results are not available, just like most of the studies on penalized regression. Recently, efforts devoted to bridging the two areas have been mainly restricted to linear models under high-dimensional settings [24–26]. Extensions to interaction models have not been reported so far. In particular, we are not aware of results reported for longitudinal models. Nevertheless, we conducted the simulation by assuming the null model and tabulate the identification results in Table A4. The results should be interpreted as identification with misspecified models. As we observed, under the null model, all six methods led to a very small number of false positives.

To assess the consistency of variable selection in longitudinal settings, we carried out the stability selection [27] under *n* = 250, *p* = 75, and *ρ* = 0.8. Each time, we selected 200 out of the total of 250 subjects without replacement and then conducted selection. The process was repeated 100 times, which yielded a proportion of selected effects. Larger proportions of being selected suggested stable results. Stability selection is well known for assessing the stability of penalized selection, and it alleviates the concern that the effects have only been identified by chance. We investigate the selection proportions of the 17 true main and interaction effects for all six methods in Table A5. A1 identified 14 true effects with proportions above 70%, which is consistent with the results shown in the lower panel of Table 1, where 13.7 TPs (sd 2.3) were identified. Such a consistent pattern can be observed across all six methods.

Although no consensus on the optimal criterion of selecting tuning parameters has been reached so far, cross-validation is perhaps the most well accepted criterion to select tuning parameters in the community of high-dimensional data analysis [3,4]. To further justify its appropriateness, under the setting of *n* = 250 and *p* = 75, we performed the analysis by selecting tuning parameters using an independently generated testing dataset with a sample size of 1000 and *p* = 75. The models were fitted on the training dataset, and prediction was assessed based on the independently generated testing dataset, so no data were used in training the model. The identification and prediction results are tabulated in Tables A6 and A7, respectively. A comparison to Tables 1 and 5 demonstrates that the results obtained by cross-validation and validation were very close.
