*3.2. Real Data Analysis*

We applied the proposed and alternative methods on a dataset from one of our previous studies in animal models [15]. In the study, 60 female CD-1 mice were assigned to four different treatment groups, which were control (ad libitum feeding and sedentary), AE (exercise and ad libitum feeding), PE (exercise and pair feeding), and DCR (sedentary and 20% dietary calorie restriction). The phenotype of interest was mice's body weight, which was measured every week for 10 weeks. Mice were sedentary and given ad libitum feeding in the control group, where they could eat as much as they wanted without doing treadmill exercises. In the AE group, mice received ad libitum feeding and ran on the

treadmill every day at a speed of 0.5 mph, 1 hour per day, and 5 days a week, while mice in the PE group did the same exercise, but were given the same amount of diet as the mice in the control group. Mice in the DCR group had 20% less calorie intake than the control group, but they had the same intake of protein, vitamins, and minerals. The composition of 176 plasma neutral lipid species of interest was measured. In the current study, we only focused on diacylglycerols. In addition, the diacylglycerol lipid species that have a majority of samples lower than the detection limits were excluded so there were 31 diacylglycerols. In total, there were 31 lipid main effects and 93 lipid–environment interactions.

Using the method A1 (interep with the exchangeable working correlation) as shown in Table 7, we identified seven lipid species that had different effects in weight control of mice (AE, PE, or DCR) on body weight compared to those of the control mice. Among them, C20:1/16:1 and C20:1/20:4 had negative interactions in AE mice, where C denotes carbon. For the lipid species of C20:1/16:1, *C*39*H*76*O*5*N*, the regression coefficient was −2.9145 for AE mice. That is, mice with an increased amount of C20:1/16:1 tended to have a lower body weight compared to that of the control. In the AE mice, both C16:0/C16:0 and C22:6/C18:1 had strong positive associations with body weights. It is interesting that C16:0/C16:0 were negatively associated with body weight in both PE and DCR mice. C16:0 is also called palmitic acid and is one of most common saturated fatty acids. Increased consumption of palmitic acid is associated with higher risk of cardiovascular disease, type 2 diabetes, and cancer [28]. The negative association of C16:0/16:0 and body weight in DCR and PE suggests that when the calories of the diet are restricted, the accumulation of saturated fat in the body actually decreased compared to the control. Another lipid that is negatively associated with body weight in DCR and PE mice is C18:1/16:1. The lipids that were positively associated with body weight in PE were C18:2/C16:1, C20:1/C16:1, and C22:6/C18:1. All species contain unsaturated fatty acids. Among them, C22:6 is one of the omega-3 polyunsaturated fatty acids (PUFA). In DCR, the two lipids that were positively associated with body weight were C18:2/16:1 and C20:1/20:4. Both fatty acids C18:2 and C20:4 were PUFA. The results seem to be consistent with our previous finding that exercise with paired feeding may increase the amount of PUFA in phospholipids in mice skin [29].


**Table 7.** Real data analysis result from method A1 (method accommodating the lipid–environment interactions with exchangeable working correlation).

AE, exercise and ad libitum feeding; PE, exercise and pair feeding; DCR, sedentary and 20% dietary calorie restriction.

In addition, we adopted A4 to analyze the lipid data. A4 also had the exchangeable working correlation, but it could not conduct group level selection of the lipid–environment interactions. The identification results are tabulated in Table 8. Note that the selection of interactions with individual dummy environment factors was not consistent with the formulation of the lipid–environment interactions. In terms of prediction, A1 had a smaller prediction error (4.04) than that of A4 (4.97).


**Table 8.** Real data analysis result from method A4 (method not accommodating the lipid–environment interactions with exchangeable working correlation).

#### **4. Discussion**

Investigation of the potential roles of lipids in the regulation and control of cellular function and the interactions between lipids and environmental factors are very important in the understanding of physiology and disease processes. Traditionally, the analyses mostly focus on the total amount of a particular type of lipid, such as total triglyceride, total cholesterol, and omega-3 fatty acid. With the recent advances in instrumental technology, it is feasible to analyze quantitatively a broad range of lipid species in a single platform [13,15,30–32]. The vast arrays of data generated in lipid profiling studies bring challenges to the statistical analysis of lipidomics data [33–35].

In this study, we proposed a penalized variable selection method to identify important lipid–environmental effects in longitudinal studies. Some statistical methods have already been reported for lipidomics studies, including the marginal test and variable selection methods [15,32,34,35]; however, they cannot be directly extended to longitudinal studies. On the other hand, existing variable selection methods for longitudinal data have been predominately developed for the identification of main effects and cannot accommodate the group level interaction structure unique to our studies. Both the simulation and case study have convincingly demonstrated the merit of the proposed interep over alternatives.

We selected tuning parameters based on cross-validation. A further investigation of different tuning criteria is interesting, but beyond the scope of this study, especially given the fact that many well known variable selection methods in longitudinal studies, such as [5], have been conducted using cross-validation. To facilitate a fair cross-comparison with existing relevant studies, we believe it is reasonable to adopt cross-validation to choose tuning parameters. Note that the aforementioned stability selection analysis also partially justifies the usage of cross-validation. We acknowledge that other criteria for selecting tunings, such as double cross-validation [36], could be a potential reliable choice. However, as it is not a widely accepted tuning criterion for high-dimensional data analysis and has not been adopted in any longitudinal studies so far, we postpone the investigation to the future.

Interaction studies have been historically pursued by statisticians [37]. Within the high-dimensional scenario, accounting for such a complex structure, in both gene–gene (G × G) and gene–environment (G × E) interaction studies, is challenging, but also rewarding [38]. The proposed study is among the first to investigate penalized identification of lipid–environment interactions in longitudinal studies. Both the simulation study and case study yielded interesting findings. G × G interaction is computationally more challenging than G × E interactions since both main effects involved in the interactions are of high dimensionality. Following the representative G × G interaction studies [39,40], we can extend the proposed study to lipid–lipid interactions, which has not been investigated in longitudinal studies so far. Besides, when multi-omics measurements are available, it is also of great interest to examine interaction effects through multi-omics integration studies in the longitudinal setting [41,42].

The proposed model can also be estimated using the quadratic inference functions (QIF). GEE relies on the working correlation matrix *R*(*η*), and it enables us to find the consistent estimator of the regression parameter if consistent estimators of the nuisance parameters *η* can be obtained. However, consistent estimators of *η* do not always exist in some cases. QIF has been proposed to avoid explicit estimation of the nuisance parameters by assuming the inverse of the working correlation matrix *R*(*η*) can be approximated by a linear combination of a class of base matrices [7,43]. Thus, QIF is robust to the misspecification of the working correlation.

In this paper, we are interested in the identification of lipid-treatment (or environment) interactions through penalization. The success of set based analysis, including those for the gene set [44] and SNP set [45,46], has tremendously motivated the development of statistical methods for G × E interactions from marginal analyses ([47,48]) to penalization methods [17,18,49]. Our model can be potentially extended in the following aspects. First, as data contamination and outliers have been widely observed in repeated measurements, robust variable selection methods in G × E interaction studies [23,50–52] can be extended to longitudinal settings. Second, recently, multiple Bayesian methods have been proposed for pinpointing important G × E interaction effects [53–55]. Within the framework of analyzing repeated measurements, Bayesian variable selection for interactions has not been extensively examined. Investigations of all these possible directions will be postponed to the near future.

**Author Contributions:** Conceptualization, C.W. and Y.J.; resources, Y.J., W.W., and C.W.; methodology, F.Z., J.R., Y.J., and C.W.; writing, original draft preparation, F.Z. and C.W.; software, F.Z. and J.R.; data analysis, F.Z., Y.J., and C.W.; writing, review and editing, all authors; supervision, C.W. and Y.J.; project administration, C.W.; funding acquisition, C.W.

**Funding:** This study was partially supported by an Innovative Research Award from the Johnson Cancer Research Center at Kansas State University and a Kansas State University Faculty Enhancement Award.

**Conflicts of Interest:** The authors declare no conflict of interest.
