Interep: An R Package for High-Dimensional Interaction Analysis of the Repeated Measurement Data
Abstract
:1. Introduction
2. Statistical Methods
2.1. Data Structure and Model Setup for the Longitudinal Interaction Analysis
2.2. An Overview of Generalized Estimating Equations in the Interaction Analysis
2.3. Penalized Identification of G × E Interactions in the Longitudinal Study
2.4. Computational Algorithms
- 1
- Specify an appropriate range for the two-dimensional grid of ();
- 2
- For a fixed (),
- 3
- For each () over the grid, repeat Step 2 until convergence and locate the optimal pair corresponding to the smallest cross-validation error.
- 4
- Report with respect to the optimal ().
3. The R Package Interep
3.1. The Main Functions
- interep(e, g, y, beta0, corre, pmethod, lam1, lam2, maxits)
- cv.interep (e, g, y, beta0, lambda1, lambda2, nfolds, corre,
- pmethod, maxits)
3.2. Other Supporting Functions
4. Simulation
- Data1 <- function (n,p,k,q,rho){
- # n: sample size; p: number of G factors;
- # k: number of time points; q: number of E factors
- y = matrix (rep (0,n*k),n,k)
- sig = matrix (0,p,p)
- for (i in 1: p) {
- for (j in 1: p) { sig[i,j] = 0.5^abs(i-j) }
- }
- x = mvrnorm(n,rep(0,p),sig)
- g = x
- # generate binary variables
- dummy0 <- as.numeric(x[,1] <= -0.5)
- dummy1 <- as.numeric(x[,1] > -0.5 & x[,1] <= 0)
- dummy2 <- as.numeric(x[,1] > 0 & x[,1] <= 0.5)
- # generate environment factors
- e = cbind (dummy0,dummy1,dummy2)
- # set up the design matrix for the interaction model
- x=cbind(dummy0,dummy1,dummy2,x)
- for (i in (q+1):(p+q)) {
- for (j in 1:q) {
- x=cbind(x,x[,j]*x[,i]) }
- }
- x=scale(x)
- ll=0.4
- ul=0.8
- coef1=runif(q,ll,ul) # for interaction effects
- coef2=runif(q,ll,ul) # for interaction effects
- coef3=runif(q,ll,ul) # for interaction effects
- coef4=runif(7,ll,ul) # for E and G main effects
- coef=c(coef4,coef1,coef2,coef3)
- mat=x[,c(1,2,3,5,7,10,15,(p+q+1):(p+q+3),
- (p+5*q+1):(p+5*q+3),(p+10*q+1):(p+10*q+3))]
- for(u in 1:k){
- y[,u] = 0.6+rowSums(coef*mat) }
- sig1 = matrix(0,k,k) # AR(1) correlation
- diag(sig1)=1
- for (i in 1: k) {
- for (j in 1: k) { sig1[i,j] = rho^abs(i-j) }
- }
- error = mvrnorm(n,rep(0,k),sig1)
- y = y + error
- index=1+c(1,2,3,5,7,10,15,(p+q+1):(p+q+3),
- (p+5*q+1):(p+5*q+3),(p+10*q+1):(p+10*q+3))
- dat = list(y=y,e=e,g=g,index=index,coef=coef)
- return(dat)
- }
- > library(interep)
- > library(MASS)
- > set.seed(1000)
- > n=250;p=75;k=5;q=3;rho=0.5
- > dat=Data1(n,p,k,q,rho)
- > e=dat$e
- > g=dat$g
- > y=dat$y
- > dim(e)
- [1] 250 3
- > dim(g)
- [1] 250 75
- > dim(y)
- [1] 250 5
- > dat$coef
- [1] 0.5934948 0.7166512 0.7787346 0.4494069 0.4576405
- [6] 0.6088111 0.7519327 0.7800568 0.7447117 0.5241370
- [11] 0.6688325 0.6749083 0.7816159 0.6365433 0.7102527
- [16] 0.4383118
- > dat$index
- [1] 2 3 4 6 8 11 16 80 81 82 92 93 94
- [14] 107 108 109
- index.true=dat$index
- beta0=rep(0.1,1+q+p+p*q)
- lambda1=0.45
- lambda2=1
- beta.est = interep(e, g, y,beta0,corre="e",pmethod="mixed",
- lam1=lambda1,lam2=lambda2,maxits=30)
- beta.est[abs(beta.est)<0.05]=0
- index.est = which(beta.est != 0)[-1]
- tp = length(intersect(index.true, index.est))
- fp = length(index.est) - tp
- > tp
- [1] 13
- > fp
- [1] 2
- > index.true
- [1] 2 3 4 6 8 11 16 80 81 82 92 93 94
- [14] 107 108 109
- > index.est
- [1] 2 3 4 6 8 9 11 14 16 92 93 94 107
- [14] 108 109
- > head(beta.est,20)
- [,1]
- [1,] 0.5962702
- [2,] 0.1635084
- [3,] 0.2501227
- [4,] 1.2758346
- [5,] 0.0000000
- [6,] 0.7307669
- [7,] 0.0000000
- [8,] 0.6030248
- [9,] 0.4935733
- [10,] 0.0000000
- [11,] 0.5921602
- [12,] 0.0000000
- [13,] 0.0000000
- [14,] 0.6130440
- [15,] 0.0000000
- [16,] 0.6448346
- [17,] 0.0000000
- [18,] 0.0000000
- [19,] 0.0000000
- [20,] 0.0000000
- reps=100
- time=rep(0,reps)
- for (h in 1:reps) {
- n=250;p=75;k=5;q=3;rho=0.5
- dat=Data1(n,p,k,q,rho)
- e=dat$e
- g=dat$g
- y=dat$y
- beta0=rep(0.1,1+q+p+p*q)
- lambda1=0.5
- lambda2=0.5
- start_time <- Sys.time()
- b = interep(e, g, y,beta0,corre="e",pmethod="mixed",lam1=lambda1,
- lam2=lambda2,maxits=30)
- end_time <- Sys.time()
- time[h]=as.numeric(end_time)-as.numeric(start_time)
- }
- mean(time);sd(time)
5. Discussion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
GEE | Generalized estimating equation |
LASSO | Least absolute shrinkage and selection operator |
PGEE | Penalized generalized estimating equation |
PQIF | Penalized quadratic inference function |
MCP | Minimax concave penalty |
SCAD | Smoothly clipped absolute deviation |
SNP | Single-nucleotide polymorphisms |
CNV | Copy number variations |
QIF | Quadratic inference function |
References
- Verbeke, G.; Fieuws, S.; Molenberghs, G.; Davidian, M. The analysis of multivariate longitudinal data: A review. Stat. Methods Med. Res. 2014, 23, 42–59. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Bandyopadhyay, S.; Ganguli, B.; Chatterjee, A. A review of multivariate longitudinal data analysis. Stati. Methods Med. Res. 2011, 20, 299–330. [Google Scholar] [CrossRef] [PubMed]
- Liang, K.Y.; Zeger, S.L. Longitudinal data analysis using generalized linear models. Biometrika 1986, 73, 13–22. [Google Scholar] [CrossRef]
- Wang, L.; Zhou, J.; Qu, A. Penalized generalized estimating equations for high-dimensional longitudinal data analysis. Biometrics 2012, 68, 353–360. [Google Scholar] [CrossRef]
- Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.) 1996, 58, 267–288. [Google Scholar] [CrossRef]
- Cho, H.; Qu, A. Model selection for correlated data with diverging number of parameters. Stat. Sin. 2013, 901–927. [Google Scholar] [CrossRef]
- Fan, J.; Li, R. Variable selection via non-concave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
- Zou, H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 2006, 101, 1418–1429. [Google Scholar] [CrossRef] [Green Version]
- Zhou, F.; Ren, J.; Lu, X.; Ma, S.; Wu, C. Gene–environment interaction: A variable selection perspective. In Epistasis; Humana: New York, NY, USA, 2021; pp. 191–223. [Google Scholar]
- Wu, C.; Jiang, Y.; Ren, J.; Cui, Y.; Ma, S. Dissecting gene--environment interactions: A penalized robust approach accounting for hierarchical structures. Stat. Med. 2018, 37, 437–456. [Google Scholar] [CrossRef]
- Zhang, Q.; Chai, H.; Ma, S. Robust identification of gene-environment interactions under high-dimensional accelerated failure time models. arXiv 2020, arXiv:2003.02580. [Google Scholar]
- Ren, M.; Zhang, S.; Ma, S.; Zhang, Q. Gene–environment interaction identification via penalized robust divergence. Biom. J. 2021. In press. [Google Scholar] [CrossRef] [PubMed]
- Zhou, F.; Ren, J.; Li, G.; Jiang, Y.; Li, X.; Wang, W.; Wu, C. Penalized Variable Selection for Lipid–Environment interactions in a longitudinal lipidomics study. Genes 2019, 10, 1002. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Zhou, F.; Lu, X.; Ren, J.; Fan, K.; Ma, S.; Wu, C. Sparse group variable selection for Gene-environment interactions in the longitudinal study. arXiv 2021, arXiv:2107.08533. [Google Scholar]
- Yuan, M.; Lin, Y. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B (Methodol.) 2006, 68, 49–67. [Google Scholar] [CrossRef]
- King, B.S.; Lu, L.; Yu, M.; Jiang, Y.; Standard, J.; Su, X.; Zhao, Z.; Wang, W. Lipidomic profiling of di–and tri–acylglycerol species in weight-controlled mice. PLoS ONE 2015, 10, e0116398. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Zhou, F.; Ren, J.; Li, X.; Wu, C.; Jiang, Y. Interep: Interaction Analysis of Repeated Measure Data, Version 0.3.2; 2021. Available online: https://cran.r-project.org/package=interep (accessed on 11 March 2022).
- Ma, S.; Yang, L.; Romero, R.; Cui, Y. Varying coefficient model for gene–environment interaction: A non-linear look. Bioinformatics 2011, 27, 2119–2126. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Wu, C.; Cui, Y. A novel method for identifying nonlinear gene–environment interactions in case–control association studies. Hum. Genet. 2013, 132, 1413–1425. [Google Scholar] [CrossRef]
- Wu, C.; Zhong, P.S.; Cui, Y. Additive varying-coefficient model for nonlinear gene-environment interactions. Stat. Appl. Genet. Mol. Biol. 2018, 17. [Google Scholar] [CrossRef]
- Wang, L.; Li, H.; Huang, J.Z. Variable selection in nonparametric varying-coefficient models for analysis of repeated measurements. J. Am. Stat. Assoc. 2008, 103, 1556–1569. [Google Scholar] [CrossRef] [Green Version]
- Tang, Y.; Wang, H.J.; Zhu, Z. Variable selection in quantile varying coefficient models with longitudinal data. Comput. Stat. Data Anal. 2013, 57, 435–449. [Google Scholar] [CrossRef]
- Wu, C.; Cui, Y.; Ma, S. Integrative analysis of gene–environment interactions under a multi-response partially linear varying coefficient model. Stat. Med. 2014, 33, 4988–4998. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Zhang, C.H. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 2010, 38, 894–942. [Google Scholar] [CrossRef] [Green Version]
- Wu, C.; Ma, S. A selective review of robust variable selection with applications in bioinformatics. Brief. Bioinform. 2015, 16, 873–883. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Friedman, J.; Hastie, T.; Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 2010, 33, 1–22. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Inan, G.; Wang, L. PGEE: An R Package for Analysis of Longitudinal Data with High-Dimensional Covariates. R J. 2017, 9, 393. [Google Scholar] [CrossRef]
- Ren, J.; He, T.; Li, Y.; Liu, S.; Du, Y.; Jiang, Y.; Wu, C. Network-based regularization for high dimensional SNP data in the case–control study of Type 2 diabetes. BMC Genet. 2017, 18, 44. [Google Scholar] [CrossRef]
- Ren, J.; Du, Y.; Li, S.; Ma, S.; Jiang, Y.; Wu, C. Robust network-based regularization and variable selection for high-dimensional genomic data in cancer prognosis. Genet. Epidemiol. 2019, 43, 276–291. [Google Scholar] [CrossRef]
- Huang, H.H.; Liang, Y. A Novel Cox Proportional Hazards Model for High-Dimensional Genomic Data in Cancer Prognosis. IEEE/ACM Trans. Comput. Biol. Bioinform. 2019, 18, 1821–1830. [Google Scholar] [CrossRef]
- Huang, H.H.; Peng, X.D.; Liang, Y. SPLSN: An efficient tool for survival analysis and biomarker selection. Int. J. Intell. Syst. 2021, 36, 5845–5865. [Google Scholar] [CrossRef]
- Wu, C.; Zhang, Q.; Jiang, Y.; Ma, S. Robust network-based analysis of the associations between (epi) genetic measurements. J. Multivar. Anal. 2018, 168, 119–130. [Google Scholar] [CrossRef]
- Schaid, D.J.; Sinnwell, J.P.; Jenkins, G.D.; McDonnell, S.K.; Ingle, J.N.; Kubo, M.; Goss, P.E.; Costantino, J.P.; Wickerham, D.L.; Weinshilboum, R.M. Using the gene ontology to scan multilevel gene sets for associations in genome wide association studies. Genet. Epidemiol. 2012, 36, 3–16. [Google Scholar] [CrossRef] [PubMed]
- Wu, C.; Cui, Y. Boosting signals in gene–based association studies via efficient SNP selection. Brief. Bioinform. 2013, 15, 279–291. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Jiang, Y.; Huang, Y.; Du, Y.; Zhao, Y.; Ren, J.; Ma, S.; Wu, C. Identification of prognostic genes and pathways in lung adenocarcinoma using a Bayesian approach. Cancer Inform. 2017, 16, 1176935116684825. [Google Scholar] [CrossRef] [PubMed]
- Eddelbuettel, D.; François, R. Rcpp: Seamless R and C++ integration. J. Stat. Softw. 2011, 40, 1–18. [Google Scholar] [CrossRef] [Green Version]
- Eddelbuettel, D. Seamless R and C++ Integration with Rcpp; Springer: New York, NY, USA, 2013. [Google Scholar]
- Eddelbuettel, D.; Sanderson, C. RcppArmadillo: Accelerating R with high-performance C++ linear algebra. Comput. Stat. Data Anal. 2014, 71, 1054–1063. [Google Scholar] [CrossRef] [Green Version]
- Wenk, M.R. The emerging field of lipidomics. Nat. Rev. Drug Discov. 2005, 4, 594. [Google Scholar] [CrossRef]
- Checa, A.; Bedia, C.; Jaumot, J. Lipidomic data analysis: Tutorial, practical guidelines and applications. Anal. Chim. Acta 2015, 885, 1–16. [Google Scholar] [CrossRef]
- Liu, M.; Zhang, Q.; Ma, S. A tree-based gene–environment interaction analysis with rare features. Stat. Anal. Data Min. ASA Data Sci. J. 2022, in press. [CrossRef]
- Zobel, R.W.; Wright, M.J.; Gauch, H.G. Statistical analysis of a yield trial. Agron. J.. 1988, 80, 388–393. [Google Scholar] [CrossRef]
- De Mendiburu, F. Agricolae: Statistical Procedures for Agricultural Research, Version 1.1; 2014. Available online: https://cran.r-project.org/package=agricolae (accessed on 11 March 2022).
- VSN International. Genstat for Windows, 21st ed.; VSN International: Hemel Hempstead, UK, 2021. [Google Scholar]
- Hill, T.; Lewicki, P. Statistics: Methods and Applications; StatSoft: Tulsa, OK, USA, 2007. [Google Scholar]
- Wu, C.; Zhou, F.; Ren, J.; Li, X.; Jiang, Y.; Ma, S. A selective review of multi-level omics data integration using variable selection. High-Throughput 2019, 8, 4. [Google Scholar] [CrossRef] [Green Version]
- Wang, J.L.; Chiou, J.M.; Müller, H.G. Functional data analysis. Annu. Rev. Stat. Appl. 2016, 3, 257–295. [Google Scholar] [CrossRef] [Green Version]
- Rubin, D. Inference and Missing Data; Cambridge University Press: Cambridge, UK, 1976. [Google Scholar]
- Little, R.; Rubin, D. Statistical Analysis with Missing Data; John Wiley and Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
- Wu, C.; Shi, X.; Cui, Y.; Ma, S. A penalized robust semiparametric approach for gene–environment interactions. Stat. Med. 2015, 34, 4016–4030. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Li, J.; Wang, Z.; Li, R.; Wu, R. Bayesian group LASSO for nonparametric varying-coefficient models with application to functional genome–wide association studies. Ann. Appl. Stat. 2015, 9, 640–664. [Google Scholar] [CrossRef] [PubMed]
- Ren, J.; Zhou, F.; Li, X.; Chen, Q.; Zhang, H.; Ma, S.; Jiang, Y.; Wu, C. Semi-parametric Bayesian variable selection for gene–environment interactions. Stat. Med. 2019, 39, 617–638. [Google Scholar] [CrossRef] [PubMed]
= 0.5 | A1 | 0.99 (0.10) | 3.2 6(0.35) | 4.90 (0.28) | 19.05 (2.12) |
A2 | 1.26 (0.29) | 4.85 (1.51) | 7.82 (2.96) | 44.35 (22.40) | |
A3 | 0.80 (0.06) | 3.01 (1.26) | 4.43 (0.26) | 23.11 (1.75) | |
A4 | 0.92 (0.18) | 2.84 (1.17) | 4.44 (0.47) | 18.05 (2.66) | |
A5 | 1.49 (0.48) | 5.54 (2.30) | 13.40 (5.00) | 48.31 (28.75) | |
A6 | 0.87 (0.09) | 2.83 (0.64) | 4.47 (0.32) | 21.61 (2.19) | |
= 0.8 | A1 | 0.99 (0.10) | 3.27 (0.39) | 4.9 2(0.39) | 18.43 (2.49) |
A2 | 1.27 (0.38) | 4.83 (1.85) | 8.89 (3.57) | 42.62 (21.26) | |
A3 | 0.88 (0.44) | 3.21 (1.50) | 4.58 (0.26) | 23.40 (1.79) | |
A4 | 0.89 (0.16) | 2.53 (0.24) | 4.43 (0.39) | 17.79 (3.27) | |
A5 | 1.43 (0.54) | 5.45 (2.93) | 11.59 (5.55) | 53.22 (29.40) | |
A6 | 0.88 (0.09) | 2.60 (0.34) | 4.58 (0.34) | 14.58 (8.26) |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhou, F.; Ren, J.; Liu, Y.; Li, X.; Wang, W.; Wu, C. Interep: An R Package for High-Dimensional Interaction Analysis of the Repeated Measurement Data. Genes 2022, 13, 544. https://doi.org/10.3390/genes13030544
Zhou F, Ren J, Liu Y, Li X, Wang W, Wu C. Interep: An R Package for High-Dimensional Interaction Analysis of the Repeated Measurement Data. Genes. 2022; 13(3):544. https://doi.org/10.3390/genes13030544
Chicago/Turabian StyleZhou, Fei, Jie Ren, Yuwen Liu, Xiaoxi Li, Weiqun Wang, and Cen Wu. 2022. "Interep: An R Package for High-Dimensional Interaction Analysis of the Repeated Measurement Data" Genes 13, no. 3: 544. https://doi.org/10.3390/genes13030544
APA StyleZhou, F., Ren, J., Liu, Y., Li, X., Wang, W., & Wu, C. (2022). Interep: An R Package for High-Dimensional Interaction Analysis of the Repeated Measurement Data. Genes, 13(3), 544. https://doi.org/10.3390/genes13030544