**1. Introduction**

Findings from the Adventist Health Study-2 (AHS-2) cohort have been prominent among epidemiologic studies linking a vegetarian dietary pattern to lower risks of diabetes, metabolic syndrome, and coronary heart disease [1–5] as well as lower risks of gastrointestinal and prostate cancers, among others [6–8]. Besides avoiding meat, vegetarians and particularly vegans in this cohort have markedly higher consumption of plant-based foods including fruits, vegetables, whole grains, soy, and nuts, but lower consumption of sweets, snack foods, refined grains, and caloric beverages [9], and healthier profiles of biomarkers of dietary intake, including phytochemicals or bioactive compounds [10] relative to non-vegetarians. However, there is limited understanding of the underlying molecular mechanisms that could help explain the probable ability of vegetarian dietary patterns to prevent chronic diseases, or to buttress causal interpretations.

DNA methylation involves the chemical addition of a methyl group at the 5 position of a cytosine residue, and may be influenced by diet, among other lifestyle or environmental factors. Such modifications can potentially alter gene expression and the development of chronic, autoimmune, or aging-related diseases [11,12]. Diet may influence gene-specific or global DNA methylation by (1) providing or modifying the availability of methyl donor compounds or cofactors that regulate onecarbon metabolism [13], (2) regulating enzymes that catalyze or reverse methylation, such as DNA methyl transferases (DNMTs) or ten or eleven translocation (TET) enzymes [14,15], and (3) indirectly regulating inflammation or oxidative stress [16]. Vegetarian or plant-based diets are rich in polyphenols and secondary plant metabolites that could inhibit the activity of DNMT to prevent cancer [14,17,18]. On the other hand, diets high in animal-based and fatty foods may generate pro-inflammatory compounds [19–22] and metabolites [15,23] that alter methylation and increase activity of TET enzymes to promote cancer.

Besides specific dietary components, there is some evidence from animal studies and human intervention trials that dietary conditions such as high fat content or caloric restriction may alter genome-wide and gene-specific DNA methylation [24–30]. However, it is not clear how habitual dietary patterns differing in consumption of plant- and animal-based foods differ in terms of DNA methylation patterns. The aim of the current study was to examine genome-wide DNA methylation patterns in long-term vegan and non-vegetarian participants of the AHS-2 cohort to identify differentially methylated genes. Our approach involved a covariate-adjusted analysis of methylation of individual CpG sites, as well as of CpGs annotated to their respective genes (gene methylation), considering various gene regions, and employing an adapted Storey et al. [31,32] permutation method to correct for false discovery.

#### **2. Materials and Methods**

#### *2.1. Study Population*

Participants in the current study are members of the larger AHS-2 cohort, established in 2002–2007 as a national study of Seventh-day Adventists, about <sup>1</sup> <sup>2</sup> of whom are vegetarian. In this sub-study, 143 participants were Caucasian, and 61 were African-American or Black. Subjects were selected from the AHS-2 biorepository for DNA methylation analyses conducted at three separate times variably balanced by age, sex, diet group, and/or ethnic group. All subjects were pooled for the analyses performed in the current study. Participants were classified a priori into vegetarian diet groups based on their responses to the food frequency questionnaire (FFQ) completed upon enrollment as follows: vegans never or rarely (less than once per month) consumed eggs, dairy, fish and other meats, and non-vegetarians consumed non-fish meats at least once a month and any meat (including but not only fish) more than once per week.

All investigations were carried out following the rules of the Declaration of Helsinki of 1975, revised in 2013. The project was originally approved by the Loma Linda University institutional review board, protocol: 48134.

#### *2.2. Laboratory Analysis and Data Processing*

Fasting blood samples were collected at field clinics held in church halls, as described previously [33]. Buffy coat was removed and diluted to a final volume of 8 mL with phosphate buffered saline, then aliquoted into straws and frozen at −180 ◦C in nitrogen vapor for storage in the AHS-2 biorepository. Genomic DNA from buffy coat samples was isolated using the Quick-gDNA MiniPrep kit (Zymo Research, Irvine, CA, USA) according to the manufacturer's protocol. After bisulfite sequencing, genome-wide methylation analysis was performed using the Infinium MethylationEPIC (*n* = 67) and HumanMethylation450 BeadChip (*n* = 137) arrays at the University of California, Los Angeles Neuroscience Genomics Core. Raw data were summarized into BeadStudio IDAT files for further analysis.

Methylation data were processed using R version 3.6.1. Initial preprocessing and normalization procedures were conducted using the minfi package in R [34]. Raw data generated from methylation analysis were normalized with the single-sample normal-exponential out-of-band (ssNoob) method as recommended by Fortin (2017) for combined 450 k and EPIC datasets. Probes were excluded that were cross-reactive, within 10 bp of single nucleotide polymorphisms (SNPs) [35,36], or present on X/Y chromosomes. Furthermore, CpG probes with bead counts less than 3 in more than 2% of samples or detection *p* values above 0.01 in more than 1% samples were excluded. After these normalization and processing steps, 313,206 CpG sites (common to both BeadChip arrays) remained for analysis.

As an additional layer of control and primarily to adjust for unwanted variation including batch, chip, positional, cell heterogeneity, and study effects, a surrogate variable analysis was used from the SmartSVA R package. With this method, influential principal components of residuals from regressions of CpG methylation on dietary groups are converted into surrogate variables, and subsequently included in final regression models, thereby accounting for unwanted variation from other sources of "error" [37].

### *2.3. Statistical Analyses*

DNA methylation β values were obtained from the BeadChip array, representing the methylation frequency within an individual sample (ratio of methylated signal divided by the sum of methylated and unmethylated signals) for each target CpG or gene ranging from 0 (not methylated) to 1 (fully methylated). Subsequently the log2 ratio of the methylated to unmethylated signal was calculated to obtain M values. Using SmartSVA (39), surrogate variables were created from regression residuals conditional on covariates, age (continuous), sex (male, female), ethnic group (Black, Caucasian), and body mass index (BMI; continuous). To examine the association between dietary pattern and DNA methylation, a linear model was then fitted where the response variable was the residual when the M values were regressed on these covariates (except BMI) and the surrogate variables [38]. This represents non-BMI covariate-adjusted methylation levels for individual CpG sites [39], or the average of covariate-adjusted methylation level for CpGs associated with each gene. The independent variable came from residuals when dietary pattern was regressed on these same covariates, this then representing covariate-adjusted dietary pattern. Body mass index was not included as a covariate, given that it may be a mediator in some cases between diet and CpG methylation.

For comparison, separate linear regression models were fitted excluding the SmartSVA method, where models were additionally adjusted for position on the BeadChip (*n* = 8 rows), the original sub-study (*n* = 3) along with array (450 K, 800 K), and cell type heterogeneity (CD8T, CD4T, NK, Bcell Mono, Gran) using the approach of Houseman et al. [38].

CpG sites were associated with genes by calculating an unweighted average of covariate-adjusted M values of all CpG sites in the dataset that were annotated to the gene of interest and in the nominated gene region, where applicable. When analyzing a particular region, genes were only included where at least two CpG sites were available in that region. Thus, besides analyzing methylation of all represented sites, or average methylation across a gene, methylation was also analyzed according to 1) genic/intergenic region (200 or 1500 base pairs upstream of the transcription start site (TSS200 and TSS1500, respectively), the 5 untranslated region (UTR), first exon, gene body, 3 UTR, and intergenic regions, 2) in relation to CpG islands (CpG island, north shore, south shore, north shelf, south shelf, and open sea regions), and 3) in the promoter, including CpG islands within the promoter.

An adapted permutation-identified null distribution (Storey et al. method, references 31, 32) was used to adjust for multiple comparisons to avoid the risk of severe false discovery. The measure of DNA methylation differences by diet pattern for each CpG site or gene is the corresponding partial t statistic for each CpG site, and for genes a joint t statistic averaging multivariable-adjusted methylation status across relevant CpGs in a particular gene. The permutation method was used to define the null distribution of the t scores for CpG sites. Then for genes, in separate analyses, all CpG sites in a nominated region relevant to that gene were used in the averaging process, not only the CpG sites that were previously statistically significant. To adjust for covariates when planning a permutation approach, the residual method was used as has been recommended [40], this being applied to both the independent variables aside from diet (the exposure of interest) and the dependent variable as described above. This reduces the independent variables to one for permutation, which already takes account of covariances between the covariates. Covariances between the residualized dependent M values are retained as these dependent variables are not permuted. Recognizing that the real data are a mixture of null and non-null methylation sites/genes, an estimate of the proportion of null features allows a direct estimate of the false discovery rate (FDR) [32], and hence the more useful selection of only those with small FDR. The method also allows an estimate of the number of all differentially methylated sites/genes [31,32]), although only a proportion can be individually identified with acceptable FDR in this relatively small dataset. All statistical analyses were performed with R software.

#### **3. Results**

#### *3.1. Methylation Di*ff*erences in Vegans and Non-Vegetarians*

Demographic characteristics of study participants are shown in Table 1. There were no statistically significant differences in the distribution of males and females, or in age, comparing vegans and non-vegetarians. As expected, vegans had lower BMI relative to non-vegetarians (24.3 vs. 28.0; *p* < 0.001). This was viewed as a potentially mediating variable in statistical analyses.


**Table 1.** Demographic characteristics of study population 1.

<sup>1</sup> Values presented as mean (SD) or as *n* (%) where indicated.

For this study, we compared methylation of individual CpG sites and genes in vegans relative to non-vegetarians, dividing CpG sites into separate categories defined by gene region, namely, genic/intergenic, relation to CpG islands, and promoter regions.

3.1.1. Differential Methylation of Select Regions Associated with Individual Genes

1. Proportions of Differentially Methylated Gene Regions

There were greater proportions of genes with at least two probes in a related CpG island (74%) or TSS1500 (72%) region, followed by those with sites in the gene body (65%) (Table 2). The majority of differentially methylated genes were hypomethylated in vegans. When differential methylation was analyzed after averaging CpG methylation across an entire gene, within each of the 18,627 genes (i.e., labelled "All") represented on the array and in our analytical dataset, a total of 18 genes (4 hypermethylated, 14 hypomethylated) were found to be differentially methylated after adjustment

for false discovery (Table 2). When differential methylation was analyzed according to methylation of genic/intergenic regions, 21 genes (including three identified in the analysis of overall gene methylation, "All") showed differential methylation in the gene body, representing the most differentially methylated genes for any region. Identifying a central linear region of a plot of cumulative probabilities of the null vs. actual distributions of t scores (adaptation of methods in references [31,32]) revealed that an estimated 1081 (6%) of the 18,627 genes analyzed are non-null (Table 2 and Appendix A), although the majority of these cannot be specifically identified here. The greatest proportion of estimated non-null genes relative to the respective total number of genes for methylation of a given region were genes defined by methylation of the gene body (9%, see Figure A1 of Appendix A), with relatively high proportions also noted when defined by methylation of the TSS1500 (7%), north shelf (7%), and open sea (7%) regions. As only t-scores of non-null sites will systematically increase (in absolute value) as sample sizes increase, it is possible to estimate the effects of larger sample size on the proportion of differentially methylated genes that could then be identified (Appendix A). With increasing sample size, the number of differentially methylated genes comparing vegans with non-vegetarians steadily increases (Table 3). Fewer differentially methylated genes were observed with exclusion of surrogate variables (SmartSVA) from the analytical approach (Supplementary Table S1), although greater numbers of differentially methylated CpG sites than genes (observed and non-null) were noted in analyses excluding surrogate variables (Supplementary Table S2).
