*2.3. Assessment of Physiological Markers of Cardiometabolic Health*

A panel of clinicians reviewed lab measurements across all VA locations and adjudicated discrepancies to ensure lab measure consistency. Electronic medical records, adjudicated by two clinicians, were used to ascertain the markers of cardiometabolic health, which included systolic and diastolic blood pressures, as well as the concentration of total cholesterol, low-density lipoprotein cholesterol (LDLC), high density lipoprotein cholesterol (HDLC), triglycerides, HbA1c, and glucose.

This study included the physiological variables that were assessed during the 1–4 years post lifestyle questionnaire administration. For each individual marker of cardiometabolic health, analyses utilized the mean measurement of all of the assessments taken during this period, as well as the maximum recorded measurement during this time period. For each physiology parameter, measurements were excluded if they were taken within 4 days of a previous measurement [14], if they had negative values, or if they had values that were three interquartile ranges from the 25th and 75th percentiles. Data from casually obtained specimens were used in the analysis without regard to fasting status.

#### *2.4. Assessment of Medication Usage*

Electronic medical records were used to ascertain antilipemic agent usage during the 1 year preceding exposure assessment and during the outcome assessment period. Two clinicians adjudicated the list of antilipemic agents and included specific medications and doses from the following Generic Adjudication Classes: alirocumab; atorvastatin; atromid; bezafibrate; bezalip retard; cholestyramine; choloxin; clofibrate; colesevelam; colestipol; dextrothyroxine; evolocumab; ezetimibe; ezetimibe/simvastatin; fenofibrate; fenofibric acid; fish oil; gemfibrozil; icosapent ethyl; lomitapide; mevacor; mipomersen; niacinamide; omega-3; omega-3 acid; probucol; rosuvastatin; and statin. The same method of adjudication was used to ascertain usage of antihypertensive medications and combinations, as well as use of hypoglycemic agents.

#### *2.5. Statistical Analysis*

Before analysis, participants were randomly divided into one of two groups: a training dataset containing 66.6% of participants (*n* = 24,411) or a validation dataset containing the remaining 33.3% of participants (*n* = 12,085). See Supplementary Figure S1 for details.

The advent of multidimensional cohorts that both assess the exposome and are linked to electronic health records has resulted in large and complex datasets that comprise normally and non-normally distributed data that can be continuous, ordinal, categorical, and binary. There are many techniques available for reducing data dimensionality, which can be broadly categorized into supervised analyses (such as decision trees) and unsupervised analyses. Of the unsupervised analytic approaches, methods such as cluster analysis were not implemented as they would have reduced the number of observations (participants) by grouping them into a smaller set of clusters. Instead, we were aiming to achieve a reduction in the number of variables by grouping them into a smaller set of factors. We achieved this aim through implementation of common exploratory factor analysis. In fact, the use of common exploratory factor analysis in biomedical research is well-tested and effective [15,16], representing an established method whereby "hidden" relationships between the assumed latent variables and the initial observed (measured) variables can be uncovered [17]. To make sense of these data, this study aimed to (i) holistically examine the complex networks of interrelationships that define the exposome and clinical cardiometabolic risk profile; (ii) represent theoretical constructs that are unmeasurable or unmeasured; (iii) include parameter-specific measurement error; and (iv) integrate a number of techniques into one framework, accounting for the range of distributions, units, and relations within and between exposures and cardiometabolic health. Through applying tetrachoric and polychoric, common exploratory factor analysis followed by multivariable-adjusted regression analysis, our methods allowed us to observe the structure of relationships within and between the human exposome and subsequent markers of cardiometabolic health in this large, metadata-rich, prospective cohort of adult US Veterans.

The first stage of the analysis identified the latent constructs (common factors) that best described the shared covariance of the observed (measured) exposures and the physiological variables in the training dataset. These unobservable latent constructs are essentially hypothetical constructs that are used to represent groups of interconnected measured variables [18]. Exploratory factor analyses were used to evaluate the latent constructs and underlying structure because there were multiple hypotheses and extremely limited *a priori* knowledge of how observed (measured) variables might cluster, and because we aimed to develop a measurement model of latent variables, and not to merely identify a linear combination of variables, as is the case in principal component analysis.

For exposure variables, the exploratory factor analysis used tetrachoric and polychoric correlation coefficients between measures of exposure and oblique promax rotation and the varimax prerotation method to make exposure factors more parsimonious [19]. For physiology variables, Spearman's correlation coefficients were estimated and utilized a common exploratory factor analysis using the orthogonal parsimax rotation [20]. Factors were estimated with maximum likelihood methods [20,21]. As a sensitivity analysis, we implemented an alternative method for extracting factors: iterated principal factor analysis. We determined the number of factors to extract through parallel analysis, where each of the eigenvalues of the input correlation matrix was compared against an empirical distribution of eigenvalues. The empirical distribution of eigenvalues was obtained from 10,000 simulations of generated random correlation matrices. We retained all factors with corresponding eigenvalues that exceeded the one-sided critical value (α = 0.01) of the empirical eigenvalue distribution [20,22].

The eigenvalues and vectors were then used to compute the standardized (mean = 0, standard deviation = 1) latent constructs in the validation dataset, upon which multivariableadjusted regression analysis that simultaneously adjusted for all of the exposure latent constructs could be applied to identify the structure of relationships between exposure latent constructs and latent constructs representing cardiometabolic health. These interrelationships were visualized using Cytoscape Version 3.7.2, with the following criteria dictating which associations were displayed: rotated factor pattern (standardized regression coefficients ≥ 0.5); uniqueness (display = all); inter-factor correlations (correlation coefficient ≥ 0.4); and multivariable-adjusted regression coefficients (significance under the Bonferroni criterion).

All analyses were conducted using SAS version 9.4, maintenance release #6.

#### **3. Results**

#### *3.1. Participant Characteristics*

All of the exposome variables that were included in the models are detailed in Table 1 and Supplementary Table S2. Of the 36,496 MVP participants analyzed, 86% were men, 85% were Caucasians and 11% were African-Americans (Table 1). The mean ± SD body mass index was 28 ± 5 kg/m2 (Supplementary Table S2). Markers of cardiometabolic health are presented in Table 2.

**Table 1.** Key baseline characteristics of all Million Veteran Program participants included in this study.




Number of participants: 36,496. Results are mean ± standard deviation or %, where appropriate.

**Table 2.** Physiological markers of cardiometabolic health in all included Million Veteran Program participants.


Number of participants: 36,496. Mean of measurements reflects mean value for all measurements, whereas maximum measurement reflects the maximum value for all measurements.

Two-thirds of participants (*n* = 24,411) were randomly assigned to the training dataset and the remaining one-third (*n* = 12,085) were randomly assigned to the validation dataset. The mean ± SD age of the participants in the training and validation datasets was identical (62.4 ± 13.4 years).

#### *3.2. Latent Constructs Describing the Exposure Variables in the Training Dataset*

Tetrachoric and polychoric, common exploratory factor analysis in the training dataset revealed 19 common factors that explained shared exposure observed (measured) variable covariance. The common factors could be broadly categorized according to the measured (observed) variables they represented (Figure 1 and Supplementary Figure S2). For example, the Common Exposure Factor E1 represented shared covariance in the intakes of many commonly consumed vegetables. Furthermore, Common Exposure Factor E17 had strong positive weighting for intake of whole milk but a strong negative weighting for intake of skim milk, representing the fact that, in this cohort, participants who frequently consumed whole milk were less likely to frequently consume skim milk.

Different types of physical activity were grouped together in three separate Common Exposure Factors. In particular, Common Exposure Factor E6 represented moderate and vigorous physical activity at home and during leisure time, Common Exposure Factor E7 represented moderate and vigorous physical activity at work, and Common Exposure Factor E10 represented light levels of physical activity at home, during leisure and at work.

**Figure 1.** Rotated factor pattern based on tetrachoric and polychoric, common exploratory factor analysis of measured exposure variables in the training dataset; limited to observed (measured) variables that had a standardized regression coefficient ≥ 0.5 for at least one latent construct. These latent constructs (common factors) are those that best described the shared covariance of the observed (measured) exposures in the training, dataset.Number of participants: 24,411. Standardized regression coefficient.

#### *3.3. Latent Constructs Describing the Physiological Variables in the Training Dataset*

Common exploratory factor analysis in the training dataset revealed 5 common factors that explained shared physiology variable covariance. These broadly represented (i) total cholesterol and LDLC; (ii) glycemic control; (iii) blood pressure; (iv) HDLC; and (v) triglycerides (Figure 2). Common Physiology Factor P1 had positive loadings for all measures of total cholesterol and LDLC, and Common Physiology Factor P3 had high loadings for all of the measures of blood pressure. In fact, the final model applied similar loadings to mean and maximum values of the observed (measured) variables. As sensitivity analysis, we implemented an iterated principal factor analysis as the extraction method and observed similar factor loadings with the exception of mean and maximum glucose, which went from having a factor loading < 0.5 for Common Physiology Factor P2 in the primary analysis to having a factor loading > 0.5 (0.64 and 0.60, respectively) for Common Physiology Factor P2 in the sensitivity analysis.

**Figure 2.** Rotated factor pattern based on common exploratory factor analysis of measured physiology variables in the training dataset. These latent constructs (common factors) are those that best described the shared covariance of the observed (measured) physiology variables in the training dataset.Number of participants: 24,411, "Mean" represents the mean measurement of all assessments. "Max" represents the maximum value of all the assessments. Abbreviations: LDLC: low-density lipoprotein cholesterol; HDLC: high-density lipoprotein cholesterol; HbA1c: glycated hemoglobin. Key:Standardized regression coefficient.

#### *3.4. Relationships between Human Exposures and Physiology in the Validation Dataset*

Identification of the 19 Common Exposure Factors was done without knowledge of the physiological variables. Likewise, the creation of the 5 Common Physiology Factors was independent of the exposure variables. In Figure 3a–e, we present the complex patterns underlying the structure of relationships between Common Exposure Factors and Common Physiology Factors that remain after taking into account the non-independence of the assessed exposome. Some Common Exposure Factors had no association with the Common Physiology Factors, whereas others showed a strong association, both inversely and positively. Specifically, even though the Common Exposure Factor describing intake of processed meat and fried potato (E2) was associated with the Common Physiology Factors describing total cholesterol and LDLC (P1), triglycerides (P5), blood pressure (P3), and glycemic control (P2), the Common Exposure Factor representing red meat intake from main and mixed dishes (E14) was not associated with any of the physiological common factors. Similarly, although the Common Exposure Factor describing intake of moderate and vigorous physical activity at home and during leisure (E6) was associated with the Common Physiology Factors describing total cholesterol and LDLC (P1), triglycerides (P5), and HDLC (P4), the Common Exposure Factor representing moderate and vigorous physical activity at work (E7) was not associated with any of the physiological common factors.

When considering individual physiology factors, the fruit latent construct (Common Exposure Factor E3), but not the vegetable latent constructs (Common Exposure Factors E1 and E19) was inversely associated (estimate: −0.03, P: 0.0077) with the latent construct summarizing total cholesterol and LDLC (Common Physiology Factor P1) (Figure 3a). Conversely, the latent construct with a positive weighting for intake of whole milk but a strong negative weighting for intake of skim milk (Common Exposure Factor E17) had a positive association with Common Physiology Factor P1, as well as with Common Physiology Factor P3, the latent construct summarizing measures of blood pressure (Figure 3d). The latent construct describing light levels of physical activity (Common Exposure Factor E10) was inversely associated with the blood pressure latent construct.

**Figure 3.** *Cont*.

**Figure 3.** *Cont*.

**Figure 3.** Common exploratory factor analysis (training dataset) and multiple regression analysis (validation dataset) outlining the interrelationships between human exposures and markers of cardiometabolic health. (**a**) Association of latent constructs representing exposure to various dietary and lifestyle common factors with the latent construct that explains the shared covariance in total cholesterol and low-density lipoprotein cholesterol (LDLC) concentrations. (**b**) Association of latent constructs representing exposure to various dietary and lifestyle common factors with the latent construct representing triglyceride concentrations. (**c**) Association of latent constructs representing exposure to various dietary and lifestyle common factors with the latent construct representing high-density lipoprotein cholesterol (HDLC) concentrations. (**d**) Association of latent constructs representing exposure to various dietary and lifestyle common factors with the latent construct representing blood pressure. (**e**) Association of latent constructs representing exposure to various dietary and lifestyle common factors with the latent construct representing glycemic control. Number of participants for the common exploratory factor analysis that was conducted in the training dataset: n = 24,411. Number of participants for the multivariable-adjusted regression analysis that was conducted in the validation dataset: n = 12,085. Observed (measured) exposure and physiology variables are presented in the order in which they appear in Figures 1 and 2, respectively. Criteria for displaying measured (observed) variables: rotated factor pattern: Standardized regression coefficient ≥ 0.5. Criteria for displaying association lines: uniqueness (display all); inter-factor correlations (correlation coefficient ≥ 0.4); and multivariable, adjusted regression coefficients (*p* value significant using the Bonferroni threshold). For multivariableadjusted regression coefficients, the line thickness represents the value of the -log10(*p* value), range: 2.11, 9.05.

#### **4. Discussion**

In this prospective cohort study of U.S. male and female Veterans, we reported an association between the exposome and markers of cardiometabolic health. Specifically, using factors identified in a training dataset, multiple multivariable-adjusted regression analyses revealed significant positive and inverse associations between exposure latent constructs and latent constructs describing observed (measured) physiology variables when applied to a separate validation dataset containing different participants. This provided us with critical insights and observations that represent steps forward in enhancing our understanding of how the exposome, as a holistic entity, shapes human physiology.

We employed common exploratory factor analysis to reveal the structure of interrelationships between individual exposures and physiology variables in a way that substantially advances our understanding of how observed (measured) exposome variables relate to the cardiometabolic risk profile. For example, a Common Exposure Factor was created to reflect the close relationship in study participants between high levels of moderate and vigorous physical activity at home and high levels of moderate and vigorous physical activity during leisure time. This relationship was not strongly correlated with levels of moderate and vigorous physical activity at work, which was represented by a different Common Exposure Factor. This suggests that the amount of moderate to vigorous physical activity participants perform at work did not covary with the amount of moderate to vigorous physical activity performed during leisure and at home [23]. By unveiling "hidden" relationships between the latent constructs and the observed (measured) variables they represent that matched our understanding of biology and variable representation, the utility of common exploratory factor analysis for both questionnaire-derived assessments of exposome and electronic medical record-derived assessments of cardiometabolic health was highlighted. However, it is unclear what the causal implications for these relationships are.

Very few studies have attempted to determine the influence of the exposome as a whole on cardiometabolic risk (as determined through electronic medical records). Modelling the system of relationships underlying the way in which the exposome, as a whole, shaped the cardiometabolic risk profile was therefore an important aim of our investigation. In addition to revealing the structure of the exposome and the structure of physiology variables, this study also revealed the structure of relationships between the exposome and cardiometabolic risk profile through multiple regression of latent constructs. An example of this was the creation of a latent construct in the training dataset that represented the reciprocal relationship between consumption of whole and skim milk. In other words, any associations of whole milk with cardiometabolic disease risk could not be separated from the effects of skim milk and should not be interpreted in isolation. When applied to separate validation datasets, this milk-based latent construct was positively associated with the latent constructs representing total cholesterol and LDLC as well as systolic and diastolic blood pressure. This observation is supported by randomized controlled trials directly comparing non-fermented whole milk to non-fermented skim milk that suggest adverse effects of whole milk, compared to skim milk, on total cholesterol and LDLC [24,25]. Further, skim but not whole milk has been shown to exhibit antihypertensive properties [26,27]. It is not yet clear whether dairy fat intake increases cardiovascular disease risk [28]. Despite this, results of our exposome analysis support the 2006 American Heart Association to Diet and Lifestyle Recommendations and the 2015–2020 Dietary Guidelines for Americans, which both encourage adults to select milk products that are either fat-free or low in fat rather than whole milk products [29,30]. The finding that individual components of the exposome are both numerically and biologically intertwined highlights the urgent need to implement analytic techniques that holistically examine the complex networks of interrelationships within and between observed (measured) variables. This was achieved through the representation of unmeasurable or unmeasured theoretical constructs as well as parameter-specific measurement error in order to draw robust generalizations regarding the complex interactions between the many exposome components and human physiology.

The use of latent factors to describe interrelationships between individual exposome and physiology components was able to shed light on hypothesized relationships. For example, the factor describing fruit consumption was inversely associated with the factor describing concentrations of total cholesterol and LDLC. This is supported by (i) our previous findings from the National Heart, Lung, and Blood Institute Family Heart Study, which found that consumption of fruit and vegetables was inversely related to LDLC in both men and women [31] and (ii) results from other cohort studies and randomized controlled trials [32,33]. Although the benefits of fruit consumption on cholesterol concentrations are not conclusive, with some studies showing no benefit of fruit consumption [34], the high fiber content of fruit has been attributed to its cholesterol-lowering capacity [35,36]. In this study, peaches, oranges, and apples contributed to the fruit factor. Apples have been shown to increase the clearance of plasma cholesterol by enhancing the fecal excretion of bile acids and cholesterol [37], and the peel of peaches has been shown to lower total cholesterol and LDLC in rats fed a high-sucrose diet. Further, the polyphenols in apples have been shown to have beneficial effects on cholesterol metabolism [38–41], as too have the pectins of apples and oranges [42]. A randomized controlled trial testing the effect of the combination of peaches, oranges, and apples on serum lipid profile is needed in order to ascertain causality of this observed association. It is important to note that there were cases where hypothesized relationships were not observed. For example, despite vegetables being a rich source of dietary fiber and higher consumption of vegetables being associated with lower risk of all-cause mortality and cardiovascular mortality [43], the vegetable consumption factors in this study were not significantly associated with the factor describing total cholesterol and LDLC. The absence of confirmatory findings regarding vegetables in this study may be explained by the absence of data on intake of nutrients, such as fiber, which can summarize contributions from many different foods that are biologically important. Another reason may be that the measurement error may be lower in fruits as opposed to vegetables. However, another interpretation may be that, after controlling for other exposome components, vegetables are not associated with total cholesterol and LDLC concentrations in this cohort. Further studies using longitudinal data are needed to confirm these findings.

Although the methods implemented were well-tested and effective, it is important to note that diet was self-reported, health outcomes were captured through electronic medical records, there was a lack of data on medication adherence, and there were a limited number of women and non-whites in this U.S. Veteran cohort. Additionally, causality of observed relationships could not be established due to the observational nature of the study. Nevertheless, it is important to note that the exposome was measured at least one year prior to any of the physiologic variables being assessed, which, although not ruling it out, does reduce the likelihood and impact of reverse causation. An additional factor to consider when interpreting the results is the possibility of false-positive findings, which was reduced through implementation of the conservative Bonferroni correction [44]. Furthermore, although residual or unmeasured confounders cannot be ruled out, the common exploratory methods implemented in this study do aim to represent unmeasured and/or immeasurable variables through the creation of latent constructs. This analytic approach also enabled us to model the measurement error inherent when using selfreported exposome assessments as well as collations from electronic medical records, even when there are some exposome variables, such as environmental variables, that are not measured directly. By conducting the analysis in both a training and a validation dataset, we were able to demonstrate the utility of our analytic strategy for use in the increasingly prevalent type of cohort that has extensive questionnaire-based assessments of the exposome as well as markers of human physiology derived from electronic medical records. However, further replication in separate datasets is warranted.

It is becoming increasingly recognized that studies of the complete exposome are more biologically representative than fragmented models based on subsets of factors. There is no more clear example of this than the position of the Academy of Nutrition and Dietetics, which plainly states that the "total diet or overall pattern of food eaten is the most important focus of healthy eating" [45]. In recognition of this, as opposed to focusing on individual nutrient recommendations, the Dietary Guidelines for Americans highlight key elements of healthy eating patterns [30]. Some patterns, such as the Mediterranean Dietary Pattern, are based on a priori knowledge of how individual dietary components influence human health, and are often represented by a pattern score that reflects relative adherence to the dietary pattern, as is the case for the Healthy Eating Index [46,47]. Although hypothesis-driven, there is no general consensus in the scientific or clinical community as to what is the ideal dietary pattern for optimal health [48]. Importantly, these dietary patterns typically reflect only a select group of dietary components, and not necessarily the diet as a whole [46,47]. Given that the aim of the present paper was to identify the importance of the entire exposome in influencing cardiometabolic health, we chose to adopt a data-driven approach which allowed us to identify existing patterns in the population, and how individual foods, lifestyle factors, and demographic features related to each other at a population level. In this study, the measured exposome variables were represented by 19 common factors that were created independently from the physiological variables. We observed heterogeneous patterns of association of exposome constructs with each of the physiological constructs, which represented measured biomarkers that are important indicators of cardiometabolic health. As such, even though the exposome overall contributed to cardiometabolic health, each facet of the exposome had differential implications for different aspects of cardiometabolic health. It is evident that the overall complex exposome that individuals are exposed to needs to be studied more in order to more fully understand what is contributing to cardiometabolic risk profiles in communities of free-living individuals. To be more comprehensive, a logical extension to this current work would be to incorporate more environmental exposome variables, such as air pollution and access to green space, into the current models.

In conclusion, we observed a complex pattern of associations between the exposome and markers of cardiometabolic health. Given that we are increasingly recognizing the potential of the exposome as a whole to have far-reaching health implications beyond the effects associated with the sum of its parts, it is more important than ever that we make available analytic tools and approaches that are capable of dealing with this directly. It should be noted that the analytic strategy implemented in this paper could be applied to address a range of research questions that utilize data from questionnaire and electronic health record data, thus bringing us one step closer to understanding how the exposome, as a whole, impacts human health.

**Supplementary Materials:** The following are available online at https://www.mdpi.com/article/10 .3390/nu13041364/s1, Table S1: Number of participants excluded based on use of antilipemic, antihypertensive and/or hypoglycemic medications during either the exposure-assessment or outcomeassessment periods, Table S2: Additional baseline characteristics of all Million Veterans Program participants included in this study, Figure S1: Schematic overview of the training and validation datasets, Figure S2: Rotated factor pattern based on tetrachoric and polychoric, common exploratory factor analysis of measured exposure variables in the training dataset; limited to observed (measured) variables that did not have a standardized regression coefficient ≥ 0.5 for at least one latent construct.

**Author Contributions:** K.L.I. conducted analyses and wrote the manuscript. X.-M.T.N., K.C., J.M.G. and L.D. collected data. All authors critically reviewed the manuscript for content and clarity. K.L.I., X.-M.T.N., D.P., G.B.R., D.K.T., R.S., Y.-L.H., R.L., P.W.F.W., K.C., J.M.G., F.B.H., W.C.W. and L.D. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research is based on data from the Million Veteran Program supported by the MVP001 and MVP#000 awards from the Department of Veterans Affairs. This research was also supported by the VA Merit Award I01 BX003340-01. Support for VA/CMS data was provided by the Department of Veterans Affairs, VA Health Services Research and Development Service, VA Information Resource Center (Project Numbers SDR 02-237 and 98-004). This research is based on data from the VA Million Veteran Program supported by award MVP#000 from the Department of Veterans

Affairs. This publication does not represent the views of the Department of Veterans Affairs or the U.S. government.

**Institutional Review Board Statement:** The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the VA Central IRB [13] (protocol code MVP001 approved in 2010).

**Informed Consent Statement:** Written informed consent has been obtained from the participants in accordance with all VA policies and under the authority of the VA Central IRB [13].

**Data Availability Statement:** Data described in the article, code book, and analytic code will not be made available to other researchers for purposes of reproducing the results or replicating the procedure, in order to comply with current VA privacy regulations pursuant to the US Department of Veterans Administration policies on compliance with the confidentiality of US veterans' data.

**Acknowledgments:** The authors thank the members of the Million Veteran Program Core, those who have contributed to the Million Veteran Program, and especially the Veteran participants for their generous contributions. This work was supported using resources and facilities of the Department of Veterans Affairs (VA) Informatics and Computing Infrastructure (VINCI), VA HSR RES 13-457. More information could be found in Appendix A.

**Conflicts of Interest:** No authors have any relevant conflict of interest.
